Large-scale clustering of cDNA-fingerprinting data

被引:151
作者
Herwig, R
Poustka, AJ
Müller, C
Bull, C
Lehrach, H
O'Brien, J
机构
[1] Max Planck Inst Mol Genet, D-14195 Berlin, Germany
[2] Univ Gottingen, Inst Math Stochast, D-37083 Gottingen, Germany
关键词
D O I
10.1101/gr.9.11.1093
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Clustering is one of the main mathematical challenges in large-scale gene expression analysis. We describe a clustering procedure based on a sequential k-means algorithm with additional I refinements that is able to handle high-throughput data in the order of hundreds of thousands of data items measured on hundreds of variables. The practical motivation for our algorithm is oligonucleotide fingerprinting-a method for simultaneous determination of expression level for every active gene of a specific tissue-although the algorithm can be applied as well Co other large-scale projects like EST clustering and qualitative clustering of DNA-chip data. As a pairwise similarity measure between two p-dimensional data points, x and y, we introduce mutual information that can be interpreted as the amount of information about x in y,and vice versa. We show that For our purposes this measure is superior to commonly used metric distances, for example, Euclidean distance. We also introduce a modified version of mutual information as a novel method for validating clustering results when the true clustering is known. The performance of our algorithm with respect to experimental noise is shown by extensive simulation studies. The algorithm is tested on a subset of 2029 cDNA clones coming from 15 different genes From a cDNA library derived from human dendritic cells. Furthermore, the clustering of these 2029 cDNA clones is demonstrated when the entire set of 76,032 cDNA clones is processed.
引用
收藏
页码:1093 / 1105
页数:13
相关论文
共 30 条
  • [1] RAPID CDNA SEQUENCING (EXPRESSED SEQUENCE TAGS) FROM A DIRECTIONALLY CLONED HUMAN INFANT BRAIN CDNA LIBRARY
    ADAMS, MD
    SOARES, MB
    KERLAVAGE, AR
    FIELDS, C
    VENTER, JC
    [J]. NATURE GENETICS, 1993, 4 (04) : 373 - 386
  • [2] COMPLEMENTARY-DNA SEQUENCING - EXPRESSED SEQUENCE TAGS AND HUMAN GENOME PROJECT
    ADAMS, MD
    KELLEY, JM
    GOCAYNE, JD
    DUBNICK, M
    POLYMEROPOULOS, MH
    XIAO, H
    MERRIL, CR
    WU, A
    OLDE, B
    MORENO, RF
    KERLAVAGE, AR
    MCCOMBIE, WR
    VENTER, JC
    [J]. SCIENCE, 1991, 252 (5013) : 1651 - 1656
  • [3] Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays
    Alon, U
    Barkai, N
    Notterman, DA
    Gish, K
    Ybarra, S
    Mack, D
    Levine, AJ
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) : 6745 - 6750
  • [4] [Anonymous], [No title captured]
  • [5] [Anonymous], [No title captured]
  • [6] BENDOR A, 1998, HPL98190
  • [7] Clark MD, 1999, METHOD ENZYMOL, V303, P205
  • [8] Cover T. M., 2005, ELEM INF THEORY, DOI 10.1002/047174882X
  • [9] Gene-representing cDNA clusters defined by hybridization of 57,419 clones from infant brain libraries with short oligonucleotide probes
    Drmanac, S
    Stavropoulos, NA
    Labat, I
    Vonau, J
    Hauser, B
    Soares, MB
    Drmanac, R
    [J]. GENOMICS, 1996, 37 (01) : 29 - 40
  • [10] Cluster analysis and display of genome-wide expression patterns
    Eisen, MB
    Spellman, PT
    Brown, PO
    Botstein, D
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1998, 95 (25) : 14863 - 14868