Mining of Biological Data II: Assessing Data Structure and Class Homogeneity by Cluster Analysis

被引:28
作者
Kamimura, Roy T. [1 ]
Bicciato, Silvio [2 ]
Shimizu, Hiroshi [3 ]
Alford, Joe [4 ]
Stephanopoulos, Gregory [1 ]
机构
[1] MIT, Dept Chem Engn, Cambridge, MA 02319 USA
[2] Univ Padua, Dept Chem Engn, I-35131 Padua, Italy
[3] Osaka Univ, Fac Engn, Dept Biotechnol, Suita, Osaka 565, Japan
[4] Eli Lilly & Co, Lilly Res Labs, Indianapolis, IN 46285 USA
关键词
D O I
10.1006/mben.2000.0155
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
An important step in data analysis is class assignment which is usually done on the basis of a macroscopic phenotypic or bioprocess characteristic, such as high vs low growth, healthy vs diseased state, or high vs low productivity. Unfortunately, such an assignment may lump together samples, which when derived from a more detailed phenotypic or bioprocess description are dissimilar, giving rise to models of lower quality and predictive power. In this paper we present a clustering algorithm for data preprocessing which involves the identification of fundamentally similar lots on the basis of the extent of similarity among the system variables. The algorithm combines aspects of cluster analysis and principal component analysis by applying agglomerative clustering methods to the first principal component of the system data matrix. As part of a rational strategy for developing empirical models, this technique selects lots (samples) which are most appropriate for inclusion in a training set by analyzing multivariate data homogeneity. Samples with similar data structures are identified and grouped together into distinct clusters. This knowledge is used in the formation of potential training sets. Additionally, this technique can identify atypical lots, i.e., samples that are not simply outliers but exhibit the general properties of one class but have been given the assignment of the other. The method is presented along with examples from its application to fermentation data sets. (C) 2000 Academic Press
引用
收藏
页码:228 / 238
页数:11
相关论文
共 15 条
  • [1] [Anonymous], 1983, INTERPRETATION ANAL
  • [2] BRERETON RG, 1992, MULTIVARIATE PATTERN
  • [3] Dillon W.R., 1984, MULTIVARIATE ANAL ME
  • [4] Jolliffe I. T., 1986, PRINCIPAL COMPONENT, DOI DOI 10.1007/978-1-4757-1904-87
  • [5] KAMIMURA R, 1997, THESIS MIT CAMBRIDGE
  • [6] Mining of Biological Data I: Identifying Discriminating Features Via Mean Hypothesis Testing
    Kamimura, Roy T.
    Bicciato, Silvio
    Shimizu, Hiroshi
    Alford, Joe
    Stephanopoulos, Gregory
    [J]. METABOLIC ENGINEERING, 2000, 2 (03) : 218 - 227
  • [7] Kohonen T., 1997, Self-organizing Maps, V2nd ed.
  • [8] NONLINEAR PRINCIPAL COMPONENT ANALYSIS USING AUTOASSOCIATIVE NEURAL NETWORKS
    KRAMER, MA
    [J]. AICHE JOURNAL, 1991, 37 (02) : 233 - 243
  • [9] Massart D.L., 1988, CHEMOMETRICS TXB
  • [10] Application of multivariate statistics in detecting temporal and spatial patterns of water chemistry in Lake George, New York
    Momen, B
    Eichler, LW
    Boylen, CW
    Zehr, JP
    [J]. ECOLOGICAL MODELLING, 1996, 91 (1-3) : 183 - 192