Mining of Biological Data II: Assessing Data Structure and Class Homogeneity by Cluster Analysis

被引：28

作者：

Kamimura, Roy T. ^{[1
]}

Bicciato, Silvio ^{[2
]}

Shimizu, Hiroshi ^{[3
]}

Alford, Joe ^{[4
]}

Stephanopoulos, Gregory ^{[1
]}

机构：

[1] MIT, Dept Chem Engn, Cambridge, MA 02319 USA

[2] Univ Padua, Dept Chem Engn, I-35131 Padua, Italy

[3] Osaka Univ, Fac Engn, Dept Biotechnol, Suita, Osaka 565, Japan

[4] Eli Lilly & Co, Lilly Res Labs, Indianapolis, IN 46285 USA

来源：

METABOLIC ENGINEERING | 2000年 / 2卷 / 03期

关键词：

D O I：

10.1006/mben.2000.0155

中图分类号：

Q81 [生物工程学（生物技术）]; Q93 [微生物学];

学科分类号：

071005 ; 0836 ; 090102 ; 100705 ;

摘要：

An important step in data analysis is class assignment which is usually done on the basis of a macroscopic phenotypic or bioprocess characteristic, such as high vs low growth, healthy vs diseased state, or high vs low productivity. Unfortunately, such an assignment may lump together samples, which when derived from a more detailed phenotypic or bioprocess description are dissimilar, giving rise to models of lower quality and predictive power. In this paper we present a clustering algorithm for data preprocessing which involves the identification of fundamentally similar lots on the basis of the extent of similarity among the system variables. The algorithm combines aspects of cluster analysis and principal component analysis by applying agglomerative clustering methods to the first principal component of the system data matrix. As part of a rational strategy for developing empirical models, this technique selects lots (samples) which are most appropriate for inclusion in a training set by analyzing multivariate data homogeneity. Samples with similar data structures are identified and grouped together into distinct clusters. This knowledge is used in the formation of potential training sets. Additionally, this technique can identify atypical lots, i.e., samples that are not simply outliers but exhibit the general properties of one class but have been given the assignment of the other. The method is presented along with examples from its application to fermentation data sets. (C) 2000 Academic Press

引用

页码：228 / 238

页数：11

共 15 条

[11] QUINLAN JR, 1986, MACH INTELL, V11, P710
[12] Rousseeuw P.J., 1990, Finding groups in data: An introduction to cluster analysis, V1
[13] Sharaf M.A., 1986, CHEMOMETRICS
[14] USING CLUSTER-ANALYSIS TO CLASSIFY TIME-SERIES
SHAW, CT
KING, GP
[J]. PHYSICA D, 1992, 58 (1-4): : 288 - 298
[15] Stephanopoulos G, 1997, BIOTECHNOL BIOENG, V53, P443

← 1 2 →