An important step in data analysis is class assignment which is usually done on the basis of a macroscopic phenotypic or bioprocess characteristic, such as high vs low growth, healthy vs diseased state, or high vs low productivity. Unfortunately, such an assignment may lump together samples, which when derived from a more detailed phenotypic or bioprocess description are dissimilar, giving rise to models of lower quality and predictive power. In this paper we present a clustering algorithm for data preprocessing which involves the identification of fundamentally similar lots on the basis of the extent of similarity among the system variables. The algorithm combines aspects of cluster analysis and principal component analysis by applying agglomerative clustering methods to the first principal component of the system data matrix. As part of a rational strategy for developing empirical models, this technique selects lots (samples) which are most appropriate for inclusion in a training set by analyzing multivariate data homogeneity. Samples with similar data structures are identified and grouped together into distinct clusters. This knowledge is used in the formation of potential training sets. Additionally, this technique can identify atypical lots, i.e., samples that are not simply outliers but exhibit the general properties of one class but have been given the assignment of the other. The method is presented along with examples from its application to fermentation data sets. (C) 2000 Academic Press
引用
收藏
页码:228 / 238
页数:11
相关论文
共 15 条
[11]
QUINLAN JR, 1986, MACH INTELL, V11, P710
[12]
Rousseeuw P.J., 1990, Finding groups in data: An introduction to cluster analysis, V1