Mining of Biological Data I: Identifying Discriminating Features Via Mean Hypothesis Testing

被引：22

作者：

Kamimura, Roy T. ^{[1
]}

Bicciato, Silvio ^{[2
]}

Shimizu, Hiroshi ^{[3
]}

Alford, Joe ^{[4
]}

Stephanopoulos, Gregory ^{[1
]}

机构：

[1] MIT, Dept Chem Engn, Cambridge, MA 02319 USA

[2] Univ Padua, Dept Chem Engn Proc, I-35131 Padua, Italy

[3] Osaka Univ, Fac Engn, Dept Biotechnol, Suita, Osaka 565, Japan

[4] Eli Lilly & Co, Lilly Res Labs, Indianapolis, IN 46285 USA

来源：

METABOLIC ENGINEERING | 2000年 / 2卷 / 03期

关键词：

D O I：

10.1006/mben.2000.0154

中图分类号：

Q81 [生物工程学（生物技术）]; Q93 [微生物学];

学科分类号：

071005 ; 0836 ; 090102 ; 100705 ;

摘要：

Large volumes of data are routinely collected during bioprocess operations and, more recently, in basic biological research using genomics-based technologies. While these data often lack sufficient detail to be used for mechanism identification, it is possible that the underlying mechanisms affecting cell phenotype or process outcome are reflected as specific patterns in the overall or temporal sensor logs. This raises the possibility of identifying outcome-specific fingerprints that can be used for process or phenotype classification and the identification of discriminating characteristics, such as specific genes or process variables. The aim of this work is to provide a systematic approach to identifying and modeling patterns in historical records and using this information for process classification. This approach differs from others in that emphasis is placed on analyzing the data structure first and thereby extracting potentially relevant features prior to model creation. The initial step in this overall approach is to first identify the discriminating features of the relevant measurements and time windows, which can then be subsequently used to discriminate among different classes of process behavior. This is achieved via a mean hypothesis testing algorithm. Next, the homogeneity of the multivariate data in each class is explored via a novel cluster analysis technique called PC1 Time Series Clustering to ensure that the data subsets used accurately reflect the variability displayed in the historical records. This will be the topic of the second paper in this series. We present here the method for identifying discriminating features in data via mean hypothesis testing along with results from the analysis of case studies from industrial fermentations. (C) 2000 Academic Press

引用

页码：218 / 227

页数：10

共 14 条

[11] QUINLAN JR, 1986, MACH INTELL, V11, P710
[12] CONTINUOUS PROCESS IMPROVEMENT THROUGH INDUCTIVE AND ANALOGICAL LEARNING
SARAIVA, PM
STEPHANOPOULOS, G
[J]. AICHE JOURNAL, 1992, 38 (02) : 161 - 183
[13] Stephanopoulos G, 1997, BIOTECHNOL BIOENG, V53, P443
[14] AN APPROXIMATE DEGREES OF FREEDOM SOLUTION TO MULTIVARIATE BEHRENS-FISHER PROBLEM
YAO, Y
[J]. BIOMETRIKA, 1965, 52 : 139 - &

← 1 2 →