Microarray missing data imputation based on a set theoretic framework and biological knowledge

被引:74
作者
Gan, XC
Liew, AWC
Yan, H
机构
[1] Chinese Univ Hong Kong, Dept Comp Sci & Engn, Shatin, Hong Kong, Peoples R China
[2] Chinese Univ Hong Kong, Dept Comp Sci & Engn, Shatin, Hong Kong, Peoples R China
[3] Univ Sydney, Sch Elect & Informat Engn, Sydney, NSW 2006, Australia
关键词
D O I
10.1093/nar/gkl047
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Gene expressions measured using microarrays usually suffer from the missing value problem. However, in many data analysis methods, a complete data matrix is required. Although existing missing value imputation algorithms have shown good performance to deal with missing values, they also have their limitations. For example, some algorithms have good performance only when strong local correlation exists in data while some provide the best estimate when data is dominated by global structure. In addition, these algorithms do not take into account any biological constraint in their imputation. In this paper, we propose a set theoretic framework based on projection onto convex sets (POCS) for missing data imputation. POCS allows us to incorporate different types of a priori knowledge about missing values into the estimation process. The main idea of POCS is to formulate every piece of prior knowledge into a corresponding convex set and then use a convergence-guaranteed iterative procedure to obtain a solution in the intersection of all these sets. In this work, we design several convex sets, taking into consideration the biological characteristic of the data: the first set mainly exploit the local correlation structure among genes in microarray data, while the second set captures the global correlation structure among arrays. The third set (actually a series of sets) exploits the biological phenomenon of synchronization loss in microarray experiments. In cyclic systems, synchronization loss is a common phenomenon and we construct a series of sets based on this phenomenon for our POCS imputation algorithm. Experiments show that our algorithm can achieve a significant reduction of error compared to the KNNimpute, SVDimpute and LSimpute methods.
引用
收藏
页码:1608 / 1619
页数:12
相关论文
共 39 条
[1]   Extreme self-organization in networks constructed from gene expression data [J].
Agrawal, H .
PHYSICAL REVIEW LETTERS, 2002, 89 (26)
[2]   Singular value decomposition for genome-wide expression data processing and modeling [J].
Alter, O ;
Brown, PO ;
Botstein, D .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (18) :10101-10106
[3]   MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia [J].
Armstrong, SA ;
Staunton, JE ;
Silverman, LB ;
Pieters, R ;
de Boer, ML ;
Minden, MD ;
Sallan, SE ;
Lander, ES ;
Golub, TR ;
Korsmeyer, SJ .
NATURE GENETICS, 2002, 30 (01) :41-47
[4]  
Arnone MI, 1997, DEVELOPMENT, V124, P1851
[5]   Continuous representations of time-series gene expression data [J].
Bar-Joseph, Z ;
Gerber, GK ;
Gifford, DK ;
Jaakkola, TS ;
Simon, I .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2003, 10 (3-4) :341-356
[6]   Deconvolving cell cycle expression data with complementary information [J].
Bar-Joseph, Ziv ;
Farkash, Shlomit ;
Gifford, David K. ;
Simon, Itamar ;
Rosenfeld, Roni .
BIOINFORMATICS, 2004, 20 :23-30
[7]   LSimpute: accurate estimation of missing values in microarray data with least squares methods [J].
Bo, TH ;
Dysvik, J ;
Jonassen, I .
NUCLEIC ACIDS RESEARCH, 2004, 32 (03) :e34
[8]   Knowledge-based analysis of microarray gene expression data by using support vector machines [J].
Brown, MPS ;
Grundy, WN ;
Lin, D ;
Cristianini, N ;
Sugnet, CW ;
Furey, TS ;
Ares, M ;
Haussler, D .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (01) :262-267
[9]  
COMBETTES PL, 1993, P IEEE, V81, P182, DOI 10.1109/5.214546
[10]   Exploring the metabolic and genetic control of gene expression on a genomic scale [J].
DeRisi, JL ;
Iyer, VR ;
Brown, PO .
SCIENCE, 1997, 278 (5338) :680-686