Binary matrix factorization for analyzing gene expression data

被引:91
作者
Zhang, Zhong-Yuan [2 ]
Li, Tao [1 ]
Ding, Chris [3 ]
Ren, Xian-Wen [4 ]
Zhang, Xiang-Sun [4 ]
机构
[1] Florida Int Univ, Sch Comp & Informat Sci, Miami, FL 33199 USA
[2] Cent Univ Finance & Econ, Sch Stat, Beijing, Peoples R China
[3] Univ Texas Arlington, Dept Comp Sci & Engn, Arlington, TX 76019 USA
[4] Chinese Acad Sci, Acad Math & Syst Sci, Beijing, Peoples R China
基金
美国国家科学基金会;
关键词
Biclustering; Non-negative matrix factorization; Boundedness property of NMF; Binary matrix; MICROARRAY DATA; ERROR; ORGANIZATION; ALGORITHMS; MODEL; PARTS;
D O I
10.1007/s10618-009-0145-2
中图分类号
TP18 [人工智能理论];
学科分类号
140502 [人工智能];
摘要
The advent of microarray technology enables us to monitor an entire genome in a single chip using a systematic approach. Clustering, as a widely used data mining approach, has been used to discover phenotypes from the raw expression data. However traditional clustering algorithms have limitations since they can not identify the substructures of samples and features hidden behind the data. Different from clustering, biclustering is a new methodology for discovering genes that are highly related to a subset of samples. Several biclustering models/methods have been presented and used for tumor clinical diagnosis and pathological research. In this paper, we present a new biclustering model using Binary Matrix Factorization (BMF). BMF is a new variant rooted from non-negative matrix factorization (NMF). We begin by proving a new boundedness property of NMF. Two different algorithms to implement the model and their comparison are then presented. We show that the microarray data biclustering problem can be formulated as a BMF problem and can be solved effectively using our proposed algorithms. Unlike the greedy strategy-based algorithms, our proposed algorithms for BMF are more likely to find the global optima. Experimental results on synthetic and real datasets demonstrate the advantages of BMF over existing biclustering methods. Besides the attractive clustering performance, BMF can generate sparse results (i.e., the number of genes/features involved in each biclustering structure is very small related to the total number of genes/features) that are in accordance with the common practice in molecular biology.
引用
收藏
页码:28 / 52
页数:25
相关论文
共 49 条
[1]
[Anonymous], 2003, ADV NEURAL INFORM PR
[2]
BenDor A., 2002, Proceedings of the sixth annual international conference on computational biology, P49, DOI [10.1145/565196.565203, DOI 10.1145/565196.565203]
[3]
Algorithms and applications for approximate nonnegative matrix factorization [J].
Berry, Michael W. ;
Browne, Murray ;
Langville, Amy N. ;
Pauca, V. Paul ;
Plemmons, Robert J. .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2007, 52 (01) :155-173
[4]
Metagenes and molecular pattern discovery using matrix factorization [J].
Brunet, JP ;
Tamayo, P ;
Golub, TR ;
Mesirov, JP .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2004, 101 (12) :4164-4169
[5]
Biclustering of gene expression data by non-smooth non-negative matrix factorization [J].
Carmona-Saez, P ;
Pascual-Marqui, RD ;
Tirado, F ;
Carazo, JM ;
Pascual-Montano, A .
BMC BIOINFORMATICS, 2006, 7 (1)
[6]
Accessing genetic information with high-density DNA arrays [J].
Chee, M ;
Yang, R ;
Hubbell, E ;
Berno, A ;
Huang, XC ;
Stern, D ;
Winkler, J ;
Lockhart, DJ ;
Morris, MS ;
Fodor, SPA .
SCIENCE, 1996, 274 (5287) :610-614
[7]
Cheng Y., 2000, Proceedings International Conference on Intelligent System,s for Molecular Biology
[8]
ISMB. International Conference on Intelligent System, V8, P93
[9]
Cooper M, 2002, PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, P25
[10]
Dhillon I., 2005, ADV NEURAL INFORM PR, V17