A statistical framework for genomic data fusion

被引:430
作者
Lanckriet, GRG
De Bie, T
Cristianini, N
Jordan, MI
Noble, WS
机构
[1] Univ Washington, Hlth Sci Ctr, Dept Genome Sci, Seattle, WA 98195 USA
[2] Univ Calif Berkeley, Dept Elect Engn & Comp Sci, Berkeley, CA 94720 USA
[3] Univ Calif Berkeley, Dept Stat, Div Comp Sci, Berkeley, CA 94720 USA
[4] Katholieke Univ Leuven, ESAT SCD, Dept Elect Engn, B-3001 Louvain, Belgium
[5] Univ Calif Davis, Dept Stat, Davis, CA 95618 USA
基金
美国国家科学基金会;
关键词
D O I
10.1093/bioinformatics/bth294
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: During the past decade, the new focus on genomics has highlighted a particular challenge: to integrate the different views of the genome that are provided by various types of experimental data. Results: This paper describes a computational framework for integrating and drawing inferences from a collection of genome-wide measurements. Each dataset is represented via a kernel function, which defines generalized similarity relationships between pairs of entities, such as genes or proteins. The kernel representation is both flexible and efficient, and can be applied to many different types of data. Furthermore, kernel functions derived from different types of data can be combined in a straightforward fashion. Recent advances in the theory of kernel methods have provided efficient algorithms to perform such combinations in a way that minimizes a statistical loss function. These methods exploit semidefinite programming techniques to reduce the problem of finding optimizing kernel combinations to a convex optimization problem. Computational experiments performed using yeast genome-wide datasets, including amino acid sequences, hydropathy profiles, gene expression data and known protein-protein interactions, demonstrate the utility of this approach. A statistical learning algorithm trained from all of these data to recognize particular classes of proteins-membrane proteins and ribosomal proteins-performs significantly better than the same algorithm trained on any single type of data.
引用
收藏
页码:2626 / 2635
页数:10
相关论文
共 37 条
[21]   PREDICTION OF PROTEIN ANTIGENIC DETERMINANTS FROM AMINO-ACID-SEQUENCES [J].
HOPP, TP ;
WOODS, KR .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA-BIOLOGICAL SCIENCES, 1981, 78 (06) :3824-3828
[22]  
Kondor R. I., 2002, P 19 INT C MACH LEAR, V2002, P315, DOI DOI 10.1109/ACCESS.2020.2967348
[23]   Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes [J].
Krogh, A ;
Larsson, B ;
von Heijne, G ;
Sonnhammer, ELL .
JOURNAL OF MOLECULAR BIOLOGY, 2001, 305 (03) :567-580
[24]   A SIMPLE METHOD FOR DISPLAYING THE HYDROPATHIC CHARACTER OF A PROTEIN [J].
KYTE, J ;
DOOLITTLE, RF .
JOURNAL OF MOLECULAR BIOLOGY, 1982, 157 (01) :105-132
[25]  
Lanckriet GRG, 2004, J MACH LEARN RES, V5, P27
[26]   MIPS:: a database for genomes and protein sequences [J].
Mewes, HW ;
Frishman, D ;
Gruber, C ;
Geier, B ;
Haase, D ;
Kaps, A ;
Lemcke, K ;
Mannhaupt, G ;
Pfeiffer, F ;
Schüller, C ;
Stocker, S ;
Weil, B .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :37-40
[27]  
Nesterov Y., 1994, INTERIOR POINT POLYN
[28]  
Noble WS., 2004, Kernel methods in computational biology, V14, P71, DOI DOI 10.1049/EL:19981092
[29]   Structure of functionally activated small ribosomal subunit at 3.3 Å resolution [J].
Schluenzen, F ;
Tocilj, A ;
Zarivach, R ;
Harms, J ;
Gluehmann, M ;
Janell, D ;
Bashan, A ;
Bartels, H ;
Agmon, I ;
Franceschi, F ;
Yonath, A .
CELL, 2000, 102 (05) :615-623
[30]   IDENTIFICATION OF COMMON MOLECULAR SUBSEQUENCES [J].
SMITH, TF ;
WATERMAN, MS .
JOURNAL OF MOLECULAR BIOLOGY, 1981, 147 (01) :195-197