Machine learning methods for predictive proteomics

被引:48
作者
Barla, Annalisa [1 ]
Jurman, Giuseppe [1 ]
Riccadonna, Samantha [1 ]
Merler, Stefano [1 ]
Chierici, Marco [1 ]
Furlanello, Cesare [1 ]
机构
[1] FBK, MPBA Unit, I-38100 Trento, Italy
关键词
proteomics; selection bias; feature selection; functional profiling;
D O I
10.1093/bib/bbn008
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
The search for predictive biomarkers of disease from high-throughput mass spectrometry (MS) data requires a complex analysis path. Preprocessing and machine-learning modules are pipelined, starting from raw spectra, to set up a predictive classifier based on a shortlist of candidate features. As a machine-learning problem, proteomic profiling on MS data needs caution like the microarray case. The risk of overfitting and of selection bias effects is pervasive: not only potential features easily outnumber samples by 10(3) times, but it is easy to neglect information-leakage effects during preprocessing from spectra to peaks. The aim of this review is to explain how to build a general purpose design analysis protocol (DAP) for predictive proteomic profiling: we show how to limit leakage due to parameter tuning and how to organize classification and ranking on large numbers of replicate versions of the original data to avoid selection bias. The DAP can be used with alternative components, i.e. with different preprocessing methods (peak clustering or wavelet based), classifiers e.g. Support Vector Machine (SVM) or feature ranking methods (recursive feature elimination or I-Relief). A procedure for assessing stability and predictive value of the resulting biomarkers list is also provided. The approach is exemplified with experiments on synthetic datasets (from the Cromwell MS simulator) and with publicly available datasets from cancer studies.
引用
收藏
页码:119 / 128
页数:10
相关论文
共 37 条
[1]   Selection bias in gene extraction on the basis of microarray gene-expression data [J].
Ambroise, C ;
McLachlan, GJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (10) :6562-6566
[2]   Signal in noise: Evaluating reported reproducibility of serum proteomic tests for ovarian cancer [J].
Baggerly, KA ;
Morris, JS ;
Edmonson, SR ;
Coombes, KR .
JNCI-JOURNAL OF THE NATIONAL CANCER INSTITUTE, 2005, 97 (04) :307-309
[3]   Reproducibility of SELDI-TOF protein patterns in serum: comparing datasets from different experiments [J].
Baggerly, KA ;
Morris, JS ;
Coombes, KR .
BIOINFORMATICS, 2004, 20 (05) :777-U710
[4]  
Baggerly KA, 2005, CANCER INFORM, V1, P9
[5]   Proteome profiling without selection bias [J].
Barla, Annalisa ;
Irler, Bettina ;
Merler, Stefano ;
Jurman, Giuseppe ;
Paoli, Silvano ;
Furlanello, Cesare .
19TH IEEE INTERNATIONAL SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS, PROCEEDINGS, 2006, :941-+
[6]  
Chambers J., 1983, GRAPHICAL METHODS DA
[7]  
Codrea MC, 2007, LECT NOTES COMPUT SC, V4447, P35
[8]  
Coombes KR, 2005, CANCER INFORM, V1, P41
[9]   Serum proteomics profiling - a young technology begins to mature [J].
Coombes, KR ;
Morris, JRS ;
Hu, JH ;
Edmonson, SR ;
Baggerly, KA .
NATURE BIOTECHNOLOGY, 2005, 23 (03) :291-292
[10]   Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform [J].
Coombes, KR ;
Tsavachidis, S ;
Morris, JS ;
Baggerly, KA ;
Hung, MC ;
Kuerer, HM .
PROTEOMICS, 2005, 5 (16) :4107-4117