A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection

被引:191
作者
Yasui, Y
Pepe, M
Thompson, ML
Adam, BL
Wright, GL
Qu, YS
Potter, JD
Winget, M
Thornquist, M
Feng, ZD
机构
[1] Fred Hutchinson Canc Res Ctr, Canc Prevent Res Program, Seattle, WA 98109 USA
[2] Univ Washington, Dept Biostat, Seattle, WA 98195 USA
[3] Eastern Virginia Med Sch, Dept Microbiol & Mol Cell Biol, Norfolk, VA USA
[4] Eastern Virginia Med Sch, Virginia Prostate Ctr, Norfolk, VA USA
[5] Fred Hutchinson Canc Res Ctr, Canc Prevent Res Program, Seattle, WA 98109 USA
关键词
boinformatics; classification; disease markers; machine learning; mass spectrometry;
D O I
10.1093/biostatistics/4.3.449
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
With recent advances in mass spectrometry techniques, it is now possible to investigate proteins over a wide range of molecular weights in small biological specimens. This advance has generated data-analytic challenges in proteomics, similar to those created by microarray technologies in genetics, namely, discovery of 'signature' protein profiles specific to each pathologic state (e.g. normal vs. cancer) or differential profiles between experimental conditions (e.g. treated by a drug of interest vs. untreated) from high-dimensional data. We propose a data-analytic strategy for discovering protein biomarkers based on such high-dimensional mass spectrometry data. A real biomarker-discovery project on prostate cancer is taken as a concrete example throughout the paper: the project aims to identify proteins in serum that distinguish cancer, benign hyperplasia, and normal states of prostate using the Surface Enhanced Laser Desorption/Ionization (SELDI) technology, a recently developed mass spectrometry technique. Our data-analytic strategy takes properties of the SELDI mass spectrometer into account: the SELDI output of a specimen contains about 48 000 (x, y) points where x is the protein mass divided by the number of charges introduced by ionization and y is the protein intensity of the corresponding mass per charge value, x, in that specimen. Given high coefficients of variation and other characteristics of protein intensity measures (y values), we reduce the measures of protein intensities to a set of binary variables that indicate peaks in the y-axis direction in the nearest neighborhoods of each mass per charge point in the x-axis direction. We then account for a shifting (measurement error) problem of the x-axis in SELDI output. After this pre-analysis processing of data, we combine the binary predictors to generate classification rules for cancer, benign hyperplasia, and normal states of prostate. Our approach is to apply the boosting algorithm to select binary predictors and construct a summary classifier. We empirically evaluate sensitivity and specificity of the resulting summary classifiers with a test dataset that is independent from the training dataset used to construct the summary classifiers. The proposed method performed nearly perfectly in distinguishing cancer and benign hyperplasia from normal. In the classification of cancer vs. benign hyperplasia, however, an appreciable proportion of the benign specimens were classified incorrectly as cancer. We discuss practical issues associated with our proposed approach to the analysis of SELDI output and its application in cancer biomarker discovery.
引用
收藏
页码:449 / 463
页数:15
相关论文
共 23 条
[1]  
Adam BL, 2001, PROTEOMICS, V1, P1264, DOI 10.1002/1615-9861(200110)1:10<1264::AID-PROT1264>3.0.CO
[2]  
2-R
[3]  
Alberts B., 1994, MOL BIOL CELL
[4]  
Djavan B, 2001, PROSTATE, V47, P111
[5]  
Freund Y., 1996, Machine Learning. Proceedings of the Thirteenth International Conference (ICML '96), P148
[6]   BOOSTING A WEAK LEARNING ALGORITHM BY MAJORITY [J].
FREUND, Y .
INFORMATION AND COMPUTATION, 1995, 121 (02) :256-285
[7]  
Freund Y., 1999, Journal of Japanese Society for Artificial Intelligence, V14, P771
[8]   Additive logistic regression: A statistical view of boosting - Rejoinder [J].
Friedman, J ;
Hastie, T ;
Tibshirani, R .
ANNALS OF STATISTICS, 2000, 28 (02) :400-407
[9]  
Friedman J.H., 1984, 5 STANF U DEP STAT L
[10]  
Hastie T, 2008, The elements of statistical learning, Vsecond, DOI DOI 10.1007/978-0-387-21606-5