Protocols for disease classification from mass spectrometry data

被引:85
作者
Wagner, M
Naik, D
Pothen, A [1 ]
机构
[1] Old Dominion Univ, Dept Comp Sci, Norfolk, VA 23529 USA
[2] Cincinnati Childrens Hosp, Med Ctr, Cincinnati, OH USA
关键词
biomarker discovery; discrimination methods; matrix-assisted laser desorption/ionization-time; of flight mass spectrometry; support vector machines;
D O I
10.1002/pmic.200300519
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
We report our results in classifying protein matrix-assisted laser desorption/ionizationtime of flight mass spectra obtained from serum samples into diseased and healthy groups. We discuss in detail five of the steps in preprocessing the mass spectral data for biomarker discovery, as well as our criterion for choosing a small set of peaks for classifying the samples. Cross-validation studies with four selected proteins yielded misclassification rates in the 10-15% range for all the classification methods. Three of these proteins or protein fragments are down-regulated and one up-regulated in lung cancer, the disease under consideration in this data set. When cross-validation studies are performed, care must be taken to ensure that the test set does not influence the choice of the peaks used in the classification. Misclassification rates are lower when both the training and test sets are used to select the peaks used in classification versus when only the training set is used. This expectation was validated for various statistical discrimination methods when thirteen peaks were used in cross-validation studies. One particular classification method, a linear support vector machine, exhibited especially robust performance when the number of peaks was varied from four to thirteen, and when the peaks were selected from the training set alone. Experiments with the samples randomly assigned to the two classes confirmed that misclassification rates were significantly higher in such cases than those observed with the true data. This indicates that our findings are indeed significant. We found closely matching masses in a database for protein expression in lung cancer for three of the four proteins we used to classify lung cancer. Data from additional samples, increased experience with the performance of various preprocessing techniques, and affirmation of the biological roles of the proteins that help in classification, will strengthen our conclusions in the future.
引用
收藏
页码:1692 / 1698
页数:7
相关论文
共 7 条
[1]  
Cristianini N, 2000, Intelligent Data Analysis: An Introduction
[2]   Comparison of discrimination methods for the classification of tumors using gene expression data [J].
Dudoit, S ;
Fridlyand, J ;
Speed, TP .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2002, 97 (457) :77-87
[3]  
Hastie T, 2008, The elements of statistical learning, Vsecond, DOI DOI 10.1007/978-0-387-21606-5
[4]  
Joachims T, 1999, ADVANCES IN KERNEL METHODS, P169
[5]  
Oh JMC, 2001, PROTEOMICS, V1, P1303, DOI 10.1002/1615-9861(200110)1:10<1303::AID-PROT1303>3.0.CO
[6]  
2-2
[7]  
PATTERSON SD, 2001, MASS SPECTROMETRY BA, P87