Learning Curves in Classification With Microarray Data

被引:10
作者
Hess, Kenneth R. [1 ]
Wei, Caimiao [1 ]
机构
[1] Univ Texas MD Anderson Canc Ctr, Dept Biostat, Houston, TX 77030 USA
关键词
D O I
10.1053/j.seminoncol.2009.12.002
中图分类号
R73 [肿瘤学];
学科分类号
100214 ;
摘要
The performance of many repeated tasks improves with experience and practice. This improvement tends to be rapid initially and then decreases. The term "learning curve" is often used to describe the phenomenon. In supervised machine learning, the performance of classification algorithms often increases with the number of observations used to train the algorithm. We use progressively larger samples of observations to train the algorithm and then plot performance against the number of training observations. This yields the familiar negatively accelerating learning curve. To quantify the learning curve, we fit inverse power law models to the progressively sampled data. We fit such learning curves to four large clinical cancer genomic datasets, using three classifiers (diagonal linear discriminant analysis, K-nearest-neighbor with three neighbors, and support vector machines) and four values for the number of top genes included (5, 50, 500, 5,000). The inverse power law models fit the progressively sampled data reasonably well and showed considerable diversity when multiple classifiers are applied to the same data. Some classifiers showed rapid and continued increase in performance as the number of training samples increased, while others showed little if any improvement. Assessing classifier efficiency is particularly important in genomic studies since samples are so expensive to obtain. It is important to employ an algorithm that uses the predictive information efficiently, but with a modest number of training samples (>50), learning curves can be used to assess the predictive efficiency of classification algorithms. © 2010 Elsevier Inc. All rights reserved.
引用
收藏
页码:65 / 68
页数:4
相关论文
共 11 条
[1]   Comparison of discrimination methods for the classification of tumors using gene expression data [J].
Dudoit, S ;
Fridlyand, J ;
Speed, TP .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2002, 97 (457) :77-87
[2]  
GU B, 2000, MODELING CLASSIFICAT
[3]  
HUET S, 2004, STAT TOOLS NONLINEAR, P137
[4]  
KADIE CM, 1991, P 8 MACH LEARN WORKS, P153
[5]   Estimating dataset size requirements for classifying DNA microarray data [J].
Mukherjee, S ;
Tamayo, P ;
Rogers, S ;
Rifkin, R ;
Engle, A ;
Campbell, C ;
Golub, TR ;
Mesirov, JP .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2003, 10 (02) :119-142
[6]  
PEPE MS, 2003, OXFORD STAT SCI SERI, V28, P66
[7]  
Provost F., 1999, Efficient progressive sampling, P23, DOI 10.1145/312129.312188
[8]  
Ramsay C R, 2001, Health Technol Assess, V5, P1
[9]  
Ritter FE., 2001, International encyclopedia of the social and behavioral sciences, P8602, DOI DOI 10.1016/B0-08-043076-7/01480-7
[10]  
THORSTONE LL, 1919, PSYCHOL MONOGR, V26, P51