Feature selection with limited datasets

被引:48
作者
Kupinski, MA [1 ]
Giger, ML [1 ]
机构
[1] Univ Chicago, Dept Radiol, Kurt Rossmann Labs, Chicago, IL 60637 USA
关键词
feature selection; classification; computer-aided diagnosis;
D O I
10.1118/1.598821
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Computer-aided diagnosis has the potential of increasing diagnostic accuracy by providing a second reading to radiologists. In many computerized schemes, numerous features can be extracted to describe suspect image regions. A subset of these features is hen employed in a data classifier to determine whether the suspect region is abnormal or normal. Different subsets of features will, in general, result in different classification performances. A feature selection method is often used to determine an "optimal" subset of features to use with a particular classifier. A classifier performance measure (such as the area under the receiver operating characteristic curve) must be incorporated into this feature selection process. With limited datasets, however, there is a distribution in the classifier performance measure for a given classifier and subset of features. In this paper, we investigate the variation in the selected subset of "optimal" features as compared with the true optimal subset of features caused by this distribution of classifier performance. We consider examples in which the probability that the optimal subset of features is selected can be analytically computed. We show the dependence of this probability on the dataset sample size, the total number of features from which to select, the number of features selected, and the performance of the true optimal subset. Once a subset of features has been selected, the parameters of the data classifier must be determined. We show that, with limited datasets and/or a large number of features from which to choose, bias is introduced if the classifier parameters are determined using the same data that were employed to select the "optimal" subset of features. (C) 1999 American Association of Physicists in Medicine. [S0094-2405(99)01010-X].
引用
收藏
页码:2176 / 2182
页数:7
相关论文
共 20 条
[11]  
Johnson R A, 2007, Applied Multivariate Statistical Analysis: Pearson New International Edition
[12]   COMPUTER-AIDED MAMMOGRAPHIC SCREENING FOR SPICULATED LESIONS [J].
KEGELMEYER, WP ;
PRUNEDA, JM ;
BOURLAND, PD ;
HILLIS, A ;
RIGGS, MW ;
NIPPER, ML .
RADIOLOGY, 1994, 191 (02) :331-337
[13]  
Kupinski M, 1996, INT CONGR SER, V1119, P401
[14]   BASIC PRINCIPLES OF ROC ANALYSIS [J].
METZ, CE .
SEMINARS IN NUCLEAR MEDICINE, 1978, 8 (04) :283-298
[16]  
METZ CE, 1986, INVEST RADIOL, V21, P720, DOI 10.1097/00004424-198609000-00009
[17]   Image feature selection by a genetic algorithm: Application to classification of mass and normal breast tissue [J].
Sahiner, B ;
Chan, HP ;
Wei, DT ;
Petrick, N ;
Helvie, MA ;
Adler, DD ;
Goodsitt, MM .
MEDICAL PHYSICS, 1996, 23 (10) :1671-1684
[18]   A NOTE ON GENETIC ALGORITHMS FOR LARGE-SCALE FEATURE-SELECTION [J].
SIEDLECKI, W ;
SKLANSKY, J .
PATTERN RECOGNITION LETTERS, 1989, 10 (05) :335-347
[19]  
Wu Y, 1993, J Digit Imaging, V6, P117
[20]   ARTIFICIAL NEURAL NETWORKS IN MAMMOGRAPHY - APPLICATION TO DECISION-MAKING IN THE DIAGNOSIS OF BREAST-CANCER [J].
WU, YZ ;
GIGER, ML ;
DOI, K ;
VYBORNY, CJ ;
SCHMIDT, RA ;
METZ, CE .
RADIOLOGY, 1993, 187 (01) :81-87