A Comprehensive Empirical Comparison of Modern Supervised Classification and Feature Selection Methods for Text Categorization

被引:46
作者
Aphinyanaphongs, Yindalon [1 ,2 ]
Fu, Lawrence D. [1 ,2 ]
Li, Zhiguo [1 ]
Peskin, Eric R. [1 ]
Efstathiadis, Efstratios [1 ]
Aliferis, Constantin F. [1 ,3 ,4 ]
Statnikov, Alexander [1 ,2 ]
机构
[1] NYU, Langone Med Ctr, Ctr Hlth Informat & Bioinformat, New York, NY 10016 USA
[2] NYU, Sch Med, Dept Med, New York, NY 10016 USA
[3] NYU, Sch Med, Dept Pathol, New York, NY 10016 USA
[4] Vanderbilt Univ, Dept Biostat, Nashville, TN 37232 USA
关键词
machine learning; text processing; information retrieval; MARKOV BLANKET INDUCTION; SUPPORT VECTOR MACHINES; FALSE DISCOVERY RATE; CAUSAL DISCOVERY; LOCAL CAUSAL; MICROARRAY; ALGORITHM;
D O I
10.1002/asi.23110
中图分类号
TP [自动化技术、计算机技术];
学科分类号
080201 [机械制造及其自动化];
摘要
An important aspect to performing text categorization is selecting appropriate supervised classification and feature selection methods. A comprehensive benchmark is needed to inform best practices in this broad application field. Previous benchmarks have evaluated performance for a few supervised classification and feature selection methods and limited ways to optimize them. The present work updates prior benchmarks by increasing the number of classifiers and feature selection methods order of magnitude, including adding recently developed, state-of-the-art methods. Specifically, this study used 229 text categorization data sets/tasks, and evaluated 28 classification methods (both well-established and proprietary/commercial) and 19 feature selection methods according to 4 classification performance metrics. We report several key findings that will be helpful in establishing best methodological practices for text categorization.
引用
收藏
页码:1964 / 1987
页数:24
相关论文
共 58 条
[1]
Aliferis CF, 2010, J MACH LEARN RES, V11, P171
[2]
Aliferis CF, 2010, J MACH LEARN RES, V11, P235
[3]
Androutsopoulos I., 2000, SIGIR Forum, V34, P160
[4]
[Anonymous], 2010, Search engines: Information retrieval in practice
[5]
[Anonymous], 2003, Leslie Pack Kaelbling, DOI DOI 10.1162/153244303322753616
[6]
[Anonymous], 2002, Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
[7]
Text categorization models for high-quality article retrieval in internal medicine [J].
Aphinyanaphongs, Y ;
Tsamardinos, I ;
Statnikov, A ;
Hardin, D ;
Aliferis, CF .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2005, 12 (02) :207-216
[8]
Aphinyanaphongs Y., 2006, P ANN AM MED INF ASS, P6
[9]
Benjamini Y, 2001, ANN STAT, V29, P1165
[10]
CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING [J].
BENJAMINI, Y ;
HOCHBERG, Y .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) :289-300