Knowledge discovery by accuracy maximization

被引:38
作者
Cacciatore, Stefano [1 ,2 ,3 ,4 ]
Luchinat, Claudio [1 ,2 ,5 ]
Tenori, Leonardo [5 ]
机构
[1] Univ Florence, Magnet Resonance Ctr CERM, I-50019 Sesto Fiorentino, Italy
[2] Univ Florence, Dept Chem, I-50019 Sesto Fiorentino, Italy
[3] Harvard Univ, Sch Med, Dept Med Oncol, Dana Farber Canc Inst, Boston, MA 02115 USA
[4] Univ Rovira & Virgili, Spanish Biomed Res Ctr Diabet & Associated Disord, Metabol Platform, E-43007 Tarragona, Spain
[5] FiorGen Fdn, I-50019 Sesto Fiorentino, Italy
关键词
dissimilarity matrix; mapping; multivariate statistics; clustering; data visualization; NONLINEAR DIMENSIONALITY REDUCTION; CLUSTER-ANALYSIS; EXPRESSION; CLASSIFICATION;
D O I
10.1073/pnas.1220873111
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Here we describe KODAMA (knowledge discovery by accuracy maximization), an unsupervised and semisupervised learning algorithm that performs feature extraction from noisy and high-dimensional data. Unlike other data mining methods, the peculiarity of KODAMA is that it is driven by an integrated procedure of cross-validation of the results. The discovery of a local manifold's topology is led by a classifier through a Monte Carlo procedure of maximization of cross-validated predictive accuracy. Briefly, our approach differs from previous methods in that it has an integrated procedure of validation of the results. In this way, the method ensures the highest robustness of the obtained solution. This robustness is demonstrated on experimental datasets of gene expression and metabolomics, where KODAMA compares favorably with other existing feature extraction methods. KODAMA is then applied to an astronomical dataset, revealing unexpected features. Interesting and not easily predictable features are also found in the analysis of the State of the Union speeches by American presidents: KODAMA reveals an abrupt linguistic transition sharply separating all post-Reagan from all pre-Reagan speeches. The transition occurs during Reagan's presidency and not from its beginning.
引用
收藏
页码:5117 / 5122
页数:6
相关论文
共 45 条
[1]   Stochastic proximity embedding [J].
Agrafiotis, DK .
JOURNAL OF COMPUTATIONAL CHEMISTRY, 2003, 24 (10) :1215-1221
[2]   Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling [J].
Alizadeh, AA ;
Eisen, MB ;
Davis, RE ;
Ma, C ;
Lossos, IS ;
Rosenwald, A ;
Boldrick, JG ;
Sabet, H ;
Tran, T ;
Yu, X ;
Powell, JI ;
Yang, LM ;
Marti, GE ;
Moore, T ;
Hudson, J ;
Lu, LS ;
Lewis, DB ;
Tibshirani, R ;
Sherlock, G ;
Chan, WC ;
Greiner, TC ;
Weisenburger, DD ;
Armitage, JO ;
Warnke, R ;
Levy, R ;
Wilson, W ;
Grever, MR ;
Byrd, JC ;
Botstein, D ;
Brown, PO ;
Staudt, LM .
NATURE, 2000, 403 (6769) :503-511
[3]  
[Anonymous], 1995, NATURE STAT LEARNING, DOI DOI 10.1007/978-1-4757-2440-0
[4]   GPU-FS-kNN: A Software Tool for Fast and Scalable kNN Computation Using GPUs [J].
Arefin, Ahmed Shamsul ;
Riveros, Carlos ;
Berretta, Regina ;
Moscato, Pablo .
PLOS ONE, 2012, 7 (08)
[5]   An optimal algorithm for approximate nearest neighbor searching in fixed dimensions [J].
Arya, S ;
Mount, DM ;
Netanyahu, NS ;
Silverman, R ;
Wu, AY .
JOURNAL OF THE ACM, 1998, 45 (06) :891-923
[6]   Evidence of different metabolic phenotypes in humans [J].
Assfalg, Michael ;
Bertini, Ivano ;
Colangiuli, Donato ;
Luchinat, Claudio ;
Schaefer, Hartmut ;
Schuetz, Birk ;
Spraul, Manfred .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2008, 105 (05) :1420-1424
[7]  
Balasubramanian M, 2002, SCIENCE, V295
[8]   Metabolomic NMR Fingerprinting to Identify and Predict Survival of Patients with Metastatic Colorectal Cancer [J].
Bertini, Ivano ;
Cacciatore, Stefano ;
Jensen, Benny V. ;
Schou, Jakob V. ;
Johansen, Julia S. ;
Kruhoffer, Mogens ;
Luchinat, Claudio ;
Nielsen, Dorte L. ;
Turano, Paola .
CANCER RESEARCH, 2012, 72 (01) :356-364
[9]   High-dimensional data clustering [J].
Bouveyron, C. ;
Girard, S. ;
Schmid, C. .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2007, 52 (01) :502-519
[10]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32