Predicting interpretability of metabolome models based on behavior, putative identity, and biological relevance of explanatory signals

被引:30
作者
Enot, David P. [1 ]
Beckmann, Manfred [1 ]
Overy, David [1 ]
Draper, John [1 ]
机构
[1] Univ Wales, Inst Biol Sci, Aberystwyth SY23 3DA, Dyfed, Wales
关键词
mass spectral fingerprinting; phenotyping; random forest data analysis;
D O I
10.1073/pnas.0605152103
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Powerful algorithms are required to deal with the dimensionality of metabolomics data. Although many achieve high classification accuracy, the models they generate have limited value unless it can be demonstrated that they are reproducible and statistically relevant to the biological problem under investigation. Random forest (RF) generates models, without any requirement for dimensionality reduction or feature selection, in which individual variables are ranked for significance and displayed in an explicit manner. In metabolome fingerprinting by mass spectrometry, each metabolite can be represented by signals at several m/z. Exploiting a prior understanding of expected biochemical differences between sample classes, we aimed to develop meaningful metrics relevant to the significance both of the overall RF model and individual, potentially explanatory, signals. Pair-wise comparison of related plant genotypes with strong phenotypic differences demonstrated that robust models are not only reproducible but also logically structured, highlighting correlated m/z derived from just a small number of explanatory metabolites reflecting the biological differences between sample classes. RF models were also generated by using groupings of samples known to be increasingly phenotypically similar. Although classification accuracy was often reasonable, we demonstrated reproducibly in both Arabidopsis and potato a performance threshold based on margin statistics beyond which such models showed little structure indicative of either generalizibility or further biological interpretability. In a multiclass problem using 25 Arabidopsis genotypes, despite the complicating effects of ecotype background and secondary metabolome perturbations common to several mutations, the ranking of metabolome signals by RF provided scope for deeper interpretability.
引用
收藏
页码:14865 / 14870
页数:6
相关论文
共 39 条
[1]  
Aharoni Asaph, 2002, OMICS A Journal of Integrative Biology, V6, P217, DOI 10.1089/15362310260256882
[2]   High-throughput classification of yeast mutants for functional genomics using metabolic footprinting [J].
Allen, J ;
Davey, HM ;
Broadhurst, D ;
Heald, JK ;
Rowland, JJ ;
Oliver, SG ;
Kell, DB .
NATURE BIOTECHNOLOGY, 2003, 21 (06) :692-696
[3]  
[Anonymous], MULTIVARIATE STAT ME
[4]   Modelling of classification rules on metabolic patterns including machine learning and expert knowledge [J].
Baumgartner, C ;
Böhm, C ;
Baumgartner, D .
JOURNAL OF BIOMEDICAL INFORMATICS, 2005, 38 (02) :89-98
[5]   Supervised machine learning techniques for the classification of metabolic disorders in newborns [J].
Baumgartner, C ;
Böhm, C ;
Baumgartner, D ;
Marini, G ;
Weinberger, K ;
Olgemöller, B ;
Liebl, B ;
Roscher, AA .
BIOINFORMATICS, 2004, 20 (17) :2985-2996
[6]   Large-scale human metabolomics studies: A strategy for data (pre-) processing and validation [J].
Bijlsma, S ;
Bobeldijk, L ;
Verheij, ER ;
Ramaker, R ;
Kochhar, S ;
Macdonald, IA ;
van Ommen, B ;
Smilde, AK .
ANALYTICAL CHEMISTRY, 2006, 78 (02) :567-574
[7]   Potential of metabolomics as a functional genomics tool [J].
Bino, RJ ;
Hall, RD ;
Fiehn, O ;
Kopka, J ;
Saito, K ;
Draper, J ;
Nikolau, BJ ;
Mendes, P ;
Roessner-Tunali, U ;
Beale, MH ;
Trethewey, RN ;
Lange, BM ;
Wurtele, ES ;
Sumner, LW .
TRENDS IN PLANT SCIENCE, 2004, 9 (09) :418-425
[8]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[9]   Hierarchical metabolomics demonstrates substantial compositional similarity between genetically modified and conventional potato crops [J].
Catchpole, GS ;
Beckmann, M ;
Enot, DP ;
Mondhe, M ;
Zywicki, B ;
Taylor, J ;
Hardy, N ;
Smith, A ;
King, RD ;
Kell, DB ;
Fiehn, O ;
Draper, J .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2005, 102 (40) :14458-14462
[10]   Gene selection and classification of microarray data using random forest -: art. no. 3 [J].
Díaz-Uriarte, R ;
de Andrés, SA .
BMC BIOINFORMATICS, 2006, 7 (1)