Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data

被引:152
作者
Gromski, Piotr S. [1 ]
Xu, Yun [1 ]
Kotze, Helen L. [1 ]
Correa, Elon [1 ]
Ellis, David I. [1 ]
Armitage, Emily Grace [1 ]
Turner, Michael L. [2 ]
Goodacre, Royston [1 ]
机构
[1] Univ Manchester, Manchester Inst Biotechnol, Sch Chem, 131 Princess St, Manchester M1 7DN, Lancs, England
[2] Univ Manchester, Sch Chem, Manchester M13 9PL, Lancs, England
关键词
missing values; metabolomics; unsupervised learning; supervised learning;
D O I
10.3390/metabo4020433
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 [生物化学与分子生物学]; 081704 [应用化学];
摘要
Missing values are known to be problematic for the analysis of gas chromatography-mass spectrometry (GC-MS) metabolomics data. Typically these values cover about 10%-20% of all data and can originate from various backgrounds, including analytical, computational, as well as biological. Currently, the most well known substitute for missing values is a mean imputation. In fact, some researchers consider this aspect of data analysis in their metabolomics pipeline as so routine that they do not even mention using this replacement approach. However, this may have a significant influence on the data analysis output(s) and might be highly sensitive to the distribution of samples between different classes. Therefore, in this study we have analysed different substitutes of missing values namely: zero, mean, median, k-nearest neighbours (kNN) and random forest (RF) imputation, in terms of their influence on unsupervised and supervised learning and, thus, their impact on the final output(s) in terms of biological interpretation. These comparisons have been demonstrated both visually and computationally (classification rate) to support our findings. The results show that the selection of the replacement methods to impute missing values may have a considerable effect on the classification accuracy, if performed incorrectly this may negatively influence the biomarkers selected for an early disease diagnosis or identification of cancer related metabolites. In the case of GC-MS metabolomics data studied here our findings recommend that RF should be favored as an imputation of missing value over the other tested methods. This approach displayed excellent results in terms of classification rate for both supervised methods namely: principal components-linear discriminant analysis (PC-LDA) (98.02%) and partial least squares-discriminant analysis (PLS-DA) (97.96%) outperforming other imputation methods.
引用
收藏
页码:433 / 452
页数:20
相关论文
共 63 条
[1]
Partial least squares for discrimination [J].
Barker, M ;
Rayens, W .
JOURNAL OF CHEMOMETRICS, 2003, 17 (03) :166-173
[2]
Development and Performance of a Gas Chromatography-Time-of-Flight Mass Spectrometry Analysis for Large-Scale Nontargeted Metabolomic Studies of Human Serum [J].
Begley, Paul ;
Francis-McIntyre, Sue ;
Dunn, Warwick B. ;
Broadhurst, David I. ;
Halsall, Antony ;
Tseng, Andy ;
Knowles, Joshua ;
Goodacre, Royston ;
Kell, Douglas B. .
ANALYTICAL CHEMISTRY, 2009, 81 (16) :7038-7046
[3]
Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[5]
Partial least squares discriminant analysis: taking the magic away [J].
Brereton, Richard G. ;
Lloyd, Gavin R. .
JOURNAL OF CHEMOMETRICS, 2014, 28 (04) :213-225
[6]
Centering and scaling in component analysis [J].
Bro, R ;
Smilde, AK .
JOURNAL OF CHEMOMETRICS, 2003, 17 (01) :16-33
[8]
AN APPLICATION OF FACTOR-ANALYSIS WITH MISSING DATA [J].
DELIGNY, CL ;
NIEUWDORP, GHE ;
BREDERODE, WK ;
HAMMERS, WE ;
VANHOUWELINGEN, JC .
TECHNOMETRICS, 1981, 23 (01) :91-95
[9]
DIXON WJ, 1975, BIOMEDICAL COMPUTER
[10]
Duda R. O., 2001, PATTERN CLASSIFICATI, V2nd