Dealing with missing values in large-scale studies: microarray data imputation and beyond

被引:120
作者
Aittokallio, Tero [1 ]
机构
[1] Univ Turku, Dept Math, Biomath Res Grp, FI-20014 Turku, Finland
基金
芬兰科学院;
关键词
missing value imputation; gene expression microarrays; mass-spectrometry proteomics; statistical modelling; biomarker discovery; disease classification; GENE-EXPRESSION DATA; LEAST-SQUARES IMPUTATION; DIFFERENTIAL EXPRESSION; BIOLOGICAL KNOWLEDGE; DNA MICROARRAYS; INCOMPLETE DATA; PROFILES; QUANTITATION; ASSOCIATION; INTERACTOME;
D O I
10.1093/bib/bbp059
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
High-throughput biotechnologies, such as gene expression microarrays or mass-spectrometry-based proteomic assays, suffer from frequent missing values due to various experimental reasons. Since the missing data points can hinder downstream analyses, there exists a wide variety of ways in which to deal with missing values in large-scale data sets. Nowadays, it has become routine to estimate (or impute) the missing values prior to the actual data analysis. After nearly a decade since the publication of the first missing value imputation methods for gene expression microarray data, new imputation approaches are still being developed at an increasing rate. However, what is lagging behind is a systematic and objective evaluation of the strengths and weaknesses of the different approaches when faced with different types of data sets and experimental questions. In this review, the present strategies for missing value imputation and the measures for evaluating their performance are described. The imputation methods are first reviewed in the context of gene expression microarray data, since most of the methods have been developed for estimating gene expression levels; then, we turn to other large-scale data sets that also suffer from the problems posed by missing values, together with pointers to possible imputation approaches in these settings. Along with a description of the basic principles behind the different imputation approaches, the review tries to provide practical guidance for the users of high-throughput technologies on how to choose the imputation tool for their data and questions, and some additional research directions for the developers of imputation methodologies.
引用
收藏
页码:253 / 264
页数:12
相关论文
共 77 条
[1]   Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling [J].
Alizadeh, AA ;
Eisen, MB ;
Davis, RE ;
Ma, C ;
Lossos, IS ;
Rosenwald, A ;
Boldrick, JG ;
Sabet, H ;
Tran, T ;
Yu, X ;
Powell, JI ;
Yang, LM ;
Marti, GE ;
Moore, T ;
Hudson, J ;
Lu, LS ;
Lewis, DB ;
Tibshirani, R ;
Sherlock, G ;
Chan, WC ;
Greiner, TC ;
Weisenburger, DD ;
Armitage, JO ;
Warnke, R ;
Levy, R ;
Wilson, W ;
Grever, MR ;
Byrd, JC ;
Botstein, D ;
Brown, PO ;
Staudt, LM .
NATURE, 2000, 403 (6769) :503-511
[2]  
[Anonymous], 1994, Advances in Neural Information Processing Systems
[3]  
[Anonymous], IEEE TRANS PATTERN A
[4]   An optimized predictive strategy for interactome mapping [J].
Aryee, Martin J. A. ;
Quackenbush, John .
JOURNAL OF PROTEOME RESEARCH, 2008, 7 (09) :4089-4094
[5]   Continuous representations of time-series gene expression data [J].
Bar-Joseph, Z ;
Gerber, GK ;
Gifford, DK ;
Jaakkola, TS ;
Simon, I .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2003, 10 (3-4) :341-356
[6]   LSimpute: accurate estimation of missing values in microarray data with least squares methods [J].
Bo, TH ;
Dysvik, J ;
Jonassen, I .
NUCLEIC ACIDS RESEARCH, 2004, 32 (03) :e34
[7]   Robust data imputation [J].
Branden, Karlien Vanden ;
Verboven, Sabine .
COMPUTATIONAL BIOLOGY AND CHEMISTRY, 2009, 33 (01) :7-13
[8]   Dealing with gene expression missing data [J].
Bras, L. P. ;
Menezes, J. C. .
IEE PROCEEDINGS SYSTEMS BIOLOGY, 2006, 153 (03) :105-119
[9]   Improving cluster-based missing value estimation of DNA microarray data [J].
Bras, Ligia P. ;
Menezes, Jose C. .
BIOMOLECULAR ENGINEERING, 2007, 24 (02) :273-282
[10]   Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes [J].
Brock, Guy N. ;
Shaffer, John R. ;
Blakesley, Richard E. ;
Lotz, Meredith J. ;
Tseng, George C. .
BMC BIOINFORMATICS, 2008, 9 (1)