Centroid estimation in discrete high-dimensional spaces with applications in biology

被引:63
作者
Carvalho, Luis E. [1 ]
Lawrence, Charles E. [1 ]
机构
[1] Brown Univ, Div Appl Math, Providence, RI 02912 USA
关键词
prediction; statistical inference; computational biology; discrete decoding;
D O I
10.1073/pnas.0712329105
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Maximum likelihood estimators and other direct optimization-based estimators dominated statistical estimation and prediction for decades. Yet, the principled foundations supporting their dominance do not apply to the discrete high-dimensional inference problems of the 21st century. As it is well known, statistical decision theory shows that maximum likelihood and related estimators use data only to identify the single most probable solution. Accordingly, unless this one solution so dominates the immense ensemble of all solutions that its probability is near one, there is no principled reason to expect such an estimator to be representative of the posterior-weighted ensemble of solutions, and thus represent inferences drawn from the data. We employ statistical decision theory to find more representative estimators, centroid estimators, in a general high-dimensional discrete setting by using a family of loss functions with penalties that increase with the number of differences in components. We show that centroid estimates are obtained by maximizing the marginal probabilities of the solution components for unconstrained ensembles and for an important class of problems, including sequence alignment and the prediction of RNA secondary structure, whose ensembles contain exclusivity constraints. Three genomics examples are described that show that these estimators substantially improve predictions of ground-truth reference sets.
引用
收藏
页码:3209 / 3214
页数:6
相关论文
共 29 条
[1]  
[Anonymous], 1999, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
[2]  
[Anonymous], 1950, STAT DECISION FUNCTI
[3]  
[Anonymous], 1922, Philosophical Transactions of the Royal Society of London A, DOI [10.1098/rsta.1922.0009, DOI 10.1098/RSTA.1922.0009]
[4]  
Attias H, 2000, ADV NEUR IN, V12, P209
[5]  
Beal MJ, 2003, BAYESIAN STATISTICS 7, P453
[6]   GenBank [J].
Benson, Dennis A. ;
Karsch-Mizrachi, Ilene ;
Lipman, David J. ;
Ostell, James ;
Wheeler, David L. .
NUCLEIC ACIDS RESEARCH, 2006, 34 :D16-D20
[7]  
BESAG J, 1986, J R STAT SOC B, V48, P259
[8]   Toward high-resolution de novo structure prediction for small proteins [J].
Bradley, P ;
Misura, KMS ;
Baker, D .
SCIENCE, 2005, 309 (5742) :1868-1871
[9]  
CARLIN BP, 2000, BAYES EMPIRICAL DATA
[10]   Objective Bayesian variable selection [J].
Casella, G ;
Moreno, E .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2006, 101 (473) :157-167