Efficient exact p-value computation for small sample, sparse, and surprising categorical data

被引:17
作者
Bejerano, G
Friedman, N
Tishby, N
机构
[1] Univ Calif Santa Cruz, Sch Engn, Ctr Biomol Sci & Engn, Santa Cruz, CA 95064 USA
[2] Hebrew Univ Jerusalem, Sch Engn & Comp Sci, IL-91904 Jerusalem, Israel
关键词
p-value; exact tests; branch and bound; real extension; categorical data;
D O I
10.1089/cmb.2004.11.867
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
A major obstacle in applying various hypothesis testing procedures to datasets in bioinformatics is the computation of ensuing p-values. In this paper, we define a generic branch-and-bound approach to efficient exact p-value computation and enumerate the required conditions for successful application. Explicit procedures are developed for the entire Cressie-Read family of statistics, which includes the widely used Pearson and likelihood ratio statistics in a one-way frequency table goodness-of-fit test. This new formulation constitutes a first practical exact improvement over the exhaustive enumeration performed by existing statistical software. The general techniques we develop to exploit the convexity of many statistics are also shown to carry over to contingency table tests, suggesting that they are readily extendible to other tests and test statistics of interest. Our empirical results demonstrate a speed-up of orders of magnitude over the exhaustive computation, significantly extending the practical range for performing exact tests. We also show that the relative speed-up gain increases as the null hypothesis becomes sparser, that computation precision increases with increase in speed-up, and that computation time is very moderately affected by the magnitude of the computed p-value. These qualities make our algorithm especially appealing in the regimes of small samples, sparse null distributions, and rare events, compared to the alternative asymptotic approximations and Monte Carlo samplers. We discuss several established bioinformatics applications, where small sample size, small expected counts in one or more categories ( sparseness), and very small p-values do occur. Our computational framework could be applied in these, and similar cases, to improve performance.
引用
收藏
页码:867 / 886
页数:20
相关论文
共 22 条
[1]   Exact inference for categorical data: recent advances and continuing controversies [J].
Agresti, A .
STATISTICS IN MEDICINE, 2001, 20 (17-18) :2709-2722
[2]  
Agresti A., 1992, STAT SCI, V7, P131, DOI DOI 10.1214/SS/1177011454
[3]   METHODS FOR EXACT GOODNESS-OF-FIT TESTS [J].
BAGLIVO, J ;
OLIVIER, D ;
PAGANO, M .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1992, 87 (418) :464-469
[4]  
Benson Harold P., 1995, Handbook of Global Optimization, P43
[5]   Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors [J].
Bulyk, ML ;
Johnson, PLF ;
Church, GM .
NUCLEIC ACIDS RESEARCH, 2002, 30 (05) :1255-1261
[6]   Protein fold similarity estimated by a probabilistic approach based on Cα-Cα distance comparison [J].
Carugo, O ;
Pongor, S .
JOURNAL OF MOLECULAR BIOLOGY, 2002, 315 (04) :887-898
[7]   GENERALIZED ITERATIVE SCALING FOR LOG-LINEAR MODELS [J].
DARROCH, JN ;
RATCLIFF, D .
ANNALS OF MATHEMATICAL STATISTICS, 1972, 43 (05) :1470-&
[8]   Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations [J].
Henikoff, S ;
Henikoff, JG ;
Pietrokovski, S .
BIOINFORMATICS, 1999, 15 (06) :471-479
[9]   Identifying DNA and protein patterns with statistically significant alignments of multiple sequences [J].
Hertz, GZ ;
Stormo, GD .
BIOINFORMATICS, 1999, 15 (7-8) :563-577
[10]   VALIDITY OF THE CHI-SQUARED TEST WHEN EXPECTED FREQUENCIES ARE SMALL - LIST OF RECENT RESEARCH REFERENCES [J].
HUTCHINSON, TP .
COMMUNICATIONS IN STATISTICS PART A-THEORY AND METHODS, 1979, 8 (04) :327-335