A statistical methodology for analyzing co-occurrence data from a large sample

被引:32
作者
Cao, Hui
Hripcsak, George
Markatou, Marianthi
机构
[1] Columbia Univ, Dept Biomed Informat, New York, NY 10032 USA
[2] Columbia Univ, Dept Biostat, New York, NY 10032 USA
关键词
associations; co-occurrence; two-way tables; volume test adjustments; p-value plot; large-scale testing;
D O I
10.1016/j.jbi.2006.11.003
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Determining important associations among items in a large database is challenging due to multiple simultaneous hypotheses and the ability to select weak associations that are statistically but not clinically significant. The simple application of the 2 test among all possible pairs of items results in mostly inappropriate associations surpassing the traditional (alpha =.05, chi(2) = 3.94) threshold. One can choose a stricter threshold to find stronger associations, but the choice may be arbitrary. We combined the volume test of Diaconis and Efron with 2 a p-value plot to select a more rigorous and less arbitrary threshold. The volume test adjusts the p-value of the Z(2) -statistic. A plot of adjusted p-values (1-p versus N-p), where N-p is the number of test statistics with a p-value greater than p, should be linear if there are no true associations. The point where the plot deviates from a line can be used as a threshold. We used linear regression to select the threshold in a reproducible fashion. In one experiment, we found that the method selected a threshold similar to that previously obtained by manually reviewing associations. (C) 2006 Elsevier Inc. All rights reserved.
引用
收藏
页码:343 / 352
页数:10
相关论文
共 7 条
[1]  
CAO H, 2005, P AMIA S, P106
[2]   SOME METHODS FOR STRENGTHENING THE COMMON X2 TESTS [J].
COCHRAN, WG .
BIOMETRICS, 1954, 10 (04) :417-451
[3]  
DIACONIS P, 1985, ANN STAT, V13, P845, DOI 10.1214/aos/1176349634
[4]   Large-scale simultaneous hypothesis testing: The choice of a null hypothesis [J].
Efron, B .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2004, 99 (465) :96-104
[5]  
LINDSAY BG, IN PRESS SPRINGER SE
[6]  
SCHWEDER T, 1982, BIOMETRIKA, V69, P493
[7]  
Yates F., 1934, JR STATIST SOC S, V1, P217, DOI [DOI 10.2307/2983604, 10.2307/2983604]