Second-generation PLINK: rising to the challenge of larger and richer datasets

被引:7546
作者
Chang, Christopher C. [1 ,2 ]
Chow, Carson C. [3 ]
Tellier, Laurent C. A. M. [2 ,4 ]
Vattikuti, Shashaank [3 ]
Purcell, Shaun M. [5 ,6 ,7 ,8 ]
Lee, James J. [3 ,9 ]
机构
[1] Complete Genom, Mountain View, CA 94043 USA
[2] BGI Cognit Genom Lab, Shenzhen 518083, Peoples R China
[3] NIDDK, Math Biol Sect, LBM, NIH, Bethesda, MD 20892 USA
[4] Univ Copenhagen, Bioinformat Ctr, DK-2200 Copenhagen, Denmark
[5] Broad Inst MIT & Harvard, Stanley Ctr Psychiat Res, Cambridge, MA 02142 USA
[6] Icahn Sch Med Mt Sinai, Div Psychiat Genom, Dept Psychiat, New York, NY 10029 USA
[7] Icahn Sch Med Mt Sinai, Inst Genom & Multiscale Biol, New York, NY 10029 USA
[8] Massachusetts Gen Hosp, Analyt & Translat Genet Unit, Psychiat & Neurodev Genet Unit, Boston, MA 02114 USA
[9] Univ Minnesota Twin Cities, Dept Psychol, Minneapolis, MN 55455 USA
来源
GIGASCIENCE | 2015年 / 4卷
关键词
GWAS; Population genetics; Whole-genome sequencing; High-density SNP genotyping; Computational statistics; FISHERS EXACT TEST; LINKAGE DISEQUILIBRIUM; EXACT TESTS; ASSOCIATION; ALGORITHM; PERMUTATION; PERFORMANCE; IDENTITY; FORMAT; FEXACT;
D O I
10.1186/s13742-015-0047-8
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. Findings: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(root n)-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release ( PLINK 2.0). Conclusions: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
引用
收藏
页数:16
相关论文
共 44 条
[1]   A general test of association for quantitative traits in nuclear families [J].
Abecasis, GR ;
Cardon, LR ;
Cookson, WOC .
AMERICAN JOURNAL OF HUMAN GENETICS, 2000, 66 (01) :279-292
[2]  
Adler M., PIGZ PARALLEL GZIP
[3]   An integrated map of genetic variation from 1,092 human genomes [J].
Altshuler, David M. ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Donnelly, Peter ;
Eichler, Evan E. ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Green, Eric D. ;
Hurles, Matthew E. ;
Knoppers, Bartha M. ;
Korbel, Jan O. ;
Lander, Eric S. ;
Lee, Charles ;
Lehrach, Hans ;
Mardis, Elaine R. ;
Marth, Gabor T. ;
McVean, Gil A. ;
Nickerson, Deborah A. ;
Schmidt, Jeanette P. ;
Sherry, Stephen T. ;
Wang, Jun ;
Wilson, Richard K. ;
Gibbs, Richard A. ;
Dinh, Huyen ;
Kovar, Christie ;
Lee, Sandra ;
Lewis, Lora ;
Muzny, Donna ;
Reid, Jeff ;
Wang, Min ;
Wang, Jun ;
Fang, Xiaodong ;
Guo, Xiaosen ;
Jian, Min ;
Jiang, Hui ;
Jin, Xin ;
Li, Guoqing ;
Li, Jingxiang ;
Li, Yingrui ;
Li, Zhuo ;
Liu, Xiao ;
Lu, Yao ;
Ma, Xuedi ;
Su, Zhe ;
Tai, Shuaishuai ;
Tang, Meifang .
NATURE, 2012, 491 (7422) :56-65
[4]  
[Anonymous], LIB AN GEN VAR DAT
[5]   Haploview: analysis and visualization of LD and haplotype maps [J].
Barrett, JC ;
Fry, B ;
Maller, J ;
Daly, MJ .
BIOINFORMATICS, 2005, 21 (02) :263-265
[6]   Improving the Accuracy and Efficiency of Identity-by-Descent Detection in Population Data [J].
Browning, Brian L. ;
Browning, Sharon R. .
GENETICS, 2013, 194 (02) :459-+
[7]   A Fast, Powerful Method for Detecting Identity by Descent [J].
Browning, Brian L. ;
Browning, Sharon R. .
AMERICAN JOURNAL OF HUMAN GENETICS, 2011, 88 (02) :173-182
[8]   PRESTO: Rapid calculation of order statistic distributions and multiple-testing adjusted P-values via permutation for one and two-stage genetic association studies [J].
Browning, Brian L. .
BMC BIOINFORMATICS, 2008, 9 (1)
[9]  
Chang C., STANDALONE C C EXACT
[10]  
Chang C, GIGASCIENCE DATABASE