Second-generation PLINK: rising to the challenge of larger and richer datasets

被引:7546
作者
Chang, Christopher C. [1 ,2 ]
Chow, Carson C. [3 ]
Tellier, Laurent C. A. M. [2 ,4 ]
Vattikuti, Shashaank [3 ]
Purcell, Shaun M. [5 ,6 ,7 ,8 ]
Lee, James J. [3 ,9 ]
机构
[1] Complete Genom, Mountain View, CA 94043 USA
[2] BGI Cognit Genom Lab, Shenzhen 518083, Peoples R China
[3] NIDDK, Math Biol Sect, LBM, NIH, Bethesda, MD 20892 USA
[4] Univ Copenhagen, Bioinformat Ctr, DK-2200 Copenhagen, Denmark
[5] Broad Inst MIT & Harvard, Stanley Ctr Psychiat Res, Cambridge, MA 02142 USA
[6] Icahn Sch Med Mt Sinai, Div Psychiat Genom, Dept Psychiat, New York, NY 10029 USA
[7] Icahn Sch Med Mt Sinai, Inst Genom & Multiscale Biol, New York, NY 10029 USA
[8] Massachusetts Gen Hosp, Analyt & Translat Genet Unit, Psychiat & Neurodev Genet Unit, Boston, MA 02114 USA
[9] Univ Minnesota Twin Cities, Dept Psychol, Minneapolis, MN 55455 USA
来源
GIGASCIENCE | 2015年 / 4卷
关键词
GWAS; Population genetics; Whole-genome sequencing; High-density SNP genotyping; Computational statistics; FISHERS EXACT TEST; LINKAGE DISEQUILIBRIUM; EXACT TESTS; ASSOCIATION; ALGORITHM; PERMUTATION; PERFORMANCE; IDENTITY; FORMAT; FEXACT;
D O I
10.1186/s13742-015-0047-8
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. Findings: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(root n)-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release ( PLINK 2.0). Conclusions: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
引用
收藏
页数:16
相关论文
共 44 条
[31]   FEXACT - A FORTRAN SUBROUTINE FOR FISHER EXACT TEST ON UNORDERED RXC CONTINGENCY-TABLES [J].
MEHTA, CR ;
PATEL, NR .
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 1986, 12 (02) :154-161
[32]   PLINK: A tool set for whole-genome association and population-based linkage analyses [J].
Purcell, Shaun ;
Neale, Benjamin ;
Todd-Brown, Kathe ;
Thomas, Lori ;
Ferreira, Manuel A. R. ;
Bender, David ;
Maller, Julian ;
Sklar, Pamela ;
de Bakker, Paul I. W. ;
Daly, Mark J. ;
Sham, Pak C. .
AMERICAN JOURNAL OF HUMAN GENETICS, 2007, 81 (03) :559-575
[33]   A major improvement to the Network Algorithm for Fisher's Exact Test in 2xc contingency tables [J].
Requena, F. ;
Ciudad, N. Martin .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2006, 51 (02) :490-498
[34]  
Sambo F, 2014, BIOINFORMATICS, V30, P495
[35]   PERMORY-MPI: a program for high-speed parallel permutation testing in genome-wide association studies [J].
Steiss, Volker ;
Letschert, Thomas ;
Schaefer, Helmut ;
Pahl, Roman .
BIOINFORMATICS, 2012, 28 (08) :1168-1169
[36]   HAPGEN2: simulation of multiple disease SNPs [J].
Su, Zhan ;
Marchini, Jonathan ;
Donnelly, Peter .
BIOINFORMATICS, 2011, 27 (16) :2304-2305
[37]   Efficient haplotype block recognition of very long and dense genetic sequences [J].
Taliun, Daniel ;
Gamper, Johann ;
Pattaro, Cristian .
BMC BIOINFORMATICS, 2014, 15
[38]   Improved Statistics for Genome-Wide Interaction Analysis [J].
Ueki, Masao ;
Cordell, Heather J. .
PLOS GENETICS, 2012, 8 (04) :141-159
[39]   Applying compressed sensing to genome-wide association studies [J].
Vattikuti, Shashaank ;
Lee, James J. ;
Chang, Christopher C. ;
Hsu, Stephen D. H. ;
Chow, Carson C. .
GIGASCIENCE, 2014, 3
[40]   Assessing the performance of the haplotype block model of linkage disequilibrium [J].
Wall, JD ;
Pritchard, JK .
AMERICAN JOURNAL OF HUMAN GENETICS, 2003, 73 (03) :502-515