Second-generation PLINK: rising to the challenge of larger and richer datasets

被引：7546

作者：

Chang, Christopher C. ^{[1
,2
]}

Chow, Carson C. ^{[3
]}

Tellier, Laurent C. A. M. ^{[2
,4
]}

Vattikuti, Shashaank ^{[3
]}

Purcell, Shaun M. ^{[5
,6
,7
,8
]}

Lee, James J. ^{[3
,9
]}

机构：

[1] Complete Genom, Mountain View, CA 94043 USA

[2] BGI Cognit Genom Lab, Shenzhen 518083, Peoples R China

[3] NIDDK, Math Biol Sect, LBM, NIH, Bethesda, MD 20892 USA

[4] Univ Copenhagen, Bioinformat Ctr, DK-2200 Copenhagen, Denmark

[5] Broad Inst MIT & Harvard, Stanley Ctr Psychiat Res, Cambridge, MA 02142 USA

[6] Icahn Sch Med Mt Sinai, Div Psychiat Genom, Dept Psychiat, New York, NY 10029 USA

[7] Icahn Sch Med Mt Sinai, Inst Genom & Multiscale Biol, New York, NY 10029 USA

[8] Massachusetts Gen Hosp, Analyt & Translat Genet Unit, Psychiat & Neurodev Genet Unit, Boston, MA 02114 USA

[9] Univ Minnesota Twin Cities, Dept Psychol, Minneapolis, MN 55455 USA

来源：

GIGASCIENCE | 2015年 / 4卷

关键词：

GWAS; Population genetics; Whole-genome sequencing; High-density SNP genotyping; Computational statistics; FISHERS EXACT TEST; LINKAGE DISEQUILIBRIUM; EXACT TESTS; ASSOCIATION; ALGORITHM; PERMUTATION; PERFORMANCE; IDENTITY; FORMAT; FEXACT;

D O I：

10.1186/s13742-015-0047-8

中图分类号：

Q [生物科学];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Background: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. Findings: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(root n)-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release ( PLINK 2.0). Conclusions: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

引用

页数：16

共 44 条

[31] FEXACT - A FORTRAN SUBROUTINE FOR FISHER EXACT TEST ON UNORDERED RXC CONTINGENCY-TABLES [J].

MEHTA, CR ;

PATEL, NR .

ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 1986, 12 (02) :154-161

[32] PLINK: A tool set for whole-genome association and population-based linkage analyses [J].

Purcell, Shaun ;

Neale, Benjamin ;

Todd-Brown, Kathe ;

Thomas, Lori ;

Ferreira, Manuel A. R. ;

Bender, David ;

Maller, Julian ;

Sklar, Pamela ;

de Bakker, Paul I. W. ;

Daly, Mark J. ;

Sham, Pak C. .

AMERICAN JOURNAL OF HUMAN GENETICS, 2007, 81 (03) :559-575

[33] A major improvement to the Network Algorithm for Fisher's Exact Test in 2xc contingency tables [J].

Requena, F. ;

Ciudad, N. Martin .

COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2006, 51 (02) :490-498

[34]

Sambo F, 2014, BIOINFORMATICS, V30, P495

[35] PERMORY-MPI: a program for high-speed parallel permutation testing in genome-wide association studies [J].

Steiss, Volker ;

Letschert, Thomas ;

Schaefer, Helmut ;

Pahl, Roman .

BIOINFORMATICS, 2012, 28 (08) :1168-1169

[36] HAPGEN2: simulation of multiple disease SNPs [J].

Su, Zhan ;

Marchini, Jonathan ;

Donnelly, Peter .

BIOINFORMATICS, 2011, 27 (16) :2304-2305

[37] Efficient haplotype block recognition of very long and dense genetic sequences [J].

Taliun, Daniel ;

Gamper, Johann ;

Pattaro, Cristian .

BMC BIOINFORMATICS, 2014, 15

[38] Improved Statistics for Genome-Wide Interaction Analysis [J].

Ueki, Masao ;

Cordell, Heather J. .

PLOS GENETICS, 2012, 8 (04) :141-159

[39] Applying compressed sensing to genome-wide association studies [J].

Vattikuti, Shashaank ;

Lee, James J. ;

Chang, Christopher C. ;

Hsu, Stephen D. H. ;

Chow, Carson C. .

GIGASCIENCE, 2014, 3

[40] Assessing the performance of the haplotype block model of linkage disequilibrium [J].

Wall, JD ;

Pritchard, JK .

AMERICAN JOURNAL OF HUMAN GENETICS, 2003, 73 (03) :502-515

← 1 2 3 4 5 →