Second-generation PLINK: rising to the challenge of larger and richer datasets

被引：7546

作者：

Chang, Christopher C. ^{[1
,2
]}

Chow, Carson C. ^{[3
]}

Tellier, Laurent C. A. M. ^{[2
,4
]}

Vattikuti, Shashaank ^{[3
]}

Purcell, Shaun M. ^{[5
,6
,7
,8
]}

Lee, James J. ^{[3
,9
]}

机构：

[1] Complete Genom, Mountain View, CA 94043 USA

[2] BGI Cognit Genom Lab, Shenzhen 518083, Peoples R China

[3] NIDDK, Math Biol Sect, LBM, NIH, Bethesda, MD 20892 USA

[4] Univ Copenhagen, Bioinformat Ctr, DK-2200 Copenhagen, Denmark

[5] Broad Inst MIT & Harvard, Stanley Ctr Psychiat Res, Cambridge, MA 02142 USA

[6] Icahn Sch Med Mt Sinai, Div Psychiat Genom, Dept Psychiat, New York, NY 10029 USA

[7] Icahn Sch Med Mt Sinai, Inst Genom & Multiscale Biol, New York, NY 10029 USA

[8] Massachusetts Gen Hosp, Analyt & Translat Genet Unit, Psychiat & Neurodev Genet Unit, Boston, MA 02114 USA

[9] Univ Minnesota Twin Cities, Dept Psychol, Minneapolis, MN 55455 USA

来源：

GIGASCIENCE | 2015年 / 4卷

关键词：

GWAS; Population genetics; Whole-genome sequencing; High-density SNP genotyping; Computational statistics; FISHERS EXACT TEST; LINKAGE DISEQUILIBRIUM; EXACT TESTS; ASSOCIATION; ALGORITHM; PERMUTATION; PERFORMANCE; IDENTITY; FORMAT; FEXACT;

D O I：

10.1186/s13742-015-0047-8

中图分类号：

Q [生物科学];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Background: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. Findings: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(root n)-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release ( PLINK 2.0). Conclusions: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

引用

页数：16

共 44 条

[1] A general test of association for quantitative traits in nuclear families [J].

Abecasis, GR ;

Cardon, LR ;

Cookson, WOC .

AMERICAN JOURNAL OF HUMAN GENETICS, 2000, 66 (01) :279-292

[2]

Adler M., PIGZ PARALLEL GZIP

[3] An integrated map of genetic variation from 1,092 human genomes [J].

Altshuler, David M. ;

Durbin, Richard M. ;

Abecasis, Goncalo R. ;

Bentley, David R. ;

Chakravarti, Aravinda ;

Clark, Andrew G. ;

Donnelly, Peter ;

Eichler, Evan E. ;

Flicek, Paul ;

Gabriel, Stacey B. ;

Gibbs, Richard A. ;

Green, Eric D. ;

Hurles, Matthew E. ;

Knoppers, Bartha M. ;

Korbel, Jan O. ;

Lander, Eric S. ;

Lee, Charles ;

Lehrach, Hans ;

Mardis, Elaine R. ;

Marth, Gabor T. ;

McVean, Gil A. ;

Nickerson, Deborah A. ;

Schmidt, Jeanette P. ;

Sherry, Stephen T. ;

Wang, Jun ;

Wilson, Richard K. ;

Gibbs, Richard A. ;

Dinh, Huyen ;

Kovar, Christie ;

Lee, Sandra ;

Lewis, Lora ;

Muzny, Donna ;

Reid, Jeff ;

Wang, Min ;

Wang, Jun ;

Fang, Xiaodong ;

Guo, Xiaosen ;

Jian, Min ;

Jiang, Hui ;

Jin, Xin ;

Li, Guoqing ;

Li, Jingxiang ;

Li, Yingrui ;

Li, Zhuo ;

Liu, Xiao ;

Lu, Yao ;

Ma, Xuedi ;

Su, Zhe ;

Tai, Shuaishuai ;

Tang, Meifang .

NATURE, 2012, 491 (7422) :56-65

[4]

[Anonymous], LIB AN GEN VAR DAT

[5] Haploview: analysis and visualization of LD and haplotype maps [J].

Barrett, JC ;

Fry, B ;

Maller, J ;

Daly, MJ .

BIOINFORMATICS, 2005, 21 (02) :263-265

[6] Improving the Accuracy and Efficiency of Identity-by-Descent Detection in Population Data [J].

Browning, Brian L. ;

Browning, Sharon R. .

GENETICS, 2013, 194 (02) :459-+

[7] A Fast, Powerful Method for Detecting Identity by Descent [J].

Browning, Brian L. ;

Browning, Sharon R. .

AMERICAN JOURNAL OF HUMAN GENETICS, 2011, 88 (02) :173-182

[8] PRESTO: Rapid calculation of order statistic distributions and multiple-testing adjusted P-values via permutation for one and two-stage genetic association studies [J].

Browning, Brian L. .

BMC BIOINFORMATICS, 2008, 9 (1)

[9]

Chang C., STANDALONE C C EXACT

[10]

Chang C, GIGASCIENCE DATABASE

← 1 2 3 4 5 →