Fast Principal Component Analysis of Large-Scale Genome-Wide Data

被引:211
作者
Abraham, Gad [1 ]
Inouye, Michael
机构
[1] Univ Melbourne, Dept Pathol, Parkville, Vic 3052, Australia
来源
PLOS ONE | 2014年 / 9卷 / 04期
基金
英国医学研究理事会; 澳大利亚国家健康与医学研究理事会;
关键词
ASSOCIATION; POPULATIONS; COMMON;
D O I
10.1371/journal.pone.0093766
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy in extracting the top principal components compared with existing tools, in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.
引用
收藏
页数:5
相关论文
共 17 条
[1]   Integrating common and rare genetic variation in diverse human populations [J].
Altshuler, David M. ;
Gibbs, Richard A. ;
Peltonen, Leena ;
Dermitzakis, Emmanouil ;
Schaffner, Stephen F. ;
Yu, Fuli ;
Bonnen, Penelope E. ;
de Bakker, Paul I. W. ;
Deloukas, Panos ;
Gabriel, Stacey B. ;
Gwilliam, Rhian ;
Hunt, Sarah ;
Inouye, Michael ;
Jia, Xiaoming ;
Palotie, Aarno ;
Parkin, Melissa ;
Whittaker, Pamela ;
Chang, Kyle ;
Hawes, Alicia ;
Lewis, Lora R. ;
Ren, Yanru ;
Wheeler, David ;
Muzny, Donna Marie ;
Barnes, Chris ;
Darvishi, Katayoon ;
Hurles, Matthew ;
Korn, Joshua M. ;
Kristiansson, Kati ;
Lee, Charles ;
McCarroll, Steven A. ;
Nemesh, James ;
Keinan, Alon ;
Montgomery, Stephen B. ;
Pollack, Samuela ;
Price, Alkes L. ;
Soranzo, Nicole ;
Gonzaga-Jauregui, Claudia ;
Anttila, Verneri ;
Brodeur, Wendy ;
Daly, Mark J. ;
Leslie, Stephen ;
McVean, Gil ;
Moutsianas, Loukas ;
Nguyen, Huy ;
Zhang, Qingrun ;
Ghori, Mohammed J. R. ;
McGinnis, Ralph ;
McLaren, William ;
Takeuchi, Fumihiko ;
Grossman, Sharon R. .
NATURE, 2010, 467 (7311) :52-58
[2]  
[Anonymous], 2011, R: A Language and Environment for Statistical Computing
[3]   Common Genetic Variation and the Control of HIV-1 in Humans [J].
Fellay, Jacques ;
Ge, Dongliang ;
Shianna, Kevin V. ;
Colombo, Sara ;
Ledergerber, Bruno ;
Cirulli, Elizabeth T. ;
Urban, Thomas J. ;
Zhang, Kunlin ;
Gumbs, Curtis E. ;
Smith, Jason P. ;
Castagna, Antonella ;
Cozzi-Lepri, Alessandro ;
De Luca, Andrea ;
Easterbrook, Philippa ;
Guenthard, Huldrych F. ;
Mallal, Simon ;
Mussini, Cristina ;
Dalmau, Judith ;
Martinez-Picado, Javier ;
Miro, Jose M. ;
Obel, Niels ;
Wolinsky, Steven M. ;
Martinson, Jeremy J. ;
Detels, Roger ;
Margolick, Joseph B. ;
Jacobson, Lisa P. ;
Descombes, Patrick ;
Antonarakis, Stylianos E. ;
Beckmann, Jacques S. ;
O'Brien, Stephen J. ;
Letvin, Norman L. ;
McMichael, Andrew J. ;
Haynes, Barton F. ;
Carrington, Mary ;
Feng, Sheng ;
Telenti, Amalio ;
Goldstein, David B. .
PLOS GENETICS, 2009, 5 (12)
[4]  
GUENNEBAUD G., 2010, Eigen v3
[5]   Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions [J].
Halko, N. ;
Martinsson, P. G. ;
Tropp, J. A. .
SIAM REVIEW, 2011, 53 (02) :217-288
[6]   AN ALGORITHM FOR THE PRINCIPAL COMPONENT ANALYSIS OF LARGE DATA SETS [J].
Halko, Nathan ;
Martinsson, Per-Gunnar ;
Shkolnisky, Yoel ;
Tygert, Mark .
SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2011, 33 (05) :2580-2594
[7]  
Johnstone IM, 2009, J AM STAT ASSOC, V104, P682, DOI 10.1198/jasa.2009.0121
[8]   Sparse Principal Component Analysis for Identifying Ancestry-Informative Markers in Genome-Wide Association Studies [J].
Lee, Seokho ;
Epstein, Michael P. ;
Duncan, Richard ;
Lin, Xihong .
GENETIC EPIDEMIOLOGY, 2012, 36 (04) :293-302
[9]  
Lippert C, 2011, NAT METHODS, V8, P833, DOI [10.1038/NMETH.1681, 10.1038/nmeth.1681]
[10]   Genes mirror geography within Europe [J].
Novembre, John ;
Johnson, Toby ;
Bryc, Katarzyna ;
Kutalik, Zoltan ;
Boyko, Adam R. ;
Auton, Adam ;
Indap, Amit ;
King, Karen S. ;
Bergmann, Sven ;
Nelson, Matthew R. ;
Stephens, Matthew ;
Bustamante, Carlos D. .
NATURE, 2008, 456 (7218) :98-U5