Sparse Principal Component Analysis for Identifying Ancestry-Informative Markers in Genome-Wide Association Studies

被引:32
作者
Lee, Seokho [2 ]
Epstein, Michael P. [3 ]
Duncan, Richard [3 ]
Lin, Xihong [1 ]
机构
[1] Harvard Univ, Sch Publ Hlth, Dept Biostat, Boston, MA 02115 USA
[2] Hankuk Univ Foreign Studies, Dept Stat, Yongin, South Korea
[3] Emory Univ, Sch Med, Dept Human Genet, Atlanta, GA USA
基金
美国国家卫生研究院; 新加坡国家研究基金会;
关键词
ancestry-informative markers; genome-wide association studies; population stratification; principal component analysis; variable selection; POPULATION STRATIFICATION; SEMIPARAMETRIC TEST; ADMIXTURE; PANEL; MAP;
D O I
10.1002/gepi.21621
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Genome-Wide association studies (GWAS) routinely apply principal component analysis (PCA) to infer population structure within a sample to correct for confounding due to ancestry. GWAS implementation of PCA uses tens of thousands of single-nucleotide polymorphisms (SNPs) to infer structure, despite the fact that only a small fraction of such SNPs provides useful information on ancestry. The identification of this reduced set of Ancestry-Informative markers (AIMs) from a GWAS has practical value; for example, researchers can genotype the AIM set to correct for potential confounding due to ancestry in follow-up studies that utilize custom SNP or sequencing technology. We propose a novel technique to identify AIMs from Genome-Wide SNP data using sparse PCA. The procedure uses penalized regression methods to identify those SNPs in a Genome-Wide panel that significantly contribute to the principal components while encouraging SNPs that provide negligible loadings to vanish from the analysis. We found that sparse PCA leads to negligible loss of ancestry information compared to traditional PCA analysis of Genome-Wide SNP data. We further demonstrate the value of sparse PCA for AIM selection using real data from the International HapMap Project and a Genome-Wide study of inflammatory bowel disease. We have implemented our approach in open-source R software for public use. Genet. Epidemiol. 36:293-302, 2012. (c) 2012 Wiley Periodicals, Inc.
引用
收藏
页码:293 / 302
页数:10
相关论文
共 32 条
[1]   Interrogating a high-density SNP map for signatures of natural selection [J].
Akey, JM ;
Zhang, G ;
Zhang, K ;
Jin, L ;
Shriver, MD .
GENOME RESEARCH, 2002, 12 (12) :1805-1814
[2]   A haplotype map of the human genome [J].
Altshuler, D ;
Brooks, LD ;
Chakravarti, A ;
Collins, FS ;
Daly, MJ ;
Donnelly, P ;
Gibbs, RA ;
Belmont, JW ;
Boudreau, A ;
Leal, SM ;
Hardenbol, P ;
Pasternak, S ;
Wheeler, DA ;
Willis, TD ;
Yu, FL ;
Yang, HM ;
Zeng, CQ ;
Gao, Y ;
Hu, HR ;
Hu, WT ;
Li, CH ;
Lin, W ;
Liu, SQ ;
Pan, H ;
Tang, XL ;
Wang, J ;
Wang, W ;
Yu, J ;
Zhang, B ;
Zhang, QR ;
Zhao, HB ;
Zhao, H ;
Zhou, J ;
Gabriel, SB ;
Barry, R ;
Blumenstiel, B ;
Camargo, A ;
Defelice, M ;
Faggart, M ;
Goyette, M ;
Gupta, S ;
Moore, J ;
Nguyen, H ;
Onofrio, RC ;
Parkin, M ;
Roy, J ;
Stahl, E ;
Winchester, E ;
Ziaugra, L ;
Shen, Y .
NATURE, 2005, 437 (7063) :1299-1320
[3]  
[Anonymous], 2004, PRINCIPAL COMPONENT
[4]   Qualitative semi-parametric test for genetic associations in case-control designs under structured populations [J].
Chen, HS ;
Zhu, X ;
Zhao, H ;
Zhang, S .
ANNALS OF HUMAN GENETICS, 2003, 67 :250-264
[5]   Inferring Geographic Coordinates of Origin for Europeans Using Small Panels of Ancestry Informative Markers [J].
Drineas, Petros ;
Lewis, Jamey ;
Paschou, Peristera .
PLOS ONE, 2010, 5 (08)
[6]   A simple and improved correction for population stratification in case-control studies [J].
Epstein, Michael P. ;
Allen, Andrew S. ;
Satten, Glen A. .
AMERICAN JOURNAL OF HUMAN GENETICS, 2007, 80 (05) :921-930
[7]   A panel of ancestry informative markers for estimating individual biogeographical ancestry and admixture from four continents: Utility and applications [J].
Halder, Indrani ;
Shriver, Mark ;
Thomas, Matt ;
Fernandez, Jose R. ;
Frudakis, Tony .
HUMAN MUTATION, 2008, 29 (05) :648-658
[8]  
Hastie T., 2009, ELEMENT STAT LEARNIN, V3rd
[9]   Ancestry Informative Marker Sets for Determining Continental Origin and Admixture Proportions in Common Populations in America [J].
Kosoy, Roman ;
Nassir, Rami ;
Tian, Chao ;
White, Phoebe A. ;
Butler, Lesley M. ;
Silva, Gabriel ;
Kittles, Rick ;
Alarcon-Riquelme, Marta E. ;
Gregersen, Peter K. ;
Belmont, John W. ;
De La Vega, Francisco M. ;
Seldin, Michael F. .
HUMAN MUTATION, 2009, 30 (01) :69-78
[10]   Proportioning whole-genome single-nucleotide-polymorphism diversity for the identification of geographic population structure and genetic ancestry [J].
Lao, O ;
van Duijn, K ;
Kersbergen, P ;
de Knijff, P ;
Kayser, M .
AMERICAN JOURNAL OF HUMAN GENETICS, 2006, 78 (04) :680-690