PCA-correlated SNPs for structure identification in worldwide human populations

被引:200
作者
Paschou, Peristera [1 ]
Ziv, Elad
Burchard, Esteban G.
Choudhry, Shweta
Rodriguez-Cintron, William
Mahoney, Michael W.
Drineas, Petros
机构
[1] Democritus Univ Thrace, Dept Mol Biol & Genet, Alexandroupolis, Greece
[2] Univ Calif San Francisco, Div Gen Internal Med, San Francisco, CA 94143 USA
[3] Univ Calif San Francisco, Inst Human Genet, San Francisco, CA 94143 USA
[4] Univ Calif San Francisco, Ctr Comprehens Canc, San Francisco, CA 94143 USA
[5] Univ Calif San Francisco, Ctr Biopharmaceut Sci, San Francisco, CA 94143 USA
[6] Univ Calif San Francisco, Dept Med, San Francisco, CA 94143 USA
[7] Univ Calif San Francisco, Lung Biol Ctr, Dept Med, San Francisco, CA 94143 USA
[8] Univ Puerto Rico, Sch Med, Pulm CCM Vet Caribbean Healthcare Syst, San Juan, PR 00936 USA
[9] Yahoo Res, Sunnyvale, CA USA
[10] Rensselaer Polytech Inst, Dept Comp Sci, Troy, NY 12180 USA
来源
PLOS GENETICS | 2007年 / 3卷 / 09期
关键词
D O I
10.1371/journal.pgen.0030160
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Existing methods to ascertain small sets of markers for the identification of human population structure require prior knowledge of individual ancestry. Based on Principal Components Analysis (PCA), and recent results in theoretical computer science, we present a novel algorithm that, applied on genomewide data, selects small subsets of SNPs (PCA-correlated SNPs) to reproduce the structure found by PCA on the complete dataset, without use of ancestry information. Evaluating our method on a previously described dataset ( 10,805 SNPs, 11 populations), we demonstrate that a very small set of PCA-correlated SNPs can be effectively employed to assign individuals to particular continents or populations, using a simple clustering algorithm. We validate our methods on the HapMap populations and achieve perfect intercontinental differentiation with 14 PCA-correlated SNPs. The Chinese and Japanese populations can be easily differentiated using less than 100 PCA-correlated SNPs ascertained after evaluating 1.7 million SNPs from HapMap. We show that, in general, structure informative SNPs are not portable across geographic regions. However, we manage to identify a general set of 50 PCA-correlated SNPs that effectively assigns individuals to one of nine different populations. Compared to analysis with the measure of informativeness, our methods, although unsupervised, achieved similar results. We proceed to demonstrate that our algorithm can be effectively used for the analysis of admixed populations without having to trace the origin of individuals. Analyzing a Puerto Rican dataset ( 192 individuals, 7,257 SNPs), we show that PCA-correlated SNPs can be used to successfully predict structure and ancestry proportions. We subsequently validate these SNPs for structure identification in an independent Puerto Rican dataset. The algorithm that we introduce runs in seconds and can be easily applied on large genome-wide datasets, facilitating the identification of population substructure, stratification assessment in multi-stage whole-genome association studies, and the study of demographic history in human populations.
引用
收藏
页码:1672 / 1686
页数:15
相关论文
共 67 条
  • [1] Achlioptas D., 2001, PROC 33 ACM S THEORY, P611
  • [2] ADMIXTURE STUDIES AND DETECTION OF SELECTION
    ADAMS, J
    WARD, RH
    [J]. SCIENCE, 1973, 180 (4091) : 1137 - 1143
  • [3] Singular value decomposition for genome-wide expression data processing and modeling
    Alter, O
    Brown, PO
    Botstein, D
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (18) : 10101 - 10106
  • [4] A haplotype map of the human genome
    Altshuler, D
    Brooks, LD
    Chakravarti, A
    Collins, FS
    Daly, MJ
    Donnelly, P
    Gibbs, RA
    Belmont, JW
    Boudreau, A
    Leal, SM
    Hardenbol, P
    Pasternak, S
    Wheeler, DA
    Willis, TD
    Yu, FL
    Yang, HM
    Zeng, CQ
    Gao, Y
    Hu, HR
    Hu, WT
    Li, CH
    Lin, W
    Liu, SQ
    Pan, H
    Tang, XL
    Wang, J
    Wang, W
    Yu, J
    Zhang, B
    Zhang, QR
    Zhao, HB
    Zhao, H
    Zhou, J
    Gabriel, SB
    Barry, R
    Blumenstiel, B
    Camargo, A
    Defelice, M
    Faggart, M
    Goyette, M
    Gupta, S
    Moore, J
    Nguyen, H
    Onofrio, RC
    Parkin, M
    Roy, J
    Stahl, E
    Winchester, E
    Ziaugra, L
    Shen, Y
    [J]. NATURE, 2005, 437 (7063) : 1299 - 1320
  • [5] Human population genetic structure and inference of group membership
    Bamshad, MJ
    Wooding, S
    Watkins, WS
    Ostler, CT
    Batzer, MA
    Jorde, LB
    [J]. AMERICAN JOURNAL OF HUMAN GENETICS, 2003, 72 (03) : 578 - 589
  • [6] Measuring European population stratification with microarray genotype data
    Bauchet, Marc
    McEvoy, Brian
    Pearson, Laurel N.
    Quillen, Ellen E.
    Sarkisian, Tamara
    Hovhannesyan, Kristine
    Deka, Ranjan
    Bradley, Daniel G.
    Shriver, Mark D.
    [J]. AMERICAN JOURNAL OF HUMAN GENETICS, 2007, 80 (05) : 948 - 956
  • [7] HIGH-RESOLUTION OF HUMAN EVOLUTIONARY TREES WITH POLYMORPHIC MICROSATELLITES
    BOWCOCK, AM
    RUIZLINARES, A
    TOMFOHRDE, J
    MINCH, E
    KIDD, JR
    CAVALLISFORZA, LL
    [J]. NATURE, 1994, 368 (6470) : 455 - 457
  • [8] BURCHARD E, 2004, AM J RESP CRIT CARE, V169
  • [9] Demonstrating stratification in a European American population
    Campbell, CD
    Ogburn, EL
    Lunetta, KL
    Lyon, HN
    Freedman, ML
    Groop, LC
    Altshuler, D
    Ardlie, KG
    Hirschhorn, JN
    [J]. NATURE GENETICS, 2005, 37 (08) : 868 - 872
  • [10] The application of molecular genetic approaches to the study of human evolution
    Cavalli-Sforza, LL
    Feldman, MW
    [J]. NATURE GENETICS, 2003, 33 : 266 - 275