Reference-guided assembly of four diverse Arabidopsis thaliana genomes

被引:183
作者
Schneeberger, Korbinian [1 ,2 ]
Ossowski, Stephan [1 ,3 ,4 ]
Ott, Felix [1 ]
Klein, Juliane D. [5 ]
Wang, Xi [1 ]
Lanz, Christa [1 ]
Smith, Lisa M. [1 ]
Cao, Jun [1 ]
Fitz, Joffrey [1 ]
Warthmann, Norman [1 ]
Henz, Stefan R. [1 ]
Huson, Daniel H. [5 ]
Weigel, Detlef [1 ]
机构
[1] Max Planck Inst Dev Biol, Dept Mol Biol, D-72076 Tubingen, Germany
[2] Max Planck Inst Plant Breeding Res, Dept Plant Dev Biol, D-50829 Cologne, Germany
[3] UPF, Barcelona 08003, Spain
[4] CRG, Genes & Dis Program, Genom & Epigen Variat Dis Grp, Barcelona 08003, Spain
[5] Univ Tubingen, Ctr Bioinformat Tubingen, D-72076 Tubingen, Germany
关键词
STRUCTURAL VARIATION; SEQUENCE DATA; SHORT READS; IDENTIFICATION; EXPRESSION; POLYMORPHISMS; ALGORITHMS; ALIGNMENT;
D O I
10.1073/pnas.1107739108
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
We present whole-genome assemblies of four divergent Arabidopsis thaliana strains that complement the 125-Mb reference genome sequence released a decade ago. Using a newly developed reference-guided approach, we assembled large contigs from 9 to 42 Gb of Illumina short-read data from the Landsberg erecta (Ler-1), C24, Bur-0, and Kro-0 strains, which have been sequenced as part of the 1,001 Genomes Project for this species. Using alignments against the reference sequence, we first reduced the complexity of the de novo assembly and later integrated reads without similarity to the reference sequence. As an example, half of the noncentromeric C24 genome was covered by scaffolds that are longer than 260 kb, with a maximum of 2.2 Mb. Moreover, over 96% of the reference genome was covered by the reference-guided assembly, compared with only 87% with a complete de novo assembly. Comparisons with 2 Mb of dideoxy sequence reveal that the per-base error rate of the reference-guided assemblies was below 1 in 10,000. Our assemblies provide a detailed, genomewide picture of large-scale differences between A. thaliana individuals, most of which are difficult to access with alignment-consensus methods only. We demonstrate their practical relevance in studying the expression differences of polymorphic genes and show how the analysis of sRNA sequencing data can lead to erroneous conclusions if aligned against the reference genome alone. Genome assemblies, raw reads, and further information are accessible through http://1001genomes.org/projects/assemblies.html.
引用
收藏
页码:10249 / 10254
页数:6
相关论文
共 42 条
[31]  
Pop M, 2004, GENOME RES, V14, P149
[32]   Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome [J].
Quinlan, Aaron R. ;
Clark, Royden A. ;
Sokolova, Svetlana ;
Leibowitz, Mitchell L. ;
Zhang, Yujun ;
Hurles, Matthew E. ;
Mell, Joshua C. ;
Hall, Ira M. .
GENOME RESEARCH, 2010, 20 (05) :623-635
[33]   Shotguns and SNPs: how fast and cheap sequencing is revolutionizing plant biology [J].
Rounsley, Steven D. ;
Last, Robert L. .
PLANT JOURNAL, 2010, 61 (06) :922-927
[34]   Simultaneous alignment of short reads against multiple genomes [J].
Schneeberger, Korbinian ;
Hagmann, Joerg ;
Ossowski, Stephan ;
Warthmann, Norman ;
Gesing, Sandra ;
Kohlbacher, Oliver ;
Weigel, Detlef .
GENOME BIOLOGY, 2009, 10 (09)
[35]   ABySS: A parallel assembler for short read sequence data [J].
Simpson, Jared T. ;
Wong, Kim ;
Jackman, Shaun D. ;
Schein, Jacqueline E. ;
Jones, Steven J. M. ;
Birol, Inanc .
GENOME RESEARCH, 2009, 19 (06) :1117-1123
[36]   Maize Inbreds Exhibit High Levels of Copy Number Variation (CNV) and Presence/Absence Variation (PAV) in Genome Content [J].
Springer, Nathan M. ;
Ying, Kai ;
Fu, Yan ;
Ji, Tieming ;
Yeh, Cheng-Ting ;
Jia, Yi ;
Wu, Wei ;
Richmond, Todd ;
Kitzman, Jacob ;
Rosenbaum, Heidi ;
Iniguez, A. Leonardo ;
Barbazuk, W. Brad ;
Jeddeloh, Jeffrey A. ;
Nettleton, Daniel ;
Schnable, Patrick S. .
PLOS GENETICS, 2009, 5 (11)
[37]  
Theologis A, 2001, GENOME BIOL, V2
[38]   Fine-scale structural variation of the human genome [J].
Tuzun, E ;
Sharp, AJ ;
Bailey, JA ;
Kaul, R ;
Morrison, VA ;
Pertz, LM ;
Haugen, E ;
Hayden, H ;
Albertson, D ;
Pinkel, D ;
Olson, MV ;
Eichler, EE .
NATURE GENETICS, 2005, 37 (07) :727-732
[39]   PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data [J].
Wang, Kai ;
Li, Mingyao ;
Hadley, Dexter ;
Liu, Rui ;
Glessner, Joseph ;
Grant, Struan F. A. ;
Hakonarson, Hakon ;
Bucan, Maja .
GENOME RESEARCH, 2007, 17 (11) :1665-1674
[40]   A new strategy for genome assembly using short sequence reads and reduced representation libraries [J].
Young, Andrew L. ;
Abaan, Hatice Ozel ;
Zerbino, Daniel ;
Mullikin, James C. ;
Birney, Ewan ;
Margulies, Elliott H. .
GENOME RESEARCH, 2010, 20 (02) :249-256