Optimization of de novo transcriptome assembly from next-generation sequencing data

被引:267
作者
Surget-Groba, Yann [1 ]
Montoya-Burgos, Juan I. [1 ]
机构
[1] Univ Geneva, Dept Zool & Anim Biol, CH-1211 Geneva 4, Switzerland
关键词
RNA-SEQ; MODEL ORGANISMS; GENE DISCOVERY; LARGE SETS; GENOME; RESOLUTION; ALLPATHS; PROGRAM; PROTEIN; FISHES;
D O I
10.1101/gr.103846.109
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Transcriptome analysis has important applications in many biological fields. However, assembling a transcriptome without a known reference remains a challenging task requiring algorithmic improvements. We present two methods for substantially improving transcriptome de novo assembly. The first method relies on the observation that the use of a single k-mer length by current de novo assemblers is suboptimal to assemble transcriptomes where the sequence coverage of transcripts is highly heterogeneous. We present the Multiple-k method in which various k-mer lengths are used for de novo transcriptome assembly. We demonstrate its good performance by assembling de novo a published next-generation transcriptome sequence data set of Aedes aegypti, using the existing genome to check the accuracy of our method. The second method relies on the use of a reference proteome to improve the de novo assembly. We developed the Scaffolding using Translation Mapping (STM) method that uses mapping against the closest available reference proteome for scaffolding contigs that map onto the same protein. In a controlled experiment using simulated data, we show that the STM method considerably improves the assembly, with few errors. We applied these two methods to assemble the transcriptome of the non-model catfish Loricaria gr. cataphracta. Using the Multiple-k and STM methods, the assembly increases in contiguity and in gene identification, showing that our methods clearly improve quality and can be widely used. The new methods were used to assemble successfully the transcripts of the core set of genes regulating tooth development in vertebrates, while classic de novo assembly failed.
引用
收藏
页码:1432 / 1440
页数:9
相关论文
共 48 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]   A new approach to sequence comparison:: normalired sequence alignment [J].
Arslan, AN ;
Egecioglu, Ö ;
Pevzner, PA .
BIOINFORMATICS, 2001, 17 (04) :327-337
[3]   SNP discovery via 454 transcriptome sequencing [J].
Barbazuk, W. Brad ;
Emrich, Scott J. ;
Chen, Hsin D. ;
Li, Li ;
Schnable, Patrick S. .
PLANT JOURNAL, 2007, 51 (05) :910-918
[4]   Phylogenomics reveals a new 'megagroup' including most photosynthetic eukaryotes [J].
Burki, Fabien ;
Shalchian-Tabrizi, Kamran ;
Pawlowski, Jan .
BIOLOGY LETTERS, 2008, 4 (04) :366-369
[5]   ALLPATHS: De novo assembly of whole-genome shotgun microreads [J].
Butler, Jonathan ;
MacCallum, Iain ;
Kleber, Michael ;
Shlyakhter, Ilya A. ;
Belmonte, Matthew K. ;
Lander, Eric S. ;
Nusbaum, Chad ;
Jaffe, David B. .
GENOME RESEARCH, 2008, 18 (05) :810-820
[6]   Hunting hidden transcripts [J].
Carninci, Piero .
NATURE METHODS, 2008, 5 (07) :587-589
[7]   Short read fragment assembly of bacterial genomes [J].
Chaisson, Mark J. ;
Pevzner, Pavel A. .
GENOME RESEARCH, 2008, 18 (02) :324-330
[8]  
Collins LJ, 2008, GENOME INFORM SER, V21, P3
[9]   Tissue Compartment Analysis for Biomarker Discovery by Gene Expression Profiling [J].
Disset, Antoine ;
Cheval, Lydie ;
Soutourina, Olga ;
Van Huyen, Jean-Paul Duong ;
Li, Guorong ;
Genin, Christian ;
Tostain, Jacques ;
Loupy, Alexandre ;
Doucet, Alain ;
Rajerison, Rabary .
PLOS ONE, 2009, 4 (11)
[10]   Broad phylogenomic sampling improves resolution of the animal tree of life [J].
Dunn, Casey W. ;
Hejnol, Andreas ;
Matus, David Q. ;
Pang, Kevin ;
Browne, William E. ;
Smith, Stephen A. ;
Seaver, Elaine ;
Rouse, Greg W. ;
Obst, Matthias ;
Edgecombe, Gregory D. ;
Sorensen, Martin V. ;
Haddock, Steven H. D. ;
Schmidt-Rhaesa, Andreas ;
Okusu, Akiko ;
Kristensen, Reinhardt Mobjerg ;
Wheeler, Ward C. ;
Martindale, Mark Q. ;
Giribet, Gonzalo .
NATURE, 2008, 452 (7188) :745-U5