TopHat: discovering splice junctions with RNA-Seq

被引:9270
作者
Trapnell, Cole [1 ]
Pachter, Lior [2 ]
Salzberg, Steven L. [1 ]
机构
[1] Univ Maryland, Ctr Bioinformat & Computat Biol, College Pk, MD 20742 USA
[2] Univ Calif Berkeley, Dept Math, Berkeley, CA 94720 USA
基金
美国国家科学基金会;
关键词
MESSENGER-RNA; ALIGNMENT; TRANSCRIPTOME; SEQUENCES; EFFICIENT; LIBRARY; GENOME;
D O I
10.1093/bioinformatics/btp120
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: A new protocol for sequencing the messenger RNA in a cell, known as RNA-Seq, generates millions of short sequence fragments in a single run. These fragments, or 'reads', can be used to measure levels of gene expression and to identify novel splice variants of genes. However, current software for aligning RNA-Seq data to a genome relies on known splice junctions and cannot identify novel ones. TopHat is an efficient read-mapping algorithm designed to align reads from an RNA-Seq experiment to a reference genome without relying on known splice sites. Results: We mapped the RNA-Seq reads from a recent mammalian RNA-Seq experiment and recovered more than 72% of the splice junctions reported by the annotation-based software from that study, along with nearly 20 000 previously unreported junctions. The TopHat pipeline is much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop computer. We describe several challenges unique to ab initio splice site discovery from RNA-Seq reads that will require further algorithm development.
引用
收藏
页码:1105 / 1111
页数:7
相关论文
共 20 条
  • [1] Abouelhoda M. I., 2004, Journal of Discrete Algorithms, V2, P53, DOI 10.1016/S1570-8667(03)00065-0
  • [2] RAPID CDNA SEQUENCING (EXPRESSED SEQUENCE TAGS) FROM A DIRECTIONALLY CLONED HUMAN INFANT BRAIN CDNA LIBRARY
    ADAMS, MD
    SOARES, MB
    KERLAVAGE, AR
    FIELDS, C
    VENTER, JC
    [J]. NATURE GENETICS, 1993, 4 (04) : 373 - 386
  • [3] Burrows M., 1994, 124 DEC DIG SYST RES
  • [4] Stem cell transcriptome profiling via massive-scale mRNA sequencing
    Cloonan, Nicole
    Forrest, Alistair R. R.
    Kolle, Gabriel
    Gardiner, Brooke B. A.
    Faulkner, Geoffrey J.
    Brown, Mellissa K.
    Taylor, Darrin F.
    Steptoe, Anita L.
    Wani, Shivangi
    Bethel, Graeme
    Robertson, Alan J.
    Perkins, Andrew C.
    Bruce, Stephen J.
    Lee, Clarence C.
    Ranade, Swati S.
    Peckham, Heather E.
    Manning, Jonathan M.
    McKernan, Kevin J.
    Grimmond, Sean M.
    [J]. NATURE METHODS, 2008, 5 (07) : 613 - 619
  • [5] Optimal spliced alignments of short sequence reads
    De Bona, Fabio
    Ossowski, Stephan
    Schneeberger, Korbinian
    Raetsch, Gunnar
    [J]. BIOINFORMATICS, 2008, 24 (16) : I174 - I180
  • [6] SeqAn An efficient, generic C++ library for sequence analysis
    Doering, Andreas
    Weese, David
    Rausch, Tobias
    Reinert, Knut
    [J]. BMC BIOINFORMATICS, 2008, 9 (1)
  • [7] Ferragina P, 2001, SIAM PROC S, P269
  • [8] Whole-genome sequencing and variant discovery in C-elegans
    Hillier, LaDeana W.
    Marth, Gabor T.
    Quinlan, Aaron R.
    Dooling, David
    Fewell, Ginger
    Barnett, Derek
    Fox, Paul
    Glasscock, Jarret I.
    Hickenbotham, Matthew
    Huang, Weichun
    Magrini, Vincent J.
    Richt, Ryan J.
    Sander, Sacha N.
    Stewart, Donald A.
    Stromberg, Michael
    Tsung, Eric F.
    Wylie, Todd
    Schedl, Tim
    Wilson, Richard K.
    Mardis, Elaine R.
    [J]. NATURE METHODS, 2008, 5 (02) : 183 - 188
  • [9] Kent WJ, 2002, GENOME RES, V12, P656, DOI [10.1101/gr.229202. Article published online before March 2002, 10.1101/gr.229202]
  • [10] Ultrafast and memory-efficient alignment of short DNA sequences to the human genome
    Langmead, Ben
    Trapnell, Cole
    Pop, Mihai
    Salzberg, Steven L.
    [J]. GENOME BIOLOGY, 2009, 10 (03):