HMMSplicer: A Tool for Efficient and Sensitive Discovery of Known and Novel Splice Junctions in RNA-Seq Data

被引:33
作者
Dimon, Michelle T. [1 ,2 ]
Sorber, Katherine [1 ]
DeRisi, Joseph L. [1 ,3 ]
机构
[1] Univ Calif San Francisco, Dept Biochem & Biophys, San Francisco, CA 94143 USA
[2] Univ Calif San Francisco, Biol & Med Informat Program, San Francisco, CA 94143 USA
[3] Howard Hughes Med Inst, Bethesda, MD 20817 USA
来源
PLOS ONE | 2010年 / 5卷 / 11期
关键词
MESSENGER-RNA; SEQUENCE; TRANSCRIPTOME; ALIGNMENT;
D O I
10.1371/journal.pone.0013875
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: High-throughput sequencing of an organism's transcriptome, or RNA-Seq, is a valuable and versatile new strategy for capturing snapshots of gene expression. However, transcriptome sequencing creates a new class of alignment problem: mapping short reads that span exon-exon junctions back to the reference genome, especially in the case where a splice junction is previously unknown. Methodology/Principal Findings: Here we introduce HMMSplicer, an accurate and efficient algorithm for discovering canonical and non-canonical splice junctions in short read datasets. HMMSplicer identifies more splice junctions than currently available algorithms when tested on publicly available A. thaliana, P. falciparum, and H. sapiens datasets without a reduction in specificity. Conclusions/Significance: HMMSplicer was found to perform especially well in compact genomes and on genes with low expression levels, alternative splice isoforms, or non-canonical splice junctions. Because HHMSplicer does not rely on pre-built gene models, the products of inexact splicing are also detected. For H. sapiens, we find 3.6% of 3' splice sites and 1.4% of 5' splice sites are inexact, typically differing by 3 bases in either direction. In addition, HMMSplicer provides a score for every predicted junction allowing the user to set a threshold to tune false positive rates depending on the needs of the experiment. HMMSplicer is implemented in Python. Code and documentation are freely available at http://derisilab.ucsf.edu/software/hmmsplicer.
引用
收藏
页数:16
相关论文
共 42 条
[1]   Global and unbiased detection of splice junctions from RNA-seq data [J].
Ameur, Adam ;
Wetterbom, Anna ;
Feuk, Lars ;
Gyllensten, Ulf .
GENOME BIOLOGY, 2010, 11 (03)
[2]  
AU KF, 2010, NUCL ACIDS RES
[3]   A MAXIMIZATION TECHNIQUE OCCURRING IN STATISTICAL ANALYSIS OF PROBABILISTIC FUNCTIONS OF MARKOV CHAINS [J].
BAUM, LE ;
PETRIE, T ;
SOULES, G ;
WEISS, N .
ANNALS OF MATHEMATICAL STATISTICS, 1970, 41 (01) :164-&
[4]   GenBank: update [J].
Benson, DA ;
Karsch-Mizrachi, I ;
Lipman, DJ ;
Ostell, J ;
Wheeler, DL .
NUCLEIC ACIDS RESEARCH, 2004, 32 :D23-D26
[5]  
Bryant DouglasW., 2010, Bioinformatics
[6]   A novel mechanism for regulating activity of a transcription factor that controls the unfolded protein response [J].
Cox, JS ;
Walter, P .
CELL, 1996, 87 (03) :391-404
[7]   WebLogo: A sequence logo generator [J].
Crooks, GE ;
Hon, G ;
Chandonia, JM ;
Brenner, SE .
GENOME RESEARCH, 2004, 14 (06) :1188-1190
[8]   Optimal spliced alignments of short sequence reads [J].
De Bona, Fabio ;
Ossowski, Stephan ;
Schneeberger, Korbinian ;
Raetsch, Gunnar .
BIOINFORMATICS, 2008, 24 (16) :I174-I180
[9]   Intron-exon structures of eukaryotic model organisms [J].
Deutsch, M ;
Long, M .
NUCLEIC ACIDS RESEARCH, 1999, 27 (15) :3219-3228
[10]   Substantial biases in ultra-short read data sets from high-throughput DNA sequencing [J].
Dohm, Juliane C. ;
Lottaz, Claudio ;
Borodina, Tatiana ;
Himmelbauer, Heinz .
NUCLEIC ACIDS RESEARCH, 2008, 36 (16)