HMMSplicer: A Tool for Efficient and Sensitive Discovery of Known and Novel Splice Junctions in RNA-Seq Data

被引：33

作者：

Dimon, Michelle T. ^{[1
,2
]}

Sorber, Katherine ^{[1
]}

DeRisi, Joseph L. ^{[1
,3
]}

机构：

[1] Univ Calif San Francisco, Dept Biochem & Biophys, San Francisco, CA 94143 USA

[2] Univ Calif San Francisco, Biol & Med Informat Program, San Francisco, CA 94143 USA

[3] Howard Hughes Med Inst, Bethesda, MD 20817 USA

来源：

PLOS ONE | 2010年 / 5卷 / 11期

关键词：

MESSENGER-RNA; SEQUENCE; TRANSCRIPTOME; ALIGNMENT;

D O I：

10.1371/journal.pone.0013875

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Background: High-throughput sequencing of an organism's transcriptome, or RNA-Seq, is a valuable and versatile new strategy for capturing snapshots of gene expression. However, transcriptome sequencing creates a new class of alignment problem: mapping short reads that span exon-exon junctions back to the reference genome, especially in the case where a splice junction is previously unknown. Methodology/Principal Findings: Here we introduce HMMSplicer, an accurate and efficient algorithm for discovering canonical and non-canonical splice junctions in short read datasets. HMMSplicer identifies more splice junctions than currently available algorithms when tested on publicly available A. thaliana, P. falciparum, and H. sapiens datasets without a reduction in specificity. Conclusions/Significance: HMMSplicer was found to perform especially well in compact genomes and on genes with low expression levels, alternative splice isoforms, or non-canonical splice junctions. Because HHMSplicer does not rely on pre-built gene models, the products of inexact splicing are also detected. For H. sapiens, we find 3.6% of 3' splice sites and 1.4% of 5' splice sites are inexact, typically differing by 3 bases in either direction. In addition, HMMSplicer provides a score for every predicted junction allowing the user to set a threshold to tune false positive rates depending on the needs of the experiment. HMMSplicer is implemented in Python. Code and documentation are freely available at http://derisilab.ucsf.edu/software/hmmsplicer.

引用

页数：16

共 42 条

[1] Global and unbiased detection of splice junctions from RNA-seq data [J].

Ameur, Adam ;

Wetterbom, Anna ;

Feuk, Lars ;

Gyllensten, Ulf .

GENOME BIOLOGY, 2010, 11 (03)

[2]

AU KF, 2010, NUCL ACIDS RES

[3] A MAXIMIZATION TECHNIQUE OCCURRING IN STATISTICAL ANALYSIS OF PROBABILISTIC FUNCTIONS OF MARKOV CHAINS [J].

BAUM, LE ;

PETRIE, T ;

SOULES, G ;

WEISS, N .

ANNALS OF MATHEMATICAL STATISTICS, 1970, 41 (01) :164-&

[4] GenBank: update [J].

Benson, DA ;

Karsch-Mizrachi, I ;

Lipman, DJ ;

Ostell, J ;

Wheeler, DL .

NUCLEIC ACIDS RESEARCH, 2004, 32 :D23-D26

[5]

Bryant DouglasW., 2010, Bioinformatics

[6] A novel mechanism for regulating activity of a transcription factor that controls the unfolded protein response [J].

Cox, JS ;

Walter, P .

CELL, 1996, 87 (03) :391-404

[7] WebLogo: A sequence logo generator [J].

Crooks, GE ;

Hon, G ;

Chandonia, JM ;

Brenner, SE .

GENOME RESEARCH, 2004, 14 (06) :1188-1190

[8] Optimal spliced alignments of short sequence reads [J].

De Bona, Fabio ;

Ossowski, Stephan ;

Schneeberger, Korbinian ;

Raetsch, Gunnar .

BIOINFORMATICS, 2008, 24 (16) :I174-I180

[9] Intron-exon structures of eukaryotic model organisms [J].

Deutsch, M ;

Long, M .

NUCLEIC ACIDS RESEARCH, 1999, 27 (15) :3219-3228

[10] Substantial biases in ultra-short read data sets from high-throughput DNA sequencing [J].

Dohm, Juliane C. ;

Lottaz, Claudio ;

Borodina, Tatiana ;

Himmelbauer, Heinz .

NUCLEIC ACIDS RESEARCH, 2008, 36 (16)

← 1 2 3 4 5 →