Screening non-coding RNAs in transcriptomes from neglected species using PORTRAIT: case study of the pathogenic fungus Paracoccidioides brasiliensis Software

被引:71
作者
Arrial, Roberto T. [1 ]
Togawa, Roberto C.
Brigido, Marcelo de M. [1 ]
机构
[1] Univ Brasilia, Inst Biol, Mol Biol Lab, BR-70910900 Brasilia, DF, Brazil
来源
BMC BIOINFORMATICS | 2009年 / 10卷
关键词
SUPPORT VECTOR MACHINE; SEQUENCE FEATURES; PROTEIN; DATABASE; PROGRAM; CDNA;
D O I
10.1186/1471-2105-10-239
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Transcriptome sequences provide a complement to structural genomic information and provide snapshots of an organism's transcriptional profile. Such sequences also represent an alternative method for characterizing neglected species that are not expected to undergo whole-genome sequencing. One difficulty for transcriptome sequencing of these organisms is the low quality of reads and incomplete coverage of transcripts, both of which compromise further bioinformatics analyses. Another complicating factor is the lack of known protein homologs, which frustrates searches against established protein databases. This lack of homologs may be caused by divergence from well-characterized and over-represented model organisms. Another explanation is that non-coding RNAs (ncRNAs) may be caught during sequencing. NcRNAs are RNA sequences that, unlike messenger RNAs, do not code for protein products and instead perform unique functions by folding into higher order structural conformations. There is ncRNA screening software available that is specific for transcriptome sequences, but their analyses are optimized for those transcriptomes that are well represented in protein databases, and also assume that input ESTs are full-length and high quality. Results: We propose an algorithm called PORTRAIT, which is suitable for ncRNA analysis of transcriptomes from poorly characterized species. Sequences are translated by software that is resistant to sequencing errors, and the predicted putative proteins, along with their source transcripts, are evaluated for coding potential by a support vector machine (SVM). Either of two SVM models may be employed: if a putative protein is found, a protein-dependent SVM model is used; if it is not found, a protein-independent SVM model is used instead. Only ab initio features are extracted, so that no homology information is needed. We illustrate the use of PORTRAIT by predicting ncRNAs from the transcriptome of the pathogenic fungus Paracoccidoides brasiliensis and five other related fungi. Conclusion: PORTRAIT can be integrated into pipelines, and provides a low computational cost solution for ncRNA detection in transcriptome sequencing projects.
引用
收藏
页数:9
相关论文
共 32 条
  • [1] [Anonymous], Data Mining Practical Machine Learning Tools and Techniques with Java
  • [2] BORGELT C, FULL NAIVE BAYES CLA
  • [3] A computational approach to identify genes for functional RNAs in genomic sequences
    Carter, RJ
    Dubchak, I
    Holbrook, SR
    [J]. NUCLEIC ACIDS RESEARCH, 2001, 29 (19) : 3928 - 3938
  • [4] Chang C.-C., LIBSVM: a Library for Support Vector Machines
  • [5] EMBL Nucleotide Sequence Database: developments in 2005
    Cochrane, Guy
    Aldebert, Philippe
    Althorpe, Nicola
    Andersson, Mikael
    Baker, Wendy
    Baldwin, Alastair
    Bates, Kirsty
    Bhattacharyya, Sumit
    Browne, Paul
    van den Broek, Alexandra
    Castro, Matias
    Duggan, Karyn
    Eberhardt, Ruth
    Faruque, Nadeem
    Gamble, John
    Kanz, Carola
    Kulikova, Tamara
    Lee, Charles
    Leinonen, Rasko
    Lin, Quan
    Lombard, Vincent
    Lopez, Rodrigo
    McHale, Michelle
    McWilliam, Hamish
    Mukherjee, Gaurab
    Nardone, Francesco
    Pastor, Maria Pilar Garcia
    Sobhany, Siamak
    Stoehr, Peter
    Tzouvara, Katerina
    Vaughan, Robert
    Wu, Dan
    Zhu, Weimin
    Apweiler, Rolf
    [J]. NUCLEIC ACIDS RESEARCH, 2006, 34 : D10 - D15
  • [6] Transcriptional profiles of the human pathogenic fungus Paracoccidioides brasiliensis in mycelium and yeast cells
    Felipe, MSS
    Andrade, RV
    Arraes, FBM
    Nicola, AM
    Maranhao, AQ
    Torres, FAG
    Silva-Pereira, I
    Poç as-Fonseca, MJ
    Campos, EG
    Moraes, LMP
    Andrade, PA
    Tavares, AHFP
    Silva, SS
    Kyaw, CM
    Souza, DP
    Network, P
    Pereira, M
    Jesuíno, RSA
    Andrade, EV
    Parente, JA
    Oliveira, GS
    Barbosa, MS
    Martins, NF
    Fachin, AL
    Cardoso, RS
    Passos, GAS
    Almeida, NF
    Walter, MEMT
    Soares, CMA
    Carvalho, MJA
    Brígido, MM
    [J]. JOURNAL OF BIOLOGICAL CHEMISTRY, 2005, 280 (26) : 24706 - 24714
  • [7] ASSESSMENT OF PROTEIN CODING MEASURES
    FICKETT, JW
    TUNG, CS
    [J]. NUCLEIC ACIDS RESEARCH, 1992, 20 (24) : 6441 - 6450
  • [8] Frith Martin C, 2006, RNA Biol, V3, P40
  • [9] Rfam: annotating non-coding RNAs in complete genomes
    Griffiths-Jones, S
    Moxon, S
    Marshall, M
    Khanna, A
    Eddy, SR
    Bateman, A
    [J]. NUCLEIC ACIDS RESEARCH, 2005, 33 : D121 - D124
  • [10] Public web-based services from the European Bioinformatics Institute
    Harte, N
    Silventoinen, V
    Quevillon, E
    Robinson, S
    Kallio, K
    Fustero, X
    Patel, P
    Jokinen, P
    Lopez, R
    [J]. NUCLEIC ACIDS RESEARCH, 2004, 32 : W3 - W9