Distinguishing protein-coding from non-coding RNAs through support vector machines

被引:116
作者
Liu, Jinfeng
Gough, Julian
Rost, Burkhard
机构
[1] Columbia Univ, Bioinformat Ctr, Dept Biochem & Mol Biophys, New York, NY 10027 USA
[2] Columbia Univ, Ctr Computat Biol & Bioinformat, New York, NY USA
[3] Columbia Univ, NE Struct Genom Consortium, Dept Biochem & Mol Biophys, New York, NY USA
[4] RIKEN, Genome Explorat Res Grp, Genome Network Project Core Grp, Genom Sci Ctr,Yokohama Inst, Yokohama, Kanagawa, Japan
来源
PLOS GENETICS | 2006年 / 2卷 / 04期
关键词
D O I
10.1371/journal.pgen.0020029
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
RIKEN's FANTOM project has revealed many previously unknown coding sequences, as well as an unexpected degree of variation in transcripts resulting from alternative promoter usage and splicing. Ever more transcripts that do not code for proteins have been identified by transcriptome studies, in general. Increasing evidence points to the important cellular roles of such non-coding RNAs ( ncRNAs). The distinction of protein- coding RNA transcripts from ncRNA transcripts is therefore an important problem in understanding the transcriptome and carrying out its annotation. Very few in silico methods have specifically addressed this problem. Here, we introduce CONC ( for "coding or non- coding''), a novel method based on support vector machines that classifies transcripts according to features they would have if they were coding for proteins. These features include peptide length, amino acid composition, predicted secondary structure content, predicted percentage of exposed residues, compositional entropy, number of homologs from database searches, and alignment entropy. Nucleotide frequencies are also incorporated into the method. Confirmed coding cDNAs for eukaryotic proteins from the Swiss- Prot database constituted the set of true positives, ncRNAs from RNAdb and NONCODE the true negatives. Ten- fold cross- validation suggested that CONC distinguished coding RNAs from ncRNAs at about 97% specificity and 98% sensitivity. Applied to 102,801 mouse cDNAs from the FANTOM3 dataset, our method reliably identified over 14,000 ncRNAs and estimated the total number of ncRNAs to be about 28,000.
引用
收藏
页码:529 / 536
页数:8
相关论文
共 43 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]  
[Anonymous], 2005, The proteomics protocols handbook. Totowa (New Jersey)
[3]   CRITICA: Coding region identification tool invoking comparative analysis [J].
Badger, JH ;
Olsen, GJ .
MOLECULAR BIOLOGY AND EVOLUTION, 1999, 16 (04) :512-524
[4]  
Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkr1065, 10.1093/nar/gkh121]
[5]  
Benson Dennis A, 2005, Nucleic Acids Res, V33, pD34
[6]   The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 [J].
Boeckmann, B ;
Bairoch, A ;
Apweiler, R ;
Blatter, MC ;
Estreicher, A ;
Gasteiger, E ;
Martin, MJ ;
Michoud, K ;
O'Donovan, C ;
Phan, I ;
Pilbout, S ;
Schneider, M .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :365-370
[7]   Knowledge-based analysis of microarray gene expression data by using support vector machines [J].
Brown, MPS ;
Grundy, WN ;
Lin, D ;
Cristianini, N ;
Sugnet, CW ;
Furey, TS ;
Ares, M ;
Haussler, D .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (01) :262-267
[8]   The transcriptional landscape of the mammalian genome [J].
Carninci, P ;
Kasukawa, T ;
Katayama, S ;
Gough, J ;
Frith, MC ;
Maeda, N ;
Oyama, R ;
Ravasi, T ;
Lenhard, B ;
Wells, C ;
Kodzius, R ;
Shimokawa, K ;
Bajic, VB ;
Brenner, SE ;
Batalov, S ;
Forrest, ARR ;
Zavolan, M ;
Davis, MJ ;
Wilming, LG ;
Aidinis, V ;
Allen, JE ;
Ambesi-Impiombato, X ;
Apweiler, R ;
Aturaliya, RN ;
Bailey, TL ;
Bansal, M ;
Baxter, L ;
Beisel, KW ;
Bersano, T ;
Bono, H ;
Chalk, AM ;
Chiu, KP ;
Choudhary, V ;
Christoffels, A ;
Clutterbuck, DR ;
Crowe, ML ;
Dalla, E ;
Dalrymple, BP ;
de Bono, B ;
Della Gatta, G ;
di Bernardo, D ;
Down, T ;
Engstrom, P ;
Fagiolini, M ;
Faulkner, G ;
Fletcher, CF ;
Fukushima, T ;
Furuno, M ;
Futaki, S ;
Gariboldi, M .
SCIENCE, 2005, 309 (5740) :1559-1563
[9]   INVITRO SPLICING OF THE RIBOSOMAL-RNA PRECURSOR OF TETRAHYMENA - INVOLVEMENT OF A GUANOSINE NUCLEOTIDE IN THE EXCISION OF THE INTERVENING SEQUENCE [J].
CECH, TR ;
ZAUG, AJ ;
GRABOWSKI, PJ .
CELL, 1981, 27 (03) :487-496
[10]  
Cherkassky V, 1997, IEEE Trans Neural Netw, V8, P1564, DOI 10.1109/TNN.1997.641482