Improving subcellular localization prediction using text classification and the gene ontology

被引:37
作者
Fyshe, Alona [1 ]
Liu, Yifeng [1 ]
Szafron, Duane [1 ]
Greiner, Russ [1 ]
Lu, Paul [1 ]
机构
[1] Univ Alberta, Dept Comp Sci, Edmonton, AB T6G 2E8, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
D O I
10.1093/bioinformatics/btn463
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Each protein performs its functions within some specific locations in a cell. This subcellular location is important for understanding protein function and for facilitating its purification. There are now many computational techniques for predicting location based on sequence analysis and database information from homologs. A few recent techniques use text from biological abstracts: our goal is to improve the prediction accuracy of such text-based techniques. We identify three techniques for improving text-based prediction: a rule for ambiguous abstract removal, a mechanism for using synonyms from the Gene Ontology (GO) and a mechanism for using the GO hierarchy to generalize terms. We show that these three techniques can significantly improve the accuracy of protein subcellular location predictors that use text extracted from PubMed abstracts whose references are recorded in Swiss-Prot.
引用
收藏
页码:2512 / 2517
页数:6
相关论文
共 13 条
[1]   Gene Ontology: tool for the unification of biology [J].
Ashburner, M ;
Ball, CA ;
Blake, JA ;
Botstein, D ;
Butler, H ;
Cherry, JM ;
Davis, AP ;
Dolinski, K ;
Dwight, SS ;
Eppig, JT ;
Harris, MA ;
Hill, DP ;
Issel-Tarver, L ;
Kasarskis, A ;
Lewis, S ;
Matese, JC ;
Richardson, JE ;
Ringwald, M ;
Rubin, GM ;
Sherlock, G .
NATURE GENETICS, 2000, 25 (01) :25-29
[2]  
Dumais S., 1998, Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, P148, DOI 10.1145/288627.288651
[3]  
FYSHE A, 2006, BIONLP WORKSH HLT NA, P17
[4]  
Hoglund Annette, 2006, Pac Symp Biocomput, P16, DOI 10.1142/9789812701626_0003
[5]  
Joachims T., 1998, Lecture Notes in Computer Science, P137, DOI DOI 10.1007/BFB0026683
[6]   PA-GOSUB: a searchable database of model organism protein sequences with their predicted gene ontology molecular function and subcellular localization [J].
Lu, P ;
Szafron, D ;
Greiner, R ;
Wishart, DS ;
Fyshe, A ;
Pearcy, B ;
Poulin, B ;
Eisner, R ;
Ngo, D ;
Lamb, N .
NUCLEIC ACIDS RESEARCH, 2005, 33 :D147-D153
[7]   Predicting subcellular localization of proteins using machine-learned classifiers [J].
Lu, Z ;
Szafron, D ;
Greiner, R ;
Lu, P ;
Wishart, DS ;
Poulin, B ;
Anvik, J ;
Macdonell, C ;
Eisner, R .
BIOINFORMATICS, 2004, 20 (04) :547-556
[8]   AN ALGORITHM FOR SUFFIX STRIPPING [J].
PORTER, MF .
PROGRAM-AUTOMATED LIBRARY AND INFORMATION SYSTEMS, 1980, 14 (03) :130-137
[9]  
SINCLAIR G, 2004, CLASSIFICATION FULL, P69
[10]  
Stapley B J, 2002, Pac Symp Biocomput, P374