Semi-supervised prediction of protein subcellular localization using abstraction augmented Markov models

被引：15

作者：

Caragea, Cornelia ^{[1
,2
]}

Caragea, Doina ^{[3
]}

Silvescu, Adrian ^{[1
,2
]}

Honavar, Vasant ^{[1
,2
]}

机构：

[1] Iowa State Univ, Artificial Intelligence Res Lab, Dept Comp Sci, Ames, IA 50010 USA

[2] Iowa State Univ, Ctr Computat Intelligence Learning & Discovery, Ames, IA 50010 USA

[3] Kansas State Univ, Manhattan, KS USA

来源：

BMC BIOINFORMATICS | 2010年 / 11卷

关键词：

Supervised learning - Markov processes - Abstracting - Maximum principle - Proteins;

D O I：

10.1186/1471-2105-11-S8-S6

中图分类号：

Q5 [生物化学];

学科分类号：

070307 [化学生物学];

摘要：

Background: Determination of protein subcellular localization plays an important role in understanding protein function. Knowledge of the subcellular localization is also essential for genome annotation and drug discovery. Supervised machine learning methods for predicting the localization of a protein in a cell rely on the availability of large amounts of labeled data. However, because of the high cost and effort involved in labeling the data, the amount of labeled data is quite small compared to the amount of unlabeled data. Hence, there is a growing interest in developing semi-supervised methods for predicting protein subcellular localization from large amounts of unlabeled data together with small amounts of labeled data. Results: In this paper, we present an Abstraction Augmented Markov Model (AAMM) based approach to semi-supervised protein subcellular localization prediction problem. We investigate the effectiveness of AAMMs in exploiting unlabeled data. We compare semi-supervised AAMMs with: (i) Markov models (MMs) (which do not take advantage of unlabeled data); (ii) an expectation maximization (EM); and (iii) a co-training based approaches to semi-supervised training of MMs (that make use of unlabeled data). Conclusions: The results of our experiments on three protein subcellular localization data sets show that semi-supervised AAMMs: (i) can effectively exploit unlabeled data; (ii) are more accurate than both the MMs and the EM based semi-supervised MMs; and (iii) are comparable in performance, and in some cases outperform, the co-training based semi-supervised MMs.

引用

页数：13

共 42 条

[1]

Alberts B., 1994, MOL BIOL CELL

[2]

[Anonymous], 2006, Semi-supervised learning

[3]

Next-generation DNA sequencing techniques [J].