Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space

被引:29
作者
Apostolico, A [1 ]
Bejerano, G
机构
[1] Purdue Univ, Dept Comp Sci, W Lafayette, IN 47907 USA
[2] Univ Padua, Dipartimento Elettron & Informat, I-35131 Padua, Italy
[3] Hebrew Univ Jerusalem, Sch Engn & Comp Sci, IL-91904 Jerusalem, Israel
关键词
amnesic automata; probabilistic suffix trees; variable memory Markovian models; protein families; protein classification;
D O I
10.1089/106652700750050844
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Statistical modeling of sequences is a central paradigm of machine learning that finds multiple uses in computational molecular biology and many other domains. The probabilistic automata typically built in these contexts are subtended by uniform, fixed-memory Markov models. In practice, such automata tend to be unnecessarily bulky and computationally imposing both during their synthesis and use. Recently, D. Ron, Y. Singer, and N. Tishby built much more compact, tree-shaped variants of probabilistic automata under the assumption of an underlying Markov process of variable memory length. These variants, called Probabilistic Suffix Trees (PSTs) were subsequently adapted by G. Bejerano and G. Yona and applied successfully to learning and prediction of protein families. The process of learning the automaton from a given training set S of sequences requires Theta (Ln(2)) worst-case time, where n is the total length of the sequences in S and L is the length of a longest substring of S to be considered for a candidate state in the automaton. Once the automaton is built, predicting the likelihood of a query sequence of m characters may cost time Theta (m(2)) in the worst case. The main contribution of this paper is to introduce automata equivalent to PSTs but having the following properties: Learning the automaton, for any L, takes O(n) time. Prediction of a string of m symbols by the automaton takes O (m) time. Along the way, the paper presents an evolving learning scheme and addresses notions of empirical probability and related efficient computation, which is a by-product possibly of more general interest.
引用
收藏
页码:381 / 393
页数:13
相关论文
共 19 条
[1]  
ABE N, 1992, MACH LEARN, V9, P205, DOI 10.1007/BF00992677
[2]   EFFICIENT STRING MATCHING - AID TO BIBLIOGRAPHIC SEARCH [J].
AHO, AV ;
CORASICK, MJ .
COMMUNICATIONS OF THE ACM, 1975, 18 (06) :333-340
[3]  
Apostolico A., 2000, RECOMB 2000. Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, P25, DOI 10.1145/332306.332321
[4]   Linear global detectors of redundant and rare substrings [J].
Apostolico, A ;
Bock, ME ;
Lonardi, S .
DCC '99 - DATA COMPRESSION CONFERENCE, PROCEEDINGS, 1999, :168-177
[5]   Efficient detection of unusual words [J].
Apostolico, A ;
Bock, ME ;
Lonardi, S ;
Xu, XY .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2000, 7 (1-2) :71-94
[6]  
Apostolico A, 1997, Pattern Matching Algorithms
[7]   The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 [J].
Bairoch, A ;
Apweiler, R .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :45-48
[8]  
Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkh121, 10.1093/nar/gkr1065]
[9]  
BEJERANO G, IN PRESS BIOINFORMAT
[10]  
Bejerano Gill., 1999, Proceedings of RECOMB, P15