An optimized sequential pattern matching methodology for sequence classification

被引:13
作者
Exarchos, Themis P. [2 ]
Tsipouras, Markos G. [1 ]
Papaloukas, Costas [3 ]
Fotiadis, Dimitrios I. [1 ]
机构
[1] Univ Ioannina, Dept Comp Sci, Unit Med Technol & Intelligent Informat Syst, GR-45110 Ioannina, Greece
[2] Univ Ioannina, Sch Med, Dept Med Phys, GR-45110 Ioannina, Greece
[3] Univ Ioannina, Dept Biol Applicat & Technol, GR-45110 Ioannina, Greece
关键词
Sequential pattern mining; Sequential pattern matching; Sequence classification; Optimization; HIDDEN MARKOV-MODELS;
D O I
10.1007/s10115-008-0146-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we present a novel methodology for sequence classification, based on sequential pattern mining and optimization algorithms. The proposed methodology automatically generates a sequence classification model, based on a two stage process. In the first stage, a sequential pattern mining algorithm is applied to a set of sequences and the sequential patterns are extracted. Then, the score of every pattern with respect to each sequence is calculated using a scoring function and the score of each class under consideration is estimated by summing the specific pattern scores. Each score is updated, multiplied by a weight and the output of the first stage is the classification confusion matrix of the sequences. In the second stage an optimization technique, aims to finding a set of weights which minimize an objective function, defined using the classification confusion matrix. The set of the extracted sequential patterns and the optimal weights of the classes comprise the sequence classification model. Extensive evaluation of the methodology was carried out in the protein classification domain, by varying the number of training and test sequences, the number of patterns and the number of classes. The methodology is compared with other similar sequence classification approaches. The proposed methodology exhibits several advantages, such as automated weight assignment to classes using optimization techniques and knowledge discovery in the domain of application.
引用
收藏
页码:249 / 264
页数:16
相关论文
共 35 条
[1]  
AGRAWAL R, 1995, PROC INT CONF DATA, P3, DOI 10.1109/ICDE.1995.380415
[2]  
Agrawal R., 1994, P INT C VER LARG DAT, P487
[3]  
[Anonymous], P EDBT 00
[4]  
Ayres J., 2002, P 8 ACM SIGKDD INT C, P429, DOI DOI 10.1145/775047.775109
[5]  
Baum L. E., 1972, Inequalities, V3, P1
[6]  
Bayardo R. J. Jr., 1997, Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, P123
[7]   The Protein Data Bank [J].
Berman, HM ;
Westbrook, J ;
Feng, Z ;
Gilliland, G ;
Bhat, TN ;
Weissig, H ;
Shindyalov, IN ;
Bourne, PE .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :235-242
[8]   Mining frequent tree-like patterns in large datasets [J].
Chen, Tzung-Shi ;
Hsu, Shih-Chun .
DATA & KNOWLEDGE ENGINEERING, 2007, 62 (01) :65-83
[9]  
EXARCHOS TP, 2006, P IEEE ENG MED BIOL, P5814
[10]  
EXARCHOS TP, 2007, J BIOMED IN IN PRESS