Mining periodic patterns with gap requirement from sequences

被引:19
作者
National University of Singapore [1 ]
不详 [2 ]
不详 [3 ]
不详 [4 ]
不详 [5 ]
不详 [6 ]
机构
[1] School of Computing, National University of Singapore, Singapore 117590
[2] Department of Computer Science, University of Hong Kong, Hong Kong, Pokfulam Road
关键词
Gap requirement; Periodic pattern; Sequence mining;
D O I
10.1145/1267066.1267068
中图分类号
学科分类号
摘要
We study a problem of mining frequently occurring periodic patterns with a gap requirement from sequences. Given a character sequence S of length L and a pattern P of length l, we consider P a frequently occurring pattern in S if the probability of observing P given a randomly picked length-l subsequence of S exceeds a certain threshold. In many applications, particularly those related to bioinformatics, interesting patterns are periodic with a gap requirement. That is to say, the characters in P should match subsequences of S in such a way that the matching characters in S are separated by gaps of more or less the same size. We show the complexity of the mining problem and discuss why traditional mining algorithms are computationally infeasible. We propose practical algorithms for solving the problem and study their characteristics. We also present a case study in which we apply our algorithms on some DNA sequences. We discuss some interesting patterns obtained from the case study. © 2007 ACM.
引用
收藏
相关论文
共 20 条
  • [1] ALTSCHUL S.F., GISH W., MILLER W., MYERS E.W., LIPMAN D.J., Basic local alignment search tool, J Molec. Biol, 215, pp. 403-410, (1990)
  • [2] BAIROCH A., BOECKMANN B., The swiss-prot protein sequence data bank, Nucleic Acids Res, 20, SUPPL., pp. 2019-2022, (1992)
  • [3] BERNARDI G., OLOFSSON B., FILIPSKI J., ZERIAL M., SALINAS J., CUNY G., MEUNIER-ROTIVAL M., RODIER F., The mosaic genome of warm-blooded vertebrates, Science, 228, 4702, pp. 953-958, (1985)
  • [4] COWARD E., DRABLOS F., Detecting periodic patterns in biological sequences, Bioinformatics, 14, 6, pp. 498-507, (1998)
  • [5] FICKETT J.W., TUNG C.S., Assessment of protein coding measures, Nucleuic Acids Res, 20, pp. 6441-6450, (1992)
  • [6] HERZEL H., WEISS O., TRIFONOV E.N., 10-11 bp periodicities in complete genomes reflect protein structure and DNA folding, Bioinformatics, 15, 3, pp. 187-193, (1999)
  • [7] JONASSEN I., Efficient discovery of conserved patterns using a pattern graph, Tech. Rep. Report No, 118, (1996)
  • [8] KURTZ S., OHLEBUSCH E., SCHLEIERMACHER C., STOYE J., GIEGERICH R., Computation and visualization of degenerate repeats in complete genomes, In Proceedings of the 8th International Conference on Intelligent Systems for Molecular (ISMB-00), (2000)
  • [9] The National Center for Biotechnology Information web site
  • [10] PEI J., HAN J., MORTAZAVI-ASL B., PINTO H., CHEN Q., DAYAL U., HSU M.-C., Prefixspan: Mining sequential patterns by prefix-projected growth, Proceedings of the 17th IEEE International Conference on Data Engineering (ICDE), (2001)