Identifying hierarchical structure in sequences: A linear-time algorithm

被引:279
作者
NevillManning, CG
Witten, IH
机构
关键词
D O I
10.1613/jair.374
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
SEQUITUR is an algorithm that infers a hierarchical structure from a sequence of discrete symbols by replacing repeated phrases with a grammatical rule that generates the phrase, and continuing this process recursively. The result is a hierarchical representation of the original sequence, which offers insights into its lexical structure. The algorithm is driven by two constraints that reduce the size of the grammar, and produce structure as a by-product. S EQUITUR breaks new ground by . operating incrementally. Moreover, the method's simple structure permits a proof that it operates in space and time that is linear in the size of the input. Our implementation can process 50,000 symbols per second and has been applied to an extensive range of real world sequences.
引用
收藏
页码:67 / 82
页数:16
相关论文
共 24 条
[1]  
ANDREAE JH, 1977, THINKING TEACHABLE M
[2]   INFERENCE OF REVERSIBLE LANGUAGES [J].
ANGLUIN, D .
JOURNAL OF THE ACM, 1982, 29 (03) :741-765
[3]  
[Anonymous], 1993, WATCH WHAT I DO PROG
[4]  
Bell T. C., 1990, TEXT COMPRESSION
[5]  
Berwick R. C., 1987, Machine Learning, V2, P9, DOI 10.1023/A:1022860810097
[6]   ATTENTION AND STRUCTURE IN SEQUENCE LEARNING [J].
COHEN, A ;
IVRY, RI ;
KEELE, SW .
JOURNAL OF EXPERIMENTAL PSYCHOLOGY-LEARNING MEMORY AND COGNITION, 1990, 16 (01) :17-30
[7]  
COOK CM, 1976, INFORM SCIENCES, V10, P59, DOI 10.1016/0020-0255(76)90061-X
[8]   BEHAVIOR STRUCTURE TRANSFORMATIONS UNDER UNCERTAINTY [J].
GAINES, BR .
INTERNATIONAL JOURNAL OF MAN-MACHINE STUDIES, 1976, 8 (03) :337-365
[9]   LANGUAGE IDENTIFICATION IN LIMIT [J].
GOLD, EM .
INFORMATION AND CONTROL, 1967, 10 (05) :447-&
[10]  
JOHANSSON S, 1978, MANUAL INFORMATION A