Bursty and hierarchical structure in streams

被引:1457
作者
Kleinberg, J [1 ]
机构
[1] Cornell Univ, Dept Comp Sci, Ithaca, NY 14853 USA
关键词
data stream algorithms; text mining; Markov source models;
D O I
10.1023/A:1024940629314
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise-that the appearance of a topic in a document stream is signaled by a "burst of activity," with certain features rising sharply in frequency as the topic emerges. The goal of the present work is to develop a formal approach for modeling such " bursts," in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. The approach is based on modeling the stream using an infinite-state automaton, in which bursts appear naturally as state transitions; it can be viewed as drawing an analogy with models from queueing theory for bursty network traffic. The resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. Experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them.
引用
收藏
页码:373 / 397
页数:25
相关论文
共 70 条
  • [21] NEAREST NEIGHBOR PATTERN CLASSIFICATION
    COVER, TM
    HART, PE
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 1967, 13 (01) : 21 - +
  • [22] EHRICH R, 1976, IEEE T COMPUT, V25, P7
  • [23] ELWALID A, 1993, IEEE ACM T NETWORKIN, V1
  • [24] FINE S, 1998, MACHINE LEARNING, V32
  • [25] Forster E.M., 1927, Aspects of the Novel
  • [26] Garofalakis M., 2002, ACM SIGMOD INT C MAN
  • [27] GAY G, 2001, ED TECHNOLOGY SOC, V4
  • [28] Genette Gerard., 1993, NARRATIVE DISCOURSE
  • [29] Genette Gerard, 1980, Narrative Discourse: An Essay in Method
  • [30] Grosz B., 1986, COMPUTATIONAL LINGUI, V12