Contextual correlation based thread detection in short text message streams

被引:9
作者
Huang, Jiuming [1 ]
Zhou, Bin [1 ]
Wu, Quanyuan [1 ]
Wang, Xiaowei [1 ]
Jia, Yan [1 ]
机构
[1] Natl Univ Def Technol, Coll Comp, Changsha 410073, Hunan, Peoples R China
基金
中国国家自然科学基金;
关键词
Text stream; Thread detection; Short text; Contextual correlation;
D O I
10.1007/s10844-011-0162-7
中图分类号
TP18 [人工智能理论];
学科分类号
140502 [人工智能];
摘要
Short text message streams are produced by Instant Messaging and Short Message Service which are wildly used nowadays. Each stream contains more than one thread usually. Detecting threads in the streams is helpful to various applications, such as business intelligence, investigation of crime and public opinion analysis. Existing works which are mainly based on text similarity encounter many challenges including the sparse eigenvector and anomaly of short text message. This paper introduces a novel concept of contextual correlation instead of the traditional text similarity into single-pass clustering algorithm to cover the challenges of thread detection. We firstly analyze the contextually correlative nature of conversations in short text message streams, and then propose an unsupervised method to compute the correlative degree. As a reference, a single-pass algorithm employing the contextual correlation is developed to detect threads in massive short text stream. Experiments on large real-life online chat logs show that our approach improves the performance by 11% when compared with the best similarity-based algorithm in terms of F1 measure.
引用
收藏
页码:449 / 464
页数:16
相关论文
共 23 条
[1]
[Anonymous], 2010, Lucene
[2]
[Anonymous], 1998, Topic Detection and Tracking Pilot Study Final Report, DOI DOI 10.1184/R1/6626252.V1
[3]
[Anonymous], 1983, PRAGMATICS CAMBRIDGE
[4]
Atkinson J., 1984, Structures of social action: Studies in conversation analysis, P53
[5]
Statistical models for text segmentation [J].
Beeferman, D ;
Berger, A ;
Lafferty, J .
MACHINE LEARNING, 1999, 34 (1-3) :177-210
[6]
Dou Shen, 2006, Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P35, DOI 10.1145/1148170.1148180
[7]
Galley M., 2003, DISCOURSE SEGMENTATI
[8]
Hearst MA, 1997, COMPUT LINGUIST, V23, P33
[9]
Heringer J., 1977, Discourse structure across time and space, P169
[10]
HowNet, 2010, HOWNET