Field independent probabilistic model for clustering multi-field documents

被引:22
作者
Zhu, Shanfeng [1 ]
Takigawa, Ichigaku [2 ]
Zeng, Jia [3 ]
Mamitsuka, Hiroshi [2 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai Key Lab Intelligent Informat Proc, Shanghai 200433, Peoples R China
[2] Kyoto Univ, Inst Chem Res, Bioinformat Ctr, Kyoto 6110011, Japan
[3] Hong Kong Baptist Univ, Dept Comp Sci, Kowloon Tong, Hong Kong, Peoples R China
关键词
Document clustering; Finite mixture model; Multivariate Bernoulli model; Multinomial model; Field independent clustering model;
D O I
10.1016/j.ipm.2009.03.005
中图分类号
TP [自动化技术、计算机技术];
学科分类号
080201 [机械制造及其自动化];
摘要
We propose a new finite mixture model for clustering multiple-field documents, such as scientific literature with distinct fields: title, abstract, keywords, main text and references. This probabilistic model, which we call field independent clustering model (FICM), incorporates the distinct word distributions of each field to integrate the discriminative abilities of each field as well as to select the most suitable component probabilistic model for each field. We evaluated the performance of FICM by applying it to the problem of clustering three-field (title, abstract and MeSH) biomedical documents from TREC 2004 and 2005 Genomics tracks, and two-field (title and abstract) news reports from Reuters-21578. Experimental results showed that FICM outperformed the classical multinomial model and the multivariate Bernoulli model, being at a statistically significant level for all the three collections. These results indicate that FICM outperformed widely-used probabilistic models for document clustering by considering the characteristics of each field. We further showed that the component model, which is consistent with the nature of the corresponding field, achieved a better performance and considering the diversity of model setting also gave a further performance improvement. An extended abstract of parts of the work presented in this paper has appeared in Zhu et al. [Zhu, S., Takigawa, L, Zhang, S., & Mamitsuka, H. (2007). A probabilistic model for clustering text documents with multiple fields. In Proceedings of the 29th European conference on information retrieval, ECIR 2007. Lecture notes in computer science (Vol. 4425, pp. 331-342)]. (C) 2009 Elsevier Ltd. All rights reserved.
引用
收藏
页码:555 / 570
页数:16
相关论文
共 26 条
[1]
[Anonymous], J MACH LEARN RES
[2]
[Anonymous], P 45 ANN SE REG C
[3]
Banerjee A., 2003, ACM International Conference on Knowledge Discovery and Data Mining, SIGKDD, P19, DOI DOI 10.1145/956750.956757
[4]
On rival penalization controlled competitive learning for clustering with automatic cluster number selection [J].
Cheung, YM .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (11) :1583-1588
[5]
DeGroot Morris., 1970, OPTIMAL STAT DECISIO
[6]
MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[7]
Ensemble methods in machine learning [J].
Dietterich, TG .
MULTIPLE CLASSIFIER SYSTEMS, 2000, 1857 :1-15
[8]
Domingos P., 1996, Proceedings of the 13th International Conference on Machine Learning, P105
[9]
Duda R. O., 2000, Pattern classification
[10]
Ghosh J, 2003, HUM FAC ER, P247