The complexity of non-hierarchical clustering with instance and cluster level constraints

被引:38
作者
Davidson, Ian [1 ]
Ravi, S. S. [1 ]
机构
[1] SUNY Albany, Dept Comp Sci, Albany, NY 12222 USA
关键词
non-hierarchical clustering; constraints; complexity;
D O I
10.1007/s10618-006-0053-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent work has looked at extending clustering algorithms with instance level must-link (ML) and cannot-link (CL) background information. Our work introduces delta and epsilon cluster level constraints that influence inter-cluster distances and cluster composition. The addition of background information, though useful at providing better clustering results, raises the important feasibility question: Given a collection of constraints and a set of data, does there exist at least one partition of the data set satisfying all the constraints? We study the complexity of the feasibility problem for each of the above constraints separately and also for combinations of constraints. Our results clearly delineate combinations of constraints for which the feasibility problem is computationally intractable (i.e., NP-complete) from those for which the problem is efficiently solvable (i.e., in the computational class P). We also consider the ML and CL constraints in conjunctive and disjunctive normal forms (CNF and DNF respectively). We show that for ML constraints, the feasibility problem is intractable for CNF but efficiently solvable for DNF. Unfortunately, for CL constraints, the feasibility problem is intractable for both CNF and DNF. This effectively means that CL-constraints in a non-trivial form cannot be efficiently incorporated into clustering algorithms. To overcome this, we introduce the notion of a choice-set of constraints and prove that the feasibility problem for choice-sets is efficiently solvable for both ML and CL constraints. We also present empirical results which indicate that the feasibility problem occurs extensively in real world problems.
引用
收藏
页码:25 / 61
页数:37
相关论文
共 27 条
[1]  
[Anonymous], INT C KNOWL DISC DAT, DOI DOI 10.1145/1014052.1014062
[2]  
Bansal N, 2002, ANN IEEE SYMP FOUND, P238, DOI 10.1109/SFCS.2002.1181947
[3]  
Basu S, 2004, SIAM PROC S, P333
[4]  
Basu S., 2002, P 19 INT C MACHINE L, P19, DOI [10.5555/645531.656012, DOI 10.5555/645531.656012]
[5]  
Bayardo R.J., 1999, P 5 ACM SIGKDD INT C
[6]  
Bilenko M., 2004, P 21 INT C MACH LEAR, P11, DOI [DOI 10.1145/1015330.1015360, 10.1145/1015330.1015360]
[7]  
BRADLEY PS, 1998, P 15 INT C MACH LEAR, P91
[8]  
Campers G., 1987, P OP RES 87 BUEN AIR, P917
[9]   Clustering with qualitative information [J].
Charikar, M ;
Guruswami, V ;
Wirth, A .
44TH ANNUAL IEEE SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, PROCEEDINGS, 2003, :524-533
[10]   THE COMPUTATIONAL-COMPLEXITY OF PROBABILISTIC INFERENCE USING BAYESIAN BELIEF NETWORKS [J].
COOPER, GF .
ARTIFICIAL INTELLIGENCE, 1990, 42 (2-3) :393-405