Cost-constrained data acquisition for intelligent data preparation

被引:21
作者
Zhu, XQ [1 ]
Wu, XD [1 ]
机构
[1] Univ Vermont, Dept Comp Sci, Burlington, VT 05401 USA
关键词
data mining; intelligent data preparation; data acquisition; cost-sensitive; machine learning; instance ranking;
D O I
10.1109/TKDE.2005.176
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Real-world data is noisy and can often suffer from corruptions or incomplete values that may impact the models created from the data. To build accurate predictive models, data acquisition is usually adopted to prepare the data and complete missing values. However, due to the significant cost of doing so and the inherent correlations in the data set, acquiring correct information for all instances is prohibitive and unnecessary. An interesting and important problem that arises here is to select what kinds of instances to complete so the model built from the processed data can receive the "maximum" performance improvement. This problem is complicated by the reality that the costs associated with the attributes are different, and fixing the missing values of some attributes is inherently more expensive than others. Therefore, the problem becomes that given a fixed budget, what kinds of instances should be selected for preparation, so that the learner built from the processed data set can maximize its performance? In this paper, we propose a solution for this problem, and the essential idea is to combine attribute costs and the relevance of each attribute to the target concept, so that the data acquisition can pay more attention to those attributes that are cheap in price but informative for classification. To this end, we will first introduce a unique Economical Factor (EF) that seamlessly integrates the cost and the importance ( in terms of classification) of each attribute. Then, we will propose a cost-constrained data acquisition model, where active learning, missing value prediction, and impact-sensitive instance ranking are combined for effective data acquisition. Experimental results and comparative studies from real-world data sets demonstrate the effectiveness of our method.
引用
收藏
页码:1542 / 1556
页数:15
相关论文
共 38 条
[1]  
AHA DW, 1991, MACH LEARN, V6, P37, DOI 10.1007/BF00153759
[2]  
[Anonymous], 1999, P 5 ACM SIGKDD INT C
[3]  
[Anonymous], 1962, INTRO MATH STAT
[4]  
[Anonymous], 1994, SIGIR
[5]  
Berry M., 1999, MASTERING DATA MININ
[6]  
Blake C.L., 1998, UCI repository of machine learning databases
[7]  
Breiman L., 1998, CLASSIFICATION REGRE
[8]  
Clark P., 1989, Machine Learning, V3, P261, DOI 10.1023/A:1022641700528
[9]   IMPROVING GENERALIZATION WITH ACTIVE LEARNING [J].
COHN, D ;
ATLAS, L ;
LADNER, R .
MACHINE LEARNING, 1994, 15 (02) :201-221
[10]   Understanding the crucial role of attribute interaction in data mining [J].
Freitas, AA .
ARTIFICIAL INTELLIGENCE REVIEW, 2001, 16 (03) :177-199