Ensemble of Bayesian predictors and decision trees for proactive failure management in cloud computing systems

被引:64
作者
Guan, Qiang [1 ]
Zhang, Ziming [1 ]
Fu, Song [1 ]
机构
[1] Department of Computer Science and Engineering, University of North Texas, Denton
来源
Journal of Communications | 2012年 / 7卷 / 01期
关键词
Cloud computing; Decision tree; Dependability assurance; Ensemble of Bayesian models; Failure prediction; Unsupervised and supervised learning;
D O I
10.4304/jcm.7.1.52-61
中图分类号
学科分类号
摘要
In modern cloud computing systems, hundreds and even thousands of cloud servers are interconnected by multi-layer networks. In such large-scale and complex systems, failures are common. Proactive failure management is a crucial technology to characterize system behaviors and forecast failure dynamics in the cloud. To make failure predictions, we need to monitor the system execution and collect health-related runtime performance data. However, in newly deployed or managed cloud systems, these data are usually unlabeled. Supervised learning based approaches are not suitable in this case. In this paper, we present an unsupervised failure detection method using an ensemble of Bayesian models. It characterizes normal execution states of the system and detects anomalous behaviors. After the anomalies are verified by system administrators, labeled data are available. Then, we apply supervised learning based on decision tree classifiers to predict future failure occurrences in the cloud. Experimental results in an institutewide cloud computing system show that our methods can achieve high true positive rate and low false positive rate for proactive failure management. © 2012 ACADEMY PUBLISHER.
引用
收藏
页码:52 / 61
页数:9
相关论文
共 49 条
[1]  
Sahoo R.K., Oliner A.J., Rish I., Et al., Critical event prediction for proactive management in large-scale computer clusters, Proceedings of ACM International Conference On Knowledge Discovery and Data Dining (SIGKDD), (2003)
[2]  
Oliner A.J., Sahoo R.K., Moreira J.E., Et al., Faultaware job scheduling for BlueGene/L systems, Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS), (2004)
[3]  
Salfner F., Lenk M., Malek M., A survey of online failure prediction methods, ACM Computing Surveys, 42, (2010)
[4]  
Mickens J.W., Noble B.D., Exploiting availability prediction in distributed systems, Proceedings of USENIX Symposium On Networked Systems Design and Implementation (NSDI), (2006)
[5]  
Fu S., Xu C., Quantifying event correlations for proactive failure management in networked computing systems, Journal of Parallel and Distributed Computing, 70, 11, pp. 1100-1109, (2010)
[6]  
Gu J., Zheng Z., Lan Z., White J., Hocks E., Park B.-H., Dynamic meta-learning for failure prediction in large-scale systems: A case study, Proceedings of IEEE International Conference On Parallel Processing (ICPP), (2008)
[7]  
Song H., Leangsuksun C., Nassar R., Availability modeling and analysis on high performance cluster computing systems, Proceedings of IEEE International Conference On Availability, Reliability and Security (ARES), (2006)
[8]  
Han J., Data Mining: Concepts and Techniques, (2005)
[9]  
Cover T., Thomas J., Elements of Information Theory, (1991)
[10]  
Duda R.O., Hart P.E., Stork D.G., Pattern Classification, (2001)