Predicting disease risks from highly imbalanced data using random forest

被引:455
作者
Khalilia, Mohammed [2 ]
Chakraborty, Sounak [3 ]
Popescu, Mihail [1 ]
机构
[1] Univ Missouri, Dept Hlth Management & Informat, Columbia, MO 65211 USA
[2] Univ Missouri, Dept Comp Sci, Columbia, MO USA
[3] Univ Missouri, Dept Stat, Columbia, MO 65211 USA
关键词
Support Vector Machine; Random Forest; Imbalanced Data; Disease Prediction; National Inpatient Sample;
D O I
10.1186/1472-6947-11-51
中图分类号
R-058 [];
学科分类号
摘要
Background: We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare. Methods: We employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases. Results: We predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process. Conclusions: In combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%.
引用
收藏
页数:13
相关论文
共 23 条
[1]
[Anonymous], Classification and Regression by randomForest," no
[2]
Bjoern M, BMC BIOINFORMATICS, V10
[3]
The use of the area under the roc curve in the evaluation of machine learning algorithms [J].
Bradley, AP .
PATTERN RECOGNITION, 1997, 30 (07) :1145-1159
[4]
Breiman I, 2003, MANUAL SETTING UP US
[5]
Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[6]
Breiman L., 1984, Classification and regression trees, V358
[7]
Chen Chao, 2004, Using random forest to learn imbalanced data
[8]
Cancer Coverage in General-Audience and Black Newspapers [J].
Cohen, Elisia L. ;
Caburnay, Charlene A. ;
Luke, Douglas A. ;
Rodgers, Shelly ;
Cameron, Glen T. ;
Kreuter, Matthew W. .
HEALTH COMMUNICATION, 2008, 23 (05) :427-435
[9]
Davis D., 2008, Predicting individual disease risk based on medical history, P769
[10]
Fuster V, 2008, MED UNDERWRITING LIF