Prediction modeling using EHR data: challenges, strategies, and a comparison of machine learning approaches.
Electronic health record (EHR) databases contain vast amounts of information about patients. Machine learning techniques such as Boosting and support vector machine (SVM) can potentially identify patients at high risk for serious conditions, such as heart disease, from EHR data. However, these techniques have not yet been widely tested. To model detection of heart failure more than 6 months before the actual date of clinical diagnosis using machine learning techniques applied to EHR data. To compare the performance of logistic regression, SVM, and Boosting, along with various variable selection methods in heart failure prediction. Geisinger Clinic primary care patients with data in the EHR data from 2001 to 2006 diagnosed with heart failure between 2003 and 2006 were identified. Controls were randomly selected matched on sex, age, and clinic for this nested case-control study. Area under the curve (AUC) of receiver operator characteristic curve was computed for each method using 10-fold cross-validation. The number of variables selected by each method was compared. Logistic regression with model selection based on Bayesian information criterion provided the most parsimonious model, with about 10 variables selected on average, while maintaining a high AUC (0.77 in 10-fold cross-validation). Boosting with strict variable importance threshold provided similar performance. Heart failure was predicted more than 6 months before clinical diagnosis, with AUC of about 0.76, using logistic regression and Boosting. These results were achieved even with strict model selection criteria. SVM had the poorest performance, possibly because of imbalanced data.