Victor Wandera Lumumba, Teddy Mutugi Wanjuki, Elizabeth Wambui Njoroge
Abstract : Hypertension remains a critical health issue, and complications such as cardiovascular disease, stroke, and renal failure similarly remain a global health concern. This study compared six supervised machine learning models – Support Vector Machines, k-nearest Neighbors, Random Forest Classifier, Naïve Bayes Classifier, Tree Bagging, and Extreme Gradient Boosting, based on the data from 2322 participants. The primary elements were SBP measured as equal to or more than 120 mmHg, BMI, Age, and the number of haemoglobin grams per litre, as well as demographic data. The research found that Random Forest yielded the highest evaluation metrics in Oversampling, with an accuracy of 1.0%, balanced Accuracy of 100 %, Sensitivity of 100 %, specificity of 100 %, and AUC of 100 %; hence proved to be the best model to address the hypertension risk among patients. The feature importance of the SBP turned out to be higher according to the SHAP analysis, considering the "No" class where the SHAP value equalled 0.24, followed by BMI (0.05) and Gender (0.06). Variables such as advanced HIV status and log-centered creatinine showed negligible impact (SHAP value = 0.00). The random forest model was accurate and steady across all performance criteria, outperforming all other models with the No Information Rate (0.978) while illustrating the significance of physiological aspects of hypertension risk assessment. These results demonstrate the capability of Random Forest in predicting hypertension risk and give important suggestions for enhancing screening methods and specific public health initiatives.
Keyword : cardiovascular disease, hypertension, machine learning, supervised learning, Sensitivity, Specificity, Oversampling.