Predictive modelling and identification of key risk factors for stroke using machine learning.
Ahmad HassanSaima Gulzar AhmadEhsan Ullah MunirImtiaz A KhanNaeem RamzanPublished in: Scientific reports (2024)
Strokes are a leading global cause of mortality, underscoring the need for early detection and prevention strategies. However, addressing hidden risk factors and achieving accurate prediction become particularly challenging in the presence of imbalanced and missing data. This study encompasses three imputation techniques to deal with missing data. To tackle data imbalance, it employs the synthetic minority oversampling technique (SMOTE). The study initiates with a baseline model and subsequently employs an extensive range of advanced models. This study thoroughly evaluates the performance of these models by employing k-fold cross-validation on various imbalanced and balanced datasets. The findings reveal that age, body mass index (BMI), average glucose level, heart disease, hypertension, and marital status are the most influential features in predicting strokes. Furthermore, a Dense Stacking Ensemble (DSE) model is built upon previous advanced models after fine-tuning, with the best-performing model as a meta-classifier. The DSE model demonstrated over 96% accuracy across diverse datasets, with an AUC score of 83.94% on imbalanced imputed dataset and 98.92% on balanced one. This research underscores the remarkable performance of the DSE model, compared to the previous research on the same dataset. It highlights the model's potential for early stroke detection to improve patient outcomes.
Keyphrases
- body mass index
- risk factors
- atrial fibrillation
- electronic health record
- type diabetes
- big data
- coronary artery disease
- machine learning
- genome wide
- pulmonary hypertension
- air pollution
- metabolic syndrome
- physical activity
- single cell
- risk assessment
- brain injury
- convolutional neural network
- glycemic control
- deep learning
- label free
- blood glucose