Introduction
Predicting outcome after stroke can be used at an institutional level to identify whether clinical services are performing below, at or above predicted levels of efficacy [
1] which enables remedial action to be taken to support improvement of poorly performing services and to recognise and replicate systems that are delivering better than predicted care.
The complexity of stroke potentially lends itself well to the use of machine learning (ML) algorithms which are able to incorporate large amount of variables and observations into one predictive model [
2]. It has been suggested that ML might outperform clinical prediction models based on regression because they make fewer assumptions and can learn complex relationships between predictors and outcomes [
3]. However, previous literature has not consistently shown that ML models generate more accurate predictions of stroke outcomes than regression based models, and improvements in methodology and reporting are needed for studies that compare modeling algorithms [
4]. Many of the practical machine learning applications are still in their infancy and need to be explored and developed better [
5].
There have been numerous ML prediction models developed previously for stroke outcomes [
6] but all had some flaws in model building which limited their utility. From a systematic review on predicting outcomes of stroke using machine learning methods, few studies met basic reporting standards for clinical prediction tools and none made their models available in a way which could be used or evaluated [
6]. Major improvements in ML study conduct and reporting are needed before it can meaningfully be considered for practice.
This study aimed to use ML and a large, nationally representative dataset containing real-world clinical variables with high potential for practical application to understand if carefully built and reported ML models can provide more accurate predictions of post-stroke mortality than existing methods. The findings of the research are intended to inform the design of predictive analytics used for mortality risk stratification and to support quality improvement in stroke care and benchmark stroke care services.
Methods
This study is reported according to the TRIPOD guidelines: transparent reporting of a multivariable prediction model for individual prognosis or diagnosis [
7].
Data source
Data were from Sentinel Stroke National Audit Programme (SSNAP), the national registry of stroke care in England, Wales and Northern Ireland. SSNAP is a Healthcare Quality Improvement Partnership (HQIP) register for stroke care quality improvement. Data were collected prospectively, validated by clinical teams and entered into the SSNAP database via a secure web interface. SSNAP is estimated to include an estimated 95% of all adults admitted to hospital with acute stroke (ischaemic or primary intracerebral haemorrhage) in England, Wales and Northern Ireland.
Study population
The original dataset was collected from 333 teams across England, Wales and Northern Ireland between April 1, 2013 to June 30, 2019. For the generalisation of the model to the population, all patients were included and no specific exclusion criteria for stroke patients.
Study outcome
The predicted outcome was all-cause 30-day in-hospital mortality post-stroke. All patients had in-hospital status due to the collection procedure in SSNAP. Out-hospital deaths were not available for analysis.
Variables
In total, 65 variables (
Supplementary Table A) were obtained from SSNAP. According to expert advice and literature review, 30 variables collected at arrival and 24 hours were used to build prediction models, including age (band by 5), sex, ethnicity, inpatient at time of stroke, hour of admission, day of week of admission, congestive heart failure, atrial fibrillation (AF), diabetes, hypertension, previous stroke/transient ischaemic attack (TIA), prior anticoagulation if AF, pre-stroke modified Rankin Scale (mRS), National Institutes of Health Stroke Scale (NIHSS) and its 15 components, and type of stroke. Age was obtained from SSNAP as banded by 5 and no continuous age was available for analysis.
Missing values
Missing data were handled using different methods according to assumptions of missing mechanism after consulting SSNAP team and clinicians. Variables with more than 80% missing were discarded due to high level of missingness. For categorical variables with missing by design/not applicable assumption, missing values were added as a new category. Missing indicator was used for missing by design/not applicable continuous variables. After these, Multiple Imputation with Chained Equations (MICE) [
8] was used to impute variables with missing at random assumption for the development dataset. All available variables except for the discarded ones were used in MICE. Five datasets were imputed using MICE and aggregated into one using median. Details for handling missing data were presented in
Supplementary Table B. To simulate the future use of the prediction model (i.e. predicting outcomes of individual patients), the validation set and temporal validation set were imputed with the median/mean of each variable.
Analysis methods
Ordinal categorical variables were coded as integers. Categorical variables that were not ordinal were coded using one hot encoding. Continuous variables remained continuous. Details of coding of variables were presented in
Supplementary Table B.
Models were developed/trained using 80% randomly selected data from 2013 to 2018, validated on the remaining 20% data from 2013 to 2018, and temporally validated on data from 2019.
Descriptive statistics were used to compare the characteristic of death/alive at 30-day across the entire datasets, and also used to compare patients’ characteristics for development set, validation set and temporal validation set.
Logistic regression (LR), LR with elastic net [
9] with/without interaction terms, and XGBoost [
10] were used to build models with the 30 variables. A reference model was developed using the same approach as SSNAP 30-day mortality model [
11]: LR with 4 variables (age, NIHSS, previously diagnosed AF, and type of stroke). To make the models comparable, the outcome of LR reference model was in-hospital 30-day mortality.
Brier score [
12] was used as an overall summative measure of predictive performance. Discrimination was measured by AUC. Calibration was visually assessed with calibration plots [
13], and numerically by calibration-in-the-large [
13] and calibration slopes [
13]. R functions for calculating these measurements were presented in the Github repository. Comparisons of Brier scores and AUCs were conducted with one-way repeated measure Analysis of Variance (ANOVA) [
14] on the 500 bootstrap samples. If an overall significant difference had been achieved (significance threshold of
p-value is 0.05) in one-way repeated measure ANOVA test, post-hoc test [
15] was conducted further for pairwise comparison which depicts where exactly the differences occurred. Further, we used DeLong test as an addition method to test the difference between any two methods.
As stroke type is a critical factor for outcomes of stroke patients, we performed a subgroup analysis to investigate the performance of the developed models for patients with different stroke type i.e. infarction and haemorrhage.
Calibration was also evaluated by shift tables, in which we classified patients in the temporal validation set into prespecified categories of low (< 5%), moderate (5–15%), or high risk (> 15%) of 30-day mortality based on two models, creating a 9-way matrix of patients that included risk profiles assigned by the 2 models (low-low, low-moderate, and so on). We then calculated the actual rate of events in these groups and compared them against the observed rates of mortality focusing on discordant categories.
Clinical utility was assessed with decision curve analysis [
15] which shows graphically the net benefit obtained by applying the strategy of treating an individual if and only if predicted risk is larger than a threshold in function of the threshold probability. Threshold equals to 0 means treating all since all predicted risk will be larger than 0. Threshold equals to 1 means treating none since all predicted risk will be smaller than 1. All analyses were conducted using R 3.6.2 and occurred between October 2019 to May 2020.
Discussion
In this study we explored the performance of XGBoost and LR related models for predicting risk of 30-day mortality after stroke. Our findings showed that the improvement of XGBoost was modest compared to LR with elastic net and interaction terms (AUC difference 0.003,
p < 0.001) and larger compared to LR and LR with elastic net (AUC difference 0.009,
p < 0.001). There has been mixed signals about whether ML outperforms LR models [
6]. In our case, with nearly a half million patients, the gain of ML was small compared to LR with elastic net and interaction terms even though significant due to the large dataset. Also, previous studies used simpler LR models (e.g. LR or least absolute shrinkage and selection operator (LASSO)) with no interactions which constrains the models to learn only linear relationships whilst ML models are capable of learning very complicated relationships. XGBoost and LR with elastic net and interaction terms performed better than LR showed that there is an opportunity to improve the accuracy by incorporating interactions between risk predictors.
When including more variables in the model, AUCs were improved by 0.041 (p < 0.001) for both LR related models and XGBoost which showed the potential of improving the accuracy by using more variables combining data-driven variable selection (i.e. LR with elastic net achieves variable selection by shrinking the coefficients of the variables and XGBoost by calculating variable importance).
A variety of models have been developed previously to predict post-stroke mortality, including ML models [
6], LR models and scores [
18]. The model developed by Bray et al. [
11] has been externally validated twice [
19,
20] with different population. Our reference model was developed with the same approach but with more patients enrolled between 2013 and 2019 compared to Bray et al. model [
11] which used patients from 2013. The AUC with LR reference model was slightly lower (0.854 (0.848–0.860) versus 0.86 (0.85–0.88)) with the Bray et al. model [
11]. Our model with 30 variables using XGBoost had a higher AUC of 0.896 (0.891–0.900).
Existing scores for 30-day mortality prediction are PLAN [
21] and IScore [
22]. PLAN was externally validated with AUC (0.87 (0.85–0.90)) [
23] and IScore with AUC (0.88 (0.86–0.91)) which were higher compared to the original studies (PLAN: AUC 0.79, IScore: AUC 0.79–0.85). Due to lack of certain variables (PLAN score: cancer, Iscore: Cancer, renal dialysis, stroke scale score, non-lacunar, and abnormal glucose level), we could not externally validate these scores.
The subgroup analysis with types of stroke showed that the models with 30 variables predicted better at haemorrhage patients than infarction patients. The reasons for this are not clear, but might relate to the proportionally higher number of mortality events in the patients with haemorrhagic stroke. The feature importance of the variables in the XGBoost model were consistent with previous literature about prognostic factors after stroke, with NIHSS, age, stroke type and prestroke mRS being the most important predictors. Notably, many individual components of the NIHSS score also contributed to the predictions in addition to the overall NIHSS, indicating that there is value in using all components of the NIHSS as part of prognostic models.
ML models have been notably performing well with non-structured data such as text mining and imaging. However, in terms of structured data, the advantages of LR models in interpretability outweigh the small improvement of the prediction accuracy by ML models. Furthermore, even though the use of ML for predicting stroke outcomes is increasing, few met basic reporting standards for clinical prediction tools and none made their models available in a way which could be used or evaluated [
6]. Major issues found in the ML studies are small dataset size, lack of calibration and external validation, lack of details in hyperparameter tuning, lack of decision curve analysis, and the non-availability of the final model [
6]. None of the ML model were externally validated due to the low quality of reporting and lack of final model. From the low reporting quality, few external validation of ML models, and lack of guidelines on developing and reporting, ML still has a long way to go in being accepted by clinicians and implemented in real-world setting.
Strengths and limitations
This is by far the largest stroke dataset used for developing prediction of post-stroke mortality model using ML (around 0.5 million versus < 1000 in previous ML post-stroke mortality prognosis studies [
6] and 77,653 as the largest, to the best of our knowledge, for LR model/score-based approach [
24]). The participants in the study are presentative for the nation and complete in terms of a nearly complete national population of hospitalised stroke and richness of available data and data quality.
The models were built with robust approaches and reported according to the TRIPOD reporting guidelines with several terms adjusted for ML studies such as hyperparameter turning and final model presentation. For missing data, we explored the missing mechanism which was not explored in previous studies before imputing or fitting the model [
18]. Hyperparameter selection was well reported in our study but not in previous ML studies [
6]. Temporal validation was performed to make sure that the models can apply to data collected after the model was developed. Finally, a repository on Github was built to share the pre-trained models for other studies to externally validate.
The main limitations are that we were restricted to using variables available in SSNAP and there may be other variables (e.g. imaging) that might improve the accuracy of prediction. We used 30 variables that were collected within SSNAP, which might not be available in other databases. However, the variables are generally available in stroke registries which can benefit from the models developed in this study. The validation was limited to temporal validation and ideally the model should be validated in external data from other data sources. Finally, death outcomes were limited to inpatient mortality and it was not possible to ascertain deaths occurring outside hospital within 30 days.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit
http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (
http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.