Background
Measures of cardiac size, mass, and function derived from imaging are some of the most fundamental biomarkers in medicine. For example, left ventricular (LV) ejection fraction (LVEF) determines selection for drug therapy in heart failure [
1‐
3], detects myocardial injury (e.g. in cardio-oncology) [
4], acts as a gatekeeper for ~ $9 billion spent per year on cardiac devices, and acts as surrogate endpoints for drug development and outcome prediction [
5‐
7].
LVEF was initially proposed as a simple way of measuring heart function using cardiac catheterization [
8]. The introduction of imaging modalities such as echocardiography, cardiac computed tomography and cardiovascular magnetic resonance (CMR) permitted absolute blood and myocardial volume measurements. CMR is accepted as the best technique for measuring cardiac structure and global systolic function (i.e., LVEF) [
9]. Image acquisition is standardized and can be delivered in as little as 15 min [
10,
11], but the image analysis process can take as long, requiring analysis by a clinician, which imparts variability because of intra- and inter-operator differences [
12,
13].
Recent developments in deep learning using convolutional neural networks (CNN)—computational models inspired by the architecture of the human brain—have revolutionized automated image analysis [
14]. The potential of CNNs in healthcare is being recognized; for example, a deep learning system has been shown to improve on human performance for detecting breast cancer in mammograms [
15]. Many CNN applications in cardiac image segmentation have been described and deployed within commercial packages [
16‐
18], but none surpass human performance and most algorithms are not directly compared to human analysis on the same data, nor validated on independent clinical datasets [
19].
We hypothesized that a carefully trained, fully automated machine learning analysis could be developed and proven to exceed human performance on any CMR scanner and any (non-congenital) disease. This approach requires a means of evaluating and comparing CMR measures of myocardial structure and function, but this is hampered by the lack of a truth standard. Most existing approaches report measurement accuracy, treating expert analysis (or a consensus thereof) as a truth standard, but this is fundamentally flawed because of the inherent variability and subjectivity of humans. We therefore concentrate on measurement precision (reproducibility) and develop an evaluation framework using an independent dataset to quantify measurement precision (reproducibility), which determines clinical smallest detectable interval change and research study sample size.
Discussion
We present a fully automated method of CMR LV volumetric analysis and demonstrate that it has superior precision to a human expert. Widespread adoption has the potential to standardize global care, reduce the need for clinical expert time, and significantly reduce sample sizes for clinical trials.
Automated CMR analysis has already demonstrated faster performance with non-inferiority to humans [
13]. Here, we demonstrate a generalizable algorithm with better-than-human precision with a substantial step-change that could impact both clinical and research work. Clinically, improved imaging biomarker precision builds confidence in quantitative analysis of cardiac structure and function and will help cascading clinical decisions that are based on cardiac measurement. For research, there is a potential reduction in required sample sizes, potentially accelerating therapeutic development. The automated method also permits retrospective analyses with considerable power—for example re-analyzing a 200 patient CMR study would take 60 min and could unearth findings previously masked by human analysis variability.
Machine errors were seen in circumstances not encountered in the training data, such as a laminar thrombus mimicking the LV wall, or a pericardial effusion resembling LV blood (Fig.
6; Additional file
1: Table S7). This in part represents a limitation when training data is collected from consented research subjects who, by definition, must be well enough to give consent. However, it also highlights that humans consider contextual information and use high-level executive function outside the scope of current AI systems. The method we present here has also has not been tested on patients with congenital heart disease and it is unlikely to generalize to such cases due to significant differences in structure and topology. This poses an interesting question about how to best model different diseases: do we create a separate model of each phenotype, or should they all be lumped together into a single model?
Sources of longitudinal variability in image-derived metrics can be grouped into three categories: variable image slice prescription by the radiographer, inconsistent analysis by the clinician, and physiological or pathological changes. Our goal is to minimize the first two sources (errors caused by inconsistency) so that we can focus on true physiological (or pathological) variability, which is crucial for serial assessment in clinical cardiology (e.g. monitoring for signs of cardiotoxicity during chemotherapy) and in clinical trials. Here, we have shown how image analysis variability can be minimized, but in future work we will extend this to automate the image acquisition process using deep learning to prescribe consistent image planes.
The difference between scan-rescan coefficients of variation (Fig.
7) and the mean absolute difference (Additional file
1: Table S4) may look small but these translate to a significant difference for both research studies and clinical care. As an example, if we used machine segmentations instead of human analysis in a clinical trial with an LVEF endpoint, we would need 46% fewer subjects to achieve the same statistical power. Minor differences in LV metric values also propagate in clinical care because normal values are treated as simple ‘cut-off’ values. Furthermore, the reported values represent the mean and larger differences are seen at an individual level.
Clinical adoption of machine learning methods is slow due to challenges with data access, integrating computer science and clinical domains, as well as validation, transparency, ethical and regulatory concerns [
32]. Here, we have demonstrated how machine learning can be applied to an important medical problem, cardiac volumetric analysis by CMR, and its performance measured using a clinically important metric—precision. Unlike the majority of previous approaches, we directly compared machine performance to a clinician on the same independent data [
19]. Such datasets could be a cornerstone of regulatory approval, where all new algorithms are systemically and independently evaluated and benchmarked against existing approaches.
There are significant differences between normative reference ranges for cardiac structure and function reported in the literature, (for example [
33,
34]) and even small changes can lead to big differences in the number of subjects beyond cut-off thresholds. Much of this variability could be due to differences in sample populations (e.g. age, sex, comorbidities), imaging techniques (e.g. gradient echo vs. balanced steady state free procession), and measurement convention (papillary muscles considered as part of the LV blood pool vs. as myocardium). Experts also have their own biases, but so do automated algorithms which is why we believe that is important to report method-specific reference range. We performed a pilot study to estimate a normal reference range, but the number of patients, particularly in older age groups, is small so we aim to refine this in future using larger samples, such as UK Biobank.
Limitations
Technical limitations of the algorithm include the method of co-registering the data from SAx, 2Ch and 4Ch images in 3D, which assumes consistent breath hold at the time of each acquisition, which may not always be the case. We will investigate ways of compensating for breath hold inconsistencies and integration into true 3D volumes in future work.
Other limitations include the limited number of human observers from a single centre used to benchmark the Precision dataset. This took 210 h of manual segmentation, but we acknowledge that validation with more clinicians from more centres is required before we can generalize the finding that the algorithm outperforms humans.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit
http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (
http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.