Background
Breast cancer is a complex disease; early detection and evaluation of the tumor are critical for prognosis and long-term survival. Once a tumor is detected, histopathologic analyses with estrogen receptor, progesterone receptor, human epidermal growth factor receptor2, and evaluation of Nottingham histologic grade (NHG) will be performed. More recently, assessment of the proliferation antigen Ki67 is increasingly recommended [
1,
2]. These biomarkers provide valuable prognostic information for survival and treatment outcomes [
3,
4]. Therefore, they are used to guide therapeutic strategy selection. However, current approaches to evaluate these biomarkers, i.e., immunohistochemistry stains, require careful assessments by trained pathologists, and disagreements between the pathologists are often observed, especially for NHG and Ki67. Other technical factors, such as sample fixation, antibody batches, and scoring methods, also contribute to the inconsistent results. To obtain a consistent assessment, more robust methods that are amendable to automation are highly desirable.
In recent years, technologies for transcriptome sequencing have become stable and matured, and their applications in clinics are steadily increasing. Some researchers use RNA sequencing to discover new biomarkers; others use it to evaluate existing biomarkers. Although the results vary, the assessment of many markers is comparable to that of histopathologic evaluation. As more and more RNA sequencing data are accumulated, data-driven and machine learning (ML)-based approaches have been explored to discover and classify biomarkers [
5‐
7]. Most of these methods use a variety of strategies to select RNA variants (genes and transcripts) and build classification models. One of the successful examples is the establishment of PAM50 [
8], where a collection of expressed genes is used to classify breast cancer into four different subtypes. One key issue in these analyses is the selection of genes and transcripts. This is because many genes are transcribed coordinately; the high correlation between these genes and transcripts, i.e., multicollinearity, makes the selection necessary. Another issue with biomarker discovery and modeling is that most researchers focus on the identification of one, or a limited number, of markers that can be used to predict the outcome measures. This is partially due to the fact that traditional modeling approaches cannot handle a very large number of variants, especially in the case where the number of variants/factors is much larger than the number of observations/sample sizes.
The arise of modern computation power and ML algorithms provides an opportunity to address these issues. Convolutional neural network (CNN) is such an algorithm that has been used very successfully in computer vision and image classification [
9,
10]. Recently, CNN algorithms have been applied to classify medical images with exciting results [
11‐
13]. More recently, there are several reports that apply CNN algorithms to the analyses of genomics data [
14‐
16]. We have developed a technology that first transforms tabulated data into artificial image objects (AIOs) and then applies ML algorithms such as CNN to classify these AIOs [
17]. In this study, we apply the AIO technique to classify breast cancer biomarkers, with a focus on Ki67 and NHG that disagreements between pathologists are frequently observed. We hope to demonstrate that a data-driven and ML-based approach could produce consistent assignments for Ki67 and NHG. This report summarizes the results from the study.
Discussion
With the advancement of high-throughput DNA sequencing technologies, transcriptome sequencing had been used increasingly in clinical studies. The rapid accumulation of large transcriptome data presented a great opportunity to apply ML algorithms to address clinical issues. In this study, we adopted the CNN algorithms to breast cancer RNA sequencing data and developed models to classify two commonly used biomarkers. Our goals were twofold: first, to evaluate the application of the AIO technique to RNA sequencing data, and second, to evaluate the performance of our CNN models with other methods that were currently used for the classification of breast cancer biomarkers. The reason we focused on Ki67 and NHG was that the assessments for these markers by pathologists were not very consistent, improvement in prediction accuracy was of high clinical value.
We designed two sets of experiments to evaluate how the combination of AIO technique and CNN algorithms performed with RNA sequencing data for biomarker classification. In the first experiment, we used cross validation techniques to assess the models built with the AIO technique and CNN algorithms. Here we combined the GSE96058 and GSE81538 datasets and performed five-fold cross validation for both Ki67 and NHG markers. For Ki67, a binary classification, we accomplished an accuracy of 0.821 ± 0.023 and AUC of 0.891 ± 0.021 (Table
2). The precision, recall, and F1 score were 0.822 ± 0.023, 0.822 ± 0.024, and 0.822 ± 0.024, respectively. For NHG, a multi-class classification, the weighted average of categorical accuracy and the weighted average of class-specific AUC were 0.820 ± 0.012 and 0.931 ± 0.006 (Table
3).
In the second experiment, sample testing, we used GSE96058 as training dataset to build the model and tested its performance with independent GSE81538 and GSE163882 datasets. We used GSE96058 as training data because it had much larger sample size (
n = 2976) compared to that of GSE81538 (
n = 405). Both GSE96058 and GSE81538 were produced by a Swedish team [
18,
19], and GSE163882 was produced by a different team. For Ki67, the Swedish team reported a multi-gene model with an accuracy of 0.663 as compared to the consensus calls from trained pathologists. Our model reported an accuracy of 0.826 (Table
2). For NHG, the Swedish team reported an accuracy of 0.677 and our CNN model reported an accuracy of 0.764 (Table
3). For GSE163882, our model trained with GSE96058 did not perform well, with weighted accuracy of 0.580 ± 0.006 and weighted AUC of 0.587 ± 0.038, respectively. To find the reason why GSE163882 did not perform well, we looked at the correlation of gene expressions between GSE96058 and GSE163882, and we found that the correlation was not very good (Pearson correlation coefficient,
R = 0.79). As a comparison, the correlation of gene expressions between GSE96058 and GSE81538 was very good (
R = 0.99) (Additional file
1: Fig. S1). These analyses indicated that the poor performance of GSE163882 was, at least in part, due to the data inconsistency between the GSE96058 and GSE163882 datasets. While both datasets were produced with Illumina NextSeq500 platform, the technical details how the sequencing was conducted could vary significantly. Furthermore, GSE96058 and GSE163882 used a different sample treatment. GSE96058 used freshly frozen tissue samples, and GSE163882 used formalin-fixed paraffin-embedded samples. All these contributed to the inconsistent sequencing data. While the models trained with GSE96058 did not perform well on GSE163882, cross validation using only the GSE163882 data with the same genes and CNN architecture did produce comparable accuracy (0.813 ± 0.026) and AUC (0.938 ± 0.007) as that of the combined GSE96058 and GSE81538 dataset (compare Additional file
1: Table S1 and Table
3). These results indicated that by transforming gene expression data into AIOs, we could apply mature algorithms such as CNN to effectively classify biomarkers and accomplish comparable or better accuracy as compared to other modeling methods. However, the significant performance differences between GSE81538 and GSE163882 datasets stressed the importance of data consistency. For a model to have good generalizability and to be used in clinical applications, it was critically important to establish stable and consistent pipelines for data production. The different performance of NHG between the GSE81538 and GSE163882 datasets made it clear of this principle that was well documented in the literature.
We evaluated how AIO configurations and CNN architecture impacted on Ki67 and NHG classification. The 16,384 genes could be configured into two different image objects. For both grayscale 128 × 128 and pseudo-color 64 × 64 × 4 configurations, our models produced similar classification accuracies (see Tables
2,
3, Additional file
1: Tables S2 and S3). This was consistent with the results obtained from classification of black-white and colored images in the field of computer vision. But for the AIO technique, this was significant because this suggested we could put different types of genomics data into separate layers/channels and incorporated them into an AIO, allowing integrated analyses with multi-omics data. We also built models to classify Ki67 and NHG using one-dimensional convolution, because genes were aligned linearly on chromosomes, and conceptually, they could be considered one-dimensional data suitable for one-dimensional convolution analyses [
15]. In our analyses, 1D-CNN and 2D-CNN produced comparable results (compare Tables
2 and
3 with Additional file
1: Tables S4 and S5). Typically, 1D-CNN was used to analyze one-dimensional data, such as time series, audio, text, and electrocardiogram data [
25], and 2D-CNN was used for computer vision and image classification. While there were techniques transforming one-dimensional data into images for 2D-CNN analyses [
26], we were not aware of extensive analyses to compare the performances between 1D-CNN and 2D-CNN using the same data. With our limited analyses, it would be difficult to conclude whether one approach was better than the other. Researchers should explore both if the data could be analyzed with these algorithms.
We evaluated the predictive power of the calls produced from our CNN models by comparing it to that of the consensus calls from trained pathologists. With similar sample size (see Table
1), the Ki67 calls produced from our models had a better power in survival analyses than the calls from trained pathologists (Fig.
4a, b). The reason for this might be that model produced calls were more consistent than human raters. For NHG, with a very small sample size (
n = 59), the model produced calls showed a trend. Should we have a larger sample size, the model produced calls would produce significant results. These results demonstrated that once implemented, these models would improve the productivity and consistency in clinical applications.
In the literature, there was a report that proposed a different method, DeepInsght, to transform non-image data into images for CNN classification [
16]. Both our AIO technique and the DeepInsight enabled the application of CNN algorithms to non-image data and had the potential to apply to large and multiple datasets. But there was a key difference between our AIO technique and the DeepInsight. For the AIO technique, we considered each feature/variable as a pixel, and directly mapped the features onto the feature map. In contrast, the DeepInsight approach first performed a kernel PCA/tSNE transformation to determine the relationship among the features and then used this information to determine the coordinates of the features on the image. After kernel PCA or tSNS transformation, not only it could cause substantial loss of information, but also created a situation where multiple features mapped to the same coordinates on the transformed image. This made it very difficult to track which features contributed the most to the patterns that the CNN algorithms learned and used to classify the label, a piece of information that could be significant for understanding the underlying biology when genomics data were used. This ability to track the genes in the spatial patterns provided a new approach to discover multi-gene interactions and interaction networks. Given the differences between the two methods, it would be interesting to compare their performances using the same datasets. Overall, both methods would have broad applications in biomedical research.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.