Comparison of Artificial Intelligence Techniques to Evaluate Performance of a Classifier for Automatic Grading of Prostate Cancer From Digitized Histopathologic Images

IMPORTANCE Proper evaluation of the performance of artificial intelligence techniques in the analysis of digitized medical images is paramount for the adoption of such techniques by the medical community and regulatory agencies. OBJECTIVES To compare several cross-validation (CV) approaches to evaluate the performance of a classifier for automatic grading of prostate cancer in digitized histopathologic images and compare the performance of the classifier when trained using data from 1 expert and multiple experts. DESIGN, SETTING, AND PARTICIPANTS This quality improvement study used tissue microarray data (333 cores) from 231 patients who underwent radical prostatectomy at the Vancouver General Hospital between June 27, 1997, and June 7, 2011. Digitized images of tissue cores were annotated by 6pathologistsfor4classes(benignandGleasongrades3,4,and5)betweenDecember12,2016,and October 5, 2017. Patches of 192 μm 2 were extracted from these images. There was no overlap between patches. A deep learning classifier based on convolutional neural networks was trained to predict a class label from among the 4 classes (benign and Gleason grades 3, 4, and 5) for each image patch. The classification performance was evaluated in leave-patches-out CV, leave-cores-out CV, and leave-patients-out 20-fold CV. The analysis was performed between November 15, 2018, and January 1, 2019.


Introduction
In the last decade, the literature on medical imaging in general, and on digital pathology in particular, has seen a dramatic increase in articles involving artificial intelligence and machine learning for automatic image analysis and classification, 1,2 as part of the development of computer-aided diagnosis systems to increase accuracy, reproducibility, and efficient throughput. This trend has been enabled by an increase in computational power, improvement of image processing and machine learning algorithms, and the availability of more comprehensive data sets for training and evaluation.
Typical computer-aided diagnosis systems consist of a training phase and an inference phase.
During training, a set of labeled instances (ie, instances with known inputs and outputs) is used to learn or determine the optimal values of the model parameters. The set of instances used in the training phase is referred to as the training set. If the output is a continuous number, such as blood pressure, the model is called a regressor. If, on the other hand, the output is a small number of discrete categories, such as benign and cancerous samples, the model is called a classifier. This work is focused on classifiers as we target the problem of classifying histopathologic images into several classes such as benign, low-grade cancer, and high-grade cancer.
A well-known pitfall when determining the model parameters from training data is overfitting, whereby the parameters are tuned such that the model performs very well on the training data but underperforms on other data. Therefore, the performance of the model should be evaluated on a separate set of instances, known as the test set. To avoid the overfitting problem and obtain an unbiased evaluation of the model, the data in the test set should be independent and separate from the data in the training set.
A common practice, known as k-fold cross-validation (CV), 3 is to divide all of the available data into k partitions, or "folds," of approximately equal sizes. Then, the model is trained k times from scratch. Each time, one of the k folds is held out to be used as the test data, and the remaining k − 1 folds are used as the training data. This approach allows one to make full use of the data by evaluating the model on all instances in an unbiased manner.
As opposed to other modalities, such as magnetic resonance imaging, in which entire images can be used as input, classification of digitized histopathologic images presents a challenge in processing because of their extremely large size, in the order of 0.1 to 10 gigapixels. To overcome this challenge, each whole slide image is divided into a grid of smaller image "patches," which are typically square and only a few hundred pixels in length. Consequently, many instances are generated from the same slide and patient.
Prostate cancer is a heterogeneous disease that is manifested in a variety of very different histopathologic patterns across patients. Because of its heterogeneity, grading of prostate cancer has a well-known high degree of interobserver variability 4-6 that leads to uncertainty in image labeling.
Therefore, training and evaluation of a classification algorithm against 1 expert may involve an inappropriate ground truth, which would yield a classifier that underperforms when evaluated We report our experience in automatic grading of prostate cancer in histopathologic images based on annotations by multiple experts, and examine different approaches to evaluation of these images. The motivation for this work arose from our interest to compare our results 7 with those of other reported studies. In this process, we found the evaluation approaches to be inconsistent across literature reports. In particular, we found that many studies have used data from a single expert to train and evaluate their models, [8][9][10][11] a practice that clearly ignores the extensive evidence of high interobserver variability. More important, we found that some studies had followed a patch-based CV, in which patches from all patients are included in the training and test sets. 10,11 A patch-based CV is fundamentally different from a patient-based CV; since the final test of the usefulness of automatic classification is whether a new patient can be correctly assessed based on the experience from other patients, it seems that patient-based CV is the only correct way of validating machine learning models for medical applications. Whereas in a patient-based CV, data from certain patients are held out during model training, in patch-based CV there is no such guarantee. Even if there is no overlapping between adjacent patches extracted from images, a patch-based CV presents a quite different problem because patches extracted from a patient are likely to include much information that is unique to that patient. Table 1 summarizes the evaluation approach and results of a sample of studies on prostate histopathologic characteristics. [7][8][9][10][11][12] Only a few studies have used a patient-based evaluation on multiexpert data. 7,12 Our goal is to demonstrate the importance of the evaluation method of artificial intelligence techniques when applied in this field, to assist clinicians in interpreting the widely divergent results reported in the literature. We aim to guide researchers in this field to improve their experimental design and performance evaluation and to avoid biased evaluations that can lead to inconsistent and erroneous conclusions.

Methods
Our data comprised 7 tissue microarray slides that contained tissue cores sampled from radical prostatectomy specimens. Sections of the blocks were stained in hematoxylin-eosin and digitized as virtual slides at ×40 magnification using a SCN400 Slide Scanner (Leica Microsystems). This study was approved by the Clinical Research Ethics Board of the University of British Columbia. The patient data were deidentified. Patients consented to the use of their data in research projects, including our own. This study followed the Standards for Quality Improvement Reporting Excellence (SQUIRE) reporting guidline.
A subset of 333 tissue cores were sampled from 231 patients who underwent radical prostatectomy at the Vancouver General Hospital between June 27, 1997, and June 7, 2011. The cores We performed another set of experiments to study the difference between single-expert vs multiple-expert data. This set of experiments followed a 20-fold leave-patients-out CV, as described above. We trained the convolutional neural network classifier using the labels of a single pathologist in each experiment, and then computed the agreement level of the model with the labels of every pathologist on the held-out patients. For each fold, we also trained the classifier using the labels of the majority vote among the pathologists, and repeated the same evaluation with the pathologists.
As a common metric to evaluate agreement, we used the (quadratic) weighted κ statistics annotators will receive a much smaller weight than a difference of benign and Gleason grade 5, which is clinically much more significant.

Statistical Analysis
Analysis was performed between November 15, 2018, and January patches on which each of the 2 methods made correct and incorrect predictions. The null hypothesis in the McNemar test is that the probability that method 1 makes a correct prediction and method 2 makes an incorrect prediction is equal to the probability that method 1 makes an incorrect prediction and method 2 makes a correct prediction. Using this test, we computed the P value for the observed data under the null hypothesis. All P values were from 2-sided tests and results were deemed statistically significant at P < .001.

Results
On data from 231 patients with prostate cancer with a mean (  When trained on a single expert, the overall agreement in grading between pathologists and the classifier ranged from 0.38 to 0.58; when trained using the majority vote among all experts, it was 0.60. The full results of the cross-expert experiments are summarized in Table 4 and show that training and evaluating the classifier on a single pathologist yields, in most cases, higher agreement

Discussion
The results of our CV experiments showed marked and statistically significant differences in the accuracy, sensitivity, and specificity for different CV methods. Leave-patches-out CV led to very high accuracy, sensitivity, and specificity estimates. As we anticipated, comparing these values with those obtained with leave-patients-out CV and leave-cores-out CV showed a significant difference. The accuracy, sensitivity, and specificity for the leave-patches-out CV were significantly higher. Even the results of 2-fold leave-patches-out CV were significantly better than the results of 20-fold leavepatients-out CV and 20-fold leave-cores-out CV, even though the number of patches available for training in 2-fold leave-patches-out CV is approximately half that available for 20-fold leavepatients-out CV and 20-fold leave-cores-out CV.
Thus, using patchwise evaluation may lead to highly overestimated performance and Hence, a patch-based CV is a flawed experimental design because the observed prediction accuracy will not be generalizable to unseen patients, which is the true goal of a machine learning model in medical applications.
Our experiments with single-expert data and multiple-expert data demonstrate the importance of evaluating the classification performance across multiple experts rather than a single expert. The results suggest that studies in the literature that used a single pathologist are likely to have overestimated their classification performance.
To our knowledge, this study is the first to systematically compare patch-based CV vs patientbased CV and to study single-expert data vs multiple-expert data for the important task of prostate cancer prediction and grading in digital histopathologic images using machine learning models. Our results show that both these factors are important to the generalizability and interpretability of the results. Hence, our results indicate that for these models to be effective in classifying new patients, they should be trained using patient-based CV on multiexpert data.
As can be seen from the example studies listed in Table 1  better than those seen in previous studies. 10,11 This is likely because of the higher representation capacity of the deep learning classifier that we used.
One way to address this interobserver variability is to apply a multilabel classifier that takes into account the multiple annotations during the training process. The overall agreement levels of the classifier that was trained by the majority vote of the 6 pathologists was higher than those obtained using a single pathologist. In most cases, the agreement of the majority vote classifier with an individual pathologist was even higher than the agreement of a single pathologist-trained classifier with the very same pathologist it was trained on, which suggests that training using the multilabel approach improved the generalizability of the classifier.
Another approach to using multiple-expert data, rather than using the majority vote, is to assign weights to different labels based on prior or learned information regarding the quality of the annotations of each expert. A study on automatic grading of prostate cancer in digital pathology 7 adapted a crowdsourcing algorithm 17 that computes the sensitivity and specificity of each expert annotator for each class and estimates a ground truth on which a classifier can be trained.

Limitations
The main limitation of this study is that it has been restricted to prostate cancer prediction and grading from tissue microarray histopathologic images. We think that our arguments regarding the flawed nature of patch-based training and validation and the need for patient-based analysis hold for other digital pathology applications and also for other applications of machine learning methods in medicine. We also expect that the benefits of using multiple-expert data observed in this study should extend to many other applications in medicine. However, each application may warrant a separate investigation to determine whether these factors have a statistically significant effect.

Conclusions
In this work, we demonstrated that some of the studies on prostate cancer classification from histopathologic images that have been published in recent years have followed flawed experimental designs. Specifically, we showed that patch-based training and evaluation could lead to significant overestimation of a model's predictive accuracy. We also showed that training on data provided by a single expert can lead to biased results that have poor generalizability compared with a model trained on data from multiple experts.
Our results show that patient-based training and evaluation is the only acceptable method for developing machine learning models in this application. Furthermore, to improve reproducibility and generalizability of the results and to facilitate comparison between different works, annotation data by multiple experts should be used to develop and evaluate these models. We expect that our conclusions may apply to other fields in digital histopathology and medical image analysis in general.
However, independent studies are warranted to determine the significance of these factors in each application. The method that we proposed in this study can be followed to establish the significance of these factors in other application areas of machine learning and artificial intelligence in medicine.