Use of Deep Learning to Develop and Analyze Computational Hematoxylin and Eosin Staining of Prostate Core Biopsy Images for Tumor Diagnosis

Key Points Question Can deep learning systems perform hematoxylin and eosin (H&E) staining and destaining and are the virtual core biopsy samples generated by them as valid and interpretable as their real-life unstained and H&E dye–stained counterparts? Findings In this cross-sectional study, deep learning models were trained using nonstained prostate core biopsy images to generate computationally H&E stained images, and core biopsy images were extracted from each whole slide, consisting of approximately 87 000 registered patch pairs of 1024 × 1024 × 3 pixels each. Comprehensive analyses of virtually stained images vs H&E dye–stained images confirmed successful computational staining. Meaning The findings of this study suggest that whole slide nonstained microscopic images of prostate core biopsy, instead of tissue samples, could be integrated with deep learning algorithms to perform computational H&E staining and destaining for rapid and accurate tumor diagnosis.


. Loss Function
Consider and respectively represent the native non-stained and H&E stained image patches in the training dataset. The generator takes in as the input and generates , the corresponding computationally stained image patch, as the output. The discriminator analyses the output image, , and predicts the probability that , is real (from the training dataset) or fake (output from generator). The loss function consisted of the cGAN loss, 3  The final loss function is: * = min max Ը ( , ) + Ը 1 ( ) + Ը ( ) Where x is the input image, y is the target image and z is the random noise, added as dropout in our work. Ը ( , ) is the loss function, Ը 1 ( ) is the 1loss between the output of the generator and the target image, and Ը ( )is the proposed term that calculated the Pearson's correlation coefficient between the generator output and target image. α=1, λ=100 and =10 gave best results. After training, the model accepted unseen native non-stained image patches and generated computationally H&E stained images patches.

Technical implementation of the CGAN model:
The discriminator was trained after every single training step for the generator. Both networks were trained for 10 epochs each using Adam optimization 12 , and a batch size of one on a NVIDIA GeForce 1080 TI GPU (NVIDIA, Santa Clara, CA) with 12 GB of VRAM and CUDA acceleration to speed up training. One epoch of training took approximately 16 GPU hours. The patches were randomly flipped and dropped out to prevent overfitting and increase generalization capability of the model.

eAppendix 3. Interpretation
Methods: Non-overlapping patches were cropped from these four sets of images, resulting in four sets of around 2000 patches (of size 1024x1024 pixels each). Patches not containing any tissue were discarded, resulting in 448 patches for each set. These four sets of patches were fed into trained staining and destaining models. Nonstained and destained patches were fed into computational staining model, while computationally and H&E stained patches were fed into the computational destaining model. The activation maps generated after each layer block (lrelu -conv2d -batch norm) were saved for further analysis. Activation maps for selected patches were analyzed to understand the transformation of input images as they passed through the generator neural network layer, and to identify which convolutional kernels are activated. Activation maps generated by the top five most activated kernels were extracted, ranked and visualized by heatmaps. Activation maps for each layer were ordered in descending order (ranked by numbers of pixels of intensity greater than 200). Top five activation maps (heatmap) from each layer are shown in Figure 3 (benign input patch) and eFigure 3 (Gleason grade 3 tumor patch). Heatmaps were the standard 'jet' heatmaps plotted using matplotlib library in python.
The activation maps, for an input image, for each layer were rescaled to 0-1 and resized to a standard size (128x128 pixels) and concatenated into a grid (16 by K grid, where k is variable and is equal to number of patches divided by 16). This process was done for every one of the 448 patches for all four image sets. The activation maps for the matching patches (unstained-destained and computationally stained-H&E stained input patch pairs) were compared and the mean squared error (MSE) was calculated. The MSE plot for the two sets of matching input patches can be seen in eFigure 5.
Results: The top five activation maps (heatmaps) obtained from the first five layers (rows I-V) and the last four layers (rows 16-19) of the generator (excluding the input and output layers) are presented in Figure 3 and eFigure3 and 4. The first five rows represent the first 5 layers of the generator and the last 4 rows represent the last four layers of the generator. It is evident from the activation heatmaps that the network does a good job of extracting the tissue from the background (eFigure 3, and 4: L1-I and L1-III). eFigure 3 and 4: L1-II, L1-III and L2-II, L2-IV and L2-V indicated that the model learnt to recognize some features in the tissue. As the image passed through the layers, we can see localized high activations (bright red and orange color), indicating region of interest learnt by the trained generator. While the initial layers learn low level features like tissue, background, circles and simple patterns, deeper layers learn to recognize high-level patterns (using a combination of the knowledge accumulated by the previous layers). It is difficult to understand these visualizations and they look like noise to the human eye. Layers 16 through 19 show the activation heatmap from the decoder side of the generator (Figure 3 and efigures 3 and 4). These layers try to decode the encoded information back to the original size (1024x1024 output), while in the process computationally staining the image." In eFigure 5 the blue lines represent the flow of MSE value through the layers for all 448 patch pairs unstaineddestained (left) and H&E stained -computationally stained (right). The green and orange lines are the first and the third quartile values. We can see the variance between the MSE values at all the layers for the 448 patches. The variance is higher for the encoder layers and is less for the decoder layers. This the case for both unstaineddestained plot and H&E stained -computationally stained plot. Peaks were observed at layers 3, 10 and 17.

eFigure 1. Color Coded Overlaid Validation Images
In Figures 1-13, each figure contains (a) Ground truth Hematoxylin and Eosin (H&E) dye stained RGB Whole Slide Images (RWSI); (b) Computationally H&E stained RWSI; (c) Computationally H&E stained RWSI overlaid with colors representing comparisons between true positive (green), false positive (red) and false negative (blue) of IOU of healthy and tumor annotations provided by five physicians on ground truth H&E dye and computationally H&E stained prostate core biopsy images; (d) Ground truth native non-stained RWSI image; (e) Computationally destained RWSI.

eTable 2. Mean Pixel Intensity Following Computational Staining and Destaining
Average pixel intensity differences following computational staining and destaining: Difference between Native non-stained and computationally stained (U_C); native non-stained and H&E dye stained (U_H); ground truth H&E dye stained and computationally stained (H*_C); H&E dye stained and computationally destained (H_D); H&E dye stained and native non-stained (H_U); computationally destained and ground truth native non-stained (D_U*). All values are in pixel intensities (0 to 255) calculated by subtracting the 2 nd from 1 st image. Positive values indicate decrease in average pixel intensities and negative values indicate gain. Values have been rounded to nearest integer. H is H&E dye stained image, C is computationally stained image, D is computationally destained image and U is native non-stained image. Ground truth images are indicated with '*' to facilitate comparisons with computational images when necessary.

Image
Computational

eAppendix 4. Evaluation of the Activation Maps of Trained Deep Neural Network
Clinical evaluations of computationally stained images: Figure 2 in the manuscript shows representative input nonstained image patches in row (a) that had Gleason grade 3 (columns I, II) or 4 (columns III, IV) tumors or were benign (column V), and their computational H&E staining (row c) and accuracy calculated using annotations by multiple physicians (row d). Tissue morphology in computationally stained patches (row c) matches closely with H&E dye stained patches (row b). Patch c-I successfully generated a benign area along with tumor signature (as indicated by arrows) and confirmed in row d-I. Computationally stained patches (row c) retain appearance of benign and malignant glands and stroma seen in H&E dye stained patches (row b). Patch b-III also contains edge/crush artifact (arrowheads) that is preserved in computationally stained image (row c-III). Same patches are shown (row d) with color-coded areas of agreement and disagreement between the labels provided on H&E dye stained images and computationally stained RWSI. It is evident that the computationally H&E stained patches represent tumor signatures with high accuracy and pathologists are able to correctly identify tumor. Majority of observed disagreements between raters did not represent misidentification of glands as benign or malignant. Instead, they show differences in rater annotation at borders of tumor labels, mainly due to differences in labeling style with some raters providing course labels and others annotating detailed labels (row d-III, arrows), or biopsy edges, as some raters chose to score partial/crushed glands at the periphery of samples and others did not (row d-III, arrowheads). Reconstructed computationally stained images shown in eFigure 1. show the most reported areas of disagreement many of which are attributed to atypical glands that were hard to categorize on both images but were well represented on the computer-generated images (eFigure 1.4.c, 1.6.c and 1.7.c). eFigure 1.5 shows the uncommon Gleason pattern 5 tumors with comedo necrosis (eFigure 1.5.a, arrow). The morphology of the tumor glands is well maintained (eFigure 1.5.b, arrowheads), but the comedo necrosis is not visualized eFigure 1.5.b, arrow). The dye-stained image in eFigure 1.7.a contains an infrequently encountered scenario (indicated by an arrow), the presence of rare malignant glands that are not well visualized on the computationally stained image (eFigure 1.7.b, arrow). Despite this altered appearance, there was no impact on clinical diagnosis as the blinded reviewers scored these areas as tumor. Some glands are poorly formed on both the dye stained and the computationally stained image (eFigure 1.8.a, 1.8.b, arrows), leading to disagreement between raters, even though the computationally stained image were identical to dye stained image. Images shown in eFigure 1.9 presented a challenging labeling exercise where tumor cell cytoplasm was very pale and did not show significant contrast to the background stroma in the dye-stained image (eFigure 1.9.a). This cytoplasmic pallor was also well preserved in the computationally stained image (eFigure 1.9.b). Despite this, appearance of the nuclei and the slight difference in cytoplasmic texture made the tumor identifiable in both images (eFigure 1.9.c and 1.9.d).
The computationally stained images shown in eFigure 1.10.b, 1.11.b, 1.13.b were well represented. Majority of the disagreement in these images arose due to tumor/non-tumor boundary and biopsy edge issues. Validation images in eFigure 1.11.b and 1.13.b illustrated additional high-quality examples of preserved morphology generated by the computationally staining algorithm, which confirmed accurate matching with dye stained images in benign conditions. Non-necrotizing granulomas, marked chronic inflammation, reactive stromal changes and proteinaceous debris were all morphologically identifiable in the computational stained images (eFigure 1.11.c).
Pathologists unanimously scored the matched H&E dye stained and computationally stained images shown in eFigure 1.3 and 1.12 as benign. After expert re-review of the original slides and additional evaluation by immunohistochemistry, the original EHR diagnosis was overturned in two cases, resulting in two additional cases of agreement. Pathologists reviewing computer-generated core 11 were able to better identify the presence of rare glands of Gleason grade 3 tumors than those who had rendered the original EHR diagnosis of benign (eFigure 1.11 marked blue/green in the supplement).
Microscopic re-review of the original glass slide confirmed that it indeed had a tiny focus of grade 3 tumor that was overlooked at the time of the original diagnosis. Subsequent immunohistochemical analysis revealed the absence of basal cells around the glands in question, confirming the diagnosis of carcinoma made during this study and revealing the diagnosis conferred on the computationally generated images to be correct. eFigure 1.2 was the only study biopsy that showed a significant difference in tumor fraction, as this study reported 50% tumor fraction and the original EHR report was 90%. Re-review of the original glass slide again showed this study fraction to be more accurate than the original diagnosis (eFigure 1.2). Otherwise, the tumor fraction identified in all the computationally generated images approximated the fraction reported in the EHR for all images as evident from eTable 5. None of the differences between EHR and computationally generated H&E diagnosis were clinically significant with regard to treatment decisions. A difference in grade of tumor was identified in a minor component of computationally stained images (eFigure 1.4, 1.7 and 1.13). The small foci of higher or lower grade tumor identified in computationally stained images (eFigure 1.4.c, 1.7.c, and 1.13.c), which were not reported at the time of original diagnosis, comprised a very small fraction of tumor volume. These were often associated with diagnostically indeterminate questions (e.g. whether a gland represented a rare focus of grade 4 tumor or if it was tangential sectioning of grade 3 tumor), and were not clinically significant in the context of the patient's known tumor at the time of original EHR reported diagnosis.