Development and Validation of a Deep Learning Model to Quantify Glomerulosclerosis in Kidney Biopsy Specimens | Nephrology | JAMA Network Open | JAMA Network
[Skip to Navigation]
Sign In
Figure 1.  Example Whole-Slide Images With Visualizations of Annotation Ground Truth and Model Prediction
Example Whole-Slide Images With Visualizations of Annotation Ground Truth and Model Prediction

Left panels, example hematoxylin-eosin (H&E)–stained whole-slide images. Middle panels, pathologist glomeruli annotations. Right panels, cross-validated model predictions obtained using H&E inputs.

Figure 2.  Model and On-Call Pathologists’ Estimates of Percent Global Glomerulosclerosis
Model and On-Call Pathologists’ Estimates of Percent Global Glomerulosclerosis

A, Model predictions of percent global glomerulosclerosis vs expert pathologists’ annotations on individual slide levels, obtained from 10-fold cross-validation. Error bars represent 95% prediction intervals computed from the beta distribution with parameters given by number of globally sclerosed and nonglobally sclerosed glomeruli. B, Same as (A), with results for individual kidneys obtained by pooling glomeruli counts. C, On-call pathologist performance vs expert pathologists’ annotations for corresponding cases. RMSE represents root-mean-square error.

Figure 3.  Model Predictions for Number of Glomeruli vs Expert Pathologists’ Annotations
Model Predictions for Number of Glomeruli vs Expert Pathologists’ Annotations

RMSE represents root-mean-square error.

Figure 4.  Sorted Percent Global Glomerulosclerosis Estimates in Comparison With Nominal Rejection Cut Point
Sorted Percent Global Glomerulosclerosis Estimates in Comparison With Nominal Rejection Cut Point

The overall risk of erroneously discarding potentially usable kidneys is shown in panel (A) for evaluations based on individual levels and pooled levels. Results for percent global glomerulosclerosis plotted in increasing order for individual levels derived from model predictions (B) and annotations (C). Results for percent global glomerulosclerosis of individual kidneys obtained from pooling levels derived from on-call pathologist counts (D), model predictions (E), and annotations (F). Error bars represent 95% prediction intervals assuming the measurement is modeled by a beta distribution with parameters given by the number of nonglobally sclerosed and globally sclerosed glomeruli. Nominal organ rejection cut point at 20% global glomerulosclerosis is shown as a horizontal gray line. Cases where the measurement prediction interval crosses the cut point are depicted with darker coloring where there is more than 5% chance of erroneously accepting or rejecting the kidney for transplantation.

Figure 5.  Global Glomerulosclerosis Measurement Distribution Modeling for Estimating Probability of Erroneous Kidney Discard
Global Glomerulosclerosis Measurement Distribution Modeling for Estimating Probability of Erroneous Kidney Discard

A, The beta distribution uses the mean number of glomeruli per level (58) observed in this study. Nominal 20% rejection cut point shown by vertical dashed gray line. The chance of erroneous organ rejection is proportional to the area under the curve to the right of the cut point line. B, Probability of erroneous discard for curves shown in A. C, Estimations use on-call pathologists’ glomeruli counts as a surrogate for ground truth. The number of erroneous discards is halved by reading 4 levels vs only 1 level.

1.
Tullius  SG, Rabb  H.  Improving the supply and quality of deceased-donor organs for transplantation.   N Engl J Med. 2018;378(20):1920-1929. doi:10.1056/NEJMra1507080 PubMedGoogle Scholar
2.
Hart  A, Smith  JM, Skeans  MA,  et al.  OPTN/SRTR 2017 annual data report: kidney.   Am J Transplant. 2019;19(suppl 2):19-123. doi:10.1111/ajt.15274 PubMedGoogle Scholar
3.
Moeckli  B, Sun  P, Lazeyras  F,  et al.  Evaluation of donor kidneys prior to transplantation: an update of current and emerging methods.   Transpl Int. 2019;32(5):459-469. doi:10.1111/tri.13430 PubMedGoogle Scholar
4.
Stewart  DE, Klassen  DK.  Early experience with the new kidney allocation system: a perspective from UNOS.   Clin J Am Soc Nephrol. 2017;12(12):2063-2065. doi:10.2215/CJN.06380617 PubMedGoogle Scholar
5.
Dare  AJ, Pettigrew  GJ, Saeb-Parsy  K.  Preoperative assessment of the deceased-donor kidney: from macroscopic appearance to molecular biomarkers.   Transplantation. 2014;97(8):797-807. doi:10.1097/01.TP.0000441361.34103.53 PubMedGoogle Scholar
6.
Bosmans  J-L, Woestenburg  A, Ysebaert  DK,  et al.  Fibrous intimal thickening at implantation as a risk factor for the outcome of cadaveric renal allografts.   Transplantation. 2000;69(11):2388-2394. doi:10.1097/00007890-200006150-00030 PubMedGoogle Scholar
7.
Escofet  X, Osman  H, Griffiths  DFR, Woydag  S, Adam Jurewicz  W.  The presence of glomerular sclerosis at time zero has a significant impact on function after cadaveric renal transplantation.   Transplantation. 2003;75(3):344-346. doi:10.1097/01.TP.0000044361.74625.E7 PubMedGoogle Scholar
8.
Randhawa  P.  Role of donor kidney biopsies in renal transplantation.   Transplantation. 2001;71(10):1361-1365. doi:10.1097/00007890-200105270-00001 PubMedGoogle Scholar
9.
Serón  D, Carrera  M, Griño  JM,  et al.  Relationship between donor renal interstitial surface and post-transplant function.   Nephrol Dial Transplant. 1993;8(6):539-543. doi:10.1093/ndt/8.6.539 PubMedGoogle Scholar
10.
Singh  P, Farber  JL, Doria  C,  et al.  Peritransplant kidney biopsies: comparison of pathologic interpretations and practice patterns of organ procurement organizations.   Clin Transplant. 2012;26(3):E191-E199. doi:10.1111/j.1399-0012.2011.01584.x PubMedGoogle Scholar
11.
Sung  RS, Christensen  LL, Leichtman  AB,  et al.  Determinants of discard of expanded criteria donor kidneys: impact of biopsy and machine perfusion.   Am J Transplant. 2008;8(4):783-792. doi:10.1111/j.1600-6143.2008.02157.x PubMedGoogle Scholar
12.
Wang  HJ, Kjellstrand  CM, Cockfield  SM, Solez  K.  On the influence of sample size on the prognostic accuracy and reproducibility of renal transplant biopsy.   Nephrol Dial Transplant. 1998;13(1):165-172. doi:10.1093/ndt/13.1.165 PubMedGoogle Scholar
13.
Husain  SA, Chiles  MC, Lee  S,  et al.  Characteristics and performance of unilateral kidney transplants from deceased donors.   Clin J Am Soc Nephrol. 2018;13(1):118-127. doi:10.2215/CJN.06550617 PubMedGoogle Scholar
14.
Messina  M, Diena  D, Dellepiane  S,  et al.  Long-term outcomes and discard rate of kidneys by decade of extended criteria donor age.   Clin J Am Soc Nephrol. 2017;12(2):323-331. doi:10.2215/CJN.06550616 PubMedGoogle Scholar
15.
Liapis  H, Gaut  JP, Klein  C,  et al; Banff Working Group.  Banff histopathological consensus criteria for preimplantation kidney biopsies.   Am J Transplant. 2017;17(1):140-150. doi:10.1111/ajt.13929 PubMedGoogle Scholar
16.
Azancot  MA, Moreso  F, Salcedo  M,  et al.  The reproducibility and predictive value on outcome of renal biopsies from expanded criteria donors.   Kidney Int. 2014;85(5):1161-1168. doi:10.1038/ki.2013.461 PubMedGoogle Scholar
17.
Haas  M.  Donor kidney biopsies: pathology matters, and so does the pathologist.   Kidney Int. 2014;85(5):1016-1019. doi:10.1038/ki.2013.439 PubMedGoogle Scholar
18.
Achi  HE, Belousova  T, Chen  L,  et al.  Automated diagnosis of lymphoma with digital pathology images using deep learning.   Ann Clin Lab Sci. 2019;49(2):153-160.PubMedGoogle Scholar
19.
Djuric  U, Zadeh  G, Aldape  K, Diamandis  P.  Precision histology: how deep learning is poised to revitalize histomorphology for personalized cancer care.   NPJ Precis Oncol. 2017;1(1):22. doi:10.1038/s41698-017-0022-1 PubMedGoogle Scholar
20.
Esteva  A, Kuprel  B, Novoa  RA,  et al.  Dermatologist-level classification of skin cancer with deep neural networks.   Nature. 2017;542(7639):115-118. doi:10.1038/nature21056 PubMedGoogle Scholar
21.
Gulshan  V, Peng  L, Coram  M,  et al.  Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs.   JAMA. 2016;316(22):2402-2410. doi:10.1001/jama.2016.17216PubMedGoogle Scholar
22.
LeCun  Y, Bengio  Y, Hinton  G.  Deep learning.   Nature. 2015;521(7553):436-444. doi:10.1038/nature14539 PubMedGoogle Scholar
23.
Litjens  G, Sánchez  CI, Timofeeva  N,  et al.  Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis.   Sci Rep. 2016;6:26286. doi:10.1038/srep26286 PubMedGoogle Scholar
24.
Mercan  E, Mehta  S, Bartlett  J, Shapiro  LG, Weaver  DL, Elmore  JG.  Assessment of machine learning of breast pathology structures for automated differentiation of breast cancer and high-risk proliferative lesions.   JAMA Netw Open. 2019;2(8):e198777. doi:10.1001/jamanetworkopen.2019.8777 PubMedGoogle Scholar
25.
Serag  A, Ion-Margineanu  A, Qureshi  H,  et al.  Translational AI and deep learning in diagnostic pathology.   Front Med (Lausanne). 2019;6:185. doi:10.3389/fmed.2019.00185 PubMedGoogle Scholar
26.
Wang  S, Yang  DM, Rong  R,  et al.  Artificial intelligence in lung cancer pathology image analysis.   Cancers (Basel). 2019;11(11):1673. doi:10.3390/cancers11111673 PubMedGoogle Scholar
27.
Bukowy  JD, Dayton  A, Cloutier  D,  et al.  Region-based convolutional neural nets for localization of glomeruli in trichrome-stained whole kidney sections.   J Am Soc Nephrol. 2018;29(8):2081-2088. doi:10.1681/ASN.2017111210 PubMedGoogle Scholar
28.
Ginley  B, Lutnick  B, Jen  KY,  et al.  Computational segmentation and classification of diabetic glomerulosclerosis.   J Am Soc Nephrol. 2019;30(10):1953-1967. doi:10.1681/ASN.2018121259 PubMedGoogle Scholar
29.
Hermsen  M, de Bel  T, den Boer  M,  et al.  Deep learning-based histopathologic assessment of kidney tissue.   J Am Soc Nephrol. 2019;30(10):1968-1979. doi:10.1681/ASN.2019020144 PubMedGoogle Scholar
30.
Kannan  S, Morgan  LA, Liang  B,  et al.  Segmentation of glomeruli within trichrome images using deep learning.   Kidney Int Rep. 2019;4(7):955-962. doi:10.1016/j.ekir.2019.04.008 PubMedGoogle Scholar
31.
Bueno  G, Fernandez-Carrobles  MM, Gonzalez-Lopez  L, Deniz  O.  Glomerulosclerosis identification in whole slide images using semantic segmentation.   Comput Methods Programs Biomed. 2020;184:105273. doi:10.1016/j.cmpb.2019.105273 PubMedGoogle Scholar
32.
Schindelin  J, Arganda-Carreras  I, Frise  E,  et al.  Fiji: an open-source platform for biological-image analysis.   Nat Methods. 2012;9(7):676-682. doi:10.1038/nmeth.2019 PubMedGoogle Scholar
33.
Cornell University. arXiv.org Computer Science. Very deep convolutional networks for large-scale image recognition by Karen Simonyan and Andrew Zisserman. arXiv:1409.1556. Last revised April 10, 2015. Accessed November 22, 2020. https://arxiv.org/abs/1409.1556
34.
Marsh  JN, Matlock  MK, Kudose  S,  et al.  Deep learning global glomerulosclerosis in transplant kidney frozen sections.   IEEE Trans Med Imaging. 2018;37(12):2718-2728. doi:10.1109/TMI.2018.2851150 PubMedGoogle Scholar
35.
Lindeberg  T.  Detecting salient blob-like image structures and their scales with a scale-space primal sketch: a method for focus-of-attention.   Int J Comput Vis. 1993;11(3):283-318. doi:10.1007/BF01469346 Google Scholar
Limit 200 characters
Limit 25 characters
Conflicts of Interest Disclosure

Identify all potential conflicts of interest that might be relevant to your comment.

Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.

Err on the side of full disclosure.

If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.

Not all submitted comments are published. Please see our commenting policy for details.

Limit 140 characters
Limit 3600 characters or approximately 600 words
    Original Investigation
    Nephrology
    January 20, 2021

    Development and Validation of a Deep Learning Model to Quantify Glomerulosclerosis in Kidney Biopsy Specimens

    Author Affiliations
    • 1Department of Pathology and Immunology, Washington University School of Medicine in St Louis, St Louis, Missouri
    • 2Institute for Informatics (I2), Washington University School of Medicine in St Louis, St Louis, Missouri
    • 3Department of Medicine, Washington University School of Medicine in St Louis, St Louis, Missouri
    JAMA Netw Open. 2021;4(1):e2030939. doi:10.1001/jamanetworkopen.2020.30939
    Key Points

    Question  Can a deep neural network decrease likelihood of unnecessary donor kidney discard by precisely quantifying percent global glomerulosclerosis on whole-slide images of hematoxylin-eosin–stained biopsy specimens?

    Findings  In this prognostic study of 83 donor kidneys, a deep neural network segmented normal and globally sclerotic glomeruli in whole-slide images to quantify percent global glomerulosclerosis with higher performance than pathologists. Model accuracy further increased by pooling multiple sections, resulting in decreased likelihood of erroneous organ discard by 37%.

    Meaning  This study’s findings suggest that deep learning methods may help prevent erroneous organ discard by performing beyond the capacity of pathologists in biopsy specimen examination.

    Abstract

    Importance  A chronic shortage of donor kidneys is compounded by a high discard rate, and this rate is directly associated with biopsy specimen evaluation, which shows poor reproducibility among pathologists. A deep learning algorithm for measuring percent global glomerulosclerosis (an important predictor of outcome) on images of kidney biopsy specimens could enable pathologists to more reproducibly and accurately quantify percent global glomerulosclerosis, potentially saving organs that would have been discarded.

    Objective  To compare the performances of pathologists with a deep learning model on quantification of percent global glomerulosclerosis in whole-slide images of donor kidney biopsy specimens, and to determine the potential benefit of a deep learning model on organ discard rates.

    Design, Setting, and Participants  This prognostic study used whole-slide images acquired from 98 hematoxylin-eosin–stained frozen and 51 permanent donor biopsy specimen sections retrieved from 83 kidneys. Serial annotation by 3 board-certified pathologists served as ground truth for model training and for evaluation. Images of kidney biopsy specimens were obtained from the Washington University database (retrieved between June 2015 and June 2017). Cases were selected randomly from a database of more than 1000 cases to include biopsy specimens representing an equitable distribution within 0% to 5%, 6% to 10%, 11% to 15%, 16% to 20%, and more than 20% global glomerulosclerosis.

    Main Outcomes and Measures  Correlation coefficient (r) and root-mean-square error (RMSE) with respect to annotations were computed for cross-validated model predictions and on-call pathologists’ estimates of percent global glomerulosclerosis when using individual and pooled slide results. Data were analyzed from March 2018 to August 2020.

    Results  The cross-validated model results of section images retrieved from 83 donor kidneys showed higher correlation with annotations (r = 0.916; 95% CI, 0.886-0.939) than on-call pathologists (r = 0.884; 95% CI, 0.825-0.923) that was enhanced when pooling glomeruli counts from multiple levels (r = 0.933; 95% CI, 0.898-0.956). Model prediction error for single levels (RMSE, 5.631; 95% CI, 4.735-6.517) was 14% lower than on-call pathologists (RMSE, 6.523; 95% CI, 5.191-7.783), improving to 22% with multiple levels (RMSE, 5.094; 95% CI, 3.972-6.301). The model decreased the likelihood of unnecessary organ discard by 37% compared with pathologists.

    Conclusions and Relevance  The findings of this prognostic study suggest that this deep learning model provided a scalable and robust method to quantify percent global glomerulosclerosis in whole-slide images of donor kidneys. The model performance improved by analyzing multiple levels of a section, surpassing the capacity of pathologists in the time-sensitive setting of examining donor biopsy specimens. The results indicate the potential of a deep learning model to prevent erroneous donor organ discard.

    Introduction

    More than 100 000 patients are currently waiting for a kidney transplant.1 Despite the growing need, between 17% and 20% of kidneys recovered for transplant are discarded.2-4 With organ shortage and increasing demand for kidney transplants, there is an urgent need to decrease unnecessary organ discard.3

    The biopsy result is reported as the most important factor in the decision to use or discard a donor kidney.5 Numerous investigations have linked chronic damage in donor kidney biopsy specimens with transplant outcomes.6-12 A level of 20% global glomerulosclerosis is frequently used as a cut point in the decision to transplant and is a major factor underlying why the biopsy result is the most common reason an organ is rejected for transplantation in the United States.4

    Recent studies indicate that acceptable kidneys are being discarded because of variable and inconsistent donor biopsy specimen interpretation.3,13,14 Even a seemingly straightforward metric such as the percent global glomerulosclerosis is subject to significant human variation.15-17 Freezing artifacts, lack of subspecialty expertise, inadequate sampling, and the time-sensitive nature of these evaluations all contribute to human errors.

    Recently, deep learning (DL) has shown potential to improve reproducibility and accuracy in histopathologic examination.18-26 Prior studies from other laboratories have used DL approaches for automated detection of nonsclerotic and globally sclerotic glomeruli.27-31 However, these techniques rely on special stains, such as periodic acid–Schiff or Masson trichrome stains, that are impractical in the time-sensitive setting of frozen sections. Previous work from members of our group describes the only reported results, to our knowledge, showing high performance for automated quantitation of percent global glomerulosclerosis using whole-slide images (WSIs) of hematoxylin-eosin–stained frozen sections.

    We hypothesize that a DL approach to examination of donor kidney biopsy specimens will outperform human pathologists in evaluating percent global glomerulosclerosis, and that further enhancement will be enabled by examining multiple levels of section. This increased tissue sampling is hypothesized to decrease the likelihood of unnecessary organ discard and will address the question of whether DL techniques are associated with making a substantive increase in the available donor organ pool.

    Methods

    This study followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guideline for diagnostic and prognostic studies. This study was reviewed and approved by the Washington University institutional review board, which also waived the need for obtaining informed patient consent because this study used only nonidentifiable biospecimens from an existing data set.

    Data Collection

    The WSIs were acquired from deceased donor biopsy specimens—98 hematoxylin-eosin–stained frozen sections and 51 permanent sections—retrieved from a total of 83 kidneys by using both wedge and needle samples. Frozen-section biopsy specimens and permanent-section biopsy specimens were obtained from different kidneys. Of 83 specimens, 62 had at least 2 levels of section. Biopsy specimen images from the Washington University database originated from Gift of Life Michigan (retrieved between August 2015 and November 2016 using a Sakura scanner; magnification, ×20) and Washington University (retrieved between June 2015 and June 2017 via Mid-America Transplant using an Aperio Scanscope CS scanner; magnification, ×20). Any deceased organ donor who presented between these dates and underwent a kidney biopsy for digital intraoperative pathologic examination was eligible for this study. The demographic characteristics and clinical features of donors were unknown to the investigators. All scans were converted from SVS to TIFF format at full resolution (0.5 μm/pixel). Image sizes ranged from 105 megapixels to 1448 megapixels.

    Data Annotation

    Slides were first annotated for nonsclerotic and sclerotic glomeruli by a board-certified expert kidney pathologist (P.W. or J.P.G.), revised by a second board-certified pathologist (T.C.L.) with experience interpreting donor kidney biopsy specimens, and followed by a final revision by another board-certified expert kidney pathologist (P.C.W. or J.P.G.). The final revised annotations served as ground truth (ie, the gold standard) for model training and evaluation. Typical variability in glomeruli counts with each revision is illustrated in eFigure 1 in the Supplement. An in-house plug-in written for Fiji32 was used to manually outline and classify glomeruli on each WSI to generate pixelwise label masks of glomerulus regions at the same resolution as the parent WSI. Glomeruli were classified as either globally sclerotic (defined as sclerosis involving the entire glomerular tuft, including obsolescent, solidified, and disappearing global glomerulosclerosis) or nonglobally sclerotic. All other areas were grouped together and labeled tubulointerstitium. A total of 1544 globally sclerosed and 6914 nonglobally sclerosed glomeruli were labeled in 149 separate images. The biopsy specimens exhibited a wide range of percent global glomerulosclerosis (0%-77%). The mean (SD) number of glomeruli per slide was 57 (31).

    DL Model Architecture

    The DL model used in this study was a fully convolutional neural network based on the VGG16 architecture33 described in previous work, which included a member of our group.34 In brief, data were input to the pretrained VGG16 base network with weights frozen below the bottleneck (ie, immediately prior to the densely connected classification layers). The VGG16 densely connected classification layers were replaced with 5 fully convolutional layers with trainable weights. Use of a fully convolutional architecture through the entire network enabled an “image to image” transformation, rather than an “image to label” transformation, for each input image patch, the latter of which is an approach that is less accurate and much more computationally expensive.34 The fully convolutional model generated downsampled pixel maps registered to the input image patch, giving the probability that each output pixel was tubulointerstitium, nonglobally sclerosed glomerulus, or globally sclerosed glomerulus.

    Training Parameters

    Images were diced into 2048 × 2048-pixel (1024 × 1024 μm) partially overlapping image patches (stride, 1664 pixels or 838 μm) for training input. Patches were selected for training by randomly sampling from the entire pool of image patches (approximately 6500 patches in each cross-validation training set, the length of a single epoch). Input patches were randomly flipped or rotated (by 0°, 90°, 180°, or 270°), yielding 8-fold augmentation of training data for a total of approximately 52 000 possible training patches in the sampling pool. Training was performed using TensorFlow by minimizing categorical cross-entropy loss, weighted classwise using a ratio for sclerosed to nonsclerosed to tubulointerstitial categories of 10:5:1 to compensate for class imbalance. Stochastic gradient descent optimization was used with a cyclic learning rate between 1e−4 and 1e−2 and a batch size of 4 for 15 epochs.

    Cross-Validation

    The model was trained and tested in 10-fold cross-validation, where 10% of the WSIs were withheld from training in each fold, and the resulting model (trained on the remaining 90% of data) was used to generate predictions on the withheld WSIs. Images from different levels of the same kidney were always held out together. No information from a test set of a cross-validation fold was used to inform training of the corresponding fold. Predictions for withheld slides were generated patchwise according to the image dicing scheme described above (ie, 2048 × 2048–pixel patches with 1664-pixel stride), and the results were reassembled to produce output probability maps for entire WSIs.

    Postprocessing

    A standard laplacian of gaussian blob detection algorithm, well-suited to identify circular regions of high image intensity at multiple scales,35 was used to localize individual glomeruli from the probability maps. Percent global glomerulosclerosis was computed by the formula 100 × S/N, where S is the number of globally sclerosed glomeruli and N is the total glomeruli count.

    Statistical Analysis

    Pixelwise agreement between annotation and prediction probability maps was quantified via the Dice coefficient and the intersection over union metric, computed in aggregate for all pixels in each output label. Glomeruli counts were obtained after blob detection processing on sclerosed and nonsclerosed probability map channels. Percent global glomerulosclerosis was computed from these counts for individual images, and for individual kidneys, by pooling counts for all levels (typically 2) associated with each kidney. Glomeruli counts were compared against annotation ground truth, with accuracy assessed by Pearson correlation coefficient r and root-mean-square error (RMSE). Corresponding quantities for percent global glomerulosclerosis were computed for on-call pathologists’ estimates, and those values were compared with the model’s performance.

    Categorization of kidneys as “acceptable” for transplant or “rejected” was determined at 20% global glomerulosclerosis, a commonly used cut point in current clinical practice based on historical data. An F1 score was computed as a function of correctly discriminating whether a sample was over or under the 20% cut point with respect to ground truth annotations. Cohen κ coefficient (an indicator of agreement between raters) was also computed for the model’s and on-call pathologists’ discrimination at the 20% cut point, as compared with ground truth annotation and with each other.

    Because the definition of global glomerulosclerosis is naturally expressed as the mean of a beta distribution given by parameters S (number of globally sclerosed glomeruli) and (N − S) (number of nonglobally sclerosed glomeruli), it was used to compute 95% prediction intervals that serve as an indicator of output precision. A 2-sided P < .05 was considered statistically significant. All statistical analyses were conducted from March 2018 to August 2020 with the Python packages scikit-learn, version 0.22.1, and SciPy.stats, version 1.4.1.

    Results
    Output Visualization

    Predicted image outputs for frozen-section and permanent-section WSIs showed qualitative agreement with target annotation maps (Figure 1). Aggregate Dice coefficients were 0.784 for nonglobally sclerosed glomeruli and 0.600 for globally sclerosed glomeruli; aggregate intersection over union metrics for the same groups were 0.645 for nonglobally sclerosed glomeruli and 0.429 for globally sclerosed glomeruli. Notably, even frozen sections with substantial artifacts showed qualitative visual agreement between annotation ground truth and predictions (example shown in Figure 1A).

    Evaluation of Percent Global Glomerulosclerosis Based on Individual Slides

    Cross-validated glomerulosclerosis predictions on individual slides also exhibited correlation with annotations (r = 0.916; 95% CI, 0.886-0.939; and RMSE = 5.631; 95% CI, 4.735-6.517; P < .001) (Figure 2A). Separating the results by slide preparation technique indicated that predictions on frozen sections showed similar correlation with ground truth (r = 0.918; 95% CI, 0.879-0.944; RMSE = 6.20; P < .001) (eFigure 3A in the Supplement), whereas the permanent group showed higher performance (r = 0.940; 95% CI, 0.896-0.965; RMSE = 4.32; P < .001) (eFigure 3D in the Supplement). The total numbers of glomeruli detected by the model are shown in Figure 3A and B, illustrating the correlations of nonglobally sclerosed glomeruli with ground truth (r = 0.955; 95% CI, 0.938-0.967; RMSE = 8.383; P < .001) and globally sclerosed glomeruli with ground truth (r = 0.934; 95% CI, 0.909-0.952; RMSE = 4.718; P < .001). The mean (SD) glomeruli count differences between annotation and prediction were 3.1 (7.8) for nonglobally sclerosed glomeruli and 0.2 (4.7) for globally sclerosed glomeruli. Similar positive results for predicted glomeruli counts were observed when separating slides by treatment (eFigure 2A, B, E, and F in the Supplement).

    Evaluation of Percent Global Glomerulosclerosis Based on Pooled Slides

    Pooling levels improved the model’s glomeruli count performance (Figure 3C and D; eFigure 2C, D, G, and H in the Supplement) as well as glomerulosclerosis correlation with annotations, as shown in Figure 2B (r = 0.933; 95% CI, 0.898-0.956; and RMSE = 5.094; 95% CI, 3.972-6.301; P < .001, for combined frozen and permanent sections), bettering on-call pathologists’ performance on the same cases (r = 0.884; 95% CI, 0.825-0.923; and RMSE = 6.523; 95% CI, 5.191-7.783; P < .001) (Figure 2C). Global glomerulosclerosis error as measured by RMSE was 22% lower for the model than for on-call pathologists. Concordance between the model’s predictions of global glomerulosclerosis for individual and pooled levels is shown in eFigure 4 in the Supplement as a residual with respect to annotation ground truth.

    Evaluation of Kidney Mischaracterization Risk

    Pooled percent global glomerulosclerosis results for the annotations, model predictions, and on-call pathologists were sorted and plotted in order of increasing percent global glomerulosclerosis for all 83 kidneys included in the study, along with corresponding 95% prediction intervals and the 20% cut point for donor organ transplant acceptance or rejection (Figure 4B-F). Because all levels of section are evaluated by the on-call pathologists at the time of biopsy, their results are considered pooled evaluations. Kidneys with prediction intervals overlapping the 20% cut point line are more at risk for erroneous acceptance or rejection if glomeruli counts are incorrectly estimated. The chance of erroneously categorizing a kidney with greater than 20% global glomerulosclerosis is shown in Figure 4A. Using individual slides, the DL model’s projected error rate was 15% lower than for on-call pathologists and nearly identical to ground truth annotations (ie, the ideal case). With pooled levels, the DL model’s projected error rate dropped to 37% lower than that for the on-call pathologists. Similarly, the DL model’s projected error rate for erroneous organ acceptance using individual levels was 21% lower than that for on-call pathologists, and 34% lower when using pooled levels.

    The F1 score and Cohen κ showed similar results. The DL model’s F1 score for individual levels having global glomerulosclerosis below 20% was 0.896, and 0.950 for those individual levels above 20%. These metrics improved when pooling levels to 0.926 for those below 20%, and 0.964 for those above 20%. This compared favorably with the F1 scores for on-call pathologists of 0.852 for those below 20%, and 0.929 for those above 20%. Cohen κ for the model predictions on individual levels with respect to ground truth was 0.847, improving to 0.891 for pooled levels. Cohen κ for on-call pathologists with respect to pooled annotations was lower, at a value of 0.781, and was 0.714 when compared with the model’s pooled level predictions. Concordance between pathologist and model results for pooled level results is shown in eFigure 5 in the Supplement as a residual with respect to ground truth, sorted by ground truth global glomerulosclerosis percentage and total glomeruli count.

    The value of multilevel examination is shown by evaluating the prediction intervals from the beta distribution. An illustration of the beta distribution for a hypothetical biopsy specimen with 15% global glomerulosclerosis is shown in Figure 5A for pools of 1, 2, 3, and 4 levels, assuming each level has 58 observed glomeruli (the mean number for this study). The height of each curve at a given value on the horizontal axis can be interpreted as the relative likelihood of estimating percent global glomerulosclerosis to be that value, given the true distribution of sclerosed and normal glomeruli. The area under the curve thus yields an estimate of the likelihood of obtaining global glomerulosclerosis estimates within the limits of integration. The distribution narrowed with increased pooling. More importantly, the normalized area under the curve beyond the nominal 20% rejection cut point decreased from 14% using only a single level to 2% when pooling 4 levels (Figure 5B), a 7-fold decrease in the chance for incorrectly overestimating global glomerulosclerosis and erroneously discarding what should be a useable organ.

    To further illustrate the benefits of level pooling, glomeruli counts for 1000 randomly selected evaluations of donor biopsy specimens (from the same database as the 83 kidney biopsy specimens used in this study) were used to simulate the effects of level pooling for a large population. The on-call pathologist estimates were used as surrogates for ground truth glomeruli counts, and data pooling was simulated by multiplying the reported counts per level by the number of simulated levels in the pool. Applying the analysis described above to this scenario, the number of erroneous organ discards for every 1000 kidneys would decrease from 31 to 13 by increasing the number of levels evaluated from 1 to 4 (Figure 5C).

    As a demonstration of the potential clinical workflow with the incorporation of DL techniques, the DL model’s predicted annotations for 25 cases from the study data set were randomly selected (5 each with 0%-5%, 6%-10%, 11%-15%, 16%-20%, and >20% global glomerulosclerosis) and were submitted to a pathologist, who evaluated the histology images with overlaid model-generated glomeruli classifications. The pathologist then corrected any missed or inaccurately labeled glomeruli in a manner and time frame consistent with current clinical practice. The pathologist-amended evaluation was better correlated with ground truth (r = 0.958) and had lower error (RMSE = 4.352) than either the on-call pathologist (r = 0.613; RMSE = 0.898) or the DL model alone (r = 0.847; RMSE = 7.535) (eFigure 7 in the Supplement).

    Discussion

    The DL model produced encouraging results, in both qualitative (visual) and quantitative findings, and recapitulated results described in earlier work by members of our group on a smaller training set.34 The model performed well using either frozen sections or permanent sections, bettering on-call pathologists’ performance. The time for the model to process an individual WSI was approximately 5 minutes, well within the typical constraints of a pathology intraoperative consultation.

    Magnification of counting errors when using a small sample highlights the value gained from pooling results from multiple levels obtained from a single kidney biopsy. The typical thickness of a donor kidney biopsy specimen is 1 mm. The pathologist only examines a representative 5-μm–thick section of this tissue, leaving a substantial portion of unevaluated kidney unexamined. Although glomeruli sampled in subsequent sections may not be independent, the slide preparation process can lead to substantial section-to-section variability in global glomerulosclerosis, irrespective of observer variability (eFigure 6 in the Supplement). By evaluating more tissue sections, the effect of this variability can be minimized and the reliability of the biopsy specimen evaluation enhanced. This benefit is clearly observed in the present study for each metric (Figures 2-4), which all exhibited improvement with examination of additional tissue.

    Current standard of care requires evaluation of only 25 glomeruli and 1 to 2 levels of section because more evaluations are not practically achievable by human pathologists in the time-sensitive context of organ transplant. The use of DL techniques to augment human capability in this setting could add vitally needed organs to the donor pool. A potential clinical workflow with the incorporation of DL techniques could be as follows: a specimen arrives in the frozen-section laboratory, where a frozen-section slide is prepared and scanned. The WSI is then uploaded to a secure location for analysis using the DL model. While the DL model is analyzing, the pathologist may log in and review the sample for other pertinent findings. The DL model result would be available within 5 to 10 minutes, presented to the pathologist as a graphical overlay of glomeruli classifications on the histology image, then verified (and amended if necessary) by the pathologist, and incorporated into the report. The actual report would directly interface with the clinical electronic health record.

    Limitations

    There are some limitations to this study. It was a single-center study. Although the WSIs were generated using 2 scanners at 2 institutions, the frozen-section data set was entirely generated at 1 institution, whereas the permanent-section data set was generated from another. Although a small preliminary data set (n = 17) suggested that model predictions on frozen sections exhibited reasonable correspondence with associated permanent sections and that the model outperformed on-call pathologists on these frozen sections (eFigure 8 in the Supplement), this study did not directly address the separate issue of how closely the frozen sections (as well as pathologists’ evaluations of them) corresponded to permanent sections subsequently acquired and processed from the same biopsy specimen.

    The data set was small compared with other DL studies. However, nearly 8500 glomeruli were examined in total, a relatively high number. The limitation in evaluating greater numbers of cases lies in the time-consuming process of serially annotating WSIs. To further evaluate the robustness of this model, additional studies will be required wherein the model is tested using WSIs generated from additional laboratories and scanners.

    Conclusions

    This prognostic study found better performance for quantifying percent global glomerulosclerosis from WSIs of frozen and of permanent hematoxylin-eosin–stained donor transplant kidney biopsy specimens by a DL model than by on-call board-certified pathologists. Performance was further improved by examining additional tissue sections, a process that is beyond the capacity of pathologists in the time-sensitive nature of evaluating donor biopsy specimens. The results indicated decreased likelihood of mischaracterizing percent global glomerulosclerosis when using the DL model, thereby decreasing the likelihood of inappropriate donor organ discard or of using an organ that is suboptimal. The findings illustrate the substantial gains that could be realized using DL methods in surgical pathology clinical practice.

    Back to top
    Article Information

    Accepted for Publication: November 1, 2020.

    Published: January 20, 2021. doi:10.1001/jamanetworkopen.2020.30939

    Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2021 Marsh JN et al. JAMA Network Open.

    Corresponding Authors: Joseph P. Gaut, MD, PhD (jpgaut@wustl.edu), and S. Joshua Swamidass, MD, PhD (swamidass@wustl.edu), Department of Pathology and Immunology, Washington University School of Medicine in St Louis, Campus Box 8118, 660 S Euclid Avenue, St Louis, MO 63110.

    Author Contributions: Dr Gaut had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

    Concept and design: Marsh, Swamidass, Gaut.

    Acquisition, analysis, or interpretation of data: All authors.

    Drafting of the manuscript: Marsh, Liu, Gaut.

    Critical revision of the manuscript for important intellectual content: Marsh, Wilson, Swamidass, Gaut.

    Statistical analysis: Marsh.

    Obtained funding: Liu, Swamidass, Gaut.

    Administrative, technical, or material support: Gaut.

    Supervision: Swamidass, Gaut.

    Conflict of Interest Disclosures: Drs Marsh, Swamidass, and Gaut may receive royalty income based on a technology developed by them, evaluated in this study, and licensed by the Washington University to PlatformSTL; they also reported receiving grants from the National Institutes of Health during the conduct of the study. Dr. Swamidass has a financial interest in PlatformSTL and may financially benefit if the company is successful in marketing its product that is related to this research. No other disclosures were reported.

    Funding/Support: This study was supported by Mid-America Transplant Foundation award 012017; the National Institute of Diabetes and Digestive and Kidney Diseases grants R41DK120253 and R42DK120253; and the Institute for Informatics (I2) and the Department of Pathology and Immunology at Washington University School of Medicine in St Louis, Missouri.

    Role of the Funder/Sponsor: The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

    References
    1.
    Tullius  SG, Rabb  H.  Improving the supply and quality of deceased-donor organs for transplantation.   N Engl J Med. 2018;378(20):1920-1929. doi:10.1056/NEJMra1507080 PubMedGoogle Scholar
    2.
    Hart  A, Smith  JM, Skeans  MA,  et al.  OPTN/SRTR 2017 annual data report: kidney.   Am J Transplant. 2019;19(suppl 2):19-123. doi:10.1111/ajt.15274 PubMedGoogle Scholar
    3.
    Moeckli  B, Sun  P, Lazeyras  F,  et al.  Evaluation of donor kidneys prior to transplantation: an update of current and emerging methods.   Transpl Int. 2019;32(5):459-469. doi:10.1111/tri.13430 PubMedGoogle Scholar
    4.
    Stewart  DE, Klassen  DK.  Early experience with the new kidney allocation system: a perspective from UNOS.   Clin J Am Soc Nephrol. 2017;12(12):2063-2065. doi:10.2215/CJN.06380617 PubMedGoogle Scholar
    5.
    Dare  AJ, Pettigrew  GJ, Saeb-Parsy  K.  Preoperative assessment of the deceased-donor kidney: from macroscopic appearance to molecular biomarkers.   Transplantation. 2014;97(8):797-807. doi:10.1097/01.TP.0000441361.34103.53 PubMedGoogle Scholar
    6.
    Bosmans  J-L, Woestenburg  A, Ysebaert  DK,  et al.  Fibrous intimal thickening at implantation as a risk factor for the outcome of cadaveric renal allografts.   Transplantation. 2000;69(11):2388-2394. doi:10.1097/00007890-200006150-00030 PubMedGoogle Scholar
    7.
    Escofet  X, Osman  H, Griffiths  DFR, Woydag  S, Adam Jurewicz  W.  The presence of glomerular sclerosis at time zero has a significant impact on function after cadaveric renal transplantation.   Transplantation. 2003;75(3):344-346. doi:10.1097/01.TP.0000044361.74625.E7 PubMedGoogle Scholar
    8.
    Randhawa  P.  Role of donor kidney biopsies in renal transplantation.   Transplantation. 2001;71(10):1361-1365. doi:10.1097/00007890-200105270-00001 PubMedGoogle Scholar
    9.
    Serón  D, Carrera  M, Griño  JM,  et al.  Relationship between donor renal interstitial surface and post-transplant function.   Nephrol Dial Transplant. 1993;8(6):539-543. doi:10.1093/ndt/8.6.539 PubMedGoogle Scholar
    10.
    Singh  P, Farber  JL, Doria  C,  et al.  Peritransplant kidney biopsies: comparison of pathologic interpretations and practice patterns of organ procurement organizations.   Clin Transplant. 2012;26(3):E191-E199. doi:10.1111/j.1399-0012.2011.01584.x PubMedGoogle Scholar
    11.
    Sung  RS, Christensen  LL, Leichtman  AB,  et al.  Determinants of discard of expanded criteria donor kidneys: impact of biopsy and machine perfusion.   Am J Transplant. 2008;8(4):783-792. doi:10.1111/j.1600-6143.2008.02157.x PubMedGoogle Scholar
    12.
    Wang  HJ, Kjellstrand  CM, Cockfield  SM, Solez  K.  On the influence of sample size on the prognostic accuracy and reproducibility of renal transplant biopsy.   Nephrol Dial Transplant. 1998;13(1):165-172. doi:10.1093/ndt/13.1.165 PubMedGoogle Scholar
    13.
    Husain  SA, Chiles  MC, Lee  S,  et al.  Characteristics and performance of unilateral kidney transplants from deceased donors.   Clin J Am Soc Nephrol. 2018;13(1):118-127. doi:10.2215/CJN.06550617 PubMedGoogle Scholar
    14.
    Messina  M, Diena  D, Dellepiane  S,  et al.  Long-term outcomes and discard rate of kidneys by decade of extended criteria donor age.   Clin J Am Soc Nephrol. 2017;12(2):323-331. doi:10.2215/CJN.06550616 PubMedGoogle Scholar
    15.
    Liapis  H, Gaut  JP, Klein  C,  et al; Banff Working Group.  Banff histopathological consensus criteria for preimplantation kidney biopsies.   Am J Transplant. 2017;17(1):140-150. doi:10.1111/ajt.13929 PubMedGoogle Scholar
    16.
    Azancot  MA, Moreso  F, Salcedo  M,  et al.  The reproducibility and predictive value on outcome of renal biopsies from expanded criteria donors.   Kidney Int. 2014;85(5):1161-1168. doi:10.1038/ki.2013.461 PubMedGoogle Scholar
    17.
    Haas  M.  Donor kidney biopsies: pathology matters, and so does the pathologist.   Kidney Int. 2014;85(5):1016-1019. doi:10.1038/ki.2013.439 PubMedGoogle Scholar
    18.
    Achi  HE, Belousova  T, Chen  L,  et al.  Automated diagnosis of lymphoma with digital pathology images using deep learning.   Ann Clin Lab Sci. 2019;49(2):153-160.PubMedGoogle Scholar
    19.
    Djuric  U, Zadeh  G, Aldape  K, Diamandis  P.  Precision histology: how deep learning is poised to revitalize histomorphology for personalized cancer care.   NPJ Precis Oncol. 2017;1(1):22. doi:10.1038/s41698-017-0022-1 PubMedGoogle Scholar
    20.
    Esteva  A, Kuprel  B, Novoa  RA,  et al.  Dermatologist-level classification of skin cancer with deep neural networks.   Nature. 2017;542(7639):115-118. doi:10.1038/nature21056 PubMedGoogle Scholar
    21.
    Gulshan  V, Peng  L, Coram  M,  et al.  Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs.   JAMA. 2016;316(22):2402-2410. doi:10.1001/jama.2016.17216PubMedGoogle Scholar
    22.
    LeCun  Y, Bengio  Y, Hinton  G.  Deep learning.   Nature. 2015;521(7553):436-444. doi:10.1038/nature14539 PubMedGoogle Scholar
    23.
    Litjens  G, Sánchez  CI, Timofeeva  N,  et al.  Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis.   Sci Rep. 2016;6:26286. doi:10.1038/srep26286 PubMedGoogle Scholar
    24.
    Mercan  E, Mehta  S, Bartlett  J, Shapiro  LG, Weaver  DL, Elmore  JG.  Assessment of machine learning of breast pathology structures for automated differentiation of breast cancer and high-risk proliferative lesions.   JAMA Netw Open. 2019;2(8):e198777. doi:10.1001/jamanetworkopen.2019.8777 PubMedGoogle Scholar
    25.
    Serag  A, Ion-Margineanu  A, Qureshi  H,  et al.  Translational AI and deep learning in diagnostic pathology.   Front Med (Lausanne). 2019;6:185. doi:10.3389/fmed.2019.00185 PubMedGoogle Scholar
    26.
    Wang  S, Yang  DM, Rong  R,  et al.  Artificial intelligence in lung cancer pathology image analysis.   Cancers (Basel). 2019;11(11):1673. doi:10.3390/cancers11111673 PubMedGoogle Scholar
    27.
    Bukowy  JD, Dayton  A, Cloutier  D,  et al.  Region-based convolutional neural nets for localization of glomeruli in trichrome-stained whole kidney sections.   J Am Soc Nephrol. 2018;29(8):2081-2088. doi:10.1681/ASN.2017111210 PubMedGoogle Scholar
    28.
    Ginley  B, Lutnick  B, Jen  KY,  et al.  Computational segmentation and classification of diabetic glomerulosclerosis.   J Am Soc Nephrol. 2019;30(10):1953-1967. doi:10.1681/ASN.2018121259 PubMedGoogle Scholar
    29.
    Hermsen  M, de Bel  T, den Boer  M,  et al.  Deep learning-based histopathologic assessment of kidney tissue.   J Am Soc Nephrol. 2019;30(10):1968-1979. doi:10.1681/ASN.2019020144 PubMedGoogle Scholar
    30.
    Kannan  S, Morgan  LA, Liang  B,  et al.  Segmentation of glomeruli within trichrome images using deep learning.   Kidney Int Rep. 2019;4(7):955-962. doi:10.1016/j.ekir.2019.04.008 PubMedGoogle Scholar
    31.
    Bueno  G, Fernandez-Carrobles  MM, Gonzalez-Lopez  L, Deniz  O.  Glomerulosclerosis identification in whole slide images using semantic segmentation.   Comput Methods Programs Biomed. 2020;184:105273. doi:10.1016/j.cmpb.2019.105273 PubMedGoogle Scholar
    32.
    Schindelin  J, Arganda-Carreras  I, Frise  E,  et al.  Fiji: an open-source platform for biological-image analysis.   Nat Methods. 2012;9(7):676-682. doi:10.1038/nmeth.2019 PubMedGoogle Scholar
    33.
    Cornell University. arXiv.org Computer Science. Very deep convolutional networks for large-scale image recognition by Karen Simonyan and Andrew Zisserman. arXiv:1409.1556. Last revised April 10, 2015. Accessed November 22, 2020. https://arxiv.org/abs/1409.1556
    34.
    Marsh  JN, Matlock  MK, Kudose  S,  et al.  Deep learning global glomerulosclerosis in transplant kidney frozen sections.   IEEE Trans Med Imaging. 2018;37(12):2718-2728. doi:10.1109/TMI.2018.2851150 PubMedGoogle Scholar
    35.
    Lindeberg  T.  Detecting salient blob-like image structures and their scales with a scale-space primal sketch: a method for focus-of-attention.   Int J Comput Vis. 1993;11(3):283-318. doi:10.1007/BF01469346 Google Scholar
    ×