Development and Validation of a Deep Learning Model to Quantify Interstitial Fibrosis and Tubular Atrophy From Kidney Ultrasonography Images | Nephrology | JAMA Network Open | JAMA Network
[Skip to Navigation]
Sign In
Figure.  Overall Analysis Pipeline
Overall Analysis Pipeline

The entire process was partitioned into 4 main tasks (green boxes): preprocessing of images, segmentation of kidneys in preprocessed images, feature extraction from masked images, and image classification from feature maps. Subtasks within these main tasks are indicated with italic type. In the feature extraction and image classification phase, a test set of 612 images was generated and was never used in any training. This test set was used for a final independent evaluation of the overall analytical pipeline. US indicates ultrasonography; VGG19, Visual Geometry Group 19; XGBoost, extreme gradient boosting.

Table 1.  Characteristics of the Study Participants
Characteristics of the Study Participants
Table 2.  Agreement Among Pathologists’ Independent Evaluation of IFTA Scores on Randomly Selected Subsample of Histopathology Slidesa
Agreement Among Pathologists’ Independent Evaluation of IFTA Scores on Randomly Selected Subsample of Histopathology Slidesa
Table 3.  Predictive Performance of the Deep Learning Model to Quantify Interstitial Fibrosis and Tubular Atrophy
Predictive Performance of the Deep Learning Model to Quantify Interstitial Fibrosis and Tubular Atrophy
Table 4.  Incremental Value of the DL Model to Predict Interstitial Fibrosis and Tubular Atrophy Class at the Level of Individual Patienta
Incremental Value of the DL Model to Predict Interstitial Fibrosis and Tubular Atrophy Class at the Level of Individual Patienta
1.
Centers for Disease Control and Prevention.  Chronic Kidney Disease in the United States, 2021. US Department of Health and Human Services, Centers for Disease Control and Prevention; 2021.
2.
Bowe  B, Xie  Y, Li  T,  et al.  Changes in the US burden of chronic kidney disease from 2002 to 2016: an analysis of the Global Burden of Disease Study.   JAMA Netw Open. 2018;1(7):e184412-e184412. doi:10.1001/jamanetworkopen.2018.4412PubMedGoogle ScholarCrossref
3.
Risdon  RA, Sloper  JC, De Wardener  HE.  Relationship between renal function and histological changes found in renal-biopsy specimens from patients with persistent glomerular nephritis.   Lancet. 1968;2(7564):363-366. doi:10.1016/S0140-6736(68)90589-8PubMedGoogle ScholarCrossref
4.
Nath  KA.  Tubulointerstitial changes as a major determinant in the progression of renal damage.   Am J Kidney Dis. 1992;20(1):1-17. doi:10.1016/S0272-6386(12)80312-XPubMedGoogle ScholarCrossref
5.
Mise  K, Hoshino  J, Ueno  T,  et al.  Prognostic value of tubulointerstitial lesions, urinary N-acetyl-β-D-glucosaminidase, and urinary β2-microglobulin in patients with type 2 diabetes and biopsy–proven diabetic nephropathy.   Clin J Am Soc Nephrol. 2016;11(4):593-601. doi:10.2215/CJN.04980515PubMedGoogle ScholarCrossref
6.
Srivastava  A, Palsson  R, Kaze  AD,  et al.  The prognostic value of histopathologic lesions in native kidney biopsy specimens: results from the Boston Kidney Biopsy Cohort Study.   J Am Soc Nephrol. 2018;29(8):2213-2224. doi:10.1681/ASN.2017121260PubMedGoogle ScholarCrossref
7.
Corapi  KM, Chen  JL, Balk  EM, Gordon  CE.  Bleeding complications of native kidney biopsy: a systematic review and meta-analysis.   Am J Kidney Dis. 2012;60(1):62-73. doi:10.1053/j.ajkd.2012.02.330PubMedGoogle ScholarCrossref
8.
Athavale  A, Kulkarni  H, Arslan  CD, Hart  P.  Desmopressin and bleeding risk after percutaneous kidney biopsy.   BMC Nephrol. 2019;20(1):413. doi:10.1186/s12882-019-1595-4PubMedGoogle ScholarCrossref
9.
Foley  RN, Collins  AJ.  End-stage renal disease in the United States: an update from the United States Renal Data System.   J Am Soc Nephrol. 2007;18(10):2644-2648. doi:10.1681/ASN.2007020220PubMedGoogle ScholarCrossref
10.
Hogan  JJ, Mocanu  M, Berns  JS.  The native kidney biopsy: update and evidence for best practice.   Clin J Am Soc Nephrol. 2016;11(2):354-362. doi:10.2215/CJN.05750515PubMedGoogle ScholarCrossref
11.
Gonzalez Suarez  ML, Thomas  DB, Barisoni  L, Fornoni  A.  Diabetic nephropathy: is it time yet for routine kidney biopsy?   World J Diabetes. 2013;4(6):245-255. doi:10.4239/wjd.v4.i6.245PubMedGoogle ScholarCrossref
12.
Moghazi  S, Jones  E, Schroepple  J,  et al.  Correlation of renal histopathology with sonographic findings.   Kidney Int. 2005;67(4):1515-1520. doi:10.1111/j.1523-1755.2005.00230.xPubMedGoogle ScholarCrossref
13.
Friedrich-Rust  M, Ong  MF, Martens  S,  et al.  Performance of transient elastography for the staging of liver fibrosis: a meta-analysis.   Gastroenterology. 2008;134(4):960-974. doi:10.1053/j.gastro.2008.01.034PubMedGoogle ScholarCrossref
14.
Lakhani  P, Sundaram  B.  Deep learning at chest radiography: automated classification of pulmonary tuberculosis by using convolutional neural networks.   Radiology. 2017;284(2):574-582. doi:10.1148/radiol.2017162326PubMedGoogle ScholarCrossref
15.
Milea  D, Najjar  RP, Zhubo  J,  et al; BONSAI Group.  Artificial intelligence to detect papilledema from ocular fundus photographs.   N Engl J Med. 2020;382(18):1687-1695. doi:10.1056/NEJMoa1917130PubMedGoogle ScholarCrossref
16.
Ko  H, Chung  H, Kim  KW,  et al.  COVID-19 pneumonia diagnosis using a simple 2D deep learning framework with a single chest CT image: model development and validation.   J Med Internet Res. 2020;22(6):e19569. doi:10.2196/19569PubMedGoogle Scholar
17.
Farris  AB, Adams  CD, Brousaides  N,  et al.  Morphometric and visual evaluation of fibrosis in renal biopsies.   J Am Soc Nephrol. 2011;22(1):176-186. doi:10.1681/ASN.2009091005PubMedGoogle ScholarCrossref
18.
Anaconda Inc. Anaconda software. Published 2016. Accessed April 9, 2021. https://anaconda.com
19.
Paszke  A, Gross  S, Massa  F,  et al PyTorch: an imperative style, high-performance deep learning library. In: Wallach  H, Larochelle  H, Beygelzimer  A, d’Buc  F, Forx  E, Garnett  R, eds.  Advances in Neural Information Processing Systems. Curran Associates, Inc; 2019:8024-8035.
20.
Crimmins  TR.  Geometric filter for speckle reduction.   Appl Opt. 1985;24(10):1438. doi:10.1364/AO.24.001438PubMedGoogle ScholarCrossref
21.
Antico  M, Sasazawa  F, Dunnhofer  M,  et al.  Deep learning-based femoral cartilage automatic segmentation in ultrasound imaging for guidance in robotic knee arthroscopy.   Ultrasound Med Biol. 2020;46(2):422-435. doi:10.1016/j.ultrasmedbio.2019.10.015PubMedGoogle ScholarCrossref
22.
Manto  M, Dupre  N, Hadjivassiliou  M,  et al.  Management of patients with cerebellar ataxia during the COVID-19 pandemic: current concerns and future implications.   Cerebellum. 2020;19(4):562-568. doi:10.1007/s12311-020-01139-1PubMedGoogle ScholarCrossref
23.
Park  H, Lee  HJ, Kim  HG,  et al.  Endometrium segmentation on transvaginal ultrasound image using key-point discriminator.   Med Phys. 2019;46(9):3974-3984. doi:10.1002/mp.13677PubMedGoogle ScholarCrossref
24.
Yang  J, Faraji  M, Basu  A.  Robust segmentation of arterial walls in intravascular ultrasound images using Dual Path U-Net.   Ultrasonics. 2019;96:24-33. doi:10.1016/j.ultras.2019.03.014PubMedGoogle ScholarCrossref
25.
[No authors listed.] Implementation of U-Net architecture using Pytorch. Accessed April 9, 2021. https://github.com/jakeoung/Unet_pytorch
26.
[No authors listed.] labelme: image polygonal annotation with Python. Published 2016. Accessed April 9, 2021. https://github.com/wkentaro/labelme
27.
Bradski  G. The OpenCV library. Dr Dobb's: the world of software development. Published November 1, 2000. Accessed April 9, 2021. https://www.drdobbs.com/open-source/the-opencv-library/184404319#
28.
[No authors listed.] Pytorch-IntermediateLayerGetter. Accessed April 9, 2021. https://github.com/sebamenabar/Pytorch-IntermediateLayerGetter
29.
[No authors listed.] Pack R package tarball with pre-built xgboost.so (with GPU support). Accessed April 9, 2021. https://github.com/dmlc/xgboost
30.
Cohen  EP, Olson  JD, Tooze  JA, Bourland  JD, Dugan  GO, Cline  JM.  Detection and quantification of renal fibrosis by computerized tomography.   PLoS One. 2020;15(2):e0228626. doi:10.1371/journal.pone.0228626PubMedGoogle Scholar
31.
Jiang  K, Ferguson  CM, Ebrahimi  B,  et al.  Noninvasive assessment of renal fibrosis with magnetization transfer MR imaging: validation and evaluation in murine renal artery stenosis.   Radiology. 2017;283(1):77-86. doi:10.1148/radiol.2016160566PubMedGoogle ScholarCrossref
32.
Terrault  NA, Lok  ASF, McMahon  BJ,  et al.  Update on prevention, diagnosis, and treatment of chronic hepatitis B: AASLD 2018 hepatitis B guidance.   Hepatology. 2018;67(4):1560-1599. doi:10.1002/hep.29800PubMedGoogle ScholarCrossref
33.
Panel  AIHG, Chung  RT, Davis  GL,  et al; AASLD/IDSA HCV Guidance Panel.  Hepatitis C guidance: AASLD-IDSA recommendations for testing, managing, and treating adults infected with hepatitis C virus.   Hepatology. 2015;62(3):932-954. doi:10.1002/hep.27950PubMedGoogle ScholarCrossref
Limit 200 characters
Limit 25 characters
Conflicts of Interest Disclosure

Identify all potential conflicts of interest that might be relevant to your comment.

Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.

Err on the side of full disclosure.

If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.

Not all submitted comments are published. Please see our commenting policy for details.

Limit 140 characters
Limit 3600 characters or approximately 600 words
    Original Investigation
    Health Informatics
    May 24, 2021

    Development and Validation of a Deep Learning Model to Quantify Interstitial Fibrosis and Tubular Atrophy From Kidney Ultrasonography Images

    Author Affiliations
    • 1Division of Nephrology, Department of Medicine, Cook County Health, Chicago, Illinois
    • 2Department of Pathology, Rush University Medical Center, Chicago, Illinois
    • 3Department of Pathology, University of Illinois at Chicago, Chicago
    • 4Division of Nephrology, University of Illinois at Chicago, Chicago
    • 5Department of Pathology, Johns Hopkins University, Baltimore, Maryland
    • 6M&H Research, LLC, San Antonio, Texas
    JAMA Netw Open. 2021;4(5):e2111176. doi:10.1001/jamanetworkopen.2021.11176
    Key Points

    Question  Can deep learning methods be used to quantify the degree of interstitial fibrosis and tubular atrophy from noninvasively assessed ultrasonography images?

    Findings  In a diagnostic evaluation of 6135 Crimmins-filtered ultrasonography images, kidney segmentation using the UNet architecture, a convolution neural network–based feature extractor, and extreme gradient boosting was able to quantify interstitial fibrosis and tubular atrophy grade with 90% accuracy independently of baseline clinical characteristics.

    Meaning  Noninvasive, ultrasonography-based prediction of interstitial fibrosis and tubular atrophy was accurate and can be used as a first-line investigation in evaluation of kidney disease.

    Abstract

    Importance  Interstitial fibrosis and tubular atrophy (IFTA) is a strong indicator of decline in kidney function and is measured using histopathological assessment of kidney biopsy core. At present, a noninvasive test to assess IFTA is not available.

    Objective  To develop and validate a deep learning (DL) algorithm to quantify IFTA from kidney ultrasonography images.

    Design, Setting, and Participants  This was a single-center diagnostic study of consecutive patients who underwent native kidney biopsy at John H. Stroger Jr. Hospital of Cook County, Chicago, Illinois, between January 1, 2014, and December 31, 2018. A DL algorithm was trained, validated, and tested to classify IFTA from kidney ultrasonography images. Of 6135 Crimmins-filtered ultrasonography images, 5523 were used for training (5122 images) and validation (401 images), and 612 were used to test the accuracy of the DL system. Kidney segmentation was performed using the UNet architecture, and classification was performed using a convolution neural network–based feature extractor and extreme gradient boosting. IFTA scored by a nephropathologist on trichrome stained kidney biopsy slide was used as the reference standard. IFTA was divided into 4 grades (grade 1, 0%-24%; grade 2, 25%-49%; grade 3, 50%-74%; and grade 4, 75%-100%). Data analysis was performed from December 2019 to May 2020.

    Main Outcomes and Measures  Prediction of IFTA grade was measured using the metrics precision, recall, accuracy, and F1 score.

    Results  This study included 352 patients (mean [SD] age 47.43 [14.37] years), of whom 193 (54.82%) were women. There were 159 patients with IFTA grade 1 (2701 ultrasonography images), 74 patients with IFTA grade 2 (1239 ultrasonography images), 41 patients with IFTA grade 3 (701 ultrasonography images), and 78 patients with IFTA grade 4 (1494 ultrasonography images). Kidney ultrasonography images were segmented with 91% accuracy. In the independent test set, the point estimates for performance matrices showed precision of 0.8927 (95% CI, 0.8682-0.9172), recall of 0.8037 (95% CI, 0.7722-0.8352), accuracy of 0.8675 (95% CI, 0.8406-0.8944), and an F1 score of 0.8389 (95% CI, 0.8098-0.8680) at the image level. Corresponding estimates at the patient level were precision of 0.9003 (95% CI, 0.8644-0.9362), recall of 0.8421 (95% CI, 0.7984-0.8858), accuracy of 0.8955 (95% CI, 0.8589-0.9321), and an F1 score of 0.8639 (95% CI, 0.8228-0.9049). Accuracy at the patient level was highest for IFTA grade 1 and IFTA grade 4. The accuracy (approximately 90%) remained high irrespective of the timing of ultrasonography studies and the biopsy diagnosis. The predictive performance of the DL system did not show significant improvement when combined with baseline clinical characteristics.

    Conclusions and Relevance  These findings suggest that a DL algorithm can accurately and independently predict IFTA from kidney ultrasonography images.

    Introduction

    Chronic kidney disease (CKD) affects 15% of the adult population in the US and has contributed to a 52% increase in cost burden from 2002 to 2016.1,2 A key pathophysiological indicator of CKD is interstitial fibrosis and tubular atrophy (IFTA), which is associated with estimated glomerular filtration rate (eGFR),3 future decline in kidney function, and development of kidney failure.4 Furthermore, IFTA incrementally improves the value of baseline proteinuria and eGFR to predict clinical outcomes of CKD5 irrespective of the underlying causes.5,6 Currently, the challenge is to have an accurate, noninvasive method to quantify IFTA because histopathological grading of kidney biopsy core by a nephropathologist is the only accepted method to quantify IFTA. In addition to being invasive, kidney biopsy is associated with bleeding complications, provides only a snapshot of IFTA, and is subject to sampling error.7,8 Consequently and importantly, most patients with CKD who will eventually need kidney replacement therapy9 never undergo a kidney biopsy10,11 and represent a missed clinical opportunity.

    Kidney ultrasonography is a routinely performed, noninvasive test for evaluation of kidney disease. Certain features on ultrasonography, such as echogenicity, kidney length, and corticomedullary differentiation, have been found to be associated with IFTA; however, these features are unable to provide a quantifiable estimate of IFTA.12 Unlike FibroScan, which provides a quantification of fibrosis in liver,13 there is no imaging modality that can provide an accurate estimate of IFTA in the kidney in routine clinical practice.

    We hypothesized that ingrained within the ultrasonographic features are subtle signs of IFTA that can be quantitatively extracted and analyzed. Artificial intelligence and deep learning (DL) are being increasingly used in diagnosis and prognosis of various medical conditions.14-16 Because DL can map complex feature relationships, we trained, validated, and tested a DL system to quantify IFTA using kidney ultrasonography images.

    Methods
    Patient Selection

    This diagnostic study was approved by the institutional review board at Cook County Health with waiver of informed consent because deidentified data and images were used, in accordance with 45 CFR §46. Reporting of the results follows the Standards for Reporting of Diagnostic Accuracy (STARD) reporting guideline.

    Consecutive patients undergoing native kidney biopsy under real-time ultrasonography guidance between January 1, 2014, and December 31, 2018, at the John H. Stroger Jr. Hospital of Cook County, Chicago, Illinois, were included in the study. Allograft biopsies and patients for whom ultrasonography images were not available or IFTA grades were not available (15 patients) were excluded. Clinical and demographic information of patients included in this study was obtained by medical record review.

    Ground Truth Quantification of IFTA

    Quantifying IFTA on Masson trichrome–stained kidney biopsy slide is the current standard of care.17 The percentage of cortex with IFTA was scored as 0% to 24%, 25% to 49%, 50% to 74%, and 75% to 100% of the cortex sampled. One nephropathologist (D.C.) provided IFTA scores from each trichrome-stained histopathological slide of the kidney biopsy core. To validate the methods of the pathologist (D.C.) in grading IFTA, a second nephropathologist (T.P.) also provided IFTA scores for a random sample of 93 whole slide images in a blinded fashion, and agreement between the 2 nephropathologists was evaluated.

    Kidney Ultrasonography Images

    Longitudinal ultrasonography images from both kidneys obtained between 6 months before and 2 weeks after kidney biopsy (including images obtained during the kidney biopsy) were included in the study. A total of 6602 ultrasonography images were deidentified and stored in the JPEG format.

    Development of DL Classification System

    Development of the DL model involved 4 independent steps: (1) preprocessing of ultrasonography images, (2) kidney segmentation, (3) feature extraction, and (4) image classification with internal and independent validation (Figure). All scripts were written in Python software version 3.7 (Python) within an Anaconda software version 3.0 (Anaconda, Inc) environment18 and used the PyTorch platform.19 Jupyter notebooks with codes and outputs are available from the authors upon reasonable request.

    Preprocessing of Ultrasonography Images

    Ultrasonography images included in the study were resized to 224 × 224 pixels, the input dimension required for many popular DL models. A Crimmins filter20 (also called the geometric filter) is most suited to reduce background noise and backscatter in ultrasonography images and was applied to each image (eFigure 1 in the Supplement).

    Kidney Segmentation

    A kidney ultrasonography image includes structures in addition to the kidney, such as muscle, adipose tissue, liver, spleen, and bowel. For the current study, it was important that the training images focused on the kidney while eliminating other structures. Thus, we first trained and validated a DL model to generate segmented (masked) ultrasonography images. We chose the UNet architecture (eFigure 2 in the Supplement) because it is suited for ultrasonography images.21-24 We used a pretrained, publicly available model25 and retrained it for kidney segmentation. For this, we randomly selected a subset of 600 ultrasonography images (Figure) and manually labeled these images for kidney identification using the labelme software.26 The selected ultrasonography images were further randomly split into a training set of 500 images and a validation set of 100 images. By use of this optimized (for mean squared error L2 loss) and trained UNet model, we generated masked images (ie, images with everything other than kidneys blacked out) from each of the 6602 preprocessed images. We used the intersection-over-union (IoU) metric to measure the accuracy of segmentation in the validation set only. We used the OpenCV Python library27 and used the function multiply to obtain a masked image from the original image and its UNet-generated mask. Of 6602 images, 6135 (93%) images had adequate masks (IoU >90%) and were used for subsequent analyses.

    Feature Extraction

    We used transfer learning for this purpose using a pretrained convolutional neural network, Visual Geometry Group 19 (VGG-19) batch normalization (BN). This model (eFigure 3A in the Supplement) comprises an initial feature extractor component followed by a classifier component. For our purposes, we used the feature extractor component only. To prime it for kidney ultrasonography fibrosis, we used IFTA grades as the final output and tuned the VGG-19 BN model using categorical cross-entropy cost function. The final output of the 7 × 7 × 512 features was then flattened into a vector of 25 088 features and further compressed (using a fully connected layer) into a 1024-length feature vector as shown in eFigure 3A in the Supplement. From the trained model, we extracted 1024 features (from the Fc2 layer) using the IntermediateLayerGetter Python library.28 Training of the feature extractor was done on a randomly selected subset of 90% of the images (5523 images) (Figure). The remaining 612 images were retained as an independent test set for validation. After training, the tuned VGG-19 BN model was used to extract features from all images into a 6135 × 1024 matrix. This matrix, along with the associated class labels, image identifiers, and training or test membership information was used for subsequent image classification.

    Image Classification

    Image classification was done using extreme gradient boosting (XGBoost, using the xgboost Python library).29 During this step, we retained the 612 images as an independent test set (Figure). The remaining images were randomly split into a training set of 5122 images and a validation set of 401 images. We used the 1024 features extracted in the feature extraction step as input to XGBoost algorithm and the ground truth IFTA grades as output for training the DL algorithm. Multiclass log loss was used for optimization. Grid search was used for finding the optimum hyperparameters, and the best predicting model (one with the least multiclass error) was used to predict the IFTA grades both in the validation set of 401 images and in the independent test set of 612 images.

    Statistical Analysis

    Descriptive statistics included mean (SD) for continuous variables and proportions for categorical variables. Statistical significance for distribution across grades of IFTA was assessed using 1-way analysis of variance for continuous variables and a 2-sided Pearson χ2 test for categorical variables. Agreement between pathologists’ grading of IFTA was done using weighted (using a square-weighted Cohen κ strategy) Cohen κ values. Performance metrics for the image classification task were precision (synonymous with positive predictive value as used in epidemiology), recall (synonymous with sensitivity), accuracy, and F1 score (which was estimated as the harmonic mean of precision and recall).

    Because clinical characteristics such as age, diabetes, hypertension, and the eGFR (derived using the Modification of Diet in Renal Disease equation) are associated with the IFTA grade, we examined whether the combination of the clinical characteristics with DL predictions improved the prediction. For this, we ran a baseline multinomial logistic model that predicted the IFTA class using predicted IFTA class as the independent variable. In the next nested model, we added age, sex, hypertension, diabetes, body mass index, and eGFR as the covariates. Incremental predictive performance of the clinical predictors over that of DL prediction was assessed by comparing likelihood ratio χ2, pseudo R2, and Brier score. To make the alternative model more robust, we also evaluated whether using powerful machine learning algorithms can further improve the IFTA predictions obtained by combining DL predictions with clinical characteristics. For this, we used the package CMA in R statistical software version 4.0.2 (R Project for Statistical Computing) and evaluated the following machine learning methods: component-wise boosting, linear discriminant analyses, diagonal discriminant analysis, partial least squares combined with linear discriminant analysis, feed forward neural network, random forest, and support vector machines. All statistical analyses were conducted in Stata statistical software version 12.0 (StataCorp). A global type I error rate of .05 was used to test statistical significance. Data analysis was performed from December 2019 to May 2020.

    Results
    Study Participants and Ultrasonography Images

    A total of 367 kidney biopsies were performed in the study period; information on degree of IFTA and concurrent ultrasonography images were available for 352 biopsies (96%). Of the 352 patients (mean [SD] age, 47.43 [14.37] years), 193 (54.82%) were women. Clinical and demographic characteristics of these patients are shown in Table 1. Numbers of patients assigned to different IFTA grades were as follows: grade 1, 159 patients (45.17%; 2701 ultrasonography images); grade 2, 74 patients (21.02%; 1239 ultrasonography images); grade 3, 41 patients (11.65%; 701 ultrasonography images); and grade 4, 78 patients (22.16%; 1494 ultrasonography images). IFTA grade increased with age, presence of diabetes, hypertension, and increased serum creatinine level. For the 352 biopsies included in the study, a total of 6135 ultrasonography images had adequate masks (Figure) and were used to train and test the DL algorithm.

    Agreement Between Pathologists’ IFTA Scores

    Overall, there was excellent agreement between the 2 pathologists for IFTA classification. (Cohen κ, 0.84) (Table 2) except for IFTA grade 3 (50%-74% IFTA score). Thus, we proceeded with the ensuing analyses using grades assigned by the first nephropathologist (D.C.), who had graded all the histopathology slides, as the ground truth labels for IFTA grades.

    Preprocessing, Kidney Segmentation, and Feature Extraction

    When the Crimmins-filtered, smoothed images (eFigure 1 in the Supplement) were used for training a UNet model for kidney segmentation (eFigure 2A in the Supplement), the network needed only 4 epochs to provide the best estimate of IoU with a rapidly decreasing loss (best IoU = 0.91, or 91% accuracy) (eFigure 2B in the Supplement). We then subjected all the preprocessed images to this tuned UNet model. We inspected the resulting images and their masks (eFigure 2C in the Supplement) manually and found that in poorly segmented images the proportion of the mask to the entire image was less than 0.05. We thus excluded these 256 images and retained a set of 6346 that related to the entire set of 367 patients. After further excluding images from the 15 patients for whom IFTA classes were not available, the final set of 6135 ultrasonography images was used for feature extraction. Of these images, 5523 were used for training the feature extractor. The training of the feature extractor was consistent, gradual, and reasonably smooth, as shown by the decreasing loss function (eFigure 3B in the Supplement). Using the tuned model, we generated the feature map for all the 6135 masked images as shown in eFigure 3C in the Supplement.

    Image Classification

    The distribution of IFTA grades was similar in the training, validation, and test sets (eFigure 4 in the Supplement). We then trained an XGBoost classifier for image classification. An exhaustive grid search yielded an optimal classification solution with the following set of hyperparameters: learning rate (eta), 0.01; maximum tree depth, 16; subsample fraction, 0.5; and severe L2 regularization penalty (lambda), 10. The decrement in loss function was monotonic and smooth in both training (5122 images) and validation (401 images) sets, as shown eFigure 5A in the Supplement. Concordantly, the multiclass labeling accuracy consistently increased in both sets (eFigure 5B in the Supplement), implying acceptable fit to the data. When this model was evaluated in the validation set, we found that confusion matrix (eFigure 6A in the Supplement) was dense along the diagonals and yielded the following performance metrics (Table 3): precision, 0.8936; recall, 0.7646; accuracy, 0.8429; and F1 score, 0.8054. To further demonstrate the robustness of this approach, the image classifier was evaluated in the independent test set. We observed a very similar performance in this test set (eFigure 5B in the Supplement) with precision of 0.8927 (95% CI, 0.8682-0.9172), recall of 0.8037 (95% CI, 0.7722-0.8352), accuracy of 0.8675 (95% CI, 0.8406-0.8944), and an F1 score of 0.8389 (95% CI, 0.8098-0.8680) (Table 3). A closer look at the confusion matrices (eFigures 6A and 6B in the Supplement) showed that the accuracy of prediction was highest for IFTA grade 1 (almost perfect) and IFTA grade 4 (0.81 and 0.82 in the validation and test sets, respectively).

    Sensitivity Analyses

    We conducted sensitivity analyses of the image-level predictions in the test set regarding their temporal proximity to the date of kidney biopsy (eTable 1 in the Supplement). The predictive accuracy was better with formal ultrasonography images (0.8986) compared with ultrasonography images obtained during the biopsy (0.8543). However, irrespective of the timing of the ultrasonography studies, the predictive performance of the DL model was consistently high.

    The comparative performance of the DL model in the subset of patients with 1 of the top 3 most common biopsy diagnoses is shown in eTable 2 in the Supplement. The highest precision of 0.9590 (and lowest recall of 0.7369) was observed for patients with lupus nephritis. In contrast, for the ultrasonography images of patients with diabetic nephropathy, the precision and recall were lowest (0.7673) and highest (0.8385), respectively. In patients with focal segmental glomerulosclerosis, the DL model performance was in between that for patients with lupus nephritis and diabetic nephropathy.

    Classification Performance at the Patient Level

    For patient-level IFTA prediction, when multiple images for a patient were available, the highest IFTA class assigned by the DL model was considered as the predicted class for that patient. The performance metrics at the patient level (eFigure 6C in the Supplement and Table 3) showed improved precision (0.9003; 95% CI, 0.8644-0.9362), recall (0.8421; 95% CI, 0.7984-0.8858), accuracy (0.8955; 95% CI, 0.8589-0.9321), and F1-score (0.8639; 95% CI, 0.8228-0.9049) compared with corresponding metrics at the image level. Notably, the mean (SE) of eGFR on the day of biopsy was 83.2 (4.6) mL/min for predicted class 1, 39.5 (4.1) mL/min for predicted class 2, 30.3 (2.7) mL/min for predicted class 3, and 20.7 (1.7) mL/min for predicted class 4, demonstrating a significant dose-response relationship (regression coefficient, −21.10; P < .001).

    Incremental Predictive Value of the DL Model

    We investigated whether the addition of clinical characteristics (those that were significantly associated with IFTA class in Table 1) to DL model predictions could further improve the prediction of IFTA at the level of the patient. A comparison of the baseline (with only DL prediction as the independent variable) and the alternative (with clinical characteristics as additional covariates) multinomial logistic regression models (Table 4 and eTable 3 in the Supplement) showed that although the overall likelihood ratio χ2 improved significantly, from 341.58 to 395.41 with 18 excess df (P < .001), as did the pseudo R2 (improving from 0.5044 to 0.5839), the other prediction metrics (precision, recall, accuracy, and F1 score) remained comparable, with overlapping 95% CIs. Using robust machine learning models also did not provide better prediction (eTable 4 in the Supplement).

    Discussion

    We developed, validated, and tested a DL algorithm to predict IFTA (a histopathology-based classification) from ultrasonography images of the kidney. To our knowledge, no such model currently exists, although similar attempts using computerized tomography30 and magnetic resonance31 images have been made. Our prediction system capitalizes on ultrasonography imaging, which is done in patients with kidney disease irrespective of need for kidney biopsy. The overall diagnostic accuracy of the DL algorithm alone was approximately 90% at the patient level and was comparable even when combined with baseline clinical characteristics.

    There have been prior attempts to correlate ultrasonography findings with IFTA on kidney biopsy. Moghazi et al12 reported that kidney length, echogenicity, and parenchymal thickness were significantly, albeit modestly (correlation coefficient, 0.35 for echogenicity and interstitial fibrosis), correlated with IFTA, and none of the ultrasonographic findings individually or in combination was able to provide a quantitative estimate of IFTA. Other ultrasonographic techniques, such as quantitative echogenicity, shear wave velocity imaging, transient elastography, and ultrasonography corticomedullary strain, have been evaluated.12 However, unlike FibroScan, which grades liver fibrosis by transient elastography and has obviated the use of biopsy in chronic viral hepatitis, none of the current ultrasonographic methods can provide a clinically useful estimate of IFTA grade.32,33 Several serum and urinary biomarkers have been evaluated as a noninvasive or semiinvasive measure of IFTA, but no biomarker has been sufficiently accurate to be useful in routine clinical practice. In our study, the DL algorithm was able to predict IFTA grade with 90% accuracy.

    Strengths and Limitations

    Our study has significant strengths. First, our approach accurately (90% accuracy) predicted the IFTA grade. Second, the biopsy diagnoses represent a spectrum of kidney diseases without exclusions (other than cystic diseases of the kidney for which biopsy is typically not performed). Third, our algorithm was able to segment ultrasonography images to identify the kidney contours with high degree of accuracy.

    From a clinical standpoint, it is foreseeable that a DL system such as the one developed in this study has the potential to act as a gatekeeper for rationalizing the decision to conduct a kidney biopsy in patients with CKD. We anticipate that because of the ability of this system to provide probabilistic estimate of IFTA in real time, the system is likely to be acceptable (because it is unlikely to put any time burden on the technicians) and can also reduce the costs associated with kidney biopsy. For example, the DL algorithm developed in our study was able to identify patients with IFTA less than 25% or greater than 75% with high degree of accuracy. It is generally accepted that patients with advanced fibrosis (ie, >75%) are not suitable candidates for immunosuppressive therapy in proteinuric glomerular diseases. Thus, a noninvasive method of estimating fibrosis can help in treatment decisions without the need for invasive kidney biopsy. Finally, although the DL-based IFTA predictions correlated with eGFR, future studies need to specifically investigate the potential of this algorithm to predict decline in kidney function prospectively.

    The results should be considered in the light of some implicit limitations. First, this was a retrospective, observational study because the kidney biopsy and ultrasonography studies were performed before this analysis. A 90% diagnostic accuracy implies that 10% of IFTA grades may potentially be misclassified. Thus, more work is needed to increase the accuracy before such an algorithm can be used in clinical practice. The diagnostic accuracy in IFTA grade 3 was low. This may partly be due to the class imbalance because IFTA grade 3 also had the least number of subjects (11.65% of total sample) and the least number of ultrasonography images for training the algorithm compared with the other 3 IFTA grades. It is conceivable that the diagnostic accuracy for IFTA grade 3 would improve with a higher representation of ultrasonography images in this grade. Interestingly, the weakest agreement between pathologists was for IFTA grade 3, which points to the possibility of an inherent difficulty in assigning histopathological images to this grade. Second, the choices for available pretrained models are aplenty. Our choice of the VGG-19 BN model was driven by the motivation to use as simple models as possible, but other deeper models like those belonging to the ResNet, DenseNet, or Inception families may improve the accuracy of IFTA estimate. Third, our UNet segmentation model provided a high average IoU but it is likely that if this accuracy is further enhanced, it may lead to an improved feature extraction and classification ability of the system. Future studies are needed to improve the segmentation component of our system. Fourth, although the DL algorithm was validated on an independent sample of images, because all the ultrasonography images used in this study are from a single center, whether the predictive performance of the model will hold in varying settings of ultrasonography image quality, different equipment used to capture ultrasonography images, different ultrasonography technicians, varying clinical profile of the patients, and differing prevalence of IFTA grades is not known. Further validation on external data sets is therefore needed. Fifth, DL models are, by nature, adaptive. We anticipate a continually improved performance of the system as more real-time data are provided for its continued learning. Our work should be viewed as a proof-of-principle that a DL algorithm has the ability to predict IFTA grade from ultrasonography image of the kidney.

    Conclusions

    In conclusion, we have developed an artificial intelligence–based and DL-driven algorithm that was trained on ultrasonography images to predict IFTA grade with high degree of accuracy. This article provides proof-of-principle that a DL system can be used to noninvasively, accurately, and independently predict IFTA grade in patients with kidney disease. Although the system in its current form may not be an alternative to kidney biopsy, after robust external validation, a DL-based, noninvasive assessment of IFTA has the potential to significantly enhance clinical decision-making and prognostication in patients with CKD.

    Back to top
    Article Information

    Accepted for Publication: March 24, 2021.

    Published: May 24, 2021. doi:10.1001/jamanetworkopen.2021.11176

    Open Access: This is an open access article distributed under the terms of the CC-BY-NC-ND License. © 2021 Athavale AM et al. JAMA Network Open.

    Corresponding Author: Hemant Kulkarni, MBBS, MD, M&H Research, LLC, 12023 Waterway Ridge, San Antonio, TX 78249 (hemant.kulkarni@mnhresearch.com).

    Author Contributions: Drs Athavale and Kulkarni had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.

    Concept and design: Athavale, Hart, Itteera, Arruda, Kulkarni.

    Acquisition, analysis, or interpretation of data: Athavale, Itteera, Cimbaluk, Patel, Alabkaa, Singh, Rosenberg, Kulkarni.

    Drafting of the manuscript: Athavale, Cimbaluk, Kulkarni.

    Critical revision of the manuscript for important intellectual content: Athavale, Hart, Itteera, Patel, Alabkaa, Arruda, Singh, Rosenberg, Kulkarni.

    Statistical analysis: Singh, Kulkarni.

    Obtained funding: Athavale.

    Administrative, technical, or material support: Athavale, Patel, Alabkaa, Arruda.

    Supervision: Hart, Cimbaluk.

    Conflict of Interest Disclosures: Dr Kulkarni reported receiving grants from Terumo Medical Corporation, Virginia Commonwealth University, and Washington University outside the submitted work. No other disclosures were reported.

    Funding/Support: We acknowledge funding support for this study from the Hektoen Institute of Medicine.

    Role of the Funder/Sponsor: The funder had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

    References
    1.
    Centers for Disease Control and Prevention.  Chronic Kidney Disease in the United States, 2021. US Department of Health and Human Services, Centers for Disease Control and Prevention; 2021.
    2.
    Bowe  B, Xie  Y, Li  T,  et al.  Changes in the US burden of chronic kidney disease from 2002 to 2016: an analysis of the Global Burden of Disease Study.   JAMA Netw Open. 2018;1(7):e184412-e184412. doi:10.1001/jamanetworkopen.2018.4412PubMedGoogle ScholarCrossref
    3.
    Risdon  RA, Sloper  JC, De Wardener  HE.  Relationship between renal function and histological changes found in renal-biopsy specimens from patients with persistent glomerular nephritis.   Lancet. 1968;2(7564):363-366. doi:10.1016/S0140-6736(68)90589-8PubMedGoogle ScholarCrossref
    4.
    Nath  KA.  Tubulointerstitial changes as a major determinant in the progression of renal damage.   Am J Kidney Dis. 1992;20(1):1-17. doi:10.1016/S0272-6386(12)80312-XPubMedGoogle ScholarCrossref
    5.
    Mise  K, Hoshino  J, Ueno  T,  et al.  Prognostic value of tubulointerstitial lesions, urinary N-acetyl-β-D-glucosaminidase, and urinary β2-microglobulin in patients with type 2 diabetes and biopsy–proven diabetic nephropathy.   Clin J Am Soc Nephrol. 2016;11(4):593-601. doi:10.2215/CJN.04980515PubMedGoogle ScholarCrossref
    6.
    Srivastava  A, Palsson  R, Kaze  AD,  et al.  The prognostic value of histopathologic lesions in native kidney biopsy specimens: results from the Boston Kidney Biopsy Cohort Study.   J Am Soc Nephrol. 2018;29(8):2213-2224. doi:10.1681/ASN.2017121260PubMedGoogle ScholarCrossref
    7.
    Corapi  KM, Chen  JL, Balk  EM, Gordon  CE.  Bleeding complications of native kidney biopsy: a systematic review and meta-analysis.   Am J Kidney Dis. 2012;60(1):62-73. doi:10.1053/j.ajkd.2012.02.330PubMedGoogle ScholarCrossref
    8.
    Athavale  A, Kulkarni  H, Arslan  CD, Hart  P.  Desmopressin and bleeding risk after percutaneous kidney biopsy.   BMC Nephrol. 2019;20(1):413. doi:10.1186/s12882-019-1595-4PubMedGoogle ScholarCrossref
    9.
    Foley  RN, Collins  AJ.  End-stage renal disease in the United States: an update from the United States Renal Data System.   J Am Soc Nephrol. 2007;18(10):2644-2648. doi:10.1681/ASN.2007020220PubMedGoogle ScholarCrossref
    10.
    Hogan  JJ, Mocanu  M, Berns  JS.  The native kidney biopsy: update and evidence for best practice.   Clin J Am Soc Nephrol. 2016;11(2):354-362. doi:10.2215/CJN.05750515PubMedGoogle ScholarCrossref
    11.
    Gonzalez Suarez  ML, Thomas  DB, Barisoni  L, Fornoni  A.  Diabetic nephropathy: is it time yet for routine kidney biopsy?   World J Diabetes. 2013;4(6):245-255. doi:10.4239/wjd.v4.i6.245PubMedGoogle ScholarCrossref
    12.
    Moghazi  S, Jones  E, Schroepple  J,  et al.  Correlation of renal histopathology with sonographic findings.   Kidney Int. 2005;67(4):1515-1520. doi:10.1111/j.1523-1755.2005.00230.xPubMedGoogle ScholarCrossref
    13.
    Friedrich-Rust  M, Ong  MF, Martens  S,  et al.  Performance of transient elastography for the staging of liver fibrosis: a meta-analysis.   Gastroenterology. 2008;134(4):960-974. doi:10.1053/j.gastro.2008.01.034PubMedGoogle ScholarCrossref
    14.
    Lakhani  P, Sundaram  B.  Deep learning at chest radiography: automated classification of pulmonary tuberculosis by using convolutional neural networks.   Radiology. 2017;284(2):574-582. doi:10.1148/radiol.2017162326PubMedGoogle ScholarCrossref
    15.
    Milea  D, Najjar  RP, Zhubo  J,  et al; BONSAI Group.  Artificial intelligence to detect papilledema from ocular fundus photographs.   N Engl J Med. 2020;382(18):1687-1695. doi:10.1056/NEJMoa1917130PubMedGoogle ScholarCrossref
    16.
    Ko  H, Chung  H, Kim  KW,  et al.  COVID-19 pneumonia diagnosis using a simple 2D deep learning framework with a single chest CT image: model development and validation.   J Med Internet Res. 2020;22(6):e19569. doi:10.2196/19569PubMedGoogle Scholar
    17.
    Farris  AB, Adams  CD, Brousaides  N,  et al.  Morphometric and visual evaluation of fibrosis in renal biopsies.   J Am Soc Nephrol. 2011;22(1):176-186. doi:10.1681/ASN.2009091005PubMedGoogle ScholarCrossref
    18.
    Anaconda Inc. Anaconda software. Published 2016. Accessed April 9, 2021. https://anaconda.com
    19.
    Paszke  A, Gross  S, Massa  F,  et al PyTorch: an imperative style, high-performance deep learning library. In: Wallach  H, Larochelle  H, Beygelzimer  A, d’Buc  F, Forx  E, Garnett  R, eds.  Advances in Neural Information Processing Systems. Curran Associates, Inc; 2019:8024-8035.
    20.
    Crimmins  TR.  Geometric filter for speckle reduction.   Appl Opt. 1985;24(10):1438. doi:10.1364/AO.24.001438PubMedGoogle ScholarCrossref
    21.
    Antico  M, Sasazawa  F, Dunnhofer  M,  et al.  Deep learning-based femoral cartilage automatic segmentation in ultrasound imaging for guidance in robotic knee arthroscopy.   Ultrasound Med Biol. 2020;46(2):422-435. doi:10.1016/j.ultrasmedbio.2019.10.015PubMedGoogle ScholarCrossref
    22.
    Manto  M, Dupre  N, Hadjivassiliou  M,  et al.  Management of patients with cerebellar ataxia during the COVID-19 pandemic: current concerns and future implications.   Cerebellum. 2020;19(4):562-568. doi:10.1007/s12311-020-01139-1PubMedGoogle ScholarCrossref
    23.
    Park  H, Lee  HJ, Kim  HG,  et al.  Endometrium segmentation on transvaginal ultrasound image using key-point discriminator.   Med Phys. 2019;46(9):3974-3984. doi:10.1002/mp.13677PubMedGoogle ScholarCrossref
    24.
    Yang  J, Faraji  M, Basu  A.  Robust segmentation of arterial walls in intravascular ultrasound images using Dual Path U-Net.   Ultrasonics. 2019;96:24-33. doi:10.1016/j.ultras.2019.03.014PubMedGoogle ScholarCrossref
    25.
    [No authors listed.] Implementation of U-Net architecture using Pytorch. Accessed April 9, 2021. https://github.com/jakeoung/Unet_pytorch
    26.
    [No authors listed.] labelme: image polygonal annotation with Python. Published 2016. Accessed April 9, 2021. https://github.com/wkentaro/labelme
    27.
    Bradski  G. The OpenCV library. Dr Dobb's: the world of software development. Published November 1, 2000. Accessed April 9, 2021. https://www.drdobbs.com/open-source/the-opencv-library/184404319#
    28.
    [No authors listed.] Pytorch-IntermediateLayerGetter. Accessed April 9, 2021. https://github.com/sebamenabar/Pytorch-IntermediateLayerGetter
    29.
    [No authors listed.] Pack R package tarball with pre-built xgboost.so (with GPU support). Accessed April 9, 2021. https://github.com/dmlc/xgboost
    30.
    Cohen  EP, Olson  JD, Tooze  JA, Bourland  JD, Dugan  GO, Cline  JM.  Detection and quantification of renal fibrosis by computerized tomography.   PLoS One. 2020;15(2):e0228626. doi:10.1371/journal.pone.0228626PubMedGoogle Scholar
    31.
    Jiang  K, Ferguson  CM, Ebrahimi  B,  et al.  Noninvasive assessment of renal fibrosis with magnetization transfer MR imaging: validation and evaluation in murine renal artery stenosis.   Radiology. 2017;283(1):77-86. doi:10.1148/radiol.2016160566PubMedGoogle ScholarCrossref
    32.
    Terrault  NA, Lok  ASF, McMahon  BJ,  et al.  Update on prevention, diagnosis, and treatment of chronic hepatitis B: AASLD 2018 hepatitis B guidance.   Hepatology. 2018;67(4):1560-1599. doi:10.1002/hep.29800PubMedGoogle ScholarCrossref
    33.
    Panel  AIHG, Chung  RT, Davis  GL,  et al; AASLD/IDSA HCV Guidance Panel.  Hepatitis C guidance: AASLD-IDSA recommendations for testing, managing, and treating adults infected with hepatitis C virus.   Hepatology. 2015;62(3):932-954. doi:10.1002/hep.27950PubMedGoogle ScholarCrossref
    ×