Development of a Deep Learning Model for Retinal Hemorrhage Detection on Head Computed Tomography in Young Children

Key Points Question Can a deep learning–based image classification system detect retinal hemorrhage (RH) on head computed tomography (CT) from infants and toddlers with abusive head trauma (AHT)? Findings In this diagnostic pilot study with training, validation, and testing on 218 globes with RH and 384 globes without RH from 301 pediatric patients with AHT, a deep learning model identified RHs otherwise not observed by radiologists with high sensitivity and specificity. Thus, RH information can be accessed by deep learning on pediatric head CT images, although the deep learning model may be overfit, and the reported performance may be optimistic in the absence of an external validation data set. Meaning By screening pediatric head CT images for RHs, deep learning models could assist clinicians in calibrating clinical suspicion for AHT, provide decision support for which patients urgently need fundoscopic examinations, and help involve child protection agencies in a timely manner when ophthalmologic services are not readily available.


Introduction
Abusive head trauma (AHT) in infants and young children is associated with up to 25% mortality and 40% severe disability among survivors. 1 Patients with AHT can present with misleading histories and a range of symptoms, such as vomiting and irritability, that overlap with common pediatric illnesses. 2 As a result, 25% to 31% of AHTs in children are missed despite these children being evaluated in medical settings. 2 Head computed tomography (CT) is commonly obtained in emergency departments to rule out a range of intracranial abnormalities in symptomatic infants and young children. However, retinal hemorrhages (RHs), which correlate strongly with AHT, 3 currently cannot be identified on this imaging modality in children unless they are exceptionally large. Identification of RH is an essential part of an AHT assessment and requires a dilated fundoscopic examination after pediatric ophthalmologic consultation. This subspecialty is not readily available in many communities.
Furthermore, the examination is uncomfortable and can require sedation, and it temporarily nullifies pupillary response as an indication of neurologic status. 4,5 Thus, although head CTs are obtained routinely, dilated fundoscopic examinations are reserved for those patients with the highest likelihood of abuse.
Deep learning-based image analysis has not, to our knowledge, been previously reported for the evaluation of retinal conditions on CT, although it has been used with head CTs to classify intracranial hemorrhage subtypes 6,7 and even to predict 6-month outcomes in pediatric traumatic brain injury. 8 The potential for deep learning to contribute to diagnostic imaging in AHT by offering predictive analytics, clinical decision support, and image analysis has been previously recognized. 9 Because computer vision can discern features that are otherwise inapparent to human visual examination, we hypothesized that deep learning-based analysis of the globes on pediatric head CTs can predict the presence and absence of RH. The aim of this study was to assess an interpretable deep learning model for the automated detection of RH in routinely acquired pediatric head CTs.

Methods
This diagnostic study was based on single-center retrospective medical records. All procedures were approved by the University of Tennessee Health Science Center's Institutional Review Board, which waived required consent because of appropriate use of deidentified images and the impracticality of obtaining consent for children presenting throughout 15 years whose primary caregivers have likely changed because of child abuse. The study followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guideline. Data for all patients and variables used in the models below were complete with no missing values.

Study Population
Our study population consisted of 301 of 570 infants and young children who were diagnosed with AHT by Le Bonheur Children's Hospital's child abuse team from May 1, 2007, to March 31, 2021 Diagnoses were made on the basis of history, physical examination, laboratory and imaging studies, dilated fundoscopic examinations, and other necessary investigations. We excluded patients older than 3 years (to increase the uniformity in globe size and developmental stage), patients with scans from outlying centers, and patients for whom both globes could not be detected because of scan quality. Patients whose scans had an intercept parameter that did not equal 0 were also excluded for reasons explained later (eFigure 1 in Supplement 1). The outcome label for every globe was the presence or absence of RHs, which was tabulated based on the results of dilated fundoscopic examinations performed on each patient by our pediatric ophthalmology service.

RH Prediction in Individual Globes: The Deep Learning Model
The axial series from the initial CT for each patient was used in our prediction models to conform to the requirement of the globe segmentation algorithm. All scans had a slice thickness of 5 mm. Slices had a matrix size of 512 × 512 and an in-plane resolution of 0.2 to 0.5 mm. The CT images were acquired from a single scanner (Toshiba) at our center.
To isolate globes, we used a globe segmentation model (MRes-UNET2D) previously developed by Umapathy et al 10 for adult CTs. Its direct application to CTs resulted in some missed and mislabeled globe regions (eFigure 8 in Supplement 1). Therefore, we also developed a novel method for straightening CTs using the calculated angle between the globes (eFigure 9 in Supplement 1), which allowed systematic cropping of the CTs to include missed and exclude mislabeled globe After globe segmentation, we compared globes with and without RH to determine whether their HU distributions varied according to CT parameters (eFigure 2 in Supplement 1). The distribution of these parameters in the final study population is included in eTable 1 in Supplement 1.
Convolutional neural networks (CNNs) are state-of-the-art models for image classification because of their ability to learn important features from images without human supervision. To  develop our deep learning model, we used transfer learning with VGG16, which is an off-the-shelf CNN model pretrained on ImageNet. We used the architecture of its feature extraction steps, and we added a global average pooling layer, a dense layer with 100 neurons, and an output layer to its classification steps. We retrained the top layers of the network, which capture more task-specific features; froze the 3 bottom convolutional blocks; and fine-tuned the top 2 convolutional blocks with a new data set (eFigure 3 in Supplement 1).
We randomly split the population of individual 3-dimensional globes into training (60%), validation (20%), and testing (20%) data sets. To improve the model further and reduce overfitting, we augmented the training data set by applying rotation, horizontal shift, horizontal flip, and scaling to existing data using the ImageDataGenerator function of the Keras library. After deciding on model hyperparameters, such as the number of layers, regularization parameters, and optimizers using the validation data set, we evaluated model performance on the test data set. Performance metrics included accuracy, sensitivity, specificity (with the F1 score maximized in the validation data set),

Findings: The Light Gradient Boosting Machine Model
To see how well other common intracranial findings in AHT and demographic characteristics could predict RH, 4 common intracranial pathologic findings identified by radiologists (subdural hematoma, epidural hematoma, subarachnoid hemorrhage, and hypoxic ischemic injury) and the demographic features age, race and ethnicity, and sex were used as features. Racial and ethnic classifications of the patients were acquired from the electronic health record to evaluate for racial and socioeconomic skew in our AHT population as well as bias in our models. Categorical features were aggregated as numbers (percentages), and continuous features were summarized as median (range). A 2-sided P < .05 was considered statistically significant for comparing cases and controls.
Using these features, we developed a more general light gradient boosting machine (GBM) 14 Table 1). These differences are consistent with a prior report. 17 Our study population provided 218 globes with RH and 384 globes without RH.
Of 120 patients with RH on fundoscopic examinations, 4 (3.3%) had CTs that were reported as having RH by pediatric radiologists.

Performance of the Deep Learning Model
Our  b Other races include patients who self-identified as Hawaiian or Pacific Islander or mixed race or whose information was missing.
c The CT finding percentages add to more than 100% because each patient can have more than 1 finding. The contribution of pixels to our model predictions increases as the color transforms from blue to red.

Subgroup Analyses
The median globe size of the entire study population was 72 × 73 pixels. The highest increase in performance occurred in the subgroup that contained globe sizes larger than or equal to 75 × 75 pixels, which had a median size of 79 × 79 pixels. These globe sizes were not related to patient age; rather, we determined that a higher sampling rate allowed for more information to be contained in these images.

Performance of the Combined Light GBM Model
Retinal hemorrhage was predicted on the level of globes with a sensitivity of 79.6%; specificity,

Limitations
This pilot study should be interpreted in the context of its limitations. Deep learning models perform best when trained on very large data sets, and as such, our study was limited in the number of qualifying scans available to us. Although we used a validated, effective data augmentation strategy to overcome this limitation, 28 a risk of overfitting remains. Furthermore, our study was conducted in a group of patients diagnosed with AHT, which has a high prevalence of RH. A general population of head CTs would have a lower prevalence of RH, and without external validation, our performance estimates may be optimistic. Our subgroups for analyses by age and ancestry were small. A number of CTs were excluded because of parameter heterogeneity and poor imaging quality. Standards for pediatric head CT acquisition published by the American College of Radiology are fairly broad and agranular 29 ; thus, the quality of imaging varies considerably. We used 5-mm slice thicknesses, although 2 mm has become common at pediatric centers and could improve our model performance.
We did not make correlations between the severity and distribution of RH 30 and the performance of our detection system. This study will need prospective, external validation on a cohort of all infants and young children who undergo head CTs with variable acquisition parameters across several centers.

Conclusions
Although rarely observable by radiologists, RH information is present in head CTs and can be accessed by deep learning image analysis, as shown in this diagnostic study. The technical ability to discriminate RH on head CT can potentially offer greater confidence to clinicians practicing in subspecialty-limited environments for moving an AHT investigation forward and decrease the number of missed cases, all by using a routine diagnostic modality that is objective and less susceptible to common clinical bias.