Comparison of Chest Radiograph Captions Based on Natural Language Processing vs Completed by Radiologists

Key Points Question Can natural language processing (NLP) be used to generate chest radiograph (CXR) captions? Findings In this diagnostic study including 74 082 CXR cases labeled with NLP for 23 abnormal signs to train convolutional neural networks, an independent prospective test data set of 5091 participants was examined. The reporting time using NLP-generated captions as prior information was 283 seconds, significantly shorter than the normal template (347 seconds) and rule-based model (296 seconds), while maintaining good consistency with radiologists. Meaning The findings of this study suggest that NLP can be used to generate CXR captions, which provides a priori information for writing reports and may make CXR interpretation more efficient.


Introduction
Chest radiography (CXR) accounts for 26% of imaging examinations of pulmonary and cardiac diseases.However, the interpretation of CXR findings is challenging because it mainly depends on the expertise of radiologists. 1,2The increasing CXR orders and the lack of experienced radiologists, especially in community clinics or primary hospitals, limit the clinical application of CXR. 3 The development of artificial intelligence (AI) accelerates the automatic interpretation of CXR. 4 Artificial intelligence solutions based on convolutional neural network (CNN) have shown excellent performance in diagnosing pulmonary diseases, [5][6][7] identifying the position of feeding tubes, 8 and predicting the temporal changes of imaging findings. 9Studies reported that AI-assisted CXR interpretation improved the diagnostic performance compared with that by a single reader, 10,11 shortened reporting time, 12 and helped junior radiologists to write reports. 13However, CNN image classification usually relies on supervised training based on expert annotation. 14,15Radiology reports contain imaging findings and diagnoses of clinical experts, but these reports are usually unstructured natural text and cannot be directly used for label classification in CNN.
Recently, the bidirectional encoder representations from transformers (BERT) have been developed for natural language processing (NLP), 16 which greatly improves the ability to recognize semantics and context and can generate medical reports.Fonollà et al 17 presented an AI-aided system that incorporated a BERT-based image captioning block to automatically describe colorectal polyps in colonoscopy.Xue et al 18 applied a recurrent generative model to a public data set to generate the imaging description paragraphs and impression sentences of CXR reports.Despite the recent research advances, AI-assisted CXR interpretation has not been routinely used in clinical practice, because this task remains highly challenging.
It is increasingly recognized that AI-involved applications need to undergo a rigorous prospective evaluation to demonstrate their effectiveness.Since most previous studies on CXR interpretation were retrospective tests on selected public data sets, 19,20 a prospective study in a clinical practice setting is needed to evaluate AI-assisted CXR interpretation.Therefore, we applied the BERT model to extract language entities and associations from unstructured radiology reports to train CNNs and generated free-text descriptive captions using NLP.We randomly assigned a normal template, NLP-generated captions, or rule-based captions to CXR cases in the test group to evaluate the consistency between the generated captions and the final reports of radiologists.The hypothesis is that NLP-generated captions can assist CXR reporting.

Methods
This study followed the Transparent Reporting of Evaluations With Nonrandomized Designs (TREND) reporting guideline for diagnostic studies.The institutional review board of Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, approved this study waived the need for informed consent because information prior to the routine reporting process does not pose any risk to the patients.Figure 1 shows the study workflow.

Retrospective Data Sets
The training data set consisted of consecutive symptomatic CXR cases at hospital A from February Colors represent different actors or actions.Dark yellow represents equipment or system; light yellow represents the function of the equipment or system; dark blue represents the captioning models; light blue represents the datasets; dark gray represents the operator; and light gray represents the operator's work.AI indicates artificial intelligence; NLP, natural language processing; PACS, picture archiving and communication system; and RIS, radiology information system.
observations.The start and completion times of image reading by residents were recorded to compare the reporting time based on the 3 models.
After the residents submitted the reports, the senior radiologists observed the CXR images and confirmed the reports.In this process, the senior radiologists were blinded to the AI captioning models, that is, they only viewed the reports written by residents but did not know which model the caption originally came from.Therefore, 19 residents (including L.Z. and L.W.) and 17 radiologists (including Y.Z. and X.X.) participated in reporting.

BERT-Based CXR Image Labeling
We used the BERT model 21,22 to recognize language entities, entity span, semantic type of entities, and semantic relationships between language entities.BERT relies on a transformer, an attention mechanism for learning the contextual relationships between words in a text.The BERT model is designed to pretrain the deep bidirectional representation from unstructured text through the joint adjustment of left and right contexts.Therefore, the pretrained BERT model can be fine-tuned through additional output layers to complete the NLP tasks in this study, ie, to learn the semantic information of radiology reports and output semantic recognition vectors for classification.
First, we used BERT to automatically mine all reports in the training data set, segment and extract terms or phrases from the sentences, and cluster them according to semantic distance. 23,24cond, 2 radiologists (including Y.Z.) other than the above physicians with 10 and 15 years of experience and 1 NLP engineer (M.L.) reviewed the language clusters to determine whether the terms correctly described the imaging findings on CXR by consensus.They also iteratively ruled out wrong terms and fixed conflicting terms and merged clusters with similar clinical meanings.Finally, a 23-label system of abnormal signs was established, including synonyms, parasynonyms, or phrases that may appear in radiology reports (Box).The details of BERT-based image labeling and CNN algorithm are in the eMethods in Supplement 1.

Board Reading
Because most CXR cases lacked pathologic reference and the original CXR reports came from medical staff with various extents of expertise, to establish a solid and unified reference standard to determine the performance of CNN, we reexamined the entire retrospective and prospective test data sets.Two different radiologists (including X.X.) with 21 and 31 years of experience independently

NLP-Based Caption Generation
The caption generation was developed by an NLP-based caption retrieval algorithm.The BERT-based CXR image labeling system generated a 1-hot code for each token sequence in the training data set.
In NLP, a token sequence is the grouped characters as a semantic unit for processing. 25The token sequences with the same 1-hot code were combined as a subset for caption retrieval.In each subset, the bilingual evaluation understudy (BLEU) score of each token sequence and other token sequences were calculated, and the token sequence with the largest average BLEU score was taken as the caption of this subset.This caption retrieval procedure went through all possible 1-hot combinations in the training data set.
To generate captions in the test data set, the 1-hot code of CNN classification results of each abnormal sign in the CXR image was matched with the subset with the same 1-hot code in the training data set, and the corresponding caption was taken as output.Because the CNN classification model did not provide information about the location and size of abnormal signs, the location descriptions and numbers in the token were left blank.

Rule-Based Caption Generation
According to the order in which radiologists write reports and the habit of expressing different positive and negative labels, a rule-based caption generation method was proposed (eMethods in Supplement 1).In short, CNN classification results with similar patterns of language description were divided into 8 subcategories to adopt similar expression patterns.For example, subcategory 1 includes the signs of consolidation, small consolidation, patchy consolidation, nodule, calcification, mass, emphysema, pulmonary edema, cavity, and pneumothorax.In this subcategory, each sign with a positive result is directly described.If the CNN determines that a pneumothorax sign is positive, then the rule-based caption is "pneumothorax is observed in the lung."If all of these signs are negative, the caption is "there are no abnormal densities in both lung fields."The results of each subcategory are linked as a complete paragraph.

Similarity Among Captioning Models
The similarity was evaluated using the final report as a reference.Therefore, the BLEU score was calculated to indicate the similarity between the caption (normal template, NLP-generated, or rulebased) and the final report (eMethods in Supplement 1).

Statistical Analysis
The metrics to indicate the classification performance of CNN included area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, specificity, and F1 score.To supplement the interpretation of the AUC on imbalanced data sets, ie, the high specificity caused by low disease prevalence, we also calculated the area under the precision-recall curve (AUPRC).The 95% CIs were calculated by bootstrapping with 100 iterations to estimate the uncertainty of these metrics. 26In this way, the original data were resampled 100 times.Each time, 95% of the data were randomly selected and used to calculate the statistics of interest.Among the 3 groups of patients assigned different caption generation models, the pairwise differences in reporting time and BLEU score were evaluated by independent-sample t test.A 2-sided P < .05value was considered statistically significant.MedCalc, version 18 (MedCalc Software) was used for statistical analysis.In the symptomatic patients, the CNN showed high performance in classifying the 20 abnormal signs (Figure 2A; and eTable 5 in Supplement 1), and the mean (SD) AUC of these abnormal signs was 0.84 (0.09), ranging from 0.69 (95% CI, 0.48-0.90) to 0.99 (95% CI, 0.98-1.00).The mean AUPRC was 0.41 (0.19) (eFigure 1A in Supplement 1).The mean accuracy was 0.89 (0.12); sensitivity, 0.47 (0.20); specificity, 0.95 (0.11); and F1 score, 0.60 (0.20).

Reporting Time
The residents spent the least reporting time using the NLP-generated captions.

Similarity of Captioning Models
Among the 5091 individuals, the AI server randomly assigned 1662 to a normal template, 1731 to NLP-generated captions, and 1698 to rule-based captions (Figure 3 and the Table ).eFigure 2 in Supplement 1 shows some representative cases.The percentage of men and the percentage of abnormal cases (with at least 1 abnormal sign) did not differ significantly among the 3 subgroups (P > .05).
The NLP-generated caption was the most similar to the final report, with a mean (SD) BLEU score of 0.69 (0.24), significantly higher than 0.37 (0.09) of the normal template (P < .001)and 0.57 (0.19) of the rule-based model (P < .001).The BLEU score of the rule-based model was significantly higher than the normal template (P < .001)(eTable 7 in Supplement 1).
Box. Abnormal Signs Extracted by Bidirectional Encoder Representations From Transformers (BERT) ModelFrom Chest Radiograph Reports in the Training Dataset inserted central catheter implant Pacemaker implant JAMA Network Open | Health Informatics Chest Radiograph Captions Based on Natural Language Processing vs Completed by Radiologists JAMA Network Open.2023;6(2):e2255113.doi:10.1001/jamanetworkopen.2022.55113(Reprinted) February 8, 2023 4/12 Downloaded From: https://jamanetwork.com/ on 09/24/2023 reviewed the CXR images and BERT-extracted labels.They made necessary corrections to the labels and resolved the inconsistency by consensus.

Figure 2 .
Figure 2. Receiver Operating Characteristic Curves of Convolutional Neural Network Classification in the Prospectively Included Test Data Set

Figure 3 .
Figure 3. Bilingual Evaluation Understudy (BLEU) Scores in the Prospectively Included Test Data Set

JAMA Network Open | Health Informatics Chest
1, 2014, to February 28, 2018.The inclusion criteria were patients (age Ն18 years) with symptoms who underwent posteroanterior CXR for cardiothoracic symptoms, such as chest tightness, cough, fever, and chest pain.The exclusion criteria were mobile CXR, poor image quality, and incomplete reports not drafted and confirmed by 2 radiologists.The retrospective test data set consisted of CXR cases at hospital B from April 1 to July 31, 2019, including symptomatic patients and asymptomatic screening participants.The symptomatic patients were from emergency, inpatient, and outpatient settings who met the indications for CXR.The Radiograph Captions Based on Natural Language Processing vs Completed by Radiologists The inclusion and exclusion criteria were similar to those of the training data set, except that the screening participants were asymptomatic.

32 .
Iwamura K, Louhi Kasahara JY, Moro A, Yamashita A, Asama H. Image captioning using motion-CNN with object detection.Sensors(Basel).2021;21(4):1270.doi:10.3390/s2104127033.Berbaum KS, Krupinski EA, Schartz KM, et al.Satisfaction of search in chest radiography 2015.Acad Radiol.2015;22(11):1457-1465.doi:10.1016/j.acra.2015.07.011 34.Annarumma M, Withey SJ, Bakewell RJ, Pesce E, Goh V, Montana G. Automated triaging of adult chest radiographs with deep artificial neural networks.Radiology.2019;291(1):196-202. doi:10.1148/radiol.2018180921Precision-Recall Curves of Convolutional Neural Network Classification in the Prospective Test Dataset eFigure 2. Three Representative Cases of Different Report Generation Models and Two Cases in Which Errors Occur in the Prospective Test Dataset eTable 1.Digital Radiography Systems eTable 2. Classification Performance of Convolutional Neural Networks of Symptomatic Patients (n=5,996) in the Retrospective Test Dataset Using Board Reading as the Reference eTable 3. Classification Performance of Convolutional Neural Networks of Asymptomatic Screening Participants (n=2,130) in the Retrospective Test Dataset Using Board Reading as the Reference eTable 4. Disease Prevalence in the Prospective Test Dataset (n=5,091) Based on Board Reading eTable 5. Classification Performance of Convolutional Neural Networks of Symptomatic Patients (n=4,175) in the Prospective Test Dataset Using Board Reading as the Reference eTable 6. Classification Performance of Convolutional Neural Networks of Asymptomatic Screening Participants (n=916) in the Prospective Test Using Board Reading as the Reference eTable 7. Multiple Regression Analysis on the Significance of Reporting Time and BLEU Score Among Three Models