Is machine learning applied to ultrasonography capable of risk-stratifying thyroid nodules by their genetic status?
In this diagnostic study of 134 lesions among 121 patients, a model developed through automated machine learning was able to identify genetically high-risk thyroid nodules by ultrasonography alone, with a specificity of 97% and positive predictive value of 90%.
The findings suggest that machine learning application to genetic risk stratification of thyroid nodules is feasible, affording an additional diagnostic adjunct to cytogenetics for nodules with indeterminate cytological result.
Thyroid nodules are common incidental findings. Ultrasonography and molecular testing can be used to assess risk of malignant neoplasm.
To examine whether a model developed through automated machine learning can stratify thyroid nodules as high or low genetic risk by ultrasonography imaging alone compared with stratification by molecular testing for high- and low-risk mutations.
Design, Setting, and Participants
This diagnostic study was conducted at a single tertiary care urban academic institution and included patients (n = 121) who underwent ultrasonography and molecular testing for thyroid nodules from January 1, 2017, through August 1, 2018. Nodules were classified as high risk or low risk on the basis of results of an institutional molecular testing panel for thyroid risk genes. All thyroid nodules that underwent genetic sequencing for cytological results with Bethesda System categories III and IV were reviewed. Patients without diagnostic ultrasonographic images within 6 months of fine-needle aspiration or who received definitive treatment at an outside medical center were excluded.
Main Outcomes and Measures
Thyroid nodules were categorized by the model as high risk or low risk using ultrasonographic images. Results were compared using genetic testing.
Among the 134 lesions identified in 121 patients (mean [SD] age, 55.7 [14.2] years; 102 women [84.3%]), 683 diagnostic ultrasonographic images were selected. Of the 683 images, 556 (81.4%) were used for training the model, 74 (10.8%) for validation, and 53 (7.8%) for testing. Most nodules had no mutation (75 [56.0%]), whereas 43 nodules (32.1%) had a high-risk mutation and 16 (11.9%) had an unknown or a low-risk mutation (χ2 = 39.060; P < .001). In total, 228 images (33.4%) were of nodules classified as genetically high risk (n = 43), and 455 (66.6%) were of low-risk nodules (n = 91). The model performed with a sensitivity of 45% (95% CI, 23.1%-68.5%), a specificity of 97% (95% CI, 84.2%-99.9%), a positive predictive value of 90% (95% CI, 55.2%-98.5%), a negative predictive value of 74.4% (95% CI, 66.1%-81.3%), and an overall accuracy of 77.4% (95% CI, 63.8%-97.7%).
Conclusions and Relevance
The study found that the model developed through automated machine learning could produce high specificity for identifying nodules with high-risk mutations on molecular testing. This finding shows promise for the diagnostic applications of machine learning interpretation of sonographic imaging of indeterminate thyroid nodules.
Thyroid nodules are extremely common, with a prevalence in the general population as high as 67%, and are often discovered incidentally on imaging for an unrelated workup.1 Once discovered, a nodule should be evaluated by dedicated ultrasonography. The most recent (2015) guidelines of the American Thyroid Association do not recommend further evaluation for thyroid nodules smaller than 1 cm unless they are accompanied by clinical symptoms or lymphadenopathy.2 For any nodule larger than 1 cm, subsequent biopsy with fine-needle aspiration (FNA) may be recommended. Factors such as parenchyma heterogeneity, echogenicity, dimensions, margins, and location can be considered when determining whether to biopsy.3,4
Nodules that qualify for FNA are subsequently graded by the Bethesda System for Reporting Thyroid Cytopathology.5 The Bethesda System assigns FNA samples to 1 of 6 diagnostic categories, each with an associated risk-of-malignancy score.5 Diagnostic uncertainty remains when using the Bethesda System because 15% to 30% of thyroid nodules are categorized as Bethesda categories III or IV, indicating an indeterminate cytological result.6 Surgical resection will provide a final diagnosis; however, less than one-third of these indeterminate nodules will go on to be malignant, resulting in potentially unnecessary resections. At some institutions, molecular genetic testing has since been used in conjunction with indeterminate cytological structure to better stratify risks on the basis of the presence or absence of high-risk genetic mutations and to identify appropriate candidates for surgical resection.7 Although molecular testing provides a less invasive alternative to a surgical procedure, both options are expensive and imperfect.8,9
The primary aim of this study was to establish whether automated machine learning by thyroid lesion ultrasonography can identify genetically high-risk thyroid nodules using genetic sampling as the criterion standard.
This retrospective diagnostic study was conducted at a single tertiary care urban academic center. It received approval from the institutional review board of Thomas Jefferson University Hospital. Informed consent was waived because the research risk was minimal, patients opting not to participate would skew the results, not all participants could be readily accessed, and the study did not involve face-to-face interaction.
We reviewed the electronic medical records of all patients who underwent ultrasonography-guided FNA with subsequent molecular testing for a workup of a thyroid nodule from January 1, 2017, through August 1, 2018. Patients were excluded if their medical records were not available or if ultrasonography with diagnostic images within 6 months of FNA were missing. One of us (A.L.), a blinded radiologist with more than 15 years’ experience in thyroid ultrasonography, selected diagnostic ultrasonographic images of each lesion, resulting in a total of 683 images (Figure 1).
Molecular testing was performed using an institution-specific panel comprising 23 gene mutations and 5 gene rearrangements to stratify FNA samples into high-risk or low-risk groups. This 23-gene panel served as a rule-in test. Samples containing 1 or more known high-risk mutations (Table 1) were classified by the molecular testing report as high risk, whereas samples with no mutation or a mutation considered to be of low or unknown risk were classified by the molecular testing report as low risk.
In total, 134 lesions across 121 patients met the criteria, including 43 genetically high-risk lesions and 91 genetically low-risk lesions. Across these 134 lesions, 683 diagnostic ultrasonographic images were selected by the blinded radiologist. The images were then cropped by a blinded nonmedical individual (S.W.) to remove annotative features.
Automated machine learning was performed on a commercial platform (AutoML Vision; Google LLC). This platform generates custom models from processed images for classification by transfer learning with neural architecture optimized on a proprietary database. Product specifications recommend between 100 and 500 images per classification label. After images are submitted for processing, roughly 80% of images are selected for training the model, roughly 10% are used for validation (selecting and modifying the model as appropriate), and roughly 10% are set aside for testing the accuracy of the model.
Lesion and patient data were compiled using REDCap Research Electronic Data Capture, version 8.4 (Vanderbilt University).10 Picture data were compiled with Excel (Microsoft Corp). Statistics were performed on SPSS, version 25 (IBM SPSS). In patients with multiple lesions that met the criteria, each nodule was evaluated independently for analysis. Statistical analyses were performed with MedCalc for Windows, version 15.0 (MedCalc Software). Unpaired, 2-tailed t tests were used to calculate the P values. Two-sided P < .05 indicated statistical significance.
Electronic medical records and ultrasonography of 134 lesions across 121 patients were reviewed for this study. Univariate analysis of the study population is detailed in Table 2. The mean (SD) age for all participants was 55.7 (14.2) years, and 102 (84.3%) were women. Nodules included in this study (n = 118) had a predilection toward the lower pole (46 [39.0%]) and middle pole (37 [31.4%]), with statistical significance (χ2 = 50.48; P < .001). Nodules that made up the upper lobe (19 [16.1%]), entire lobe (9 [7.6%]), or isthmus (7 [5.9%]) were less frequent. Sixteen nodules were excluded from these calculations because their location was not specified in the imaging reports or in other clinical records. Fifty-one patients (42.9%) underwent total thyroidectomy or lobectomy, of whom 24 had malignant pathological findings as detailed in Table 3.
In addition, nodules had a mean (SD) largest dimension of 2.65 (1.5) cm. No difference was observed in the mean (SD) size of nodules between genetically high-risk nodules and genetically low-risk nodules (n = 43; 2.6 [1.4] cm vs n = 91; 2.68 [1.5] cm; P = .76). Most nodules had no mutation (75 [56.0%]), but a high-risk mutation was present in some nodules (43 [32.1%]), and the presence of an unknown or a low-risk mutation was least likely (16 [11.9%]; χ2 = 39.060; P < .001).
The American Thyroid Association (ATA) risk scores (based on sonographic findings, which we assigned to a score range of 1 to 5, with 1 being benign and 5 being high suspicion) were available for 121 nodules (90.3%). Of these nodules, 50 (41.3%) were considered to have low levels of suspicion (ATA risk score of 3), whereas 27 nodules (22.3%) were considered to have intermediate (ATA risk score of 4) and 30 nodules (24.8%) to have high (ATA risk score of 5) levels of suspicion (Table 2).
Of the 683 diagnostic ultrasonographic images selected, 228 (33.4%) were images of genetically high-risk nodules (n = 43 nodules) and 455 (66.6%) were of genetically low-risk nodules (n = 91 nodules). High-risk genetic mutations identified in pathologically confirmed malignant nodules included PTEN (1 nodule; GenBank GeneID 5728), BRAF (9 nodules; GenBank GeneID 673), NRAS (4 nodules; GenBank GeneID 4893), TERT (3 nodules; GenBank GeneID 7015), IDH1 (1 nodule; GenBank GeneID 3417), and HRAS (1 nodule; GenBank GeneID 3265). The subsequent machine learning workflow used by AutoML Vision is demonstrated in Figure 2. Of the 683 images submitted, 556 (81.4%) were used for training, 74 (10.8%) were used for validation, and 53 (7.8%) were used for final testing.
Model classification as high or low genetic risk was compared against molecular testing classification of risk. The performance of the model is depicted in Table 4. The model-classified nodules within the test set (53 images) demonstrated a low sensitivity of 45.0% (95% CI, 23.1%-68.5%). However, specificity was higher at 97.0% (95% CI, 84.2%-99.9%). The positive predictive value (PPV) for the model was 90.0% (95% CI, 55.2%-98.5%), and the negative predictive value (NPV) was 74.4% (95% CI, 66.1%-81.3%). The overall model accuracy was 77.4% (95% CI, 63.8%-97.7%).
The management of Bethesda categories III and IV thyroid nodules remains a challenge. The goal is to maximize resection of malignant nodules and minimize resection of benign nodules. Risk-stratification adjuncts such as molecular testing can help to alleviate the management dilemma. Machine learning algorithms analyzing readily available ultrasonography data may provide a cost-effective and rapid point-of-care addition to the armamentarium of the endocrine surgeon.
Ultrasonography has been suggested as a tool to aid in risk stratification for cytologically indeterminate nodules, and various proposed rating systems consider sonographic markers such as composition, echogenicity, dimensions, and margins.3 Studies show that with sonography, increased size is actually inversely associated with malignant neoplasm risk.4,11 Although the nodules can be assigned to risk classifications on the basis of their ultrasonographic findings, evidence indicates that these classifications are not clinically significant in the management of indeterminate nodules.12-14
Commercially available molecular tests have proven to be valuable in assessing indeterminate thyroid nodules. The test ThyroSeq v2 (UPMC and CBLPath) used next-generation sequencing to find point mutations in 13 genes and 42 gene fusions that have been implicated in thyroid cancers. ThyroSeq v2 performed well at predicting the histologic structure of nodules as benign or malignant, with a sensitivity of 90%, a specificity of 93%, a PPV of 83%, an NPV of 96%, and an accuracy of 92%.15,16 Afirma GSC (Veracyte Inc), which is marketed as a rule-out test and uses an RNA genomic sequencing classifier to test more than 10 000 genes and additional sequences targeting parathyroid lesions, performed with a sensitivity of 91%, a specificity of 68%, an NPV of 96%, and a PPV of 47%.17,18 The success of these commercial products has allowed for some institutions to refer selected patients with indeterminate cytological results to surgical treatment on the basis of the identification of high-risk genetic mutations.18,19
The model we built performed with a high specificity (97%) and PPV (90%) for ruling in the presence of a high-risk mutation on the basis of sonography alone. A limited number of patients (n = 24) had pathologically confirmed malignant nodules, and a small number of malignant nodules were attributed to any single mutation. As such, mutation-specific sonographic predictors could not be evaluated at this time, which makes this endeavor an important future research direction. Although sensitivity was at 45%, the model showed promise at detecting a subset of patients for whom directly proceeding to surgical treatment might be advantageous, obviating the need for additional cytogenetic analysis and thus mitigating cost as well as unnecessary invasive procedures. Specifically, although molecular testing is available at advanced institutions, the ATA recommends cautious adoption.19 Clinical context with consideration of growth kinetics, sonographic features, and nuclear medicine should be examined. A model developed by automated machine learning may provide an additional data point to assist clinicians in decision-making.
In this study, automated machine learning was performed on AutoML Vision. This software is proprietary, but the commercial platform uses a combination of architectural search and transfer learning to optimize the development of a convolutional neural network classifier.20 Machine learning for the prediction of the malignant neoplasm risk of thyroid nodules has been described in the literature.21,22 Variable success has been described, along with concerns regarding underperformance (compared with the performance of experienced radiologists) and methodologic standardization or validation while classifying nodules as malignant or benign.21,22 Instead, we focused on the prediction of genetic status by ultrasonographic images, offering an additional diagnostic axis to malignant neoplasm risk.
This study has some limitations. First, no anatomic segmentation was used. The entire ultrasonographic image of the lesion was analyzed. Although images were cropped to exclude annotative features, adjacent lesions or structures (such as the internal jugular or carotid vessels or the trachea) were visualized in some patients. At this time, because hyperparameter optimization was performed entirely internally, ascertaining whether overfitting to irrelevant image features is occurring is not yet possible. However, hyperparameter optimization also provides the advantage of incorporating locoregional changes that might stratify the lesion. In addition, mutations with unknown significance, included in the low-risk group, may be prospectively identified to confer high-risk status. Second, this study was performed in a retrospective fashion and is thus reliant on the veracity of the electronic health record, and the ultrasonography machine and operator were not controlled for. Third, given that multiple images were extracted from a single nodule’s ultrasonography and each image was treated as an independent nodule, a possible learner bias was possible if same-nodule images existed in both the training and testing sets. However, multiple images generated from a single nodule all varied in their imaging plane and overall orientation. After cropping, the relative independence of these images was confirmed by an experienced radiologist, who was unable to correctly match images to corresponding images from the same nodule. We believe this limitation will be overcome in the future by expanding the image library, at which point multi-image extraction will be unnecessary.
Ultimately, automated machine learning shows promise in the development of a model based on ultrasonography alone to predict lesions likely to harbor high-risk genetic mutations in thyroid nodules. Future research may focus on model optimization and performance along with validation, including feature extraction to confirm the use of only anatomically relevant features, anatomic segmentation for increased lesion specificity, and prospective validation. Expanded data sets with additional standardized imaging can likely improve further iterations of the model.
Computer-aided diagnosis systems are increasingly prevalent in research and early clinical application. Specifically, machine learning application to sonographic genetic risk stratification of thyroid nodules appears feasible. Future research to increase sample size, improve data standardization in a prospective fashion, and identify anatomically relevant segments of automated decision-making is necessary for model development and maturation. The model we developed showed a high specificity for identifying nodules with high-risk mutations on molecular testing. This preliminary study shows promise for the diagnostic applications of machine learning interpretation of sonographic imaging of indeterminate thyroid nodules.
Accepted for Publication: August 27, 2019.
Corresponding Author: John Eisenbrey, MD, Department of Radiology, Thomas Jefferson University, 132 S. 10th St, Suite 767, Main Building, Philadelphia, PA 19107 (email@example.com)
Published Online: October 24, 2019. doi:10.1001/jamaoto.2019.3073
Author Contributions: Dr Eisenbrey had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Concept and design: Daniels, Gummadi, Zhu, Lyshchik, Curry, Cottrill, Eisenbrey.
Acquisition, analysis, or interpretation of data: Daniels, Gummadi, Zhu, Wang, Patel, Swendseid, Cottrill, Eisenbrey.
Drafting of the manuscript: Daniels, Gummadi, Zhu, Wang, Curry, Eisenbrey.
Critical revision of the manuscript for important intellectual content: Gummadi, Wang, Patel, Swendseid, Lyshchik, Cottrill, Eisenbrey.
Statistical analysis: Daniels, Gummadi, Zhu, Wang, Eisenbrey.
Obtained funding: Lyshchik.
Administrative, technical, or material support: Daniels, Gummadi, Wang, Patel, Curry, Cottrill, Eisenbrey.
Supervision: Gummadi, Swendseid, Lyshchik, Curry, Cottrill, Eisenbrey.
Conflict of Interest Disclosures: Dr Lyshchik reported receiving grants and personal fees from GE Healthcare, grants from Bracco Diagnostics, and book royalties from Elsevier outside the submitted work. Dr Eisenbrey reported receiving grants and nonfinancial support from GE Healthcare as well as nonfinancial support from Siemens Healthineers and Cannon Medical outside the submitted work. No other disclosures were reported.
Meeting Presentation: The results of this study were presented at the 2019 American Head and Neck Society Annual Meeting at the Combined Otolaryngology Spring Meetings; May 1, 2019; Austin, Texas.
et al. 2015 American Thyroid Association management guidelines for adult patients with thyroid nodules and differentiated thyroid cancer: the American Thyroid Association Guidelines Task Force on Thyroid Nodules and Differentiated Thyroid Cancer. Thyroid
. 2016;26(1):1-133. doi:10.1089/thy.2015.0020PubMedGoogle ScholarCrossref
et al. Comparative analysis of diagnostic performance, feasibility and cost of different test-methods for thyroid nodules with indeterminate cytology. Oncotarget
. 2017;8(30):49421-49442. doi:10.18632/oncotarget.17220PubMedGoogle ScholarCrossref
JG. Research electronic data capture (REDCap)–a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform
. 2009;42(2):377-381. doi:10.1016/j.jbi.2008.08.010PubMedGoogle ScholarCrossref
DL. Validation of American Thyroid Association ultrasound risk assessment of thyroid nodules selected for ultrasound fine-needle aspiration. Thyroid
. 2017;27(8):1077-1082. doi:10.1089/thy.2016.0555PubMedGoogle ScholarCrossref
J-H. Cytology-ultrasonography risk-stratification scoring system based on fine-needle aspiration cytology and the Korean-Thyroid Imaging Reporting and Data System. Thyroid
. 2017;27(7):953-959. doi:10.1089/thy.2016.0603PubMedGoogle ScholarCrossref
et al. Ultrasound is helpful to differentiate Bethesda class III thyroid nodules: a PRISMA-compliant systematic review and meta-analysis. Medicine (Baltimore)
. 2017;96(16):e6564. doi:10.1097/MD.0000000000006564PubMedGoogle Scholar
et al. Risk stratification of thyroid nodules with atypia of undetermined significance/follicular lesion of undetermined significance (AUS/FLUS) cytology using ultrasonography patterns defined by the 2015 ATA guidelines. Ann Otol Rhinol Laryngol
. 2017;126(9):625-633. doi:10.1177/0003489417719472PubMedGoogle ScholarCrossref
et al. Highly accurate diagnosis of cancer in thyroid nodules with follicular neoplasm/suspicious for a follicular neoplasm cytology by ThyroSeq v2 next-generation sequencing assay. Cancer
. 2014;120(23):3627-3634. doi:10.1002/cncr.29038PubMedGoogle ScholarCrossref
et al; American Thyroid Association Surgical Affairs Committee. American Thyroid Association statement on surgical application of molecular profiling for thyroid nodules: current impact on perioperative decision making. Thyroid
. 2015;25(7):760-768. doi:10.1089/thy.2014.0502PubMedGoogle ScholarCrossref
QV. Learning transferable architectures for scalable image recognition [published online April 2018]. Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit
. doi:10.1109/CVPR.2018.00907Google Scholar
J. Classifier model based on machine learning algorithms: application to differential diagnosis of suspicious thyroid nodules via sonography. AJR Am J Roentgenol
. 2016;207(4):859-864. doi:10.2214/AJR.15.15813PubMedGoogle ScholarCrossref