CT indicates computed tomography.
eAppendix 1. Author Contributions and Data Set
eAppendix 2. Supplementary Methods
eAppendix 3. Supplementary Material
Customize your JAMA Network experience by selecting one or more topics from the list below.
Identify all potential conflicts of interest that might be relevant to your comment.
Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.
Err on the side of full disclosure.
If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.
Not all submitted comments are published. Please see our commenting policy for details.
Oktay O, Nanavati J, Schwaighofer A, et al. Evaluation of Deep Learning to Augment Image-Guided Radiotherapy for Head and Neck and Prostate Cancers. JAMA Netw Open. 2020;3(11):e2027426. doi:10.1001/jamanetworkopen.2020.27426
Can machine learning models achieve clinically acceptable accuracy in image segmentation tasks in radiotherapy planning and reduce overall contouring time?
This quality improvement study was conducted on a set of 242 head and neck and 519 pelvic computed tomography scans acquired for radiotherapy planning at 8 distinct clinical sites with heterogeneous population groups and image acquisition settings. The proposed technology achieved levels of accuracy within interexpert variability; statistical agreement was observed for 13 of 15 structures while reducing the annotation time by a mean of 93% per scan.
The study findings highlight the opportunity for widespread adoption of autosegmentation models in radiotherapy workflows to reduce overall contouring and planning time.
Personalized radiotherapy planning depends on high-quality delineation of target tumors and surrounding organs at risk (OARs). This process puts additional time burdens on oncologists and introduces variability among both experts and institutions.
To explore clinically acceptable autocontouring solutions that can be integrated into existing workflows and used in different domains of radiotherapy.
Design, Setting, and Participants
This quality improvement study used a multicenter imaging data set comprising 519 pelvic and 242 head and neck computed tomography (CT) scans from 8 distinct clinical sites and patients diagnosed either with prostate or head and neck cancer. The scans were acquired as part of treatment dose planning from patients who received intensity-modulated radiation therapy between October 2013 and February 2020. Fifteen different OARs were manually annotated by expert readers and radiation oncologists. The models were trained on a subset of the data set to automatically delineate OARs and evaluated on both internal and external data sets. Data analysis was conducted October 2019 to September 2020.
Main Outcomes and Measures
The autocontouring solution was evaluated on external data sets, and its accuracy was quantified with volumetric agreement and surface distance measures. Models were benchmarked against expert annotations in an interobserver variability (IOV) study. Clinical utility was evaluated by measuring time spent on manual corrections and annotations from scratch.
A total of 519 participants’ (519 [100%] men; 390 [75%] aged 62-75 years) pelvic CT images and 242 participants’ (184 [76%] men; 194 [80%] aged 50-73 years) head and neck CT images were included. The models achieved levels of clinical accuracy within the bounds of expert IOV for 13 of 15 structures (eg, left femur, κ = 0.982; brainstem, κ = 0.806) and performed consistently well across both external and internal data sets (eg, mean [SD] Dice score for left femur, internal vs external data sets: 98.52% [0.50] vs 98.04% [1.02]; P = .04). The correction time of autogenerated contours on 10 head and neck and 10 prostate scans was measured as a mean of 4.98 (95% CI, 4.44-5.52) min/scan and 3.40 (95% CI, 1.60-5.20) min/scan, respectively, to ensure clinically accepted accuracy. Manual segmentation of the head and neck took a mean 86.75 (95% CI, 75.21-92.29) min/scan for an expert reader and 73.25 (95% CI, 68.68-77.82) min/scan for a radiation oncologist. The autogenerated contours represented a 93% reduction in time.
Conclusions and Relevance
In this study, the models achieved levels of clinical accuracy within expert IOV while reducing manual contouring time and performing consistently well across previously unseen heterogeneous data sets. With the availability of open-source libraries and reliable performance, this creates significant opportunities for the transformation of radiation treatment planning.
Each year, more than half a million patients are diagnosed with cancer and receive radiotherapy either alone or in combination with surgery.1,2 Intensity-modulated radiation therapy has become a key component of contemporary cancer treatment because of reduced treatment-induced toxic effects, with 40% of successfully cured patients undergoing some form of radiotherapy.3 Development of personalized radiation treatment plans that match a patient’s unique anatomical configuration of tumor and organs at risk (OARs) is a multistep process starting with the acquisition of cross-sectional images and the segmentation of relevant anatomical volumes within the images through to dose calculation and subsequent delivery of radiation to the patient.
The segmentation of the images represents a significant rate-limiting factor within this treatment workflow. Currently, this task is performed manually by an oncologist using specially designed software to draw contours around the regions of interest. While the task demands considerable clinical judgement, it is also laborious and repetitive, with contoured volumes needing to be constructed slice by slice across entire cross-sectional volumes. Consequently, it is an extremely time-consuming process, often taking up to several hours per patient.4 It can create delays in the workflow that may be detrimental to patient outcomes, but it also comes with an increasing financial burden to the hospital. As such, there is a significant motivation to provide automated or semiautomated support to reduce overall segmentation time for process.
In addition to long contouring times, there are challenges that derive from a dependency on computed tomography (CT) scans as primary reference images for tumor and healthy tissue anatomy. The inherent limitation of CT images in terms of image contrast on soft tissues makes segmentation challenging, and there remains uncertainty in the exact extent of tumor and normal tissues. This introduces a further key challenge for manual contouring; it is well documented that there is as a source of interoperator variability (IOV) in segmentation.5-11 Such variability can affect subsequent dose calculations, with the potential for poorer patient outcomes.12 Likewise, it presents a concern in the context of clinical trials carried out across multiple hospital sites. In addition to time savings, automating contouring would offer potential for greater standardization.
There has been significant investment to establish autosegmentation techniques that aim to reduce time and variability. Recent efforts are exploring machine learning (ML) methods for autosegmentation of CT scans in radiotherapy.13-16 While they achieve reasonable accuracy within the same-site data sets on which they are trained and evaluated, model performance is often compromised when deployed across other hospital sites. Such approaches can be further limited in adaptability to different clinical domains of radiotherapy. Restricting these algorithms to a single bodily region or a single hospital site with specific acquisition protocols limits the value and applicability of these approaches in real-world clinical contexts. Furthermore, integration of such tools into existing hospital workflows is often not considered. To address these limitations, we present a generic segmentation solution for both prostate and head and neck cancer treatment planning and demonstrate how it can be integrated into existing workflows.
All data sets were licensed under an agreement with the clinical sites involved and received a favorable opinion from the research ethics committee from the East of England–Essex research ethics committee and the Health Research Authority. Under the agreements between the parties, the clinical sites agreed to obtain all consents, permissions, and approvals. This study followed the Standards for Quality Improvement Reporting Excellence (SQUIRE) reporting guideline.
The proposed segmentation method is based on a state-of-the-art convolutional neural network (CNN) model, and the same methodology is applied to both prostate and head and neck imaging data sets. It uses a variant of the 3-dimensional (3D) U-Net model17 to generate contours of the OARs from raw 3D CT images (eAppendix 2 in the Supplement).
For prostate cancer, the model focuses contouring the following 6 structures: prostate gland, seminal vesicles, bladder, left and right femurs, and the rectum. For the purposes of radiotherapy planning in prostate cancer, radiation oncologists consider the prostate gland to be the target volume, while the remaining structures are delineated as OARs. In the case of head and neck cancer, we used a subset of OAR structures defined by a panel of radiation oncologists,18 based on their relevance to routine head and neck radiotherapy (Table 1). The proposed model is trained to automatically delineate these 9 structures on a given head CT scan.
We aggregated 519 pelvic and 242 head and neck CT scans acquired at 8 different clinical sites from patients diagnosed either with prostate or head and neck cancer, as outlined in eAppendix 1 in the Supplement. The scans show variation across sites due to differences in scanner type and acquisition protocols. For experimental purposes, the images are grouped into 2 disjoint sets: main and external, as outlined in eAppendix 1 in the Supplement. The former is intended to be used for model training and testing purposes; the latter is an excluded data set composed of images from 3 randomly selected clinical sites and used to measure the model’s generalization capability to unseen data sets. The main data set does not contain any images from these 3 excluded sites, thereby enabling a masked evaluation to be performed on the external data set. The images were manually annotated by 2 clinically trained expert readers (R.J. and G.B.) and a radiation oncologist masked to the others’ annotations; as such, all structures in each image were manually contoured by 1 expert and later reviewed by a separate oncologist. For further details of the manual contouring process, see eAppendix 1 in the Supplement. The external head and neck data set was formed by using the head CT scans released by Nikolov et al,15 which is an open-source data set19 for benchmarking head and neck CT segmentation models and was acquired in The Cancer Imaging Archive Cetuximab20 and The Cancer Genome Atlas Head-Neck Squamous Cell Carcinoma studies.
To evaluate model performance we used the Dice coefficient21 as a similarity metric, which quantifies the correspondence between pairs of volumetric segmentations for the same structure. Perfectly overlapping structures result in a Dice score of 100.00%, while a Dice score of 0.00% corresponds to complete lack of overlap. In addition to this, we measured the overlap between pairs of contours using Hausdorff and mean surface-to-surface distance metrics (in mm). The metrics are visually presented and described further in eAppendix 3 in the Supplement.
An ensemble of CNN models were trained with different training and validation set splits from main data set while leaving out a fixed disjoint testing set (see eAppendix 2 in the Supplement for details). The agreement between contours generated by the model and expert readers was measured statistically with the Cohen and Fleiss κ22 for single and multiple annotators, respectively. For each structure, an agreement score was computed on foreground pixels defined by a binary mask. This is intended to avoid a possible bias due to a large number of background pixels. Similarly, Bland-Altman plots23 were generated to visualize the level of agreement on a patient level (eAppendix 3 in the Supplement). The performance differences observed between the main and external sites was statistically tested with the Mann-Whitney test.24 The same model training setup was also deployed on the main head CT data set to train a head and neck model that can delineate OARs in the context of head and neck radiotherapy(Table 1). Figure 1 shows qualitative assessment of contours predicted with the proposed models. Additionally, to identify any gross contouring mistakes, the segmentations were also compared in terms of geometric surface distances.
In a second set of experiments to test the generalization to data sets from unseen clinical sites, the previously trained pelvic and head and neck CT models were tested on their corresponding external data set (external), which was comprised of images acquired at 3 particular clinical sites that were excluded from the training and validation data sets (main). With this experiment, the aim was to assess the generalization of the trained models to unseen CT acquisition protocols and patient groups.
All statistical analyses were conducted using Python version 3.7.3 (Python Software Foundation), with scikit-learn package version 0.21.1 for the Cohen-Fleiss κ and scipy package version 1.3.1 for the Mann-Whitney test. Statistical significance was set at P < .01 for null hypothesis testing and κ > 0.75 for the agreement analysis. All tests were 2-tailed.
A total of 519 participants’ (519 [100%] men; 390 [75%] aged 62-75 years) pelvic CT images and 242 participants’ (184 [76%] men; 194 [80%] aged 50-73 years) head and neck CT images were included. The prostate segmentation results (Table 2) show that the autogenerated organ delineations (ensemble) for prostate scans were consistent with the contours produced by clinical experts, with surface errors being within the acceptable error bound (eg, left femur, κ = 0.982). Results were consistent with head and neck segmentation results (eg, brainstem, κ = 0.806) (Table 1). Similarly, in validations on external data sets (Table 1 and Table 2), the model performed consistently well in both radiotherapy domains across multiple sites (eg, mean [SD] Dice score for left femur, internal vs external data sets: 98.52% [0.50] vs 98.04% [1.02]; P = .04), only with a slight performance drop on segmenting the submandibular glands due to low tissue contrast. Our observations of the segmentation errors tended to occur in the superior and inferior extent of tubular structures and in the interface between adjacent organs. However, we have not observed any inconsistencies that, if not corrected, could lead to significant errors in a treatment plan, as evidenced by the surface distance results. This is because the proposed postprocessing method does not allow inconsistencies at a distance from the anatomical structure by design.
An acceptable measure of performance is expected to be within the bounds of IOV found in human experts.5,7 The IOV Dice scores and surface distances between 3 experts contouring 10 test images for each radiotherapy domain are provided in Table 1 and Table 2. For 14 of 15 structures, statistical agreement (ie, κ > 0.75) was observed between autogenerated contours and expert annotations. The reference contours were determined by applying a majority voting scheme using all 3 annotators. At least 2 experts must have agreed to imply that a structure is in fact present. For all the structures except SMGs, the similarity scores with ground truth achieved the criteria of being on-par with levels of expert IOV in contouring, as indicated by the κ values and Bland-Altman plots (eAppendix 3 in the Supplement) collected for the agreement analysis. Here we can see that for more clearly defined structures with high contrast, such as the bladder and femurs, there is reasonably high consistency across the experts (κ > 0.96). But for lower contrast and deformable features, such as the prostate gland, seminal vesicles, and SMGs, we see a higher rate of variability because the organ boundaries are typically unclear in the presence of such adverse conditions (Figure 2 and Table 2). A similar pattern of performance difference is seen on the contours generated by the model, where the same test images are segmented and compared qualitatively with the same reference contours (Figure 2).
The clinical benefit of the models was assessed by comparing times to correct autogenerated contours with times to manually contour images from scratch. For this analysis, the head and neck IOV-10 and prostate IOV-10 data sets were used, in which the scans had varying imaging quality, ranging from good (15) to poor (5). An in-house annotation tool is used for both contouring and correction tasks. The tool features assistive contouring and interactive contour refinement modules that ease the contouring task while ensuring high segmentation accuracy. Manual segmentation of the head and neck scans for the same 9 OARs took a mean of 86.75 (95% CI, 75.25-98.29) min/scan for an expert reader and 73.25 (95% CI, 68.68-77.82) min/scan for a radiation oncologist. For the same scans, the review and correction time of autocontours was measured as 4.98 (95% CI, 4.44-5.52) min/scan for head and neck scans and 3.40 (95% CI, 1.60-5.20) min/scan for prostate scans, which are inspected and updated (if necessary) by the oncologist to ensure clinical accuracy required for treatment planning. This represented a mean 93% reduction in time. Among all 20 scans, the slowest correction time per scan was measured as 7.05 minutes because of low imaging quality. A mean inference time of 23 (95% CI, 20-26) seconds was taken to segment target and all OAR foreground pixels in full input CT scan.
Several frameworks have been proposed for autosegmentation of head and neck 15,25 and pelvic organs.13,16,26 In 2 studies,13,16 the authors describe an approach for prostate and OAR segmentation, where organ localization is performed prior to segmentation. Their algorithm was validated on a data set of 88 CT scans. A similar cascaded autosegmentation approach was proposed in Wang et al26 to delineate OARs in head and neck CT scans; this study was conducted by training (33 scans) and evaluating (15 scans) using the public data set released in an autosegmentation challenge.27
There have been efforts to show the potential clinical use cases of ML solutions for automatic OAR contouring. In contrast to previous work, in which evaluations were performed on small sets of homogenous images, we evaluated how ML solutions could lead to generalized performance across (1) different radiotherapy domains and (2) data sets from multiple sites. We aimed to demonstrate the robustness and generalizability of these solutions. More importantly, we found that integrating these models into clinical workflows could reduce the time required to prepare dose plans for treatment.
The models demonstrated performance generalizability across diverse acquisition settings while achieving good levels of agreement with expert contours. This could facilitate easier deployment in new clinical sites. Of further importance for any practical adoption of this technology across large scale health care systems is the ability to work across diverse clinical domains. We have shown how our approach, without any substantial changes, can enable the training of models in diverse radiotherapy domains, as demonstrated through applications in prostate and head and neck cancer. This is especially significant given the distinct imaging challenges associated with these different domains.
Practical adoption in clinical contexts is enhanced by incorporating the presented models into the existing workflow of radiation oncologists (Figure 3). The illustrated system has been implemented and evaluated by clinical experts working at Cambridge University Hospitals. In this workflow, CT scans are acquired from patients as they attend preparations for radiotherapy treatment. These scans are initially stored at the hospital’s image database and later securely transferred via the gateway to the autosegmentation platform in the cloud after anonymizing them. Once the segmentation process is completed, resultant files are uploaded back to the hospital’s image database, creating a seamless clinical workflow in which clinicians can review and refine contours in their existing contouring and planning tools.
Bringing these ML tools to the point where they can be meaningfully adopted in clinical practice requires a level of clinical accuracy commensurate with expert observers. While the models have performed well in this regard, in instances where the model performed poorly, the opportunity to manually correct the segmentations remains a necessary component of the presented workflow. The presented workflow enables oncologists to use their existing clinical systems for review and editing, which makes this technology more accessible across clinics because the existing workflows are maintained. At the same time, clinicians can inspect and edit contours in minutes rather than hours. Such time savings are significant even when considered only in absolute terms.
The source code used in this study is made publicly available.28 This creates an opportunity for oncology centers to use this technology to train and deploy new models using their own data sets. In this way, users can include other normal tissue structures in the autocontouring pipeline, including cochlear and oral-cavity structures in head and neck cancer treatments. The availability of new public data sets and sharing across clinics is an important milestone in improving the performance of models and making them accessible. Similarly, image quality (IQ) assurance29 is essential for reliable use of models. IQ assessment should be performed prior to model deployment30 both at acquisition and processing time to filter out images with metal artifacts. Training models on a diverse set of data sets, as performed in this study, is an effective way to cope with low-contrast (eg, cone-beam CT) and high-noise images. External data set validation is also essential to measure such impacts; for instance, the images from the external head and neck data set used in this study contained severe beam-hardening artifacts.
More adaptive forms of radiotherapy, in which anatomy is resegmented and the dose plan reoptimized for each fraction, are regarded as a more ideal way to deliver treatment,31 which has been challenging to adopt due to its heavy resource demand.32 In that regard, the presented technology can enable continuous resegmentation and adaptive reoptimization of therapy to be adopted at scale. For instance, in the cases of hypofractionated regimens or emergency treatments, extension of these models to resegment anatomy on scans would have significant clinical utility to save time and allow patients to progress to treatment more quickly. Integration with technologies such as The Magnetic Resonance Linear Accelerator,33 used for simultaneous imaging and dose delivery, could also potentially offer more adaptive forms of treatment to pinpoint the location of tumors at the time of treatment.
This study has limitations. The data sets used in the IOV and annotation time experiments are smaller than the remaining evaluations presented in this study. For further statistical significance, these experiments shall be repeated with larger data sets with varying imaging quality. Additionally, surface and Dice metrics used in model evaluation do not always correlate with time savings in manual contouring process.15,34 This necessitates the design of new metrics that quantify segmentation errors by taking into account the cost of required user interaction to correct them.
This study found that ML-based autosegmentation reduces contouring time while yielding clinically valid structural contours on heterogeneous data sets for both prostate and head and neck radiotherapy planning. This is evidenced in evaluations on external data sets and IOV experiments conducted on a multisite data set. Overall, the approach contributes to the practical challenges of scalable adoption across health care systems through off-the-shelf extensibility across hospital sites and applicability across multiple cancer domains. Future ML studies validating the applicability of the proposed technology on other radiotherapy domains and larger data sets will be valuable for wider adoption of ML solutions in health care systems.
Accepted for Publication: October 1, 2020.
Published: November 30, 2020. doi:10.1001/jamanetworkopen.2020.27426
Correction: This article was corrected on December 9, 2020, to fix an error in the Abstract.
Open Access: This is an open access article distributed under the terms of the CC-BY-NC-ND License. © 2020 Oktay O et al. JAMA Network Open.
Corresponding Author: Ozan Oktay, PhD, Health Intelligence, Microsoft Research, 21 Station Rd, Cambridge, CB1 2FB United Kingdom (firstname.lastname@example.org).
Author Contributions: Dr Oktay had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Concept and design: Oktay, Nanavati, Schwaighofer, Tanno, Jena, Noble, Glocker, Bishop, Alvarez-Valle, Nori.
Acquisition, analysis, or interpretation of data: Oktay, Nanavati, Schwaighofer, Carter, Bristow, Jena, Barnett, Noble, Rimmer, Glocker, O'Hara, Alvarez-Valle.
Drafting of the manuscript: Oktay, Nanavati, Schwaighofer, Bristow, Jena, Glocker, O'Hara, Alvarez-Valle, Nori.
Critical revision of the manuscript for important intellectual content: Oktay, Schwaighofer, Carter, Tanno, Jena, Barnett, Noble, Rimmer, Glocker, O'Hara, Bishop, Alvarez-Valle, Nori.
Statistical analysis: Oktay, Nanavati, Schwaighofer, Carter.
Obtained funding: Nori.
Administrative, technical, or material support: Oktay, Schwaighofer, Bristow, Jena, Noble, Bishop, Alvarez-Valle, Nori.
Supervision: Oktay, Schwaighofer, Jena, Barnett, Rimmer, Glocker, Alvarez-Valle, Nori.
Conflict of Interest Disclosures: Dr Jena reported receiving personal fees from Microsoft during the conduct of the study. Dr Noble reported receiving grants from Cancer Research UK and personal fees from Microsoft Research, Cambridge, during the conduct of the study. No other disclosures were reported.
Funding/Support: The research work reported in the manuscript was self-funded by Microsoft Research Cambridge.
Role of the Funder/Sponsor: The funder had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.