Global, Regional, and National Cancer Incidence, Mortality, Years of Life Lost, Years Lived With Disability, and Disability-Adjusted Life-Years for 29 Cancer Groups, 1990 to 2016

This systematic analysis evaluates the cancer burden over time at the global and national levels measured in incidence, mortality, years lived with disability, years of life lost, and disability-adjusted life-years.

C14 together as, "lip, oral cavity, and pharyngeal cancer." These groups were broken down into subcauses that could be mapped to single GBD causes. In this example, those include lip and oral cavity cancer (C00-C08), nasopharyngeal cancer (C11), cancer of other parts of the pharynx (C09-C10, C12-C13), and "Malignant neoplasm of other and ill-defined sites in the lip, oral cavity, and pharynx" (C14). To redistribute the data, weights were created using the same method employed in age-sex splitting (see step four above). For the undefined code (C14 in the example) an "average all cancer" weight was used, which was generated by adding all cases rom SEER/NORDCAN/CI5 and dividing those by the combined population. Then, proportions were generated by subcause for each aggregate cause as in the sex splitting example above (see step four). The total number of cases from the aggregated group (C00-C14) was recalculated for each subgroup and the undefined code (C14). C14 was then redistributed as a "garbage code" in step six. Distinct proportions were used for C46 (Kaposi sarcoma). C46 entries were redistributed as "other cancer" and HIV. In the sixth step (#6 in the flowchart), unspecified codes ("garbage code") were redistributed. Redistribution of cancer registry incidence and mortality data mirrored the process of the redistribution used in the cause of death database and has not changed compared to GBD 2013. 13 In the seventh step (#7 in the flowchart), duplicate or redundant sources were removed from the processed cancer registry dataset. Duplicate sources were present if, for example, the cancer registry was part of the CI5 database but we also had data from the registry directly. Redundancies occurred and were removed as described in "Inclusion and Exclusion Criteria," where more detailed data were available, or when national registry data could replace regionally representative data. From here, two parallel selection processes were run to generate input data for the MI models and to generate incidence for final mortality estimation. Higher priority was given to registry data from the most standardized source when creating the final incidence input, whereas for the MI model input, only sources that reported incidence and mortality were used. This is different from GBD 2015, where mortality and incidence could come from different sources as long as they covered the same population. In the eighth step (#8 in the flowchart), the processed incidence and mortality data from cancer registries were matched by cancer, age, sex, year, and location to generate MI ratios. These MI ratios were used as input for a three-step modeling approach using the general GBD ST-GPR approach with SDI as a covariate in the linear step mixed effects model using a logit link function. Predictions were made without the random effects. The ST-GPR model has three main hyper-parameters that control for smoothing across time, age, and geography. The time adjustment parameter ( ) was set to 2, which aims to borrow strength from neighboring time points (i.e., the exposure in this year is highly correlated with exposure in the previous year but less so further back in time). The age adjustment parameter ω was set to 0.5, which borrows strength from data in neighboring age groups. The space adjustment parameter was set to 0.95 in locations with data and to 0.5 in locations without data the higher was applied when at least one age-sex group in the country of estimation had at least five unique data points. The lower was applied when estimating data-scarce countries). Zeta aims to borrow strength across the hierarchy of geographical locations. 13 For the amplitude parameter in the Gaussian process regression we used 2 and for the scale we used a value of 15. We have modified the approach to estimate MI ratios compared to GBD 2015. Since for GBD 2015 MI ratio predictions for some cancers yielded similar predictions for low-SDI countries without data as for high-SDI countries, we refined the estimation process. Inclusion criteria for the MI ratio input data were changed to only include mortality and incidence data if they were reported by the same source. We excluded MI ratios reported in the CI5 4,4-10 since mortality data used for the calculation of these MI ratios by definition has to be independent from the cancer registry. We also revised the outliering 8 process and excluded data based on the SDI quintile categorization rather than on development status. For each cancer, MI ratios from locations in SDI quintiles 1-4 (low to high-middle SDI) were dropped if they were below the median of MI ratios from locations in SDI quintile 5 (high SDI). We also dropped MI ratios from locations in SDI quintiles 1-4 if the MI ratios were above the third quartile + 1.5 * IQR (interquartile range). We dropped all MIR that were based on less than 25 cases to avoid noise due to small numbers except for mesothelioma and acute lymphoid leukemia, where we dropped MIR that were based on less than 10 cases because of lower data availability for these two cancers. We also aggregated incidence and mortality to the youngest 5-year age bin where we had at least 50 data points to avoid MIR predictions in young age groups that were based on few data points. The MIR in the age-bin that was used to aggregate MIR was used to backfill the MIR for younger age groups. Since MI ratios can be above 1, especially in older age groups and cancers with low cure rates, we used the 95 th percentile of the cleaned dataset that only included MIR that were based on 50 or more cases to cap the MIR input data. This "upper cap" was used to allow MIR over 1 but to constrain the MIR to a maximum level. To run the logit model, the input data were divided by the upper caps and model predictions after ST-GPR was rescaled by multiplying them by the upper caps. Upper caps used for GBD 2016 were the following: 1.32 To constrain the model at the lower end, we used the 5 th percentile of the cancer-specific cleaned MIR input data to replace all model predictions with this lower cap. Final MI ratios were matched with the cancer registry incidence dataset in the ninth step (#9 in the flowchart) to generate mortality estimates (Incidence * Mortality/Incidence = Mortality) (#10 in the flowchart). The final mortality estimates were then uploaded into the COD database (#11 in the flowchart). Cancer-specific mortality modeling then followed the general CODEm process. Cause of death database formatting Formatting of data sources for the cause of death database has been described in detail elsewhere (#11 in the flowchart). 13

CODEm models
Mortality estimates for each cancer were generated using CODEm (#12 in the flowchart). Methods describing the CODEm approach have been described elsewhere. 13,14 In brief, the CODEm modeling approach is based on the principles that all types of available data should be used even if data quality varies; that individual models but also ensemble models should be tested for their predictive validity; and that the best model or sets of models should be chosen based on the out of sample predictive validity. Models were run separately for countries with extensive and complete vital registration data and countries with less VR data to prevent an inflation in the uncertainty around the estimates in "datarich" countries. Covariates were selected based on a possible predictive relationship between the covariate and the specific cancer mortality. Level 1 covariates have a proven strong relationship with the outcome such as etiological or biological roles. Level 2 covariates have a strong relationship but not a direct biological link. Covariates that are more distal in the causal chain or are mediated through Level 1 or 2 covariates are categorized as Level 3. 14 Differences in covariate selection between GBD 2015 and GBD 2016 by cause and direction of the covariate can be found in eTable 9.

Liver cancer etiology split models
To find the proportion of liver cancer cases due to the four etiology groups included in GBD (1. Liver cancer due to hepatitis B, 2. Liver cancer due to hepatitis C, 3. Liver cancer due to alcohol, 4. Liver cancer due to other causes), a systematic literature search was performed on 10/24/2016 in PubMed using the search string ("liver neoplasms"

[All Fields] OR "HCC"[All Fields] OR "liver cancer"[All Fields] OR "Carcinoma, Hepatocellular"[Mesh]) AND (("hepatitis B"[All Fields] OR "Hepatitis B"[Mesh] OR "Hepatitis B virus"[Mesh] OR "Hepatitis B Antibodies"[Mesh] OR "Hepatitis B Antigens"[Mesh]) OR ("hepatitis C"[All Fields] OR "Hepatitis C"[Mesh] OR "hepatitis C antibodies"[MESH] OR "Hepatitis C Antigens"[Mesh] OR "Hepacivirus"[Mesh]) OR ("alcohol"[All Fields] OR "Alcohol Drinking"[Mesh] OR "Alcohol-Related Disorders"[Mesh] OR "Alcoholism"[Mesh] OR "Alcohol-Induced Disorders"[Mesh])) AND ("2015"[PDAT]
: "2016" [PDAT]) NOT (animals [MeSH] NOT humans [MeSH]). Studies were included if the study population was representative of the liver cancer population for the respective location. For each study the proportions of liver cancer due to the three specific risk factors were calculated. Remaining risk factors were included under a combined "other" group. Cryptogenic cases were only included if other etiologies like viral hepatitis or alcoholic cirrhosis had been excluded. If multiple risk factors were reported for an individual patient these were apportioned proportionally to the individual risk factors. The proportion data found through the systematic literature review were used as input for four separate DisMod-MR 2.1 models to determine the proportion of liver cancers due to the four subgroups for all locations, both sexes, and all age groups (step #16 in the flowchart). A study covariate was used for publications that only assessed liver cancer in a cirrhotic population. The reference, or "gold standard," that was used for crosswalking was the compilation of all studies that assessed the etiology of liver cancer in a general population. A study covariate was also used for studies that only assessed hepatocellular carcinoma (HCC) as opposed to all primary liver cancers. The reference or "gold standard" that was used for crosswalking was the compilation of all studies that assessed the etiology of liver cancer for the population with all primary liver cancers. For liver cancer due to hepatitis C and hepatitis B, a prior value of 0 was set between age 0 and 0.01. For 10 liver cancer due to alcohol a prior value of 0 was set for ages 0 to 5 years. For liver cancer due to hepatitis C, hepatitis C (IgG) seroprevalence was used as a covariate as well as a covariate for alcohol (liters per capita) and hepatitis B prevalence (HBsAg seroprevalence), forcing a negative relationship between the alcohol and hepatitis B covariate and the outcome of liver cancer due to hepatitis C proportion. For liver cancer due to hepatitis B, seroprevalence of HBsAg was used as a covariate as well as a covariate for alcohol and hepatitis C IgG seroprevalence, forcing a negative relationship between the alcohol and hepatitis C covariate and the outcome of liver cancer due to hepatitis B proportion. For liver cancer due to alcohol, alcohol (liters per capita) was used as a covariate as well as a covariate for proportion of alcohol abstainers, hepatitis B and hepatitis C seroprevalence, forcing a negative relationship between the proportion of alcohol abstainers, hepatitis B and hepatitis C covariates and the outcome of liver cancer due to alcohol proportion. All covariates used were modeled independently. To ensure consistency between cirrhosis and liver cancer estimates and to take advantage of the data for the respective other related cause (e.g., liver cancer due to hepatitis C and the related cause cirrhosis due to hepatitis C), we generated covariates from the liver cancer proportion models that we used in the cirrhosis etiology proportion models. We then created covariates from the cirrhosis etiology proportion models and used those in the liver cancer etiology models.
Since the proportion models are run independently of each other, the final proportion models were scaled to sum to 100% within each age, sex, year, and location, by dividing each proportion by the sum of the four (step # 17). For the liver cancer subtype mortality estimates, we multiplied the parent cause "liver cancer" by the corresponding scaled proportions (step # 18). Single cause estimates were adjusted to fit into the separately modeled all-cause mortality in the process CodCorrect.

CodCorrect
CODEm models estimate the individual cause-level mortality without taking into account the all-cause mortality (#13 in the flowchart). To ensure that all single causes add up to the all-cause mortality and that all child-causes add up to the parent cause, an algorithm called "CodCorrect" is used (#14 and #15 in the flowchart). Details regarding the algorithm can be found elsewhere. 13

Incidence estimation
GBD cancer incidence estimates were generated by dividing final mortality estimates (after CodCorrect adjustment) by the MI ratio for the specific cancer (#1 eFigure 2). To propagate uncertainty from the MI ratios and the mortality estimates to incidence, this process was done at the 1,000-draw level. It was assumed that uncertainty in the MI ratio is independent of uncertainty in the estimated age-specific death rates.

Prevalence and YLD estimation
Prevalence is estimated as 10-year prevalence for all cancers as in GBD 2013 and GBD 2015. 15 To estimate cancer prevalence, relative cancer survival was estimated by scaling cancer-specific survival between a "best case" and "worst case" survival. The methods and input data used to generate the best and worst case survival as well as to scale countries between these boundaries remained the same as for the GBD 2013 and GBD 2015 studies (# 2, 3, and 5 in the flowchart). 15 Since the cause "other leukemia" was added for GBD 2016, survival data were updated using SEER 1973 survival data for "other leukemia" as the worst-case scenario and SEER 2010 survival data as the best-case survival. To transform relative to absolute survival (adjusting for background mortality) GBD 2016 lifetables were used (# 6 and 7 in the flowchart). 16 The access to cancer care variable to scale countries between the best and worst case survival was estimated using the same method as for GBD 2013 and GBD 2015 (# 4 in the flowchart): 15 1 c=country; y=year; s=sex; Age-standardized MI ratio min =lowest MI ratio for all countries and years; Agestandardized MI ratio max =highest MI ratio for all countries and years 12 Duration of the treatment phases (1. diagnosis and primary therapy; 2. Controlled phases; 3. Metastatic phase; 4. Terminal phase) remained the same as for GBD 2013 and GBD 2015, with the exception of the "other leukemia" cause, which was added for GBD 2016 (eTable 12). Total prevalence time was divided into phases 1, 3, and 4 for the population that died within 10 years, and the remaining prevalence was attributed to the controlled phase. For the population that survived beyond 10 years, prevalence person time was attributed to phase 1 and phase 2 (#8 in the flowchart). YLDs were calculated by multiplying each phase with the respective disability weight (eTable 13). To generate the total YLDs for each cancer (with the exception of cancers where additional disability is added due to procedures -see next paragraph) the YLDs for each cancer sequela were added (step 13 in eFigure 2).
Additional disability was estimated for breast cancer (disability due to mastectomy), larynx cancer (disability due to laryngectomy), colon and rectum cancer (disability due to stoma), bladder cancer (disability due to incontinence), and prostatectomy (disability due to incontinence and impotence) (#10 in eFigure 2). Hospital data were used to estimate the number of cancer patients undergoing mastectomy, laryngectomy, stoma, prostatectomy, and cystectomy. These proportions remained the same as in GBD 2013 and GBD 2015 and were used as input for proportion models that were run in DisMod-MR 2.0 (#9 in eFigure 2). 15,17 The procedure proportions (proportion of cancer population that undergoes procedures) from hospital data was used as input for a proportion model in DisMod-MR 2.0 in order to estimate the proportions for all locations, by age, and by sex.
Since colostomy or ileostomy procedures are done for reasons other than cancer, a literature review was done to determine the proportion of ostomies due to colorectal cancer. The "all cause" colostomy proportions were multiplied by 0.58 based on the results of the literature review showing that on average 58% of ostomies are done for colorectal cancer. [18][19][20] The final procedure proportions were applied to the incidence cases of the respective cancers and multiplied with the proportion of the incidence population surviving for 10 years to determine the incident cases of the cancer population that underwent procedures. These incident cases were used again as an input for DisMod-MR 2.1, with a remission specification of zero and an excess mortality rate prior of 0 to 0.1. This approach was different compared to GBD 2015, where we used the cause-specific mortality of the specific cancer to obtain prevalence of the sequela. The approach was changed since the mortality for the population undergoing disease-specific procedures (e.g., mastectomy for breast cancer) is likely closer to the general population after they have survived for a period of time compared to the cause-specific mortality of the underlying disease (e.g., breast cancer).
Since disability associated with prostatectomy comes from impotence and incontinence and not from the prostatectomy itself, 18% of the prostatectomy prevalence was assumed to be incontinent and 55% was assumed to be impotent based on a literature review done for GBD 2013. [21][22][23][24][25][26][27][28] Since all sequelae for a cause need to be mutually exclusive, the controlled phase for the cancers with additional procedure-related disability was adjusted to only include the population without procedurerelated disability (= controlled phases prevalence of the total population -controlled phase prevalence of the proportion that experienced procedure related disability) (#11 in eFigure 2). The disability weight for the prevalence of the population that experiences additional disability was adjusted to reflect the combined disability of the controlled phase as well as the procedure.
Lastly, the procedure sequelae prevalence and general sequelae prevalence were multiplied with disability weights (eTable 13) for the procedures to obtain the number of YLDs (#10 in eFigure 2). The sum of these YLDs are the final YLD estimate associated with each cancer.

Probability of cancer
The cumulative probability of developing cancer for certain age groups and an approximated lifetime risk for all cancer groups (age 0 to 79) as well as the odds of developing cancer for 2016 were calculated. The method use does not take into account competing risks of death. The cancer risk is approximated using the following formula 29 : Objectives and funding Reported in the manuscript/appendix 1 Define the indicator(s), populations (including age, sex, and geographic entities), and time period(s) for which estimates were made.
See appendix: "Definition of indicator" 2 List the funding sources for the work.
See main manuscript Data Inputs For all data inputs from multiple sources that are synthesized as part of the study: 3 Describe how the data were identified and how the data were accessed.
See appendix: "Data sources" 4 Specify the inclusion and exclusion criteria. Identify all ad-hoc exclusions.
See appendix: "Data sources" 5 Provide information about all included data sources and their main characteristics. For each data source used, report reference information or contact name/institution, population represented, data collection method, year(s) of data collection, sex and age range, diagnostic criteria or measurement method, and sample size, as relevant.
http://ghdx.healthdata.org/gbd-2016/data-input-sources 6 Identify and describe any categories of input data that have potentially important biases (e.g., based on characteristics listed in item 5).

See appendix: "Bias of categories of input data"
For data inputs that contribute to the analysis but were not synthesized as part of the study: 7 Describe and give sources for any other data inputs.
http://ghdx.healthdata.org/gbd-2016/data-input-sources For all data inputs: 8 Provide all data inputs in a file format from which data can be efficiently extracted (e.g., a spreadsheet rather than a PDF), including all relevant meta-data listed in item 5. For any data inputs that cannot be shared because of ethical or legal reasons, such as third-party ownership, provide a contact name or the name of the institution that retains the right to the data. http://ghdx.healthdata.org/gbd-2016/data-input-sources DATA ANALYSIS 9 Provide a conceptual overview of the data analysis method. A diagram may be helpful. See eFigure 1: Flowchart GBD cancer mortality, YLL estimation See eFigure 2: Flowchart GBD cancer incidence, prevalence, YLD estimation 10 Provide a detailed description of all steps of the analysis, including mathematical formulae. This description should cover, as relevant, data cleaning, data pre-processing, data adjustments and weighting of data sources, and mathematical or statistical model(s). eTable 2: Sources for cancer incidence and mortality-to-incidence ratio data by country, year, and registry eTable 9: Comparison of GBD 2015 and GBD 2016 covariates used and level of covariates 18 Discuss limitations of the estimates. Include a discussion of any modelling assumptions or data limitations that affect interpretation of the estimates.

See main manuscript "Limitations"
eTable 2: Sources for cancer incidence and mortality-to-incidence ratio data by country, year, and registry