Analysis of Sociodemographic, Clinical, and Genomic Factors Associated With Breast Cancer Mortality in the Linked Surveillance, Epidemiology, and End Results and Medicare Database

Key Points Question Can existing data sets be used to link sociodemographic, clinical, and genomic data into a single population-level data set to investigate disparities in cancer outcomes? Findings This cohort study used first-in-kind linkage of Surveillance, Epidemiology, and End Results, Medicare, and residual tumor repository data of 3522 women with newly diagnosed screening- vs symptomatic-detected estrogen receptor–positive nonmetastatic breast cancer to demonstrate that screening and socioeconomic factors remain associated with breast cancer outcomes, even after adjusting for clinical, demographic, and genomic factors. Meaning These findings suggest that screening detection, tumor stage, gene expression, and survival are associated phenomena that may offer novel insights when examined together within a single context.


Introduction
Despite advances in our basic understanding of breast cancer biology, the relative contribution of sociocultural and biological factors in breast cancer disparities has remained an area of active debate during the past 30 years, and pure biological, social, and care access-based models cannot accurately describe all epidemiological phenomena. 1,2 Evidence of social drivers of race-based disparities has been demonstrated with respect to screening, stage at detection, treatment, and overall survival. [3][4][5][6] Poverty is associated with advanced-stage disease presentation, 7 and increased distance to care is associated with decreased use of adjuvant therapy. 8 On the other hand, analyses of phase III SWOG trials have demonstrated disparities in breast cancer outcomes, even in the setting of presumably equal care. 9 Furthermore, many features of breast cancer, including receptor status, remain stable during the course of metastatic cancer, suggesting that these molecular subtypes reflect different biological entities, [10][11][12] and genomic risk scores have prognostic and predictive capability 10 years after initial treatment. [13][14][15][16] To better understand breast cancer disparities, investigations of "nature and nurture" 17 must be combined, accounting for population sciences and dissemination of cancer care. 18 Most breast cancer research addresses basic science, health services, or clinical domains, but rarely all 3. A key driver of this siloed research is the paucity of population-level linkage containing both clinical and health services data with physical tumor samples. Last, most genomically analyzed tumor samples are collected in academic medical centers or within the context of a clinical trial, settings known to differ substantially from the general population with respect to patients, treatment, and outcomes. [19][20][21][22] In this study, we conducted a proof-of-principle transdisciplinary investigation of health services and basic biological data within a population-level sample of patients. To accomplish this, we linked Surveillance, Epidemiology, and End Results (SEER) data, physical tumor blocks from the SEER residual tumor repository (RTR), and associated Medicare claims data. We then used this novel data set to investigate the biological and clinical progression of cancer associated with sociodemographic data and screening vs symptom detection among women with nonmetastatic invasive estrogen receptor (ER)-positive breast cancers in a population-level study. The primary aim of our study was to demonstrate the feasibility of our approach to investigate the interaction among health service, demographic, and clinical factors and their association with breast cancer-specific (BCS) and overall survival after adjusting for genomic factors. Our secondary aim was to investigate the association among health service, demographic, and clinical factors with tumor biology and progression.

Data Source
This cohort study was approved by all participating entities' individual institutional review boards, which waived the need for informed consent owing to the use of deidentified registry data. This study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline.
The RTR banks formalin-fixed, paraffin-embedded (FFPE) blocks of tumor tissues that were clinically discarded, including primary, lymph node, and metastatic tumors from patients diagnosed in Iowa and Hawaii from January 1, 1993, to December 31, 2007. SEER data are linked with these physical tumor blocks, providing basic clinical and demographic information (eg, age, race, stage). SEER-coded race and ethnicity are determined per the SEER coding manual, which is primarily based on self-reported race and ethnicity as contained within the electronic medical record. Medicare insures approximately 97% of Americans 65 years or older, and administrative claims data are collected as part of routine operation, with deidentified claims data serving as a commonly used research data set. These data include all Medicare-billed services received by a patient, and therefore provide detailed and accurate data regarding the longitudinal treatment of patients. Linked Medicare claims data from January 1, 1992, through December 31, 2008, were available for analysis. This linkage represents, to our knowledge, the first joint data set combining the SEER, SEER-RTR, and Medicare claims data.

Study Population and Analysis Cohorts
A SEER-Medicare cohort was created using all patients who met study criteria and for whom both SEER and Medicare claims data were available (eFigure 1 in the Supplement) and included women with a SEER-based diagnosis of ER-positive invasive breast cancer from 1993 to 2007 with a confirmatory inpatient, outpatient, or carrier-based Medicare claim. Standard SEER-Medicare inclusion and exclusion criteria were then applied (Figure 1). We excluded T3 and T4 tumors because these would likely have only been symptomatic. Patients were required to be 66 years or older per standard SEER-Medicare study inclusion criteria. We limited our study to women 75 years or younger to focus on women who were more likely to undergo treatment with reasonable remaining natural life expectancy, and we included women with prior malignant disease. A subset of the SEER-Medicare cohort was then used to create a molecular cohort. Cases were selected by evenly sampling from screening-vs non-screening-detected tumors for which tissue samples were available, limited to those with adequate RNA integrity for genomic analysis, and further limited to samples confirmed as either luminal A or B cancer by molecular subtyping with a 50-gene signature (PAM50). Central pathological confirmation of all tumor cases and grade determination was performed by a single breast cancer pathologist (E.P.).

Primary Study End Points
Screening detection of tumors was determined using the presence of a bilateral screening mammography in the 4 months before the breast cancer diagnosis claim or in the year before the site-directed breast surgery as validated previously. 23,24 The National Cancer Institute comorbidity index was determined using inpatient, outpatient, and carrier claim files in the year before diagnosis. 25,26 For SEER stage, we used the American Joint Committee on Cancer Staging Manual,

Genomic Analyses
The FFPE tumor specimens were analyzed using a gene expression profiling (Breast Cancer 360 [BC360]; NanoString, Inc) to quantify continuous values for the messenger RNA expression of 752 genes and 30 cancer-related gene expression signatures (eg, androgen receptor signaling) and provide molecular subtyping into luminal A, luminal B, ERBB2 (formerly known as HER2)-enriched, and basallike using 58 genes and the PAM50 algorithm. Heatmaps of expression profiles were created using hierarchical clustering with nSolver, version 4.0 (NanoString) and the R heatmap statistical package (R Program for Statistical Computing) for exploratory analyses of gene signature clustering in screen-and symptom-detected cancer as well as by T and N stage. Expressions of 752 genes were individually regressed as continuous values as a function of screening status, T stage, N stage, and association with BCS and overall survival controlling for clinical, demographic, and socioeconomic factors using a threshold of unadjusted P < .05 for exploratory analyses. Where reported, false discovery rate was calculated using the Benjamini-Hochberg correction. The total number of samples obtained was restricted by project resources, which allowed the molecular analysis of 140 samples.

Statistical Analysis
Data were analyzed from August 1, 2018, to July 25, 2021. Associations between screening status and stage were analyzed using bivariate t tests and the Cochran-Mantel-Haenszel test for nonzero correlation as well as unadjusted and adjusted logistic regression. Survival analyses were performed using unadjusted and adjusted Cox proportional hazards regression to estimate the associations between both BCS and all-cause mortality and patient demographic and socioeconomic factors,

SEER-Medicare Molecular Cohort: Patient Characteristics and Comparison by Screening Status
The molecular cohort consisted of women with tissue blocks pulled for molecular analysis, stratified for relatively equal representation of screen-detected vs symptomatic tumors. RNA quality assurance passed for 97% of samples. Of these, fewer than 11 samples were found to have molecular subtypes (P = .008) (Figure 2).   Within the Surveillance, Epidemiology, and End Results (SEER)-Medicare cohort (N = 3522), all-cause mortality and breast cancer-specific (BCS) mortality were significantly higher in patients whose tumors were symptomatic. Within the molecular cohort (n = 130), screening detection status and molecular subtype were associated with all-cause mortality but not BCS mortality. The number of patients at risk are censored at 75% of patients for all panels but are not shown for the molecular cohort owing to standard SEER-Medicare data use agreements limiting reporting of cell sizes of fewer than 11.

Individual Gene-Level Analyses Associated With Screening and Disease Progression
Increased expression of 95 genes was associated with BCS mortality (Figure 3A and eAppendix in the The largest differences in gene expression were observed when comparing T2 vs T1 tumors (253 genes), in which 48 genes maintained a false discovery rate of less than 0.05 ( Figure 3B). Downregulated genes (n = 224) were enriched for cellular differentiation, immune response, cell adhesion, and regulation of apoptosis ( Figure 3C). Upregulated genes (n = 29) were enriched for cell cycle and proliferation, glycolytic metabolism, and regulation of apoptosis ( Figure 3D). Only 46 genes were differentially expressed between symptomatic-vs screening-detected tumors and 13 genes between stages III vs II disease ( Figure 3B).

Exhibition of Different Changes in Gene Expression by T Stage in Luminal A and B Tumors
We next hypothesized that T2 vs T1 changes in gene expression would differ by luminal A (88 genes) vs luminal B (100 genes) molecular subtypes owing to distinct mechanisms of disease progression.

Discussion
This study reports the first linkage connecting tumor-based genomic analyses with Medicare administrative claims and SEER clinical, sociodemographic, and vital status data. Using this population-level data set, we were able to model the interaction between screening-based breast cancer detection and sociodemographic characteristics, disease stage, and biological pathway activity as well as their association with overall and BCS mortality. Even after correcting for all clinical and genomic factors, living in a zip code with a poor level of educational attainment remained one of the factors most strongly associated with increased all-cause mortality. Genomic activation of TGFβ and p53 pathways showed adverse associations with survival, whereas improved overall survival was associated with androgen receptor signaling, macrophage infiltration, and activation of cytotoxic T cells. T stage demonstrated the strongest association with changes in gene expression, with other factors such as screening status or N stage showing no associations with gene expression when accounting for T stage. Interestingly, genomic dysregulation associated with T stage differed within luminal A vs B tumors, with luminal B molecular subtype tumors associated with distinct inhibition of interferon γ signaling and MHCII expression that was not observed in the luminal A molecular subtype, which instead was associated with cytokine-based immune dysregulation. This study serves as proof-of-principle that combining health service, clinical, sociodemographic, and genomic data together with a single population-level cohort is feasible and may offer new insights into disease progression and factors driving adverse outcomes.
Genomic findings were consistent with our current understanding of the biology of breast cancer, including an adverse association between TGFβ and p53 signaling and a favorable association with androgen receptor signaling and immune infiltration, particularly macrophages and cytotoxic T cells. Differences in immune dysregulation in the progression from T1 to T2 tumors within luminal B vs A molecular subtype tumors may have prognostic or therapeutic implications in tumor immunotherapy. In support of the external validity of our analysis, we observed an adverse outcome associated with increased expression of several genes associated with breast cancer mortality that have been confirmed previously in the literature (KIFC1, 28 FAM83D, 29 GRB7, 30 UBE2C, 31 and CLDN4 32 ).
An encouraging next-generation iteration of the SEER-RTR concept is the SEER virtual tissue repository (SEER-VTR), which has been implemented recently in 7 SEER registries, including Iowa, Hawaii, Kentucky, Louisiana, Los Angeles, Greater California, and Connecticut. The SEER-VTR works by using SEER-based records to link to the location of tumor blocks stored within community pathology laboratories, which are required by the College of American Pathologists to keep tumor blocks for 10 years after a cancer diagnosis. Prospective partnerships between SEER registries and their community partners can thereby be leveraged to include the physical use of patient samples for anyone diagnosed within the past 10 years. Analogous approach to the one we report in this study using SEER-RTR could be used in collaboration with the SEER-VTR program in future research.

Limitations
There are several limitations of this study, including the retrospective and historical nature of our cohort, which did not likely undergo modern imaging, treatment, or genomic risk score profiling. The molecular cohort was limited by small sample sizes owing to the pilot nature of the study. We were unable to assess prescription of nonintravenous medications, including hormonal therapy, which was not available within the Medicare claims data until the introduction of part D in 2006. Many factors known to influence breast cancer could not be incorporated into the study design, including family history and lifestyle factors such as diet, obesity, smoking, and alcohol consumption. Many forms of biological dysregulation were not represented in this study, including somatic 33 and tumoral mutations, epigenetic changes, 34 genomic instability, hormonal signaling, 35 metabolism, 36 tumor microenvironment, 37 proteomics, and more, owing to cost and logistic constraints. Instead, we included only a single genomic platform and women with ER-positive tumors to focus on demonstrating the feasibility of linking of genomic, health services, and clinical data together in a single data source and model. Tissue analysis was limited to FFPE, given the archival nature of the specimens. The Iowa and Hawaii populations were unable to be used to analyze representative numbers of Black women. However, the creation of a more racially diverse study cohort is a priority and a topic of active future investigation due to known associations between race and breast tumor biology and molecular subtype. [38][39][40][41] Conclusions By linking SEER-Medicare data to physical tumor specimens, additional connections may be revealed among biology, access to health care, and disparities in breast cancer outcomes. The findings of this population-based cohort study suggest that tumor screening and socioeconomic status are associated with survival in patients who have locally advanced, ER-positive tumors, even when clinical and genomic factors are incorporated. Preliminary analyses suggest that luminal A and B molecular subtypes may be associated with distinct mechanisms of genomic progression when detected at later tumor stages within population-level cohorts.