Pathogenic/likely pathogenic (hereafter referred to as pathogenic) variants obtained from ClinVar repository; novel loss-of-function variants with annotations of frameshift, splice acceptor, splice donor, stop gained, stop lost, or start lost using Variant Effect Predictor (VEP) bioinformatic algorithm; and nonrecessive genes identified in OMIM were curated, yielding 5360 pathogenic or loss-of-function variants for analysis. Samples of genomic DNA from participants in both biobanks were prepared and then exome sequenced by the Regeneron Genetics Center. As an example, LDLR R744Ter was identified in 2 of 72 434 participants with exome sequence data from the UK Biobank and BioMe Biobank. Electronic health records (EHRs) of participants were searched for the presence of diagnosis codes indicative of clinical disease. Hypercholesterolemia was observed in 1 individual with LDLR R744Ter based on the EHR diagnosis codes. For each variant, penetrance was calculated as the number of individuals with the variant and a corresponding disease diagnosis divided by the total number of individuals with the variant. This process was repeated for all variants to obtain a penetrance data set, which was used to evaluate factors associated with penetrance. Factors included the variant’s reported pathogenicity and review status in ClinVar, molecular consequence, and gene, as well as the individual’s ancestry and age. All analyses used the set of 5360 pathogenic/loss-of-function variants, except for the analysis of pathogenicity, which used a set of 30 144 ClinVar variants (pathogenic, benign, uncertain, and conflicting).
The risk difference (RD) of 4454 pathogenic/loss-of-function variants for 49 diseases that had at least 36 variants, with at least 2 of the variants identified in at least 2 individuals. Diamonds represent the unweighted mean of the RD of all variants for each disease and the whiskers represent the SD. Large numbers of variants with little to no disease risk were observed (3919 [88%] variants with RD ≤0.05). The complete set of 5360 pathogenic/loss-of-function variants for 157 diseases had 4795 (89%) variants with RD ≤0.05 and is shown in eFigure 3 and eTable 7 in the Supplement.
The penetrance of variants with different characteristics was compared (eg, pathogenic vs benign variants). The penetrance of variants with a particular characteristic was also compared to the baseline disease prevalence in individuals with normal alleles at each corresponding variant position, with the mean penetrance and disease prevalence shown as points. The violin width represents the density of variants at specified penetrance values or normal alleles at specified prevalence values, on a base-10 logarithmic scale, with all violins in a given plot scaled to the same maximum width. A, Pathogenic variants had a higher mean penetrance than benign variants (difference, 6.0 percentage points [95% CI, 5.6-6.4 percentage points]; 2-tailed unpaired t test P < .001). B, Pathogenic variants reviewed by experts had a higher mean penetrance than those with multiple submitters (difference, 12 percentage points; [95% CI, 8.7-15 percentage points]; P < .001). C, Frameshift variants had a higher mean penetrance than missense variants (difference, 5.9 percentage points [95% CI, 4.3-8.3 percentage points]; P < .001).
A, Points represent mean values. The violin width represents the density of variants at specified penetrance values or normal alleles at specified prevalence values, with all violins in a given plot scaled to the same maximum width. The highest mean variant penetrance was 38% in breast cancer type 1 susceptibility (BRCA1) (n = 48 variants; mean risk difference = 0.32; P < .001) and 38% in breast cancer type 2 susceptibility (BRCA2) (n = 92 variants; mean risk difference = 0.32; P < .001).
Data Sharing Statement. Penetrance dataset availability.
eMethods. Quality control of exome and genotype data, validation of phenotyping, biomarker analyses, sensitivity analyses, and correction for multiple hypothesis testing.
eFigure 1. Variant selection for analysis of penetrance.
eFigure 2. Metabolite and clinical measurements for individuals with pathogenic or loss-of-function (pathogenic/LoF) variants with different penetrance in the BioMe Biobank.
eFigure 3. Risk estimates for 157 diseases associated with 5,360 pathogenic or loss-of-function (pathogenic/LoF) variants.
eFigure 4. Disease risk associated with loss-of-function (LoF) variants and pathogenic variants.
eFigure 5. Penetrance of pathogenic/loss-of-function (pathogenic/LoF) variants in 73 genes from the American College of Medical Genetics and Genomics’ Recommendations for Reporting of Incidental Findings (ACMG 73) or 9 genes from the Centers for Disease Control and Prevention’s tier 1 list (tier 1).
eFigure 6. Distribution of observed disease risk for pathogenic or loss-of-function (pathogenic/LoF) variants in 73 genes from the American College of Medical Genetics and Genomics’ Recommendations for Reporting of Incidental Findings (ACMG 73).
eFigure 7. Distribution of observed disease risk for pathogenic or loss-of-function (pathogenic/LoF) variants in 9 genes from the Centers for Disease Control and Prevention’s tier 1 list (tier 1).
eFigure 8. Disease risk of pathogenic/loss-of-function (pathogenic/LoF) variants for diseases with <10 cases and ≥10 cases.
eFigure 9. Sensitivity analysis of penetrance estimates by sample size of individuals with a variant.
eFigure 10. Penetrance of 13,298 singletons stratified by ClinVar pathogenicity, ClinVar review status, and molecular consequence.
eFigure 11. Subgroup analysis of penetrance estimates by self-reported ancestry.
eFigure 12. Association between age of disease onset and age-dependent change in penetrance for 157 diseases in the BioMe Biobank and UK Biobank.
eTable 1. Summary of cases/controls, ICD-10 diagnosis codes, and genes for 197 diseases analyzed for penetrance of genetic variants.
eTable 2. Summary of 37,780 clinical variants assessed for penetrance.
eTable 3. Validation of ICD-10-based phenotypes with clinical algorithms in the assessment of penetrance for 9 diseases in the BioMe Biobank.
eTable 4. Tabulated list of 208 penetrance measurements computed with ICD-10-based and clinical algorithm-based phenotypes for ClinVar pathogenic variants in the BioMe Biobank.
eTable 5. Validation of ICD-10 phenotypes with manual curation of physician notes in the problem list for 6 diseases in the BioMe Biobank.
eTable 6. Validation of ICD-10 phenotypes with manual review of electronic health records for 6 diseases in the BioMe Biobank.
eTable 7. Summary of risk difference (RD) for 5,360 pathogenic/loss-of-function (LoF) variants associated with 157 diseases.
eTable 8. Diseases in the American College of Medical Genetics and Genomics Recommendations for Reporting of Incidental Findings (ACMG 73) and Centers for Disease Control and Prevention tier 1 list (tier 1) with genes containing pathogenic/loss-of-function variants.
eTable 9. Age of onset for 197 diseases included in study.
eTable 10. Twenty pathogenic/loss-of-function variants associated with elevated risk of 7 diseases.
eTable 11. Penetrance estimates for 59 pathogenic/loss-of-function variants in LDLR.
Customize your JAMA Network experience by selecting one or more topics from the list below.
Forrest IS, Chaudhary K, Vy HMT, et al. Population-Based Penetrance of Deleterious Clinical Variants. JAMA. 2022;327(4):350–359. doi:10.1001/jama.2021.23686
What is the population-based penetrance of pathogenic and loss-of-function clinical variants?
This cohort study included 72 434 participants from 2 biobanks who had alleles for pathogenic or loss-of-function variants reported for 157 diseases. Among the 5360 pathogenic/loss-of-function variants, 4795 (89%) were associated with less than or equal to 5% risk difference for disease in individuals with the variant allele; pathogenic variants were associated with 6.9% mean penetrance and benign variants were associated with 0.85% mean penetrance.
In these biobanks, the estimated penetrance of pathogenic/loss-of-function variants varied, but was generally associated with a small increase in the risk of disease.
Population-based assessment of disease risk associated with gene variants informs clinical decisions and risk stratification approaches.
To evaluate the population-based disease risk of clinical variants in known disease predisposition genes.
Design, Setting, and Participants
This cohort study included 72 434 individuals with 37 780 clinical variants who were enrolled in the BioMe Biobank from 2007 onwards with follow-up until December 2020 and the UK Biobank from 2006 to 2010 with follow-up until June 2020. Participants had linked exome and electronic health record data, were older than 20 years, and were of diverse ancestral backgrounds.
Variants previously reported as pathogenic or predicted to cause a loss of protein function by bioinformatic algorithms (pathogenic/loss-of-function variants).
Main Outcomes and Measures
The primary outcome was the disease risk associated with clinical variants. The risk difference (RD) between the prevalence of disease in individuals with a variant allele (penetrance) vs in individuals with a normal allele was measured.
Among 72 434 study participants, 43 395 were from the UK Biobank (mean [SD] age, 57 [8.0] years; 24 065 [55%] women; 2948 [7%] non-European) and 29 039 were from the BioMe Biobank (mean [SD] age, 56  years; 17 355 [60%] women; 19 663 [68%] non-European). Of 5360 pathogenic/loss-of-function variants, 4795 (89%) were associated with an RD less than or equal to 0.05. Mean penetrance was 6.9% (95% CI, 6.0%-7.8%) for pathogenic variants and 0.85% (95% CI, 0.76%-0.95%) for benign variants reported in ClinVar (difference, 6.0 [95% CI, 5.6-6.4] percentage points), with a median of 0% for both groups due to large numbers of nonpenetrant variants. Penetrance of pathogenic/loss-of-function variants for late-onset diseases was modified by age: mean penetrance was 10.3% (95% CI, 9.0%-11.6%) in individuals 70 years or older and 8.5% (95% CI, 7.9%-9.1%) in individuals 20 years or older (difference, 1.8 [95% CI, 0.40-3.3] percentage points). Penetrance of pathogenic/loss-of-function variants was heterogeneous even in known disease predisposition genes, including BRCA1 (mean [range], 38% [0%-100%]), BRCA2 (mean [range], 38% [0%-100%]), and PALB2 (mean [range], 26% [0%-100%]).
Conclusions and Relevance
In 2 large biobank cohorts, the estimated penetrance of pathogenic/loss-of-function variants was variable but generally low. Further research of population-based penetrance is needed to refine variant interpretation and clinical evaluation of individuals with these variant alleles.
Quiz Ref IDIdentification of pathogenic variants in disease predisposition genes, including 73 genes recommended by the American College of Medical Genetics & Genomics (ACMG 73),1 informs clinical diagnosis and actions.2 This genotype-first approach in medicine is feasible only if pathogenicity is known. A database of genetic variation, ClinVar,3 classifies variant pathogenicity (eg, pathogenic, benign). However, most variants have uncertain clinical significance and misclassified variants have inflated pathogenicity.4 Variants implicated in diseases such as breast cancer and cardiomyopathy have overestimated pathogenicity,5 and many pathogenic variants have been downgraded to benign or uncertain clinical significance.6 There is therefore a need to accurately assess a variant’s disease risk.
Quiz Ref IDThe disease risk associated with most variants (ie, penetrance) is uncertain. Examples of highly penetrant variants include pathogenic variant alleles in LDLR (Entrez Gene 3949) with a penetrance of 73% for hypercholesterolemia7 and pathogenic variants in BRCA1 (Entrez Gene 672) and BRCA2 (Entrez Gene 675) with a penetrance of approximately 60% for breast cancer by 70 years of age.8 Findings of highly penetrant variants guide clinical decisions: individuals with LDLR variants are prescribed statins as young as 8 years to prevent coronary events,9 while individuals with BRCA1 or BRCA2 variant alleles receive mammograms before 30 years of age.10 Penetrance thus gives meaningful and actionable information in quantifying disease risk for individuals with a variant.
Penetrance has traditionally been derived from family-based or clinical cohort studies.11 These studies focus on small numbers of genes and maximize penetrance estimates by recruiting patients with disease or family history of disease, and are susceptible to ascertainment bias.12 In contrast, large biobanks have exome sequences coupled to electronic health records (EHRs) for many unrelated individuals, enabling population-based estimates of penetrance. However, the penetrance of most variants in the general population remains uncharacterized. Thus, in the current study, variant penetrance was evaluated in 2 large-scale EHR-linked biobanks.
The study protocols were approved by the institutional review board of the Icahn School of Medicine at Mount Sinai. Use of data from UK Biobank was completed and approved using the UK Biobank Resource under application number 16218. Written informed consent was obtained for all study participants.
The study design is shown in Figure 1. Quiz Ref IDThe penetrance of pathogenic/loss-of-function variants was analyzed in a large study population of participants from 2 biobanks with linked exome and EHR data. The variant allele was defined as the pathogenic allele reported in ClinVar or loss-of-function allele annotated by bioinformatic algorithms, while the normal allele was the nonvariant allele. Penetrance for each variant was determined by the proportion of individuals with the variant allele who were affected with disease, as defined by an International Classification of Diseases, Tenth Revision (ICD-10) diagnosis code in the EHR.
Penetrance was evaluated in a cohort of individuals enrolled, without selection for traits or disease, in 2 EHR-linked biobanks. The BioMe Biobank is a health system–based biobank comprising approximately 60 000 patients recruited from the Mount Sinai Health System in Manhattan, New York, from 2007 onwards, with follow-up for data used in this study completed on December 12, 2020. BioMe is highly diverse, with individuals of African, Asian, European, Hispanic, and multiple self-reported ancestries representative of the surrounding New York City area. All BioMe participants consented to providing biological and DNA samples linked to deidentified EHRs, and the first 31 250 participants underwent exome sequencing. Quality control was applied whereby samples with discordance between genetic and recorded sex, low coverage, contamination, or duplicate samples were excluded. In addition, samples lacking complete demographic data, from individuals younger than 20 years, or with missing ICD-10 data (ie, incomplete record in the health system) were removed to generate the final study set.
UK Biobank is a community-based cohort of approximately 500 000 individuals, chiefly of British self-reported ancestry, aged 40 to 69 years who were enrolled at 22 assessment centers across the UK between 2006 and 2010,13 with follow-up for data used in this study completed on June 3, 2020. UK Biobank participants are not more likely to have health conditions than nonparticipants.14 All individuals consented to providing medical history, demographic data, and DNA samples. A subset of individuals had their exomes sequenced and passed standard quality control.15 Samples without complete demographic or ICD-10 data were excluded to generate a final study set.
In BioMe, exome sequence data and variant call files were produced by the Regeneron Genetics Center (eMethods in the Supplement). In UK Biobank, exome data from the first release of exome sequence data generated with the functional equivalence pipeline15 were used (sequence data and quality control described elsewhere13). Variants associated with ClinVar diseases were ascertained from the variant call files using PLINK, version 2.0.16 To enrich for penetrant variants, a subset of pathogenic/likely pathogenic (hereafter collectively referred to as pathogenic) or loss-of-function variants was used for most downstream analyses. These were defined by curating variant summary information in ClinVar variant call files (December 2020 release), functional annotations from Variant Effect Predictor (version 99.2),17 and genic mode of inheritance (eg, dominant, recessive) from OMIM.18 An overview of pathogenic/loss-of-function variant selection is provided in eFigure 1 in the Supplement. First, variants of pathogenic classification in ClinVar, and novel variants with a damaging molecular consequence (splice acceptor/donor, stop gained/lost, frameshift, or start lost; collectively defined as loss-of-function) annotated by Variant Effect Predictor, were included. Loss-of-function variants in a gene were mapped to disease based on prior pathogenic variant submissions in ClinVar linking genes to diseases (eg, BRCA1 loss-of-function variants mapped to breast cancer). Because missense variants have varying degrees of pathogenicity, non-ClinVar missense variants were also excluded. The review status (level of pathogenicity evidence) was noted for each ClinVar variant: no assertion criteria (review status 0), single submitter/multiple submitters with conflicting interpretation (review status 1), multiple submitters with no conflict of interpretation (review status 2), and reviewed by an expert panel (review status 3). Second, variants in genes with exclusively recessive mode of inheritance in OMIM were excluded. The gene for each variant was retrieved from NCBI reference sequences (https://www.ncbi.nlm.nih.gov/refseq/) and corroborated with genomic coordinates in OMIM.
In both BioMe and UK Biobank, case status was obtained using ICD-10 codes, which map directly to ClinVar diseases in Systematized Nomenclature of Medicine Clinical Terms. All samples in the final data set had ICD-10 codes available. Cases were identified by the presence of a corresponding ICD-10 code, while controls were identified by the absence of all corresponding ICD-10 codes. Case-control status was thereby defined for Systematized Nomenclature of Medicine Clinical Terms diseases of nonrecessive inheritance in ClinVar (disease inheritance was retrieved from https://www.ncbi.nlm.nih.gov/medgen/). A complete list of the cases, controls, and ICD-10 codes for each disease is provided (eTable 1 in the Supplement). Validation of the ICD-10–based phenotyping method was performed against clinical algorithms,19-27 manual review of physician notes, manual review of the EHR, and analyses of biomarkers in BioMe (eMethods in the Supplement).
Differences in continuous and categorical variables were assessed with 2-sided unpaired t tests and Fisher exact tests, respectively. The risk difference (RD) between the prevalence of disease in individuals with the variant allele and individuals with the normal allele was computed, and significance was evaluated with 2-sided Fisher exact tests. The significance level was set at P < .05 when comparing 2 groups. A strict Bonferroni-corrected significance threshold was used in analyses with multiple comparisons (eMethods in the Supplement): mean penetrance for 5 ClinVar pathogenicity classes (P < .01), 4 ClinVar review status levels (P < .01), and 8 molecular consequences (P < .006). Individuals with missing demographic or clinical data were removed from analysis during quality control. Violin plots were generated with the function geom_violin from the R package ggplot2, version 3.3.3, and all statistical tests and plots were made using R, version 3.5.3 (R Foundation for Scientific Computing).
Several sensitivity analyses were performed (see eMethods in the Supplement). The RD and penetrance associated with pathogenic/loss-of-function variants in 73 genes from the ACMG 731 and 9 genes from the Centers for Disease Control and Prevention’s tier 1 list28 were assessed. Penetrance distributions were evaluated for diseases with different number of cases and variants with different numbers of individuals with the variant, including singletons (rare variants appearing once). Variant penetrance was further examined by self-reported ancestries and ages of individuals.
In BioMe, 31 250 individuals underwent exome sequencing and 437 samples with discordance between genetic and recorded sex, low coverage, or contamination or duplicated samples were excluded, leaving 30 813 samples. Samples without complete demographic data (n = 345), of participants younger than 20 years (n = 610), or lacking ICD-10 data (n = 819) were removed to generate the final study set of 29 039 samples. In UK Biobank, 49 960 individuals underwent exome sequencing and passed standard quality control. Those without complete demographic information (n = 2) or ICD-10 codes available (n = 6563) were removed, leaving a final study set of 43 395 samples.
The study design is illustrated in Figure 1. The study population comprised 72 434 individuals with exome and phenotype data, including 43 395 from UK Biobank (mean [SD] age, 57  years; 24 065 [55%] women) and 29 039 from BioMe (mean [SD] age, 59  years; 17 355 [60%] women) with a spectrum of health conditions (Table). A total of 197 diseases and 37 780 clinical variants (reported in ClinVar or of predicted loss-of-function consequence) were identified (eTables 1 and 2 in the Supplement). A stringent set of 5360 pathogenic/loss-of-function variants was used for most downstream analyses (eFigure 1 in the Supplement). The population-based approach of phenotyping and computing penetrance was validated against clinical algorithms (eTables 3 and 4 in the Supplement), manual review of physician notes in the problem list (eTable 5 in the Supplement), and manual review of the EHR (eTable 6 in the Supplement) for a set of representative diseases. Biomarker measurements were also investigated for individuals with pathogenic/loss-of-function variants linked to familial hypercholesterolemia, maturity-onset diabetes of the young, and obesity (eFigure 2 in the Supplement).
The distribution of risk estimates for 157 diseases with 5360 pathogenic/loss-of-function variants was examined. A subset of 565 variants (11%) with an RD greater than 0.05 for 55 diseases (35%) was detected (eFigure 3 in the Supplement); among 4454 pathogenic/loss-of-function variants for diseases that had at least 36 variants (with at least 2 variants identified in at least 2 individuals), 535 (12%) had an RD greater than 0.05 (Figure 2). In contrast, large numbers of weakly penetrant and nonpenetrant variants that were associated with little to no disease risk (4795 [89%] variants with RD ≤0.05) were observed (eFigure 3 in the Supplement). Mean variant RD for all 157 diseases is provided in eTable 7 in the Supplement. Variants that were both pathogenic and loss-of-function had a higher mean RD than those that were pathogenic and non–loss-of-function (difference, 0.040 [95% CI, 0.023-0.058]; P < .001) or non-ClinVar and loss-of-function (difference, 0.044 [95% CI, 0.033-0.055]; P < .001) (eFigure 4 in the Supplement). Pathogenic/loss-of-function variants in ACMG 73 or tier 1 genes were identified (eTable 8 in the Supplement). Pathogenic/loss-of-function variants in ACMG 73 genes were associated with a greater mean RD for disease than those not in ACMG 73 genes (difference, 0.042 [95% CI, 0.032-0.052]; P < .001) (eFigures 5 and 6 in the Supplement). Similarly, pathogenic/loss-of-function variants in tier 1 genes had a higher mean RD for disease than those not in tier 1 genes (difference, 0.17 [95% CI, 0.15-0.19]; P < .001; eFigures 5 and 7 in the Supplement). Pathogenic/loss-of-function variants had similar mean RD for diseases with at least 10 cases and diseases with fewer than 10 cases (difference, 0.019 [95% CI, −0.0035 to 0.043]; P = .10; eFigure 8 in the Supplement).
Penetrance was stratified by characteristics of ClinVar pathogenicity, ClinVar review status, and molecular consequence (Figure 3). The penetrance of variants with different characteristics was compared (eg, penetrance of pathogenic vs benign variants); penetrance of variants with a certain characteristic (eg, ClinVar pathogenic variant) was also compared with the baseline prevalence of disease in individuals with a normal allele. Mean penetrance was higher by 6.0 percentage points ([95% CI, 5.6-6.4 percentage points]; P < .001) for pathogenic variants (mean, 6.9% [95% CI, 6.0%-7.8%]) than for benign variants (mean, 0.85% [95% CI, 0.76%-0.95%]); by 12 percentage points ([95% CI, 8.7-15 percentage points]; P < .001) for pathogenic variants reviewed by experts (mean, 18% [95% CI, 14%-22%]) than for those with multiple submitters (mean, 6.4% [95% CI, 5.1%-7.6%]); and by 5.9 percentage points ([95% CI, 4.3-8.3 percentage points]; P < .001) for frameshift variants (mean, 10% [95% CI, 9.0%-12%]) than for missense variants (mean, 4.1% [95% CI, 2.9%-5.3%]). Median penetrance was 0% for all variant groups tested due to large numbers of nonpenetrant variants. Penetrance distributions were consistent for variants with different numbers of individuals with the variant (eFigure 9 in the Supplement) and for singletons (eFigure 10 in the Supplement).
Estimates of disease risk were ascertained in self-reported African, Asian, European, and Hispanic ancestries. When stratified by ancestry, penetrance distributions for ClinVar pathogenicity, review status, and molecular consequence were similar to the primary analysis (eFigure 11 in the Supplement). Of 2211 pathogenic/loss-of-function variants in BioMe, 76 (3.4%) were exclusively present in individuals of African ancestry, 17 (0.77%) were exclusively present in individuals of Asian ancestry, 71 (3.2%) were exclusively present in individuals of European ancestry, and 43 (1.9%) were exclusively present in individuals of Hispanic ancestry. Of 3408 pathogenic/loss-of-function variants in UK Biobank, 8 (0.23%) were specific to individuals of African ancestry, 11 (0.32%) were specific to individuals of Asian ancestry, and 235 (6.9%) were specific to individuals of European ancestry. Highly penetrant ancestry-specific variants were identified, such as an Asian ancestry–specific pathogenic frameshift variant in HBB (Entrez 3043, NC_000011.10:c.5226994_5226995insC) associated with increased risk of thalassemia (RD = 0.99; P < .001) and a European ancestry–specific pathogenic frameshift variant in PALB2 (Entrez 79728, NC_000016.10:c.23636037_23636038del) associated with elevated risk of breast cancer (RD = 0.92; P = .007).
Variant penetrance was further delineated based on the age of individuals with variants ranging from 20 years or older to 70 years or older. Age of disease onset is pertinent for estimating penetrance: congenital or early-onset diseases will have manifested in older individuals with a penetrant variant, whereas later-onset diseases may not have presented in younger individuals with a penetrant variant. The observed change in penetrance of pathogenic/loss-of-function variants with increasing age was characterized for diseases stratified by age of onset (eFigure 12 in the Supplement). Disease onset was defined as earlier (congenital, childhood, or adolescent), later (adulthood), or any (eTable 9 in the Supplement). For each disease onset group, change in penetrance was assessed as the difference in penetrance between the lowest age threshold and increasing age thresholds (eg, variant penetrance for age ≥20 vs ≥30 y, ≥20 vs ≥40 y). The mean change in penetrance increased for later disease onset over higher age thresholds in BioMe. Of all age threshold comparisons for later disease onset, the largest mean change in penetrance was 1.8 percentage points ([95% CI, 0.40-3.3 percentage points]; P = .02) when comparing age 20 years or older (mean penetrance, 8.5% [95% CI, 8.0%-9.0%]) and 70 years or older (mean penetrance, 10.3% [95% CI, 9.0%-11.6%]), while mean change in penetrance remained unchanged for earlier disease onset (difference, −0.21 percentage points [95% CI, −0.44 to 0.015 percentage points]; P = .09) and any disease onset (difference, 0.075 percentage points [95% CI, −1.3 to 1.4 percentage points]; P = .91).
The ACMG recommends reporting secondary findings of pathogenic variants, but acknowledges that insufficient penetrance data requires ongoing study.29 Thus, the population-based penetrance of pathogenic/loss-of-function variants in 10 known or suspected breast cancer predisposition genes30 (BRCA1, BRCA2, PALB2, CHEK2 [Entrez Gene 11200], ATM [Entrez Gene 472], PTEN [Entrez Gene 5728], CDH1 [Entrez Gene 999], BARD1 [Entrez Gene 580], BRIP1 [Entrez Gene 83990], and RAD51D [Entrez Gene 5892]) was evaluated (Figure 4). The highest disease risk was associated with pathogenic/loss-of-function variants in BRCA1 (mean RD, 0.32 [95% CI, 0.20-0.45]), BRCA2 (mean RD, 0.32 [95% CI, 0.23-0.46]), and PALB2 (mean RD, 0.20 [95% CI, 0.045-0.36]).
These analyses identified previously reported variants and novel variants associated with elevated disease risk. Examples for 7 diseases, including familial breast cancer, are highlighted in eTable 10 in the Supplement. Many pathogenic/loss-of-function variants in BRCA2 were strongly associated with familial breast cancer, such as a pathogenic frameshift variant (NC_000013.11:c.32340301del; penetrance = 47%; RD = 0.39; P < .001) and a novel frameshift variant (NC_000013.11:c.32340630_32340631del; penetrance = 100%; RD = 0.96; P < .001). There was substantial heterogeneity of penetrance of variants even within the same disease predisposition gene (Figure 4). Pathogenic/loss-of-function variants in LDLR exhibited a wide range of penetrance for familial hypercholesterolemia (0%-100%), a subset of which were also identified in a prior study31 with similar penetrance estimates (eTable 11 in the Supplement). Similarly, there was a wide range of penetrance for familial breast cancer (0%-100%), with pathogenic/loss-of-function variants in BRCA1, BRCA2, and PALB2.
Quiz Ref IDIn this comprehensive assessment of variant penetrance in 72 434 individuals from 2 biobanks, most pathogenic variants in the population had low disease risk, consistent with overestimated disease risk for variants reported as pathogenic. This is in line with previous studies showing inflated disease risk of pathogenic variants from conventional studies with ascertainment bias12,32 and recent population-based studies with lower risk estimates for familial hypercholesterolemia and developmental disorders.31,33
Nascent efforts have begun to probe the upward bias of penetrance estimates in traditional genetic studies. The present study adds to this literature, systematically investigating the pervasiveness of overestimated penetrance among 5360 pathogenic/loss-of-function variants for 157 diseases. Penetrance of pathogenic/loss-of-function variants in well-known disease predisposition genes from the ACMG 73 and tier 1 list were higher than those in other genes, but still generally low; for instance, many pathogenic/loss-of-function variants in LDLR had low penetrance for familial hypercholesterolemia, in agreement with a previous study.31 Moreover, the reported pathogenicity of variants did not equate with penetrance. ClinVar pathogenic variants were more penetrant than variants of other ClinVar classes, yet still weakly penetrant overall, consistent with previous studies4-6 and with the ClinVar definition of pathogenic that includes “low-penetrance” variants (https://www.ncbi.nlm.nih.gov/clinvar/docs/clinsig/).
These findings raise the question of whether variants reported as pathogenic but empirically shown to have low penetrance should be classified differently or whether categorical systems of disease risk (pathogenic vs nonpathogenic) should be complemented with a quantitative system of penetrance.5 A new approach was recently proposed for classifying familial hypercholesterolemia based on 2 parameters: whether an individual has a pathogenic LDLR variant and severity of their hypercholesterolemia.34 This disease classification schema, and others like it, would benefit from knowledge of penetrance for pathogenic variants to better stratify disease risk and personalize medical care.
Quiz Ref IDPopulation-based screening for genetic disorders depends in part on the disease prevalence in the target population; rare disorders (eg, cyclical neutropenia, with a prevalence of 0.017% in the study population) have a lower a priori probability of pathogenic and penetrant variants being detected than more common diseases (eg, familial breast cancer with a prevalence of 4.2%). In addition, although this study focused on the penetrance of individual rare variants, alternative approaches with polygenic risk scores assess the effect of common variants in aggregate, which may interplay with the disease risk of rare variants. Populations also differ by allele frequency and disease factors,35 yet most genetic studies have focused on individuals of European ancestry.36 Here, estimates of disease risk in distinct ancestral subgroups were captured, identifying more than 100 variants specific to non-European ancestries. Previous studies have reported age-dependent penetrance at the gene level, whereby all variation in a gene is aggregated for diseases such as familial breast cancer,11 amyotrophic lateral sclerosis,37,38 and obesity,39 whereas this study evaluated age-dependent penetrance at the variant level.
This study has several limitations. First, ICD-10 codes from the EHR were used to define case-control status (eTable 1 in the Supplement). Although commonly used in EHR-linked studies, there may be some misclassification.40 Although validation analyses supported the approach used for phenotyping, these were completed for a subset of the diseases.
Second, assessing variant penetrance in large populations using ICD-10 codes in EHRs may lead to conservative estimates of penetrance because healthy participants with pathogenic variants but without any detectable disease were included in the study. Additionally, individuals who could be detected as having subtle manifestations of a pathogenic variant by detailed disease-directed phenotyping may not be labeled with an ICD-10 code in a general biobank. In contrast, family-based or clinical cohort studies may overestimate penetrance due to recruitment of individuals with a personal or family history of disease (ie, ascertainment bias of cases)11,12 and because their participants may undergo more detailed phenotyping. Nonetheless, variant disease risk ascertained from a large population may reasonably reflect clinically meaningful penetrance.
Third, the data sets may have had bias. BioMe comprises individuals recruited from the Mount Sinai Health System with a higher burden of diseases and therefore higher penetrance estimates. In contrast, healthy volunteers in UK Biobank may lead to conservative estimates of penetrance. True penetrance values may lie between upper estimates from family-based or clinical cohort studies with ascertainment bias of cases and lower estimates from population-based studies with healthy participants.
Fourth, although the sample size exceeded 72 000 exomes and enabled the ascertainment of rare variants, many penetrance estimates for rare variants were based on low numbers of individuals and produced estimates with wide CIs. Raw counts of individuals with the variant allele and individuals with the normal allele were included in the data set so estimates may be interpreted accordingly. Additionally, only variants in genes with nonrecessive inheritance were assessed; future studies with larger sample sizes should investigate the penetrance of variants in genes with recessive inheritance.
Fifth, variants were mapped to diseases based on disease genes reported in ClinVar from pathogenic variant submissions. Inaccurate ClinVar submissions, and therefore mappings from variant to disease, are possible and some genes tenuously associated with disorders may not have been excluded, although mappings were manually checked extensively against the literature for accuracy and well-known disease predisposition genes from the ACMG 73 and tier 1 list were separately considered.
In 2 large biobank cohorts, the estimated penetrance of pathogenic/loss-of-function variants was variable but generally low. Further research of population-based penetrance is needed to refine variant interpretation and clinical evaluation of individuals with these variant alleles.
Corresponding Author: Ron Do, PhD, 1468 Madison Ave, Annenberg Building, Floor 18, Room 80B, New York, NY 10029 (firstname.lastname@example.org).
Accepted for Publication: December 13, 2021.
Author Contributions: Dr Do and Mr Forrest had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.
Concept and design: Forrest, Jordan, Cho, Do.
Acquisition, analysis, or interpretation of data: All authors.
Drafting of the manuscript: Forrest, Cho, Do.
Critical revision of the manuscript for important intellectual content: All authors.
Statistical analysis: Forrest, Chaudhary, Petrazzini, Rocheleau, Cho.
Obtained funding: Forrest, Loos, Cho, Do.
Administrative, technical, or material support: Vy, Bafna, Cho.
Supervision: Rocheleau, Nadkarni, Cho, Do.
Conflict of Interest Disclosures: Mr Forrest reported receiving grants from the National Institute of General Medical Sciences of the National Institutes of Health (NIH). Dr Nadkarni reported receiving grants, personal fees, and nonfinancial support from and being a cofounder of and having equity in Renalytix; being a cofounder in Pensieve Health; being a cofounder and having equity in Verici; and receiving personal fees from Siemens, Reata, AstraZeneca, and BioVie outside the submitted work. Dr Do reported receiving grants from AstraZeneca and Goldfinch Bio; nonfinancial support from Goldfinch Bio; personal fees from Variant Bio; and being a scientific cofounder, consultant, and equity holder in Pensieve Health outside the submitted work. No other disclosures were reported.
Funding/Support: Mr Forrest is supported by the National Institute of General Medical Sciences of the National Institutes of Health (NIH) (T32-GM007280). Dr Do is supported by the National Institute of General Medical Sciences of the NIH (R35-GM124836) and the National Heart, Lung, and Blood Institute of the NIH (R01-HL139865 and R01-HL155915).
Role of the Funder/Sponsor: The National Institute of General Medical Sciences and the National Heart, Lung, and Blood Institute of the NIH had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; or decision to submit the manuscript for publication; and no right to veto publication of the manuscript.
Disclaimer: The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Additional Contributions: Bruce D. Gelb, MD; Sander Houten, PhD; Paz Polak, PhD; and Stuart Scott, PhD, all of whom are on the thesis advisory committee of Iain Forrest, provided critical feedback and expertise. All contributors are affiliated with the Icahn School of Medicine at Mount Sinai and no one received any additional compensation beyond usual salary for their contributions to this study.