Comparison of educational achievement of participants in the Estonian Genome Center, University of Tartu (EGCUT) with carriers of Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources (DECIPHER)–listed rearrangements or with carriers of deletions, female carriers of deletions, and duplications segregated by size (CNV frequencies ≤0.05%). The educational attainment decreases with copy number variation (CNV) size. See Table 2 for statistically significant differences between groups. Educational levels are coded according to the Estonian education curriculum (eMethods in the Supplement).
eTable 1. Phenotypes of EGCUT individuals with DECIPHER-listed recurrent rearrangements
eTable 2. Prevalence and characteristic features of DECIPHER-listed genomic disorders
eTable 3. Sample demographics and characteristics
eTable 4. Summary scores of ALSPAC participants Standard Assessment Tests (SATs)
eTable 5. Prevalence of NAHR-mediated recurrent CNVs in clinical and general population cohorts
eTable 6. Follow-up phenotyping of 16p11.2 600kb BP4-BP5 deletions and duplications identified in the EGCUT cohort
eTable 7. Individual CNVs in EGCUT discovery and replication cohorts
eTable 8. Education attainment in EGCUT replication cohorts separately and combined with discovery cohort
eTable 9. Mean Standard Assessment Tests (SATs) scores for English and Mathematics in ALSPAC CNV carriers
eTable 10. Education attainment in Italian HYPERGENES cohort
eTable 11. Education attainment in European American MCTFR cohort
eTable 12. MetaCore Enrichment by GO Processes analysis report
eFigure 1. Diagnoses reported in the EGCUT participants according to the WHO ICD-10 classification
eFigure 2. Multidimensional scaling analysis of EGCUT population structure
eFigure 3. Assessment of CNV deleteriousness
Männik K, Mägi R, Macé A, Cole B, Guyatt AL, Shihab HA, Maillard AM, Alavere H, Kolk A, Reigo A, Mihailov E, Leitsalu L, Ferreira A, Nõukas M, Teumer A, Salvi E, Cusi D, McGue M, Iacono WG, Gaunt TR, Beckmann JS, Jacquemont S, Kutalik Z, Pankratz N, Timpson N, Metspalu A, Reymond A. Copy Number Variations and Cognitive Phenotypes in Unselected Populations. JAMA. 2015;313(20):2044-2054. doi:10.1001/jama.2015.4845
The association of copy number variations (CNVs), differing numbers of copies of genetic sequence at locations in the genome, with phenotypes such as intellectual disability has been almost exclusively evaluated using clinically ascertained cohorts. The contribution of these genetic variants to cognitive phenotypes in the general population remains unclear.
To investigate the clinical features conferred by CNVs associated with known syndromes in adult carriers without clinical preselection and to assess the genome-wide consequences of rare CNVs (frequency ≤0.05%; size ≥250 kilobase pairs [kb]) on carriers’ educational attainment and intellectual disability prevalence in the general population.
Design, Setting, and Participants
The population biobank of Estonia contains 52 000 participants enrolled from 2002 through 2010. General practitioners examined participants and filled out a questionnaire of health- and lifestyle-related questions, as well as reported diagnoses. Copy number variant analysis was conducted on a random sample of 7877 individuals and genotype-phenotype associations with education and disease traits were evaluated. Our results were replicated on a high-functioning group of 993 Estonians and 3 geographically distinct populations in the United Kingdom, the United States, and Italy.
Main Outcomes and Measures
Phenotypes of genomic disorders in the general population, prevalence of autosomal CNVs, and association of these variants with educational attainment (from less than primary school through scientific degree) and prevalence of intellectual disability.
Of the 7877 in the Estonian cohort, we identified 56 carriers of CNVs associated with known syndromes. Their phenotypes, including cognitive and psychiatric problems, epilepsy, neuropathies, obesity, and congenital malformations are similar to those described for carriers of identical rearrangements ascertained in clinical cohorts. A genome-wide evaluation of rare autosomal CNVs (frequency, ≤0.05%; ≥250 kb) identified 831 carriers (10.5%) of the screened general population. Eleven of 216 (5.1%) carriers of a deletion of at least 250 kb (odds ratio [OR], 3.16; 95% CI, 1.51-5.98; P = 1.5e-03) and 6 of 102 (5.9%) carriers of a duplication of at least 1 Mb (OR, 3.67; 95% CI, 1.29-8.54; P = .008) had an intellectual disability compared with 114 of 6819 (1.7%) in the Estonian cohort. The mean education attainment was 3.81 (P = 1.06e-04) among 248 (≥250 kb) deletion carriers and 3.69 (P = 5.024e-05) among 115 duplication carriers (≥1 Mb). Of the deletion carriers, 33.5% did not graduate from high school (OR, 1.48; 95% CI, 1.12-1.95; P = .005) and 39.1% of duplication carriers did not graduate high school (OR, 1.89; 95% CI, 1.27-2.8; P = 1.6e-03). Evidence for an association between rare CNVs and lower educational attainment was supported by analyses of cohorts of adults from Italy and the United States and adolescents from the United Kingdom.
Conclusions and Relevance
Known pathogenic CNVs in unselected, but assumed to be healthy, adult populations may be associated with unrecognized clinical sequelae. Additionally, individually rare but collectively common intermediate-size CNVs may be negatively associated with educational attainment. Replication of these findings in additional population groups is warranted given the potential implications of this observation for genomics research, clinical care, and public health.
Quiz Ref IDRecent studies showed that human individuals differ on approximately 0.8% of their genome.1 The Database of Genomic Variants catalogs approximately 2.4 million DNA copy number variations (CNVs), genetic sequences that differ in numbers of copies in the human genome, mapping to approximately 200 000 unique loci that cover 72% of the human genome.2 Copy number variations have been shown to contribute to interindividual variation in a wide variety of traits and conditions by globally influencing the transcriptome.3- 6 Large, defined herein as larger than 500 kb, recurrent CNVs have been associated with complex disorders, particularly developmental delay and intellectual disability7,8 Intellectual disability is characterized by limited intellectual functioning and impaired adaptive behavior in everyday life. These CNVs are listed in the Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources (DECIPHER)9 and are often regrouped under the term genomic disorder.8
The effects of large CNVs in adult populations remain unclear because associations of large rare CNVs with pathologies were almost exclusively evaluated using clinically ascertained pediatric cohorts with known intellectual impairment. The purpose of this study was to investigate the characteristics of adult carriers of known pathological CNVs who were not clinically preselected and to assess the burden of rare intermediate-size autosomal CNVs (defined as 500 kb > CNV ≥250) on educational attainment and intellectual disability.
Quiz Ref IDThe Estonian Genome Center, the University of Tartu (EGCUT), cohort is a population biobank containing 5% of the Estonian adult population.10 Samples have been collected in all 15 Estonian counties and diverse social groups by 454 general practitioners (corresponding to 56% of practioners registered to the Estonian Health Board). The age, sex, and geographical distribution of the 52 000 participants closely reflect those of the Estonian adult population. The detailed description of the Estonian cohort was previously published.10 At baseline, general practitioners performed a standardized objective examination of the participants and filled out a questionnaire that included more than 1000 health- and lifestyle-related questions, as well as provided the diagnoses of diseases present in the medical history of the participating individual using the format of the International Statistical Classification of Diseases and Related Health Problems, Tenth Revision (ICD-10)10 (see details in eMethods of the Supplement). The data are continuously updated through periodic linking to national electronic health registries. The wide range of phenotypes, ages, and social groups makes the cohort ideally suited to population-based studies. For details on EGCUT cohort phenotype data (see eFigure 1 and the eMethods section in the Supplement). The Estonian Genome Center is conducted according to the Estonian Human Genes Research Act and is managed in conformity with the International Organization for Standardization ISO 9001:2008. The ethics review committee on Human Research of the University of Tartu approved the project. Written informed consent was obtained from all participants for the baseline and follow-up investigations.
The relevant phenotype traits of individual carriers of DECIPHER-listed syndromic CNVs in the EGCUT cohort (eTable 1 in the Supplement) were obtained from the baseline questionnaire and compared with the reviewed characteristics of corresponding syndromes (eTable 2 in the Supplement). To further investigate the clinical features of adult carriers not clinically preselected, all 16p11.2 600kb BP4-BP5 (breakpoint) deletion and reciprocal duplication carriers identified in the Estonian cohort were invited back for follow-up investigations. See the Results section for a detailed description and relevant references of this genomic disorder. These CNVs were selected because of their relatively high prevalence and variable phenotype. These carriers were phenotyped using the standardized clinical and neuropsychological protocol that had been developed previously to specifically study patients with the 16p11.2 syndromes who had been ascertained through clinical cohorts.11,12 In agreement with the known population prevalence of 16p11.2 600kb BP4-BP5 CNVs,11 4 deletion carriers (0.05%) and 7 duplication carriers (0.09%) were identified in the Estonian set.
The EGCUT cohort (and Estonian population in general) is an outbred population with no substantial regional or ethnic differences. Single-nucleotide polymorphism (SNP) allele frequencies and linkage disequilibrium patterns are similar to those found in populations with European ancestry.13 We did not find small series of nonrecurrent CNVs or inflation of recurrent rearrangements typical of founder effects14,15 (eMethods in the Supplement). Accordingly, EGCUT samples have been successfully used to discover or replicate hundreds of SNP associations, which are vulnerable to population frequencies and stratification differences16- 18 (see eMethods and eFigure 2 in the Supplement for details on the Estonian population makeup and stratification).
Quiz Ref IDThe genomic DNA of 8110 individuals (7020 for the discovery and 1090 for the replication cohorts; eTable 3 in the Supplement), randomly selected among the 52 000 EGCUT participants, was subjected to CNV analysis. A third cohort of 1066 individuals (“high-functioning replication cohort”) was used to further assess the significance of the signal obtained regarding educational attainment. Due to the recruitment criteria that required participants to provide sufficient and consistent information in an advanced sleeping pattern–related questionnaire and a regular work schedule over a survey period of 6 months, the high-functioning replication cohort was biased toward higher than average sociocognitive functioning (eMethods in the Supplement). Single-nucleotide polymorphism genotyping and CNV calling were performed using Illumina platforms and the Hidden Markov Model-based software PennCNV according to the manufacturer’s and developer’s protocols,19 respectively. The 6819 discovery, 1058 replication, and 993 high-functioning replication samples that passed the quality control parameters were retained (eMethods in the Supplement).
The difference of studied phenotypes between CNV carriers and the general population was assessed. A 2-sided Fisher exact test and Welch 2-sample t test were used for statistical analysis in The R Project for Statistical Computing environment (http://www.r-project.org, R version 3.0.2). A threshold of P ≤ .05 was set to indicate statistical significance. See eMethods in the Supplement for assessment of phenotype and determination of the prevalence of recurrent genomic syndromes. Briefly, intellectual disability is defined in the Diagnostic and Statistical Manual of Mental Disorders as a deficit in overall cognitive functioning along with limitations in adaptive behavior. It was diagnosed as described in F70-F79 of the ICD-10. All diagnoses, including intellectual disability, were diagnosed according to diagnostic standards throughout the participants’ medical history and recorded before enrollment to the biobank. It is reported to the EGCUT database by the recruiting general practitioners. Intellectual disability prevalence is estimated at 1% to 3% in developed countries,20 which is consistent with the prevalence found in the EGCUT discovery cohort (1.7%). Educational levels were uniformly coded at the time of enrollment according to the Estonian education curriculum from 1 to 7, ie, from less than primary school through scientific degree (eMethods in the Supplement). In both discovery and replication cohorts, the mean education attainment (MEA) corresponded to secondary education (MEA, 4.09; 95% CI, 4.07-4.12 for the discovery cohort and 4.0; 95% CI, 3.93-4.05 for the replication cohort) in agreement with the country’s MEA. See eMethods in the Supplement for details on the Estonian population religiousness, school curriculum organization, and education system performance.
Quiz Ref IDThree previously published data sets were used to functionally annotate genes embedded in rare CNVs and assess if their characteristics could be used to predict CNV deleteriousness (eMethods in the Supplement): (1) the neurodevelopmental gene list21,22; (2) the haploinsufficiency scores (HiS), ie, the probability that a given gene does not maintain its normal function with only 1 functional copy23; and the list of ohnologs, ie, genes related by ancestral whole-genome duplication events.24 Because a CNV may preserve a gene’s integrity yet indirectly affect it through changes in the copy-number of its regulatory elements,4,5,25 the potential contributions of the latter was also tested by stratifying CNVs using the number of encompassed regulatory elements identified in.26 To further assess the functions of imbalanced genes, we used Thomson Reuters MetaCore, an integrated software suite for data-mining and pathway analysis based on a manually curated biological knowledge database (eMethods in the Supplement).
The Avon Longitudinal Study of Parents and Children (ALSPAC) cohort is a birth cohort based in Bristol, United Kingdom,27 which initially enrolled 14 541 pregnant women with expected delivery dates between April 1, 1991, and December 31, 1992. The 13 988 children who were alive at 1 year of age. Additional families were enrolled in later phases. Detailed phenotypic information on the children and their parents was collected during clinic visits and by the completion of questionnaires, as well as from linkage with external data sources (eMethods in the Supplement). Ethical approval for the study was obtained from ALSPAC Ethics and Law Committee and the local research ethics committees.
The Illumina HumanHap550 Quad platform was used to genotype 9912 children in ALSPAC. The CNVs were called with PennCNV.19 The subset of 5218 unrelated individuals, who passed the array quality control, who gave consent, and for whom educational information was available were retained for analysis (eTable 3 and eMethods in the Supplement). Log R ratio (LRR) and B allele frequency metrics were derived from raw data using published guidelines.28
Within ALSPAC, educational attainment was assessed using data from the UK-based Key Stage 3 National Curriculum Tests in English and mathematics, taken at ages 13 and 14 years, also known as Standard Assessment Tests (SATs). A discrete level is awarded for these tests, but to further account for the exact score received and the fact that the maximum and minimum level achievable for mathematics was dependent on the tier of examination for which the child was entered, results were scaled and adjusted as described previously.29,30 Due to nonnormal distribution of the data, these 2 variables were then inverse-rank normal transformed and then standardized. Furthermore, tertiles of the English and mathematics scores were created (eTable 4 in the Supplement). Differences in MEAs according to rare CNV carrier status (frequency ≤0.05%) were compared using a Welch 2-sided t test. This was performed separately for each of the inverse-rank transformed, standardized English and mathematics educational attainment scores. To obtain an interpretable estimate of effect, univariable logistic regression models were assessed separately for English and mathematics. The top tertile was coded as the reference group and the bottom tertile as the risk group. Separate odds ratios (ORs) were estimated for membership of the risk group, comparing CNV carriers corresponding to increasing size groups against baseline (no large CNVs at a frequency ≤0.05%). The binary educational outcome was then regressed against CNV carrier status as an ordered variable, including all 4 size categories, and the P value was reported as an assessment of trend.
Participants from 2 studies conducted by the Minnesota Center for Twin and Family Research were used as replication samples: the Sibling Interaction and Behavior Study, and the Minnesota Twin Family Study, which is a longitudinal study of a community-based sample of same-sex twins born between 1972 and 1994 in Minnesota and their parents.31 The Sibling Interaction Behavior Study is an adoption study of sibling pairs and their parents32; its community-based sample contains families in which both siblings are adopted, in which both are biologically related to the parents, or in which one is adopted and one is biologically related. In the current analyses, only a single random individual was selected for inclusion in analyses in order to create a data set of unrelated participants (n = 2390, eTable 3 in the Supplement). The collection, genotyping, and analysis of DNA samples for both studies were approved by the University of Minnesota Institutional Review Board's Human Subjects Committee. Written informed consent was obtained from all participants; parents provided written informed consent for their minor children.
Genotyping was performed using the Illumina 660W-Quad array. Whole blood extracted DNA samples were only analyzed if the participant was white non-Hispanic and the standard deviation of the GC-corrected33 autosomal log R ratios was less than 0.20. The CNVs were called using PennCNV and then processed and filtered. Adjacent CNVs were merged if they had the same copy number and if the number of markers in the intervening gap was less than 20% of the number of total markers spanning the called CNVs. To replicate the results of the EGCUT discovery cohort the same parameters, ie, rare (frequency, ≤0.05%) deletions of 250 kb or longer and duplications of 1 Mb or longer were retained in the burden analysis.
The full-scale intelligence quotient (FSIQ) was estimated using an abbreviated form of either the Wechsler Intelligence Scale for Children-Revised (WISC-R; for children ≤16 years) or the Wechsler Adult Intelligence Scale-Revised (WAIS-R; for individuals ≥16 years). The short forms consisted of 2 performance subtests (Block Design and Picture Arrangement) and Verbal subtests (information and vocabulary) and were prorated to determine FSIQ. Estimates from this short form have been shown to correlate 0.94 with FSIQ from the complete test.34 Samples with multiple FSIQ measurements were averaged together for analysis (mean, 104.52; SD, 4.27; range, 67-150).
The Italian cohort follow-up is based on 451 individuals belonging to the cohort ascertained as controls for genome-wide association studies of hypertension (HYPERGENES)35 (eTable 3 in the Supplement). Years of schooling was defined in accordance with the International Standard Classification of Education 1997 classification, leading to 7 categories of educational attainment that are internationally comparable (eMethods in the Supplement). Single-nucleotide polymorphisms were genotyped using Illumina Human 1M-Duo BeadChips and CNVs called with PennCNV as for the EGCUT discovery cohort. Differences in means of educational attainment were compared using a Welch 2-samples 1-tailed t test and Wilcoxon rank-sum test in R. Both tests returned comparable results.
To investigate the medical burden of rare CNVs in the general population, we opted for a genotype-first approach and analyzed a random sample from the EGCUT cohort. Within a combined discovery and replication sample of 7877 unrelated individuals, 56 carriers (0.7%) of known recurrent autosomal genomic disorders were identified (eTable 1 in the Supplement). Although the prevalence of each genomic disorder is lower than previously reported in clinical cohorts,36,37 it is only slightly lower than the 67 individuals expected according to the reported population prevalence of the 57 autosomal syndromes listed in the DECIPHER database of genomic disorders9 (eTable 2, eTable 5, and eMethods in the Supplement). The EGCUT cohort is depleted (6 observed carriers of 17 expected, OR, 0.35; 95% CI, 0.11-0.94; P = .03) of the most deleterious CNVs (graded 1-2 by DECIPHER), whereas the frequency of CNVs graded 3 and ungraded is as expected (50 of 50; OR, 1; CI 95%, 0.66-1.51; P > .99).
The clinical features of the EGCUT carriers of DECIPHER-listed CNVs are comparable with those reported in disease cohorts. Thirty-one of 56 (55%; including only formal diagnosis) and 39 of 56 (70%; including self-reported problems) carriers recruited from the general population with no prior awareness of their genetic disorder present phenotypes previously associated with their genomic lesion in the literature (see eTable 1 for the phenotypes identified in the 56 EGCUT carriers and eTable 2 for phenotypes associated with DECIPHER-listed CNVs in the Supplement). For example, carriers of the 16p11.2 600kb BP4-BP5 deletions and reciprocal duplications identified in clinical cohorts show opposite phenotypes on body weight, head size, and volume of specific corticostriatal structures. They exhibit reduced FSIQ, as well as neuropsychiatric problems and congenital abnormalities.11,12,38- 44 Correspondingly, the baseline questionnaires of the 4 deletion (case Nos. 41-44 in eTable 1 in the Supplement) and 7 duplication (Nos. 45-51) carriers identified in the EGCUT cohort indicated high and low body mass indexes, respectively, neuropsychiatric traits, and learning and developmental problems. The follow-up evaluation of these carriers uncovered additional similarities in the spectrum and severity distribution of phenotypic features found in 16p11.2 BP4-BP5 rearrangement carriers identified through pan-European recruitment via clinical genetics centers (eTable 6 in the Supplement).
A genome-wide map of rare autosomal CNVs in the discovery set of 6819 individuals was generated (eTable 3 in the Supplement) and a total of 216 deletion and 509 duplication carriers were identified (≥ 250 kb with carrier frequency of ≤0.05%; eTable 7 in the Supplement). The underrepresentation of those with deletions compared with those with duplications (P = 2.2e-16) is consistent with previous reports and concordant with the hypothesis that the former are more deleterious.1,14 We found evidence for an association between carrier status and prevalence of intellectual disability. Twenty-three individuals, equal to a 3.2% prevalence, were diagnosed with intellectual disability in the rare CNV carriers group vs 114 intellectual disability diagnoses (1.7%) in the EGCUT cohort (OR, 1.93; 95% CI, 1.17-3.06; P = .007). This finding was associated with deletions, 11 individuals (5.1%) had intellectual disability (OR, 3.16; 95% CI, 1.51-5.98; P = 1.5e-03). The prevalence of intellectual disability was higher in the carriers of DECIPHER-listed CNVs with 4 diagnosed individuals out of 45 (8.9%; OR, 5.74; 95% CI, 1.47-16.22; P = 7.2e-03). The difference with EGCUT remained statistically different even after exclusion of this group with known disease causing CNVs; the remaining 19 individuals with an intellectual disability diagnosis correspond to a prevalence of 2.8% (OR, 1.64; 95% CI, 0.95-2.71; P = .05).
We next assessed the correlation between CNV size and intellectual disability. It was previously reported that cohorts of affected patients show an excess of CNVs compared with controls and that this excess is larger for longer CNVs.7 The frequency of intellectual disability increases with CNV size: 6 individuals (4.3%) with deletion ranging from 250 kb to 500 kb (OR, 2.65; 95% CI, 0.94-6.11; P = .03) vs 36 (8.3%) with at least 1-Mb deletions (OR, 5.34; 95% CI, 1.03-17.42; P = .02), whereas associations with duplications are only detectable when rearrangements exceed 1 Mb in size (102 diagnosed individuals [5.9%]; OR, 3.67; 95% CI, 1.29-8.54; P = .008; Table 1). Among the 275 individuals with smaller deletions (125 kb ≤CNV <250 kb) no apparent association existed (7 diagnosed individuals [2.5%]; OR, 1.5; 95% CI, 0.59-3.28; P = .24).
The diagnosis of intellectual disability is binary. Thus, to assess the effects of rare CNVs with greater granularity, we investigated whether their occurrence and size are related to achieved educational levels, a proxy for global cognition.45,46 For this purpose, we used the scale of 7 sublevels of the Estonian education curriculum (eMethods in the Supplement). Although 1729 individuals (25.3%) sampled in the Estonian cohort did not complete secondary school (level 4; MEA, 4.09; 95% CI, 4.07-4.12), which was similar to the at-large Estonian population,10 the proportion of those who did not complete secondary school is higher among carriers of DECIPHER-listed genomic disorders with 22 (48.9%) only reaching elementary or basic education (OR, 2.8; 95% CI, 1.49-5.3; P = 8.3e-04; MEA, 3.71; 95% CI, 3.38-4.04; P = .03; Figure). The fraction of carriers who failed to reach secondary education was associated with CNV size. For example, the carriers of CNVs of 1 Mb or larger have an MEA of 3.65 (95% CI, 3.49-3.81; P = 4.6e-07), and 56 (40.6%) of them did not complete secondary school (OR, 2.01; 95% CI, 1.40-2.87; P = 1e-04; Figure). Deletions are associated with most of the outcome, with MEAs decreasing to 3.5 (95% CI, 3.20-3.80; P = 4e-04); 17 carriers (47.2%) of those with a deletion of 1 Mb or larger did not complete their secondary education (OR, 2.63; 95% CI, 1.28-5.36; P = .006; Figure). A decrease is already seen in the group with deletions ranging from 250 to 500 kb who had an MEA of 3.86 (95% CI, 3.68-4.05; P = .1.7e-02) and 41 carriers (29.5%) not graduating from secondary school (OR, 1.23; 95% CI, 0.83-1.80; P = .28). Consistent with the intellectual disability results, 275 individuals with smaller deletions (125 kb ≤CNV <250 kb) were not associated with changes in educational attainment (MEA, 4.11; 95% CI, 3.97-4.25; P = .80), 72 (26.2%) of them had less than secondary education (OR, 1.04; 95% CI, 0.78-1.38; P = .78). Similarly, duplications were associated with an educational attainment decrease only when rearrangements were 1 Mb or larger (MEA, 3.71; 95% CI, 3.52-3.90; P = 1.5e-04) and 39 carriers (38.2%) did not complete secondary school (OR, 1.82; 95% CI, 1.19-2.77; P = 4.2e-03; Figure).
The EGCUT ancestry principal components are not associated with CNV burden (eFigure 2 in the Supplement), suggesting that genetic stratification is likely not confounding the association with educational attainment. Likewise, differences in educational achievement possibilities due to religion or ethnicity was not likely to account for the observed associations, as the surveys of the Organization for Economic Cooperation and Development (OECD) Program for International Student Assessment and Program for the International Assessment of Adult Competencies showed that the “free education for all” Estonian system is among the best in the world in terms of results and equal opportunity (eMethods in the Supplement).
A replication of the education analysis was conducted on a nonoverlapping random set of 1058 unrelated EGCUT individuals recruited similarly (eTable 3 in the Supplement) but sampled at a different time point and genotyped using a different array platform. Of those, 271 (25.6%) did not complete secondary school (MEA, 4.00; 95% CI, 3.93-4.05). However, carriers of deletion with sizes ranging from 250 kb to 500 kb, congruent with the discovery cohort, were associated with a nonsignificant trend suggesting a diminished educational attainment (MEA, 3.68; 95% CI, 3.39-3.97; P = .056), of whom 9 (36%) had only achieved a basic education or less (OR, 1.63; 95% CI, 0.63-3.98; P = .25). Similarly, duplication carriers with rearrangements of 1 Mb or larger had an MEA of 3.54 (95% CI, 2.97-4.11; P = .15) of whom 6 (46.2%) had achieved only basic education or less (OR, 2.49; 95% CI, 0.68-8.72; P = .11). The joint analyses of these 2 random cohorts confirmed the negative association on educational attainment among those with rare deletions of at least 250 kb (MEA, 3.81; 95% CI, 3.67-3.94; P = 1.06e-04) and 83 individuals (33.5%) of 248 in this group achieved less than secondary school (OR, 1.48; 95% CI, 1.12-1.95; P = .005). The same held true for duplications of 1 Mb or larger (MEA, 3.69; 95% CI, 3.51-3.87; P = 5.024e-05), 45 (39.1%) achieved less than secondary school (OR, 1.89; 95% CI, 1.27-2.8; P = 1.6e-03) (Figure and Table 2). We challenged these results further using a non–overlapping set of 993 unrelated EGCUT individuals biased toward higher than average sociocognitive functioning due to the different ascertainment criteria (eTable 3 and eMethods in the Supplement): MEA, 4.77; 95% CI, 4.69-4.84, lower than secondary education 9.4% [n = 93]). Even in a group that is probably partially depleted of severe CNVs, there was a trend toward a lower MEA among carriers of deletions ranging from 250 kb to 500 kb and duplications of 1Mb or larger of the same order of magnitude as in the discovery cohort (MEA, 4.36; 95% CI, 3.71-5.00; Δ, −0.41 and MEA, 4.44; 95% CI, 3.57-5.32; Δ, −0.33, respectively). Combining both independent replication cohorts confirmed the above results (MEA, 4.36 within the replication cohorts; MEA, 3.91 in the group of carriers of deletions ranging from 250-500 kb [95% CI, 3.63-4.20]; P = .004]; MEA, 3.79 in the group of carriers of duplication 1 Mb or larger [95% CI, 3.24-4.34; P = .057]). The same held true when all 3 Estonian cohorts were analyzed together (eTable 8 in the Supplement).
We sought to strengthen the inference from our results using the SATs scores of 5218 members of the ALSPAC birth cohort as an alternative measure of educational attainment (eTable 4 in the Supplement). When the MEA was studied using the transformed variables, mathematics scores were lower in carriers of rare intermediate-size deletions than in the controls (250 kb ≤CNV <500 kb: Welch 2-sided t test comparing means, P = .019), and English language scores were lower in carriers of large deletions (≥1 Mb, Welch 2-sided t test comparing means, P = .020; eTable 9 in the Supplement). Mean education attainment in English language and mathematics was lower in those who carried large duplications (≥1 Mb; P = .020 and P = .049, respectively, Welch 2-sided t test). These results support the association between educational attainment and rare CNVs using a different education metrics in a geographically distinct and differently ascertained cohort of adolescents. Larger CNV size was associated with the odds of individuals belonging to the lowest tertile of SATs score for both English language and mathematics (eTable 4 in the Supplement). This was apparent both for carriers of deletions. For English language test results, carriers of deletions of 250 kb ≤CNV <500 kb had an OR of 1.26 (95% CI, 0.81-1.95), deletions of 500 kb ≤CNV <1 Mb had an OR of 1.69 (95% CI, 0.88-3.30), and deletions of 1 Mb or larger had an OR of 4.18 (95% CI, 1.48, 14.87; P for trend = .002). For mathematics test results, carriers of deletions of 250 kb ≤CNV <500 kb had an OR of 1.42 (95% CI, 0.91-2.21); deletions of 500 kb ≤CNV <1Mb had an OR of 2.21 (95% CI, 1.01-5.06); and deletions of 1 Mb or larger had an OR of 3.69 (95% CI, 1.51-10.29; P for trend, < 2.0e-04). Substantive evidence for an association of duplications and educational attainment was only observed for English language test results, for which carriers with duplications of 250 kb ≤ CNV < 500 kb had an OR of 1.14 (95% CI, 0.81-1.61); carriers with duplications of 500 kb ≤CNV <1 Mb had an OR of 1.19 (95% CI, 0.76-1.87); and carriers with duplications of 1 Mb or larger had an OR of 2.22 (95% CI, 1.07-4.84; P for trend = .035). For mathematics, carriers with duplications of 250 kb ≤CNV <500 kb had an OR of 1.10 (95% CI, 0.78-1.54); carriers with duplications of 500 kb ≤CNV <1 Mb had an OR of 1.03 (95% CI, 0.68-1.55); carriers with duplications of 1 Mb or larger had an OR of 1.54 (95% CI, 0.80-3.01; P for trend = .27; Table 3).
These results were followed up in 2 separate cohorts of healthy individuals with normal cognitive functioning (eMethods in the Supplement). Consistent with this ascertainment, both the Italian and Minnesota cohorts suggested a paucity of DECIPHER-listed CNVs (1 observed vs 4 expected; P = .37; OR, 0.25; CI 95%, 0.005-2.53) among the Italian cohort and (14 vs 20; P = .39; OR, 0.7; 95%, CI; 0.32-1.46) among the Minnesota cohort (eTable 2). Of note, the analysis of the Italian cohort was restricted by a small sample size (n = 451; eTable 3) resulting in both a limited statistical power and limited CNV frequency calculation (≥0.25%). At this 5-fold higher level of prevalence, the MEA was lower in carriers of deletion 500 kb ≤CNV <1 Mb (Δ MEA = −0.26; P = .39, Wilcoxon test) and carriers of duplications of 1 Mb or larger (Δ MEA = −0.66; P = .11; Wilcoxon test; eTable 10 in the Supplement). A consistent, but similarly underpowered, association with lower FSIQ was found in carriers of rare deletions in the Minnesota cohort (500 kb ≤CNV <1 Mb, Δ = −4.23 IQ points, P = .43; ≥1 Mb, Δ = −13.82 IQ points, P = .09) and duplications (500 kb ≤ CNV <1 Mb, Δ = −5.56, P = .01; ≥ 1Mb, Δ = −6.03, P = .16; eTable 11 in the Supplement).
In contrast to duplication carriers (male:female ratio, 1.06 (303:285), an excess of female carriers was observed in every deletion size class of 250 kb or larger separately and together within the combined EGCUT discovery and replication cohort (male:female ratio, 0.78 (109:139); P = .14, OR, 1.22; 95% CI, 0.94-1.59). The reduction of MEA is greater in female carriers than in the male carriers (Δ MEA, −0.42 for females and −0.02 for males; Figure). Specifically, the female carriers of the 250 kb ≤ CNV <500 kb deletion had an MEA of 3.71 (95% CI, 3.50-3.92) compared with an MEA of 4.13 (95% CI, 4.09-4.16; P = 3e-04) in EGCUT females. The male carriers of similar size deletion had an MEA of 4.00 (95% CI, 3.76-4.24), whereas EGCUT males had an MEA of 4.02 (95% CI, 3.99-4.06; P = .85). Note that although 855 women (21.2%) in EGCUT earned college or academic degrees, the presence of a rare deletion is associated with a decreased fraction of women reaching the highest educational levels—levels 5 through 7. Only 12 women (12.9%) in the 250 kb ≤ deletion <500 kb group attained these highest educational levels (OR, 0.55; 95% CI, 0.27-1.02; P = .05; Figure). For example, only 1 of 20 women carrying deletions of 1 Mb or larger achieved more than a secondary school education. She carried the 17p12 deletion, which is causative for hereditary neuropathy with liability to pressure palsies (HNPP OMIM 162500). The joint analysis of the 3 Estonian cohorts confirmed that female deletion carriers are responsible for the majority of the decrease in educational attainment. All the EGCUT women combined had an MEA of 4.22 (95% CI, 4.19-4.25), whereas women with the 250 kb ≤ CNV <500 kb deletion had an MEA of 3.71 (95% CI, 3.54-3.88; P = 3.9e-08); 920 women (20.3%) achieved basic education or less, 49 deletion female carriers (33.6%; OR, 1.99; 95% CI, 1.37-2.85; P = 2.4e-04; eTable 8 in the Supplement). Consistent with the Estonian results, the Minnesota cohort female carriers of deletions of 500 kb or larger were associated with a stronger decrease of FSIQ (Δ = −13.73; P = .03) than were male carriers (Δ = −0.12; P = .98; eTable 11 in the Supplement).
Investigating the functions of the 642 protein-coding genes encompassed in the identified rare deletions of 250 kb or longer, we found evidence of enrichment for genes with a role in neurogenesis, cognition, learning, memory, and behavior (29 of the top 50 gene ontology processes with strongest evidence; all with a false discovery rate of less than 2.45e-05; eTable 12 in the Supplement). We then assessed whether we could use gene characteristics to more accurately predict CNV deleteriousness. Copy number variations were stratified by the number of embedded protein-coding and noncoding genes, neurodevelopmental genes,22 ohnologs,24 the sum of imbalanced genes’ probability score for haploinsufficiency (HiS),23 and the highest HiS in the CNV. A decrease of cognitive abilities was present in carriers of deletions encompassing 2 or more genes (MEA, 3.82; 95% CI, 3.65-3.99; P = .003) and duplications including 11 or more genes (MEA, 3.74; 95% CI, 3.56-3.92; P = 3e-04; eFigure 3 in the Supplement). When genes were present in the rearranged interval, deleteriousness was associated with the presence of at least 1 protein-coding gene. Within the group of carriers of deletions with these characteristics, 8 individuals were diagnosed with an intellectual disability, a prevalence of 5.3% (OR, 3.31; 95% CI, 1.37-6.93; P = .4.6e-03). Together this group had an MEA of 3.79 (95% CI, 3.61-3.97; P = .1.4e-03), and 50 (33.3%) did not reach secondary education (OR, 1.47; 95% CI, 1.02-2.1; P = .029). These results are in agreement with the observation that the majority of mendelian pathogenic mutations disrupt coding sequences.47 Prevalence of intellectual disability was best correlated with the presence of at least 1 neurodevelopmental gene with the deleted interval or with a high HiS sum. The group of carriers of deletion encompassing a neurodevelopmental gene has an MEA of 3.76 (95% CI, 3.48-4.05; P = .03). Within this group, 6 individuals (8.8%) were diagnosed with intellectual disability (OR, 5.69; 95% CI, 1.97-13.47; P = .001). The group of carriers of deletions with the highest quartile of HiS sums had an MEA of 3.91 (95% CI, 3.59-4.23; P = .27) with 4 individuals (8.9%) diagnosed with intellectual disability (OR, 5.74; 95% CI, 1.46-16.22; P = 7.2e-03). Presence of an ohnolog in the deletion is associated with a higher prevalence of intellectual disability, however to a lesser degree, affecting 6 individuals (5.9%; OR, 3.7; 95% CI, 1.29-8.54; P = .008). Neither separately nor together did the numbers of promoters, enhancers, transcriptional elements, and insulators within a CNV correlate with intellectual disability and educational attainment.
Quiz Ref IDAlthough various large pathogenic CNVs are known, the vast majority of rare CNVs of intermediate size (250-500 kb) are thought to be nondeleterious. In the current report we show that the presence of both recurrent syndromic and rare intermediate-size nonrecurrent CNVs, which are cumulatively frequent in the general population (10.5%), are associated with intellectual disability and negatively with educational attainment. For example, the frequency of intellectual disability increases to 4.3% among carriers of 250 kb ≤ deletions <500 kb compared with 1.7% in the Estonian general population. The MEA of carriers decreased from 4.09 to 3.86 with 29.5% not graduating from secondary school compared with 25.3% in the Estonian population. These results are likely to be underestimated through exclusion of the most severely affected patients, inclusion of patients with CNVs known to have no effect on cognition and incorrect inclusion of carriers of large somatic or tumorigenic genomic lesions.
The link between impaired cognitive functioning and lower academic achievement in CNV carriers parallels the recognized correlation between health and education.48 This health-education gradient was postulated to result from the combination of heritable factors impacting both traits, poor early life health that affects learning, and health-related behaviors being modulated by education. Although recurrent CNVs conferring risk of autism spectrum disorders or schizophrenia were associated with a decrease in IQ of individuals from the general population49 and phenotype mining of carriers of genomic variants in the Northern Finland 1966 Birth Cohort revealed an excess of lower IQ, school grade retention before age 14 years, and impaired hearing among individuals carrying deletions larger than 500 kb previously implicated in neurodevelopmental disorders,14 both studies did not recognize that other CNVs, in particular nonrecurrent ones, were also associated with decreases in cognitive capabilities.
Although 40% to 80% of the variance in intelligence and 20% to 40% in educational attainment are explained by genetic factors,50- 53 studies failed to find major contributors to this heritability. For example, 3 individual SNPs each with an approximate effect size of 1 month of schooling per allele have been identified in a genome-wide association study involving more than 126 000 individuals (largest estimated effect, 0.02%)17 and only a polygenic model including approximately 300 000 common SNPs genome-wide explained 28% to 29% of variation in general cognition.54 Even though earlier studies failed to identify common CNVs as major contributors to the above heritabilities,55- 58 the results presented herein suggest that rare structural variants of 250 kb or larger for deletions and 1 Mb or larger for duplications are associated with complex traits such as educational attainment and variance in intelligence in population cohorts. About 2% of the analyzed biobank participants carry a rare CNV of 1 Mb or larger. Even without considering other health problems, a fifth of them appear to be linked with decreased quality of life, for the fraction reaching a secondary educational level is 15% lower when comparing CNV carriers to the general population. This reduction results in an MEA that is half a level lower. If we take into account also the carriers of the smaller intermediate-size CNVs associated with lower educational attainment identified in this report (at least 0.2% of the population) and the highly pathogenic anomalies absent from the EGCUT cohort (0.15%), the quality of life for 1 of 40 people might be negatively affected by rare CNVs. These variants may account for a sizable portion of the heritability of the complex “educational attainment” measure.52
The observed excess of females carrying rare genomic deletions supports the recently described female-biased mutational burden.21,59 Females appear “protected” from neurodevelopmental disorders. This potentially allows females to be enrolled in general population cohorts despite the fact that they carry rare CNVs, whereas their male counterparts who likely present more severe phenotypes are excluded from such studies. Consequently and corroboratively, female deletion carriers are responsible for the majority of the signal on educational attainment.
Although intellectual disability prevalence was increased with presence of a neurodevelopmental or ohnolog gene in the deleted interval or a high haploinsufficiency score of imbalanced genes, none of the assessed evaluators correctly capture the variation in educational attainment, possibly because they are limited to protein-coding genes. Investigation of the function of the encompassed protein-coding genes revealed that they were enriched for genes involved in neurogenesis, cognition, learning, memory, and behavior. This is consistent with the hypothesis that these rearrangements are rare because they affect genes important for neurodevelopment and thus are rapidly purged from the population.
Although none of the carriers of known syndromic CNVs identified in the EGCUT cohort were previously diagnosed with a genetic disease, many had major clinical problems (eg, intellectual disability, congenital anomalies, neuropathies, neuropsychiatric disturbances, extreme obesity, and reproductive problems). Because the latter are most likely caused by the newly found genetic alterations, it suggests that these individuals have escaped the attention of the medical genetics system and thus far have not received proper examination and counseling.
We acknowledge several study limitations. Because this is an observational study, no causal inferences can be drawn and confounding bias due to another causal factor could not be excluded. Although caution is required in using educational attainment as a proxy for intellectual function, the confirmatory results obtained with SATs scores and FSIQ in geographically distinct cohorts mitigate this concern. Some of the results show borderline statistical significance, which can be explained by the fact that rare CNVs by definition translate to a small number of carriers. The investigation of the population variance of a complex trait such as educational attainment requires extremely large phenotyped data sets to reach sufficient power.
Known pathogenic CNVs in unselected, but assumed to be healthy, adult populations may be associated with unrecognized clinical sequelae. Additionally, individually rare but collectively common intermediate-size CNVs may be negatively associated with educational attainment. Replication of these findings in additional population groups is warranted given the potential implications of this observation for genomics research, clinical care, and public health.
Corresponding Author: Alexandre Reymond, PhD, Center for Integrative Genomics, University of Lausanne, Genopode Bldg, 1015 Lausanne, Switzerland (firstname.lastname@example.org).
Author Contributions: Dr Reymond had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Study concept and design: Männik, Reymond.
Acquisition, analysis, or interpretation of data: All the authors.
Drafting of the manuscript: Männik, Reymond.
Critical revision of the manuscript for important intellectual content: All the authors.
Statistical analysis: Männik, Mägi, Macé, Cole, Guyatt, Shihab, Mihailov, Kutalik, Pankratz, Timpson.
Obtained funding: Metspalu, Reymond.
Administrative, technical, or material support: Mihailov, Alavere, Metspalu, Reymond.
Study supervision: Metspalu, Reymond.
Conflict of Interest Disclosures: All authors have completed and submitted the ICMJE Form for Disclosure of Potential Conflicts of Interest. Drs Timpson, Gaunt, and Shihab work within the MRC Integrative Epidemiology Unit, supported by the Medical Research Council [MC_UU_12013/1-9]. No other disclosures were reported.
Funding/Support: Dr Männik is a grantee of a scholarship from the Swiss Scientific Exchange New Member State of the European Union Program. Ms Guyatt is funded by a PhD studentship from the Wellcome Trust (grant 102433/Z/13/Z). Dr Jacquemont is a Bursary Professor of the Swiss National Science Foundation (SNSF). This study is supported by 2 SNSF grants (31003A_160203, Drs Reymond, and Kutalik), a specific 16p11.2 SNSF Sinergia grant (CRSII33-133044, Dr Reymond), the Simons Foundation Autism Research Initiative (SFARI274424, Dr Reymond), Leenaards Foundation Prizes (Drs Jacquemont, Reymond, and Kutalik), European Commission Framework Program 7 grants (278913, 306031, and 313010) (Dr Metspalu), Center of Excellence in Genomics (EXCEGEN) and University of Tartu (SP1GVARENG, Dr Metspalu), Estonian Research Council Grant (IUT20-60, Dr Metspalu), US Public Health Service grants from the National Institute on Alcohol Abuse and Alcoholism (AA09367 and AA11886, Dr Pankratz), the National Institute on Drug Abuse (DA05147, DA13240, and DA024417, Dr Pankratz), and the National Institute of Mental Health (MH066140, Dr Pankratz).
Role of the Funder/Sponsor: None of the funders or sponsors had any role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; and preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Disclosure: This publication is the work of the authors, who will serve as guarantors for the contents of this article.
Additional Data: ALSPAC data were generated by Sample Logistics and Genotyping Facilities at the Wellcome Trust Sanger Institute and LabCorp (Laboratory Corporation of America) using support from 23andMe. ALSPAC analyses were completed using computational facilities of the Advanced Computing Research Centre, University of Bristol (http://www.bris.ac.uk/acrc).
Additional Contributions: We thank the EGCUT participants, the EGCUT personnel for their compensated assistance in recruiting, phenotyping, routing samples, genotyping, and administrative responsibilities especially those of Mari Nelis, PhD, Lili Milani, PhD, Viljo Soo, Kairit Mikkel, Mari-Liis Tammesoo, MSc, and Steven Smit. We are grateful to Sven Bergmann, PhD, University of Lausanne for comments on the manuscript. We are extremely grateful to all the families who took part in the ALSPAC study, the midwives for their help in recruiting them, and the whole ALSPAC team, which includes interviewers, computer and laboratory technicians, clerical workers, research scientists, volunteers, managers, receptionists, and nurses. We are grateful to Mary Ward for her advice on the use of the ALSPAC education data. The UK Medical Research Council and the Wellcome Trust (grant 102215/2/13/2) and the University of Bristol provided core support for ALSPAC.