The Genetic Architecture of Depression in Individuals of East Asian Ancestry

Key Points Question Are the genetic risk factors for depression the same in individuals of East Asian and European descent? Findings In this genome-wide association meta-analysis of depression in 194 548 individuals with East Asian ancestry, 2 novel genetic associations were identified, one of which is specific to individuals of East Asian descent living in East Asian countries. There was limited evidence for transferability with only 11% of depression loci previously identified in individuals of European descent reaching nominal significance levels in the individuals of East Asian descent. Meaning Caution is advised against generalizing findings about genetic risk factors for depression beyond the studied population.

for age, sex, principal components (PCs) and recruitment region. After filtering variants with effective sample size (Neff) < 50 2 and poorly imputed variants (info<0.7), 10,834,708 variants were included in the downstream analyses.

B. China, Oxford and Virginia Commonwealth University Experimental Research on Genetic Epidemiology cohort (CONVERGE)
The CONVERGE cohort of Han Chinese women has been previously described 3 . Briefly, ~5,000 cases of recurrent MDD (≥2 episodes), established with the CIDI, which used DSM-IV criteria, were analysed against an equal number of controls. Cases with medical history of bipolar disorder, psychosis, mental retardation and/or drug or alcohol abuse before their first depressive episode were excluded from the study.
CONVERGE samples underwent whole-genome sequencing, as previously described 3 . In brief, after genotyping calling, two rounds of imputation were performed: first without a reference panel and then using the 1000Genomes Phase 1 Asian haplotypes. Variants with a) a P-value for violation HWE < 10 -6 , b) information score < 0.9 and c) MAF in CONVERGE < 0.5% were excluded from the GWAS, resulting in a final set of 5,987,610 SNPs. The GWAS was conducted with a mixed-linear model including a genetic relationship matrix (FastLMM version 2.06.20130802) as random effect and PCs from eigen-decomposition of this matrix as fixed effects. We further filtered the publicly available GWAS summary statistics by removing variants with Neff less than 50.

C. 23andMe cohort
The GWAS dataset of personal genetic company 23andMe, Inc. (Sunnyvale, CA) that included in this meta-analysis, encompassed 2,729 depression cases and 90,310 controls of East Asian ancestry. All participants provided informed consent and answered surveys online according to 23andMe's human subject protocol, which was received and approved by Ethical & Independent Review Services, an AAHRPP-accredited institutional review board. As part the medical history survey, participants were asked if they have ever received a clinical diagnosis or treatment for depression (binary variable).
DNA extraction and genotyping were performed on saliva samples by National Genetics Institute (NGI), a CLIA licensed clinical laboratory and a subsidiary of Laboratory Corporation of America. Samples were genotyped on one of five genotyping platforms. The v1 and v2 platforms were variants of the Illumina HumanHap550+ BeadChip, including about 25,000 custom SNPs selected by 23andMe, with a total of about 560,000 SNPs. The v3 platform was based on the Illumina OmniExpress+ BeadChip, with custom content to improve the overlap with our v2 array, with a total of about 950,000 SNPs. The v4 platform was a fully customized array, including a lower redundancy subset of v2 and v3 SNPs with additional coverage of lower-frequency coding variation, and about 570,000 SNPs. The v5 platform (68.4% of the samples in the East-Asian dataset), is an Illumina Infinium Global Screening Array (~640,000 SNPs) supplemented with ~50,000 SNPs of custom content. This array was specifically designed to better capture global genetic diversity and to help standardize the platform for genetic research.
Imputation was performed with Minimac3 using a reference panel combining the May 2015 release of the 1000 Genomes Phase 3 haplotypes with the UK10 imputation reference panel. The association testing was performed by logistic regression assuming additive allelic effects, adjusting for age, sex, the top five principal components to account for residual population structure and indicators for genotype platforms to account for genotype batch effects. The association analysis and the downstream quality control was conducted separately for the genotyped and the imputed SNPs.
Genotyped GWAS results were filtered for: SNPs that were only genotyped on "v1" and/or "v2" platforms due to small sample size, SNPs on chrM or chrY, SNPs that failed a test for parent-offspring transmission, SNPs with fitted β<0.6 and P<10 −20 for a test of β<1, SNPs with a Hardy-Weinberg P<10 −20 , or a call rate of <90%, SNPs with genotype date effects (determined as P<10 −50 by ANOVA of SNP genotypes against a factor dividing genotyping date into 20 roughly equal-sized buckets), SNPs with large sex effect (ANOVA of SNP genotypes, r2>0.1), SNPs with probes matching multiple genomic positions in the reference genome and variants with minor allele counts in the controls less than 50.
For imputed GWAS results, SNPs with poor imputation quality (rsq<0.7), Neff less than 50 and SNPs that had strong evidence of a platform batch effect were excluded from the downstream analysis. The batch effect test is an F test from an ANOVA of the SNP dosages against a factor representing v4 or v5 platform (P<10 −50 ).
Across all results, further filtering was performed on SNPs that have an available sample size of less than 20% of the total GWAS sample size, logistic regression results that did not converge due to complete separation, identified by abs(effect)>10 or stderr>10 on the log odds scale.

D. Taiwan-Major Depressive Disorder (MDD) Study
MDD patients were included from a family study of mood disorders in Taiwan were also excluded. The GWAS was performed using PLINK 1.9 and adjusted for 5 ancestry principal components. The GWA analysis was conducted separately by platforms with (1) Affymetrix TWB2.0 and (2) all other platforms combined together. In the latter, variants significantly associated (P<0.005) with genotyping platforms were excluded from downstream analysis. We also used a stricter imputation threshold for filtering (info<0.9 instead of 0.7).

E. Women's Health Initiative study (WHI)
The WHI study is a long-term national health study in U.S conducted in postmenopausal women, enrolled either in a clinical trial or an observational study 5 . We analysed data from 3,492 women with Asian ancestry who were genotyped as part of the WHI -Population Architecture using Genomics and Epidemiology (PAGE) sub-study. These participants had agreed their data to be included in the database of Genotypes and Phenotypes (dbGaP). The genotype and phenotype data were assessed vid dbGaP study accession phs000200.v12.p3. Depressive symptoms in the past week were assessed in the baseline visit with a 6-item Center for Epidemiological Studies Depression Scale (CES-D) form. Based on Smoller et al., definitions 6 , participants with a score of 5 or more were considered as depression cases, while participants not classified as currently depressed (6-item CES-D), without medical history of depression (2-item Diagnostic Interview Schedule) and not on antidepressant therapy constituted the control group.
The dataset of Asian participants of WHI included in our analyses, have been genotyped with CardioMetaboChip, as part of the NHGRI's PAGE project. Samples and variants with a call rate lower than 95%, typed variants with different missingness rates between case and control group > 0.2 and variants with MAF < 0.05 were excluded from downstream analysis. A logistic regression analysis was performed (PLINK2), adjusting for age, sex, 20 PCs and study subgroup.

F. Intern Health Study (IHS)
We also considered participants from IHS, a multi-institutional longitudinal cohort study of medical interns in U.S. The study design has been previously described 7 . Depressive symptoms were measured through the PHQ-9 questionnaire, a self-report component of the primary care evaluation of mental disorders inventory. Subjects were asked to complete the questionnaire assessing PHQ-9 depressive symptoms in the baseline survey, as well as at months 3, 6, 9 and 12 of their internship year. Participants with a PHQ-9 score of 10 or greater 8 during their internship were considered as depression cases in this study. A total of 294 depression cases and 544 controls were considered in this association study.
IHS samples were genotyped on Illumina Infinium CoreExome v1.0 or v1.1 array. Quality control steps and imputation were performed using the Ricopili Rapid Imputation Consortium Pipeline 9 . Study samples were assigned into distinct ancestry groups based on PCs derived from the study samples combined with 1000Genomes reference panel. In brief, samples with call rate < 98% or samples with a gender mismatch between genotype and reported data were excluded. For duplicated samples and up to third-degree relatives, the sample with higher call rate was selected. Variants with call rate < 98%, missing difference > 0.20 were also excluded prior imputation. Genotypes were imputed to the Haplotype Reference Consortium (HRC) reference panel using EAGLE and IMPUTE2 for the phasing and the imputation respectively. A logistic regression analysis was performed (PLINK2) in genotype dosages, adjusting for age, sex, 20 first PCs. Variants with MAF < 0.05 and imputation info score < 0.7 were excluded from downstream analysis, resulting in a dataset of 4,626,568 variants.

G. UK Biobank (UKB)
UKB is a well-characterized cohort of more than 500,000 individuals recruited at UK between 2006-2010 with linked health and genetic data 10 . A subset of participants has also completed the mental-health questionnaire. We used a combination of hospital diagnoses (ICD10 codes) and lifetime CIDI (A. prolonged feelings of depression OR prolonged loss of interest in normal activities AND B. affected more than half of the day during worst episode of depression AND C. the frequency of depressed days during worst episode was at almost every day/every day AND D. these problems interfered with your life/activities (study/employment, childcare and housework, leisure pursuits) somewhat/a lot) to define our cases. Gender mismatches, missingness/heterozygosity outliers, participants with excessive genetic relatedness, no quality control metrics, individuals that have withdrawn their consent and up to 2nd degree relatives (PC-Relate) were excluded before the analysis.
UKB genotyping was conducted by Affymetrix using two similar arrays; Applied Biosystems™ UK BiLEVE Axiom™ Array, consisting of 807,411 genetic variants and a bespoke UK Biobank Axiom™ array, including 825,927 genetic variants. All genetic data was quality controlled by UKBB bioinformatics team, both at sample and marker level, resulting in a dataset of 488,377 samples and 805,426 variants from both arrays. The genetic data was subsequently imputed by UKB to over 90 million SNPs, indels and large structural variants, using haplotypes of both British, European and diverse-ancestry populations. For this study, we used data imputed with both the HRC and the merged UK10K and 1000Genomes Phase 3 reference panels 10 . To assign individuals in ancestry groups based on their genetic information, we implemented the PC-AiR method to perform a PC analysis for the detection of population structure 11 . A logistic regression analysis was performed in imputed genetic dataset (PLINK2), adjusting for age, sex, genotyping array and PCs that were calculated based on the subset of genetically defined EAS participants. Downstream analysis was restricted in the subset of common (MAF > 0.05) and well-imputed (> 0.7) variants. The analysis conducted under UK Biobank application 51119.

H. Army Study To Assess Risk and Resilience in Service members (Army STARRS) study
Data from the Army-STARRS, a study conducted in army members in USA, were also assessed in the current analysis. Army STARRS includes the New Soldier Study (NSS) and the Pre/Pst Deployment Study (PPDS). Detailed information about the design of the study have been published previously 12 . Depression outcomes were measured with the CIDI screening scales and evaluated for concordance with DSM-IV diagnoses within the Army STARRS clinical reappraisal study 13 .
The genotyping and imputation of Army-STARRS, New Soldier Study (NSS) samples has been described previously 14 . In brief, samples were genotyped using the Illumina OmniExpress and Exome array and were imputed on a reference multi-ancestry panel from the 1000G Genomes Project (phase1). Samples and genetic variants with a call rate less than 95% and 98% respectively were filtered out. A logistic regression analysis was performed in common and well-imputed variants (PLINK2), adjusting for age, sex and the 20 first PCs.
BioMe samples were genotyped with the Infinium Global Screening Array (GSA) BeadChip. Individuals with population-specific heterozygosity rate that surpassed +/-6 standard deviations of the population-specific mean, along with individuals with a call rate of <95%, individuals with discordant reported and genetic sex and with phenotypically intermediate sex were not considered in the analysis. In cases of duplicates, the sample of each pair with the lower missingness rate in the exomic data was preferentially excluded. Genetic variants exclusions included a call rate <95% and HWE p < 10 -5 . The resulting dataset was imputed to the 100Genomes Phase 3 reference panel. The GWAS was performed with a binary mixed model (SAIGE). The first 20 PCs were calculated using PLINK (v1.9) and a genomic relationship matrix (GRM) was calculated using the KING (v1.4) software (-ibs). The PCA and GRM calculations were restricted to common (MAF>0.01), autosomal sites. Additionally, variants with MAF<0.05 and info<0.07 were excluded before the meta-analysis.

Data availability statement
Summary statistics for the combined EAS meta-analysis excluding the 23andMe study are available through the PGC website (http://www.med.unc.edu/pgc/downloads). The genome-wide summary statistics for CONVERGE and the European meta-analysis are also available on the PGC website. Uploading and sharing of individual genetic data from CKB are subject to restrictions according to the Interim Measures for the Administration of Human Genetic Resources administered by the Human Genetic Resources Administration of China (HGRAC). Summary data including allele frequencies and GWAS summary statistics are available by application and restricted to research-related purposes.
Other individual-level CKB data are available through www.ckbiobank.org, subject to completion of a Material Transfer Agreement, either through Open Access or on application. CKB data access is subject to oversight by an independent Data Access Committee. Analyses using CKB data were conducted under research approval 2018-0018. Data from 23andMe, Inc were made available under a data use agreement that protects participant privacy. Please visit https://research.23andme.com/collaborate/#dataset-access for more information and to apply to access the data. The raw genetic and phenotypic UK Biobank data used in this study, which were used under license (application number 51119), are available from: http://www.ukbiobank.ac.uk/. The genotype and phenotype data for the WHI study can be requested via dbGaP study accession phs000200.v12.p3.

Genotyping
The genotyping of each study has been previously described 3, 4, 10, 14, 16 . To optimise genome-wide coverage in EAS populations, genotyping was carried out using two custom-designed Affymetrix Axiom arrays in CKB and the Affymetrix TWB2.0 array for a subset of the Taiwan-Major Depressive Disorder study samples 1,4 . CONVERGE used whole-genome sequencing with a mean depth of 1.7 3 . More detail for all studies is provided in the studies description above.

Quality control
Quality control and association analyses were carried out separately for each study as described in the studies description and Supplementary Table 2. Genotypes were imputed to 1000 Genomes Project reference panel, except IHS where the Haplotype Reference Consortium (HRC) was used, 23andMe and UKB where the 1000 Genomes data were combined with the UK10K and HRC imputation reference panel, respectively. In the meta-analysis, we included only well-imputed variants (imputation accuracy > 0.7) with effective sample size (Neff) equal or higher than 50 2 in the larger datasets (CONVERGE, CKB, 23andMe), and with minor allele frequency (MAF)>=0.05 in the other studies.
For the Taiwan-MDD study an imputation accuracy threshold of 0.9 was used.

Meta-analysis
We performed a Z-score weighted meta-analysis using METAL 33 for 13,163,200 genetic variants (Supplementary Figure 1). For all meta-analyses, results were restricted to variants present in at least two studies. We also performed a Z-score weighted meta-analysis combining results from our EAS analysis and the publicly available summary statistics from the largest published GWAS in EUR samples 17 . Variants associated at genome-wide significance in this trans-ancestry meta-analysis were considered novel if they were located outside ±250kb either side of the lead variants from the published GWAS of depression in EUR and if the Linkage Disequilibrium (LD) with the lead variant was <0.01 17 . We calculated betas for the meta-analyses using the formula from Zhu et al. 18 . Odds ratios were based on an inverse-variance weighted meta-analysis of the study betas, where for CONVERGE we used results from a logistic regression in Plink instead of FastLMM.

Functional annotation and gene-based association analysis
We functionally annotated the lead variants and their proxies (r 2 ≥0.8). Gene-based association analysis was performed using MAGMA (v1.08), implemented in FUMA, with default settings 19,20 . SNPs were mapped to 19,575 protein coding genes from Ensembl build 85. Significance for the gene-based analysis was defined as the Bonferroni corrected threshold (P=2.6x10 -6 ).
We functionally annotated the lead SNPs in the genomic regions associated with increased risk for depression using HaploReg v4 21 and Open Targets Genetics Platfrom 22 . Candidate genes for each locus associated with depression were selected based on their proximity to the lead variant and/or the evidence of eQTL associations for a gene in that region. Open Targets Genetics interrogates various data sources to link genetic variation to genetic expression. The GeneCards database was used to obtain summary information of the identified genes, while NCBI's PubMed database was used to interrogate literature related to gene function and association with other human traits/diseases. We queried the identified variants and their proxies in PhenoScanner 23 and the NHGRI-EBI GWAS catalogue 24 to investigate trait pleiotropy.

Reproducibility of established depression loci
We assessed whether the associations of 102 established depression loci from the largest published EUR GWAS 17 were reproducible in samples with EAS ancestry. Since the lead SNP might not be the causal variant nor correlated with it in other ancestry groups due to LD differences, we also formed credible sets that are likely to include the causal variant. These were based on all variants in LD with the lead variant of a locus (r 2 >0.6) based on an ancestry matched reference (1000 Genomes Project v3 EUR samples). We assessed whether any variant in the credible set displayed evidence of association in the target study. As these credible sets contained multiple SNPs, we used a p-value threshold of P<0.01 to indicate reproducibility. While this p-value threshold might not provide conclusive evidence of reproducibility for individual loci, we used it to test reproducibility rates across sets of loci.
We estimated the number of associations out of the 102 established loci that were expected to replicate. We accounted for the sample size of our study and the allele frequency in EAS populations. First, we calculated the power 25 to observe an association in the EAS meta-analysis for each of the 102 loci at alpha error of 0.05 using the effect estimate from the EUR discovery study 8 , the allele frequency for EAS samples from 1000 Genomes and the sample size available in the EAS meta-analysis. By summing up the probabilities across the 102 loci, we derived the absolute number of associations out of the 102 we are powered to observe if the effect estimates in EAS are consistent with the ones from the EUR studies. For benchmarking, we also assessed the reproducibility of these established loci in ancestry-matched cohorts. We used independent EUR GWAS for depression with different sample sizes (BioMe, BioVU, FinnGen 26 , 23andMe).

Heritability and genetic correlations
We estimated the SNP heritability (h 2 ) for each depression phenotype in EAS (meta-analysed cohorts) using used LD score (LDSC) regression 27 . We also used bivariate GREML implemented in the GCTA software 28 to estimate h 2 for the two large Chinese datasets, CONVERGE and CKB (symptom-based definition), that contribute the majority of samples in our analysis for which genotype data were available. For this we excluded, related individuals and used hard-calls for variants with call rate>0.95 and MAF>0.01. For this analysis we used a variety of prevalence estimates, ranging from 6.5% 29 to 15% 30 .
To characterise the genetic architecture of depression, we estimated genetic correlations between depression in EAS and EUR studies. For clinical depression in EUR samples, we used the summary statistics from 45,396 cases with DSM-based diagnosis of major depressive disorder and 97,250 controls from a meta-analysis of 33 independent cohorts included in the latest GWAS 17 , excluding UKB and 23andMe. Additionally, we generated a symptom-based definition for EUR samples using the PHQ-9 questionnaire and a cut-off score of 10 31 , yielding 6,510 affected individuals and 116,697 controls from UK Biobank 10, 32 .
To assess the sharing of genetic risk factors for depression across the genome between the two populations, we estimated trans-ancestry genetic correlations using POPCORN 33 . We estimated the genetic effect correlation which compares effects independent of allele frequency differences between the two populations. LDSC was also used to estimate genetic correlations between different outcomes within each ancestry group. The default LD Scores computed using 1000 Genomes EAS data were used as a reference for the LD estimates. We also assessed the genetic overlap with other traits using publicly available summary statistics (PGC, NHGRI-EBI GWAS catalogue) from EAS and EUR populations, using LDSC and POPCORN respectively, as described above. We only present genetic correlation estimates where the standard error (SE) was less than 0.3.
To aid interpretation of the trans-ancestry genetic correlations, we also gathered estimates for other traits. We extracted genetic correlations between EUR and EAS from publications [34][35][36][37] . Additionally, we used publicly available summary statistics from Biobank Japan 38,39 and EUR GWASs to estimate correlations for coronary artery disease (CAD) 40 , breast cancer 41 and age at menarche 42 using POPCORN as outlined above.
A novel locus at 7p21.2 was associated with depression at genome-wide significance in the analysis of the East Asia based studies (Table 1). The lead SNP, rs10240457 (EAF=0.646, beta for A-allele=0.028, SE=0.005, P=5.0x10 -9 ) is intronic to AGMO (Alkylglycerol Monooxygenase). This gene cleaves the O-alkyl bond of ether lipids which are essential components of brain membranes and function in cell-signalling and other critical biological processes.
We carried out a meta-analysis for the broad depression outcome in EAS and the largest GWAS of depression in EUR samples 17 ( Figure 1B , Supplementary Figure 4). The lead variant at 1q25.2, rs7548487, (beta for A allele= -0.013, SE=0.002, P=1.29x10 -8 ) is located in an intron of ASTN1 (astrotactin 1). Astrotactin is a neuronal adhesion molecule required for glial-guided migration of young postmitotic neuroblasts in cortical regions of the developing brain 47 . The C-allele of the lead variant at 18q12.1, rs547488 had beta 0.008 (SE=0.001) and P=3.3x10 -8 . It is located downstream of CDH2 (cadherin 2) and is nominally associated with the expression of CDH2 in the brain (UKBEC, P=0.03) and from BrainSeq 48 (P=0.027). CDH2 encodes N-cadherin, which expresses broadly in multiple tissues and has been shown to play a role in the development of the nervous system and be associated with neurodevelopmental disorders 49 . The third locus is 22q13.31 with lead variant rs12160976 (beta for A allele=-0.009, SE=0.002, P=1.6x10 -8 ).

Gene-based analysis
We also performed a gene-level aggregate test based on the meta-analysis summary statistics using MAGMA (v1.08), as implemented in FUMA 20 . The ETS Variant Transcription Factor 5 (3q27.2) gene, was the only gene that passed the significance threshold (P=6.9x10 -6 ). It has been previously associated with depression risk in an EUR study 50 .

Reproducibility
In addition to the comparisons described in the main manuscript, to rule out that the low reproducibility rates are due to differences in LD patterns between the ancestry groups, we created credible sets of SNPs that are likely to contain the causal variants and assessed their associations in the EAS data. Of the 102 credible sets, 13 (12.7%) contained variant(s) with P<0.01 in the EAS association analysis with depression. We also assessed a high-confidence set of loci from the largest EUR meta-analysis that were replicated in an independent dataset of 23andMe 8 . Out of the 86 which were available in the EAS meta-analysis, 13 (15.1%) of the credible sets contained a variant with P<0.01.

eFigure 7. Genetic Correlations Between the Clinical and Symptom-Based Depression Phenotypes in East Asians and Other Traits in Europeans
For this analysis we used published summary statistics for schizophrenia, age of menarche, body mass index (BMI) and type 2 diabetes, from European (EUR) GWAS (LDSC) and East Asian (EAS) GWASs (POPCORN). Colours correspond to direction and strength of the genetic correlations (rgen). Statistically significant genetic correlations are indicated by a star (*