Identification of Candidate Parkinson Disease Genes by Integrating Genome-Wide Association Study, Expression, and Epigenetic Data Sets

Key Points Question What genes and genomic processes underlie risk of sporadic Parkinson disease? Findings This genetic association study integrated Parkinson disease genome-wide association study data and brain-derived gene regulation data using various complementary bioinformatic tools and identified 11 candidate genes with evidence of disease-associated regulatory changes. Coexpression and protein level analyses of these genes demonstrated a significant functional association with known mendelian Parkinson disease genes. Meaning This study suggests that gene regulation data may be used to identify candidate genes and pathways involved in sporadic Parkinson disease.

This supplementary material has been provided by the authors to give readers additional information about their work.
Illumina GenomeStudio using the methylation module.
All subsequent DNAm data quality control procedures were performed in R using the recommended protocols for wateRmelon 2 . CpG sites were removed if they had a detection P-value > 5% in more than 1% of individuals and CpGs with a beadcount of less than three in more than 5% of individuals. Identity of samples was confirmed by matching DNAm gender with reported gender, and comparing SNP calls on the HM450 with genotyping calls. Nonautosomal probes, probes containing SNPs within the probe and cross-hybridising probes were subsequently removed 3 . Normalisation of DNAm data was performed using the "dasen" function, which performs background adjustment of methylated and unmethylated intensities and subsequent quantile normalisation of methylated and unmethylated Type 1 and Type II intensities independently.

Genotyping and quality control for mQTL analysis
DNA extracted from each sample was genotyped using the Illumina Human Exome Core array with additional custom content of approximately 27,000 SNPs which resulted in the genotyping of a total of >500,000 SNPs. SNPs were removed using the following filters using PLINK1.9; MAF<5%, genotyping rate <5%, HWE < 1E-05.
Individuals were removed if overall genotyping rates were < 95% or were genetically related with PIHAT>0.125 4 .
Eigenstrat was used to assess population substructure after which individuals were removed if they were more than 2 standard deviations away from the mean of the first 10 principle components 5 . Prior to imputation the genotypes were aligned to the Haplotype Reference Consortium (HRC) genotype reference panel using a publicly available perl script (http://www.well.ox.ac.uk/~wrayner/tools/). After alignment, genotypes were recoded in vcf format using PLINK1.9 and sorted using VCFtools 6 . Genotypes were uploaded to the Michigan Imputation Server (https://imputationserver.sph.umich.edu/index.html). Pre-imputation haplotype phasing was performed using Eagle and imputation was performed using Minimac3 (Sayantan Das et al, 2016) using the HRCv1.1 reference panel [7][8][9] .
SNPs were retained with an imputation quality score >0.8 and maf>5%. After DNAm and genotype quality control, data was available for 134 individuals. Cis PDmQTLs were defined as correlations between the target PD SNP genotype and DNAm levels of CpGs within a 500kb window of the SNP base position. mQTL analyses were performed using R package MatrixEQTL 10 . Linear models were fitted to test whether DNAm beta-values for each CpG were predicted by SNP genotypes. We included covariates for age at death, gender, population stratification, batch and post-mortem interval. We retained the strongest SNP-CpG pair at a 5% false discovery rate to be used in downstream analyses. CpGs were subsequently mapped to genes if they were within 10kb of the gene transcription start / end base position according to HG19 coordinates.

Calculation of DNA methylation Fusion weights for TWAS
FUSION weights for DNA methylation in the dorsolateral prefrontal cortex and substantia nigra were generated inhouse using methylation and genotype data in up to 134 individuals using standard parameters of FUSION.compute_weights.R (http://gusevlab.org/projects/fusion/). After excluding CpGs that were not heritable (p>0.01) or failed to converge, FUSION-CpG weights were available for 37,460 and 36,620 probes in the frontal cortex and substantia nigra respectively. In order to biologically interpret the DNA methylation results, a CpG was assigned a protein-coding gene if it was within 10kb of the gene transcription start/end base position according to HG19 coordinates.

Literature-derived PPI networks
We extracted currently known protein interactors (PPIs) for the proteins (seeds) encoded by the genes prioritized in this manuscript (coloc protein network). PPIs were identified for each seed protein based upon entries in the following databases within the IMEX consortium 11  and WDR45) and for 118 genes with a negative score in the coloc analysis (the latter being used as a negative