Phase 1: Cohort Preparation and Quality Control — Month 1-2
Assemble three primary cohorts: 23andMe (n=75,607 cases, 231,747 controls), CONVERGE (n=5,303 cases, 5,337 controls), and PGC (n=9,240 cases, 9,519 controls). Implement standardized quality control: remove samples with call rate <95%, exclude SNPs with MAF <1%, call rate <95%, or Hardy-Weinberg equilibrium p <10^-6. Perform principal component analysis to identify and exclude population outliers (>4 standard deviations from population mean). Remove related individuals (IBD >0.1) and samples with sex discordancies. Conduct batch effect detection using quantile-quantile plots and genomic inflation factor (λGC) calculations.
...
Phase 1: Cohort Preparation and Quality Control — Month 1-2
Assemble three primary cohorts: 23andMe (n=75,607 cases, 231,747 controls), CONVERGE (n=5,303 cases, 5,337 controls), and PGC (n=9,240 cases, 9,519 controls). Implement standardized quality control: remove samples with call rate <95%, exclude SNPs with MAF <1%, call rate <95%, or Hardy-Weinberg equilibrium p <10^-6. Perform principal component analysis to identify and exclude population outliers (>4 standard deviations from population mean). Remove related individuals (IBD >0.1) and samples with sex discordancies. Conduct batch effect detection using quantile-quantile plots and genomic inflation factor (λGC) calculations. Impute genotypes using 1000 Genomes Project Phase 3 reference panel with IMPUTE2 software, retaining variants with info score >0.8.
Phase 2: Individual Cohort GWAS Analysis — Month 2-3
Perform genome-wide association analysis for each cohort separately using logistic regression with additive genetic model. Include covariates: age, sex, and first 10 principal components to control for population stratification. For 23andMe cohort, use self-reported clinical diagnosis/treatment as case definition. For CONVERGE cohort, employ Composite International Diagnostic Interview (CIDI) with recurrent MDD episodes. For PGC cohort, use structured clinical interviews or DSM-IV symptom checklists. Calculate genomic inflation factors and ensure λGC <1.1 after covariate adjustment. Generate Manhattan plots and quantile-quantile plots for each cohort. Estimate SNP-based heritability using LDSC (linkage disequilibrium score regression).
Phase 3: Meta-Analysis Implementation — Month 3-4
Conduct fixed-effects inverse variance-weighted meta-analysis using METAL software. Test ~9 million SNPs passing quality control across all cohorts. Weight each study by sample size and apply genomic control correction if λGC >1.05. Identify genome-wide significant loci using threshold p <5×10^-8. Test for between-study heterogeneity using Cochran's Q statistic and I^2 measure. Perform conditional analysis to identify independent signals within associated loci using GCTA-COJO. Calculate effective sample size accounting for case-control imbalance. Generate forest plots for top associations and regional association plots for significant loci using LocusZoom.
Phase 4: Functional Annotation and Prioritization — Month 4-5
Annotate genome-wide significant variants using ANNOVAR and VEP (Variant Effect Predictor). Prioritize variants based on functional consequences: missense variants with CADD score >15, eQTLs from GTEx database (brain tissues), chromatin interactions from Hi-C data, and regulatory elements from ENCODE. Perform gene-based association testing using MAGMA with default settings (SNP-wise mean model). Conduct pathway enrichment analysis using GSEA with curated gene sets from MSigDB, focusing on neurotransmitter signaling, synaptic function, and neuronal development pathways. Map lead SNPs to target genes using multiple approaches: nearest gene, eQTL mapping, chromatin conformation capture, and DEPICT gene prioritization.
Phase 5: Replication and Validation Studies — Month 5-6
Select top 30 genome-wide significant variants for replication in independent cohorts: UK Biobank depression cases (n=40,000), iPSYCH Danish registry (n=15,000), and FinnGen (n=25,000). Calculate required sample sizes for 80% power to replicate associations with original effect sizes. Perform lookup analyses for established psychiatric GWAS loci from schizophrenia, bipolar disorder, and autism spectrum disorders. Test for genetic correlation between MDD and related psychiatric traits using LDSC. Conduct polygenic risk score (PRS) analysis using PRSice software with p-value thresholds from 0.001 to 0.5. Validate PRS performance in independent test sets measuring AUC and Nagelkerke R^2.
Phase 6: Clinical Translation and Reporting — Month 6-7
Perform drug target enrichment analysis using OpenTargets platform to identify actionable genes. Conduct Mendelian randomization analyses to test causal relationships between identified loci and MDD-related phenotypes (BMI, smoking, education). Generate comprehensive summary statistics file with standardized format including chromosome, position, alleles, effect sizes, standard errors, and p-values for >9 million variants. Perform power calculations for future studies and estimate sample sizes needed to identify additional loci. Create interactive web portal for results visualization and data sharing. Conduct sensitivity analyses excluding participants with bipolar disorder or other psychiatric comorbidities. Prepare manuscripts following STREGA guidelines for genetic association studies.