To increase the power of our analyses, we meta-analysed our vitamin D GWAS results with the publicly-available summary statistics from the SUNLIGHT Consortium (Jiang et al. 2018). That study was conducted in 79,366 individuals of European ancestry. Results were downloaded from this page. Summary statistics were available for 2,579,297 variants.




# ************************************
# General information on GWAS run ----
# ************************************

# Number of variants available in UKB GWAS
ukbSNPs_N=length(count.fields("results/fastGWA/vitD_BMIcov_fastGWA_withX.gz", skip=1))

# Number of variants that overlap between the UKB and SUNLIGHT GWAS
overlappingSNPs_N=length(count.fields("results/sunlight/vitD_BMIcov_fastGWA.overlappingSNPs.RDS", skip=1))

# Number of variants available in dataset used as LD reference
referenceSNPs_N=length(count.fields("input/UKB_v3EUR_HM3/ukbEUR_imp_allchr_v3_HM3.bim", sep="\t"))

Bivariate LDSC resgression

# Copy sulight results from local to Delta (run locally)
scp $local/results/sunlight/25HydroxyVitaminD_QC.METAANALYSIS1.txt $deltaID:$WD/results/sunlight

# ##############################
# Run bivariate LDSC regression
# ##############################
module load ldsc
prefix=vitD_BMIcov_fastGWA

# Directories
results=$WD/results/sunlight
tmpDir=$results/tmpDir
ldscDir=$bin/ldsc

# Format SUNLIGHT GWAS results
echo "SNP A1 A2 b n p" > $results/sunlight.ldHubFormat
awk 'NR>1 {print $1,$2,$3,$4,$14,$6}' $results/25HydroxyVitaminD_QC.METAANALYSIS1.txt >> $results/sunlight.ldHubFormat

# Munge SUNLIGHT sumstats (process sumstats and restrict to HapMap3 SNPs)
munge_sumstats.py --sumstats $results/sunlight.ldHubFormat \
                  --merge-alleles $ldscDir/w_hm3.snplist \
                  --out $tmpDir/sunlight
                  
# Format vitD_BMIcov sumstats
cd $WD/results/fastGWA/ldscFormat
awk '{print $1,$2,$3,$5,$8,$7}' $WD/results/fastGWA/maFormat/$prefix.ma > $prefix.ldHubFormat

# Munge vitD_BMIcov sumstats (process sumstats and restrict to HapMap3 SNPs)
cd $WD/results/fastGWA/ldscFormat/
munge_sumstats.py --sumstats $prefix.ldHubFormat \
                  --merge-alleles $ldscDir/w_hm3.snplist \
                  --out $prefix

# Run rg
cd $results
ldsc.py --rg  $WD/results/fastGWA/ldscFormat/$prefix.sumstats.gz,$tmpDir/sunlight.sumstats.gz\
        --ref-ld-chr $ldscDir/eur_w_ld_chr/ \
        --w-ld-chr $ldscDir/eur_w_ld_chr/ \
        --out $results/$prefix.sunlight.ldsc.rg
        
# Copy results to local computer (run command in local computer)
scp $deltaID:$WD/results/sunlight/vitD_BMIcov_fastGWA.sunlight.ldsc.rg.log $local/results/sunlight/
# Read-in bivariate LDSC regression results
tmp=read.table("results/sunlight/vitD_BMIcov_fastGWA.sunlight.ldsc.rg.log", skip=30, fill=T, h=F, colClasses="character", sep="_")[,1]

# Get results:
intercept=strsplit(grep("Intercept",tmp, value=T)[3],":")[[1]][2]
rg=strsplit(grep("Genetic Correlation:",tmp, value=T),":")[[1]][2]

We ran bivariate LDSC regression with both GWAS results to assess if there was overlap between the samples.

  • Bivariate LDSC Intercept: -0.0012 (0.0121)
  • Genetic Correlation: 0.9162 (0.0602)

The genetic correlation between the two traits is high and the bivariate LDSC intercept is close to zero, suggesting no sample overlap.



Meta-analysis

Method used

We meta-analysed the GWAS results with the default method implemented in METAL, which generates Z-scores based on the observed (1) P-values and (2) direction of effect. For more details see Willer et al. 2010.

This approach does not require the beta coefficients (and respective standard errors) to be in the same units, which is critical in our case, since vitamin D levels were normalised/standardised differently in the two studies (rank-based inverse-normal transformed in our study; log-transformed in the SUNLIGHT study).

Note: There are 307 SNPs with P=0 in the UKB GWAS results. Since we cannot get a quantile for P=0, METAL discards these variants from analysis. To prevent this, we set these SNPs to the lowest P-value in the GWAS that was not 0 (P=3.95x10-323) before running the meta-analysis.


Variants included in the meta-analysis
Allele frequencies are required to calculate the effect size and the SE of the meta-analysed results (see below). We don’t have allele frequencies from the SUNLIGHT consortium, but we can use those from the UKB as both GWAS were performed in individuals with European ancestry. This also means that we can only convert meta-analysis z-scores to beta and SE for SNPs that are in the UKB GWAS results.

We can restrict the meta-analysis to SNPs in:

  • UKB GWAS (M=8806780 autossomal SNPs) - Meta-analyse SNPs present in both GWAS and keep individual-GWAS results for SNPs that are only in UKB GWAS
  • Both GWAS (M=1281182 autossomal SNPs) - Meta-analyse SNPs present in both GWAS, only

We could justify using the first approach based on the sample size of the UKB GWAS (the minimum sample sample size in the UKB is 396,459, whereas in the SUNLIGHT consortium it is 1,001).

ukb=$WD/results/fastGWA/vitD_BMIcov_fastGWA.gz
sun=$WD/results/sunlight/25HydroxyVitaminD_QC.METAANALYSIS1.txt

# Check how many SNPs have P=0 in both GWAS
zcat $ukb | awk '$10==0' | wc -l 
awk '$6==0' $sun | wc -l 
  ## 307 in UKB
  ## 0 in SUNLIGHT
  
# Convert P=0 to second smallest P in the GWAS
newP=`zcat $ukb | sed 1d | awk '$10!=0 {print $10}' | sort -g | sed 1q`
zcat $ukb | awk -v newP="$newP" '{ $10 = ($10 == "0" ? newP : $10) } 1'  > $WD/tmpDir/vitD_noP0
gzip $WD/tmpDir/vitD_noP0

# ##############################
# Run meta-analysis
# ##############################

# NOTE: Run these commands with an interactive session (I still haven't been able to submit a script that will run them)
#       Use: qsub -I -A UQ-IMB-CNSG -X -l nodes=1:ppn=5,mem=15GB,walltime=6:00:00
# NOTE: Running the analysis without restricting to any subset of variants
#       like this we get results for all the variants and then we can use the subset we want

cd $WD/results/sunlight
metal=$bin/metal/metal
$metal

# === DESCRIBE AND PROCESS UKB RESULTS ===
MARKER SNP
ALLELE A1 A2
EFFECT BETA
PVALUE P
WEIGHT N
#STDERR se

PROCESS $WD/tmpDir/vitD_noP0.gz

# === DESCRIBE AND PROCESS SUNLIGHT RESULTS ===
MARKER MarkerName
ALLELE Allele1 Allele2
EFFECT Effect
PVALUE P-value
WEIGHT SAMPLESIZE
#STDERR StdErr

PROCESS $WD/results/sunlight/25HydroxyVitaminD_QC.METAANALYSIS1.txt

# === RUN META-ANALYSIS ===
#MINWEIGHT 417581
OUTFILE $WD/results/sunlight/vitD_BMIcov_fastGWA_sunlight_meta .TBL
ANALYZE

# #########################################
# Copy results to local computer 
# #########################################
scp $deltaID:$WD/results/sunlight/vitD_BMIcov_fastGWA_sunlight_meta1.TBL $local/results/sunlight/


# ##############################
# Calculate allele frequencies
# ##############################
# Required to calculate the effect size and SE from the combined z-scores (see section below)

# Directories

results=$WD/results/sunlight
tmpDir=$results/tmpDir

# Files
geno=$medici/v3EURu_impQC/ukbEURu_imp_chr{TASK_ID}_v3_impQC
sumstats=$results/vitD_BMIcov_fastGWA_sunlight_meta1.TBL
snps2exclude=$WD/input/UKB_v3EUR_impQC/UKB_EUR_impQC_duplicated_rsIDs

# Generate list of SNPs to get allele frequency for
awk 'NR>1 {print $1}' $sumstats > $tmpDir/snps4freqReport.meta

# Generate list of alleles to get frequency for
awk 'NR>1 {print $1,$2}' $sumstats | awk '$2 = toupper($2)' > $tmpDir/alleles4freqReport.meta

# Generate SNP frequency report
cd $results
plink=$bin/plink/plink-1.9/plink
output=$tmpDir/vitD_BMIcov_fastGWA_sunlight_meta_A1freq.chr{TASK_ID}
tmp_command="$plink --bfile $geno \
                    --extract $tmpDir/snps4freqReport.meta \
                    --exclude $snps2exclude \
                    --a1-allele $tmpDir/alleles4freqReport.meta 2 1 \
                    --freq \
                    --out $output"
jobName=$(basename $output  | sed 's/.chr{TASK_ID}//')
qsubshcom "$tmp_command" 1 10G $jobName 10:00:00 "-array=1-22"

sed 1q $tmpDir/vitD_BMIcov_fastGWA_sunlight_meta_A1freq.chr1.frq | awk -v OFS="\t" '$1=$1' > $results/vitD_BMIcov_fastGWA_sunlight_meta_A1.frq
cat $tmpDir/vitD_BMIcov_fastGWA_sunlight_meta_A1freq.chr*.frq | grep -v CHR | awk -v OFS="\t" '$1=$1' >> $results/vitD_BMIcov_fastGWA_sunlight_meta_A1.frq
rm $genoDir/v3EURu_chr* 

Meta-analysis effect sizes and SE
The output of the meta-analysis are combined Z-scores. To calculate the combined effect size (beta) and corresponding SE, we used the first equations in page 10 of the supplementary materials of Zhu et al. 2016:

img
# ************************************
# Format meta-analysis results ----
# ************************************
# Read in data
#gwas=read.table("results/fastGWA/vitD_BMIcov_fastGWA.gz", h=T, stringsAsFactors=F)
#saveRDS(gwas, file="results/fastGWA/vitD_BMIcov_fastGWA.RDS")
ukb=readRDS("results/fastGWA/vitD_BMIcov_fastGWA.RDS")
sun=read.table("results/sunlight/25HydroxyVitaminD_QC.METAANALYSIS1.txt",h=T,stringsAsFactors=F)
meta0=read.table("results/sunlight/vitD_BMIcov_fastGWA_sunlight_meta1.TBL",h=T,stringsAsFactors=F)

# ----- Format meta-ananalysed results
  # Get allele frequencies from the UKB
  meta=merge(meta0,ukb[,c("CHR","SNP","POS","AF1")], by.x="MarkerName", by.y="SNP")

  # Calculate beta and SE
  z=meta$Zscore
  p=meta$AF1
  n=meta$Weight
  meta$b= z / sqrt(2*p*(1-p)*(n+z^2))
  meta$se=1 / sqrt(2*p*(1-p)*(n+z^2))

  # Define which SNPs were meta-analysed
  meta[substr(meta$Direction,1,1)=="?","Analysed"]="sunlight"
  meta[substr(meta$Direction,2,2)=="?","Analysed"]="ukb"
  meta[!meta$Analysed %in% c("sunlight","ukb"),"Analysed"]="both"

  # Make file with same columns as PLINK output (for clumping)
  meta$TEST="ADD"
  df=meta[,c("CHR","MarkerName","POS","Allele1","TEST","Weight","b","Zscore","P.value")]
  names(df)=c("CHR","SNP","BP","A1","TEST","NMISS","BETA","STAT","P")

  # Make file in .ma format (for COJO)
  df2=meta[,c("MarkerName","Allele1","Allele2","AF1","b","se","P.value","Weight","Analysed")]
  names(df2)=c("SNP","A1","A2","freq","b","se","p","n","Analysed")


# ----- Save formatted meta-analysed results
  ukb_prefix="vitD_BMIcov_fastGWA"
  
  # UKB SNPs 
  ukb_snps=meta[meta$Analysed %in% c("both","ukb"),"MarkerName"]
  write.table(df[df$SNP %in% ukb_snps,], file=gzfile(paste0("results/sunlight/",ukb_prefix,"_sunlight_meta.ukbSNPs.linear.gz")), quote=F, row.names=F)
  write.table(df2[df2$SNP %in% ukb_snps,], paste0("results/sunlight/",ukb_prefix,"_sunlight_meta.ukbSNPs.ma"), quote=F, row.names=F)

  # Overlapping SNPs
  overlapping_snps=meta[meta$Analysed=="both","MarkerName"]
  write.table(df[df$SNP %in% overlapping_snps,], file=gzfile(paste0("results/sunlight/",ukb_prefix,"_sunlight_meta.overlappingSNPs.linear.gz")), quote=F, row.names=F)
  write.table(df2[df2$SNP %in% overlapping_snps,], paste0("results/sunlight/",ukb_prefix,"_sunlight_meta.overlappingSNPs.ma"), quote=F, row.names=F)

  
# ----- Format UKB results to re-run clumping and COJO on subset of overlapping SNPs
  # PLINK output format (for clumping)
  ukb$TEST="ADD"; ukb$STAT="NA"
  df=ukb[,c("CHR","SNP","POS","A1","TEST","N","BETA","STAT","P")]
  names(df)=c("CHR","SNP","BP","A1","TEST","NMISS","BETA","STAT","P")
  
  # COJO output format (for COJO)
  df2=ukb[,c("SNP","A1","A2","AF1","BETA","SE","P","N")]
  names(df2)=c("SNP","A1","A2","freq","b","se","p","n")
  
# ----- Save restricted UKB results
  ukb_prefix="vitD_BMIcov_fastGWA"
  write.table(df[df$SNP %in% overlapping_snps,], file=gzfile(paste0("results/sunlight/",ukb_prefix,".overlappingSNPs.linear.gz")), quote=F, row.names=F)
  write.table(df2[df2$SNP %in% overlapping_snps,], paste0("results/sunlight/",ukb_prefix,".overlappingSNPs.ma"), quote=F, row.names=F)
  
  
  
  
# ************************************
# Check meta-analysis results ----
# ************************************
sun=(sun[,c(1:6,12:14)])
names(sun)=c("SNP","A1","A2","BETA","SE","P","POS","CHR","N")

a=merge(ukb,sun,by="SNP",suffixes=c("1","2"))
a$A12=toupper(a$A12)
a$A22_sun=toupper(a$A22)
names(a)=tolower(names(a))


# ---- Check meta-analysis results
snp="rs1000000"

# Define satatistics for this SNP
tmp=a[a$snp==snp,]
n1=tmp$n1; n2=tmp$n2
p1=tmp$p1; p2=tmp$p2
b1=tmp$beta1; b2=tmp$beta2
f=tmp$af1

# Calculate intermediate statistics
z1=qt(p1/2, n1-1, lower=F)*sign(b1)
z2=qt(p2/2, n2-1, lower=F)*sign(b2)
w1=sqrt(n1)
w2=sqrt(n2)

# Calculate meta-analysis statistics
  ## Z-score
  zm=sum(z1*w1, z2*w2) / sqrt(n1+n2); zm
  meta[meta$MarkerName==snp,"Zscore"]

  ## Beta and SE
  bm=zm / sqrt(2*f*(1-f)*(n1+n2+zm^2)); bm
  sem=1 / sqrt(2*f*(1-f)*(n1+n2+zm^2)); sem
  
tmp



Considerations taken

In our analyses, we used a rank-based inverse-normal transformation (RINT) to generate a normally-distributed phenotype. The plot below contrasts that phenotype to the phenotype obtained with log transformation:

phenos=read.table("tmpDir/mlm_norm_log_pheno", h=T, stringsAsFactors=F)

# Association between the 2 possible transofrmations
test=lm(phenos$log_vitD ~ phenos$norm_vitD)
beta=round(summary(test)$coefficients[2,1],4)
p=summary(test)$coefficients[2,4]
r2=round(cor(phenos$log_vitD, phenos$norm_vitD),2)


# Plot
plot(phenos$norm_vitD, phenos$log_vitD, col="lightcoral", 
     xlab="Vitamin D (RINT)", ylab="Vitamin D (log)", main="Comparison of phenotype scales")
abline(test)
#text(min(phenos$norm_vitD), max(phenos$log_vitD), adj=c(0,1), label=paste0("b=",beta))
text(min(phenos$norm_vitD), max(phenos$log_vitD), adj=c(0,1), label=bquote(r^2 == .(r2)))
text(min(phenos$norm_vitD), max(phenos$log_vitD)-sd(phenos$log_vitD), adj=c(0,1), label=paste0("p=",p))

# Read-in GWAS results
#gwas=read.table("results/fastGWA/vitD_BMIcov_fastGWA.gz", h=T, stringsAsFactors=F)
#saveRDS(gwas, file="results/fastGWA/vitD_BMIcov_fastGWA.RDS")
ukb=readRDS("results/fastGWA/vitD_BMIcov_fastGWA.RDS")
names(ukb)=c("CHR","SNP","POS","A1","A2","N","AF1","beta","se","p")
sun=read.table("results/sunlight/25HydroxyVitaminD_QC.METAANALYSIS1.txt",h=T,stringsAsFactors=F)
sun=(sun[,c(1:6,12:14)])
names(sun)=c("SNP","A1","A2","beta","se","p","POS","CHR","N")
sun$A1=toupper(sun$A1)
sun$A2=toupper(sun$A2)

# Merge results for top-associated SNPs
df=ukb[ukb$p<5e-8,]
df=merge(df,sun, by="SNP", suffixes=c("_ukb","_sun"))

# Align
  ## Keep effect for SNPs with matching alleles
  tmp=which(df$A1_ukb==df$A1_sun & df$A2_ukb==df$A2_sun)
  df[tmp,"b_sun"]=df[tmp,"beta_sun"]
  
  ## Invert effect for switched alleles
  tmp=which(df$A1_ukb==df$A2_sun & df$A2_ukb==df$A1_sun)
  df[tmp,"b_sun"]=df[tmp,"beta_sun"]*-1
  df[tmp,"A1_sun"]=df[tmp,"A1_ukb"]
  df[tmp,"A2_sun"]=df[tmp,"A2_ukb"]

  nrow(df[is.na(df$b_sun),])
  # Only 16 SNPs left to align (reciprocal alleles)
  
# Save the ones we aligned and plot
df=df[!is.na(df$b_sun),]
df$beta_sun=df$b_sun
df$b_sun=NULL
write.table(df, file="tmpDir/fastGWA_sunlight_alignedBetas", quote=F, row.names=F)
saveRDS(df, file="tmpDir/fastGWA_sunlight_alignedBetas.RDS")

The plot below depicts the effect sizes of variants with genome-wide significant association in the UKB GWAS (N=5801). The effect sizes estimated in both GWAS are correlated.

df=readRDS("tmpDir/fastGWA_sunlight_alignedBetas.RDS")

# Association between effect sizes
test=lm(df$beta_sun~df$beta_ukb)
beta=round(summary(test)$coefficients[2,1],4)
p=summary(test)$coefficients[2,4]
r2=round(cor(df$beta_sun,df$beta_ukb)^2,2)

# Plot effect sizes
plot(df$beta_ukb, df$beta_sun, pch=20, cex=0.8, cex.axis=1.1, cex.lab=1.2, col="lightcoral",
     xlab=expression(UK~Biobank), ylab=expression(SUNLIGHT),
     main="Effect size estimates (and SE) for SNPs with\n genome-wide significant association in the UKB")
  # Add SE
  nsnps = nrow(df)
  for( i in 1:nsnps ) {
    # x axis
    xstart = df[i,"beta_ukb"] - df[i,"se_ukb"]; xend = df[i,"beta_ukb"] + df[i,"se_ukb"]
    ystart = df[i,"beta_sun"]; yend = df[i,"beta_sun"]
    segments(xstart, ystart, xend, yend, lwd=1.5, col="lightcoral")
    # y axis
    xstart = df[i,"beta_ukb"]; xend = df[i,"beta_ukb"]
    ystart = df[i,"beta_sun"] - df[i,"se_sun"]; yend = df[i,"beta_sun"] + df[i,"se_sun"]
    segments(xstart, ystart, xend, yend, lwd=1.5, col="lightcoral")
  }
  # Add regression line and statistics
  abline(test)
  text(min(df$beta_ukb), max(df$beta_sun), adj=c(0,1), label=bquote(r^2 == .(r2)))
  text(min(df$beta_ukb), max(df$beta_sun)-sd(df$beta_sun), adj=c(0,1), label=paste0("p=",p))



Clumping and COJO

After restricting the meta-analysed results to overlapping SNPs or UKB SNPs, we conducted the following analyses:

  • Clumping to identify independent loci
  • COJO to identify independent variants

Clumping

We used stringent thresholds to clump the meta-analysed results:

  • –clump-p1 = 5x10-8 - max P-value for index variants to start a clump
  • –clump-p2 = 0.01 (default) - max P-value for variants in a clump to be listed in column SP2
  • –clump-r2 = 0.01 - minimum r2 with index variant for a site to be included in that clump
  • –clump-kb = 5000 - maximum distance from index SNP for a site to be included in that clump

Clumping was performed with plink 1.9. v3EUR_impQC samples were used as reference.


# #########################################
# Clump meta-analysis results
# #########################################

# Directories


results=$WD/results/sunlight
tmpDir=$results/tmpDir

for subset in ukbSNPs overlappingSNPs
do 
    # Files/settings to use
    prefix=vitD_BMIcov_fastGWA_sunlight_meta.$subset
    ref=$medici/v3EUR_impQC/ukbEUR_imp_chr{TASK_ID}_v3_impQC
    inFile=$results/$prefix.linear.gz
    outFile=$tmpDir/${prefix}_chr{TASK_ID}
    snps2exclude=$WD/input/UKB_v3EUR_impQC/UKB_EUR_impQC_duplicated_rsIDs

    # Run clumping
    cd $results
    plink=$bin/plink/plink-1.9/plink
    tmp_command="$plink --bfile $ref \
                    --exclude $snps2exclude \
                    --clump $inFile \
                    --clump-p1 5e-8 \
                    --clump-r2 0.01 \
                    --clump-kb 5000 \
                    --clump-field "P" \
                    --out $outFile"          
    qsubshcom "$tmp_command" 1 30G $prefix.clump 24:00:00 "-array=1-22"
done

# Merge clumped results once all chr finished running
for subset in ukbSNPs overlappingSNPs
do
    prefix=vitD_BMIcov_fastGWA_sunlight_meta.$subset
    cat $tmpDir/${prefix}_chr*clumped | grep "CHR" | head -1 > $results/$prefix.clumped
    cat $tmpDir/${prefix}_chr*clumped | grep -v "CHR" |sed '/^$/d'  >> $results/$prefix.clumped
done

# Copy results to local computer to summarise and plot (run command in local computer)
scp $deltaID:$WD/results/sunlight/vitD_BMIcov_fastGWA_sunlight_meta.*clumped $local/results/sunlight/



# #########################################
# Clump restricted UKB results 
# #########################################

# Directories


results=$WD/results/sunlight
tmpDir=$results/tmpDir

# Files/settings to use
prefix=vitD_BMIcov_fastGWA.overlappingSNPs
ref=$medici/v3EUR_impQC/ukbEUR_imp_chr{TASK_ID}_v3_impQC
inFile=$results/$prefix.linear.gz
outFile=$tmpDir/${prefix}_chr{TASK_ID}
snps2exclude=$WD/input/UKB_v3EUR_impQC/UKB_EUR_impQC_duplicated_rsIDs

# Run clumping
cd $results
plink=$bin/plink/plink-1.9/plink
tmp_command="$plink --bfile $ref \
                    --exclude $snps2exclude \
                    --clump $inFile \
                    --clump-p1 5e-8 \
                    --clump-r2 0.01 \
                    --clump-kb 5000 \
                    --clump-field "P" \
                    --out $outFile"          
qsubshcom "$tmp_command" 1 30G $prefix.clump 24:00:00 "-array=1-22"


# Merge clumped results once all chr finished running
cat $tmpDir/${prefix}_chr*clumped | grep "CHR" | head -1 > $results/$prefix.clumped
cat $tmpDir/${prefix}_chr*clumped | grep -v "CHR" |sed '/^$/d'  >> $results/$prefix.clumped

# Copy results to local computer to summarise and plot (run command in local computer)
scp $deltaID:$WD/results/sunlight/vitD_BMIcov_fastGWA.overlappingSNPs.clumped $local/results/sunlight/

COJO

We ran COJO on the meta-analysed results with the same settings used on the UKB GWAS:

  • –bfile: We used a random subset of 20,000 unrelated individuals of European ancestry from the UKB as LD reference. The subset was generated based with the files in Q0286/UKBiobank/v3EURu_impQC/, which include 1365446 autossomal SNPs and are restricted to unrelated individuals of European ancestry.
  • –cojo-slct: Performs a step-wise model selection of independently associated SNPs
  • –cojo-p 5e-8 (default): P-value to declare a genome-wide significant hit
  • –cojo-wind 10,000 (default): Distance (in KB) from which SNPs are considered in linkage equilibrium
  • –cojo-collinear 0.9 (default): maximum r2 to be considered an independent SNP, i.e. SNPs with r2>0.9 will not be considered further
  • –diff-freq 0.2 (default): maximum allele difference bet
  • extract: Restrict analysis to SNPs with MAF>0.01
# #########################################
# Run COJO on meta-analysed results
# #########################################

# Directories


tmpDir=$WD/tmpDir
results=$WD/results/sunlight

# Files/settings to use
random20k_dir=$WD/input/UKB_v3EURu_impQC/random20K
ldRef=$random20k_dir/ukbEURu_imp_chr{TASK_ID}_v3_impQC_random20k_maf0.01
filters=maf0.01
snps2include=$WD/input/UKB_v3EUR_impQC/UKB_EUR_impQC_$filters.snplist
snps2exclude=$WD/input/UKB_v3EUR_impQC/UKB_EUR_impQC_duplicated_rsIDs

for subset in ukbSNPs overlappingSNPs
do 

    prefix=vitD_BMIcov_fastGWA_sunlight_meta.$subset
    inFile=$results/$prefix.ma
    outFile=$tmpDir/${prefix}_${filters}_chr{TASK_ID}

    # Run COJO
    cd $results
    gcta=$bin/gcta/gcta_1.92.3beta3
    tmp_command="$gcta --bfile $ldRef \
                   --cojo-slct \
                   --cojo-file $inFile \
                   --extract $snps2include \
                   --exclude $snps2exclude \
                   --out $outFile"
                             
    qsubshcom "$tmp_command" 1 100G ${prefix}_COJO 24:00:00 "-array=1-22"
done

# Merge COJO results once all chr finished running
for subset in ukbSNPs overlappingSNPs
do 
    prefix=vitD_BMIcov_fastGWA_sunlight_meta.$subset
    cat $tmpDir/${prefix}_${filters}*.jma.cojo | grep "Chr" | uniq > $results/${prefix}_${filters}.cojo
    cat $tmpDir/${prefix}_${filters}_chr*.jma.cojo | grep -v "Chr" |sed '/^$/d'  >> $results/${prefix}_${filters}.cojo
done 

# Copy results to local computer to summarise and plot (run command in local computer)
scp $deltaID:$WD/results/sunlight/vitD_BMIcov_fastGWA_sunlight_meta*.cojo $local/results/sunlight/


# #########################################
# Run COJO restricted UKB results
# #########################################

# Directories


results=$WD/results/sunlight
tmpDir=$results/tmpDir

# Files/settings to use
prefix=vitD_BMIcov_fastGWA.overlappingSNPs
random20k_dir=$WD/input/UKB_v3EURu_impQC/random20K
ldRef=$random20k_dir/ukbEURu_imp_chr{TASK_ID}_v3_impQC_random20k_maf0.01
filters=maf0.01
inFile=$results/$prefix.ma
snps2include=$WD/input/UKB_v3EUR_impQC/UKB_EUR_impQC_$filters.snplist
snps2exclude=$WD/input/UKB_v3EUR_impQC/UKB_EUR_impQC_duplicated_rsIDs
outFile=$tmpDir/${prefix}_${filters}_chr{TASK_ID}

# Run COJO
cd $results
gcta=$bin/gcta/gcta_1.92.3beta3
tmp_command="$gcta --bfile $ldRef \
                   --cojo-slct \
                   --cojo-file $inFile \
                   --cojo-wind 500 \
                   --extract $snps2include \
                   --exclude $snps2exclude \
                   --out $outFile"
                             
qsubshcom "$tmp_command" 1 100G ${prefix}_COJO 24:00:00 "-array=1-22"


# Merge COJO results once all chr finished running
cat $tmpDir/${prefix}_${filters}*.jma.cojo | grep "Chr" | uniq > $results/${prefix}_${filters}.cojo
cat $tmpDir/${prefix}_${filters}_chr*.jma.cojo | grep -v "Chr" |sed '/^$/d'  >> $results/${prefix}_${filters}.cojo

# Copy results to local computer to summarise and plot (run command in local computer)
scp $deltaID:$WD/results/sunlight/vitD_BMIcov_fastGWA.overlappingSNPs_maf0.01.cojo $local/results/sunlight/


# #########################################
# Run COJO on SUNLIGHT results
# Running this for the sample size comparisons
# #########################################

# ----- Convert SUNLIGHT sumstats to .ma format
## Run this step in R

  qsub -I -A UQ-IMB-CNSG -X -l nodes=1:ppn=5,mem=5GB,walltime=6:00:00
  R
  setwd("$WD")
  
  # Load sumstats
  sun=read.table("results/sunlight/25HydroxyVitaminD_QC.METAANALYSIS1.txt", h=T, stringsAsFactors=F)
  sun=sun[,c("MarkerName","Allele1","Allele2","Effect","StdErr","P.value","Pos","Chr","SAMPLESIZE")]
  
  # Add allele frequencies
  ukb=read.table("results/fastGWA/vitD_fastGWA_withX.gz", h=T, stringsAsFactors=F)
  df=merge(sun,ukb[,c("CHR","SNP","POS","AF1","A1","A2")], by.x="MarkerName", by.y="SNP")
  df$Allele1=toupper(df$Allele1)
  df$Allele2=toupper(df$Allele2)
  df[df$Allele1==df$A1 & df$Allele2==df$A2,"freq"]=df[df$Allele1==df$A1 & df$Allele2==df$A2,"AF1"]
  df[df$Allele1==df$A2 & df$Allele2==df$A1,"freq"]=1-df[df$Allele1==df$A2 & df$Allele2==df$A1,"AF1"]
  df=df[!is.na(df$freq),c("MarkerName","Allele1","Allele2","freq","Effect","StdErr","P.value","SAMPLESIZE")]
  names(df)=c("SNP","A1","A2","freq","b","se","p","n")
  
  # Save
  write.table(df, "results/sunlight/sunlight.ma", quote=F, row.names=F)
  


# ----- Run COJO
# Directories

results=$WD/results/sunlight
tmpDir=$results/tmpDir

# Files/settings to use
prefix=sunlight
random20k_dir=$WD/input/UKB_v3EURu_impQC/random20K
ldRef=$random20k_dir/ukbEURu_imp_chr{TASK_ID}_v3_impQC_random20k_maf0.01
filters=maf0.01
inFile=$results/$prefix.ma
snps2include=$WD/input/UKB_v3EUR_impQC/UKB_EUR_impQC_$filters.snplist
snps2exclude=$WD/input/UKB_v3EUR_impQC/UKB_EUR_impQC_duplicated_rsIDs
outFile=$tmpDir/${prefix}_${filters}_chr{TASK_ID}

 

# Run COJO
cd $results
gcta=$bin/gcta/gcta_1.92.3beta3
tmp_command="$gcta --bfile $ldRef \
                   --cojo-slct \
                   --cojo-file $inFile \
                   --extract $snps2include \
                   --exclude $snps2exclude \
                   --out $outFile"
                             
qsubshcom "$tmp_command" 1 100G COJO.$prefix 24:00:00 "-array=1-22"


# Merge COJO results once all chr finished running
cat $tmpDir/${prefix}_${filters}*.jma.cojo | grep "Chr" | uniq > $results/${prefix}_${filters}.cojo
cat $tmpDir/${prefix}_${filters}_chr*.jma.cojo | grep -v "Chr" |sed '/^$/d'  >> $results/${prefix}_${filters}.cojo

# Copy results to local computer to summarise and plot (run command in local computer)
scp $deltaID:$WD/results/sunlight/sunlight_maf0.01.cojo $local/results/sunlight/



Influence of sample size

The SUNLIGHT consortium GWAS summary statistics were genereated based on a sample of 79,366 individuals. To better compare these results with the ones obtained with the UKB, we selected a random sample of 80,000 individuals and conducted the GWAS as before (see the GWAS section). We repeated this five times and present the reults below.

In addition, we generated random samples of increasing sample sizes to assess the influence of the sample size on the GWAS results.

# ########################################
# Select random samples
# ########################################

## !!! Run this step in R

# Generate 5 sets of 80K
  vitD_ids=read.table("input/mlm_pheno", h=T, stringsAsFactors=F)
  n=80000
  n_samples=5
  for(i in 1:n_samples)
  {
    subset=sample(vitD_ids$IID, size=n)
    tmp=data.frame(FID=subset, IID=subset)
    write.table(tmp, paste0("results/fastGWA/sampleSize_stratifiedGWAS/data_subsets/ukbSubset_",n/1000,"K_",i), quote=F, row.names=F)
  }
  
# Generate sets of different sample sizes
  n=seq(from=50000, to=400000, by=50000)
  for(i in n)
  {
    subset=sample(vitD_ids$IID, size=i)
    tmp=data.frame(FID=subset, IID=subset)
    write.table(tmp, paste0("results/fastGWA/sampleSize_stratifiedGWAS/data_subsets/ukbSubset_",i/1000,"K"), quote=F, row.names=F)
  }
  
# Tranfer to Delta to run GWAS
scp $local/results/fastGWA/sampleSize_stratifiedGWAS/data_subsets/ukbSubset* $deltaID:$WD/results/fastGWA/sampleSize_stratifiedGWAS/data_subsets/
  
# ########################################
# Run UKB GWAS on sub-samples
# ########################################

# Set directories

results=$WD/results/fastGWA/sampleSize_stratifiedGWAS
tmpDir=$results/tmpDir

# Files/settings to use
prefix=vitD_fastGWA
geno=$WD/results/fastGWA/tmpDir/UKB_v3EUR_impQC.files
pheno=$WD/input/mlm_pheno
grm=$WD/input/UKB_HM3_m01_sp
quantCovars=$WD/input/quantitativeCovariates
#quantCovars=$WD/input/quantitativeCovariates_bmi
catgCovars=$WD/input/categoricalCovariates
snps2include=$WD/input/UKB_v3EUR_impQC/UKB_EUR_impQC_maf0.01.snplist
gcta=$bin/gcta/gcta_1.92.3beta3

# Run GWAS
  subsets2run=$tmpDir/subsets2run
  ls $results/data_subsets > $subsets2run
  nSubsets=`wc -l < $subsets2run`
  if test $nSubsets != 1; then arrays="-array=1-$nSubsets"; i="{TASK_ID}"; else arrays=""; i=1; fi 
  getSubset="subset=\$(awk -v i=$i 'NR==i {print \$1}' $subsets2run)"
  defineOutputName="output=$tmpDir/$prefix.\$subset"
  runGWAS="$gcta --mbfile $geno \
                 --fastGWA-lmm \
                 --pheno $pheno \
                 --grm-sparse $grm \
                 --qcovar $quantCovars\
                 --covar $catgCovars \
                 --covar-maxlevel 110 \
                 --keep $results/data_subsets/\$subset \
                 --extract $snps2include \
                 --threads 16 \
                 --out \$output"
  qsubshcom "$getSubset; $defineOutputName; $runGWAS" 16 10G $prefix.stratifiedGWAS 24:00:00 "$arrays"

# Compress results
for subset in `ls $results/data_subsets/`
do
  output=$tmpDir/$prefix.$subset
  gzip -c $output.fastGWA > $results/$prefix.$subset.gz
done

# Convert GWAS sumstats to .ma format (for COJO)
for subset in `ls $results/data_subsets/`
do
  output=$tmpDir/$prefix.$subset
  echo "SNP A1 A2 freq b se p n" > $results/maFormat/$prefix.$subset.ma
  awk '{print $2,$4,$5,$7,$8,$9,$10,$6}' $output.fastGWA | sed 1d >> $results/maFormat/$prefix.$subset.ma
done


# Copy GWAS results to local computer 
scp $deltaID:$WD/results/fastGWA/sampleSize_stratifiedGWAS/vitD_fastGWA.ukbSubset_*.gz $local/results/fastGWA/sampleSize_stratifiedGWAS/ 


# ########################################
# Run COJO on UKB GWAS from sub-samples
# ########################################
# Directories

results=$WD/results/fastGWA/sampleSize_stratifiedGWAS
tmpDir=$results/tmpDir
random20k_dir=$WD/input/UKB_v3EURu_impQC/random20K

# Files/setting to use
gcta=$bin/gcta/gcta_1.92.3beta3
ldRef=$random20k_dir/ukbEURu_imp_chr{TASK_ID}_v3_impQC_random20k_maf0.01
filters=maf0.01
snps2include=$WD/input/UKB_v3EUR_impQC/UKB_EUR_impQC_$filters.snplist


for subset in `ls $results/data_subsets/`
do
  prefix=vitD_fastGWA.$subset
  inFile=$results/maFormat/$prefix.ma
  outFile=$tmpDir/${prefix}_${filters}_chr{TASK_ID}

  # Run COJO
  cd $results
  tmp_command="$gcta --bfile $ldRef \
                     --cojo-slct \
                     --cojo-file $inFile \
                     --extract $snps2include \
                     --out $outFile"
  qsubshcom "$tmp_command" 1 100G COJO.$prefix 24:00:00 "-array=1-22"
done

# Merge COJO results
for subset in `ls $results/data_subsets/`
do
  prefix=vitD_fastGWA.$subset
  cat $tmpDir/${prefix}_${filters}*.jma.cojo | grep "Chr" | uniq > $results/${prefix}_${filters}.cojo
  cat $tmpDir/${prefix}_${filters}_chr*.jma.cojo | grep -v "Chr" |sed '/^$/d'  >>     $results/${prefix}_${filters}.cojo
done

# Copy COJO results to local computer
scp $deltaID:$WD/results/fastGWA/sampleSize_stratifiedGWAS/vitD_fastGWA.ukbSubset_*.cojo $local/results/fastGWA/sampleSize_stratifiedGWAS/ 
cojoList=list.files("results/fastGWA/sampleSize_stratifiedGWAS/", pattern="vitD_fastGWA.ukbSubset.*cojo")

# Summary of COJO results
df=c()
for(i in cojoList)
{
  print(paste(grep(i,cojoList), "of", length(cojoList), "-", i))
  # Load COJO results
  cojo=read.table(paste0("results/fastGWA/sampleSize_stratifiedGWAS/",i))
  
  # Summarise
  tmp=data.frame(file=i, M=nrow(cojo))
  
  # Define sample size
  sampleSizeString=grep("K",sapply(strsplit(i, "_"), "["), value=T)
  sampleSize=sub("K","",sampleSizeString)
  tmp$N=as.numeric(sampleSize)*1000
  
  # Merge with summary from other runs
  df=rbind(df,tmp)
}
## [1] "1 of 13 - vitD_fastGWA.ukbSubset_100K_maf0.01.cojo"
## [1] "2 of 13 - vitD_fastGWA.ukbSubset_150K_maf0.01.cojo"
## [1] "3 of 13 - vitD_fastGWA.ukbSubset_200K_maf0.01.cojo"
## [1] "4 of 13 - vitD_fastGWA.ukbSubset_250K_maf0.01.cojo"
## [1] "5 of 13 - vitD_fastGWA.ukbSubset_300K_maf0.01.cojo"
## [1] "6 of 13 - vitD_fastGWA.ukbSubset_350K_maf0.01.cojo"
## [1] "7 of 13 - vitD_fastGWA.ukbSubset_400K_maf0.01.cojo"
## [1] "8 of 13 - vitD_fastGWA.ukbSubset_50K_maf0.01.cojo"
## [1] "9 of 13 - vitD_fastGWA.ukbSubset_80K_1_maf0.01.cojo"
## [1] "10 of 13 - vitD_fastGWA.ukbSubset_80K_2_maf0.01.cojo"
## [1] "11 of 13 - vitD_fastGWA.ukbSubset_80K_3_maf0.01.cojo"
## [1] "12 of 13 - vitD_fastGWA.ukbSubset_80K_4_maf0.01.cojo"
## [1] "13 of 13 - vitD_fastGWA.ukbSubset_80K_5_maf0.01.cojo"
# Plot sub-samples of 80K + SUNLIGHT
  tmp=df[df$N==80000,]
  tmp$study="UK Biobank"
  tmp$replicate=paste0("sample",1:nrow(tmp))
  ## Add SUNLIGHT
    sunlight=read.table("results/sunlight/sunlight_maf0.01.cojo", h=T, stringsAsFactors=F)
    sun=data.frame(file=NA, M=nrow(sunlight), N=NA, study="SUNLIGHT", replicate="")
    tmp=rbind(tmp, sun)
    tmp$replicate=factor(tmp$replicate, levels=c("", paste0("sample",1:5)))

if(saveplots){tiff("manuscript/Figures/figureS3a_80Ksamples.tiff", width=6, height=5 ,units="in", res=800,compression = "lzw")}
  ggplot(tmp, aes(x=replicate, y=M, fill=study)) +
    geom_bar(alpha=0.7, stat="identity") +
    theme(legend.position = "none",
          text = element_text(size=15)) +
    labs(x="", 
         y="Number of independent associations")#,

         #title="Number of independent associations in UK Biobank sub-samples of 80K individuals")
if(saveplots){dev.off()}
  

# Plot sub-samples of increasing sample size
  samples=seq(from=50000, to=400000, by=50000)
  tmp=df[df$N %in% samples,]
  # Fit regression
  test=summary(lm(tmp$M~tmp$N))$coefficients
  b=test[2,1]
  b= gsub("e","*x10^",formatC(b, format = "e", digits = 2))
  plot_label= paste("beta ==",b)
  
if(saveplots){tiff("manuscript/Figures/figureS3a_sampleSize.tiff", width=5, height=5 ,units="in", res=800,compression = "lzw")}
  ggplot(tmp, aes(x=N, y=M)) +
    geom_smooth(method='lm', formula= y~x, col="darkgrey", fill="white", size=.5) +
    geom_point(col="lightcoral", size=3) +
    geom_line(col="lightcoral", size=1) +
    scale_x_continuous(breaks=samples, labels=paste0(samples/1000,"K")) +
    theme(text = element_text(size=15)) +
    labs(x="Sample size", 
         #title="Number of independent associations identified with COJO as a function of sample size",
         y="Number of independent associations") +
    annotate("text", x=51000, y=130, label=plot_label, parse=T, hjust=0, vjust=.2, size=5)

if(saveplots){dev.off()}

Comparison of UKB and meta-analysis results

Based on the results presented below, we decided to use only UKB results for downstream analyses.

Using overlapping SNPs

# Load results
  ## UKB
  #ukb=read.table("results/sunlight/vitD_BMIcov_fastGWA.overlappingSNPs.ma", h=T, stringsAsFactors=F)
  #saveRDS(ukb, "results/sunlight/vitD_BMIcov_fastGWA.overlappingSNPs.RDS")
  ukb=readRDS("results/sunlight/vitD_BMIcov_fastGWA.overlappingSNPs.RDS")
  ukbCOJO=read.table("results/sunlight/vitD_BMIcov_fastGWA.overlappingSNPs_maf0.01.cojo", h=T, stringsAsFactors=F)
  ukbClump=read.table("results/sunlight/vitD_BMIcov_fastGWA.overlappingSNPs.clumped", h=T, stringsAsFactors=F)

  ## Meta-analysis
  #met=read.table("results/sunlight/vitD_BMIcov_fastGWA_sunlight_meta.overlappingSNPs.ma", h=T, stringsAsFactors=F)
  #saveRDS(met, "results/sunlight/vitD_BMIcov_fastGWA_sunlight_meta.overlappingSNPs.ma.RDS")
  met=readRDS("results/sunlight/vitD_BMIcov_fastGWA_sunlight_meta.overlappingSNPs.ma.RDS")
  metClump=read.table("results/sunlight/vitD_BMIcov_fastGWA_sunlight_meta.overlappingSNPs.clumped", h=T, stringsAsFactors=F)
  metCOJO=read.table("results/sunlight/vitD_BMIcov_fastGWA_sunlight_meta.overlappingSNPs_maf0.01.cojo", h=T, stringsAsFactors=F)
  
# ************************************
# Compare results for overlapping SNPs ----
# ************************************

# SNPs with P<5x10-8
ukb_sigSNPs = ukb[ukb$p<5e-8,"SNP"]
met_sigSNPs = met[met$p<5e-8,"SNP"]

We compared the UKB and meta-analysis results using overlapping SNPs i.e. SNPs that were tested both in UKB and in SUNLIGHT (M=2442731 SNPs)

Number of genome-wide significant (GWS; P<5x10-8) variants

  • UKB: 5816
    • 506 not GWS in meta-analysis
  • Meta-analysis: 6113
    • 5310 also GWS in UKB
    • 803 not GWS in UKB

Number of independent loci (clumped)

  • UKB: 140
  • Meta-analysis: 161

Number of independent variants (COJO)

  • UKB: 181
  • Meta-analysis: 165

Using UKB SNPs

# ##################################
# Calculate the LD between COJO SNPs identifiend with UKB or meta-analysis
# ##################################

# Directories
results=$WD/results/sunlight
tmpDir=$results/tmpDir

# Files
ukb_cojoSNPs=$WD/results/fastGWA/vitD_BMIcov_fastGWA_maf0.01.cojo
meta_cojoSNPs=$results/vitD_BMIcov_fastGWA_sunlight_meta.ukbSNPs_maf0.01.cojo
ld_ref=$medici/v3EUR_impQC/ukbEUR_imp_chr{TASK_ID}_v3_impQC

# Make list of all UKB COJO SNPs + meta-analysis COJO SNPs
sort -u <(awk '{print $2}' $ukb_cojoSNPs | sed 1d)  <(awk '{print $2}' $meta_cojoSNPs | sed 1d) > $tmpDir/ukb_meta_cojoSNPs

# Calculate LD between each SNP pair
cd $results
plink=$bin/plink/plink-1.9/plink
tmp_command="$plink --bfile $ld_ref \
                    --extract $tmpDir/ukb_meta_cojoSNPs \
                    --r2 inter-chr \
                    --ld-window-r2 0 \
                    --out $tmpDir/ukb_meta_cojoSNPs_chr{TASK_ID}"
qsubshcom "$tmp_command" 1 10G vitD_ukb_meta_cojo_ld 1:00:00 "-array=1-22"

# Merge results from the different chromosomes
cat $tmpDir/ukb_meta_cojoSNPs_chr*ld | grep "CHR" | head -1 > $results/ukb_meta_cojoSNPs_ld_r2
cat $tmpDir/ukb_meta_cojoSNPs_chr*ld | grep -v "CHR" | sed '/^$/d'  >> $results/ukb_meta_cojoSNPs_ld_r2

# Copy results to local computer
scp $deltaID:$WD/results/sunlight/ukb_meta_cojoSNPs_ld_r2 $local/results/sunlight/
# Load results
  ## UKB
  #gwas=read.table("results/fastGWA/vitD_BMIcov_fastGWA_withX.gz", h=T, stringsAsFactors=F)
  #saveRDS(gwas, file="results/fastGWA/vitD_BMIcov_fastGWA_withX.RDS")
  ukb=readRDS("results/fastGWA/vitD_BMIcov_fastGWA_withX.RDS")
  ukbClump=read.table("results/fastGWA/vitD_BMIcov_fastGWA_withX.clumped", h=T, stringsAsFactors=F)
  ukbCOJO=read.table("results/fastGWA/vitD_BMIcov_fastGWA_maf0.01_withX.cojo", h=T, stringsAsFactors=F)
    
  ## Meta-analysis
  #met=read.table("results/sunlight/vitD_BMIcov_fastGWA_sunlight_meta.ukbSNPs.ma", h=T, stringsAsFactors=F)
  #saveRDS(met, "results/sunlight/vitD_BMIcov_fastGWA_sunlight_meta.ukbSNPs.ma.RDS")
  met=readRDS("results/sunlight/vitD_BMIcov_fastGWA_sunlight_meta.ukbSNPs.ma.RDS")
  metClump=read.table("results/sunlight/vitD_BMIcov_fastGWA_sunlight_meta.ukbSNPs.clumped", h=T, stringsAsFactors=F)
  metCOJO=read.table("results/sunlight/vitD_BMIcov_fastGWA_sunlight_meta.ukbSNPs_maf0.01.cojo", h=T, stringsAsFactors=F)

  cojoSNPs_ld=read.table("results/sunlight/ukb_meta_cojoSNPs_ld_r2", h=T, stringsAsFactors=F)
  
# ************************************
# Compare results for SNPs in UKB ----
# ************************************

# SNPs with P<5x10-8
ukb_sigSNPs = ukb[ukb$P<5e-8,"SNP"]
met_sigSNPs = met[met$p<5e-8,"SNP"]

# COJO SNPs
  # Loop through UKB COJO SNPs and check overlap with meta-analysis COJO SNPs
  df=data.frame(ukb_snp=ukbCOJO$SNP, chr=ukbCOJO$Chr, bp=ukbCOJO$bp,
                met_closest_SNP=NA, dist=NA, met_highLD_SNP=NA, ld_r2=NA, stringsAsFactors=F)
  for(i in 1:nrow(df))
  {
    snp=df[i,"ukb_snp"]
    chr=df[i,"chr"]
    bp=df[i,"bp"]
  
    # Return same SNP if it is in meta COJO
    if(snp %in% metCOJO$SNP)
    {
      df[i,c("met_closest_SNP","dist","met_highLD_SNP","ld_r2")]=c(snp, 0, snp, 1)
    }
  
    # Return closest SNP and SNP with highest LD if SNP is not in meta COJO
    if(!snp %in% metCOJO$SNP & chr %in% unique(metCOJO$Chr))
    {
      # Closest by distance
      tmp=metCOJO[metCOJO$Chr==chr,]
      tmp$dist=tmp$bp-bp
      closestSNP=which(abs(tmp$dist) == min(abs(tmp$dist)))
      df[i,c("met_closest_SNP","dist")]=c(tmp[closestSNP,"SNP"],
                                            tmp[closestSNP,"dist"])
  
      # Closest by LD
      tmp=cojoSNPs_ld[(cojoSNPs_ld$SNP_A==snp & cojoSNPs_ld$SNP_B %in% metCOJO$SNP) | 
                      (cojoSNPs_ld$SNP_B==snp & cojoSNPs_ld$SNP_A %in% metCOJO$SNP),]
      if(nrow(tmp)==0)
      {
        df[i,c("met_highLD_SNP","ld_r2")]=rep(NA,2)
      }
      if(nrow(tmp)>0)
      {
        highLDsnp=which(tmp$R2==max(tmp$R2))
        if(tmp[highLDsnp,"SNP_A"]==snp)
        {
          df[i,c("met_highLD_SNP","ld_r2")]=c(tmp[highLDsnp,"SNP_B"],
                                              tmp[highLDsnp,"R2"])
        }
        if(tmp[highLDsnp,"SNP_B"]==snp)
        {
          df[i,c("met_highLD_SNP","ld_r2")]=c(tmp[highLDsnp,"SNP_A"],
                                              tmp[highLDsnp,"R2"])
        }

      }
    }
  }
  df=df[!is.na(df$dist),]
  df[,c("dist","ld_r2")] = lapply(df[,c("dist","ld_r2")], as.numeric)
  cojo_ukbVSmet=df

  
  
  # Loop through meta-analysis COJO SNPs and check overlap with UKB COJO SNPs
  df=data.frame(meta_snp=metCOJO$SNP, chr=metCOJO$Chr, bp=metCOJO$bp,
                ukb_closest_SNP=NA, dist=NA, ukb_highLD_SNP=NA, ld_r2=NA, stringsAsFactors=F)
  for(i in 1:nrow(df))
  {
    snp=df[i,"meta_snp"]
    chr=df[i,"chr"]
    bp=df[i,"bp"]
  
    # Return same SNP if it is in UKB COJO
    if(snp %in% ukbCOJO$SNP)
    {
      df[i,c("ukb_closest_SNP","dist","ukb_highLD_SNP","ld_r2")]=c(snp, 0, snp, 1)
    }
  
    # Return closest SNP and SNP with highest LD if SNP is not in UKB COJO
    if(!snp %in% ukbCOJO$SNP)
    {
      # Closest by distance
      tmp=ukbCOJO[ukbCOJO$Chr==chr,]
      tmp$dist=tmp$bp-bp
      closestSNP=which(abs(tmp$dist) == min(abs(tmp$dist)))
      df[i,c("ukb_closest_SNP","dist")]=c(tmp[closestSNP,"SNP"],
                                         tmp[closestSNP,"dist"])
  
      # Closest by LD
      tmp=cojoSNPs_ld[(cojoSNPs_ld$SNP_A==snp & cojoSNPs_ld$SNP_B %in% metCOJO$SNP) | 
                      (cojoSNPs_ld$SNP_B==snp & cojoSNPs_ld$SNP_A %in% metCOJO$SNP),]
      if(nrow(tmp)==0)
      {
        df[i,c("ukb_highLD_SNP","ld_r2")]=rep(NA,2)
      }
      if(nrow(tmp)>0)
      {
        highLDsnp=which(tmp$R2==max(tmp$R2))
        if(tmp[highLDsnp,"SNP_A"]==snp)
        {
          df[i,c("ukb_highLD_SNP","ld_r2")]=c(tmp[highLDsnp,"SNP_B"],
                                              tmp[highLDsnp,"R2"])
        }
        if(tmp[highLDsnp,"SNP_B"]==snp)
        {
          df[i,c("ukb_highLD_SNP","ld_r2")]=c(tmp[highLDsnp,"SNP_A"],
                                              tmp[highLDsnp,"R2"])
        }
      }
      
    }
  }
  df[,c("dist","ld_r2")] = lapply(df[,c("dist","ld_r2")], as.numeric)
  cojo_metVSukb=df

We compared the UKB and meta-analysis results using UKB SNPs (M=8806780), i.e. all SNPs that were tested both in UKB and in SUNLIGHT, plus SNPs that were exclusively tested in the UKB (except those that could not be meta-analysed because the P-value was zero).

Number of genome-wide significant (GWS; P<5x10-8) variants

  • UKB: 20052
    • 1368 not GWS in meta-analysis
  • Meta-analysis: 19487
    • 18684 also GWS in UKB
    • 803 not GWS in UKB

Number of independent loci (clumped)

  • UKB: 199
  • Meta-analysis: 224

Number of independent variants (COJO)

  • UKB: 162
    • 111 not in meta-analysis COJO set
      • 103 within 1 Mbp of a meta-analysis COJO variant
      • 54 in high LD (r2>0.9) with a meta-analysis COJO variant
      • 21 in low LD (r2<0.1) with meta-analysis COJO variants
      • 20 in low LD (r2<0.05) with meta-analysis COJO variants
  • Meta-analysis: 221
    • 172 not in UKB COJO set
      • 162 within 1 Mbp of a UKB COJO variant
      • 2 in high LD (r2>0.9) with a UKB COJO variant
      • 127 in low LD (r2<0.1) with UKB COJO variants
      • 116 in low LD (r2<0.05) with UKB COJO variants