Meta-analysis

Summary statistics from the SUNLIGHT consortium were available for 2,579,297 SNPs, and the genetic correlation estimate with UKB results was not significantly different from 1 (rg = 0.92, s.e. = 0.06). Meta-analysis with our UKB GWAS results after imputation of the SUNLIGHT summary statistics (6,912,294 overlapping SNPs) identified 15,154 GWS variants, 150 GCTA–COJO independent SNPs.

SUNLIGHT GWAS summary statistics

The SUNLIGHT data were downloaded from https://drive.google.com/drive/folders/0BzYDtCo_doHJRFRKR0ltZHZWZjQ (file 25HydroxyVitaminD_QC.METAANALYSIS1.txt).

In total, 79,366 individuals from 31 GWAS cohorts of European descent in Europe, Canada and USA were included in that study. The SUNLIGHT consortium performed genome-wide analyses within each cohort using a uniform analysis plan. Specifically, additive genetic models were fitted using linear regression on natural-log transformed 25OHD. Month of sample collection, sex, age, body mass index, and principal components (PCs) were included as covariates. Then, a fixed-effect inverse variance weighted meta-analysis was conducted using the METAL software package, with control for the population structure. SNPs with a MAF ≤ 0.05, imputation info score ≤ 0.8, HWE ≤ 1×10⁻⁶, and less than two studies or 10,000 individuals contributing to each reported SNP association were removed. After quality control, the SUNLIGHT consortium made summary statistics on 2,579,297 SNPs available for download.

SUNLIGHT imputation

Since SUNLIGHT consortium used natural-log transformed phenotype and included supplement intake and BMI as covariates, hence we also generated UKB results using the same setting and used those for meta-analysis.
Since individual-level genotypes are not available from the SUNLIGHT consortium, we imputed the SUNLIGHT summary statistics to 1000 Genome Project (1KGP) using ImpG. The haplotype reference panel files (EUR) and SNP mapping files were obtained from 1KGP phase 1 (release v3). Since information on allele frequency was not available in the SUNLIGHT summary statistics, we imputed the allele frequency info using the 1000G imputed ARIC data. We removed SNPs that had mismatched alleles in the two cohorts. After imputation, we further removed SNPs with imputation accuracy metric (r2pred score) < 0.6, and were left with 7,667,712 SNPs. We extracted 6,912,294 SNPs that were available both in the imputed SUNLIGHT and the UKB summary statistics. We further checked the consistency of allele frequencies and the heterogeneity of effect sizes of the SNPs in both cohorts.

#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)

library(data.table)
sunvd=fread("$path/25HydroxyvitaminD_GWAS_SUNLIGHT_Consortium/25HydroxyVitaminD_QC.METAANALYSIS1.txt", header=T)
sunvd=as.data.frame(sunvd)

#SNP ID, SNP Position, Ref Allele, Alt Allele, Z-score
#$path/1000G_phase1/ALL.chr${i}.phase1_release_v3.20101123.filt.markers

for (chr in 1:22) {
    marker=paste("$path/1000G_phase1/ALL.chr", chr, ".phase1_release_v3.20101123.filt.markers", sep="")
    marker=read.table(marker, header=F)
    marker[,5]=1:nrow(marker)

    colnames(marker)[1]="MarkerName"
    chr_sunvd=sunvd[which(sunvd$Chr==chr),]
    mat=merge(chr_sunvd, marker, by="MarkerName")
    mat2=mat[order(mat[,18]),]
    mat=mat2[,c(1:17)]
#   which(mat$Pos!=mat[,15])

    mat[,18]=casefold(mat[], upper=T)
    mat[,19]=casefold(mat[], upper=T)
    mat[,20]=casefold(mat[], upper=T)
    mat[,21]=casefold(mat[], upper=T)

#    mat[which(mat$Allele1=="a"), 18]="A"
#    mat[which(mat$Allele1=="c"), 18]="C"
#    mat[which(mat$Allele1=="t"), 18]="T"
#    mat[which(mat$Allele1=="g"), 18]="G"
#    mat[which(mat$Allele2=="a"), 19]="A"
#    mat[which(mat$Allele2=="c"), 19]="C"
#    mat[which(mat$Allele2=="t"), 19]="T"
#    mat[which(mat$Allele2=="g"), 19]="G"

    for (i in 1:nrow(mat)) {

          if (mat[i,18]==mat[i,16]) {
              mat[i,20]=mat[i,4]
          } else {
              if (mat[i,18]==mat[i,17]) {
                  mat[i,20]=-mat[i,4]
          } else {
              mat[i,20]=NA
          }
          }
    }

    mat[,21]=mat[,20]/mat$StdErr
    dm=mat[,c(1,15:17,21)]
    colnames(dm)=c("SNP ID", "SNP Position", "Ref Allele", "Alt Allele", "Z-score")
    file_out=paste("z_score_chr", chr, sep="")
    write.table(dm, file_out, quote=F, col.names=T, row.names=F)
}

#PBS -S /bin/bash
#PBS -N z_score
#PBS -A UQ-IMB-CNSG
#PBS -o $job_reports_path
#PBS -e $job_reports_path
#PBS -l select=1:ncpus=1:intel=True:mem=6GB
#PBS -l walltime=2:00:00
#PBS -J 1-22

cd $path

i=$PBS_ARRAY_INDEX;

Rscript Imp-z-score.R ${i}

#PBS -N VD_IMPG
#PBS -A UQ-IMB-CNSG
#PBS -o $job_reports_path
#PBS -e $job_reports_path
#PBS -l select=1:ncpus=1:intel=True:mem=35GB
#PBS -l walltime=30:00:00
#PBS -S /bin/bash
#PBS -J 1-22

cd $work_path

i=$PBS_ARRAY_INDEX;

## reference files
markers="$path/1000G_phase1/ALL.chr${i}.phase1_release_v3.20101123.filt.markers"
bgl="$path/1000G_phase1/ALL.chr${i}.phase1_release_v3.20101123.filt.bgl"
panel="$path/eur.panel"

## z-score files
zscore="$path/ImpG/sunlight/z_score_chr${i}"

if [ ! -d "chr${i}_maps" ]; then
    mkdir chr${i}_maps
fi

GenMaps -m $markers -p chr${i}_maps/chr${i} -s chr${i}_maps.names

if [ ! -d "chr${i}_haps" ]; then
    mkdir chr${i}_haps
fi

# Can also choose multi-ethnics panel
GenHaps -d chr${i}_maps/ -s chr${i}_maps.names -a $bgl -p $panel -o chr${i}_haps/

if [ ! -d "chr${i}_typed" ]; then
    mkdir chr${i}_typed
fi

# input your zscore files for 22 chromosomes

GenMaps-Typed -m chr${i}_maps/ -s chr${i}_maps.names -y $zscore -d chr${i}_typed/ -o chr${i}_maps.names.2

if [ ! -d "chr${i}" ]; then
    mkdir chr${i}
fi

mv chr${i}_maps/ chr${i}_haps/ chr${i}_typed/ ./chr${i}/
mv chr${i}_maps.names chr${i}_maps.names.2 ./chr${i}/

mkdir ./chr${i}/imp
mkdir ./chr${i}/betas
mv ./chr${i}/chr${i}_maps ./chr${i}/maps
mv ./chr${i}/chr${i}_haps ./chr${i}/haps
mv ./chr${i}/chr${i}_typed ./chr${i}/typed

ImpG-Summary-GenBeta-Chr -b ../ImpG-Bins/ImpG-Summary-GenBeta -s chr${i}/chr${i}_maps.names -d chr${i}/

ImpG-Summary-Chr -b ../ImpG-Bins/ImpG-Summary -d chr${i}/ -s chr${i}/chr${i}_maps.names

MergeZsc -d ./chr${i}/imp -s ./chr${i}/chr${i}_maps.names -o chr${i}.impz

## Filter r^2pred<0.6
echo "CHR SNP BP A1 A2 z r2pred" > all_r06.impz

for i in {1..22}
do
tail -q -n +2 chr${i}.impz | awk '$6>=0.6' |awk '{print '${i}',$1,$2,$3,$4,$5,$6}' | sed '/^\s*$/d' >> all_r06.impz
echo "chr${i} finished."
done

library(data.table)
library(dplyr)
vd=fread("$path/ImpG/ImpG-Utils/all_r06.impz",header=T)
vd=as.data.frame(vd)

frq=fread("$path/1000G_phase1/freq/combined_MAF0.001_arichu_1kg_maf.frq",header=T)
frq=as.data.frame(frq)

## Impute freq by 1000G panel
snp=Reduce(intersect,list(vd$SNP,frq$SNP))
vd=vd[match(snp, vd$SNP),]
frq=frq[match(snp, frq$SNP),]
c=as.data.frame(cbind(vd[,c("A1","A2")],frq[,c("A1","A2")]))
mindx=apply(c,1,function(x){length(unique(x))})
snbuf=which(mindx>2)
newvd=vd[-snbuf,]
newfrq=frq[-snbuf,]

newfrq[newfrq$A1!=newvd$A1,"MAF"]=1-newfrq[newfrq$A1!=newvd$A1,"MAF"]
newvd$freq=newfrq$MAF

## Put raw freq (Didn't find information on freq in the summary statics provided by the SUNLIGHT)
raw=fread("$path/ImpG/sunlight/25HydroxyvitaminD_GWAS_SUNLIGHT_Consortium/25HydroxyVitaminD_QC.METAANALYSIS1.txt",header=T)
raw=as.data.frame(raw)
colnames(raw)[1]="SNP"

## Remove mismatched alleles
snp=Reduce(intersect, list(newvd$SNP,raw$SNP))
raw=raw[match(snp,raw$SNP),]
tmp=newvd[match(snp,newvd$SNP),]
#c=as.data.frame(cbind(raw[,2:3],tmp[,4:5]))
c=as.data.frame(cbind(as.data.frame(apply(raw[,2:3], 2, toupper)),tmp[,4:5]))
mindx=apply(c,1,function(x){length(unique(x))})
snbuf=which(mindx>2)
snp_name=raw[snbuf,1] #or = tmp[snbuf,2]

newvd_mismatch_with_raw=newvd[match(snp_name, newvd$SNP),]
newvd_remove_mismatch=anti_join(newvd, newvd_mismatch_with_raw, by="SNP")

##
vd2=newvd_remove_mismatch

## Impute other variables
vd2$P=pchisq(vd2$z^2,1,lower.tail=F)

## You can also use median sample size for genotyped SNPs
median(raw$SAMPLESIZE)
vd2$N=69515

#vd2$se=1/sqrt(2*vd2$freq*(1-vd2$freq)*(vd2$N+vd2$z^2))
#vd2$b=vd2$z*vd2$se

## Flip alleles in A1 as minor
new=vd2
flip=which(new$freq>0.5)
t=new[flip,"A1"]
new[flip,"A1"]=new[flip,"A2"]
new[flip,"A2"]=t
#new[flip,"b"]=new[flip,"b"]*(-1)
new[flip,"z"]=new[flip,"z"]*(-1)
new[flip,"freq"]=1-new[flip,"freq"]

# recover se and b
new$se=1/sqrt(2*new$freq*(1-new$freq)*(new$N+new$z^2))
new$b=new$z*new$se

## Save "VD_imputed_1000G.raw"
write.table(new,"VD_imputed_1000G.raw",row.names=F,col.names=T,quote=F)
write.table(new[,c("SNP","A1","A2","freq","b","se","P","N")],"VD_imputed_1000G_cojo.txt",row.names=F,col.names=T,quote=F)

Meta-analysis

Since the effect size estimates (betas) and their standard errors (se) are not in entirely consistent units across the two cohorts, we adopted a sample size-based approach to perform the meta-analysis. This approach converts the direction of effect and P-value reported in each cohort into signed z-scores, which are combined across studies into a weighted sum (based on sample size) for each allele. However, since we have SNPs with very small P-values, which are recognized as zero by the METAL software and cannot be converted to z-score, we computed the z-score using the betas and se, and computed the meta-analysis using the same equations and tests in R as implemented in the METAL package. Among the 6,912,294 SNPs, 6,165 SNPs showed heterogeneity (HetPVal < 1x 10^-3). Since many of these SNPs were very significant (758 SNPs with P < 5 x 10^-8), we kept them for downstream analysis.

library(data.table)

# re-format the summary statistics to prepare for meta-analysis
# our vitamin D summary statistics
ukb=fread("$path/meta/fastGWA/vD_meta_fastGWA.fastGWA",header=T)
ukb=as.data.frame(ukb)
# 1000G-imputed SUNLIGHT summary statistics
sun=fread("$path/ImpG/ImpG-Utils/VD_imputed_1000G_cojo.txt",header=T)
sun=as.data.frame(sun)

#change the fomate of ukb
ukb2=ukb[,c("SNP", "A1", "A2", "AF1", "BETA", "SE", "P", "N")]
colnames(ukb2)=c("SNP", "A1", "A2", "freq", "b", "se", "P", "N")
ukb=ukb2

## Flip UKB A1 and A2
snbuf=which(ukb$freq>0.5)
t=ukb[snbuf,2]
ukb[snbuf,2]=ukb[snbuf,3]
ukb[snbuf,3]=t
ukb[snbuf,5]=ukb[snbuf,5]*(-1)
ukb[snbuf,4]=1-ukb[snbuf,4]

## Match UKB and SUNLIGHT
snp=Reduce(intersect,list(ukb$SNP,sun$SNP))
length(snp)
ukb2=ukb[match(snp,ukb$SNP),]
sun2=sun[match(snp,sun$SNP),]

## Save results
write.table(ukb2,"ukb_vd_for_meta.txt",row.names=F,col.names=T,quote=F)
write.table(sun2,"sunlight_vd_for_meta.txt",row.names=F,col.names=T,quote=F)

## Check consistency
cor(ukb2$b,sun2$b)                         #0.1079836
cor(ukb2$b/ukb2$se,sun2$b/sun2$se)         #0.1753683

sun2$P_UKB=ukb2$P
sun3=sun2[sun2$P<1e-5 | sun2$P_UKB<1e-5,]    #30293
nrow(sun2[sun2$P<5e-8 | sun2$P_UKB<5e-8,])   #13606
ukb3=ukb2[match(sun3$SNP,ukb2$SNP),]         

cor(ukb3$b,sun3$b)                           #0.8886714
cor(ukb3$se,sun3$se)                         #0.9837388

## Find outliers
chi=(sun3$b-ukb3$b)^2/(sun3$se^2+ukb3$se^2)
p_diff=pchisq(chi,1,lower.tail=F)

pdf("Outliers.pdf")
plot(sun3$b,ukb3$b,col=ifelse(p_diff<(0.05/nrow(sun3)),"red","black"),xlab="SUNLIGHT",ylab="UKB",main="Compare beta estimates in SNP with p < 1E-5")
abline(a=0,b=1)
dev.off()

heter=sun3$SNP[which(p_diff<(0.05/nrow(sun3)))]
sun4=sun3[match(heter,sun3$SNP),]
ukb4=ukb3[match(heter,ukb3$SNP),]
write.table(heter,"heter_SNPs.txt",row.names=F,col.names=T,quote=F)

###
#library(data.table)
#ukb=fread("ukb_vd_for_meta.txt", header=T)
#ukb=as.data.frame(ukb)
#sun=fread("sunlight_vd_for_meta.txt", header=T)
#sun=as.data.frame(sun)

ukb=ukb2
sun=sun2
#Remove SNPs with more than one alleles
c=as.data.frame(cbind(ukb[,c("A1","A2")], sun[,c("A1","A2")]))
mindx=apply(c,1,function(x){length(unique(x))})
snbuf=which(mindx>2)
newsun=sun[-snbuf,]
newukb=ukb[-snbuf,]

#Concerns about the allele frequencies
#Swith major and minor allele
#check before switch
length(newsun[,2]!=newukb[,2])                      #6912294
length(which(newsun[,2]!=newukb[,2]))               #39356
length(which(newsun[,3]!=newukb[,3]))               #39356
#identical(which(newukb[,2]!=newsun[,2]), which(newsun[,2]!=newukb[,2])) #TRUE

#switch
tline=which(newsun[,2]!=newukb[,2])
t=newsun[tline,]$A1
newsun[tline,]$A1=newsun[tline,]$A2
newsun[tline,]$A2=t
newsun[tline,]$freq=1-newsun[tline,]$freq
newsun[tline,]$b=newsun[tline,]$b*(-1)
write.table(newsun[tline,]$SNP, "allele_switched_SNPs", quote=F, col.names=F, row.names=F)

# diffs in freq > 0.2
diff=which((newukb$freq-newsun$freq)>0.2)
write.table(newsun[diff,]$SNP, "freq_diff_02_SNPs", quote=F, col.names=F, row.names=F)

##################################
# meta--formated as METAL output #
##################################
#cohort1 UKB, cohort2 SUNLIGHT
b1=newukb$b
se1=newukb$se
z1=b1/se1
N1=newukb$N
freq1=newukb$freq

b2=newsun$b
se2=newsun$se
z2=b2/se2
N2=newsun$N
freq2=newsun$freq

#formulas to meta
freq=(freq1*N1+freq2*N2)/(N1+N2)
zmeta=(z1*sqrt(N1)+z2*sqrt(N2))/sqrt(N1+N2)
Nmeta=N1+N2
bmeta=zmeta/sqrt(2*freq*(1-freq)*(Nmeta+(zmeta)^2))
semeta=1/sqrt(2*freq*(1-freq)*(Nmeta+(zmeta)^2))

Q=(z1-sqrt(N1)*zmeta/sqrt(Nmeta))^2+(z2-sqrt(N2)*zmeta/sqrt(Nmeta))^2     #HetChiSq
df=rep(1, 6912294)                                                        #HetDf
I2=((Q-df)/Q)*100                                                         #HetISq
I2[which(I2<0)]=0
length(I2[which(I2==0)])
het_P=pchisq(Q,1,lower.tail=F)                                            #HetPVal
mpval=pchisq((zmeta)^2,1,lower.tail=F)                                    #P-value

#if use beta and se to do the meta
#b1=newukb$b
#se1=newukb$se
#N1=newukb$N
#freq1=newukb$freq

#b2=newsun$b
#se2=newsun$se
#N2=newsun$N
#freq2=newsun$freq

#formulas
#b=(b1/se1^2+b2/se2^2)/(1/se1^2+1/se2^2)   #Effect
#se=1/sqrt(1/se1^2+1/se2^2)                #StdErr
#freq=(freq1*N1+freq2*N2)/(N1+N2)          #
#Q=(b-b1)^2/se1^2+(b-b2)^2/se2^2           #HetChiSq
#df=rep(1, 6912294)                        #HetDf
#I2=(Q-df)/Q                               #HetISq
#het_P=pchisq(Q,1,lower.tail=F)            #HetPVal
#mpval=pchisq((b/se)^2,1,lower.tail=F)     #P-value

#format
out=array(NA, c(6912294, 11))
out=as.data.frame(out)
out[,1:3]=newukb[,1:3]
out[,4:6]=cbind(Nmeta, zmeta, mpval)
out[,8:11]=cbind(I2, Q, df, het_P)
column_names=c("MarkerName", "Allele1", "Allele2", "Weight", "Zscore", "P-value", "Direction", "HetISq", "HetChiSq", "HetDf", "HetPVal")
#column_names=c("MarkerName", "Allele1", "Allele2", "Effect", "StdErr", "P-value", "Direction", "HetISq", "HetChiSq", "HetDf", "HetPVal")
colnames(out)=column_names

#Direction
meta_direct=function(x){
    if(x<0) return("-") else return("+")
}

com_direct=function(x){
     z=paste(meta_direct(x[1]), meta_direct(x[2]), sep="")
     return(z)
}

tmp=cbind(b1, b2)
direct=apply(tmp, 1, com_direct)
out[,7]=direct
write.table(out, "meta_z_ukb_sunlight_metal_format.txt", quote=F, col.names=T, row.names=F)

#Output be & se as well
out[,12]=bmeta
out[,13]=semeta
out[,14]=freq
colnames(out)[12:14]=c("Effect", "StdErr", "Freq")

imp=fread("$path/ImpG/ImpG-Utils/VD_imputed_1000G.raw", header=T)
imp=as.data.frame(imp)
colnames(imp)[2]="MarkerName"
comb=merge(out, imp, by="MarkerName")
ukbr=fread("$path/fastGWA/vD_meta_fastGWA.fastGWA", header=T)
ukbr=as.data.frame(ukbr)
ukbr[,11]=1:nrow(ukbr)
colnames(ukbr)[2]="MarkerName"
comb2=merge(comb, ukbr, by="MarkerName")
comb3=comb2[order(comb2[,35]),]
comb3=comb3[,1:34]
write.table(comb3, "combined_info_meta_ukb_sunlight_raw.txt", quote=F, col.names=T, row.names=F)

## Remove heter SNPs with HetPVal < 1E-3
new=comb3[comb3$HetPVal>=1e-3,]
heter=comb3[comb3$HetPVal<1e-3,]
## Save heterogeneous SNPs
write.table(heter,"meta_heter_SNPs.txt",row.names=F,col.names=T,quote=F)
write.table(new,"meta_pass_heter_SNPs.txt",row.names=F,col.names=T,quote=F)

## Save cojo format file
colnames(comb3)[6]="PValue"
cojo_form=comb3[,c("MarkerName", "Allele1", "Allele2", "Freq", "Effect", "StdErr", "PValue", "Weight")]
colnames(cojo_form)=c("SNP","A1","A2", "freq", "b","se","p", "N")
write.table(cojo_form,"meta_vd_cojo_format.txt",row.names=F,col.names=T,quote=F)

COJO and clumping of the meta-analysis

15,154 SNPs reached the GWS threshold (P < 5 x 10^-8), and 150 SNPs and 166 SNPs were identified as independent signals by the GCTA-COJO and LD clumping analysis. Of these, 91 were within 1-Mb regions also identified in the UKB alone as GWS. Given that the meta-analysis only increased the total number of significant loci by seven, and given our preference not to include BMI as a covariate, we continued with the UKB-only results for our downstream analyses.

#PBS -S /bin/bash
#PBS -N meta_cojo
#PBS -A UQ-IMB-CNSG
#PBS -o $job_reports_path
#PBS -e $job_reports_path
#PBS -l select=1:ncpus=16:intel=True:mem=10GB
#PBS -l walltime=10:00:00
#PBS -J 1-22

cd $path

i=$PBS_ARRAY_INDEX;

ref="$path/input/UKB_v3EURu_impQC/random20K/ukbEURu_imp_chr${i}_v3_impQC_random20k_maf0.01"
snplist="$path/input/UKB_v3EURu_impQC/UKB_EURu_impQC_maf0.01.snplist"
sum="$path/meta/meta/meta_vd_cojo_format.txt"

gcta64 --bfile $ref --extract $snplist --chr ${i} --cojo-slct --cojo-file $sum --cojo-p 5e-8 --cojo-wind 10000 --out meta_cojo_chr${i} >> meta_cojo_chr${i}.log 2>&1 --thread-num 16

# Merge COJO results once all chr finished running
cat $path/meta_cojo_chr*.jma.cojo | grep "Chr" | uniq > $path/meta.jma.cojo
cat $path/meta_cojo_chr*.jma.cojo | grep -v "Chr" |sed '/^$/d' >> $path/meta.jma.cojo

#PBS -S /bin/bash
#PBS -N meta_clump
#PBS -A UQ-IMB-CNSG
#PBS -o $job_reports_path
#PBS -e $job_reports_path
#PBS -l select=1:ncpus=1:intel=True:mem=20GB
#PBS -l walltime=10:00:00
#PBS -J 1-22

cd $path

i=$PBS_ARRAY_INDEX;

ref="$path/UKBiobank/v3EUR_impQC/ukbEUR_imp_chr${i}_v3_impQC"
snps2exclude="$path/input/UKB_v3EUR_impQC/UKB_EUR_impQC_duplicated_rsIDs"
inFile="$path/meta_vd_cojo_format.txt"
outFile="$path/meta_clump_chr${i}"

plink2 --bfile $ref --exclude $snps2exclude --clump $inFile --clump-p1 5e-8 --clump-p2 5e-8 --clump-r2 0.01 --clump-kb 10000 --clump-field p --out $outFile

## Merge clumped results once all chr finished running
cat $path/meta_clump_chr*clumped | grep "CHR" | head -1 > meta.clumped
cat $path/meta_clump_chr*clumped | grep -v "CHR" |sed '/^$/d'  >> meta.clumped