We ran classical genome-wide association (GWA) analyses with three methods: (1) PLINK, (2) fastGWA, and (3) BOLT-LMM. These methods test the association between alleles and the mean of a phenotype and identify quantitative trait loci (QLT), i.e. variants that are associated with variability in the trait.

Below we describe our analyses and present the results obtained with each method.




Methods

# *********************************
# General information on GWAS run
# *********************************
# GWAS sample size
plinkPheno_N=length(count.fields("input/plink_pheno", skip=1))
mlmPheno_N=length(count.fields("input/mlm_pheno", skip=1))

# Number of variants available to test (same whether we select unrelated or all individuals)
all_availableSNPs_N=length(count.fields("input/UKB_v3EURu_impQC/v3EURu_impQC.bim"))
hm3_availableSNPs_N=length(count.fields("input/UKB_v3EURu_HM3/v3EURu_HM3.bim"))
x_availableSNPs_N=length(count.fields("input/UKB_v3EUR_impQC/ukbEURv3_X_maf0.1_FEMALE.bim"))+length(count.fields("input/UKB_v3EUR_impQC/ukbEURv3_XY_maf0.1_FEMALE.bim"))

# Number of variants available in dataset used as LD reference
referenceSNPs_N=length(count.fields("input/UKB_v3EUR_HM3/ukbEUR_imp_allchr_v3_HM3.bim", sep="\t"))

# Number of variants tested in PLINK GWAS
plink_commonSNPstested_N=length(count.fields("results/plink/vitD_plinkGWAS_maf0.01.gz"))
plink_commonANDrareSNPstested_N=length(count.fields("results/plink/vitD_plinkGWAS_maf0.0001.gz"))

# Number of variants tested in BOLT
bolt_testedSNPs_N=length(count.fields("results/bolt/vitD_boltGWAS.gz"))

GWAS

fastGWA

This analysis was performed with fastGWA using variants with MAF>0.01. Given our sample size (N=417580), this is equivalent to restricting to SNPs seen at least 8351.6 times.

Chromosome X and its pseudo-autosomal region were analysed separately in males and females and were subsequently meta-analysed with METAL, using the inverse-variance based method. Only variants with MAF>0.01 in females were analysed.

The following files were used to run the analysis:

  • Phenotype file: This file includes one line per individual and a column with the phenotype. The phenotype was obtained after (1) restricting to individuals of European ancestry, and (2) applying a rank-based inverse-normal transformation. In total 417580 individuals of European ancestry were analysed.
  • Genotype file: In this analysis, we used the files in Q0286/UKBiobank/v3EUR_impQC/. These are restricted to individuals of European ancestry and include 44741804 autossomal variants + 260713 variants in chromosome X (including the pseudo-autosomal region). Variants had (1) MAC > 5, (2) INFO > 0.3, and (3) genotype missingness > 0.05. In addition, these files exclude variants with missingness < 0.05, pHWE < 1x10-5, and MAC > 5 in the subset of unrelated European individuals.
  • Sparse GRM: For this analysis we used a sparse GRM kindly provided by Longda Jiang. The GRM was calculated using HapMap3 SNPs from individuals of European ancestry. As such, even if we use genotype files for the entire UK Biobank cohort, GCTA would extract European individuals automatically and run the analysis on that subset.
  • Covariates files: This file includes one column per covariate to fit in the model. Covariates included were: age, sex, genotyping batch, UKB assessment centre, assessment month, information on whether vitamin supplements were ever taken, and the first 40 PCs. For more detail see the covariates section of the Pheno page.

To run the analysis faster, associations were performed by chromosome and ran in parallel. qsubshcom was used to submit the parallel jobs (array).

# Directories
genoDir=$medici/v3EUR_impQC


# #########################################
# Prepare files for analysis
# #########################################

# ------ Get sparse GRM from Longda's directory
## Run from delta
scp $rooID:$longda/UKB_geno/UKB_sparse_GRM/UKB_HM3_m01_sp.grm* $WD/input


# ------ Get list of SNPs to include in the analysis
# Extract common (MAF>0.01) variants
geno=$medici/v3EUR_impQC/ukbEUR_imp_chr{TASK_ID}_v3_impQC
filters=maf0.01
genoDir=$WD/input/UKB_v3EUR_impQC
snpList=$genoDir/UKB_EUR_impQC_$filters
cd $genoDir
tmp_command="plink --bfile $geno \
                   --write-snplist \
                   --maf 0.01 \
                   --out ${snpList}_chr{TASK_ID}"
qsubshcom "$tmp_command" 1 10G commonVariants 3:00:00 "-array=1-22"
cat ${snpList}_chr*.snplist > $snpList.snplist
rm ${snpList}_chr*


# Extract common + rare (MAF>0.0001) variants
geno=$medici/v3EUR_impQC/ukbEUR_imp_chr{TASK_ID}_v3_impQC
filters=maf0.0001
genoDir=$WD/input/UKB_v3EUR_impQC
snpList=$genoDir/UKB_EUR_impQC_$filters
cd $genoDir
tmp_command="plink --bfile $geno \
                   --write-snplist \
                   --maf 0.0001 \
                   --out ${snpList}_chr{TASK_ID}"
qsubshcom "$tmp_command" 1 10G rareVariants 3:00:00 "-array=1-22"
cat ${snpList}_chr*.snplist > $snpList.snplist
rm ${snpList}_chr*




# ------ Make mbfile with all individual-level data files to include in the analysis
tmpDir=$WD/results/fastGWA/tmpDir
echo $genoDir/ukbEUR_imp_chr1_v3_impQC > $tmpDir/UKB_v3EUR_impQC.files
for i in `seq 2 22`; do echo $genoDir/ukbEUR_imp_chr${i}_v3_impQC >> $tmpDir/UKB_v3EUR_impQC.files; done




# ------ Check X chromosome codes for males (are they coded 0/1 or 0/2?)
## Extract first 10 SNPs in bim file
awk 'NR<11{print $2}' $medici/v3EUR_chrX/ukbEURv3_X_maf0.1_MALE.bim > $WD/tmpDir/xSNP
plink --bfile $medici/v3EUR_chrX/ukbEURv3_X_maf0.1_MALE \
      --extract $WD/tmpDir/xSNP \
      --out $WD/tmpDir/checkMaleXcodes \
      --recode A
for i in `seq 7 16`; do echo $i `sed 1d $WD/tmpDir/checkMaleXcodes.raw | awk -v i="$i" '{print $i}' | sort | uniq ` ; done

## Males are coded 0/2, so we can run the GWAS in males and females separately and meta-analyse
## Only need to remember that the interpretation of the X chr results in males is different: the estimated effects are per half allele

# #########################################
# Run GWAS with fastGWA - autosomes
# #########################################

# Set directories
results=$WD/results/fastGWA
tmpDir=$results/tmpDir

# Files/settings to use
prefix=vitD_fastGWA
#prefix=vitD_BMIcov_fastGWA
#prefix=vitD_fastGWA_maf0.0001
geno=$tmpDir/UKB_v3EUR_impQC.files
pheno=$WD/input/mlm_pheno
grm=$WD/input/UKB_HM3_m01_sp
quantCovars=$WD/input/quantitativeCovariates
#quantCovars=$WD/input/quantitativeCovariates_bmi
catgCovars=$WD/input/categoricalCovariates
snps2include=$WD/input/UKB_v3EUR_impQC/UKB_EUR_impQC_maf0.01.snplist
#snps2include=$WD/input/UKB_v3EUR_impQC/UKB_EUR_impQC_maf0.0001.snplist
output=$tmpDir/$prefix

# Run GWAS
cd $results
gcta=$bin/gcta/gcta_1.92.3beta3
tmp_command="$gcta --mbfile $geno \
                   --fastGWA-lmm \
                   --pheno $pheno \
                   --grm-sparse  $grm \
                   --qcovar $quantCovars\
                   --covar $catgCovars \
                   --covar-maxlevel 110 \
                   --extract $snps2include \
                   --threads 16 \
                   --out $output"
qsubshcom "$tmp_command" 16 10G $prefix 24:00:00 ""

# Compress results
gzip -c $output.fastGWA > $results/$prefix.gz


# Convert GWAS sumstats to .ma format (for COJO)
echo "SNP A1 A2 freq b se p n" > $results/maFormat/$prefix.ma
awk '{print $2,$4,$5,$7,$8,$9,$10,$6}' $tmpDir/$prefix.fastGWA | sed 1d >> $results/maFormat/$prefix.ma



# #########################################
# Run GWAS with fastGWA - X chromosome
# #########################################

# Set directories
results=$WD/results/fastGWA
tmpDir=$results/tmpDir

# Files/settings to use
pheno=$WD/input/mlm_pheno
grm=$WD/input/UKB_HM3_m01_sp
quantCovars=$WD/input/quantitativeCovariates
#quantCovars=$WD/input/quantitativeCovariates_bmi
catgCovars=$WD/input/categoricalCovariates_noSex

# ---- Run GWAS separately for males and females
cd $results
for sex in FEMALE MALE
do
  # Run X chromosome pseudo-autosomal region (chr XY) as well
  for region in X XY
  do
    prefix=vitD_fastGWA_chr$region.$sex
    #prefix=vitD_BMIcov_fastGWA_chr$region.$sex
    geno=$medici/v3EUR_chrX/ukbEURv3_${region}_maf0.1_$sex
    output=$tmpDir/$prefix
  
    gcta=$bin/gcta/gcta_1.92.3beta3
    tmp_command="$gcta --bfile $geno \
                       --fastGWA-lmm \
                       --pheno $pheno \
                       --grm-sparse  $grm \
                       --qcovar $quantCovars\
                       --covar $catgCovars \
                       --covar-maxlevel 110 \
                       --threads 16 \
                       --out $output"
    qsubshcom "$tmp_command" 16 30G $prefix 6:00:00 ""
  done
done


# #########################################
# Meta-analyse results from males and females for X and XY
# #########################################

# NOTE: Run these commands with an interactive session (I still haven't been able to submit a script that will run them)
#       Use: qsub -I -A UQ-IMB-CNSG -X -l nodes=1:ppn=5,mem=15GB,walltime=6:00:00

cd $WD/results/fastGWA/tmpDir
metal=$bin/metal/metal
$metal

# === CHROMOSOME X ================================================
CLEAR
SCHEME STDERR
AVERAGEFREQ ON

# Describe and process results for X chromosome
MARKER SNP
ALLELE A1 A2
EFFECT BETA
PVALUE P
WEIGHT N
STDERR SE
FREQLABEL AF1

PROCESS $WD/results/fastGWA/tmpDir/vitD_fastGWA_chrX.FEMALE.fastGWA
PROCESS $WD/results/fastGWA/tmpDir/vitD_fastGWA_chrX.MALE.fastGWA
#PROCESS $WD/results/fastGWA/tmpDir/vitD_BMIcov_fastGWA_chrX.FEMALE.fastGWA
#PROCESS $WD/results/fastGWA/tmpDir/vitD_BMIcov_fastGWA_chrX.MALE.fastGWA

# Run meta-analysis
OUTFILE $WD/results/fastGWA/tmpDir/vitD_fastGWA_chrX .TBL
#OUTFILE $WD/results/fastGWA/tmpDir/vitD_BMIcov_fastGWA_chrX .TBL

ANALYZE

# === CHROMOSOME XY (X chromosome pseudo-autosomal region) ========
CLEAR
SCHEME STDERR
AVERAGEFREQ ON

# Describe and process results for XY chromosome
MARKER SNP
ALLELE A1 A2
EFFECT BETA
PVALUE P 
WEIGHT N
STDERR SE
FREQLABEL AF1

PROCESS $WD/results/fastGWA/tmpDir/vitD_fastGWA_chrXY.FEMALE.fastGWA
PROCESS $WD/results/fastGWA/tmpDir/vitD_fastGWA_chrXY.MALE.fastGWA
#PROCESS $WD/results/fastGWA/tmpDir/vitD_BMIcov_fastGWA_chrXY.FEMALE.fastGWA
#PROCESS $WD/results/fastGWA/tmpDir/vitD_BMIcov_fastGWA_chrXY.MALE.fastGWA

# Run meta-analysis
OUTFILE $WD/results/fastGWA/tmpDir/vitD_fastGWA_chrXY .TBL
#OUTFILE $WD/results/fastGWA/tmpDir/vitD_BMIcov_fastGWA_chrXY .TBL
ANALYZE


# #########################################
# Format meta-analysed results and merge with autossomes
# Generate .ma format files for COJO
# #########################################

## Note: Ran this step in R, using an interactive session
##  qsub -I -A UQ-IMB-CNSG -X -l nodes=1:ppn=5,mem=15GB,walltime=6:00:00
R
setwd("$WD/results/fastGWA")

prefix="vitD_fastGWA"
#prefix="vitD_BMIcov_fastGWA"

# Read X meta-analyses results + GWAS results for autosomes
xmeta=read.table(paste0("tmpDir/",prefix,"_chrX1.TBL"),h=T,stringsAsFactors=F)
xymeta=read.table(paste0("tmpDir/",prefix,"_chrXY1.TBL"),h=T,stringsAsFactors=F)
autosomes=read.table(paste0("tmpDir/",prefix,".fastGWA"), h=T,stringsAsFactors=F)
names(xmeta)=c("SNP","A1","A2","AF1","AF1_SE","BETA","SE","P","direction")
names(xymeta)=c("SNP","A1","A2","AF1","AF1_SE","BETA","SE","P","direction")

# Get chrX details (BP, allele freq, and N) from female and male GWAS
xfemale=read.table(paste0("tmpDir/",prefix,"_chrX.FEMALE.fastGWA"), h=T,stringsAsFactors=F)
xmale=read.table(paste0("tmpDir/",prefix,"_chrX.MALE.fastGWA"), h=T,stringsAsFactors=F)
  ## Get all results in the same order
  xmeta=xmeta[match(xfemale$SNP,xmeta$SNP),]
  xmale=xmale[match(xfemale$SNP,xmale$SNP),]
  ## Add BP
  xmeta$POS=xfemale$POS
  # Add N
  xmeta$N=xfemale$N+xmale$N
  # Change alleles to capital letters
  xmeta$A1=toupper(xmeta$A1)
  xmeta$A2=toupper(xmeta$A2)
  # Define chromosome
  xmeta$CHR=23
  
# Get chrXY details (BP, allele freq, and N) from female and male GWAS
xyfemale=read.table("tmpDir/vitD_fastGWA_chrXY.FEMALE.fastGWA", h=T,stringsAsFactors=F)
xymale=read.table("tmpDir/vitD_fastGWA_chrXY.MALE.fastGWA", h=T,stringsAsFactors=F)
  ## Get all results in the same order
  xymeta=xymeta[match(xyfemale$SNP,xymeta$SNP),]
  xymale=xymale[match(xyfemale$SNP,xymale$SNP),]
  ## Add BP
  xymeta$POS=xyfemale$POS
  ## Add N
  xymeta$N=xyfemale$N+xymale$N
  # Define chromosome
  xymeta$CHR=25
  # Change alleles to capital letters
  xymeta$A1=toupper(xymeta$A1)
  xymeta$A2=toupper(xymeta$A2)

# fastGWA format
# CHR SNP POS A1 A2 N AF1 BETA SE P
  xmeta_fastgwa=xmeta[,c("CHR","SNP","POS","A1","A2","N","AF1","BETA","SE","P")]
  xymeta_fastgwa=xymeta[,c("CHR","SNP","POS","A1","A2","N","AF1","BETA","SE","P")]
  ## Merge X + XY + autosomes
  allChr=rbind(autosomes, xmeta_fastgwa, xymeta_fastgwa)

# COJO (.ma) format
# SNP A1 A2 freq b se p n
 ## CHR X
 xmeta_ma=xmeta[,c("SNP","A1","A2","AF1","BETA","SE","P","N")]
 names(xmeta_ma)=c("SNP","A1","A2","freq","b","se","p","n")
  ## CHR XY
 xymeta_ma=xymeta[,c("SNP","A1","A2","AF1","BETA","SE","P","N")]
 names(xymeta_ma)=c("SNP","A1","A2","freq","b","se","p","n")
  

# Save new file formats
  ## All chromosomes
  write.table(allChr, file=gzfile(paste0(prefix,"_withX.gz")), quote=F, row.names=F)
  ## X and XY in fastGWA format (for clumping)
  write.table(xmeta_fastgwa, file=gzfile(paste0("tmpDir/",prefix,"_chrX.gz")), quote=F, row.names=F)
  write.table(xymeta_fastgwa, file=gzfile(paste0("tmpDir/",prefix,"_chrXY.gz")), quote=F, row.names=F)
  ## X and XY in .ma format (for COJO)
  write.table(xmeta_ma, paste0("maFormat/",prefix,"_chrX.ma"), quote=F, row.names=F)
  write.table(xymeta_ma, paste0("maFormat/",prefix,"_chrXY.ma"), quote=F, row.names=F)


# #########################################
# Make .ma file with all chr (including X)
# #########################################

prefix=vitD_fastGWA
#prefix=vitD_BMIcov_fastGWA
results=$WD/results/fastGWA/maFormat

cp $results/$prefix.ma $results/${prefix}_withX.ma 
sed 1d $results/${prefix}_chrX.ma >> $results/${prefix}_withX.ma 
sed 1d $results/${prefix}_chrXY.ma >> $results/${prefix}_withX.ma 



# Copy to results to local computer (run command from local)
scp $deltaID:$WD/results/fastGWA/vitD_fastGWA_withX.gz $local/results/fastGWA/

scp $deltaID:$WD/results/fastGWA/vitD_BMIcov_fastGWA_withX.gz $local/results/fastGWA/



BOLT-LMM

This analysis was performed with BOLT-LMM restricting to SNPs with MAF>0.01.

The following files were used to run the analysis:

  • Phenotype file: This file includes one line per individual and a column with the phenotype. The phenotype was obtained after (1) restricting to individuals of European ancestry, and (2) applying a rank-based inverse-normal transformation. In total 417580 individuals of European ancestry were analysed.
  • Genotype file: Files in Q0286/UKBiobank/v3EUR_impQC/. These are restricted to individuals of European ancestry and include 44741804 autossomal variants with Variants had (1) MAC > 5, (2) INFO > 0.3, and (3) genotype missingness > 0.05. In addition, these files exclude variants with missingness < 0.05, pHWE < 1x10-5, and MAC > 5 in the subset of unrelated European individuals.
  • Covariates file: This file includes one column per covariate to fit in the model. Covariates included were: age, sex, genotyping batch, UKB assessment centre, month of assessment, information on whether vitamin supplements were ever taken, and the first 40 PCs. For more detail see the covariates section of the Pheno page.
# #########################################
# Prepare files required for this analysis
# #########################################

# Note: Only need to run once

# Files
quantCovars=$WD/input/quantitativeCovariates
catgCovars=$WD/input/categoricalCovariates
geno=$medici/v3EUR_impQC/ukbEUR_imp_chr{TASK_ID}_v3_impQC


# Merge categorical and quantitative covariates into one file
join -1 1 -2 1 $catgCovars <(cut -d' ' -f2,2 --complement $quantCovars) > $WD/input/Covariates

# Crete list of all SNPs and exclude SNPs that we want to exclude
# BOLT needs a list of SNPs to exclude
genoDir=$WD/input/UKB_v3EUR_impQC
snpList_all=$genoDir/UKB_EUR_impQC
cd $genoDir
tmp_command="plink --bfile $geno \
                   --write-snplist \
                   --out ${snpList_all}_chr{TASK_ID}"
qsubshcom "$tmp_command" 1 30G allVariants 10:00:00 "-array=1-22"
cat ${snpList_all}_chr*.snplist > $snpList_all.snplist
rm ${snpList_all}_chr*

# Create list of variants to include (MAF>0.01)
filters=maf0.01
snpList_include=$genoDir/UKB_EUR_impQC_$filters
tmp_command="plink --bfile $geno \
                   --write-snplist \
                   --maf 0.01 \
                   --out ${snpList_include}_chr{TASK_ID}"
qsubshcom "$tmp_command" 1 30G commonVariants 10:00:00 "-array=1-22"
cat ${snpList_include}_chr*.snplist > $snpList_include.snplist
rm ${snpList_include}_chr*

# Make list of SNPs to exclude - this is what BOLT accepts as input
snpList_exclude=$WD/input/UKB_v3EUR_impQC/UKB_EUR_impQC_$filters.EXCLUDEsnplist
awk 'NR==FNR{a[$1];next} !($1 in a)' $snpList_include.snplist $snpList_all.snplist > $snpList_exclude


# Copy files from delta to RCC (-- run this step on delta -- )
cd $WD
scp input/UKB_v3EUR_impQC/UKB_EUR_impQC_maf0.01.EXCLUDEsnplist $rooID:$WD/input/UKB_v3EUR_impQC/
scp input/mlm_pheno $rooID:$WD/input/
scp input/Covariates $rooID:$WD/input/
# #########################################
# Run GWAS with BOLT-LMM
# #########################################

# RCC settings
boltDir=$bin/bolt/BOLT-LMM_v2.3.2
nThreads=22
account="-acct=UQ-IMB-CNSG"

# Delta settings
boltDir=$bin/bolt/BOLT-LMM_v2.3.2
nThreads=16
account=""

# Common settings
prefix=vitD_boltGWAS
results=$WD/results/bolt
bed=$medici/v3EUR_impQC/ukbEUR_imp_chr{1:22}_v3_impQC.bed
bim=$medici/v3EUR_impQC/ukbEUR_imp_chr{1:22}_v3_impQC.bim
fam=$medici/v3EUR_impQC/ukbEUR_imp_chr1_v3_impQC.fam
modelSNPs=$medici/v3EUR_HM3/ukbEUR_imp_allchr_v3_HM3_maf01_R9.prune.in
snps2exclude=$WD/input/UKB_v3EUR_impQC/UKB_EUR_impQC_maf0.01.EXCLUDEsnplist
pheno=$WD/input/mlm_pheno
trait=norm_vitD
covarFile=$WD/input/Covariates
output=$results/$prefix

# Run GWAS
bolt=$boltDir/bolt
cd $results
tmp_command="$bolt --bed=$bed \
                   --bim=$bim \
                   --fam=$fam
                   --exclude $snps2exclude \
                   --phenoFile=$pheno \
                   --phenoCol=$trait \
                   --covarFile $covarFile \
                   --covarCol=sex \
                   --covarCol=batch \
                   --covarCol=assessCentre \
                   --covarCol=assessMonth \
                   --covarCol=supplementsEver \
                   --covarMaxLevels=110 \
                   --qCovarCol=age \
                   --qCovarCol=PC{1:40} \
                   --lmm \
                   --modelSnps=$modelSNPs \
                   --LDscoresFile=$boltDir/tables/LDSCORE.1000G_EUR.tab.gz \
                   --geneticMapFile=$boltDir/tables/genetic_map_hg19_withX.txt.gz \
                   --statsFile=$output \
                   --numThreads $nThreads"
qsubshcom "$tmp_command" $nThreads 100G $prefix 240:00:00 "$account"
gzip $output


# Convert GWAS sumstats to .ma format (for COJO)
echo "SNP A1 A2 freq b se p n" > $output.ma
n=`wc -l < $fam`
zcat $results/$prefix.gz | awk -v N=$n 'NR>1 {print $1,$5,$6,$7,$9,$10,$11,N*(1-$8)}'  >> $output.ma

# Make file with same columns as PLINK output
echo "CHR SNP BP A1 TEST NMISS BETA STAT P" > $output.linear
zcat $output.gz | grep -v '^SNP' | awk -v N=$n '{print $2, $1, $3, $5, "ADD", N*(1-$8), $9, $9/$10, $11 }'   >> $output.linear
gzip $output.linear


# Copy results to local computer to summarise and plot (run command in local computer)
scp $deltaID:$WD/results/bolt/vitD_boltGWAS* $local/results/bolt/



Clump

fastGWA

Results were then clumped with plink 1.9. The following options were used in this analysis:

  • –bfile - v3EUR_impQC samples were used as reference for the autosomes. v3EUR_impQC female samples were used as reference for the X chromosome and the X chromosome pseudo-autossomal region
  • –clump-p1 = 5x10-8 - max P-value for index variants to start a clump
  • –clump-p2 = 0.01 (default) - max P-value for variants in a clump to be listed in column SP2
  • –clump-r2 = 0.01 - minimum r2 with index variant for a site to be included in that clump
  • –clump-kb = 10,000 - maximum distance from index SNP for a site to be included in that clump
# There are duplicated SNPs in the v3EUR_impQC files. This causes an error when running clumping using them as LD reference. 
# Generate a file with their rs IDs and exclude them from analysis when clumping
for i in `seq 1 22`
do 
  echo $i
  awk '{print $2}' $medici/v3EUR_impQC/ukbEUR_imp_chr${i}_v3_impQC.bim | sort | uniq -d >> $WD/input/UKB_v3EUR_impQC/UKB_EUR_impQC_duplicated_rsIDs
done


# #########################################
# Clump GWAS results - autossomes
# #########################################

# Directories
results=$WD/results/fastGWA
tmpDir=$results/tmpDir

# Files/settings to use
prefix=vitD_fastGWA
#prefix=vitD_BMIcov_fastGWA
ref=$medici/v3EUR_impQC/ukbEUR_imp_chr{TASK_ID}_v3_impQC
inFile=$tmpDir/$prefix.fastGWA
outFile=$tmpDir/${prefix}_chr{TASK_ID}
snps2exclude=$WD/input/UKB_v3EUR_impQC/UKB_EUR_impQC_duplicated_rsIDs
window=10000
jobName=$prefix.clump

#  # Test different window sizes and LD refs
#  ## 2Mb window + all LD ref
#  window=2000
#  outFile=$tmpDir/${prefix}.2Mb_chr{TASK_ID}
#  jobName=$prefix.2Mb.clump
#  ## 2Mb window + 20K random sample LD ref
#  ref=$WD/input/UKB_v3EURu_impQC/random20K/ukbEURu_imp_chr{TASK_ID}_v3_impQC_random20k_maf0.01
#  window=2000
#  outFile=$tmpDir/${prefix}.20kref.2Mb_chr{TASK_ID}
#  jobName=$prefix.20kref.2Mb.clump

# Run clumping
cd $results
plink=$bin/plink/plink-1.9/plink
tmp_command="$plink --bfile $ref \
                    --exclude $snps2exclude \
                    --clump $inFile \
                    --clump-p1 5e-8 \
                    --clump-r2 0.01 \
                    --clump-kb $window \
                    --clump-field "P" \
                    --out $outFile"          
qsubshcom "$tmp_command" 1 30G $jobName 24:00:00 "-array=1-22"



# #########################################
# Clump GWAS results - chromosome X
# #########################################

# Directories
results=$WD/results/fastGWA
tmpDir=$results/tmpDir

# Run clumping
cd $results
for chr in X XY
do 
  # Files/settings to use
  prefix=vitD_fastGWA_chr$chr
  #prefix=vitD_BMIcov_fastGWA_chr$chr
  ref=$medici/v3EUR_chrX/ukbEURv3_${chr}_maf0.1_FEMALE
  inFile=$tmpDir/$prefix.gz
  outFile=$tmpDir/$prefix
  plink=$bin/plink/plink-1.9/plink
  tmp_command="$plink --bfile $ref \
                      --clump $inFile \
                      --clump-p1 5e-8 \
                      --clump-r2 0.01 \
                      --clump-kb 10000 \
                      --clump-field "P" \
                      --out $outFile"          
  qsubshcom "$tmp_command" 1 10G $prefix.clump 10:00:00 ""
  echo $prefix.clump
done

# #########################################
# Merge clumped results once all chr finished running
# #########################################
cat $tmpDir/${prefix}_chr*clumped | grep "CHR" | head -1 > $results/${prefix}_withX.clumped
cat $tmpDir/${prefix}_chr*clumped | grep -v "CHR" |sed '/^$/d'  >> $results/${prefix}_withX.clumped


# Copy results to local computer to summarise and plot (run command in local computer)
scp $deltaID:$WD/results/fastGWA/vitD_fastGWA_withX.clumped $local/results/fastGWA/

scp $deltaID:$WD/results/fastGWA/vitD_BMIcov_fastGWA_withX.clumped $local/results/fastGWA/



COJO

fastGWA

COJO – a summary-statistics-based method implemented in GCTA – is an approach to identify independent associations. It starts with the most associated SNP, then, conditioning on that SNP, it finds the second most associated SNP. Then, conditioning on both those SNPs, it finds the third most associated SNP, and so on.

The following options were used in this analysis:

  • –bfile: For the autosomes, we used a random subset of 20,000 unrelated individuals of European ancestry from the UKB as LD reference. The subset was generated based with the files in Q0286/UKBiobank/v3EURu_impQC/, which include 1365446 autossomal SNPs and are restricted to unrelated individuals of European ancestry. For the X chromosome, we used v3EUR_impQC female samples as LD reference.
  • –cojo-slct: Performs a step-wise model selection of independently associated SNPs
  • –cojo-p 5e-8 (default): P-value to declare a genome-wide significant hit
  • –cojo-wind 10,000 (default): Distance (in KB) from which SNPs are considered in linkage equilibrium
  • –cojo-collinear 0.9 (default): maximum r2 to be considered an independent SNP, i.e. SNPs with r2>0.9 will not be considered further
  • –diff-freq 0.2 (default): maximum allele difference between the GWAS and the LD reference sample
  • extract: Restrict analysis to SNPs with MAF>0.01
# ########################
# Prepare files to run COJO
# ########################

# --- Extract a random set of 20K unrelated Europeans to use as LD reference for the autosomes
#     In addition, keep only the variants in the vittamin D GWAS to speed up GSMR analysis
  # Extract SNPs in fastGWA results
  prefix=vitD_fastGWA
  results=$WD/results/fastGWA

  random20k_dir=$WD/input/UKB_v3EURu_impQC/random20K
  awk 'NR>1{print $2}' $tmpDir/$prefix.fastGWA |  > $random20k_dir/$prefix.snpList
  
  # Extract data
  ld_ref=$medici/v3EURu_impQC/ukbEURu_imp_chr{TASK_ID}_v3_impQC
  ids2keep=$medici/v3Samples/ukbEURu_v3_all_20K.fam
  snps2keep=$WD/input/UKB_v3EURu_impQC/random20K/$prefix.snpList
  output=$random20k_dir/ukbEURu_imp_chr{TASK_ID}_v3_impQC_random20k_maf0.01

  cd $random20k_dir
  plink=$bin/plink/plink-1.9/plink
  tmp_command="$plink --bfile $ld_ref  \
                      --keep $ids2keep \
                      --extract $snps2keep  \
                      --make-bed \
                      --out $output"
  qsubshcom "$tmp_command" 1 10G ukbEURu_v3_impQC_random20k_maf0.01 24:00:00 "-array=1-22"


# #########################################
# Run COJO on GWAS results - autosomes
# #########################################

# RCC settings
gcta=$bin/gcta/gcta_1.92.3beta3/gcta64
account="-acct=UQ-IMB-CNSG"

# Delta settings
gcta=$bin/gcta/gcta_1.92.3beta3
account=""

# Common settings
prefix=vitD_fastGWA
#prefix=vitD_BMIcov_fastGWA
results=$WD/results/fastGWA
tmpDir=$results/tmpDir
random20k_dir=$WD/input/UKB_v3EURu_impQC/random20K
ldRef=$random20k_dir/ukbEURu_imp_chr{TASK_ID}_v3_impQC_random20k_maf0.01
filters=maf0.01
window=10000
inFile=$results/maFormat/$prefix.ma
snps2include=$WD/input/UKB_v3EUR_impQC/UKB_EUR_impQC_$filters.snplist
outFile=$tmpDir/${prefix}_${filters}_chr{TASK_ID}
jobName=$prefix.COJO
  
  # Test different window sizes and LD refs
    ## 5Mb window + 20K random sample LD ref
    window=5000
    outFile=$tmpDir/${prefix}.20kref.5Mb_chr{TASK_ID}
    jobName=$prefix.20kref.5Mb.COJO
    ## 2Mb window + 20K random sample LD ref
    window=2000
    outFile=$tmpDir/${prefix}.20kref.2Mb_chr{TASK_ID}
    jobName=$prefix.20kref.2Mb.COJO
    ## 10Mb window + all LD ref
    ldRef=$medici/v3EUR_impQC/ukbEUR_imp_chr{TASK_ID}_v3_impQC
    window=10000
    outFile=$tmpDir/${prefix}.ALLref.10Mb_chr{TASK_ID}
    jobName=$prefix.ALLref.10Mb.COJO
    ## 5Mb window + all LD ref
    ldRef=$medici/v3EUR_impQC/ukbEUR_imp_chr{TASK_ID}_v3_impQC
    window=5000
    outFile=$tmpDir/${prefix}.ALLref.5Mb_chr{TASK_ID}
    jobName=$prefix.ALLref.5Mb.COJO
    ## 5Mb window + all LD ref
    ldRef=$medici/v3EUR_impQC/ukbEUR_imp_chr{TASK_ID}_v3_impQC
    window=2000
    outFile=$tmpDir/${prefix}.ALLref.2Mb_chr{TASK_ID}
    jobName=$prefix.ALLref.2Mb.COJO


# Run COJO
cd $results
tmp_command="$gcta --bfile $ldRef \
                   --cojo-slct \
                   --cojo-file $inFile \
                   --cojo-wind $window \
                   --extract $snps2include \
                   --out $outFile"
                             
qsubshcom "$tmp_command" 1 100G $jobName 24:00:00 "-array=1-22 $account"


# #########################################
# Run COJO on GWAS results - X chromosome
# #########################################

# RCC settings
account="-acct=UQ-IMB-CNSG"

# Delta settings
account=""

# Common settings
prefix=vitD_fastGWA
#prefix=vitD_BMIcov_fastGWA
results=$WD/results/fastGWA
tmpDir=$results/tmpDir
filters=maf0.01

# Run COJO
cd $results
gcta=$bin/gcta/gcta_1.92.3beta3
for chr in X XY
do
  ref=$medici/v3EUR_chrX/ukbEURv3_${chr}_maf0.1_FEMALE
  inFile=$results/maFormat/${prefix}_chr$chr.ma
  outFile=$tmpDir/${prefix}_${filters}_chr$chr
  tmp_command="$gcta --bfile $ref \
                   --cojo-slct \
                   --cojo-file $inFile \
                   --maf 0.01 \
                   --out $outFile"
                             
  qsubshcom "$tmp_command" 1 30G ${prefix}_chr$chr.COJO 24:00:00 "$account"
done

# #########################################
# Merge COJO results once all chr finished running
# #########################################
cat $tmpDir/${prefix}_${filters}*.jma.cojo | grep "Chr" | uniq > $results/${prefix}_${filters}_withX.cojo
cat $tmpDir/${prefix}_${filters}_chr*.jma.cojo | grep -v "Chr" |sed '/^$/d'  >> $results/${prefix}_${filters}_withX.cojo


# Copy results to local computer to summarise and plot (run command in local computer)
scp $deltaID:$WD/results/fastGWA/vitD_fastGWA*_withX.cojo $local/results/fastGWA/

scp $deltaID:$WD/results/fastGWA/vitD_BMIcov_fastGWA_maf0.01_withX.cojo $local/results/fastGWA/

To assess the number of significant associations that were independent from previously reported variants, we performed an association analysis conditional on the 6 following SNPs:

  • rs2282679
  • rs10741657
  • rs12785878
  • rs10745742
  • rs8018720
  • rs6013897

Specifically, we used the --cojo-cond option implemented in GCTA for this analysis.

# ------ Condition GWAS results on known variants
# Delta settings
gcta=$bin/gcta/gcta_1.92.3beta3
account=""

# Common settings
prefix=vitD_fastGWA
results=$WD/results/fastGWA
tmpDir=$results/tmpDir
random20k_dir=$WD/input/UKB_v3EURu_impQC/random20K
ldRef=$random20k_dir/ukbEURu_imp_chr{TASK_ID}_v3_impQC_random20k_maf0.01
filters=maf0.01
inFile=$results/maFormat/$prefix.ma
snps2include=$WD/input/UKB_v3EUR_impQC/UKB_EUR_impQC_$filters.snplist
outFile=$tmpDir/${prefix}_${filters}_conditionOnKnownLoci_chr{TASK_ID}
knownLoci=$WD/input/vitD_knownLoci.txt

# Generate list of known variants
loci2conditionOn=$tmpDir/knownLoci
awk 'NR>1 {print $1}' $knownLoci > $loci2conditionOn

# Condition GWAS results on known variants
cd $results
tmp_command="$gcta --bfile $ldRef \
                   --cojo-file $inFile \
                   --extract $snps2include \
                   --cojo-cond $loci2conditionOn \
                   --out $outFile"
                             
qsubshcom "$tmp_command" 1 100G $prefix.COJO.knownLoci 24:00:00 "-array=1-22 $account"




# ------ Create file with all results after conditioning on known variants
  ## Run this step in R
  qsub -I -A UQ-IMB-CNSG -X -l nodes=1:ppn=5,mem=10GB,walltime=6:00:00
  cd $WD
  R
  knownLoci=read.table("input/vitD_knownLoci.txt", h=T, stringsAsFactors=F, sep="\t")
  
  # Load all results and 
  gwas=read.table("results/fastGWA/vitD_fastGWA_withX.gz", h=T, stringsAsFactors=F)
  
  # Keep SNP information from CHR that have been condition on known SNPs and remove results (we'll use the conditioned ones) 
  knownLoci_SNPinfo=gwas[gwas$CHR %in% knownLoci$CHR,c("SNP","A1","A2")]
  gwas=gwas[!gwas$CHR %in% knownLoci$CHR,]
  
  # Add conditioned results
  for(chr in unique(knownLoci$CHR))
  {
    print(chr)
    tmp=read.table(paste0("results/fastGWA/tmpDir/vitD_fastGWA_maf0.01_conditionOnKnownLoci_chr",chr,".cma.cojo"), h=T, stringsAsFactors=F)
    # Add information not provided in COJO output (A2)
    tmp=merge(tmp,knownLoci_SNPinfo, all.x=T)
    if(!identical(tmp$A1,tmp$refA)){print("Check alleles")}
    # Merge with unconditioned results
    tmp=tmp[,c("Chr","SNP","bp","A1","A2","n","freq","bC","bC_se","pC")]
    names(tmp)=c("CHR","SNP","POS","A1","A2","N","AF1","BETA","SE","P")
    gwas=rbind(gwas,tmp)
  }
  
  # Save conditioned results
  write.table(gwas, file=gzfile("results/fastGWA/vitD_fastGWA_withX_conditionOnKnownLoci.gz"), quote=F, row.names=F)
  
  # Convert to .ma format (to run COJO) and save
  ## SNP A1 A2 freq b se p n
  maFormat=gwas[,c("SNP","A1","A2","AF1","BETA","SE","P","N")]
  names(maFormat)=c("SNP","A1","A2","freq","b","se","p","n")
  write.table(maFormat, "results/fastGWA/maFormat/vitD_fastGWA_withX_conditionOnKnownLoci.ma", quote=F, row.names=F)
  
  

# ------ Run COJO on conditioned results

# Delta settings
gcta=$bin/gcta/gcta_1.92.3beta3
account=""

# Common settings
prefix=vitD_fastGWA_withX_conditionOnKnownLoci
results=$WD/results/fastGWA
tmpDir=$results/tmpDir
random20k_dir=$WD/input/UKB_v3EURu_impQC/random20K
ldRef=$random20k_dir/ukbEURu_imp_chr{TASK_ID}_v3_impQC_random20k_maf0.01
filters=maf0.01
inFile=$results/maFormat/$prefix.ma
snps2include=$WD/input/UKB_v3EUR_impQC/UKB_EUR_impQC_$filters.snplist
outFile=$tmpDir/${prefix}_${filters}_chr{TASK_ID}
knownLoci=$WD/input/vitD_knownLoci.txt

cd $results
tmp_command="$gcta --bfile $ldRef \
                   --cojo-slct \
                   --cojo-file $inFile \
                   --extract $snps2include \
                   --out $outFile"
qsubshcom "$tmp_command" 1 30G $prefix.COJO 24:00:00 "-array=1-22 $account"



cat $tmpDir/${prefix}_${filters}*.jma.cojo | grep "Chr" | uniq > $results/${prefix}_${filters}.cojo
cat $tmpDir/${prefix}_${filters}_chr*.jma.cojo | grep -v "Chr" |sed '/^$/d'  >> $results/${prefix}_${filters}.cojo


# Copy results to local computer to summarise and plot (run command in local computer)
scp $deltaID:$WD/results/fastGWA/vitD_fastGWA_withX_conditionOnKnownLoci.gz $local/results/fastGWA/


scp $deltaID:$WD/results/fastGWA/vitD_fastGWA_withX_conditionOnKnownLoci_maf0.01.cojo $local/results/fastGWA/



BOLT-LMM

The following options were used in this analysis:

  • –bfile: We used the GWAS individual-level genotype data (v3EUR_impQC) as reference.
  • –cojo-slct: Performs a step-wise model selection of independently associated SNPs
  • –cojo-p 5e-8 (default): P-value to declare a genome-wide significant hit
  • –cojo-wind 10,000 (default): Distance (in KB) from which SNPs are considered in linkage equilibrium
  • –cojo-collinear 0.9 (default): maximum r2 to be considered an independent SNP, i.e. SNPs with r2>0.9 will not be considered further
  • –diff-freq 0.2 (default): maximum allele difference between the GWAS and the LD reference sample
  • extract: Restrict analysis to SNPs with MAF>0.01
# #########################################
# Run COJO on GWAS results
# #########################################

# Directories
results=$WD/results/bolt
tmpDir=$results/tmpDir

# Files/settings to use
prefix=vitD_boltGWAS
ref=$medici/v3EUR_impQC/ukbEUR_imp_chr{TASK_ID}_v3_impQC
filters=maf0.01
inFile=$results/$prefix.ma
snps2include=$WD/input/UKB_v3EUR_impQC/UKB_EUR_impQC_$filters.snplist
outFile=$tmpDir/${prefix}_${filters}_chr{TASK_ID}

# Run COJO
cd $results
gcta=$bin/gcta/gcta_1.92.3beta3
tmp_command="$gcta --bfile $ref \
                   --cojo-slct \
                   --cojo-file $inFile \
                   --extract $snps2include \
                   --out ${outFile}"
                             
qsubshcom "$tmp_command" 1 100G $prefix.COJO 24:00:00 "-array=1-22"


# Merge COJO results once all chr finished running
cat $tmpDir/${prefix}_${filters}*.jma.cojo | grep "Chr" | uniq > $results/${prefix}_${filters}.cojo
cat $tmpDir/${prefix}_${filters}_chr*.jma.cojo | grep -v "Chr" |sed '/^$/d'  >> $results/${prefix}_${filters}.cojo


# Copy results to local computer to summarise and plot (run command in local computer)
scp $deltaID:$WD/results/bolt/vitD_bolt*.cojo $local/results/bolt/



LDSC

fastGWA

To run LDSC, we restricted the GWAS summary statistics to variants found in w_hm3.noMHC.snplist (downloaded from LDHub). This is a list of HapMap3 SNPs and excludes SNPs in the MHC region. Note that this does not include variants on the X chromosome.

module load ldsc
prefix=vitD_fastGWA
#prefix=vitD_BMIcov_fastGWA

# Directories
results=$WD/results/fastGWA
tmpDir=$results/tmpDir
ldscDir=$bin/ldsc

# Format GWAS results
awk '{print $1,$2,$3,$5,$8,$7}' $results/maFormat/$prefix.ma > $tmpDir/$prefix.ldHubFormat

# Restrict to common SNPs
sed 1q $tmpDir/$prefix.ldHubFormat > $results/ldscFormat/$prefix.ldHubFormat
awk 'NR==FNR{a[$1];next} ($1 in a)' $WD/input/UKB_v3EURu_impQC/UKB_EURu_impQC_maf0.01.snplist <(awk 'NR==FNR{a[$1];next} ($1 in a)' $WD/input/w_hm3.noMHC.snplist $tmpDir/$prefix.ldHubFormat) >> $results/ldscFormat/$prefix.ldHubFormat

# Munge sumstats (process sumstats and restrict to HapMap3 SNPs)
cd $results/ldscFormat/
munge_sumstats.py --sumstats $results/ldscFormat/$prefix.ldHubFormat \
                  --merge-alleles $ldscDir/w_hm3.snplist \
                  --out $results/ldscFormat/$prefix

# Run LDSC
ldsc.py --h2 $results/ldscFormat/$prefix.sumstats.gz \
        --ref-ld-chr $ldscDir/eur_w_ld_chr/ \
        --w-ld-chr $ldscDir/eur_w_ld_chr/ \
        --out $results/ldsc/$prefix.ldsc


# Copy results to local computer (run command in local computer)
scp $deltaID:$WD/results/fastGWA/ldsc/vitD_fastGWA.ldsc.log $local/results/fastGWA/ldsc/



BOLT-LMM

To run LDSC, we restricted the GWAS summary statistics to variants found in w_hm3.noMHC.snplist (downloaded from LDHub). This is a list of HapMap3 SNPs and excludes SNPs in the MHC region. Note that this does not include variants on the X chromosome.

module load ldsc
prefix=vitD_boltGWAS

# Directories
results=$WD/results/bolt
tmpDir=$results/tmpDir
ldscDir=$bin/ldsc

# Format GWAS results
awk '{print $1,$2,$3,$5,$8,$7}' $results/$prefix.ma > $tmpDir/$prefix.ldHubFormat

# Restrict to common SNPs
sed 1q $tmpDir/$prefix.ldHubFormat > $results/$prefix.ldHubFormat
awk 'NR==FNR{a[$1];next} ($1 in a)' $WD/input/UKB_v3EURu_impQC/UKB_EURu_impQC_maf0.01.snplist <(awk 'NR==FNR{a[$1];next} ($1 in a)' $WD/input/w_hm3.noMHC.snplist $tmpDir/$prefix.ldHubFormat) >> $results/$prefix.ldHubFormat

## I had to read the ldHubFormat-file with R and over-write it as ldsc was having issues with the 
## format of the p-value for some SNPs
  # cd $WD
  # R
  # a=read.table("results/bolt/vitD_boltGWAS.ldHubFormat",h=T,stringsAsFactors=F)
  # write.table(a,"results/bolt/vitD_boltGWAS.ldHubFormat",quote=F,row.names=F)

# Munge sumstats (process sumstats and restrict to HapMap3 SNPs)
munge_sumstats.py --sumstats $results/$prefix.ldHubFormat \
                  --merge-alleles $ldscDir/w_hm3.snplist \
                  --out $tmpDir/$prefix

# Run LDSC
ldsc.py --h2 $tmpDir/$prefix.sumstats.gz \
        --ref-ld-chr $ldscDir/eur_w_ld_chr/ \
        --w-ld-chr $ldscDir/eur_w_ld_chr/ \
        --out $results/$prefix.ldsc


# Copy results to local computer (run command in local computer)
scp $deltaID:$WD/results/bolt/vitD_boltGWAS.ldsc.log $local/results/bolt





Identify new loci

fastGWA

To assess the number of new loci in our GWAS, we:

  • Identified previously reported GWS variants (referred to as ‘know SNPs’ below)
  • Selected all variants tested in our GWAS that were GWAS and within 1Mb of a known SNP
  • Calculated the LD r2 between known SNPs and the selected nearby variants
  • Selected variants with LD r2>0.001
  • Defined as ‘novel’ any variants in our GWAS that were not in the list above
# Copy list of known loci to cluster (run this command locally)
scp $local/input/vitD_knownLoci.txt $deltaID:$WD/input/

# ################################################
# Check how many new loci we have
# ################################################

# ---------- Define lists to calculate LD with
## Run these steps in R in the cluster
qsub -I -A UQ-IMB-CNSG -X -l nodes=1:ppn=5,mem=5GB,walltime=6:00:00
cd $WD
R
gwas=read.table("results/fastGWA/vitD_fastGWA_withX.gz",h=T,stringsAsFactors=F)
knownLoci=read.table("input/vitD_knownLoci.txt",h=T,stringsAsFactors=F, sep="\t")
knownLoci=merge(knownLoci, gwas[,c("SNP","POS")])

# Identify variants to calculate LD with (ie in the within 1Mb and GWS at 5x10-8)
for(i in 1:nrow(knownLoci))
{
  snp=knownLoci[i,"SNP"]
  chr=knownLoci[i,"CHR"]
  minBP=knownLoci[i,"POS"]-1e6
  maxBP=knownLoci[i,"POS"]+1e6
  df=gwas[gwas$CHR==chr & gwas$POS>minBP & gwas$POS<maxBP & gwas$P<5e-8,]
  # Save list to calculate LD
  write.table(df$SNP, paste0("results/fastGWA/tmpDir/knownLoci_2checkLD_",snp), quote=F, row.names=F, col.names=F)
}



# ---------- Calculate LD between GWS SNPs within 1Mb of know loci

# Directories
results=$WD/results/fastGWA
tmpDir=$results/tmpDir

# Files/parameters
ld_ref=$medici/v3EUR_impQC/ukbEUR_imp_chr\${chr}_v3_impQC
knownLoci=$WD/input/vitD_knownLoci.txt
n=`wc -l < $knownLoci`
plink=$bin/plink/plink-1.9/plink

cd $results
if test $n != 1; then arrays="-array=2-$n"; i="{TASK_ID}"; else arrays=""; i=1; fi 
tmp_command="snp=\$(awk -v i=$i 'NR==i {print \$1}' $knownLoci);
             chr=\$(awk -v i=$i 'NR==i {print \$2}' $knownLoci);
             $plink --bfile $ld_ref \
                    --extract $tmpDir/knownLoci_2checkLD_\$snp \
                    --r2 inter-chr \
                    --ld-window-r2 0 \
                    --out $tmpDir/knownLoci_2checkLD_\$snp"
qsubshcom "$tmp_command" 1 30G  checkLD_knownLoci 24:00:00 "$arrays"

# Restrict to correlations with known loci and merge results
sed 1q $results/tmpDir/knownLoci_2checkLD*.ld | uniq | tr ' ' '\t' > $results/knownLoci_2checkLD
for snp in `sed 1d $knownLoci | awk '{print $1}'`
do
  echo $snp
  grep -he $snp $results/tmpDir/knownLoci_2checkLD*.ld >> $results/knownLoci_2checkLD
  
done


# Copy results to local (run from local)
scp $deltaID:$WD/results/fastGWA/knownLoci_2checkLD $local/results/fastGWA/
# ******************************************************************
# Format GWAS results as Table 2 of Wray et. al 2018 (PMID 29700475)
# ******************************************************************

# Read in GWAS results 
df=read.table("results/fastGWA/vitD_fastGWA.gz", h=T, stringsAsFactors=F)
cojo=read.table("results/fastGWA/vitD_fastGWA_maf0.01_hwe1e6.cojo", h=T, stringsAsFactors=F)
df$cojo_indep=ifelse(df$SNP %in% cojo$SNP, "yes","no")

# Rename columns
names(df)[c(3,7)]=c("BP","A1_freq")

# Extract data subsets for John
# File with ALL SNPs
write.table(df, "results/fastGWA/vitD_fastGWA.txt", quote=F, row.names=F)

# File restricted to SNPs with P<0.05
df2=df[df$p<0.05,]
write.table(df2, "results/fastGWA/vitD_fastGWA_p0.05.txt", quote=F, row.names=F)





Results

Note: Unless otherwise specified, all results presented are for variants with MAF>0.01.

Manhattan plot

fastGWA

# Read in fastGWA results ---------------------------
#gwas=read.table("results/fastGWA/vitD_fastGWA_withX.gz", h=T, stringsAsFactors=F)
#saveRDS(gwas, file="results/fastGWA/vitD_fastGWA_withX.RDS")
gwas=readRDS("results/fastGWA/vitD_fastGWA_withX.RDS")
names(gwas)[grep("POS|beta|p",names(gwas))]=c("BP","BETA","P")
cojo=read.table("results/fastGWA/vitD_fastGWA_maf0.01_withX.cojo", h=T, stringsAsFactors=F)
clump=read.table("results/fastGWA/vitD_fastGWA_withX.clumped", h=T, stringsAsFactors=F)


# Add cumulative position + clumping/cojo information for plotting
  # ----- Add chromosome start site to the GWAS result
  don=aggregate(BP~CHR,gwas,max)
  don$tot=cumsum(as.numeric(don$BP))-don$BP
  don=merge(gwas,don[,c("CHR","tot")])
  # ----- Add a cumulative position to plot in this order
  don=don[order(don$CHR,don$BP),]
  # ----- Restrict SNPs with P<0.03 + a random samples of 100K SNPs (to plot faster)
  don=don[don$P<0.03 | don$SNP %in% sample(gwas$SNP,100000),]
  don$BPcum=don$BP+don$tot
  # ----- Add cojo information
  don$is_cojo_sig=ifelse(don$SNP %in% cojo$SNP & don$P<1e-15, "yes","no")
  don$is_cojo=ifelse(don$SNP %in% cojo$SNP, "yes","no")

  # ----- Prepare X axis
  axisdf = don %>% group_by(CHR) %>% summarize(center=( max(BPcum) + min(BPcum) ) / 2 )
  
  
# Generate Manhattan plot
ggplot(don, aes(x=BPcum, y=-log10(P))) +
  geom_point( aes(color=as.factor(CHR)), alpha=0.3, size=0.5) +
  scale_color_manual(values = rep(c("lightsteelblue4", "lightsteelblue3"),44)) +
  # Highlighted SNPs of interest
  geom_point(data=don[don$is_cojo=="yes",], color="red", size=2) +
  #geom_label_repel( data=don[don$is_cojo_sig=="yes",], aes(label=SNP), size=2) +
  # Make X axis
  scale_x_continuous( label=axisdf$CHR, breaks=axisdf$center ) +
  labs(x="")+
  theme_bw() +
  theme(legend.position="none",
        panel.border = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) 

# Keep info to summarise in the report below
fastGWA_common=gwas
fastGWA_cojo_common=cojo
fastGWA_clump_common=clump

Red dots represent independent associations identified with COJO.

BOLT-LMM

# Read in BOLT GWAS results ----------------------------------------
#gwas=read.table("results/bolt/vitD_boltGWAS.linear.gz", h=T, stringsAsFactors=F)
#saveRDS(gwas, file="results/bolt/vitD_boltGWAS.linear.RDS")
gwas=readRDS("results/bolt/vitD_boltGWAS.linear.RDS")
cojo=read.table("results/bolt/vitD_boltGWAS_maf0.01.cojo", h=T, stringsAsFactors=F)


#  Add cumulative position + clumping/cojo information for plotting
  # ----- Add chromosome start site to the GWAS result
  don=aggregate(BP~CHR,gwas,max)
  don$tot=cumsum(as.numeric(don$BP))-don$BP
  don=merge(gwas,don[,c("CHR","tot")])
  # ----- Add a cumulative position to plot in this order
  don=don[order(don$CHR,don$BP),]
  # ----- Restrict SNPs with P<0.03 + a random samples of 100K SNPs (to plot faster)
  don=don[don$P<0.03 | don$SNP %in% sample(gwas$SNP,100000),]
  don$BPcum=don$BP+don$tot
  # ----- Add cojo information
  don$is_cojo_sig=ifelse(don$SNP %in% cojo$SNP & don$P<1e-15, "yes","no")
  don$is_cojo=ifelse(don$SNP %in% cojo$SNP, "yes","no")

  # ----- Prepare X axis
  axisdf = don %>% group_by(CHR) %>% summarise(center=( max(BPcum) + min(BPcum) ) / 2 )
  
  
#  Generate Manhattan plot
ggplot(don, aes(x=BPcum, y=-log10(P))) +
  geom_point( aes(color=as.factor(CHR)), alpha=0.3, size=0.5) +
  scale_color_manual(values = rep(c("lightsteelblue4", "lightsteelblue3"),44)) +
  # Highlighted SNPs of interest
  geom_point(data=don[don$is_cojo=="yes",], color="red", size=2) +
  #geom_label_repel( data=don[don$is_cojo_sig=="yes",], aes(label=SNP), size=2) +
  # Make X axis
  scale_x_continuous( label=axisdf$CHR, breaks=axisdf$center ) +
  labs(x="")+
  theme_bw() +
  theme(legend.position="none",
        panel.border = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())

# Keep info to summarise in the report below
boltGWAS_common=gwas
boltGWAS_cojo_common=cojo

Red dots represent independent associations identified with COJO.



Summary of results

fastGWA

Briefly, we have:

  • 8806780 variants tested – including 260713 in the X chromosome
  • 18864 genome-wide significant (GWS; P<5x10-8) hits – including 527 in the X chromosome
  • 143 independent variants (determined with COJO) – including 1 in the X chromosome
  • 133 independent variants (determined with COJO) that were GWS in the original GWAS
  • 187 independent loci (determined with clumping) – including 1 in the X chromosome

# ---------- Check number of new loci in our GWAS - based on LD
knownLoci=read.table("input/vitD_knownLoci.txt",h=T,stringsAsFactors=F, sep="\t")
knownLoci=merge(knownLoci, fastGWA_common[,c("SNP","BP")], all.x=T)

r2_thrsh=0.001
ld=read.table("results/fastGWA/knownLoci_2checkLD", h=T, stringsAsFactors=F)
knownLoci_SNPs=c(ld[ld$R2>r2_thrsh,"SNP_A"], ld[ld$R2>r2_thrsh,"SNP_B"])

new_gwsSNPs=fastGWA_common[!fastGWA_common$SNP %in% knownLoci_SNPs & fastGWA_common$P<5e-8,]
new_cojoSNPs=fastGWA_cojo_common[!fastGWA_cojo_common$SNP %in% knownLoci_SNPs,]

# ---------- Check number of new loci in our GWAS - based on COJO results
indep=read.table("results/fastGWA/vitD_fastGWA_withX_conditionOnKnownLoci.gz", h=T, stringsAsFactors=F)
indep_cojo=read.table("results/fastGWA/vitD_fastGWA_withX_conditionOnKnownLoci_maf0.01.cojo", h=T, stringsAsFactors=F)



# ---------- Present table of previously known variants
knownLoci=knownLoci[order(knownLoci$CHR, knownLoci$BP),c("SNP","CHR","BP","ref")]
tmp2=kable(knownLoci, 
           caption=paste0("List of previously reported GWS loci (N=",nrow(knownLoci),")"), 
           row.names=F)
kable_styling(tmp2, full_width=F)
List of previously reported GWS loci (N=6)
SNP CHR BP ref
rs2282679 4 72608383 Ahn et al, Wang et al, Jiang et al
rs10741657 11 14914878 Ahn et al, Wang et al, Jiang et al
rs12785878 11 71167449 Ahn et al, Wang et al, Jiang et al
rs10745742 12 96358529 Jiang et al
rs8018720 14 39556185 Jiang et al
rs6013897 20 52742479 Wang et al, Jiang et al

To assess the number of new associations in our study, we conditioned our results on the list of variants above.

After conditioning on those, we had:

  • 14311 GWS associations – including 527 in the X chromosome
  • 140 independent variants (determined with COJO) – including 1 in the X chromosome

We also checked how many variants were GWS in our analysis and were not in LD (r2>0.001) with a nearby (i.e. within 1Mb of) variant previously reported to associate with vitamin D (see list above). Using this criteria, we had:

  • 12933 GWS hits
  • 127 novel independent (determined with COJO) variants
  • 117 novel independent (determined with COJO) variants that were GWS in the original GWAS
# ################################################
# Present list of independent associations (identified with COJO)
# ################################################
indep = fastGWA_cojo_common %>% arrange(Chr, bp)

# Format table
indep=indep[,c("Chr","SNP","bp","refA","freq","b","se","p","n","freq_geno","bJ","bJ_se","pJ")]
indep$p=format(indep$p, digits=3)
indep$pJ=format(indep$pJ, digits=3)
indep=indep %>% mutate_at(vars(se, bJ_se, b, bJ), round, 4)
indep=indep %>% mutate_at(vars(freq, freq_geno), round, 2)


# Present table
datatable(indep, 
          rownames=FALSE, 
          extensions=c('Buttons','FixedColumns'),
          options=list(pageLength=5,
                       dom='frtipB',
                       buttons=c('csv', 'excel'),
                       scrollX=TRUE),
          caption="List of independent associations") %>% 
  formatStyle("p","white-space"="nowrap") %>% 
  formatStyle("pJ","white-space"="nowrap")

Columns are: Chr, chromosome; SNP, SNP rs ID; bp, physical position; refA, the effect allele; type, type of QTL; freq, frequency of the effect allele in the original data; b, se and p, effect size, standard error and p-value from the original GWAS; n, estimated effective sample size; freq_geno, frequency of the effect allele in the reference sample; bJ, bJ_se and pJ, effect size, standard error and p-value from a joint analysis of all the selected SNPs.

BOLT-LMM

Briefly, we have:

  • 8546068 variants tested
  • 20223 genome-wide (P<5x10-8) hits
  • 161 independent associations (determined with COJO)
  • 151 independent associations (determined with COJO) that were genome-wide significant in the original GWAS

# ################################################
# Present list of independent associations (identified with COJO)
# ################################################

indep = boltGWAS_cojo_common %>% arrange(Chr, bp)

# Format table
indep=indep[,c("Chr","SNP","bp","refA","freq","b","se","p","n","freq_geno","bJ","bJ_se","pJ")]
indep$p=format(indep$p, digits=3)
indep$pJ=format(indep$pJ, digits=3)
indep=indep %>% mutate_at(vars(se, bJ_se, b, bJ), round, 4)
indep=indep %>% mutate_at(vars(freq, freq_geno), round, 2)


# Present table
datatable(indep, 
          rownames=FALSE, 
          extensions=c('Buttons','FixedColumns'),
          options=list(pageLength=5,
                       dom='frtipB',
                       buttons=c('csv', 'excel'),
                       scrollX=TRUE),
          caption="List of independent associations") %>% 
  formatStyle("p","white-space"="nowrap") %>% 
  formatStyle("pJ","white-space"="nowrap")

Columns are: Chr, chromosome; SNP, SNP rs ID; bp, physical position; refA, the effect allele; type, type of QTL; freq, frequency of the effect allele in the original data; b, se and p, effect size, standard error and p-value from the original GWAS; n, estimated effective sample size; freq_geno, frequency of the effect allele in the reference sample; bJ, bJ_se and pJ, effect size, standard error and p-value from a joint analysis of all the selected SNPs.

QQplot + LDSC

To run LDSC, we restricted the GWAS summary statistics to HapMap3 SNPs, excluding the MHC region and X chromosome. For more information see the Methods section above.

fastGWA

# Get sample for QQ plotting
gwas=fastGWA_common
pthrh=0.03
tmp=c(gwas[gwas$P<pthrh,"P"], sample(gwas[gwas$P>pthrh,"P"],100000))

# Plot
qq(tmp)

LDSC

tmp=read.table("results/fastGWA/ldsc/vitD_fastGWA.ldsc.log", skip=15, fill=T, h=F, colClasses="character", sep="_")[,1]

# Get results:
h2=strsplit(grep("Total Observed scale h2",tmp, value=T),":")[[1]][2]
lambda_gc=strsplit(grep("Lambda GC",tmp, value=T),":")[[1]][2]
chi2=strsplit(grep("Mean Chi",tmp, value=T),":")[[1]][2]
intercept=strsplit(grep("Intercept",tmp, value=T),":")[[1]][2]
ratio=strsplit(grep("Ratio",tmp, value=T),":")[[1]][2]
  • Total Observed scale h2: 0.0966 (0.0138)
  • Lambda GC: 1.4817
  • Mean Chi2: 1.881
  • Intercept: 1.0628 (0.0136)
  • Ratio: 0.0713 (0.0154)

BOLT-LMM

# Get sample for QQ plotting
gwas=boltGWAS_common
pthrh=0.03
tmp=c(gwas[gwas$P<pthrh,"P"], sample(gwas[gwas$P>pthrh,"P"],100000))

# Plot
qq(tmp)

LDSC

tmp=read.table("results/bolt/vitD_boltGWAS.ldsc.log", skip=15, fill=T, h=F, colClasses="character", sep="_")[,1]

# Get results:
h2=strsplit(grep("Total Observed scale h2",tmp, value=T),":")[[1]][2]
lambda_gc=strsplit(grep("Lambda GC",tmp, value=T),":")[[1]][2]
chi2=strsplit(grep("Mean Chi",tmp, value=T),":")[[1]][2]
intercept=strsplit(grep("Intercept",tmp, value=T),":")[[1]][2]
ratio=strsplit(grep("Ratio",tmp, value=T),":")[[1]][2]
  • Total Observed scale h2: 0.092 (0.013)
  • Lambda GC: 1.4926
  • Mean Chi2: 1.9231
  • Intercept: 1.0706 (0.014)
  • Ratio: 0.0765 (0.0152)

SBayesR

fastGWA

# Copy fastGWA results from Delta to RCC ( -- run this step on delta -- )
scp $WD/results/fastGWA/maFormat/vitD_fastGWA.ma $rooID:$WD/results/fastGWA/maFormat/

# #########################################
# Run SBayesR
# #########################################

# Delta settings
ldm=input/ukbEURu_imp_v3_HM3_n50k.chisq10.ldm.sparse
gctb=$bin/gctb/gctb_2.0_Linux/gctb
account=""

# RCC settings
mldm=$llloyd/sbayesr/pre_review/ukbEURu_hm3_sparse/ukbEURu_hm3_sparse_mldm_list.txt
gctb=$bin/gctb/gctb_2.0_Linux/gctb
account="-acct=UQ-IMB-CNSG"

# Common settings
prefix=vitD_fastGWA
results=$WD/results/fastGWA
gwas=$results/maFormat/$prefix.ma
output=$results/$prefix.sbayesr

# Run SBayesR
cd $results
tmp_command="$gctb --sbayes R \
                   --gwas-summary $gwas \
                   --mldm $mldm \
                   --out $output"

qsubshcom "$tmp_command" 1 70G $prefix.sbayesr 60:00:00 "$account"


# Copy results to local computer (run command in local computer)
scp $rooID:$WD/results/fastGWA/vitD_fastGWA.sbayesr.parRes $local/results/fastGWA/
tmp=read.table("results/fastGWA/vitD_fastGWA.sbayesr.parRes",h=T,stringsAsFactors=F, skip=2)
h2=round(as.numeric(tmp["hsq",][1]),4)
se=round(as.numeric(tmp["hsq",][2]),4)
  • SBayesR h2 (SE): 0.1266 (0.0015)