GWAS_workshopx

Download Report

Transcript GWAS_workshopx

Genome-Wide Association Studies
(GWAS)
• Study design: Case/Control, Family-based, Cohort
• Phenotype: Dichotomous, Quantitative
• 103 – 105 individuals; 105 – 106 polymorphisms
Genotyping
rs1372493
rs1372493
1.60
16000
1.40
14000
1.20
12000
1
8000
Norm R
Intensity (B)
10000
6000
0.80
0.60
4000
0.40
2000
0.20
0
0
-2000
2317
834
74
-0.20
0
2000
4000
6000
8000
10000
12000
14000
16000
Intensity (A)
Cartesian coordinate view
18000
20000
0
0.20
0.40
0.60
0.80
Norm Theta
Polar coordinate view
1
Genotyping
Poor cluster plot
PLINK
• What is PLINK?
– A software to analyse phenotype/genotype data
– Run from the command line
• Why should we use PLINK?
– Probably the most common tool used to analyse
(human) GWAS data
– Free and open source
– Designed to perform a wide range of basic, large-scale analyses
in computationally efficient manner
– Can be used on several platforms
– No programming ability required and excellent
documentation
The Original PLINK
http://pngu.mgh.harvard.edu/~purcell/plink/
PLINK 1.9: Speedier (but less informative)
https://www.cog-genomics.org/plink2/
How to get PLINK (for Windows)
• Determine if your PC is 32 bit or 64 bit
How to get PLINK (for Windows)
• Determine if your PC is 32 bit or 64 bit
• Download the relevant stable build from the
PLINK 1.9 website to a convenient location, then
unzip
the PLINK program/executable
Running PLINK (on Windows)
• PLINK is run from the command prompt
• Navigate to the location of the data or PLINK executable
using the cd command (cd = change directory)
Note
• The command prompt needs to be told where
the PLINK executable and the data is
• Easiest to direct the command prompt to the
same folder/directory as your data
• If the PLINK executable is here, all good
• If not, put the path of PLINK’s location in your
environment path to make it easier to call
> echo %PATH%
> path = C:\PLINK_location;%PATH%
Note
• The command prompt needs to be told where
the PLINK exexutable and the data is
• Easiest to direct the command prompt to the
same folder/directory as your data
• If the PLINK executable is here, all good
• If not, put the path of PLINK’s location in your
environment path to make it easier to call
• This process is temporary and will only work for
the current window
File Formats
• filename.ped
• filename.map
text based pedigree files
text based map file
• filename.bed
• filename.bim
• filename.fam
binary genotype file
marker file
family/individual file
• PHENO file phenotype file
• COVAR file covariate file
PED Files: the individual & genotype file
MAP file: the marker information file
Genetic distance
The PED and MAP file
• The Ped(igree) Files
1.
2.
3.
4.
5.
6.
Family ID
Individual ID
Paternal ID
Maternal ID
Sex (1=male; 2=female; other=unknown)
Phenotype (pheno)
• The Map Files
1. chromosome (1-22,X|23,Y|24,PAR|25,MT|26 or
0=unplaced)
2. rs# or snp identifier
3. Genetic distance (morgans)
4. Base-pair position (bp units)
Binary Files: A More Efficient Way to
Store PED and MAP files
• BED file: binary file, genotype information
• BIM file: extended MAP file: two extra columns = allele
names
• FAM file: first six columns of PED file
PLINK Commands
For PED/MAP files:
> plink --file filename –-options –-out outfile
filename
without extension, PLINK will look for filename.ped
and filename.map
options
various kind of options, see the following slides and
documentation
outfile
optional output name (without extension); if --out is absent,
output file will be named plink.suffix (where suffix depends on
the option chosen)
For BED/BIM/FAM files:
> plink --bfile filename –-options –-out outfile
Note: may need to type plink.exe on windows to call the program
Rules to Remember
• Always consult the log file
• Consult the web documentation regularly
• PLINK has no memory, each run loads data anew,
previous filters lost
• Exact syntax and spelling is important
– “minus minus” … “dash dash” … “hyphen
hyphen”
• Check the analyses are doing what you expect
Data Management
--recode
creates a new PED/MAP fileset after applying any specified operations
--make-bed
creates a new binary fileset after applying any specified operations
--update-map
update variant base-pair positions; requires text file containing marker name
in column 1 and new base-pair position in column 2
--update-ids
update sample IDs; requires text file containing original family ID in column
1, original individual ID in column 2, new family ID in column 3 and new
individual ID in column 4
--flip
Given a file containing a list of SNPs with A/C/G/T alleles, --flip swaps A↔T
and C↔G
--bmerge
--bmerge merges a specified binary fileset with the input data (which is
considered the reference)
See https://www.cog-genomics.org/plink2/index for complete list of all options
Input Filtering
--keep
accepts a text file with family IDs in column 1 and individual IDs in column 2 and
removes all unlisted samples from the current analysis
--remove
accepts a text file with family IDs in column 1 and individual IDs in column 2 and
removes all listed samples from the current analysis
--mind
filters out all individuals with missing call rates exceeding the provided value
--extract
accepts a text file with a list of variant IDs and removes all unlisted variants
from the current analysis
--exclude
accepts a text file with a list of variant IDs and removes all listed variants from
the current analysis
--chr
excludes all variants not on the listed chromosome(s); --from-kb and –to-kb may
be added to restrict analysis to a particular region of the specified chromosome
--geno
filters out all variants with missing call rates exceeding the provided value
--maf
filters out all variants with minor allele frequency below the provided threshold
--hwe
filters out all variants which have Hardy-Weinberg equilibrium exact test p-value
below the provided threshold
Quality Control of a GWAS dataset
• GWAS_build36.bed, GWAS_build36.bim,
GWAS_build36.fam
• 897 cases and 963 controls (simulated phenotype)
from Ireland and Britain genotyped on an Illumina chip
• 420755 markers genotyped, from chromosomes 1-22,
X and pseudoautosomal regions
• Important to have build information
• Important to have strand information
Quality Control:
Sample Call Rate & Heterozygosity
• Low call rate (high missingness) indicates poor DNA quality
• High heterozygosity can indicate sample contamination
• Low heterozygosity can occur for many reasons
--autosome
excludes all unplaced and non-autosomal variants
--missing
produces sample-based (plink.imiss) and variant-based (plink.lmiss) missing
data reports
--het
computes observed and expected autosomal homozygous genotype counts for
each sample
• Step 1: Create a QC file in Excel, with one individual per row
• Step 2: Calculate missingness rates for each person (based on good
quality, autosomal SNPs)
• Step 3: Calculate heterozygosity values for each person (based on
good quality, autosomal SNPs)
Quality Control:
Sample Gender Check
• Compare listed gender with gender predicted based on X
chromosome genotypes to identify potential sample mix-up
----check-sex
compares sex assignments in the input dataset with those imputed from X
chromosome inbreeding coefficients. By default, F estimates smaller than
0.2 yield female calls, and values larger than 0.8 yield male calls
• Step 4: Perform a check-sex for each person (based on good quality
X-chromosome SNPs)
• Step 5: Remove individuals with low call rates and/or failing the sexcheck. Heterozygosity?
Quality Control:
Genetic Relationships & IBD Sharing
IBD= Identical by descent
IBS= Identical by state
C/T
C/T
1
Relationship
Parent-Offspring
Full Siblings
Half Siblings
Aunt-Niece
Double First Cousins
First Cousins
Second Cousins
Unrelated
2
All of the children share 2 alleles IBS
Child 1 & 2 share 2 alleles IBD
Child 2 & 3 share 1 allele IBD
Child 2 & 4 share 0 alleles IBD
T/T
C/C
C/T
3
C/T
4
IBD Sharing Probabilities
IBD-2
IBD-1
IBD-0
0
1
0
1/4
1/2
1/4
0
1/2
1/2
0
1/2
1/2
1/16
6/16
9/16
0
1/4
3/4
0/16
1/16
15/16
0
0
1
Proportion of
Genome Shared IBD
50%
50%
25%
25%
25%
12.5%
3.125%
0%
Quality Control:
Population Stratification



Imagine a sample of individuals drawn from a population consisting of two distinct subgroups which
differ in allele frequency.
If the prevalence of disease is greater in one sub-population, then this group will be over-represented
amongst the cases.
Any marker which is also of higher frequency in that subgroup will appear to be associated with the
disease
Quality Control:
Sample Ethnicity
Compare genetic similarity/dissimilarity of GWAS individuals to others of different ethnicity
Use Principal Components Analysis (PCA)
JPT/CHB
CEU
Outliers
YRI
Quality Control:
Adjust for Population Structure
Linkage Disequilibrium (LD)
• Linkage disequilibrium: the non-random association of alleles at linked loci
• A measure of the tendency of some alleles to be inherited together on
haplotypes descended from ancestral chromosomes
• Consider a G/A SNP and a nearby C/T SNP
• Theoretically, there are 4 possible haplotypes:
• G-T
• G-C
• A-T
• A-C
• If however only the G-T and A-C haplotypes are observed in the population,
then the 2 SNPs are in perfect linkage disequilibrium, they are perfectly
correlated
• If we genotype the first SNP, we know what the alleles are at the second SNP
Quality Control:
Creating a LD-pruned Dataset
• Checking for relatedness is a relatively long process, but can be speeded
up using a reduced dataset (i.e. less SNPs)
• PCA is LD-sensitive; the dataset must be LD-pruned first
• Makes sense to use the same (reduced) set of SNPs for both processes
--range
--exclude normally removes all listed variants from the current analysis.
With the 'range' modifier, all variants within chromosomal regions specified
in a text file are excluded
--indep-pairwise requires three parameters: a window size in variant count or kilobase (if the
'kb' modifier is present) units, a variant count to shift the window at the
end of each step, a pairwise r2 threshold: at each step, pairs of variants in
the current window with r2 greater than the threshold are noted, and
variants are greedily pruned from the window until no such pairs remain.
• Step 6: Identify a list of good quality SNPs, excluding SNPs within known
regions of extensive LD, and perform further LD-pruning
• Step 7: Create a new binary fileset, including only the SNPs identified in
Step 6
Relationship Testing:
Pairwise IBD Calculations
--genome
invokes an IBS/IBD computation, and then writes a report to plink.genome.
The report includes the proportion of the genome shared IBD (PI-HAT)
between pairs of individuals
--min
plink.genome files can be VERY large. –min can be used to restrict reporting
to those pairs of individuals where PI-HAT exceeds a specified threshold (e.g.
restrict to those related at 1st cousin level or closer)
• Step 8: Calculate pairwise IBD for all individuals remaining, but
restrict reporting to those with PI-HAT values greater than 0.1
• Step 9: Remove one of each related pair from dataset prior to
population structure analysis
Population Structure:
Identifying Non-Europeans
--bmerge
--bmerge merges a specified binary fileset with the input data (which is
considered the reference)
--flip
Given a file containing a list of SNPs with A/C/G/T alleles, --flip swaps A↔T and
C↔G
--pca
--pca extracts the top specified number of principal components of the variancestandardized relationship matrix. Eigenvectors are written to plink.eigenvec, and
top eigenvalues are written to plink.eigenval.
• Step 10: Merge unrelated pruned dataset with hapmap3 data (known ethnicities, forward
strand, build36)
• Step 11: Flip strand in GWAS data for SNP flagged by bmerge process and make new binary
dataset
• Step 12: Repeat merge with hapmap3, this time using flipped dataset. Add geno filter to
remove any SNPs not genotyped in both datasets
• Step 13: Perform PCA on merged dataset and extract top 2 principal components. Include
header in a tab-delimited report
• Step 14: Remove non-European individuals
http://www.well.ox.ac.uk/~wrayner/strand/
Population Structure:
Generate PCs to use as Covariates
• Step 15: Perform PCA on European dataset and extract top 10
principal components. Include header in a tab-delimited report
• Step16: Remove individuals failing QC from original GWAS dataset
https://github.com/DReichLab/EIG
Analysis:
Apply Filters and Analyse
--geno
filters out all variants with missing call rates exceeding the provided value
--maf
filters out all variants with minor allele frequency below the provided
threshold
--hwe
filters out all variants which have Hardy-Weinberg equilibrium exact test pvalue below the provided threshold
--logistic
performs logistic regression given a case/control phenotype and some
covariates
--covar
--covar designates the file to load covariates from. The file format is optional
header line, FID and IID in first two columns, covariates in remaining columns.
By default, the main phenotype is set to missing if any covariate is missing
--covar-name
lets you specify a subset of covariates to load, by column name; separate
multiple column names with spaces or commas, and use dashes to designate
ranges
• Step 17: Perform logistic regression analysis using PC(s) and any other
appropriate covariates, applying appropriate SNP filters
Analysis:
Example QQ Plots
Quantile - Quantile (QQ) plots are informative
No signal
Enrichment of low p-values
May be true association
Population stratification?
Polygenic signal?
Analysis:
Generate Plots
--adjust qq-plot
--adjust causes an .adjusted file to be generated with each association test
report, containing several basic multiple testing corrections for the raw pvalues. 'qq-plot' adds a quantile column to simplify QQ plotting.
• Step 18: Repeat analysis adding a flag to generate random P-values
expected under the null hypothesis
• Step 19: Create QQ plot in R
data<-read.table(file="GWAS_build36_postQC_analysis_adj.assoc.logistic.adjusted", header=T)
plot(-log(data$QQ, 10), -log(data$UNADJ,10), xlab = "Expected –logP values", ylab = "Observed –logP values")
abline(a=0, b=1)
• Step 20: Create Manhattan plot in Haploview
Analysis:
Generate Manhattan Plot in Haploview