Ensembl Variations

Download Report

Transcript Ensembl Variations

Sequence Variation in Ensembl
1 of 25
Outline
•
•
•
•
•
2 of 25
SNPs
SNPs in Ensembl
Linkage disequilibrium
SNPs in BioMart
DAS sources
Single nucleotide polymorphisms
(SNPs)
• Two human genomes differ by
~0.1%
• Polymorphism: a DNA variation in
which each possible sequence is
present in at least 1% of people
• Most polymorphisms (~90%) take
the forms of SNPs: variations that
involve just one nucleotide
• ~1 out of every 300 bases in the human
genome
• ~10 million in the human genome
3 of 25
Functional Consequences
4 of 25
• SNPs in coding area that
alter aa sequence
Cause of most monogenic
disorders, e.g:
Hemochromatosis (HFE)
Cystic fibrosis (CFTR)
Hemophilia (F8)
• SNPs in coding areas that
don’t alter aa sequence
May affect splicing
• SNPs in promoter or
regulatory regions
May affect the level, location or
timing of gene expression
• SNPs in other regions
No direct known impact on
phenotype, useful as markers
Practical Applications
•
•
•
•
•
Disease diagnosis
Association studies
Pharmacogenomics
Forensic testing
Population genetics and
evolutionary studies
• Marker-assisted selection
5 of 25
Practical Applications
6 of 25
SNPs in Ensembl
• Most SNPs imported from dbSNP (rs……):
• Imported data: alleles, flanking sequences, frequencies,
….
• Calculated data: position, synonymous status, peptide
shift, ….
• For human also:
•
•
•
•
•
HGVbase
TSC
Affy GeneChip 100K and 500K Mapping Array
Affy Genome-Wide SNP array 6.0
Ensembl-called SNPs (from Celera reads and Jim
Watson’s and Craig Venter’s genomes)
• For mouse, rat, dog and chicken also:
• Sanger- and Ensembl-called SNPs (other strains / breeds)
7 of 25
dbSNP
• Central repository for simple genetic
polymorphisms:
• single-base nucleotide substitutions
• small-scale multi-base deletions or insertions
• retroposable element insertions and microsatellite
repeat variations
• http://www.ncbi.nlm.nih.gov/SNP/index.html
• For human (dbSNP build 128):
• 34,434,159 submissions (ss#’s)
• 11,883,685 RefSNP clusters (rs#’s)
• 6,262,709 validated
•
737,679 with frequency
8 of 25
SNPs in Ensembl - Types
9 of 25
Non-synonymous
Synonymous
Frameshift
Stop lost
Stop gained
In coding sequence, resulting in an aa change
In coding sequence, not resulting in an aa change
In coding sequence, resulting in a frameshift
In coding sequence, resulting in the loss of a stop codon
In coding sequence, resulting in the gain of a stop codon
Essential splice site
Splice site
In the first 2 or the last 2 basepairs of an intron
1-3 bps into an exon or 3-8 bps into an intron
Upstream
Regulatory region
5' UTR
Intronic
3' UTR
Downstream
Intergenic
Within 5 kb upstream of the 5'-end of a transcript
In regulatory region annotated by Ensembl
In 5' UTR
In intron
In 3' UTR
Within 5 kb downstream of the 3'-end of a transcript
More than 5 kb away from a transcript
SNPs in Ensembl - Species
•
•
•
•
•
•
10 of 25
Human
Chimp
Mouse
Rat
Dog
Cow
•
•
•
•
•
Platypus
Chicken
Zebrafish
Tetraodon
Mosquito
Caveat
For human, mouse and rat Ensembl defines all
SNP alleles respective to the + strand of the
genome assembly! (to be able to merge dbSNP
data with Sanger resequencing data)
Exceptions:
Those cases where SNPs are shown as part of a
sequence
11 of 25
5 MINUTE EXERCISE
A missense SNP, C1858T, in PTPN22 (Tyrosine-protein
phosphatase non-receptor type 22) has been identified as a
genetic risk factor for rheumatoid arthritis.
This SNP is also referred to as R620W.
12 of 25
1.
Find the SNPView page for this SNP.
2.
Why are the alleles on this page given as A/G?
3.
What is the minor allele of this SNP in Caucasians?
SNPs in Ensembl
GeneSNPView (1)
Transcript
InterPro domains
SNP alleles
13 of 25
SNPs in Ensembl
GeneSNPView (2)
14 of 25
SNPs in Ensembl
TranscriptSNPView (1)
Shows SNP alleles in different:
• Individuals (human):
Celera HuAA, HuCC, HuDD and HuFF,
Craig Venter, Jim Watson
• Strains (mouse, rat)
• Breeds (chicken, dog)
15 of 25
SNPs in Ensembl
TranscriptSNPView (2)
Different
individuals
Resequencing
coverage
SNP alleles
Alleles in
different individuals
16 of 25
SNPs in Ensembl
TranscriptSNPView (3)
17 of 25
5 MINUTE EXERCISE
18 of 25
1.
Find the TranscriptSNPView page for human PTPN22.
2.
Do all individuals (HuAA, HuCC, HuDD, HuFF, Venter
and Watson) have resequence coverage at the
position of the C1858T (R620W) SNP?
3.
Has any of the individuals a higher risk to get
rheumatoid arthritis based on its genotype at this
position?
4.
Is there an individual that is heterozygote at this
position?
Haplotypes and Linkage
Disequilibrium
A haplotype is a set of SNPs on a single
chromatid that are statistically associated
Linkage disequilibrium describes a
situation in which some combinations of
SNP alleles occur more or less frequently
in a population than would be expected
from a random formation of haplotypes
from alleles based on their frequencies
19 of 25
Measures of LD
• D = P(AB) – P(A)P(B)
• D ranges from – 0.25 to + 0.25
• D = 0 indicates linkage equilibrium
• dependent on allele frequencies, therefore of little use
• D’ = D / maximum possible value
• D’ = 1 indicates perfect LD
• estimates of D’ strongly inflated in small samples
• r2 = D2 / P(A)P(B)P(a)P(b)
• r2 = 1 indicates perfect LD
• measure of choice
20 of 25
Linkage Disequilibrium
LDView
It is also possible
to export SNP
information for
upload into the
HaploView
software tool
21 of 25
Linkage Disequilibrium
LDTableView
22 of 25
5 MINUTE EXERCISE
Retrieve all non-synonymous SNPs for the human
CFTR gene using BioMart and export their id,
genomic position, alleles and peptide shift
(hint: which dataset should you start with?).
23 of 25
DAS Sources
For human, data from the following DAS Sources can be
visualised on ContigView:
24 of 25
•
DGV and DGV loci:
Structural variations from the Database of Genomic
Variations (CNVs, InDels, inversions etc.)
•
RedonCNV regions and RedonCNV loci:
Copy number variations from Redon et al. paper
•
SegDup Washu:
Segmental Duplications, University of Washington
Q U E S T I O N S
A N S W E R S
25 of 25