Variations - Bioinformatics Unit
Download
Report
Transcript Variations - Bioinformatics Unit
Variation and Functional
Genomics
Overview
•
•
•
•
•
•
Genomic Diversity (SNPs)
Variations in the Ensembl Browser
Human genome
HapMap
Gen2Phen and EGA
A bit about Functional Genomics
2 of 49
Genomic Diversity
SNPs (Single Nucleotide Polymorphisms)
base pair substitutions
InDels
insertion/deletion (frameshifts)
occur in
1 in every 300 bp (human)
~3 billion base pairs in mammalian genomes!
3 of 49
Single nucleotide polymorphisms
(SNPs)
• Polymorphism: a DNA variation
in which each possible
sequence is present in at least
1% of the population
• Most polymorphisms (~90%)
take the form of SNPs:
variations that involve just one
nucleotide
4 of 49
Origin of SNPs
x
z
y
w
v
x
Mutation in
individuals
z
y
v
w
w
Adapted from Bioinformatics for Geneticists, Eds Barnes and Gray
Selection of
alleles
SNP
Increase of the
allele to a
substantial
population
frequency
Fixation of the
allele in a
populations
5 of 49
Functional Consequences
Type
Consequence
SNPs in coding area that
alter aa sequence
Cause of most monogenic disorders,
e.g:
Cystic fibrosis (CFTR)
Hemophilia (F8)
SNPs in coding areas that
don’t alter aa sequence
May affect splicing
SNPs in promoter or
regulatory regions
May affect the level, location or timing
of gene expression
SNPs in other regions
No direct known impact on phenotype
Useful as markers
6 of 49
Studying variation – why?
• SNPs can cause disease
(SNP in clotting factor IX codes for a stop codon:
haemophilia)
• SNPs can increase disease risk
(SNP in LDL receptor reduces efficiancy: high
cholesterol)
• SNPs can affect drug response
(SNP in CYP2D8, a gene in the drug breakdown pathway
in the liver, disrputs breakdown of debrisoquine, a
treatment for high blood pressure.)
7 of 49
Studying variation – why?
• Determine disease risk
• Individualised medicine
(pharmacogenomics)
• Forensic studies
• Biological markers
• Hybridisation studies, marker-assisted
breeding
• Understanding Evolution
8 of 49
Practical Applications
9 of 49
9 of 25
SNPs in Ensembl
• Most SNPs imported from dbSNP (rs……):
• Imported data: alleles, flanking sequences,
frequencies, ….
• Calculated data: position, synonymous status, peptide
shift, ….
• For human also:
•
•
•
•
HGVbase
Affy GeneChip 100K and 500K Mapping Array
Affy Genome-Wide SNP array 6.0
Ensembl-called SNPs (from Celera reads and Jim
Watson’s and Craig Venter’s genomes)
• For mouse, rat, dog and chicken also:
• Sanger- and Ensembl-called SNPs (other strains /
breeds)
10 of 49
10 of 25
dbSNP
• Central repository for simple genetic
polymorphisms:
• single-base nucleotide substitutions
• small-scale multi-base deletions or insertions
• retroposable element insertions and
microsatellite repeat variations
http://www.ncbi.nlm.nih.gov/SNP/
• For human (dbSNP build 129):
• 19,125,432 submissions (ss#’s)
• 2,920,818 new RefSNPs (rs#’s)
11 of 49
11 of 25
SNPs in Ensembl - Types
Non-synonymous
Synonymous
Frameshift
Stop lost
codon
Stop gained
codon
In coding sequence, resulting in an aa change
In coding sequence, not resulting in an aa change
In coding sequence, resulting in a frameshift
In coding sequence, resulting in the loss of a stop
In coding sequence, resulting in the gain of a stop
Essential splice site In the first 2 or the last 2 basepairs of an intron
Splice site
1-3 bps into an exon or 3-8 bps into an intron
Upstream
Regulatory region
5' UTR
Intronic
3' UTR
Downstream
Intergenic
Within 5 kb upstream of the 5'-end of a transcript
In regulatory region annotated by Ensembl
In 5' UTR
In intron
In 3' UTR
Within 5 kb downstream of the 3'-end of a transcript
More than 5 kb away from a transcript
12 of 49
12 of 25
SNPs in Ensembl - Species
•
•
•
•
•
•
Human
Chimp
Mouse
Rat
Dog
Cow
•
•
•
•
•
Platypus
Chicken
Zebrafish
Tetraodon
Mosquito
13 of 49
Overview
•
•
•
•
•
•
Genomic Diversity (SNPs)
Variations in the Ensembl Browser
Human genome
HapMap
Gen2Phen and EGA
A bit about Functional Genomics
14 of 49
Focus on Human
• Venter and Watson genomes
• 1000 genomes project
• HapMap
15 of 49
First diploid genomes for human
Craig Venter:
• Sequence & analysis
ongoing since 2003
Jim Watson:
• 454 technology (7.4x)
• 100 mill unpaired
reads (25 billion bps)
• $1,000,000
“The Diploid Genome Sequence of an Individual Human” PLoS Biology 5: 10 2113-2144 (2007)
“The Complete Genome of an Individual by Massively Parallel DNA Sequencing” Nature 452:872-876 (2008)
“Accurate Whole Human Genome Sequencing Using Reversible Terminator Chemistry ” Nature 456:53-59 (2008)
“The Diploid Genome Sequence of an Asian Individual” Nature 456:60-65 (2008)
16 of 49
www.1000genomes.org
17 of 49
1000 Genomes
Delivering 20TB of sequence data…
•First Pilot. 60 HapMap samples
sequenced (low coverage)
•Second Pilot. Two trios of European
and African descent (high coverage)
•Third Pilot. Sequence 1,000 genes in
1,000 individuals (high coverage)
18 of 49
1000 Genomes Browser
Main page
• Built on Ensembl
• Navigation on the left
hand side
• Options as drop down
menus
• Currently only includes
human data
• In the future comparative
genomics information will
be available
• All pages link to Ensembl
and UCSC
19 of 49
Spot the difference!
20 of 49
Reference Sequence
• The Human Genome Project gave the
“average” DNA sequence of a small number of
people.
• This helps us find out how a human develops
and works
• Does not show us the DNA differences
between different humans
• Does not reflect the major alleles
21 of 49
HapMap www.hapmap.org
• A multi-country effort to identify and
catalogue genetic similarities and
differences in people.
• Collaboration among scientists and
funding agencies from Japan, the
United Kingdom, Canada, China,
Nigeria, and the United States.
• All of the information generated by
the project released into the public
domain.
22 of 49
HapMap (phase I & II)
• Samples from populations with African,
Asian and European ancestry.
• 270 DNA samples from 4 populations:
• 30 trios (two parents and an adult child)
from the Yoruba people of Ibadan, Nigeria
• 45 unrelated Japanese from the Tokyo
area
• 45 unrelated Han Chinese from Beijing
• 30 trios from Utah with Northern and
Western European ancestry (CEPH)
23 of 49
HapMap (phase III)
• Genotypes from 1115 individual from 11 populations:
• ASW African ancestry in Southwest USA (71)
• CEU Utah residents with Northern and Western
European ancestry from the CEPH collection (162)
• CHB Han Chinese in Beijing, China (70)
• CHD Chinese in Metropolitan Denver, Colorado (70)
• GIH Gujarati Indians in Houston, Texas (83)
• JPT Japanese in Tokyo, Japan (82)
• LWK Luhya in Webuye, Kenya (83)
• MEX Mexican ancestry in Los Angeles, California
(71)
• MKK Maasai in Kinyawa, Kenya (171)
• TSI Toscani in Italia (77)
• YRI Yoruba in Ibadan, Nigeria (163)
24 of 49
Haplotyping
• A haplotype is a set of SNPs (on
average ~25 kb) found to be
statistically associated on a single
chromatid and which therefore tend
to be inherited together over time.
• Haplotyping involves grouping
subjects by haplotypes.
25 of 49
Linkage Disequilibrium
LD is the deviation from
equilibrium, or random
association.
(i.e. in a population, two alleles
are always inherited together,
though they should undergo
recombination some of the
time.)
26 of 49
Measures of LD
• D = P(AB) – P(A)P(B)
• D ranges from – 0.25 to + 0.25
• D = 0 indicates linkage equilibrium
• dependent on allele frequencies, therefore of little
use
• D’ = D / maximum possible value
• D’ = 1 indicates perfect LD
• estimates of D’ strongly inflated in small samples
• r2 = D2 / P(A)P(B)P(a)P(b)
• r2 = 1 indicates perfect LD
• measure of choice
High LD, or perfect LD, shows high association of SNPs.
27 of 49
Linkage Disequilibrium
LD values between
two variants are
displayed by means
of inverted coloured
triangles going from
white (low LD) to
red (high LD).
28 of 49
Tag SNPs define a haplotype
Adapted from Nature 426, 6968: 789-796 (2003)
29 of 49
Tag SNPs
• ‘Tag SNPs’ define the minimum
SNP set to identify a haplotype.
• r2 = 1 between 2 SNPs means 1
would be ‘redundant’ in the
haplotype.
30 of 49
Locus specific databases
(LSDB)
• Databases that focus on one gene or one
disease
• e.g. p53, ABO, collagen
• e.g. Albinism, cystic fibrosis, Alzheimer’s
disease
• User communities:
•Clinicians – driven by
genetic testing of patients
• Research groups-disease
and function driven
31 of 49
LSDBs
• >700 on the Human Genome Variation Society website
32 of 49
LSDB examples
33 of 49
Why is it difficult to merge these data?
• Historical reasons. LSDBs sometimes
• Use sequences which do not start at Methionine
• Use transcript coordinates not genomic
• Use a different transcript for reporting mutations
• Regularly changes with new assemblies/gene
builds
• It may contain minor alleles or rare alleles
• It may be inaccurate
• Missing genes (e.g. no α-haemoglobin Thalasemia)
• Mixture of sequences from different individuals
34 of 49
Ensembl and LRGs
• Define an exchange format for LRGs with
the NCBI
• Create an LRG website
• Create a pipeline for receiving the data and
creating an LRG
• Extend e! databases to store LRGs
• Develop an API to query LRGs and
associated annotation
• Consult with the LSDBs to develop useful
visualisation tools
• Build displays for LRG data and annotation
35 of 49
Why is this important for
Ensembl
• Ensembl has traditionally focused on an
infrastructure for molecular biologists
• Needs to expand to provide support for
more stable transcript sequences used for
reporting mutations
• It will give central databases access to
patient variation, genotype, phenotype and
disease data
• This will improve our data resources
36 of 49
Advantages to LSDBs
• LRGs in Ensembl gives LSDBs access to:
• Genome annotation (including comparative,
functional genomics and variation data)
• Data integration with other variation
resources (dbSNP, EGA, 1000 Genomes,
NHGRI GWA catalogue)
• Sequence search and data mining tools
• A Perl API to query the data
• A genome browser website for visualisation
in genomic context and local context
• Promotes discoverability of LSDBs
• Data is mapped from one assembly to the next
37 of 49
EGA- Repository for genotype data
• www.ebi.ac.uk/ega/
38 of 49
Variations Team
Fiona Cunningham
Yuan Chen
Will McLaren
39 of 49
Functional Genomics
(Wikipedia): Functional genomics is a field of molecular
biology that attempts to make use of the vast wealth of
data produced by genomic projects
(such as genome sequencing projects)
to describe gene (and protein) functions and
interactions.
In Ensembl:
Regulatory build using ENCODE project information
Promoters and Enhancers from CisRED and VISTA
FlyReg features (for Drosophila)
40 of 49
ENCODE
Encylopedia Of DNA Elements
14 June 2007, Nature
Where are the promoter, enhancer, and
other regulatory regions of the human
genome?
Pilot project showed: Use chromatin
accessibility and histone modification
analysis to predict TSS
41 of 49
Regulatory Build
Uses CTCF and DNAse1 data from
multiple cell types as “core features”.
Overlapping methylation sites expand
these regions.
42 of 49
How to get there?
43 of 49
Click on a Regulatory Feature…
44 of 49
Region in Detail
45 of 49
BioMart
46 of 49
There are other sets…
Sequence motifs determined by experimental and
prediction tools.
http://www.cisred.org/
VISTA Enhancer Set
Tissue-specific enhancers. Tested experimentally.
Nucleic Acids Res. 2007 January; 35(Database issue): D88–D92.
47 of 49
Total List of Regulation Info.
• Homo sapiens
•
•
•
•
•
•
•
•
DNase I Hypersensitivitiy sites for GM06990 and CD4+ T cells
CTCF binding sites
Histone modification data
MeDIP-chip methylation data for 17 human tissues and cell lines
VISTA Enhancer Assay (http://enhancer.lbl.gov)
cisRED motifs (www.cisred.org)
miRanda microRNA target prediction
Expression Quantitative Trait Loci (eQTL) from the Sanger Institute
• Mus musculus
•
•
•
DNase1 Hypersensititvity site (ES cells)
Histone modifications for ES, MEF, and NPC cells
cisRED motifs (www.cisred.org)
• Danio rerio
•
ZFMODELS-enhancers
•
•
•
REDfly TFBSs
BioTIFFIN
REDfly CRMs
• Drosophila melanogaster
48 of 49
Functional Genomics Team
• eFG Ian Dunham
• Nathan Johnson
• Daniel Sobral (starts Dec 1)
• Andy Yates (multi-species support)
• ENCODE
• Steven Wilder
• Damian Keefe
49 of 49
End of course survey!
http://tinyurl.com/yaw6nzq
50 of 49