Transcript Variations

Variation and Functional
Genomics
Overview of Talk
• SNPs and InDels
• Larger structural variants (CNVs)
• Phenotype data
• Individual genomes
• HapMap variations and genotypes
• Locus Specific Databases
• LRGs
2 of 51
Genomic Diversity
SNPs (Single Nucleotide Polymorphisms)
base pair substitutions
InDels
insertion/deletion (frameshifts)
occur in
1 in every 300 bp (human)
~3 billion base pairs in mammalian genomes!
3 of 51
Functional Consequences
Type
Consequence
SNPs in coding area that
alter aa sequence
Cause of most monogenic disorders,
e.g:
Cystic fibrosis (CFTR)
Hemophilia (F8)
SNPs in coding areas that
don’t alter aa sequence
May affect splicing
SNPs in promoter or
regulatory regions
May affect the level, location or timing
of gene expression
SNPs in other regions
No direct known impact on phenotype
Useful as markers
4 of 51
Sequence Polymorphisms Effects
• Cause disease
(SNP in clotting factor IX codes for a
stop codon: haemophilia)
• Increase disease risk
(SNP in LDL receptor reduces efficiancy:
high cholesterol)
• Affect drug response
(2 million hospitalized patients suffer
serious adverse drug reactions, with
more than 100,000 are fatal*)
5 of 51
Studying variation – why?
• Determine disease risk
• Individualised medicine
(pharmacogenomics)
• Forensic studies
• Biological markers
• Hybridisation studies, marker-assisted
breeding
• Understanding Evolution
6 of 51
Practical Applications
7 of 51
7 of 25
dbSNP
http://www.ncbi.nlm.nih.gov/SNP/
55 organisms covered:
http://www.ncbi.nlm.nih.gov/projects/SNP/snp_summary.cgi
8 of 51
8 of 25
Small Scale Sequence Variants
• Most SNPs and Indels are imported from dbSNP (rs……):
• Imported data: alleles, flanking sequences, pop. frequencies
• Calculated data: position, transcript effect
• For human also:
• HGMD (Human Gene Mutation Database)
• HGVS (Human Genome Variation Society)
• Affymetrix and Illumina variations
• Ensembl-called SNPs (from aligned individual genomes)
• For mouse, rat, dog and chicken also:
• Sanger- and Ensembl-called SNPs (other strains/breeds)
9 of
9 of
5149
9 of 25
SNPs and InDels in Ensembl
Non-synonymous
Synonymous
Frameshift
Stop lost
codon
Stop gained
codon
In coding sequence, resulting in an aa change
In coding sequence, not resulting in an aa change
In coding sequence, resulting in a frameshift
In coding sequence, resulting in the loss of a stop
In coding sequence, resulting in the gain of a stop
Essential splice site In the first 2 or the last 2 basepairs of an intron
Splice site
1-3 bps into an exon or 3-8 bps into an intron
Upstream
Regulatory region
5' UTR
Intronic
3' UTR
Downstream
Intergenic
Within 5 kb upstream of the 5'-end of a transcript
In regulatory region annotated by Ensembl
In 5' UTR
In intron
In 3' UTR
Within 5 kb downstream of the 3'-end of a transcript
More than 5 kb away from a transcript
10 of 51
10 of 25
Small Scale Sequence Variants
Ensembl Region in Detail View
Colour-coded SNPs and InDels
Legend
11ofof51
49
11
Polymorphisms in Ensembl
•
•
•
•
•
•
•
Chicken • Platypus
Chimp • Tetraodon
Cow
• Zebrafish
Dog
Human
Mouse
Rat
• Plants (Rice, Arabadopsis,
Grapevine, Brachypodia)
• Yeast
• Fly
• Mosquito
• Plasmodium falciparum
12 of 51
CNV in human
Structural variants track
13 of 51
13/72
Phenotype Data
Genome wide association data
• 159 annotations from EGA http://www.ebi.ac.uk/ega
• 2697 from NHGRI http://www.genome.gov/gwastudies/
14ofof51
49
14
14/72
15 of 51
15/72
Somatic Variations: COSMIC
16 of 51
Population Data in Ensembl
http://hapmap.ncbi.nlm.nih.gov/
http://www.1000genomes.org
17 of 51
17/72
Population Data
Variation tab: Population genetics
18 of 51
Variation Tab
• Flanking sequence
• Population genetics and LD plots
• Disease relationships (human)
EGA, GWAS, HapMap, Clinical/LSDB
• Ancestral alleles
19 of 51
Variation Views
• View variations drawn on the sequence
Gene tab: Sequence link,
Transcript tab: Exons, cDNA, protein links
• View a table of variations for each transcript
Gene tab: Variation Table
• View variations drawn along a transcript
Gene tab: Variation Image
20 of 51
Comparison Views
Human, Mouse, Rat, Dog and Cow
have individual or strain comparisons:
Comparison Image link at the left of the Transcript tab.
21 of 51
SNP Effect Calculator
Click on Manage your data at the left of any page.
Follow the link to “SNP Effect Predictor”.
Paste in variation positions and alleles
22 of 51
SNP Effect Calculator
Location, variation name in Ensembl, and consequence
on amino acid sequence is returned.
23 of 51
Ensembl Variation
• SNPs and InDels
• Larger structural variants (CNVs)
• Phenotype data
• Individual genomes (human)
• HapMap variations and genotypes
• Locus Specific Databases
• LRGs
24 of 51
Sequencing Individuals
• Venter and Watson genomes
• 1000 genomes project
• HapMap
25 of 51
First diploid genomes for human
Craig Venter:
• Sequence & analysis
ongoing since 2003
Jim Watson:
• 454 technology (7.4x)
• 100 mill unpaired
reads (25 billion bps)
• $1,000,000
“The Diploid Genome Sequence of an Individual Human” PLoS Biology 5: 10 2113-2144 (2007)
“The Complete Genome of an Individual by Massively Parallel DNA Sequencing” Nature 452:872-876 (2008)
“Accurate Whole Human Genome Sequencing Using Reversible Terminator Chemistry ” Nature 456:53-59 (2008)
“The Diploid Genome Sequence of an Asian Individual” Nature 456:60-65 (2008)
26 of 51
Reference Sequence
• The Human Genome Project gave the
“average” DNA sequence of a small number of
people.
• This helps us find out how a human develops
and works
• Does not show us the DNA differences
between different humans
• Does not reflect the major alleles
27 of 51
1000 Genomes Project
www.1000genomes.org
1000 genomes track in Region in Detail
28 of 51
HapMap www.hapmap.org
• A multi-country effort to identify and
catalogue genetic similarities and
differences in people.
• Collaboration among scientists and
funding agencies from Japan, the
United Kingdom, Canada, China,
Nigeria, and the United States.
• All of the information generated by
the project released into the public
domain.
29 of 51
HapMap (phase III)
• Genotypes from 1115 individual from 11 populations:
• ASW African ancestry in Southwest USA (71)
• CEU Utah residents with Northern and Western
European ancestry from the CEPH collection (162)
• CHB Han Chinese in Beijing, China (70)
• CHD Chinese in Metropolitan Denver, Colorado (70)
• GIH Gujarati Indians in Houston, Texas (83)
• JPT Japanese in Tokyo, Japan (82)
• LWK Luhya in Webuye, Kenya (83)
• MEX Mexican ancestry in Los Angeles, California
(71)
• MKK Maasai in Kinyawa, Kenya (171)
• TSI Toscani in Italia (77)
• YRI Yoruba in Ibadan, Nigeria (163)
30 of 51
Haplotyping
• A haplotype is a set of SNPs (on
average ~25 kb) found to be
statistically associated on a single
chromatid and which therefore tend
to be inherited together over time.
• Haplotyping involves grouping
subjects by haplotypes.
31 of 51
Locus specific databases
(LSDB)
• Databases that focus on one gene or one
disease
• e.g. p53, ABO, collagen
• e.g. Albinism, cystic fibrosis, Alzheimer’s
disease
• User communities:
•Clinicians – driven by
genetic testing of patients
• Research groups-disease
and function driven
32 of 51
LSDBs
• >1000 on the Human Genome Variation Society website
33ofof51
49
33
LSDB examples
34 of 51
Why is it difficult to merge these data?
• Historical reasons. LSDBs sometimes
• Use sequences which do not start at Methionine
• Use transcript coordinates not genomic
• Use a different transcript for reporting mutations
• Regularly changes with new assemblies/gene
builds
• It may contain minor alleles or rare alleles
• It may be inaccurate
• Missing genes (e.g. no α-haemoglobin Thalasemia)
• Mixture of sequences from different individuals
35 of 51
Ensembl and LRGs
• Define an exchange format for LRGs with
the NCBI
• Create an LRG website
• Create a pipeline for receiving the data and
creating an LRG
• Extend e! databases to store LRGs
• Develop an API to query LRGs and
associated annotation
• Consult with the LSDBs to develop useful
visualisation tools
• Build displays for LRG data and annotation
36 of 51
EGA- Repository for genotype data
• www.ebi.ac.uk/ega/
37 of 51
Sequences Differing from the
Reference
• Common coordinate system for
reporting mutations and variation
data (stable sequence)
• Locus Reference Genomic (LRG)
• Ensembl displays LRGs
• Project in collaboration with the NCBI and GEN2PHEN
• Extension of the RefSeq gene project
• View and Request LRGs here:
http://www.lrg-sequence.org/
38 of 51
Locus Reference Genomic
LRG = Genomic sequence for reporting mutations
(containing transcript)
* Often differs from the reference assembly
39 of 51
LRGs in the Browser
LRG_13
LRG transcripts and underlying
All LRGs
sequence can be viewed.
http://www.ensembl.org/Homo_sapiens/LRG/Summary?lrg=LRG_13
40 of 51
Fiona Cunningham
Variations Team
Pontus Larsson
Will McLaren
Graham Ritchie
41 of 51
Functional Genomics
(Wikipedia): Functional genomics is a field of molecular
biology that attempts to make use of the vast wealth of
data produced by genomic projects
(such as genome sequencing projects)
to describe gene (and protein) functions and
interactions.
In Ensembl:
Regulatory build using ENCODE project information
Promoters and Enhancers from CisRED and VISTA
FlyReg features (for Drosophila)
42 of 51
ENCODE
Encylopedia Of DNA Elements
14 June 2007, Nature
Where are the promoter, enhancer, and
other regulatory regions of the human
genome?
Pilot project showed: Use chromatin
accessibility and histone modification
analysis to predict TSS
43 of 51
Regulatory Build
 CTCF-binding sites
 DNAse1 hypersensitive sites
 TF binding sites
These are “core features”
Overlapping methylation sites expand
these regions.
http://www.ensembl.org/info/docs/funcgen/index.html
44 of 51
The Regulation Tab
45 of 51
How to get there?
46 of 51
The Location Tab
47 of 51
BioMart
48 of 51
There are other sets…
Sequence motifs determined by experimental and
prediction tools.
http://www.cisred.org/
VISTA Enhancer Set
Tissue-specific enhancers. Tested experimentally.
Nucleic Acids Res. 2007 January; 35(Database issue): D88–D92.
49 of 51
Gene Regulation Summary
• Homo sapiens
•
•
•
DNase I hypersensitivitiy, CTCF binding sites, TF binding sites (core features)
Histone modification data
MeDIP-chip methylation data for 17 human tissues and cell lines
•
•
•
•
VISTA Enhancer Assay (http://enhancer.lbl.gov)
cisRED motifs (www.cisred.org)
miRanda microRNA target prediction
Expression Quantitative Trait Loci (eQTL) from the Sanger Institute
• Mus musculus
•
•
•
DNase1 Hypersensititvity site (ES cells)
Histone modifications for ES, MEF, and NPC cells
cisRED motifs (www.cisred.org)
• Danio rerio
•
ZFMODELS-enhancers
• Drosophila melanogaster
•
•
•
REDfly TFBSs
BioTIFFIN
REDfly CRMs
50 of 51
Functional Genomics
• eFG Ian Dunham
• Nathan Johnson
• Daniel Sobral
• Andy Yates
• ENCODE
• Steven Wilder
• Damian Keefe
51 of 51