Gill: Comparative Genomics II

Download Report

Transcript Gill: Comparative Genomics II

CS273A
Lecture 11: Comparative Genomics II
MW 12:50-2:05pm in Beckman B302
Profs: Serafim Batzoglou & Gill Bejerano
TAs: Harendra Guturu & Panos Achlioptas
http://cs273a.stanford.edu [BejeranoFall13/14]
1
Announcements
Some mid term feedback feedback:
• You seem to like us
• We like you too!
• Teach us more biology / Teach us more algorithms
• We’ll highlight follow-up classes towards the end of the quarter
• Give us more references
• Start with Wikipedia. Then ask us for any specifics on Piazza.
• How do all the different topics we cover tie together?
• They all teach you about the human genome!
•
Its functions, its evolution and its contribution to disease – it’s a big canvas
• What are the most important problems in the field?
• Different people will give you different answers
•
Every topic we introduce to you is not fully resolved!
• Homework is very technical. Hard to focus on the insights.
• This is part of our daily challenge.
•
•
We should make you like the taste of it, because we sure do!
Your project will give you a taste of real open ended research.
http://cs273a.stanford.edu [BejeranoFall13/14]
2
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA
TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC
TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC
TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT
CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG
AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA
GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT
TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG
TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT
TGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT
TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG
CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC
ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA
GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA
ATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA
TTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA
ATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT
ATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTTGCGAAGTT
TGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGT
TCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATAC
ATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCT
GCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTA
CGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGA
ATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACA
TCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAAC
GGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAA
CTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTG
GCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTC
TTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAAT
TGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCT
GCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTT
AATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCT
TCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTT
AATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGA
TTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTA
CTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTT
TACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTT
ACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAA
3
AATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGT
Comparative Genomics
“Nothing in Biology Makes Sense
Except in the Light of Evolution”
Theodosius Dobzhansky
“Nothing in Evolution Makes Sense
Except in the Light of Computation”
Yours Truly
human
chimp
macaque
mouse
rat
cow
dog
opossum
platypus
chicken
zfish
tetra
fugu
T
http://cs273a.stanford.edu [BejeranoFall13/14]
4
Terminology
Orthologs : Genes related via speciation (e.g. C,M,H3)
Paralogs: Genes related through duplication (e.g. H1,H2,H3)
Homologs: Genes that share a common origin
(e.g. C,M,H1,H2,H3)
Gene tree
single
ancestral
gene
Species tree
Speciation
Duplication
Loss
http://cs273a.stanford.edu [BejeranoFall13/14]
5
Conservation implies function
purifying selection vs.
neutral evolution
Note: Lack of sequence conservation does NOT imply lack of function.
NOR does it rule out function conservation.
http://cs273a.stanford.edu [BejeranoFall13/14]
6
Dotplots
• Dotplots are a simple way of
seeing alignments
• We really like to see good visual
demonstrations, not just tables of
numbers
• It’s a grid: put one sequence
along the top and the other
down the side, and put a dot
wherever they match.
• You see the alignment as a
diagonal
• Note that DNA dotplots are
messier because the alphabet
has only 4 letters
• Smoothing by windows helps:
http://cs273a.stanford.edu [BejeranoFall13/14]
Chaining Alignments
Chaining highlights homologous regions between genomes, bridging
the gulf between syntenic blocks and base-by-base alignments.
Local alignments tend to break at transposon insertions, inversions,
duplications, etc.
Global alignments tend to force non-homologous bases to align.
Chaining is a rigorous way of joining together local alignments into
larger structures.
http://cs273a.stanford.edu [BejeranoFall13/14]
8
Another Chain Example
Human Sequence
A
B
C
D
E
Mouse Sequence
A
B
C
B’
D
E
In Human Browser
Implicit
Human
sequence
Mouse
chains
B’
…
D
…
D
In Mouse Browser
E
E
Implicit
Mouse
sequence
Human
chains
http://cs273a.stanford.edu [BejeranoFall13/14]
…
…
D
E
9
Chains join together related local alignments
likely ortholog
likely paralogs
shared domain?
Protease Regulatory Subunit 3
http://cs273a.stanford.edu [BejeranoFall13/14]
10
Note: repeats are a nuisance
mouse
human
If, for example, human and mouse have each 10,000 copies
of the same repeat:
We will obtain and need to output 108 alignments of all these
copies to each other.
Note that for the sake of this comparison interspersed repeats
and simple repeats are equal nuisances.
However, note that simple repeats, but not interspersed repeats,
violate the assumption that similar sequences are homologous.
Solution:
1 Discover all repetitive sequences in each genome.
2 Mask them when doing genome to genome comparison.
3 Chain your alignments.
4 Add back to the alignments only repeat matches that lie within
pre-computed chains.
This re-introduces back into the chains (mostly)orthologous copies.
(Which is valuable!)
http://cs273a.stanford.edu [BejeranoFall13/14]
11
Chains
• a chain is a sequence of gapless aligned blocks, where there must be
no overlaps of blocks' target or query coords within the chain.
• Within a chain, target and query coords are monotonically nondecreasing. (i.e. always increasing or flat)
• double-sided gaps are a new capability (blastz can't do that) that
allow extremely long chains to be constructed.
• not just orthologs, but paralogs too, can result in good chains. but
that's useful!
• chains should be symmetrical -- e.g. swap human-mouse -> mousehuman chains, and you should get approx. the same chains as if you
chain swapped mouse-human blastz alignments.
• chained blastz alignments are not single-coverage in either target or
query unless some subsequent filtering (like netting) is done.
• chain tracks can contain massive pileups when a piece of the target
aligns well to many places in the query. Common causes of this
include insufficient masking of repeats and high-copy-number genes
(or paralogs).
[Angie Hinrichs, UCSC wiki]
http://cs273a.stanford.edu [BejeranoFall13/14]
12
Before and After Chaining
http://cs273a.stanford.edu [BejeranoFall13/14]
13
Chaining Algorithm
Input - blocks of gapless alignments from (b)lastz
Dynamic program based on the recurrence relationship:
score(Bi) = max(score(Bj) + match(Bi) - gap(Bi, Bj))
j<i
Uses Miller’s KD-tree algorithm to minimize which parts of dynamic
programming graph to traverse. Timing is O(N logN), where N is
number of blocks (which is in hundreds of thousands)
See [Kent et al, 2003]
“Evolution's cauldron: Duplication,
deletion, and rearrangement in the
mouse and human genomes”
http://cs273a.stanford.edu [BejeranoFall13/14]
14
Netting Alignments
Commonly multiple mouse alignments can be found for a particular
human region, particularly for coding regions.
Net finds best match mouse match for each human region.
Highest scoring chains are used first.
Lower scoring chains fill in gaps within chains inducing a natural
hierarchy.
http://cs273a.stanford.edu [BejeranoFall13/14]
15
Net highlights rearrangements
A large gap in the top level of the net is filled by an
inversion containing two genes. Numerous smaller
gaps are filled in by local duplications and processed
pseudo-genes.
http://cs273a.stanford.edu [BejeranoFall13/14]
16
Nets attempt to capture the ortholog
(they also hide everything else)
http://cs273a.stanford.edu [BejeranoFall13/14]
17
Nets/chains can reveal retrogenes (and when they jumped in!)
http://cs273a.stanford.edu [BejeranoFall13/14]
18
Nets
• a net is a hierarchical collection of chains, with the highest-scoring
non-overlapping chains on top, and their gaps filled in where possible
by lower-scoring chains, for several levels.
• a net is single-coverage for target but not for query.
• because it's single-coverage in the target, it's no longer symmetrical.
• the netter has two outputs, one of which we usually ignore: the targetcentric net in query coordinates. The reciprocal best process uses
that output: the query-referenced (but target-centric / target singlecov) net is turned back into component chains, and then those are
netted to get single coverage in the query too; the two outputs of that
netting are reciprocal-best in query and target coords. Reciprocalbest nets are symmetrical again.
• nets do a good job of filtering out massive pileups by collapsing them
down to (usually) a single level.
• GB: for human inspection always prefer looking at the chains!
[Angie Hinrichs, UCSC wiki]
http://cs273a.stanford.edu [BejeranoFall13/14]
19
Before and After Netting
http://cs273a.stanford.edu [BejeranoFall13/14]
20
Convert / LiftOver
"LiftOver chains" are actually chains extracted from nets, or chains
filtered by the netting process.
LiftOver – batch utility
http://cs273a.stanford.edu [BejeranoFall13/14]
21
Drawbacks
• Inversions not handled optimally
Chains
> > > > chr1 > > >
> > > > chr1 > > >
< < < < chr5 < < < <
< < < < chr1 < < <
<
Nets
> > > > chr1 > > >
> > > > chr1 > > >
< < < < chr5 < < < <
http://cs273a.stanford.edu [BejeranoFall13/14]
22
Self Chain reveals paralogs
(self net is
meaningless)
http://cs273a.stanford.edu [BejeranoFall13/14]
23
Let’s put the chains and nets
to good use…
http://cs273a.stanford.edu [BejeranoFall13/14]
24
The Genotype - Phenotype divide
Can we find evolutionary patterns that are
distinct enough to be phenotypically revealing?
Problem #1:
Species A
Species B
Too many nucleotide changes
between any pair of related
species (or individuals).
The vast majority of these are
near/neutral.
http://cs273a.stanford.edu [BejeranoFall13/14]
25
Matching Genotype to Phenotype is hard
Phenotype
Number of rearrangements
Genotype
Most mutations
are near/neutral.
http://cs273a.stanford.edu [BejeranoFall13/14]
26
What about a tree of related species?
What if we could find evolutionary patterns that were
distinct enough to be phenotypically revealing?
Species A
Species B
ancestor
.
.
.
Genomes:
Inherited with Modifications.
Traits:
Come and Go.
Species H
http://cs273a.stanford.edu [BejeranoFall13/14]
27
What happens when an ancestral trait “goes”?
ancestral trait information
ancestor
Trait information is no longer under selection
Phenotype
Genome
Erodes away over evolutionary time
http://cs273a.stanford.edu [BejeranoFall13/14]
28
ancestral trait information
A lot of DNA and many traits
vary between any two species.
ancestor
Trait information is no longer under selection
Phenotype
Genome
Erodes away over evolutionary time
http://cs273a.stanford.edu [BejeranoFall13/14]
29
ancestral trait information
A lot of DNA and many traits
vary between any two species.
What about independent trait loss?
ancestor
vitamin C synthesis, tail, body hair,
dentition features, etc. etc.
Trait information is no longer under selection
Phenotype
Genome
Erodes away over evolutionary time
http://cs273a.stanford.edu [BejeranoFall13/14]
30
ancestral trait information
ancestor
Trait information is no longer under selection
Phenotype
Genome
Erodes away over evolutionary time
http://cs273a.stanford.edu [BejeranoFall13/14]
31
The PG screen





matches trait presence/absence pattern
http://cs273a.stanford.edu [BejeranoFall13/14]
[Hiller et al., 2012a]
32
The PG screen
Capture the independent genomic switch
from purifying selection  neutral evolution
in all and only the trait loss species.
Robust to: Different trait disabling times.
Different trait disabling mutations.
http://cs273a.stanford.edu [BejeranoFall13/14]
33
Branding ;-)
phenotype
genotype
Forward Genetics:
Search for mutations that segregate with the trait
Forward Genomics:
Search for regions that are lost only in species lacking the trait
But does it work?
http://cs273a.stanford.edu [BejeranoFall13/14]
34
Vitamin C Synthesis
human
rats & mice
synthesize vitamin C
cannot synthesize vitamin C
http://cs273a.stanford.edu [BejeranoFall13/14]
35
The Vitamin C synthesis “phenotree”
vitamin C synthesis was lost
3-4 times independently
in mammalian evolution
Fwd Genomics asks:
Do one or more
genomic loci
look like THAT?
http://cs273a.stanford.edu [BejeranoFall13/14]
36
Start by using chains and nets!
species 1
ACCCTATCGATTGCA
species 2
TCCGTATCG-TT-CA
outgroup
ACTCT-TCGATT-AA
First we use lastz, chaining & netting to
align the reference genome to orthologous
sequences in all other species’ genomes.
37
We quantify divergence by comparing sequences to the
reconstructed ancestral sequence
reconstruct
ancestral
sequence
Mutation in
species 1 or 2?
Insertion in species 1 or
deletion in species 2 ?
species 1
ACCCTATCGATTGCA
species 2
TCCGTATCG-TT-CA
outgroup
ACTCT-TCGATT-AA
ancestor
ACCCTATCGATT-CA
species 1
ACCCTATCGATTGCA
14 identical bases
species 2
TCCGTATCG-TT-CA
11 identical bases
percent of identical bases:
species 1
species 2
93%
79%  more diverged
38
Sequencing errors mimic divergence
ancestor ACCCTATCGATT-CAATGG
species 1 ACCCTATCGATTGCAAGGG
89% identical bases
species 2 TCCGTAACG--T-CTATCG
61% identical bases
sequence
quality scores
high sequencing error rate
 treat species 2 as missing data
39
Assembly gaps mimic divergence
Sanger
reads
species 1
species 2
species 3
species 4
species 5
assembly gap
?????????
conserved region
 treat species 1 as missing data
40
Reconstruct the evolutionary history of
all conserved regions, coding and non-coding
544,549 conserved regions
93%
70%
85%
reconstruct
ancestral
locus
...
matrix: 33 species x 544,549 regions
• Reconstruct ancestral sequence
• Measure extant species divergence
• Avoid
• Low quality sequence
• Assembly gaps
• Seek perfect phenotree match
http://cs273a.stanford.edu [BejeranoFall13/14]
41
We quantify the match to the vitamin C pattern
by counting the number of species that violate the pattern

Percent identity
0
100
Percent identity
0
100




1 violation
http://cs273a.stanford.edu [BejeranoFall13/14]
2 violations
42
Regions matching the vitamin C trait are clustered
no. of violating species
perfect
match
544,549 conserved regions
0
1
2
3
4
5
6
7
8
9
10
no
match
 these conserved regions are all exons of a single gene
http://cs273a.stanford.edu [BejeranoFall13/14]
43
This gene is more diverged
in all non-vitamin C synthesizing species
http://cs273a.stanford.edu [BejeranoFall13/14]
44
What is the function of this gene ?
33 genomes X 544,549 regions
Vitamin C
pattern
Gulo - gulonolactone (L-) oxidase
encodes the enzyme responsible for vitamin C biosynthesis
Note:
1. No likely shared
disabling mutation.
2. We learned about
both evolution and
function.
http://cs273a.stanford.edu [BejeranoFall13/14]
45
The Power of Forward Genomics
33 genomes X 544,549 regions
Vitamin C
pattern
Gulo - gulonolactone (L-) oxidase
Forward genomics works.
Can it work for continuous traits?
With only two independent losses?
And many unknown values?
http://cs273a.stanford.edu [BejeranoFall13/14]
46
Bile
Bile is a fluid produced by the liver that aids
the digestion of lipids in the small intestine.
http://cs273a.stanford.edu [BejeranoFall13/14]
47
Bile Phospholipids
Different mammals have remarkably different levels of biliary phospholipids:
http://cs273a.stanford.edu [BejeranoFall13/14]
48
ABCB4 is a phospholipid transporter
http://cs273a.stanford.edu [BejeranoFall13/14]
49
Find “Cure” Models for Human Disease
Human ABCB4 mutations lower patient biliary phospholipid levels
to guinea pig levels but are detrimental.
Our discovery: Guinea pig and horse have inactivated the Abcb4 gene
in their natural state. How can they do it?
create KO gene
Natural KO
try to fix/treat
find nature’s cure!
http://cs273a.stanford.edu [BejeranoFall13/14]
50
Forward Genomics: How General?
• Maybe we just got lucky?
• Simulation: our discoveries are not serendipitous
• More losses, more branch length => more likely
[Hiller et al., 2012a]
http://cs273a.stanford.edu [BejeranoFall13/14]
51
Forward Genomics: It’s not just enzymes
We find hundreds of Conserved Non-coding Elements (CNEs)
independently lost using just 8 mammalian genomes.
[Hiller et al., 2012b]
http://cs273a.stanford.edu [BejeranoFall13/14]
52
9 independent CNE losses near DIAPH2
in dog and guinea pig
Diaphanous homolog 2 may play a role in the development and normal function
of the ovaries. Mutations of this gene have been linked to premature ovarian failure.
http://cs273a.stanford.edu [BejeranoFall13/14]
53
How many independent trait losses?
42% of measured traits!
[Hiller et al., 2012a]
http://cs273a.stanford.edu [BejeranoFall13/14]
54
http://cs273a.stanford.edu [BejeranoFall13/14]
55