W09 Presentation: Comparative Genomics File

Download Report

Transcript W09 Presentation: Comparative Genomics File

BIO 508
Comparative
Genomics
Eric Franzosa
30 March 2016
From the notes of
CHARLES DARWIN
Mid-July 1837
What evolution is
What evolution is
Biological systems replicate
Replication introduces variation
Some variants replicate more
We observe variation in phenotype
Underlying this variation is a change in genotype
Many changes in genotype
are deleterious, resulting in
less-fit organisms
gene
Some changes are adaptive,
resulting in more-fit
organisms
gene
gene
gene
Changes in genotype may
also be “neutral,” producing
no change in the phenotype
Changes in genes and genomes
Changes in genotype
GATCGGGATCGCGGACGATC
Evolution
GATCGCCATCGCGGATGATC
GGTCGGGATCGGGGACTTTC
“DNA is a victim of its own chemistry”
(Don’t be too upset: our existence depends on it!)
Changes in genotype: Nucleotide substitutions
C
A
transitions
(purines)
transitions
(pyrimidines)
transversions
G
T
S1 = AGCTATGCAGGGGCATCTAGGCATCGAGCATC
S2 = AGCTGTGGAGGAGCCTCTAGGCATTGTGCATC
Changes in genotype: Insertions & deletions
S1 = AG-TATGCAGGGGCA-CTAGGC---GAGCATC
S2 = AGCTAT-CAGGGGCATC-AGGCATCGAGCATC
Collectively called indels.
S1 = AGTATGCAGGGGCA-CTAGGC---GAGCATC
S2 = AG TAT-CAGGGGCATC-AGGCATCGAGCATC
S1 = AG TATGCAGGGGCA-CTAGGC---GAGCATC
S2 = AGCTAT-CAGGGGCATC-AGGCATCGAGCATC
“Runs” of indels are common: they happen much more often
than you’d expect from rates of single indels.
Genome evolution
ancestral
genome
gene
modern
genome
1
modern
genome
2
ORTHOLOGS
Descend directly from ancestral gene in the ancestral
species; they tend to carry out the same function.
Genome evolution
ancestral
genome
gene
modern
genome
1
PARALOGS
modern
genome
2
Result from gene duplication; one or both may
diverge from the ancestral gene’s function.
Examples: Hemoglobin protein family
Number of UNCHANGED positions ^
Examples: Yeast genomes
Whole genome
duplication!
Kellis et al. Nature 428, 617-624 (2004)
Genome evolution
ancestral
genome
gene
XENOLOG
modern
genome
1
modern
genome
2
Are not related by descent (reproduction),
giving rise to the term “horizontal gene transfer.”
Genome evolution: deletion
Genes (arrows) can be lost or shuffled.
1
2
3
deleted
1
3
Genome evolution: inversion
Rotate
Mirror
Comparison whole genomes with dot plots
Learning from change
Signs of selection in genome evolution
gene
???
ancestral genome
mutation
Selection
(bad mutations aren’t reproduced as much)
modern genome 1
modern genome 2
comparison reveals
constrained elements
e.g. genes
RealSigns
example
GENOMICS
of selection in genomeCOMPARATIVE
evolution: ENCODE
Ovcharenko, I. et al. Genome Res. 2005 Jan;15(1):184-94
Protein consequences
Why do sequences differ?
Changes in coding sequences
Protein consequences
Why do sequences differ?
Changes in coding sequences
Wild Type mRNA
5’
GCU
GGA
GCA
CCA
GGA
CAA
GAU
GGA
3’
WT Polypeptide
N
Ala
Gly
Ala
Pro
Gly
Gln
Asp
Gly
C
5’
GCU
GGA
GCC
CCA
GGA
CAA
GAU
GGA
3’
N
Ala
Gly
Ala
Pro
Gly
Gln
Asp
Gly
C
5’
GCU
GGA
GCA
CCA
AGA
CAA
GAU
GGA
3’
N
Ala
Gly
Ala
Pro
Arg
Gln
Asp
Gly
C
5’
GCU
GGA
GCA
CCA
GGA
UAA
GAU
GGA
3’
N
Ala
Gly
Ala
Pro
Gly
STOP
5’
GCU
GGA
GCC
ACC
AGG
ACA
AGA
UGG
3’
N
Ala
Gly
Ala
Thr
Arg
Thr
Arg
Trp
C
substitution
Silent Mutation
Missense Mutation
insertion
Nonsense Mutation
Frameshift Mutation
C
Changes in coding sequences
CDS 1
CDS 2
CDS 3
5:5
5:5
5:5
fixation
1:3
2:3
3:3
evolutionary rate
LOW
MEDIUM
HIGH
selective constraint
HIGH
MEDIUM
LOW
*mutation
*non-synonymous (changes amino acid) vs. synonymous (silent)
Examples: Structure-evolution relationships
Interfaces and
active sites break
the rules!
Global determinants of protein evolutionary rate
ER  (fitness effect)-1
ER  contact density
Hirsh et al., Nature, 2001
Bloom et al., Mol Biol Evol, 2006
ER  (expression)-1
ER  (# of interactors)-1
Pal et al., Genetics, 2001
Fraser et al., Science, 2002
Quantifying sequence similarity by alignment
mathamatiks
did you mean mathematics?
Search
Sequence alignment
mathamatiks
++++ ++++ +
mathematics
Score = 9
In a world of substitutions, life is pretty easy
(Just count the “identities”)
Sequence alignment
Problem: letters can be added and deleted,
mathmaticks
++++
+
mathematics
Score = 5
Allowing gaps (indels) improves the score,
math-maticks
++++ +++++ +
mathematic-s
Better
Score = 10
Sequence alignment
For two sequences of length 12, there are over
2.5 x 108 gapped alignments to consider.
Allowing gaps (indels) improves the score,
math-maticks
++++ +++++ +
mathematic-s
Better
Score = 10
Representing an alignment on a grid
Xi → 0
–
Yj
↓
0 –
1 G
2 A
3 T
4 A
5 G
6 A
1
G
2
A
3
T
4
T
5
A
6
C
7
A
Consider two DNA
sequences:
GATTACA GATAGA
Representing an alignment on a grid
Xi → 0
–
Yj
↓
0 –
1
G
2
A
3
T
4
T
5
A
6
C
7
A
Consider two DNA
sequences:
1 G
GATTACA GATAGA
2 A
3 T
4 A
5 G
6 A
G
|
G
A
|
A
T
–
T
|
T
A
|
A
C
G
A
|
A
Their best alignment
looks like this
Representing an alignment on a grid
Xi → 0
–
Yj
↓
0 –
0
1
G
2
A
3
T
4
T
5
A
6
C
7
A
Consider two DNA
sequences:
↖
1 G
GATTACA GATAGA
↖
2 A
←
The alignment is a
path in this grid
↖
3 T
↖
4 A
↖
5 G
↖
6 A
G
|
G
A
|
A
T
–
T
|
T
A
|
A
C
G
A
|
A
Representing an alignment on a grid
Xi → 0
–
Yj
↓
0 –
0
1
G
2
A
3
T
4
T
5
A
6
C
7
A
Consider two DNA
sequences:
↖
1 G
1
GATTACA GATAGA
↖
2 A
2 ← 1
A “match” step adds
1 to the score…
↖
3 T
2
↖
4 A
A “mismatch” step
adds zero…
3
↖
5 G
3
↖
6 A
Total = 4
4
G
|
G
A
|
A
T
–
T
|
T
A
|
A
C
G
A
|
A
And an indel/gap
step adds -1
(Punishment for
being less common)
Finding the best alignment
Xi → 0
1
2
3
4
5
6
7
–
G
A
T
T
A
C
A
Yj
↓
0 –
0 → -1 → -2 → -3 → -4 → -5 → -6 → -7
↓
1 G
-1
↓
2 A
-2
↓
3 T
-3
↓
4 A
-4
↓
5 G
-5
↓
6 A
-6
Start by considering only gaps (ridiculous alignments)
Finding the best alignment
Xi → 0
1
2
3
4
5
6
7
–
G
A
T
T
A
C
A
Yj
↓
0 –
0 → -1 → -2 → -3 → -4 → -5 → -6 → -7
↓↘↓
1 G
-1 → 1
↓
2 A
-2
↓
3 T
-3
↓
4 A
-4
↓
5 G
-5
↓
6 A
-6
Option 1 (↘)
-G
-G
0+1=1
Option 2 (→)
G-G
-1 + -1 = -2
Option 3 (↓)
-G
G-
Each other cell is the best of 3 options
-1 + -1 = -2
Finding the best alignment
Xi → 0
1
2
3
4
5
6
7
–
G
A
T
T
A
C
A
Yj
↓
0 –
0 → -1 → -2 → -3 → -4 → -5 → -6 → -7
↓↘
↘↓
1 G
-1
1 → 0
↓
2 A
-2
↓
3 T
-3
↓
4 A
-4
↓
5 G
-5
↓
6 A
-6
Option 1 (↘)
-G
GA
-1 + 0 = -1
Option 2 (→)
GGA
1 + -1 = 0
Option 3 (↓)
-- G
-2
+
-1
=
-3
GA -
Keep going, keeping track of the best option
Finding the best alignment
Xi → 0
1
2
3
4
5
6
–
G
A
T
T
A
C
Yj
↓
0 –
0 → -1 → -2 → -3 → -4 → -5 → -6
↓↘
1 G
-1
1 → 0 → -1 → -2 → -3 → -4
↓
↓↘
↘
2 A
-2
0
2 → 1 → 0 → -1 → -2
↓
↓
↓↘
↘
3 T
-3
-1
1
3 → 2 → 1 → 0
↓
↓↘↓
↓↘
↘
4 A
-4
-2
0
2
3
3 → 2
↓↘↓
↓
↓↘↓↘
↘
5 G
-5
-3
-1
1
2
3
3
↓
↓↘↓
↓
↓↘
↘
6 A
-6
-4
-2
0
1
3
3
7
A
→ -7
→ -5
↘
→ -3
→ -1
↘
→ 1
↘
→ 2
↘
4
The completed table (best score in lower right corner)
Finding the best alignment
Xi → 0
1
2
3
4
5
6
–
G
A
T
T
A
C
Yj
↓
0 –
0 → -1 → -2 → -3 → -4 → -5 → -6
↓↖
1 G
-1
1 → 0 → -1 → -2 → -3 → -4
↓
↓↖
↘
2 A
-2
0
2 → 1 → 0 → -1 → -2
↓
↓
↓↖
↘
3 T
-3
-1
1
3 ← 2 → 1 → 0
↓
↓↘↓
↓↘
↖
4 A
-4
-2
0
2
3
3 → 2
↓↘↓
↓
↓↘↓↘
↖
5 G
-5
-3
-1
1
2
3
3
↓
↓↘↓
↓
↓↘
↘
6 A
-6
-4
-2
0
1
3
3
G
|
G
A
|
A
T
|
T
T
–
A
|
A
C
G
7
A
→ -7
→ -5
↘
→ -3
→ -1
↘
→ 1
↘
→ 2
↖
4
A
|
A
Trace back to the
start to find the best
alignment!
You can do this in Excel 
=MAX( B2+IF($A3=C$1,1,0), B3-1, C2-1 )
match = 1
mismatch = 0
gap = -1
More alignment options
Local alignment ignores the beginning and
ends of sequences, finding common “cores”
Gaps can be added in groups, which is more
forgiving of large scale insertion/deletion
When aligning proteins, matches & mismatches are scored
to reflect biochemical similarity (e.g. BLOSUM62 scores)
Trp W
Tyr Y
Val V
-3
-2
0
A
Ala
-3
-2
-3
R
Arg
-4
-2
-3
N
Asn
-4
-3
-3
D
Asp
-2
-2
-1
C
Cys
-2
-1
-2
Q
Gln
-3
-2
-2
E
Glu
-2 -2 -3 -2
-3 2 -1 -1
-3 -3 3 1
G H I L
Gly His Ile Leu
-3 -1 1 -4
-2 -1 3 -3
-2 1 -1 -2
K M F P
Lys Met Phe Pro
-3 -2 11 2 -3
-2 -2 2 7 -1
-2 0 -3 -1 4
S T W Y V
Ser Thr Trp Tyr Val
Phylogenetic tree reconstruction
How would we build a tree like this?
How would we build a tree like this?
In the old days, we built trees
by comparing phenotypes
Can you guess some problems
with this approach?
Today trees are built
based on genotype
information, e.g.
mitochondrial DNA
sequences
How would we build a tree like this?
A CGGGG-GGTGCGACTATTC--A
B CAGGACTATGCGACTAATCTGA
C CAGGACTATGCAACTAATCTGA
D CAGGACTATGCAACTAATCTGA
E TGGGACTATGCGACTATTCTGA
Use the alignment
techniques from before to
compare each pair of
sequences…
From similarity to distance
G A G C T A C G G A
G A C C A A - T G A
+1 +1 0 +1 0 +1 -1 0 +1 +1
Alignments usually focus on similarity,
trees usually focus on distance.
0
0 +1 0 +1 0 +2 +1 0
0
Today’s simplifying assumptions:
(1) All sequences have the same length
(2) All changes are created equal
(3) Multiple changes per site (e.g. G→A→C) are rare
Building trees with UPGMA
Unweighted Pair Group Method with Arithmetic mean
(Most complicated name, most straightforward method)
Building trees with UPGMA
d
A
B
C
D
E
A
0
B
8
0
C
4
8
D
6
8
E
8
4
0
6
0
8
8
0
Start with distance table, d, and
each species as a “leaf” node
B
E
D
A
C
Building trees with UPGMA
d
A
B
C
D
E
A
0
B
8
0
C
4
8
D
6
8
E
8
4
Find the two closest nodes i and j,
separated by distance d(i, j).
0
6
0
8
8
0
Hang the two nodes from a new
node, k, at height d(i, j)/2.
F
2
B
E
D
2
A
C
Building trees with UPGMA
d
A
B
C
D
E
F
A
0
B
8
0
C
4
8
D
6
8
E
8
4
F
8
0
6
0
8
8
0
6
8
0
Add the new node k to the
distance table; its distance to
each other node h is the average
of i and j’s distances to h.
d(h, i)  d(h, j)
d(h, k) 
2
F
2
B
E
D
2
A
This is the “arithmetic mean” in the UPGMA acronym
C
Building trees with UPGMA
d
B
D
E
F
B
0
D
8
0
E
4
8
F
8
6
0
8
0
Delete i and j from
the distance table.
F
2
B
E
D
2
A
C
Building trees with UPGMA
d
B
D
E
F
G
B
0
D
8
0
E
4
8
F
8
4
G
8
0
8
0
8
0
G
2
F
2
B
2
E
Rinse and repeat.
D
2
A
C
Building trees with UPGMA
d
D
F
G
D
0
F
4
0
G
8
8
0
G
2
F
2
B
2
E
Rinse and repeat.
D
2
A
C
Building trees with UPGMA
d
D
F
G
H
D
0
F
4
0
G
8
8
H
-
0
8
0
H
G
2
F
3
2
B
1
2
E
Rinse and repeat.
D
2
A
C
Building trees with UPGMA
d
G
H
G
0
H
8
0
H
G
2
F
3
2
B
1
2
E
Rinse and repeat.
D
2
A
C
Building trees with UPGMA
d
G
H
I
G
0
H
8
0
I
0
I
1
2
H
G
2
F
3
2
B
1
2
E
Rinse and repeat.
D
2
A
C
Building trees with UPGMA
d
A
B
C
D
E
A
0
B
8
0
C
4
8
D
6
8
E
8
4
0
6
0
8
8
0
I
1
2
H
G
2
F
3
2
B
1
2
E
D
2
A
Distances in the tree reflect the original distances.
C
Problems with UPGMA
Assumes a molecular clock, i.e.
Distance between taxa is directly proportional
to the time since their divergence.
This is often not the case. Why?
B
C
D
B
C
D
A
A
A perfectly reasonable tree
UPGMA gets it wrong
Neighbor joining (NJ)NEIGHBOR JOINING
Neighbor joining is another distance-based tree-building
method. It does not assume a molecular clock.
A
E
E
B
A
D
y
x
C
D
C
B
Pull out pairs of nodes that keep the total branch length as
small as possible (not necessarily the closest nodes).
NJ geometry is nifty but tedious...
3
3
3
1
4
y
x
1
4
y
2
x
1
4
y
2
5
5
5
3
3
3
1
4
y
5
x
x
1
4
y
2
5
x
2
1
4
y
2
x
5
N
N
1
2 N 
L xy 
dij 

 dk1   dk2  (N  2)d12 
2(N  2)  k 3
N  3 ji3 
k 3
2
Maximum parsimony
Maximum parsimony (MP) looks for a tree that can explain the
differences among your sequences using the smallest possible
number of mutations.
Maximum parsimony
Begin with a set of aligned sequences:
S1
S2
S3
S4
=
=
=
=
A
G
A
A
G
G
A
T
A
A
A
C
A
A
A
T
A
A
T
T
A
A
A
G
A
A
A
A
T
T
C
T
A
A
T
C
A
A
T
C
Consider all possible trees:
S1
S3
S1
S2
S1
S2
S2
S4
S3
S4
S4
S3
Maximum parsimony
S1
S2
S3
S4
A
G
A
A
=
=
=
=
G
G
A
T
A
A
A
C
A
A
A
T
A
A
T
T
A
A
A
G
A
A
A
A
T
T
C
T
A
A
T
C
A
A
T
C
Count changes at this site required by each tree.
S1
A
A
G
S2
1
change
S3
A
A
A
S4
S1
A
A
A
S3
1
change
S2
G
A
A
S4
S1
A
A
A
S4
1
change
S2
G
A
A
S3
Maximum parsimony
This site is uninformative.
S1
S2
S3
S4
A
G
A
A
=
=
=
=
G
G
A
T
A
A
A
C
A
A
A
T
A
A
T
T
A
A
A
G
A
A
A
A
T
T
C
T
A
A
T
C
A
A
T
C
All trees look equally good.
S1
G
G
G
S2
2
change
S3
A
T
T
S4
S1
G
G
A
S3
2
change
S2
G
G
T
S4
S1
G
G
T
S4
2
change
S2
G
G
A
S3
Maximum parsimony
This site is informative.
S1
S2
S3
S4
A
G
A
A
=
=
=
=
G
G
A
T
A
A
A
C
A
A
A
T
A
A
T
T
A
A
A
G
A
A
A
A
T
T
C
T
A
A
T
C
A
A
T
C
The first tree is more parsimonious.
S1
A
A
A
S2
1
change
S3
T
T
T
S4
S1
A
A
T
S3
2
change
S2
A
A
T
S4
S1
A
A
T
S4
2
change
S2
A
A
A
S3
Maximum parsimony
S1
S2
S3
S4
S1
=
=
=
=
A
G
A
A
G
G
A
T
A
A
A
C
A
A
A
T
A
A
T
T
A
A
A
G
A
A
A
A
T
T
C
T
A
A
T
C
A
A
T
C
1
2
1
1
1
1
0
1
2
2
S3
This tree can explain the four
sequences with 12 mutations, the
most parsimonious solution.
S2
S4
Problems with maximum parsimony
The number of trees gets very big very fast.
Taxa
3
5
10
20
Rooted Trees
3
105
3.4 × 107
8.2 × 1021
Unrooted Trees
1
15
2.0 × 106
2.2 × 1020
N
(2N-3)!!
(2N-5)!!
N!! = N × (N – 2) × (N – 4) × (N – 6) × …
So what do people actually use?
Distance-based methods like NJ work well in practice, and
are not too computationally demanding.
The state-of-the-art is to apply a probabilistic approach to
maximum parsimony and find the most likely
(maximum likelihood) tree.
(Tricks are used to speed up the search through tree space
at the expense of finding an exact solution.)
RAxML
PhyML
MrBayes
FastTree
Multiple sequence alignment
Multiple sequence alignment (MSA)
Multiple sequence alignment (MSA)
Useful for summarizing whole
families of related proteins.
Input for maximum parsimony and
maximum likelihood tree-building.
Exact MSA
To align X = FIT and Y = FLIT
at each point in the alignment we
must check 3 options for each cell
in the alignment grid.
SEQ X
SEQ Y
(X0, Y0)
(Xi, Yj)
F - I T
F L I T
MATCH or
MISMATCH
GAP #1
GAP #2
X
Y
X
-
Y
Exact MSA
SEQ X
To align X = FIT and Y = FLIT
and Z = FRET we must check 7
options in each cell of the
alignment cube.
SEQ Y
(X0, Y0, Z0)
MATCH or
MISMATCH
GAP #1
GAP #2
X
Y
Z
X
Y
-
X
Z
(Xi, Yj, Zk)
F - I T
F L I T
F R E T
GAP #3
GAP #4
GAP #5
GAP #6
Y
Z
Z
X
-
Y
-
Exact MSA
MATCH or
MISMATCH
X
Y
Z
GAP #1
GAP #2
GAP #3
X
Y
-
X
Z
Y
Z
GAP #4
GAP #5
GAP #6
Z
X
-
Y
-
Exact MSA (is not feasible)
And that’s just 3 short sequences!
The number of total comparisons grows as LN × (2N – 1)
(L = sequence length, N = # of sequences)
L ~ 90
N ~ 30
This method needs
>1067 comparisons to
build the alignment...
Progressive MSA
Use iterative pairwise alignments to build the full
alignment of all sequences.
PIEKSAVTL
DIRKSAITAL
DIRKSALYAL
YEEKSITAL
YEEKSTAL
First do all pairwise alignments, then use the resulting
distance information to build a tree.
Progressive MSA
PIEKSAVTL
DIRKSAITAL
DIRKSALYAL
YEEKSITAL
YEEKSITAL
YEEKSTAL
YEEKS-TAL
Order of alignment is determined by the tree structure
(align more closely related groups first).
Progressive MSA
PIEKSAVTL
DIRKSAITAL
DIRKSAITAL
DIRKSALYAL
DIRKSALYAL
YEEKSITAL
YEEKS-TAL
Order of alignment is determined by the tree structure
(align more closely related groups first).
Progressive MSA
PIEKSAVTL
PIEKSAVT-L
DIRKSAITAL
DIRKSAITAL
DIRKSALYAL
DIRKSALYAL
YEEKSITAL
YEEKS-TAL
Order of alignment is determined by the tree structure
(align more closely related groups first).
Aligning alignments
AAAAACCCCC
AAAAACCCCC
AAAAATTCCCCC
AAAAATTCCCCC
AAAAA--CCCCC
AAAAA--CCCCC
When “gapping” an alignment, each
sequence receives a gap
F
F
F
F
L
R
R
I1
I2
E
T
T
Y
T
+36 -2 -5
9
= 38
I 1 I1 I1 I2 I 2 I2 - E - E E
+4 -1 -3 -1 -3 -1
= -5
Matches and mistmatches are scored
with a sum of all pairwise scores
Progressive MSA
PIEKSAVT-L
PIEKSAVT-L
DIRKSAITAL
DIRKSAITAL
DIRKSALYAL
DIRKSALYAL
YEEKSITAL
YEEKS-ITAL
YEEKS-TAL
YEEKS--TAL
Order of alignment is determined by the tree structure
(align more closely related groups first).
Issues with progressive MSA
This approach still requires a lot of comparisons, but it is
feasible (the exact approach was not).
PIEKSAVT-L
DIRKSAITAL
DIRKSALYAL
YEEKS-ITAL
This produces a good MSA, but
it’s not guaranteed to be the
best one!
YEEKS--TAL
The CLUSTAL algorithm follows this approach, using a
neighbor joining tree to guide the alignments.
Issues with progressive MSA
“Once a gap, always a gap!”
DIRKSAITAL
DIRKSALYAL
PIEKSAVT-L
YEEKSIT-AL
YEEKS-T-AL
Choices that worked well in a
pairwise alignment may prove
poor for the MSA, but there is no
going back with this approach.
The MUSCLE algorithm re-evaluates earlier alignment
decisions to mitigate these problems.
Activity
(time permitting)
Gather myoglobin sequences at UniProt
Download as a FASTA file
Use phylogeny.fr to build a MSA
Use phylogeny.fr to build a neighbor joining tree
Use bootstraps to assess confidence