Genomics of sensory systems - University of Maryland

Download Report

Transcript Genomics of sensory systems - University of Maryland

Lecture 5 : Phylogenies
9/16/09
Translated blast = protein vs translated database
Blasting Genbank - blastn
Z. bruijni - long beaked echidna T. aculeatus - echidna
T. rostratus = honey possum
AX8GS9DG01S
Blasting Genbank - discont
megablast - exactly same as blastn
Z. bruijni - long beaked echidna T. aculeatus - echidna
T. rostratus = honey possum
AX9N23U7014
Blasting Genbank - megablast same species but different order
Z. bruijni - long beaked echidna T. aculeatus - echidna
T. rostratus = honey possum
AX9TUM1G016
AX9DYYTE01N
Blasting Genbank - Tblastn
T. aculeatus - echidna S. brachyurus - quokka S. crassicaudata - fat tailed
dunnart M. fasciatus - numbat I. obesulus - quenda
Species found by BLAST
O. anatinus = platypus
T. aculeatus = echidna
Z. bruijni - Long beaked echidna
S. brachyurus = quokka
M. fasciatus =
numbat
S. crassicaudata =
fat tailed dunnart
T. rostratus =
honey possum
I. obesulus = quenda
= bandicoo
Homologene - can be reached from
NCBI home page
Scroll down they are listed
alphabetically
Questions
Phylogenies - what are they?
1. How do we build them?
2. What do they tell us?
Phylogeny

Evolutionary history
of a a group of
organisms,
especially as
depicted in a family
tree
Haeckel, 1879
Things trees might tell you :
 How
are organisms with particular trait
related?
 Did trait evolve multiple times or only
once?
 What is evolutionary pathway
Of organisms
Of genes
Molecules can be used to
learn how organisms are
related
To learn about vertebrate evolution:
Compare >600 genes
Used genes to
measure time
1) Time since
common
ancestor with
human
2) Time since two
groups diverged
More recent version
of vertebrate evolution
which shows
divergence times on
the animal tree
Ponting 2008
Orangutan
Human
Chimp
Rhesus monkey
Mouse
Rat
Dog
Cat
Horse
Cow
Opposum
Wallaby
Platypus
Anole
Chicken
Frog
Fish -Medaka
Fugu
Tetraodon
Zebrafish
Elephant shark
Lamprey
Primates 25 MY
Mammals 100 MY
Tetrapods 420 MY
Fish 320 MY
All vertebrates 550 MY
Molecular clock
 Molecules
change at a steady rate
 We can calibrate how fast they change
using fossils
 The molecules then become a time
piece to measure how recently different
groups split off from each other
Sequence conservation may be
high
Gene might code for
a protein which is
highly constrained
 Might have to
interact with lots of
other proteins
 Selection might be
quite strong

Sequence conservation may be
low

Not much constraint

Few sites of
interaction

Selection might be
weak
Phylogeny steps
 Align
sequences so homologous AA can
be compared
 Determine the similarity between
sequences
 Use this to generate a relationship
between sequences
Clustalw2 to align sequences
Put sequences in FASTA file
>TetraodonG1
MVWDGGIEPNGTEGKNFYIPMSNRTGIVRSPFEYPQYYLVDPIM
FKMLALYMFFLICTGTPINGLTLLVTAQNKKLRQPLNYILVNLAVAGLIMCAFGFTIT
ITSAINGYFILGATACAVEGFMATLGGEVALWSLVVLAIERYIVVCKPMGSFKFTGTH
AAVGVLFTWIMAFACAGPPLFGWSRYLPEGMQCSCGPDYYTLAPGYNNESYVIYMFVV
HFFVPVFLIFFTYGSLVLTVRAAAQQQESESTQKAQREVTRMCILMVLGFLVAWTPYA
TFSGWIFMNKGAAFHPLTAALCAFFAKSSALYNPVIYVLMNKQFRNCMLSTFGMGGAV
DDETSVSASKTEVSSVS
>ZebrafishG1
MNGTEGSNFYIPMSNRTGLVRSPYDYTQYYLAEPWKFKALAFYMFLLIIFGFPINVLT
LVVTAQHKKLRQPLNYILVNLAFAGTIMVIFGFTVSFYCSLVGYMALGPLGCVMEGFF
ATLGGQVALWSLVVLAIERYIVVCKPMGSFKFSANHAMAGIAFTWFMACSCAVPPLFG
WSRYLPEGMQTSCGPDYYTLNPEYNNESYVMYMFSCHFCIPVTTIFFTYGSLVCTVKA
AAAQQQESESTQKAEREVTRMVILMVLGFLFAWVPYASFAAWIFFNRGAAFSAQAMAV
PAFFSKTSAVFNPIIYVLLNKQFRSCMLNTLFCGKSPLGDDESSSVSTSKTEVSSVSPA
>CichlidG1
MAWEGGIEPNGTEGKNFYIPMSNRTGIVRSPFEYTQYYLADPIFFKLLAFYMFFLICT
GTPINSLTLFVTAQNKKLRQPLNYILVNLAVAGLIMCCFGFTITITSAFNGYFILGST
FCAIEGFMATLGGEVALWSLVVLAIERYIVVCKPMGSFKFSGAHAGAGVLFTWIMAMA
CAAPPLFGWSRYIPEGMQCSCGPDYYTLAPGFNNESYVIYMFVVHFFVPVFIIFFTYG
SLVMTVKAAAAQQQDSASTQKAEKEVTRMCVLMVMGFLIAWTPYASFAGWIFMNKGAS
FSALTAAIPAFFAKSSALYNPVIYVLMNKQFRNCMLSTIGMGGMVEDETSVSTSKTEV
SSVS
Aligned sequences .aln ; Jalview gives colored version
Funky tree
.dnd (need special program to draw)
Scroll down this page for tree (use Phylogram)
CLUSTAL W (1.83) multiple sequence alignment
TetraodonG1
CichlidG1
ZebrafishG1
MVWDGGIEPNGTEGKNFYIPMSNRTGIVRSPFEYPQYYLVDPIMFKMLALYMFFLICTGT 60
MAWEGGIEPNGTEGKNFYIPMSNRTGIVRSPFEYTQYYLADPIFFKLLAFYMFFLICTGT 60
--------MNGTEGSNFYIPMSNRTGLVRSPYDYTQYYLAEPWKFKALAFYMFLLIIFGF 52
*****.***********:****::*.****.:* ** **:***:** *
TetraodonG1
CichlidG1
ZebrafishG1
PINGLTLLVTAQNKKLRQPLNYILVNLAVAGLIMCAFGFTITITSAINGYFILGATACAV 120
PINSLTLFVTAQNKKLRQPLNYILVNLAVAGLIMCCFGFTITITSAFNGYFILGSTFCAI 120
PINVLTLVVTAQHKKLRQPLNYILVNLAFAGTIMVIFGFTVSFYCSLVGYMALGPLGCVM 112
*** ***.****:***************.** ** ****::: .:: **: **. *.:
TetraodonG1
CichlidG1
ZebrafishG1
EGFMATLGGEVALWSLVVLAIERYIVVCKPMGSFKFTGTHAAVGVLFTWIMAFACAGPPL 180
EGFMATLGGEVALWSLVVLAIERYIVVCKPMGSFKFSGAHAGAGVLFTWIMAMACAAPPL 180
EGFFATLGGQVALWSLVVLAIERYIVVCKPMGSFKFSANHAMAGIAFTWFMACSCAVPPL 172
***:*****:**************************:. ** .*: ***:** :** ***
TetraodonG1
CichlidG1
ZebrafishG1
FGWSRYLPEGMQCSCGPDYYTLAPGYNNESYVIYMFVVHFFVPVFLIFFTYGSLVLTVR- 239
FGWSRYIPEGMQCSCGPDYYTLAPGFNNESYVIYMFVVHFFVPVFIIFFTYGSLVMTVKA 240
FGWSRYLPEGMQTSCGPDYYTLNPEYNNESYVMYMFSCHFCIPVTTIFFTYGSLVCTVKA 232
******:***** ********* * :******:*** ** :** ********* **:
TetraodonG1
CichlidG1
ZebrafishG1
AAAQQQESESTQKAQREVTRMCILMVLGFLVAWTPYATFSGWIFMNKGAAFHPLTAALCA 299
AAAQQQDSASTQKAEKEVTRMCVLMVMGFLIAWTPYASFAGWIFMNKGASFSALTAAIPA 300
AAAQQQESESTQKAEREVTRMVILMVLGFLFAWVPYASFAAWIFFNRGAAFSAQAMAVPA 292
******:* *****::***** :***:***.**.***:*:.***:*:**:* . : *: *
TetraodonG1
CichlidG1
ZebrafishG1
FFAKSSALYNPVIYVLMNKQFRNCMLSTFGMGG--AVDDETS-VSASKTEVSSVS-- 351
FFAKSSALYNPVIYVLMNKQFRNCMLSTIGMGG--MVEDETS-VSTSKTEVSSVS-- 352
FFSKTSAVFNPIIYVLLNKQFRSCMLNTLFCGKSPLGDDESSSVSTSKTEVSSVSPA 349
**:*:**::**:****:*****.***.*: *
:**:* **:*********
Alignment is key
 Any
other analysis that you do is only as
good as your alignment
 If
your alignment is bad subsequent
analyses will be bad
 Junk
in = Junk out
Alignments
 Tell
you about sequence conservation
How much is there?
Where is it?
Calculate sequence
similarities
Zebrafish
Trout
Medaka
Cod
Halibut
Tetraodon
Goldfish
Killifish
M--------NGTEGSNFYIPMSNR
M------Q-NGTEGSNFYIPMSNR
M------E-NGTEGKNFYIPMNNR
M----RMEANGTEGKNFYIPMSNR
MVWDGGIEPNGTEGKNFYIPMSNR
MVWDGGIEPNGTEGKNFYIPMSNR
M--------NGTEGNNFYVPLSNR
M---GYG-PNGTEGNNFYIPMSNK
*
*****.***:*:.*:
Pairwise comparisons
Use tree to show sequence
relationships
Short branches mean sequences are more similar
Long branches mean there are more differences
Q3. How do we build
phylogenies?
 Assume
the relationships involve
bifurcating branches
ATC
ATC
ATG
ATG
ACG
ACG
CCG
CCC
CCG
CCC
Methods to determine
similarities
 Parsimony
 Distance
 Maximum
 Bayesian
likelihood
Parsimony
 The
least complex explanation is the
most likely to be correct
Occam’s razor
 The
preferred phylogenetic tree is one
that requires fewest changes
Count up # changes for all possible trees
Find the shortest one
Trees based on parsimony
ATCG
CT
CT
ATCG
ACCG
ACCG
Most parsimonious
ATCG
ACCG
CT
ATCG
ACCG
Trees based on parsimony
T
CT
CT
T
C
C
Most parsimonious
T
C
CT
T
C
Can’t always distinguish tree
topologies
T
CT
T
CT
T
T
C
C
C
C
Equally parsimonious
Other limitations
 All
changes are weighted the same
C-T same as C - A
Same no matter how long it takes for the
change to occur
Distance methods
 Calculate
a numerical value for
sequence differences
Do for all pairwise combinations
 Build
tree by joining most similar
sequences and then more divergent
Distance methods
 Fast
 Pretty
robust
 Only deals with data in pairs
Pairwise distances
Taxa1 AACGGTCATGGCGTTGCATT
Taxa2 AACGGTCAGGGCGTTGCATT
Taxa3 AACGGTCACGCCGCTGCATT
1
2
3
1
0
.05
.15
2
.05
0
.15
3
.15
.15
0
Distance, d
p
is fractional similarity of sequence
 Simplest form of distance: d = 1 - p
AACGGTCATGGCGTTGCATT
AACGGTCACGGCGTTGCATT
p = 19/20
d = 0.05
Tree building
 Neighbor
joining
Join most similar pair of sequences
Add more divergent after
1
2
3
1
0
.05
.15
2
.05
0
.15
3
.15
.15
0
1
2
3
How different can 2 sequences
get?
 At
infinite time, random probability that
two sequences are the same
Probability a base is same = 1/4
 DNA
only has 4 bases
Certain sites will start to change multiple
times
Need to account for these multiple hits
Random sequences
 Write
down 20 bases of sequence
Compare your sequence to
this one
 AGTCCGATTACGGCTAGCAG
 What
fraction of sites are the same in
the two sequences?
Sequence similarity decays to
25% over long times
1.2
Sequence similarity
1
0.8
0.6
0.4
0.2
0
0
0.5
1
1.5
2
Time
2.5
3
3.5
Sequence difference maxes at
0.75
1
0.9
Sequence difference
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.5
1
1.5
2
Time
2.5
3
3.5
Sequence change accumulates
linearly with time at beginning
1
0.9
Sequence difference
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.5
1
1.5
2
Time
2.5
3
3.5
DNA models
 Use
different DNA models to account for
how sequences evolve with time
Allows you to apply different molecular
clocks
Relate sequence change to time
Clock is not linear except for small changes
and short times
 Models
same as used in maximum
likelihood methods
How good is your tree?
 Bootstrap
approach
Run the same method multiple times
Subsample data each time
Use 50% of data
See how reproducible the trees are
Count how many times a particular grouping
occurs
Distance tree
for rod and
cone
transducin
alpha subunit
Branch lengths
are proportional
to sequence
differences
58
Boot strap
values are
given for
each node
which tells
how
reproducible
that grouping
is
97 69
68
100
72
98
98
98
72
86
95
100