Information Encoding in Biological Molecules: DNA and

Download Report

Transcript Information Encoding in Biological Molecules: DNA and

Molecular Evolution, Multiple Sequence Alignment
& Phylogenetics.
Canadian Bioinformatics Workshop Thursday June 21st
David Lynn M.Sc., Ph.D.,
Postdoctoral Research Associate,
Brinkman Lab.,
Department of Molecular Biology & Biochemistry,
Simon Fraser University,
Greater Vancouver, B.C.
Lab 4.1
1
Evidence for Evolution – Fact not Theory
•
Fossils
•
Observable – e.g. viral evolution – HIV drug treatment can
predict which sites will change. Why you need flu vaccine every
year!
•
Overwhelming scientific evidence.
•
We are 99% identical at DNA level to chimp.
Lab 4.1
2
Nothing in biology makes sense except in the light of evolution
Dobzhansky, 1973
Lab 4.1
3
Why Learn about Evolution:
•
•
Tells us where we come from, classification of species, which species
are most closely related.
Understand the fundamentals of life.
•
•
•
•
•
Practical side:
Foundation of most bioinformatics analyses:
Gene family identification.
Gene discovery – inferring gene function, gene annotation.
Origins of a genetic disease, characterization of polymorphisms.
Lab 4.1
4
Besoin - the need or
desire for change in
phenotype
Change in phenotype
Jean Baptiste
de Lamarck
Change in genotype
Inherited
Lab 4.1
Change in phenotype
of offspring
5
Genotype unaffected by
changes in phenotype
Spontaneous and
random changes in
genes during
reproduction
Offspring has
changed genotype
August
Weismann
Change in
phenotype of offspring
Weismann distinguished somatic and germline mutation
Lab 4.1
6
Part of Darwin’s Theory
•
The world is not constant, but changing
•
All organisms are derived from common ancestors by a process of
branching.
Classify organisms based on shared traits inherited from common
ancestor
Morphological character-based analysis – didn’t know about DNA
•
•
Lab 4.1
7
For evolution to happen, must have heredity and variation – Decent
with modification.
Lab 4.1
8
Variation by DNA mutation
•
•
Nucleotide substitution
– Replication error
– Chemical reaction
Insertions or deletions (indels)
– single base indels
– Unequal crossing over
Lab 4.1
9
What happens when a new mutation arises?
Lab 4.1
10
Positive Selection
•
A new allele (mutant) confers some increase in the fitness of the
organism
•
Selection acts to favour this allele
•
Also called adaptive evolution
NOTE: Fitness = ability to survive and reproduce
Lab 4.1
11
Advantageous Allele
Herbicide resistance gene in nightshade plant
Lab 4.1
12
Negative selection
•
A new allele (mutant) confers some decrease in the fitness of the
organism
•
Selection acts to remove this allele
•
Also called purifying selection
Lab 4.1
13
Deleterious allele
Human breast cancer gene, BRCA2
5% of breast cancer cases are familial
Mutations in BRCA2 account for 20% of familial cases
Normal (wild type) allele
Mutant allele
(Montreal 440
Family)
Stop codon
4 base pair deletion
Causes frameshift
Lab 4.1
14
Neutral mutations
•
Neither advantageous nor disadvantageous
•
Invisible to selection (no selection)
•
Frequency subject to ‘drift’ in the population
•
Random drift – random changes in small populations
Lab 4.1
15
Random Genetic Drift
Selection
100
Allele frequency
advantageous
disadvantageous
0
Lab 4.1
16
Evolutionary models
•
Neo-Darwinian (Pan-selectionist) – positive selection only.
•
Mutationist – mutation and random drift.
•
Neutralist – mutation, random drift, and negative selection.
Lab 4.1
17
Neo-Darwinian Model
•
Mutation is recognised as the origin of variation.
•
Gene substitution (new allele replacing old) occurs by positive
selection only.
•
Polymorphism (multiple alleles co-existing) caused by balancing
selection.
Lab 4.1
18
Neutral Theory
•
•
•
•
•
•
Too much polymorphism to be explained by mutation and positive
selection alone (NeoDarwinian model).
Why so much?
Neutral Theory of Molecular Evolution
– Motoo Kimura, 1968
Most polymorphism is selectively neutral.
Majority of evolutionary changes caused by random genetic drift of
selectively neutral (or almost neutral) alleles.
Still allows for some selection.
Motoo Kimura (1924-94)
Lab 4.1
19
What about the rate of evolution?
Lab 4.1
20
Molecular Clock Hypothesis
•
Rate of evolution of DNA is constant over
time and across lineages
•
Resolve history of species
– Timing of events
– Relationship of species
•
Early protein studies showed
approximately constant rate of evolution
•
As more data accumulated quickly shown
that there is no universal molecular clock.
•
But: still useful if you compare like with
like.
Lab 4.1
21
Different Rates within a Gene or Genome
•
•
•
•
•
•
•
Coding sequences evolve more slowly than non-coding sequences.
Synonymous substitutions are often more common than nonsynonymous.
3rd codon position sites evolve faster than others.
Some sequences are under functional constraint.
Different genes evolve at different rates.
Different regions of genome – higher mutation, higher recombination
rates.
Genes in different species evolve at different rates e.g.
 rodents vs primates  generation time hypothesis.
 sharks vs mammals  metabolic rate hypothesis.
Lab 4.1
22
Two Sequence Alignment
Lab 4.1
23
Inferring Function by Homology
•
The fact that functionally important aspects of sequences are
conserved across evolutionary time allows us to find, by homology
searching, the equivalent genes in one species to those known to be
important in other model species.
•
Logic: if the linear alignment of a pair of sequences is similar, then we
can infer that the 3-dimensional structure is similar; if the 3-D structure
is similar then there is a good chance that the function is similar.
Lab 4.1
24
BASIC LOCAL ALIGNMENT SEARCH TOOLS (BLAST)
•
BLAST programs (there are several) compare a query sequence to all
the sequences in a database in a pairwise manner.
•
Breaks: query and database sequences into fragments known as
"words", and seeks matches between them.
•
Attempts to align query words of length "W" to words in the database
such that the alignment scores at least a threshold value, "T". known as
High-Scoring Segment Pairs (HSPs)
•
HSPs are then extended in either direction in an attempt to generate an
alignment with a score exceeding another threshold, "S", known as a
Maximal-Scoring Segment Pair (MSP)
Lab 4.1
25
Two Sequence Alignment
To align GARFIELDTHECAT with GARFIELDTHERAT is easy
GARFIELDTHECAT
||||||||||| ||
GARFIELDTHERAT
Lab 4.1
26
Gaps
Sometimes, you can get a better overall alignment if you insert gaps
GARFIELDTHECAT
||||||||
|||
GARFIELDA--CAT
is better (scores higher) than
GARFIELDTHECAT
||||||||
GARFIELDACAT
Lab 4.1
27
No Gap Penalty
But there has to be some sort of a gap-penalty otherwise you can
align ANY two sequences:
G-R--E------AT
| | |
||
GARFIELDTHECAT
Lab 4.1
28
Affine Gap Penalty
•
Could set a score for each indel.
•
Usually use affine (open + extend).
•
Open –10, extend -0.05
Lab 4.1
29
2+ Similar Sequences
•
When doing a similarity search against a database
you are trying to decide which of many sequences is the CLOSEST
match to your search sequence.
•
Which of the following alignment pairs is better?:
Lab 4.1
30
Scoring Alignments
GARFIELDTHECAT
||||
|||||||
GARFRIEDTHECAT
GARFIELDTHECAT
||| ||| |||||
GARWIELESHECAT
GARFIELDTHECAT
|| ||||||| ||
GAVGIELDTHEMAT
Lab 4.1
31
Willie Taylor’s AA Venn Diagram
Lab 4.1
32
Substitution Matrices
#BLOSUM
A R
A 5 -2
R -2 6
N -2 -1
D -3 -3
C -1 -5
Q -1 1
E -1 -1
G 0 -3
H -2 0
I -2 -4
L -2 -3
Lab 4.1
90
N
-2
-1
7
1
-4
0
-1
-1
0
-4
-4
D
-3
-3
1
7
-5
-1
1
-2
-2
-5
-5
C
-1
-5
-4
-5
9
-4
-6
-4
-5
-2
-2
Q
-1
1
0
-1
-4
7
2
-3
1
-4
-3
E
-1
-1
-1
1
-6
2
6
-3
-1
-4
-4
G
0
-3
-1
-2
-4
-3
-3
6
-3
-5
-5
H
-2
0
0
-2
-5
1
-1
-3
8
-4
-4
I
-2
-4
-4
-5
-2
-4
-4
-5
-4
5
1
L
-2
-3
-4
-5
-2
-3
-4
-5
-4
1
5
33
Low Complexity Masking
•
Some sequences are similar even if they have no recent
common ancestor.
•
Huntington's disease is caused by poly CAG tracks in the DNA which
results in polyGlutamine (Gln, Q) tracks in the protein.
•
If you do a homology search with QQQQQQQQQQ you get hits to
other proteins that have a lot of glutamines but have totally different
function.
Lab 4.1
34
Low Complexity Masking
Huntingtin:
MATLEKLMKA FESLKSFQQQ QQQQQQQQQQ
QQQQQQQQQQ PPPPPPPPPP PQLPQPPPQA
hits
>MM16_MOUSE MATRIX METALLOPROTEINASE-16 Score = 34.4 bits (78), Expect = 0.18 Identities = 21/65 (32%),
Positives = 25/65 (38%), Gaps = 2/65 (3%):
FQQQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQ--AQPLLPQPQPPPPPP
F Q +
+
Q Q+
PP
PPP
LP PP
P
P+
P PP
FYQYMETDNFKLPNDDLQGIQKIYGPPDKIPPPTRPLPTVPPHRSVPPADPRRHDRPKPP
But not because it is involved in microtubule mediated transport!
Lab 4.1
35
E values
•
An E-value is a measure of the probability of any given hit occurring by
chance.
•
Dependent on the size of the query sequence and the database.
•
The lower the E-value the more confidence you can have that a hit is a
true homologue (sequence related by common descent).
Lab 4.1
36
Dotplot theory
Another way of comparing 2 sequences
Task: align ATGATATTCTT and ATTGTTC
A
T
T
G
T
T
C
Lab 4.1
A
.
.
.
.
.
.
.
T
.
.
.
.
.
.
.
G
.
.
.
.
.
.
.
A
.
.
.
.
.
.
.
T
.
.
.
.
.
.
.
A
.
.
.
.
.
.
.
T
.
.
.
.
.
.
.
T
.
.
.
.
.
.
.
C
.
.
.
.
.
.
.
T
.
.
.
.
.
.
.
T
.
.
.
.
.
.
.
37
Go along the first seq inserting a + wherever 2/3 bases in a
moving window match. The first seq is compared to ATT
(the first 3 bases in the vertical sequence)
A
T
T
G
T
T
C
Lab 4.1
A
.
.
.
.
.
.
.
T
.
+
.
.
.
.
.
G
.
.
.
.
.
.
.
A
.
.
.
.
.
.
.
T
.
+
.
.
.
.
.
A
.
.
.
.
.
.
.
T
.
+
.
.
.
.
.
T
.
.
.
.
.
.
.
C
.
.
.
.
.
.
.
T
.
+
.
.
.
.
.
T
.
.
.
.
.
.
.
38
Then go along the first seq inserting a + wherever 2/3 bases
in a moving window match. The first seq is compared to TTG
(the next 3 in the vertical sequence).
A
T
T
G
T
T
C
Lab 4.1
A
.
.
.
.
.
.
.
T
.
+
+
.
.
.
.
G
.
.
.
.
.
.
.
A
.
.
.
.
.
.
.
T
.
+
.
.
.
.
.
A
.
.
.
.
.
.
.
T
.
+
.
.
.
.
.
T
.
.
+
.
.
.
.
C
.
.
.
.
.
.
.
T
.
+
.
.
.
.
.
T
.
.
.
.
.
.
.
39
Iterate until
A
T
T
G
T
T
C
Lab 4.1
A
.
.
.
.
.
.
.
T
.
+
+
.
.
.
.
G
.
.
.
+
.
.
.
A
.
.
.
.
+
.
.
T
.
+
.
.
.
.
.
A
.
.
.
.
.
.
.
T
.
+
.
.
.
.
.
T
.
.
+
.
.
+
.
C
.
.
.
+
.
.
.
T
.
+
.
.
+
.
.
T
.
.
.
.
.
.
.
40
A T G A T A T T C T T
A
T
T
G
T
T
C
+
+
+
+
+
+
+
+
+
+
+
The human eye is particularly good at picking up structure from
the pattern of dots. You might see a hint of a duplicated region in
the horizontal sequence that is not so clear from the sequence itself
Lab 4.1
41
Lab 4.1
42
Lab 4.1
43
Multiple Sequence Alignments
Lab 4.1
44
Why Do MSAs?
•
Although BLAST may give you good E-value – MSA more convincing
that protein is related and can be aligned over entire length.
•
Identification of conserved regions or domains in proteins.
 Regions that are evolutionary conserved are likely to be important
for structure/function.
 Mutations in these areas more likely to affect function.
•
Identification of conserved residues in proteins.
•
Prerequisite for doing phylogenetic trees.
Lab 4.1
45
Identification of Conserved Domains:
Lab 4.1
46
Human b-defensins
Lab 4.1
47
Computing MSAs
•
Problem: Once you attempt to align more than a few sequences – MSA
quickly becomes computationally intensive and eventually intractable.
•
Solution: Clustal – invented in Kennedy’s pub, Trinity College Dublin.
•
Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence
alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research,
22:4673-4680.
•
Download Clustalx: ftp://ftp-igbmc.ustrasbg.fr/pub/ClustalX/clustalx1.81.msw.zip
•
Adding evolutionary theory to multiple sequence alignment.
Lab 4.1
48
How MSAs are computed
Lab 4.1
49
You still may have to do some hand-editing!!
Lab 4.1
50
Alignment Editors
•
Several multiple sequence alignment editors are available for manually
editing MSAs.
 GeneDoc http://www.nrbsc.org/gfx/genedoc/index.html
 Jalview http://www.jalview.org/
Lab 4.1
51
T-Coffee Vs Clustal
•
ClustalW http://www.ebi.ac.uk/clustalw/ is standard program for MSAs.
•
However, newer program T-Coffee http://www.tcoffee.org/ often does a
better job particularly with more distantly related proteins.
•
Other programs e.g. Muscle http://www.drive5.com/muscle/ may be
better than T-Coffee at aligning large number of sequences.
Lab 4.1
52
Phylogenetics –
Inferring the evolutionary relationships between
genes/sequences/species.
Lab 4.1
53
Terminology
Node
Bootstrap values
(%) showing level
of statistical
confidence in
clade.
Branch – length proportional to amount of evolution (not all trees)
operational taxonomic units (OTUs)
e.g. genes, species, populations.
This case: protein sequences.
Clade
Outgroup
Lab 4.1
54
Different Views of the Same Trees
==
Star-shaped phylogeny.
No branch lengths shown
Lab 4.1
55
Why Do Trees?
•
Classification of life.
•
Investigate the evolutionary relationship between
genes/species/strains.
 What can this tell us about function.
•
Epidemiology: tracing pathogen evolution/origins
e.g. viruses, SARS, foot & mouth, Avian
Influenza.
•
Assign orthology to related genes.
•
The closest BLAST hit is often not the nearest
neighbor.

Lab 4.1
Koski LB, Golding GB J Mol Evol. 2001.
56
SARS as an example
SARS forms a distinct clade
within genus Coronavirus.
Implications for vaccine and
drug design.
Implications for
epidemiology.
Lab 4.1
57
Ortholog & Paralogs
•
Orthologs – Genes derived from a
speciation event i.e. the ‘same’
gene in different species
•
Paralogs – Genes derived from a
gene duplication event.
Evolutionarily related but not the
‘same’ gene  may have similar
functions but likely also different
ones.
Lab 4.1
58
Importance of Ortholog Prediction:
Species1_GeneA Species2_GeneA Outgroup_GeneA
•
Why important  implies likely conservation of function in different species
 necessary to make inferences of function based on analysis in one of
the species.
•
Example: knockout gene A in species 1  observe phenotype  infer
gene A in species 2 has same/similar function
 Only holds if comparing orthologous genes.
Lab 4.1
59
Common Problems in Ortholog Prediction
•
Reciprocal Best BLAST Hit (RBH)  commonly used high-throughput
method for ortholog identification.
Species 2
Species 1
BLAST
•
Incomplete genome sequence or gene loss often result in paralogs
predicted as orthologs.
Lab 4.1
60
Common Problems in Ortholog Prediction
Lab 4.1
61
Real Example: Assigning orthology of a novel chicken IRAK.
Lynn et al., 2003
Lab 4.1
62
Ortholuge: Improving the specificity of
high-throughput ortholog prediction
•
Solution to problem: Putative orthologs from 2 species are compared to
a third outgroup species and phylogenetic distances are calculated.
•
Unusual phylogenetic distances used to identified possible/probable
paralogs.
Lab 4.1
63
Phylogenetic Methods
•
UPGMA
–
assumes constant rate of evolution – molecular clock: don’t publish UPGMA trees
•
Neighbor-Joining
–
very fast. Often a “good enough” tree.
•
Maximum Parsimony
–
Minimum # mutations to construct tree. Slower than NJ.
•
Maximum Likelihood
–
Very CPU intensive. Requires explicit model of evolution – rate and pattern of nucleotide substitution. Only use if you know
what you are doing. Rubbish in rubbish out!!
Lab 4.1
64
Distance Methods
•
•
Distance matrix
UPGMA assumes constant rate of evolution – molecular clock: don’t
publish UPGMA trees
•
Neighbor joining is very fast.
•
Often a “good enough” tree.
•
Embedded in ClustalW.
•
Use in publications only if too many taxa to compute with MP or ML
Lab 4.1
65
Maximum Parsimony
•
Minimum # mutations to construct tree.
•
Better than NJ – information lost in distance matrix – but much slower.
•
Sensitive to long-branch attraction.
•
No explicit evolutionary model.
•
Protpars refuses to estimate branch lengths.
•
Informative sites.
Lab 4.1
66
Maximum Likelihood
•
•
•
•
•
•
Very CPU intensive.
Requires explicit model of evolution – rate and pattern of nucleotide
substitution.
 JC Jukes/Cantor
 K2P Kimura 2 parameter transition/transversion
 F81 Felsenstein – base composition bias
 HKY85 merges K2P and F81
Explicit model  preferred statistically.
Assumes change more likely on long branch.
No long-branch attraction.
Wrong model  wrong tree.
Lab 4.1
67
DNA Trees
•
More info in DNA than proteins.
•
Systematic 3rd position changes can confuse.
•
For distant relationships: remove 3rd positions.
•
Advise: Use DNA directly only if evolutionary distance is short.
•
Translate into protein to align
– then copygaps back to DNA
•
Many issues can confuse tree – Beware.
Lab 4.1
68
Things to be aware of….
•
Beware base composition bias in unrelated taxa e.g. 2 species with high G+C
content will tend to group together.
•
Are sites (hairpins, CpGs?) independent?  most models assume that they are.
•
Are substitution rates equal across dataset?  if not some methods can
account for this.
•
Long branches prone to error – remove them?
•
Excellent alignment = few informative sites.
•
Exclude unreliable data – toss all gaps  but also removes phylogenetically
informative indels.
Lab 4.1
69
Bootstrapping – statistical confidence in a tree.
Lab 4.1
70
Acknowledgements
•
Thanks to Aoife McLysaght, Trinity College Dublin, Ireland for sharing some of her slides on
molecular evolution with me.
•
Some of the slides were adapted from material used last year at the CBW by Prof. Fiona
Brinkman, Simon Fraser University.
•
Some of the material used here was originally given as part of a course “Introduction to
Bioinformatics” designed and implemented by myself and Dr. Andrew Lloyd, University
College Dublin.
•
Figures for some of the slides on phylogenetics have been taken from Baldauf SL, 2003
“Phylogeny for the faint of heart: a tutorial. Trends in Genetics 19(6).
Lab 4.1
71