Building Phylogenetic trees

Download Report

Transcript Building Phylogenetic trees

Alignments and phylogeny
Peter Hantz
EMBL Heidelberg
Evolution
Mutations:
changes in the DNA sequence
chemicals, physical conditions,
replication errors (cca 10/replication/human genome)
coding region mutations:
>silent
results in the same AA
>mis-sense
results in a different AA
>nonsense
results in a stop codon
non-coding region mutations:
e.g. Transcription factor binding sites - developmental deficiencies
gene dublications + "neofunctionalization" of the extra copy
new function - new evolutionary pressure
Evolution
Homology, orthology, paralogy
Homologous:
genes having a common ancestor
orthologous:
common ancestor, in different species
paralogous:
common ancestor, gene dublication in the same species
wikipedia
Alignments
Aims:
-to compare DNA or protein sequences
that diverged during evolutionary processes
Why is it good for?
-a new piece of DNA/protein: what's it?
(proteins: structural relationships)
-to trace evolutionary relationships
(phylogenetic trees)
-to find mutations that cause diseases
-for gene cloning: degenerate PCR primer design
-forensics, whale protection...
Some applications...
Forensics
Alignments are the first step
of building evolutionary trees
A. Budd
Malaria - Sickle Cell Anemia
snp
23andme.com
Global and Local Alignments
Global: assumed to be similar overall (e.g. closely related ones)
"forces" the alignment to span the entire length of the sequences
e.g. to find evolutionary relationships
e.g. to find mutations in a gene
Local: if short pieces (motifs, domains) are similar
to identify regions of similarity within long sequences
e.g. To find coding regions in a genomic dna full with introns
Source:wikipedia
How a very simple pair-wise alignent works: the DotPlot method
Source: Gene Cloning
Global/local alignmnets:
Dynamic Programing
G: Needleman-Wunsch algorithm
L: Smith-Waterman algorithm
Aligning DNA
|:
identical
nothing: mismatch
-:
gap
How do we quantify their similarity?
Aligning DNA, continued, concepts for both global and local alignments
Measure of similarity: f(#of matches, mismatches, gaps)
How do we count for identities, mismatches, gaps?
A simple model (standard BLAST scoring matrix):
Identity +2
Mismatch -3
Gap creation -5
Gap extension -2
Terminal Mismatch 0
All mismatches are "similar"?
Kimura model:
purine-purine and pirimidine-pirimidine : more probable
Score of two aligned DNA sequences (or parts of them):
TCCTAGGACTCATCGTAAGG
TCCTAG - - AACCTCGTAAGG
+2+2+2 +2+2+2 -5-2 -3 -3 +2 -2 -3 +2+2 +2 +2+2+2+2
=28-7-9=14
Aligning Proteins
some subtitutions less relevant than other ones
change to a similar AA: not too much change in the protein structure
change to sg. else: unprobable, it will be selkected against
letter =match, +=conservative substitution, Ø= non-cons. substitution, - =gap
(software: Blast)
Identities: same AA
Positives:
"similar" AA-s, incl. identities
How "similar" are these sequences?
Measure of similarity: f(#of matches, # of cons/non-cons ch, #gaps)
AA changes:
quantification by the "substitution matrix"
>the simplest one:the identity matrix
>A more realistic one: BlockSUbstitutionMatrix
(BLOSUM62):
+probable, -unprobable,
diagoinal:prob. let like this
Gap penalties: usually -10 for gap open and -2 for gap extension
The score:
measure of similarity of a segment/the enitre length of two aligned protein sequences
high score means: good alignment/real similarity
score for a given alignment = symbol-wise score total (matrix) + gap penalty total
AA B B C C D D - - E E F
AA - - - - D D K K K E F G G
4+4 -10-2-2-2+6+6 -10-1+1+5+6 0 0
=
Significance: E-value
Why score S is not enough?
To what do we compare it? Statistics is needed.
Ex.
Given our query sequence (DNA or protein)
Let's take a random sequence from a hypothetical database of size D:
Can an alignment as good or better that this occur BY CHANCE?
(calculated from a random database sequence)
From the score S a so-called "Expectation value" is calculated
E=P(S(random)>S(query))D
If the chance is tiny (E<10^-6):
unlikely that the observed alignment is due to chance alone
Note: D is very high, the E-value increases
Multiple alignments: more complex than the pairwise ones
"Progressive alignment methods" (e.g. Clustal)
Iterative methods (e.g. Muscle)
Multiple alignment of
Nitric Oxide Synthase
protein sequences
(P. Hantz)
Viagra!
If two sequences have a large % of identity,
they can be interpreted to be homologous (y/n)
histons - very conservative
MHC gene pool - evolves like crazy
The BLAST (Basic Local Alignment Search Tool)
Scored pairwise local alignments are generated very fast
"Is my new sequence related to sg I know about?"
Also used for aligning sequences
How does it work?
In a nutshell:
-List the words of length 3 (by default) of the query:
PQGEFG >> PQG, GGE, GEF, EFG
-Scan the database sequences with "the relevant ones" of these
-try extending the exact matches, via local alignments,
until there are not too much mismatches (until a score level)
>> HSP-s (High-Scoring Segment Pairs)
-Evaluate the significance and E-value of the HSP-s
Using it: BLAST sub-programs (beside the pairwise alignments)
Blastn: Search a nucleotide database using a nucleotide query
What can my sequence be? – close relatives
(What are the sequences similar to my sequence?)
BlastP: Search protein database using a protein query
What can my protein be?
(What are the proteins similar to my sequence?)
Find conserved domains in the query
Find members of the protein family
Blastx:
Search protein database using a translated nucleotide query
Find coding sequences in a piece of genomic DNA
Protein sequences: more conserved!
Tblastn: Search translated nucleotide database using a protein query
Find similar proteins e.g. not annotated DNA sequences
Tblastx: Search translated nucleotide database using a translated nucleotide query
Find coding sequences in a piece of genomic DNA
Note: translation is done in all 6 frames, and all of these are locally analyzed
Making Phylogenetic Trees
Aims:
to show evolutinary relationships of:
genes, species (they evolve ALL the time)
even computer softwares...
A. Budd
Ji et al., 2008
The new rRNA-based animal phylogeny (1995)
Annelida
Platyhelm.
Mollusca
(1995)
Protostomia
Halanych et al, Aduotte et al., after 1995
Deuterostomia
rRNA:
present in all creatures
slow/fast evolving parts
secondary structure
The new rRNA-based Tree of Life
Woese et al., after 1990
Description of Evolutionary Trees
Internal nodes:
hypothetical ancestral organisms/genes
Terminal nodes:
existing organisms/genes (Operational Taxonomy Units, OTU)
Root:
the last common ancestor of the entire group
Sister groups:
on either side of a split, with a common ancestor
and no additional descendents
Monophyletic group
A group containing an ancestor and all of its descendants
most recent common ancestor of the group
Only the terminal nodes (OTU) exist right now
They all evolve in time!
Description of Phylogenetic Trees
Cladogram:
branch length unscaled
Phylogram:
branch length=amounts of evol. divergence
(horizontals doesn't count)
Biology: sequences, organisms evolve all the time
time
Molecular clocks:
Can the # of mutations of a DNA or protein sequence correlated
with the time lapsed after the Last Common Ancestor?
If the rate is cca constant: YES
Different rates - different advantages GENE PHYLOGENY ≠ SPECIES PHYLOGENY
too fast: saturation - back mutations - underestimation of the distance:
A-B-C-D-E-A-...
calibration: some dated fossil records are needed
Building Phylogenetic trees: A very simple example:
(a) MSATHC (b) ITATHC (c) ITAGHC (d) LTAAHC
Mutations (a)<>(b)<>(c)
One step-one mutation
Mutations (a)<>(d)<>(b)
A "rooted" tree: an assumption for the commn ancestor
source: Gene Cloning
Building Phylogenetic Trees
Distance-based methods
"Distance" of two sequences: "metric" in mathematics (4 axioms)
several ones: euclidean... non-euclidean...
Green: Euclidean distance
Others: "Manhattan" distance
A distance between two strings: Leveshtein distance d(IJ)
the minimum number of edits
(insertion, deletion, or substitution of a single character)
needed to transform one string into the other
Its calculation: intuitively easy, practically complicated (dynamic programing)
Example: d(kitten/6/, sitting/7/)=3
Leveshtein distance: a special case of the score:
gaps/mismatches/missing ends: 1; matches: 0
kitten → sitten ( 's' for 'k')
sitten → sittin ('i' for 'e')
sittin → sitting (insert 'g').
The Building of a Tree: the UPGMA method
Sequential clustering method
Starting point: distance matrix [d(IJ)] of the sequences (triangular m.)
0
ua

d ( S , S ) 0
 1 2









ua 
d ( S n 1S n ) 0 
-grouping the pairwise distances corresponding to the pairs of strings
with the with the smallest pairwise distance:
-node "placed" at the half of the distance
-creating a reduced matrix:
these two are "joined"
distances are re-calculated:
d ( X , BF ) 
d ( X , B)  d ( X , F )
2
http://www.nmsr.org/upgma.htm
FIRST DO AN ALIGNMENT
Pre-processing:
Eliminate obiously wrong sequence regions
e.g. "forgotten" introns when investigating proteins, AG|GT...AG|G
e.g. obvious sequencing errors
e.g. bad sequences
Correct the distances for multiple substitutions (homoplasy)
(measure of distances: change/site, 0…1)
Building a tree
use a program...
Rooting a tree
Rooting induces a directionality
>"automatically" done by several software:
midpoint rooting
(root on the branch on equal distance from the most distant OTU-s)
>"by hand"
by choosing an "outgroup": a homologous, but "quite far" sequence
root: on the branch between the tree and the outgroup
Problems might still appear!...
Bootstrap values: how roboust is our tree
Randomizing a bit the sequences.... does the tree/subtree persist?
Larger% - better
Taxa1
Taxa2
Taxa3
Taxa4
A
A
A
A
C
C
G
G
T
A
A
T
G
G
C
C
A
A
T
T
1
3
2
4
Resample datasets
(with replacement)
Sample1
T
A
A
T
G
G
C
C
A
A
A
A
C
C
G
G
60%
T
A
A
T
A
A
T
T
G
G
C
C
A
A
A
A
G
G
C
C
Sample99
The result:
...
Sample3
Sample2
T
A
A
T
G
G
C
C
A
A
A
A
T
A
A
T
A
A
T
T
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
Sample100
T
A
A
T
T
A
A
T
G
G
C
C
A
A
A
A
C
C
G
G
C
C
G
G
T. Larsson