Molecular Systematics & Evolution of Microorganisms

Download Report

Transcript Molecular Systematics & Evolution of Microorganisms

ALIGNMENT OF NUCLEOTIDE
&
AMINO-ACID SEQUENCES
1
An alignment is an evolutionarily meaningful comparison
of two or more sequences (DNA, RNA, or proteins).
In the case of two DNA sequences, an alignment
consists of a series of paired bases, one base from
each sequence. There are three types of pairs:
(1) matches = the same nucleotide appears in both
sequences.
(2) mismatches = different nucleotides are found in the
two sequences.
(3) gaps = a base in one sequence and a null base in the
other.
GCGGCCCATCAGGTAGTTGGTG-G
GCGTTCCATC--CTGGTTGGTGTG
***..***** .*.******* *
2
Alignment: A hypothesis concerning
positional homology among residues
in a sequence.
Positional homology = A pair of nucleotides
from two aligned sequences that have
descended from one nucleotide in the ancestor
of the two sequences.
GCGGCCCATCAGGTAGTTGGTG-G
GCGTTCCATC--CTGGTTGGTGTG
***..***** .*.******* *
3
Positional homology = A pair of nucleotides
from two aligned sequences that have
descended from one nucleotide in the ancestor
of the two sequences.
GCGGCCCATCAGGTAGTTGGTG-G
GCGTTCCATC--CTGGTTGGTGTG
***..***** .*.******* *
4
These two nucleotides are derived from the
ancestor of cats and armadillos.
Homology:
The term was
coined by Richard
Owen in 1843.
Definition:
Similarity
resulting from
common ancestry.
5
Homology: A qualitative statment
• Homology designates a relationship of
common descent between entities
• Two genes are either homologs or not
– it doesn’t make sense to say “two
genes are 43% homologous.”
– it doesn’t make sense to say “Linda is
43% pregnant.”
6
Homology
By comparing homologous characters,
we can reconstruct the evolutionary
events that have led to the formation of
the extant sequences from the common
ancestor.
7
Homology
When dealing with sequences, we are
interested in POSITIONAL HOMOLOGY.
We identify positional homology by
ALIGNMENT.
8
ACTGGGCCCAAATC
A
ACTGGGCCCAAATC
1 deletion
1 substitution
G
ACTGGGCCCAAATC
A
1 insertion
1 substitution
ACTGGCCCAGATC
ACAGGGCCACAAATC
Correct alignment
ACT-GGCC-CAGATC
ACAGGGCCACAAATC
**.-****-**.***
Incorrect alignment
ACTGGCCCAGATC-ACAGGGCCACAAATC
**.**.***.*..-9
unknown
unknown
unknown
ACTGGCCCAGATC
ACAGGGCCACAAATC
Correct alignment?
ACT-GGCC-CAGATC
ACAGGGCCACAAATC
**.-****-**.***
Incorrect alignment?
ACTGGCCCAGATC-ACAGGGCCACAAATC
**.**.***.*..-10
Sequence alignment = The
identification of the location of
deletion or insertions that might
have occurred in either of the
two lineages since their
divergence from a common
ancestor.
Insertion + Deletion = Indel or Gap
11
Sequence alignment
1. Pairwise alignment
2. Multiple alignment
12
-
Two DNA sequences: A and B.
Lengths are m and n, respectively.
The number of matched pairs is x.
The number of mismatched pairs is y.
Total number of bases in gaps is z.
13
An gap indicates that a deletion
or an insertion has occurred in one
of the two lineages.
GCGG-CCATCAGGTAGTTGGTG-GCGTTCCATC--CTGGTTGGTGTG
14
The alignment is the first step in
many evolutionary and functional
studies.
Errors in alignment tend to
amplify in later computational
stages.
15
Methods of alignment:
1. Manual
2. Dot matrix
3. Algorithmic (scoring matrices and gap penalties)
16
Manual alignment. When there are
few gaps and the two sequences
are not too different from each
other, a reasonable alignment
can be obtained by visual
inspection.
GCG-TCCATCAGGTAGTTGGTGTG
GCGTTCCATCAGGTGGTTGGTGTG
*** **********.*********
17
Advantages of manual alignment:
(1) use of a powerful and trainable
tool (the brain, well…, some
brains).
(2) ability to integrate additional
data, e.g., domain structure,
biological function (e.g., 3D
structure).
18
Disadvantages of manual alignment:
1. Subjectivity = the inability to formally
specify the algorithm.
2. Irreproducibility = the inability of two
researchers to reach the same result.
3. Unscalability = the inability to apply the
method to long sequences.
4. Incommensurability = the inability to
compare the results to those derived from
other methods.
19
The dot-matrix
method: The two
sequences are written out
as column and row
headings of a twodimensional matrix. A dot
is put in the dot-matrix
plot at a position where
the nucleotides in the two
sequences are identical.
20
The alignment
is defined by
a path from
the upperleft element
to the lowerright
element.
21
There are 4 possible steps in the path:
(1) a diagonal step through
a dot = match.
(2) a diagonal step through
an empty element of the
matrix = mismatch.
(3) a horizontal step = a
gap in the sequence on
the top of the matrix.
(4) a vertical step = a gap
in the sequence on the
left of the matrix.
22
forbidden
directions
allowed
directions
23
A dot matrix may become cluttered.
With DNA sequences, ~25% of the
elements will be occupied by dots by
chance alone.
24
window size =1
stringency = 1
alphabet size = 4
The number of spurious matches is
determined by: window size,
stringency, & alphabet size.
25
window size =1
stringency = 1
alphabet size = 4
window size = 3
stringency = 2
alphabet size = 4
26
window size = 1
stringency = 1
alphabet size = 20
27
Dot-matrix methods:
Advantages: May unravel
information on the evolution of
sequences.
28
Window size = 60 amino acids; Stringency = 24 matches
Advantages:
Highlighting Information
The vertical gap indicates
that a coding region
corresponding to ~75
amino acids has either
been deleted from the
human gene or inserted
into the bacterial gene.
29
Window size = 60 amino acids; Stringency = 24 matches
Advantages:
Highlighting Information
The two diagonally
oriented parallel lines
most probably indicate
that a small internal
duplication has occurred
in the bacterial gene.
30
Dot-matrix
methods:
Disadvantage:
May not
identify the
best alignment.
31
Scoring Matrices & Gap Penalties
32
The true alignment between two sequences is
the one that reflects accurately the evolutionary
relationships between the sequences.
Since the true alignment is unknown, in practice
we look for the optimal alignment, which is the
one in which the numbers of mismatches and
gaps are minimized according to certain
criteria.
Unfortunately, reducing the
number of mismatches results
in an increase in the number of
gaps, and vice versa.
34
a = matches
b = mismatches
g = nucleotides in gaps
d = gaps
35
The scoring
scheme comprises a gap
penalty and a scoring matrix, M(a,b), that
specifies the score for each type of match (a = b)
or mismatch (a  b).
The units in a scoring matrix may be the
nucleotides in the DNA or RNA sequences, the
codons in protein-coding regions, or the amino
acids in protein sequences.
36
If you want to know the secrets behind the black
box of sequence alignment, you will have to take
a class in BIOINFORMATICS.
37
Multiple Sequence
Alignment is
infinitely more
complicated than
pairwise alignment
38
Multiple Sequence
Alignment does not
have an exact
optimal solution.
It is solved
heuristically.
39
A Multiple Sequence Alignment
GCGGCTCA
GCGGCCCA
GCGTTCCA
GCGTCCCA
GCGGCGCA
***...**
TCAGGTAGTT
TCAGGTAGTT
TC--CT-GTT
TCAGCTAGTT
TTAGCTAGTT
*.--.*-***
GGTG-G
GGTG-G
GGTGTG
GTTG-G
GGTG-A
*.**-.
Spinach
Rice
Mosquito
Monkey
Human
40