MBG305_03_cc

Download Report

Transcript MBG305_03_cc

Applied Bioinformatics
Week 3
Theory I
• Similarity
• Dot plot
Introduction to Bioinformatics
http://www.personeel.unimaas.nl/Westra/Education/BioInf/slides_of_bioinformatics.htm
LECTURE 3: SEQUENCE ALIGNMENT
3.2 On sequence alignment
Sequence alignment is the most important task in
bioinformatics!
3
http://www.personeel.unimaas.nl/Westra/Education/BioInf/slides_of_bioinformatics.htm
LECTURE 3: SEQUENCE ALIGNMENT
3.2 On sequence alignment
Sequence alignment is important for:
* prediction of function
* database searching
* gene finding
* sequence divergence
* sequence assembly
4
http://www.personeel.unimaas.nl/Westra/Education/BioInf/slides_of_bioinformatics.htm
LECTURE 3: SEQUENCE ALIGNMENT
3.3 On sequence similarity
Homology: genes that derive from a common ancestor-gene
are called homologs
Orthologous genes are homologous genes in different
organisms
Paralogous genes are homologous genes in one organism
that derive from gene duplication
Gene duplication: one gene is duplicated in multiple copies
that therefore free to evolve and assume new functions
5
http://www.personeel.unimaas.nl/Westra/Education/BioInf/slides_of_bioinformatics.htm
HOMOLOGOUS and PARALOGOUS
6
http://www.personeel.unimaas.nl/Westra/Education/BioInf/slides_of_bioinformatics.htm
HOMOLOGOUS and PARALOGOUS
7
http://www.personeel.unimaas.nl/Westra/Education/BioInf/slides_of_bioinformatics.htm
HOMOLOGOUS and PARALOGOUS versus ANALOGOUS
8
plants
? globin
Ath-g
analogs
http://www.personeel.unimaas.nl/Westra/Education/BioInf/slides_of_bioinformatics.htm
LECTURE 3: SEQUENCE ALIGNMENT: sequence similarity
Causes for sequence (dis)similarity
mutation:
a nucleotide at a certain location is replaced by
another nucleotide (e.g.: ATA → AGA)
insertion:
at a certain location one new nucleotide is
inserted inbetween two existing nucleotides
(e.g.: AA → AGA)
deletion:
at a certain location one existing nucleotide
is deleted (e.g.: ACTG → AC-G)
indel:
an insertion or a deletion
10
Similarity
• We can only measure current similarity
• We can form hypothesi
Similarity Searching
• DotPlot
• Needleman-Wunsch
• Smith-Waterman
• FASTA
• BLAST
Dot Plot
• Writing one sequence horizontally
• Writing the other vertically
• At each intersection with equal nucleotides
make a dot in the matrix
Dot Plot
Dot Plot
• Messy?
• Strong similarities can be visually
enhanced
• Select a window size and a similarity score
for that window (e.g. 10 and 8)
• Create a new matrix with dots where the
window score >= 8
Dot Plot
Dot Plot Interpretation
Creating a Dot Plot
End Theory I
• Mindmapping
• 10 min break
Practice I
• Dot plot
Dot Plot
• ACGTGTGCGTTTGAAC
• GGGTGTTCGTTTAAAC
• Make a Dot plot for the two sequences above
• Use a window of 3 to refine the view
• Can you use Excel?
• Get any two DNA sequences and try the tool
below
– http://www.vivo.colostate.edu/molkit/dnadot/
Similarity Searching
• DotPlot
• Needleman-Wunsch
• Smith-Waterman
• FASTA
• BLAST
Definitions
Optimal alignment - one that exhibits the
most correspondences. It is the alignment
with the highest score. May or may not
be biologically meaningful.
Global alignment - Needleman-Wunsch
(1970) maximizes the number of matches
between the sequences along the entire
length of the sequences.
Local alignment - Smith-Waterman (1981)
gives the highest scoring local match
between two sequences.
How can we find an optimal
alignment?
1
27
• ACGTCTGATACGCCGTATAGTCTATCT
CTGAT---TCG-CATCGTC--T-ATCT
• How many possible alignments?
C(27,7) gap positions = ~888,000
possibilities
• Dynamic programming: The Needleman &
Wunsch algorithm
Time Complexity
Consider two sequences:
AAGT
AGTC
How many possible alignments the 2
sequences have?
  = (2n)!/(n!)2 = (22n /n ) = (2n)
2n
n
= 70
Scoring a sequence alignment
• Match/mismatch score:
+1/+0
• Open/extension penalty: –2/–1
ACGTCTGATACGCCGTATAGTCTATCT
||||| |||
|| ||||||||
----CTGATTCGC---ATCGTCTATCT
•
•
•
•
Matches: 18 × (+1)
Mismatches: 2 × 0
Open: 2 × (–2)
Extension: 5 × (–1)
Score = +9
Pairwise Global Alignment
• Computationally:
– Given:
a pair of sequences (strings of
characters)
– Output:
an alignment that maximizes the
similarity
Needleman-Wunsch Alg
Needleman-Wunsch Alg
Needleman-Wunsch Alg
Needleman-Wunsch Alg
Needleman-Wunsch Alg
Needleman-Wunsch Alg
Needleman-Wunsch Alg
Needleman-Wunsch Alg
Needleman-Wunsch Alg
• Which Alignment is better?
• For scoring use:
– Match
– Mismatch
– Gap open
– Gap extension
1
0
-2
-1
• How can substitution matrices be
integrated?
Needleman & Wunsch
•
•
•
•
Place each sequence along one axis
Place score 0 at the up-left corner
Fill in 1st row & column with gap penalty multiples
Fill in the matrix with max value of 3 possible moves:
– Vertical move: Score + gap penalty
– Horizontal move: Score + gap penalty
– Diagonal move: Score + match/mismatch score
• The optimal alignment score is in the lower-right corner
• To reconstruct the optimal alignment, trace back where the
max at each step came from, stop when hit the origin.
Three steps in Needleman-Wunsch
Algorithm
•
•
•
•
Initialization
Scoring
Trace back (Alignment)
Consider the two DNA sequences to be
globally aligned are:
ATCG (x=4, length of sequence 1)
TCG (y=3, length of sequence 2)
Pooja Anshul Saxena, University of Mississippi
Scoring Scheme
•
•
•
•
Match Score = +1
Mismatch Score = -1
Gap penalty = -1
Substitution Matrix
A
C
G
T
A
1
-1
-1
-1
C
-1
1
-1
-1
G
-1
-1
1
-1
T
-1
-1
-1
1
Pooja Anshul Saxena, University of Mississippi
Initialization Step
• Create a matrix with X +1 Rows and Y +1
Columns
• The 1st row and the 1st column of the
score matrix are filled as multiple of gap
penalty
0
A
-1
T
-2
C
-3
T
C
G
-1
-2
-3
G
-4
Pooja Anshul Saxena, University of Mississippi
Scoring
S(T,A)
• The score of any cell C(i, j) is the maximum of:
scorediag = C(i-1, j-1) + S(i, j) = -1
scoreup = C(i-1, j) + g
= -2
Max -> C(i,j)
scoreleft = C(i, j-1) + g
= -2
where S(i, j) is the substitution score for letters i
and j, and g is the gap penalty
T
C
G
0
-1
-2
-3
A
-1
-1
T
-2
C
-3
G
-4
Scoring ….
• Example:
The calculation for the cell C(2, 2):
scorediag = C(i-1, j-1) + S(I, j) = 0 + -1 = -1
scoreup = C(i-1, j) + g = -1 + -1 = -2
scoreleft = C(i, j-1) + g = -1 + -1 = -2
T
C
G
0
-1
-2
-3
A
-1
-1
T
-2
C
-3
G
-4
Pooja Anshul Saxena, University of Mississippi
Scoring ….
• Final Scoring Matrix
T
C
G
0
-1
-2
-3
A
-1
-1
-2
-3
T
-2
0
-1
-2
C
-3
-1
1
0
G
-4
-2
0
2
Pooja Anshul Saxena, University of Mississippi
Trace back
• The trace back step determines the actual
alignment(s) that result in the maximum
score
• There are likely to be multiple maximal
alignments
• Trace back starts from the last cell, i.e.
position X, Y in the matrix
• Gives alignment in reverse order
Pooja Anshul Saxena, University of Mississippi
Trace back ….
• There are three possible moves: diagonally
(toward the top-left corner of the matrix), up, or left
• Trace back takes the current cell and looks to the
neighbor cells that could be direct predecessors.
This means it looks to the neighbor to the left (gap
in sequence #2), the diagonal neighbor
(match/mismatch), and the neighbor above it (gap
in sequence #1). The algorithm for trace back
chooses as the next cell in the sequence one of
the possible predecessors
Pooja Anshul Saxena, University of Mississippi
Trace back ….
T
C
G
0
-1
-2
-3
A
-1
-1
-2
-3
T
-2
0
-1
-2
C
-3
-1
1
0
G
-4
-2
0
2
• The only possible predecessor is the diagonal match/mismatch
neighbor. If more than one possible predecessor exists, any can be
chosen. This gives us a current alignment of
Seq 1: G
|
Seq 2: G
Pooja Anshul Saxena, University of Mississippi
Trace back ….
• Final Trace back
T
C
G
0
-1
-2
-3
A
-1
-1
-2
-3
T
-2
0
-1
-2
C
-3
-1
1
0
G
-4
-2
0
2
Best Alignment:
ATCG
| | | |
_TCG
Pooja Anshul Saxena, University of Mississippi
Similarity Searching
• DotPlot
• Needleman-Wunsch
• Smith-Waterman
• FASTA
• BLAST
Local Alignment
• Problem first formulated:
– Smith and Waterman (1981)
• Problem:
– Find an optimal alignment
between a substring of s and a
substring of t
• Algorithm:
– is a variant of the basic algorithm
for global alignment
Motivation
• Searching for unknown domains or motifs
within proteins from different families
– Proteins encoded from Homeobox genes (only
conserved in 1 region called Homeo domain – 60
amino acids long)
– Identifying active sites of enzymes
• Comparing long stretches of anonymous DNA
• Querying databases where query word much
smaller than sequences in database
• Analyzing repeated elements within a single
sequence
Smith-Waterman Alg
• Very similar to Needleman-Wunsch
• Determines local instead of global alignment
• Scores can drop and increase
• Alignments are calculated between 0 and 0
scores
Three steps in Smith-Waterman
Algorithm
•
•
•
•
Initialization
Scoring
Trace back (Alignment)
Consider the two DNA sequences to be
globally aligned are:
ATCG (x=4, length of sequence 1)
TCG (y=3, length of sequence 2)
Pooja Anshul Saxena, University of Mississippi
Scoring Scheme
•
•
•
•
Match Score = +1
Mismatch Score = -1
Gap penalty = -1
Substitution Matrix
A
C
G
T
A
1
-1
-1
-1
C
-1
1
-1
-1
G
-1
-1
1
-1
T
-1
-1
-1
1
Pooja Anshul Saxena, University of Mississippi
Initialization Step
• Create a matrix with X +1 Rows and Y +1
Columns
• The 1st row and the 1st column of the
score matrix are filled with 0s
0
A
0
T
0
C
0
T
C
G
0
0
0
G
0
Pooja Anshul Saxena, University of Mississippi
Scoring
• The score of any cell C(i, j) is the maximum
of:
scorediag = C(i-1, j-1) + S(I, j)
scoreup = C(i-1, j) + g
scoreleft = C(i, j-1) + g
And
0
(here S(I, j) is the substitution score for letters
i and j, and g is the gap penalty)
Pooja Anshul Saxena, University of Mississippi
Scoring ….
• Example:
The calculation for the cell C(2, 2):
scorediag = C(i-1, j-1) + S(I, j) = 0 + -1 = -1
scoreup = C(i-1, j) + g = 0 + -1 = -1
scoreleft = C(i, j-1) + g = 0 + -1 = -1
T
C
G
0
0
0
0
A
0
0
T
0
C
0
G
0
Pooja Anshul Saxena, University of Mississippi
Scoring ….
• Final Scoring Matrix
T
C
G
0
0
0
0
A
0
0
0
0
T
0
1
0
0
C
0
0
2
1
G
0
0
1
3
Note: It is not mandatory that the last cell
has the maximum alignment score!
Pooja Anshul Saxena, University of Mississippi
Trace back
• The trace back step determines the actual
alignment(s) that result in the maximum
score
• There are likely to be multiple maximal
alignments
• Trace back starts from the cell with
maximum value in the matrix
• Gives alignment in reverse order
Pooja Anshul Saxena, University of Mississippi
Trace back ….
• There are three possible moves: diagonally
(toward the top-left corner of the matrix), up, or left
• Trace back takes the current cell and looks to the
neighbor cells that could be direct predecessors.
This means it looks to the neighbor to the left (gap
in sequence #2), the diagonal neighbor
(match/mismatch), and the neighbor above it (gap
in sequence #1). The algorithm for trace back
chooses as the next cell in the sequence one of
the possible predecessors. This continues till cell
with value 0 is reached.
Pooja Anshul Saxena, University of Mississippi
Trace back ….
T
C
G
0
0
0
0
A
0
0
0
0
T
0
1
0
0
C
0
0
2
1
G
0
0
1
3
• The only possible predecessor is the diagonal match/mismatch
neighbor. If more than one possible predecessor exists, any can be
chosen. This gives us a current alignment of
Seq 1: G
|
Seq 2: G
Pooja Anshul Saxena, University of Mississippi
Trace back ….
• Final Trace back
T
C
G
0
0
0
0
A
0
0
0
0
T
0
1
0
0
C
0
0
2
1
G
0
0
1
3
Best Alignment:
TCG
| | |
TCG
Pooja Anshul Saxena, University of Mississippi
http://www.personeel.unimaas.nl/Westra/Education/BioInf/slides_of_bioinformatics.htm
LECTURE 3: GLOBAL ALIGNMENT
62
http://www.personeel.unimaas.nl/Westra/Education/BioInf/slides_of_bioinformatics.htm
LECTURE 3: GLOBAL ALIGNMENT
63
Significance of Sequence
Alignment
• Consider randomly generated
sequences. What distribution do you
think the best local alignment score of
two sequences of sample length should
follow?
1.
2.
3.
4.
5.
Uniform distribution
Normal distribution
Binomial distribution (n Bernoulli trails)
Poisson distribution (n, np=)
others
Extreme Value Distribution
• Yev = exp(- x - e-x )
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-5
0
5
Extreme Value Distribution vs.
Normal Distribution
0.4
0.4
0.35
0.35
0.3
0.3
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
-5
0
5
0
-5
0
5
“Twilight Zone”
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-5
0
Some proteins with less than 15% similarity have
exactly the same 3-D structure while some
proteins with 20% similarity have different
structures. Homology/non-homology is never
granted in the twilight zone.
5
End of Theoretical Part 2
• Mindmapping
• 10 min break
Needleman-Wunsch
• ACGTGTGCGTTTGAAC
• GGGTGTAGTCGTTTAAAC
• Apply the Needleman-Wunsch algorithm
to these two sequences
• Score the alignments
Alignments
• Explanation for alignment algorithms
– http://baba.sourceforge.net/
• Alignment of 2 sequences
– http://www.expasy.org/tools/sim-prot.html
• Get any two amino acid sequences and try
– http://bioinformatics.iyte.edu.tr/SmithWaterman/