CS790 – Introduction to Bioinformatics
Download
Report
Transcript CS790 – Introduction to Bioinformatics
Introduction to Bioinformatics
Sequence Alignments
Sequence Alignments
Cornerstone of bioinformatics
What is a sequence?
• Nucleotide sequence
• Amino acid sequence
Pairwise and multiple sequence alignments
• We will focus on pairwise alignments
What alignments can help
• Determine function of a newly discovered gene sequence
• Determine evolutionary relationships among genes, proteins,
and species
• Predicting structure and function of protein
Intro to Bioinformatics – Sequence Alignment
2
Acknowledgement: This notes is adapted from lecture notes of both Wright State University’s Bioinformatics Program and Professor Laurie Heyer of Davidson College with permission.
DNA Replication
Prior to cell division, all the
genetic instructions must be
“copied” so that each new cell
will have a complete set
DNA polymerase is the enzyme
that copies DNA
• Reads the old strand in the 3´ to 5´
direction
Intro to Bioinformatics – Sequence Alignment
3
Over time, genes accumulate mutations
Environmental factors
• Radiation
• Oxidation
Mistakes in replication or
repair
Deletions, Duplications
Insertions, Inversions
Translocations
Point mutations
Intro to Bioinformatics – Sequence Alignment
4
Deletions
Codon deletion:
ACG ATA GCG TAT GTA TAG CCG…
• Effect depends on the protein, position, etc.
• Almost always deleterious
• Sometimes lethal
Frame shift mutation:
ACG ATA GCG TAT GTA TAG CCG…
ACG ATA GCG ATG TAT AGC CG?…
• Almost always lethal
Intro to Bioinformatics – Sequence Alignment
5
Indels
Comparing two genes it is generally impossible
to tell if an indel is an insertion in one gene, or
a deletion in another, unless ancestry is known:
ACGTCTGATACGCCGTATCGTCTATCT
ACGTCTGAT---CCGTATCGTCTATCT
Intro to Bioinformatics – Sequence Alignment
6
The Genetic Code
Substitutions are
mutations
accepted by
natural selection.
Synonymous:
CGC CGA
Non-synonymous:
GAU GAA
Intro to Bioinformatics – Sequence Alignment
7
Comparing Two Sequences
Point mutations, easy:
ACGTCTGATACGCCGTATAGTCTATCT
ACGTCTGATTCGCCCTATCGTCTATCT
Indels are difficult, must align sequences:
ACGTCTGATACGCCGTATAGTCTATCT
CTGATTCGCATCGTCTATCT
ACGTCTGATACGCCGTATAGTCTATCT
----CTGATTCGC---ATCGTCTATCT
Intro to Bioinformatics – Sequence Alignment
8
Why Align Sequences?
The draft human genome is available
Automated gene finding is possible
Gene: AGTACGTATCGTATAGCGTAA
• What does it do?
One approach: Is there a similar gene in
another species?
• Align sequences with known genes
• Find the gene with the “best” match
Intro to Bioinformatics – Sequence Alignment
9
Gaps or No Gaps
Examples
Intro to Bioinformatics – Sequence Alignment
10
Scoring a Sequence Alignment
Given
Match score:
+1
Mismatch score:
+0
Gap penalty:
–1
Sequences
ACGTCTGATACGCCGTATAGTCTATCT
||||| |||
|| ||||||||
----CTGATTCGC---ATCGTCTATCT
Matches: 18 × (+1)
Mismatches: 2 × 0
Score
Gaps: 7 × (– 1)
Intro to Bioinformatics – Sequence Alignment
= +11
11
Origination and Length Penalties
Note that a bioinformatics computational model or
algorithm must be “biologically meaningful” or
even “biologically significant”
We want to find alignments that are evolutionarily
likely.
Which of the following alignments seems more
likely to you?
ACGTCTGATACGCCGTATAGTCTATCT
ACGTCTGAT-------ATAGTCTATCT
ACGTCTGATACGCCGTATAGTCTATCT
AC-T-TGA--CG-CGT-TA-TCTATCT
We can achieve this by penalizing more for a new
gap, than for extending an existing gap
Intro to Bioinformatics – Sequence Alignment
12
Scoring a Sequence Alignment (2)
Given
Match/mismatch score:
+1/+0
Gap origination/length penalty: –2/–1
Sequences
ACGTCTGATACGCCGTATAGTCTATCT
||||| |||
|| ||||||||
----CTGATTCGC---ATCGTCTATCT
Matches: 18 × (+1)
Mismatches: 2 × 0
Score
Origination: 2 × (–2)
Length: 7 × (–1)
= +7
Caution: Sometime “gap extension” used instead of “gap
length” …
Intro to Bioinformatics – Sequence Alignment
13
How can we find an optimal alignment?
Finding the alignment is computationally hard:
ACGTCTGATACGCCGTATAGTCTATCT
CTGAT---TCG-CATCGTC--T-ATCT
C(27,7) gap positions = ~888,000 possibilities
It’s possible, as long as we don’t repeat our
work!
Dynamic programming: The Needleman &
Wunsch algorithm
Intro to Bioinformatics – Sequence Alignment
14
Dynamic Programming
Technique of solving optimization problems
• Find and memorize solutions for subproblems
• Use those solutions to build solutions for larger
subproblems
• Continue until the final solution is found
Recursive computation of cost function in a
non-recursive fashion
Intro to Bioinformatics – Sequence Alignment
15
Dynamic Programming
Example
• Solving Fibonacci number f(n) = f(n-1) + f(n-2)
recursively takes exponential time
Because many numbers are recalculated
• Solving it using dynamic programming only takes a
linear time
Use an array f[] to store the numbers
f[0] = 1;
f[1] = 1;
for (i = 2; i <= n; i++)
f[i] = f[i – 1] + f[i – 2];
Intro to Bioinformatics – Sequence Alignment
16
Global Sequence Alignment
Needleman-Wunsch algorithm
Di,j = max{ Di-1,j + d(Ai, –), Di,j-1 + d(–, Bj), Di-1,j-1 + d(Ai, Bj) }
B
i-1
A
i
j-1
j
Suppose we are aligning:
Seq1
a with a …
Seq2 a
Intro to Bioinformatics – Sequence Alignment
0
-1
a
-1
17
Dynamic Programming (DP) Concept
Suppose we are aligning:
CACGA
CGA
Intro to Bioinformatics – Sequence Alignment
18
DP – Recursion Perspective
Suppose we are aligning:
ACTCG
ACAGTAG
Last position choices:
G
G
+1
ACTC
ACAGTA
G
-
-1
ACTC
ACAGTAG
G
-1
ACTCG
ACAGTA
Intro to Bioinformatics – Sequence Alignment
19
What is the optimal alignment?
ACTCG
ACAGTAG
Match: +1
Mismatch: 0
Gap: –1
Intro to Bioinformatics – Sequence Alignment
20
Needleman-Wunsch: Step 1
Each sequence along one axis
Gap penalty multiples in first row/column
0 in [1,1] (or [0,0] for the CS-minded)
A
C
A
G
T
A
G
0
-1
-2
-3
-4
-5
-6
-7
A
-1
1
Intro to Bioinformatics – Sequence Alignment
C
-2
T
-3
C
-4
G
-5
21
Needleman-Wunsch: Step 2
Vertical/Horiz. move: Score + (simple) gap penalty
Diagonal move: Score + match/mismatch score
Take the MAX of the three possibilities
A
C
A
G
T
A
G
0
-1
-2
-3
-4
-5
-6
-7
A
-1
1
Intro to Bioinformatics – Sequence Alignment
C
-2
T
-3
C
-4
G
-5
22
Needleman-Wunsch: Step 2 (cont’d)
Fill out the rest of the table likewise…
a
a
c
a
g
t
a
g
0
-1
-2
-3
-4
-5
-6
-7
c
-1
1
Intro to Bioinformatics – Sequence Alignment
t
-2
0
c
-3
-1
g
-4
-2
-5
-3
23
Needleman-Wunsch: Step 2 (cont’d)
Fill out the rest of the table likewise…
a
a
c
a
g
t
a
g
0
-1
-2
-3
-4
-5
-6
-7
c
-1
1
0
-1
-2
-3
-4
-5
t
-2
0
2
1
0
-1
-2
-3
c
-3
-1
1
2
1
1
0
-1
g
-4
-2
0
1
2
1
1
0
-5
-3
-1
0
2
2
1
2
The optimal alignment score is calculated in the
lower-right corner
Intro to Bioinformatics – Sequence Alignment
24
But what is the optimal alignment
To reconstruct the optimal alignment, we must
determine of where the MAX at each step came
from…
a
a
c
a
g
t
a
g
0
-1
-2
-3
-4
-5
-6
-7
Intro to Bioinformatics – Sequence Alignment
c
-1
1
0
-1
-2
-3
-4
-5
t
-2
0
2
1
0
-1
-2
-3
c
-3
-1
1
2
1
1
0
-1
g
-4
-2
0
1
2
1
1
0
-5
-3
-1
0
2
2
1
2
25
A path corresponds to an alignment
= GAP in top sequence
= GAP in left sequence
= ALIGN both positions
One path from the previous table:
Corresponding alignment (start at the end):
AC--TCG
ACAGTAG
Intro to Bioinformatics – Sequence Alignment
Score = +2
26
Algorithm Analysis
Brute force approach
• If the length of both sequences is n, number of
possibility = C(2n, n) = (2n)!/(n!)2 22n / (n)1/2,
using Sterling’s approximation of n! = (2n)1/2e-nnn.
• O(4n)
Dynamic programming
• O(mn), where the two sequence sizes are m and n,
respectively
• O(n2), if m is in the order of n
Intro to Bioinformatics – Sequence Alignment
27
Practice Problem
Find an optimal alignment for these two
sequences:
GCGGTT
GCGT
Match: +1
Mismatch: 0
Gap: –1 g
c
g
t
g
0
-1
-2
-3
-4
Intro to Bioinformatics – Sequence Alignment
c
-1
g
-2
g
-3
t
-4
t
-5
-6
28
Practice Problem
Find an optimal alignment for these two
sequences:
GCGGTT
GCGT
g
c
g
g
t
t
g
c
g
t
0
-1
-2
-3
-4
-1
1
0
-1
-2
-2
0
2
1
0
-3
-1
1
3
2
GCGGTT
GCG-TIntro to Bioinformatics – Sequence Alignment
-4
-2
0
2
3
-5
-3
-1
1
3
-6
-4
-2
0
2
Score = +2
29
Semi-global alignment
Suppose we are aligning:
GCG
GGCG
Which do you prefer?
G-CG
-GCG
GGCG
GGCG
g
g
g
c
g
0
-1
-2
-3
-4
c
-1
1
0
-1
-2
g
-2
0
1
1
0
-3
-1
1
1
2
Semi-global alignment allows gaps at the ends
for free.
• Terminal gaps are usually the result of incomplete
data acquisition no biologically significant
Intro to Bioinformatics – Sequence Alignment
30
Semi-global alignment
Semi-global alignment allows gaps at the ends
for free.
g
g
g
c
g
0
0
0
0
0
c
0
1
1
0
1
g
0
0
1
2
1
0
1
1
1
3
Initialize first row and column to all 0’s
Allow free horizontal/vertical moves in last row
and column
Intro to Bioinformatics – Sequence Alignment
31
Local alignment
Global alignments – score the entire alignment
Semi-global alignments – allow unscored gaps
at the beginning or end of either sequence
Local alignment – find the best matching
subsequence
CGATG
AAATGGA
This is achieved by allowing a 4th alternative at
each position in the table: zero.
Intro to Bioinformatics – Sequence Alignment
32
Local Sequence Alignment
Why local sequence alignment?
• Subsequence comparison between a DNA sequence
and a genome
• Protein function domains
• Exons matching
Smith-Waterman algorithm
Initialization: D1,j = , Di,1 =
Di,j = max{
Di-1,j + d(Ai, –),
Di,j-1 + d(–,Bj),
Di-1,j-1 + d(Ai,Bj),
0}
Intro to Bioinformatics – Sequence Alignment
33
Local alignment
Score: Match = 1, Mismatch = -1, Gap = -1
c
a
a
a
t
g
g
a
0
0
0
0
0
0
0
0
g
0
0
0
0
0
0
0
0
a
0
0
0
0
0
1
1
0
t
0
1
1
1
0
0
0
2
g
0
0
0
0
2
1
0
1
0
0
0
0
1
3
2
1
CGATG
AAATGGA
Intro to Bioinformatics – Sequence Alignment
34
Local alignment
Another example
Intro to Bioinformatics – Sequence Alignment
35
More Example
Align
ATGGCCTC
ACGGCTC
Mismatch = -3
Gap
= -4
Global
Alignment:
ATGGCCTC
ACGGC-TC
-A
T
G
G
C
C
T
C
Intro to Bioinformatics – Sequence Alignment
-0
-4
-8
-12
-16
-20
-24
-28
-32
A
-4
1
-3
-7
-11
-15
-19
-23
-27
C
-8
-3
-2
-6
-10
-10
-14
-18
-22
G
-12
-7
-6
-1
-5
-9
-13
-17
-21
G
-16
-11
-10
-5
0
-4
-8
-12
-16
C
-20
-15
-14
-9
-4
1
-3
-7
-11
T
-24
-19
-14
-13
-8
-3
-2
-2
-6
C
-28
-23
-18
-17
-12
-7
-2
-5
-1
36
More Example
Local
Alignment:
ATGGCCTC
ACGG CTC
or
ATGGCCTC
ACGGC TC
Intro to Bioinformatics – Sequence Alignment
-A
T
G
G
C
C
T
C
-0
0
0
0
0
0
0
0
0
A
0
1
0
0
0
0
0
0
0
C
0
0
0
0
0
1
1
0
1
G
0
0
0
1
1
0
0
0
0
G
0
0
0
1
2
0
0
0
0
C
0
0
0
0
0
3
1
0
1
T
0
0
1
0
0
0
0
2
0
C
0
0
0
0
0
1
1
0
3
37
Scoring Matrices for DNA Sequences
Transition: A G C T
Transversion: a purine (A or G) is replaced by a
pyrimadine (C or T) or vice versa
Intro to Bioinformatics – Sequence Alignment
38
Scoring Matrices for Protein Sequence
PAM 250
Intro to Bioinformatics – Sequence Alignment
39
Scoring Matrices for Protein Sequence
PAM (Percent Accepted Mutations)
1 PAM unit can be thought of as the amount of time in which an
“average” protein mutates 1 out of every 100 amino acids.
Perform multiple sequence alignment on a “family” of proteins that are
at least 85% similar. Find the frequency of amino acid i and j are aligned
to each other and normalize it.
Entry mij in PAM1 represents the probability of amino acid i
substituted by amino acid j in 1 PAM unit
PAM 2 = PAM 1 × PAM 1
PAM n = (PAM 1)n, e.g., PAM 250 = (PAM 1)250
Questions
Why are values in a PAM matrix integers? Shouldn’t a probability be
between 0 and 1?
Why is PAM250 possible? Shouldn’t a probability be less than or equal
to 100%?
Intro to Bioinformatics – Sequence Alignment
40
Scoring Matrices for Protein Sequence
BLOSUM (BLOcks SUbstitution Matrix) 62
Intro to Bioinformatics – Sequence Alignment
41
Using Protein Scoring Matrices
Divergence
BLOSUM 80
PAM 1
Closely related
Less divergent
Less sensitive
BLOSUM 62
PAM 120
Looking for
BLOSUM 45
PAM 250
Distantly related
More divergent
More sensitive
• Short similar sequences → use less sensitive matrix
• Long dissimilar sequences → use more sensitive matrix
• Unknown → use range of matrices
Comparison
• PAM – designed to track evolutionary origin of proteins
• BLOSUM – designed to find conserved regions of proteins
Intro to Bioinformatics – Sequence Alignment
42
Multiple Sequence Alignment
Why multiple sequence alignment (MSA)
• Two sequences might not look very similar.
• But, some “similarity” emerges with more sequences,
however.
Is dynamic programming still applicable?
CLUSTALW
• One of most popular tools for MSA
• Heuristic-based approach
• Basic
1) Calculate the distance matrix based on pairwise alignments
2) Construct a guide tree
NJ (unrooted tree) Mid-point (rooted tree)
3) Progressive alignment using the guide tree
43