Bioinformatics Workshop - Ramapo College of New Jersey

Download Report

Transcript Bioinformatics Workshop - Ramapo College of New Jersey

Algorithms in
Bioinformatics
Lawrence D’Antonio
Ramapo College of New Jersey
Bioinformatics Workshop, Fall 2003
Topics
• Algorithm basics
• Types of algorithms in bioinformatics
• Sequence alignment
• Database Searches
Bioinformatics Workshop, Fall 2003
Algorithm basics
• What is an algorithm?
• Algorithm complexity
• P vs. NP
• NP completeness
Bioinformatics Workshop, Fall 2003
What is an algorithm?
• An algorithm is a step-by-step procedure to
solve a problem
• The word “algorithm” comes from the 9th
century Islamic mathematician alKhwarizmi
Bioinformatics Workshop, Fall 2003
Algorithm Complexity
• If the algorithm works with n pieces of data
and the number of steps is proportional to n,
then we say that the running time is O(n).
• If the number of steps is proportional to
log n, then the running time is O(log n).
Bioinformatics Workshop, Fall 2003
Example
• Problem: find the largest element in a
sequence of n elements.
• Solution idea: Iteratively compare size of
elements in sequence.
Bioinformatics Workshop, Fall 2003
Algorithm:
1. Initialize first element as largest.
2. For each remaining element.
If current element larger than largest, make
that element largest.
Running time: O(n)
Bioinformatics Workshop, Fall 2003
Polynomial Time
• An algorithm is said to run in polynomial
time if its running time can be written in the
form O(nk) for some power k.
• The underlying problem is said to be of
class P.
Bioinformatics Workshop, Fall 2003
Polynomial Time Examples
• Searching
Binary Search: O(log n)
• Sorting
Quick Sort: O(n log n)
Bioinformatics Workshop, Fall 2003
NP Algorithms
• An algorithm is nondeterministic if it begins
with guessing a solution to the problem and
then verifies the guess.
• A problem is of category NP if there is a
nondeterministic algorithm for that problem
which runs in polynomial time.
Bioinformatics Workshop, Fall 2003
NP Complete
• A problem is NP-complete if it has an NP
algorithm, and solutions to this problem can
be used to solve all other NP problems.
• A problem is NP-hard if it is at least as hard
as the NP-complete problems
Bioinformatics Workshop, Fall 2003
NP Complete Examples
• Traveling salesman
• Knapsack problem
• Partition problem
• Graph coloring
Bioinformatics Workshop, Fall 2003
P = NP ?
• P  NP
• If P  NP then NP-complete problems have
exponential running time.
Bioinformatics Workshop, Fall 2003
Polynomial vs. Exponential
Bioinformatics Workshop, Fall 2003
Algorithms in Bioinformatics
• Algorithms to compare DNA, RNA, or
protein sequences
• Database searches to find homologous
sequences
• Sequence assembly
• Construction of evolutionary trees
• Structure prediction
Bioinformatics Workshop, Fall 2003
Edit operations on sequences
Substitution
Insertion
Deletion
AATAAGC
AAT-AAGC
AATAAGC
ATTAAGC
AATTAAGC
AA-AAGC
Bioinformatics Workshop, Fall 2003
What is sequence alignment?
• Compare two sequences using matches,
substitutions and indels.
G AA - - T C AT
G - TGG -CA • 3 matches, 1 substitution, 5 indels
Bioinformatics Workshop, Fall 2003
Complexity of DNA Problems
• 3 billion base pairs in human genome
• Many NP complete problems
• 10600 possible alignments for two 1000
character sequences
Bioinformatics Workshop, Fall 2003
Types of sequence alignment
• Determine the alignment of two sequences
that maximizes similarity (global
alignment)
• Determine substrings of two sequences with
maximum similarity (local alignment)
• Determine the alignment for several
sequences that maximizes the sum of pairs
similarity (multiple alignment)
Bioinformatics Workshop, Fall 2003
Significance of Alignment
• Functional similarity
• Structural similarity
• Homology
Bioinformatics Workshop, Fall 2003
Scoring System
• Assign a score for each possible match,
substitution and indel
• Distance functions – Find alignment to
minimize distance between sequences
• Similarity functions – Find alignment to
maximize similarity between sequences
Bioinformatics Workshop, Fall 2003
Edit Distance
G AA - - T C AT
G - TGG -CA • Similarity function: 1 for match, -1 for
substitution, -2 for indel
• Score: -8
Bioinformatics Workshop, Fall 2003
Dynamic Programming
• Used on optimization problems
• Bottom-up approach
• Recursively builds up solution from
subproblem optimal solutions
Bioinformatics Workshop, Fall 2003
Dynamic Programming Alignment
Algorithm (Needleman-Wunsch)
• Given sequences a1,a2,…,an and b1,b2,…,bm
to be aligned:
• Initialize alignment matrix (aligning with
spaces)
• Entry [i,j] gives optimal alignment score for
sequences a1,a2,…,ai and b1,b2,…,bj (where
1  i  n, 1  j  m)
Bioinformatics Workshop, Fall 2003
Computing Alignment Matrix
If a1,a2,…,ai and b1,b2,…,bj have been aligned,
there are three possible next moves:
• Match ai+1 with bj+1
• Match ai+1 with a space —
• Match bj+1 with a space —
Choose the move that maximizes the similarity
of the two sequences
Bioinformatics Workshop, Fall 2003
Global Alignment Matrix
—
G
G
G
C
A
T
—
0
-2
-4
-6
-8
-10
-12
G
-2
1
-1
-3
-5
-7
-9
G
-4
-1
2
0
-2
-4
-6
A
-6
-3
0
1
-1
-1
-3
Bioinformatics Workshop, Fall 2003
C
-8
-5
-2
-1
2
0
-2
A
-10
-7
-4
-3
0
3
1
Optimal Global Alignment
G
G
G
C
A
T
G
G
A
C
A
—
Bioinformatics Workshop, Fall 2003
Alignment Running Time
• Assuming two sequences n characters each
• Running time is O(n2) (each entry of matrix
must be calculated)
Bioinformatics Workshop, Fall 2003
Variations of Alignment
Algorithm
• Gap penalty
• Local alignment
• Multiple alignment
Bioinformatics Workshop, Fall 2003
Gap Penalty
• A gap is a number k of consecutive spaces
• k consecutive spaces are more probable than
k isolated spaces
• Typical gap penalty function: a + b·k
(affine gap penalty)
• Here the first space in a gap is penalized
a+b, further spaces are penalized b each.
Bioinformatics Workshop, Fall 2003
Gap Penalty Example
• Use penalty, 1 + k
A- A- C - A
AC TAT CA
• Score: -6
AA C - - - A
AC TAT CA
• Score: -4
Bioinformatics Workshop, Fall 2003
Local Alignment
• Find conserved regions in otherwise
dissimilar sequences (e.g., viral and host
DNA)
• Smith-Waterman algorithm
• Includes a fourth possibility at each step
(don’t align)
Bioinformatics Workshop, Fall 2003
Local Alignment Example
• Align the following
G C T C T G C G AATA
C G TT GAGATAC T
Bioinformatics Workshop, Fall 2003
Optimal Local Alignment
G C T C T G C G AATA
C G TT GAGATAC T
(G C T C) T G C G A A T A
(C G T) T G A G - A T A (C T)
Bioinformatics Workshop, Fall 2003
Multiple Alignment
• Find the alignment among a set of
sequences that maximizes the sum of scores
for all pairs of sequences
• Dynamic programming run-time for k
sequences of length n: O(k2 2k nk)
• Multiple alignment is NP-complete
Bioinformatics Workshop, Fall 2003
Other Features
• Usually used for protein alignment
• Can be used for global or local alignment
Bioinformatics Workshop, Fall 2003
Multiple Alignment Example
P E A A L Y G R F T - - - I K S D VW
P E S L A Y N K F - - - S I K S D VW
P E A L N Y G R Y - - - S S E S D VW
P E A L N Y GWY - - - S S E S D VW
P E V I RMQ D D N P F S F Q S D V Y
Bioinformatics Workshop, Fall 2003
Multiple vs. Pairwise Alignment
• Optimal multiple alignment does not imply
optimal pairwise alignment
AT
A-T
A-T
Bioinformatics Workshop, Fall 2003
Substitution Matrices
• In homologous sequences certain amino
acid substitutions are more likely to occur
than others
• Types of substitution matrices
* PAM
* BLOSUM
Bioinformatics Workshop, Fall 2003
PAM Matrices
• Defines units of evolutionary distance
• 1 PAM unit represents an average of one
mutation per 100 amino acids
• Start with a set of highly similar sequences
and compute
* pa = probability of occurrence of amino acid a
* Mab = probability of a mutating to b
Bioinformatics Workshop, Fall 2003
PAM Matrix Formula
• Entries in a k-PAM matrix
k
ab
M
10 log10
pb
Bioinformatics Workshop, Fall 2003
PAM250 Matrix
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
C
12
S
0
2
T
-2
1
3
P
-3
1
0
6
A
-2
1
1
1
2
G
-3
1
0
-1
1
5
N
-4
1
0
-1
0
0
2
D
-5
0
0
-1
0
1
2
4
E
-5
0
0
-1
0
0
1
3
4
Q
-5
-1
-1
0
0
-1
1
2
2
4
H
-3
-1
-1
0
-1
-2
2
1
1
3
6
R
-4
0
-1
0
-2
-3
0
-1
-1
1
2
6
K
-5
0
0
-1
-1
-2
1
0
0
1
0
3
5
M
-5
-2
-1
-2
-1
-3
-2
-3
-2
-1
-2
0
0
6
I
-2
-1
0
-2
-1
-3
-2
-2
-2
-2
-2
-2
-2
2
5
L
-6
-3
-2
-3
-2
-4
-3
-4
-3
-2
-2
-3
-3
4
2
6
V
-2
-1
0
-1
0
-1
-2
-2
-2
-2
-2
-2
-2
2
4
2
4
F
-4
-3
-3
-5
-4
-5
-4
-6
-5
-5
-2
-4
-5
0
1
2
-1
9
Y
0
-3
-3
-5
-3
-5
-2
-4
-4
-4
0
-4
-4
-2
-1
-1
-2
7
10
W
-8
-2
-5
-6
-6
-7
-4
-7
-7
-5
-3
2
-3
-4
-5
-2
-6
0
0
Bioinformatics Workshop, Fall 2003
W
17
BLOSUM Matrices (Omit)
• Uses log-odds ratio similar to PAM
• Uses short highly conserved sequences
• BLOSUM x matrices created after removing
sequences that are more than x percent identical
• Better at local alignments
Bioinformatics Workshop, Fall 2003
BLOSUM Matrices
• A motif is a conserved amino acid pattern
found in a group of proteins with similar
biological meaning (PROSITE)
• A block is a conserved amino acid pattern in
a group of proteins (no spaces allowed in
the pattern) (BLOCKS)
Bioinformatics Workshop, Fall 2003
Motif Example
• Motif obtained from a group of 34 tubulin
proteins
M[FYW] . . F[VLI]H . [FYW] . . EGM
Bioinformatics Workshop, Fall 2003
Defining BLOSUM (I)
• BLOSUMn uses blocks that are n%
identical (BLOSUM62 is most common)
• Consider all pairs of amino acids appearing
in the same column in the blocks
Bioinformatics Workshop, Fall 2003
Defining BLOSUM (II)
• Define n(i,j) to be the frequency that amino
acids i,j appear in a column pair
• Define e(i,j) to be the frequency that amino
acids i,j appear in any pair
• Define BLOSUM entry
n( i , j )
s( i , j )  log 2
e( i, j )
Bioinformatics Workshop, Fall 2003
PAM vs. BLOSUM
• PAM derived from highly similar sequences
(evolutionary model)
• BLOSUM derived from protein families
sharing a common ancestor (conserved
domain model)
Bioinformatics Workshop, Fall 2003
Database Searches
• FASTA
• BLAST
Bioinformatics Workshop, Fall 2003
FASTA
• Looks for sequences in a database similar to
a query sequence
• Heuristic, exclusion method
• Compares query sequence to each database
sequence (called the text)
Bioinformatics Workshop, Fall 2003
FASTA Algorithm (I)
• Look for small substrings in query and text
that exactly match (“hot spots”)
• Find ten best “diagonal runs” of hot spots
Bioinformatics Workshop, Fall 2003
Hot Spot Example
E K L A S R K L
H
A
S
H
K
L
*
*
*
*
Bioinformatics Workshop, Fall 2003
FASTA Algorithm (II)
• Find best local alignment for each run
• Combine these into larger alignment
• Do multiple alignment on query and texts
having highest score in last step
Bioinformatics Workshop, Fall 2003
BLAST
• Basic Local Alignment Search Tool
• Heuristic, exclusion method
• Computes statistical significance of
alignment scores
Bioinformatics Workshop, Fall 2003
BLAST Algorithm
• Find all w-length substrings in text that align to
some w-length substring in query with score above
a given threshold (called “hits”)
• Extend these hits as far as possible (“segment
pairs”)
• Report the highest scoring segment pairs
Bioinformatics Workshop, Fall 2003
Other Bioinformatics Algorithms
•
•
•
•
•
Palindromes
Tandem Repeats
Longest Common Subsequence
Double Digest (NP complete)
Shortest Common Superstring (NP
complete)
Bioinformatics Workshop, Fall 2003
References
• Clote and Backofen, Computational Molecular
Biology, Wiley
• Gusfield, Algorithms on Strings, Trees, and
Sequences, Cambridge University Press
• Mount, Bioinformatics, Cold Spring Harbor Press
• Setubal and Meidanis, Introduction to
Computational Molecular Biology, PWS
• Waterman, Introduction to Computational
Biology, CRC Press
Bioinformatics Workshop, Fall 2003