Transcript Slide 1

Lecture 2: Sequence Alignment
BMI/IBGP 705
Kun Huang
Department of Biomedical Informatics
Ohio State University
Major issues in genomics
• Homology
Alignment as an optimization problem
Dynamical programming
BLAST
Tools and examples (in the lab session)
“I think …”
Charles Darwin
(1809-1882)
Homology
A Working Definition:
Sequences or structures which
share a common ancestor
Homology Defined
"The same organ in different animals under a
variety of form and function."
Sir Richard Owen, Lectures on the Comparative Anatomy and
Physiology of the Invertebrate Animals, 1843.
"The mechanism of homology is heredity."
Allan Boyden, Homology and Analogy: A century after the
definitions of "homologue" and "analogue" of Richard Owen, 1943.
"Homology is a relation bearing on recency of
common ancestry.“
Olivier Rieppel, Homology and logical fallacy, 1993.
Sequence Homology



Genes in separate species derived from the
same ancestral genes in the last common
ancestor of those two species are orthologs.
Related genes resulted from a gene duplication
event within a single genome--and are likely to
have diverged in their function--are paralogs.
Both orthologs and paralogs are homologs, a
general term to cover both types of relationships
Recognizing Sequence Homology
• Relies primarily on understanding random sequence
similarity
• Only by knowing what random similarity looks like
can we tell when two sequences are significantly
similar
• Understanding mutational regularity and sequence
evolution increases the significance
1. Closely-related: Transitions/transversions
2. Distantly-related: PAM mutation probabilities
• Even distantly-related sequences can be recognized
• "Significant Similarity" is not a definition of
homology.
Databases
•
•
•
•
•
GenBank
EMBL
DDBJ
SWISSPROT
…
Major issues in genomics
• Homology
• Format
• Search
Alignment as an optimization problem
Dynamical programming
BLAST
Tools and examples
Aligning Text Strings
TCATG
CATT G
2 matches, 0 gaps
TCATG
| |
CATT G
4 matches, 1 inserted gaps
TCA - TG
| |
| |
CATTG
3 matches, 2 end gaps
TCATG
| | |
CATTG
4 matches, 1 inserted gaps
TCA T - G
| | |
|
CAT TG
Optimal solution: with respect to what criteria / cost
function?
Alignment as An Optimization Problem
• Optimization criteria / cost function
• Parameters to be adjusted
• Search algorithm / process
• Exhaustive testing
• Suboptimal solutions
• Computational cost / complexity
• Statistical significance
Alignment as An Optimization Problem
Optimization criteria / cost function
•
•
What sort of alignment should be considered
Scoring system (maximize the score)
• Additive model
• Based on probability compared with random
sequence (PAM, BLOSUM)
• Assumption of independence
• More complicated cases
Gap penalty – linear (s = -gd)
affine (s = -d – (g-1)e)
Alignment as An Optimization Problem
Optimization criteria / cost function
•
•
What sort of alignment should be considered
Scoring system
• Additive model
• Based on probability compared with random
sequence (PAM, BLOSUM)
• Assumption of independence
• More complicated cases
Gap penalty – linear (s = -gd)
affine (s = -d – (g-1)e)
Alignment as An Optimization Problem
Parameters to be adjusted
•
•
•
Shift
Number of gaps
Position of gaps
3 matches, 2 end gaps
TCATG
| | |
CATTG
4 matches, 1 inserted gaps
TCA - TG
| |
| |
CATTG
Alignment as An Optimization Problem
Search algorithm / process
•
Exhaustive testing
Try all possible configuration of parameters.
E.g., sequence a with length m, sequence b with
length n. Try all m+n shifts (if we use the O(.)
annotation, then the running time is O(m+n)).
0 matches, 0 gaps
TCATG
CATT G
2 matches, 0 gaps
TCATG
| |
CATT G
3 matches, 2 end gaps
TCATG
| | |
CATTG
Alignment as An Optimization Problem
Search algorithm / process
•
Computational cost / complexity
What if we allow gaps?
0 matches, 0 gaps
TCATG
CATT G
2 matches, 0 gaps
TCATG
| |
CATT G
3 matches, 2 end gaps
TCATG
| | |
CATTG
Many possible alignments to
consider
•
•
•
Without gaps, there are are n+m possible alignments
between sequences of length n and m
Once we start allowing gaps, there are many possible
arrangements to consider:
abcbcd abcbcd abcbcd
||| | |
| | | ||
||
abc--d a--bcd ab--cd
This becomes a very large number when we allow
mismatches, since we then need to look at every possible
pairing between elements: there are roughly nm possible
alignments.
Exponential computations get big fast
• If n=m=100, there are 100100 = 10200 =
100,000,000,000,000,000,000,000,000,000,000,000,0
00,000,000,000,000,000,000,000,000,000,000,000,00
0,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,0
00,000,000,000,000,000,000,000,000,000,000,000,00
0,000,000,000,000,000 different alignments.
• And 100 amino acids is a small protein!
Alignment as An Optimization Problem
Statistical significance
•
•
•
Not only are there many possible gapped
alignments, but introducing too many gaps makes
nonsense alignments possible:
s--e-----qu---en--ce
sometimesquipsentice
Need to distinguish between alignments that occur
due to homology, and those that could be expected
to be seen just by chance.
Define a score function that accounts for statistical
significance (logarithmic scale – multiplication of
odds becomes addition of scores).
Major issues in genomics
• Homology
Alignment as an optimization problem
Dynamical programming
BLAST
Tools and examples
Dot matrix sequence comparison
• Write one sequence across top of matrix, the other across
left side, then put a dot where character on line i equals one
in column j
• Examples below: DNA and amino acid sequences of the
phage  cI (vertical axis) and phage P22 c2 (horizontal axis)
repressors
Dynamic programming
•
The name comes from an operations research task, and
has nothing to do with writing programs. Programming –
use tabular structure for computing.
• The key idea is to start aligning the sequences left to right;
once a prefix is optimally aligned, nothing about the
remainder of the alignment changes the alignment of the
prefix.
• We construct a matrix of possible alignment scores (nxm2
calculations worst case) and then "traceback" to find the
optimal alignment.
•
Called Needleman-Wunch (for global matching) or SmithWaterman (for local matching)
Dynamic programming
•
The name comes from an operations research task, and
has nothing to do with writing programs. Programming –
use tabular structure for computing.
3
18
20
5
A
2
11
25
17
B
Dynamic programming matrix
•
•
Each cell has the score for the
best aligned sequence prefix up
to that position.
Example:
ATGCT vs. ACCGCT
Match: +2, mismatch: 0,
gap: -1
Gap
A
T
G
C
T
Gap
0
-1
-1
-1
-1
-1
A
-1
2
0
0
0
0
C
-1
0
0
0
2
0
C
-1
0
0
0
2
0
G
-1
0
0
2
0
0
C
-1
0
0
0
2
0
T
-1
0
2
0
0
2
Matching matrix, NOT the
dynamical programming matrix!
Dynamic programming matrix
0
0
AT
AC
A_ T
AC_
AT _
A_C
AT
AC
1
2
3
4
5
6
Gap
A
T
G
C
T
1
Gap
0
-1
-2
-3
-4
-5
2
A
-1
2 (2)
1(0)
0
0
0
3
C
-2
1 (0)
2 (0)
1
2
1
4
C
-3
0
1 (0)
2
3
2
5
G
-4
0
0
3 (2)
2
3
6
C
-5
0
0
2
5 (2)
4
7
T
-6
0
2
1
4
7 (2)
Optimal alignment by traceback
• We “traceback” a path that gets us the highest score. If
we don't have “end gap” penalties, then take
any path from the
0
1
2
3
4
5
6
last row or column
Gap
A
T
G
C
T
0
to the first.
0
-1
-2
-3
-4
-5
1 Gap
• Otherwise we need
A
-1
2 (2)
1(0)
0
-1
-2
2
to include the top
and bottom corners
C
-2
1 (0) 2 (0)
1
2
1
3
AT - GCT
ACCGCT
A - TGCT
ACCGCT
4
C
-3
0
1 (0)
2
3
2
5
G
-4
-1
0
3 (2)
2
3
6
C
-5
-2
-1
2
5 (2)
4
7
T
-6
-3
0
1
4
7 (2)
Dynamic programming
•
Global alignment – an alignment of two or more sequences
that matches as many characters as possible in all of the
sequences. Needleman-Wunch algorithm
•
Local alignment – an alignment that includes only the best
matching, highest-scoring regions in two or more
sequences. Smith-Waterman algorithm
•
Difference – all the scores are kept in the dynamical
programming matrix for global alignment; only the positive
scores are kept in the dynamical programming matrix for
local alignment, the negative ones are converted to zero.
Major issues in genomics
• Homology
Alignment as an optimization problem
Dynamical programming
BLAST
Tools and examples
Sequence alignment (BLAST)
The Basic Local Alignment Search Tool (BLAST) finds regions of local
similarity between sequences. The program compares nucleotide or protein
sequences to sequence databases and calculates the statistical significance
of matches. BLAST can be used to infer functional and evolutionary
relationships between sequences as well as help identify members of gene
families.
http://www.ncbi.nlm.nih.gov/blast/
BLAST – Algorithm Intuition
The BLAST algorithm.The BLAST algorithm is a heuristic search method that seeks
words of length W (default = 3 in blastp) that score at least T when aligned with the
query and scored with a substitution matrix. Words in the database that score T or
greater are extended in both directions in an attempt to find a locally optimal ungapped
alignment or HSP (high scoring pair) with a score of at least S or an E value lower than
the specified threshold. HSPs that meet these criteria will be reported by BLAST,
provided they do not exceed the cutoff value specified for number of descriptions and/or
alignments to report.
BLAST – Algorithm Intuition
Databases are pre-indexed
by the words.
Without gaps:
Altschul, S. F., Gish, W.,
Miller, W., Myers, E. W.,
Lipman, D. J., J. Mol. Biol.
(1990) 215:403-410
With gaps:
Altschul, S. F., Madden, T.
L., Schaffer, A. A., Zhang,
J., Zhang, Z., Miller, W.,
Lipman, D. J., Nucleic Acids
Research (1997)
25(17):3389-3402
http://www.compbio.dundee.ac.uk/ftp/preprints/review93/Figure10.pdf
BLAST – Scoring Matrices
DNA scoring matrix (substitution matrix)
A
T
G
C
A
5
-4
-4
-4
ATTTAGCCG
ACTTGGCCT
5 55 555
T
-4
5
-4
-4
G
-4
-4
5
-4
C
-4
-4
-4
5
Score = 5X6+(-4)X3=18
http://www.ncbi.nlm.nih.gov/BLAST/matrix_info.html#matrix
BLAST – Scoring Matrices
• DNA is relatively easy to choose and protein is harder.
• PAM (Percent Accepted Mutation) matrices: predicted
matrices, most sensitive for alignments of sequences with
evolutionary related homologs. The greater the number in
the matrix name, the greater the expected evolutionary
(mutational) distance, i.e. PAM30 would be used for
alignments expected to be more closely related in evolution
than an alignment performed using the PAM250 matrix.
• BLOSUM (Blocks Substitution Matrix): calculated matrices,
most sensitive for local alignment of related sequences,
ideal when trying to identify an unknown nucleotide
sequence. BLOSUM62 is the default matrix set be the
BLAST search tool.
BLAST – Parameters
• Word size – for MegaBlast, can work between
w=16 and 64.
• Expected – statistical based notion, compare the
matched sequence with random sequence (the
likelihood). The larger the score, the smaller the
expected value, the more significant the result.
• Percent Identity, match/mismatch scores.
BLAST – Program Selection
Nucleotide
• Quickly search for highly similar sequences
(megablast)
• Quickly search for divergent sequences
(discontiguous megablast)
• Nucleotide-nucleotide BLAST (blastn)
• Search for short, nearly exact matches
• Search trace archives with megablast or
discontiguous megablast
Protein
• Protein-protein BLAST (blastp)
• Position-specific iterated and pattern-hit
initiated BLAST (PSI- and PHI-BLAST)
• Search for short, nearly exact matches
• Search the conserved domain database
(rpsblast)
• Protein homology by domain architecture
(cdart)