Bioinformatics Unit 1: Data Bases and Alignments

Download Report

Transcript Bioinformatics Unit 1: Data Bases and Alignments

Bioinformatics
Unit 1: Data Bases and
Alignments
Lecture 3:
“Homology” Searches and
Sequence Alignments (cont.)
The Mechanics of Alignments
Overview
•
•
•
•
Introduction/review
Reading alignment outputs
Scoring (substitution) matrices
More on alignment algorithms and dynamic
programming
• Useful alignment algorithms
• Examples
Introduction
Sequence alignment is a useful tool with
many, diverse applications.
• Examples of sequence alignments:
– Compare a new sequence against an established
sequence from a database
– In sequencing a new gene one usually
sequences both strands and then aligns
(reversing one of them, of course!). This
ensures accuracy.
Examples of Sequence
Alignments (cont.)
– Compare the sequence homology to look for
evolutionary relatedness.
– To identify the sites of mutations
– To find regions of overlapping sequence
(cosmids or YACs for example)
– To identify conserved functional domains in
gene products
– Others to be sure!
Understanding Alignment
Outputs
• One sequence is placed above another and the
aligned vertical pairs are compared (scored)
• Matching pairs are joined with a bar ( | ) to
indicate identity.
• A colon ( : ) is used to identify similar but
nonidentical pairs.
– IUB ambiguity codes are used (e.g. N pairs with G, C,
T or A).
– Nonidentical amino acids with similar physical
properties can also be reported as similar.
Example
330
991
CCTTNATTTCCTTTTTGACA 349
||||:||| |||||||||||
CCTTAATTCCCTTTTTGACA 972
• Only 20 bases of each sequence aligned (a
local alignment)
• The numbers at each end of the alignment corresponds to the
nucleotide number in the original sequence.
– There was a 329 nucleotide non-identical prefix in the top query
sequence and a 971 non-identical prefix in the lower query
sequence.
– There may have been non-identical suffixes too, or the entered
sequences may only have been 341 and 991 bases long,
respectfully.
Example (cont.)
330
991
CCTTNATTTCCTTTTTGACA 349
||||:||| |||||||||||
CCTTAATTCCCTTTTTGACA 972
• The lower sequence has been reversed (complement)
• There are two non-identical pairs
– Nucleotides number 334 and 987 are paired by a colon (:). The
nucleotide at this position on the upper strand is an N indicating
that the sequencer was unable to determine the nucleotide identity.
– The nucleotide pair between numbers 338 (top) and 983 (bottom)
comprises a T and a C. These do not match and no line has been
drawn between them. This may be the result of a point mutation, or
a mistake in determining or entering the sequence.
Scoring Alignments
• Positive values are given for each identical match
• Smaller positive values are given for
“conservative substitutions”
• Negative values are given for non-identical, nonconservative pairs
• Gaps are penalized
• Total score is the sum of the individual pair wise
scores
• Longer alignments give higher scores than shorter
ones
Gaps and Scoring
• Gaps may be caused by insertion in one sequence
or deletion in the other (“indel” events). We don’t
know which.
• Gaps in an alignment are indicated by a ‘-’ in one
or both of the sequences
• Gaps are penalized in scoring an alignment in two
ways
– Origination penalty - the scoring penalty for creating a
gap of any length (larger)
– Length penalty - based on the length of the gap
(smaller)
A Simple Example of Gap Scoring
If scoring matrix says:
Match = +1
Mismatch = 0
Gap origination penalty = -2
Gap length penalty = -1 (for each base)
Calculate the scores for each alignment. Which alignment is best and why?
A Simple Example of Gap Scoring
Score = -3
Score = -1
Score = 1
If scoring matrix says:
Match = +1
Mismatch = 0
Gap origination penalty = -2
Gap length penalty = -1 (for each base)
The third alignment is best. From an evolutionary standpoint only one
genetic event (indel spanning 2 bases).
Scoring Matrices: How values are
assigned for each pair in an alignment
• DNA scoring matrices are fairly simple
Scoring Matrices: How values are
assigned for each pair in an alignment
• Protein matrices are far more complex
– There are 20 “letters” v. only 4 in DNA
– Far greater opportunity for conservative
substitutions
– Some are based on “observed” substitutions
– Others are based on chemical/physical
properties of the amino acids
– Others are based on the genetic code (how
easily could a codon specifying one amino acid
be changed to a codon specifying a different
amino acid?)
Two Common Protein Scoring Matrices
• The Point Accepted Mutation (PAM) matrix
– Based on observed substitution rates
– Different variations are used based on
assumptions of the length of time since the
sequences diverged
• PAM-1 may be best for comparing two closely
related sequences
• Pam-1000 may be best for comparing sequences
with distant relationships
• PAM-250 is a suitable compromise
A PAM250 Scoring Matrix
Two Common Protein Scoring Matrices
(cont.)
• BLOSUM matrices are also commonly used
• Constructed by analyzing substitution rates for
sequences that cluster by phylogenetic analysis
• Also appended with numbers (but different
meaning)
– BLOSUM-62 is best for comparing sequences with
approximately 62% similarity
– BLOSUM-80 is best for comparing sequences with
approximately 80% similarity
Alignment Algorithms and
Dynamic Programming
• Computer trickery!
– The straightforward approach is too intense
– For 2 sequences of 95 and 100 nucleotides
there are ~ 55 million possible alignments!
• (imagine a database search in this context!)
• Dynamic programming breaks the problem
into a series of small steps and adds the
results of these small steps to answer the
problem
Dynamic Programming (cont.)
When you run an alignment a dynamic
programming matrix is formed with the
two sequences on the sides. Scores for
each pair are placed in the matrix. If the
sequences match, you would start in the
lower right corner and proceed diagonally
to the upper left corner.
AC--TCG
ACAGTAG
Alignment score = 2
Vertical arrows indicate internal gaps
Graphical Output: Dot plots and
Path Graphs
Comparison
• Dot Plots
– Have been popular
– Reveal complex
relationships involving
multiple regions
– Difficult to interpret as
they (may) show many
alignments
– Hard to see gaps and
visualize “best” alignment
• Path Diagrams
– More simple to interpret
– Show only one alignment
• (Some can show more)
– Gaps appear as horizontal
or vertical segments of the
path line
Example 1
X
Y
3’
Y
5’
5’
X
3’
Example 2
X
Y
3’
Y
5’
5’
X
3’
Example 3
X
Y
3’
Y
5’
5’
X
3’
Some Useful Alignment
Programs
• BLAST 2 Sequences (NCBI)
• CLUSTALW (Biology Workbench)
• MAP (Multiple Alignment Program) at
Baylor, TX
• Many others
A Nice BLAST 2 Sequences
Example at:
http://www.ncbi.nlm.nih.gov/blast/