Thursday and Friday
Download
Report
Transcript Thursday and Friday
Wed (tomorrow) 10am
- this suite booked for BLAST searches
Thursday and Friday
Dr Michael Carton
Formerly VO’F group, now
National Disease Surveillance
Centre (NDSC)
TODAY
www.nuigalway.ie/microbiology/
bioinformaticsnode/home.html
Lots of definitions - don’t worry!!
But, later on, look stuff up on Google
or Scirus
Remember:
Homology:- sequences are homologous if
they are related by divergence from a
common ancestor
Sequence alignment
In order to detect sequence homology
we must first align sequences.
An alignment is a hypothesis of
positional homology between
nucleotides/amino acids.
Alignment example
Take the case of a hypothetical ancestral
sequence (GAATTCGC). Over time mutation
may lead to two different forms of this
sequence, GAATTCGC and GATTGGC.
Example continued
Alignment without gaps
GAATTCGC
GATTGGC
** *
Alignments with gaps
GAATTCGC
GA–TTGGC
** ** **
or
GAATTC–GC
GA–TT–GGC
** ** **
Types of alignment
Local
Local alignment finds short regions of
similarity between a pair of sequences
Global
Global alignments attempts to find the
optimal alignment over the entire length of
the sequences.
Local alignment
Finds domains and short regions of
similarity between a pair of sequences.
The two sequences under comparison
do not necessarily need to have high
levels of similarity over their entire
length in order to receive locally high
similarity scores.
Local alignment
This feature of local similarity searches
give them the advantage of being
useful when looking for domains within
proteins or looking for regions of
genomic DNA that contain introns. Local
similarity searches do not have the
constraint that similarity between two
sequences needs to be observed over
the entire length of each gene
Global alignment
Finds the optimal alignment over the entire
length of the two sequences under
comparison. Algorithms of this nature are not
particularly suited to the identification of
genes that have evolved by recombination or
insertion of unrelated regions of DNA. In
instances such as this, a global similarity
score will be greatly reduced. In cases where
genes are being aligned whose sequences are
of comparable length and also whose entire
gene is homologous (descent from a common
ancestor), global alignment works well.
PROGRAMS USED
Local
Blast
Fasta3
Global
Clustalw
Clustalx
Terminology
Exact (Exhaustive):
This is a method of looking at all
possibilities for a particular problem and
then choosing the best one. It is the most
rigorous method.
Heuristic:
This class of methods takes short-cuts and
attempts to arrive at an optimal solution by
making educated guesses.
Matrices
Write one sequence horizontally
Write the other sequence vertically to
form a grid:
T
1 1 0
0 1 0
1 0 1
T
A
A
T
G
A
T
T
G
Calculating an Alignment
Score
An alignment’s score is calculated using
Scoring matrix
Gap Opening Penalty
Gap extension penalty
Scoring an alignment
A
C
T
A
1
C
0
1
T
0
0
1
G
0
0
0
G
1
Previous Example
Alignment without gaps
GAATTCGC
GATTGGC
** *
Alignments with gaps
GAATTCGC
GA–TTGGC
** ** **
or
GAATTC–GC
GA–TT–GGC
** ** **
Dotplot Matrix I
Dotplot Matrix II
Chimpanzee haeomoglobin
intergenic DNA plotted against
itself c. 400 bases
Noise is caused by matches that have occurred by chance
without any homology present. Can use a filter to reduce the
noise, eg. only place a dot when a specified portion of a small
group of successive bases match, eg. window of 10 only
highlighted if 6 of the 10 bases match
8 out of 10,
even less noise
Chimp and spider monkey DNA,
but c. 4,000 bases this time
IDENTITY DOT BLOT
-identity blocks
-looks for blocks of
perfect identity,
-reduces time
required
Scoring matrix
In reality, we know that certain mutations are
more likely to have occurred than others.
Conservation of the secondary structure of
proteins is an important consideration.
The mutation of the third base in a codon
often results in no change in the amino acid
coded for.
Observations of alignments of amino acid
sequences have been used to calculate the
probability of certain substitutions.
Scoring Matrices
Scoring matrices tell how similar amino
acids are.
There are two main sets of scoring
matrices: PAM and BLOSUM.
PAM is based on evolutionary distances
BLOSUM is based on structure/function
similarities
AA Matrices
Assigning a score to all of the 210
possible amino acid substitutions has
been done by several authors but 2 are
especially noteworthy
Dayhoff et al. (1978) used amino acid
alignments of sequences that were 85%
similar as a basis for the PAM mutation
data matrices
AA Matrices
Henikoff and Henikoff (1992) used several
different alignments to produce the BLOSUM
matrices.
The Blosum 62 Matrix is based on an
alignment of sequences that are at least 62%
similar
This is possibly the most used of amino acid
substitution matrices and is the default matrix
used in several applications
Scoring matrices
These have been empirically determined and
have been calculated by the direct
comparison of related protein sequences.
In general, amino acid substitutions that are
seen to occur very rarely are given a negative
value.
Conservative substitutions (i.e., isoleucine for
leucine) are given a positive value. Identical
matches are also given a positive value.
The bottom line on PAM
Frequencies of alignment
Frequencies of occurrence
The probability that two amino acids, i and j are
aligned by evolutionary descent divided by the
probability that they are aligned by chance
BLOSUM Matrices
BLOSUM is built from distantly related
sequences whereas PAM is built from
closely related sequences.
BLOSUM is built from conserved blocks
of aligned protein segment found in the
BLOCKS database.
PAM and BLOSUM
Running searches with different matrices will
help find different sorts of hits.
PAM30 will preferentially find homologues
that are evolutionarily close
PAM250 will tend to find long, weak diffuse
matches typical of distantly related proteins.
BLOSUM62 is based on alignments of proteins
that are at least 62% similar.
Evolutionary Basis of
Sequence Alignment
1. Similarity: Quantity that relates to how
alike two sequences are.
2. Identity: Quantity that describes how alike
two sequences are in the strictest terms.
3. Homology: a conclusion drawn from data
suggesting that two genes share a common
evolutionary history.
Evolutionary Basis of
Sequence Alignment (Cont. 1)
1. Example: Shown on the next page is a pairwise alignment of
two proteins. One is mouse trypsin and the other is crayfish
trypsin. They are homologous proteins. The sequences
share 41% identity.
2. Underlined residues are identical. Asterisks and diamond
represent those residues that participate in catalysis. Five
gaps are placed to optimize the alignment.
Evolutionary Basis of
Sequence Alignment (Cont. 2)
Why are there regions of identity?
1) Conserved function-residues participate in reaction.
2) Structural-residues participate in maintaining structure of
protein. (For example, conserved cysteine residues that
form a disulfide linkage)
3) Historical-Residues that are conserved solely due to a
common ancestor gene.
Sequence Homology
Searching
Find related sequences in the
database
Original BLAST
Segment pair- this is a pair of
subsequences of the same length that
form an ungapped alignment.
BLAST searches for all segment pairs
between the query sequence and all of
the sequences in the database (above a
certain threshold).
HSP-High-Scoring Pair.
Original Blast
HSPs are derived by first finding the
pairs that satisfy the threshold (T)
conditions. Then the alignment is
extended in both directions unyil the
quality of the alignment drops off
dramatically or falls to zero
The HSPs are then sorted according to
their score
Gapped BLAST
The original BLAST suffered from the
limitation of not being able to introduce gaps
into the alignment.
Gapped BLAST is an effort to circumvent this
shortcoming.
Experience shows that often several
ungapped non-overlapping alignments result
from a match to a single database entry.
Two-Hit method
Find 2 HSPs within a distance m of each
other on the same diagonal.
Do not attempt an HSP extension
unless you find two regions that meet
this criterion.
Attempt to generate a single gapped
alignment in this region.
FastA algorithm
Is the alignment significant?
Could we see an alignment like this
purely by chance?
What are the statistics involved?
ktups
Sequence X
GAATTCGCATC
This 11 base sequence can be divided into six 6-long
segments of DNA
GAATTC
AATTCG
ATTCGC
TTCGCA
TCGCAT
CGCATC
These are known as ‘ktuples’ (ktup Fasta).
Sequences in databases are stored in this form.
Global Alignment vs. Local
Alignment
Global alignment is used when the overall
gene sequence is similar to another
sequence-often used in multiple sequence
alignment e.g. Clustal W algorithm
Local alignment is used when only a small
portion of one gene is similar to a small
portion of another gene.
BLAST
FASTA
Different forms of BLAST and
FASTA
You have a nucleotide sequence.
Want to compare with other nucleotide
sequences
Blastn
Fasta3
Different forms of BLAST and
FASTA
To compare the 6-frame conceptual
translation of the nucleotide sequence
against a protein database
Blastx
Fastx3
Fasty3
Different forms of BLAST and
FASTA
If we translate our nucleotide sequence,
we can compare it to the translation of
a nucleotide database;
tBlastn
tFasty3
Homology Search Tools
BLAST (Basic Local Alignment Search
Tool) by Stephen Altschul
http://www.ncbi.nih.gov/
FASTA by William Pearson
http://www.ebi.ac.uk/
Open a new word file and 3 web browser
windows