BLAST- bioinformatics

Download Report

Transcript BLAST- bioinformatics

Chapter 11
Assessing Pairwise Sequence Similarity: BLAST and
FASTA (Lecture follows chapter pretty closely)
This lecture is designed to introduce you to
the theory and practice of performing the
variety of sequence similiarity searches
available via the NCBI BLAST service
Searching primary databases
using sequence similarity
Basic Local Alignment Search Tool
BLAS
T
BLAST is a computer algorithm that returns
sequences in the database with the highest
percentage of bases in common to the query
sequence
Sequence alignments
• The first line of evidence that two sequences are,
because of a shared evolutionary history, related.
• If two sequences (DNA or protein) are related
by descent, they will be related by sequence.
A
B
C
The “evolutionary distance” between A and B is
smaller, and therefore, under an assumption of
brownian motion (no selection) a given stretch of
DNA or protein will share more nucleotides or
amino acids in common.
All search algorithms will produce
results when queried!
• The trick is to be able to (i) trust and evaluate
the result, and (ii) to be able to quantify this
evaluation.
Reasons for performing BLAST searches
• There can be many reasons, but common ones
are:
Human or Computational annotation: For non-model systems, for which little
bench work has been done compared to a model system, sequence alignments
with known, experimentally verified genes, can aid in the assignment of function
Evolutionary: discovering similar sequences in different organisms allows one to
ask whether and how sequence-level changes result in functional changes. Can
be done for coding or non-coding (i.e. regulatory regions) .
Multiple sequence alignments can help identify conserved regions of coding
sequences, which might have functional significance, or to help understand
evolutionary relationships among difficult to classify organisms.
Multiple sequence alignments can also help with the development of primers in
order to easily clone out a cDNA from an organism for which genome sequencing
has not been done.
Sequence similarity can only be
ascertained by aligning two sequences
ACGGCATCCGACGCTTAGCGGACTATCGATCTGA
ACCCGGCCTACGGCTACTCGCTTAGCGGACTCGG
Some basic concepts:
Sequence Similarity (Data)
Homology (Inference)
Percent similarity of base pairs Similar because of common
between any two sequences, descent from an ancestor that
over a given length of
contained that sequence.
sequence
Continuous quantity any real
number (0-100%)
Categorical quantity, either
two sequences are
homologous or not*
*Because of a variety of genomic rearrangement phenomena, two sequences
that code for a non-homologous protein per se, can contain sub-sequences
that are indeed homologous. This is actually a source of false positive hits
during Blast searches
Two kinds of Homology: Paralogy vs. Orthology
Orthologues are two sequences that are related because that sequence existed
in an ancestor.
Paralogues are two or more sequences that are related because a gene
duplication event
Paralogous sequences
AGCCTATGGCAA
ACGCTAGGGCTT
ACGCTATGGCAA
ACGCTAGGGGCAA
Orthologous sequences
Ancestral sequence
ACGCTACGGGCAA
Global versus local alignment
Local Alignment
Global Alignment
Best alignment of two
sequences, based on regions
of sub-sequence (the local
part) with the highest
similarity
Best alignment of two
sequences along their entire
length. Used with highly
similar sequences.
Better at finding weak
similarities and functional
domains within coding
sequences
Better for multiple sequence
alignments
Most common in database
searches
Not particularly useful for
database searches.
DNA versus protein alignment
DNA
Can be much longer
protein
Only as long as the
longest coding
sequence (<5K a.a.)
Because of the wobble Are more sensitive
effect, are less
because amino acids
sensitive and tend to
tend to be conserved
miss sequence
along functional
relationships among
domains, so even
distantly related
weak total similarities
species
can still be detected
Dotplot: for comparing two sequences
• A dotplot is a simple techniqe that yields a
graphical, but not statistical, representation of
sequence similarity (see Fig 11.1)
A C G T C G T A
A
C
G
A
G
G
T
A
Dot plots
• Good:
Easy visualization of syntenic regions of long regions of genomic sequence
Easy visualization of exon/intron boundaries
Easy visualization and enumeration of tandemly repeated sequence
elements
• Bad:
Can only compare two sequences at a time.
No numerical evaluation of the degree of similarity – we examine this next
Scoring matrices
• When searching a complex database full of billions of
nucleotides worth of sequences, we must not only
identify related sequences, but develop the ability to
score how “good” the match is, and then rank these
“hits” in a list or a table.
A scoring matrix is: an empirical weighting scheme used
in all sequence comparisons
Example here?
DNA Scoring matrices
A
G
C
T
Are all transitions and transversions equally likely?
Different scoring matrices make different assumptions
about this, but it should be clear that the wobble effect,
numerical probability and molecular mech. matters here.
Amino Acid Scoring Matrices
• A bit more complicated, because:
there are 20 possible substitutions at any particular site
Some substitutions are more constrained by function than
others. In other words, we need to distinguish between
absolute conservation (dark blue) and functional conservation
(light blue).
Some amino acids are more rare among all proteins, or
within proteins, so changes in these amino acids must be
given higher weight.
Depending upon evolutionary distance, some amino acid
changes are more likely than others.
The log odds ratio
• A scoring matrix is a probability matrix, which is
an attempt to understand the probability of all
pairwise substitutions, given how often they are
actually observed to change in known
sequences.
• Probabilities are calculated as being less than, or
greater than observed by random chance, hence
the negative numbers.
PAM250 Matrix
PAM amino acid scoring matrix
• The Point Accepted Mutation considers only mutations
at the single site level.
Original matrices were done using sequences with
more than 85% similarity, which means they are
very closely related
The term acceptance refers to functional conservation of
protein function, even if sequence changes.
Amino acids can be group according to their “chemistries.” Because
some amino acids are very similar in their chemistries we should not
score substitutions between them the same as between two amino
acids that are very different in their chemistries
Remember this?
Assumptions of the PAM matrices
• Substitutions at a given site are (i)
independent of previous changes at that site
and are independent of changes to adjacent
sites.
• One PAM unit corresponds to 1 a.a. change/
100 a.a., or 1% divergence in sequence.
• PAM160 = 160 (total) changes/100 a.a. This can be
problematic because this represents an extrapolation
of probabilities calculated for closely related sequence,
and so and error will simply be multiplied
BLOSUM matrices
• Considered the fact that BLOcks of sequence
corresponding to secondary structure (i.e. functional
domains like catalytic sites, DNA binding regions etc.)
are likely to display different SUbstitution probabilities.
• And, BLOSUM matrices considered subsitution
probabilities across several evolutionary distances,
and so are more accurate than PAM for weaker
sequence relationships.
• E.g. BLOSUM62 matrix means that sequences with
no more than 62% sequence similarity were used to
calculate substitution probabilities.
GAPS and penalties
• Other types of mutations involve insertions and
deletions of sequence, collectively called indels
ACGATCGTCATCGATCGA
ACGATCTCGATCGA
These two sequences
only align well across
<half of their sequences.
ACGATCGTCATCGATCGA
ACGATC - - - - TCGATCGA
If we introduce four gaps, it
is obvious that these
sequences are more related
than we thought.
But there must be a penalty for doing this (i.e. lowering the overall score)
because you ,
Two kinds of Gap scoring methods
Affine gap penalty
G + Ln; where G is a penalty for introducing a
gap, and L is the penalty for lengthening the
range of the gapped region (G > Ln)
Non-affine gap penalty
Ln; No penalty for opening, and where L is a
fixed penalty for every gap
So, How does BLAST work?
• Words, Neighbourhoods, and High Scoring
Segment pairs, Oh My!
• The Word is the minimum length of sequence
that is used to start a search, usually three
amino acids (RDQYPQW).
• Neighbourhoods are similar words to the
query word, (e.g. RDQ vs RBQ vs RDE). These
are subject to the scoring matrix
• A High scoring segment pair is the region of
sequence for which the highest scoring Word can
be extended the most as matching sequence
Detection of high scoring segments.
Lexa M et al. Bioinformatics 2011;27:2510-2517
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: [email protected]
Let’s Blast!!