Biology and computers

Download Report

Transcript Biology and computers

BLAST and Multiple Sequence
Alignment
Announcements


Quiz #3 on Thurs., May 17 on lectures presented April
26, May 3 and May 15
Writing assignments due May 24 at the beginning of
class.
Learning objectives-Learn the basics of BLAST
and Psi-BLAST and CLUSTAL W
Workshop-Use of Psi-BLAST to determine
sequence similarities.
Homework-Due May 20
BLAST
Basic Local Alignment Search Tool
Speed is achieved by:
Pre-indexing the database before the search
 Parallel processing

Uses a hash table that contains
neighborhood words.
Neighborhood words
The program declares a hit if the word taken from
the query sequence has a score >= T when a
scoring matrix is used.
This allows the word size (W) to be kept high (for
speed) without sacrificing sensitivity.
If T is increased user the number of background
hits is reduced and the program will run faster.
Which Program should one use?
Most researchers use methods for
determining local similarities:
Smith-Waterman (gold standard)
Do not find every possible alignment
 FASTA
of query with database sequence. These
 BLAST
are used because they run faster than S-W

}
What are the different BLAST
programs?
blastp
 compares an amino acid query sequence against a protein sequence
database
blastn
 compares a nucleotide query sequence against a nucleotide
sequence database
blastx
 compares a nucleotide query sequence translated in all reading
frames against a protein sequence database
tblastn
 compares a protein query sequence against a nucleotide sequence
database dynamically translated in all reading frames
tblastx
 compares the six-frame translations of a nucleotide query sequence
against the six-frame translations of a nucleotide sequence
database. Please note that tblastx program cannot be used with the
nr database on the BLAST Web page.
When to use a particular program
Problem
Program
Explanation
Identify
Unknown
Protein
BLASTP;
General protein
comparison. Use ktup=2
for speed; ktup=1 for
sensitive search.
Smith-Waterman
Slower than FASTA3
and BLAST but provides
maximum sensitivity
TBLASTN
Psi-BLAST
Use if homolog cannot
be found in protein
databases; Approx. 33%
slower
Finds distantly related
sequences. It replaces
the query sequence with
a position-specific score
matrix after an initial
BLASTP search. Then it
uses the matrix to find
distantly related
sequences
When to use a particular program (cont. 1)
Problem
Identify
new
orthologs
Identify
EST
Sequence
Identify
DNA
Sequence
Program
TBLASTN:TBLASTX
BLASTX;TBLASTX
BLASTN
Explanation
Use PAM matrix <=20 or
BLOSUM90 to avoid detecting
distant relationships. Search
EST sequences w/in the same
species.
Always attempt to translate
your sequence into protein
prior to searching.
Nucleotide sequence
comparision
Filtering Repetitive Sequences
Over 50% of genomic DNA is repetitive
This is due to:





retrotransposons
ALU region
microsatellites
centromeric sequences, telomeric sequences
5’ Untranslated Region of ESTs
Example of ESTs with simple low complexity regions:
T27311
GGGTGCAGGAATTCGGCACGAGTCTCTCTCTCTCTCTCTCTCTCTCTC
TCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTC
Filtering Repetitive Sequences
(cont. 1)
Programs like BLAST have the option of filtering
out low complex regions. (Called Masking)
Repetitive sequences increase the chance of a
match during a database search
PSI-BLAST
PSI-position specific iterative
a position specific scoring matrix (PSSM) is
constructed automatically from multiple HSPs of
initial BLAST search. Normal E value is used
The PSSM is created as the new scoring matrix for
a second BLAST search. Low E value is used
E=.001.
Result-1) obtains distantly related sequences
2) finds the important residues that provide
function or structure.
Multiple alignment
Learning objectives-Understand usefulness of
multiple alignment. Become familiar with
ClustalW algorithm. Understand the difference
between ClustalW and PSI-BLAST.
Steps to multiple alignment
Create Alignment
Edit the alignment to ensure that regions of functional
or structural similarity are preserved
Phylogenetic Structural Find conserved motifs Design of
to deduce function
Analysis
PCR primers
Analysis
Multiple Sequence Alignment
Collection of three or more protein (or
nucleic acid) sequences partially or
completely aligned.
Aligned residues tend to occupy
corresponding positions in the 3-D structure
of each aligned protein.
Practical use of MSA
Helps to place protein into a group of
related proteins. It will provide insight into
function, structure and evolution.
Helps to detect homologs
Identifies sequencing errors
Identifies important regulatory regions in
the promoters of genes.
Clustal W (Thompson et al.,
1994)
CLUSTAL=Cluster alignment
The underlying concept is that groups of
sequences are phylogenetically related. If
they can be aligned then one can construct a
tree.
Step1-pairwise alignments
 Step2-create a guide tree
 Step3-progressive alignment

Flowchart of computation steps in
Clustal W (Thompson et al., 1994)
Pairwise Alignment: Calculation of distance matrix
Creation of unrooted Neighbor-Joining Tree
Rooted NJ Tree (guide tree) and calculation of sequence weights
Progressive alignment following the Guide Tree
Step 1-Pairwise alignments
Compare each sequence with each
other and calculate a distance matrix.
A
Different
sequences
-
B
.87
-
C
.59 .60
A B
C
Each number represents the number
of exact matches divided by the
sequence length (ignoring gaps).
Thus, the higher the number the more
closely related the two sequences are.
In this distance matrix, sequence A is 87% identical to sequence B
Step 1-Pairwise alignments
Compare each sequence with each
other and pairwise alignment scores
human
Dog
mouse
EYSGSSEKIDLLASDPHEALICKSERVHSKSVESNIEDKIFGKTYRKKASLPNLSHVTEN 480
EYSGSSEKIDLMASDPQDAFICESERVHTKPVGGNIEDKIFGKTYRRKASLPKVSHTTEV 477
GGFSSSRKTDLVTPDPHHTLMCKSGRDFSKPVEDNISDKIFGKSYQRKGSRPHLNHVTE 476
SeqA Name Len(aa) SeqB Name Len(aa) Score
1
human
60
2
dog
60
76
1
human
60
3 mouse
59
57
2
dog
60
3 mouse
59
49
Step 2-Create Guide Tree
Use the Distance Matrix to create a Guide Tree to
Distance from
determine the “order” of the sequences.
random sequence
H
Different
sequences
human:0.07429
dog:0.15904
D
M
Seff =
76
-
57 49
H D
mouse:0.3494
M
Sreal(ij) – Srand(ij)
Sident(ij) – Srand(ij) x 100
D = -ln(Seff)
Guide Tree
Branch length proportional
to estimated divergence
between dog and other sequences
( human:0.07429, dog:0.15904, mouse:0.34944);
Step 3-Progressive Alignment
human:0.07429
dog:0.15904
mouse:0.3494
Guide Tree
Align human and dog first. Then add mouse to the
previous alignment. In the closely aligned sequences
gaps are given a heavier weight (positive value) than gaps in more divergent sequences. “once a gap always a gap”
Why a heavier weight for the closely aligned sequences?
Because those gaps suggest separations between functional or
structural entities. In more divergent sequences
gaps may be produced as an artifact of sequences
that are dissimilar.
Gap treatment
Short stretches of 5 hydrophilic residues often indicate loop or random
coil regions (not essential for structure) and therefore gap penalties are
reduced reduced for such stretches.
Gap penalties for closely related sequences are lowered compared to
more distantly related sequences (“once a gap always a gap” rule). It
is thought that those gaps occur in regions that do not disrupt the
structure or function.
Alignments of proteins of known structure show that proteins gaps do
not occur more frequently than every eight residues. Therefore
penalties for gaps increase when required at 8 residues or less for
alignment. This gives a lower alignment score in that region.
A gap weight is assigned after each aa according the frequency that
such a gap naturally occurs after that aa in nature
Amino acid weight matrices
As we know, there are many scoring
matrices that one can use depending on the
relatedness of the aligned proteins.
As the alignment proceeds to longer
branches the aa scoring matrices are
changed to accommodate more divergent
sequences. The length of the branch is used
to determine which matrix to use and
contributes to the alignment score.
Flowchart of computation steps in
Clustal W (Thompson et al., 1994)
Pairwise Alignment: Calculation of distance matrix
Creation of unrooted Neighbor-Joining Tree
Rooted NJ Tree (guide tree) and calculation of sequence weights
Progressive alignment following the Guide Tree
From Baxenavis and Oullette, 2001
Example of Sequence Alignment
using Clustal W
Asterisk represents identity
: represents high similarity
. represents low similarity
Multiple Alignment
Considerations
Quality of guide tree. It would be good to have a set of
closely related sequences in the alignment to set the
pattern for more divergent sequences.
If the initial alignments have a problem, the problem is
magnified in subsequent steps.
CLUSTAL W is best when aligning sequences that are
related to each other over their entire lengths
Do not use when there are variable N- and C- terminal
regions
If protein is enriched for G,P,S,N,Q,E,K,R then these
residues should be removed from gap penalty list.
(what types of residues are these?)
Reference: http://www-igbmc.u-strasbg.fr/BioInfo/ClustalW/