blastMeuEHugueCicotte - IME-USP

Download Report

Transcript blastMeuEHugueCicotte - IME-USP

Sequence Database
Searching
Local and Global Alignment,
Database Searching With Blast
Original Presentations:
Hugues Sicotte
National Center for Biotechnology Information
[email protected]
Adaptation:
Alan Durham
University of São Paulo
[email protected]
Sequence Database
Searching
Alignment definition and Type:
Alignment:
Each Base is used at most once.
Global Alignment:
All bases aligned with another base or with
a gap (symbol of “-” or sometimes “.”).
G-ATES
GRATED
Local Alignments:
Do not need to align all the bases in all sequences.
Align BILLGATESLIKESCHEESE and GRATEDCHEESE
G-ATESLIKESCHEESE
GRATED-----CHEESE
or
G-ATES
& CHEESE
GRATED
& CHEESE
Sequence Database
Searching
Insertions and
deletions (‘indels’)
are represented
by gaps in
alignments
COMPARATIVE ANALYSIS
GATTATACCA
GATTA---CA
gap of length 3
Sequence Database
Searching
Score and Statistics
Percent Identity.
Can be misleading.
Score:
A simple quality measure is the “score”. The
score assigns points for each aligned base
(or gap) of the alignment.
identical bases : “match” score
mismatching bases: “mismatch” score
gaps(optional): “gap opening” penalty for starting a gap
“gap extension” penalty for each gap symbol.
Example:
match = +1 , mismatch =-1,
gap opening = -5, gap extension=-1
G-ATESLIKESCHEESE
AND/OR
GRATED-----CHEESE
Score = 10*(+1)+1*(-1)+(-5-1)+(-5+5*(-1))
= -7
G-ATES
&
CHEESE
GRATED
&
CHEESE
Sequence Database
Searching
Which alignment is
“better”?
SCORING SYSTEMS
GCTACTAG-T-T--CGC-T-TAGC
GCTACTAGCTCTAGCGCGTATAGC
0 mismatches, 5 gaps
GCTACTAGTT------CGCTTAGC
GCTACTAGCTCTAGCGCGTATAGC
3 mismatches, 1 gap
Sequence Database
Searching
SCORING SYSTEMS
High penalty for
“opening” a gap
(e.g. G = 5)
Lower penalty for
“entending” a gap
(e.g. L = 1)
GCTACTAG-T-T--CGC-T-TAGC
GCTACTAGCTCTAGCGCGTATAGC
Penalty = 5G + 6L = 31
GCTACTAGTT------CGCTTAGC
GCTACTAGCTCTAGCGCGTATAGC
Penalty = 1G + 6L = 11
Sequence Database
Searching
Mix-and-match
protein modules
confound alignment
algorithms
LOCAL SIMILARITY
Protein modules in coagulation factor XII (F12) and
tissue plasminogen activator (PLAT)
F12
F2 E F1 E
PLAT
F1 E
F1,F2
E
K
Catalytic
K
K
K
Catalytic
Catalytic
Fibronectin repeats
EGF similarity domain
Kringle domain
Serine protease activitiy
Figure 7.3
Sequence Database
Searching
Mix-and-match
protein modules
confound alignment
algorithms
LOCAL SIMILARITY
Protein modules in coagulation factor XII (F12) and
tissue plasminogen activator (PLAT)
F12
F2 E F1 E
K
Catalytic
modules in
reverse order
PLAT
F1 E
F1,F2
E
K
Catalytic
K
K
Catalytic
Fibronectin repeats
EGF similarity domain
Kringle domain
Serine protease activitiy
Figure 7.3
Sequence Database
Searching
Mix-and-match
protein modules
confound alignment
algorithms
LOCAL SIMILARITY
Protein modules in coagulation factor XII (F12) and
tissue plasminogen activator (PLAT)
F12
F2 E F1 E
K
Catalytic
repeated
modules
PLAT
F1 E
F1,F2
E
K
Catalytic
K
K
Catalytic
Fibronectin repeats
EGF similarity domain
Kringle domain
Serine protease activitiy
Figure 7.3
Sequence Database
Searching
DOT PLOTS
Dot-plot Fitch : Biochem. Genet. (1969)3,99-108
Horizontal axis is
coordinates for
one sequence
Vertical axis is
coordinates for the
other
C
G T A C
C
G T
A
0
0
0
1
0
0
0
0
C
1
0
0
0
1
1
0
0
G
0
1
0
0
0
0
1
0
T
0
0
1
0
0
0
0
1
Figure 7.4
Sequence Database
Searching
DOT PLOTS
Dot-plot Fitch : Biochem. Genet. (1969)3,99-108
Horizontal axis is
coordinates for
one sequence
Vertical axis is
coordinates for the
other
Can also score not 1 position at a time, but in sliding window. For
example a window of 3 nucleotides where we score 1 for identical
triplets and 0 for all other combinations yields.
C
G T A C
C
A
0
0
0
0
0
0
C
1
0
0
0
0
1
G T
G
T
Figure 7.4b
Sequence Database
Searching
DOT PLOTS
Horizontal axis is
coordinates for
one sequence
Vertical axis is
coordinates for the
other
Tissue Plasminogen Activator (PLAT)
Coagulation Factor XII (F12)
Figure 7.4
Sequence Database
Searching
DOT PLOTS
K
K
Catalytic
Adjacent dots
merge to form
diagonal segments
Tissue Plasminogen Activator (PLAT)
Plot dots for high
similarity within a
short window
E F1
Coagulation Factor XII (F12)
F2
E F1 E
K
Catalytic
Figure 7.4
Sequence Database
Searching
DOT PLOTS
Catalytic
K
K
Tissue Plasminogen Activator (PLAT)
Repeated domains
show a
characteristic
pattern
E F1
Coagulation Factor XII (F12)
F2
E F1 E
K
Catalytic
Figure 7.4
Sequence Database
Searching
PATH GRAPHS
Dot plots suggest
paths through the
alignment space
EGF similarity domains of urokinse plasminogen activator
(PLAU) and tissue plasminogen activator (PLAT)
137
90
137
23
23
90
PLAU
PLAT
90
23
72
Each path is a
unique alignment
72
Path graphs are
more explicit
representations
EPKKVKDHCSKHSPCQKGGTCVNMP--SGPH-CLCPQHLTGNHCQKEK---CFE
ELHQVPSNCD----CLNGGTCVSNKYFSNIHWCNCPKKFGGQHCEIDKSKTCYE
137
72
Figure 7.5
Sequence Database
Searching
DYNAMIC PROGRAMMING
Dynamic Programming Example
Construct an
optimal of these
two sequences:
G A T A C T A
G A T T A C C A
Using these
scoring rules:
Match:
+1
Mismatch: -1
Gap:
-1
Sequence Database
Searching
Arrange the
sequence
residues along a
two-dimensional
lattice
Vertices of the
lattice fall
between letters
DYNAMIC PROGRAMMING
G A T A C T A
G
A
T
T
A
C
C
A
Sequence Database
Searching
The goal is to find
the optimal path
DYNAMIC PROGRAMMING
G A T A C T A
G
A
T
T
A
C
C
A
from here
to here
Sequence Database
Searching
Each path
corresponds to a
unique alignment
DYNAMIC PROGRAMMING
G A T A C T A
G
A
T
T
A
C
C
A
Which one is
optimal?
Sequence Database
Searching
The score for a
path is the sum of
its incremental
edges scores
DYNAMIC PROGRAMMING
G A T A C T A
G
A
T
T
A
C
C
A
A aligned with A
Match = +1
Sequence Database
Searching
The score for a
path is the sum of
its incremental
edges scores
DYNAMIC PROGRAMMING
G A T A C T A
G
A
T
T
A
C
C
A
A aligned with T
Mismatch = -1
Sequence Database
Searching
The score for a
path is the sum of
its incremental
edges scores
DYNAMIC PROGRAMMING
G A T A C T A
G
A
T
T
A
C
C
A
T aligned with NULL
Gap = -1
NULL aligned with T
Sequence Database
Searching
Incrementally
extend the path
DYNAMIC PROGRAMMING
G
A
T
T
A
C
C
A
0
-1
G A T A C T A
-1
+1
Sequence Database
Searching
Incrementally
extend the path
Remember the
best sub-path
leading to each
point on the
lattice
DYNAMIC PROGRAMMING
G
A
T
T
A
C
C
A
0
-1
G A T A C T A
-1
-2
+1
-2
Sequence Database
Searching
Incrementally
extend the path
Remember the
best sub-path
leading to each
point on the
lattice
DYNAMIC PROGRAMMING
G
A
T
T
A
C
C
A
0
-1
G A T A C T A
-1
-2
+1
-2
0
0
+2
Sequence Database
Searching
Incrementally
extend the path
Remember the
best sub-path
leading to each
point on the
lattice
DYNAMIC PROGRAMMING
G
A
T
T
A
C
C
A
0
G A T A C T A
-1
-2
-1
+1
-2
0
-2
0
+2
Sequence Database
Searching
Incrementally
extend the path
Remember the
best sub-path
leading to each
point on the
lattice
DYNAMIC PROGRAMMING
G
A
T
T
A
C
C
A
0
G A T A C T A
-1
-2
-3
-1
+1
-2
0
-1
-2
0
+2
+1
-3
-1
+1
+3
Sequence Database
Searching
Incrementally
extend the path
Remember the
best sub-path
leading to each
point on the
lattice
DYNAMIC PROGRAMMING
G
A
T
T
A
C
C
A
0
G A T A C T A
-1
-2
-3
-4
-5
-1
+1
0
-1
-2
-3
-2
0
+2
+1
0
-1
-3
-1
+1
+3
+2
+1
-4
-2
0
+2
+2
+1
-5
-3
-1
+1
+3
+2
Sequence Database
Searching
Incrementally
extend the path
Remember the
best sub-path
leading to each
point on the
lattice
DYNAMIC PROGRAMMING
G
A
T
T
A
C
C
A
0
G A T A C T A
-1
-2
-3
-4
-5
-6
-7
-1
+1
0
-1
-2
-3
-4
-5
-2
0
+2
+1
0
-1
-2
-3
-3
-1
+1
+3
+2
+1
0
-1
-4
-2
0
+2
+2
+1
+2
+1
-5
-3
-1
+1
+3
+2
+1
+3
-6
-4
-2
0
+2
+4
+3
+2
-7
-5
-3
-1
+1
+3
+3
+2
-8
-6
-4
-2
0
+2
+2
+4
Sequence Database
Searching
Trace-back to get
optimal path and
alignment
DYNAMIC PROGRAMMING
G
A
T
T
A
C
C
A
0
G A T A C T A
-1
-2
-3
-4
-5
-6
-7
-1
+1
0
-1
-2
-3
-4
-5
-2
0
+2
+1
0
-1
-2
-3
-3
-1
+1
+3
+2
+1
0
-1
-4
-2
0
+2
+2
+1
+2
+1
-5
-3
-1
+1
+3
+2
+1
+3
-6
-4
-2
0
+2
+4
+3
+2
-7
-5
-3
-1
+1
+3
+3
+2
-8
-6
-4
-2
0
+2
+2
+4
Sequence Database
Searching
Print out the
alignment
G A - T A CT A
G A T T A CC A
DYNAMIC PROGRAMMING
G A T A C T A
G
A
T
T
A
C
C
A
Sequence Database
Searching
Global Alignment
methods:
Local Alignment
methods:
Two different types of Alignment
Needleman & Wunch (J. Mol. Biol. (1970) 48,443-453 :
Problem of finding the best path. Revelation: Any partial subpath that ends at a point along the true optimal path must itself
be the optimal path leading to that point. This provides a
method to create a matrix of path “score”, the score of a path
leading to that point. Trace the optimal path from one end to
the other of the two sequences.
Smith & Waterman.(J. Mol. Biol. (1981), 147,195-197: Use
Needleman &Wunch, but report all non-overlapping paths,
starting at the highest scoring points in the path graph.
FASTP(Lipman &Pearson(1985),Science 227,1435-1441
BLAST (Altschul et al (1990),J. Mol. Bio. 215,408-410): don’t
report all overlapping paths, but only attempt to find paths if
there are words that are high-scoring. Speeds up considerably the
alignments.
Sequence Database
Searching
Implementations of
dynamic
programming for
global and local
similarities
GLOBAL & LOCAL SIMILARITY
Optimal global
alignment
Optimal local
alignment
Needleman & Wunsch (1970)
Smith & Waterman (1981)
Sequences align
essentially from
end to end
Sequences align
only in small,
isolated regions
Sequence Database
Searching
Local and Global Alignments: computing the
matrix
global alignment: as shown in previous slides
local allignment: change computation so that never put
negative values
– when value of cell will be negative, set to zero (means staring
another path)
– best local alignment comes from entry in matrix with maximum
value
semi-global alignment:
– good in assembling
– ignore gaps at the end and at the beginning of sequences
– to ignore gaps at the beginning of alignment: zeroes in first
column and first row
– to ignore gaps at the end of alignment: pick maximum value of
last row and last column
Sequence Database
Searching
•choose the best
score from
scores in last row
and collumn
•fill first row and
first column with
zeroes
•in this problem,
2 solutions
(we eliminated
the last symbol
from one of the
sequences)
D Y N A M I C P R O G R A M M I N G: semiglobal alignment
G
A
T
T
A
C
C
A
0
G A T A C T
0
0
0
0
0
0
0
+1
0
-1
-1
-1
-1
0
0
+2
+1
0
-1
-2
0
-1
+1
+3
+2
+1
0
0
-1
0
+2
+2
+1
+2
0
-1
-1
+1
+3
+2
+1
0
-1
-2
0
+2
+4
+3
0
-1
-2
-1
+1
+3
+3
0
-1
-2
-2
0
+2
+2
Sequence Database
Searching
Match/mismatch scores and Statistics
•Some amino acids mutations do not affect structure/function
very much. Amino acids with similar physico-chemical and
steric properties can often replace each other.
•Scoring system that doesn’t penalize very much mutations to
similar amino acid.
•PAM Matrices: Point Accepted Mutations. Defined in terms
of a divergence of 1 percent PAM. For distant sequences use
PAM250, while for closer sequences (like DNA) use PAM100.
Some sites accumulate mutations some others don’t, thus use
of the PAM100 matrice doesn’t mean that the sequences
compared were 100% mutated.
•BLOSUM: BLOCK substitution matrices. Started with the
BLOCKS database of multiple alignment only involving
distant sequences. BLOSUM62 means that the proteins
compated were never closer than 62% Identity. BLOSUM50
matrices involved alignment of more distant sequences.
Recommend use BLOSUM matrices (BLOSUM62) for most
protein alignments.
Sequence Database
Searching
Sequence Database
Searching
Alignment methods
Sequence Alignment
representation using
a dot plot.
Subject sequence
For a query of N
letters against a
subject sequence of
M letters, it requires
MxN comparisons.
Query sequence
Sequence Database
Searching
Hashing is a
common method for
accelerating
database searches
Compile “dictionary”
of words from the
query sequence. Put
each word in a lookup table that points
to the original
position in the
sequence. Thus
given one word, you
can know if it is in
the query in a single
operation.
HASHING METHODS
query
sequence
MLIIKRDELVISWASHERE
MLI
LII
IIK
IKR
all overlapping
KRD
words of size 3
RDE
DEL
ELV
LVI
VIS
ISW
SWA
WAS
ASH
SHE
HER
ERE
Sequence Database
Searching
Index lookup
Each word is assigned a unique integer.
E.g. for a word of 3 letters made up of an alphabet of 20
letters.
1. Assign a code to each letter Code(l) (0 to 19)
2. For a word of 3 letters L1 L2 L3 the code is
index = Code(L1)*202 + Code(L2)*201 + Code(L3)
3. Have an array with a list of the positions that have
that word.
0 1 2 3
1
Position in query sequence of word
Sequence Database
Searching
Building the
dictionary for the
query sequence
requires (N-2)
operations.
The database
contains (M-2)
words, and it takes
only one operation to
see if the word was
in the query.
HASHING METHODS
query
sequence
MLIIKRDELVISWASHERE
MLI
LII
IIK
IKR
all overlapping
KRD
words of size 3
RDE
DEL
ELV
LVI
VIS
ISW
SWA
WAS
ASH
SHE
HER
ERE
Sequence Database
Searching
HASHING METHODS
Query sequence
Use word hits to
determine were to
search for alignments
fills the dynamic
programming matrix
in (N-2)+(M-2)
operations instead
of MxN.
Subject sequence
Scan the subject,
looking up words in
the dictionary
Sequence Database
Searching
Blast: extending good hits
blast pre-processes the target sequence set
lists of hits for each possible word (
–
3-tuple for proteins - 203 = 8000 different words
–
for each word, find with ones have “good match”
•
13 in old version 11 in new version
for the “good ones” get list of sequences in database that have it
Blast (old) : extend match both ways while score is increasing
Blast2 (new):
–
when two words found in same “diagonal” withing “short” distance,
extend an un-gapped alignment.
–
continue extension like old blast
get local alignments with score greater than cutoff score
perform SW on best candidates
Sequence Database
Searching
HASHING METHODS
Query sequence
Use word hits to
determine were to
search for alignments
Database sequence
Scan the database,
looking up words in
the dictionary
BLAST extends from word hits
Sequence Database
Searching
HASHING METHODS
Query sequence
Use word hits to
determine were to
search for alignments
Database sequence
Scan the database,
looking up words in
the dictionary
BLAST2 extends pairs in same diagonal
first
Simplest Database
searching could is a
large dynamic
programming
example.
With all the database
sequences
concatenated one
after another.
Database Search Space
Query sequence
Concanated Database sequence
Sequence Database
Searching
Sequence Database
Searching
Database Search Space
Which alignment is
more significant?
Concanated Database sequence
Query sequence
Database Search Space
Score can be used to judge
alignments. But a score
absolute value is a function
of the score parameters.
Match=+1,Mismatch=-1,
Gap_open=5,
gap_extend=1
Yields same alignments as
Match=+10,Mismatch=-10,
Gap_open=50,
gap_extend=10
Scores useful for relative
ranking.
Query sequence
Concanated Database sequence
Sequence Database
Searching
Sequence Database
Searching
Database Search Space
To Judge relevancy of an
alignment, need to judge if
match is significant.
Number of Aligments with
scores >= S expected if the
query was a random given
the database size and
composition.
Expect of 0.0 means a very
good match unlikely to be
random.
Concanated Database sequence
E-value = Expect(S) is a
function of the score,
database size and
composition, and query size.
Query sequence
Sequence Database
Searching
Alligning sequences in databases: evaluating
significance
we can allign a sequence with any
other one
we want good allignents that are
statistically significant
when searching databases, statistical
relevance needs to be computed too
E value: number of hits a random
sequence of the same size would get in a
database of the same size
Sequence Database
Searching
E-value
“Hits” can be sorted according to their E-value or their
score.
The E-value is better known as the EXPECT value
and is a function of score, database size and query
sequence length.
E-value: Number of alignments with a score >=S that
you expect to find if the database was a collection of
random letters.
e.g. For a score of 1, one only requires 1 match, and there should
be an enormous amount of alignments. One expects to find less
alignments with a score of 5, and so on.. Eventually when the
score is big enough, one expects to find an insignificant number of
of alignments that could be due to chance.
E-value of less than 1e-6 (1* 10-6 in scientific notation) are usually
very good and for proteins, E<1e-2 is usually considered
significant. It is still possible for a Hit with E>1 to be biologically
meaningful, but more analysis is required to comfirm that.
Even for VERY good hits, it is possible that the hit is due to a
biological artifact (sequencing/cloning vector, repeats, lowcomplexity sequence…)
Sequence Database
Searching
The “hit list” gives
titles and scores for
matched sequences
DATABASE SEARCHING
> fasta
myquery swissprot -ktup 2
The best scores are:
initn
gi|1706794|sp|P49789|FHIT_HUMAN BIS(5'-ADENOSYL)- 996
gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL) 412
gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEI 238
gi|3915958|sp|Q58276|Y866_METJA HYPOTHETICAL HIT- 153
gi|3916020|sp|Q11066|YHIT_MYCTU HYPOTHETICAL 15.7 163
gi|3023940|sp|O07513|HIT_BACSU HIT PROTEIN
164
gi|2506515|sp|Q04344|HNT1_YEAST HIT FAMILY PROTEI 130
gi|2495235|sp|P75504|YHIT_MYCPN HYPOTHETICAL 16.1 125
gi|418447|sp|P32084|YHIT_SYNP7 HYPOTHETICAL 12.4
42
gi|3025190|sp|P94252|YHIT_BORBU HYPOTHETICAL 15.9 128
gi|1351828|sp|P47378|YHIT_MYCGE HYPOTHETICAL HIT76
gi|418446|sp|P32083|YHIT_MYCHR HYPOTHETICAL 13.1
27
gi|1708543|sp|P49773|IPK1_HUMAN HINT PROTEIN (PRO
66
gi|2495231|sp|P70349|IPK1_MOUSE HINT PROTEIN (PRO
65
gi|1724020|sp|P49774|YHIT_MYCLE HYPOTHETICAL HIT52
gi|1170581|sp|P16436|IPK1_BOVIN HINT PROTEIN (PRO
66
gi|2495232|sp|P80912|IPK1_RABIT HINT PROTEIN (PRO
66
gi|1177047|sp|P42856|ZB14_MAIZE 14 KD ZINC-BINDIN
73
gi|1177046|sp|P42855|ZB14_BRAJU 14 KD ZINC-BINDIN
76
gi|1169825|sp|P31764|GAL7_HAEIN GALACTOSE-1-PHOSP
58
gi|113999|sp|P16550|APA1_YEAST 5',5'''-P-1,P-4-TE
47
gi|1351948|sp|P49348|APA2_KLULA 5',5'''-P-1,P-4-T
63
gi|123331|sp|P23228|HMCS_CHICK HYDROXYMETHYLGLUTA
58
gi|1170899|sp|P06994|MDH_ECOLI MALATE DEHYDROGENA
70
gi|3915666|sp|Q10798|DXR_MYCTU 1-DEOXY-D-XYLULOSE
75
gi|124341|sp|P05113|IL5_HUMAN INTERLEUKIN-5 PRECU
36
gi|1170538|sp|P46685|IL5_CERTO INTERLEUKIN-5 PREC
36
gi|121369|sp|P15124|GLNA_METCA GLUTAMINE SYNTHETA
45
gi|2506868|sp|P33937|NAPA_ECOLI PERIPLASMIC NITRA
48
gi|119377|sp|P10403|ENV1_DROME RETROVIRUS-RELATED
59
gi|1351041|sp|P48415|SC16_YEAST MULTIDOMAIN VESIC
48
gi|4033418|sp|O67501|IPYR_AQUAE INORGANIC PYROPHO
38
init1 opt z-sc E(77110)
996 996 1262.1
0
382 395 507.6 1.4e-21
133 316 407.4 5.4e-16
98 190 253.1 2.1e-07
163 184 244.8 6.1e-07
164 170 227.2 5.8e-06
91 157 210.3 5.1e-05
125 148 199.7 0.0002
42 140 191.3 0.00058
73 139 188.7 0.00082
76 133 181.0 0.0022
27 119 165.2 0.017
66 118 163.0 0.022
65 116 160.5
0.03
52 117 160.3 0.031
66 115 159.3 0.035
66 112 155.5 0.057
73 112 155.4 0.058
76 110 153.8 0.072
58 104 138.5
0.51
47 103 137.8
0.56
63
98 131.3
1.3
58
99 129.4
1.6
48
91 122.9
3.7
50
92 121.9
4.3
36
85 121.3
4.7
36
84 120.0
5.5
45
90 118.9
6.3
48
92 117.4
7.6
59
89 117.0
8
48
97 117.0
8
38
83 116.8
8.3
Sequence Database
Searching
Detailed alignments
are shown farther
down in the output
DATABASE SEARCHING
> fasta
myquery swissprot -ktup 2
>>gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL)-TETR (182 aa)
initn: 412 init1: 382 opt: 395 z-score: 507.6 E(): 1.4e-21
Smith-Waterman score: 395;
52.3% identity in 109 aa overlap
10
20
30
40
50
MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRPVERFHDLRPDEVADLF
: X: .:.:: :.:: ::..:::::: : : : :..:: :.:..:::
gi|170 MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGHVLVIPQRAVPRLKDLTPSELTDLF
10
20
30
40
50
60
gi|170
60
70
80
90
100
110
gi|170 QTTQRVGTVVEKHFHGTSLTFSMQDGPEAGQTVKHVHVHVLPRKAGDFHRNDSIYEELQK
....: :.:: : ... ....::: .::::: :::::..::: .:: .:: .: :X.:
gi|170 TSVRKVQQVIEKVFSASASNIGIQDGVDAGQTVPHVHVHIIPRKKADFSENDLVYSELEK
70
80
90
100
110
120
120
130
140
gi|170 HDKEDFPASWRSEEEMAAEAAALRVYFQ
..
gi|170 NEGNLASLYLTGNERYAGDERPPTSMRQAIPKDEDRKPRTLEEMEKEAQWLKGYFSEEQE
130
140
150
160
170
180
>>gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEIN 2 (217 aa)
initn: 238 init1: 133 opt: 316 z-score: 407.4 E(): 5.4e-16
Smith-Waterman score: 316;
37.4% identity in 131 aa overlap
gi|170
10
20
30
40
MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRP-VER
:.. :. .v^: :.. ..:::: ::.::::::. ::X :
Sequence Database
Searching
Database Search Space
Some matches are non-meaningful
because they occur VERY often in
database.
Biological repeated elements(retroposons
ALU)
Low-complexity repeated patterns.
(CAGCAG, QQQ,KKK,…)
These elements should be
FILTERED or MASKED
to avoid generating false ‘hits’.. It
is ‘OK’ to align through them if they
are near meaningful diagonal ‘hits’
Concanated Database sequence
e.g. nucleotide AAA (from polyA)
Query sequence
Sequence Database
Searching
HASHING METHODS
Query sequence
Use word hits to
determine were to
search for alignments
Subject sequence
Scan the database,
looking up words in
the dictionary
FASTA searches in a band