Database Searching

Transcript Database Searching

Database Searches
FASTA
Database searches: Why?
• To discover or verify identity of a newly
sequenced gene
• To find other members of a multigene
family
• To classify groups of genes
Database searching
• In practice, we cannot use Smith-Waterman to
search for sequences in a database:
– Databases are huge (GenBank ~30 million sequences, SwissProt >> 100,000 sequences)
– S-W is slow: Time is proportional to N n2 where n = sequence
length and N = number of sequences in the database
• Instead, use faster heuristic approaches
– FASTA
– BLAST
• Tradeoff: Sensitivity vs. false positives
• Smith-Waterman is slower, but more sensitive
Dot Plots
GATCA AC TGA CGTA
G
T
T
C
A
G
C
T
G
C
G
T
A
C
Dot Plots
GATCA AC TGA CGTA
G
T
T
C
A
G
C
T
G
C
G
T
A
C
4-base window and 75% identity
FASTA
• Originally developed ~1985 by Lipman
and Pearson
• Goal: Perform fast, approximate local
alignments to find sequences in the
database that are related to the query
sequence
• Based on dot plot idea
FASTA: Step 1
• Look for exact matches between words
in query and test sequence
– Words are short
• DNA words are usually 6 bases
• Protein words are 1 or 2 amino acids
– Ktup denotes word length
– Use hash tables to locate words quickly
FASTA: Details
• Hashing: Map a strings of characters to
integers. e.g.,
–
–
–
–
AAA → 0
AAC → 1
...
TTT → 63 (oversimplified)
• Preprocess the database and create a table
that stores locations of each possible k-tuple:
– 20k for amino acids (400 if k = 2),
– 4k for DNA (4096 if k = 6),
• Use hash code computed from query sequence
k-tuples for quick look up
FASTA
FASTA: Step 2
• Find 10 best diagonal runs (sequence of
nearby hot spots on same diagonal)
• Give each hot spot a positive score, and each
space between consecutive hot spots a
negative score that decreases with distance
– similar to affine gap costs in S-W
• Each diagonal run is composed of matches
(hot spots themselves) and mismatches
(interspot regions) but no indels
FASTA: Step 3
• Evaluate each diagonal run using an
appropriate scoring matrix and find
best scoring run
– Discard runs with low scores (“filtration”)
• The highest-scoring diagonal is reported
as init1
FASTA: Step 4
• After all diagonals found, try to join diagonals by
adding gaps
• Use weighted directed acyclic graph between
segments representing those which could be combined
using indel
• Find a maximum weight path in this graph;
corresponds to a local alignment, reported as initn
Adding gaps
FASTA: Step 5
• If score reaches a threshold value,
compute an alternative local alignment
• Form a band around init1 in dynamic
programming table
– Width depends on ktup
• Use Smith-Waterman to find best
alignment restricted to that band.
• Result is called opt
FASTA: Final Steps
• Rank database sequences according to
opt scores
• use full Smith-Waterman method to align
query sequence against each of the highest
ranking sequences from the database
• Perform statistical analysis
!!SEQUENCE_LIST 1.0
(Nucleotide) FASTA of: b2.seq from: 1 to: 693 December 9, 2002 14:02
TO: /u/browns02/Victor/Search-set/*.seq Sequences:
2,050 Symbols:
913,285 Word Size: 6
Searching with both strands of the query.
Scoring matrix: GenRunData:fastadna.cmp
Constant pamfactor used
Gap creation penalty: 16 Gap extension penalty: 4
Histogram Key:
Each histogram symbol represents 4 search set sequences
Each inset symbol represents 1 search set sequences
z-scores computed from opt scores
z-score obs
exp
(=)
(*)
< 20
0
0:
22
0
0:
24
3
0:=
26
2
0:=
28
5
0:==
30
11
3:*==
32
19
11:==*==
34
38
30:=======*==
36
58
61:===============*
38
79
100:====================
*
40
134
140:==================================*
42
167
171:==========================================*
44
205
189:===============================================*====
46
209
192:===============================================*=====
48
177
184:=============================================*
List
The best scores are:
init1 initn
SW:PPI1_HUMAN
Begin: 1 End: 269
! Q00169 homo sapiens (human). phosph... 1854
SW:PPI1_RABIT
Begin: 1 End: 269
! P48738 oryctolagus cuniculus (rabbi... 1840
SW:PPI1_RAT
Begin: 1 End: 270
! P16446 rattus norvegicus (rat). pho... 1543
SW:PPI1_MOUSE
Begin: 1 End: 270
! P53810 mus musculus (mouse). phosph... 1542
SW:PPI2_HUMAN
Begin: 1 End: 270
! P48739 homo sapiens (human). phosph... 1533
SPTREMBL_NEW:BAC25830
Begin: 1 End: 270
! Bac25830 mus musculus (mouse). 10, ... 1488
SP_TREMBL:Q8N5W1
Begin: 1 End: 268
! Q8n5w1 homo sapiens (human). simila... 1477
SW:PPI2_RAT
Begin: 1 End: 269
! P53812 rattus norvegicus (rat). pho... 1482
opt
z-sc E(1018780)..
1854
1854
2249.3
1.8e-117
1840
1840
2232.4
1.6e-116
1543
1837
2228.7
2.5e-116
1542
1836
2227.5
2.9e-116
1533
1533
1861.0
7.7e-96
1488
1522
1847.6
4.2e-95
1477
1522
1847.6
4.3e-95
1482
1516
1840.4
1.1e-94
Alignments
SCORES
Init1: 1515 Initn: 1565 Opt: 1687 z-score: 1158.1 E(): 2.3e-58
>>GB_IN3:DMU09374
(2038 nt)
initn: 1565 init1: 1515 opt: 1687 Z-score: 1158.1 expect(): 2.3e-58
66.2% identity in 875 nt overlap
(83-957:151-1022)
60
70
80
90
100
110
u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAGCGGAGGCGATGGCGCTGTTGGCC
|| ||| | ||||| |
||| |||||
DMU09374
AGGCGGACATAAATCCTCGACATGGGTGACAACGAACAGAAGGCGCTCCAACTGATGGCC
130
140
150
160
170
180
120
130
140
150
160
170
u39412.gb_pr GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTTTGGAGGCTCA
|||||||||
|| |||
|
| || ||| |
|| || ||||| ||
DMU09374
GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTTCGGAGGGTCC
190
200
210
220
230
240
180
190
200
210
220
230
u39412.gb_pr TCCAAAATAGAGGAAGCATGCGAAATCTACGCCAGAGCAGCAAACATGTTCAAAATGGCC
||| | ||||| ||
|||
||||
| || | |||||||| || ||| ||
DMU09374
AACAAGGTGGAGGACGCCATCGAGTGCTACCAGCGGGCGGGCAACATGTTTAAGATGTCC
250
260
270
280
290
300
240
250
260
270
280
290
u39412.gb_pr AAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCACCTGCAGCTC
||||||||||
||||| |
|||||| |||| |||
|| ||| || |
DMU09374
AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACACGCGCGGGCT
310
320
330
340
350
360

Database Searching

Transcript Database Searching

Directory