Databases indexation
Download
Report
Transcript Databases indexation
Databases
indexation
Laurent Falquet, EPFL March, 2005
Swiss Institute of Bioinformatics
Swiss EMBnet node
Overview
Data access concept
sequential
direct
Why indexing?
formatdb
Parsing output
Indexing
EMBOSS
Fetch
Other
BLAST
Excel import/export
Tab delimited
Coma delimited
Data access: sequential vs direct
Sequential access
Vary from very short to very long
Direct access
Very small variations
track
sector
head
Similar concept for databases
Flat files = sequential
>seq1
cgatgtcatgtg
Indexing = simulated direct
ID
>seq2
cgatcgtagctgtagctgtag
>seq3
catgtgcatgcgacgt
Position Length
(byte)
(byte)
SEQ1
SEQ2
0
19
19
28
SEQ3
47
23
Tools
EMBOSS
dbiflat
dbifasta
dbiblast
seqret
seqretsplit
entret
Other examples
SRS (icarus language)
http://srs.ebi.ac.uk
http://www.lionbioscience.com/
indexer & fetch (warning
local SIB tool)
Relational (MySQL, Oracle…)
EMBOSS how to index?
Where is your file?
What is the format?
Where should be the
indices?
Where is the
emboss.default file?
(.embossrc)
Other EMBOSS tools
textsearch
whichdb
EMBOSS example
Input file and directory
~/embossidx/ECOLI.dat
cd embossidx
Index creation
dbiflat -idformat swiss -dbname ECOLI.dat -directory . -release 1.0
-date 12/02/05 -fields AC
Generates 4 files
acnum.hit
acnum.trg
division.lkp
entrynam.idx
Don’t forget to modify ~/.embossrc
.embossrc
set emboss_filter 1
# Ecoli
DB ecoli [
type: P
comment: "E.coli proteome"
method: emblcd
format: swiss
dir: "~/embossidx"
file: "ECOLI.dat"
release: "1.0"
indexdir: "~/embossidx"
]
Example of queries
seqret ecoli:thio_ecoli
seqret ecoli:P00274
entret ecoli:thio_ecoli
and even
seqret ‘ecoli:*_ECOLI’
Indexer & fetch
Warning this is a local SIB tool!!
Input file and directory
~/embossidx/ECOLI.dat
cd embossidx
Index creation
Generates 1 file
indexer -h '^ID' -t '^//' -i -p '^ID\s+(\S+)' ECOLI.dat ecoli.idx
ecoli.idx
Don’t forget to modify config file
Config file: fetch.conf
fetch.conf
#dbkey format indexfile datafile
ecoli sp ~/embossidx/ecoli.idx ~/embossidx/ECOLI.dat
Example of queries
fetch -c fetch.conf ecoli:thio_ecoli
fetch -c fetch.conf -f ‘ecoli:thio_ecoli[20..50]’
BLAST
Maintained at NCBI
Source distributed freely with
several accessory tools
ftp://ftp.ncbi.nlm.nih.gov/too
lbox/ncbi_tools/ncbi.tar.gz
Requires compilation to install
on your local computer
blastall contains
blastp
blastn
blastx
tblastn
tblastx
Other tools
blastpgp
megablast
formatdb
Available Blast programs
Program
blastp
Query
protein
blastn
nucleotide
blastx
nucleotide
protein
Database
VS
VS
VS
tblastn
nucleotide
protein
nucleotide
protein
tblastx
protein
VS
nucleotide
protein
protein
nucleotide
VS
protein
What makes BLAST so fast?
Indexing all words of 3 aa or
11 bp in the sequence database
Searching the query for all
words of a score > T
Search the indexed database
for all perfect matches
Try to align matches that are
on the same diagonal
Indexing for Blast (1)
A substitution matrix is used to compute the word scores
Query
REL
RSL
LKP
score < T
ACT
...
AAA
AAC
AAD
score > T
RSL
...
...
TVF
YYY
List of all possible words with
3 amino acid residues (8000)
List of words matching the
query with a score > T
Indexing for Blast (2)
Database sequences
ACT
ACT
ACT
...
Search for
exact matches
RSL
RSL
RSL
...
TVF
RSL
RSL
TVF
TVF
List of words matching the
query with a score > T
List of sequences containing
words similar to the query (hits)
Indexing for Blast (3)
Database sequence
Query
A
Ungapped extension if:
2 "Hits" are on the same diagonal but
at a distance less than A
Database sequence
Query
A
Extension using dynamic programming
limited to a restricted region
limited through a score drop-off
threshold
BLAST indexing with formatdb
Formatdb
mydb.seq must contain sequences in FASTA format
formatdb -i mydb.seq -p T -n mydb
Generates 3 files
mydb.psq
mydb.pin
mydb.phr
Then start a Blast:
blastall -p blastp -d mydb -i myseq (-optional parameters)
Blast local vs remote
blastall
Executed locally
Slow
No need to transfert db
blastall.remote
Executed remotely
Fast
Requires special
priviledges and db
transfert
Multiple Blasts?
1 seq vs db seq
1 FASTA seq as input
db seq vs db seq
Several single FASTA
seq files as input or
1 Multiple FASTA seq file
as input
Possibility to export
results as XML
Use Perl to automatize the
queries and parse the
output
Parsing Blast output
BLASTP 2.2.10 [Oct-19-2004]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
Query= ACCA_BACSU O34847 Acetyl-coenzyme A carboxylase carboxyl
transferase subunit alpha (EC 6.4.1.2).
(325 letters)
Database: ecoli_blast
4339 sequences; 1,373,039 total letters
Searching.........done
Sequences producing significant alignments:
Score
(bits)
ACCA_ECOLI P30867 Acetyl-coenzyme A carboxylase carboxyl transfe...
266
E
Value
1e-72
Parsing Blast output (2)
>ACCA_ECOLI P30867 Acetyl-coenzyme A carboxylase carboxyl
transferase subunit alpha (EC 6.4.1.2).
Length = 318
Score = 266 bits (681), Expect = 1e-72
Identities = 143/312 (45%), Positives = 188/312 (60%), Gaps = 3/312 (0%)
Query: 5
Sbjct: 5
Query: 62
Sbjct: 65
LEFEKPVIELQTKIAELKKFTQDS---DMDLSAEIERLEDRLAKLQDDIYKNLKPWDRVQ 61
L+FE+P+ EL+ KI L
++
D+++ E+ RL ++ +L
I+ +L W
Q
LDFEQPIAELEAKIDSLTAVSRQDEKLDINIDEEVHRLREKSVELTRKIFADLGAWQIAQ 64
IARLADRPTTLDYIEHLFTDFFECHGDRAYGDDEAIVGGIAKFHGLPVTVIGHQRGKDTK 121
+AR
RP TLDY+
F +F E GDRAY DD+AIVGGIA+ G PV +IGHQ+G++TK
LARHPQRPYTLDYVRLAFDEFDELAGDRAYADDKAIVGGIARLDGRPVMIIGHQKGRETK 124
Query: 122 ENLVRNFGMPHPEGYRKALRLMKQADKFNRPIICFIDTKGAYPGRAAEERGQSEAIAKNL 181
E + RNFGMP PEGYRKALRLM+ A++F PII FIDT GAYPG AEERGQSEAIA+NL
Sbjct: 125 EKIRRNFGMPAPEGYRKALRLMQMAERFKMPIITFIDTPGAYPGVGAEERGQSEAIARNL 184
Query: 182 FEMAGLRVPXXXXXXXXXXXXXXXXXXXXXXXHMLENSTYSVISPEGAAALLWKDSSLAK 241
EM+ L VP
+ML+ STYSVISPEG A++LWK + A
Sbjct: 185 REMSRLGVPVVCTVIGEGGSGGALAIGVGDKVNMLQYSTYSVISPEGCASILWKSADKAP 244
Query: 242 KAAETMKITAPDLKELGIIDHMIKEVKGGAHHDVKLQASYMDXXXXXXXXXXXXXXXXXX 301
AAE M I AP LKEL +ID +I E GGAH + + A+ +
Sbjct: 245 LAAEAMGIIAPRLKELKLIDSIIPEPLGGAHRNPEAMAASLKAQLLADLADLDVLSTEDL 304
Query: 302 VQQRYEKYKAIG 313
+RY++ + G
Sbjct: 305 KNRRYQRLMSYG 316
Parsing Blast output (3)
With BioPerl:
#!/usr/local/bin/perl
use Bio::SearchIO;
my $blast_report = new Bio::SearchIO ('-format' => 'blast',
'-file'
=> $ARGV[0]);
print "Query name:\tQuery description:\tHit name:\tHit description:\tE-value\tScore\n";
while( my $result = $blast_report->next_result) {
print $result->query_name(), "\t", $result->query_description(), "\n";
while( my $hit = $result->next_hit()) {
print "\t\t", $hit->name(), "\t", $hit->description();
while( my $hsp = $hit->next_hsp()) {
print "\t", $hsp->evalue(), "\t",
$hsp->score();
}
print "\n";
}
}
exit 0;
MS-Excel import/export
Excel can import
Excel can export
Tab delimited
Coma delimited
Tab delimited
Space delimited
AC/ID
desc
score
e-value
THIO_ECOLI
thioredoxin Escherichia coli
234
2.1e-5
120
0.001
THIO_HUMAN thioredoxin Homo sapiens
MS-Excel import/export
Tab delimited file:
\t delimits the columns
\n delimits the lines
Optional first line contains columns title
Example:
AC/ID\tdesc\tscore\te-value\n
THIO_ECOLI\tthioredoxin Escherichia coli\t234\t2.1e-5\n
THIO_HUMAN\tthioredoxin Homo sapiens\t120\t0.001\n
MS-Excel import/export
Coma delimited file:
, delimits the columns, each value is surrounded by ‘ ’
\n delimits the lines
Optional first line contains columns title
Example:
‘AC/ID’,’desc’,’score’,’e-value’\n
’THIO_ECOLI’,’thioredoxin Escherichia coli’,’234’,’2.1e-5’\n
’THIO_HUMAN’,’thioredoxin Homo sapiens’,’120’,’0.001’\n