Databases indexation

Download Report

Transcript Databases indexation

Databases
indexation
Laurent Falquet, EPFL March, 2005
Swiss Institute of Bioinformatics
Swiss EMBnet node
Overview

Data access concept

sequential
 direct
Why indexing?
 formatdb
 Parsing output



Indexing



EMBOSS
Fetch
Other
BLAST

Excel import/export
Tab delimited
 Coma delimited

Data access: sequential vs direct

Sequential access
Vary from very short to very long

Direct access
Very small variations
track
sector
head
Similar concept for databases

Flat files = sequential
>seq1
cgatgtcatgtg

Indexing = simulated direct
ID
>seq2
cgatcgtagctgtagctgtag
>seq3
catgtgcatgcgacgt
Position Length
(byte)
(byte)
SEQ1
SEQ2
0
19
19
28
SEQ3
47
23
Tools

EMBOSS
dbiflat
 dbifasta
 dbiblast

seqret
 seqretsplit
 entret

Other examples

SRS (icarus language)





http://srs.ebi.ac.uk
http://www.lionbioscience.com/
indexer & fetch (warning
local SIB tool)
Relational (MySQL, Oracle…)
EMBOSS how to index?




Where is your file?
What is the format?
Where should be the
indices?
Where is the
emboss.default file?
(.embossrc)

Other EMBOSS tools
textsearch
 whichdb

EMBOSS example

Input file and directory
~/embossidx/ECOLI.dat
 cd embossidx


Index creation


dbiflat -idformat swiss -dbname ECOLI.dat -directory . -release 1.0
-date 12/02/05 -fields AC
Generates 4 files
acnum.hit
 acnum.trg
 division.lkp
 entrynam.idx


Don’t forget to modify ~/.embossrc
.embossrc
set emboss_filter 1
# Ecoli
DB ecoli [
type: P
comment: "E.coli proteome"
method: emblcd
format: swiss
dir: "~/embossidx"
file: "ECOLI.dat"
release: "1.0"
indexdir: "~/embossidx"
]

Example of queries
seqret ecoli:thio_ecoli
 seqret ecoli:P00274
 entret ecoli:thio_ecoli


and even

seqret ‘ecoli:*_ECOLI’
Indexer & fetch


Warning this is a local SIB tool!!
Input file and directory
~/embossidx/ECOLI.dat
 cd embossidx


Index creation


Generates 1 file


indexer -h '^ID' -t '^//' -i -p '^ID\s+(\S+)' ECOLI.dat ecoli.idx
ecoli.idx
Don’t forget to modify config file
Config file: fetch.conf

fetch.conf
#dbkey format indexfile datafile
ecoli sp ~/embossidx/ecoli.idx ~/embossidx/ECOLI.dat

Example of queries
fetch -c fetch.conf ecoli:thio_ecoli
 fetch -c fetch.conf -f ‘ecoli:thio_ecoli[20..50]’

BLAST


Maintained at NCBI
Source distributed freely with
several accessory tools






ftp://ftp.ncbi.nlm.nih.gov/too
lbox/ncbi_tools/ncbi.tar.gz
Requires compilation to install
on your local computer
blastall contains



blastp
blastn
blastx
tblastn
tblastx
Other tools
blastpgp
 megablast
 formatdb

Available Blast programs
Program
blastp
Query
protein
blastn
nucleotide
blastx
nucleotide
protein
Database
VS
VS
VS
tblastn
nucleotide
protein
nucleotide
protein
tblastx
protein
VS
nucleotide
protein
protein
nucleotide
VS
protein
What makes BLAST so fast?




Indexing all words of 3 aa or
11 bp in the sequence database
Searching the query for all
words of a score > T
Search the indexed database
for all perfect matches
Try to align matches that are
on the same diagonal
Indexing for Blast (1)
A substitution matrix is used to compute the word scores
Query
REL
RSL
LKP
score < T
ACT
...
AAA
AAC
AAD
score > T
RSL
...
...
TVF
YYY
List of all possible words with
3 amino acid residues (8000)
List of words matching the
query with a score > T
Indexing for Blast (2)
Database sequences
ACT
ACT
ACT
...
Search for
exact matches
RSL
RSL
RSL
...
TVF
RSL
RSL
TVF
TVF
List of words matching the
query with a score > T
 List of sequences containing
words similar to the query (hits)
Indexing for Blast (3)
Database sequence
Query
A
Ungapped extension if:
2 "Hits" are on the same diagonal but
at a distance less than A
Database sequence
Query
A
Extension using dynamic programming
limited to a restricted region
limited through a score drop-off
threshold
BLAST indexing with formatdb

Formatdb
mydb.seq must contain sequences in FASTA format
 formatdb -i mydb.seq -p T -n mydb


Generates 3 files
mydb.psq
 mydb.pin
 mydb.phr


Then start a Blast:

blastall -p blastp -d mydb -i myseq (-optional parameters)
Blast local vs remote

blastall
Executed locally
 Slow
 No need to transfert db


blastall.remote
Executed remotely
 Fast
 Requires special
priviledges and db
transfert

Multiple Blasts?

1 seq vs db seq


1 FASTA seq as input
db seq vs db seq
Several single FASTA
seq files as input or
 1 Multiple FASTA seq file
as input


Possibility to export
results as XML

Use Perl to automatize the
queries and parse the
output
Parsing Blast output
BLASTP 2.2.10 [Oct-19-2004]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
Query= ACCA_BACSU O34847 Acetyl-coenzyme A carboxylase carboxyl
transferase subunit alpha (EC 6.4.1.2).
(325 letters)
Database: ecoli_blast
4339 sequences; 1,373,039 total letters
Searching.........done
Sequences producing significant alignments:
Score
(bits)
ACCA_ECOLI P30867 Acetyl-coenzyme A carboxylase carboxyl transfe...
266
E
Value
1e-72
Parsing Blast output (2)
>ACCA_ECOLI P30867 Acetyl-coenzyme A carboxylase carboxyl
transferase subunit alpha (EC 6.4.1.2).
Length = 318
Score = 266 bits (681), Expect = 1e-72
Identities = 143/312 (45%), Positives = 188/312 (60%), Gaps = 3/312 (0%)
Query: 5
Sbjct: 5
Query: 62
Sbjct: 65
LEFEKPVIELQTKIAELKKFTQDS---DMDLSAEIERLEDRLAKLQDDIYKNLKPWDRVQ 61
L+FE+P+ EL+ KI L
++
D+++ E+ RL ++ +L
I+ +L W
Q
LDFEQPIAELEAKIDSLTAVSRQDEKLDINIDEEVHRLREKSVELTRKIFADLGAWQIAQ 64
IARLADRPTTLDYIEHLFTDFFECHGDRAYGDDEAIVGGIAKFHGLPVTVIGHQRGKDTK 121
+AR
RP TLDY+
F +F E GDRAY DD+AIVGGIA+ G PV +IGHQ+G++TK
LARHPQRPYTLDYVRLAFDEFDELAGDRAYADDKAIVGGIARLDGRPVMIIGHQKGRETK 124
Query: 122 ENLVRNFGMPHPEGYRKALRLMKQADKFNRPIICFIDTKGAYPGRAAEERGQSEAIAKNL 181
E + RNFGMP PEGYRKALRLM+ A++F PII FIDT GAYPG AEERGQSEAIA+NL
Sbjct: 125 EKIRRNFGMPAPEGYRKALRLMQMAERFKMPIITFIDTPGAYPGVGAEERGQSEAIARNL 184
Query: 182 FEMAGLRVPXXXXXXXXXXXXXXXXXXXXXXXHMLENSTYSVISPEGAAALLWKDSSLAK 241
EM+ L VP
+ML+ STYSVISPEG A++LWK + A
Sbjct: 185 REMSRLGVPVVCTVIGEGGSGGALAIGVGDKVNMLQYSTYSVISPEGCASILWKSADKAP 244
Query: 242 KAAETMKITAPDLKELGIIDHMIKEVKGGAHHDVKLQASYMDXXXXXXXXXXXXXXXXXX 301
AAE M I AP LKEL +ID +I E GGAH + + A+ +
Sbjct: 245 LAAEAMGIIAPRLKELKLIDSIIPEPLGGAHRNPEAMAASLKAQLLADLADLDVLSTEDL 304
Query: 302 VQQRYEKYKAIG 313
+RY++ + G
Sbjct: 305 KNRRYQRLMSYG 316
Parsing Blast output (3)

With BioPerl:
#!/usr/local/bin/perl
use Bio::SearchIO;
my $blast_report = new Bio::SearchIO ('-format' => 'blast',
'-file'
=> $ARGV[0]);
print "Query name:\tQuery description:\tHit name:\tHit description:\tE-value\tScore\n";
while( my $result = $blast_report->next_result) {
print $result->query_name(), "\t", $result->query_description(), "\n";
while( my $hit = $result->next_hit()) {
print "\t\t", $hit->name(), "\t", $hit->description();
while( my $hsp = $hit->next_hsp()) {
print "\t", $hsp->evalue(), "\t",
$hsp->score();
}
print "\n";
}
}
exit 0;
MS-Excel import/export

Excel can import

Excel can export
Tab delimited
 Coma delimited
Tab delimited
 Space delimited


AC/ID
desc
score
e-value
THIO_ECOLI
thioredoxin Escherichia coli
234
2.1e-5
120
0.001
THIO_HUMAN thioredoxin Homo sapiens
MS-Excel import/export

Tab delimited file:
\t delimits the columns
 \n delimits the lines
 Optional first line contains columns title
 Example:

AC/ID\tdesc\tscore\te-value\n
THIO_ECOLI\tthioredoxin Escherichia coli\t234\t2.1e-5\n
THIO_HUMAN\tthioredoxin Homo sapiens\t120\t0.001\n
MS-Excel import/export

Coma delimited file:
, delimits the columns, each value is surrounded by ‘ ’
 \n delimits the lines
 Optional first line contains columns title
 Example:

‘AC/ID’,’desc’,’score’,’e-value’\n
’THIO_ECOLI’,’thioredoxin Escherichia coli’,’234’,’2.1e-5’\n
’THIO_HUMAN’,’thioredoxin Homo sapiens’,’120’,’0.001’\n