How to Install and Use a Local BLAST Server

Download Report

Transcript How to Install and Use a Local BLAST Server

How to Install and Use
a Standalone BLAST
(Basic Local Alignment
Search Tool) Server
Doug Davis
Plant Science Division
Univ. of Missouri
6/26/06
Lab Premise

Bioinformatics research is typically web-based

Access to necessary URLs may be hampered
by need for administrator permissions

Solution: Standalone BLAST (you will be
provided a CD containing all necessary files at
the lab’s conclusion)
Lab Goals

See where BLAST fits into the larger
scheme of bioinformatics

Demonstrate installation of a standalone
BLAST server on a Windows XP PC (should
also work on a Windows 2000 PC)

Gain initial familiarity with available
standalone BLAST parameters
Bioinformatics Defined

Study of biological questions using computers in place
of traditional labware (e.g. test tubes, pH meters,
electrophoretic equipment)

Dependent on databases containing molecular data
generated over many decades

Millions of sequences are in these databases; best of
all, tools like BLAST can search for sequences in such
large databases very rapidly
What Is BLAST?

BLAST is a program that searches for similarities
among molecular sequences- works with nucleic acids
and proteins

It performs local (as opposed to global) alignments
using a special set of scoring matrices

It calculates statistical significance for any matches it
finds (allows you to evaluate the degree of similarity)

a very powerful tool for characterizing unknown
sequences by using sequence alignments to known
sequences
The usual way BLAST is
employed…

Requires an active internet connection to visit websites
where molecular databases reside (e.g.
http://www.ncbi.nlm.nih.gov)- you have a lot of flexibility
working over the web (many different databases and
informatics tools can be rapidly accessed)

You specify a target database to be searched using the
website’s BLAST server

You upload the query sequences (these are the
sequences you want to learn more about) to a webBLAST server; then these sequences are compared by
the BLAST alignment algorithm to all sequences in the
specified target database
BLAST Session Setup
Target database sequences
Query sequence(s)
These are sequences you want
to know more about. Consider
them as “unknowns”.
BLAST
program
This database contains many
sequences which are already characterized, these are
the “knowns”
If BLAST detects a match between query sequences and
database sequences, this indicates some meaningful
relationship between the aligned sequences.
Here’s how the BLAST session looks in
“Command Prompt” (this is the program
you will use in Windows to run BLAST):
Here’s the “Hit Table” Output from a BLAST
Session- the Hit Table format is a stripped-down
BLAST output
Hit Table Format of BLAST
Output

The output report fields are outlined here
# BLASTN 2.2.14 [May-07-2006]
# Query: 5221 sequences
# Database: maize_genes.txt
# Fields: Query, Subject, %ID, AlignLngth, Mismatch, Gaps, Qry_start, Qry_end, Subj_start, Subj_end, e-val, bit_score
CK828121
CF624012
CF624331
CK826720
CF623767
TC279221
TC279225
TC279227
TC279296
TC281097
88.96
94.03
99.25
100.00
81.85
589
318
665
28
292
55
19
4
0
39
8
0
1
0
13
74
143
1
1
283
653
460
665
28
567
274
411
846
1666
1513
861
0.0
94 4e-136
183
0.0
1639 3e-008
1229 3e-035
642
480
1277
56.0
145
BLAST Report Field
Explanations

mismatches- number of nucleotides that don’t match over the
length of the aligned portion

gaps- a confusing field, as these can be caused both by
truncation of sequence or when there are multiple, contiguous
mismatches in the middle of an alignment- then the matching
algorithm introduces a gap into the alignment

e-value- a statistic which indicates the probability of recovering
the sequence of interest, given the size of the database
searched; it is strongly influenced by the size of the database
searched

bit score- a probability statistic which takes the size of the
searched database into account (high scores indicate strong
alignments); unaffected by the size of the database searched
Default BLAST Output: Graphical Alignment
of Query Sequence to Subject Sequence in
the Target Database (nucleotide-nucleotide)
Query= gi|44900833|gb|CK827378.1|CK827378 zmrsub1_0B20-006-a11.s4
zmrsub1 Zea mays cDNA 3', mRNA sequence
(609 letters)
Sequences producing significant alignments:
TC280752 UP|Q9LLI2_MAIZE (Q9LLI2) Cellulose synthase-8, complete
>TC280752 UP|Q9LLI2_MAIZE (Q9LLI2) Cellulose synthase-8, complete
Length = 3931
Score = 32.2 bits (16), Expect = 0.34
Identities = 22/24 (91%)
Strand = Plus / Plus
Query: 531 cgaggcggaggacgccgtcgacga 554
||||| |||||||| |||||||||
Sbjct: 519 cgaggaggaggacggcgtcgacga 542
Score
E
(bits) Value
32
0.34
How Does BLAST Make the
Alignments?
Answer: Local Alignment is based on the “Smith-Waterman Algorithm”
C
O
E
L
A
C
A
N
T
H
0
0
0
0
0
0
0
0
0
0
0
P
0
0
0
0
0
0
0
0
0
0
0
E
0
0
0
1
0
0
0
0
0
0
0
L
0
0
0
0
2
1
0
0
0
0
0
I
0
0
0
0
1
1
0
0
0
0
0
C
0
1
0
0
0
0
2
0
0
0
0
A
0
0
0
0
0
0
0
3
2
1
0
N
0
0
0
0
0
0
0
1
4
3
2
the local alignment produced by this algorithm is: ELACAN
ELICAN
How to Calculate SmithWaterman Matrix Values

Matches are assigned a value of +1, mismatches are -1,
gaps (where there is no character to try matching with in one
of the sequences) are also assigned a value of -1

Calculate the match score: sum of the score in the
preceeding diagonal cell plus the gap penalty (+1 if no gap, 1 if there is a gap)

Calculate the horizontal gap score: sum of the cell to the left
plus the gap penalty

Calculate the vertical gap score: sum of the cell above plus
the gap penalty

The maximum score is never less than 0.
What Types of Questions Can
BLAST Be Used to Answer?

Find genes in a genomic sequence

Predict a protein’s function

Predict the 3-D structure of a protein

Identify members of gene/protein
families
Why install a Standalone
Copy of BLAST?

You don’t need administrator permissions to
run it

Easier to control the output format (you
aren’t stuck with what the website decides
you should have)

More user control (easier to construct
custom BLAST queries)
Flow of Events in a BLAST Session
create a file
that contains the
query sequences
create a blank file
that will receive the
BLAST output
format the target
database (protein or
nucleic acid)
submit the BLAST
job using the
command prompt
review the BLAST
output; formulate
new hypothesis
BLAST Installation
Details: Part 1

Insert the provided CD and locate the file
named “ncbi.ini” (this file contains the path
to the BLAST\data subfolder)

Click the “Start” button on your desktop,
then click on “My Computer”, then click on
the C:\ drive

Open the WINDOWS, WINNT, or
WINDOWS NT folder and drag the ncbi.ini
file into either of these folders
BLAST Installation
Details:Part 2

Go to C:\Program Files

Drag the BLAST folder on your CD into the
C:\Program Files folder- be careful to not
place it inside another folder that resides in
C:\Program Files.

Open the BLAST folder and click the file
named “blast-2.2.14-ia32-win32” to install the
BLAST application
BLAST Installation Details:
Part 3

Drag the .txt file “maize_genes” from the CD into the “C:\Program
Files\BLAST\data” folder

Create and save a blank text (.txt) file named “query_seqs” in the
“C:\Program Files\BLAST\data” folder

Open the .txt file named “Install_Lab_seqs” from the CD, and copy
the contents; paste these into the file “query_seqs” then save the file

Create and save a .txt file named “output” in the “C:\Program
Files\BLAST\data” folder- this file will receive the BLAST output
BLAST Installation Details:
Part 4

Move the following files from the
“C:\Program Files\BLAST\bin” folder into
the “C:\Program Files\BLAST\data” folder:
“formatdb”, “blastall”, “blastclust”, and
“megablast” (these are the “executable”
files you will need to make BLAST run)

Click Start, select “All Programs”, then
select “Accessories”; click the “Command
Prompt” icon to open a “command line”
session
Get Ready to BLAST
Type the following in at the command
prompt: “formatdb –i maize_genes.txt
–p F –o F” (this command will format
the target database, maize_genes.txt,
so that it can be searched by BLAST)
Using Standalone BLAST

At the command prompt, type the following:
C:\Program Files\BLAST\data>megablast -i
query_seqs.txt -d maize_genes.txt -o
output.txt -F "m D" -D 3

Press the Enter button, then BLAST will start
processing the commands

When the program terminates (you will get a new
command prompt), open the output.txt file to inspect
the results.
Different Types of BLAST

There are 5 types of BLAST available:
megaBLAST: very rapid (~12-fold faster than BLASTN),
DNA query against DNA databases
BLASTN: same set-up as megaBLAST, slower, but more
options for query construction
BLASTP: protein used to search protein database
BLASTX: translated DNA search of protein database
TBLASTN: protein used to search translated DNA
database
TBLASTX: DNA translated in all 6 frames versus a
translated DNA database
We’ll look more at these this afternoon