4.blast-parameters

Download Report

Transcript 4.blast-parameters

Tweaking BLAST
Although you normally see BLAST as a web page with boxes to place data in
and tick boxes, etc., it is actually a command line program that can be run just by
typing the appropriate command and options, e.g.
>blastall –p blastn –i my_sequence.fasta –d refseq
This is the simplest form: where the basic program ‘blastall’ takes a number of
different options, or parameters, indicated by the –x and followed by its value.
-p <which blast flavour to run>
-i <file with query sequence in>
-d <pre-indexed database name>
There are many other parameters, and if not listed explicitly they will use a
default value most appropriate to the blast flavour requested.
E.g. for –W <word size> blastn uses –W 11, where blastx uses –W 3.
There are also some options that appear on the web pages that are not really
parameters but manage the job in a similar way. One of the most useful of these
is on the NCBI blast pages where you can use Entrez queries or pick from an
organism list to modify your search.
The Many Parameters of BLAST
There are almost literally hundreds of parameters, but most are way too obscure
even for die-hard techies like me! Very few of them are regularly useful in any
but their default value, but just occasionally they are very necessary.
Here are some of the ones that I have used:
-e
-m
-F
-U
-G
-E
-q
-r
-b
-g
-W
-z
-S
-l
max expected value
output format
(graphical or tabular/spreadsheet)
filter query sequence for low complexity
(default TRUE)
use only upper case regions of query
(default FALSE)
gap opening cost
gap extension cost
nucleotide mismatch penalty
(BLASTx uses matrices)
nucleotide match reward
number of matching sequences to report
allow gaps
(default TRUE)
word size
effective database size
(removes effect of actual database size!)
query strands to search
(default both directions)
restrict database sequences to given list of ‘gi‘ numbers
BLAST Parameters Exercises
1. BLASTn vs. BLASTx
Open the file example-sequences.html, copy the sequence:
>blastn-vs-blastx
This is a Xenopus tropicalis cDNA sequence.
Go to the NCBI BLAST Home Page/Nucleotide-nucleotide BLAST (blastn)
section. Paste your sequence into the box.
Run BLASTn against the nr nucleotide database using all default options.
Then hit [format] to wait for the results in a new page.
(hint if you paste the sequence definition line ‘>name’ into the box as well, your
results will be labelled accordingly, which can be useful)
Now repeat but go to the TRANSLATED BLAST section, and BLAST against the
nr protein database using BLASTx.
How might the different results help us view the presence of this gene in other
vertebrates?
Results for Exercise 1.
BLASTn
BLASTx
BLAST Parameters Exercises
2. Low complexity filtering
Open the file example-sequences.html, copy the sequence:
>low-complexity-filtering-A
This is sequence contains a long AT tandem repeat.
Go to the NCBI BLAST Home Page/TRANSLATED BLAST section/BLASTx.
Paste your sequence into the box.
Carefully UNTICK the “Choose filter [ ] Low complexity” BOX in the second
section. And then run BLASTx against the nr database.
What do you feel about these alignments?
Re-run, but leave the low-complexity filter ON this time.
Does this change our view of the protein matches?
Now continue with >low-complexity-filtering-B and –C.
C is an especially interesting case – what can we deduce about the cDNA
sequence? Annotators beware!
Results for Exercise 2A (OFF)
BLASTn – low complexity filtering OFF
Results for Exercise 2A (ON)
BLASTn – low complexity filtering ON
Results for Exercise 2B
ON
OFF
Results for Exercise 2C
ON
OFF
There is a sequence error, an
extra G at position 117 in the
sequence:
cDNA
(117)
AGAAAAGAAGAAACATGGCAATGGATCAGAA
|||||||||||||||| ||||||||||||||
AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
BLAST Parameters Exercises
3. Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search
to more specific items.
For instance to find only matching sequences in fruit fly, enter ‘Drosophila
melanogaster[ORGN]’ in the Limit by entrez query box in the second section
(you can also select the organism from the adjacent drop-down list).
To combine items use logical AND, OR or NOT.
Open the file example-sequences.html.
Copy the sequence >cyclin-D1-Xt and go to the NCBI BLAST Home Page/
TRANSLATED BLAST section/BLASTx, and paste the sequence.
Use an Entrez query to find all rodent sequences (rat and mouse) with a good
match to cyclin-D1. At what E-value do we expect we are no longer looking at
cyclins? Try running the search again with that E-value as a limit…
BLAST Parameters Exercises
4. BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequences.html.
Also open the NCBI BLAST Home Page/SPECIAL – Align two sequences section.
There are several Xenopus tropicalis cyclins in the examples file.
Copy the sequence >cyclin-A1-Xt to the Sequence 1 BLAST window
Copy the sequence >cyclin-A2-Xt to the Sequence 2 BLAST window
(i) Run the default comparison, should be BLASTn. Note the alignment.
Now run again using tBLASTx – what does this do to our understanding of the
relationship between these two sequences? Are they homologs, orthologs or
paralogs – or none of these?
(ii) Revert to BLASTn, and try varying the values for mismatch penalties and
gapping – start by reducing the mismatch penalty to -1. Then try reducing the gap
open and gap extension penalties….
What do we learn from this?
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2…
Results for Exercise 4 (i)
BLASTn
tBLASTx
Results for Exercise 4 (ii)
Mismatch penalty = -2 (default)
Mismatch penalty = -1
BLAST Parameters Exercises
5. Word Size
Go to: informatics.gurdon.cam.ac.uk/online/workshops/useful-web-sites.html
Open example-sequences.html
Copy the sequence >morpholino go to the NCBI BLAST Home Page.
Go to the NUCLEOTIDE BLAST section, BLASTn, and paste the sequence.
Check OFF the low complexity filter, and then run the search.
Now re-run the search, setting the following parameters:
Low complexity
Expect
Word Size
Other advanced
OFF
100
7
-q-1
(mismatch penalty -1 instead of default -3)
What difference does this make?