Transcript Quiz

Blast
1
Review
•
•
•
Basic Local Alignment Search Tool
Initially developed and used to search genomic sequence
– Ex) I have a short piece of mRNA/EST that I sequenced -- and I'd like to
see where it is in a genome
– Especially useful before genome browsers
– Before browsers and annotation, used to "check" if the region that
contained my gene of interest had been sequenced yet
– Still very useful
Initial application
– query sequence
– target database
2
Blast output: Blast of Human Promoter VS Mouse Promoter in BMP4
Query= range=chr14:52413865-52418867 5'pad=0 3'pad=0 revComp=FALSE
strand=? repeatMasking=none
(5003 letters)
>mm3_dna range=chr14:37718906-37723909 5'pad=0 3'pad=0 revComp=FALSE
strand=? repeatMasking=none
Length = 5004
Score = 387 bits (195), Expect = e-110
Identities = 258/278 (92%), Gaps = 2/278 (0%)
Strand = Plus / Plus
Query: 1
catgcaccgactagtcgccgtaccttccaaaaatacccatgggagtctgggcttccctga 60
|||||||| ||||||| | |||||||||||||||||||||||||||||||||||||||||
Sbjct: 3285 catgcaccaactagtctctgtaccttccaaaaatacccatgggagtctgggcttccctga 3344
Query: 61
gtttagtgtaactgctgcccaaactgatgattaaaacactgagtaatcaccatttaccct 120
|||||| |||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 3345 gtttagaataactgctgcccaaactgatgattaaaacactgagtaatcaccatttacccc 3404
Query: 121
gagtaattagagataatgaaacacctctaaaatcaggttatggagaaaggagctgttgga 180
||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||
Sbjct: 3405 gagtaattagagataatgaaacacctctaaaatcaggttatggagaagggagctgttgga 3464
Query: 181
tcggtttaaaaggtggccttccataaactgttaaaagatttccaaacgttgagaaacagg 240
| |||||||| ||||||||||| |||||||||||| |||| ||||||||| ||||||||
Sbjct: 3465 ttggtttaaagggtggccttccgtaaactgttaaaggattcccaaacgtt--gaaacagg 3522
Query: 241
ctgtgtgcagaactgtgtgccagagagagatacctctt 278
||||| |||| ||||||||| || ||||||||||||
Sbjct: 3523 ctgtgagcagggctgtgtgcccgaaggagatacctctt 3560
3
Review
• Highly refined and specialized
– nucleotide-nucleotide (blastn), protein-protein (blastp), translated
nucleotide against protein (blastx), protein against translated nucleotide
(tblastn), translated nucleotide against translated nucleotide (tblastx)
• Variations
– bl2seq
– Wu-BLAST
4
Blast Algorithms
• Remove regions of low complexity
• Make k-letter words (k-mer) of query
sequence (a hash?) Default for blastn is
w=11
• Compare k-mers in query sequence with
pre-computed k-mers in target sequence
• Note, the k-mers in the target sequence (
database) are pre-computed --> significant
speed up
5
How does BLAST work?
An alignment begins with the query
sequence converted into "words" (default
for blastn is w=11)
ATGCCCTGACGCAATGACCTTAAGGAT
ATGCCCTGCCG
word1
TGCCCTGCCGC
word2
GCCCTGCCGCA
etc
CCCTGACGCAA
CCTGACGCAAT
:
:
6
Hash-Based Alignment
base-10 numbers
5805 = 5*10^3 + 8*10^2 + 0*10^1+5*10^0
k=8
ATGCCTGGGCT
A=0, C=1, G=2, and T=3 (base 4 number)
ATGCCTGG = 0*4^7+3*4^6+2*4^5+1*4^4+1*4^3+3*4^2+2*4^1+2*4^0
= 14714
Now we can compare “chunks” of sequence much faster - speed
increase by factor of 8
Can pre-compute hashes for entire genome, and only compare hashes
Premise for popular alignment tools – BLAST, BLAT,and UIcluster
7
Appendix 1: The Files Produced by Formatdb (From formatdb.html)
-----------------------------------------Using formatdb without the "-o T" indexing option results in three
BLAST database files (.nhr, .nin, ,nsq). Using the "-o T" option will result
in additional files.
If gi's are present in the FASTA definition lines of the source file, there
will be four additional files created (.nsd, nsi, nni, nnd). These are
ISAM indices for mapping a sequence identifier to a particular sequence in
the BLAST database. If gi's are not use there will be only two additional
files created (.nsd, .nsi).
These files are listed below for both an example nucleotide and a protein
sequence database. The actual sequence data is stored in the files with
extension "nsq" or "psq". The compression ratio for these files is
about 4:1 for nucleotides and 1:1 for protein sequence source files.
Extension Content
Format
--------------------------------------------Nucleotide database formatted without "-o T"
nhr
nin
nsq
deflines
binary
indices
binary
sequence data binary
8
Blast Works
After a word alignment between query, and
target, blast attempts to extend the
alignment in both directions.
The alignment continues which contributes
to the alignment "score." Once the score
falls below a threshold, the extension
halts.
The resulting alignment is called an HSP, or
high scoring pair.
9
BLAST Example
• Processing of BLAST output is not particularly special in terms of
Bioinformatics Techniques
• It is useful for comparing sequences and identifying "similarity"
– similarity – a measure of how many bases/amino acids are shared
between 2 sequences
– At the 2 extremes:
• 100 "percent identity" means 2 sequences are identical
• 0 "percent identity" means 2 sequences share no residues
• note that 2 random nucleotide sequences, without gaps will have 25%
identity
• 2 sequences may be 80% identical at the aa level, but only 55% at the nt
level
• similarity does not describe length
–
–
–
–
–
1 pair is 98% identical for 8 nucleotides
1 pair is 75% identical for 100 nucleotides
Which pair of sequences is significant and more closely related?
Measures exist that attempt to quantify these issues
Not covered (Statistical Methods for Bioinformatics)
10
BLASTN 2.2.6 [Apr-09-2003]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
Query=
(500 letters)
Database: blastdb/yeast.nt
17 sequences; 12,155,026 total letters
Searching.done
Score
E
(bits) Value
Sequences producing significant alignments:
gi|6324971|ref|NC_001148.1|
gi|6322623|ref|NC_001143.1|
gi|7839148|ref|NC_001136.2|
gi|6323989|ref|NC_001146.1|
gi|6323501|ref|NC_001145.1|
gi|7276232|ref|NC_001137.2|
Saccharomyces
Saccharomyces
Saccharomyces
Saccharomyces
Saccharomyces
Saccharomyces
cerevisiae
cerevisiae
cerevisiae
cerevisiae
cerevisiae
cerevisiae
chromosome
chromosome
chromosome
chromosome
chromosome
chromosome
...
...
...
...
...
...
32
32
32
30
30
30
1.2
1.2
1.2
4.7
4.7
4.7
>gi|6324971|ref|NC_001148.1| Saccharomyces cerevisiae chromosome XVI,
complete chromosome sequence
Length = 948061
Score = 32.2 bits (16), Expect = 1.2
Identities = 16/16 (100%)
Strand = Plus / Minus
Query: 1
tctggcgtgacaccag 16
||||||||||||||||
Sbjct: 511257 tctggcgtgacaccag 511242
11
Query, Subject,
HSP (high scoring pair)
>gi|6226515|ref|NC_001224.1| Saccharomyces cerevisiae mitochondrion,
complete genome
Length = 85779
Score = 40.1 bits (20), Expect = 0.050
Identities = 20/20 (100%)
Strand = Plus / Plus
Query: 1562
ttaattattaaaaataataa 1581
||||||||||||||||||||
Sbjct: 74540 ttaattattaaaaataataa 74559
Score = 40.1 bits (20), Expect = 0.050
Identities = 20/20 (100%)
Strand = Plus / Plus
Query: 1562 ttaattattaaaaataataa 1581
||||||||||||||||||||
Sbjct: 2168 ttaattattaaaaataataa 2187
12
BLAST
www.ncbi.nih.gov (web interface)
Highlight parameters
output in html/text
word size
number of results
quality of hit
>hg16_dna range=chr14:52413865-52418867 5'pad=0 3'pad=0 revComp=FALSE strand=? repeatMasking=none
CATGCACCGACTAGTCGCCGTACCTTCCAAAAATACCCATGGGAGTCTGG
GCTTCCCTGAGTTTAGTGTAACTGCTGCCCAAACTGATGATTAAAACACT
GAGTAATCACCATTTACCCTGAGTAATTAGAGATAATGAAACACCTCTAA
AATCAGGTTATGGAGAAAGGAGCTGTTGGATCGGTTTAAAAGGTGGCCTT
CCATAAACTGTTAAAAGATTTCCAAACGTTGAGAAACAGGCTGTGTGCAG
AACTGTGTGCCAGAGAGAGATACCTCTTGGGCTGTTAAAAGGTGAATTTC
CTTCCAAAACATTCTTAAAGTGCGTTTTGTAGAACACCTTGGTGATGACC
CTGAGGTAGACCCCAGTAAATGAGCTCACTTCCCTCCTCCCCCCAAGGCC
TAAAGACAGAGGTGGCCTGCAGACAGGCTGGGGCCACGTTTTTTACAGTT
AGAAAAATCTGTTTCTAGCTCAGAAGCCGTCTTAAGAACCGACACAGCAG
TTATTTTTGTTGGTGTTATGGTTGTTGTTGATTTTTTTGTTCATTTTTGG
TGTACCTAAGAGTCCTGACACTCTGTCTTTCCAAGGAGATAACCACAGAA
•
•
•
application
database
sequence
13
Setting up Standalone BLAST for UNIX:
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=Basically, there are three steps needed to setup the Standalone BLAST
executable for the UNIX platform.
1) Download the UNIX binary, uncompress and untar the file. It is
suggested that you do this in a separate directory, perhaps called
"blast".
2) Create a .ncbirc file. In order for Standalone BLAST to operate, you
have will need to have a .ncbirc file that contains the following lines:
[NCBI]
Data="path/data/"
Where "path/data/" is the path to the location of the Standalone BLAST
"data" subdirectory. For Example:
Data=/root/blast/data
The data subdirectory should automatically appear in the directory where
the downloaded file was extracted. Please note that in many cases it may
be necessary to delimit the entire path including the machine name and
or the net work you are located on. Your systems administrator can help
you if you do not know the entire path to the data subdirectory.
Make sure that your .ncbirc file is either in the directory that you
call the Standalone BLAST program from or in your root directory.
3) Format your BLAST database files. The main advantage of Standalone
BLAST is to be able to create your own BLAST databases. This can be done
with any file of FASTA formatted protein or nucleotide sequences. If you
are interested in creating your own database files you should refer to
the sections "Non-redundant defline syntax" and "Appendix 1: Sequence
Identifier Syntax" of the README in the BLAST database directory
(ftp://ftp.ncbi.nih.gov/blast/db/). You can also refer to the FASTA
description available from the BLAST search pages
(http://www.ncbi.nlm.nih.gov/BLAST/fasta.html).
14
Installing BLAST on "old" CSS Itaniums
http://www.ncbi.nih.gov/Ftp/index.html
Because I want to install on a remote machine, I'll use ftp – a program that stands for 'file transfer protocol'
Allows me to access files by command line
Another version is ncftp – with more features
file completion
download stats
ssh [email protected] (this will log you into an itanium that is still supported by NCBI).
mkdir ~/blast
cd blast
ncftp ftp.ncbi.nih.gov (or ftp: login ftp, passwd [email protected])
bin
cd blast/executables/LATEST
get blast-2.2.10-ia32.linux.tar (this should work on an itanium)
## These are out of date: get blast-2.2.6-ia32-linux.tar.gz (linux) (or blast-2.2.6-hppa-hpux.tar.gz) ~ 15 meg
cd ../../../..
cd blast/db/FASTA/
get yeast.nt.gz (~ 3.5 meg)
quit
gunzip blast-2.2.10-ia32.linux.tar.gz
tar –xvf blast*
[for HPUX: cd blast-2.2.6]
[for HPUX, : mv ../yeast.nt.gz .]
gunzip yeast.nt.gz
mkdir blastdb
mv yeast.nt blastdb
./formatdb -i blastdb/yeast.nt -p F
15
Linux Example
•
•
•
•
•
•
Download binary (for CSS machines – "x64"):
ftp://ftp.ncbi.nih.gov/blast/executables/release
mkdir ~/blast
mv blast-2.2.18-x64-linux.tar.gz ~/blast
gunzip blast-2.2.18-x64-linux.tar.gz
tar –xvf blast-2.2.18-x64-linux.tar.gz
16
Linux Example
• Download database(s) of choice
• ftp://ftp.ncbi.nih.gov/blast/db/
• ftp://ftp.ncbi.nih.gov/blast/db/nt.00.tar.gz
(nt.00.tar.gz-nt.05.tar.gz)
Note – these are too LARGE for your CSS
account
• mkdir ~/blast/db
• chmod -R 755 db/
• mv nt.*.tar.gz ~/blast/db
17
Create Our Own DB
• Download chrX:
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/
CHR_X/hs_ref_chrX.fa.gz
• mv ~/Deskstop/hs_ref_chrX.fa.gz
~/blast/db
• gunzip hs_ref_chrX.fa.gz
• ../blast-2.2.18/bin/formatdb -i
hs_ref_chrX.fa -p F -o T -n hs_ref_chrX.fa
• This will create approx 8 files.
18
Install BLAST locally
• Have all we need to run locally
• Note, there are further environment
settings to make it easier to run from any
location on the system – that is in the
readme
• Need a sequence in a file to blast against
database
19
Install Blast Locally
Create file called dmd.hs (in ~/blast) with the
following sequence:
>gi|5032280|ref|NM_000109.2| Homo sapiens dystrophin (DMD), transcript Dp427c, mRNA
CGTTAAATGCAAACGCTGCTCTGGCTCATGTGTTTGCTCCGAGGTATAGGTTTTGTTCGACTGACGTATC
AGATAGTCAGAGTGGTTACCACACCGACGTTGTAGCAGCTGCATAATAAATGACTGAAAGAATCATGTTA
GGCATGCCCACCTAACCTAACTTGAATCATGCGAAAGGGGAGCTGTTGGAATTCAAATAGACTTTCTGGT
TCCCAGCAGTCGGCAGTAATAGAATGCTTTCAGGAAGATGACAGAATCAGGAGAAAGATGCTGTTTTGCA
CTATCTTGATTTGTTACAGCAGCCAACTTATTGGCATGATGGAGTGACAGGAAAAACAGCTGGCATGGAA
cd ~/blast
./blast-2.2.18/bin/blastall -p blastn -i dmd.hs -d
db/hs_ref_chrX.fa -o dmd.hs.out
./blast-2.2.18/bin/blastall -p blastn -i dmd.hs -d
db/hs_ref_chrX.fa -o dmd.hs.out –e 0.00001
20
BLASTN 2.2.15 [Oct-15-2006]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
Query= gi|5032280|ref|NM_000109.2| Homo sapiens dystrophin (DMD),
transcript Dp427c, mRNA
(350 letters)
Database: hs_ref_chrX.fa
22 sequences; 152,577,922 total letters
Searching..................................................done
Score
Sequences producing significant alignments:
E
(bits) Value
ref|NT_011757.15|HsX_11914 Homo sapiens chromosome X genomic con... 694 0.0
ref|NT_011638.12|HsX_11795 Homo sapiens chromosome X genomic con... 40 0.042
ref|NT_011681.15|HsX_11838 Homo sapiens chromosome X genomic con... 38 0.17
ref|NT_079573.3|HsX_79638 Homo sapiens chromosome X genomic cont... 38 0.17
ref|NT_011669.16|HsX_11826 Homo sapiens chromosome X genomic con... 36 0.65
ref|NT_011651.16|HsX_11808 Homo sapiens chromosome X genomic con... 36 0.65
ref|NT_011630.14|HsX_11787 Homo sapiens chromosome X genomic con... 34 2.6
ref|NT_113965.1|HsX_111684 Homo sapiens chromosome X genomic con... 34 2.6
ref|NT_011786.15|HsX_11943 Homo sapiens chromosome X genomic con... 34 2.6
21
>ref|NT_011757.15|HsX_11914 Homo sapiens chromosome X genomic contig, reference
assembly
Length = 34879939
Score = 694 bits (350), Expect = 0.0
Identities = 350/350 (100%)
Strand = Plus / Minus
Query: 1
cgttaaatgcaaacgctgctctggctcatgtgtttgctccgaggtataggttttgttcga 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 31139409 cgttaaatgcaaacgctgctctggctcatgtgtttgctccgaggtataggttttgttcga 31139350
Query: 61
ctgacgtatcagatagtcagagtggttaccacaccgacgttgtagcagctgcataataaa 120
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 31139349 ctgacgtatcagatagtcagagtggttaccacaccgacgttgtagcagctgcataataaa 31139290
ETC…
22
What Next?
• We could use BLAST to do something
useful
• Take 400,000 sequence reads, and use
blast to identify locations of read
sequences in genome
• BLAST each sequence individually
• Cluster sequences
• Generate report
• How would we do all of that?
23
Perl
• A program to read thru a big sequence file
– and strip out each individual read
• Use Perl as a "wrapper" – to run Blast
• Parse results (or possibly use a PM to
parse results)
• Process, aggregate, summarize, compare
to control sequence – and report
generation
24
Alternatively …BLAT
• BLAST-Like Alignment Tool
• http://genome.ucsc.edu/FAQ/FAQblat
25
From the FAQ
http://genome.ucsc.edu/
From a practical standpoint, Blat has several advantages
over BLAST:
• speed (no queues, response in seconds) at the price of
lesser homology depth
• the ability to submit a long list of simultaneous queries in
fasta format
• five convenient output sort options
• a direct link into the UCSC browser
• alignment block details in natural genomic order
• an option to launch the alignment later as part of a
custom track
26
Blast Suite
http://genome.ucsc.edu/goldenPath/help/blatSpec.html
Blat produces two major classes of alignments: at the DNA level between two sequences
that are of 95% or greater identity, but which may include large inserts, and at the
protein or translated DNA level between sequences that are of 80% or greater
identity and may also include large inserts.
The main programs in the blat suite are:
* gfServer – a server that maintains an index of the genome in memory and uses the
index to quickly find regions with high levels of sequence similarity to a query
sequence.
* gfClient – a program that queries gfServer over the network, and then does a
detailed alignment of the query sequence with regions found by gfServer.
* blat –combines client and server into a single program, first building the index, then
using the index, and then exiting.
* webBlat – a web based version of gfClient that presents the alignments in an
interactive fashion.
27
Example
• Download binaries (or source files and
compile)
• http://genometest.cse.ucsc.edu/~kent/exe/
• (for CSS – use the Linux version)
blat database query output.psl –out=blast
• database and query files may be either: .fa
, .nib or .2bit file
28
system
system LIST
-- executes any program on the system
#!/usr/bin/perl
system("/mnt/r0-blastdb/blast-bin/blastall –p blastn –d
/mnt/r0-blastdb/FormattedDBs/nt – i test.txt –o test.out");
# this assumes test.txt and test.out are in same directory as
your program
.
29
back ticks
#!/usr/bin/perl
$output = `/mnt/r0-blastdb/blast-bin/blastall –p
blastn –d /mnt/r0-blastdb/FormattedDBs/nt – i
test.txt –o test.out`;
print "$output\n";
# command is passed on, and interpreted by the
shell
# output of command returned
30