Running BLAST through Perl

Download Report

Transcript Running BLAST through Perl

Running BLAST through Perl
Introduction to BLAST
•
•
•
•
BLAST is a common sequence search tool that can be used for either
nucleic acid or protein sequences.
BLAST = Basic Local Alignment Search Tool
The original publication was: Altschul, S.F., Gish, W., Miller, W., Myers, E.W.
& Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol.
215:403-410.
Current reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman
(1997), "Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs", Nucleic Acids Res. 25:3389-3402.
•
We are using a stand-alone version, meaning that the code and databases
for BLAST reside on our computer. We obtained the code from the NCBI,
and it resides at /home/share/blast.dir.
•
There is also a widely used web-based version at the NCBI:
http://www.ncbi.nih.gov/BLAST/ , which is very useful for searches with
single sequences against the entire collection of known DNA sequences in
GenBank, among other things. Also it has a very good tool for giving a nice
visual comparison of two sequences. However, the web version is very
difficult to use if you need to do multiple queries.
What BLAST does
• Very useful reference: the NCBI BLAST course:
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
• Also, a section of the NCBI Handbook concerning BLAST:
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.chapter.610
• BLAST is a heuristic search method, meaning that it makes
assumptions about the data based on experience. This implies that
it is not guaranteed to find the best alignment in all possible
circumstances. It sacrifices some accuracy for a great increase in
speed.
• BLAST is based on the Smith-Waterman algorithm, which is slow
but guaranteed to get the best possible alignment given certain input
parameters. We have a Smith-Waterman implementation on biolinx
called “swat”.
BLAST Algorithm
• BLAST is a local search algorithm: it starts with small perfect
matches and tries to extend them.
• First, the query sequence is broken down into a dictionary of
“words”, all possible sequences of a certain size. For proteins, the
default word size is 3 amino acids. For DNA, the default word size is
11 nucleotides.
• The query words are then matched with a similar dictionary for the
database.
• BLAST requires 2 non-overlapping word hits within a certain
distance of each other to proceed. (Originally each individual word
hits was searched, but using 2 hits proved to be faster with only a
small observed loss of accuracy).
• All regions with 2 hits are then extended using the Smith-Waterman
algorithm to find the maximal extent of the alignment, starting at the
center of the better of the 2 matched words and extending in both
directions.
Substitution Matrices
•
•
When searching for imperfect matches, it is necesary to have penalties for base or
amino acid mismatches.
For nucleic acids:
–
–
–
•
•
the simplest approach is to assign the same penalty for any mismatch.
a more sophisticated approach is to assign a different (larger) penalty for transversions than
for transitions.
by default, blastn uses a reward score of 1 for a match and a penalty of -3 for any mismatch.
For proteins the situation is more complicated, because many amino acid
substitutions have very little effect on protein function, for example isoleucine and
valine, or aspartic acid and glutamic acid.
Protein substitution matrices have been developed by comparing the frequencies of
different possible substitutions seen in homologous proteins across species. Two
approaches have been used:
–
–
–
The PAM series (developed by Dayhoff) uses global alignments of proteins from closely
related species. Results are then extrapolated to more distant relationships by matrix
multiplication. The numbers describing the matrices reflect evolutionary distance: PAM30 is
used for more closely related species and PAM 70 is for more distant relationships.
The BLOSUM series (developed by Henikoff) use local, ungapped alignments between
proteins from distantly related species. The matrices are directly calculated, not
extrapolated. The numbers refer to the minimum percent identity allowed for a pair of
sequences to be included in the matrix calculation. Thus, BLOSUM80 required 80% identity
and only includes proteins that are more similar than those in the BLOSUM 45 matrix.
BLOSUM matrices have been found to work better in practice, and blastp uses BLOSUM62
by default.
Gap Scores
•
•
•
•
•
•
When closely related sequences are aligned, small gaps are commonly
observed in addition to base substitutions.
The frequency of gaps is about 1/10 the frequency of substitutions.
Gapped alignments mess up statistical calculations, but they must be dealt
with.
Since a gap can have any length, there are separate penalties for opening a
gap and for extending a gap. By default, blastn uses a gap opening penalty
of -5 and an extension penalty of -2; blastp has an opening penalty of -11
and an extension penalty of -1.
Using separate opening and extension penalties is called an “affine” scoring
system. For example, y = 3x is a linear function of x, while y = 3x + 5 is an
affine function of x.
Using an affine function for gaps is considerably slower than using a linear
function (i.e. the gap opening and gap extensions are identical). Megablast,
a related program useful for large-scale sequence comparisons, uses a
linear gap function.
Smith-Waterman Algorithm
•
•
•
•
The S-W algorithm is an extension of the Needlemann-Wunsch algorithm,
which produces global alignments. Both entire sequences are aligned with
N-W, but local alignments are allowed in S-W.
These algorithms use the technique called “dynamic programming” (which
is not directly related to computer programming).
Start with a 2-dimensional matrix with one sequence along the top and the
other sequence down the left side. All possible pairs of nucleotides or amino
acids are represented by the cells of the matrix.
All possible alignments are represented by the paths through the matrix.
– a diagonal step is an alignment between the query and the subject sequences at
that position
– a vertical step is a gap in the query sequence
– a horizontal step is a gap in the subject sequence.
•
Have a match reward and penalties for mismatches, gap openings, and gap
extensions. For our example, we will use:
–
–
–
–
match = +3
mismatch = -1
gap opening = -4
gap extension = -4 (to keep things simple)
Calculating Cell Scores
T A
•
Initialize the edge rows to scores of 0.
•
The cell at row i and column j has a score
S(i, j)
Starting at top left cell, proceed row-by-row,
calculating each cell’s score S(i, j). S(i, j) is
the maximum of:
•
–
–
–
–
•
0 (i.e. set to 0 if the calculated score is less
than 0)
S(i-1, j-1) + match/mismatch score for cell (i, j)
S(i, j-1) + match/mismatch score for cell (i, j) +
gap penalty
S(i-1, j) + match/mismatch score for cell (i, j)
+ gap penalty
Then, start at the highest score in the matrix
and trace back the path leading through the
highest previous scores to 0. Go left and up
only, preferring the diagonal path if a choice
needs to be made.
T 5 7
G 2 ?
For the cell in question, the bases don’t
match, so it starts with a match/mismatch
score of -1. There are 3 possible
alignment paths to this cell:
1. diagonal (query/subject alignment).
Score = 5 – 1 = 4.
2. vertical (query gap). Score = 7 – 4 –
1=2
3. horizontal (subject gap). Score = 2 –
4 – 1 = -3 (set to 0)
Since 4 is the maximum, the cell’s value is
set to 4.
S-W Example: 1
--Initialize top and left side to 0’s.
0
T
0
T
0
C
0
C
0
A
0
C
G
0
0
A
C
T
T
C
G
C
T
0
0
0
0
0
0
0
0
S-W Example: 2
--top row matches = 3; mismatches = -1 (set to 0)
A
C
T
T
C
G
C
T
0
0
0
0
0
0
0
0
0
T
0
0
0
3
3
0
0
0
3
T
0
C
0
C
0
A
0
C
G
0
0
S-W Example: 3
--2 T’s in a row = 6; T-C following gets 3 – 1 = 2
A
C
T
T
C
G
C
T
0
0
0
0
0
0
0
0
0
T
0
0
0
3
3
0
0
0
3
T
0
0
0
3
6
2
0
0
3
C
0
C
0
A
0
C
0
G
0
S-W Example: 4
--C-G mismatch gets 9 – 4 – 1 = 4 (subject gap), not 2 – 1 = 1 (aligned)
A
C
T
T
C
G
C
T
0
0
0
0
0
0
0
0
0
T
0
0
0
3
3
0
0
0
3
T
0
0
0
3
6
2
0
0
3
C
0
0
3
0
2
9
4
3
0
C
0
A
0
C
G
0
0
S-W Example: 5
--C-C match gets 9 – 4 + 3 = 8 (query gap), not 2 + 3 = 5 (aligned)
A
C
T
T
C
G
C
T
0
0
0
0
0
0
0
0
0
T
0
0
0
3
3
0
0
0
3
T
0
0
0
3
6
2
0
0
3
C
0
0
3
0
2
9
4
3
0
C
0
0
3
2
0
8
8
7
2
A
0
C
G
0
0
S-W Example: 6
--etc.
A
C
T
T
C
G
C
T
0
0
0
0
0
0
0
0
0
T
0
0
0
3
3
0
0
0
3
T
0
0
0
3
6
2
0
0
3
C
0
0
3
0
2
9
4
3
0
C
0
0
3
2
0
8
8
7
2
A
0
3
0
2
1
3
7
7
6
C
0
0
6
1
1
4
2
10
6
G
0
0
1
5
0
0
7
2
9
S-W Example: 7
--starting at the highest score (10) trace back the alignment path
A
C
T
T
C
G
C
T
0
0
0
0
0
0
0
0
0
T
0
0
0
3
3
0
0
0
3
T
0
0
0
3
6
2
0
0
3
C
0
0
3
0
2
9
4
3
0
C
0
0
3
2
0
8
8
7
2
A
0
3
0
2
1
3
7
7
6
C
0
0
6
1
1
4
2
10
6
G
0
0
1
5
0
0
7
2
9
S-W Example: 8
•
•
•
Final alignment. Note that it doesn’t
cover the first two bases of the query
(top row) sequence, and also doesn’t
cover the last mismatched pair (T-G).
The statistical distribution of scores can
be derived mathematically (to a close
approximation). The alignment score
can then be assigned a statistical
expectation, the e-value, based on the
lengths of the query and subject
sequences as well as on the match,
mismatch, and gap penalties. The evalue is the expected number of random
sequences of the same length as the
aligned sequence that would be found in
the subject sequence by chance alone.
e-values are related to the probability of
finding a random sequence in the
database by the Poisson distribution.
However, for e-values less than 0.01,
probabilities and e-values are virtually
identical.
ACTT- CGCT
| |
|
|
TTCCACG
BLAST types
•
•
•
A BLAST search involves taking a single sequence (the query sequence) and
comparing it to a database of other sequences (subject sequences), then reporting
the best matches.
In general, protein searches are more capable of detecting more distant and
imperfect matches than DNA searches, because synonymous and similar amino
acids are more conserved evolutionarily than are nucleotide sequences.
A number of different BLAST programs are available. The main ones differ primarily
by the type of query and subject sequences used:
–
–
–
–
–
•
blastn: query is DNA, subject is DNA
blastp: query is protein, subject is protein
blastx: query is nucleic acid that is translated by the program into protein sequences (all 6
reading frames); subject database is protein. Useful for imperfect query sequences such as
ESTs
tblastn: query is protein; database is DNA translated into protein sequences in all 6 reading
frames. Useful for low quality databases.
tblastx: query is DNA translated into protein, subject is nucleotide translated into protein.
Both are translated into all 6 frames. It is very slow relative to the other BLAST types.
Several other types of BLAST search exist: see the NCBI site FAQs to learn more
about them.
Query Sequence
•
•
•
•
•
•
The query sequence must be written in “FASTA” format. This consists of a single comment line
that starts with a >, followed by one or more lines of sequence.
Comments usually start with a unique identifier for the sequence, then some other descriptive
information. To save trouble later, the unique identifier should contain letters, numbers,
undescores, hyphens and periods only, and contain no spaces. The unique identifier ends at the
first space.
The sequence should contain only sequence characters: no spaces or numbers allowed.
Either upper or lower case is acceptable,
Some suggest that lines of 80 or fewer characters should be used. I prefer to put the entire
sequence on a single line, which can be quite long.
Nucleic acid sequences can be either DNA or RNA. In general, A, C, G, T and U are used, along
with N for “any base” and a single – (hyphen) for gaps of any length.
–
–
Some sequences also contain a variety of other letters, such as R for pyrimidine or M for amino-containing.
These are not widely used and may well confuse others.
However, the existence of these characters makes it necessary that you check unknown sequences for the
presence of letters other than ACGTUN-.
•
Protein sequence use the standard 20 amino acid single letter codes. In addition, X means any
amino acid, * is the translational stop codon at the end of the sequence, and there are a few other
oddities.
•
Note that FASTA format is not rigidly defined, and it is always worth checking sequence files for
what exactly is being used as a format, and the information for any new program for what is an
acceptable format.
FASTA format examples
•
>gi|282349|pir||A41961 chitinase (EC 3.2.1.14) D - Bacillus circulans
LNQAVRFRPVITFALAFILIITWFAPRADAAAQWQAGTAYKQGDLVTYLNKD
YECIQPHTALTGWEPSNVPALWKYVGEGTGGGTPTPDTTPPTVPAGLTSS
LVTDTSVNLTWASTDNVGVTGYEVYRNGTLVANTSTTTAVVTGLTAGTTYV
FTVKAKDAAGNLSAASTSLSVTTSTGSSNPGPSGSKWLIGYWHNFDNGS
TNIKLRNVSTAYDVINVSFAEPISPGSGTLAFTPYNATVEEFKSDIAYLQSQG
KKVLISMGGANGRIELTDATKKRQQFEDSLKSIISTYGFNGLDIDLEGSSLSL
NAGDTDFRSPTTPKIVNLINGVKALKSHFGANFVLTAAPETAYVQGGYLNY
GGPWGAYLPVIHALRNDLTLLHVQHYNTGSMVGLDGRSYAQGTADFHVA
MAQMLLQGFNVGGSSGPFFSPLRPDQIAIGVPASQQAAGGGYTAPAELQ
KALNYLIKGVSYGGSYTLRQLRAMSVSRAL
•
>U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1)
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCT
CCCATCCGTGTCTATTGTACCCTGTTGCTTCGGCGGGCCCGCCGCTTG
TCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGCC
GGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGAT
TGAATGCAATCAGTTAAAACTTTCAACAATGGATCTCTTGGTTCCGGC
Subject Databases
•
•
•
•
•
BLAST uses a special database format to speed up the search operation.
Several pre-packaged databases exist, most notably “nr” which is the nonredundant database consisting of all sequences in GenBank. See the
BLAST program selection guide at
http://www.ncbi.nlm.nih.gov/blast/producttable.shtml .
We mostly make our own databases, which is done with the “formatdb”
program. Useful information can be found in the file
/home/share/blast.dir/README.formatdb.
Databases are located either in the same directory you are running BLAST
from, or in /home/share/blast.dir/data.
To use the /home/share/blast.dir/data directory without having to specify the
entire path every time, you need to create a file called “.ncbirc” in your home
directory that contains these two lines:
[NCBI]
Data=/home/share/blast.dir/data
Using formatdb
•
•
I am just going to cover the basic options: a number of other options can be found in
the README.formatdb file.
By default, format db produces 3 files with the same base name but different
extensions. For example, ath.nhr, ath.nsq, and ath.nin. The base name is “ath”, and
these are nucleic acid files because the first letter of the extensions is n (it would be p
for protein files).
•
The three necessary parameters for formatdb are:
-i input data file (containing one or more sequences in FASTA format)
-n output file base name (if this parameter is not set, the input file name is used as base)
-p type of file: T for protein, F for nucleic acid
•
The –o option produces another set of files used for indexing. This of course uses
more space. I haven’t found it very useful.
•
Basic syntax on the command line:
formatdb –i my_input_genes.txt –p F –n my_genes
•
This takes a FASTA file containing nucleic acid sequences and creates the database
files my_genes.nhr, my_genes.nsq, and my_genes.nin.
Using BLAST
• The best source of information is the file
/home/share/blast.dir/README.bls.
• The program we are running is called “blastall”; the various types of
blast are invoked using options to blastall. The program is found in
/home/share/blast.dir.
• The essential options are:
-p program to run (blastn, blastp, blastx, tblastn, tblastx)
-d subject database (created with formatdb)
-i input sequence file (FASTA format)
-o output file name
• basic syntax, on the command line:
blastall –p blastn –i my_input_file –d my_database –o my_blast_output.txt
Some Other BLAST options
•
-e cutoff value The default cutoff is 10.0. No hits with a worse (higher)
value than this are reported. This number needs to be in decimal, not
exponential notation. I find setting -e .001 eliminates a large number of
nuisance hits.
•
-F filter query sequence. By default this option invokes the DUST filter that
masks out simple sequence repeats (low complexity sequences) that will
match many places in the genome that aren’t really related to the query
sequence. To turn it off, use –F T (i.e. set it to TRUE).
•
-m 8 This gives a tabular output that gives summary information for each hit.
Very useful for processing many sequences.
•
-M protein substitution matrix. Default is BLOSUM62. Others available are
BLOSUM45, BLOSUM80, PAM30 and PAM70. These matrices define the
penalties for substituting one amino acid for another.
•
Other options can be found be typing “blastall” on the command line
Tabular BLAST Output
•
•
•
•
The –m 8 option gives a simple table as output. Often this information is all
you need.
The table is tab-separated, with each hit on a separate line.
You can easily parse these data by splitting each line at tabs: my @data =
split /\t/;
Starting at 0, the columns are:
0: query name
1: subject name
2: percent identities
3: aligned length
4: number of mismatched positions
5: number of gap positions
6: query sequence start
7: query sequence end
8: subject sequence start
9: subject sequence end
10: e-value
11: bit score
More Tabular BLAST Output
•
•
•
•
•
•
•
The query and subject names come from the FASTA comment lines: everything up to the first
space character.
Aligned length can include gaps in both the query and the subject, so it is not necessarily the
same length as either of these sequences.
Percent identities looks at every position in the aligned sequences and counts the number that
have the same base or amino acid. This count is then divided by the aligned length and then
multiplied by 100.
Mismatches are positions in the aligned sequences where both query and subject have a base or
amino acid. Gap positions are not included.
The number of gap positions is the number of positions in the aligned sequence with a base or
amino acid in one sequence and a hyphen (gap) in the other. (NOT the number of gaps
regardless of length).
The query sequence start and end positions are always written with the start less than the end.
However, the subject sequences can have the start position greater than the end, if the alignment
is on the opposite DNA strand from the way the subject sequence was originally put into the
database.
The e-value is the expected number of random sequences that would have at least this good a
match in a random database of the same size as the subject. Thus, an e-value of 1 means that 1
match would be expected to occur by chance alone using this database.
–
–
•
•
It is worth noting that a probability of 0.05 is often considered significant in biological experiments. This
value would almost never be considered a significant hit in BLAST searches. Perhaps a rule-of-thumb worst
significant score would be 10-6 (1e-6).
Scores less than 1e-180 are given a value of 0.0.
The bit score is a normalized (corrected for the mismatch and gap penalties) version of the score
derived from the Smith-Waterman search. It is related to the e-value through an exponential
function that also takes into account the length of both the query and the subject database. That
is, the e-value already takes into account query and subject lengths, while the bit score doesn’t.
I prefer to deal with e-values.
BLAST Output
• If you don’t specify –m 8, you get a longer, detailed output file.
• The output starts with basic info, identifier for query (FASTA
comment line up to the first space) and the database you listed on
the command line with the –d option.
• Then a summary of each database sequence that had a significant
alignment(s). Two numbers appear: the bit score (an integer) and
the e-value (a floating point number with a minimum of 0 and a
maximum of whatever you set –e to be, defaulting to 10.0).
• The summary lines add scores for all hits within each subject
sequence. This is useful and meaningful if the subject database
contains discrete sequences such as individual genes or peptides.
However, if the subject database is genomic chromosomal
sequences, the summary lines can be quite misleading.
• After the summary lines are detailed views of each hit.
More BLAST Output: Detailed View
•
•
•
At the start of the details for each subject sequence, there is a > in the first column, followed by
the subject’s FASTA comment line. The next line gives the subject sequence’s length.
Note that there can be multiple hits on each subject, and each will be listed separately following
these subject identifier and length lines.
Then, some summary information about this hit. For example:
Score = 1594 bits (804), Expect = 0.0
Identities = 804/804 (100%)
Strand = Plus / Plus
•
•
•
The Score is the raw bit score followed by the normalized bit score in parentheses. If you use bit
scores to compare sequences, you want to use the normalized bit score.
Expect is the e-value.
Identities is the number of matching positions between the aligned sequences, followed by the
percentage of identical bases.
–
–
–
•
Sometimes the Identities line will also have a “Gaps = “ section, which gives the number of gap positions.
For protein alignments, there is also a “Positives = “ section, which gives the number of amino acid matches
that are not identical but which count as similar in the scoring matrix (e.g. isolecine and valine).
The order on this line is Identities, Positives, Gaps. However, Positives and/or Gaps may be absent.
Strand is the orientation of the query sequence, then the subject sequence. Query is always
“Plus”, but the subject will be “Minus” if the alignment is on the opposite strand from the original
sequence put into the subject database. This line appears in nucleic acid searches but not in
protein searches (which only match
More BLAST Output: Detailed View
• After the summary lines, the aligned sequence appears,
in groups of 3 lines. The query sequence is on top,
followed by a line indicating matches, followed by the
subject line.
• For blastn, the match line just consists of vertical lines
where the bases are identical and a blank space where
they are not identical.
• For blastp, the match line gives the amino acid code for
indentical amino acids, a “+” for similar (positives) amino
acids, and a blank space otherwise.
• After all hits have been displayed, the end of the output
file has some statistical information about the subject
database and parameters of the blast search.
Running External Programs with
Perl
•
The most commonly used Perl command for running external programs is “system”. This
command executes the program specified by its arguments, then returns control to the next line in
your Perl program.
–
–
•
“system” returns the signal number resulting from the process (program) it executed. If all went
well, this is 0. Numbers other than 0 indicate some kind of error.
–
–
•
There is also the “exec” command, which executes the program given in its arguments, but never returns
control to the Perl command that issued the command. Occasionally useful, but not in the context of BLAST
searches.
You can also enclose the program name in backicks; the program’s output to STDOUT is the return value of
this:
$output_string = ` blastall –p blastn –i my_input_file –d my_database`;
You should always test the results of interacting with the external world.
Thus, reasonable syntax for using the system command:
if ( system(“whatever see below”) != 0) {
die “error executing system command\n”;
}
The simplest way to use “system” is to simply enclose the command line you need in quotes:
system( “blastall –p blastn –i my_input_file –d my_database –o my_blast_output.txt” )
•
The above line invokes the bash shell to interpret the command line, converting each group of
symbols separated by spaces into a separate argument.
•
You can avoid invoking the shell at all (a somewhat more secure method), by separating out the
individual space-delimited segments yourself:
system( “blastal”l, “–p”, “blastn”, “–”i”, “my_input_file”, “–d”, “my_database”, “–o”, “my_blast_output.txt” )
Parsing BLAST Output
•
•
•
•
•
•
•
Once you have run BLAST, its output is in a file. You need to open this file, then read
it line-by-line with a while statement.
Because you are reading line-by-line, most of the information you gather needs to be
stored in global variables: variables declared with “my” outside the while loop.
Each new line needs to be examined for the type of information it contains using
regular expressions. It is important to be able to identify the start and end points of
various sections, and respond appropriately.
Most information will be gathered using regular expression memory parentheses.
Variables need to be re-initialized at the end of each section (or the beginning of the
next one if the end is not easily discerned).
We will store the information in a hash whose keys are the subject identifiers. All hits
for that subject will be stored in an array.
Assuming that you want the information in the detailed analysis of each hit, note that
there are 3 important sections:
–
–
–
1. each new subject sequence starts with a “>” in the first column, followed by the rest of its
FASTA comment line. This can spill over multiple lines in the output. Immediately following
this line(s) is the “Length = “ line.
2. Within each subject sequence there are one or more hits. Each hit starts with a “Score =“
line, followed by an “Identities = “ line, followed (for blastn but not blastp) with a “Strand = “
line.
3. Within each hit there are groups of 3 lines representing the aligned sequence, including
the coordinates of the start and end position of each line. There can be several such groups,
for long hits.
1. Parsing New Subject Sequence
Information
•Each new subject sequence
starts with a line having the
entire FASTA comment line on
it (i.e. a “>” in the first column),
followed by zero or more
additional comment lines, and
ends with a “Length = “ line.
•The useful pieces of
information here are the
comment line itself, the unique
identifier for this subject
sequence (part of the comment
line), and the length.
•Because these sections have
easily recognized start and end
points, you can deal with them
as a unit, inputting new lines
separately from the main while
loop.
my ($subject_ident, $comment_line, $subject_length);
my %hits; # to store all data from this file;
while (<INFILE>) {
# code for processing other sections
if (/^>/) { # finds the “>” in the first column of the line
$subject_length = 0; # reset to 0
$comment_line = $_;
while (1) { # endless loop to get all of comment line
my $next_line = <INFILE>;
if ($next_line =~ /Length = (\d+)/ ) {
$subject_length = $1;
last; # break out of loop
}
else { # continuation of comment line
chomp $next_line;
$next_line =~ s/^\s+/ /; # remove leading
$next_line =~ s/\s+$/ /; # and trailing multiple spaces
$comment_line .= $next_line;
} # end else
} # end while(1) loop
$comment_line =~ />(\S+)\s/;
$subject_ident = $1;
$hits{$subject_ident}{subject_length} = $subject_length;
$hits{$subject_ident}{comment_line} = $comment_line;
} # end subject sequence info section
# additional processing of BLAST output file
} # end while <INFILE> loop
2. Parsing Hit Information
•
Each hit starts with the same 3 lines (2 for blastp), Score, Identities, and Strand. If you detect the
Score line, you can get the next two lines without using the overall while loop:
my ($e_value, $identities, $gaps, $aligned_length, $orientation);
while (<INFILE>) {
if (/Score = /) {
my $score_line = $_;
my $identities_line = <INFILE>;
my $strand_line = <INFILE>;
# now process the information in these 3 lines
$score_line =~ /Score.*Expect = ([-.e\d]+)/; # note the regex here!
$e_value = $1;
$identities_line =~ /Identities = (\d+)\/(\d+)/; # note the escaped slash
($identities, $aligned_length) = ($1, $2);
$gaps = 0; # set a default
if ($identities_line =~ /Gaps = (\d+)/ ) { # Gaps are not always present
$gaps = $1;
}
$orientation = 'Plus';
if ($strand_line =~ /Plus \/ (Plus|Minus)/) { # only for blastn
$orientation = $1;
}
push @{ $hits{$subject_ident}{hit_data} }, [$e_value, $identities, $gaps, $aligned_length, $orientation];
} # end if /Score/
} # end while <INFILE>
3. Parsing the Aligned Sequences
•
•
•
•
This section is the hardest to deal with, because the end point is not clearly delimited.
We solve this problem by noting that there are 3 ways the aligned sequence section can end: it
can be followed by another hit (a Score = line), by a new subject (a line with > in the first column)
or by the statistical information at the end of the file (identified by “Lambda” in the first column).
So, near the beginning of the overall while loop we will insert a section that recognizes any of
these patterns, and writes the aligned sequence information to the array of information from the
previous hit.
However, note that this should only be done if there is a previous hit, that is, not for the first hit in
the first subject sequence. We can detect this because the $subject_ident variable will be undef
before the first subject line is processed, and there will be no $hits{$subject_ident}{hit_data}.
my ($query_start, $query_end, $subject_start, $subject_end);
while (<INFILE>) {
if ($subject_ident && exists $hits{$subject_ident}{hit_data} && (/^>/ || /Score =/ || /^Lambda/ ) ) { #
but not for the first hit of the first subject
push @{ $hits{$subject_ident}[-1] }, $query_start, $query_end, $subject_start, $subject_end;
last if (/^Lambda/); # end of hit processing
$query_start = $query_end = $subject_start = $subject_end = 0; # reset
}
# process the file
} # end while <INFILE>
3. Parsing Aligned Sequences: 2
•
•
•
The aligned sequences come in groups of 3 lines, Query, the match line, and Subject. It is
easiest to process each group of 3 as a unit.
We will gather the start and end positions for both query and subject. It is of course possible to
collect more detailed information about the sequences if desired.
There will be one or more aligned sequence units. Since the start and end points are set to 0 at
the beginning of whatever section follows, the start positions are only modified if their variables
are at 0.
while (<INFILE>) {
# previous code
if (/^Query/) {
my $query_line = $_;
my $match_line = <INFILE>;
my $subject_line = <INFILE>;
$query_line =~ /Query:\s*(\d+)\D+(\d+)\s*$/;
$query_start = $1 unless $query_start;
$query_end = $2;
$subject_line =~ /Sbjct:\s*(\d+)\D+(\d+)\s*$/; # note wierd spelling
$subject_start = $1 unless $subject_start;
$subject_end = $2;
} # end /^Query/
# code for other sections
} # end while <INFILE>
Printing Out the Information
•
•
•
After the end of the while <INFILE> loop.
Very simple format.
This whole program could be a subroutine, with a reference to the %hits
hash returned.
foreach my $subject_ident (sort keys %hits) {
print "$subject_ident:\n";
print " $hits{$subject_ident}{comment_line}\n";
print " Length = $hits{$subject_ident}{subject_length}\n";
foreach my $hit ( @{$hits{$subject_ident}{hit_data}} ) {
print "@$hit\n";
}
print "\n";
}