Bioperl_chosen - Directory Listing
Download
Report
Transcript Bioperl_chosen - Directory Listing
BioPerl
Based on a presentation by
Manish Anand/Jonathan Nowacki/
Ravi Bhatt/Arvind Gopu
Introduction
Objective of BioPerl:
Develop reusable, extensible core Perl modules for
use as a standard for manipulating molecular
biological data.
Background:
Started in 1995
One of the oldest open source Bioinformatics
Toolkit Project
So what is BioPerl?
Higher level of abstraction
Re-usable collection of Perl modules that facilitate
bioinformatics application development:
Accessing databases with different formats
Sequence manipulation
Execution and Parsing of the results of molecular
biology programs
Catch? BioPerl does not include programs like Blast,
ClustalW, etc
Uses system calls to execute external programs
So what is BioPerl? (continued…)
551 modules (incl. 82 interface modules)
37 module groups
79,582 lines of code (223,310 lines total)
144 lines of code per module
For More info: BioPerl Module Listing
Major Areas covered in Bioperl
Sequences, features, annotations,
Pairwise alignment reports
Multiple sequence alignments
Bibliographic data
Graphical rendering of sequence tracks
Database for features and sequences
Additional things
Gene prediction parsers
Trees, parsing phylogenetic and
molecular evolution software output
Population genetic data and summary
statistics
Taxonomy
Protein Structure
Downloading modules
Modules can be obtained from:
www.CPAN.org (Perl Modules)
www.BioPerl.org (BioPerl Modules)
Downloading modules from CPAN
Interactive mode
perl -MCPAN -e shell
Batch mode
use CPAN;
clean, install, make, recompile, test
Directory Structure
BioPerl directory structure organization:
Bio/
BioPerl modules
models/ UML for BioPerl classes
t/
Perl built-in tests
t/data/
Data files used for the tests
scripts/ Reusable scripts that use BioPerl
scripts/contributed/ Contributed scripts not
necessarily integrated into BioPerl.
doc/
"How To" files and the FAQ as XML
Parsing Sequences
Bio::SeqIO
multiple drivers: genbank, embl, fasta,...
Sequence objects
Bio::PrimarySeq
Bio::Seq
Bio::Seq::RichSeq
Sequence Object Creation
Sequence Creation :
$sequence = Bio::Seq->new(
-seq => ‘AATGCAA’
-display_id => ‘my_sequence’);
Flat File Format Support :
Raw, FASTA, GCG, GenBank, EMBL, PIR
Via ReadSeq: IG, NBRF, DnaStrider, Fitch,
Phylip, MSF, PAUP
Sequence object
Common (Bio::PrimarySeq) methods
seq() - get the sequence as a string
length() - get the sequence length
subseq($s,$e) - get a subseqeunce
translate(...) - translate to protein [DNA]
revcom() - reverse complement [DNA]
display_id() - identifier string
description() - description string
Sequence Types
Different Sequence Objects:
Seq – Some annotations
RichSeq – Additional annotations
PrimarySeq – Bare minimum annotation
( id, accession number, alphabet)
LocatableSeq – Start, stop and gap information also
LargeSeq – Very long sequences
LiveSeq – Newly sequenced genomes
Using a sequence
use Bio::PrimarySeq;
my $str = “ATGAATGATGAA”;
my $seq = Bio::PrimarySeq->new(-seq => $str,
-display_id=>”example”);
print “id is “, $seq->display_id,”\n”;
print $seq->seq, “\n”;
my $revcom = $seq->revcom;
print $revcom->seq, “\n”;
print “frame1=”,$seq->translate->seq,“\n”;
id is example
ATGAATGATGAA
TTCATCATTCAT
trans frame1=MNDE
Accessing remote databases
$gb = new Bio::DB::GenBank();
$seq1 = $gb->get_Seq_by_id('MUSIGHBA1');
$seq2 = $gb->get_Seq_by_acc('AF303112');
$seqio = $gb->
get_Stream_by_id(["J00522","AF303112","2981014"]);
Sequence – Accession numbers
# Get a sequence from RefSeq by accession number
use Bio::DB::RefSeq;
$gb = new Bio::DB::RefSeq;
$seq = $gb->get_Seq_by_acc(“NM_007304”);
print $seq->seq();
Reading and Writing Sequences
Bio::SeqIO
fasta, genbank, embl, swissprot,...
Takes care of writing out associated
features and annotations
Two functions
next_seq (reading sequences)
write_seq (writing sequences)
Writing a Sequence
use Bio::SeqIO;
# Let’s convert swissprot to fasta format
my $in = Bio::SeqIO->new( -format => ‘swiss’,
-file => ‘file.sp’);
my $out = Bio::SeqIO->new( -format => ‘fasta’,
-file => ‘>file.fa’);`
while( my $seq = $in->next_seq ) {
$out->write_seq($seq);
}
Manipulating sequence data with Seq
methods
Allows the easy manipulation of bioinformatics data
Specific parts of various annotated formats can be
selected and rearranged.
Unwanted information can be voided out of reports
Important information can be highlighted, processed,
stored in arrays for graphs/charts/etc with relative
ease
Information can be added and subtracted in a flash
The Code
#!/usr/local/bin/perl
use Bio::Seq;
use Bio::SeqIO;
my $seqin = Bio::SeqIO->new('-file' => "genes.fasta" , '-format' =>'Fasta');
my $seqobj = $seqin->next_seq();
my $seq = $seqobj->seq(),"\n"; #plain sequence
print ">",$seqobj->display_id()," Description: ",$seqobj->desc(), " Alphabet:
",$seqobj->alphabet(),"\n";
$seq =~ s/(.{60})/$1\n/g; # convert to 60 char lines
print $seq,"\n";
Before
After
Obtaining basic sequence statistics- molecular weights,
residue & codon frequencies (SeqStats, SeqWord)
Molecular Weight
Monomer Counter
Codon Counter
DNA weights
RNA weights
Amino Weights
More
The Code
#!/usr/local/bin/perl
use Bio::PrimarySeq;
use Bio::Tools::SeqStats;
my $seqobj = new Bio::PrimarySeq(-seq =>
'ATCGTAGCTAGCTGA', -display_id => 'example1');
$seq_stats = Bio::Tools::SeqStats->new(-seq=>$seqobj);
$hash_ref = $seq_stats->count_monomers();
foreach $base (sort keys %$hash_ref) {
print "Number of bases of type ", $base, "= ",%$hash_ref>{$base},"\n";
}
The Results
More Code
use SeqStats;
$seq_stats = Bio::Tools::SeqStats->new($seqobj);
$weight = $seq_stats->get_mol_wt();
-returns the molecular weight
$monomer_ref = $seq_stats->count_monomers();
-counts the number of monomers
$codon_ref = $seq_stats->count_codons(); # for nucleic acid
sequence
-counts the number of codons
Monomer
Why the Large and The Small MW?
Note that since sequences may contain
ambiguous monomers (eg "M"
meaning "A" or "C" in a nucleic acid
sequence), the method get_mol_wt
returns a two-element array containing the
greatest lower bound and
least upper bound of the molecule. (For a
sequence with no ambiguous monomers, the
two elements of the returned array will be
equal.)
Identifying restriction enzyme sites (Restriction
Enzyme)
Bioperl's standard RestrictionEnzyme object comes with data for
more than 150 different restriction enzymes.
To select all available enzymes with cutting patterns that are six
bases long:
$re = new Bio::Tools::RestrictionEnzyme('-name'=>'EcoRI');
@sixcutters = $re->available_list(6);
sites for that enzyme on a given nucleic acid sequence can be
obtained using
$re1 = new Bio::Tools::RestrictionEnzyme(-name=>'EcoRI'); #
$seqobj is the Seq object for the dna to be cut @fragments = $re1>cut_seq($seqobj);
Manipulating sequence alignments
Bioperl offers several perl objects to facilitate
sequence alignment:
pSW (Smith-Waterman)
Clustalw.pm
TCoffee.pm
bl2seq option of StandAloneBlast.
Manipulating Alignments
Some of the manipulations possible with SimpleAlign include:
slice(): Obtaining an alignment ``slice'', that is, a subalignment
inclusive of specified start and end columns.
column_from_residue_number(): Finding column in an alignment
where a specified residue of a specified sequence is located.
consensus_string(): Making a consensus string. This method
includes an optional threshold parameter, so that positions in the
alignment with lower percent-identity than the threshold are
marked by ``?'''s in the consensus
percentage_identity(): A fast method for calculating the average
percentage identity of the alignment
consensus_iupac(): Making a consensus using IUPAC ambiguity
codes from DNA and RNA.
The Code
use Bio::SimpleAlign;
$aln = Bio::SimpleAlign->new('t/data/testaln.fasta');
$threshold_percent = 60;
$consensus_with_threshold = $aln>consensus_string($threshold_percent);
$iupac_consensus = $aln->consensus_iupac();
# dna/rna alignments only
$percent_ident = $aln->percentage_identity;
$seqname = 'AKH_HAEIN';
$pos = $aln>column_from_residue_number($seqname, 14);
Searching for Sequence Similarity
BLAST with BioPerl
Parsing Blast and FASTA Reports
Search and SearchIO
BPLite, BPpsilite, BPbl2seq
Parsing HMM Reports
Standalone BioPerl BLAST
Remote Execution of BLAST
BioPerl has built in capability of running BLAST jobs
remotely using RemoteBlast.pm
Runs these jobs at NCBI automatically
NCBI has dynamic configurations (server side) to “always” be
up and ready
Automatically updated for new BioPerl Releases
Convenient for independent researchers who do not
have access to huge computing resources
Quick submission of Blast jobs without tying up local
resources (especially if working from standalone
workstation)
Legal Restrictions!!!
Example of Remote Blast
$remote_blast = Bio::Tools::Run::RemoteBlast->new( 'prog' => 'blastp','-data' => 'ecoli','-expect' => '1e-10' );
$r = $remote_blast->submit_blast("t/data/ecolitst.fa");
while (@rids = $remote_blast->each_rid )
{
foreach $rid ( @rids )
{
$rc = $remote_blast->retrieve_blast($rid);
}
}
Sample Script to Read and Parse
BLAST Report
# Get the report $searchio = new Bio::SearchIO (-format
=> 'blast', -file => $blast_report);
$result = $searchio->next_result;
# Get info about the entire report
$algorithm_type = $result->algorithm;
# get info about the first hit
$hit = $result->next_hit;
$hit_name = $hit->name ;
# get info about the first hsp of the first hit
$hsp = $hit->next_hsp;
$hsp_start = $hsp->query->start;
Running BLAST Locally
StandAloneBlast
Bio::Tools::Run::StandAloneBlast
Factory Objects
@params = ('program' => 'blastn', 'database' => 'ecoli.nt');
$factory = Bio::Tools::Run::StandAloneBlast>new(@params);
Advantages:
Private
Use Customized Local Resources
Avoid Network Problems
Examples
# Setting parameters similar to RemoteBlast
$input = Bio::Seq->new(-id =>"test query", -seq
=>"ACTAAGTGGGGG");
$blast_report = $factory->blastall($input);
# Blast Report Object that directly accesses parser
while (my $sbjct = $blast_report->next_hit){
while (my $hsp = $sbjct->next_hsp){
print $hsp->score . " " . $hsp->subject>seqname . "\n";
}
}
Format Conversion – Sequences Example
Use: Bio::SeqIO
Core Code:
$in = Bio::SeqIO->new('-file' => "COG0001",
'-format' => 'Fasta');
$out = Bio::SeqIO->new('-file' => ">COG0001.gen",
'-format' => 'genbank');
while ( my $seq = $in->next_seq() ) {
$out->write_seq($seq);
}
Format Conversion – Alignments
Alignment formats supported:
INPUT: fasta, selex (HMMER), bl2seq,
clustalw (.aln), msf (GCG), psi (PSI-BLAST),
mase (Seaview), stockholm, prodom, water,
phylip (interleaved), nexus, mega, meme
OUTPUT: fasta, clustalw, mase, selex,
msf/gcg, and phylip (interleaved).
Next_aln( ) and write_aln( ) methods of the
‘Bio::AlignIO’ object are used
ClustalW and Profile Align
ClustalW using BioPerl
Clustalw program should be installed and
environment variable ‘CLUSTALDIR’ set
Setting Parameters – Build a factory
Some parameters: 'ktuple', 'matrix', 'outfile',
'quiet‘
Need reference to sequence array object (See
example)
Align( ) and Profile_align( ) methods used
ClustalW – Example
Use Bio::SeqIO,
Bio::Tools::Run::Alignment::Clustalw
Core code (Simple Align):
@params = ('ktuple' => 2, 'matrix' => 'BLOSUM',
'outfile' => 'clustalw_out', 'quiet' => 1);
$factory = Bio::Tools::Run::Alignment::
Clustalw->new(@params);
$seq_array_ref = \@seq_array;
$aln= $factory->align($seq_array_ref);
Smith Waterman Search
Smith Waterman pairwise alignment
Standard method for producing an optimal
local alignment of two sequences
Auxilliary Bioperl-ext library required
SW algorithm implemented in C and
incorporated into bioperl
Align_and_show() & Pairwise_alignment() in
Bio::Tools::pSW module are methods used
Smith Waterman Search – Example
Use Bio::Tools::pSW, Bio::SeqIO, Bio::AlignIO
Core code:
$factory = new Bio::Tools::pSW( '-matrix' =>
'BLOSUM62', '-gap' => 12, '-ext' => 2);
$aln = $factory->pairwise_alignment($seq_array[0],
$seq_array[1]);
my $alnout = new Bio::AlignIO(-format => 'msf',
-fh => \*STDOUT);
$alnout->write_aln($aln);
Smith Waterman Search
AlignIO object in previous slide – could also be
used to print into a file
Use double loop to do all pairwise
comparisons
More Info: Bio::Tools::pSW mapage