Bioperl_chosen - Directory Listing

Download Report

Transcript Bioperl_chosen - Directory Listing

BioPerl
Based on a presentation by
Manish Anand/Jonathan Nowacki/
Ravi Bhatt/Arvind Gopu
Introduction

Objective of BioPerl:


Develop reusable, extensible core Perl modules for
use as a standard for manipulating molecular
biological data.
Background:


Started in 1995
One of the oldest open source Bioinformatics
Toolkit Project
So what is BioPerl?

Higher level of abstraction

Re-usable collection of Perl modules that facilitate
bioinformatics application development:
 Accessing databases with different formats
 Sequence manipulation
 Execution and Parsing of the results of molecular
biology programs

Catch? BioPerl does not include programs like Blast,
ClustalW, etc
 Uses system calls to execute external programs
So what is BioPerl? (continued…)
551 modules (incl. 82 interface modules)
 37 module groups
 79,582 lines of code (223,310 lines total)
 144 lines of code per module
 For More info: BioPerl Module Listing

Major Areas covered in Bioperl
Sequences, features, annotations,
 Pairwise alignment reports
 Multiple sequence alignments
 Bibliographic data
 Graphical rendering of sequence tracks
 Database for features and sequences

Additional things
Gene prediction parsers
 Trees, parsing phylogenetic and
molecular evolution software output
 Population genetic data and summary
statistics
 Taxonomy
 Protein Structure

Downloading modules
Modules can be obtained from:
www.CPAN.org (Perl Modules)
www.BioPerl.org (BioPerl Modules)
Downloading modules from CPAN
Interactive mode
 perl -MCPAN -e shell
Batch mode
use CPAN;
clean, install, make, recompile, test
Directory Structure

BioPerl directory structure organization:







Bio/
BioPerl modules
models/ UML for BioPerl classes
t/
Perl built-in tests
t/data/
Data files used for the tests
scripts/ Reusable scripts that use BioPerl
scripts/contributed/ Contributed scripts not
necessarily integrated into BioPerl.
doc/
"How To" files and the FAQ as XML
Parsing Sequences

Bio::SeqIO


multiple drivers: genbank, embl, fasta,...
Sequence objects
Bio::PrimarySeq
 Bio::Seq
 Bio::Seq::RichSeq

Sequence Object Creation
Sequence Creation :
$sequence = Bio::Seq->new(
-seq => ‘AATGCAA’
-display_id => ‘my_sequence’);
Flat File Format Support :
Raw, FASTA, GCG, GenBank, EMBL, PIR
Via ReadSeq: IG, NBRF, DnaStrider, Fitch,
Phylip, MSF, PAUP
Sequence object

Common (Bio::PrimarySeq) methods
seq() - get the sequence as a string
 length() - get the sequence length
 subseq($s,$e) - get a subseqeunce
 translate(...) - translate to protein [DNA]
 revcom() - reverse complement [DNA]
 display_id() - identifier string
 description() - description string

Sequence Types
Different Sequence Objects:






Seq – Some annotations
RichSeq – Additional annotations
PrimarySeq – Bare minimum annotation
( id, accession number, alphabet)
LocatableSeq – Start, stop and gap information also
LargeSeq – Very long sequences
LiveSeq – Newly sequenced genomes
Using a sequence
use Bio::PrimarySeq;
my $str = “ATGAATGATGAA”;
my $seq = Bio::PrimarySeq->new(-seq => $str,
-display_id=>”example”);
print “id is “, $seq->display_id,”\n”;
print $seq->seq, “\n”;
my $revcom = $seq->revcom;
print $revcom->seq, “\n”;
print “frame1=”,$seq->translate->seq,“\n”;
id is example
ATGAATGATGAA
TTCATCATTCAT
trans frame1=MNDE
Accessing remote databases
$gb = new Bio::DB::GenBank();
$seq1 = $gb->get_Seq_by_id('MUSIGHBA1');
$seq2 = $gb->get_Seq_by_acc('AF303112');
$seqio = $gb->
get_Stream_by_id(["J00522","AF303112","2981014"]);
Sequence – Accession numbers
# Get a sequence from RefSeq by accession number
use Bio::DB::RefSeq;
$gb = new Bio::DB::RefSeq;
$seq = $gb->get_Seq_by_acc(“NM_007304”);
print $seq->seq();
Reading and Writing Sequences

Bio::SeqIO

fasta, genbank, embl, swissprot,...
Takes care of writing out associated
features and annotations
 Two functions

next_seq (reading sequences)
 write_seq (writing sequences)

Writing a Sequence
use Bio::SeqIO;
# Let’s convert swissprot to fasta format
my $in = Bio::SeqIO->new( -format => ‘swiss’,
-file => ‘file.sp’);
my $out = Bio::SeqIO->new( -format => ‘fasta’,
-file => ‘>file.fa’);`
while( my $seq = $in->next_seq ) {
$out->write_seq($seq);
}
Manipulating sequence data with Seq
methods





Allows the easy manipulation of bioinformatics data
Specific parts of various annotated formats can be
selected and rearranged.
Unwanted information can be voided out of reports
Important information can be highlighted, processed,
stored in arrays for graphs/charts/etc with relative
ease
Information can be added and subtracted in a flash
The Code
#!/usr/local/bin/perl
use Bio::Seq;
use Bio::SeqIO;
my $seqin = Bio::SeqIO->new('-file' => "genes.fasta" , '-format' =>'Fasta');
my $seqobj = $seqin->next_seq();
my $seq = $seqobj->seq(),"\n"; #plain sequence
print ">",$seqobj->display_id()," Description: ",$seqobj->desc(), " Alphabet:
",$seqobj->alphabet(),"\n";
$seq =~ s/(.{60})/$1\n/g; # convert to 60 char lines
print $seq,"\n";
Before
After
Obtaining basic sequence statistics- molecular weights,
residue & codon frequencies (SeqStats, SeqWord)







Molecular Weight
Monomer Counter
Codon Counter
DNA weights
RNA weights
Amino Weights
More
The Code
#!/usr/local/bin/perl
use Bio::PrimarySeq;
use Bio::Tools::SeqStats;
my $seqobj = new Bio::PrimarySeq(-seq =>
'ATCGTAGCTAGCTGA', -display_id => 'example1');
$seq_stats = Bio::Tools::SeqStats->new(-seq=>$seqobj);
$hash_ref = $seq_stats->count_monomers();
foreach $base (sort keys %$hash_ref) {
print "Number of bases of type ", $base, "= ",%$hash_ref>{$base},"\n";
}
The Results
More Code
use SeqStats;
$seq_stats = Bio::Tools::SeqStats->new($seqobj);
$weight = $seq_stats->get_mol_wt();
-returns the molecular weight
$monomer_ref = $seq_stats->count_monomers();
-counts the number of monomers
$codon_ref = $seq_stats->count_codons(); # for nucleic acid
sequence
-counts the number of codons
Monomer
Why the Large and The Small MW?
Note that since sequences may contain
ambiguous monomers (eg "M"
meaning "A" or "C" in a nucleic acid
sequence), the method get_mol_wt
returns a two-element array containing the
greatest lower bound and
least upper bound of the molecule. (For a
sequence with no ambiguous monomers, the
two elements of the returned array will be
equal.)
Identifying restriction enzyme sites (Restriction
Enzyme)

Bioperl's standard RestrictionEnzyme object comes with data for
more than 150 different restriction enzymes.

To select all available enzymes with cutting patterns that are six
bases long:


$re = new Bio::Tools::RestrictionEnzyme('-name'=>'EcoRI');
@sixcutters = $re->available_list(6);
sites for that enzyme on a given nucleic acid sequence can be
obtained using

$re1 = new Bio::Tools::RestrictionEnzyme(-name=>'EcoRI'); #
$seqobj is the Seq object for the dna to be cut @fragments = $re1>cut_seq($seqobj);
Manipulating sequence alignments

Bioperl offers several perl objects to facilitate
sequence alignment:




pSW (Smith-Waterman)
Clustalw.pm
TCoffee.pm
bl2seq option of StandAloneBlast.
Manipulating Alignments






Some of the manipulations possible with SimpleAlign include:
slice(): Obtaining an alignment ``slice'', that is, a subalignment
inclusive of specified start and end columns.
column_from_residue_number(): Finding column in an alignment
where a specified residue of a specified sequence is located.
consensus_string(): Making a consensus string. This method
includes an optional threshold parameter, so that positions in the
alignment with lower percent-identity than the threshold are
marked by ``?'''s in the consensus
percentage_identity(): A fast method for calculating the average
percentage identity of the alignment
consensus_iupac(): Making a consensus using IUPAC ambiguity
codes from DNA and RNA.
The Code









use Bio::SimpleAlign;
$aln = Bio::SimpleAlign->new('t/data/testaln.fasta');
$threshold_percent = 60;
$consensus_with_threshold = $aln>consensus_string($threshold_percent);
$iupac_consensus = $aln->consensus_iupac();
# dna/rna alignments only
$percent_ident = $aln->percentage_identity;
$seqname = 'AKH_HAEIN';
$pos = $aln>column_from_residue_number($seqname, 14);
Searching for Sequence Similarity
BLAST with BioPerl
 Parsing Blast and FASTA Reports

Search and SearchIO
 BPLite, BPpsilite, BPbl2seq

Parsing HMM Reports
 Standalone BioPerl BLAST

Remote Execution of BLAST


BioPerl has built in capability of running BLAST jobs
remotely using RemoteBlast.pm
Runs these jobs at NCBI automatically





NCBI has dynamic configurations (server side) to “always” be
up and ready
Automatically updated for new BioPerl Releases
Convenient for independent researchers who do not
have access to huge computing resources
Quick submission of Blast jobs without tying up local
resources (especially if working from standalone
workstation)
Legal Restrictions!!!
Example of Remote Blast
$remote_blast = Bio::Tools::Run::RemoteBlast->new( 'prog' => 'blastp','-data' => 'ecoli','-expect' => '1e-10' );
$r = $remote_blast->submit_blast("t/data/ecolitst.fa");
while (@rids = $remote_blast->each_rid )
{
foreach $rid ( @rids )
{
$rc = $remote_blast->retrieve_blast($rid);
}
}
Sample Script to Read and Parse
BLAST Report
# Get the report $searchio = new Bio::SearchIO (-format
=> 'blast', -file => $blast_report);
$result = $searchio->next_result;
# Get info about the entire report
$algorithm_type = $result->algorithm;
# get info about the first hit
$hit = $result->next_hit;
$hit_name = $hit->name ;
# get info about the first hsp of the first hit
$hsp = $hit->next_hsp;
$hsp_start = $hsp->query->start;
Running BLAST Locally

StandAloneBlast


Bio::Tools::Run::StandAloneBlast
Factory Objects


@params = ('program' => 'blastn', 'database' => 'ecoli.nt');
$factory = Bio::Tools::Run::StandAloneBlast>new(@params);
Advantages:



Private
Use Customized Local Resources
Avoid Network Problems
Examples



# Setting parameters similar to RemoteBlast
$input = Bio::Seq->new(-id =>"test query", -seq
=>"ACTAAGTGGGGG");
$blast_report = $factory->blastall($input);
# Blast Report Object that directly accesses parser
while (my $sbjct = $blast_report->next_hit){
while (my $hsp = $sbjct->next_hsp){
print $hsp->score . " " . $hsp->subject>seqname . "\n";
}
}
Format Conversion – Sequences Example

Use: Bio::SeqIO
 Core Code:
$in = Bio::SeqIO->new('-file' => "COG0001",
'-format' => 'Fasta');
$out = Bio::SeqIO->new('-file' => ">COG0001.gen",
'-format' => 'genbank');
while ( my $seq = $in->next_seq() ) {
$out->write_seq($seq);
}
Format Conversion – Alignments

Alignment formats supported:
 INPUT: fasta, selex (HMMER), bl2seq,
clustalw (.aln), msf (GCG), psi (PSI-BLAST),
mase (Seaview), stockholm, prodom, water,
phylip (interleaved), nexus, mega, meme
 OUTPUT: fasta, clustalw, mase, selex,
msf/gcg, and phylip (interleaved).
 Next_aln( ) and write_aln( ) methods of the
‘Bio::AlignIO’ object are used
ClustalW and Profile Align

ClustalW using BioPerl
 Clustalw program should be installed and
environment variable ‘CLUSTALDIR’ set
 Setting Parameters – Build a factory
 Some parameters: 'ktuple', 'matrix', 'outfile',
'quiet‘
 Need reference to sequence array object (See
example)
 Align( ) and Profile_align( ) methods used
ClustalW – Example

Use Bio::SeqIO,
Bio::Tools::Run::Alignment::Clustalw
 Core code (Simple Align):
@params = ('ktuple' => 2, 'matrix' => 'BLOSUM',
'outfile' => 'clustalw_out', 'quiet' => 1);
$factory = Bio::Tools::Run::Alignment::
Clustalw->new(@params);
$seq_array_ref = \@seq_array;
$aln= $factory->align($seq_array_ref);
Smith Waterman Search

Smith Waterman pairwise alignment
 Standard method for producing an optimal
local alignment of two sequences
 Auxilliary Bioperl-ext library required
 SW algorithm implemented in C and
incorporated into bioperl
 Align_and_show() & Pairwise_alignment() in
Bio::Tools::pSW module are methods used
Smith Waterman Search – Example

Use Bio::Tools::pSW, Bio::SeqIO, Bio::AlignIO
 Core code:
$factory = new Bio::Tools::pSW( '-matrix' =>
'BLOSUM62', '-gap' => 12, '-ext' => 2);
$aln = $factory->pairwise_alignment($seq_array[0],
$seq_array[1]);
my $alnout = new Bio::AlignIO(-format => 'msf',
-fh => \*STDOUT);
$alnout->write_aln($aln);
Smith Waterman Search
AlignIO object in previous slide – could also be
used to print into a file
 Use double loop to do all pairwise
comparisons
 More Info: Bio::Tools::pSW mapage
