Transcript Bio::Seq

The Bio::Seq Class
• The Bio::Seq class allows for efficient
manipulation and storage of nucleotide and
protein sequences
• Handles tasks such as
–
–
–
–
Manipulating the actual sequence
Creating new Bio::Seq subsequence objects
Manipulating features for the sequence
Manipulating other data associated with the
sequence (i.e. accession number, species, etc…)
Interfaces that Bio::Seq Implements
• Bio::SeqI and PrimarySeqI
– Used to access data such as the type of
sequence, its accession number and id, and its
description
• Bio::IdentifieableI
– Handles aspects regarding the source of the
sequence and the sequence’s id and version
• Bio::DescribableI
– Methods to access human readable descriptions
of the sequence object
• Bio::AnnotatableI
– Methods to access the annotation of the sequence
• Bio::FeatureHolderI
– Methods to get features present in this sequence
Bio::SeqI
top_SeqFeatures
all_SeqFeatures
seq
write_GFF
annotation
feature_count
species
primary_seq
Bio::IdentifiableI
object_id
version
authority
namespace
UML Diagram
of Bio::Seq
Bio::AnnotatableI
annotation
Bio::Seq
new
seq
validate_seq
length
subseq
display_id
accession_number
desc
primary_id
can_call_new
alphabet
object_id
version
authority
namespace
display_name
description
annotation
get_SeqFeatures
get_all_SeqFeatures
feature_count
add_SeqFeature
remove_SeqFeature
revcom
trunc
id
primary_seq
species
Bio::FeatureHolderI
get_SeqFeatures
feature_count
get_all_SeqFeatures
Bio::DescribableI
display_id
description
Common Bio::Seq methods
• new
– Constructs a a Bio::Seq object
– E.g. $seq = Bio::Seq->new(-seq=>’ACGTCGAC’, display_id=>’foo’)
• seq
– Gets or sets the string representation of the nucleotides or
amino acids in this Bio::Seq object
– e.g. print $seq->seq;
• length
– Gets the length of a Bio::Seq object
– e.g. print $seq->length;
• accession_number
– Gets or sets the accession number of a Bio::Seq
object
– e.g. $seq->accession_number(‘AC12345);
Common Bio::Seq methods
(continued)
• subseq
– Gets the subsequence as a string from the first integer to the
second integer, inclusive
– e.g. print $seq->subseq(3, 9);
• trunc
– Returns a new Bio::Seq object that is a truncation from the
first integer to the second integer, inclusive
– e.g. $trunc_seq = $seq->trunc(3, 9);
• revcom
– Returns a new Bio::Seq object that is the reverse
compliment of this Bio::Seq object
– e.g. $revcom = $seq->revcom
Obtaining Bio::Seq objects
• Bio::Seq objects can be constructed from a
variety of flat file formats, Internet databases,
or from other Bio::Seq objects.
– The Bio::SeqIO class allows for sequential input or
output of sequences from or to flat files
– Bio::DB classes such as Bio::DB::Genbank or
Bio::DB::Fasta allow for random access retrieval of
sequences
– Using the trunc function of a Bio::Seq object will
return a new Bio::Seq object.
Bio::SeqIO
• Sequential access to a flat file of such types as fasta,
gff, embl, swissprot, etc…
• The new method’s file argument requires the path to a
file of a supported file type
– Include a ‘>’ before the file’s path if you wish to
write to it
Bio::SeqIO->new(-file=>’>path/of/file/to/write’
-format >’embl’);
– Include ‘>>’ before the file’s path to append that file
Bio::SeqIO->new(-file=>‘>>path/of/file/to/append’
-format> ‘gff’);
– To read, simply include the path to the file
Bio::SeqIO->new(-file=>‘path/of/file/to/read’
-format> ‘fasta’);
Input With Bio::Seq
• Allows for construction of Bio::Seq objects in
a sequential order
• Use next_seq method to get the next
sequence from the file, if one exists
• To read all the sequences in a file and print
their names to the screen:
my $seqio = Bio::SeqIO->new(-file=>’foo.fasta’,
-format=>’fasta’);
while (my $seq = $seqio->next_seq)
{
print $seq->display_id;
print ‘’\n’’;
}
Output With Bio::SeqIO
• Whether writing or appending to a file, the
new method creates the file if it does not exist
• Write overwrites all data in the file if one
existed, append adds sequences to the end
• The following writes a sequence to a file
# $seq has previously been defined as a Bio::Seq
object
my $seqio = Bio::SeqIO->new(-file=>’>foo.swissprot’,
-format=>’swissprot’);
$seqio->write_seq($seq);
Bio::DB::*
• Bio::DB::* is a collection of similar classes
where * varies
• * may be
– GenBank
– A flatfile format
• e.g. Fasta, EMBL, SwissProt, GFF, etc…
• Modules in this form allow for random access
retrieval from
– A specific file or a directory of flat files of type *
• Uses indexing to allow for quick retrieval of sequence
information
– The GenBank database at NCBI if * is GenBank.
Random Access Retrieval from Flat Files
• The new method requires the path to either a file
containing sequences or to a directory containing
many sequence files
– Indexes the files on first run or when the -reindex
argument is used with a value of 1
• The following constructs a Bio::Seq object with a
given accession from a Fasta file
use Bio::DB::Fasta;
# given $accession is a valid accession
# number
$db = Bio::DB::Fasta->new(
‘path/to/file/or/directory’,
-reindex=>1);
$seq = $db->get_Seq_by_acc($accession);
Retrieval from GenBank
• Constructing a Bio::Seq object using
Bio::DB::GenBank requires an Internet
connection to access the GenBank database
• The following is an example to construct a
Bio::Seq object
use Bio::DB::GenBank;
# given $accession is defined as a valid
# accession number
$gb = Bio::DB::GenBank->new();
$seq = $gb->get_Seq_by_acc($accession);