Sequences as arrays or strings

Download Report

Transcript Sequences as arrays or strings

String ($var) arrays (@array)
conversion and substring extraction
Lecture 6
Split strings
• This function can be used to split (divide) data:
– Strings into an arrays.
– Strings into a list of scalars ($variables)
– It can also split each character of a string by using “”
as the deliminiter.
•
>192a8, the lactose gene, e. coli, cambridge university, january 1981
– chomp($line = <>); # read the line into $line
– @fields = split ‘,’,$line; #splits a String into an array
– ($clone,$laboratory,$left_oligo,$right_oligo) = split
‘,’,$line;
• See SplitExample.pl
Join: elements of an array/
• The join function is the reverse of the split:
– Convert an array into a string
• To transform arrays (lists) into strings: join
• #initialize an array
• @seq = (“aaaaaa",“tttttt",“cccccc",“ggggggg");
• $CombinedSeq = join ‘', @seq;
• Result of the join is:
• aaaaaattttttccccccggggggg
• See JoinExample.pl
Concatetion
• To concatenate to strings you use the
– =. Symbol
– Seq1 is a null string: $seq = “”;
– We can add (concatenate) a sequence to this by:
– $seq .= $input_seq2
– It can be used to read in sequences and join them
together so they form one string.
Extracting substrings
•
•
Substr: a function to extracting a substring from a string.
Assume the string is: AAAAGGGGCCCCTTTT
•
To extract the sequence AGG (a codon) from the string we need:
–
–
•
Move to 4 positions [character} of the string] t.
Extract 3 characters or a 3 character substring
The syntax for perl substr (substring function)
– $sub = substr ($string, offset position[position to begin extraction], size of substring)
– Offset is zero based
•
•
# more details on substrings can be found at:
# http://perlmeme.org/howtos/perlfunc/substr.html
•
•
Extract words from a sentence: Substring.pl
Extract codon from a DNA seqeunce: substring.pl
Perl Functions for determining the ORF of DNA
sequences.
•
•
The Unpack function: this a function of the perl language that extracts sets of characters from a
sequence of characters and assign them to an array.
So they can be used to extract groups of 3 bases from a DNA sequence. E.g.. open reading frames,
and assign each set to an element of an array.
–
@triplets = unpack("a3" x (length($line)/3), $line);
•
To determining all possible open reading frames (ORFs) for a DNA sequence (reading frame 1,
reading frame 2 and reading frame 3) one needs to shift one base when going from reading frame
1 to reading frame 2 and the same when going from reading frame 2 to reading frame 3
subsequent
•
Frame Shift (1positions to the right)
–
•
@triplets = unpack(‘a1’ . “a3” x (length ($line)/3),$line);
Remember if there are only 2 characters at the end/ beginning of a sequence. Unpack will still
assign them to an element of the array. If using hash tables do not forget an exist function may
be required,
• See Unpack_codons.pl (Run to show the output)
Sample Exercise
• Write a script to read in the contents of a fasta
file (without descriptor line) and print it out as
a string containing all the DNA bases/ Amino
acids
• Modify the unpack function to use substrings
instead of unpack.