Transcript Perl

Perl
Practical Extration and Reporting Language
An Introduction by
Shwen Ho
What is Perl good for?
Designed for text manipulation
 Very fast to implement
 Allows many different ways to solve the
same problem
 Runs on many different platform

– Windows, Mac, Unix, Linux, Dos, etc
Running Perl
Perl scripts do not need to be compiled
 They are interpreted at the point of
execution
 They do not necessarily have a
particular file extension although the .pl
file extension is used commonly.

Running Perl

Executing it via the command line
command line> perl script.pl arg1 arg2 ...

Or add the line "#!/usr/bin/perl" to the start of
the script if you are using unix/linux
– Remember to set the correct file execution
permissions before running it.
chmod +x perlscript.pl
./perlscript.pl
Beginning Perl

Every statement end with a semi colon ";".

Comments are prefixed at the start of the line with a
hash "#".

Variable are assigned a value using the character "=".

Variables are not statically typed, i.e., you do not
have to declare what kind of data you want to hold in
them.

Variables are declared the first time you initialise
them and they can be anywhere in the program.
Scalar Variables

Contains single piece of data
 '$' character shows that a variable is scalar.
 Scalar variables can store either a number of
a string.
 A string is a chunk of text surrounded by
quotes.
$name = "paul";
$year = 1980;
print "$name is born in $year";
output: paul is born in 1980
Arrays Variables (List)

Ordered list of data, separated by commas.
 '@' character shows that a variable is an array
Array of numbers
@year_of_birth = (1980, 1975, 1999);
Array of string
@name = ("Paul", "Jake", "Tom");
Array of both string and numbers
@paul_address = (14,"Cleveland St","NSW",2030);
Retrieving data from Arrays

Printing Arrays
@name = ("Paul", "Jake", "Tom");
print "@name";

Accessing individual elements in an array
@name = ("Paul", "Jake", "Tom");
print "$name[1]";

What has changed? @name to $name
– To access individual elements use the syntax
$array[index]

Why did $name[1] print the second element?
– Perl, like Java and C, uses index 0 to represent
the first element.
Interesting things you can do
with Array
@name = ("Paul", "Jake", "Tom");
print "@name";
Paul Jake Tom
print @name;
PaulJakeTom
$count=@name;
$count = 3
@nameR=reverse(@name); @nameR=("Tom","Jake","Paul")
@nameS=sort(@name);
@nameS=("Jake","Paul","Tom")
Basic Arithmetic Operators
+ Addition
- Subtraction
* multiplication
/ division
++ adding one to the variable
-subtracting one from the variable
$a += 2 incrementing variable by 2
$b *= 3 tripling the value of the variable
Relational Operators
Comparison
Equals
Not equal
Less than
Greater than
Less than or equal
Greater than or equal
Comparison
Numeric
String
==
!=
<
>
<=
>=
<=>
eq
ne
lt
gt
le
gt
cmp
Control Operators - If
if ( expression 1) {
...
}
elsif (expression 2) {
...
}
else {
...
}
Iteration Structures

while (CONDITION) { BLOCK }

until (CONDITION) {BLOCK}

do {BLOCK} while (CONDITION)

for (INITIALIZATION ; CONDITION ; Re-INITIALIZATION)
{BLOCK}

for VAR (LIST) {BLOCK}

foreach VAR (LIST) {BLOCK}
Iteration Structures
$i = 1;
while($i <= 5){
print "$i\n";
$i++;
}
for($x=1; $x <=5; $x++) {
print "$x\n";
}
@array = [1,2,3,4,5];
foreach $number (@array){
print "$number\n";
}
String Operations
Strings can be concatenated with the dot operator
$lastname = "Harrison";
$firstname = "Paul";
$name = $firstname . $lastname;
$name = "$firstname$lastname";

String comparison can be done with the relational operator
$string1 = "hello";
$string2 = "hello";
if ($string1 eq $string2)
{ print "they are equal"; }
else { print "they are different"; }

String comparison using patterns

The =~ operator return true if the pattern within the /
quotes are found.
$string1 = "HELLO";
$string2 = "Hi there";
# test if the string contains the pattern EL
if ($string1 =~ /EL/)
{ print "This string contains the pattern"; }
else { print "No pattern found"; }
Functions in Perl

No strict variable type restriction during function call
– java example
variable_type function (variable_type variable_name)
public int function1 (int var1, char var2) { … }

Perl has provided lots of useful functions within the
language to get you started.
– chop - remove the first character of a string
– chomp - often used to remove the carriage return character
from the end of a string
– push - append one or more element into an array
– pop - remove the last element of an array and return it
– shift - remove the first element of an array and return it
– s - replace a pattern with a string
Functions in Perl

The "split" function breaks a given string into
individual segments given a delimiter.
 split( /pattern/, string) returns a list
@output = split (/\s/, $string);
# breaks the sentence into words
@output = split (//, $string);
# breaks the sentence into single characters
@output = split (/,/, $string);
# breaks the sentence into chunks separated by a
comma.

join ( /delimiter/, array) returns a string
Functions in Perl
A simple perl function
sub sayHello {
print "Hello!!\n";
}
sayHello();
Executing functions in Perl

Function arguments are stored automatically in a
temporary array called @_ .
sub sayHelloto {
@name = @_;
$count = @_;
foreach $person (@name){
print "Hello $person\n";
}
return $count;
}
@array = ("Paul", "Jake", "Tom");
sayHelloto(@array);
sayHelloto("Mary", "Jane", "Tylor", 1,2,3);
Input / Output

Perl allows you to read in any input that
is automatically sent to your program
via standard input by using the handle
<STDIN>.

One way of handling inputs via
<STDIN> is to use a loop to process
every line of input
Input / Output

Count the number of lines from standard
input and print the line number together with
the 1st word of each line.
$count = 1;
foreach $line (<STDIN>){
@array = split(/\s/, $line);
print "$count $array[0]\n";
$count++;
}

Other I/O topics include reading and writing to
files, Standard Error (STDERR) and Standard
Output (STDOUT).
Regular Expression
Regular expression is a set of
characters that specify a pattern.
 Used for locating piece of text in a file.
 Regular expression syntax allows the
user to do a "wildcard" type search
without necessarily specifying the
character literally.
 Available across OS platform and
programming language.

Simple Regular Expression

A simple regular expression contains
the exact string to match
$string = "aaaabbbbccc";
if($string =~ /bc/){
print "found pattern\n";
}
output: found pattern
Simple Regular Expression

The variable $& is automatically set to
the matched pattern
$string = "aaaabbbbccc";
if($string =~ /bc/){
print "found pattern : $&\n";
}
output: found pattern bc
Simple Regular Expression

What happen when you want to match a
generalised pattern like an "a" followed
by some "b"s and a single "c"
$string = "aaaabbbbccc";
if($string =~ /abbc/){
print "found pattern : $&\n";
}
else {print "nothing found\n"; }
output: nothing found
Regular Expression - Quantifiers


We can specify the number of times we want
to see a specific character in a regular
expression by adding operators behind the
character.
* (asterisk) matches zero or more copies of a
specific character
 + (plus) matches one or more copies of a
specific character
Regular Expression - Quantifiers
@array = ["ac", "abc", "abbc", "abbbc",
"abb", "bbc", "bcf", "abbb",
"c"];
foreach $string (@array){
if($string =~ /ab*c/){
print "$string ";
}
}
output:
ac abc abbc abbbc
Regular Expression - Quantifiers
@array = ["ac", "abc", "abbc", "abbbc",
"abb", "bbc", "bcf", "abbb",
"c"];
Regular Exp Matched pattern
abc
abc
ab*c
ac abc abbc abbbc
ab+c
abc abbc abbbc
Regular Expression - Anchors

You can use Anchor restrictions preceding
and behind the pattern to specify where
along the string to match to.

^ indicates a beginning of a line restriction
$ indicates an end of line restriction

Regular Expression - Anchors
@array = ["ac", "abc", "abbc", "abbbc",
"abb", "bbc", "bcf", "abbb",
"c"];
Regular Exp Matched pattern
^bc
bc
^b*c
bbc bcf c
^b*c$
bbc c
b*c$
ac abc abbc abbbc bbc c
Regular Expression - Range

[…] is used to identify the exact characters
you are searching for.
[0123456789] will match a single numeric
character.
 [0-9] will also match a single numeric
character
 [A-Za-z] will match a single alphabet of any
case.

Regular Expression - Range

Search for a word that
–
–
–
–
starts with the uppercase T
second letter is a lowercase alphabet
third letter is a lower case vowel
is 3 letters long followed by a space

Regular expression : "^T[a-z][aeiou] "

Note : [z-a] is backwards and does not work
Note : [A-z] does match upper and lowercase but
also 6 additional characters between the upper and
lower case letters in the ASCII chart: [ \ ] ^ _ `

Regular Expression - Others

Match a single character (non specific) with "." (dot)
a.c = matches any string with "a" follow by one
character and followed by "c"

Specifying number of repetition sets with \{ and \}
[a-z]\{4,6\} = match four, five or six lower case
alphabet

Remembering Patterns with \(,\) and \1
Regular Exp allows you to remember and recall patterns
RegExp problem and strategies

You tend to match more lines than desired.
A.*B matches AAB as well as
AAAAAAACCCAABBBBAABBB

Knowing what you want to match
 Knowing what you don’t want to match
 Writing a pattern out to describe that you
want to match
 Testing the pattern

More info : type "man re_syntax" in a unix
shell
Example problem - Background

Biologists are interested in analysing proteins
that are from a particular biochemical enzyme
class "CDK1, CDK2 or CDK3". In additional,
biologists would like to extract those protein
sequences that contain the amino acid
pattern (motif) that represents a particular
virus binding site.
Serine , Glutamic Acid , (multiple occurrence of) Alanine , Glycine
Serine = S, Glutamic Acid = E , Alanine = A, Glycine = G
Example Problem - Dataset

Dataset was downloaded from an online
phosphorylation protein database.

Contains 16472 protein entries in one file.

One entry per line and terminates with
carriage return character.

Comma delimited entries
– field1, field2, field3, field4, …..
Example Problem - Dataset fields
1. acc - unique database ID
2. sequence - amino acid sequence for the protein
3. position - position along sequence that is
phophorylated
4. code - amino acid that is phophorylated
5. pmid - unique protein ID linked to an international
protein database
6. kinase - enzyme class of this protein
7. source - where this protein found
8. entry_date - date entered into the database
Example Problem - Dataset fields
1. acc - unique database ID
2. sequence - amino acid sequence for the protein
3. position - position along sequence that is
phophorylated
4. code - amino acid that is phophorylated
5. pmid - unique protein ID linked to an international
protein database
6. kinase - enzyme class of this protein
7. source - where this protein found
8. entry_date - date entered into the database
The task
1.
2.
3.
Extract those entries that have the
string CDK1, CDK2 or CDK3 in the
enzyme column.
Within our extracted entries, search
and match those sequences that
contain the virus binding pattern.
Print out the database ID of the
positively matched entries.
Problem: Divide and conquer
1.
enzyme class CDK1 , CDK2 or CDK3
2.
extract those protein with the pattern
Serine , Glutamic Acid , (multiple occurrence of) Alanine , Glycine
Serine = S, Glutamic Acid = E , Alanine = A, Glycine = G
Interesting parts of Perl not
covered in this lecture

Hashes
– One unique variable that is linked to another
variable
•
•
•
•
"Lecture 1002" ---> "Thur 3pm"
"Lecture 1002" ---> 25
"Lecture 1002" ---> [name1, name2, … ]
"Lecture 1002" ---> [{name1},{name2}.. ]
{name2} -> student ID
{name1} --> student ID
Interesting parts of Perl not
covered in this lecture

CGI (Common Gateway Interface)
– Creation of dynamic web pages using perl
– CGI, PHP, JavaScript, Java Applet, etc.

Object Oriented Perl

Perl books & references to explore at your own
curiosity
– http://perldoc.perl.org/
– http://www.oreilly.com/pub/topic/perl
– Book: O’Reilly - Perl Cookbook - This will save you
someday
– Book: O'Reilly - Mastering Regular Expressions