Perl Programming for Biologists - Part 1

Download Report

Transcript Perl Programming for Biologists - Part 1

Perl Programming for
Biologists, Second Edition
Part 1: 9/11/2007
Yannick Pouliot, PhD
Bioresearch Informationist
Lane Medical Library & Knowledge Management Center
Lane Medical Library & Knowledge Management Center
http://lane.stanford.edu
Class Requirements

You must


have wireless access
have the admin password to your machine
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
2
To Do

Please download all class materials from
http://lane.stanford.edu/howto/index.html?id=_2796
into C:\course
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
3
Class Focus
1.
2.
3.
Creating, writing and reading Excel files
Reformatting data files for input to an
analysis program
Writing and reading from a database such
as MS Access or other locally installed
relational database, as well as from
databases available on the Internet
And remember: Ask LOTS OF QUESTIONS
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
4
Cautions

All examples pertain to MS Office 2003



Examples still work in MS Office 2007
However, Perl modules used here do not work
with MS Office 2007-formatted documents
All examples pertain to Perl 5.x, not 6.x


V.5 and 6 are NOT compatible
V.5 is far more common, so not much of an issue
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
5
So Why Perl?



Perl = Practical Extraction and Reporting Language
Free
Very widely used



Very flexible and portable
Not the only language of this type


E.g., Python
Not the absolute easiest


Especially in biological community
… but pretty easy
Not suited for everything

E.g., for ultra-fast mathematically-oriented code, C is still
best
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
6
Today’s session:
- Installing and understanding what is
required to run Perl
- Understanding the basics of a Perl
program
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
7
Part 1: Installation
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
8
Components to Install & Configure
1.
Perl itself


More accurately, the Perl interpreter
We’ll use ActiveState Perl 5.8x (ActivePerl)

2.
Additional Perl modules


3.
Module = extra functions not part of the interpreter
Described at Comprehensive Perl Archive Network (CPAN)
Open Perl IDE

IDE = integrated development environment:




4.
www.activestate.com/store/freedownload.aspx?prdGuid=81fbce82-6bd5-49bc-a91508d58c2648ca
Editor  to write/edit your program
Debugger  to find bugs
A compiler/interpreter  to run your program from within the IDE
sourceforge.net/project/showfiles.php?group_id=23334&release_id=91440
Configuring the ODBC manager (next week)


Part of Windows
Allows different programs to interact with databases on your machine or
anywhere on the Web via single “doorway”
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
9
What is an Interpreter?

= A program that translates an instruction into
the computer’s language and executes it
before proceeding to the next instruction


= compiled and executed once instruction at a
time
Perl is usually used in interpreted mode

Can also be compiled once (= faster)
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
10
Installing Perl from ActiveState
1.
Go to
www.activestate.com/store/freedownload.aspx
?prdGuid=81fbce82-6bd5-49bc-a91508d58c2648ca
We’ll be downloading Perl 5.8.x.x:
1. Select Windows MSI package for Windows
X86
2. Run the installer
3. Install under c:\Perl
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
11
Installing Additional Perl Modules
The fountain of all things Perl: CPAN


= Comprehensive Perl Archive Network
http://www.cpan.org/

What does a module look like?

Why modules?

PPM for downloading & installing modules

What modules are in MY Perl?
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
12
Perl
Modules
We’ll Be
Using
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
Name
Function
Included
File::Copy
manipulating files
Included
File::Find
manipulating files
Included
File::Path
manipulating files
You do it!
File::Rename
Manipulating files
Included
IO::File
accessing the insides of files
Included
Spreadsheet::WriteExcel
writing into an MS Excel spreadsheet
Included
Spreadsheet::ParseExcel
parsing an MS Excel spreadsheet
Included
Spreadsheet::BasicRead
reading the contents of an MS Excel spreadsheet
Included
Win32::OLE
provides easy access to Windows (e.g., launching Excel)
Included
DBI
provides access to relational databases
Included
DBD::ODBC
provides access to relational databases
Included
URI
accessing URLs
Included
LWP::Simple
interacting with a Web site via http
Included
Array::Unique
returns unique elements of an array
Included
List::Uniq
returns unique elements of a list
Included
Data :: Dumper
dumping data out of a data structure
Included
Switch
switch function ("multiple if-else-then")
13
The PPM Module: Installing Perl
Modules the Easy Way


Perl modules can downloaded and installed
manually from CPAN (hard)
They can also be installed via the Perl
Package Manager: PPM (easy)
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
14
Installing an environment to
run and edit Perl:
Integrated
Development
Environment (IDE)
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
15
Why an IDE?
IDE = integrated development environment:




Editor  to write/edit your program
Debugger  to find bugs
A “runner” (compiler/interpreter)  to run your program from within the
IDE
IDEs provide facilities to facilitate writing & debugging


E.g., automatic code highlighting
We’ll use Open Perl IDE

Free, open source, portable


sourceforge.net/project/showfiles.php?group_id=23334&release_id=
91440
IDE: Definition, description

For our Mac friends: Affrus
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
16
Installing Open Perl IDE
Go to
sourceforge.net/project/showfiles.php?group_id=
23334&release_id=91440
and download the code
2. Create folder Program Files/OpenPerlIDE
3. Unzip into Program Files/OpenPerlIDE
4. Update Path (under System Properties,
Advanced, Environment Variables, System
Variables)
→ this makes it possible to run Open Perl IDE
from anywhere on your machine…
1.
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
17
BREAK
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
18
Part 2: What does it all do?
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
19
Example Short Program
1.
2.
3.
Start Open Perl IDE
Load Simple1.pl
Run Simple1.pl
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
20
Learning by Example

Simple2.pl
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
21
Exploring Perl’s Major Language
Elements

Norman Matloff’s introduction to Perl:
http://heather.cs.ucdavis.edu/~matloff/Perl/PerlIntro.pdf


Perl language reference
http://en.wikipedia.org/wiki/Perl#Data_types
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
22
Additional Key Books/Resources




Learning by example: Perl Cookbook
Perl Programming for Biologists
Perl Quick Reference Guide
My favorite: Perl Quick Reference
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
23
Going Further: Programming Tips

Plan your program

Write down how you intend to process the data in more-or-less plain
language



Goal: making sure that it really does make sense
Hacking doesn’t really pay…
Have documentation handy


ActivePerl documentation (searchable)
Perl language reference
→ eBooks: help served on a silver platter


Lane FAQs
When you’re stuck: Search the Web

Google can answer almost any programming question

… though quality documentation is still best
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
24
Toying with Excel3.pl, a “real”
program
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
25
Excel3.pl: Introducing Object
Programming

Purpose: From an Excel worksheet that lists public
identifiers for DNA sequences associated with
genes, the program retrieves:






UniGene cluster ID
Gene symbol
NCBI Gene ID
… and writes the result into another Excel worksheet
Mix of procedural and object programming
Relevant links:


http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene&orig_
db=unigene
Entrez Utilities
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
26
Gene symbols &
descriptions
Sequence identifier
Search
UniGene for
cluster ID
UniGene
ESearch
Result ID
Excel report
write
Retrieve UniGene
description for that
cluster
UniGene
ESummary
What Excel3.pl Does
Cluster ID
Search Gene
with Gene
Gene
ESearch
Result ID
Excel report
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
write
Retrieve
Gene
description
for that gene
Gene
ESummary
27
Assignments

Look at code for Example3.pl


Modify it, break it
Write down at least one question  so we can talk
about it next week
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
28
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
29
eBooks Rule
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
30
What Does A Module Look Like?
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
31