Writing Scripts
Download
Report
Transcript Writing Scripts
High-Performance Computing
Survival Guide
James R. Knight
Yale Center for Genome Analysis
Department of Genetics
Yale University
January 25, 2016
1950’s – The Beginning...
2016 – Looking very similar...
...but there are differences
• Not a single computer but thousands of them, called a cluster
– Hundreds of physical “computers”, called nodes
– Each with 4-64 CPU’s, called cores
• Nobody works in the server rooms anymore
– IT is there to fix what breaks, not to run computations (or help you
run computations)
– Everything is done by remote connections
• Computation is performed by submitting jobs for running
– This actually hasn’t changed...but how you run jobs has...
A Compute Cluster
louise.hpc.yale.edu
You
are
here!
300+ Users.
90 compute nodes for
general use.
300TB disk space.
Login-0-1
Compute-3-2
Compute-3-1
Compute-1-1
Network
Compute-1-2
Compute-2-1
Compute-2-2
You Use a Compute Cluster! Surfing the Web
You
are
here!
Return the
webpage
Click on a link
Blah.com
Compute
Compute
Compute
Network
Compute
Construct the
webpage
contents
Compute
Compute
How you’ll be using Louise
You
are
here!
Connect by
ssh
louise.hpc.yale.edu
300+ Users.
90 compute nodes for
general use.
300TB disk space.
Login-0-1
Compute-3-2
Compute-3-1
Compute-1-1
Run commands on
compute nodes (and Network
submit qsub jobs to
the rest of the cluster)
Compute-1-2
Compute-2-1
Compute-2-2
Connect
by qsub -I
1970’s – Terminals, In the Beginning...
2016 – Pretty much the same...
• Terminal app on Mac
• Look in the “Other” folder in Launchpad
Your “New” User Interface – Hunt and Peck!
• Type a command at the prompt, hit the return key
program arguments...
• This runs the program, which will read the arguments, read
inputs, perform computations and produce outputs
• When it completes, the prompt is displayed, telling you it is ready
for the next command
• Key commands to learn: ssh [email protected]
qsub -I
11
Helpful Tips
• Take a Linux basics tutorial
• The faster you can type, the faster you will be done
• Select and learn a text editor
– Vi or Emacs
• Select and learn a programming language
– Perl, Python or R
• Ask these questions to keep you oriented
–
–
–
–
–
What computer am I on?
What directory am I in?
Where are the files for my analysis?
What program(s) do I have running?
What jobs do I have running?
Directories and Paths
• Linux directory structure same as Mac/Windows folder structure
– Folders/directories containing files and other sub-folders/sub-dirs
– “Easy-to-access” directories: HOME directory
• A path is a string naming a file or directory in the structure
– The slash character (‘/’) is separator for directories
/Users/jamesknight/Desktop/hpc_survival_guide_jan_2015.pptx
The Shell
• When you type commands and run programs, you are actually running
a program called a shell
– Designed to take user input, run programs and display output
– Started automatically when Terminal app started or when you log into a
computer
– Linux runs the bash shell, by default
• Maintains useful environment variables
– $PWD, which holds your current working directory path
– $HOME or ~, which holds your home directory path
– $PATH, which holds locations of programs
• Powerful tool for organizing and executing commands
– Useful to combine programs or redirect inputs and outputs, without having
to write a program to do that
– Full-fledged programming language, used to write shell scripts to run sets
of commands
The Program’s Viewpoint
• Programs start knowing nothing, and must figure out what to do
– Lines of code are generalized instructions
– Specifics come from reading the program’s environment
Command-line
Arguments
(what you typed)
Standard
Input
The
Program
(keyboard)
Standard
Output
(screen)
Standard
Error
Files
to read
Files
to write
(screen)
Shell Redirection, Piping and Multiple Commands
• The shell lets you redirect stdin, stdout and stderr to configure how
your program communicates
• myprog
< inFile
> outFile
2> errFile
– “< inFile” redirects stdin so that program reads contents of “inFile”
– “> outFile” redirects stdout so that program writes standard output to
“outFile”
– “2> errFile” redirects stderr so that program writes standard error to
“errFile”
• echo Hello | sed s/Hello/Goodbye/
– The “|” (called a pipe) redirects the echo program’s standard output so that
it writes to the standard input of the sed program
– This command writes “Goodbye” to the screen
• echo Hello ; echo Goodbye
– The semi-colon separates commands, allowing multiple programs to run
from one command-line
– This command writes “Hello” then “Goodbye” to the screen
Writing Scripts
• Sometimes Linux’s built-in programs, and existing bioinformatics
programs, are not enough
– To combine programs together in a specific way
– To run programs on many different files/datasets
– To perform custom statistical analyses on data files
• Scripting languages make it easy to write your own programs
– bash, perl, python, R
– Write the lines of the script using a text editor
– Use the language’s program to run the script
python myscript arguments...
Then, test, debug and rewrite...
Writing Scripts
• A script is like a lab protocol
– Instructions on how to perform a task
– Executed in order, from beginning to end
– Just as protocol steps can have sub-steps, repeated steps and
sub-protocols, script statements can have sub-statements, loops
and function calls
• Types of statements in a script
– Computation (assignment, input/output), if-then-else,
for and while loops, functions
• Each programming language has its own unique syntax that you
must follow
REMEMBER: You are the protocol writer...
...writing for someone very, very, very stupid
Writing Scripts
• Instead of reagents, tubes and plates, scripts operate on values,
variables, data structures and files
–
–
–
–
Values: numbers (1, 2, 87.5), strings (“I am a string!”)
Variables: holder for a value
Data structures: holder for collections of values
Files: Series of strings (text files) or numbers (binary files) stored
on disk
• Important data structures:
– List or Array – ordered collection of values [ 1, 2, 4, 3 ]
– Hash or Dictionary – collection of “name, value” pairs, like a
telephone book
– Record or Struct – collection of named variables/data-structures
– Matrix – two-dimensional collection of numbers
That’s fine, but how do you do this, really???
• My best recommendation: Think about it, and write it down, as a
protocol, then translate it into the programming language
– Make the step descriptions comments in the script
• Comments are lines beginning with ‘#’, which are ignored when
executing the script
– Refine into sub-steps when translation is difficult
• Example: writing echo in Python
– Echo takes the command-line arguments and writes them to
standard output
[jk2269@compute-7-2 ~]$ echo Hello from the cluster!
Hello from the cluster!
[jk2269@compute-7-2 ~]$
That’s fine, but how do you do this, really???
• Attempt #1: Implement that description
– Python has a sys.argv list with the command-line arguments
– Python has a print statement to write to standard output
• Program:
#
# Print the command-line arguments to stdout.
#
import sys
print sys.argv
[jk2269@compute-7-2 ~]$ python myecho.py Hello from the cluster!
['myecho.py', 'Hello', 'from', 'the', 'cluster!']
[jk2269@compute-7-2 ~]$
That’s fine, but how do you do this, really???
• Attempt #2: Refine, write each argument separately, so that the
output can be formatted better.
– Python can loop over the values of the sys.argv list
– The print statement can write string values like “ “ (a space)
• Program:
#
# 1. for each command-line argument, except the
first,
#
a. print the argument and a space
#
import sys
for arg in sys.argv[1:]:
print arg, “ “
[jk2269@compute-7-2 ~]$ python myecho.py Hello from the cluster!
Hello
from
the
cluster!
[jk2269@compute-7-2 ~]$
That’s fine, but how do you do this, really???
• Attempt #3: Fix the extra newline characters
– Ending the print statement with a comma avoids the newline, but
does print a space (so skip the explicit space)
– Add an extra print statement to get it to print the newline
• Program:
#
# 1. for each command-line argument, except the
first,
#
a. print the argument, with no newline
# 2. print a newline
#
import sys
for arg in sys.argv[1:]:
print arg,
print
[jk2269@compute-7-2 ~]$ python myecho.py Hello from the cluster!
Hello from the cluster!
[jk2269@compute-7-2 ~]$
That’s fine, but how do you do this, really???
• Attempt #4: Try a different approach, construct the string to be
output, then print it.
– Python has a join function that combines a list of strings into a
string, with a separator.
• Program:
#
# 1. Combine the command-line arguments into a
#
string, separating them by spaces
# 2. Print the string
#
import sys
line = “ “.join(sys.argv[1:])
print line
[jk2269@compute-7-2 ~]$ python myecho.py Hello from the cluster!
Hello from the cluster!
[jk2269@compute-7-2 ~]$
That’s fine, but how do you do this, really???
• Why scripting/programming is hard:
– You must think of everything
• Use testing, iteration and refinement to make sure that you have
thought of everything
• You can get to “good enough”
– You have to write everything in a foreign language, with no
allowance for error
• My best recommendation: Think about it, and write it down, as a
protocol, then translate it into the programming language
– Design what you want the program to do as you would a protocol,
in English (or your favorite language)
– Match program statements to the steps, refining the steps so that
they can be translated
Running Jobs on the Cluster
• You must make reservations!
– Cluster is a shared resource, so you must ask for exclusive use of
nodes and cores
– The job request goes into a queue, and is granted when resources
are available
– How to do this? qsub!
• Interactive jobs
– “qsub –I” – request 1 core on 1 node
– “qsub –I –l nodes=1:ppn=8” – request 1 node, with 8 cores
• Batch jobs
– “qsub myjob.pbs” – Request to run the bash script myjob.pbs
• Louise’s cluster runs PBS/Torque to manage the queues, so “.pbs”
suffix is marking this as a script that can be submitted to the cluster
Running Jobs on the Cluster
Example myjob.pbs file
#PBS
#PBS
#PBS
#PBS
–q
-l
–o
-e
general
nodes=3:ppn=8
myjob_outFile.txt
myjob_errFile.txt
source ~/.bashrc
Lines
containing
options for the
job request
Just do this
Set working
directory
cd /data/scratch/firstjob_Jan2015
echo Hello
echo Goodbye
The lines of
your script
Running Jobs on the Cluster
• What if I have to run a program on 100 datasets?
– You could make 100 scripts, or you could use Simplequeue!
• Write a text file, where each line is a one-line shell command
• Use the sqPBS.py program to make a PBS script
• Submit the PBS script
• Python program that can write the text file (let’s call it “writeit.py”)
import sys
import os
cwd = os.getcwd()
for arg in sys.argv[1:]:
print "source ~/.bashrc ; cd”, cwd, “; python myscript”, arg
• Commands to run
python writeit.py dataset*.gz > runit.smplq
sqPBS.py general 3.2 jk2269 myscript runit.smplq > runit.pbs
qsub runit.pbs
What do you need to know how to do to “survive”?
• How to get into the cluster, and back out again.
• How to run commands in the shell.
• How to navigate around the directories (and make and remove
them).
• How to create, look at and edit text files.
• How to write scripts to do the computations you need to do.
• How to submit jobs, to run things on the compute nodes.
Helpful Tips
• Take a Linux basics tutorial
• The faster you can type, the faster you will be done
• Select and learn a text editor
– Vi or Emacs
• Select and learn a programming language
– Perl, Python or R
• Ask these questions to keep you oriented
–
–
–
–
–
What computer am I on?
What directory am I in?
Where are the files for my analysis?
What program(s) do I have running?
What jobs do I have running?
Helpful Tips
• Ask these questions to keep you oriented
– What computer am I on?
• Look at the prompt, ‘hostname’
– What directory am I in?
• Look at the prompt and window top
• ‘pwd’, ‘cd’
– Where are the files for my analysis?
• ‘ls’
• ‘mkdir’, ‘rm’, ‘rmdir’
• ‘more’ or ‘less’, ‘head’, ‘tail’
– What program(s) do I have running?
• ‘ps’, ‘top’, ‘screen’
– What jobs do I have running?
• ‘qstat’
Golden Rule for Bioinformatic Clusters
• Never, ever, ever read and write SAM files. Always pipe it
through samtools to convert from SAM to BAM, if the software
doesn’t support native BAM files.