Writing Scripts

Transcript Writing Scripts

High-Performance Computing
Survival Guide
James R. Knight
Yale Center for Genome Analysis
Department of Genetics
Yale University
January 25, 2016
1950’s – The Beginning...
2016 – Looking very similar...
...but there are differences
• Not a single computer but thousands of them, called a cluster
– Hundreds of physical “computers”, called nodes
– Each with 4-64 CPU’s, called cores
• Nobody works in the server rooms anymore
– IT is there to fix what breaks, not to run computations (or help you
run computations)
– Everything is done by remote connections
• Computation is performed by submitting jobs for running
– This actually hasn’t changed...but how you run jobs has...
A Compute Cluster
louise.hpc.yale.edu
You
are
here!
300+ Users.
90 compute nodes for
general use.
300TB disk space.
Login-0-1
Compute-3-2
Compute-3-1
Compute-1-1
Network
Compute-1-2
Compute-2-1
Compute-2-2
You Use a Compute Cluster! Surfing the Web
You
are
here!
Return the
webpage
Click on a link
Blah.com
Compute
Compute
Compute
Network
Compute
Construct the
webpage
contents
Compute
Compute
How you’ll be using Louise
You
are
here!
Connect by
ssh
louise.hpc.yale.edu
300+ Users.
90 compute nodes for
general use.
300TB disk space.
Login-0-1
Compute-3-2
Compute-3-1
Compute-1-1
Run commands on
compute nodes (and Network
submit qsub jobs to
the rest of the cluster)
Compute-1-2
Compute-2-1
Compute-2-2
Connect
by qsub -I
1970’s – Terminals, In the Beginning...
2016 – Pretty much the same...
• Terminal app on Mac
• Look in the “Other” folder in Launchpad
Your “New” User Interface – Hunt and Peck!
• Type a command at the prompt, hit the return key
program arguments...
• This runs the program, which will read the arguments, read
inputs, perform computations and produce outputs
• When it completes, the prompt is displayed, telling you it is ready
for the next command
• Key commands to learn: ssh [email protected]
qsub -I
11
Helpful Tips
• Take a Linux basics tutorial
• The faster you can type, the faster you will be done
• Select and learn a text editor
– Vi or Emacs
• Select and learn a programming language
– Perl, Python or R
• Ask these questions to keep you oriented
–
–
–
–
–
What computer am I on?
What directory am I in?
Where are the files for my analysis?
What program(s) do I have running?
What jobs do I have running?
Directories and Paths
• Linux directory structure same as Mac/Windows folder structure
– Folders/directories containing files and other sub-folders/sub-dirs
– “Easy-to-access” directories: HOME directory
• A path is a string naming a file or directory in the structure
– The slash character (‘/’) is separator for directories
/Users/jamesknight/Desktop/hpc_survival_guide_jan_2015.pptx
The Shell
• When you type commands and run programs, you are actually running
a program called a shell
– Designed to take user input, run programs and display output
– Started automatically when Terminal app started or when you log into a
computer
– Linux runs the bash shell, by default
• Maintains useful environment variables
– $PWD, which holds your current working directory path
– $HOME or ~, which holds your home directory path
– $PATH, which holds locations of programs
• Powerful tool for organizing and executing commands
– Useful to combine programs or redirect inputs and outputs, without having
to write a program to do that
– Full-fledged programming language, used to write shell scripts to run sets
of commands
The Program’s Viewpoint
• Programs start knowing nothing, and must figure out what to do
– Lines of code are generalized instructions
– Specifics come from reading the program’s environment
Command-line
Arguments
(what you typed)
Standard
Input
The
Program
(keyboard)
Standard
Output
(screen)
Standard
Error
Files
to read
Files
to write
(screen)
Shell Redirection, Piping and Multiple Commands
• The shell lets you redirect stdin, stdout and stderr to configure how
your program communicates
• myprog
< inFile
> outFile
2> errFile
– “< inFile” redirects stdin so that program reads contents of “inFile”
– “> outFile” redirects stdout so that program writes standard output to
“outFile”
– “2> errFile” redirects stderr so that program writes standard error to
“errFile”
• echo Hello | sed s/Hello/Goodbye/
– The “|” (called a pipe) redirects the echo program’s standard output so that
it writes to the standard input of the sed program
– This command writes “Goodbye” to the screen
• echo Hello ; echo Goodbye
– The semi-colon separates commands, allowing multiple programs to run
from one command-line
– This command writes “Hello” then “Goodbye” to the screen
Writing Scripts
• Sometimes Linux’s built-in programs, and existing bioinformatics
programs, are not enough
– To combine programs together in a specific way
– To run programs on many different files/datasets
– To perform custom statistical analyses on data files
• Scripting languages make it easy to write your own programs
– bash, perl, python, R
– Write the lines of the script using a text editor
– Use the language’s program to run the script
python myscript arguments...
Then, test, debug and rewrite...
Writing Scripts
• A script is like a lab protocol
– Instructions on how to perform a task
– Executed in order, from beginning to end
– Just as protocol steps can have sub-steps, repeated steps and
sub-protocols, script statements can have sub-statements, loops
and function calls
• Types of statements in a script
– Computation (assignment, input/output), if-then-else,
for and while loops, functions
• Each programming language has its own unique syntax that you
must follow
REMEMBER: You are the protocol writer...
...writing for someone very, very, very stupid
Writing Scripts
• Instead of reagents, tubes and plates, scripts operate on values,
variables, data structures and files
–
–
–
–
Values: numbers (1, 2, 87.5), strings (“I am a string!”)
Variables: holder for a value
Data structures: holder for collections of values
Files: Series of strings (text files) or numbers (binary files) stored
on disk
• Important data structures:
– List or Array – ordered collection of values [ 1, 2, 4, 3 ]
– Hash or Dictionary – collection of “name, value” pairs, like a
telephone book
– Record or Struct – collection of named variables/data-structures
– Matrix – two-dimensional collection of numbers
That’s fine, but how do you do this, really???
• My best recommendation: Think about it, and write it down, as a
protocol, then translate it into the programming language
– Make the step descriptions comments in the script
• Comments are lines beginning with ‘#’, which are ignored when
executing the script
– Refine into sub-steps when translation is difficult
• Example: writing echo in Python
– Echo takes the command-line arguments and writes them to
standard output
[jk2269@compute-7-2 ~]$ echo Hello from the cluster!
Hello from the cluster!
[jk2269@compute-7-2 ~]$
That’s fine, but how do you do this, really???
• Attempt #1: Implement that description
– Python has a sys.argv list with the command-line arguments
– Python has a print statement to write to standard output
• Program:
#
# Print the command-line arguments to stdout.
#
import sys
print sys.argv
[jk2269@compute-7-2 ~]$ python myecho.py Hello from the cluster!
['myecho.py', 'Hello', 'from', 'the', 'cluster!']
[jk2269@compute-7-2 ~]$
That’s fine, but how do you do this, really???
• Attempt #2: Refine, write each argument separately, so that the
output can be formatted better.
– Python can loop over the values of the sys.argv list
– The print statement can write string values like “ “ (a space)
• Program:
#
# 1. for each command-line argument, except the
first,
#
a. print the argument and a space
#
import sys
for arg in sys.argv[1:]:
print arg, “ “
[jk2269@compute-7-2 ~]$ python myecho.py Hello from the cluster!
Hello
from
the
cluster!
[jk2269@compute-7-2 ~]$
That’s fine, but how do you do this, really???
• Attempt #3: Fix the extra newline characters
– Ending the print statement with a comma avoids the newline, but
does print a space (so skip the explicit space)
– Add an extra print statement to get it to print the newline
• Program:
#
# 1. for each command-line argument, except the
first,
#
a. print the argument, with no newline
# 2. print a newline
#
import sys
for arg in sys.argv[1:]:
print arg,
print
[jk2269@compute-7-2 ~]$ python myecho.py Hello from the cluster!
Hello from the cluster!
[jk2269@compute-7-2 ~]$
That’s fine, but how do you do this, really???
• Attempt #4: Try a different approach, construct the string to be
output, then print it.
– Python has a join function that combines a list of strings into a
string, with a separator.
• Program:
#
# 1. Combine the command-line arguments into a
#
string, separating them by spaces
# 2. Print the string
#
import sys
line = “ “.join(sys.argv[1:])
print line
[jk2269@compute-7-2 ~]$ python myecho.py Hello from the cluster!
Hello from the cluster!
[jk2269@compute-7-2 ~]$
That’s fine, but how do you do this, really???
• Why scripting/programming is hard:
– You must think of everything
• Use testing, iteration and refinement to make sure that you have
thought of everything
• You can get to “good enough”
– You have to write everything in a foreign language, with no
allowance for error
• My best recommendation: Think about it, and write it down, as a
protocol, then translate it into the programming language
– Design what you want the program to do as you would a protocol,
in English (or your favorite language)
– Match program statements to the steps, refining the steps so that
they can be translated
Running Jobs on the Cluster
• You must make reservations!
– Cluster is a shared resource, so you must ask for exclusive use of
nodes and cores
– The job request goes into a queue, and is granted when resources
are available
– How to do this? qsub!
• Interactive jobs
– “qsub –I” – request 1 core on 1 node
– “qsub –I –l nodes=1:ppn=8” – request 1 node, with 8 cores
• Batch jobs
– “qsub myjob.pbs” – Request to run the bash script myjob.pbs
• Louise’s cluster runs PBS/Torque to manage the queues, so “.pbs”
suffix is marking this as a script that can be submitted to the cluster
Running Jobs on the Cluster
Example myjob.pbs file
#PBS
#PBS
#PBS
#PBS
–q
-l
–o
-e
general
nodes=3:ppn=8
myjob_outFile.txt
myjob_errFile.txt
source ~/.bashrc
Lines
containing
options for the
job request
Just do this
Set working
directory
cd /data/scratch/firstjob_Jan2015
echo Hello
echo Goodbye
The lines of
your script
Running Jobs on the Cluster
• What if I have to run a program on 100 datasets?
– You could make 100 scripts, or you could use Simplequeue!
• Write a text file, where each line is a one-line shell command
• Use the sqPBS.py program to make a PBS script
• Submit the PBS script
• Python program that can write the text file (let’s call it “writeit.py”)
import sys
import os
cwd = os.getcwd()
for arg in sys.argv[1:]:
print "source ~/.bashrc ; cd”, cwd, “; python myscript”, arg
• Commands to run
python writeit.py dataset*.gz > runit.smplq
sqPBS.py general 3.2 jk2269 myscript runit.smplq > runit.pbs
qsub runit.pbs
What do you need to know how to do to “survive”?
• How to get into the cluster, and back out again.
• How to run commands in the shell.
• How to navigate around the directories (and make and remove
them).
• How to create, look at and edit text files.
• How to write scripts to do the computations you need to do.
• How to submit jobs, to run things on the compute nodes.
Helpful Tips
• Take a Linux basics tutorial
• The faster you can type, the faster you will be done
• Select and learn a text editor
– Vi or Emacs
• Select and learn a programming language
– Perl, Python or R
• Ask these questions to keep you oriented
–
–
–
–
–
What computer am I on?
What directory am I in?
Where are the files for my analysis?
What program(s) do I have running?
What jobs do I have running?
Helpful Tips
• Ask these questions to keep you oriented
– What computer am I on?
• Look at the prompt, ‘hostname’
– What directory am I in?
• Look at the prompt and window top
• ‘pwd’, ‘cd’
– Where are the files for my analysis?
• ‘ls’
• ‘mkdir’, ‘rm’, ‘rmdir’
• ‘more’ or ‘less’, ‘head’, ‘tail’
– What program(s) do I have running?
• ‘ps’, ‘top’, ‘screen’
– What jobs do I have running?
• ‘qstat’
Golden Rule for Bioinformatic Clusters
• Never, ever, ever read and write SAM files. Always pipe it
through samtools to convert from SAM to BAM, if the software
doesn’t support native BAM files.

Writing Scripts

Transcript Writing Scripts

Directory