Introduction to Python for Biologists

Download Report

Transcript Introduction to Python for Biologists

This Lecture
Introduction to Python for Biologists
Part 1
Learning Objectives
• Install Python
• Data & Variables
• Strings
• String slicing
• String methods
• Lists
• List methods & list slicing
• Math
• Arrays
Why do Biologist need to learn Programming?
http://archive.oreilly.com/pub/a/oreilly//news/perlbio_1001.html
http://www.nature.com/nbt/journal/v31/n10/box/nbt.2721_BX1.html
• Biology is becoming a data-driven field
– New technology enables scientists to generate large data sets in semi-automated
experiments.
– Analysis of your own data is challenging
– Automation saves time
– Many interesting questions remain unanalyzed in huge amounts of publicly
available data
– Integration of new experimental results with public data is a challenging
computational problem
• Scientists who can pursue innovative data analysis methods
have an advantage over those limited to existing software (or
those who require the assistance of other people with
programming and data analysis skills)
Python*
•
•
•
•
•
•
•
is a Programming Language
Free, open source
Runs on all types of computers
“User friendly and easy to learn”
“clean readable code”
Very popular among bioinformaticians
Good documentation available
https://wiki.python.org/moin/BeginnersGuide/Overview
• Powerful “object oriented” features
• Many add-on toolkits (“modules”) available for scientific
computing, visualization, statistics, etc.
*Python is named after a 1970’s British comedy TV show, not a large snake
Grad School
Python
Thanks to xkcd: https://xkcd.com/519/
Python.org
Online Tutorials
• You can’t learn an entire programming language
from a couple of classroom lectures.
• There are many online tutorials for Python, which
allow self-learning at your own pace
• We recommend:
•
•
•
•
•
Codecademy.com
TryPython.org
LearnPython.org
LearnPythontheHardWay.org/book
Software Carpenty
• For Biologists:
• Python for Biologists
• Rosalind Python Village (learn by solving problems)
Reading For this week:
• Python for Biologists, chapter 1-3
• The anatomy of successful computational biology
software. Altschul S, Demchak B, Durbin R, Gentleman R, Krzywinski M, Li H,
Nekrutenko A, Robinson J, Rasband W, Taylor J, Trapnell C.
Nature Biotechnology 2013 Oct;31(10):894-7. DOI:doi:10.1038/nbt.2721
Install Python
• Assignment: Install Python on your computer
• Be sure to include the Numpy and SciPy
modules
• One easy way to set up a GUI for Python (on Mac
and Windows) is to download the free version of
Anaconda: http://continuum.io/downloads
• Or you can run the command line version on Linux or
in the Macintosh Terminal (for Mac you will need
Xcode, which is a free software developers toolkit from
Apple, is not installed by default in OSX)
Anaconda
• Your life (in this course) will probably be easier if you install
the (free) Anaconda – includes numerical, scientific, statistical,
and graphics modules.
http://continuum.io/downloads
Programming Concepts
All programming languages are built from the
same basic elements:
• data
• operators
• flow control
These concepts are expressed in a specific
syntax for each programming language
Data types
• Basic:
• Strings = 'GATCCATGCGAGACCCTTGA‘
• Numbers = 7, 123.455, 4.2e-14
• Boolean = True, False
• Every data object has a type
– (try these examples on your own)
>>> type (1)
>>> type (“GATCCT”)
Variables
• A Variable is a named container for data (think of it
as a box or a shelf that has a name)
• In Python, a variable can hold any type of data,
does not need to be pre-defined
• The data in the variable can be changed at any
time (and can change to a different type)
• Python variable names must start with a letter,
can only contain text letters and numbers and the
underscore _ character.
• Case sensitive
Comments
• Comments are bits of text added by the
programmer into the code that explain what is
going on. They are not executed by the computer.
• Python uses the hash symbol # to mark a
comment, anything on a line after the # is
ignored
• Use lots of clear comments in your code: for a
good grade, so others can understand your code,
and so you can understand your own code from
the past (days, weeks, years… ago).
Examples of Variables
A value is assigned to a variable by the = sign. The value to the
right of the = is put into the variable name on the left.
my_DNA = "ATGCGTA"
gene_length = 467
Dog_Text = “my Dog has Fleas” #spaces are part of a string
counter = 6
pi_short= 3.14
my_list = [a, b, c, d]
HBB_human=“MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFG
DLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH”
#this string is one line that wraps on the screen
Strings
• Strings are text. Must always be in quotes.
• Can use single or double quotes, but must be consistent
• A string can contain space characters and also newline
characters.
• Biology data involves a lot of strings: sequences, names
(taxonomy, gene names), etc.
• A string is usually assigned to a variable
>>>my_string = “gi|45478711|ref|NC_005816.1|Yersinia pestis biovar Microtus”
String Methods
• In Python, data objects of type ‘string’ have built
in operators called ‘methods”
• Methods use a ‘dot’ syntax as follows:
>>> my_DNA = "ATGCGTA"
>>> my_DNA.count(G)
2
>>> my_DNA.lower()
'atgcgta‘
String Concatenation
• Two strings can be joined with the + operator
c = 'cat'
h = 'hat'
print ('cat' + 'hat')
ch = c + h
print ch
print (c + ' in the ' + h)
• Numbers must be converted to strings using the str() function
before using the string concatentation operator
A=5
print (A + c)
#note the error message
print ('We have' + ' ' + str(A) + ' ' + c + 's')
More String methods
• upper() and lower() return a value that changes the case of a
string. You usually need to put this value into a variable,
otherwise the original string is unchanged.
>>> my_DNA = “TATGCGTA"
>>> my_DNA.lower()
‘tatgcgta‘
>>> my_DNA
'TATGCGTA‘
• len() gives the length of a string
>>> len(my_DNA)
8
Find & Replace
• find() is another handy string method. (Note: It
only works for exact matches)
>>> my_DNA = "ATGCGTA“
>>> my_DNA.find("GC")
2
#returns the position index of the first occurrence
of the search string in the target
• replace() finds and replaces letters in a string
>>> my_DNA.replace('T', 'X' )
'AXGCGXA'
Lists
• Lists contain a group of things, in square
brackets, separated by commas
List1 = [a, b, c, d]
List2 = [“XP_008199794”, “PF03769”, “gi|54037254”]
List_mix = [“fish”, “hat”, “box”, 17, 4935.45, True]
• The elements of a list do not all have to be of
the same type
• Lists are used for many tasks in Python that
involve a lot of data.
List Elements
• The elements in a list are ordered. They can be accessed by
their index number in the list.
• Python starts counting list elements at zero
• The list index is indicated by a number in square brackets
following the name of the list
• List slicing uses this format: [begin:end:step]
• You can do fancy things with list slicing, but intervals are
counted with strange rules. You need to study this.
>>> my_list=['G', 'A', 'hat', 'cat']
>>> my_list[1]
'A'
>>> my_list[1:3]
['A', 'hat']
>>> my_list[:-2]
['G', 'A']
List Methods
• You can assign a value to a specific position in a list:
>>> my_list=['G', 'A', 'hat', 'cat']
>>> my_list[1] = “X”
>>> my_list
['G', ‘X', 'hat', 'cat']
• List methods are functions built into the list data type. They use the
‘dot’ syntax just like string methods.
my_list.count(‘G’)
1
•
list.append() is a commonly used list method. It adds its argument to the end
of a list. It is frequently used to collect results as a program steps through a
loop
my_list.append(‘T’)
>>> my_list
['G', 'X', 'hat', 'cat', 'T' ]
String Slicing
• Strings can be treated as a list of letters, and
sliced with the exact same methods as lists
>>> my_DNA = "ATGCGTA"
>>> my_DNA[1]
'T'
>>> my_DNA[1:4]
'TGC'
Split a string into a List
• Sometimes it is helpful to turn a string into a list of
words or numbers. The split() method does this.
• By default, it splits on whitespace, but any character
specified in the parentheses can be used as delimiter.
• This is useful when working with tab delimited or
comma delimited (csv) data.
>>> names = "melanogaster,simulans,yakuba,ananassae"
>>> species = names.split(",")
>>> print(names[1] + ' ' + species[2])
e yakuba
The list() function
• The list() function splits a string into a list of
characters
>>> hi = "Hello world"
>>> list(hi)
['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']
Join
• join() turns a list of strings into a single string. You
can add a spacer such as a comma or space
character. It has a backwards syntax, where the
spacer is the thing being acted upon by the method:
>>> my_list=['G', 'A', 'hat', 'cat']
>>> spacer = ':'
>>> newstring = spacer.join(my_list)
>>> newstring
'G:A:hat:cat'
>>> '#'.join(my_list)
'G#A#hat#cat'
String Slicing: Exercises
(do these yourself in Python shell)
>>> dna = 'CGGTTAATAGGGACTCTC'
>>> dna[0]
>>> dna[0:3]
>>> dna[-1]
>>> dna[-1:-3]
>>> dna[-3:-1]
>>> dna[0:5]
>>> dna[0:5:2]
>>> dna[0:5][::-1]
>>> dna[0:5][::-2]
>>> dna
Math
• Python can do simple math like a calculator.
• Type the following expressions into an interactive
Python session (or the IDE editor), hit the
enter/return key (or Run button) and observe the
results:
2+2
6–3
8 / 3.0
9*3
6 ** 2
Math module
• Python does not activate all of its built-in
functions when you start it up
• You use the “import” command to add
modules.
• Type “import math” to get more advanced
mathematics functions. math.sqrt() is a
function in the math module. Try this:
import math
math.sqrt(36)
6.0
Simple Navigation
• Doing some simple file system navigation in Python is
unreasonably difficult (uses a module called os)
• Where am I?
>>> import os
>>> os.getcwd()
'C:\\Python27‘
• What files are in this directory (folder)?
>>> os.listdir('.')
['at.py', 'hello.py', 'JASPAR-pfm_all.txt', 'JasparClient.py', 'MA0024.1.pfm',
'my_blast.xml', 'ros4.py', 'rosalind_ini5.txt', 'SRR020192.fastq', 'Test_100.fasta‘]
• Change directory
>>> os.chdir('/Users/stu/Python')
NumPy and Arrays
• Arrays are like lists, but they contain only
numbers, and they have dimensions.
• NumPy is a Python module that enables array
operations.
Here is a simple one dimensional array of integers (just like
a list):
>>> import numpy as np
>>> x = np.array([42,47,11], int)
>>> x
>>> array([42, 47, 11])
Software Carpentry has a nice introduction to NumPy arrays:
http://swcarpentry.github.io/python-novice-inflammation/01-numpy.html
2-Dimensional Array
• A two dimensional array is like a list of lists, but
each row must have the same number of
elements.
>>> x = np.array( ((11,12,13), (21,22,23), (31,32,33)) )
>>> print x
[ [11 12 13]
[21 22 23]
[31 32 33] ]
• Note the nested square brackets
• NumPy has no problem with 3, 4, or more
dimensions, but it is annoying to represent as text.
Matrix Math
• Matrices are 2-dimensional arrays.
• NumPy has linear algebra methods for operations on
matrices. These operations require that two matrices be
of the same size.
•
•
•
•
•
Vector addition
Matrix subtraction
Matrix multiplication
Scalar product (dot product)
Cross product
>>> x = np.array([3,2])
>>> y = np.array([5,1])
>>> z = x + y
>>> z
array([8, 3])
Assignment:
• Rosalind Python Village
– All 6 problems (should take you 1-2 hours)
Rosalind Python Village:
http://rosalind.info/problems/list-view/?location=python-village
Summary
• Install Python
• Data & Variables
• Strings
• String slicing
• String methods
• Lists
• List methods & list slicing
• Math
• Arrays