Transcript BioPython

PYTHON
WHAT IS BIOPYTHON?
Biopython is a python library of resources for developers of
Python-base software for bioinformatics and research.
• can parse bioinformatics files into local data structures
• Fasta, GenBank, Blast output Clustalw etc.
• Can access many files directly ( web database, NCBI) from
within the script.
• Works with sequences and records
• Many search algorithms, comparative algorithms and
format options.
INSTALLING
BIOPYTHON
Comes with Anaconda. You don’t even have to type in the
import commands!
If you use the standard IDLE environment you will need to
download BioPython and place it in the proper directory.
Bioinformatics has become so important in recent years that
almost every programming environment, C++, Perl, etc has
its own Bioinfo libraries.
SEQUENCE OBJECTS
Biological sequences represent the main point of interest in
Bioinformatics processing. Python includes a special datatype
called a Sequence.
Sequence objects are not the same as Python strings. They are
really strings together with additional information, such as an
alphabet, and a variety of methods such as translate(),
reverse_complement() and so on.
dna = ‘AGTACACTGGT ‘  this is a pure string
// Here is how you create a sequence object.
seqdna = Seq(‘AGTACACTGGT ‘, Alphabet())  sequence obj
Note that seqdna is a sequence object not just a string.
ALPHABETS - SEE IUPAC
(INTERNATIONAL UNION OF PURE AND APPLIED CHEMISTRY)
Alphabets are just the set of allowable characters that are
used in the string.
IUPAC.unambiguous_dna is really just the set {A,C,G, T} of
nucleotides.
IUPAC.unambiguous_rna is {A,C,G,U}
IUPAC.protein is just the 20 standard amino acids
{A,R,N,D,C,Q,E,H,I,L,K,M,F,P,S,T,W,Y,V}
and others
We will use mainly the {A,C,G,T} DNA set.
Nice for type checking our sequences.
DUMPING ALPHABETS
from Bio.Alphabet import IUPAC
print IUPAC.unambiguous_dna.letters
print IUPAC.ambiguous_dna.letters
print IUPAC.unambiguous_rna.letters
print IUPAC.protein.letters
OUTPUT
GATC
GATCRYWSMKHBVDN
GAUC
ACDEFGHIKLMNPQRSTVWY
CAN WORK WITH SEQUENCE
OBJECTS LIKE STRINGS
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
my_seq = Seq("GATCG", IUPAC.unambiguous_dna)
print my_seq[0]  prints first letter
print len(my_seq)  print length of string in sequence
print Seq(“AAAA”).count(“AA”)  non overlapping count ie 2
print GC(my_seq)  Gives the GC % of the sequence.
print my_seq[2:5]  We can even slice them. Returns a Seq.
#convert seq obj to a pure string obj
dna_string = str(my_seq)
NUCLEOTIDE SEQUENCES AND
(REVERSE) COMPLEMENTS
>>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC
>>> my_seq =
Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC",
IUPAC.unambiguous_dna)
>>> my_seq
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC',
IUPACUnambiguousDNA())
>>> my_seq.complement()
Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG',
IUPACUnambiguousDNA())
>>> my_seq.reverse_complement()
Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC',
IUPACUnambiguousDNA())
REVERSING A SEQUENCE
an easy way to just reverse a Seq object (or a Python string)
is slice it with -1 step
# FORWARD
>>> my_seq
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC',
IUPACUnambiguousDNA())
#BACKWARD ( Using a -1 step slice )
>>> my_seq[::-1]
Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG',
IUPACUnambiguousDNA())
DOUBLE STRANDED DNA
DNA coding strand (aka Crick strand, strand +1)
5’ ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3’
|||||||||||||||||||||||||||||||||||||||
3’ TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC 5’
DNA template strand (aka Watson strand, strand −1)
TRANSCRIPTION
5’ ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3’
|||||||||||||||||||||||||||||||||||||||
3’ TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC 5’
Transcription
5’ AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG 3’
Single stranded messenger RNA
LETS DO SOME REVERSE COMP
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
coding_dna =
Seq(“ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG",
IUPAC.unambiguous_dna)
template_dna= coding_dna.reverse_complement()
print template_dna
CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT
TRANSCRIBE ( T->U )
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
coding_dna =
Seq(“ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG",
IUPAC.unambiguous_dna)
messenger_rna = coding_dna.transcribe()
print messenger_rna
AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG
//or you can do both
messenger_rna = coding_dna.reverse_complement().transcribe()
TRANSLATE INTO PROTEIN
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
messenger_rna =
Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCG
AUAG", IUPAC.unambiguous_rna)
print messenger_rna
print messenger_rna.translate() # I added the spaces
AUG GCC AUU GUA AUG GGC CGC UGA AAG GGU GCC
CGA UAG
MAIVMGR*KGAR*
# the * represents stop codons.
STANDARD
TRANSLATION TABLE
PRINTING TABLES
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
from Bio.Data import CodonTable
stdTable = CodonTable.unambiguous_dna_by_id[1]
print stdTable
mitoTable = CodonTable.unambiguous_dna_by_id[2]
print mitoTable
Table 1 Standard, SGC0
Table 2 Vertebrate Mitochondrial, SGC1
| T
| C
| A
| G
|
--+---------+---------+---------+---------+-T | TTT F
| TCT S
| TAT Y
| TGT C
| T
T | TTC F
| TCC S
| TAC Y
| TGC C
| C
T | TTA L
| TCA S
| TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S
| TAG Stop| TGG W
| G
--+---------+---------+---------+---------+-C | CTT L
| CCT P
| CAT H
| CGT R
| T
C | CTC L
| CCC P
| CAC H
| CGC R
| C
C | CTA L
| CCA P
| CAA Q
| CGA R
| A
C | CTG L(s)| CCG P
| CAG Q
| CGG R
| G
--+---------+---------+---------+---------+-A | ATT I
| ACT T
| AAT N
| AGT S
| T
A | ATC I
| ACC T
| AAC N
| AGC S
| C
A | ATA I
| ACA T
| AAA K
| AGA R
| A
A | ATG M(s)| ACG T
| AAG K
| AGG R
| G
--+---------+---------+---------+---------+-G | GTT V
| GCT A
| GAT D
| GGT G
| T
G | GTC V
| GCC A
| GAC D
| GGC G
| C
G | GTA V
| GCA A
| GAA E
| GGA G
| A
G | GTG V
| GCG A
| GAG E
| GGG G
| G
--+---------+---------+---------+---------+--
| T
| C
| A
| G
|
--+---------+---------+---------+---------+-T | TTT F
| TCT S
| TAT Y
| TGT C
| T
T | TTC F
| TCC S
| TAC Y
| TGC C
| C
T | TTA L
| TCA S
| TAA Stop| TGA W
| A
T | TTG L
| TCG S
| TAG Stop| TGG W
| G
--+---------+---------+---------+---------+-C | CTT L
| CCT P
| CAT H
| CGT R
| T
C | CTC L
| CCC P
| CAC H
| CGC R
| C
C | CTA L
| CCA P
| CAA Q
| CGA R
| A
C | CTG L
| CCG P
| CAG Q
| CGG R
| G
--+---------+---------+---------+---------+-A | ATT I(s)| ACT T
| AAT N
| AGT S
| T
A | ATC I(s)| ACC T
| AAC N
| AGC S
| C
A | ATA M(s)| ACA T
| AAA K
| AGA Stop| A
A | ATG M(s)| ACG T
| AAG K
| AGG Stop| G
--+---------+---------+---------+---------+-G | GTT V
| GCT A
| GAT D
| GGT G
| T
G | GTC V
| GCC A
| GAC D
| GGC G
| C
G | GTA V
| GCA A
| GAA E
| GGA G
| A
G | GTG V(s)| GCG A
| GAG E
| GGG G
| G
--+---------+---------+---------+---------+--
CODON - AMINO ACIDS
Amino Acid
Isoleucine
Leucine
Valine
Phenylalanine
Methionine
Cysteine
Alanine
Glycine
Proline
Threonine
Serine
Tyrosine
Tryptophan
Glutamine
Asparagine
Histidine
Glutamic acid
Aspartic acid
Lysine
Arginine
Stop codons
.
SLC
I
L
V
F
M
C
A
G
P
T
S
Y
W
Q
N
H
E
D
K
R
Stop
DNA codons
ATT, ATC, ATA
CTT, CTC, CTA, CTG, TTA, TTG
GTT, GTC, GTA, GTG
TTT, TTC
ATG
TGT, TGC
GCT, GCC, GCA, GCG
GGT, GGC, GGA, GGG
CCT, CCC, CCA, CCG
ACT, ACC, ACA, ACG
TCT, TCC, TCA, TCG, AGT, AGC
TAT, TAC
TGG
CAA, CAG
AAT, AAC
CAT, CAC
GAA, GAG
GAT, GAC
AAA, AAG
CGT, CGC, CGA, CGG, AGA, AGG
TAA, TAG, TGA
THE SEQRECORD OBJECT
A SeqRecord is a structure that allows the storage of
additional information with a sequence. This includes the
usual information found in standard genbank files. The
following is a sample.
.seq - The sequence
.id - The primary ID used to identify the sequence (String)
.name – The common name of the sequence
.annotations – A dictionary of additional information about
the sequence
.features –A list of SeqFeature objects
READ A RECORD
from Bio import SeqIO
record = SeqIO.read("micoplasmaGen.gb","genbank")
print record.description
ct=0
for f in record.features:
if f.type=='gene':
ct+=1
print ct