JOIN2004 Universidade do Minho

download report

Transcript JOIN2004 Universidade do Minho

4/7/2016
Why Life is Beautiful
James Tisdall
JOIN2004 Universidade do Minho
4/7/2016
James Tisdall (Jim)
Biocomputing Associates
PO Box 9965
Philadelphia PA 19118 USA
(610)933-1200
http:/www.biocomputingassociates.com
And
DuPont Experimental Station
[email protected]
JOIN2004 Universidade do Minho
Who am I?
4/7/2016
Musician
 Bell Laboratories, Information Principles Research
 Human Genome Project
early Perl program DNA WorkBench
 Mercator Genetics, Computational Biologist
 Fox Chase Cancer Center, Bioinformatics Manager
 Biocomputing Associates, consulting
 DuPont Genetics Research
 Author of "Beginning Perl for Bioinformatics" and
"Mastering Perl for Bioinformatics" from O'Reilly

JOIN2004 Universidade do Minho
4/7/2016
"
"
Bioinformatics means using computers for
biology research, and
Perl is a popular computer language for
bioinformatics programming
JOIN2004 Universidade do Minho
4/7/2016
OUTLINE
"
Perl resources
"
Overview of basic Perl
"
The art of programming
"
Sequences and strings
"
Motifs and loops
"
Subroutines and bugs
JOIN2004 Universidade do Minho
OUTLINE (Continued)
4/7/2016
"
Mutations and Randomization
"
The Genetic Code
"
Restriction Maps and Regular Expressions
"
GenBank
"
Protein Data Bank
"
BLAST
"
Further Topics
JOIN2004 Universidade do Minho
Advantages of Perl
4/7/2016
1. Ease of Programming
• Simplify common programming tasks. A lot of free
software for bioinformatics, web programming, …
2. Rapid Prototyping
• Often less programming time than other languages
3. Portability
• Most operating systems: Windows, Mac, Unix, etc…
• Free
• Run speed - suitable for most tasks
JOIN2004 Universidade do Minho
4/7/2016
Biological Sequence Data
Bases and amino acids are represented as letters in
Perl, just as in biology literature:
"
DNA and RNA
A C G T (or U)
"
Amino acids
ACDEFGHIKLMNPQRSTVWXY
JOIN2004 Universidade do Minho
4/7/2016
A Small Bioinformatics Program
Sequences are represented as Strings
# First we store the DNA in a variable called $DNA
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
# Next, we print the DNA onto the screen
print $DNA;
# Finally, we'll specifically tell the program to exit.
exit;
JOIN2004 Universidade do Minho
4/7/2016
Virtual Transcription: DNA to RNA
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
print "$DNA\n\n";
# Transcribe the DNA to RNA by substituting all T's with U's.
$RNA = $DNA;
$RNA =~ s/T/U/g;
print "$RNA\n";
JOIN2004 Universidade do Minho
4/7/2016
Reverse Complement
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
$revcom = reverse $DNA;
# Next substitute all bases by their complements,
# A->T, T->A, G->C, C->G, uppercase or lowercase
$revcom =~ tr/ACGTacgt/TGCAtgca/;
JOIN2004 Universidade do Minho
4/7/2016
Mutations and Randomization
JOIN2004 Universidade do Minho
4/7/2016
Modeling mutations randomly
"
How to randomly select a position in a string
"
How to randomly select an index in an array
"
Randomly select a base in DNA
"
Mutate a base to some other random base
"
Generate random DNA data sets
"
Repeatedly mutate DNA
JOIN2004 Universidade do Minho
4/7/2016
A program to simulate DNA mutation
"
Pseudocode:
1. Select a random position in the DNA
2. Choose a random nucleotide
3. Substitute the random nucleotide in the random
position
JOIN2004 Universidade do Minho
4/7/2016
A program to simulate DNA mutation
1. Select a random position in the DNA
$position = int(rand(length($DNA)))
2. Choose a random nucleotide
@nucleotides = ('A', 'C', 'G', 'T');
$new = $nucleotides[rand @nucleotides]
3. Substitute the random nucleotide in the random
position
substr($DNA, $position, 1, $new)
JOIN2004 Universidade do Minho
4/7/2016
Generating random DNA
sub make_random_DNA {
# Collect arguments, declare variables
my($length) = @_;
my $dna;
for (my $i=0 ; $i < $length ; ++$i) {
$dna .= $nucleotides[rand @nucleotides];
}
return $dna;
}
JOIN2004 Universidade do Minho
4/7/2016
The Genetic Code
JOIN2004 Universidade do Minho
4/7/2016
The genetic code
"
DNA encodes amino acids
–
3 bases of DNA = a codon
–
Each codon represents an amino acid or a "stop"
–
There are 64 codons, but 20 amino acids
–
Most amino acids are coded for by multiple codons,
which makes the genetic code a redundant code
JOIN2004 Universidade do Minho
4/7/2016
Translating codons to amino acids
sub codon2aa {
my($codon) = @_;
$codon = uc $codon;
my(%genetic_code) = (
'TCA' => 'S', # Serine
'TCC' => 'S', # Serine
'TCG' => 'S', # Serine
'TCT' => 'S', # Serine
'TTC' => 'F', # Phenylalanine
… et cetera …
JOIN2004 Universidade do Minho
4/7/2016
Translating DNA into proteins
sub dna2peptide {
my($dna) = @_;
# Initialize variables
my $protein = '';
# Translate each three-base codon to an amino acid, and append to a protein
for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {
$protein .= codon2aa( substr($dna,$i,3) );
}
return $protein;
}
JOIN2004 Universidade do Minho
4/7/2016
Reading DNA from FASTA files
"
"
FASTA format is the most common way that DNA
is stored in files (but there are many others)
Example:
> sample dna (This is a typical fasta header.)
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcag
JOIN2004 Universidade do Minho
4/7/2016
Reading DNA from FASTA files
> sample dna (This is a typical fasta header.)
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
Tgggacgccgccgagtggtctgtgcag
"
Notice that first line starts with a > symbol, and
contains annotation.
"
Each other line just contains sequence data.
"
Simple!
JOIN2004 Universidade do Minho
4/7/2016
Reading FASTA files: core code
From textbook, p. 170
foreach my $line (@fasta_file_data) {
if($line =~ /^>/) {# discard fasta header line
next;
} else {# keep line, add to sequence string
$sequence .= $line;
}
}
# remove non-sequence data (in this case, whitespace) from $sequence string
$sequence =~ s/\s//g;
JOIN2004 Universidade do Minho
4/7/2016
Regular expressions
Regular expressions are a language within
the Perl language that describe many kinds
of patterns in strings, e.g. motifs in DNA.
Regular expressions permit you to find and
alter many patterns with relative ease.
The excellent regular expressions in Perl
are a major reason for Perl's success as a
bioinformatics programming language.
JOIN2004 Universidade do Minho
4/7/2016
Restriction enzymes, sites, and maps
"
"
"
A restriction enzyme is a protein that cuts DNA at
short, specific sequences
EcoRI cuts at the restriction site GAATTC,
between the G and A, on both complementary
strands (the reverse complement of GAATTC is
GAATTC), leaving "sticky ends" for insertion into
a vector for cloning and sequencing.
A restriction map is the list of locations where a
given restriction site occurs in the sequence
JOIN2004 Universidade do Minho
4/7/2016
Finding restriction sites
while ( $sequence =~ /$regexp/ig ) {
push ( @positions, pos($sequence) - length($&) + 1);
}
"
while loop conditional is a pattern match
"
ig: i stands for "case insensitive"
"
g stands for "global", will keep searching entire string
"
pos is a built-in function that gives the ending position of the last successful
match; subtract the length of the match ($&) to get to the beginning of the
match; add 1 because biologists call the first base in DNA as position 1,
while Perl calls the first letter in a string as position 0
JOIN2004 Universidade do Minho
4/7/2016
GenBank
JOIN2004 Universidade do Minho
4/7/2016
GenBank
"
http://ncbi.nlm.nih.gov/Genbank
JOIN2004 Universidade do Minho
4/7/2016
GenBank Libraries
"
Examples:
PRI: primate sequences
ROD: rodent sequences
PHG: bacteriophage sequences
GSS: genome survey sequences
JOIN2004 Universidade do Minho
4/7/2016
Protein Data Bank
JOIN2004 Universidade do Minho
4/7/2016
PDB - Protein Data Bank
"
The primary depository for 3D protein structures
"
http://www.rcsb.org/pdb/
JOIN2004 Universidade do Minho
4/7/2016
BLAST
JOIN2004 Universidade do Minho
4/7/2016
BLAST
"
Basic Local Alignment Search Tool
"
Most popular bioinformatics program
"
Discovers sequence similarity between your
sequence and large databases
"
Voluminous output
"
Need to process output by program
JOIN2004 Universidade do Minho
4/7/2016
Similarity vs Homology
String matching is computer science term for finding
sequence similarity
"
Exact
"
Approximate : percent identity
Homology
"
Sequences are or are not related evolutionarily
"
Yes or no; no percentages
JOIN2004 Universidade do Minho
4/7/2016
Parsing BLAST output
"
"
"
"
Many bioinformatics projects require frequent
BLAST searches
Need to automate collection and processing
BLAST output does not label lines by their
function, so use regular expressions
Several BLAST parsers are available as Perl
modules: see for example Bioperl modules
Bio::Tools::BPlite and Bio::Tools::Blast::*
JOIN2004 Universidade do Minho
4/7/2016
Bioinformatics Resources
1. Books
•
Bioinformatics: Sequence and Genome Analysis by Mount
•
Developing Bioinformatics Computer Skills – Gibas & Jambeck
•
Introduction to Computational Biology: Maps,Sequences and Genomes,
Waterman
•
Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins.
Baxecvanis & Ouellette, editors
2. Government/Public Information Resources
•
National Center for Biotechnology Information (NCBI) – U.S. Government
Center – www.ncbi.nlm.nih.gov
•
European Molecular Biology Laboratory (EMBL). European Union
Laboratory – www.embl.org
•
European Bioinformatics Institute (EBI)- From UK, part of EMBLwww.ebi.ac.uk
•
UCSC Genome Resource- serves constructed genome data –
www.genome.ucsc.edu
JOIN2004 Universidade do Minho
4/7/2016
Bioinformatics Resources (cont.)
3. Conferences
•
ISMB: Intellegent Systems for Molecular Biology, www.iscb.org/ismb2004/
•
Bioinformatics Open Source Conference, www.bioinformatics.org
•
RECOMB: Conference on Computational Molecular Biology,
recomb04.sdsc.edu/
JOIN2004 Universidade do Minho
4/7/2016
Molecular Biology
1. Helpful Reference Books for Programmers
•
Recombinant DNA, 2nd edition, Watson et al.
•
Molecular Biology of the Gene, Watson et al.
•
Molecular Cell Biology, Lodish, et al.
JOIN2004 Universidade do Minho
4/7/2016
Approximate string matching
As an example of advanced data
structures, following is an algorithm
that relies on a 2 dimensional array.
It finds the best match for a (shorter)
pattern in a (longer) text.
It is one of several important dynamic
programming algorithms in
bioinformatics
JOIN2004 Universidade do Minho
4/7/2016
Approximate string matching 1 of 4
my $pattern = 'EIQADEVRL';
print "PATTERN:\n$pattern\n";
my $text =
'SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDE';
print "TEXT:\n$text\n";
my $TLEN = length $text;
my $PLEN = length $pattern;
# D is the Distance matrix, which shows the "edit distance" between
# substrings of the pattern and the text.
# It is implemented as a reference to an anonymous array.
my $D = [];
# The rows correspond to the text
# Initialize row 0 of D.
for (my $t=0; $t <= $TLEN ; ++$t) {
$D->[$t][0] = 0;
}
JOIN2004 Universidade do Minho
4/7/2016
Approximate string matching 2 of 4
# Compute the edit distances.
for (my $t=1; $t <= $TLEN ; ++$t) {
for (my $p=1; $p <= $PLEN ; ++$p) {
$D->[$t][$p] =
# Choose whichever alternative has the least cost
min3(
# Text and pattern may or may not match at this character
...
substr($text, $t-1, 1) eq substr($pattern, $p-1, 1)
? $D->[$t-1][$p-1] # If match, no increase in edit distance!
: $D->[$t-1][$p-1] + 1,
# If the text is missing a character
$D->[$t-1][$p] + 1,
# If the pattern is missing a character
$D->[$t][$p-1] + 1
)
}
}
JOIN2004 Universidade do Minho
4/7/2016
Approximate string matching 3 of 4
# Print D, the resulting edit distance array
for (my $p=0; $p <= $PLEN ; ++$p) {
for (my $t=0; $t <= $TLEN ; ++$t) {
print $D->[$t][$p], " ";
}
print "\n";
}
# Find the best match(es).
# The edit distances appear in the last row.
my @matches = ();
my $bestscore = 10000000;
for (my $t=1 ; $t <= $TLEN ; ++$t) {
if( $D->[$t][$PLEN] < $bestscore) {
$bestscore = $D->[$t][$PLEN];
@matches = ($t);
}elsif( $D->[$t][$PLEN] == $bestscore) {
push(@matches, $t);
}
}
JOIN2004 Universidade do Minho
4/7/2016
Approximate string matching 4 of 4
# Report the best match(es).
print "\nThe best match for the pattern $pattern\n";
print "has an edit distance of $bestscore\n";
print "and appears in the text ending at location";
print "s" if ( @matches > 1);
print " @matches\n";
JOIN2004 Universidade do Minho
4/7/2016
Approximate string matching: Output
PATTERN:
EIQADEVRL
TEXT:
SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDE
000000000000000000000000000000000000
111111111111011111011101011111111110
222222222222101222112211112222212221
333323333332211233222222222223322332
444433444443322123333333333333433343
555543455554433222344444444443444434
666654456665444333234545455554455543
776765556776555444323455556665556654
887776566787666555433456665676666765
998787667788776666543456776677777776
The best match for the pattern EIQADEVRL
has an edit distance of 3
and appears in the text ending at location 20
JOIN2004 Universidade do Minho
4/7/2016
Bioperl
Bioperl is a collection of Perl modules for
bioinformatics. It has been developed by a
large group of volunteers.
Bioperl has become a fairly stable platform
with which to develop software; it is
growing and changing rapidly, but its core
is relatively mature.
Bioperl is a must for anyone interested in
writing bioinformatics software in Perl
Don't reinvent the wheel! (Unless you like
to, for the fun or interest of it.)
JOIN2004 Universidade do Minho
4/7/2016
What can Bioperl do?
Sequence manipulation (DNA, RNA,
proteins)
"
"
"
"
"
"
"
Several kinds of sequence objects
Sequence format conversion
Translation
Revcom
Sequence statistics (mol. weights, etc.)
Restriction enzyme maps
Signal cleavage sites
JOIN2004 Universidade do Minho
4/7/2016
What can Bioperl do?
Sequence alignment
"
"
"
Interfaces with many alignment programs
such as fasta, clustalw, gcg
Extract subalignment, consensus string
Alignment statistics
JOIN2004 Universidade do Minho
4/7/2016
What can Bioperl do?
Biological databases
"
"
"
"
Retrieving Genbank, EMBL, acedb, etc.
Retrieving from remote databases
Indexing and retrieving from local databases
Parsing program reports from Blast and
variants, HMMer, genefinding programs
JOIN2004 Universidade do Minho
4/7/2016
What can Bioperl do?
Sequence annotation
"
"
Adding annotation
Parsing annotation from sequence databases
JOIN2004 Universidade do Minho
4/7/2016
What can Bioperl do?
Protein structure
"
"
Read in PDB files
Extract chains and their residues
JOIN2004 Universidade do Minho