Transcript Lecture 4

Lecture 4
BNFO 235
Usman Roshan
IUPAC Nucleic Acid symbols
IUPAC Amino Acid symbols
Genetic code
Splitting and joining strings
• split: splits a string by regular
expression and returns array
– @s = split(/,/);
– @s = split(/\s+/);
• join: joins elements of array and returns
a string (opposite of split)
– $seq=join(“”, @pieces);
– $seq=join(“X”, @pieces);
Searching and substitution
• $x =~ /$y/ ---- true if expression $y
found in $x
• $x =~ /ATG/ --- true if open reading
frame ATG found in $x
• $x !~ /GC/ --- true if GC not found in $x
• $x =~ s/T/U/g --- replace all T’s with U’s
• $x =~ s/g/G/g --- convert all lower case
g to upper case G
DNA regular expressions
Taken from Jagota’s Perl for Bioinformatics
DNA Sequence Evolution
-3 mil yrs
AAGACTT
AAGACTT
AAGGCTT
AAGGCTT
_GGGCTT
_GGGCTT
GGCTT
_G_GCTT
(Mouse)
(Mouse)
TAGACCTT
TAGACCTT
TAGGCCTT
TAGGCCTT
(Human)
(Human)
-2 mil yrs
T_GACTT
T_GACTT
TAGCCCTTA
TAGCCCTTA
(Monkey)
(Monkey)
A_CACTT
A_CACTT
ACACTTC
A_CACTTC
(Lion)
ACCTT
A_C_CTT
(Cat)
(Cat)
-1 mil yrs
today
Comparative Bioinformatics
• Fundamental notion of biology: all life is
related by an unknown evolutionary Tree of
Life.
• Therefore, if we know something about one
species we can make inferences about other
ones.
• Also, by comparing multiple species we can
make inferences about sets of species.
• How do we compare DNA or protein
sequences of two different species?
Comparative Bioinformatics
• We need to know how often do mutations
from A to T occur or A to C occur.
• To determine this we manually create a set of
“true” alignments and estimate the likelihood
of A changing to C, for example, by counting
the number of time A changes to C and
computing related statistics.
• Now we have a realistic “scoring matrix”
which can be used to evaluate how related
are two species based on their DNA.
Problems
• Write a Perl subroutine called readmatrix that reads a
DNA substitution scoring matrix from a file called
“dna.txt” and stores it in a two dimensional array. The
format of the scoring matrix in the file is
A C
G
T
A 10 3
1
4
C 3 12
3
5
G1 3
15
2
T 4 5
2
11
• Write a Perl subroutine called translate that takes an
mRNA sequence and converts it into a protein
sequence and also returns the sequence.
Problems
• Write a Perl program that reads in a
substitution scoring matrix from a file called
“matrix.txt”, reads in a pair of DNA sequences
of equal length from a file called “dna.txt”, and
returns the total substitution score between
the two sequences.
• Write a Perl program that reads pairs of DNA
sequences from a file called “DNApairs.txt”
and estimates the frequency of nucleotide
substitutions.