Slides - Department of Computer Science • NJIT
Download
Report
Transcript Slides - Department of Computer Science • NJIT
Lecture 1
BNFO 240
Usman Roshan
Course overview
• Perl progamming language (and some Unix basics)
• Sequence alignment problem
– Algorithm for exact pairwise alignment
– Heuristics for exact multiple alignment
– Computational complexity
– Heuristics for pairwise alignment and BLAST, FASTA database
search
– Real world alignment problems
– Substitution matrices
• Phylogeny reconstruction
– Estimating distance matrices
– Distance based phylogeny reconstruction ---- UPGMA and
neighbor joining algorithms
Overview (contd)
• Wednesdays --- meet in GITC 2305
• Fridays --- meet in PC Mall room number
PC 36
• Grade: 50% monthly programming
assignment and 50% final exam
• Texts:
– Introduction to Bioinformatics by Arthur Lesk
– Beginning Perl for Bioinformatics by James
Tisdall
DNA Sequence Evolution
-3 mil yrs
AAGACTT
AAGACTT
AAGGCTT
AAGGCTT
_GGGCTT
_GGGCTT
GGCTT
_G_GCTT
(Mouse)
(Mouse)
TAGACCTT
TAGACCTT
TAGGCCTT
TAGGCCTT
(Human)
(Human)
-2 mil yrs
T_GACTT
T_GACTT
TAGCCCTTA
TAGCCCTTA
(Monkey)
(Monkey)
A_CACTT
A_CACTT
ACACTTC
A_CACTTC
(Lion)
ACCTT
A_C_CTT
(Cat)
(Cat)
-1 mil yrs
today
Comparative bioinformatics
• What is the evolutionary relationship of a
set of DNA sequences?
• What are the evolutionary conserved
regions of a set of proteins?
• How evolutionary close is a pair of
species?
• How similar are two DNA sequences?
• How similar are a set of DNA sequences?
Representing DNA in a format
manipulatable by computers
• DNA is a double-helix molecule made
up of four nucleotides:
–
–
–
–
Adenosine (A)
Cytosine (C)
Thymine (T)
Guanine (G)
• Since A (adenosine) always pairs with
T (thymine) and C (cytosine) always
pairs with G (guanine) knowing only
one side of the ladder is enough
• We represent DNA as a sequence of
letters where each letter could be
A,C,G, or T.
• For example, for the helix shown here
we would represent this as CAGT.
Transcription and translation
Amino acids
Proteins are chains of
amino acids. There are
twenty different amino
acids that chain in
different ways to form
different proteins.
For example,
FLLVALCCRFGH
(this is how we could store
it in a file)
This sequence of amino
acids folds to form a 3-D
structure
Protein folding
Protein folding
• The protein folding
problem is to determine
the 3-D protein structure
from the sequence.
• Experimental techniques
are very expensive.
• Computational are cheap
but difficult to solve.
• By comparing sequences
we can deduce the
evolutionary conserved
portions which are also
functional (most of the time).
Protein
structure
• Primary structure: sequence of
amino acids.
• Secondary structure: parts of the
chain organizes itself into alpha
helices, beta sheets, and coils. Helices
and sheets are usually evolutionarily
conserved and can aid sequence
alignment.
• Tertiary structure: 3-D structure of
entire chain
• Quaternary structure: Complex of
several chains
Key points
• DNA can be represented as strings
consisting of four letters: A, C, G, and T.
They could be very long, e.g. thousands
and even millions of letters
• Proteins are also represented as strings
of 20 letters (each letter is an amino acid).
Their 3-D structure determines the
function to a large extent.
Pairwise sequence alignment
• How to align two sequences?
Pairwise alignment
• How to align two sequences?
• We use dynamic programming
• Treat DNA sequences as strings over the
alphabet {A, C, G, T}
Pairwise alignment
Dynamic programming
Define V(i,j) to be the optimal pairwise alignment
score between S1..i and T1..j (|S|=m, |T|=n)
Dynamic programming
Define V(i,j) to be the optimal pairwise alignment
score between S1..i and T1..j (|S|=m, |T|=n)
Time and space complexity is O(mn)
Tabular computation of scores
Traceback to get alignment
How do we understand this dynamic
programming algorithm?
• Let’s first look at some example
alignments
• Let’s look at gaps. How do we know where
to insert gaps
• Let’s look at the structure of an optimal
alignment of two sequences x and y and
how it relates optimal alignments of
subsequences of x and y