CIS786, Lecture 2

Download Report

Transcript CIS786, Lecture 2

http://creativecommons.org/licenses/by-sa/2.0/
BNFO 602, Lecture 2
Usman Roshan
Some of the slides are based upon material by David Wishart of University of Alberta and Ron Shamir
of Tel Aviv University
Previously…
• Model of DNA sequence evolution
– Poisson model under two state characters
– Derivation of expected number of changes on
a single edge of a tree
– Jukes-Cantor for four state model (DNA)
– Estimating expected number of changes of
two DNA sequences using maximum
likelihood
Previously…
• Distance based phylogeny reconstruction
methods
– UPGMA --- not additive
– Neighbor Joining --- additive
– Both are fast --- polynomial time
– Easy to implement
Previously
• Simulation
– Method for simulating evolution of DNA
sequences on a fixed tree
– Comparing two different phylogenies for
computing the error rate
– Effect of accuracy on real methods as a
function of sequence length, number of
sequences, and other factors
Sequencing Successes
T7 bacteriophage
completed in 1983
39,937 bp, 59 coded proteins
Escherichia coli
completed in 1998
4,639,221 bp, 4293 ORFs
Sacchoromyces cerevisae
completed in 1996
12,069,252 bp, 5800 genes
Sequencing Successes
Caenorhabditis elegans
completed in 1998
95,078,296 bp, 19,099 genes
Drosophila melanogaster
completed in 2000
116,117,226 bp, 13,601 genes
Homo sapiens
completed in 2003
3,201,762,515 bp, 31,780 genes
Genomes to Date
• 8 vertebrates (human, mouse, rat, fugu,
zebrafish)
• 3 plants (arabadopsis, rice, poplar)
• 2 insects (fruit fly, mosquito)
• 2 nematodes (C. elegans, C. briggsae)
• 1 sea squirt
• 4 parasites (plasmodium, guillardia)
• 4 fungi (S. cerevisae, S. pombe)
• 200+ bacteria and archebacteria
• 2000+ viruses
So what do we do with all
this sequence data?
So what do we do with all
this sequence data?
Comparative
bioinformatics
DNA Sequence Evolution
-3 mil yrs
AAGACTT
AAGACTT
AAGGCTT
AAGGCTT
_GGGCTT
_GGGCTT
GGCTT
_G_GCTT
(Mouse)
(Mouse)
TAGACCTT
TAGACCTT
TAGGCCTT
TAGGCCTT
(Human)
(Human)
-2 mil yrs
T_GACTT
T_GACTT
TAGCCCTTA
TAGCCCTTA
(Monkey)
(Monkey)
A_CACTT
A_CACTT
ACACTTC
A_CACTTC
(Lion)
ACCTT
A_C_CTT
(Cat)
(Cat)
-1 mil yrs
today
Sequence alignments
They tell us about
• Function or activity of a new gene/protein
• Structure or shape of a new protein
• Location or preferred location of a protein
• Stability of a gene or protein
• Origin of a gene or protein
• Origin or phylogeny of an organelle
• Origin or phylogeny of an organism
• And more…
Pairwise alignment
• How to align two sequences?
Pairwise alignment
• How to align two sequences?
• We use dynamic programming
• Treat DNA sequences as strings over the
alphabet {A, C, G, T}
Pairwise alignment
Dynamic programming
Define V(i,j) to be the optimal pairwise alignment
score between S1..i and T1..j (|S|=m, |T|=n)
Dynamic programming
Define V(i,j) to be the optimal pairwise alignment
score between S1..i and T1..j (|S|=m, |T|=n)
Time and space complexity is O(mn)
Tabular computation of scores
Traceback to get alignment
Local alignment
Finding optimally aligned local regions
Local alignment
Database searching
• Suppose we have a set of 1,000,000
sequences
• You have a query sequence q and want to
find the m closest ones in the database--that means 1,000,000 pairwise
alignments!
• How to speed up pairwise alignments?
FASTA
• FASTA was the first
software for quick
searching of a
database
• Introduced the idea of
searching for k-mers
• Can be done quickly
by preprocessing
database
FASTA: combine high scoring hits into
diagonal runs
BLAST
Key idea: search for k-mers (short matchig substrings)
quickly by preprocessing the database.
BLAST
This key idea can also be used for speeding up pairwise
alignments when doing multiple sequence alignments
Biologically realistic scoring matrices
• PAM and BLOSUM are most popular
• PAM was developed by Margaret Dayhoff
and co-workers in 1978 by examining
1572 mutations between 71 families of
closely related proteins
• BLOSUM is more recent and computed
from blocks of sequences with sufficient
similarity
PAM
• We need to compute the probability
transition matrix M which defines the
probability of amino acid i converting to j
• Examine a set of closely related
sequences which are easy to align---for
PAM 1572 mutations between 71 families
• Compute probabilities of change and
background probabilities by simple
counting
PAM
• In this model the unit of evolution is the amount
of evolution that will change 1 in 100 amino
acids on the average
The scoring matrix Sab is the ratio of Mab to pb
PAM Mij matrix (x10000)
Multiple sequence alignment
• “Two sequences whisper, multiple
sequences shout out loud”---Arthur Lesk
• Computationally very hard---NP-hard
Formally…
Multiple sequence alignment
Unaligned sequences
Aligned sequences
GGCTT
TAGGCCTT
TAGCCCTTA
ACACTTC
ACTT
_G_ _ GCTT_
TAGGCCTT_
TAGCCCTTA
A_ _CACTTC
A_ _C_ CTT_
Conserved regions help us
to identify functionality
Sum of pairs score
Sum of pairs score
• What is the sum of
pairs score of this
alignment?
Tree alignment score
Tree alignment score
Tree Alignment
GGCTT
(Mouse)
TAGGCCTT
(Human)
TAGCCCTTA
(Monkey)
ACACTTC
(Lion)
ACCTT
(Cat)
Tree Alignment
AGGGACTT_
2
TGGGGCTT_
3
2
TAGGCCTT_
3
0
_G__GCTT_
(Mouse)
TAGGCCTT_
(Human)
Tree alignment score = 14
2
TAGCCCTTA
(Monkey)
A__CACTT_
1
1
A__CACTTC
(Lion)
A__C_CTT_
(Cat)
Tree Alignment---depends on tree
TA_CCCTTA
1
TA_CCCTT_
0
0
TA_CCCTT_
4
2
_G__GCTT_
(Mouse)
TAGGCCTT_
(Human)
Tree alignment score = 15
TA_CCCTTA
1
3
4
A__C_CTT_
(Cat)
A__CACTTC
(Lion)
TAGCCCTTA
(Monkey)
Switch monkey and cat
Profiles
• Before we see how to construct multiple
alignments, how do we align two
alignments?
• Idea: summarize an alignment using its
profile and align the two profiles
Profile alignment
Iterative alignment
(heuristic for sum-of-pairs)
• Pick a random sequence from input set S
• Do (n-1) pairwise alignments and align to
closest one t in S
• Remove t from S and compute profile of
alignment
• While sequences remaining in S
– Do |S| pairwise alignments and align to
closest one t
– Remove t from S
Iterative alignment
• Once alignment is computed randomly
divide it into two parts
• Compute profile of each sub-alignment
and realign the profiles
• If sum-of-pairs of the new alignment is
better than the previous then keep,
otherwise continue with a different division
until specified iteration limit
Progressive alignment
• Idea: perform profile alignments in the
order dictated by a tree
• Given a guide-tree do a post-order search
and align sequences in that order
• Widely used heuristic
• Can be used for solving tree alignment
Simultaneous alignment and
phylogeny reconstruction
• Given unaligned sequences produce both
alignment and phylogeny
• Known as the generalized tree alignment
problem---MAX-SNP hard
• Iterative improvement heuristic:
–
–
–
–
Take starting tree
Modify it using say NNI, SPR, or TBR
Compute tree alignment score
If better then select tree otherwise continue until
reached a local minimum
Median alignment
• Idea: iterate over the phylogeny and align
every triplet of sequences---takes o(m3) (in
general for n sequences it takes O(2nmn)
time
• Same profiles can be used as in
progressive alignment
• Produces better tree alignment scores (as
observed in experiments)
• Iteration continues for a specified limit
Popular alignment programs
• ClustalW: most popular, progressive alignment
• MUSCLE: fast and accurate, progressive and
iterative combination
• T-COFFEE: slow but accurate, consistency
based alignment (align sequences in multiple
alignment to be close to the optimal pairwise
alignment)
• PROBCONS: slow but highly accurate,
probabilistic consistency progressive based
scheme
• DIALIGN: very good for local alignments
MUSCLE
MUSCLE
MUSCLE
Profile sum-of-pairs score
Log expectation score used by MUSCLE
Evaluation of multiple sequence
alignments
•
•
•
•
•
•
Compare to benchmark “true” alignments
Use simulation
Measure conservation of an alignment
Measure accuracy of phylogenetic trees
How well does it align motifs?
More…
BAliBASE
• Most popular benchmark of alignments
• Alignments are based upon structure
BAliBASE currently consists of 142 reference alignments, containing
over 1000 sequences. Of the 200,000 residues in the database, 58%
are defined within the core blocks. The remaining 42% are in
ambiguous regions that cannot be reliably aligned. The alignments
are divided into four hierarchical reference sets, reference 1 providing
the basis for construction of the following sets. Each of the main sets
may be further sub-divided into smaller groups, according to
sequence length and percent similarity.
BAliBASE
• The sequences included in the database are selected
from alignments in either the FSSP or HOMSTRAD
structural databases, or from manually constructed
structural alignments taken from the literature. When
sufficient structures are not available, additional
sequences are included from the HSSP database
(Schneider et al., 1997). The VAST Web server (Madej,
1995) is used to confirm that the sequences in each
alignment are structural neighbours and can be
structurally superimposed. Functional sites are identified
using the PDBsum database (Laskowski et al., 1997)
and the alignments are manually verified and adjusted,
in order to ensure that conserved residues are aligned
as well as the secondary structure elements.
BAliBASE
• Reference 1 contains alignments of (less than 6) equidistant sequences, ie. the percent identity between two
sequences is within a specified range. All the sequences
are of similar length, with no large insertions or
extensions. Reference 2 aligns up to three "orphan"
sequences (less than 25% identical) from reference 1
with a family of at least 15 closely related sequences.
Reference 3 consists of up to 4 sub-groups, with less
than 25% residue identity between sequences from
different groups. The alignments are constructed by
adding homologous family members to the more
distantly related sequences in reference 1. Reference 4
is divided into two sub-categories containing alignments
of up to 20 sequences including N/C-terminal extensions
(up to 400 residues), and insertions (up to 100 residues).
Comparison of alignments on
BAliBASE
Parsimonious aligner (PAl)
1. Construct progressive alignment A
2. Construct MP tree T on A
3. Construct progressive alignment A’ on
guide-tree T
4. Set A=A’ and go to 3
5. Output alignment and tree with best MP
score
PAl
• Faster than iterative improvement
• Speed and accuracy both depend upon
progressive alignment and MP heuristic
• In practice MUSCLE and TNT are used for
constructing alignments and MP trees
• How does PAl compare against traditional
methods?
• PAl not designed for aligning structural regions
but focuses on evolutionary conserved regions
• Let’s look at performance under simulation
Evaluating alignments under
simulation
• We first need a way to evolve sequences
with insertions and deletions
• NOTE: evolutionary models we have
encountered so far do not account for
insertions and deletions
• Not known exactly how to model insertions
and deletions
ROSE
• Evolve sequences under an i.i.d. Markov Model
• Root sequence: probabilities given by a probability
vector (for proteins default is Dayhoff et. al. values)
• Substitutions
– Edge length are integers
– Probability matrix M is given as input (default is PAM1*)
– For edge of length b probabilty of x  y is given by Mbxy
• Insertion and deletions:
– Insertions and deletions follow the same probabilistic model
– For each edge probability to insert is iins .
– Length of insertion is given by discrete probability distribution
(normally exponential)
– For edge of length b this is repeated b times.
• Model tree can be specified as input
Evaluation of alignments
Let’s simulate alignments and
phylogenies and compare them under
simulation!!
Parameters for simulation study
• Model trees: uniform random distribution and
uniformly selected random edge lengths
• Model of evolution: PAM with insertions and
deletions probabilities selected from a gamma
distribution (see ROSE software package)
• Replicate settings: Settings of 50, 100, and 400
taxa, mean sequence lengths of 200 and 500
and avg branch lengths of 10, 25, and 50 were
selected. For each setting 10 datasets were
produced
Phylogeny accuracy
Alignment accuracy
Running time
Conclusions
• DIALIGN seems to perform best followed
by PAl, MUSCLE, and PROBCONS
• DIALIGN, however, is slower than PAl
• Does this mean DIALIGN is the best
alignment program?
Conclusions
• DIALIGN seems to perform best followed
by PAl, MUSCLE, and PROBCONS
• DIALIGN, however, is slower than PAl
• Does this mean DIALIGN is the best
alignment program?
• Not necessarily: experiments were
performed under uniform random trees
with uniform random edge lengths. Not
clear if this emulates the real deal.
Conclusions
• DIALIGN seems to perform best followed
by PAl, MUSCLE, and PROBCONS
• DIALIGN, however, is slower than PAl
• Does this mean DIALIGN is the best
alignment program?
• Not necessarily: experiments were
performed under uniform random trees
with uniform random edge lengths. Not
clear if this emulates the real deal.
• What about sum-of-pairs vs MP scores?
Sum-of-pairs vs MP score
Sum-of-pairs vs MP score
Conclusions
• Optimizing MP scores under this
simulation model leads to better
phylogenies and alignments
Conclusions
• Optimizing MP scores under this
simulation model leads to better
phylogenies and alignments
• What other models can we try?
Conclusions
• Optimizing MP scores under this
simulation model leads to better
phylogenies and alignments
• What other models can we try?
• Real data phylogenies as model trees
• Birth-death model trees
• Other distributions for model trees…
• Branch lengths: similar issues…
• Evolutionary model parameters estimated
from real data