Bioinformatics - University of Oxford

Download Report

Transcript Bioinformatics - University of Oxford

Modelling genomes
Gil McVean
Department of Statistics, Oxford
Why would we want to model a genome?
•
To identify genes
– Protein-coding
– RNA
– Small RNAs
•
To identify regulatory elements
– Transcription factor binding sites
– Enhancers
•
To classify genome content
– Repeat DNA
– Unique sequence
•
To understand the processes that shape genomes
–
–
–
–
–
Mutation
Recombination
Duplication
Rearrangement
Natural selection
EMISSIONS
STATES
A rather simple model for a protein-coding gene
Non-coding DNA
T:
C:
A:
G:
30%
20%
30%
20%
s
Start codon
ATG: 100% AAA:
AAC:
AAG:
…
TTT:
Codon
1/61%
1/61%
1/61%
1/61%
t
TERM
TAA: 30%
TAG: 40%
TGA: 30%
Non-coding DNA
T:
C:
A:
G:
30%
20%
30%
20%
Define model
Explore properties
Estimate parameters
from data
Test goodness-of-fit
Refine model
A ‘genome’ model is like any other statistical model
Hidden Markov Models in bioinformatics
•
The model of a gene just described can be thought of as a hidden Markov
model (HMM)
– The underlying states evolve in a Markov fashion, but we observe features (the
DNA sequence) emitted by those states
•
You will remember that there are lots of nice computational properties of
hidden Markov models that we can use for inference
– Finding a most likely sequence of states
– Calculating posterior probabilities of a given state at a given position
•
There are also various algorithms we can use to estimate parameters of
HMMs (e.g. ML estimation by EM)
•
How would you use the model of a gene to find new genes?
– How well do you think it would do?
Making useful HMMs in bioinformatics
•
To be useful, HMMs for genes have to incorporate many features
– Regulatory sequences
– Intron-splicing features
– Correlations and biases in amino acid and base composition
•
A REALLY important feature to capture is their evolution
– Important parts of genes and genomes evolve slower due to constraint
Searching for homology
•
If we compare human and chimpanzee sequences they are approximately
98.8% identical at the DNA level. It is ‘easy’ to identify which parts of the
genome in humans correspond to which parts in chimps
•
If we compare human with, say mouse, we can see some parts that are
similar, and other parts where there is only vague or even no obvious
similarity.
•
When measuring evolution, we need to identify regions that are homologous
– Homology means similarity by descent
•
Traditionally, the problem of identifying homology has been intrinsically
linked to the problem of alignment
Alignment of PFEMP1 proteins from P. falciparum
The simplest problem: aligning two sequences
•
Suppose we have just two protein sequences that we want to align
WAKIS
WEEKS
•
In evolution, three types of event can happen
– Mutation to new amino acids
– Insertion of new amino acids
– Deletion of amino acids
•
We want to work out which amino acids in the two sequences are
homologous – i.e. related to each other through shared ancestry
W—AKIS
WEEK-S
What do the ‘-’s really mean?
How can we construct an alignment algorithm?
•
What we want to do is to look at every possible alignment and choose the
one that is ‘best’
•
What we have to do is to find an efficient algorithm that can search every
possible alignment and that has an objective measure as to what ‘best’
means
•
A natural approach is to make a model of alignments, parameterise it and
find the alignment that maximises the likelihood
•
Although the problem sounds hard we can solve it using a hidden Markov
model structure
How does is work?
•
Suppose residues Xi and Yj are aligned to each other
Xi
Yj
•
Three things could happen next
– The next two residues in each sequence could also align
– A gap could be introduced in sequence X
– A gap could be introduced in sequence Y
XiXi+1
(A)
•
YjYj+1
(A)
(B)
(C)
Xi (B)
YjYj+1
We can parameterise the probabilities of each event
XiXi+1
(C)
Yj -
The full algorithm
•
•
We need to consider similar transitions for the cases when residue Xi is
aligned to a gap after residue Yj, and when Yj is aligned to a gap after Xi
Xi …-
Yj
Yj-a…Yj
…-
We need to specify various probabilities
–
–
–
–
–
•
Xi-a…Xi
The probability of inserting a gap
The probability of extending a gap
The probability of finishing the alignment
The probability of observing an aligned pair of residues (20x20)
The probability of observing a residue aligned to a gap (20)
Once specified we can use the Viterbi and Forward/Backward algorithms to
identify ML alignments, sample from the posterior or calculate posterior
probabilities
The forward algorithm
Xi+1
Emission probabilities
= ek(Xi+1 )
H
H
D
Transition probabilities
= qij
f H (i  1)   f H (i)qHH  f D (i)qDH  eH ( X i 1 )
In alignment the state space is two-dimensional (residue i aligned to residue j)
The Viterbi algorithm
Xi-1
Xi
Xi+1
H
H
H
D
D
D

vk (i  1)  ek ( X i 1 ) max j v j (i)q jk

A traceback matrix is used to keep track of the best partial alignments
An example
•
Suppose the gap opening and extension parameters are 0.2 and 0.5
respectively. There is a 80% chance of observing a match, a 20/19%
chance of observing any given mismatch and a 5% chance of observing
each unaligned amino acid (We can ignore termination for the moment)
•
The BEST alignments are given below, each of which has log likelihood of
-16.84, or 31% of the total likelihood (lnlk = -15.67).
•
W—AKIS
WA-KIS
WEEK-S
WEEK-S
In many real situations, the best alignment represents only a fraction of
the total likelihood
Posterior decoding
•
Using the forward-backward algorithm we can calculate the posterior
probability that any residue is aligned to any other, or that a given residue is in
a gap state
Y1
X1 X2 X3 X4 X5
Y1
Y2
Y2
Y3
Y3
Y4
Y4
Y5
Y5
X1 X2 X3 X4 X5
Conditional on X2-Y3
Extending the method
•
Originally, alignment algorithms (Needleman and Wunsch, 1970; Smith and
Waterman, 1981; Gotoh 1982) were not explicitly defined as hidden Markov
models
– Finite-state automata (FSA)
•
There have been many extensions to the original idea
–
–
–
–
–
•
Local alignment
Repeat alignment
Protein family identification
Gene finding
Multiple alignment
The alignment algorithm is very much a workhorse of bioinformatics, as an
alignment is needed or almost all subsequent analyses (e.g. phylogenetic
tree reconstruction, population genetic inference)
– However, relying on a single alignment is not always a great idea
Doing away with alignment
•
For most problems, the alignment is not of primary interest
•
The natural thing to do is to integrate over alignments (as in the FB
algorithm) to estimate parameters of interest
•
The key problem is that there is no computationally efficient algorithm for
statistical multiple alignment. All widely-used methods use heuristic
approaches
Gene conversion and var gene diversity in P. falciparum
•
Multiple alignment methods typically assume the sequences are related to
each through an evolutionary tree
•
For the case of multi-gene families, this may not be the case, because gene
conversion between copies can lead to mosaic structures
•
If we wish to learn about the processes of conversion, a natural approach is
to model the mosaicism
– In the case of var genes, the sequences are so diverged that we also need to
consider the problem of alignment
Mosaic alignment
•
We could model the n+1th sequence as a mosaic of the previous n
•
We can calculate the likelihood of observing a given sequence by summing
over all possible mosaic structures and their alignment
•
We can also identify the most likely mosaic structure and calculate the
expected number of recombination events
– Repeating the procedure for all sequences provides a way of assessing the
importance of mosaicism within the family
Extensive mosaicism within the var family