Transcript flu

Genomics – session II
March 29, 2017
Assembled genomic sequences
……CGACAATGCACGACAGAGGAAGCAGAACAGATGATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGA
TCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAA
TATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGT
GCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAAGCCCAAAAAGACAATACACGACAGAGAGA
GAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAA
TGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTG
CCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTATGCACGACAGAGGAAGCAGAACAGATAT
TTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCT
TTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATT
GCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCT
GTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATG
ATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCA
TATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTG
GCGGATGAACGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAA
TCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATC
GCGTATGCGAGAGTATGCACGACAGAGGAAGCAGAAATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAAT
TCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGT
TCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAA
ATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATA
TTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGC
GGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGAT
CGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGAC
AATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATA
TTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTAGAGTGCGAGTGCTTTTCAACATATTGTGCTCT……
An annotated gene
……CGACAATGCACGACAGAGGAAGCAGAACAGATGATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGA
TCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAA
TATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGT
GCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAAGCCCAAAAAGACAATACACGACAGAGAGA
GAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAA
TGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTG
CCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTATGCACGACAGAGGAAGCAGAACAGATAT
TTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCT
TTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATT
GCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCT
GTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATG
ATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCA
TATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTG
GCGGATGAACGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAA
TCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATC
GCGTATGCGAGAGTATGCACGACAGAGGAAGCAGAAATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAAT
TCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGT
TCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAA
ATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATA
TTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGC
GGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGAT
CGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGAC
AATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATA
TTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTAGAGTGCGAGTGCTTTTCAACATATTGTGCTCT……
TSS
ATG
promoter 5’UTR
TAA
3’UTR
Scheme of a eukaryotic gene
Motif identification approaches
Various pattern recognition method are
used for identification of these signals:
STATAWAWRSS
– consensus sequences
– weight matrices
– decision trees
KCABHYBY
– Hidden Markov Models (HMMs)
– neural networks
YRRCCAATSR
–…
Consensus sequence
• obtained by choosing the most frequent base at
each position of the multiple alignment of
subsequences of interest
consensus sequence:
TACGAT
TATACT
TAGAGT
GAAACT
TATGAT
TATGTT
TAt[AG]NT
consensus (IUPAC):
TAtRNT
Leads to loss of information and can produce many false
positive or false negative predictions
Weight matrices
• Computed by measuring the frequency of every element of every
position of the site (weight)
TACGAT
TATAAT
TATAAT
GATACT
TATGAT
TATGTT
1 2 3 4 5 6
A 0 6
0 3 4 0
C 0 0
1 0 1 0
G 1 0
0 3 0 0
T 5 0
5 0 1 6
• Probability of a nucleotide each site:
Site 1: P(A) = 0, P(C) = 0, P(G) = 1/6, P(T) = 5/6
• Probability of being this motif for a given sequence:
P(TATAAT):
Weight matrices (log values)
• Convert probabilities to log values for
computation efficiency
• Pseudocode for 0-occurrence base at a site (10-5)
S(TATAAT) = -0.26 + 0.0 - 0.26 – 1.0 – 0.58 – 0.0
• Matrices=in-1.2
log values
TACGAT
TATAAT
TATAAT
GATACT
TATGAT
TATGTT
1
2
3
4
5
6
A
-16.61 0.0
-16.61
-1.0
-0.58
-16.61
C
-16.61 -16.61
-2.58
-16.61
-2.58
-16.61
G -2.58
-16.61
-16.61
-1.0
-16.61
-16.61
T
-16.61
-0.26
-16.61
-2.58
0.0
-0.26
• Score for a sequence is the sum of the matrix values (likelihood of
being the motif)
More about weight matrices
• Disadvantages:
– cut-off value required
– assumes independence between adjacent bases
Considering sequence composition
• Nucleotide frequency in human genome
%A
C%
G%
Genome
23.95
26.08 26.02 23.94
Intergenic
29.89
20.09 20.09 29.93
Genes
28.83
21.12 21.16 28.89
Exons
23.84
26.16 26.11 23.89
Introns
29.12
20.85 20.87 29.16
• How to improve the matrices?
TACGAT
TATAAT
TATAAT
GATACT
TATGAT
TATGTT
1 2
3 4 5 6
A 0 6
0 3 4 0
C 0 0
1 0 1 0
G 1 0
0 3 0 0
T 5 0
5 0 1 6
T%
Decision tree (Definition)
• Decision tree induction is one of the simplest, and
yet most successful forms of learning algorithm.
• A decision tree takes as input an object or
situation described by a set of attributes and
returns a "decision” – the predicted output value
for the input.
• The input attributes and the output value can be
discrete or continuous.
Definition II
• Learning a discrete-valued function is called
classification learning.
• Learning a continuous function is called
regression.
• We will concentrate on Boolean classification,
wherein each example is classified as true
(positive) or false (negative).
Decision trees as performance elements
• A decision tree reaches its decision by
performing a sequence of tests.
• Each internal node in the tree corresponds to a
test of the value of one of the properties, and the
branches from the node are labeled with the
possible values of the test.
• Each leaf node in the tree specifies the value to
be returned if that leaf is reached.
Example
• Aim: To decide whether to wait for a table at a restaurant.
• Attributes:
1. Alternate: whether there is a suitable alternative restaurant
nearby.
2. Bar: whether the restaurant has a comfortable bar area to
wait in.
3. Fri/Sat: true on Fridays and Saturdays.
4. Hungry: whether we are hungry.
5. Patrons: how many people are in the restaurant (values are
None, Some, and Full).
6. Price: the restaurant's price range ($, $$, $$$).
7. Raining: whether it is raining outside.
8. Reservation: whether we made a reservation.
9. Type: the kind of restaurant (French, Italian, Thai, or burger).
10. WaitEstimate: the wait estimated by the host (0-10
minutes, 10-30, 30-60, >60).
A decision tree to decide whether to
wait for a table
Inducing decision trees from examples
• An example for a Boolean decision tree consists of a
vector of input attributes, X, and a single Boolean output
value y.
• The complete set of examples is called the training set.
• In order to find a decision tree that agrees with the
training set, we could simply construct a decision tree that
has one path to a leaf for each example, where the path
tests each attribute in turn and follows the value for the
example and the leaf has the classification of the example.
• When given the same example again, the decision tree
will come up with the right classification. Unfortunately, it
will not have much to say about any other cases!
Decision-tree leaning algorithm
• Splitting the examples by testing on attributes (i) :
(a) Splitting on Type brings us no nearer to distinguishing
between positive and negative examples.
Decision-tree leaning algorithm
• Splitting the examples by testing on attributes (ii) :
Patrons?
In general, after the first
attribute test splits up the
examples, each outcome is
a new decision tree
learning problem in itself
Splitting on Patrons does a good job of separating positive and negative
examples. After splitting on Patrons, Hungry is a fairly good second test.
A decision tree induced from examples
The scheme used in decision
tree learning for selecting
attributes is designed to
minimize the depth of the
final tree.
Important and useless attribute
Patrons?
The Patrons attribute is not
perfect, but it is fairly good.
Hungry is a fairly good second
attribute. The Type attribute is a
really useless one.
Information content
• One suitable measure is the expected amount of
information provided by the attribute, where we
use the term in the mathematical sense first
defined in Shannon and Weaver (1949).
• Information theory measures information
content in bits.
• One bit of information is enough to answer a
yes/no question about which one has no idea,
such as the flip of a fair coin.
Information content
• In general, if the possible answers vi have
probabilities P(vi), then the information content
of the actual answer is given by:
• For example, flip of a fair coin:
I(P(head), P(tail)) = -0.5*log20.5 - 0.5*log20.5 = 1
Information content
• For decision tree learning, the question that
needs answering is; for a given example, what is
the correct classification?
• An estimate of the probabilities of the possible
answers before any of the attributes have been
tested is given by the proportions of positive and
negative examples in the training set.
• Suppose the training set contains p positive
examples and n negative examples. Then an
estimate of the information contained in a correct
answer is:
Information content
• A test on a single attribute A will not usually tell us
all the information, but it will give us some of it.
• We can measure exactly how much by looking at
how much information we still need after the
attribute test.
• Any attribute A divides the training set E into
subsets E1, . . . , Ev, according to their values for A,
where A can have v distinct values.
• Each subset Ei has pi positive examples and ni
negative examples, so if we go along that branch,
we will need an additional I(pi/(pi + ni), ni/(pi + ni))
bits of information to answer the question.
Information content
• A randomly chosen example from the training set
has the ith value for the attribute with probability
(pi + ni ) / ( p + n), so on average, after testing
attribute A, we will need the following amount of
information to classify the example:
R(Patrons) =
Information content
• The information gain from the attribute test is
the difference between the original information
requirement and the new requirement:
• The heuristic used in the CHOOSE-ATTRIBUTE
function is just to choose the attribute with the
largest gain.
Information content
Information content
• Returning to our example:
Gain(Patrons) = ? bits
Markov model
• Collection of states
{S1, S 2, …, Sn}
• State transition probabilities (transition matrix)
A = P(qt+1 = Si | qt = Sj)
• Initial state distribution
i = (.7 .25 .05)
Markov model (our example)
• Collection of states
{Sfluent, S intermediate, Sbeginner}
• State transition probabilities (transition matrix)
A=
flu
int
beg
flu
.8
.15
.05
int
.38
.6
.02
.05
.2
beg .75
• Initial state distribution
i = (.7 .25 .05)
Probability of a sequence of events
flu → int → int → int → beg → beg
P(Sflu)xP(Sint | Sflu)xP(Sint | Sint)x
P(Sint | Sint)xP(Sbeg | Sint)xP(Sbeg | Sbeg)
= 0.7x0.15x0.6x0.6x0.02x0.2 = 0.0001512
Hidden Markov model (HMM)
Not observable
60%
80%
60%
15%
fluent
5%
30%
38%
intermediate
10%
20%
30%
5%
75%
5%
2%
65%
begginer
0%
50%
50%
Ingredients of HMM
• Collection of states:
{S1, S2,…,SN}
• State transition probabilities (transition matrix)
Aij = P(qt+1 = Si | qt = Sj)
• Initial state distribution
 = P(q
i
1
= Si)
• Observations: {O1, O2,…,OM}
• Observation probabilities:
Bj(k) = P(vt = Ok | qt = Sj)
Assumptions of HMM
For the sake of mathematical and computational
tractability, following assumptions are made in the
theory of HMMs.
(1)The Markov assumption
In other words it is assumed that the next state is dependent only upon
the current state. This is called the Markov assumption and the
resulting model becomes actually a first order HMM.
(2)The stationarity assumption
Here it is assumed that state transition probabilities are independent of
the actual time at which the transitions takes place.
(3)The output independence assumption
This is the assumption that current output(observation) is statistically
independent of the previous outputs(observations).
Ingredients of HMM (our example)
• Collection of states
{Sfluent, S intermediate, Sbeginner}
• State transition probabilities (transition matrix)
A=
flu
int
beg
flu
.8
.15
.05
int
.38
.6
.02
.05
.2
beg .75
emission matrix
flu
• Initial state distribution
B = int
i = (.7 .25 .05)
beg
• Observations:
{O , O , O }
• Observation probabilities (emission matrix):
.6
.3
.1
.05
.3
.65
0
.50
.50
Probability of a sequence of events
.1
.7 flu .8
.1
flu
.8
.3
flu
.8
.3
flu
.8
.6
flu
.8
.3
flu
.8
.3
flu
P(O) = P(O , O , O ,…, O )
=  P(O | Q)P(Q) =  P(O | q1,…,q7)
= 0.7 x 0.86 x 0.34 x 0.12 x 0.6 + …
Typical HMM problems
Annotation Given a model M and an observed string S,
what is the most probable path through M generating S
Classification Given a model M and an observed string S,
what is the total probability of S under M
Consensus Given a model M, what is the string having the
highest probability under M
Training Given a set of strings and a model structure, find
transition and emission probabilities assigning high
probabilities to the strings
HMM and gene structure
• Nucleotides {A,C,G,T} are the observables
• Different states generates nucleotides at different frequencies
A simple HMM for unspliced genes:
AAAGC ATG CAT TTA ACG AGA GCA CAA GGG CTC TAA TGCCG
• The sequence of states is an annotation of the generated string – each
nucleotide is generated in intergenic, start/stop, coding state
A HMM for eukaryotic gene finding
start coding non end
A = 0

0.5
0.5

0
0 start

0.998 0.002 0 coding
0.001 0.996 0 non

0.001 0.002 0 end
0
A
B=
0
C
G
T
Coding .28
.22 .25 .25
non
.18 .18 .32
.32
xm(i) = probability of being in state m at position i;
B(m,yi) = probability of emitting character yi in state m;
Amk = probability of transition from state k to m.
Bacteriomaker HMM machine
Coding
A
C
G
T
1
0.22
0.28
0.28
0.22
0.1
0.9
ATG
TAA
1
0.1
Inter
gene
0.9
A
C
G
T
0.25
0.25
0.25
0.25
A simple gene finder
GENSCAN (Burge & Karlin)
E0
I0
E1
E2
I1
I2
E ini
E term
E single
5'UTR
poly-A
promoter
Forward (+) strand
Reverse (-) strand
3'UTR
intergenic
region
Forward (+) strand
Reverse (-) strand
6201
6261
6321
6381
6441
6501
6561
6621
6681
6741
6801
6861
6921
6981
7041
7101
7601
7661
7721
7781
7841
7901
7961
8021
8081
8141
8201
8261
8321
8381
8441
8901
8961
9021
9081
9141
9201
9261
…
…
The order of the Markov chain
The order of the Markov chain is the number
of previous positions on which the current
position depends
e.g., in nucleic acid sequence, 0-order is
mononucleotide, 1st-order is dinucleotide,
2nd-order is trinucleotide, etc.
e.g. 5th order Markov Chain:
A
C
G
A
T
C
G
T
C
C
Evaluation of gene finding programs
Actual
True False
Predicted
True
TP
FP
PP=TP+FP
False
FN
TN
PN=FN+TN
AP=TP+FN AN=FP+TN
• Sensitivity: S n  TP / AP  1  
• Specificity:
= Coverage
• Precision = TP / PP
Sp= TN / AN = 1 - β
• Misclassification rates:   FN
AP
• Normalized specificity: 

• FDR = FP / PP
FP

AN
1 
1   
• Accuracy = (TP
+ TN) / (P + N)
Combining programs’ predictions
• Set of methods used and the way they are
integrated differs between individual
programs
• Different programs often predict different
elements of an actual gene
they could complement each other
yielding better prediction
Probability of a sequence of events
• High prediction accuracy as well as reliability of
their exon probability made them the best
candidates for our study
Genscan
111
624
91
HMMgene
• Genscan predicted 77% of exons correctly,
HMMgene 75%, both 87%
Consensus at splicing sites
exon
intron
exon
Hidden Markov model