RA with Base

Transcript RA with Base

Reduced Alphabet
• Given 20 AA’s, can some be coalesced ?
• From AA Physico-Chemical properties, use clustering
• Dayhoff clustering
• (KRH) (EDQN) (PTSGA) (C) (VLIM) (FWY)
• PCA, Hierarchical clustering
• (KRH) (EDQN) (PTSGA) (C) (VLIMF) (WY)
Dayhoff Clustering
2nd Base
U
1
s
t
B
a
s
e
C
A
G
U
UUU
UUC
UUA
UUG
CUU
CUC
CUA
CUG
AUU
AUC
AUA
AUG
GUU
GUC
GUA
GUG
C
F
L
I
M
V
F
I
I
I
I
UCU
UCC
UCA
UCG
CCU
CCC
CCA
CCG
ACU
ACC
ACA
ACG
GCU
GCC
GCA
GCG
A
S
P
T
A
A
A
A
A
UAU
UAC
UAA
UAG
CAU
CAC
CAA
CAG
AAU
AAC
AAA
AAU
GAU
GAC
GAA
GAG
G
Y
H
Q
N
K
D
E
Y
H
D
D
H
D
D
UGU
UGC
UGA
UGG
CGU
CGC
CGA
CGG
AGU
AGC
AGA
AGG
GGU
GGC
GGA
GGG
(KRH) (EDQN) (PTSGA) (C) (VLIM) (FWY)
C
W
R
S
R
G
C
U
C
A
W G
U
R C
A
G
G U
C
R A
G
U
G C
A
G
Murphy, Wallqvist, Levy, 2000


To study protein folding
Used BLOSUM50 similarity matrix
 Determine correlation coefficients between similarity matrix
elements for all pairs of AA’s


e.g., CAV = (∑i MA,i MV,i )/[(∑i MA,i MA,i q)*(∑i MV,i MV,i )] with summation
over i is taken for 20 AA’s
Group two AA’s with highest CC’s, and either add the next AA with
the highest CC to a group or a new group
RA with Base-pairing
•
Four groups
•
•
(KRH)(YSTNQC)(DE)(WFGAVILLMP)
Genetic Code arranged by different base ordering
2
2nd Base
1st Base
2nd Base
1st Base
RA with Base-pairing -2
•
Group (A-U; C-G) (except for leave nodes)
RA with Base-pairing -3
•
Four AA groups followed by base-pairing
•
Results in 15 letters (4F)
2nd Base
U
1
s
t
B
a
s
e
C
A
G
U
UUU
UUC
UUA
UUG
CUU
CUC
CUA
CUG
AUU
AUC
AUA
AUG
GUU
GUC
GUA
GUG
C
F
L
I
M
V
F
F
F
F
F
UCU
UCC
UCA
UCG
CCU
CCC
CCA
CCG
ACU
ACC
ACA
ACG
GCU
GCC
GCA
GCG
A
S
P
T
A
T
A
T
A
UAU
UAC
UAA
UAG
CAU
CAC
CAA
CAG
AAU
AAC
AAA
AAU
GAU
GAC
GAA
GAG
G
Y
H
Q
N
K
D
E
Y
H
Q
N
K
D
D
UGU
UGC
UGA
UGG
CGU
CGC
CGA
CGG
AGU
AGC
AGA
AGG
GGU
GGC
GGA
GGG
C
W
R
S
R
G
C
U
C
A
W G
U
R C
A
G
S U
C
V A
G
U
G C
A
G
RA with Base-pairing -4
•
CLUTO groups followed by base-pairing
2nd Base
U
1
s
t
B
a
s
e
C
A
G
•
U
UUU
UUC
UUA
UUG
CUU
CUC
CUA
CUG
AUU
AUC
AUA
AUG
GUU
GUC
GUA
GUG
C
F
L
I
M
V
F
I
I
I
I
UCU
UCC
UCA
UCG
CCU
CCC
CCA
CCG
ACU
ACC
ACA
ACG
GCU
GCC
GCA
GCG
A
S
P
T
A
A
A
A
A
UAU
UAC
UAA
UAG
CAU
CAC
CAA
CAG
AAU
AAC
AAA
AAU
GAU
GAC
GAA
GAG
G
Y
H
Q
N
K
D
E
Y
H
Q
N
K
D
D
UGU
UGC
UGA
UGG
CGU
CGC
CGA
CGG
AGU
AGC
AGA
AGG
GGU
GGC
GGA
GGG
C
W
R
S
R
G
C
U
C
A
W G
U
R C
A
G
S U
C
V A
G
U
G C
A
G
(VLIM)(PTSA)(Y)(H)(Q)(N)(K)(DE)(C)(W)(R)(S*)(R*)(G)
RA with Base-pairing - 5
•
Phylogenetic tree of cytochrome C gene
•
Using Neucletide sequences
•
Using 20 AA’s
RA with Base-pairing - 6
•
Phylogenetic tree of cytochrome C gene
•
Using 4 group/15 letter
•
Using 6 group/14 letter
RA with Base-pairing - 7
•
Phylogenetic tree of cytochrome C gene
•
Using 6 CLUTO group
•
Using 6 group/14 letter
Transforms
• PCA
• Extract structure info from data
• Orthogonal transform of coordinate systems
• Principal components are linear combination of input space
• Can be used in dimension reduction, compression, etc.
• Given M centered observations xk =[xk1, xk2, …, xkp] (k=1,…M)
(∑Mxk = 0), solve
λ v = C v (C= ∑jMxjxjT /M)
• Kernel PCA
• Obtain principal components of variables (features) as
nonlinear combinations of input variables
• Mapping
X = Φ(x)
Φ: mapping to feature space, potentially infinite dimension
Kernel PCA
• Kernel PCA
• Given M centered (∑M Φ(xk)= 0), solve
λ v = C v (C= ∑jMΦ(xj) Φ(xj)T /M)
Kernel PCA
• Given M centered (∑M Φ(xk)= 0), solve
λ v = C v (C= ∑jMΦ(xj) Φ(xj)T /M)
• Computational Issues
•Centering in the feature space
•Computation of mapping into feature space
• Limited to special cases such as (xy)d
Support Vector Machine
• W. Noble, “What is SVM?” Computational Biology, Dec. 2006
(http://www.fml.tuebingen.mpg.de/raetsch/lectures/ismb09tutori
al/images/WhatIsASVM.pdf )
• http://www.dtreg.com/svm.htm
• SVM is similar to NN
• An SVM with sigmoid kernel function is equivalent to a twolayer perceptron
• SVM is used in
• Credit card fraud detection
• Recognize hand writing
• Classification of microarray gene expression profiles
Cancer Gene Classificaiton
• Affymetrix microarray
• Contains probes for 6,817 human genes
• For a bone marrow sample, it returns 6,817 values, each of
which represents mRNA levels of individual genes
• Trained for 27 Acute lymphoblastic leukemia (ALL) and 11
acute myeloid leukemia (AML)
• For simplicity, assume two gene probes
• Can plot gene expressions
Maximum Margin Hyperplane
• How do we tell which line is better ?
• From statistical learning theory
• Margin is defined as the distance from a separating
hyperplane to the nearest expression vector
• Determine separating hyperplane as the one that maximizes
the margin
Separating Hyperplane and Kernel Function
• hyperplane
• Data in m-dimension vector, translated into (m+i)-dimension
may give a better separation
Soft Margin and Overfitting
• Expect some errors – give a soft margin
• Overfitting may increase ambiguity
Evaluating Models -- ROC Curve
• Evaluation tool of a classifier
• Accuracy or error rates can be plotted
• ROC (Receiver Operating Characteristics) is popular
• Want to predict p(ositive) or n(egative)
• From a test (experiment)
• Actually p, predict p – true positive (TP)
• Actually n, predict p – false positive (FP, false alarm)
• Actually p, predict n – false negative (FN)
• Actually n, predict n – true negative (TN)
• ROC plots
• TP rate vs. FP rate
• TP – sensitivity of the model/test/experiment
• FP – (1-specificity)
Info in Sequence
• Chen, et al., divergence and Shannon Info in Genomes, 2005
• Measure of info – entropy
• H = -Σpi log(pi)
• Hmax when pi ‘s are equal (pi = 1/n = p’), Hmax = log(n)
• General notion that uncertainly decrease with more Info
• Define divergency (Gatlin)
• D = log(n) – H = Σpi log(pi)p’/n
• m Set: k-mers with m (A+T)
Genome as a fractal
•
•
•
•
•
Fractal, Self-similar, heavy tail distribution
www.cs.uml.edu/~kim/580/00_hao_fractal.pdf
2000, only a few genomes available
Only 1 Mbp sequence could be visualized
K string of bases {g, c, a, t}
• Bases arranged in M = { g c}
a t
Genome as a fractal
•
•
•
•
•
Fractal, Self-similar, heavy tail distribution
www.cs.uml.edu/~kim/580/00_hao_fractal.pdf
2000, only a few genomes available
Only 1 Mbp sequence could be visualized
K string of bases {g, c, a, t}
• Bases arranged in M = { g c}
a t
• MK = M✪M✪M….. ✪M (✪ -- convolution)
Genome as a fractal
• Let M00 = g, M01 = c, M10 = a, M11 = t
• Then,
MI,JK = MI1,J1 MI2,J2 …. MIK,JK
• gta: M00M11M10 : 011, 010
• Quaternary Number System
• α: {g,c,a,t} <-> {0,1,2,3}
• Given a genome with si, slide window of width K by one
base at a time, & compute index values
Genome as a fractal
• Given a genome with si, slide window of width K by one base at
a time, & compute index values
• First window:
• Index = Σi 4K-i α(si) (i=1,2,..,K)
• X = Σi 2K-I [α(si) >>1]
• Y = Σi 2K-I [α(si) &&0x01]
• Subsequent windows
• Index’ = 4 [mod(index, 4K-i)} + α(sK+1)
• X’ = 2 [mod(x, 2K-1)] + [α(sK+1) &&0x01]
• y’ = 2 [mod(x, 2K-1)] + [α(sK+1) >>1]
Genome as a fractal
• Plot of index values
Evolution Speed
• Barrick, 2009
(www.cs.uml.edu/
~kim/580/09_bar
rick.pdf)
• Cultured E. coli
over 40,000
generations over
20 years
• Compile
substitutions at
generations 2K,
5K, 10K, 20K and
40K
• At 20K, 45
mutations
• 39 SNPs, 16
indels
Substitution to Adaptation
• Rate of substitution
• Linear
• Neutral model is
supported
• Adaptation (fixed
substitution)
• Nonlinear
• Rate of fitness
improvement
decreases
• Early adaptation is
fixed
Green squares: improvement of fitness
Blue line: uniform mutation
Blue curve: 95% confidence interval
Blue circle: total mutations

RA with Base

Transcript RA with Base

Directory