b - The University of North Carolina at Chapel Hill

Download Report

Transcript b - The University of North Carolina at Chapel Hill

Sequence Clustering
COMP 790-90 Research Seminar
Spring 2011
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
CLUSEQ
• The primary structures of many biological
(macro)molecules are “letter” sequences despite
their 3D structures.
 Protein has 20 amino acids.
 DNA has an alphabet of four bases {A, T, G, C}
 RNA has an alphabet {A, U, G, C}
•
•
•
•
2
Text document
Transaction logs
Signal streams
Structural similarities at the sequence level often
suggest a high likelihood of being
functionally/semantically related.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Problem Statement
• Clustering based on structural
characteristics can serve as a powerful tool
to discriminate sequences belonging to
different functional categories.
 The goal is to create a grouping of sequences such that
sequences in each group have similar features.
 The result can potentially reveal unknown structural and
functional categories that may lead to a better
understanding of the nature.
• Challenge: how to measure the structural
similarity?
3
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Measure of Similarity
• Edit distance:
 computationally inefficient
 only captures the optimal global alignment but ignore many
other local alignments that often represent important features
shared by the pair of sequences.
• q-gram based approach:
 ignores sequential relationship (e.g., ordering, correlation,
dependency, etc.) among q-grams
• Hidden Markov model:
 capture some low order correlations and statistics
 vulnerable to noise and erroneous parameter setting
 computationally inefficient
4
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Measure of Similarity
• Probabilistic Suffix Tree
 Effective in capturing significant structural
features
 Easy to compute and incrementally maintain
• Sparse Markov Transducer
 Allows wild cards
5
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Model of CLUSEQ
• CLUSEQ: exploring significant patterns of
sequence formation.
 Sequences belonging to one group/cluster may subsume to
the same probability distribution of symbols (conditioning
on the preceding segment of a certain length), while
different groups/clusters may follow different underlying
probability distributions.
 By extracting and maintaining significant patterns
characterizing (potential) sequence clusters, one can easily
determine whether a sequence should belong to a cluster by
calculating the likelihood of (re)producing the sequence
under the probability distribution that characterizes the
cluster.
6
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Model of CLUSEQ
Sequence:
Cluster S:
 = s1s2…sl
PS ( )  PS ( s1 )  PS ( s2 | s1 )   PS ( sl | s1  sl 1 )
l
  PS ( si | s1  si 1 )
i 1
r
r
r
r
Random
P
(

)

P
(
s
)

P
(
s
)



P
( sl )
1
2
If PS() is high, we may
consider
 a member
of S
process:
l
  P r ( si )
i 1
If PS() >> Pr(), we may consider  a member of S
7
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Model of CLUSEQ
• Similarity between  and S
l
PS ( ) 
sim S ( )  r
 i 1
P ( )
PS (s i | s1...s i 1 )
l
 p (s )
i
 P (s | s ...s ) 
   S i 1 i 1 
p (s i )
i 1 

l
i 1
• Noise may be present.
• Different portions of a (long) sequence may subsume
to different conditional probability distributions.
SIM S ( )  max sim S (s i ...s j )
1i  j l
8
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Model of CLUSEQ
• Give a sequence  = s1s2…sl and a cluster S, a dynamic
programming method can be used to calculate the similarity
SIMS(). Via a single scan of . Let
PS (s i | s1...s i 1 )
p (s i )
Y i  max sim S (s j ...s i )
Xi 
1 j i
Z i  max sim S (s i 1...s i 2 )
1 i 1i 2i
• Intuitively, Xi, Yi, and Zi can be viewed as the similarity
contributed by the symbol on the ith position of  (i.e., si), the
maximum similarity possessed by any segment ending at the
ith position, and the maximum similarity possessed by any
segment ending prior to or on the ith position, respectively.
9
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Model of CLUSEQ
• Then, SIMS() = Zl, which can be obtained by
Y i  max Y i 1  X i , X i 
Z i  max Z i 1 ,Y i 
and Y 1  Z 1  X 1
• For example, SIMS(bbaa) = 2.10 if p(a) = 0.6 and p(b)
= 0.4.
10
Sequence
b
b
a
a
PS(si|s1…si-1)
0.55
0.418
0.87
0.406
Xi
1.38
1.05
1.45
0.677
Yi
1.38
1.45
2.10
1.42
Zi
1.38
1.45
2.10
2.10
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Probabilistic Suffix Tree
• a compact representation to organize the
derived CPD for a cluster
• built on the reversed sequences
• Each node corresponds to a segment, , and
is associated with a counter C() and a
probability vector P(si| ).
11
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Probabilistic Suffix Tree
36
a
(0.417,0.583)
55
a
b
60
(0.636,0.364) (0.4,0.6)
(0.406,0.594)
96
39
b (0.289,0.711)
135
a
b
a
(0,1)
(0.889,0.111)(0.917,0.083) (0.87,0.13)
(0.45,0.55)
300 root
b
60 a 69 b
165
39 b a
a (0.582,0.418)
(0.231,0.769)
96
21 a
b (0.375,0.625)
(0.25,0.75)
57
b (0.211,0.789)
a 36
45
20
C(ba)=96
P(a|ba)=0.406
P(b|ba)=0.594
b
(0.25,0.75) (0.167,0.833)
12
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Model of CLUSEQ
• Retrieval of a CPD entry P(si|s1…si-1)
• The longest suffix sj…si-1
 can be located by traversing from the root along the path
“ si-1  …  s2 s1” until we reach either the node
labeled with s1…si or a node where no further advance can
be made.
 takes O(min{i, h}) where h is the height of the tree.
• Example: P(a|bbba)
13
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
P(a|bbba)  P(a|bba) = 0.4
36
a
(0.417,0.583)
55
a
b
60
(0.636,0.364) (0.4,0.6)
0.4
(0.406,0.594)
96
39
b (0.289,0.711)
135
a
b
a
(0,1)
(0.889,0.111)(0.917,0.083) (0.87,0.13)
300 root
b
60 a 69 b
165
39 b a
a (0.582,0.418)
(0.231,0.769)
96
21 a
b (0.375,0.625)
(0.25,0.75)
57
b (0.211,0.789)
a 36
45
20
(0.45,0.55)
b
(0.25,0.75) (0.167,0.833)
14
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
CLUSEQ
• Sequence Cluster: a set of sequences S is a
sequence cluster if, for each sequence  in S,
the similarity SIMS() between  and S is
greater than or equal to some similarity
threshold t.
• Objective: automatically group a set of
sequences into a set of possibly overlapping
clusters.
15
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Algorithm of CLUSEQ
Unclustered sequences
• An iterative process
 Each cluster is represented
by a probabilistic suffix tree.
 The optimal number of
clusters and the number of
outliers allowed can be
Similarity
adapted by CLUSEQ
threshold
automatically
adjustment
 new cluster generation,
cluster split, and cluster
consolidation
 adjustment of similarity
threshold
16
Generate new clusters
Sequence re-clustering
Cluster split
Cluster consolidation
Any
improvement?
No
sequence
clusters
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
New Cluster Generation
• New clusters are generated from un-clustered
sequences at the beginning of each iteration.
 k’ × f new clusters
number of consolidated clusters
max{ k 'n k 'c ,0}
f 
k 'n
Generate new clusters
Sequence re-clustering
number of clusters
Similarity
threshold
adjustment
number of new
clusters generated at
the previous iteration
17
Unclustered sequences
Cluster split
Cluster consolidation
Any
improvement?
No
sequence
clusters
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Sequence Re-Clustering
• For each (sequence, cluster) pair
 Calculate similarity
 PST update if necessary
 Only similar portion is used
 The update is weighted by the similarity value
36
(0.417,0.583)
55
a 60
(0.636,0.364)
(0.4,0.6)
a (0.406,0.594)
96
b (0.289,0.711)
135
b
a
a
39 b
(0.45,0.55)
(0.889,0.111) (0.917,0.083) (0.87,0.13)
b
60
a
39
21
(0.25,0.75)
20
(0.25,0.75)
18
a 36
69
b
b a 96 a
(0.231,0.769)
a 57 b (0.375,0.625)
b (0.211,0.789)
(0.167,0.833)
Generate new clusters
300
(0,1)
45
Unclustered sequences
165
b
Sequence re-clustering
root
Similarity
threshold
adjustment
Cluster split
Cluster consolidation
(0.582,0.418)
Any
improvement?
No
sequence
clusters
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Cluster Split
• Check the convergence of each existing
cluster
 Imprecise probabilities are used for each
probability entry in PST
Unclustered sequences
 Split non-convergent cluster
Generate new clusters
Sequence re-clustering
Similarity
threshold
adjustment
Cluster split
Cluster consolidation
Any
improvement?
No
sequence
clusters
19
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Imprecise Probabilities
• Imprecise probabilities uses two values
(p1, p2) (instead of one) for a
probability.
 p1 is called lower probability and p2 is called
upper probability.
 The true probability lies somewhere between p1
and p2.
 p2 – p1 is called imprecision.
20
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Update Imprecise Probabilities
• Assuming the prior knowledge of a
(conditional) probability is (p1, p2) and the
occurrences in the new experiment is a out of
b trials.
a  s  p1
p'1 
bs
a  s  p2
p '2 
bs
where s is the learning parameter which
controls the weight that each experiment
carries.
21
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Properties
• The following two properties are very
important.
 If the probability distribution stays static, then p1 and p2
will converge to the true probability.
 If the experiment agrees with the prior assumption, the
range of imprecision decreases after applying the new
evidence, e.g., p2’ – p1’ < p2 – p1.
• The clustering process terminates when the
imprecision of all significant nodes is less
than a small threshold.
22
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Cluster Consolidation
• Starting from the smallest cluster
• Dismiss clusters that have few sequence
not covered by other clusters
Unclustered sequences
Generate new clusters
Sequence re-clustering
Similarity
threshold
adjustment
Cluster split
Cluster consolidation
Any
improvement?
No
sequence
clusters
23
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Adjustment of Similarity
Threshold
count
• Find the sharpest turn of the similarity
distribution function
tnew
bil
told  tˆ

2
n 1
Unclustered sequences
tˆ  max | b  b |
l
i
i 2
Sequence re-clustering
Similarity
threshold
adjustment
r
i
b
Cluster split
Cluster consolidation
Any
improvement?
similarity
24
Generate new clusters
r
i
No
sequence
clusters
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Algorithm of CLUSEQ
• Implementation issues
 Limited memory space
Prune the node with smallest count first.
Prune the node with longest label first.
Prune the node with expected probability vector first.
 Probability smoothing
Eliminates zero empirical probability
 Other considerations
Background probabilities
A priori knowledge
Other structural features
25
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study
• We have experimented with a protein database of
8000 proteins from 30 families from SWISS-PROT
database.
Model
CLUSEQ
Edit
Distance
Edit
Distance
with Block
Operations
Hidden
Markov
Model
Q-gram
Accuracy
92%
23%
90%
91%
75%
Response
Time (sec)
144
487
13754
3117
132
26
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study
Synthetic data
27
Initial t
1.05
1.5
2
3
Final t
1.99
2.01
2
1.99
Response time 8011
7556
6754
7234
precision
81.3% 83.1% 83.4% 81.9%
recall
82.1% 82.8% 83.6% 82.7%
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study
Synthetic data
Initial cluster number 1
20
100
200
Final cluster number
102
99
101
102
Response time
10112 9023
6754
8976
precision
81.3% 82.1% 82.6% 81%
recall
81.6% 82%
28
83.4% 81.7%
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study
14000
16000
14000
12000
14000
12000
10000
8000
6000
4000
response tim e (sec)
16000
response tim e (sec)
response tim e (sec)
• CLUSEQ has linear scalability with respect to the
number of clusters, number of sequences, and
sequence length.
10000
8000
6000
4000
2000
2000
0
50
100
num ber of clusters
150
10000
8000
6000
4000
2000
0
0
12000
0
0
100000
200000
300000
num ber of sequences
0
1000
2000
3000
average sequence length
Synthetic Data
29
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Remarks
• Similarity measure
 Powerful in capturing high order statistics and
dependencies
 Efficient in computation  linear complexity
 Robust to noise
• Clustering algorithm




30
High accuracy
High adaptability
High scalability
High reliability
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
References
• CLUSEQ: efficient and effective sequence
clustering, Proceedings of the 19th IEEE
International Conference on Data Engineering
(ICDE), 2003.
• A frame work towards efficient and effective
protein clustering, Proceedings of the 1st IEEE
CSB, 2002.
31
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL