Related Works

Download Report

Transcript Related Works

Clustered Segment Indexing for
Pattern Searching on the Secondary Structure
of Protein Sequences
Minkoo Seo
Sanghyun Park JungIm Won
{mkseo, sanghyun, jiwon}@cs.yonsei.ac.kr
Department of Computer Science, Yonsei University
Data and Knowledge Engineering Laboratory
Outline
• Introduction
• Related Works
• Basic Concept
– Segment
– Cluster and Look Ahead
• Selectivity Analysis
• Algorithm Description
– Exact Match
– Range Match
• Experiments
• Conclusion and Future Work
Data and Knowledge Engineering Laboratory
Introduction
• Similarities in Protein Structure
– Functional properties of proteins usually depend on structures of
the proteins.
– There are many proteins which are structurally similar, but their
sequences are not similar at the level of amino acids.
Amino Acids (AA)
Our target
Secondary Structure Element (SSE):
Helices and Sheets
Polymer Chain
Protein
Polymer Chain
…
Loop Regions
Polymer Chain
<Protein Structure>
Data and Knowledge Engineering Laboratory
Introduction (cont’)
• Secondary Structure of Protein Sequences
– Linear sequence of amino acids folds into three dimensional
structures.
– There are three basic types of folds:
• h : Alpha Helices
• e : Beta Sheets
• l : Turns of Loops
– Normally, these types occur in groups.
• For example, ‘lhhheeeelll’ is more likely to occur than ‘helhehhll.’
Primary :GQISDSIEEKRGFFSTKR..
Secondary:HLLLLLLLLLLHHHEEEE..
Probability:855577763445449476..
Data and Knowledge Engineering Laboratory
Outline
• Introduction
• Related Works
• Basic Concept
– Segment
– Cluster and Look Ahead
• Selectivity Analysis
• Algorithm Description
– Exact Match
– Range Match
• Experiments
• Conclusion and Future Work
Data and Knowledge Engineering Laboratory
Related Works
• BLAST and FASTA
– These exhaustive search techniques are often adapted to run on
specialized high-end hardware or parallelized.
– They do not solve the fundamental problem of needing to
compare a query to each sequence in the collection.
Data and Knowledge Engineering Laboratory
Related Works (cont’)
• Segment Based Indexing
L. Hammel and J. M. Patel, "Searching on the Secondary Structure of
Protein Sequence," In Proc. VLDB Conference, 2002
– Indexing scheme is based on the idea of the segmentation of
secondary protein sequences.
– Index selectivity is not uniform.
– Gap can be defined only as segments.
Data and Knowledge Engineering Laboratory
Outline
• Introduction
• Related Works
• Basic Concept
– Segment
– Cluster and Look Ahead
• Selectivity Analysis
• Algorithm Description
– Exact Match
– Range Match
• Experiments
• Conclusion and Future Work
Data and Knowledge Engineering Laboratory
Segment
Sample Protein Sequence
Protein ID :1
Primary :GQISDSIEEKRGFFSTKR..
Secondary :HLLLLLLLLLLHHHEEEE..
Probability:855577763445449476..
<Segments>
ID
Segmentation Process
:
Start
Loc.
1
H
1
0
1
L
10
1
H LLLLLLLLLL HHH EEEE
1
H
3
11
0 1234567890 123 4567
1
E
4
14
Type, Len: H,1
Loc
Type Length
L,10
H,3
E,4
Data and Knowledge Engineering Laboratory
Segment (cont’)
• Selectivity of Type+Length
Number of
segments in
Thousands
– Index selectivity is not uniform.
200
l
150
h
100
e
50
0
1
11
21
31
41
51
61
Segment Length
Data and Knowledge Engineering Laboratory
Cluster and Look Ahead
<Cluster table>
<Segments>
ID
Type
Str
0
1
10
1
h
3
11
e
4
14
ID
Type
Length
1
h
1
1
l
1
1
Start
Loc.
• 2k segments are combined and
put into Cluster table. By doing
this, we index several characters
instead of exactly one character.
• ‘Look Ahead’ fields contain ‘Type’
fields of next n segments.
• Maximum value of k can be
limited according to query length.
k=0
k=1
k=2
Length
Start
Loc.
Look
Ahead
h
1
0
lhe
1
l
10
1
he
1
h
3
11
e
1
e
4
14
ID
Type
Str
Length
Start
Loc.
Look
Ahead
1
hl
11
0
he
1
lh
13
1
e
1
he
7
11
ID
Type
Str
Length
Start
Loc.
1
hlhe
18
0
Look
Ahead
Data and Knowledge Engineering Laboratory
Outline
• Introduction
• Related Works
• Basic Concept
– Segment
– Cluster and Look Ahead
• Selectivity Analysis
• Algorithm Description
– Exact Match
– Range Match
• Experiments
• Conclusion and Future Work
Data and Knowledge Engineering Laboratory
(ID, LOC) in Thousands
Index Selectivity
18
16
14
12
10
8
6
4
2
0
0.140
Number of (ID, LOC) Found
0.120
Selectivity
0.100
0.080
0.060
0.040
0.020
0.000
Cluster table is searched by
TypeStr+TypeLen+LookAhead.
G
SE
K4
K3
K2
K1
K0
T
EN
M
Segment table is searched by
Type+TypeLen.
Number of Proteins = 320,000
Cluster indices are 30~3,000 times more selective than the segment index,
though the exact length information is lost.
Data and Knowledge Engineering Laboratory
LookAhead Selectivity
16
Num of (ID, LOC) Found w/o LookAhead
(ID, LOC) in Thousands
14
Num of (ID, LOC) Found w/ LookAhead
12
10
8
6
4
2
0
K0
K1
K2
K3
K4
Number of Proteins = 320,000
LookAhead improves selectivity 2~34 times.
Data and Knowledge Engineering Laboratory
Outline
• Introduction
• Related Works
• Basic Concept
– Segment
– Cluster and Look Ahead
• Selectivity Analysis
• Algorithm Description
– Exact Match
– Range Match
• Experiments
• Conclusion and Future Work
Data and Knowledge Engineering Laboratory
Exact Match
Query
<h 1 1><l 10 10><h 3 3><e 4 4><l 3 3>
Compute k
k  min( log 2 | Q |, 4)  min( log 2 5, 4)  2
Clustering
<h 1 1><l 10 10><h 3 3><e 4 4><l 3 3>
① k=2
② k=2
Sort Merge
• ID must be equal.
• Loc difference must be 1(=<h 1 1>).
Post Processing
Validate result using actual
sequences.
Searching
① TypeStr=hlhe, TypeLen=1+10+3+4,
LookAhead=l
② TypeStr=lhel, TypeLen=10+3+4+3
Data and Knowledge Engineering Laboratory
Range Match
<Range Match>
Clustering
<e 50 60><l 3 3><h 5 6><e 6 8>
Searching
TypeStr=elhe,
TypeLen=64(50+3+5+6)~
77(60+3+6+8)
<Selective Clustering Range Match (SCRM)>
Clustering
<e 50 60><l 3 3><h 5 6><e 6 8>
Comparing Histograms
<e 50 60><l 3 3><h 5 6><e 6 8>
②
①
③
④
① k=2, TypeLen=64~77
② k=1, TypeLen=53~63
③ k=1, TypeLen=8~9
④ k=1, TypeLen=11~14
Range of TypeLen is increased
too much.
Search the cluster which seems to return
the smallest number of rows.
Data and Knowledge Engineering Laboratory
Outline
• Introduction
• Related Works
• Basic Concept
– Segment
– Cluster and Look Ahead
• Selectivity Analysis
• Algorithm Description
– Exact Match
– Range Match
• Experiments
• Conclusion and Future Work
Data and Knowledge Engineering Laboratory
Environment
• Oracle Database 8.1.7 on Linux
– To store cluster table and cluster indices.
• JDK 1.4.2_04 on Windows XP
– To run searching program.
• Protein Data Used
– 80,000 amino acid sequences of PDB(Protein Data Bank) were
downloaded.
– Secondary structure of protein sequences were predicted using
PREDATOR.
– 320,000 sequences were populated by duplicating 80,000
sequences 4 times.
Data and Knowledge Engineering Laboratory
Exact Match
Ours
ISS
SSS
30
MISS(2)
20
Time in Sec.
Time in Sec.
40
Ours
5
MISS(2)
10
0
0
3
5
10
|Q|
Exact Match
15
20
3
5
10
15
20
|Q|
Exact Match
Our method is 3~20 times faster than MISS(2)
and 8~49 times faster than SSS and ISS.
Data and Knowledge Engineering Laboratory
Range Match
15
SCRM
10
MISS(2)
5
0
Time in Sec.
Time in Sec.
20
15
SCRM
MISS(2)
10
5
0
3
5
10
15
20
3
5
10
15
20
|Q|
|Q|
Range Match, First
Range Match, Middle
Time in Sec.
20
SCRM
15
MISS(2)
10
SCRM is 1.2~6 times faster than MISS(2).
5
0
3
5
10
15
20
|Q|
Range Match, End
Data and Knowledge Engineering Laboratory
Scalability
10
Ours
MISS(2)
Time in Sec.
Time in Sec.
15
5
0
80,000
160,000
320,000
Number of proteins
Exact Match, |Q|=5
50
45
40
35
30
25
20
15
10
5
0
Ours(SCRM)
MISS(2)
80,000
160,000
320,000
Number of proteins
Range Match (Middle), |Q|=5
Our methods are 2~4 times faster than MISS(2).
Data and Knowledge Engineering Laboratory
Index Size
3,000
Cluster Index
Size in MB
2,500
Segment Index
2,000
1,500
1,000
500
0
0
1
2
3
Maximum K value
4
Index size increases linearly.
Number of proteins = 320,000
Maximum K value can be set according to expected length of queries.
In most cases, maximum K value of 2 will suffice.
Data and Knowledge Engineering Laboratory
Outline
• Introduction
• Related Works
• Basic Concept
– Segment
– Cluster and Look Ahead
• Selectivity Analysis
• Algorithm Description
– Exact Match
– Range Match
• Experiments
• Conclusion and Future Work
Data and Knowledge Engineering Laboratory
Conclusion and Future Work
• We have proposed the concept of clustered indexing,
which indexes fixed number of characters, and look
ahead, which enhances index selectivity.
• Our experiments show that proposed techniques are
more efficient than the previous algorithms.
• In the future, we would like to study and propose
approximate searching algorithms for 3-D structures of
proteins.
Data and Knowledge Engineering Laboratory
Q&A