No Slide Title

Download Report

Transcript No Slide Title

Database Engine Design
a.k.a. Research@ DSL
Jayant Haritsa
August 2007
SERC Research Symposium
Slide 1
Database Management Systems (DBMS)
• Efficient and convenient mechanisms for
storing, querying and maintenance of
enterprise data
• Cornerstone of computer industry
– Uses more than 80 percent of computers worldwide
– Employs more than 70 percent of computer professionals
– Largest monetary sector of computer business
August 2007
SERC Research Symposium
Slide 2
DBMS FEATURES
• Handle data of arbitrary size
– Income-Tax records are in Petabytes (1015)
• Self-contained
– contains both data and meta-data
• Program-Data insulation
– application s/w not affected by storage changes
SR No | Name | Address | Hostel | GPA
SR No | Name | Address | GPA | Hostel
August 2007
SERC Research Symposium
Slide 3
DBMS FEATURES (contd)
• Declarative Access
– state what you want, not how to get it
• On-the-Fly Questions
– ask new questions without writing new programs
• PEACE OF MIND
– changes to the database are guaranteed to be
immune to subsequent system failures
Sri Sri Ravishankar of the Information World
August 2007
SERC Research Symposium
Slide 4
Current Database Systems
• Commercial
– IBM DB2 / Oracle / Microsoft SQL Server / Sybase
• Public-domain
– PostgreSQL / MySQL / Berkeley DB
August 2007
SERC Research Symposium
Slide 5
DBMS Myths 
• Databases? Isn’t that the boring part of
accounting?
• Hazaar dumb Cobol programming!
• Maha-bore - almost as dull as watching
Rahul Dravid bat!
• High-tech name for data entry!
• Will only get job with TCS!
• ...
August 2007
SERC Research Symposium
Slide 6
DBMS Realities 
• Design of database engines has lots of really, really
interesting intellectual problems with practical impact
– theory, algorithms, data structures, experiments,
prototypes
• Turing awards
– 1981: Edgar Codd (relational data model)
– 1999: Jim Gray (transaction model)
• Ullman, Silberschatz, Papadimitrou, …
• Rajaraman, Patnaik, Balakrishnan,
Jacob/Govindarajan …
August 2007
SERC Research Symposium
Slide 7
Database Systems Lab
(DSL)
Established 1995
August 2007
SERC Research Symposium
Slide 8
Research Topics
– Real-Time Database Systems
– Distributed Transaction Management 1995-2000
– OODBMS
– Web Databases
– Data Mining
– XML Databases
2000-2005
– Biological Databases
– Query Optimization
Last few years
– Multilingual Databases
– Music Databases
August 2007
SERC Research Symposium
Slide 9
Research Trajectory
MIDDLEWARE
OO Models
Mining
XML
CORE DB TECHNOLOGY
Transaction
Processing
August 2007
Access
Methods
SERC Research Symposium
Query
Processing
Slide 10
Research Techniques
• Theory
– real-time, data mining, query optimization
• Simulation studies
– real-time, distributed, web dbms
• Empirical evaluation
– data mining, biological, multilingual dbms, query optimization
• Prototype development
– OODBMS (Flexible Manufacturing [MIDAS], VLSI [DIAS], Bio-diversity
[Oshadhi,Bodhi] )
– XML (Storage [LegoDB], Compression [XGrind] )
– Query Optimization (Clustering [Plastic], Visualization [Picasso] )
– Multilingual Databases (Cross-lingual SQL [Mira] )
August 2007
SERC Research Symposium
Slide 11
SPINE: Putting Backbone into
Genomic Sequence Indexing
August 2007
SERC Research Symposium
Slide 12
Standard Genomic Index: Suffix Tree
[Weiner 1973]
Vertically-compressed trie of suffixes augmented with links
0 12 3 4 5 6 78 9
Suffix Links
(xW → W)
Data = ‘GTTAATTACT$’
A
T
ATTACT$
3
GTTAATTACT$
CT$
7
Search for
Query = ‘TTA’
August 2007
Tree Edges
A
TTACT$
4
8
TA
$
9
0
ATTACT$
2
SERC Research Symposium
CT$
6
ATTACT$
1
CT$
5
Slide 13
Locate all Maximal Matching Substrings
[Chang & Lawler 1990]
• For each position in query sequence Q ,
locate all longest matching substrings of
length ≥  in the indexed data sequence D
– Example:
D = ‘GTTAATTACT$’
Q = ‘CTAATGA’ and
Result:
{ TAAT:<2,1>
August 2007
=3
AAT:<3,2> }
SERC Research Symposium
Slide 14
Maximal Substring Search
with Suffix Tree Index
0 1 2 34 5 6 7 8 9
Q = ‘CTAATGA’
 = 3
D = ‘GTTAATTACT$’
A
T
ATTACT$
3
GTTAATTACT$
CT$
7
A
TTACT$
4
8
9
0
ATTACT$
2
August 2007
TA
$
SERC Research Symposium
CT$
6
ATTACT$
1
CT$
5
Slide 15
Features of Suffix Tree Index
• Accurate retrieval
– no false negatives (unlike BLAST)
• Linear Time Complexity for both Construction
and Search!
– because of Suffix-links
• Widely used
– More than 40-50 applications over biological
sequences [Gusfield, 2002]
– MUMmer [Celera Genomics], AVID, …
August 2007
SERC Research Symposium
Slide 16
Crippling Limitation
• Viable only for sequences that are short
enough for their associated suffix tree
to fit completely in main memory …
[Baeza-Yates and Navarro, 2000]
• Best that has been built so far is for sequences of
~ 10 Mbp (Human Genome is 300 times longer!)
August 2007
SERC Research Symposium
Slide 17
Difficulties in Supporting
Suffix Trees on Long Sequences - 1
Space overheads are enormous
– Order(s) of magnitude larger than data!
– Human Genome can be easily stored in
main memory (~1 GB) but the index could
be of the order of 10-100 GB
 Disk-resident suffix trees for long sequences
August 2007
SERC Research Symposium
Slide 18
Difficulties in Supporting
Suffix Trees on Long Sequences - 2
Tree Construction on Disk is Very Slow
–
Due to disk thrashing from random seeks
The active suffix creeps through the text like a
caterpillar … corresponding active node swings
through the tree like a butterfly
[Giegerich and Kurtz, 1995]
August 2007
SERC Research Symposium
Slide 19
Difficulties in Supporting
Suffix Trees on Long Sequences - 3
Searching on Disk is Very Slow
– Unbalanced Tree Structure
• Shape of tree depends on
sequence stochastic properties
– “Multi-directional” traversals
causes disk thrashing
• Tree-Edge  “Vertical Walk-Down”
Suffix Tree Search
• Edge + Link mesh
• Two phase Search
• Locate
• Report
≥
• Suffix-Link  “Horizontal Jump-Across”
Combination of
Batman and Spiderman !
August 2007
SERC Research Symposium
Slide 20
The SPINE* Index
A Horizontally-Compacted Trie Index
[*Sequence Processing INdexing Engine]
August 2007
SERC Research Symposium
Slide 22
SPINE Index Structure
D = ‘ACCACAC’
Complete horizontal compaction
into single linear chain!!
0
A
0
0
C(0)
1
Root node
C
2
• Nodes
• Forward Edges
1
Link
1
2
3
4
C
2
Vertebra
SERC Research Symposium
Rib
1(2)
5
A
– Links
August 2007
A(1)
A
– Vertebras (Backbone)
– Ribs / Ext-Ribs
• Backward Edges
C
6
C
7
Extension
rib
Slide 23
Structural Advantages of SPINE
w.r.t. Suffix Trees
1) Number of nodes is equal to length of string,
whereas in suffix tree can go up to double.
2) Entire data sequence explicitly embedded in index
 throw away the data!
3) On-line incremental algorithm (by definition)
–
do not need to possess entire data sequence in advance
D =‘ACCA’
4) Node creation order and
logical order are the same
 prefix-partitionable
0
A
1
C
C (0)
2
C
3
A (1)
A
August 2007
SERC Research Symposium
4
Slide 24
Advantages of SPINE (contd)
5) Each node represents a set of suffixes
whereas in suffix tree each node
represents only a single suffix
–
Number of suffixes processed for
construction and searching is smaller
0
A
1
C (0)
C
2
C
3
A (1)
A
4
6) Easy to develop buffering strategies for
persistent implementations
August 2007
SERC Research Symposium
Slide 25
SPINE Performance Summary
Data Sets
Ecoli: 3.5 Mbp
Celegans: 15.5 Mbp
HC 21: 28.5 Mbp
HC19: 57.5 Mbp
Suffix Tree (MUMmer - Celera Genomics)
• Spine Space
– ~ 2/3 of Suffix Tree
• Spine Time
• Construction: ~ 1/2 of Suffix Tree
• Searching: ~ 1/2 of Suffix Tree
August 2007
SERC Research Symposium
Slide 26
SPINE Summary
• First index based on horizontal (inter-path)
compaction of the trie
• Collapses into a single linear structure
• Improved features and performance w.r.t. suffix
trees, the classical index
•
•
•
•
•
August 2007
Prefix-partitionable (first index to have this property)
Easily amenable to persistent disk implementation
Retains linear time/space complexity
Better construction speed and capacity
Better search response times
SERC Research Symposium
Slide 27
Full details at http://dsl.serc.iisc.ernet.in
Questions?
August 2007
SERC Research Symposium
Slide 28
END PRESENTATION
August 2007
SERC Research Symposium
Slide 29