Transcript ppt

The MonetDB Architecture
Martin Kersten
CWI
Amsterdam
M.Kersten 2008
1
Try to keep things simple
Database
Structures
Execution
Paradigm
Query
optimizer
DBMS
Architecture
M.Kersten 2008
2
MonetDB quickstep
End-user application
SQL
JDBC
ODBC
XQuery
Python
Perl
C-mapi lib
MAPI protocol
MonetDB
kernel
M.Kersten Sep 2008
PHP
RoR
The MonetDB Software Stack
SQL/XML
SOAP
Open-GIS
XQuery
SQL 03
Optimizers
MonetDB 4
X100
MonetDB 5
MonetDB kernel
An advanced column-oriented DBMS
M.Kersten Sep 2008
Compile?
MonetDB quickstep
SQL
MAL
The MonetDB Assembly Language
glues the layers together.
- binary relational model operators
- imperative constructs
- functional abstractions
MAL
MonetDB Kernel
MonetDB Server
Original: P. A. Boncz, M. L. Kersten.
MIL Primitives for Querying a Fragmented World.
The VLDB Journal, 8(2):101-119, October 1999.
M.Kersten 2008
5
MonetDB quickstep
select count(*) from photoobjall;
function user.s3_1():void;
X1:bat[:oid,:lng] := sql.bind("sys","photoobjall","objid",0);
X6:bat[:oid,:lng] := sql.bind("sys","photoobjall","objid",1);
X9:bat[:oid,:lng] := sql.bind("sys","photoobjall","objid",2);
X13:bat[:oid,:oid] := sql.bind_dbat("sys","photoobjall",1);
X8 := algebra.kunion(X1,X6);
X11 := algebra.kdifference(X8,X9);
X12 := algebra.kunion(X11,X9);
X14 := bat.reverse(X13);
X15 := algebra.kdifference(X12,X14);
X18 := algebra.markT(X15,0@0);
X19 := bat.reverse(X18);
X20 := aggr.count(X19);
sql.exportValue(1,"sys.","count_","int",32,0,6,X20,"");
end s3_1;
0 base table
delete
2 update
1 insert
SQL
MAL
Tactical
Optimizers
MAL
MonetDB
Kernel
MonetDB Server
MonetDB quickstep
Strategic optimizer:
– Exploit the lanuage the language
– Rely on heuristics
SQL
MAL
Tactical MAL optimizer:
–Modular optimizer framework
–Focused on coarse grain resource
optimization
Operational optimizer:
– Exploit everything you know at runtime
– Re-organize if necessary
M.Kersten 2008
MAL optimizers
MAL
MonetDB Kernel
MonetDB Server
7
MonetDB quickstep
select count(*) from photoobjall;
function user.s3_1():void;
X1:bat[:oid,:lng] := sql.bind("sys","photoobjall","objid",0);
X20 := aggr.count(X1);
sql.exportValue(1,"sys.","count_","int",32,0,6,X20,"");
end s3_1;
SQL
MAL
Optimizer pipelines.
Tactical
Optimizers
sql> select optimizer;
inline,remap,evaluate,costModel,coercions
,emptySet,aliases,mergetable,deadcode,c
onstants,commonTerms,joinPath,deadcod
e,reduce,garbageCollector,dataflow,history
,replication,multiplex
MAL
MonetDB
Kernel
MonetDB Server
MonetDB quickstep
select count(*) from photoobjall;
function user.s3_1():void;
X1:bat[:oid,:lng] := sql.bind("sys","photoobjall","objid",0);
X20 := aggr.count(X1);
sql.exportValue(1,"sys.","count_","int",32,0,6,X20,"");
end s3_1;
SQL
MAL
Kernel execution paradigms
Tactical
Optimizers
Tuple-at-a-time pipelined
Operator-at-a-time
MAL
MonetDB
Kernel
MonetDB Server
Try to keep things simple
MonetDB storage
Database
Structures
PAX
stores
M.Kersten 2008
N-ary
stores
Column
stores
10
Try to keep things simple
Early 80s: tuple storage structures for PCs were simple
OK John
32
Houston
OK Mary
31
Houston
Easy to access at the cost of wasted space
M.Kersten 2008
11
Try to keep things simple
Slotted pages Logical pages equated physical pages
32
John Houston
31
M.Kersten 2008
Mary Houston
12
Try to keep things simple
Slotted pages Logical pages equated multiple physical pages
32
John Houston
31
M.Kersten 2008
Mary Houston
13
Avoid things you don’t always need
Not all attributes are equally important
M.Kersten 2008
14
Avoid moving too much around
A column orientation is as simple and acts like an array
Attributes of a tuple are correlated by offset
M.Kersten 2008
15
Try to keep things simple
• MonetDB Binary Association Tables
ID
10
11
12
13
OID
ID
100
101
102
103
104
M.Kersten 2008
OID
10
11
12
13
14
Day
Discount
4/4/98
0.195
9/4/98
0.065
1/2/98
0.175
7/2/98
0
Day
100
4/4/98
101
9/4/98
102
1/2/98
103
7/2/98
104
1/2/99
OID
Discount
100
0.195
101
0.065
102
0.175
103
0
104
0.065
16
Try to avoid doing things twice
Physical data organization
• Binary Association Tables
head
tail
100
Dense
sequence
101
102
103
104
10
11
12
13
14
Memory
mapped
files
Bat Unit
fixed size
M.Kersten 2008
17
Try to avoid doing things twice
• Binary Association Tables accelerators
OID
ID
100
101
102
103
Hash-based
access
M.Kersten 2008
104
10
11
12
13
14
Column properties:
key-ness
non-null
dense
ordered
18
Try to avoid doing things twice
• Binary Association Tables storage control
OID
100
ID
100
101
102
103
Type remappings
are used to squeeze
space
104
Amsterdam
Seattle
New York
London
Paris
ID
Amsterdam
Seattle
New York
London
A BAT can be used
as an encoding table
A VID datatype
can be used to
represent dense
enumerations
M.Kersten 2008
19
Mantra: Try to keep things simple
• Column orientation benefits datawarehousing
• Brings a much tighter packaging and improves
transport through the memory hierarchy
• Each column can be more easily optimized for
storage using compression schemes
• Each column can be replicated for read-only
access
M.Kersten 2008
20
Try to maximize performance
Volcano
model
Execution
Paradigm
Materialize
All
Model
Vectorized
model
M.Kersten 2008
22
Volcano Engines
Query
SELECT
name,
salary*.19 AS tax
FROM
employee
WHERE
age > 25
M.Kersten 2008
23
Volcano Engines
Operators
Iterator interface
-open()
-next(): tuple
-close()
M.Kersten 2008
24
Volcano Engines
Primitives
Provide computational
functionality
All arithmetic allowed in
expressions,
e.g. multiplication
mult(int,int)  int
M.Kersten 2008
25
Try to maximize performance
Volcano paradigm
• The Volcano model is based on a simple pullbased iterator model for programming
relational operators.
• The Volcano model minimizes the amount of
intermediate store
• The Volcano model is CPU intensive and can
be inefficient
M.Kersten 2008
26
Try to use simple a software pattern
MonetDB paradigm
• The MonetDB kernel is a programmable
relational algebra machine
• Relational operators operate on ‘array’-like
structures
M.Kersten 2008
27
Try to use simple a software pattern
Operator implementation
• All algebraic operators materialize their result
•
•
•
•
GOOD: small code footprints
GOOD: potential for re-use
BAD : extra storage for intermediates
BAD: cpu cost for retaining it
• Local optimization decisions
•
•
•
•
Sortedness, uniqueness, hash index
Sampling to determine sizes
Parallelism options
Properties that affect the algorithms
M.Kersten 2008
28
Try to use simple a software pattern
Operator implementation
• All algebraic operators materialize their result
• Local optimization decisions
• Heavy use of code expansion to reduce cost
•
•
•
•
•
55 selection routines
149 unary operations
335 join/group operations
134 multi-join operations
72 aggregate operations
M.Kersten 2008
29
Try to avoid the search space trap
Database
Structures
Execution
Paradigm
Query
optimizer
DBMS
Architecture
M.Kersten 2008
30
Query optimization
• Alternative ways of evaluating a given query
• Equivalent expressions
• Different algorithms for each operation (Chapter 13)
• Cost difference between a good and a bad way of
evaluating a query can be enormous
• Example: performing a r X s followed by a selection
r.A = s.B is much slower than performing a join on
the same condition
• Need to estimate the cost of operations
• Depends critically on statistical information about
relations which the database must maintain
• Need to estimate statistics for intermediate results to
compute cost of complex expressions
Introduction (Cont.)
Relations generated by two equivalent expressions
have the same set of attributes and contain the same
set of tuples, although their attributes may be ordered
differently.
Introduction (Cont.)
• Generation of query-evaluation plans for an
expression involves several steps:
1. Generating logically equivalent expressions
• Use equivalence rules to transform an expression
into an equivalent one.
2. Annotating resultant expressions to get alternative
query plans
3. Choosing the cheapest plan based on estimated
cost
• The overall process is called cost based
optimization.
Equivalence Rules
1. Conjunctive selection operations can be
deconstructed into a sequence of individual
selections.   ( E )    (  ( E ))
2. 2. Selection operations are commutative.
1
2
1
2
  (  ( E ))    (  ( E ))
1
2
2
1
3. Only the last in a sequence of projection
operations is needed, the others can be
omitted. t1 (t2 ((tn (E ))))  t1 (E )
4. Selections can be combined with Cartesian
products and theta joins.
a.
b.
(E1 X E2) = E1
 E2
1(E1
2 E2) = E1
1 2 E2
Equivalence Rules (Cont.)
5. Theta-join operations (and natural joins) are
commutative.
E1  E2 = E2  E1
6. (a) Natural join operations are associative:
(E1 E2) E3 = E1 (E2
E3)
(b) Theta joins are associative in the following
manner:
(E1
1
E2)
2  3
E3 = E1
2 3
(E2
2
E3)
where 2 involves attributes from only E2
and E3.
Pictorial Depiction of Equivalence Rules
Equivalence Rules (Cont.)
7. The selection operation distributes over the
theta join operation under the following two
conditions:
(a) When all the attributes in 0 involve only
the attributes of one of the expressions (E1)
being joined.
0E1

E2) = (0(E1))

E2
(b) When  1 involves only the attributes of E1
and 2 involves only the attributes of E2.
1 E1  E2) = (1(E1))  ( (E2))
Equivalence Rules (Cont.)
8. The projections operation distributes over the
theta join operation as follows:
(a) if it involves only attributes from L1  L2:
 L1  L2 ( E1.......  E2 )  ( L1 ( E1 ))......  ( L2 ( E2 ))
(b) Consider a join E1

E2.
• Let L1 and L2 be sets of attributes from E1 and E2,
respectively.
• Let L3 be attributes of E1 that are involved in join
condition , but are not in L1  L2, and
• let L4 be attributes of E2 that are involved in join
condition , but are not in L1  L2.
 L1  L2 ( E1..... E2 )   L1  L2 (( L1  L3 ( E1 ))......  ( L2  L4 ( E2 )))
Equivalence Rules (Cont.)
9. The set operations union and intersection are
commutative
E 1  E2 = E2  E 1
E 1  E2 = E2  E 1
 (set difference is not commutative).
10.Set union and intersection are associative.
(E1  E2)  E3 = E1  (E2  E3)
(E1  E2)  E3 = E1  (E2  E3)
11.The selection operation distributes over ,  and –.
 (E1 – E2) =  (E1) – (E2)
and similarly for  and  in place of –
Also:
 (E1 – E2) = (E1) – E2
and similarly for  in place of –, but
not for 
12.
The projection operation distributes over
union
L(E1  E2) = (L(E1))  (L(E2))
Multiple Transformations (Cont.)
Optimizer strategies
• Heuristic
• Apply the transformation rules in a specific order
such that the cost converges to a minimum
• Cost based
• Simulated annealing
• Randomized generation of candidate QEP
• Problem, how to guarantee randomness
Memoization Techniques
• How to generate alternative Query Evaluation Plans?
• Early generation systems centred around a tree
representation of the plan
• Hardwired tree rewriting rules are deployed to
enumerate part of the space of possible QEP
• For each alternative the total cost is determined
• The best (alternatives) are retained for execution
• Problems: very large space to explore, duplicate
plans, local maxima, expensive query cost
evaluation.
• SQL Server optimizer contains about 300 rules to be
deployed.
Memoization Techniques
• How to generate alternative Query Evaluation Plans?
• Keep a memo of partial QEPs and their cost.
• Use the heuristic rules to generate alternatives to
built more complex QEPs
• r1
r2
r3
r4
r4
r3
r2 r1
r1 r2
Level n plans
Level 2 plans
r3
r2 r3
r3 r4
x
r1 r4
Level 1 plans
Ditching the optimizers
• Applications have different characteristics
• Platforms have different characteristics
• The actual state of computation is crucial
• A generic all-encompassing optimizer costmodel does
M.Kersten 2008
not work
44
Try to disambiguate decisions
Code Inliner.
Constant Expression Evaluator.
Accumulator Evaluations.
Strength Reduction.
Common Term Optimizer.
Join Path Optimizer.
Ranges Propagation.
Operator Cost Reduction.
Foreign Key handling.
Aggregate Groups.
M.Kersten 2008
Code Parallizer.
Replication Manager.
Result Recycler.
MAL Compiler.
Dynamic Query Scheduler.
Memo-based Execution.
Vector Execution.
Alias Removal.
Dead Code Removal.
Garbage Collector.
45
No data from persistent store to the memory trash
Database
Structures
Execution
Paradigm
Query
optimizer
DBMS
Architecture
M.Kersten 2008
46
No data from persistent store to the memory trash
Execution paradigms
• The MonetDB kernel is set up to accommodate
different execution engines
• The MonetDB assembler program is
•
•
•
•
Interpreted in the order presented
Interpreted in a dataflow driven manner
Compiled into a C program
Vectorised processing
• X100 project
M.Kersten 2008
47
MonetDB/x100
X100 query engine
Combine Volcano model with
vector processing.
All vectors together should fit
the CPU cache
CPU
cache
RAM
Vectors are compressed
ColumnBM
(buffer manager)
Optimizer should tune this,
given the query characteristics.
networked
ColumnBM-s
M.Kersten 2008
48
No data from persistent store to the memory trash
• Varying the vector size on TPC-H query 1
mysql,
oracle, low IPC,
overhead
db2
MonetDB
X100
M.Kersten 2008
RAM
bandwidth
bound
49
Query evaluation strategy
• Pipe-line query evaluation strategy
• Called Volcano query processing model
• Standard in commercial systems and MySQL
• Basic algorithm:
• Demand-driven evaluation of query tree.
• Operators exchange data in units such as records
• Each operator supports the following interfaces:–
open, next, close
• open() at top of tree results in cascade of opens
down the tree.
• An operator getting a next() call may recursively
make next() calls from within to produce its next
answer.
• close() at top of tree results in cascade of close
down the tree
Query evaluation strategy
• Pipe-line query evaluation strategy
• Evaluation:
• Oriented towards OLTP applications
• Granule size of data interchange
• Items produced one at a time
• No temporary files
• Choice of intermediate buffer size allocations
• Query executed as one process
• Generic interface, sufficient to add the iterator
primitives for the new containers.
• CPU intensive
• Amenable to parallelization
Query evaluation strategy
• Materialized evaluation strategy
• Used in MonetDB
• Basic algorithm:
• for each relational operator produce the complete
intermediate result using materialized operands
• Evaluation:
• Oriented towards decision support queries
• Limited internal administration and dependencies
• Basis for multi-query optimization strategy
• Memory intensive
• Amendable for distributed/parallel processing
Outline
Database
Structures
Execution
Paradigm
Indexing
M.Kersten 2008
53
Find a trusted fortune teller
Materialized
Views
Indexing
M.Kersten 2008
Cracking
Traditional
B-tree, Hash
54
Find a trusted fortune teller
• Indices in database systems focus on:
• All tuples are equally important for fast retrieval
• There are ample resources to maintain indices
• MonetDB does not rely on B-trees
• MonetDB cracks the database into pieces
M.Kersten 2008
55
Find a trusted fortune teller
Database Cracking
• Cracking in a database
kernel?
D1
• Every database scan
is a physical
reorganization
• This is crazy…
• Reorganization is utterly
expensive…
• Better to have many
(update) users pay less
then one (query) user a lot
• It does not fit the Volcanostyle query processor..
• It just doesn’t work that
way…….
M.Kersten 2008
56
Find a trusted fortune teller
• Cracking in a database
kernel?
• Every database scan is a
physical reorganization
D1
select * from D1 where ccd=‘j5’
M.Kersten 2008
57
Find a trusted fortune teller
• Cracking in a database
kernel?
D1
• Every database scan is a
physical reorganization
D1
create view V as
select * from D1 where ccd=‘j5’
select * from D1
where not ccd=‘j5’
M.Kersten 2008
58
Find a trusted fortune teller
• Cracking in a database
kernel?
D1
• Every database scan is a
physical reorganization
D1
select * from D1 where ccd=‘j5’
select * from D1
where not ( ccd=‘j5’or ccd= ‘j7’)
select * from D1 where ccd=‘j7’
M.Kersten 2008
60
Try to avoid useless investments
• Cracking in a database
kernel?
D1
• Every database scan is a
physical reorganization
• Don’t pay the complete
sort cost during update
• Incremental predicate
index maintenance
• The first user pays !
M.Kersten 2008
61
Try to avoid useless investments
A simple range query
M.Kersten 2008
Try to avoid useless investments
TPC-H query 6
M.Kersten 2008
Try to avoid useless investments
• Cracking is easy in a column store and is part
of the critical execution path
• Cracking works under high volume updates
M.Kersten 2008
64
Whoa MonetDB !
Speed lines !
PostgreSQL
SQLserver
Cstore
MySQL
DB2
M.Kersten 2008
65