Column-Oriented DB, IMDB

Transcript Column-Oriented DB, IMDB

Advanced Database Systems:
DBS CB, 2nd Edition
Advanced Topics of Interest:
In-Memory DB (IMDB) and
Column-Oriented DB
1
Outline

In-Memory Database (IMDB)

Column-Oriented Database (C-Store)
2
In-Memory Database (IMDB)
3
Introduction
In Memory database system (IMDB)
 Data resides permanently on main physical
memory
 Backup copy (optionally) on disk
Disk Resident database system (DRDB)
 Data resides on disk
 Data may be cached into memory for access.
Main difference is that in IMDB, the primary
copy. lives permanently in memory
4
Questions about IMDB


Is it reasonable to assume that the entire
database fits in memory?
Yes, for some applications!
What is the difference between a IMDB and a
DRDB with a very large cache?
In DRDB, even if all data fits in memory, the data
structures and algorithms are designed for disk
access.
5
Differences in properties of main
memory and disk



The access time for main memory is orders of
magnitude less than for disk storage
Main memory is normally volatile, while disk
storage is not
The layout of data on disk is much more critical
than the layout of data in main memory
6
Impact of memory resident data

The differences in properties of main-memory
and disk have important implications in:

Concurrency control
 Commit processing
 Access methods
 Data representation
 Query processing
 Recovery
 Performance
7
Concurrency control

Access to main memory is much faster than disk
access, so we can expect that transactions
complete more quickly in an IMDB system

Lock contention may not be as important as it is
when the data is disk resident
8
Commit Processing

As protection against media failure, it is
necessary to have a backup copy and to keep a
log of transaction activity

The need for a stable log threatens to
undermine the performance advantages that can
be achieved with memory resident data
9
Access Methods

The costs to be minimized by the access
structures (indexes) are different, i.e., B-tree vs.
T-tree…
10
Data representation

Main memory databases can take advantage of
efficient pointer following for data representation
11
A study of Index Structures
for Main Memory Database
Management Systems
Tobin J. Lehman
Michael J. Carey
VLDB 1986
12
Disk versus In-Memory

Primary goals for a disk-oriented index structure
design:



Minimize the number of disk accesses
Minimize disk space
Primary goals of an In Memory index design:


Reduce overall computation time
Using as little memory as possible
13
Classic index structures

Arrays:



A: use minimal space, providing that the size is known in
advance
D: impractical for anything but a read-only environment
AVL Trees:




Balanced binary search tree
The tree is kept balanced by executing rotation operations when
needed
A: fast search
D: poor storage utilization
14
Classic index structures (cont)

B trees:

Every node contains some ordered data items and pointers
 Good storage utilization
 Searching is reasonably fast
 Updating is also fast
15
Hash-based indexing

Chained Bucket Hashing:




Static structure, used both in memory and in disk
A: fast, if proper table size is known
D: poor behavior in a dynamic environment
Extendible (Dynamic) Hashing:



Dynamic hash table that grows with data
A hash node contain several data items and splits in two when an
overflow occurs
Directory grows in powers of two when a node overflows and has
reached the max depth for a particularly directory size
16
Hash-based indexing (cont)

Linear Hashing:





Uses a dynamic hash table
Nodes are split in predefined linear order
Buckets can be ordered sequentially, allowing the bucket
address to be calculated from a base address
The event that triggers a node split can be based on storage
utilization
Modified Linear Hashing:




More oriented towards main memory
Uses a directory which grows linearly
Chained single items nodes
Splitting criteria is based on average length of the hash chains
17
The T-tree





A binary tree with many elements kept in order in a
node (evolved from AVL tree and B tree)
Intrinsic binary search nature
Good update and storage characteristics
Every tree has an associated minimum and maximum
count
Internal nodes (nodes with two children) keep their
occupancy in the range given by min and max count
18
The T-tree
19
Search algorithm for T-tree


Similar to searching in a binary tree
Algorithm:

Start at the root of the tree
 Loop:


If the search value is less than the minimum value of the node
 Then search down the left sub-tree
If the search value is greater than the maximum value in the
node



Then search the right sub-tree
Else search the current node
The search fails when a node is searched and the item is
not found, or when a node that bounds the search value
cannot be found
20
Insert algorithm
Insert (x):
 Search to locate the bounding
node
 If a bounding node is found:



Let a be this node
If value fits then insert it into a
and STOP
Else



remove min element amin from node
Insert x
Go to the leaf containing greatest lower bound for a and insert amin
into this leaf
21
Insert algorithm (cont)

If a bounding node is not found




Let a be the last node on the search path
If insert value fits then insert it into the node
Else create a new leaf with x in it
If a new leaf was added

For each node in the search path (from leaf to root)

If the two sub-trees heights differ by more than one, then rotate and
STOP
22
Delete algorithm





(1) Search for the node that bounds the delete value;
search for the delete value within this node, reporting an
error and stopping if it is not found
(2) If the delete will not cause an underflow then delete
the value and STOP
Else, if this is an internal node, then delete the value
and ‘borrow’ the greatest lower bound
Else delete the element
(3) If the node is a half-leaf and can be merged with a
leaf, do it, and go to (5)
23
Delete algorithm (cont)




(4) If the current node (a leaf) is not empty, then STOP
Else free the node and go to (5)
(5) For every node along the path from the leaf up to the
root, if the two sub-trees of the node differ in height by
more than one, then perform a rotation operation
STOP when all nodes have been examined or a node
with even balanced has been discovered
24
LL Rotation
Left rotation
25
LR Rotation
A
Right rotation
Left rotation
C
B
Bl
Ar
Cr
Cl
26
Summary



We introduced a new In Memory index structure,
the T-tree
For unordered data, Modified Linear Hashing
should give excellent performance for exact
match queries
For ordered data, the T-tree provides excellent
overall performance for a mix of searches,
inserts and deletes, and it does so at a relatively
low cost in storage space
27
But…
Even if the T-trees have more keys in
each node, only the two end keys are
actually used for comparison
 Since for every key in node we store a
pointer to the record, and most of the time
the record pointers are not used, the
space is ‘wasted’

28
C-Store: A Column-Oriented
RDBMS;
Michael Stonebraker
Column-Oriented Database
29
Traditional Row-Oriented Database






Store fields in one record contiguously on disk
Use B-tree indexing
Use small (e.g. 4K) disk blocks
Align fields on byte or word boundaries
Conventional (row-oriented) query optimizer and
executor (technology from 1979)
Aries-style transactions
30
Terminology -- “Row Store”
Record 1
Record 2
Record 3
Record 4
E.g. DB2, Oracle, Sybase, SQLServer, …
Row-Stores are Write Optimized



Can insert and delete (IUD) a record in one
physical write
Good for On-Line Transaction Processing (OLTP)
But not for read mostly applications


Data warehouses
CRM
32
Elephants Have Extended Row Stores




With Bitmap indices
Better sequential read
Integration of “data cube” products
Materialized views
But there may be a better idea…….
33
Column Stores
At 100K Feet….



Ad-hoc queries read 2 columns out of 20
In a very large warehouse, Fact table is rarely clustered
correctly
Column store reads 5-10% of what a row store reads
35
C-Store (Column Store) Project

Brandeis/Brown/MIT/UMass-Boston project




Usual suspects participating
Enough coded to get performance numbers for some
queries
Complete status later
Pioneering Work


Sybase IQ (early ’90s)
MonetDB (see CIDR ’05 for the most recent description)
36
C-Store Technical Ideas












Code the columns to save space
No alignment
Big disk blocks
Only materialized views (perhaps many)
Focus on Sorting not indexing
Automatic physical DBMS design
Optimize for grid computing
Innovative redundancy
Xacts – but no need for Mohan
Data ordered on anything, Not just time
Compression
Column optimizer and executor
37
No Alignment

Dense pack columns


E.g. a 5 bit field takes 5 bits
Current CPU speed going up faster than disk
bandwidth

Faster to shift data in CPU than to waste disk
bandwidth
38
Big Disk Blocks


Tunable
Big (minimum size is 64K)
39
Only Materialized Views







Projection (materialized view) is some number of
columns from a fact table
Plus columns in a dimension table – with a 1-n join
between Fact and Dimension table
Stored in order of a storage key(s)
Several may be stored!
With a permutation, if necessary, to map between them
Table (as the user specified it and sees it) is not stored!
No secondary indexes (they are a one column sorted MV
plus a permutation, if you really want one)
40
Example:

User view:



EMP (name, age, salary, dept)
Dept (dname, floor)
Possible set of MVs:



MV-1 (name, dept, floor) in floor order
MV-2 (salary, age) in age order
MV-3 (dname, salary, name) in salary order
41
Automatic Physical DBMS Design




Not enough 4-star wizards to go around
Accept a “training set” of queries and a space budget
Choose the MVs auto-magically
Re-optimize periodically based on a log of the interactions
42
Optimize for Grid Computing

I.e. shared-nothing


Dewitt (Gamma) was right
Horizontal partitioning and intra-query parallelism
as in Gamma
43
Innovative Redundancy

Hardly any warehouse is recovered by a redo
from the log




Takes too long!
Store enough MVs at enough places to ensure Ksafety
Rebuild dead objects from elsewhere in the
network
K-safety is a DBMS-design problem!
44
XACTS – No C. Mohan




Undo from a log (that does not need to be persistent)
Redo by rebuild from elsewhere in the network
Snapshot isolation (run queries as of a tunable time in the
recent past)
 To solve read-write conflicts
Distributed Xacts
 Without a prepare message (no 2 phase commit)
45
Storage (sort) Key(s) is not Necessarily
Time Efficient


That would be too limiting
So how to do fast updates to dense pack column storage
that is not in entry sequence?
46
Solution – a Hybrid Store (H-Store)
http://www.vldb2005.org/program/paper/thu/p553-stonebraker.pdf
(Much like Monet)
(Much like Monet)
Write-optimized
I/U store
(Much like MonetDB)
(Batch rebuilder)
Tuple mover
Read-optimized
Column store
(Batch rebuilder)
(What we have been
(What we have been
talking about so far)
talking about so far)
(What we have been
talking about so far)
MIT
47
Column Executor



Column operations – not row operations
Columns remain coded – if possible
Late materialization of columns
Column Optimizer

Chooses MVs on which to run the query


Build in snowflake schemas


Most important task
Which are simple to optimize without exhaustive search
Looking at extensions
48
Performance



100X popular row store in 40% of the space
7X popular row store in 1/6th of the space
Code available with BSD license
49
University Research







Extension of algorithms to non-snowflake schemas
Study of L2 cache performance
Study of coding strategies
Study of executor options
Study of recovery tactics
Non-cursor interface
Study of optimizer primitives
50
END
51