Transcript Document

Time for Change:
Why Not Transact In Memory?
Sang K. Cha
Email: [email protected]
on leave from Seoul National University, Korea
Transact In Memory Project
My interest in main memory database started from:
WS-IRIS at HP Lab (1991-1992.1):
• In-memory personal object manager project of Tore Risch (before
SmallBase project which became TimesTen)
Related previous projects at Seoul National Univ.:
C++ design & implementation of MMDBMS prototypes
• Initial design of MMDBMS modules (1993-1994)
• Mr.RT for ETRI (1995-1996)
– 2+ local startups
• Xmas: Extension of Mr.RT (1997-1999)
Object-oriented middleware for spatial object database (1994-1999)
Transact in Memory project (2000 - present)
2
Transact In Memory Project
Spirit:


Do what we think are important rather than what others
(government bureaucrats or committees) think important.
No more Build and Perish:
 Deal with the end users directly for quick feedback on
our work
Funding Scheme:

Experimental
 Presently, self-managed through Transact In Memory,
Inc. for research and development freedom
 Venture capital financing in the future?
3
Transact In Memory Project
Our Users

Performance-demanding users with simple applications
(with 3n-letter company product bases)
 Telco companies
 Stock brokerage companies
 ….
Our People:

20+ is expanding
 Graduate students
 Consultant (patent attorney for IP protection)
 Full-time, field service engineering & marketing staffs
4
Outline
Introduction to Main Memory DBMS
MMDBMS vs DRDBMS
Memory-primary vs disk-primary architecture
More on Transact In Memory Project
Some hypotheses
Parallel Architecture with Parallel Logging and Recovery
Some Memory-Primary Design
P*TIME: a Highly Parallel Transact In Memory Engine
Summary
5
Main Memory DBMS
Database resident in memory
Read transactions simply read the in-memory data.
Update transactions do in-memory updates and write update log
to the log disk.
Occasionally, checkpoint the dirty pages of the in-memory
database to the disk-resident backup DB to shorten the recovery
time.
MMDBMS
Primary DB
Checkpointing
Backup DB
Logging
Log
6
MMDBMS Technical Issues
Logging and Recovery: critical for not losing updates
In-Memory Index Structures
Efficient L2 Cache Utilization?
Concurrency Control of Index and Data
DBMS Process & Application Binding Architecture
Query Processing
Memory Management
Availability for Mission-critical Applications
7
Reminder on Hardware Advances
1000 times over the 15 years [Patterson]
CPU power, Memory capacity, Disk capacity
• Not really for memory and disk speed
• Network interface speed: 100Mbps – Gbps (maybe 100 times)
VAX 11/750 in 1984 (Stanford KBMS personal workstation)
•
•
•
•
CPU (32 bit, cycle time: 320 ns, CPI: 10)
8MB memory (800 ns for write, 640ns for read)
Hundreds of MB disk space
10Mbps ethernet connection
How much speedup in the DBMS performance
over the same period? --- Not 1000 times!
8
In the old days (1984), a few MB was said…
With the availability of very large, relatively inexpensive
main memories, it is becoming possible to keep large
databases resident in memory. …. At the present time,
memory for VAX 11/780 costs approximately $1,500 a megabyte. In
1990, a gigabyte of memory should cost less than $200,000 (with the
availability of 1Mb chips). If 4 Mbit chips are available, the price
might be as low as $50,000.
- DeWitt, Katz, Olken, Shapiro, Stonebraker, and Wood,
“Implementation Techniques for Main Memory Database Systems,”
SIGMOD 1984
9
and today a few GB….
We write the paper with essentially the same wording:
As the price of memory continues to drop below $1,000/GB, it is
now feasible to place many of the database tables and indexes
in main memory.
- Kim, Cha, Kwon, SIGMOD 2001
Real numbers:
Max CPU#
Max Memory
Memory Price
www.kingston.com
Sun Blade 1K
2
8GB
Sun Fire 3800
8
64GB
Sun Fire 15K
108
0.5TB
Compaq ML 570
4
16GB
$1170/2GB, $7021/4GB
$914/2GB, $2143/4GB
10
MMDBMS vs DRDBMS
Well-Designed MMDBMS should outperform DRDBMS
significantly
not only for read-oriented applications
but also for update-intensive applications!
with
memory-optimized data structures and algorithms
disk access reduced to the sequential form of log writing and
occasional checkpointing
11
Q: Is Disk Database with large buffer
the same as Main Memory Database?
No!
• Complex mapping between disk and memory
– E.g., traversing index blocks in buffer requires bookkeeping the mapping between disk
and memory addresses
» Swizzle or not
Large Buffer
Index Blocks
Data
Blocks
Database
•
record
disk address
Log
Disk index block design is not optimized against hardware
cache misses.
12
Other Reasons to Say No!
Update performance advantage of main memory database
Disk-resident DBMS is limited in taking advantage of
hardware advances
Huge code base
• many levels of abstraction for portability
• backward compatibility with the old customers (long history of evolution)
 Slow to change
 Too many control knobs to optimize
Disk-primary design
13
Two Different Paradigms
Disk Primary
Design the disk-primary
structures and algorithms.
Memory Primary
Design the memory-primary
structures and algorithms.
Cache-conscious design
Cache the disk blocks of the If necessary, map the indisk-resident database in the memory structures to the
memory buffer.
disk.
Uniform mapping
14
Memory Primary versus Disk Primary
CPU
Cache Memory
Memory-Primary Focus
Main Memory (DRAM)
Disk-Primary Focus
15
Why Memory-Primary Design?
Memory access is no longer uniform!
Cost of L2 cache misses: hundred clock cycles, or
thousand instruction opportunity
(From David Patterson, et. al., IEEE Micro, 1997)
16
Memory-Primary Design
Cache-conscious: the well-known disk access rule comes
back in a slightly different flavor
Random access is expensive:
• Example: Pointer chasing
Block access is more efficient:
• Align data structures with L2 cache lines (e.g., 64 bytes).
• Pack the right data in the cache block.
17
Cache behavior of commercial DBMS
(on Uniprocessor Pentium II Xeon)
Anastassia Ailamaki et al, DBMSs on a Modern Processor: Where
does time go?, VLDB 99
Memory related delays: 40-80% of execution time.
Data accesses on caches: 19-86% of memory stalls.
Multiprocessor cache behavior?
Probably worse because of coherence cache misses
18
Coherence Cache Miss Problem
Updates on the shared data structures on sharedmemory multiprocessor systems


Cache invalidation
Coherence Cache Miss
Problems with
Frequently updated, shared internal control structures
locks, transaction tables, etc.
19
Transact In Memory Project at SNU
Objective: Build a highly parallel, highly scalable
Main-Memory DBMS


Exploit the parallelism in logging, recovery, and CC
Exploit advances in HW computing power
Specific technical issues published so far:



Parallel logging and recovery [ICDE01]
Cache-conscious index [SIGMOD01]
Cache-conscious concurrency control of index [VLDB01]
20
Transact In Memory Project
Some Hypotheses
MMDBMS should do more than just keeping the database in memory
New architecture and new algorithms are needed; Otherwise DRDBMS will
catch up eventually.
Memory-primary design should boost the performance significantly
MMDBMS is the best place to apply the self-tuning RISC DBMS concept!
Thanks to Surajit Chaudhuri and Gerhard Weikum for their VLDB 2000
vision paper!
Once we have a high-performance MMDBMS, today’s multi-tier
architecture may not be the best fit.
Database servers may become idle …
This is also a good place for inventing …
Because of new cost model, new premise
21
Parallel Main Memory DBMS
Transaction
Mgr
Data
Dictionary
Replication
Mgr
…
Recovery
Mgr
Primary Database
Lock Mgr
Memory Mgr
Buffer
…
Buffer
Buffer
Checkpoint Mgr
Log
Anchor
Backup DB 1
…
Backup DB m
Parallel Checkpointing
…
Buffer
Log Mgr
Log 1
…
Log n
Parallel Logging
22
Parallel Main Memory DBMS
Primary Database
BDP
BDP
…
Buffer
LP
LP
Buffer
Buffer
…
Buffer
Recovery Mgr
BDR
Log
Anchor
BDR
Backup DB1
…
Backup DBm
LR
Log1
LR
…
BDP: Backup DB rePlay
LP: Log rePlay
BDR: Backup DB Read
LR: Log Read
Logn
Fully Parallel Recovery:
Backup Database Loading and Log Processing all in Parallel
23
Problems with Existing Logging Schemes
Constraint on log replay by the serialization order
Conventional Wisdom
Distribute log records to multiple disks
• Pays the cost of merging and applying log records by the
serialization order during recovery
Partition the database and assigning different log disks to
different partitions => limited parallelism
• Poor utilization of multiple log disks when updates are skewed to
certain partitions
• Difficulty of handling transactions updating multiple segments
24
Differential Logging: Definition
Let the value of R: p  q
Differential log (p,q) = p  q
Redo: p  (p,q) = q
Undo: q  (p,q) = p
where  denotes the bit-wise XOR
[Lee, Kim, and Cha, ICDE 2001]
25
Differential Logging Example
Log sequence number
Value of a resource, R
Resource ID
XOR difference
0000
L1
R
…
0101
L2
R
…
1100
0101
1001
Redo L1; Redo L2;
0000
0101
1001
Redo L2; Redo L1;
0000
1100
1001
26
Source of Parallelism
XOR operation is commutative and associative
Serialization order is insignificant
• Log records can be applied in arbitrary order
• Log records can be distributed to multiple disks to improve the
logging and recovery performance.
Redo and undo can be intermixed
• Single-pass recovery possible
Backup DB loading can be intermixed with log processing
• Initialize the primary DB with 0’s. Copying Backup DB into memory
is equivalent to applying XOR to the backup data and the
corresponding memory initialized with 0’s.
27
Cache-conscious Index Structures
Design focus: minimize the L2 cache misses
Index node alignment with cache lines
• Node size: a few multiples of L2 cache blocks
Pointer elimination for increasing fanout  Reduced height 
Reduced cache misses
• CSB+-tree [Rao & Ross, SIGMOD 2000]
Key compression for increasing fanout
• CR-tree [Kim et al, SIGMOD 2001], pkT/pkB-tree[Chen, et al
SIGMOD 2001]
B+-tree:
23
CSB+-tree:
34
23 34 47 58
28
Concurrency Control of Index
What if we combine traditional latch-based index
concurrency control schemes?


Lock coupling involves latching and unlatching index nodes while
traversing down the tree
Latching/unlatching the index node involves memory write on the
cache block containing the latch.
Memory writes on the shared data structures on the shared-memory
multiprocessor systems lead to coherence cache misses.
29
OLFIT Concurrency Control: Overview
Consistent node access protocol [Cha et al, VLDB 2001,
Patent-Pending]:
ReadNode:
• Just read the node optimistically with no latch operations
• At the end, detect the read-update conflict by checking the latch state
and the version number updated by the UpdateNode operation
• Retry on the conflict.
UpdateNode:
• Latch and unlatch the index node before and after updating the node
content
– Prevent other node updates from interfering
• Increment the version number of the index node before unlatching
30
P*TIME:
Highly Parallel Transact In Memory Engine
CPU
CPU
CPU
CPU
Cache
Cache
Cache
Cache
Parallel
Checkpointing
Cache-Conscious
Index
Concurrency Control
Primary DB
Backup
Backup
Backup
DB DB DB
Parallel
Logging
LogLog
Log Disks
Partitions
Partitions
Partitions
Parallel
Recovery
31
P*TIME:
Highly Parallel Transact In Memory Engine
Lightweight, multithreaded architecture
Scalable performance on multiprocessor platforms
Implemented mostly in C++
Small footprint (<10MB)
Industry standard API: SQL, JDBC (with binding to Java
and C++), ODBC
Support of DB embedded application architecture
Multi-level concurrency control
Data replication for high availability
Simple performance model
Ease of application tuning and optimization
32
P*TIME-Embedded Telco Directory Server
15M customer database
1 table (~1GB)
3 hash indexes (~1.3GB)
•
•
•
•
Backup DB
loading time: 16
sec(65MB/sec)
Log processing
time: 3040MB/sec
Parallel index
building time: 19
sec
Initial DB loading
from text file: 6
min
33
Summary
The memory-primary architecture
Utilizes the hardware resource more efficiently.
Still many things to do
P*TIME: second-generation MMDBMS
Memory-primary architecture with parallelism exploited
Other optimizations in implementation
Numerous applications in Telecom and Internet
Directory servers
Update-intensive applications: batch  on-line
34