Flash Memory Database Systems and IPL

Download Report

Transcript Flash Memory Database Systems and IPL

Flash Talk
Flash Memory Database Systems
and In-Page Logging
Bongki Moon
Department of Computer Science
University of Arizona
Tucson, AZ 85721, U.S.A.
[email protected]
In collaboration with Sang-Won Lee (SKKU), Chanik Park (Samsung)
COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -1-
Magnetic Disk vs Flash SSD
Champion
for 50 years
Intel X25-M Flash SSD
80GB 2.5 inch
New
challengers!
Seagate ST340016A
40GB,7200rpm
COMPUTER SCIENCE DEPARTMENT
Samsung FlashSSD
128 GB 2.5/1.8 inch
KOCSEA’09, Las Vegas, December 2009 -2-
Past Trend of Disk
• From 1983 to 2003 [Patterson, CACM 47(10) 2004]
 Capacity increased about 2500 times (0.03 GB  73.4 GB)
 Bandwidth improved 143.3 times (0.6 MB/s  86 MB/s)
 Latency improved 8.5 times (48.3 ms  5.7 ms)
Year
1983
1990
1994
1998
2003
Product
CDC
94145-36
Seagate
ST41600
Seagate
ST15150
Seagate
ST39102
Seagate
ST373453
Capacity
0.03 GB
1.4 GB
4.3 GB
9.1 GB
73.4 GB
RPM
3600
5400
7200
10000
15000
Bandwidth
(MB/sec)
0.6
4
9
24
86
Media
diameter
5.25
5.25
3.5
3.0
2.5
Latency (msec)
48.3
17.1
12.7
8.8
5.7
COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -3-
I/O Crisis in OLTP Systems
• I/O becomes bottleneck in OLTP systems
 Process a large number of small random I/O operations
• Common practice to close the gap
 Use a large disk farm to exploit I/O parallelism
• Tens or hundreds of disk drives per processor core
• (E.g.) IBM Power 596 Server : 172 15k-RPM disks per processor core
 Adopt short-stroking to reduce disk latency
• Use only the outer tracks of disk platters
 Other concerns are raised too
• Wasted capacity of disk drives
• Increased amount of energy consumption
• Then, what happens 18 months later?
 To catch up with Moore’s law and balance CPU and I/O (Amdhal’s
law), the number of spindles should be doubled again?
COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -4-
Flash News in the Market
• Sun Oracle Exadata Storage Server [Sep 2009]
 Each Exadata cell comes with 384 GB flash cache
• MySpace dumped disk drives [Oct 2009]
 Went all flash, saving power by 99%
• Google Chrome ditched disk drives [Nov 2009]
 SSD is the key to 7-sec boot time
• Gordon at UCSD/SDSC [Nov 2009]
 64 TB RAM, 256 TB Flash, 4 PB Disks
• IBM hooked up with Fusion-io [Dec 2009]
 SSD storage appliance for System X server line
COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -5-
Flash for Database, Really?
• Immediate benefit for some DB operations
 Reduce commit-time delay by fast logging
 Reduce read time for multi-versioned data
 Reduce query processing time (sort, hash)
• What about the Big Fat Tables?
 Random scattered I/O is very common in OLTP
• Slow random writes by flash SSD can handle this?
COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -7-
Transactional Log
SQL Queries
System Buffer Cache
Database
Transaction
Temporary
Rollback
Table space
(Redo) Log
Table Space
Segments
COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -8-
Commit-time Delay by Logging
• Write Ahead Log (WAL)
 A committing transaction force-writes its
log records
 Makes it hard to hide latency
 With a separate disk for logging
• No seek delay, but …
• Half a revolution of spindle on average
• 4.2 msec (7200RPM), 2.0 msec (15k RPM)
T1
…
T2
Tn
SQL
Buffer
Log Buffer
pi
DB
 With a Flash SSD: about 0.4 msec
LOG
• Commit-time delay remains to be a significant overhead
 Group-commit helps but the delay doesn’t go away altogether.
• How much commit-time delay?
 On average, 8.2 msec (HDD) vs 1.3 msec (SDD) : 6-fold reduction
• TPC-B benchmark with 20 concurrent users.
COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -9-
Rollback Segments
SQL Queries
System Buffer Cache
Database
Transaction
Temporary
Rollback
Table space
(Redo) Log
Table Space
Segments
COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -10-
MVCC Rollback Segments
• Multi-version Concurrency Control (MVCC)
 Alternative to traditional Lock-based CC
 Support read consistency and snapshot isolation
 Oracle, PostgresSQL, Sybase, SQL Server 2005, MySQL
• Rollback Segments
 Each transaction is assigned to a rollback segment
 When an object is updated, its current value is recorded
in the rollback segment sequentially (in append-only
fashion)
 To fetch the correct version of an object, check whether
it has been updated by other transactions
COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -11-
MVCC Write Pattern
• Write requests from TPC-C workload
 Concurrent transactions generate multiple streams of append-only
traffic in parallel (apart by approximately 1 MB)
 HDD moves disk arm very frequently
 SSD has no negative effect from no in-place update limitation
COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -12-
MVCC Read Performance
C
…
B
100
A
50
T1
A
100
T0
A
200
• To support MV read consistency,
I/O activities will increase
 A long chain of old versions may have
to be traversed for each access to a
frequently updated object
T2
Rollback segment
• Read requests are scattered
randomly
 Old versions of an object may be
stored in several rollback segments
 With SSD, 10-fold read time reduction
was not surprising
Rollback segment
COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -13-
Database Table Space
SQL Queries
System Buffer Cache
Database
Transaction
Temporary
Rollback
Table space
(Redo) Log
Table Space
Segments
COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -14-
Workload in Table Space
• TPC-C workload
 Exhibit little locality and sequentiality
• Mix of small/medium/large read-write, read-only (join)
 Highly skewed
• 84% (75%) of accesses to 20% of tuples (pages)
• Write caching not as effective as read caching
 Physical read/write ratio is much lower that logical
read/write ratio
• All bad news for flash memory SSD
 Due to the No in place update and Asymmetric read/write
speeds
 In-Page Logging (IPL) approach [SIGMOD’07]
COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -15-
In-Page Logging (IPL)
• Key Ideas of the IPL Approach
 Changes written to log instead of updating them in place
• Avoid frequent write and erase operations
 Log records are co-located with data pages
• No need to write them sequentially to a separate log region
• Read current data more efficiently than sequential logging
 DBMS buffer and storage managers work together
COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -16-
Design of the IPL
• Logging on Per-Page basis in both Memory and Flash

update-in-place
Database in-memory
data page
Buffer
in-memory
log sector
(512B)
(8KB)


Flash
Memory
An In-memory log sector can
be associated with a buffer
frame in memory
Allocated on demand when
a page becomes dirty
An In-flash log segment is
allocated in each erase unit
15 data pages
(8KB each)
Erase unit:
128KB
….
….
log area (8KB):
16 sectors
The log area is shared by all the data pages in an erase unit
COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -17-
IPL Write
•
Data pages in memory


•
Updated in place, and
Physiological log records written to its in-memory log sector
In-memory log sector is written to the in-flash log segment, when


Data page is evicted from the buffer pool, or
The log sector becomes full
•
When a dirty page is evicted, the content is not written to flash memory

 The previous version remains intact
Data pages and their log records are physically co-located in the same erase unit
Update / Insert / Delete
Buffer
Pool
Flash
Memory
update-in-place
Sector : 512B
physiological log
Page : 8KB
Block : 128KB
Data Block Area
COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -18-
IPL Read
• When a page is read from flash, the current version is computed on the fly
Apply the “physiological action”
to the copy read from Flash
(CPU overhead)
Pi
Buffer
Pool
Re-construct
the current
in-memory copy
Read from Flash
 Original copy of Pi
 All log records belonging to Pi
(IO overhead)
Flash
Memory
data area
(120KB):
15 pages
log area (8KB):
16 sectors
COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -19-
IPL Merge
• When all free log sectors in an erase unit are consumed
 Log records are applied to the corresponding data pages
 The current data pages are copied into a new erase unit
• Consumes, erases, and releases only one erase unit
Can be
Erased
15 up-to-date
data pages
Merge
Physical
Flash
Block
log area (8KB):
16 sectors
Bold
COMPUTER SCIENCE DEPARTMENT
clean log area
Bnew
KOCSEA’09, Las Vegas, December 2009 -20-
Industry Response
• Common in Enterprise Class SSDs
 Multi-channel, inter-command parallelism
• Thruput than bandwidth, write-followed-by-read pattern
 Command queuing (SATA-II NCQ)
 Large RAM Buffer (with super-capacitor backup)
• Even up to 1 MB per GB
• Write-back caching, controller data (mapping, wear leveling)
 Fat provisioning (up to ~20% of capacity)
• Impressive improvement
Prototype/Product
EC SSD
X-25M
15k-RPM Disk
Read (IOPS)
10500
20000
450
Write (IOPS)
2500
1200
450
COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -21-
EC-SSD Architecture
• Parallel/interleaved operations
 8 channels, 2 packages/channel, 4 chips/package
 Two-plane page write, block erase, copy-back operations
Host I/F
(SATA-II)
ARM9
ECC
Main
Controller
Flash
Controller
Flash
package
NAND
Flash
package
NAND
8 channels
DRAM
(128MB)
NAND
COMPUTER SCIENCE DEPARTMENT
NAND
KOCSEA’09, Las Vegas, December 2009 -22-
Concluding Remarks
• Recent advances cope with random I/O better
 Write IOPS 100x higher than early SSD prototype
 TPS: 1.3~2x higher than RAID0-8HDDs for read-write
TPC-C workload, with much less energy consumed
• Write still lags behind
 IOPSDisk < IOPSSSD-Write << IOPSSSD-Read
 IOPSSSD-Read / IOPSSSD-Write = 4 ~ 17
• A lot more issues to investigate
 Flash-aware buffer replacement, I/O scheduling, Energy
 Fluctuation in performance, Tiered storage architecture
 Virtualization, and much more …
COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -23-
Question?
• For more Flash Memory work of Ours




In-Page Logging [SIGMOD’07]
Logging, Sort/Hash, Rollback [SIGMOD’08]
SSD Architecture and TPC-C [SIGMOD’09]
In-Page Logging for Indexes [CIKM’09]
• Even More?
 www.cs.arizona.edu/~bkmoon
COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -24-