Solid State Storage Deep Dive

Download Report

Transcript Solid State Storage Deep Dive

And How It Effects SQL Server
•
•
•
•
•
•
•
•
•
•
•
•
•
•
NAND Flash Structure
MLC and SLC Compared
NAND Flash Read Properties
NAND Flash Write Properties
Wear-Leveling
Garbage Collection
Write Amplification
TRIM
Error Detection and Correction
Reliability
Form Factor
Performance Characteristics
Determining What’s Right for You
Not All SSD’s Are Created Equal
•
Two Main Flavors NAND And NOR
NOR
•
NAND
•
– Operates like RAM.
– NOR is parallel at the cell level.
– NOR reads slightly faster than NAND.
– Can execute directly from NOR without copy to
RAM.
– NAND
– NAND
– NAND
– NAND
operates like a block device a.k.a. hard disk.
is serial at the cell level.
writes significantly faster than NOR.
erases much faster than NOR--4 ms vs. 5 s.
•
Serial array of transistors.
•
Arrays grouped into pages.
•
Pages grouped into Blocks
•
Pages grouped into chip
•
Chips grouped on to devices.
– Each transistor holds 1 bit(or more).
– 4096 bytes in size.
– Contains “spare” area for ECC and other ops.
– 64 to 128 pages.
– Smallest erasable unit.
– As big as 16 Gigabytes.
– Usually in a parallel arrangement.
NAND Flash Structure. Gates, Cells,
Pages and Strings.
•
MLC (Multi-Level Cell)
–
–
–
–
•
Higher capacity (two bits per cell).
Low P\E cycle count 3k~ 10K~.
Cheaper per Gigabyte.
High ECC needs.
SLC (Single-Level Cell)
– Fast read speed
• 25ns vs. 50ns
– Fast Write Speed
• 220ns vs. 900ns
– High P\E cycle count 100k~ to 300k~
– Tend to be conservative numbers.
– Minimal ECC requirements
• 1 bit per 512 bytes vs. 12~ bits per.
– Expensive
• Up to 5x the cost of MLC.

It isn’t RAM.
◦ Slower access times.
 1~ ns vs. 50~ ns.
 No write in place.

It isn’t a hard disk.
◦ Much faster access times.
 Nanoseconds vs. Milliseconds
◦ No moving parts.

Program Erase Cycle
◦ Erased state all bits are 1.
◦ Programmed bits are 0.
◦ Programmed pages at a time.
 One pass programming.
◦ Erased block at a time(128 pages).
 Must erase entire block to program a single page
again.
◦ Finite life cycle, 10k~ MLC 100k~ SLC.
 Once failed to erase may still be readable.
Data written in pages and erased in blocks. Blocks are becoming larger as NAND
Flash die sizes shrink.
•
Wear-Leveling
– Spreads writes across blocks.
– Ideally, write to every block before erasing any.
– Data grouped into two patterns.
• Static, written once and read many times.
• Dynamic, written often read infrequently.
– If you only Wear-Level data in motion you burn out
the page quickly.
– If you Wear-Level static data you are incurring extra
I/O

Background Garbage Collection
◦
◦
◦
◦
◦
Defers P/E cycle.
Pages marked as dirty, erased later.
Requires spare area.
Incurs additional I/O.
Can be put under pressure by frequent small writes.

Write Amplification
◦
◦
◦
◦
Ripples in a pond.
Device moves blocks around.
Incoming I/O greater than Device has.
Every write causes additional writes.
 Small writes can be a real problem.
 OLTP workloads are a good example.
 TRIM can help.
Initial Write of 4 pages to a single
erasable block.
Four new pages and four replacement
pages written. Original pages are now
marked invalid.
Garbage collection comes along and
moves all valid pages to a new block
and erases the other block.
•
TRIM
– Supported out of the box on Windows 7, Windows
2008 R2.
• Some manufacturers are shipping a TRIM service that
works with their driver
– Acts like spare area for garbage collection.
– OS and file system tell drive block is empty.
– Filling file system defeats TRIM.
– File fragmentation can hurt TRIM.
• Grow your files manually!
• Don’t run disk defrag!
Many things cause errors on Flash!
• Write Disturb
– Data Cells NOT being written to are corrupted.
• Fixed with normal erase.
•
Read Disturb
– Repeated reads on same page effects other pages on
block.
• Fixed with normal erase.
•
Charge Loss/Gain
– Transistors may gain or lose charge over time.
• Flash devices at rest or rarely accessed data.
• Fixed with normal erase.
All of these issues are generally dealt with very well
using standard ECC techniques.
As cells are programmed other cells
may experience voltage change.
As cells are read other cells in same
block can suffer voltage change.
If flash is at rest or rarely read cells
can suffer charge loss.
•
Not all drives are benchmarked the same.
Short-stroking
•
Huge queue depths.
•
Odd block transfer sizes.
•
– Only using a small portion of the drive.
– Allows for lots of spare capacity via TRIM.
– Increases latency.
– Can be unrealistic.
– Random IO testing.
• Some use 512 byte while others use 4k.
– Sequential IO testing.
• Most use 128k.
• Some use 64k to better fit into large buffers.
• Some use 1mb and high queue depths.

Read the numbers carefully.
◦ Random IO bench usually 4k.
 SQL Server works on 8k.
◦ Sequential IO bench usually 128k.
 SQL Server works on 64k to 128mb
◦ Queue depths set high.
 SQL Server usually configured for low Queue depth.
•
SLC is ready “Out of the box.”
– Requires much less infrastructure on disk to
support robust write environments.
•
MLC needs some help.
– Requires lots of spare area and smarter controllers
to handle extra ECC.
– eMLC has all management functions built onto the
chip.
•
Both configured similarly.
– RAID of chips.
– TRIM, GC and Wear-Leveling


Longevity between devices can be huge.
Consumer grade drives are consumable.
◦ Aren’t rated for full drive writes.
 Desktop drives usually tested on a fraction of drive
capacity!
◦ Aren’t rated for continuous writes.
 It may say three year life span.
 Could be much shorter look at total writes.
•
SAS is the king of your heavy workloads.
Command Queuing
•
Error recovery and detection.
•
Duplex
•
Multi-path IO
•
– SAS supports up to 216 usually capped at 64.
– SATA supports up to 32.
– SMART isn’t.
– SCSI command set is better.
– SAS is full duplex and dual ported per drive.
– SATA is single duplex and single ported.
– Native to SAS at the drive level.
– Available to SATA via expanders.
•
Flash comes in lots of form factors.
• Standard 2.5” and 3.5” drives,
• Fibre Attached
• Texas Memory System RAM-SAN 620
• Violin Memory
• PCIe add-in cards.
•
•
•
•
•
•
•
Few “native” cards.
Fusion-io
Texas Memory System RAM-SAN 20
Bundled solutions.
LSI SSS6200
OCZ Z-Drive
OCZ Revodrive
• PCIe To Disk
• 2.5” form factor and plugs
• Skips SAS/SATA for direct PCIe lanes.

You MUST understand your workloads.
◦ Monitor virtual file stats
 http://sqlserverio.com/2011/02/08/gather-virtualfile-statistics-using-t-sql-tsql2sday-15/
 Track random vs. sequential
 Track size of transfers
◦ Capture IO Patterns
 http://sqlserverio.com/2010/06/15/fundamentalsof-storage-systems-capturing-io-patterns/
◦ Benchmark!
 http://sqlserverio.com/2010/06/15/fundamentalsof-storage-testing-io-systems/
•
From new
•
Previous writes effect future reads.
– Best possible performance.
– Drive will never be this fast again.
– Large sequential writes nice for GC.
– Small random writes slow GC down.
– Wait for GC to catch up when benching drive.
• Give the GC time to settle in going from small random to
large sequential or vice versa.
• Steady state is what we are after.
•
Performance over time slows.
– Cells wear out.
• Causes multiple attempts to read or write
• ECC saves you but the IO is still spent.
•
•
Not all drives are equal.
Understand drives are tuned for workloads.
– Desktop drives don’t favor 100% random writes…
– Enterprise drives are expected to get punished.
•
Fix it with firmware.
– Most drives will have edge cases.
• OCZ and Intel suffered poor performance after drive
use over time.
• Be wary of updates that erase your drive.
– Gives you a temporary performance boost.
•
•
•
Flash read performance is great, sequential or
random.
Flash write performance is complicated, and can
be a problem if you don’t manage it.
Flash wears out over time.
– Not nearly the issue it use to be, but you must
understand your write patterns.
– Plan for over provisioning and TRIM support.
• It can have a huge impact on how much storage you actually
buy.
– Flash can be error prone.
• Be aware that writes and reads can cause data corruption.
Solid State Storage Deep Dive
Wes Brown