NVMe - Myoungsoo Jung

Download Report

Transcript NVMe - Myoungsoo Jung

An In-Depth Study
of Next Generation Interface
for Emerging Non-Volatile Memories
Wonil Choi, Jie Zhang, Shuwen Gao, Jaesoo Lee,
Myoungsoo Jung , Mahmut Kandemir
Yonsei University
Executive Summary
Motivation
To fully utilize the potential performance of Non-Volatile Memories (NVM),
Non-Volatile Memory Express (NVMe) was recently proposed.
Challenge
Its wide range of design parameters have not been fully explored in the
literature. Hence, what kind of system consideration and limitation need to
be taken into account are still veiled.
Modeling
Due to the absence of publicly-available tools, we developed an analytical
simulation model based on NVMe specifications. Our model can characterize
NVMe in a wide variety storage settings including physical bus performance,
NVM type, queue count/depth, etc.
Exploration
Using our model, we explored various NVMe design parameters which can
affect I/O response time and system throughput. We also presents key
observations regrading communication overheads and queue configurations.
Background and Motivation:
Advent of NVMe and
its Well-known Characteristics
Need for High-Performance Interfaces
Host
System
Interface
Storage System
(NVMs)
Performance
More Resources
More Parallelisms
Bottleneck!
SATA/SAS PCIe  Higher Bandwidth
• Storage interface as a bridge between host and storage
– Traditional SATA and SAS have been widely employed
• Storage-internal bandwidths keep increasing
– Thanks to increased resources and parallelisms
• Traditional interfaces failed to deliver the very-high bandwidths
• From upgrading traditional interfaces to devising new highperformance interfaces
PCI Express & NVM Express
• A high-speed physical interconnect (proposed by PCI-SIG)
• Widely adopted in computer system extension
– GPU connection & SSD connection
• A brand-new logical device interface (proposed by NVMHCI)
• Designed for exploiting the potential of high-performance NVMs
and standardizing the PCIe-based memory interfaces
– Samsung NVMe 96X series, Intel SSD DC series, HGST Ultrastar SN series,
Micron 9100 series, Hwuawei ES series, etc
• Our focus in this work is to explore NVMe on top of PCIe
NVMe Interface
Traditional Interface
NVMe’s Streamlined Communication
“The Linux I/O Stack Diagram”
NVMe’s Rich Queuing Mechanism
• Traditional interface provides single I/O queue with 10s entries
– Native Command Queuing (NCQ) with 32 entries
• NVMe strives to increase throughput by providing scalable
number of queues with scalable entries
– Up to 64K queues with up to 64K entries
• NVMe queue configurations in the host-side memory
– Pairs of Submission Queue (SQ) and Completion Queue (CQ)
– Per-core, Per-process, or Per-thread
We Want to Know Real Characteristics
• Question (1) regarding to streamlined communication:
How much overhead is brought up by the NVMe Communication
for different types of NVMs in processing an I/O request?
• Question (2) regarding to rich queuing mechanism:
Is it possible to scalably extract performance improvements,
as the number of queues and queue entries increase?
• No publicly-available tool to characterize NVMe
– Design parameters have not been studied in the literature
– NVMe design consideration and limitation are still veiled
We propose an analytical NVMe model
to uncover its real characteristics
Preliminaries:
PCIe/NVMe Operations
Memory Stack Architecture
I/O Threads (on cores)
Host-Side
Storage-Side
PCIe
NVMe Drivers
NVMe Info
RD/WR Data
NVMe Controller (Door Bell Reg.)
NVM Controller
NVM (Block / Byte-Addressable)
NVMe
SUB/CPL Queues (in memory)
Communication Protocol
[I/O Read]
Time flow
Host-Side
DB-write
IO-Req
IO-Fetch
WR-DMA
DB-write
IO-Req
IO-Fetch
SSD-PROC
SSD RD
RD-DMA
CPL-Submit
MSI
CPL-Submit
MSI
SSD-Side
SSD-PROC SSD WR
SSD-Side
Host-Side
[I/O Write]
PCIe Bus & Packet Transfer
PCIe Lane (x1 ~ x32)
SSD
…
Host
System
PCIe Bus(v1 ~ v4)
v1 x1
v2 x1
v3 x1
v4 x1
250GB/s
500MB/s
1GB/s
2GB/s
Transaction Layer Packet
(TLP)
DB-Write
IO-Req
Data Link Layer Packet
(DLLP)
IO-Fetch CPL-Submit
MSI
TLP
24B
24B
20B
20B
20B
DMA
ACK
TLP
DLLP
4KB
8B
Our NVMe Model:
Based on Four Components
Our Analytical Model
(1) I/O Request
(3) Host
System
(2) PCIe Bus
(4) NVM
• (1) I/O Request Model
– Based on the handshaking process of NVMe
– TReadI/O = TDB-Write + TIO-Req + TIO-Fetch + TCPL-Submit + TMSI + TStall
+ TRD-NVM + TRD-DMA
– TWriteI/O = TDB-Write + TIO-Req + TIO-Fetch + TCPL-Submit + TMSI + TStall
+ TWR-NVM + TWR-DMA
– TNVMeInfo = TDB-Write + TIO-Req + TIO-Fetch + TCPL-Submit + TMSI + Tstall
TNVMeInfo
vs
TNVM + TDMA
?
Our Analytical Model
(1) I/O Request
(3) Host
System
(2) PCIe Bus
(4) NVM
• (2) PCIe Bus Model
– Scalable performance (versions, lanes)
– One packet only on the bus  queues for packets
• (3) Host System Model
– Queue count, queue depth
– I/O submission rates
• (4) NVM Model
– Block-addressable NVM (flash), byte-addressable NVM (PCM, STT-MRAM)
– Read/write latencies, processing unit sizes
Experiments & Observations
Experimental Setup
• Our model features a broad range of design parameters
• But, our goal is to uncover true NVMe characteristics
• (1) We configure two representative NVM SSDs
NVM Type
Block-based NVM (Block-address)
DRAM-like NVM (Byte-address)
Base Unit Size
4KB
64B
R/W Latencies
(Read) 30us / (Write) 200us
(Read) 50ns / (Write) 1us
• (2) We configure two representative micro-benchmarks
Access Pattern
Block Access
Byte Access
Size Range
4KB ~ 2MB
8B ~ 1024B
I/O Interval
10us
100ns
Above parameter values can significantly vary,
but, we fix the representative values to evaluate NVMe!
Communication Overhead Analysis:
Block NVM + NVMe
•
•
•
•
Our interest: TNVMeInfo VS TSSD+TDMA in block NVMs ?
Y-axis: latency breakdown of a single 4KB-I/O R/W request
X-axis: varying PCIe bus performance
TNVMeInfo: 0.15% (read) & 0.03% (write) of total latency
In block NVMs, NVMe communication is NOT a burden
Communication Overhead Analysis:
DRAM-like NVM + NVMe
• How about in DRAM-like NVMs?
• Y-axis: latency breakdown of a single 64B-I/O R/W request
• TNVMeInfo: 44% (read) & 4% (write) of total latency (V2x1~V2x16)
In DRAM-like NVMs, fraction of NVMe information is REMARKABLE
Communication Overhead Analysis:
DRAM-like NVM + NVMe
[64B Read on V2x1 PCIe Bus]
•
•
•
•
•
[64B Write on V2x1 PCIe Bus]
Latency breakdown of a 64B RD/WR request on a V2x1 PCIe bus
Read: TNVMeInfo (58%) VS TDMA + TSSD (42%)
Write: TNVMeInfo (20%) VS TDMA + TSSD (80%)
As PCIe performance increases, this high overhead can be hidden
The number of I/O requests also increases in DRAM-like NVMs
MSI
•
•
•
•
SSD-Side
overhead
Host-Side
NVMe ISR Overhead
At the host, Interrupt Service Routine (ISR) triggered by MSI
An ISR execution imposes a long CPU-intervention (overhead)
I/O generation rates: DRAM-like (20~100ns) Block (2~10us)
I/O thread (or SQ): 1 ~ 32
Too frequent ISR invocations in DRAM-like NVMs
Queue Depth Analysis: Throughput
[Block NVM with V2x16 PCIe Bus] [DRAM-like NVM with V2x16]
• Our interest: scalable throughput as Q depth increases?
• Queue depth (entries): 1 ~ 65536 (Max value in NVMe specifications)
• Request sizes: Block NVM(8KB~1MB) & DRAM-like NVM (8B~1024B)
Simply working with a very large queue is not necessarily
improving the storage throughput in a scalable fashion
Queue Depth Analysis: Saturation Point
• Our interest: Q depth limit?
• PCIe bus performance: V2x1 ~ V4x32
• Saturation point: Block NVM (700) & DRAM-like NVM (10)
In DRAM-like NVM, high-bandwidth PCIe bus and streamlined
NVMe protocols make queue levels quite low
Queue Depth Analysis: Latency
[Block NVM with V2x16 PCIe Bus]
[DRAM-like NVM with V2x16]
• User can fill queue with I/O requests regardless of the saturation
• This severely hurts I/O latencies in both Block & DRAM-like NVMs
• After PCIe bus bandwidth runs out, all NVMeInfo and DMA packets
are stalled to get the bus service
Queue Count Analysis: Throughput
[Block NVM with V2x16 PCIe Bus] [DRAM-like NVM with V2x16]
• Our interest: scalable throughput as Q count increases?
• Q (thread) count : 1~65536 (Max value in NVMe specifications)
• Request sizes: Block NVM (4~32KB) & DRAM-like NVM (8~1024B)
• Block NVM: Q count can increase until PCIe bandwidth runs out
• DRAM-like NVM: Throughput saturates at a small queue count
Conclusions
• To better utilize NVMs, PCIe and NVMe are getting attention as its
physical interface and logical interface, respectively
• Due to lack of studies in the literature, NVMe characteristics and
design considerations are still veiled
• To uncover true NVMe characteristics, we proposed an analytical
model based on PCIe/NVMe specifications
• Key observations
–
–
–
–
(1) NVMe communication overhead in DRAM-like NVMs is not light-weight
(2) Frequent ISR invocations in DRAM-like NVMs generates a big burden
(3) Performance does not scale to the increasing Q depth/counts
(4) Latencies get hurt if Q depth/count goes beyond a saturation point
Thanks a lot for your interest
Backup Slides
Queue Count Analysis: Saturation Point
• Our interest: Q count limit?
• PCIe bus performance: V2x1 ~ V4x32
• Saturation point: Block NVM (256) & DRAM-like NVM (16)
• Unlike the expectation that more threads would be allowed and
beneficial for the DRAM-like NVM, at most 16 threads perform
the best in the majority of the cases