Thin Servers with Smart Pipes
Download
Report
Transcript Thin Servers with Smart Pipes
Thin Servers with Smart Pipes:
Designing SoC Accelerators for
Memcached
BOHUA KOU
JING GAO
Overview
Motivation & Introduction to Memcached Servers
Performance Analysis on Existing Servers
Identify Inefficiencies and bottlenecks
Thin Servers with Smart Pipes (TSSP) Architecture
TSSP Performance Results
Conclusion & Questions
2
Internet Service Workload
Internet services and data grow rapidly
Database servers along cannot maintain
performance with justified cost
Other types of infrastructure needed to
allow the modern internet services to scale
3
Memcached Servers
Distributed key-value stores
Implemented using a hash table, keys are unique and each
key maps to only a single server at any time
Easy to scale (connecting to additional servers is simple)
Attractive because server interface is general
4
Workloads
Memcached behavior can vary considerably based on the size and access frequency of the
objects it stores
Designed five workloads that capture a wide range of behavior to explore bottlenecks and
inefficiencies with existing servers.
FixedSize:
fixed object size and uniform popularity distribution; small objects place the
greatest stress on Memcached performance
MicroBlog:
object size and popularity distribution on a sample of “tweets” collected from
Twitter
Wiki:
use the entire Wikipedia database, each object represents an individual article
in HTML format
ImgPreview:
sample thumbnail photos and associated view counts from Flickr
FriendFeed:
same as Microblog
5
Workload Characteristics
6
System Under Test
Two different server systems
High-end Xeon-based server vs low-power Atom-based server
Three different classes of network interface cards (NIC)
A total of 21 different system configurations
7
Simulation Results
(CPU Bottlenecks)
On the left: requests-per-second for GET-requests to fixed-size
objects; Gap becomes larger when size is below 1KB
On the right: performance bottleneck shifts from network to
CPU
Xeon-class systems operate at
less than 1/8 of their theoretical
instruction throughput
Atom-class systems 1/16 of their
theoretical instruction
throughput
8
Simulation Results
(Caching Bottlenecks)
L1 Icache misses really bad
L1 Dcache and L2 performs
moderately well
9
Simulation Results
(TLB and Branch Prediction Bottlenecks)
TLB behaviors for Xeon comparable to recent characterization works
Atom provides an insufficient ITLB which causes most of its instruction fetch stalls
Branch misprediction rates is high for Atom due to its less capable branch predictor than Xeon
10
Simulation Results
(Impact of NIC quality)
11
Simulation Results
even though the Atom is a low-power processor with a significantly lower peak power than the Xeon, its power
efficiency is worse
12
Overcoming Bottlenecks:
Bottleneck: GET operation
The most latency and throughput critical memcached task
Up to 30:1 GET/SET ratio
Microarchitecture Solution: Thin Servers with Smart Pipes (TSSP)
Shift GET from core to a hardware pipeline
Using UDP protocol to replace TCP
Thin Servers = Embedded class cores
Smart Pipes = Integrated, on-die NIC with nearby hardware
13
TSSP Architecture Overview
Atom Based
Low power core
Integrated NIC and Accelerator
Integrated NIC
Handle incoming requests
GET Request
MMU
Virtual address translation
All the other requests
Memcached Accelerator
Respond to GET request without software
interaction
All other request types and memory
management are handled by software
14
NIC and Memcached Accelerator
Routes based on IP
address, port and protocol
(TCP)
Data
Center
Network
NIC
System MMU
SoC fabric
Process GET requests
Hardware
traversable
Hash table
Memcached Accelerator
15
Hardware and Software Data Structure
Variable 1-255 bytes key -> 64-bit identifier key’
Split MemCached’s lookup structure (hash table) to the
key and value storage (Slab-allocated memory)
4 possible slots for a given hash
Need update key’s last-access timestamp
16
Evaluation Results
Implementation
Memcached as an FPGA appliance
Table: TSSP energy efficiency comparison for FixedSize Workload
Area
8000 Look Up Tables, ~2% of the FPGA
Power
Estimated TSSP Total power = SoC Xilinx Zynq
platform + non-CPU components of Atom-based system
Performance
Measuring processing time for the memcached
applications, eg. FixedSize
Expected greater improvements if implemented as an
ASIC rather than FPGA
17
Conclusion
Identify several system bottlenecks for high-performance and low-power CPUs
Propose the TSSP design: low-power embedded class core + Memcached
accelerator
Potential 6 ~ 16X improvement in energy efficiency over existing servers
18
Discussion
Evaluation of TSSP only used FixedSize workload for comparison, is it enough?
Analysis on TSSP power consumption is weak. Does the estimated power reliable?
TSSP also modifies the software, is the change easy to make?
19