Exploring CPU, Memory and I/O Architectures for the 2015

Download Report

Transcript Exploring CPU, Memory and I/O Architectures for the 2015

I/O Acceleration in Server
Architectures
Laxmi N. Bhuyan
University of California, Riverside
http://www.cs.ucr.edu/~bhuyan
Acknowledgement
• Many slides in this presentation have been
taken (or modified from) from Li Zhao’s
Ph.D. dissertations at UCR and Ravi Iyer’s
(Intel) presentation at UCR.
• The research has been supported by NSF,
UC Micro and Intel Research.
Enterprise Workloads
• Key Characteristics
– Throughput-Oriented
• Lots of transactions, operations, etc in flight
• Many VMs, processes, threads, fibers, etc
• Scalability and Adaptability are key
– Rich (I/O) Content
• TCP, SoIP, SSL, XML
• High Throughput Requirements
• Efficiency and Utilization are key
Rich I/O Content in the Enterprise
App
• Trends
–Increasing layers of processing
on I/O data
XML
• Business critical functions (TCP,
IP storage, security, XML etc.)
• Independent of actual application
processing
• Exacerbate by high network rates
SSL
iSCSI
TCP/IP
Platform
Network
Data
–High rates of I/O Bandwidth with
new technologies
• PCI-Express technology
• 10/s Gb to 40 Gb/s network
technologies and it just keeps
going
Network Protocols
• TCP/IP protocols
TCP/IP
– 4 layers
• OSI Reference
Model
7
– App3 layers
6
4
Application
100%
Execution time breakdown
90%
TCP
70%
5
Protocol
XML
80%
Application
SSL
60%
50%
30%
10%
0%
Front end server
Secure server
Application
HTTP, Telnet
Presentation
HTTP, Telnet,
XML
SSL, XML
Session
SSL
Transport
Transport
TCP, UDP
3
2
Internet
Network
IP, IPSec, ICMP
1
1
20%
Examples
4
3
2
40%
OSI
Data Link
Link
Physical
Ethernet, FDDI
Ethernet, FDDI
Coax, Signaling
Financial server
SSL and XML are in Session Layer
Situation even worse with virtualization
Virtualization Overhead: Server
Consolidation
Ping-Pong Latency(us)
100
80
60
40
20
0
1
2
4
8
32
64
128
512
1K
1.5K
4k
Message Size
Linux-to-Linux
Inter-VM(Same host)
Inter-VM bandwidth comparison with Native Linux-to-Linux
7
Bandwidth(Gbps)
• Server
consolidation is
when both the
guests run on
same physical
hardware
• Server-toServer Latency
& Bandwidth
comparison
under 10Gbps
===>
Inter-VM latency com parison w ith Linux-to-Linux
6
5
4
3
2
1
0
40B
512B
1K
1.4K
Message Size
Inter-VM Linux-to-Linux
1.5K
8K
64K
Communicating with the Server:
The O/S Wall
User
CPU
Kernel
PCI Bus
NIC
NIC
Problems:
• O/S overhead to move
a packet between
network and application
level => Protocol Stack
(TCP/IP)
• O/S interrupt
• Data copying from
kernel space to user
space and vice versa
• Oh, the PCI
Bottleneck!
The Send Operation
• The application writes the transmit data to the TCP/IP
sockets interface for transmission in payload sizes
ranging from 4 KB to 64 KB.
• The data is copied from the User space to the Kernel
space
• The OS segments the data into maximum transmission
unit (MTU)–size packets, and then adds TCP/IP header
information to each packet.
• The OS copies the data onto the network interface card
(NIC) send queue.
• The NIC performs the direct memory access (DMA)
transfer of each data packet from the TCP buffer space
to the NIC, and interrupts CPU activities to indicate
completion of the transfer.
Transmit/Receive data using a
standard NIC
Note: Receive path is longer than Send due to extra copying
Where do the cycles go?
Network Bandwidth is Increasing
TCP requirements Rule of thumb:
1GHz for 1Gbps
1000
GHz and Gbps
100
10
100
Network
bandwidth
outpaces
Moore’s Law
10
The gap between the rate of
processing network applications
and the fast growing network
bandwidth is increasing
1
0.1
40
Moore’s
Law
.01
1990
1995
2000
2003
2005 2006/7
2010
Time
I/O Acceleration Techniques
• TCP Offload: Offload TCP/IP Checksum and
Segmentation to Interface hardware or
programmable device (Ex. TOEs) – A TOEenabled NIC using Remote Direct Memory Access
(RDMA) can use zero-copy algorithms to place data
directly into application buffers.
• O/S Bypass: User-level software techniques
to bypass protocol stack – Zero Copy Protocol
(Needs programmable device in the NIC for
direct user level memory access – Virtual to
Physical Memory Mapping. Ex. VIA)
Comparing standard TCP/IP and TOE
enabled TCP/IP stacks
(http://www.dell.com/downloads/global/power/1q04-her.pdf)
Design of a Web Switch Using IXP
2400 Network Processor
Internet
IP
TCP
APP. DATA
GET /cgi-bin/form HTTP/1.1
Host: www.site.com…
Switch
Same problems with AONs,
programmable routers,
web switches. Requests
going through network, IP
and TCP layers
Our Solution: Bring TCP
connection establishment
and processing down to
the network level using NP
Ref: L. Bhuyan, “A Network
Application level Processing
Processor Based, Content Aware
Switch”, IEEE Micro, May/June
2006, (with L. Zhao and Y. Luo).
Ex; Design of a Web Switch Using Intel IXP 2400 NP
(IEEE Micro June/July 2006)
www.yahoo.com
Internet
Image Server
IP
TCP
APP. DATA
Application Server
Switch
GET /cgi-bin/form HTTP/1.1
Host: www.yahoo.com…
HTML Server
700
Linux Splicer
600
SpliceNP
Latency on the switch (ms)
Throughput (Mbps)
800
500
400
300
200
100
0
1
4
16
64
Request file size (KB)
256
1024
20
18
16
14
12
10
8
6
4
2
0
Linux Splicer
SpliceNP
1
4
16
64
Request file size (KB)
256
1024
But Our Concentration in this talk!
Server Acceleration
Design server (CPU) architectures to speed up protocol
stack processing!
Also Focus on TCP/IP
Profile of a Packet
Simulation Results. Run Free BSD on Simplescalar.
No System Overheads
Descriptor & Header Accesses
IP Processing
Computes
TCB Accesses
TCP Processing
Memory Copy
Total Avg Clocks / Packet: ~ 21K
Effective Bandwidth: 0.6 Gb/s
(1KB Receive)
Five Emerging Technologies
• Optimized Network Protocol Stack
(ISSS+CODES, 2003)
• Cache Optimization (ISSS+CODES, 2003,
ANCHOR, 2004)
• Network Stack Affinity Scheduling
• Direct Cache Access (DCA)
• Lightweight Threading
• Memory Copy Engine (ICCD 2005 and IEEE TC)
Cache Optimizations
Instruction Cache Behavior
18
TCP/IP
16
SPECint00
12
10
• Benefit more from larger
line size, higher degree of
set associativity
8
6
4
2
0
2
4
8
16
32
64
Cache size(KBytes)
128
256
20
10
TCP/IP
18
TCP/IP
9
SPECint00
16
SPECint00
8
14
Miss rate (%)
Miss rate (%)
Miss rate (%)
14
• Higher requirement on L1
I-cache size due to the
program structure
12
10
8
6
7
6
5
4
4
3
2
2
0
1
2
4
8
16
Cache line size(Bytes)
32
64
0
2
4
8
Set associativity
16
32
Execution Time Analysis
Given a total L1 cache size
on the chip, more area
should be devoted to
I-Cache and less to D-cache
Direct Cache Access (DCA)
Normal DMA Writes
CPU
Step 1
DMA Write
Memory
Controller
NIC
CPU
Step 4
CPU Read
Cache
Step 2
Snoop
Invalidate
Direct Cache Access
Step 3
CPU Read
Cache
Step 2
Cache Update
Memory
Step 3
Memory Write
Step 1
DMA Write
Memory
Controller
NIC
Eliminate 3 to 25 memory accesses by
placing packet data directly into cache
Memory
Memory Copy Engines
L.Bhuyan, “Hardware Support for Bulk Data Movement in Server Platforms”,
ICCD, October 2005 (IEEETC, 2006), with L. Zhao, et.al.
Memory Overhead Simulation
PC
4700000
4650000
4600000
4550000
4500000
4450000
4400000
4350000
memory copy
mbuf
4300000
4250000
4200000
1
0
header
2
500
3
4
5
6
7
1000
1500
2000
Cycle
•NIC descriptors
•Mbufs
•TCP/IP headers
•Payload
2500
Copy Engines
• Copy is time-consuming due to
– CPU moves data at small granularity
– Source or destination is in memory (not cache)
– Memory accesses clog up resources
• Copy engine can
– Fast copies and reducing CPU resource occupancy
– Copies can be done in parallel with the CPU computation
– Avoid cache pollution and reduce interconnect traffic
• Low overhead communication between the engine & the
CPU
– Hardware support to allow the engine to run asynchronously
with the CPU
– Hardware support to share the virtual address between the
engine and the CPU
– Low overhead signaling of completion
Asynchronous Low-Cost Copy
(ALCC)
• Today, memory to memory data copies require CPU execution
• Build a copy engine and tightly couple it with the CPU
– Low communication overhead; asynchronous execution w.r.t CPU
App Processing
Memory Copy
App Processing
App Processing
App Processing
App Processing
Memory Copy
Continue computing during memory to memory copies
Performance Evaluation
Execution time reduction
80%
SYNC_NLB
SYNC_WLB
ASYNC
70%
60%
50%
40%
30%
20%
10%
0%
512
1024
1400
512
in order
1024
out of order
Payload size (bytes)
1400
Total I/O Acceleration
Potential Efficiencies (10X)
Benefits of Architectural Technques
Ref: Greg Regnier, et al., “TCP Onloading for DataCenter Servers,” IEEE Computer,
vol 37, Nov 2004
On CPU, multi-gigabit, line speed
network I/O is possible
CPU-NIC Integration
CPU-NIC Integration
Performance Comparison (RX) with Various Connections in
SUN Niagra 2 Machine
• INIC performs better than DNIC with greater than 16 Connections.
10
40%
9
35%
Bandwidth (Gbps)
8
30%
7
6
25%
5
20%
4
15%
3
10%
2
1
5%
0
0%
1
4
8
16
32
64
Connections
DNIC(BW)
INIC(BW)
DNIC(CPU Util)
INIC(CPU Util)
CPU Utilization
DNIC vs INIC with Different Connections
Latency Comparison
INIC can achieve a lower latency by saving 6 µs. It is due to the
smaller latency of accessing I/O registers and eliminating PCI-E bus
latency.
DNIC vs INIC (Latency)
120
INIC
DNIC
Latency (us)
115
110
105
100
95
90
64
128
256
512
I/O Size(Bytes)
1K
1.5K
Current and Future Work
• Architectural characteristics and Optimization
– TCP/IP Optimization, CPU+NIC Integration, TCB Cache Design,
Anatomy and Optimization of Driver Software
– Caching techniques, ISA optimization, Data Copy engines
– Simulator Design
– Similar analysis with virtualization and 10 GE with multi-core
CPUs ongoing with Intel project.
• Core Scheduling in Multicore Processors
– TCP/IP Scheduling on multi-core processors
– Application level Cache-Aware and Hash-based Scheduling
– Parallel/Pipeline Scheduling to simultaneously address
throughput and latency
– Scheduling to minimize power consumption
– Similar research with Virtualization
• Design and analyisis of Heterogeneous Multiprocessors
– Heterogeneous Chip multiprocessors
-- Use of Network Processors, GPUs and FPGA’s
THANK YOU