Exploring CPU, Memory and I/O Architectures for the 2015
Download
Report
Transcript Exploring CPU, Memory and I/O Architectures for the 2015
Multicore Architectures
Managing Wire Delay in
Large CMP Caches
Bradford M. Beckmann
David A. Wood
Multifacet Project
University of Wisconsin-Madison
MICRO 2004
12/8/04
Static NUCA
Dynamic NUCA
Current CMP: IBM Power 5
CPU 0 CPU 1
2 CPUs
L1 I$ L1 D$ L1 D$ L1 I$
L2
L2
L2
Bank Bank Bank
3 L2
Cache
Banks
CPU 3
CPU 5
L1
D$
L1
I$
L1
I$
L1
D$
CPU 0
L1
I$
L1
D$
L1
D$
L1
I$
CPU 1
L1
I$
L1
D$
CPU 2
L1
I$
L1
D$
CPU 4
Baseline: CMP-SNUCA
L1
D$
L1
I$
CPU 7
L1
D$
L1
I$
CPU 6
7
CPU 3
L1
I$
L1
D$
L1
I$
L1
D$
CPU 5
L1
D$
L1
I$
CPU 1
L1
I$
L1
D$
A
B
CPU 2
L1
I$
L1
D$
CPU 4
Block Migration: CMP-DNUCA
L1
D$
L1
I$
CPU 0
B
A
L1
D$
L1
I$
CPU 7
L1
D$
L1
I$
CPU 6
8
On-chip Transmission Lines
• Similar to contemporary off-chip communication
• Provides a different latency / bandwidth tradeoff
• Wires behave more “transmission-line” like as
frequency increases
– Utilize transmission line qualities to our advantage
– No repeaters – route directly over large structures
– ~10x lower latency across long distances
• Limitations
– Requires thick wires and dielectric spacing
– Increases manufacturing cost
Transmission Lines: CMP-TLC
CPU 3
CPU 2
CPU 1
CPU 0
L1
I$
L1
D$
L1
I$
L1
D$
L1
I$
L1
D$
L1
I$
L1
D$
L1
I$
L1
D$
L1
I$
L1
D$
L1
I$
L1
D$
L1
I$
L1
D$
CPU 4
CPU 5
CPU 6
16
8-byte
links
CPU 7
10
CPU 3
CPU 5
8
32-byte
links
L1
D$
L1
I$
L1
I$
L1
D$
CPU 0
L1
I$
L1
D$
L1
D$
L1
I$
CPU 1
L1
I$
L1
D$
CPU 2
L1
I$
L1
D$
CPU 4
Combination: CMP-Hybrid
L1
D$
L1
I$
CPU 7
L1
D$
L1
I$
CPU 6
11
CMP-DNUCA: Organization
CPU 3
CPU 1
CPU 4
CPU 2
Bankclusters
Local
CPU 0
CPU 5
Inter.
CPU 7
Center
CPU 6
12
Hit Distribution: Grayscale Shading
CPU 3
CPU 0
CPU 5
CPU 1
CPU 4
CPU 2
CPU 7
CPU 6
Greater %
of L2 Hits
L2 Hit Latency
Bars Labeled
D: CMP-DNUCA
T: CMP-TLC
H: CMP-Hybrid
Normalized Runtime
Overall Performance
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
CMP-SNUCA
perfect CMP-DNUCA
CMP-TLC
perfect CMP-Hybrid
jbb
oltp
ocean
apsi
Benchmarks
Transmission lines improve L2 hit and L2 miss latency
I/O Acceleration in Server
Architectures
Laxmi N. Bhuyan
University of California, Riverside
http://www.cs.ucr.edu/~bhuyan
Acknowledgement
• Many slides in this presentation have been
taken (or modified from) from Li Zhao’s
Ph.D. dissertation at UCR and Ravi Iyer’s
(Intel) presentation at UCR.
Enterprise Workloads
• Key Characteristics
– Throughput-Oriented
• Lots of transactions, operations, etc in flight
• Many VMs, processes, threads, fibers, etc
• Scalability and Adaptability are key
– Rich (I/O) Content
• TCP, SoIP, SSL, XML
• High Throughput Requirements
• Efficiency and Utilization are key
Server Network I/O Acceleration
Bottlenecks
Rate of Technology Improvement
Rich I/O Content – How does a
server communicate with I/O Devices?
Communicating with the Server:
The O/S Wall
User
CPU
Kernel
PCI Bus
NIC
NIC
Problems:
• O/S overhead to move a
packet between network and
application level => Protocol
Stack (TCP/IP)
• O/S interrupt
• Data copying from
kernel space to user
space and vice versa
• Oh, the PCI Bottleneck!
Our Aim: Design server
(CPU) architectures to
overcome the problems!
TCP Offloading with Cavium Octeon
Multi-core MIPS 64 processor
Application Oriented Networking
(AON)
Internet
IP
TCP
Same problems with
programmable routers.
Requests going through
network, IP and TCP layers
APP. DATA
GET /cgi-bin/form HTTP/1.1
Host: www.site.com…
Switch
Solution: Bring processing
down to the network level
=> TCP Offload (Not a
topic for discussion)!
“A Network
Processor Based, Content Aware
Switch”, IEEE Micro, May/June
2006, (with L. Zhao, et al).
Ref: L. Bhuyan,
Application level Processing
Timing Measurement in an UDP communication
X.Zhang, L. Bhuyan and W. Feng, ““Anatomy of UDP and
M-VIA for Cluster Communication” JPDC, October 2005
Rich I/O Content in the Enterprise
• Trends
– Increasing layers of processing on I/O data
App
• Business critical functions (TCP, IP storage, security, XML
etc.)
• Independent of actual application processing
• Exacerbate by high network rates
XML
– High rates of I/O Bandwidth with new technologies
• PCI-Express technology
• 10/s Gb to 40 Gb/s network technologies and it just keeps
going
SSL
• Problem Statement
– Data Movement latency to deliver data
iSCSI
TCP/IP
Platform
Network
Data
• Interconnect protocols
• Data structures used for shared memory communication,
serialization and locking
• Data movement instructions (for e.g. rep mov)
– Data Transformation latencies
• SW efficiency – degree of IA optimization
• IA cores vs. fixed function devices
• Location of processing: Core, Uncore, Chipset vs. Device
– Virtualization and real workload requirements
Network Protocols
• TCP/IP protocols
– 4 layers
7
• OSI Reference
Model
– App3 layers
Application
100%
Execution time breakdown
90%
TCP
XML
Application
5
Examples
Application
HTTP, Telnet
Presentation
HTTP,
XML
Telnet
Session
SSL
34
Transport
Transport
TCP, UDP
23
Internet
Network
IP, IPSec, ICMP
Data Link
Ethernet, FDDI
Ethernet, FDDI
Coax, Signaling
2
1
1
SSL
60%
50%
40%
30%
20%
10%
0%
Front end server
46
OSI
Protocol
80%
70%
TCP/IP
Secure server
Financial server
Link
Physical
Our Concentration in this talk:
TCP/IP
Network Bandwidth is Increasing
TCP requirements Rule of thumb:
1GHz for 1Gbps
1000
GHz and Gbps
100
10
1
0.1
100
Network
bandwidth
outpaces
Moore’s Law
40
10
The gap between the rate of
processing network applications
and the fast growing network
bandwidth is increasing
Moore’s
Law
.01
1990
1995
2000
2003
2005 2006/7
2010
Time
Profile of a Packet
System Overheads
Descriptor & Header Accesses
IP Processing
Computes
TCB Accesses
TCP Processing
Memory Copy
Total Avg Clocks / Packet: ~ 21K
Effective Bandwidth: 0.6 Gb/s
(1KB Receive)
Five Emerging Technologies
• Optimized Network Protocol Stack
(ISSS+CODES, 2003)
• Cache Optimization (ISSS+CODES, 2003,
ANCHOR, 2004)
• Network Stack Affinity Scheduling
• Direct Cache Access
• Lightweight Threading
• Memory Copy Engine (ICCD 2005 and IEEE TC)
Stack Optimizations
(Instruction Count)
• Separate Data & Control
Paths
– TCP data-path focused
– Reduce # of conditionals
– NIC assist logic (L3/L4
stateless logic)
• Basic Memory Optimizations
– Cache-line aware data
structures
– SW Prefetches
• Optimized Computation
– Standard compiler capability
3X reduction in
Instructions per
Packet
Reduce Protocol Overheads
• TCP/IP
– Data touching
• Copies: 0-copy
• Checksum: offload to NIC
– Non-data touching
• Operating system
– Interrupt processing: interrupt coalescing
– Memory management
• Protocol processing: LSO
Instruction Mix & ILP
45
Execution Time (Billion cycles)
50
TCPIP
SPECint00
Percentage (%)
40
35
30
25
20
15
10
5
7
Instruction Access
Data Access
CPU Execution
6
5
4
3
2
1
0
1
0
Load
Store
Uncond.
branch
Cond.
branch
Int
• Higher % of unconditional
branches
• Lower % of conditional
branches
2
4
8
1
SPECint00
2
4
8
TCP/IP
Issue width
• Less sensitive to ILP
• Issue width: 1 2, 2 4
SPEC: 40%, 24%
TCP/IP: 29%, 15%
EX: Frequently Used Instruction
Pairs in TCP/IP
• Identify frequent instruction
pairs with dependence (RAW)
• Integer + branch: header
validation, packet
classification, states checking
• Combine the two instructions
to create a new instruction =>
Reduces the number of
instructions and cycles
1st
instruction
2nd
instruction
Occurrenc
e
ADDIU
BNE
4.91%
ANDI
BEQ
4.80%
ADDU
ADDU
3.56%
SLL
OR
3.38%
Execution Time Reduction
L. Bhuyan, “Architectural Analysis and Instruction Set
Optimization for Network Protocol Processors”, IEEE
ISSS+CODES, October 2003, (with H. Xie and L.
Zhao),
Performance improvement (%)
Instruction reduction
Execution time reduction
20
15
10
5
0
i-
r-
b-
-i
-r
-b
ii
ir
ib
ri
rr
rb
bi
br
bb
1.2
CPU execution
Data access
Instruction access
1
Execution time
• Number of
instructions reduced
is not proportional to
the execution time
reduction
• 1% 6% to 23%
• Instruction access time:
47%
• CPU execution time: 3%
• Data access time: 14%
25
0.8
0.6
0.4
0.2
0
orig i-
r-
b-
-i
-r
-b
ii
ir
ib
ri
rr
rb
bi
br
bb
Cache Optimizations
Instruction Cache Behavior
18
TCP/IP
16
SPECint00
12
10
8
6
4
2
0
2
4
8
16
32
64
Cache size(KBytes)
128
256
• Benefit more from a L1
cache with larger size, line
size, higher degree of set
associativity
20
10
TCP/IP
18
TCP/IP
9
SPECint00
16
SPECint00
8
14
Miss rate (%)
Miss rate (%)
Miss rate (%)
14
• Higher requirement on L1
cache size due to the
program structure
12
10
8
6
7
6
5
4
4
3
2
2
0
1
2
4
8
16
Cache line size(Bytes)
32
64
0
2
4
8
Set associativity
16
32
Execution Time Analysis
Given a total L1 cache size
on the chip, more area
should be devoted to
I-Cache and less to D-cache
Network Cache
• Two sub-caches
• Benefit
– Reduce cache
pollution
– Each cache has
its own
configuration
320
45
256
35
30
192
TLC/SB cache size
regular cache size
MPP for TLC/SB
MPP for reqular cache
25
20
15
128
10
64
5
0
0
0
2
6
14
30
Cache size difference (Bytes)
62
126
Cache size (KBytes)
40
Misses per packet
– TLC: temporal
data
– SB: nontemporal data
50
Reduce Compulsory Cache Misses
• NIC descriptors and TCP
headers
• Lock a memory region
• Perform update
Misses per packet
– Cache Region Locking w/ Auto
Updates (CRL)
55
45
CRL
40
35
30
25
20
15
5
0
• Hybrid protocols: update
• Auto-fill: prefetch
128
256
512
1024
1460
Payload size (bytes)
55
Misses per packet
– Cache Region Prefetching
(CRP)
Base
10
– Support for CRL-AU
• TCP payload
50
50
Base
45
CRP
40
35
30
25
20
15
10
5
0
128
256
512
1024
Payload size (bytes)
1460
Network Stack Affinity
Assigns network I/O workloads to designated devices
Separates network I/O from application work
Core
Core
Core
Core
Core
Core
Core
…
CPU
Dedicated for network I/O
Intel calls it Onloading
CPU
CPU
Chipset
Memory
Memory
Memory
Memory
Core
I/O Interface
Reduces scheduling overheads
More efficient cache utilization
Increases pipeline efficiency
Direct Cache Access (DCA)
Normal DMA Writes
CPU
Step 1
DMA Write
Memory
Controller
NIC
CPU
Step 4
CPU Read
Cache
Step 2
Snoop
Invalidate
Direct Cache Access
Step 3
CPU Read
Cache
Step 2
Cache Update
Memory
Step 3
Memory Write
Step 1
DMA Write
Memory
Controller
NIC
Eliminate 3 to 25 memory accesses by
placing packet data directly into cache
Memory
Lightweight Threading
Builds on helper threads; reduces CPU stall
Memory informing event (e.g. cache miss)
Thread Manager
S/W controlled thread 1
Execution pipeline
S/W controlled thread 2
Single Hardware Context
Single Core Pipeline
Continue computing in single pipeline
in shadow of cache miss
Memory Copy Engines
L.Bhuyan, “Hardware Support for Bulk Data Movement in Server Platforms”,
ICCD, October 2005 (Also to appear in IEEETC), with L. Zhao, et.al.
Memory Overheads
PC
4700000
4650000
4600000
4550000
4500000
4450000
4400000
4350000
memory copy
mbuf
4300000
4250000
4200000
1
0
header
2
500
3
4
5
6
7
1000
1500
2000
Cycle
•NIC descriptors
•Mbufs
•TCP/IP headers
•Payload
2500
Copy Engines
• Copy is time-consuming due to
– CPU moves data at small granularity
– Source or destination is in memory (not cache)
– Memory accesses clog up resources
• Copy engine can
– Fast copies and reducing CPU resource occupancy
– Copies can be done in parallel with the CPU computation
– Avoid cache pollution and reduce interconnect traffic
• Low overhead communication between the engine & the
CPU
– Hardware support to allow the engine to run asynchronously
with the CPU
– Hardware support to share the virtual address between the
engine and the CPU
– Low overhead signaling of completion
Design of Copy Engine
• Trigger CE
– Copy initiation
– Address translation
– Copy communication
• Communication between
the CPU and the CE
Performance Evaluation
Execution time reduction
80%
SYNC_NLB
SYNC_WLB
ASYNC
70%
60%
50%
40%
30%
20%
10%
0%
512
1024
1400
512
in order
1024
out of order
Payload size (bytes)
1400
Asynchronous Low-Cost Copy
(ALCC)
• Today, memory to memory data copies require CPU execution
• Build a copy engine and tightly couple it with the CPU
– Low communication overhead; asynchronous execution w.r.t CPU
App Processing
Memory
Copy
App Processing
App
AppProcessing
Processing
Memory Copy
Continue computing during memory to
memory copies
Total I/O Acceleration
Potential Efficiencies (10X)
Benefits of Affinity
Benefits of Architectural Technques
Greg Regnier, et al., “TCP Onloading for DataCenter Servers,” IEEE Computer, vol 37, Nov 2004
On CPU, multi-gigabit, line speed
network I/O is possible
I/O Acceleration – Problem Magnitude
50,000
4.0
45,000
3.5
40,000
3.0
35,000
30,000
2.5
25,000
2.0
20,000
1.5
15,000
1.0
10,000
5,000
0.5
0
0.0
TCP Orig
Cycles
Data Rate
Data Rate (Gbps)
Cycles (Banias 1.7 GHz)
I/O Processing Overheads
TCP Opt
iSCSI
SSL
XML
Protocol Processing
Memory Copies &
Effects of Streaming
CRCs
Crypto
Parsing,
Tree Construction
I/O Processing Rates are significantly limited by CPU in the face of Data
Movement and Transformation Operations
Building Block Engines
Data Validation
Bulk Data Operations
Copies / Moves
Scatter / Gather
Inter-VM-comm
• Questions:
XORs
Checksums
CRCs
Integrated Memory Controllers
CACHE
BBE
Small Small Small Small Small Small Small Small
Core Core Core Core Core Core Core Core
Small Small Small Small Small Small Small Small
BBE BBE
BBE
Core Core Core Core Core Core Core Core
Encryption
Compression
XML Parsing
Small Small Small Small Small Small Small Small
Core Core Core Core Core Core Core Core
Small Small Small Small Small Small Small Small
Core Core Core Core Core Core Core Core
– What are the characteristics of bulk data operations?
– Why are they performance bottlenecks today?
– What is best way to improve their performance?
• Parallelize the computation across many small CPUs?
• Build a BBE and tightly couple it with the CPU
• How do we expose BBEs?
Scalable on-die fabric
Granularity?
Reconfigurability?
CoreBBE
Investigate architectural and platform support for building
block engines in future servers
Conclusions and Future Work
• Studied architectural characteristics of key network
protocols
– TCP/IP: requires more instruction cache, has a large memory
overhead
• Proposed several techniques for optimization
– Caching techniques
– ISA optimization
– Data Copy engines
• Further investigation on network protocols & optimization
– Heterogeneous Chip multiprocessors
-- Other I/O applications, SSL, XML, etc.
- Use of Network Processors and FPGA’s