awaheed seminar
Download
Report
Transcript awaheed seminar
Memory Access Characteristics of Network
Infrastructure Applications
Dr. Abdul Waheed
Computer Engineering Department
KFUPM
April 7, 2003
Network Infrastructure
Network infrastructure consists of:
Core devices
Edge devices
Edge device
Application layer
I/O devices
Transport layer
Network layer
Examples: switches and routers
NICs
Core device
Outgoing traffic
Incoming traffic
Network layer
NICs
Incoming traffic
Outgoing traffic
Examples: proxy servers,
layer 4 switches, and
web servers
Characteristics of applications on core and edge devices:
Transaction processing systems on PDUs
High transaction throughput needed to match line speeds
Copyright © 2003
2
Abdul Waheed
Problem Statement
Role of infrastructure device architecture:
High performance protocol engines
High memory bandwidth is needed
Typically, special-purpose architectures are employed costly solution
High transaction throughput with low latency
Throughput should ideally match with network speed
Network processors
Multiprocessors
ASICS
Does employing general-purpose computer architecture help?
Does memory hierarchy help or hinder high throughput?
Typical hierarchy: registercachesmain memoryvirtual memory
Is memory and I/O bandwidth a bottleneck?
What are the other performance bottlenecks?
Copyright © 2003
3
Abdul Waheed
Motivation
Commonly known characteristics of network infrastructure:
Two categories of transaction processing:
Stands in contrast with scientific/engineering applications
Network processors exploit these characteristics:
typically, a core device
typically, an edge device
Lack of data locality in PDU
Those requiring heavy lifting
Those requiring heavy computing
Small data and large instruction caches
Common perception: caches hinder high network throughput
This is obvious but why don’t we try to confirm…
To convince ourselves
If above perception is incorrect, we can use general-purpose
computer architectures for building network infrastructure
inexpensive, scalable, and flexible solution
Copyright © 2003
4
Abdul Waheed
The Approach
Possible approaches for memory performance evaluation:
Trace-driven simulation
Measurements
Both approaches have limitations:
Using on-chip counters
Benchmarks
Accuracy of trace-driven approach is questionable
Measurement based approach are not repeatable (typically)
Analytical approach:
Based on using a general-purpose computing architecture as a server
A simple latency model
Four categories of possible data movements for each transaction
Analyze the overhead for high transaction throughput due to:
Caches
Limited memory and I/O bandwidth
Copyright © 2003
5
Abdul Waheed
Typical Server Architecture and Data Flow
These data movement
paths are specific to
using the server as a
network protocol engine
Four data transfer paths:
Memory-CPU
Memory-memory
Copyright © 2003
6
Memory-I/O
Memory-network
Abdul Waheed
Latency Model
Each transaction involves:
CPU cycles
Data transfers: one or more of four identified types
Transaction latency:
Ttrans = TCPU + n1Tm-c + n2Tm-m + n3Tm-disk + n4Tm-net
TCPU
Tm-c
Tm-m
Tm-disk
Tm-net
ni
Copyright © 2003
total CPU time needed for the transaction
time to transfer entire PDU from memory to CPU for proc.
Latency of memory-memory copy of a PDU
Latency of memory-I/o read/write of a block of data
Latency of memory-network read/write of a PDU
Number of each type of data movement operations
7
Abdul Waheed
Memory-CPU Transfers
Used for PDU processing
Number of memory stall cycles = (IC)(AR)(MR)(MP)
CPU accesses PDU words in sequential order
Stride = 1
Typically, one-way data flow from memory to CPU via L1 cache
Example: checksum computation of a TCP segment
IC
AR
MR
MP
Instruction count per transaction
Number of memory accesses/instruction (AR=1)
Ratio of cache misses to memory accesses
Miss penalty in terms of clock cycles
Cache miss rate:
Worst case: MR = 1 while typically MP = 10 stalls = 10 x IC
However, situation is not that bad for network applications!
Copyright © 2003
8
Abdul Waheed
Memory-CPU Transfers (Cont’d)
Temporal vs. spatial locality
A PDU lacks temporal locality
Observation: PDU processing exhibits excellent spatial locality
Thus, generally MR = W/L
Suppose data cache line is 32 bytes (or 16 words) long
Sequential accesses with stride = 1
Accessing one word, brings other 15 words as well
Thus, effective MR = 1/16 = 6.2% better than even scientific apps
W
L
Width of each memory access (in bytes)
Length of each cache line (in bytes)
Validation of above observation:
Similar special locality characteristics reported via measurements:
S. Sohoni et al., “A Study of Memory System Performance of Multimedia
Applications,” in proc. of ACM SIGMETRICS 2001
MR for streaming media player better than SPEC benchmark apps!
Copyright © 2003
9
Abdul Waheed
Memory-CPU Transfers (Cont’d)
Determine cache overhead wrt execution time:
(Execution time)no-cache
(Execution time)with-cache
Cache overhead
Cache overhead in various cases:
Worst case: MR = 1 and MP = 10
Cache results in 11 times higher latency for each transaction!
Memory-CPU latency dependent on internal bus bandwidth
Best case: MR = 0 trivial
Average case: MR = 0.1 and MP = 10 and (MR)(MP)1
= (IC)(CPI)(CC)
= (IC)(CPI)(CC) {1 + (MR)(MP)}
= 1 + (MR)(MP)
Latency due to stalls = ideal execution time without stalls
Cache overhead can restrict the throughput to ½ possible with a special-purpose
network processor
Practical consideration: use internal BW to calculate Tm-c
Avoid using low-level measurements
Latency hiding is very effective for such applications
Tm-c = S/Bi msec where S is the PDU size and Bi is the internal bus BW in
MB/s
Copyright © 2003
10
Abdul Waheed
Memory-Memory Transfers
Memory-memory transfer:
Due to memory copy of PDU between protocol layers
Transfers through caches and CPU
Stride =1
Transfer involves memorycacheCPUcachememory data
movement
Latency:
Dependent on internal (system) bus bandwidth
Tm-m = 2S/Bi msec
Copyright © 2003
11
Abdul Waheed
Memory-I/O and -Network Transfers
Memory-network transfers:
Passes over the I/O bus
DMA can be used
Again, stride = 1
Latency:
Limiting factor is the I/O bus bandwidth
Tm-net = S/Be msec
Copyright © 2003
12
Abdul Waheed
Characteristics of Selected Applications
Three applications:
IP packet forwarding
Simple application
HTTP proxying
RTP streaming
We need to calculate theoretical peak throughput
Based on application data movement characteristics
Depending only on architectural capabilities
Copyright © 2003
13
Abdul Waheed
Header
Payload
Payload
Header
Payload
Header
Payload
Header
Header
IP Forwarding
Payload
Router
Each transaction involves following operations:
Passing of incoming IP packet NICIP
(mem-net)
Reading of destination IP address
(cache miss)
Hash function calculation
(from cache)
Table lookup
(cache miss or hit, longterm hit)
IP header update
(from cache)
Update time to live
Recompute header checksum
Passing of outgoing IP packet IPNIC
Copyright © 2003
14
(mem-net)
Abdul Waheed
IP Forwarding (Cont’d)
For an IP forwarding transaction:
Data cache miss rate = 1 or 2 per transaction
Number of memory-network transactions = 2
Some CPU cycles are consumed for each transaction
Latency:
Depends on CPU cycles and memory-network transfers
TIP = TCPUIP + 2S/Be msec
where TCPUIP is the CPU time (in msec) to process the transaction
Using similar analysis for HTTP proxying and RTP streaming:
THTTP = TCPUH + 5S/Bi + S/Be msec
TRTP = TCPUR + 4S/Bi + S/Be msec
Copyright © 2003
15
Abdul Waheed
Optimistic Theoretical Peak Throughput
Assume:
For each transaction, TCPU << min{Tm-m,Tm-net} and can be ignored
No bus contention among independent transactions
Throughputs (in Mbytes/sec):
Meaning, no software overhead
(Throughput)IP
(Throughput)HTTP
(Throughput)RTP
= Be/2
= BiBe/(5Be + Bi)
= BiBe/(4Be + Bi)
Apply to state-of-the-art processors based architectures:
Intel Pentium IV 3 GHz
AMD Athlon XP 3000+
MIPS R16000 700 MHz
Sun Ultrasparc III 900 MHz
Copyright © 2003
16
Abdul Waheed
Optimistic Peak Throughputs
Assumed PCI-X I/O bus
133 MHz with 64 bits wide
Be = 1066 Mbystes/sec
Consider only IP forwarding case:
Throughput of 4264 Mbps at least 12 ports, 155 Mbps, fullduplex software router
Copyright © 2003
17
Abdul Waheed
Measurement Based IP Forwarding
Throughput
CPU utilization
Ethernet interface 0
90
450
80
400
70
Mbits/sec
350
300
250
64 bytes packet
200
64K packet
10K packet
150
CPU utilization %
500
60
10K packet
40
64K packet
30
100
20
50
10
0
64 bytes packet
50
0
1
2
3
4
5
6
7
8
1
Configuration
3
4
5
6
7
8
Configuration
Linux IP forwarding platform:
2
Pentium IV 2 GHz processor
PCI-X bus with 3200 MB/sec internal bus
4 Gigabit NICs connecting to four traffic generating hosts
Peak throughput on any interface: 450 Mbps (full-duplex)
Mean total throughput = 450x2x4 = 3600 Mbps 84% of peak
Difference due to software overhead, bus contention, and interrupts
Copyright © 2003
18
Abdul Waheed
Conclusions
Contributions of this work
We show that average cache overhead for network applications is
equal to the ideal-case latency for a transaction
Excellent spatial locality for PDU processing
We show that peak throughput is limited by internal and external
bus bandwidths rather than caches
General-purpose processor based architectures can replace specialpurpose network processor based architectures
Open questions
What is the contribution of software overheads?
What is the role of bus contention to limit the throughput?
Impact of multithreading combined with superscalar architectures
Can latency hiding provide extra mileage?
Role of NI and device driver optimizations?
Copyright © 2003
19
Abdul Waheed
Backup Slides
Memory-Memory Transfer Bandwidth
Extended copy transfer characteristics (stride = 1)
memory bandwidth
(Mbytes/sec)
6000
5000
4000
3000
2000
1000
32
M
8M
2M
51
2K
12
8K
32
K
8K
2K
0.
5K
0.
1K
0
block size (working set)
Copyright © 2003
21
Abdul Waheed
HTTP Proxying
Each HTTP proxy transaction results in following operations:
Passing of HTTP request packet NICIP (mem-mem)
Passing of HTTP request packet IPTCP (mem-mem)
Passing of HTTP request packet TCPHTTP (mem-mem)
Formation of HTTP response header (cache hit)
Retrieval of target document HTTPTCP (mem-mem)
Passing of HTTP response TCPIP
(mem-mem+CPU visit)
Passing of HTTP response IPNIC (mem-mem+ CPU visit)
Entire TCP segment needs to visit CPU (and cache) for checksum computation
Only IP header checksum is computed (at least 20 bytes)
Constant number of compulsory misses
For all data within a transaction:
Data cache miss rate = 1/(B/W) + c1
Number of memory-memory transactions = 3 for request + 3 for response
Request size is only a few bytes while average response size is around 10KB
Context switches = 2
Copyright © 2003
22
Abdul Waheed
Streaming Media
There are typically two parts:
Request part
Response
One time for a transaction
Streaming of very large number of packets for each transaction
RTP packets over UDP
Ignore the request part and consider only the RTP streaming
part
Copyright © 2003
23
Abdul Waheed
RTP Streaming
Each transaction response RTP packet involves:
RTP header calculation
Passing of RTPUDP
Passing of UDPIP
(compulsory cache miss)
(mem-mem)
(mem-mem+CPU visit)
Checksum of entire UDP datagram
Passing of IPNIC
(mem-mem)
For each RTP packet processing:
Data cache miss rate = 1/(B/W) + c
Number of memory-memory transactions = 3
Number of context switches = 1
Max. theoretical throughput = ? (given bus BW)
Copyright © 2003
24
Abdul Waheed
General Memory Access Model
At level i in the memory hierarchy:
Average memory access time = (hit time)i + (miss rate)i * (miss
penalty)i
Added complexity when out of order execution is possible
Miss latency should include only non-overlapped stalls
Measurements should be consistent for a given processor
Memory stall cycles/instruction = (misses/instruction) * {total miss
latency – overlapped miss latency)
Copyright © 2003
25
Abdul Waheed
General Memory Performance
Optimization Techniques
Reduce miss penalty
Reduce time to hit in cache
Parallelism
Use pipelined accesses
Prefetching
Essentially reduces miss penalty
Reduce miss rate
Typical compiler based optimization technique
Copyright © 2003
26
Abdul Waheed
How to Get High Throughput?
By-pass the cache
Direct memory access from NIC
Should decrease the latency by one order of magnitude
Memory bandwidth can still become a bottleneck
Hide the latency
Use multiple threads for independent transactions
Superscalar and pipelined architecture can ensure useful work
while one thread stalls
Throughput is dependent on the number of concurrent threads
Copyright © 2003
27
Abdul Waheed
Trace Driven Simulation
Instrumented
code execution
Application
source code
Model
generator
Parse tree of
source code
Copyright © 2003
Model
execution
Memory parameters
28
Abdul Waheed
Tuning Using Hardware Counters
Application
program
Execution on
target system
Performance
analysis
Performance counter
based measurements
Source code
modifications
Copyright © 2003
29
Abdul Waheed
Latency Hiding
We use measurements to compare the memory performance
and throughput of Darwin streaming server and Windows media
server
Platforms
Darwin running under Linux
WMS running under Windows 2000 Server
Metrics for comparisons
CPU utilization
Cache miss rate
Page fault rate
Data throughput (mbits/sec)
Copyright © 2003
30
Abdul Waheed
Comparison of Streaming Media Servers
Server:
- Pentium 166 MHz
server
- 128 MB RAM
100 Mb/s switch
Clients:
- Pentium 166 MHz
- 48 - 64 MB RAM
Switch:
- 100 Mbps
client 1
client 2
client 3
client 4
- Layer 2 switch
Server machine runs Darwin Streaming Server under Linux
Same server machine runs Windows Media Server under Win2K Server
Copyright © 2003
31
Abdul Waheed
Latency Hiding (Cont’d)
Server CPU Utilization (300k, m ultiple stream s)
30
120
25
100
20
DSS
15
WMS
10
CPU %
Avg. throughput (Mbps)
Average throughput (single, 300k)
5
80
DSS
60
WMS
40
20
0
4
100
0
400
4
Num ber of client requests
100
400
Num ber of client requests
page faults (multiple, 300k)
cache miss (multiple, 300k)
50
2
1.5
DSS
WMS
1
0.5
page faults/sec
cache miss/sec
2.5
40
30
DSS
20
WMS
10
0
0
4
100
4
400
400
Num ber of client requests
Num ber of client requests
Copyright © 2003
100
32
Abdul Waheed
Latency Hiding (Cont’d)
Peak throughput
Memory performance
Indicated by 100% CPU usage
Windows Media Server delivers significantly larger throughput at
higher load than Darwin Streaming Server
WMS shows high cache and page fault rates at high loads but still
delivers better throughput
Better exploitation of latency hiding opportunities offered by the
processor through OS, compiler, and application
Don’t expect more from a freely available media server!!
Darwin is available in public domain with source code from
www.apple.com/quicktime
Copyright © 2003
33
Abdul Waheed
Technologies
Copyright © 2003
34
Abdul Waheed
STREAMS
STREAM – a full-duplex communication channel between a
user-level process and a device
A STREAM consists of:
- STREAM head interfaces with the user process
- driver end interfaces with the device
- zero or more STREAM modules between them.
Each module contains a read queue and a write queue
Message passing is used to communicate between queues
Copyright © 2003
35
Abdul Waheed
The STREAMS Structure
Copyright © 2003
36
Abdul Waheed
Interrupt-Driven I/O Cycle
Copyright © 2003
37
Abdul Waheed
Known Application Characteristics
Some applications are compute-intensive
Large number of memory accesses
Challenge: how to keep CPU busy during every cycle
Identification of the problem: simulation/measurement/static analysis
Tuning: code transformations/algorithmic modifications
Both identification and tuning is hard!!
Some applications allow latency hiding
Multithreaded applications
Example: high throughput servers that process independent
transactions
Copyright © 2003
38
Abdul Waheed