Transcript ppt

INF5063:
Programming Heterogeneous Multi-Core Processors
Network Processors
A generation of multi-core processors
April 6, 2016
Agere Payload Plus APP550
Classifier
buffer
Scheduler
buffer
Stream editor
memory
from
Ingress
to Egress
to coprocessor
from. coprocessor
Classifier
memory
University of Oslo
Scheduler
memory
Statistics
memory
INF5063, Carsten Griwodz & Pål Halvorsen
PCI Bus
Agere Payload Plus APP550
Classifier
buffer
Scheduler
buffer
Packet (protocol data unit) assembler
- collect all blocks of a frame
- not programmable
Stream editor
memory
Stream Editor (SED)
- two parallel engines
- modify outgoing packets (e.g., checksum, TTL, …)
- configurable, but not programmable
from
Pattern Processing Engine
Ingress
- patterns specified by programmer
- programmable using a special high-level language
- only pattern matching instructions
from.by
co- parallelism
hardware using multiple copies and
several
sets of variables
processor
- access to different memories
to Egress
to coprocessor
Reorder Buffer Manager
- transfers data between classifier and traffic manager
- ensure packet order due to parallelism and
variable processing time in the pattern processing
Classifier
memory
State Engine
- gather information (statistics) for scheduling
- verify flow within bounds
- provide an interface to the host
- configure and control other functional units
University of Oslo
Scheduler
memory
Statistics
memory
INF5063, Carsten Griwodz & Pål Halvorsen
PCI Bus
PowerNP
2 Interfaces (OUT to host)
2 Interfaces (IN from host)
Internal
memory
External
memory
Control store
Ingress
queue
PowerPC
core
Ingress
data
store
Embedded processors
Embedded processors
Embedded processors
Embedded processors
Embedded processors
Embedded processors
Embedded processors
Embedded processors
Instruct.
memory
Hardware
classifier
Dispatch
unit
Egress
queue
4 Interfaces (OUT to net)
4 Interfaces (IN from net)
University of Oslo
Egress
data
store
INF5063, Carsten Griwodz & Pål Halvorsen
PowerNP
2 Interfaces (OUT to host)
2 Interfaces (IN from host)
Coprocessors
- 8 embedded processors
External
- 4 kbytes local memory each
memory
- 2 cores/processor
- 2 threads/core
Internal
memory
Control store
Ingress
Embedded
queue PowerPC GPU
- no OS on the NPF
PowerPC
core
Ingress
data
store
Embedded processors
Embedded processors
Embedded processors
Embedded processors
Embedded processors
Embedded processors
Embedded processors
Embedded processors
Instruct.
memory
Hardware
classifier
Dispatch
unit
4 Interfaces (IN from net)
University of Oslo
Egress
data
store
Link layer
- framing outside the processor
INF5063, Carsten Griwodz & Pål Halvorsen
Egress
queue
4 Interfaces (OUT to net)
IXP1200 Architecture
RISC processor:
- StrongARM running Linux
- control, higher layer protocols and exceptions
- 232 MHz
Access units:
- coordinate access to external units
Scratchpad:
- on-chip memory
- used for IPC and synchronization
Microengines:
- low-level devices with limited set of instructions
- transfers between memory devices
- packet processing
- 232 MHz
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
IXP2400 Architecture
Coprocessors
- hash unit
- 4 timers
SRAM
- general purpose I/O pins
bus tests)
- external JTAG connections (in-circuit
- several bulk cyphers (IXP2850 only)
- checksum (IXP2850 only)
SRAM
-…
PCI bus
RISC processor:
- StrongArm  XScale
- 233 MHz  600 MHz
SRAM
access
PCI
access
coprocessor
SCRATCH
memory
FLASH
Slowport
- shared inteface to external units
- used for FlashRom during bootstrap
slowport
access
SDRAM
access
DRAM
multiple
independent
internal
buses
Embedded
RISK CPU
(XScale)
microengine 1
microengine 2
microengine 3
Media Switch Fabric
microengine 4
- forms fast path for transfers
Microengines
- interconnect for severalmicroengine
IXP2xxx
5- 6  8
MSF
access
…
microengine 8
DRAM
bus
Receive/transmit buses
- shared bus  separate busses
University of Oslo
receive bus
INF5063, Carsten Griwodz & Pål Halvorsen
- 233 MHz  600 MHz
transmit bus
INF5063:
Programming Heterogeneous Multi-Core Processors
Example: SpliceTCP
TCP Splicing
SYNACK
SYNACK
Internet
Some client
University of
of Oslo
Oslo
University
INF5063, Carsten
Carsten Griwodz
Griwodz &
& Pål
Pål Halvorsen
Halvorsen
INF5063,
SYN
TCP Splicing
ACK
ACK
Internet
Some client
University of
of Oslo
Oslo
University
INF5063, Carsten
Carsten Griwodz
Griwodz &
& Pål
Pål Halvorsen
Halvorsen
INF5063,
ACK
TCP Splicing
DATA
DATA
Internet
GET
Some client
University of
of Oslo
Oslo
University
INF5063, Carsten
Carsten Griwodz
Griwodz &
& Pål
Pål Halvorsen
Halvorsen
INF5063,
HTTP-GET
TCP Splicing
Internet
Some client
University of
of Oslo
Oslo
University
INF5063, Carsten
Carsten Griwodz
Griwodz &
& Pål
Pål Halvorsen
Halvorsen
INF5063,
TCP Splicing
accept
connect
while(1)
read
write
Data
link layer
layer
Transport
layer
Application
Network
layer
Linux Netfilter
• Establish upstream connection
• Receive entire packet
• Rewrite headers
• Forward packet
Physical
Network
layer
Transport
Data
linklayer
layer
layer
University of
of Oslo
Oslo
University
IXP 2400
• Establish upstream connection
• Parse packet headers
• Rewrite headers
• Forward packet
INF5063, Carsten
Carsten Griwodz
Griwodz &
& Pål
Pål Halvorsen
Halvorsen
INF5063,
Throughput vs Request File Size
800
700
Linux-based
Throughput (Mbps)
NP-based
600
500
400
300
200
100
0
1
4
16
64
256
1024
Request file size (KB)
Major performance gain at all request sizes
Graph from the presentation of the paper
SpliceNP: A TCP Splicer using a Network Processor, ANCS2005, Princeton, NJ, Oct 27-28, 2005
By Li Zhao, Yan Lou, Laxmi Bhuyan (Univ. Calif. Riverside), Ravi Iyer (Intel)
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
INF5063:
Programming Heterogeneous Multi-Core Processors
Example:
Transparent protocol translation and load
balancing in a media streaming scenario
slides from an ACM MM 2007 presentation
by Espeland, Lunde, Stensland, Griwodz and Halvorsen
Load Balancer
IXP 2400
mplayer
clients
RSTP/RTP
video server
ingress
RTSP / RTP parser
RTSP
Balancer
.
.
.
RTP/UDP
Monitor
1. identify connection
Historic and
2. if exist
send to right server
(select port to use)
else
create new session
(select one server)
send packet
current loads
Network
of the different
servers
egress
RTSP
RTP/UDP
RSTP/RTP
video server
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Transport Protocol Translator
IXP 2400
mplayer
clients
RSTP/RTP
video server
ingress
HTTP
RTSP
HTTP-streaming is
frequently used today!!
.
.
.
RTSP / RTP parser
Balancer
Network
Monitor
egress
RSTP/RTP
video server
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Transport Protocol Translator
IXP 2400
mplayer
clients
RSTP/RTP
video server
ingress
RTP/UDP
Protocol translator
HTTP
HTTP
RTSP/RTP
RTSP / RTP parser
Balancer
.
.
.
Monitor
Network
RTSP
egress
RTSP/RTP
RSTP/RTP
video server
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Results
 The prototype works and both load balances and translates
between HTTP/TCP and RTP/UDP
 The protocol translation gives a much more stable bandwidth
than using HTTP/TCP all the way from the server
protocol translation
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
HTTP
INF5063:
Programming Heterogeneous Multi-Core Processors
Example: Booster Boxes
slide content and structure mainly from the NetGames 2002 presentation by Bauer, Rooney and Scotton
Client-Server
local
distribution
network
local
distribution
network
University of Oslo
backbone
network
local
distribution
network
INF5063, Carsten Griwodz & Pål Halvorsen
Peer-to-peer
local
distribution
network
local
distribution
network
University of Oslo
backbone
network
local
distribution
network
INF5063, Carsten Griwodz & Pål Halvorsen
Booster boxes
 Middleboxes
− Attached directly to ISPs’ access routers
− Less generic than, e.g. firewalls or NAT
 Assist distributed event-driven applications
− Improve scalability of client-server and peer-to-peer applications
 Application-specific code
−
−
−
−
−
“Boosters”
Caching on behalf of a server
Aggregation of events
Intelligent filtering
Application-level routing
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Booster boxes
local
distribution
network
local
distribution
network
University of Oslo
backbone
network
local
distribution
network
INF5063, Carsten Griwodz & Pål Halvorsen
Booster boxes
local
distribution
network
local
distribution
network
University of Oslo
backbone
network
local
distribution
network
INF5063, Carsten Griwodz & Pål Halvorsen
Booster boxes
 Application-specific code
− Caching on behalf of a server
• Non-real time information is cached
• Booster boxes answer on behalf of servers
− Aggregation of events
• Information from two or more clients within a time window is aggregated
into one packet
− Intelligent filtering
• Outdated or redundant information is dropped
−
Application-level routing
• Packets are forward based on
 Packet content
 Application state
 Destination address
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Architecture
 Data Layer
− behaves like a layer-2 switch for the bulk of the traffic
− copies or diverts selected traffic
− IBM’s booster boxes use the packet capture library (“pcap”) filter
specification to select traffic
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Data Aggregation Example: Floating Car Data
Main booster task
 Complex message aggregation
 Statistical computations
 Context information
Traffic monitoring/predictions
Pay-as-you-drive insurance
Car maintenance
Car taxes
…
 Very low real-time requirements
Transmission of
 Position
 Speed
 Driven distance
…
Statistics gathering
Compression
Filtering
…
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Interactive TV Game Show
Main booster task
 Simple message aggregation
 Limited
real-time
requirements
3. packet
aggregation
4. packet
forwarding
2. packet
interception
1. packet generation
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Game with large virtual space
Main booster task
 Dynamic server
selection
server 2
server 1
− based on current ingame location
− Require applicationspecific processing
Virtual space
handled by
server 1
handled by
server 2
University of Oslo
 High real-time requirements
INF5063, Carsten Griwodz & Pål Halvorsen
Summary
 Scalability
− by application-specific knowledge
− by network awareness
 Main mechanisms
−
−
−
−
−
Caching on behalf of a server
Aggregation of events
Attenuation
Intelligent filtering
Application-level routing
 Application of mechanism depends on
− Workload
− Real-time requirements
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
INF5063:
Programming Heterogeneous Multi-Core Processors
Multimedia Examples
Multicast Video-Quality Adjustment
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Multicast Video-Quality Adjustment
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Multicast Video-Quality Adjustment
IO
hub
memory
hub
CPU
memory
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Multicast Video-Quality Adjustment
 Several ways to do video-quality adjustments
−
−
−
−
frame dropping
re-quantization
scalable video codec
…
 Yamada et. al. 2002:
use low-pass filter to eliminate high-frequency components of the MPEG-2
video signal and thus reduce data rate
− determine a low-pass parameter for each GOP
− use low-pass parameter to calculate how many DCT coefficients to
remove from each macro block in a picture
− by eliminating the specified number of DCT coefficients the video
data rate is reduced
− implemented the low-pass filter on an IXP1200
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Multicast Video-Quality Adjustment

Low-pass filter on IXP1200
−
−
−
parallel execution on 200MHz StrongARM and microengines
24 MB DRAM devoted to StrongARM only
8 MB DRAM and 8 MB SRAM shared
−
test-filtering program on a regular PC determined work-distribution
•
•
75% of data from the block layer
56% of the processing overhead is due to DCT
 five step algorithm:
1. StrongArm receives packet  copy to shared memory area
2. StrongARM process headers and generate macroblocks (in shared memory)
3. microengines read data and information from shared memory and perform
quality adjustments on each block
4. StrongARM checks if the last macroblock is processed (if not, go to 2)
5. StrongARM rebuilds packet
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Multicast Video-Quality Adjustment

Segmentation of
MPEG-2 data
−
−
slice = 16 bit high stripes
macroblock = 16 x 16 bit square
• four 8 x 8 luminance
• two 8 x 8 chrominance
 DCT transformed with
coefficients sorted in
ascending order

Data packetization for video filtering
−

720 x 576 pixels frames and 30 fps
36 “slices” with 45 macroblocks per frame
−
−
Each slice = one packet
8 Mbps stream  ~7Kb per packet
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Multicast Video-Quality Adjustment
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Multicast Video-Quality Adjustment
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Multicast Video-Quality Adjustment
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Multicast Video-Quality Adjustment
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Multicast Video-Quality Adjustment
 Evaluation – three scenarios tested
− StrongARM only
 550 kbps
− StrongARM + 1 microengine
 350 kbps
− StrongARM + all microengines
 1350 kbps
− achieved real-time transcoding not enough for practical
purposes, but distribution of workload is nice
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
INF5063:
Programming Heterogeneous Multi-Core Processors
Parallelism, Pipelining &
Workload Partitioning
Divide and …
 Divide a problem into parts – but how?
Pipelining:
Parallelism:
Hybrid:
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Key Considerations
 System topology
− processor capacities:
different processors have different capabilities
− memory attachments:
• different memory types have different rates and access times
• different memory banks have different access times
− interconnections:
different interconnects/busses have different capabilities
 Requirements of the workload?
− dependencies
 Parameters?
− width of pipeline (level of parallelism)
− depth of pipeline (number of stages)
− number of jobs sharing busses
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Network Processor Example
 Pipelining vs. Multiprocessor
by Ning Weng & Tilman Wolf
− network processor example
− all pipelining, parallelism and
hybrid is possible
− packet processing scenario
− what is the performance of
the different schemes taking into account…?
•
•
•
•
… processing dependencies
… processing demands
… contention on memory interfaces
… pipelining and parallelism effects
(experimenting with the width and the depth of the pipeline)
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Simulations
 Several application examples in the
paper giving different DAGs, e.g.,…
− ... flow classification:
classify flows according to IP addresses and
transport protocols
 Measuring system throughput
varying all the parameters
− # processors in parallel (width)
− # stages in the pipeline (depth)
− # memory interfaces (busses) between
each stage in the pipeline
− memory access times
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Results
 # memory interfaces per stage M = 1
 Memory service time S = 10
 Increases with the pipeline depth D
− Good scalability – proportional to the # processors
 Increases with the width W initially, but tails off for large W
− Poor scalability due to contention on the memory channel
 Efficiency per processing engine…?
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Lessons learned…
 Memory contention can become a severe system bottleneck
− the memory interface saturates with about two processing elements per interface
− off-chip memory access cause significant reduction in throughput and drastic
increase in queuing delay
− performance increase with more
• memory channels
• lower access times
 Most NP applications are of sequential nature which leads to highly pipelined
NP topologies
 Balancing processing tasks to avoid slow pipeline stages
 Communication and synchronization are the main contributors to the
pipeline stage time, next to the memory access delay
 “Topology” has significant impact on performance
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Some References
1.
Tatsuya Yamada, Naoki Wakamiya, Masayuki Murata, and Hideo Miyahara: "Implementation
and Evaluation of Video-Quality Adjustment for heterogeneous Video Multicast“, 8th AsiaPacific Conference on Communications, Bandung, September 2002, pp. 454-457
2.
Daniel Bauer, Sean Rooney, Paolo Scotton, “Network Infrastructure for Massively Distributed
Games”, NetGames, Braunschweig, Germany, April 2002
J.R. Allen, Jr., et al., “IBM PowerNP network processor: hardware, software, and
applications”, IBM Journal of Research and Development, 47(2/3), pp. 177-193, March/May
2003
3.
4.
5.
Ning Weng, Tilman Wolf, “Profiling and mapping of parallel workloads on network
processors”, ACM Symposium of Applied Computing (SAC 2005), pp. 890-896
Ning Weng, Tilman Wolf, “Analytic modeling of network processors for parallel workload
mapping”, ACM Trans. on Embedded Computing Systems, 8(3), 2009
6.
Li Zhao, Yan Lou, Laxmi Bhuyan, Ravi Iyer, “SpliceNP: A TCP Splicer using a Network
Processor”, ANCS2005, 2005
7.
Håvard Espeland, Carl Henrik Lunde, Håkon Stensland, Carsten Griwodz, Pål Halvorsen,
”Transparent Protocol Translation for Streaming”, ACM Multimedia 2007
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen
Summary
 TODO
University of Oslo
INF5063, Carsten Griwodz & Pål Halvorsen