Castine Product Overview

Download Report

Transcript Castine Product Overview

Intel® IXP2XXX Network
Processor Architecture Overview
John Morgan
Infrastructure Processor Division
September 2004
Page 1
IXP2400 External Features
External Interfaces
Host
CPU
(Optional)
PCI 64-bit / 66 MHz

IXA SW
Classification
Accelerator
IXP2400
CoProc Bus
MicroEngine
Clusters
Customer
ASICs
Flash
QDR SRAM
1.6 GBs
64 M Byte
IXP2400
(Egress)
(Ingress)
Utopia 1/2/3 or
POS-PL2/3
Interface
Switch Fabric
Port Interface

DDR DRAM
2 GByte
Slow Port
Utopia 1,2,3
SPI – 3 (POS-PL3)
CSIX

Flow Control Bus
ATM / POS
PHY
or Ethernet
MAC






MSF Interface supports UTOPIA 1/2/3,
SPI-3 (POS-PL3), and CSIX.
Four independent, configurable, 8-bit
channels with the ability to aggregate
channels for wider interfaces.
Media interface can support channelized
media on RX and 32-bit connect to
Switch Fabric over SPI-3 on TX (and
vice versa) to support Switch Fabric
option.
2 Quad Data Rate SRAM channels.
A QDR SRAM channel can interface to
Co-Processors.
1 DDR SDRAM channel.
PCI 64/66 Host CPU interface.
Flash and PHY Mgmt interface.
Dedicated inter-IXP channel to
communicate fabric flow control
information from egress to ingress for
dual chip solution.
IXP2400
72
MEv2
1
DDRAM
MEv2
2
Rbuf
64 @ 128B
Intel®
XScale™
Core
32K IC
32K DC
PCI
64b
(64b)
66 MHz
G
A
S
K
E
T
MEv2
4
MEv2
3
Tbuf
64 @ 128B
MEv2
5
MEv2
6
S
P
I
3
or
C
S
I
X
Hash
64/48/128
Scratch
16KB
QDR
SRAM
1
QDR
SRAM
2
E/D Q
E/D Q
18
18
18
18
MEv2
8
MEv2
7
CSRs
-Fast_wr -UART
-Timers
-GPIO
-BootROM/Slow Port
32b
32b
IXP2400 Resources Summary








Half Duplex OC-48 / 2.5 Gb/sec Network Processor
(8) Multi-Threaded Microengines
Intel® XScale™ Core
Media / Switch Fabric Interface
PCI interface
2 QDR SRAM interface controllers
1 DDR SDRAM interface controller
8 bit asynchronous port
– Flash and CPU bus
 Additional integrated feature
–
–
–
–
–
Hardware Hash Unit
16 KByte Scratchpad Memory,Serial UART port
8 general purpose I/O pins
Four 32-bit timers
JTAG Support
IXP2800 External Features
Host
CPU
(Optional)
PCI 64-bit / 66 MHz
QDR SRAM
12.8 Gbps x 4
64 M Byte x 4
channels
IXA SW
Classification
Accelerator
IXP2800
MicroEngine
Clusters
Flash
IXP2800
(Egress)
(Ingress)
Slow Port
SPI – 4, CSIX-L1
SPI-4 or CSIXL1
Switch Fabric
Port Interface


RDR DRAM
50+Gbps
2 Gbyte total
for 3 channels
CoProc Bus
Customer
ASICs
External Interfaces



Flow Control Bus
ATM / POS
PHY
or Ethernet
MAC

Media Interface supports
both SPI-4 and CSIX
4 Quad Data Rate (QDR)
SRAM channels
 Each channel can
interface to Coprocessors
3 RDRAM Channels
PCI 64/66 Host CPU interface
Flash and PHY Management
interface
Dedicated inter-IXP channel
to communicate fabric flow
control information from
egress to ingress for dual
chip solution
18
18
18
IXP2800
Stripe
RDRAM
1
RDRAM
2
MEv2
1
RDRAM
3
MEv2
2
MEv2
3
MEv2
4
Rbuf
64 @ 128B
64b
G
A
S
K
E
T
Intel®
XScale™
Core
PCI
(64b)
66 MHz
MEv2
8
32K IC
32K DC
MEv2
7
MEv2
6
MEv2
5
Tbuf
64 @ 128B
MEv2
9
MEv2
10
MEv2
11
S
P
I
4
or
C
S
I
X
MEv2
12
Hash
48/64/128
QDR
SRAM
1
QDR
SRAM
2
QDR
SRAM
3
QDR
SRAM
4
E/D Q
E/D Q
E/D Q
E/D Q
18
18
18
18
18
Page 6
18
18
18
MEv2
16
MEv2
15
MEv2
14
MEv2
13
Scratch
16KB
CSRs
-Fast_wr -UART
-Timers
-GPIO
-BootROM/SlowPort
16b
16b
IXP2800 Resources Summary








Half Duplex OC-192 / 10 Gb/sec Network Processor
(16) Multi-Threaded Microengines
Intel® XScale™ Core
Media / Switch Fabric Interface
PCI interface
4 QDR SRAM Interface Controllers
3 Rambus* DRAM Interface Controllers
8 bit asynchronous port
– Flash and CPU bus
 Additional integrated features
– Hardware Hash Unit for generating of 48-, 64-, or 128-bit adaptive
polynomial hash keys
– 16 KByte Scratchpad Memory
– Serial UART port for debug
– 8 general purpose I/O pins
– Four 32-bit timers
– JTAG Support
IXP2800 and IXP2400
Comparison
IXP2800
IXP2400
1.4/1.0 GHz/ 650 MHz
600/400MHz
DRAM Memory
3 channels RDRAM
800/1066MHz; Up to 2GB
1 channel DDR DRAM 150MHz; Up to 2GB
SRAM Memory
4 channels QDR (or coprocessor)
2 channels QDR (or coprocessor)
Media Interface
Separate 16 bit Tx & Rx
configurable to SPI-4 P2 or
CSIX_L1
16 (MEv2)
Separate 32 bit Tx & Rx
configurable to SPI-3,
UTOPIA 3 or CSIX_L1
8 (MEv2)
Dual chip full duplex OC192
Dual chip full duplex OC48
Frequency
Number of
MicroEngines
Performance
MicroEngine v2
D-Push
Bus
From Next Neighbor
Local
Memory
128
GPR
128
GPR
128 Next
Neighbor
S-Push
Bus
128 D
Xfer In
128 S
Xfer In
640 words
LM Addr 1
LM Addr 0
2 per
CTX
B_op
4K/8K
Instructions
A_op
Prev B
Control
Store
Prev A
P-Random #
CRC Unit
CRC remain
Other
Local CSRs
Multiply
Find first bit
32-bit Execution
Data Path
TAGs
0-15
Add, shift, logical
Lock
0-15
Status
Entry#
ALU_Out
To Next Neighbor
Timers
Timestamp
128 D
Xfer Out
D-Pull Bus
Status
and
LRU
Logic
(6-bit)
CAM
B_Operand
A_Operand
128 S
Xfer Out
S-Pull Bus
Microengine v2 Features – Part 1

Clock Rates
– IXP2400 – 600/400 MHz
– IXP2800 - 1.4/1.0 GHz/ 650 MHz

Control Store
– IXP2400 – 4K Instruction store
– IXP2800 – 8K Instruction store

Configurable to 4 or 8 threads
– Each thread has its own program counter, registers,
signal and wakeup events
– Generalized Thread Signaling (15 signals per thread)

Local Storage Options
–
–
–
–
256 GPRs
256 Transfer Registers
128 Next Neighbor Registers
640 - 32bit words of local memory
Microengine v2 Features – Part 2

CAM (Content Addressable Memory)
–
–
Performs parallel lookup on 16 - 32bit entries
Reports a 9-bit lookup result
– 4 State bits (software controlled, no impact to hardware)
– Hit – entry number that hit; Miss – LRU entry
– 4-bit index of Cam entry (Hit) or LRU (Miss)
–

CRC hardware
–
–
–

Improves usage of multiple threads on same data
IXP2400 - Provides CRC_16, CRC_32
IXP2800 - Provides CRC_16, CRC_32, iSCSI, CRC_10 and CRC_5
Accelerates CRC computation for ATM AAL/SAR, ATM OAM and Storage
applications
Multiply hardware
–
–
Supports 8x24, 16x16 and 32x32
Accelerates metering in QoS algorithms
– DiffServ, MPLS

Pseudo Random Number generation
–

Accelerates RED, WRED algorithms
64-bit Time-stamp and 16-bit Profile count
Intel® XScale™ Core Overview
 High-performance, Low-power, 32-bit Embedded
RISC processor
 Clock rate
– IXP2400 600 MHz
– IXP2800 700/500/325 MHz
 32 Kbyte instruction cache
 32 Kbyte data cache
 2 Kbyte mini-data cache
 Write buffer
 Memory management unit
Web Switch Design Using
Network Processors – NSF
Project 2002-2005
Funded by NSF and Intel – Not Intel Confidential
L. Zhao, Y. Luo, L. Bhuyan and R. Iyer, “A Network
Processor-Based Content Aware Switch”
IEEE Micro, May/June 2006
Page 13
Web Switch or Layer 5 Switch
www.yahoo.com
Internet
Image Server
IP
TCP
APP. DATA
Application Server
GET /cgi-bin/form HTTP/1.1
Host: www.yahoo.com…
Switch
HTML Server
 Layer 4 switch
– Content blind
– Storage overhead
– Difficult to administer
 Content-aware (Layer 5/7) switch
– Partition the server’s database over different nodes
– Increase the performance due to improved hit rate
– Server can be specialized for certain types of request
Layer-7 Two-way Mechanisms
 TCP gateway
Application level proxy on
the web switch mediates
the communication
between the client and the
server
user
kernel
 TCP splicing
Reduce the overhead in
TCP gateway by
forwarding directly by OS
kernel
TCP Splicing
 Establish connection
with the client
SYNC
Time
SYND,ACKC+1
– Three-way
handshake
ACKD+1,DataC+1
SYNC
SYNS,ACKC+1
D ->S
ACKC+len+1,DataD+1
ACKD+len+1
Client
ACKS+1,DataC+1
D<- S ACKC+len+1,DataS+1
D ->S
Switch
ACKS+len+1
Server
 Choose the server
 Establish connection
with the server
 Splice two
connections
 Map the sequence for
subsequent packets
Partitioning the Workload
Latency on a Linux-based switch
 Latency is reduced by TCP splicing
Latency on the switch (ms)
Latency using NP
20
18
16
14
12
10
8
6
4
2
0
Linux Splicer
SpliceNP
1
4
16
64
Request file size (KB)
256
1024
Throughput
Throughput (Mbps)
800
700
Linux Splicer
600
SpliceNP
500
400
300
200
100
0
1
4
16
64
Request file size (KB)
256
1024
NePSim:
http://www.cs.ucr.edu/~yluo/nepsim/
 Objectives
–
–
–
–
–
Open-source
Cycle-level accuracy
Flexibility
Integrated power model
Fast simulation speed
 Challenges
–
–
–
–
Domain specific instruction set
Porting network benchmarks
Difficulty in debugging multithreaded programs
Verification of the functionality and timing
Yan Luo, Jun Yang, Laxmi Bhuyan, Li Zhao, NePSim, IEEE Micro Special Issue on NP,
Sept/Oct 2004, Intel IXP Summit Sept 2004, Users from UCSD, Univ. of Arizona, Georgia
Tech, Northwestern Univ., Tsinghua Univ. NePSim has so far 3530 web page visits, 806
downloads by October 2006 since July 2004
NePSim Software Architecture
 Microengine (six)
SRAM
Microengine
Stats
SDRAM Network Device Debugger

Memory (SRAM/SDRAM)

Network Device

Debugger

Statistic

Verification
Verification
NePSim
Power Model
H/W component
Model Type
Tool
Configurations
GPR per
Microengine
Array
XCacti
2 64-entry files, one read/write
port per file
Control store,
scratchpad
Cache w/o
tag path
XCacti
4KB, 4byte per block, direct
mapped, 10-bit address
ALU, shifter
ALU and
shifter
Wattch
32bit
…
…
…
…
Benchmarks
 ipfwdr
– IPv4 forwarding(header validation, IP lookup)
– Medium SRAM access
 nat
– Network address translation
– Medium SRAM access
 url
– Examines payload for URL pattern
– Heavy SDRAM access
 md4
– Compute a 128-bit message “signature”
– Heavy computation and SDRAM access
Verification of NePSim
benchmarks
IXP1200
?
=
NePSim
Performance
Statistics
23990 inst.(pc=129) executed
23990 inst.(pc=129) executed
24008 sram req issued
24008
sram req issued
….
24009
….
24009
Assertion Based Verification
(Linear Temporal Logic/Logic Of Constraint)
X. Chen, Y. Luo, H. Hsieh, L. Bhuyan, F. Balarin, "Utilizing Formal Assertions for System Design of
Network Processors," Design Automation and Test in Europe (DATE), 2004.
Performance-Power Trend
Power consumption increases faster than
performance
Power
Power
Performance
Performance
url
ipfwdr
Power
Power
Performance
Performance
md4
nat
Dynamic Voltage Scaling

Power = C • α • V2 • f

Voltage
Frequency
 Reduce PE voltage and frequency when PE
has idle time
Power Reduction with DVS
Power Reduction
Perf. Reduction
url
ipfwdr md4
nat
avg
Yan Luo, Jun Yang, Laxmi Bhuyan, Li Zhao, NePSim: A Network Processor
Simulator with Power Evaluation Framework, IEEE Micro Special Issue
on Network Processors, Sept/Oct 2004
Power Saving by Clock Gating


Shutdown unnecessary PEs, re-activate
PEs when needed
Clock gating retains PE instructions
Yan Luo, Jia Yu, Jun Yang, Laxmi Bhuyan, Low Power Network Processor Design Using
Clock Gating, IEEE/ACM Design Automation Conference (DAC), June , 2005 , Extended
Version to appear in ACM Trans on Architecture and Code Optimization
Challenges of Clock Gating PEs
 Terminating threads safely
– Threads request memory resources
– Stop unfinished threads result in resource leakage

Reschedule packets to avoid “orphan” ports



Static thread-port mapping prohibits shutting down
PEs
Dynamically assign packets to any waiting threads
Avoid “extra” packet loss


Burst packet arrival can overflow internal buffer
Use a small extra buffer space to handle burst
Experiment Results of Clock Gating
<4% reduction on system throughput
Main Contributions
 Constructed an execution driven multiprocessor router simulation
framework, proposed a set of benchmark applications and
evaluated performance
 Built NePSim, the first open-source network processor simulator,
ported network benchmarks and conducted performance and
power evaluation
 Applied dynamic voltage scaling to reduce power consumption
 Used clock gating to adapt number of active PEs according to realtime traffic