slides in ppt

Download Report

Transcript slides in ppt

CS 260
Lecture 1: Introduction to
Network Processors
Instructor: L.N. Bhuyan
www.cs.ucr.edu/~bhuyan/CS260
1
2003 ©UCR
Outline
°Introduction to NP Systems
°Relevant Applications
°Design Issues and Challenges
°Relevant Software and Benchmarks
°A case study: Intel IXP network
processors
2
2003 ©UCR
What are Network Processors
°Any device that executes programs to
handle packets in a data network
°Examples
• Processors on router line cards
• Processors in network access equipment
3
2003 ©UCR
Why Network Processors
° Current Situation
• Data rates are increasing
• Protocols are becoming more dynamic and
sophisticated
• Protocols are being introduced more rapidly
° Processing Elements
• GP(General-purpose Processor)
-
Programmable, Not optimized for networking applications
• ASIC(Application Specific Integrated Circuit)
-
high processing capacity, long time to develop, Lack the flexibility
• NP(Network Processor)
4
-
achieve high processing performance
-
programming flexibility
-
Cheaper than GP
2003 ©UCR
Typical NP Architecture
SDRAM
Bus
(Packet buffer)
SRAM
(Routing table)
Bus
Output ports
Input ports
multi-threaded processing elements
Co-processor
Network Processor
5
2003 ©UCR
TCP/IP Model
OSI
TCP/IP
7
Application
Application
6
Pre.
5
Session
4
Transport
TCP
3
Network
IP
2
Data Link
1
Physical
Host-to-Net
° ISO OSI (Open Systems Interconnection) not fully
implemented
° Presentation and Session layers not present in TCP/IP
6
2003 ©UCR
Processing Tasks
Policy Applications
Control
Plane
Network Management
Signaling
Topology Management
Queuing / Scheduling
Data Transformation
Data
Plane
Classification
Data Parsing
Media Access Control
Physical Layer
Source: Network Processor Tutorial in Micro 34 - Mangione-Smith & Memik
7
2003 ©UCR
Application Categorization
° Control-Plane tasks
• Less time-critical
• Control and management of device
operation
- Table maintenance, port states, etc.
° Data-Plane tasks
• Operations occurring real-time on
“packet path”
• Core device operations
- Receive, process and transmit packets
8
2003 ©UCR
Data Plane Tasks
° Media Access Control
• Low-level protocol implementation
- Ethernet, SONET framing, ATM cell processing, etc.
° Data Parsing
• Parsing cell or packet headers for address or protocol
information
° Classification
• Identify packet against a criteria (filtering / forwarding
decision, QoS, accounting, etc.)
° Data Transformation
• Transformation of packet data between protocols
° Traffic Management
• Queuing, scheduling and policing packet data
9
2003 ©UCR
Applications: IPv4 Routing
P
A
P
P
B
Router
C
° Routers determine next hop and forward
packets
10
2003 ©UCR
URL-based switching – My NSF Project
www.yahoo.com
Internet
Image Server
IP
TCP
APP. DATA
Application Server
GET /cgi-bin/form HTTP/1.1
Host: www.yahoo.com…
Switch
HTML Server
° Increase efficiency
° Tasks
• Traverse the packet data (request) for each
arriving packet and classify it:
- Contains ‘.jpg’ -> to image server
- Contains ‘cgi-bin/’ -> to application server
11
2003 ©UCR
Organizing Processor Resources
°Design decisions:
• High-level organization
• ISA and micro architecture
• Memory and I/O integration
°Today’s commercial NPs:
• Chip multiprocessors
• Most are multithreaded
• Exploit little ILP (Cisco does)
• No cache
• Micro-programmed
12
2003 ©UCR
Architectural Comparisons
°High-level organizations
• Aggressive superscalar (SS)
• Fine-grained multithreaded (FGMT)
• Chip multiprocessor (CMP)
• Simultaneous multithreaded (SMT)
13
2003 ©UCR
Multithreading
°Basic idea
• multiple register sets in the processor
• fast context switch
• switch thread on a cache access (How is
this different than non-blocking cache?)
• tolerating local latency vs remote in CCNUMA multiprocessors
• hybrids
- switch on notice
- simultaneous multithreading
14
2003 ©UCR
Architectural Comparisons (cont.)
Fine-Grained Coarse-Grained
Multiprocessing
Time (processor cycle)
Superscalar
Simultaneous
Multithreading
15
Thread 1
Thread 3
Thread 5
Thread 2
Thread 4
Idle slot
2003 ©UCR
Tasks and Services
Three Benchmarks used in the experiment
16
2003 ©UCR
Some Challenges
° Intelligent Design
• Given a selection of programs, a target network link speed, the
‘best’ design for the processor
-
Least area
-
Least power
-
Most performance
° Write efficient multithreaded programs
• NPs have
-
Heterogeneous computer resources
Non-uniform memory
Multiple interacting threads of execution
Real-time constraints
• Make use of resources
-
How to use special instructions and hardware assists
–
–
Compilers
Hand-coded
• Multithreaded programs
17
Manage access to shared state
Synchronization between threads
2003 ©UCR
Benchmarks for Network Processors
• NetBench
- 10 applications
- http://cares.icsl.ucla.edu/NetBench
• CommBench
- 8 networking and communications
applications
- http://ccrc.wustl.edu/~wolf/cb/
• EEMBC
- http://www.eembc.org/benchmark
• MediaBench
- Transcoders
- Some communications applications
18
2003 ©UCR
IXP1200 Block Diagram
° StrongARM
processing core
° Microengines
introduce new ISA
° I/O
• PCI
• SDRAM
• SRAM
• IX : PCI-like packet
bus
° On chip FIFOs
• 16 entry 64B each
19
2003 ©UCR
IXP1200 Microengine
° 4 hardware contexts
• Single issue processor
• Explicit optional context switch on SRAM
access
° Registers
• All are single ported
• Separate GPR
• 256*6 = 1536 registers total
° 32-bit ALU
• Can access GPR or XFER registers
° Shared hash unit
• 1/2/3 values – 48b/64b
• For IP routing hashing
° Standard 5 stage pipeline
° 4KB SRAM instruction store – not a cache!
° Barrel shifter
20
2003 ©UCR
IXP 2400 Block Diagram
° XScale core replaces
StrongARM
DDR DRAM
controller
° Microengines
ME0
ME1
ME3
ME2
Scratch
/Hash
/CSR
XScale
Core
PCI
QDR SRAM
controller
• Faster
• More: 2 clusters of 4
microengines each
° Local memory
ME4
ME7
ME5
ME6
MSF Unit
° Next neighbor routes
added between
microengines
° Hardware to accelerate
CRC operations and
Random number
generation
° 16 entry CAM
21
2003 ©UCR
Different Types of Memory
Type
Width Size
Approx
Notes
unloaded
(byte) (bytes) latency
(cycles)
Local
4
2560
1
Indexed addressing
post incr/decr
On-chip 4
Scratch
16K
60
Atomic ops
SRAM
4
256M
150
Atomic ops
DRAM
8
2G
300
Direct path to/fro MSF
22
2003 ©UCR
IXA Software Framework
External
Processors
Control Plane Protocol Stack
Control Plane PDK
XScale
Core
C/C++
Language
Core Components
Core Component Library
Resource Manager Library
Microengine
Pipeline
Microblock Library
Micro
block
Micro
block
Protocol Library
Micro
block
Microengine
C Language
Utility Library
Hardware Abstraction Library
23
2003 ©UCR
Example Toaster System: Cisco 10000
° Almost all data plane operations execute on the programmable XMC
° Pipeline stages are assigned tasks – e.g. classification, routing, firewall,
MPLS
• Classic SW load balancing problem
° External SDRAM shared by common pipe stages
24
2003 ©UCR
IBM PowerNP
° 16 pico-procesors and 1
powerPC
° Each pico-processor
• Support 2 hardware threads
• 3 stage pipeline :
fetch/decode/execute
° Dyadic Processing Unit
• Two pico-processors
• 2KB Shared memory
• Tree search engine
° Focus is layers 2-4
° PowerPC 405 for control
plane operations
• 16K I and D caches
° Target is OC-48
25
2003 ©UCR
SRAM
Fabric
Processor
Table
Lookup
Unit
Executive
Processor
text
Buffer Mngt
Unit
Queue
Mngt Unit
60Gbps Busses
Cluster
26
SRAM
PROM
Switch
Fabric
PCI
SRAM
CONTROL
Motorola C-Port C-5 Chip Architecture
Cluster
CP-0 CP-1textCP-2 CP-3
CP12
CP- CP13 text 14
CP15
PHY PHY PHY PHY
PHY PHY PHY PHY
2003 ©UCR
References
° [NPT] W. H. Mangione-Smith, G. Memik Network Processor
Technologies
° [NPRD] Patrick Crowley, Raj Yavatkar An Introduction to
Network Processor Research & Design, HPCA-9 Tutorial
27
2003 ©UCR