NQUEENS - klabs.org

Download Report

Transcript NQUEENS - klabs.org

NARC:
Network-Attached Reconfigurable Computing for
High-performance, Network-based Applications
Chris Conger, Ian Troxel, Daniel Espinosa,
Vikas Aggarwal, and Alan D. George
High-performance Computing and Simulation (HCS) Research Lab
Department of Electrical and Computer Engineering
University of Florida
Conger
#233 MAPLD 2005
Outline








Conger
Introduction
NARC Board Architecture, Protocols
Case Study Applications
Experimental Setup
Results and Analysis
Pitfalls and Lessons Learned
Conclusions
Future Work
2
#233 MAPLD 2005
Introduction

Network-Attached Reconfigurable Computer (NARC) Project




Inspiration: network-attached storage
(NAS) devices
Core concept: investigate challenges and
alternatives for enabling direct network access
and control over reconfigurable (RC) devices
Method: prototype hardware interface and
software infrastructure, demonstrate proof of
concept for benefits of network-attached RC
resources
Motivations for NARC project include (but
not limited to) applications such as:

Network-accessible processing resources




Network monitoring and packet analysis




Easy attachment; unobtrusive, fast traffic gathering and processing
Network intrusion and attack detection, performance monitoring, active traffic injection
Direct network connection of FPGA can enable wire-speed processing of network traffic
Aircraft and advanced munitions systems


Conger
Generic network RC resource, viable alternative to server and supercomputer solutions
Power and cost savings over server-based FPGA cards are key benefits

No server needed to host RC device

Infrastructure provided for robust operation and interfacing with users
Performance increase over existing RC solutions is not a primary goal of this approach
Standard Ethernet interface eases addition and integration of RC devices in aircraft and munitions systems
Low weight and power also attractive characteristics of NARC device for such applications
3
#233 MAPLD 2005
Envisioned Applications

Aerospace & military applications



Modular, low-power design lends itself well to
military craft and munitions deployment
FPGAs providing high-performance radar, sonar,
and other computational capabilities
Scientific field operations

Quickly provide first-level estimations for
scientific field operations for geologists,
biologists, etc.

Field-deployable covert operations
Completely wireless device enabled through battery, WLAN
Passive network monitoring applications
Active network traffic injection




Distributed computing
Cost-effective, RC-enabled clusters or cluster resources
Cluster NARC devices at a fraction of cost, power, cooling



Cost-effective intelligent sensor networks


Use FPGAs in close conjunction with sensors to provide pre-processing
functions before network transmission
High-performance network technologies


Conger
Fast Ethernet may be replaced by any network technology
Gig-E, Infiniband, RapidIO, proprietary communication protocols
4
#233 MAPLD 2005
NARC Board Architecture: Hardware

ARM9 network control with FPGA processing power (see Figure 1)



Prototype design consists of two boards, connected via cable:

Network interface board (ARM9 processor + peripherals)

Xilinx development board(s) (FPGA)
Xilinx HW-AFX-BG560-100
Network interface peripherals include:

Layer-2 network connection (hardware PHY+MAC)
FPGA

External memory, SDRAM and Flash

Serial port (debug communication link)
input

FPGA control and data lines
8
8
8
NARC hardware specifications:
control output

ARM-core microcontroller, 1.8V core, 3.3V peripheral







Custom Board
EXTERNAL MEMORY
memory bus
AT91RM92000
External memory, 3.3V


32-bit RISC, 5-stage pipeline, in-order execution
16KB data cache, 16KB instruction cache
Core clock speed 180MHz, peripheral clock 60MHz
On-chip Ethernet MAC layer with DMA
RC
Device(s)
32MB SDRAM, 32-bit data bus
2MB Flash, 16-bit data bus
Port available for additional 16-bit SRAM devices
Network
Interface
ARM Processor
JTAG
Ethernet
PHY
Ethernet transceiver, 3.3V



DM9161 PHY layer transceiver
100Mbps, full duplex capable
RMII interface to MAC
Serial 1
RJ-45
Figure 1 – Block diagram of NARC device
Conger
5
#233 MAPLD 2005
NARC Board Architecture: Software
narc.h

ARM processor runs Linux kernel 2.4.19





client
applicatio
n
main
applicatio
n
util.c
Linux Kernel
library
routines
NARC Board
Makefile
Initialize and configure on-chip peripherals of ARM-core processor
Configure FPGA (SelectMAP protocol)
Transfer data to/from FPGA, manipulate control lines
Monitor and initiate network traffic
arm-linux-gcc
RAMDISK
NARC protocol for job exchange (from remote workstation)





Client
application
NARC board application and client application must follow standard
rules and procedures for responding to requests from a user
User appends a small header onto data (if any) containing info.
about request before sending over network (see Figure 3)
Bootstrap software in on-board Flash, automatically loads
and executes on power-up
Optional serial interface through HyperTerminal for
debugging/development
`
User Workstation
Figure 2 – Software development process
RTYPE
1 byte
Configures clocks, memory controllers, I/O pins, etc
Contacts tftp server running on network, downloads Linux and
ramdisk
Boot Linux, automatically execute NARC board software contained
in ramdisk
NARC
board
application
c


main.c
gc


client.c
Applications written in C, compiled using GCC compiler for
ARM (see Figure 2)
NARC API: Low-level driver function library for basic services


Provides TCP(UDP)/IP stack, resource management, threaded
execution, Berkeley Sockets interface for applications
Configured and compiled with drivers specifically for our board
definition
s
global
vars
Job ID
1 byte
Undefined
2 bytes
Data Size in bytes
4 bytes
Data
RTYPE
00 – request status
01 – configure FPGA
02-FE – user-definable functions
FF – reboot board
Job ID
Unique identifier of request
Included with response
Figure 3 – Request header field definitions
Conger
6
#233 MAPLD 2005
NARC Board Architecture: FPGA Interface
PROG, INIT, CS,
WRITE, DONE
D[0:7]

Data communicated to/from FPGA by means of
unidirectional data paths




8-bit input port, 8-bit output port, 8 control lines (Figure 4)
Control lines manage data transfer, also drive configuration signals
Data transferred one byte at a time, full duplex communication possible
Control lines include following signals:






Clock – software-generated signal to clock data on data ports
Reset – reset signal for interface logic in FPGA
Ready – signal indicating device is ready to accept another byte of data
Valid – signal indicating device has placed valid data on port
SelectMAP – all signals necessary to drive SelectMAP configuration
FPGA configuration through SelectMAP protocol




Out[0:7]
In[0:7]
f_ready
f_ready
a_valid
a_valid
In[0:7]
Out[0:7]
f_valid
f_valid
ARM
FPGA
a_ready
a_ready
clock
reset
Figure 4 – FPGA interface signal diagram
User submits configuration request (RTYPE = 01), along with bitfile and function descriptor
Function descriptor is ASCII string, formatted list of functions with associated RTYPE definition
ARM halts and configures FPGA, stores descriptor in dedicated RAM buffer for user queries
All FPGA designs must restrict use of all SelectMAP pins after configuration


Conger
PROG, INIT, CS,
WRITE, DONE
Fastest configuration option for Xilinx FPGAs, protocol emulated using GPIO pins of ARM
NARC board enables remote configuration and management of FPGA


SelectMAP
Port
Some signals are shared between SelectMAP port and FPGA-ARM link
Once configured, SelectMAP pins must remain tri-stated and unused
7
#233 MAPLD 2005
Results and Analysis: Raw Performance

FPGA interface I/O throughput (Table 1)


1KB data transferred over link, timed
Measured using hardware methods



Widening data paths, optimizing software routine will
significantly improve FPGA I/O performance




Mb/s
Logic
Analyzer
Measured using Linux network benchmark IPerf


NARC board located on arbitrary switch within network, application
partner is user workstation
Transfers as much data as possible in 10 seconds, calculates
throughput based on data sent divided by 10 seconds
Performed two experiments with NARC board serving as client
in one run, server in other
Both local and remote (remote location ~400 miles away, at
Florida State University) IPerf partner
Network interface achieves reasonably good bandwidth
efficiency
External memory throughput (Table 3)




Input
Output
6.08
6.12
Table 1 – FPGA interface I/O performance
Mb/s
Local
Network
Remote
Network
(WAN)
NARCServer
75.4
4.9
ServerServer
78.9
5.3
Table 2 – Network throughput
4KB transferred to external SDRAM, both read and write
Measurements again taken using logic analyzer
Memory throughput sufficient to provide wire-speed buffering
of network traffic

Conger
Figure 9 – Logic analyzer timing
Handshake protocol may add unnecessary overhead
Network throughput (Table 2)

time
Performance lower than desired for prototype


Logic analyzer – to capture raw link data rate, divide data sent by
time from first clock to last clock (see Figure 9)
On-chip Ethernet MAC has DMA to this SDRAM
Should help alleviate I/O bottleneck between ARM and FPGA
8
Mb/s
Logic
Analyzer
Read
Write
183.2
160
Table 3 – External SDRAM throughput
#233 MAPLD 2005
Results and Analysis: Raw Performance

Reconfiguration speed



Includes time to transfer bitfile over network, plus time to configure device (transfer
bitfile from ARM to FPGA), plus time to receive acknowledgement
Our design currently completes a user-initiated reconfiguration request with a
1.2MB bitfile in 2.35 sec
Area/resource usage of minimal wrapper for Virtex-II Pro FPGA

Stats on resource requirements for a minimal design to provide required link
control and data transfer in an application wrapper are presented below:




Conger
Design implemented on older
Virtex-II Pro FPGA
Numbers to right indicate
requirements for wrapper only,
un-used resources available
for use in user applications
Extremely small footprint!
Footprint will be even smaller
on larger FPGA
Device utilization summary:
-------------------------------------------------------Selected Device : 2vp20ff1152-5
Number
Number
Number
Number
Number
Number
9
of
of
of
of
of
of
Slices:
Slice Flip Flops:
4 input LUTs:
bonded IOBs:
BRAMs:
GCLKs:
143
120
238
24
8
1
out
out
out
out
out
out
of
of
of
of
of
of
9280
18560
18560
564
88
16
1%
0%
1%
4%
9%
6%
#233 MAPLD 2005
Case Study Applications

Clustered RC Devices: N-Queens

HPC application demonstrating NARC board’s role as generic compute resource







Conger
Application characterized by minimal communication, heavy computation within FPGA
NARC version of N-Queens adapted from previously implemented application for PCIbased Celoxica RC1000 board housed in a conventional server
N-Queens algorithm is a part of the DoD high-performance computing benchmark suite and
representative of select military and intelligence processing algorithms
Exercises functionality of various developed mechanisms and protocols for job
submission, data transfer, etc. on NARC
Figure c/o Jeff Somers
User specifies a single parameter N, upon
completion the algorithm returns total number
of possible solutions
Purpose of algorithm is to determine how many
possible arrangements of N queens there are on
an N × N chess board, such that no queen may
Figure 5 – Possible 8x8 solution
attack another (see Figure 5)
Results are presented from both NARC-based execution and RC1000-based
execution for comparison
10
#233 MAPLD 2005
Case Study Applications

Network processing: Bloom Filter




Conger
This application performs passive packet analysis through use
of a classification algorithm known as a Bloom Filter

Application characterized by constant, bursty communication patterns

Most communication is Rx over network, transmission to FPGA

Filter may be programmed or queried
NARC device copies all received network frames to memory,
ARM parses TCP/IP header and sends it to Bloom Filter for
classification

User can send programming requests, which include a header and
string to be programmed into Filter

User can also send result collection requests, which causes a
formatted results packet to be sent back to the user

Otherwise, application constantly runs, querying each header against
the current Bloom Filter and recording match/header pair information
Figure 6 – Bloom Filter algorithmic architecture
Bloom Filter works by using multiple hash functions on a given
bit string, each hash function rendering indices of a separate
bit vector (see Figure 6)

To program, hash inputted string and set resulting bit positions as 1

To query, hash inputted string, if all resulting bit positions are 1 the
string matches
Implemented on Virtex-II Pro FPGA

Uses slightly larger, but ultimately more effective application wrapper
(see Figure 7)

Larger FPGA selected to demonstrate interoperability with any FPGA
11
Figure 7 – Bloom Filter implementation architecture
#233 MAPLD 2005
Experimental Setup

N-Queens: Clustered RC devices


NARC device located on arbitrary switch in network
User interfaces through client application on
workstation, requests N-Queens procedure




Figure 8 illustrates experimental environment
Client application records time required to satisfy request
Power supply measures current draw of active NARC device
RC-enabled
servers
NARC
Ethernet
Network
N-Queens also implemented on RC-enabled server
equipped with Celoxica RC1000 board


Client-side function call to NARC board replaced with function
call to RC1000 board in local workstation, same timing
measurement
Comparison offered in terms of performance, power, cost
NARC
User

Bloom Filter: Network processing



Only packet headers (TCP/IP) are passed to FPGA
Data continuously sent to FPGA as packets arrive over network
By attaching NARC device to switch, limited packets can be captured


Conger
Figure 8 – Experimental environment
Same experimental setup as N-Queens case study
Software on ARM co-processor captures all Ethernet frames


Workstation
Only broadcast packets and packets destined for the NARC device can be seen
Dual-port device could be inserted in-line with network link, monitor all flow-through traffic
12
#233 MAPLD 2005
Results and Analysis: N-Queens Case Study
First, consider an execution time comparison
between our NARC board and a PCI-based
RC card (see Figure 10a and 10b)


Being able to match performance of PCI-based card
is a resounding success!



Both FPGA designs clocked at 50MHz
Performance difference is minimal between devices


Conger
Three regulated power supplies exist in complete
NARC device (network interface + FPGA board): 5V,
3.3V, 2.5V
Current draw from each supply was measured
Power consumption is calculated as sum of V×I
products of all three supplies
13
0.05
NARC
0.04
RC-1000
0.03
0.02
0.01
0
5
Power consumption and cost of NARC devices
drastically lower than that of server with RC card
combos
Multiple users may share NARC device, PCI-based
cards somewhat fixed in an individual server
Power consumption calculated using following
method

Exec. Time (s)

N-Queens Execution Time Comparison
(small board size)
6
7
8
9
10
Algorithm Parameter (N)
N-Queens Execution Time Comparison
(large board size)
Exec. Time (s)

70
60
50
40
30
20
10
0
NARC
RC-1000
11
12
13
14
Algorithm Parameter (N)
Figure 10 – Performance comparison between NARC
board and PCI-based RC card on server
#233 MAPLD 2005
Results and Analysis: N-Queens Case Study
NARC / RC-1000 Performance Ratio

Figure 11 summarizes the performance
ratio of N-Queens between both NARC
and RC-1000 platforms
Consider Table 4 for a summary of cost
and power statistics




Equivalency
15
10
5
0
5
6
7
8
9
10
11
12
13
14
Algorithm Parameter (N)
FPGA costs offset when compared to
another device
Price shown includes PCB fabrication,
component costs
Figure 11 – Power consumption calculation
NARC Board
Approximate power consumption
drastically less than server + RC-card
combo

RATIO
20
Unit price shown excluding cost of FPGA


25
Ratio

Cost per unit
(prototype)
$175.00
Approx. Power
Consumption
3.28 W
Table 4 – Price and power figures
for NARC device
Power consumption of server varies
depending on particular hardware
Typical servers operate off of 200400W power supplies
P = (5V)(I5) + (3.3V)(I33) + (2.5V)(I25)
See Figure 12 for example of approximate
power consumption calculation
I5 ≈ 0.2A ; I33 ≈ 0.49A ; I25 ≈ 0.27A
P
= (5)(.2) + (3.3)(.49) + (2.5)(.27) = 3.28W
Figure 12 – Power consumption calculation
Conger
14
#233 MAPLD 2005
Results and Analysis: Bloom Filter

Passive, continuous network traffic analysis

Wrapper design was slightly larger than previous minimal wrapper used with N-Queens




Packets received over network link are parsed by ARM, with TCP/IP header saved in buffer
Headers sent one-at-a-time as query requests to Bloom Filter (FPGA), when query finishes
another header will be de-queued if available





Not affected by wrapper constraint
Significantly faster computation speed
than FPGA-ARM link communication
speed
FPGA-side buffer will not fill up,
headers are processed before next
header transmitted to FPGA
ARM-side buffer may fill up under
heavy traffic loads

Conger
User may query NARC device at any time for results update, program new pattern
Figure 13 shows resource usage for
Virtex-II Pro FPGA
Maximum clock frequency of 113MHz


Still small footprint on chip, majority of FPGA remains for application
Maximum wrapper clock frequency 183 MHz, should not limit application clock if in same clock domain
32MB ARM-side RAM gives large buffer
Device utilization summary:
------------------------------------------------------Selected Device : 2vp20ff1152-5
Number
Number
Number
Number
Number
Number
of
of
of
of
of
of
Slices:
1174
Slice Flip Flops: 1706
4 input LUTs:
2032
bonded IOBs:
24
BRAMs:
9
GCLKs:
1
out
out
out
out
out
out
of
of
of
of
of
of
9280
18560
18560
564
88
16
13%
9%
11%
4%
10%
6%
Figure 13 – Device utilization statistics for Bloom Filter design
15
#233 MAPLD 2005
Pitfalls and Lessons Learned

FPGA I/O throughput capacity remains persistent problem



One motivation for designing custom hardware is to remove typical PCI
bottleneck and provide wire-speed network connectivity for FPGA
Under-provisioned data path between FPGA and network interface restricts
performance benefits for our prototype design
Luckily, this problem may be solved through a variety of approaches




Wider data paths (16-bit, 32-bit) double or quadruple throughput, at expense of
higher pin count
Use of higher-performance co-processor capable of faster I/O switching frequencies
Optimized data transfer protocol
Having co-processor in addition to FPGA to handle network
interface is vital to success of our approach




Conger
Required in order to permit initial remote configuration of FPGA, as well as
additional reconfigurations upon user request
Offloading network stack, basic request handling, and other maintenance-type
tasks from FPGA saves significant amount of valuable slices for user designs
Drastically eases interfacing with user application on networked workstation
Active co-processor for FPGA applications, e.g. parsing network packets as in
Bloom Filter application
16
#233 MAPLD 2005
Conclusions




A novel approach to providing FPGAs with standalone network connectivity has
been prototyped and successfully demonstrated

Investigated issues critical to providing remote management of standalone NARC resources

Proposed and demonstrated solutions to discovered challenges

Performed pair of case studies with two distinct, representative applications for a NARC device
Network-attached RC devices offer potential benefits for a variety of applications

Impressive cost and power savings over server-based RC processing

Independent NARC devices may be shared by multiple users without moving

Tightly coupled network interface enables FPGA to be used directly in path of network traffic for
real-time analysis and monitoring
Two issues that are proving to be a challenge to our approach include:

Data latency in FPGA communication

Software infrastructure required to achieve a robust standalone RC unit
While prototype design achieves relatively good performance in some areas, and
limited performance in others, this is acceptable for concept demonstration


Conger
Fairly complex board design; architecture and software enhancements in development
As proof of “NARC” concept, important goal of project was achieved in demonstration of an
effective and efficient infrastructure for managing NARC devices
17
#233 MAPLD 2005
Future Work

Expansion of network processing capabilities

Further development of packet filtering application



Expansion of Ethernet link to 2-port hub







Fast Ethernet targeted for prototyping effort, concept demonstration
True high-performance device should support Gigabit Ethernet
Other potential technologies include (but not limited to) InfiniBand, RapidIO
Further development of management infrastructure


Conger
Ultimate vision for NARC device
Will restrict number of different FPGAs which may be supported, according to
chosen FPGA socket/footprint for board
Increased difficulty in PCB design
Expansion to Gig-E, other network technologies


Permit transparent insertion of device into network path
Provide easier access to all packets in switched IP network
Merging FPGA with ARM co-processor and network interface
into one device


More specific and practical activity or behavior sought from network traffic
Analyze streaming packets at or near wire-speed rates
Need for more robust control/decision-making middleware
Automatic device discovery, concurrent job execution, fault-tolerant operation
18
#233 MAPLD 2005