PowerPoint - Berkeley Reconfigurable Architectures, Systems, and
Download
Report
Transcript PowerPoint - Berkeley Reconfigurable Architectures, Systems, and
A Streaming
Multi-Threaded Model
Eylon Caspi,
Joe Yeh,
Randy Huang, Yury Markovskiy,
André DeHon, John Wawrzynek
BRASS Research Group
University of California, Berkeley
MSP-3 12/2/01
Protecting Software Investment
Technology trends: bigger, faster
Moore’s Law: 2x transistors every 18 months
Device landscape growing
Microprocessors, DSPs, FPGAs, communication
processors, network processors, PSOCs, etc.
Need a way to let SW survive,
automatically scale to next-gen device
Need a strong model for SW-HW interface
with better parallelism
12/2/01
Eylon Caspi — MSP-3
2
Outline
Motivation
SCORE
SCORE for Reconfigurable Hardware
SCORE for Microprocessors
Summary / Future Work
12/2/01
Eylon Caspi — MSP-3
3
A Lesson from ISA Processors
ISA (Instruction Set Architecture)
decouples SW from HW
Survival to compatible, next generation devices
Performance scales with device speed + size
Survival for decades—e.g. IBM 360, x86
An ISA cannot scale forever
Latency scales with device size (cycles to cross chip, access mem)
Need parallelism to hide latency
ILP:
expensive to extract + exploit (caches, branch pred., etc.)
Data:
(Vector, MMX) limited applicability; MMX not scalable
Thread: (MP, multi-threaded) IPC expensive; hard to program
Gluing together conventional processors is insufficient
12/2/01
Eylon Caspi — MSP-3
4
Streams
Stream = FIFO communication channel
with blocking read, non-blocking write,
conceptually unbounded capacity
Basic primitive for communication, synchronization
Exposed at all levels—programming model, architecture
Application = data flow graph of threads, memories
Kahn process network
Stream semantics ensure determinism regardless of communication timing,
thread scheduling (Kahn continuity)
Thread
Mem
Thread
Thread
Mem
Thread
12/2/01
Eylon Caspi — MSP-3
5
Stream-Aware Scheduling
Streams expose inter-thread dependencies (data flow)
Streams enable efficient, flexible schedules
Efficient: fewer blocked cycles, shorter run time
Automatically schedule to available resources
Number of processors, memory size, network bandwidth, etc.
E.g. Fully spatial, pipelined
E.g. Time multiplexed with data batching
Amortize cost of context swap over larger data set
Thread
Mem
Thread
Thread
Mem
Thread
12/2/01
Eylon Caspi — MSP-3
6
Stream Reuse
Persistent streams enable reuse
Establish connection once (network route / buffer)
Reuse connection while threads loaded
Cheap (single cycle) stream access
Amortize per-message cost of communication
Thread
Mem
Thread
Thread
Mem
Thread
12/2/01
Eylon Caspi — MSP-3
7
SCORE Compute Model
Program = data flow graph of
stream-connected threads
Kahn process network
(blocking read, non-blocking write)
Compute: Thread
Communication: Stream
Task with local control
FIFO channel, unbounded buffer capacity,
blocking read, non-blocking write
Memory: Segment
Dynamics:
Model admits parallelism at multiple levels: ILP, pipeline, data
Memory block with stream interface (e.g. streaming read)
Dynamic local thread behavior dynamic flow rates
Unbounded resource usage: may need stream buffer expansion
Dynamic graph allocation
12/2/01
Eylon Caspi — MSP-3
8
SCORE for Reconfigurable Hardware
SCORE: Stream Computations Organized for
Reconfigurable Execution
Programmable logic +
Programmable Interconnect
E.g. Field Programmable Gate Arrays (FPGAs)
Hardware scales by tiling / duplicating
High parallelism; spatial data paths
But no abstraction for software survival
No binary compatibility
No performance scaling
12/2/01
Designer targets a specific device, specific resource constraints
Eylon Caspi — MSP-3
9
Virtual Hardware
Compute model has unbounded resources
Paging
Efficient virtualization
Programmer no longer targets particular device size
“Compute pages” swapped in/out (like VM)
Page context = thread (FSM to access streams, block)
Amortize reconfiguration cost over an entire input buffer
compute
pages
buffers
10
Transform
Quantize
RLE
Encode
SCORE Hardware Model
Paged FPGA
Compute Page (CP)
Fixed-size slice of RC hardware
(e.g. 512 4-LUTs)
Fixed number of I/O ports
Configurable Memory Block (CMB)
Distributed, on-chip memory
(e.g. 2 Mbit)
Stream access
High-level interconnect
Microprocessor
Run-time support + user code
12/2/01
Eylon Caspi — MSP-3
11
Programming Model: TDF
TDF = intermediate, behavioral language for:
EFSM Operators
• Static operator graphs
State machine for:
Firing signatures
• Control flow (branching)
Firing semantics:
When in state X, wait for X’s inputs, then fire (consume, act)
select (input boolean
s,
input unsigned[8] t,
input unsigned[8] f,
output unsigned[8] o )
{
state S (s) : if (s) goto T;
else
goto F;
state T (t) : o=t;
goto S;
state F (f) : o=f;
goto S;
}
s
t
f
select
o
12
Page Scheduling
Schedule = time-sliced eviction / loading
Choose pages to run
Manage stream buffers (modify page graph; swap memory)
Configure CPs, CMBs, network
Implemented several schedulers
Dynamic:
Static:
Quasi-Static:
Dynamic loading order based on buffered input
Static, repeated loading order
Static loading order, dynamic time slice
Page loading order (static / quasi-static)
Topological:
Min-cut:
Exhaustive:
12/2/01
dependence order
(arbitrary topological sort of page graph)
minimize # of live stream buffers
(min-cut page graph)
minimize stall cycles based on profiled I/O rates
(exhaustively search all topological orders)
Eylon Caspi — MSP-3
13
Execution Results
Hardware Size (CP-CMB Pairs)
12/2/01
Eylon Caspi — MSP-3
14
Heterogeneous SCORE
SCORE extends to other processor types
Network interface
Route traffic to network or buffer
Block on empty/full stream access
Processor
Processor
Processor
Processor
FPU
FPU
FPU
FPU
IO
FPU
Processor
Processor
Processor
Processor
Processor
Processor
Processor
Processor
Processor
12/2/01
Processor
Processor
Processor
FPU
IO
FPU
FPU
FPU
FPU
FPU
FPU
FPU
FPU
Processor
Eylon Caspi — MSP-3
15
Microprocessor Stream Support
Stream instructions: stream_read (reg,idx)
stream_write (reg,idx)
Network
Interface
12/2/01
Eylon Caspi — MSP-3
16
Summary
Exposing streams at all levels (programming model,
architecture) enables software survival + performance
scaling in high-capacity architectures
Demonstrated scalable hybrid reconfigurable architecture;
proposed heterogeneous / multi-processor extensions
Future work
Page partitioning for reconfigurable
Scheduling with I/O rate matching
More Information
SCORE web page
http://brass.cs.berkeley.edu/SCORE/
FPGA 2002 paper (February 24-26)
12/2/01
Eylon Caspi — MSP-3
17
Supplemental
12/2/01
Eylon Caspi — MSP-3
18
Functional Simulation
FPGA based on HSRA [Berkeley, FPGA ’99]
CP:
512 4-LUTs
CMB: 2Mbit DRAM
Area for CP-CMB pair: .25: 12.9mm2
(1/9 of PII-450)
.18:
6.7mm2
(1/16 of PIII-600)
Page reconfiguration:
5000 cycles (from CMB)
Synchronous operation (same clock speed as processor)
x86 microprocessor
Page Scheduler task
Swap on timer interrupt (every 250,000 cycles)
Fully dynamic scheduling
12/2/01
Eylon Caspi — MSP-3
19
Application: JPEG Encode
12/2/01
Eylon Caspi — MSP-3
20
Execution Results
Hardware Size (CP-CMB Pairs)
12/2/01
Eylon Caspi — MSP-3
21
Execution Results
Hardware Size (CP-CMB Pairs)
12/2/01
Eylon Caspi — MSP-3
22
Execution Results
Hardware Size (CP-CMB Pairs)
12/2/01
Eylon Caspi — MSP-3
23