PowerPoint - Berkeley Reconfigurable Architectures, Systems, and

Download Report

Transcript PowerPoint - Berkeley Reconfigurable Architectures, Systems, and

A Streaming
Multi-Threaded Model
Eylon Caspi,
Joe Yeh,
Randy Huang, Yury Markovskiy,
André DeHon, John Wawrzynek
BRASS Research Group
University of California, Berkeley
MSP-3 12/2/01
Protecting Software Investment
 Technology trends: bigger, faster
 Moore’s Law: 2x transistors every 18 months
 Device landscape growing
 Microprocessors, DSPs, FPGAs, communication
processors, network processors, PSOCs, etc.
 Need a way to let SW survive,
automatically scale to next-gen device
 Need a strong model for SW-HW interface
with better parallelism
12/2/01
Eylon Caspi — MSP-3
2
Outline





Motivation
SCORE
SCORE for Reconfigurable Hardware
SCORE for Microprocessors
Summary / Future Work
12/2/01
Eylon Caspi — MSP-3
3
A Lesson from ISA Processors
 ISA (Instruction Set Architecture)
decouples SW from HW
 Survival to compatible, next generation devices
 Performance scales with device speed + size
 Survival for decades—e.g. IBM 360, x86
 An ISA cannot scale forever
 Latency scales with device size (cycles to cross chip, access mem)
 Need parallelism to hide latency

ILP:
expensive to extract + exploit (caches, branch pred., etc.)
 Data:
(Vector, MMX) limited applicability; MMX not scalable
 Thread: (MP, multi-threaded) IPC expensive; hard to program
Gluing together conventional processors is insufficient
12/2/01
Eylon Caspi — MSP-3
4
Streams

Stream = FIFO communication channel
with blocking read, non-blocking write,
conceptually unbounded capacity
 Basic primitive for communication, synchronization
 Exposed at all levels—programming model, architecture

Application = data flow graph of threads, memories
 Kahn process network
 Stream semantics ensure determinism regardless of communication timing,
thread scheduling (Kahn continuity)
Thread
Mem
Thread
Thread
Mem
Thread
12/2/01
Eylon Caspi — MSP-3
5
Stream-Aware Scheduling
 Streams expose inter-thread dependencies (data flow)
 Streams enable efficient, flexible schedules
 Efficient: fewer blocked cycles, shorter run time
 Automatically schedule to available resources

Number of processors, memory size, network bandwidth, etc.
 E.g. Fully spatial, pipelined
 E.g. Time multiplexed with data batching

Amortize cost of context swap over larger data set
Thread
Mem
Thread
Thread
Mem
Thread
12/2/01
Eylon Caspi — MSP-3
6
Stream Reuse
 Persistent streams enable reuse
 Establish connection once (network route / buffer)
 Reuse connection while threads loaded
 Cheap (single cycle) stream access

Amortize per-message cost of communication
Thread
Mem
Thread
Thread
Mem
Thread
12/2/01
Eylon Caspi — MSP-3
7
SCORE Compute Model

Program = data flow graph of
stream-connected threads
 Kahn process network
(blocking read, non-blocking write)

Compute: Thread

Communication: Stream
 Task with local control
 FIFO channel, unbounded buffer capacity,
blocking read, non-blocking write

Memory: Segment

Dynamics:

Model admits parallelism at multiple levels: ILP, pipeline, data
 Memory block with stream interface (e.g. streaming read)
 Dynamic local thread behavior  dynamic flow rates
 Unbounded resource usage: may need stream buffer expansion
 Dynamic graph allocation
12/2/01
Eylon Caspi — MSP-3
8
SCORE for Reconfigurable Hardware
 SCORE: Stream Computations Organized for
Reconfigurable Execution
 Programmable logic +
Programmable Interconnect
 E.g. Field Programmable Gate Arrays (FPGAs)
 Hardware scales by tiling / duplicating
 High parallelism; spatial data paths
 But no abstraction for software survival
 No binary compatibility
 No performance scaling

12/2/01
Designer targets a specific device, specific resource constraints
Eylon Caspi — MSP-3
9
Virtual Hardware

Compute model has unbounded resources

Paging

Efficient virtualization
 Programmer no longer targets particular device size
 “Compute pages” swapped in/out (like VM)
 Page context = thread (FSM to access streams, block)
 Amortize reconfiguration cost over an entire input buffer
compute
pages
buffers
10
Transform
Quantize
RLE
Encode
SCORE Hardware Model
 Paged FPGA
 Compute Page (CP)

Fixed-size slice of RC hardware
(e.g. 512 4-LUTs)
 Fixed number of I/O ports
 Configurable Memory Block (CMB)

Distributed, on-chip memory
(e.g. 2 Mbit)
 Stream access
 High-level interconnect
 Microprocessor
 Run-time support + user code
12/2/01
Eylon Caspi — MSP-3
11
Programming Model: TDF
 TDF = intermediate, behavioral language for:
 EFSM Operators
• Static operator graphs
 State machine for:
 Firing signatures
• Control flow (branching)
 Firing semantics:
 When in state X, wait for X’s inputs, then fire (consume, act)
select (input boolean
s,
input unsigned[8] t,
input unsigned[8] f,
output unsigned[8] o )
{
state S (s) : if (s) goto T;
else
goto F;
state T (t) : o=t;
goto S;
state F (f) : o=f;
goto S;
}
s
t
f
select
o
12
Page Scheduling

Schedule = time-sliced eviction / loading
 Choose pages to run
 Manage stream buffers (modify page graph; swap memory)
 Configure CPs, CMBs, network

Implemented several schedulers
 Dynamic:
 Static:
 Quasi-Static:

Dynamic loading order based on buffered input
Static, repeated loading order
Static loading order, dynamic time slice
Page loading order (static / quasi-static)
 Topological:
 Min-cut:
 Exhaustive:
12/2/01
dependence order
(arbitrary topological sort of page graph)
minimize # of live stream buffers
(min-cut page graph)
minimize stall cycles based on profiled I/O rates
(exhaustively search all topological orders)
Eylon Caspi — MSP-3
13
Execution Results
Hardware Size (CP-CMB Pairs)
12/2/01
Eylon Caspi — MSP-3
14
Heterogeneous SCORE
 SCORE extends to other processor types
 Network interface
 Route traffic to network or buffer
 Block on empty/full stream access
Processor
Processor
Processor
Processor
FPU
FPU
FPU
FPU
IO
FPU
Processor
Processor
Processor
Processor
Processor
Processor
Processor
Processor
Processor
12/2/01
Processor
Processor
Processor
FPU
IO
FPU
FPU
FPU
FPU
FPU
FPU
FPU
FPU
Processor
Eylon Caspi — MSP-3
15
Microprocessor Stream Support
 Stream instructions: stream_read (reg,idx)
stream_write (reg,idx)
Network
Interface
12/2/01
Eylon Caspi — MSP-3
16
Summary
 Exposing streams at all levels (programming model,
architecture) enables software survival + performance
scaling in high-capacity architectures
 Demonstrated scalable hybrid reconfigurable architecture;
proposed heterogeneous / multi-processor extensions
 Future work
 Page partitioning for reconfigurable
 Scheduling with I/O rate matching
 More Information
 SCORE web page
http://brass.cs.berkeley.edu/SCORE/
 FPGA 2002 paper (February 24-26)
12/2/01
Eylon Caspi — MSP-3
17
Supplemental
12/2/01
Eylon Caspi — MSP-3
18
Functional Simulation
 FPGA based on HSRA [Berkeley, FPGA ’99]
 CP:
512 4-LUTs
 CMB: 2Mbit DRAM
 Area for CP-CMB pair: .25: 12.9mm2
(1/9 of PII-450)
.18:
6.7mm2
(1/16 of PIII-600)
 Page reconfiguration:
5000 cycles (from CMB)
 Synchronous operation (same clock speed as processor)
 x86 microprocessor
 Page Scheduler task
 Swap on timer interrupt (every 250,000 cycles)
 Fully dynamic scheduling
12/2/01
Eylon Caspi — MSP-3
19
Application: JPEG Encode
12/2/01
Eylon Caspi — MSP-3
20
Execution Results
Hardware Size (CP-CMB Pairs)
12/2/01
Eylon Caspi — MSP-3
21
Execution Results
Hardware Size (CP-CMB Pairs)
12/2/01
Eylon Caspi — MSP-3
22
Execution Results
Hardware Size (CP-CMB Pairs)
12/2/01
Eylon Caspi — MSP-3
23