Multi-core Accelerators - List of Research Wikis

Download Report

Transcript Multi-core Accelerators - List of Research Wikis

Revisiting Parallelism
ECE 4100/6100 (1)
Where Are We Headed?
1000000
Multi-Threaded, Multi-Core
100000
Multi Threaded
MIPS
10000
Speculative, OOO
1000
100
Super Scalar
10
1
0.1
0.01
1970
8086
286
486
Era of
Era of Instruction
Level
Pipelined
Architecture Parallelism
386
1980
1990
Era of
Thread &
Processor
Level
Parallelism
Special
Purpose
HW
2000
2010
Source: Shekhar Borkar, Intel Corp.
ECE 4100/6100 (2)
Beyond ILP
• Performance is limited by the serial fraction
parallelizable
1CPU
2CPUs
3CPUs
4CPUs
• Coarse grain parallelism in the post ILP era
– Thread, process and data parallelism
• Learn from the lessons of the parallel
processing community
– Revisit the classifications and architectural
techniques
ECE 4100/6100 (3)
Flynn’s Model*
• Flynn’s Classification
– Single instruction stream, single data stream (SISD)
• The conventional, word-sequential architecture including
pipelined computers
Data Level
Parallelism
(DLP)
– Single instruction stream, multiple data stream (SIMD)
• The multiple ALU-type architectures (e.g., array processor)
– Multiple instruction stream, single data stream (MISD)
• Not very common
Thread Level
Parallelism
(TLP)
– Multiple instruction stream, multiple data stream (MIMD)
• The traditional multiprocessor system
*M.J. Flynn, “Very high speed computing systems,” Proc. IEEE, vol. 54(12), pp. 1901–1909, 1966.
ECE 4100/6100 (4)
ILP Challenges
• As machine ILP capabilities increase, i.e., ILP width
and depth, so do challenges
– OOO execution cores
• Key data structure sizes increase – ROB, ILP window, etc.
• Dependency tracking logic increases quadratically
– VLIW/EPIC
• Hardware interlocks, ports, recovery logic (speculation)
increases quadratically
• Circuit complexity increases with number of inflight
instructions
Data Parallelism
ECE 4100/6100 (5)
Example: Itanium 2
the percentage
• Note
of the die devoted to
control
this is a statically
• And
scheduled processor!
ECE 4100/6100 (6)
Data Parallel Alternatives
• Single Instruction Stream Multiple Data
Stream Cores
– Co-processors exposed through the ISA
– Co-processors exposed as a distinct
processor
• Vector Processing
– Over 5 decades of development
ECE 4100/6100 (7)
The SIMD Model
• Single instruction stream broadcast to all processors
– Processors execute in lock step on local data
– Efficient in use of silicon area - less resources devoted to
control
– Distributed memory model vs. shared memory model
• Distributed memory
– Each processor has local memory
– Data routing network whose operation is under centralized
control.
– Processor masking for data dependent operations
• Shared memory
– Access to memory modules through an alignment network
• Instruction classes: computation, routing, masking
ECE 4100/6100 (8)
Two Issues
• Conditional Execution
• Data alignment
ECE 4100/6100 (9)
Vector Cores
ECE 4100/6100 (10)
Classes of Vector Processors
Vector machines
register machines
memory machines
• Memory to memory architectures have
seen a resurgence on chip
ECE 4100/6100 (11)
VMIPS
• Load/Store
architecture
• Multiported
registers
• Deeply pipelined
functional units
• Separate scalar
registers
ECE 4100/6100 (12)
Cray Family Architecture
• Stream oriented
• Recall data skewing
and concurrent
memory accesses!
• The first load/store
ISA design  Cray 1
(1976)
ECE 4100/6100 (13)
Features of Vector Processors
• Significantly less dependency checking logic
– Order of complexity of scalar comparisons with a
significantly smaller number
• Vector data sets
– Hazard free operation on deep pipelines
– Conciseness of representation leads to low instruction issue
rate
– Reduction in normal control hazards
– Vector operations vs. a sequence of scalar operations
• Concurrency in operation, memory access and
address generation
– Often statically known
ECE 4100/6100 (14)
Some Examples
ECE 4100/6100 (15)
Basic Performance Concepts
• Consider the vector operation
Z = A*X + Y
• Execution time
– tex = tstartup + n*tcycle
• Metrics
– Rinfinity
– Rhalf
– Rv
ECE 4100/6100 (16)
Optimizations for Vector Machines
• Chaining
MULT.V
V1, V2. V3
ADD.V V4, V1, V5
– Fine grained forwarding of elements if a vector
• Need additional ports on a vector register
– Effectively creates a deeper pipeline
• Conditional operations and vector masks
• Scatter/gather operations
• Vector lanes
– Each lane is coupled to a portion of the vector register file
– Lanes are transparent to the code and are like caches in the
family of machines concept
ECE 4100/6100 (17)
The IBM Cell Processor
ECE 4100/6100 (18)
Cell Overview
P
M P
I U
C
S
P
U
S S
P P
U U
MIB
S
P
U
B R
I R
A
C
S S S S
C
P P P P
U U U U
• IBM/Toshiba/Sony joint project - 4-5 years, 400 designers
–
–
–
–
–
234 million transistors, 4+ Ghz
256 Gflops (billions of floating pointer operations per second)
26 Gflops (double precision)
Area 221 mm2
Technology 90nm SOI
ECE 4100/6100
(19)
Cell Overview (cont.)
• One 64-bit PowerPC processor
– 4+ Ghz, dual issue, two threads
– 512 kB of second-level cache
• Eight Synergistic Processor Elements
– Or “Streaming Processor Elements”
– Co-processors with dedicated 256kB of memory (not cache)
• EIB data ring for internal communication
– Four 16 byte data rings, supporting multiple transfers
– 96B/cycle peak bandwidth
– Over 100 outstanding requests
• Dual Rambus XDR memory controllers (on chip)
– 25.6 GB/sec of memory bandwidth
• 76.8 GB/s chip-to-chip bandwidth (to off-chip GPU)
ECE 4100/6100 (20)
Cell Features
• Security
– SPE dynamically reconfigurable as secure co-processor
• Networking
– SPEs might off-load networking overheads (TCP/IP)
• Virtualization
– Run multiple OSs at the same time
– Linux is primary development OS for Cell
• Broadband
– SPE is RISC architecture with SIMD organization and Local
Store
– 128+ concurrent transactions to memory per processor
ECE 4100/6100 (21)
PPE Block Diagram
• PPE handles operating system and control tasks
– 64-bit Power ArchitectureTM with VMX
– In-order, 2-way hardware Multi-threading
– Coherent Load/Store with 32KB I & D L1 and 512KB L2
ECE 4100/6100 (22)
PPE Pipeline
ECE 4100/6100 (23)
SPE Organization and Pipeline
IBM Cell SPE Organization
IBM Cell SPE pipeline diagram
ECE 4100/6100 (24)
Cell Temperature Graph
• Power and heat are key constrains
– Cell is ~80 watts at 4+ Ghz
– Cell has 10 temperature sensors
Source: IEEE ISSCC, 2005
ECE 4100/6100 (25)
SPE
• User-mode architecture
– No translation/protection within SPU
– DMA is full Power Arch protect/x-late
• Direct programmer control
– DMA/DMA-list
– Branch hint
• VMX-like SIMD dataflow
– Broad set of operations
– Graphics SP-Float
– IEEE DP-Float (BlueGene-like)
• Unified register file
– 128 entry x 128 bit
• 256kB Local Store
– Combined I & D
– 16B/cycle L/S bandwidth
– 128B/cycle DMA bandwidth
ECE 4100/6100 (26)
Cell I/O
• XDR is new high-speed memory from Rambus
–
–
–
–
Dual XDRTM controller (25.6GB/s @ 3.2Gbps)
Two configurable interfaces (76.8GB/s @6.4Gbps)
Flexible Bandwidth between interfaces
Allows for multiple system configurations
• Pros:
– Fast - dual controllers give 25GB/sed
• Current AMD Opteron is only 6.4GB/s
– Small pin count
– Only need a few chips for high bandwidth
• Cons:
– Expensive ($ per bit)
ECE 4100/6100 (27)
Multiple system support
• Game console
systems
• Workstations
(CPBW)
• HDTV
• Home media
servers
• Supercomputers
ECE 4100/6100 (28)
Programming Cell
• 10 virtual processors
– 2 threads of PowerPC
– 8 co-processor SPEs
• Communicating with SPEs
– 256kB “local storage” is NOT a cache
• Must explicitly move data in and out of local store
• Use DMA engine (supports scatter/gather)
ECE 4100/6100 (29)
Programming Cell
SIMD alignment
directives
Automatic tuning for
each ISA
Automatic SIMDization
Multiple-ISA hand tuned
programs
Shared memory, single
program abstraction
Automatic parallelization
Explicit SIMD coding
Explicit parallelization
with local
memories with
Highest
performance
help from programmers
Highest Productivity with
fully automatic compiler
technology
ECE 4100/6100 (30)
Execution Model
• SPE executables are
embedded as readonly data in the
PPE executable
• Use the memory
flow controller
(MFC) for DMA
operations
• The “shopping list”
view of memory
accesses
Source: IBM
ECE 4100/6100 (31)
Programming Model
PPE Program
SPE Program
/* spe_foo.c
A C program to be compiled into an
executable called "spe_foo"
*/
int main(unsigned long long speid, addr64
argp, addr64 envp)
{
int i; /* func_foo would be the real code */
i = func_foo(argp);
return i;
}
/* spe_runner.c
A C program to be linked with spe_foo
and run on the PPE.
*/
extern spe_program_handle_t spe_foo;
int main()
{
int rc, status = 0;
speid_t spe_id;
spe_id =
spe_create_thread(0, &spe_foo, 0, NULL, -1, 0);
rc = spe_wait(spe_id, &status, 0);
return status;
}
blocking call
Source: IBM
ECE 4100/6100 (32)
SPE Programming
• Dual Issue with issue
constraints
• Predication and hints, no
branch prediction
hardware
• Alignment instructions
Source: IBM
ECE 4100/6100 (33)
Programming Idioms: Pipeline
ECE 4100/6100 (34)
Programming Idioms: Work Queue
Model
work queue
• Pull data off of a
shared queue
• Self scheduled
ECE 4100/6100 (35)
SPMD & MIMD Accelerators
Executing same
(SPMD) or different
(MPMD) programs
ECE 4100/6100 (36)
Cell Processor Application
Areas
•
•
•
•
•
•
•
•
•
Digital content creation (games and movies)
Game playing and game serving
Distribution of (dynamic, media rich) content
Imaging and image processing
Image analysis (e.g. video surveillance)
Next-generation physics-based visualization
Video conferencing (3D)
Streaming applications (codecs etc.)
Physical simulation & science
ECE 4100/6100 (37)
Some References and Links
• http://researchweb.watson.ibm.com/journal/rd/49
4/kahle.html
• http://en.wikipedia.org/wiki/Cell_(microprocessor)
• http://www.research.ibm.com/cell/home.html
• http://www.research.ibm.com/cellcompiler/slides/
pact05.pdf
• http://www.hpcaconf.org/hpca11/slides/Cell_Publ
ic_Hofstee.pdf
• http://www.hpcaconf.org/hpca11/papers/25_hofst
ee-cellprocessor_final.pdf
• http://www.research.ibm.com/cellcompiler/papers
/pham-ISSCC05.pdf
ECE 4100/6100 (38)
IRAM Cores
ECE 4100/6100 (39)
Data Parallelism and the Processor
Memory Gap
CPU
µProc
60%/yr.
“Moore’s Law”
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
7%/yr.
1
Time
How can we close this gap?
ECE 4100/6100 (40)
The Effects of the ProcessorMemory Gap
• Tolerate gap with deeper cache memories  increasing worst
case performance
• System level impact: Alpha 21164
–
–
–
–
–
I & D cache access: 2 clocks
L2 cache: 6 clocks
L3 cache: 8 clocks
Memory: 76 clocks
DRAM component access: 18 clocks
–
–
–
–
SpecInt92: 22%
Specfp92: 32%
Database: 77%
Sparse matrix: 73%
• How much time is spent in the memory hierarchy?
ECE 4100/6100 (41)
Where do the Transistors Go?
Processor
• Alpha 21164
• StrongArm SA110
• Pentium Pro
% Area
(cost)
37%
61%
64%
%Transistors
(power)
77%
94%
88%
• Caches have no inherent value, they
simply recover bandwidth?
ECE 4100/6100 (42)
Impact of DRAM Capacity
capacity creates a quandary
• Increasing
Continual four fold increase in density increases minimum
–
–
–
memory increment for a given width
How do we match the memory bus width?
Cost/bit issues for wider DRAM chips
• die size, testing, package costs
of DRAM chips decrease  decrease in
• Number
concurrency
ECE 4100/6100 (43)
Merge Logic and DRAM!
• Bring the processors to memory
 Tremendous on-chip bandwidth for predictable application
reference patterns
• Enough memory to hold complete programs and
data
 feasible
• More applications are limited by memory speed
 Better memory latency for applications with irregular access
patterns
• Synchronous DRAMs to integrate with the higher
speed logic
 compatible
ECE 4100/6100 (44)
Potential: IRAM for Lower
Latency
• DRAM Latency
– Dominant delay = RC of the word lines
– Keep wire length short & block sizes small?
• 10-30 ns for 64b-256b IRAM “RAS/CAS”?
ECE 4100/6100 (45)
Potential for IRAM Bandwidth
• 1024 1Mbit modules(1Gb), each 256b wide
– 20% @ 20 ns RAS/CAS = 320 GBytes/sec
• If cross bar switch delivers 1/3 to 2/3 of BW of 20% of modules
100 - 200 GBytes/sec
• FYI: AlphaServer 8400 = 1.2 GBytes/sec
– 75 MHz, 256-bit memory bus, 4 banks
ECE 4100/6100 (46)
IRAM Applications
• PDAs, cameras, gameboys, cell phones, pagers
• Database systems?
Database demand:
2X / 9 months
100
Database-Proc.
Performance Gap:
“Greg’s Law”
µProc speed
2X / 18 months
10
“Moore’s Law”
Processor-Memory
Performance Gap:
DRAM speed
2X /120 months
1
1996
1997
1998
1999
2000
ECE 4100/6100 (47)
Estimating IRAM Performance
• Direct application produces modest
performance improvements
– Architectures were designed to overcome the
memory bottleneck
– Architectures were not designed to use
tremendous memory bandwidth
• Need to rethink the design!
– Tailor architecture to utilize the high bandwidth
ECE 4100/6100 (48)
Emerging Embedded Applications and
Characteristics
• Fastest growing application domain
– Video processing, speech recognition, 3D Graphics
– Set top boxes, game consoles, PDAs
• Data parallel
– Typically low temporal locality
– Size, weight and power constraints
– Highest speed not necessarily the best processor
• What about the role of ILP processors here?
• Real Time constraints
– Right data at the right time
ECE 4100/6100 (49)
SIMD/Vector Architectures
• VIRAM - Vector IRAM
– Logic is slow in DRAM process
– Put a vector unit in a DRAM and provide a port between a traditional
processor and the vector IRAM instead of a whole processor in DRAM
Source: Berkeley Vector IRAM
ECE 4100/6100 (50)
ISA
• LD/SD vector ISA defined as a co-processor to the MIPS 64 ISA
• Vector register file with 32 entries
– Each can be configured as 64b, 32b, or 16b
– Integer or FP elements
• Two scalar register files
– Memory and exception handling, base addresses and stride
information
– Scalar operands
• Flag registers
• Special
– Limited scope instructions to permute contents of vector registers
– Integer instructions for saturated arithmetic
ECE 4100/6100 (51)
MIMD Machines
P+C
P+C
P+C
P+C
Dir
Dir
Dir
Dir
Memory
Memory
Memory
Memory
Interconnection Network
• Parallel processing has catalyzed the development of
a several generations of parallel processing machines
• Unique features include the interconnection network,
support for system wide synchronization, and
programming languages/compilers
ECE 4100/6100 (52)
Basic Models for Parallel Programs
• Shared Memory
– Coherency/consistency are driving concerns
– Programming model is simplified at the expense
of system complexity
• Message Passing
– Typically implemented on distributed memory
machines
– System complexity is simplified at the expense of
increased effort by the programmer
ECE 4100/6100 (53)
Shared Memory Vs. Message Passing
• Shared memory
– Simplifies software development
– Increases complexity of hardware
• Power  directories, coherency enforcement logic
• More recently transactional memory
• Message passing doesn’t need centralized bus
– Simplifies hardware
• Scalable memory and interconnect bandwidth
– Increases complexity of software development
• Increases the burden on the developer
ECE 4100/6100 (54)
Two Emerging Challenges
Programming Models and
Compilers?
Source: Intel Corp.
Interconnection Networks
Source: IBM
ECE 4100/6100 (55)