Register File

Download Report

Transcript Register File

In search of performance
• The road to multi-cores
–
–
–
–
–
–
Single-cycle processor
Pipelining
Caches
Simplescalar
Superscalar, VLIW and EPIC
Multi-threading, Hyperthreading and Simultaneous Multithreading
• Multicores
– Chip Multi-Processors
– Graphics Processors
– Software Managed
1
4/8/2017
http://www.public.asu.edu/~ashriva6
What is this?
00100111101111011111111111100000
10101111101111110000000000010100
10101111101001000000000000100000
10101111101001010000000000100100
10101111101000000000000000011000
10101111101000000000000000011100
10001111101011100000000000011100
10001111101110000000000000011000
00000001110011100000000000011001
00100101110010000000000000000001
00101001000000010000000001100101
10101111101010000000000000011100
00000000000000000111100000010010
00000011000011111100100000100001
00010100001000001111111111110111
10101111101110010000000000011000
00111100000001000001000000000000
10001111101001010000000000011000
00001100000100000000000011101100
00100100100001000000010000110000
10001111101111110000000000010100
00100111101111010000000000100000
00000011111000000000000000001000
00000000000000000001000000100001
int main (int argc, char *argv[])
{
int i;
int sum = 0;
for (i = 0; i <= 100; i = i + 1) sum = sum + i * i;
printf ("The sum from 0 .. 100 is %d\n", sum);
}
MIPS machine language code
for a routine to compute and
print the sum of the squares of
integers between 0 and 100.
High-Level Languages
• Higher-level languages
– Allow the programmer to think in a more natural language
• Customized for their intended use, e.g.,
– Fortran for scientific computation
– Cobol for business programming
– Lisp for symbol manipulation
– Improve programmer productivity and maintainability
• more understandable code that is easier to debug and validate
– Independent of
• Computer on which it applications are developed
• Computer on which it applications will execute
• Enabler
– Optimizing Compiler Technology
• Now very little programming at the assembler level
Translation from High Level Languages
• High-level language program (in C)
swap (int v[], int k)
. . .
•Assembly language program (for MIPS)
swap:
sll
add
lw
lw
sw
sw
jr
C - Compiler
$2, $5, 2
$2, $4, $2
$15, 0($2)
$16, 4($2)
$16, 0($2)
$15, 4($2)
$31
• Machine (object) code (for MIPS)
000000
000000
100011
100011
101011
101011
000000
00000
00100
00010
00010
00010
00010
11111
00101
00010
01111
10000
10000
01111
00000
0001000010000000
0001000000100000
0000000000000000
0000000000000100
0000000000000000
0000000000000100
0000000000001000
Assembler
Fetching Instructions
• Fetching instructions involves
– reading the instruction from the Instruction Memory
– updating the PC to hold the address of the next instruction
Add
4
Instruction
Memory
PC
Read
Address
Instruction
– PC is updated every cycle, so it does not need an explicit write
control signal
– Instruction Memory is read every cycle, so it doesn’t need an
explicit read control signal
Executing ALU Operations
31
R-type: op
25
rs
20
15
10
rt
rd
5
0
shamt funct
– perform the indicated (by op and funct) operation on values in rs and rt
– store the result back into the Register File (into location rd)
RegWrite
Instruction
Read Addr 1
Register Read
Read Addr 2 Data 1
File
Write Addr
Read
Write Data
ALU control
ALU
overflow
zero
Data 2
– Note that Register File is not written every cycle (e.g. sw), so we need an
explicit write control signal for the Register File
Executing Load and Store Operations
RegWrite
Instruction
ALU control
overflow
zero
Read Addr 1
Register Read
Read Addr 2 Data 1
File
Write Addr
Read
Write Data
16
Address
ALU
Data
Memory Read Data
Write Data
Data 2
Sign
Extend
MemWrite
MemRead
32
Executing Branch Operations
Add
4
Add
Shift
left 2
Branch
target
address
ALU control
PC
Read Addr 1
Instruction
Register Read
Read Addr 2 Data 1
File
Write Addr
Read
Write Data
16
Data 2
Sign
Extend
32
zero (to branch
control logic)
ALU
Executing Jump Operations
• Jump operations have to
31
25
0
J-Type: op
jump target address
– replace the lower 28 bits of the PC with the lower 26 bits of
the fetched instruction shifted left by 2 bits
Add
4
4
Instruction
Memory
PC
Read
Address
Shift
left 2
Instruction
26
Jump
address
28
Adding the Pieces Together
Add
RegWrite
ALUSrc ALU control
4
MemtoReg
ovf
zero
Instruction
Memory
PC
MemWrite
Read
Address
Instruction
Read Addr 1
Register Read
Read Addr 2 Data 1
File
Write Addr
Read
Write Data
Address
ALU
Write Data
Data 2
Sign
16 Extend
Data
Memory Read Data
MemRead
32
Adding the Branch Portion
Add
4
Shift
left 2
RegWrite
Instruction
Memory
PC
Read
Address
Instruction
MemWrite
MemtoReg
Address
ALU
Data
Memory Read Data
Write Data
Data 2
Sign
16 Extend
PCSrc
ALUSrc ALU control
ovf
zero
Read Addr 1
Register Read
Read Addr 2 Data 1
File
Write Addr
Read
Write Data
Add
MemRead
32
Adding the Jump Portion
26
Shift
left 2
32
28
Jump
1
PC+4[31-28]
0
Add
4
Shift
left 2
RegWrite
Instruction
Memory
PC
Read
Address
Instruction
MemWrite
MemtoReg
Address
ALU
Data
Memory Read Data
Write Data
Data 2
Sign
16 Extend
PCSrc
ALUSrc ALU control
ovf
zero
Read Addr 1
Register Read
Read Addr 2 Data 1
File
Write Addr
Read
Write Data
Add
MemRead
32
MIPS Machine (with Controls)
Instr[25-0]
Shift
left 2
26
1
28
32
0
PC+4[31-28]
0
Add
Jump
ALUOp
Add
Shift
left 2
4
1
PCSrc
Branch
MemRead
MemtoReg
MemWrite
Instr[31-26] Control
Unit
ALUSrc
RegWrite
RegDst
Instruction
Memory
PC
Read
Address
Instr[31-0]
ovf
Instr[25-21] Read Addr 1
Register Read
Instr[20-16] Read Addr 2 Data 1
File
0
Write Addr
Read
1
Instr[15
-11]
Instr[15-0]
Write Data
zero
ALU
Data
Memory Read Data
1
Write Data
0
0
Data 2
1
Sign
16 Extend
Address
32
Instr[5-0]
ALU
control
Single Cycle – Can we do better?
• Every instruction executes in 1 cycle
– Every instruction time = slowest instruction time
• Cannot reuse resources
– A wire can carry only one value in one cycle
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Clock
Singlecycle inst.
Ifetch Dec/Reg Exec
Mem
Wr
Ifetch Dec/Reg Exec
lw
14
4/8/2017
http://www.public.asu.edu/~ashriva6
sw
Mem
In search of performance
• The road to multi-cores
–
–
–
–
–
–
Single-cycle processor
Pipelining
Caches
Simplescalar
Superscalar, VLIW and EPIC
Multi-threading, Hyperthreading and Simultaneous Multithreading
• Multicores
– Chip Multi-Processors
– Graphics Processors
– Software Managed
15
4/8/2017
http://www.public.asu.edu/~ashriva6
Pipelined Processor
4
IF/ID
ADD
ID/EX
EX/MEM
MEM/WB
M
u
x
PC
Comp.
IR6...10
Inst.
Memory
M
u
x
IR11..15
MEM/
WB.IR
Register
File
Data must be
stored from one
stage to the next
in pipeline.
Registers/latches.
hold temporary
values between
clocks and needed
info. for
execution.
M
u
x
Sign
Extend
16
32
Branch
taken
ALU
Data
Mem.
M
u
x
Applied to Computer Design
2
3
4
5
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
DM
Reg
Reg
DM
Reg
Reg
ALU
DM
Reg
ALU
Inst. i
1
ALU
Inst. #
ALU
Clock Number
DM
Inst. i+1
Inst. i+2
Program execution order (in instructions)
Inst. i+3
IM
Reg
IM
IM
IM
Reg
6
7
8
WB
Time
Potential of n-times speedup. n = no of pipeline stages
Reg
Data Forwarding
Branch Hazard
PCSrc
ID/EX
ALUOp
EX
IF/ID
ALUSrc
Add
RegWrite
Add
Read reg 2
Read data1
Zero
Instruction
Memory
Write reg
Read data2
M
u
x
Write data
ALU
Register File
Inst[15-0]
WB
Read
addr
Write
addr
Write
data
Sign
extend
ALU
control
Inst[20-16]
Inst[15-11]
M
RegDst
PC
MEM/WB
Shift
left 2
Read reg1
Read
address
WB
MemWrite
M
MemRead
Control
MemtoReg
EX/MEM
WB
Branch
M
u
x
M
u
x
Read
data
Data
Memory
M
u
x
A Branch Predictor
Normal PC value
P
C
Guess Branch
Guess as to where
to branch
Instruction
Memory
Branch
Prediction
Logic
Branch
Update
Information
In search of performance
• The road to multi-cores
–
–
–
–
–
–
Single-cycle processor
Pipelining
Caches
Simplescalar
Superscalar, VLIW and EPIC
Multi-threading, Hyperthreading and Simultaneous Multithreading
• Multicores
– Chip Multi-Processors
– Graphics Processors
– Software Managed
21
4/8/2017
http://www.public.asu.edu/~ashriva6
Memory Hierarchy
Capacity
Access Time
Cost
CPU Registers
100s Bytes
<10s ns
Cache
K Bytes
10-100 ns
1-0.1 cents/bit
Main Memory
M Bytes
200ns- 500ns
$.0001-.00001 cents /bit
Disk
G Bytes, 10 ms
(10,000,000 ns)
10-5 - 10-6 cents/bit
Tape
infinite
sec-min
10-8 cents/bit
Registers
Staging
Xfer Unit
Upper Level
faster
Instr. Operands prog./compiler
Cache
Blocks
1-8 bytes
cache cntl
8-128 bytes
Memory
Pages
OS
4K-16K bytes
Disk
Files
Tape
user/operator
Mbytes
Larger
Lower Level
• Fact: Large memories are slow, fast memories are small
• How do we create a memory that gives the illusion of being large,
cheap and fast (most of the time)?
Caches: Insight
• Temporal Locality (Locality in Time):
=> Keep most recently accessed data items closer to
the processor
• Spatial Locality (Locality in Space):
=> Move blocks consists of contiguous words to the
upper levels
To Processor Upper Level
Memory
Lower Level
Memory
Blk X
From Processor
Blk Y
Direct Mapped Cache
• Cache line 0 can be occupied by data from:
– Memory location 0, 8, 16, ... etc.
– In general: any memory location whose 3 LSBs of the address are 0s
– Address<2:0> => cache index
• Which one should we place
in the cache?
– the one we reference most frequently
• How can we tell which one
is in the cache?
Set-Associative Cache
• Conflict Misses are misses caused by: different memory locations
mapped to the same cache index
– Solution 1: make the cache size bigger
– Solution 2: Multiple entries for the same
cache index
Cache Index
Cache Tag
Block 0
:
• N-way set associative:
N entries for each cache index
• N direct mapped caches
operates in parallel
cache
tag
Cache Data
=
=
:
selector
(MUX)
Block 0
• A two-way set-associate cache:
:
:
data
hit
In search of performance
• The road to multi-cores
–
–
–
–
–
–
Single-cycle processor
Pipelining
Caches
Simplescalar
Superscalar, VLIW and EPIC
Multi-threading, Hyperthreading and Simultaneous Multithreading
• Multicores
– Chip Multi-Processors
– Graphics Processors
– Software Managed
26
4/8/2017
http://www.public.asu.edu/~ashriva6
Simple Pipelined Processor
Register File
IF
ID
RF
EX
MEM
WB
• Fetch, issue and commit one instruction per cycle
– Execution units can take multiple cycles
–
Including the memory
• Issue Logic
– Need to stall the instruction until its operands are ready to read
–
–
Depending on the bypasses
In-order Issue, execute, and commit
• Several Execution Units
– I-ALU, FP-Add, FP-Mult, FP-DIV
– With different delays
Multi-cycle and Pipelined Units
Register File
EX
FP Mult
IF
ROB
ID
RF
MEM
FP Add
CB
WB
Divider
Functional Unit
Latency
Initiation Interval
I-ALU
1
1
FP ALU
4
1
FP+ Integer Mult
7
1
FP Divider
24
24
Register File
EX
M1
IF
ID
ROB
M2
M3
M4
M5
M6
RF
M7
MEM
A1
A2
A3
Divider
A4
CB
WB
Multi-cycle Units
Register File
EX
FP Mult
IF
ID
ROB
RF
MEM
FP Add
CB
WB
Divider
• Actually EX unit may have separate,
but parallel units for Integer
arithmetic, multiplication, and floating
point operations
• Scheduling done using Reservation
Tables
Functional Unit
Latency
I-ALU
1
FP ALU
4
FP+ Integer Mult
7
FP Divider
24
Instruction Reordering
Register File
EX
M1
IF
ID
ROB
M2
M3
M4
M5
M6
M7
RF
MEM
A1
A2
A3
CB
WB
A4
Divider
•
Even in an in-order issue processor
–
•
Several instructions can be executing at the same time
ADDD cannot be executed due to dependency
–
But SUBD can be
•
•
Instruction Reordering is needed during issue
Commit buffer is needed for in-order writes
•
Front End
–
•
Instruction Fetch separated from the rest of the pipeline by ROB
Back End
–
Instruction Commit is separated from the rest of the pipeline by CB
DIVD
F0, F2, F4
ADDD F10, F0, F8
SUBD
F12, F8, F14
Dynamic Scheduling
• Register Scoreboarding
– CDC 6600
– First instruction whose source is not the destination of
an unfinished instruction is issued
• In presence of bypasses?
– Only one destination can be in flight
• Difficult to disambiguate which result to use.
• Tomasulo’s Algorithm
–
–
–
–
IBM 360
Writes the result in a new location
All reads happen from new location
Register Renaming
In search of performance
• The road to multi-cores
–
–
–
–
–
–
Single-cycle processor
Pipelining
Caches
Simplescalar
Superscalar, VLIW and EPIC
Multi-threading, Hyperthreading and Simultaneous Multithreading
• Multicores
– Chip Multi-Processors
– Graphics Processors
– Software Managed
32
4/8/2017
http://www.public.asu.edu/~ashriva6
Superscalar Processors
• Exploit Instruction-Level Parallelism in hardware
– Instructions are issued from a sequential
instruction stream
– CPU hardware dynamically checks for data
dependencies between instructions at run time
• Dynamic re-ordering of instructions
– Accepts multiple instructions per clock cycle
Register File
EX
M1
IF
ID
ROB
M2
M3
A1
Issue 2
instructions
per cycle
M4
A2
M5
A3
M6
M7
A4
Divider
• Issue logic overhead
– For n-different instructions and k-issue processor
– # gates ~ nk, and delay ~ k2log n – combinatoric explosion
• ILP available in the application
MEM
CB
Commit 2
instructions
per cycle
WB
VLIW Processors
• Instruction Re-ordering and management extremely resource and power
hungry
– Perform in Compiler: VLIW Processor
operation
operation
• Josh Fischer
• Instruction composed of operations
Instruction
– All operations execute in parallel
• All reads are done in OR stage, and writes in the WB stage
– No re-ordering and checking by the machine
• Very simple hardware (power-efficient)
– Each operation requires small statically predictable delay
• Triggered a lot of compiler work
– Trace Scheduling: Speculative execution and compensation code
• Problems
–
–
–
–
–
Code density with lots of NOPs
Compiler speculates on branches
Load hoisting
Re-compile when machine width changes
Memory operations do not have deterministic delay
operation
operation
EPIC
• Intel Itanium Architecture
– 1.7 bilion transistors core on 90nm process
– 24M L3 cache
• Explicitly Parallel Instruction Computing
– Beyond VLIW
Sturdy heat sink of Itanium
• Compatibility
– Bit-vector in instruction to specify dependency with
previous instructions
• Load Hoisting
– Speculative load, and check load instructions
• Branches
– Predication
– Multi-way branch instructions
• Register Renaming
– Very large Register Files
• Code size
– Itanium code size ~ 2* x86 code size
Processor in a sea of cache
Comparison
• Superscalar and VLIW are the two ends of the
spectrum
• EPIC has more dynamic features, but removes
the really costly hardware features
In search of performance
• The road to multi-cores
–
–
–
–
–
–
Single-cycle processor
Pipelining
Caches
Simplescalar
Superscalar, VLIW and EPIC
Multi-threading, Hyperthreading and Simultaneous Multithreading
• Multicores
– Chip Multi-Processors
– Graphics Processors
– Software Managed
37
4/8/2017
http://www.public.asu.edu/~ashriva6
Multi-threading
• Tackle memory wall
– Small pipeline stalls can be fixed by scheduling
– Switch to a different thread of execution for memory misses
• Can be done in software
– Store and restore the thread context
• Hardware Threading
– Hardware state for each thread
• RF, PC, CPSR, Register renaming tables
• Even separate pipeline registers for each thread
• When to switch
– When current one stalls, timeout, priority with pre-emption
– Fine-grained
• Round robin, dynamic thread priorities
Simultaneous Multi Threading
• Issue instructions from multiple threads in
each cycle
• Chief Advantage:
– Possible to find more instructions to issue, and
hence higher performance
• Chief Disadvantages
– Maximum hardware overhead
– Need larger cache
– Load store queues need to keep memory
consistent
• Intel HT, IBM AS/400
(a) Superscalar, (b) Multithreading, and
(c) SMT processor
In search of performance
• The road to multi-cores
–
–
–
–
–
–
Single-cycle processor
Pipelining
Caches
Simplescalar
Superscalar, VLIW and EPIC
Multi-threading, Hyperthreading and Simultaneous Multithreading
• Multicores
– Chip Multi-Processors
• Sun Niagara, Power, Intel
– Graphics Processors
– Software Managed
40
4/8/2017
http://www.public.asu.edu/~ashriva6
Sun UltraSPARC I (Niagara)
• 8 cores
– Small L1 Cache
• Private 8KB data cache
• Private 16 KB
instruction cache
– Shared L2
• 3 MB unified, with 4 banks,
16-byte access to memory
• At 400 MHZ, 25.6 GBps
– Only one FPU
• Each FP takes > 40 cycles
– As a rough guideline, performance degrades if the number of floatingpoint instructions exceeds 1 percent of total instructions.
Sun UltraSPARC I (Niagara)
• Core Pipeline
– No out-of-order execution
– Small cache
– 4-way Multi-threading to hide cache miss latency
Sun UltraSPARC II (Niagara 2)
• 1 FPU per core
• Each core is 4 way SMT
• Eight encryption engines, with each supporting
•
DES, 3DES, AES, RC4, SHA1, SHA256, MD5, RSA-2048, ECC, CRC32
• Power 123 W (max), 95 W (normal)
• Verilog RTL available under OpenSPARC project
• Rock Processor (separate from Niagara)
– Hardware Implemented Transactional Memory
– Hardware scout for prefetching
– Coming out in 2009
IBM Power Series
• POWER - Performance Optimization With Enhanced RISC
• POWER 1 from RS/6000
• Power 2
– Quad-word storage instructions. The quad-word load instruction moves two
adjacent double-precision values into two adjacent floating-point registers.
– Hardware square root instruction.
– Floating-point to integer conversion instructions
– Power2 was the processor used in the 1997 IBM Deep Blue
Chess playing supercomputer which beat chess grandmaster
Gary Kasparov.
Power 3
• Issue up-to 4 FP instructions
• Peak of 8-instructions per cycle
• 32-byte interface with
L1 I$
• 16-byte interface with L1 D$
• Unified L2 – 1MB – 16 MB
• 6-stage pipeline
– branch mis-prediction penalty
is only 3 cycles
•
•
•
•
•
2048-entry BHT
Full register renaming – Tomasulo’s alg.
Non-blocking for up to 4 cache misses
L2 cache latency – 6 cycles
Data Cache – Prefetch mechanism
monitors accesses to two adjacent cache
lines
– If found starts a stream of prefetch
Power 4
• Dual-core, shared L2 cache
• External off-chip L3 cache
• 174 M transistors, 130 nm, 1 GHz, 115 W
Power 4
• 5 instructions can commit each cycle
• Register renaming ,out-of order and
hit under misses allow for more than
200 instructions in-flight.
Power 5
• 276 M transistors, 130 nm,
• On-chip memory controller, L3 is
still off-chip
• 2-way SMT
• Software thread priority
• Increase in resources
– GPRs: 80  120
– FPRs: 72  120
• LRQ, SRQ have 32-entries
• Dynamic Resource Balancing for
SMT
– Reduce priority of low-ILP thread
Power 5
Power 6
• “Often the individual components,
especially the processors, are capable of
additional performance, but the power
and thermal costs require the system to
enforce limits to ensure safe, continued
operation.”
• 790 M transistors, 65 nm, <5 GHz, 430 W
• 13 gate delay pipeline
• “Historically,
Power
Architecture
technology-based machines consumed
nearly their maximum power when idle.”
• PURR: Processor Utilization of Resources
Register
– Added for each SMT thread
– Collect statistics for thread prioritization
• At the SuperComputing 2007 (SC07)
water-cooled Power6 was revealed
• IBM and Cray get $250 M each from
DARPA for petascale computer
Nehalem
• Advertisement Video
• 730 M transistors, 45 nm,
~ 3 GHz, 173 W
• 2-way SMT
• Operation fusion
– Cannot change ISA, but
can execute complex
instructions efficiently
in hardware
• Loop Stream Detector
Branch
Prediction
Fetch
Decode
28
Micro-Ops
Loop
Stream
Detector
Nehalem Memory Architecture
Core
32kB L1
32kB L1
Data CacheInst. Cache
Core
Core
Core
L1 Caches L1 Caches
L1 Caches
…
L2 Cache
Exclusive
Inclusive
L3 Cache
L3 Cache
HIT!
MISS!
Core
1
Core
2
L2 Cache
L3 Cache
256kB
L2 Cache
Core
0
L2 Cache
Core
3
Must check all other cores
Core
0
Core
1
0 0 1 0
Core
2
Core
3
Only need to check the core
whose core valid bit is set
Nehalem Power Control Unit
Vcc
BCLK
Core
PLL
Vcc
Freq.
Sensors
Integrated proprietary
microcontroller
PLL
Core
Vcc
Freq.
Sensors
PLL
Core
PCU
Vcc
Freq.
Sensors
PLL
Core
Uncore ,
LLC
Vcc
Freq.
Sensors
PLL
Shifts control from
hardware to
embedded firmware
Real time sensors for
temperature, current,
power
Flexibility enables
sophisticated
algorithms, tuned for
current operating
conditions
Intel® Core™ Microarchitecture (Nehalem)
Turbo Mode
Power Gating
Zero power for inactive
cores
54
Core 2
Core 3
Core 1
Core 0
Workload Lightly Threaded
or < TDP
Frequency (F)
Core 3
Core 2
Core 1
Core 0
Frequency (F)
No Turbo
Intel® Core™ Microarchitecture
(Nehalem) Turbo Mode
Power Gating
Turbo Mode
Zero power for inactive
cores
In response to workload
adds additional performance
bins within headroom
55
Core 1
Core 0
Workload Lightly Threaded
or < TDP
Frequency (F)
Core 3
Core 2
Core 1
Core 0
Frequency (F)
No Turbo
Intel® Core™ Microarchitecture
(Nehalem) Turbo Mode
Power Gating
Turbo Mode
Zero power for inactive
cores
In response to workload
adds additional performance
bins within headroom
56
Core 1
Core 0
Workload Lightly Threaded
or < TDP
Frequency (F)
Core 3
Core 2
Core 1
Core 0
Frequency (F)
No Turbo
In search of performance
• The road to multi-cores
–
–
–
–
–
–
Single-cycle processor
Pipelining
Caches
Simplescalar
Superscalar, VLIW and EPIC
Multi-threading, Hyperthreading and Simultaneous Multithreading
• Multicores
– Chip Multi-Processors
• Sun Niagara, Power, Intel
– Graphics Processors
• Nivida Tesla
– Software Managed
57
4/8/2017
http://www.public.asu.edu/~ashriva6
Nvidia GPUs
• CPUs devote more transistors to data storage
• GPUs devote more transistors to data processing
• Writing graphics applications required use of graphics API, like DirectX.
• CUDA – Compute Unified Device Architecture
– C-like language for writing graphics applications
• CUDA Execution Model – Highly multithreaded co-processor
– For loop kernels (parallel for)
GPU for Graphics
• Graphics Applications
–
–
–
–
All task on a pixel is a thread, and executed on single core
Multiple threads execute on a single core in SMT fashion
Pixel tasks are independent
All frame pixels processed simultaneously
Processor Architecture
• Thread processors have
– private registers
– Common instruction stream
– Shared memory
• Tesla S1070
– 1 TFlop processor
# of Tesla GPUs
4
# of Streaming Processor Cores
960 (240 per processor)
Frequency of processor cores
1.296 to 1.44 GHz
Single Precision floating point
performance (peak)
Double Precision floating point
performance (peak)
Floating Point Precision
IEEE 754 single & double
Total Dedicated Memory
16GB
Memory Interface
512-bit
Memory Bandwidth
408GB/sec
Max Power Consumption
800 W
System Interface
PCIe x16 or x8
Programming environment
CUDA
3.73 to 4.14 TFlops
311 to 345 GFlops
In search of performance
• The road to multi-cores
–
–
–
–
–
–
Single-cycle processor
Pipelining
Caches
Simplescalar
Superscalar, VLIW and EPIC
Multi-threading, Hyperthreading and Simultaneous Multithreading
• Multicores
– Chip Multi-Processors
• Sun Niagara, Power, Intel
– Graphics Processors
• Nivida Tesla
– Software Managed
• IBM Cell
61
4/8/2017
http://www.public.asu.edu/~ashriva6
IBM Cell
• Clock speed > 4 GHz
• Peak Performance
(single precision) > 256
GFlops
• Peak Performance
(dual precision) > 26
GFlops*
• Local Storage per SPU
256 KB
• Area: 221 mm2
• Technology 90 nm
• Total number of
transistors: 234 M
Heterogeneous multi-core system architecture
• Power Processing Element for control tasks
• Synergistic Processor Element for data intensive processing.
Synergistic Processor Element consists of
- Synergistic Processor Unit (SPU)
- Synergistic Memory Flow Control (SMF)
Cell Components – SPE (8)
Synergistic Processing Unit
• RISC organization
– 32 bit fixed instructions encoding
3-operand instruction format
– Load/Store architecture
– Unified register file
• User-mode architecture
– No page translation within SPU
• SIMD dataflow
– Broad
set
of
operations
(8,16,32,64 bit)
– Graphics Single Precision float
– IEEE DP- Float
• 256 KB Local Store
– Combined Inst and Data
• DMA block transfer
– Using
Power
Architecture
memory translation
SPE Pipeline
Synergistic Memory Flow Control
• SMF implements memory management
and mapping
• SMF operates in parallel to SPU
– Independent compute and transfer
– Command interface from SPU
• DMA queue decouples SMF and SPU
• Block transfer between system memory
and local store
• SPE programs reference system
memory using user-level effective
address space
– Ease of data sharing
– Local store to local store transfers
– Protection
8 concurrent memory transfers
In search of performance
• The road to multi-cores
–
–
–
–
–
–
Single-cycle processor
Pipelining
Caches
Simplescalar
Superscalar, VLIW and EPIC
Multi-threading, Hyperthreading and Simultaneous Multithreading
• Multicores
– Chip Multi-Processors
• Sun Niagara, Power, Intel
– Graphics Processors
• Nivida Tesla
– Software Managed
• IBM Cell
• Where are we headed to
67
4/8/2017
http://www.public.asu.edu/~ashriva6
Where are we headed to?
• Intel CMPs
–
–
–
–
16-core, 32-core CMPs
With SMT, and shared memory
Cache coherency
TLMs, hardware and software implemented
• Intel 80-core experimental system
– More like Cell processor
• Intel Network Processor
– Larabee – 10s of cores network processor
• Nvidia
– Planning for 1000 core processor.
• IBM
– Next version of cell
68
4/8/2017
http://www.public.asu.edu/~ashriva6
CGRAs
coarse grain reconfigurable architecture
Config
FU
•
Array of PEs connected in a mesh-like
interconnect
•
Characterized by array size, node
functionalities, interconnect, register
file configurations
•
Execute compute intensive kernels in
multimedia applications
LRF