Lec4-billionarch97-2

Transcript Lec4-billionarch97-2

ECE8833 Polymorphous and Many-Core Computer Architecture
Lecture 4 Billion-Transistor Architecture 97 (Part II)
Prof. Hsien-Hsin S. Lee
School of Electrical and Computer Engineering
Practitioners’ Groups
Every one has an acronym ! 
• IRAM
– Implementation at Berkeley
• CMP
– Lead to Sun Niagra and the multicore (r)evolution
• SMT
– Intel HyperThreading (arguably Intel first envisioned the
idea), IBM Power5, Alpha 21464
– Many credit this technology to UCSB’s multistreaming
work in early 1990s.
• RAW
– Lead to Tilera64
ECE8833 H.-H. S. Lee 2009
2
C. E. Kozyrakis, S. Perissakis, D. Patterson, T.
Anderson, K. Asanovic, N. Cardwell, R. Fromm, J.
Golbus, B. Gribstad, K. Keeton, R. Thomas, N.
Treuhaft, K. Yelick
ECE8833 H.-H. S. Lee 2009
3
Mission Statement
ECE8833 H.-H. S. Lee 2009
4
Future Roadblocks that Inspires IRAM
• Latency issues
– Continuingly increased performance gap between processor and
memory
– DRAM optimized for density, not speed
• Bandwidth issues
– Off-chip bus
• Slow and narrow
• high capacitance, high energy
– Especially, scientific codes, database, etc.
ECE8833 H.-H. S. Lee 2009
5
IRAM Approach
• Move DRAM closer to processor
– Enlarge on-chip bandwidth
• Fewer I/O pins
– Smaller package
– Serial interface
 Anything look familiar?
ECE8833 H.-H. S. Lee 2009
6
IRAM Chip Design Research
• How much larger and slower is a processor designed in a straight DRAM process
vs. a standard logic process
– Microprocessor fab offers fast transistors fo fast logic and many metal layers for
accelerating communication and simplifying power distribution
– DRAM fabs offer many poly layers to give small DRAM cells and low leakage for low
refresh rate
• Speed of page buffer vs. registers and cache
• New DRAM interface based on fast serial links (2.5Gbit/s or 300 MB/s per pin)
• Quantify Bandwidth vs. Area/Power tradeoff
• Area overhead for IRAM vs. a DRAM
• Extra power dissipation for IRAM vs. a DRAM
• Performance of IRAM with same area and power as DRAM (“processor for free)
Source: David Patterson’s slide in his IRAM Overview talk
ECE8833 H.-H. S. Lee 2009
7
IRAM Architecture Research
• How much slower can a processor with a high bandwidth
memory be and yet be as fast as a conventional computer?
(very interesting point)
• Compare memory management schemes (e.g., vector
registers, scratch pad, wide TLB/cache)
• Compare scheme for running large programs, i.e., span
multiple IRAMs
• Quantify value of compact programs and data (e.g., compact
code, on-the-fly compression)
• Quantify pros and cons of standard instruction set vs.
custom IRAM instruction set
Source: David Patterson’s slide in his IRAM Overview talk
ECE8833 H.-H. S. Lee 2009
8
IRAM Compiler Research
• Explicit SW control of memory management vs. conventional
implicit HW designs
–
–
–
–
Protection (software fault isolation)
Paging (dynamic relocation, overlap I/O accesses)
“Cache” control (vector register, scratch pad)
I/O interrupt/polling
• Evaluate benchmark performance in conjunction with
architectural research
–
–
–
–
Number crunching (Vector vs. superscalar)
Memory intensive (database, operating system)
Real-time benchmarks (stability and performance)
Pointer intensive (GCC compiler)
• Impact of Language on IRAM (Fortran 77 vs. HPF, C/C++ vs
Java)
Source: David Patterson’s slide in his IRAM Overview talk
ECE8833 H.-H. S. Lee 2009
9
Potential IRAM Architecture
• “New Model”: VSIW=Very Short Instruction Word!
–
–
–
–
Compact: Describe N operations with 1 short inst. (vector)
Predictable: (real-time) perf. Vs. statistical perf. (cache)
Multimedia ready: choose Nx64b, 2Nx32b, 4Nx16
Easy to get high performance; N operations:
• Are independent
• Use same functional unit
• Access disjoint registers
• Access registers in same order as previous instructions
• Access contiguous memory words or known pattern
• Hides memory latency (and any other latency)
– Compiler technology already developed..
Source: David Patterson’s slide in his IRAM talk
ECE8833 H.-H. S. Lee 2009
10
Berkeley Vector-Intelligent RAM
Why vector processing
•
•
•
•
Scalable design
Higher code density
Run at a higher clock rate
Better energy efficiency due to
easier clock gating for vector /
scalar units
• Lower die temperature to
keep good data retention rate
• On-chip DRAM is sufficient for
embedded applications
• Use external off-chip DRAM as
secondary memory
– Pages swapped between onchip and off-chip DRAMs
ECE8833 H.-H. S. Lee 2009
11
VIRAM-1 Floorplan
•
•
•
•
•
180nm, CMOS, 6-layer copper
125 million transistors, 325 mm2
2 watts @ 200MHz
13MB eDRAM macros from IBM and 4 vector units (total 8KB vector registers)
VRF = 32x64b or 64x32b or 128x16b
64-bit MIPS M5Kc
¼ of 8KB VRF
(Custom layout)
IBM Embedded DRAM
macros, each 13Mbit
[Gebis et al. DAC student contest 04]
ECE8833 H.-H. S. Lee 2009
12
S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, R. L.
Stamm, D. M. Tullsen
ECE8833 H.-H. S. Lee 2009
13
SMT Concept vs. Other Alternatives
Unused
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Execution Time
FU1 FU2 FU3 FU4
Conventional
Superscalar
Single
Threaded
Chip
Fine-grained
Coarse-grained
Multithreading Multithreading Multiprocessor
(CMP)
(cycle-by-cycle (Block Interleaving)
Interleaving)
Simultaneous
Multithreading
(or Intel’s HT)
• Early SMT idea was developed by UCSB (Mario Nemirosky’s group HICSS’94)
• The name SMT was christened by the group at University of Washington ISCA’95
ECE8833 H.-H. S. Lee 2009
14
Exploiting Choice: SMT Inst Fetch Policies
• FIFO, Round Robin, simple but may be too naive
• RR.X.Y
–
–
–
–
X threads for Y instructions
RR1.8
RR.2.4 or RR.4.2
RR.2.8
• What are the main design and/or performance
issue when X > 1
[Tullsen et al. ISCA96]
ECE8833 H.-H. S. Lee 2009
15
Exploiting Choice: SMT Inst Fetch Policies
• Adaptive Fetching Policies
– BRCOUNT (reduce wrong path issuing)
• Count # of br inst in decode/rename/IQ stages
• Give top priority to thread with the least BRCOUNT
– MISSCOUT (reduce IQ clog)
• Count # of outstanding D-cache misses
• Give top priority to thread with the least MISSCOUNT
– ICOUNT (reduce IQ clog)
• Count # of inst in decode/rename/IQ stages
• Give top priority to thread with the least ICOUNT
– IQPOSN (reduce IQ clog)
• Give lowest priority to those threads with inst closest to the head of INT
or FP instruction queues
– Due to that threads with the oldest instructions will be most prone
to IQ clog
• No Counter needed
[Tullsen et al. ISCA96]
ECE8833 H.-H. S. Lee 2009
16
Exploiting Choice: SMT Inst Fetch Policies
[Tullsen et al. ISCA96]
ECE8833 H.-H. S. Lee 2009
17
Alpha 21464 (EV8)
• Leading-edge process technology
–
–
–
–
–
1.2 to 2.0GHz
0.125m CMOS
SOI-compatible
Cu interconnect, 7 metal layers
Low-k dielectrics
• Chip characteristics
– 1.2V Vdd, 250W (EV6: 72W and EV7: 125W)
– 250 million transistors, 350mm2
– 1100 signal pins in flip chip packaging
Slide Source: Dr. Joel Emer
ECE8833 H.-H. S. Lee 2009
18
EV8 Architecture Overview
• Enhanced OoO execution
• 8-wide issue superscalar processor
• Large on-die L2 (1.75MB)
• 8 DRDRAM channels
• On-chip router for system interconnect
• Directory-based ccNUMA for up to 512-way SMP
• 4-way SMT
Slide Source: Dr. Joel Emer
ECE8833 H.-H. S. Lee 2009
19
SMT Pipeline
• Replicated
• Shared resources
– PCs
– Register maps
Fetch
Decode/
Map
–
–
–
–
–
Queue
Reg
Read
RF
Instruction queue
First and second level caches
Translation buffers
Branch predictor
Execute
Dcache/
Store
Buffer
Reg
Write
Retire
PC
Register
Map
Regs
Dcache
Regs
Icache
Slide Source: Dr. Joel Emer
ECE8833 H.-H. S. Lee 2009
20
Intel HyperThreading
• Intel Xeon Processor, Xeon MP Processor, and ATOM
• Enable Simultaneous Multi-Threading (SMT)
– Exploit ILP through TLP (—Thread-Level Parallelism)
– Issuing and executing multiple threads at the same snapshot
• Appears to be 2 logical processors
• Share the same execution resources
• Duplicate architectural states and certain microarchitectural states
– IPs, iTLB, streaming buffer
– Architectural register file
– Return stack buffer
– Branch history buffer
– Register Alias Table
ECE8833 H.-H. S. Lee 2009
21
Sharing Resource in Intel HT
• P4’s TC (or ROM) is alternatively accessed per cycle for
each logical processor unless one is stalled due to TC miss
• TLB shared with logical processor ID but partitioned
– X86 does not employ ASID
– Hard-partitioning appears to be the only option to allow HT
•
•
•
•
•
•
op queue (into ½) after fetched from TC
ROB (126/2 in P4)
LB (48/2 in P4)
SB (24/2 or 32/2 in P4)
General op queue and memory op queue (1/2)
Retirement: alternating between 2 logical processors
ECE8833 H.-H. S. Lee 2009
22
HT in Intel ATOM
32KB
•
•
•
•
•
•
First In-order processor with HT
HT claimed to enlarge silicon
asset by 8%
Claimed 30% performance
increase at 15% power increase
Shared cache space
deprived/competed between
threads
No dedicated Multiplier – use
SIMD Multiplier
No dedicated Int Divider - use FP
Divider
512KB
24KB
25mm2 @45nm
Source: Microprocessor Report and Intel
ECE8833 H.-H. S. Lee 2009
23
L. Hammond, B. A. Nayfeh, K. Olukotun
ECE8833 H.-H. S. Lee 2009
24
Main Argument
• Single thread of control has limited parallelism (ILP is dead)
• Cost of the above is prohibitive due to complexity
• Achieving parallelization with SW, not HW
– Inherently parallel multimedia application
– Widespread Multi-tasking OS
– Emerging parallel compilers (ref. SUIF), mainly for loop-level parallelism
• Why not SMT?
– Interconnect delay issue
– Partitioning is less localized than CMP
• Use relatively simple single-thread processor
– Exploit only “modest” amount of ILP
– Execute multiple threads in parallel
• Bottom line
ECE8833 H.-H. S. Lee 2009
25
Architectural Comparison
ECE8833 H.-H. S. Lee 2009
26
Single Chip Multiprocessor
ECE8833 H.-H. S. Lee 2009
27
Commercial CMP (AMD Phenom II Quad-Core)
•
•
•
•
•
•
AMD K10 (Barcelona)
Code name “Deneb”
45nm process
4 cores, private 512KB L2
Shared 6MB L3 (2MB in Phenom)
Integrated Northbridge
– Up to 4 DIMMs
• Sideband Stack optimizer (SSO)
– Parallelize many POPs and PUSHs
(which were dependent on each
other)
• Convert them into pure loads/store
instructions
– No uops in FUs for stack pointer
adjustment
ECE8833 H.-H. S. Lee 2009
28
Intel Core i7 (Nehalem)
• 4-core
• HT support each core
• 8MB shared L3
• 3 DDR3 channels
• 25.6GB/s memory BW
• Turbo Boost Technology
– New P-state (Performance)
– DFVS when workloads
operated under max power
– Same frequency for all cores
ECE8833 H.-H. S. Lee 2009
29
Ultra Sparc T1
•
•
•
•
•
•
Up to Eight cores, each 4-way threaded
Fine-grained multithreading
– a thread-selection logic
• Take out threads that encounter long
latency events
– Round-robin cycle-by-cycle
– 4 threads in a group share a processing
pipeline (Sparc pipe)
1.2 GHz (90nm)
In-order, 8 instructions per cycle (single issue
from each core)
1 shared FPU
Caches
– 16K 4-way 32B L1-I
– 8K 4-way 16B L1-D
– Blocking cache (reason for MT)
– 4-banked 12-way 3MB L2 + 4 memory
controllers. (shared by all)
– Data moved between the L2 and the cores
using an integrated crossbar switch to
provide high throughput (200GB/s)
ECE8833 H.-H. S. Lee 2009
30
Ultra Sparc T1
• Thread-select logic marks a thread inactive based
on
– Instruction type
• A predecode bit in the I-cache to indicate long-latency instruction
– Misses
– Traps
– Resource conflicts
ECE8833 H.-H. S. Lee 2009
31
Ultra Sparc T2
•
•
•
•
•
•
•
•
•
•
•
•
A fatter version of T1
1.4GHz (65nm)
8 threads per core, 8 cores on-die
1 FPU per core (1 FPU per die in T1), 16 INT EU (8 in T1)
L2 increased to 8-banked 16-way 4MB shared
8 stage integer pipeline ( as opposed to 6 for T1)
16 instructions per cycle
One PCI Express port (x8 1.0)
Two 10 Gigabit Ethernet ports with packet classification and filtering
Eight encryption engines
Four dual-channel FBDIMM memory controllers
711 signal I/O,1831 total
• Subsequent T2 Plus contains 2 sockets: 16 cores / 128 threads
ECE8833 H.-H. S. Lee 2009
32
Sun ROCK Processor
• 16-core, two threads per core
• Hardware scout threading (runahead)
– Invisible to SW
– Long latency inst starts auto HW scout
• L1 D$ miss
• Micro-DTLB miss
• Divide
– Warm up branch predictor
– Prefetch memory
• Execute Ahead (EXE)
– Retire independent instructions while
scouting
• Simultaneous Speculative Threading
(SST) [ISCA’09]
– Two hardware threads for one program
– Runahead speculatively executes under a
cache miss
– OoO retirement
• HTM Support
ECE8833 H.-H. S. Lee 2009
33
Many-Core Processors
•
•
•
•
2KB Data Memory
3KB Instruction Memory
No coherence support
2 FMACs
• Next-gen will have 3Dintegrated memory
– SRAM first
– DRAM in the future
Intel Teraflops
(Polaris)
ECE8833 H.-H. S. Lee 2009
34
E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W.
Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J.
Babb, S. Amarasinghe, A. Agarwal
ECE8833 H.-H. S. Lee 2009
35
MIT RAW Design Tenet
• Long wire across chip will be the constraint
• Exposed architecture to software (parallelizing compilers)
– Explicit parallelization
– Pins
– Communication
• Use tile-based architecture
– Similar designs sponsored by DARPA PCA program: UT TRIPS,
Stanford Smart Memories
• Simple Point-to-point static routing network
–
–
–
–
One cycle across each tile
Scalable (than bus)
Harnessed by compiler with precise count of wire hops
Use dynamic router to support memory accesses that cannot be
analyzed statically.
ECE8833 H.-H. S. Lee 2009
36
Application Mapping on RAW
Video
Data
Stream
Frame Buffer
And Screen
Four-way
parallelized
scalar code
Two-way threaded
Java program
httpd
[Taylor IEEE MICRO’02]
ECE8833 H.-H. S. Lee 2009
Fast Inter-tile ALU
forwarding : 3 cycles
Custom Data
Path Pipeline
(by Compiler)
Zzzz..
Sleep Mode
(power saving)
37
Scalar Operand Network Design
Non-Pipelined Scalar Operand Network
Pipelined w/ Bypass Link
Pipelined w/ Bypass Link and Multiple ALUs
Lots of live values in the SON
[Taylor et al. HPCA’03]
ECE8833 H.-H. S. Lee 2009
38
Communication Scalability Issue
Routing
area
Large
MUX
Complex
Compare logic
• RB (# of result bus) * WS (window size) compares made per cycle
• Long, dense wire elongates cycle time
– Pipeline the wire
• Cost of processing incoming information is high
• Similar problem in bus-based snoopy cache protocol
ECE8833 H.-H. S. Lee 2009
39
Scalar Operand Network
RegFile
RegFile
RegFile
Multiscalar
Operand Network
(distributed ILP machine)
RegFile
RegFile
RegFile
Switch
RegFile
RegFile
RegFile
Scalar Operand Network
On a 2-D, p2p interconnect
(e.g., Raw or TRIPS)
[Taylor et al. HPCA’03]
ECE8833 H.-H. S. Lee 2009
40
Mapping Operations to Tile-based Architecture
RegFile
>>
RegFile
*
RegFile
ld a
i = a[j];
q = b[i];
r = q+j;
s = q >> 3;
t = r * s;
b[j] = l;
b[t] = t;
• Done at
RegFile
RegFile
+
RegFile
ld b
st b
st b
– Compile time (RAW)
– Or Runtime
• “Point-to-point” 2D mesh
• Tradeoff
– Computation vs.
Communication
– Compute Affinity (data
flow through fewer hops)
• How to maintain control
flow-control
ECE8833 H.-H. S. Lee 2009
41
RAW Core-to-Core Communication
• Static Router
– Place-and-route wires by software
– P2p scalar transport
– Compilers (or assembly writers) handle predictable
communication
• Dynamic Router
– Transport dynamic, unpredictable operations
• Interrupts
• Cache misses
– Unpredictable communication at compile-time
ECE8833 H.-H. S. Lee 2009
42
Architectural Comparison
RAW
Superscalar
Multiprocessor
• Raw replace a bus of a superscalar with switched network
• Switched network is tightly integrated into processor’s pipeline to support singlecycle message injection and receive operations
• Raw software (compiler) has to implement functions such as instruction
scheduling, dependency checking, etc.
• Raw yields complexity to software so that more hardware can be used for ALU
and memory
ECE8833 H.-H. S. Lee 2009
43
RAW’s Four On-Chip Mesh Networks
Compute
Pipeline
8 32-bit channels
Registered at input  longest wire = length of tile
[Slide Source: Michael B. Taylor]
ECE8833 H.-H. S. Lee 2009
44
Raw Architecture
[Slide Source: Volker Strumpen]
ECE8833 H.-H. S. Lee 2009
45
Raw Compute Processor Pipeline
Fast ALU-tonetwork (4 cycles)
R24-27 map to 4
on-chip physical
networks
0-cycle local
bypass
[Taylor IEEE MICRO’02]
ECE8833 H.-H. S. Lee 2009
46
RAW Processor Tile
Each tile contains
• Tile processor
– 32-bit MIPS, 8-stage in-order,
single issue
– 32KB instruction memory
– 32KB data cache (not
coherent, user managed)
• Switch processor
– 8K-instruction memory
– Executes basic move and
branch instructions
– Transfer between local switch
and neighbor switches
• Dynamic Router
– Hardware control (not directly
under programmer’s control)
ECE8833 H.-H. S. Lee 2009
47
Raw Programming
• Compute the sum c=a+b across four tiles:
ECE8833 H.-H. S. Lee 2009
48
Data Path: Zoom 1
• Stateful hardware: local data memory (a,c), register (b) and both static
networks (snet1 and 2)
ECE8833 H.-H. S. Lee 2009
49
Zoom 2: Processor Datapaths
ECE8833 H.-H. S. Lee 2009
50
Zoom 2: Switch Datapaths (+-tile processor)
ECE8833 H.-H. S. Lee 2009
51
Raw Assembly
ECE8833 H.-H. S. Lee 2009
52
RAW On-Chip Network
• 2D Mesh
– Longest wire is no greater than one side of a tile
– Worst case: 6 hops (or cycles) for 16 tiles
• 2 Static Routers, “point-to-point,” each has
– A 64KB SW-managed instruction cache
– A pair of routing crossbars
– Example:
Tile 1 (receiver)
Tile 0 (sender)
or $csto, $0, $5
nop route $csto->$cEo2 #SWITCH0
nop route $cWi2->$csti2 #SWITCH1
and $5, $5, $csti2
• 2 Dynamic Routers
– Dimension-ordered routing by hardware
– Example:
Tile 0 (sender)
lui
ihdr
or
ld
$3, $0, 15
$cgno, $3, 0x0200 #header msg len=2
$cgno,$0,$9
#sent word1
$cgno,$0,$csti
#sent word2
ECE8833 H.-H. S. Lee 2009
Tile 15 (receiver)
or $2, $cgni, $0 #word1
or $3, $cgni, $0 #word2
53
Control Orchestration Optimization
• Orchestrated by the Raw compiler
• Control localization
– Hide control flow sequence within a “macro-instruction” assigned to a tile
: One instruction
macroin
s
[Lee et al. ASPLOS’98]
ECE8833 H.-H. S. Lee 2009
54
Example of RAW Compiler Transformation
Initial Code
Transformatio
n
y
z
a
y
=
=
=
=
a+b;
a*a;
y*a*5;
y*b*6;
Instruction
Partitioner
Global Data
Partitioner
Data & Inst
Placer
Communication
Code Gen
read(a)
read(b)
y_1
= a+b
z_1
= a*a
tmp_1 = y_1*a
a_1
= tmp_1*5
tmp_2 = y_1*b
y_2
= tmp_2*6
write(z)
write(a)
write(y)
read (a)
read (b)
z_1 = a*a
y_1 = a+b
write(z)
tmp_2 = y_1*b
tmp_1 = y_1*a
y_2 = tmp_2*6
a_1=tmp_1*5
write(y)
write(a)
Event
Scheduler
Initial Code Transformation
[Lee et al. ASPLOS’98]
ECE8833 H.-H. S. Lee 2009
55
Example of RAW Compiler Transformation
read (a)
read (b)
z_1 = a*a
y_1 = a+b
read (a)
read (b)
z_1 = a*a
y_1 = a+b
{a,z}
write(z)
tmp_2 = y_1*b
write(z)
tmp_2 = y_1*b
tmp_1 = y_1*a
y_2 = tmp_2*6
a_1=tmp_1*5
write(y)
{b,y}
tmp_1 = y_1*a
y_2 = tmp_2*6
a_1=tmp_1*5
write(y)
write(a)
Instruction Partitioner
Global
Data
Partitioner
write(a)
{a,z}
{b,y}
P0
P1
Data & Inst Placer
[Lee et al. ASPLOS’98]
ECE8833 H.-H. S. Lee 2009
56
Example of RAW Compiler Transformation
read (b)
read (a)
send (a)
z_1 = a*a
y_1 = a+b
a=rcv()
write(z)
route(P0,S1)
route(S0,P1)
route(S1,P0)
route(P1,S0)
send(y_1)
tmp_2 = y_1*b
y_1=rcv()
tmp_1 = y_1*a
y_2 = tmp_2*6
a_1=tmp_1*5
write(y)
write(a)
P0
S0
S1
P1
Communication Code Gen
[Lee et al. ASPLOS’98]
ECE8833 H.-H. S. Lee 2009
57
Example of RAW Compiler Transformation
read (a)
route(P0,S1)
route(S0,P1)
send (a)
route(S1,P0)
route(P1,S0)
z_1 = a*a
read (b)
a=rcv()
y_1 = a+b
write(z)
send(y_1)
y_1=rcv()
tmp_2 = y_1*b
tmp_1 = y_1*a
y_2 = tmp_2*6
a_1=tmp_1*5
write(y)
write(a)
P0
S0
S1
P1
Event Scheduler
[Lee et al. ASPLOS’98]
ECE8833 H.-H. S. Lee 2009
58
Raw Compiler Example
Assign instructions to the tiles,
maximizing locality.
Generate the static router
instructions to transfer
Operands & streams tiles.
tmp3 = (seed*6+2)/3
v2 = (tmp1 - tmp3)*5
v1 = (tmp1 + tmp2)*3
v0 = tmp0 - v1
….
seed.0=seed
pval1=seed.0*3.0
pval5=seed.0*6.0
pval4=pval5+2.0
seed.0=seed
pval0=pval1+2.0
tmp3.6=pval4/3.0
tmp3=tmp3.6
tmp0.1=pval0/2.0
pval1=seed.0*3.0
v1.2=v1
v3.10=tmp3.6-v2.7
v2.4=v2
pval5=seed.0*6.0
pval0=pval1+2.0
tmp0=tmp0.1
v3=v3.10
pval2=seed.0*v1.2
pval3=seed.o*v2.4
pval4=pval5+2.0
tmp1.3=pval2+2.0
tmp0.1=pval0/2.0
tmp2.5=pval3+2.0
tmp0=tmp0.1
v1.2=v1
v2.4=v2
pval2=seed.0*v1.2
pval3=seed.o*v2.4
tmp1.3=pval2+2.0
tmp2.5=pval3+2.0
tmp3.6=pval4/3.0
tmp1=tmp1.3
tmp2=tmp2.5
pval7=tmp1.3+tmp2.5
tmp3=tmp3.6
pval6=tmp1.3-tmp2.5
tmp1=tmp1.3
v1.8=pval7*3.0
v2.7=pval6*5.0
pval7=tmp1.3+tmp2.5
v0.9=tmp0.1-v1.8
v1=v1.8
v0=v0.9
v3.10=tmp3.6-v2.7
v1.8=pval7*3.0
v0.9=tmp0.1-v1.8
v2=v2.7
v3=v3.10
tmp2=tmp2.5
pval6=tmp1.3-tmp2.5
v2.7=pval6*5.0
v2=v2.7
v1=v1.8
v0=v0.9
[Slide Source: Michael B. Taylor]
ECE8833 H.-H. S. Lee 2009
59
Scalability
180 nm, 16 tiles
1 cycle
90 nm, 64 tiles
Just stamp out more tiles!
Longest wire, frequency, design and verification complexity
all independent of issue width.
Architecture is backwards compatible.
[Slide Source: Michael B. Taylor]
ECE8833 H.-H. S. Lee 2009
60