Lecture #3 - The University of Texas at Dallas

Transcript Lecture #3 - The University of Texas at Dallas

EE (CE) 6304 Computer Architecture
Lecture #3
(9/1/15)
Yiorgos Makris
Professor
Department of Electrical Engineering
University of Texas at Dallas
Course Web-site:
http://www.utdallas.edu/~gxm112130/EE6304FA15
Have we reached the end of ILP?
• Multiple processor easily fit on a chip
• Every major microprocessor vendor
has gone to multithreaded cores
– Thread: loci of control, execution context
– Fetch instructions from multiple threads at once,
throw them all into the execution unit
– Intel: hyperthreading
– Concept has existed in high performance computing
for 20 years (or is it 40? CDC6600)
• Vector processing
– Each instruction processes many distinct data
– Ex: MMX
• Raise the level of architecture – many
processors per chip
Tensilica Configurable Proc
Limiting Forces: Clock Speed and ILP
• Chip density is
continuing increase
~2x every 2 years
– Clock speed is not
– # processors/chip (cores)
may double instead
• There is little or no
more Instruction
Level Parallelism (ILP)
to be found
– Can no longer allow
programmer to think in
terms of a serial
programming model
• Conclusion:
Parallelism must be
exposed to software!
Source: Intel, Microsoft (Sutter) and
Stanford (Olukotun, Hammond)
Examples of MIMD Machines
• Symmetric Multiprocessor
– Multiple processors in box with shared
memory communication
– Current MultiCore chips like this
– Every processor runs copy of OS
• Non-uniform shared-memory with
separate I/O through host
– Multiple processors
» Each with local memory
» general scalable network
– Extremely light “OS” on node provides
simple services
» Scheduling/synchronization
– Network-accessible host for I/O
• Cluster
– Many independent machine connected with
general network
– Communication through messages
P
P
P
P
Bus
Memory
P/M P/M P/M P/M
P/M P/M P/M P/M
P/M P/M P/M P/M
P/M P/M P/M P/M
Host
Time (processor cycle)
Categories of Thread Execution
Superscalar
Fine-Grained Coarse-Grained
Thread 1
Thread 2
Multiprocessing
Thread 3
Thread 4
Simultaneous
Multithreading
Thread 5
Idle slot
Processor-DRAM Memory Gap (latency)
µProc
60%/yr.
(2X/1.5yr
)
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
9%/yr.
(2X/10
yrs)
CPU
100
10
1
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
1000
Time
The Memory Abstraction
• Association of <name, value> pairs
– typically named as byte addresses
– often values aligned on multiples of size
• Sequence of Reads and Writes
• Write binds a value to an address
• Read of addr returns most recently written
value bound to that address
command (R/W)
address (name)
data (W)
data (R)
done
Memory Hierarchy
• Take advantage of the principle of locality to:
– Present as much memory as in the cheapest technology
– Provide access at speed offered by the fastest technology
Processor
Control
1s
Size (bytes): 100s
On-Chip
Cache
Speed (ns):
Registers
Datapath
Second
Level
Cache
(SRAM)
Main
Memory
(DRAM/
FLASH/
PCM)
10s-100s
100s
Ks-Ms
Ms
Secondary
Storage
(Disk/
FLASH/
PCM)
Tertiary
Storage
(Tape/
Cloud
Storage)
10,000,000s 10,000,000,000s
(10s ms)
(10s sec)
Gs
Ts
The Principle of Locality
• The Principle of Locality:
– Program access a relatively small portion of the address space at
any instant of time.
• Two Different Types of Locality:
– Temporal Locality (Locality in Time): If an item is referenced, it will
tend to be referenced again soon (e.g., loops, reuse)
– Spatial Locality (Locality in Space): If an item is referenced, items
whose addresses are close by tend to be referenced soon
(e.g., straightline code, array access)
• Last 30 years, HW relied on locality for speed
P
$
MEM
Example of modern core: Nehalem
• ON-chip cache resources:
– For each core: L1: 32K instruction and 32K data cache, L2: 1MB
– L3: 8MB shared among all 4 cores
• Integrated, on-chip memory controller (DDR3)
Memory Abstraction and Parallelism
• Maintaining the illusion of sequential access to
memory across distributed system
• What happens when multiple processors access
the same memory at once?
– Do they see a consistent picture?
Pn
P1
Pn
P1
$
$
Interconnection network
Mem
Mem
Mem
$
Mem
Interconnection network
• Processing and processors embedded in the
memory?
$
Is it all about communication?
Pentium IV Chipset
Proc
Caches
Busses
adapters
Memory
Controllers
I/O Devices:
Disks
Displays
Keyboards
Networks
Breaking the HW/Software Boundary
• Moore’s law (more and more trans) is all about
volume and regularity
• What if you could pour nano-acres of unspecific
digital logic “stuff” onto silicon
– Do anything with it. Very regular, large volume
• Field Programmable Gate Arrays
– Chip is covered with logic blocks w/ FFs, RAM blocks, and
interconnect
– All three are “programmable” by setting configuration bits
– These are huge?
• Can each program have its own instruction set?
• Do we compile the program entirely into
hardware?
log (people per computer)
“Bell’s Law” – new class per decade
Number Crunching
Data Storage
productivity
interactive
• Enabled by technological opportunities
year
• Smaller, more numerous and more intimately connected
• Brings in a new kind of application
• Used in many ways not previously imagined
streaming
information
to/from physical
world
It’s not just about bigger and faster!
• Complete computing systems can be tiny and cheap
• System on a chip
• Resource efficiency
– Real-estate, power, pins, …
Understanding & Quantifying Cost,
Performance, Power, Dependability & Reliability
Integrated Circuit Cost
• Integrated circuit
• Bose-Einstein formula:
• Defects per unit area = 0.016-0.057 defects
per square cm (2010)
• N = process-complexity factor = 11.5-15.5 (40
nm, 2010)
Which is faster?
Plane
DC to
Paris
Speed
Passengers
Throughput
(pmph)
Boeing 747
6.5 hours
610 mph
470
286,700
BAD/Sud
Concorde
3 hours
1350 mph
132
178,200
• Time to run the task (ExTime)
– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns …
(Performance)
– Throughput, bandwidth
Definitions
• Performance is in units of things per sec
– bigger is better
• If we are primarily concerned with response time
– performance(x) =
1
execution_time(x)
" X is n times faster than Y" means
Execution_time(Y)
Performance(X)
n
=
=
Performance(Y)
Execution_time(X)
CPI
Processor performance equation
inst count
CPU time
= Seconds
= Instructions x
Program
Program
Instruction
CPI
Program
Inst Count
X
Compiler
X
(X)
Inst. Set.
X
X
Organization
Technology
Cycles
Cycle time
x Seconds
X
Cycle
Clock Rate
X
X
Cycles Per Instruction
(Throughput)
“Average Cycles per Instruction”
CPI = (CPU Time * Clock Rate) / Instruction Count
= Cycles / Instruction Count
n
CPU time  Cycle Time   CPI j  I j
j 1
n
CPI   CPI j  Fj
j 1
where Fj 
Ij
Instruction Count
“Instruction Frequency”
Example: Calculating CPI bottom up
Run benchmark and collect workload characterization (simulate, machine
counters, or sampling)
Base Machine
Op
ALU
Load
Store
Branch
(Reg /
Freq
50%
20%
10%
20%
Reg)
Cycles
1
2
2
2
Typical Mix of
instruction types
in program
CPI(i)
.5
.4
.2
.4
1.5
(% Time)
(33%)
(27%)
(13%)
(27%)
Design guideline: Make the common case fast
MIPS 1% rule: only consider adding an instruction of it is shown to add 1%
performance improvement on reasonable benchmarks.
Example: Branch Stall Impact
• Assume CPI = 1.0 ignoring branches (ideal)
• Assume branch was stalling for 3 cycles
• If 30% branch, Stall 3 cycles on 30%
• Op
• Other
• Branch
Freq
70%
30%
Cycles CPI(i) (% Time)
1
.7
(37%)
4
1.2
(63%)
• => new CPI = 1.9
• New machine is 1/1.9 = 0.52 times faster (i.e. slow!)
Speed Up Equation for Pipelining
CPIpipelined  Ideal CPI  Average Stall cycles per Inst
For simple RISC pipeline, CPI = 1:
Cycle Time unpipelined
1
Speedup 

1  Pipeline stall CPI Cycle Time pipelined
Making common case fast
• Many a time an architect spends tremendous effort
and time to optimize some aspect of system
– Later realize that overall speedup is unrewarding
• So, better to measure the usage of that aspect of
system, before attempt to optimize it
• In making a design trade-off
– Favor the frequent case over the infrequent case
• In allocating additional resources
– Allocate to improve frequent event, rather than a rare event
So, what principle quantifies this scenario?
Amdahl’s Law

Fractionenhanced 
ExTimenew  ExTimeold  1  Fractionenhanced  
Speedupenhanced 

Speedupoverall 
ExTimeold

ExTimenew
1
1  Fractionenhanced  
Fractionenhanced
Speedupenhanced
Best you could ever hope to do:
Speedupmaximum
1

1 - Fractionenhanced 
Amdahl’s Law example
• New CPU 10X faster
• I/O bound server, so 60% time waiting for I/O
Speedup overall 

1
Fraction enhanced
1  Fraction enhanced  
Speedup enhanced
1
1  0.4  0.4
10

1
 1.56
0.64
• Apparently, its human nature to be attracted by
10X faster, vs. keeping in perspective its just
1.6X faster
Define and quantify power ( 1 / 2)
• For CMOS chips, traditional dominant energy consumption
has been in switching transistors, called dynamic power
2
Powerdynamic  1 / 2  CapacitiveLoad  Voltage  FrequencySwitched
• For mobile devices, energy better metric
2
Energydynamic  CapacitiveLoad  Voltage
• For a fixed task, slowing clock rate (frequency switched)
reduces power, but not energy
• Capacitive load a function of number of transistors
connected to output and technology, which determines
capacitance of wires and transistors
• Dropping voltage helps both, so went from 5V to 1V
• To save energy & dynamic power, most CPUs now turn off
clock of inactive modules (e.g. Fl. Pt. Unit)
Example of quantifying power
• Suppose 15% reduction in voltage results in a
15% reduction in frequency. What is impact on
dynamic power?
Powerdynamic  1 / 2  CapacitiveLoad  Voltage  FrequencySwitched
2
 1 / 2  .85  CapacitiveLoad  (.85Voltage)  FrequencySwitched
2
 (.85)3  OldPower dynamic
 0.6  OldPower dynamic
Define and quantify power (2 / 2)
• Because leakage current flows even when a
transistor is off, now static power important too
Powerstatic  Currentstatic  Voltage
• Leakage current increases in processors with
smaller transistor sizes
• Increasing the number of transistors increases
power even if they are turned off
• In 2006, goal for leakage was 25% of total power
consumption; high performance designs at 40%
• Very low power systems even gate voltage to
inactive modules to control loss due to leakage
Power and Energy
• Energy to complete operation (Joules)
– Corresponds approximately to battery life
– (Battery energy capacity actually depends
on rate of discharge)
• Peak power dissipation (Watts = Joules/second)
– Affects packaging (power and ground pins,
thermal design)
• di/dt, peak change in supply current (Amps/second)
– Affects power supply noise (power and
ground pins, decoupling capacitors)
Peak Power versus Lower Energy
Peak A
Peak B
Power
Integrate power
curve to get energy
Time
• System A has higher peak power, but lower total energy
• System B has lower peak power, but higher total energy
Define and quantify dependability (1/3)
•
•
How decide when a system is operating properly?
Infrastructure providers now offer Service Level
Agreements (SLA) to guarantee that their
networking or power service would be dependable
• Systems alternate between 2 states of service
with respect to an SLA:
1. Service accomplishment, where the service is
delivered as specified in SLA
2. Service interruption, where the delivered service
is different from the SLA
• Failure = transition from state 1 to state 2
• Restoration = transition from state 2 to state 1
Define and quantify dependability (2/3)
•
Module reliability = measure of continuous service
accomplishment (or time to failure).
2 metrics
1. Mean Time To Failure (MTTF) measures Reliability
2. Failures In Time (FIT) = 1/MTTF, the rate of failures
•
•
Mean Time To Repair (MTTR) measures Service
Interruption
–
•
•
Traditionally reported as failures per billion hours of operation
Mean Time Between Failures (MTBF) = MTTF+MTTR
Module availability measures service as alternate between
the 2 states of accomplishment and interruption (number
between 0 and 1, e.g. 0.9)
Module availability = MTTF / ( MTTF + MTTR)
Example calculating reliability
•
•
If modules have exponentially distributed lifetimes
(age of module does not affect probability of
failure), overall failure rate is the sum of failure
rates of the modules
Calculate FIT and MTTF for 10 disks (1M hour
MTTF per disk), 1 disk controller (0.5M hour
MTTF), and 1 power supply (0.2M hour MTTF):
FailureRat e  10  (1 / 1,000,000)  1 / 500,000  1 / 200,000
 10  2  5 / 1,000,000
 17 / 1,000,000
 17,000 FIT
MTTF  1,000,000,000 / 17,000
 59,000hours

Lecture #3 - The University of Texas at Dallas

Transcript Lecture #3 - The University of Texas at Dallas

Directory