T(1 – F enhanced )
Download
Report
Transcript T(1 – F enhanced )
Hypertreading, Multi-core
architectures,
Amdahl’s Law and Review
CENG331: Introduction to Computer Systems
14th Lecture
Instructor:
Erol Sahin
Acknowledgement: Some of the slides are adapted from the ones prepared
by R.E. Bryant, D.R. O’Hallaron of Carnegie-Mellon Univ.
Computation power is never enough
Our machines are million times faster than ENIAC..
Yet we ask for more..
To make sense of universe
To build bombs
To understand climate
To unravel genome
Upto now, the solution was easy:
Make the clock run faster..
–2–
Hitting the limits…
No more.. We’ve hit the fundamental limits on clock
speed:
Speed of light: 30cm/nsec in vacuum and 20 cm/nsec in
copper
In a 10GHz machine, signals cannot travel more than 2cm in
total
In a 100GHz, at most 2mm, just to get the signal from one
end to another and back
We can make computers as small as that, but we face with
other problems:
More difficult to design and verify
.. and fundamental problems
Heat dissipation!
Coolers are already bigger than the CPUs..
Going from 1MHz to 1GHz was simply engineering the chip
fabrication process, not from 1GHz to 1 THz..
–3–
Parallel computers
New approach:
Build parallel computers
Each computer having a CPU running at normal speeds
Systems with 1000 CPU’s are already available..
The big challenge:
How to utilize parallel computers to achieve speedups
Programming has been mostly a “sequential business”
» Parallel programming remained exotic
How to share information among different computers
How to coordinate processing?
How to create/adapt OSs?
–4–
Superscalar architectures, Simultenous
Multi-threading
CENG331: Introduction to Computer Systems
14th Lecture
Instructor:
Erol Sahin
Acknowledgement: Some of the slides are adapted from the ones prepared
by Neil Chakrabarty and William May
Threading Algorithms
Time-slicing
A processor switches between threads in fixed time intervals.
High expenses, especially if one of the processes is in the wait state.
Switch-on-event
Task switching in case of long pauses
Waiting for data coming from a relatively slow source, CPU resources
are given to other processes
–6–
Threading Algorithms (cont.)
Multiprocessing
Distribute the load over many processors
Adds extra cost
Simultaneous multi-threading
Multiple threads execute on a single processor without switching.
Basis of Intel’s Hyper-Threading technology.
–7–
Hyper-Threading Concept
At each point of time only a part of processor resources is used
for execution of the program code.
Unused resources can also be loaded, for example, with
parallel execution of another thread/application.
Extremely useful in desktop and server applications where
many threads are used.
–8–
–9–
Hyper-Threading Architecture
First used in Intel Xeon MP processor
Makes a single physical processor appear as multiple
logical processors.
Each logical processor has a copy of architecture
state.
Logical processors share a single set of physical
execution resources
– 10 –
Hyper-Threading Architecture
Operating systems and user programs can schedule processes
or threads to logical processors as if they were in a
multiprocessing system with physical processors.
From an architecture perspective we have to worry about the
logical processors using shared resources.
Caches, execution units, branch predictors, control logic, and buses.
– 11 –
Advantages
Extra architecture only adds
about 5% to the total die
area.
No performance loss if only
one thread is active.
Increased performance
with multiple threads
Better resource utilization.
– 12 –
Disadvantages
To take advantage of hyper-threading performance, serial
execution can not be used.
Threads are non-deterministic and involve extra design
Threads have increased overhead
– 13 –
Multi-Core processors
CENG331: Introduction to Computer Systems
14th Lecture
Instructor:
Erol Sahin
Acknowledgement: Some of the slides are adapted from the ones prepared
by Neil Chakrabarty and William May
Multicore chips
Recall Moore’s law:
The number of transistors that can be
put on a chip will double every 18
months!
Still holds! 300 million transistors on an
Intel Core 2 Duo class chip
Question: what do you do with all those
transistors?
Increase cache.. But we already have
4MB caches.. Performance gain is little.
Another option: put two or more cores
on the same chip (technically die)
Dual-core and quad-core chips are
already on the market
80-core chips are manufactured.. More
will follow.
– 15 –
Multicore Chips
Shared L2:
Good for sharing resources
Greedy cores may hurt the
performances of others
Intel style
AMD style
(a) A quad-core chip with a shared L2 cache. (b) A
quad-core chip with separate L2 caches.
– 16 –
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
– 17 –
Comparison: SMT versus Multi-core
Multi-core
• Several cores each designed to be smaller and not too
powerful
• Great thread-level parallelism
SMT
• One large, powerful superscalar core
• Great performance on running a single thread
• Exploits instruction-level parallelism
– 18 –
Cloud Computing and
Server farms
Cloud computing
Server farms
– 19 –
Amdahl’s Law and Review
CENG331: Introduction to Computer Systems
14th Lecture
Instructor:
Erol Sahin
Acknowledgement: Some of the slides are adapted from the ones prepared
by R.E. Bryant, D.R. O’Hallaron of Carnegie-Mellon Univ.
Performance Evaluation
This week
Performance evaluation
Amdahl’s law
Review
– 21 –
Problem
You plan to visit a friend in Cannes France and
must decide whether it is worth it to take the
Concorde SST or a 747 from NY (New York) to
Paris, assuming it will take 4 hours LA (Los
Angeles) to NY and 4 hours Paris to Cannes.
– 22 –
Amdahl’s Law
You plan to visit a friend in Cannes France and must decide
whether it is worth it to take the Concorde SST or a 747
from NY(New York) to Paris, assuming it will take 4 hours LA
(Los Angeles) to NY and 4 hours Paris to Cannes.
time NY Paris total trip time
747
SST
8.5 hours
3.75 hours
speedup over 747
16.5 hours
11.75 hours
1
1.4
Taking the SST (which is 2.2 times faster) speeds up the overall
trip by only a factor of 1.4!
– 23 –
Speedup
Old program (unenhanced)
T1
T2
Old time: T = T1 + T2
T2 = time that can be
enhanced.
New program (enhanced)
T1 = T1
T1 = time that can NOT
be enhanced.
T2 T2
New time: T = T1 + T2
T2 = time after the
enhancement.
Speedup: Soverall = T / T
– 24 –
Computing Speedup
Two key parameters:
Fenhanced = T2 / T
Senhanced = T2 / T2
(fraction of original time that can be improved)
(speedup of enhanced part)
T = T1 + T2 = T1 + T2 = T(1-Fenhanced) + T2
= T(1 – Fenhanced) + (T2/Senhanced)
= T(1 – Fenhanced) + T(Fenhanced /Senhanced)
= T((1 – Fenhanced) + Fenhanced/Senhanced)
[by def of Senhanced]
[by def of Fenhanced]
Amdahl’s Law:
Soverall = T / T = 1/((1 – Fenhanced) + Fenhanced/Senhanced)
Key idea:
Amdahl’s Law quantifies the general notion of diminishing returns.
It applies to any activity, not just computer programs.
– 25 –
Amdahl’s Law Example
Trip example:
Suppose that for the New York to Paris leg, we now consider the
possibility of taking a rocket ship (15 minutes) or a handy rip in the
fabric of space-time (0 minutes):
time NY->Paris total trip time
speedup over 747
747
8.5 hours
16.5 hours
1
SST
3.75 hours
11.75 hours
1.4
rocket 0.25 hours
8.25 hours
2.0
rip
0.0 hours
8 hours
2.1
– 26 –
Lesson from Amdahl’s Law
Useful Corollary of Amdahl’s law:
1 Soverall 1 / (1 – Fenhanced)
Fenhanced
Max Soverall
Fenhanced
Max Soverall
0.0
1
0.9375
16
0.5
2
0.96875
32
0.75
4
0.984375
64
0.875
8
0.9921875
128
Moral: It is hard to speed up a program.
Moral++ : It is easy to make premature optimizations.
– 27 –
Other Maxims
Second Corollary of Amdahl’s law:
When you identify and eliminate one bottleneck in a system,
something else will become the bottleneck
Beware of Optimizing on Small Benchmarks
Easy to cut corners that lead to asymptotic inefficiencies
– 28 –
Review: Time to look back!
– 29 –
Take-home message: Nothing is real!
– 30 –
Great Reality #1:
Int’s are not Integers, Float’s are not Reals
Example 1: Is x2 ≥ 0?
Float’s: Yes!
Int’s:
40000 * 40000 --> 1600000000
50000 * 50000 --> ??
Example 2: Is (x + y) + z = x + (y + z)?
Unsigned & Signed Int’s: Yes!
Float’s:
(1e20 + -1e20) + 3.14 --> 3.14
1e20 + (-1e20 + 3.14) --> ??
– 31 –
Great Reality #2:
You’ve Got to Know Assembly
Chances are, you’ll never write program in assembly
Compilers are much better & more patient than you are
But: Understanding assembly key to machine-level
execution model
Behavior of programs in presence of bugs
High-level language model breaks down
Tuning program performance
Understand optimizations done/not done by the compiler
Understanding sources of program inefficiency
Implementing system software
Compiler has machine code as target
Operating systems must manage process state
Creating / fighting malware
x86 assembly is the language of choice!
– 32 –
Great Reality #3:
Memory Matters
Random Access Memory Is an Unphysical Abstraction
Memory is not unbounded
It must be allocated and managed
Many applications are memory dominated
Memory referencing bugs especially pernicious
Effects are distant in both time and space
Memory performance is not uniform
Cache and virtual memory effects can greatly affect program
performance
Adapting program to characteristics of memory system can lead to
major speed improvements
– 33 –
Great Reality #4:
There’s more to performance than
asymptotic complexity
Constant factors matter too!
And even exact op count does not predict performance
Easily see 10:1 performance range depending on how code written
Must optimize at multiple levels: algorithm, data representations,
procedures, and loops
Must understand system to optimize performance
How programs compiled and executed
How to measure program performance and identify bottlenecks
How to improve performance without destroying code modularity
and generality
– 34 –
Great Reality #5:
Computers do more than execute programs
They need to get data in and out
I/O system critical to program reliability and performance
They communicate with each other over networks
Many system-level issues arise in presence of network
Concurrent operations by autonomous processes
Coping with unreliable media
Cross platform compatibility
Complex performance issues
– 35 –
What waits for you in the future?
– 36 –
New tricks to learn..
– 37 –
Coming to a classroom next to you in Spring semester:
334- Int. to Operating systems
Processes
and threads
Synchronization, semaphores, monitors
Assignment 1: Synchronization
CPU scheduling policies
Assignment 2: System calls and scheduling
Virtual
memory, paging, and TLBs
Assignment 3: Virtual memory
Filesystems,
FFS, and LFS
Assignment 4: File systems
RAID and NFS
filesystems
– 38 –