Transcript slides

EECE 571R (Spring 2009)
Massively parallel/distributed platforms
Matei Ripeanu
matei at ece.ubc.ca
Contact Info
Email: matei @ ece.ubc.ca
Office: KAIS 4033
Office hours: by appointment (email me)
Course page:
http://www.ece.ubc.ca/~matei/EECE571/
EECE 571R: Course Goals
• Primary
– Gain understanding of fundamental issues that affect design of:
• Large-scale systems – massively parallel / massively
distributed
– Survey main current research themes
– Gain experience with distributed systems research
• Research on: federated system, networks
• Secondary
– By studying a set of outstanding papers, build knowledge of how to
do & present research
– Learn how to read papers & evaluate ideas
What I’ll Assume You Know
• Basic Internet architecture
– IP, TCP, DNS, HTTP
• Basic principles of distributed computing
– Asynchrony (cannot distinguish between communication failures
and latency)
– Incomplete & inconsistent global state knowledge (cannot know
everything correctly)
– Failures happen (in large systems, even rare failures of individual
components, aggregate to high failure rates)
• If there are things that don’t make sense, ask!
Outline
• Case study (and project ideas):
– Amdahl’s Law in the Multi-core Era
• Administrative: course schedule/organization
Amdahl’s Law in the Multicore Era
Acknowledgement:: some slides borrowed form Mark D. Hill, David Patterson, Jim Larus, Saman Amarasinghe, Richard
Brunner, Luddy Harrison presentations
Moore’s Laws
Technology & Moore’s Law
Transistor
1947
Integrated Circuit 1958
(a.k.a. Chip)
Moore’s Law 1964:
# Transistors per Chip doubles every two years (or 18 months)
Architects & Another Moore’s Law
Microprocessor 1971
50M transistors ~2000 
Popular Moore’s Law:
Processor (core) performance doubles every two years
Multicore Chip (a.k.a. Chip Multiprocesors)
Why Multicore?
• Power  simpler structures
• Memory  Concurrent accesses
to tolerate off-chip latency
• Wires  intra-core wires shorter
• Complexity  divide & conquer
But more cores; NOT faster cores
Will effective chip performance
keep doubling every two years?
Eight 4-way cores 2006
Virtuous Cycle, circa 1950 – 2005
Increased
processor
performance
Larger, more
feature-full
software
Slower
programs
Higher-level
languages &
abstractions
Larger
development
teams
World-Wide Software Market (per IDC):
$212b (2005)  $310b (2010)
Virtuous Cycle, 2005 – ???
X
Increased
processor
performance
Slower
programs
Larger, more
feature-full
software
GAME OVER — NEXT LEVEL?
Higher-level
languages &
abstractions
Larger
development
teams
Thread Level Parallelism & Multicore Chips
World-Wide Software Market $212b (2005)  ?
Summary: A Corollary to Amdahl’s Law
• Develop Simple Model of Multicore Hardware
– Complements Amdahl’s software model
– Fixed chip resources for cores
– Core performance improves sub-linearly with resources
• Shows Need For Research To
–
–
–
–
Increase parallelism (Are you surprised?)
Increase core performance (especially for larger chips)
Refine asymmetric designs (e.g., one core enhanced)
Refine dynamically harnessing cores for serial performance
Outline
• Recall Amdahl’s Law
• A Model of Multicore Hardware
• Symmetric Multicore Chips
• Asymmetric Multicore Chips
• Dynamic Multicore Chips
• Caveats & Wrap Up
Amdahl’s Law
• Begins with Simple Software Assumption (Limit Arg.)
– Fraction F of execution time perfectly parallelizable
– No Overhead for
– Scheduling
– Synchronization
– Communication, etc.
– Fraction 1 – F Completely Serial
• Time on 1 core = (1 – F) / 1 + F / 1 = 1
• Time on N cores = (1 – F) / 1 + F / N
Amdahl’s Law [1967]
1
Amdahl’s Speedup =
1-F
1
+
F
N
• Implications:
– Attack the common case when introducing paralelizations:
When f is small, optimizations will have little effect.
– The aspects you ignore also limit speedup: As N approaches
infinity, speedup is bound by 1/(1 – f ).
Amdahl’s Law [1967]
1
Amdahl’s Speedup =
1-F
1
+
• Discussion:
– Can you ever obtain super-linear speedups?
F
N
Amdahl’s Law [1967]
1
Amdahl’s Speedup =
1-F
1
+
F
N
• For mainframes, Amdahl expected 1 - F = 35%
– For a 4-processor speedup = 2
– For infinite-processor speedup < 3
– Therefore, stay with mainframes with one/few processors
• Q: Does it make sense to design massively multicore
chips? What kind?
Designing Multicore Chips Hard
• Designers must confront single-core design options
–
–
–
–
Instruction fetch, wakeup, select
Execution unit configuation & operand bypass
Load/queue(s) & data cache
Checkpoint, log, runahead, commit.
• As well as additional design degrees of freedom
–
–
–
–
How many cores? How big each?
Shared caches: levels? How many banks?
Memory interface: How many banks?
On-chip interconnect: bus, switched?
Want Simple Multicore Hardware Model
To Complement Amdahl’s Simple Software Model
(1) Chip Hardware Roughly Partitioned into
– (I) Multiple Cores (with L1 caches)
– (II) The Rest (L2/L3 cache banks, interconnect, pads, etc.)
– Assume: Changing Core Size/Number does NOT change The Rest
(2) Resources for Multiple Cores Bounded
– Bound of N resources per chip for cores
– Due to area, power, cost ($$$), or multiple factors
– Bound = Power? (but pictures here use Area)
Want Simple Multicore Hardware Model, cont.
(3) Architects can improve single-core performance
using more of the bounded resource
• A Simple Base Core
– Consumes 1 Base Core Equivalent (BCE) resources
– Provides performance normalized to 1
• An Enhanced Core (in same process generation)
– Consumes R x BCEs
– Performance as a function Perf(R)
• What does function Perf(R) look like?
More on Enhanced Cores
• (Performance Perf(R) consuming R BCEs resources)
• If Perf(R) > R  Always enhance core
• Cost-effectively speedups both sequential & parallel
• Therefore, equations assume Perf(R) < R
• Graphs Assume Perf(R) = square root of R
– 2x performance for 4 BCEs, 3x for 9 BCEs, etc.
– Why? Models diminishing returns with “no coefficients”
• How to speedup enhanced core?
– <Insert favorite or TBD micro-architectural ideas here>
Outline
• Recall Amdahl’s Law
• A Model of Multicore Hardware
• Symmetric Multicore Chips
• Asymmetric Multicore Chips
• Dynamic Multicore Chips
• Caveats & Wrap Up
How Many (Symmetric) Cores per Chip?
•
•
•
•
•
Each Chip Bounded to N BCEs (for all cores)
Each Core consumes R BCEs
Assume Symmetric Multicore = All Cores Identical
Therefore, N/R Cores per Chip — (N/R)*R = N
For an N = 16 BCE Chip:
Sixteen 1-BCE cores
Four 4-BCE cores
One 16-BCE core
Performance of Symmetric Multicore Chips
• Serial Fraction 1-F uses 1 core at rate Perf(R)
• Serial time = (1 – F) / Perf(R)
• Parallel Fraction uses N/R cores at rate Perf(R) each
• Parallel time = F / (Perf(R) * (N/R)) = F*R / Perf(R)*N
• Therefore, w.r.t. one base core:
1
Symmetric Speedup =
• Implications?
1-F
Perf(R)
+
F*R
Perf(R)*N
Enhanced Cores speed Serial & Parallel
Symmetric Multicore Chip, N = 16 BCEs
16
Symmetric Speedup
14
12
10
8
6
4
F=0.5
2
0
1
2
4
8
16
(16 cores)
(8 cores)
R BCEs
(4 cores)
(2 cores)
(1 core)
F=0.5
R=16,
Cores=1,
Speedup=4
F=0.5, Opt. Speedup S = 4 = 1/(0.5/4 + 0.5*16/(4*16))
Need to increase parallelism to make multicore optimal!
Symmetric Multicore Chip, N = 16 BCEs
16
Symmetric Speedup
14
12
10
8
F=0.9
6
F=0.9, R=2, Cores=8, Speedup=6.7
4
F=0.5
2
0
1
2
4
8
16
R BCEs
At F=0.9, Multicore optimal, but speedup limited
Need to obtain even more parallelism!
F=0.5
R=16,
Cores=1,
Speedup=4
Symmetric Multicore Chip, N = 16 BCEs
16
F=0.999
Symmetric Speedup
14
F1, R=1, Cores=16, Speedup16
F=0.99
12
F=0.975
10
8
F=0.9
6
4
F=0.5
2
0
1
2
4
8
16
R BCEs
F matters: Amdahl’s Law applies to multicore chips
Researchers should target parallelism F first
Symmetric Multicore Chip, N = 16 BCEs
16
F=0.999
Symmetric Speedup
14
F=0.99
12
F=0.975
10
8
F=0.9
6
Recall F=0.9, R=2, Cores=8, Speedup=6.7
4
F=0.5
2
0
1
2
4
8
16
R BCEs
As Moore’s Law enables N to go from 16 to 256 BCEs,
More core enhancements? More cores? Or both?
Symmetric Multicore Chip, N = 256 BCEs
Symmetric Speedup
250
F1
R=1 (vs. 1)
Cores=256 (vs. 16)
Speedup=204 (vs. 16)
200
F=0.999
150
MORE CORES!
100
F=0.99
F=0.975
50
F=0.99
R=3 (vs. 1)
Cores=85 0(vs. 16)
Speedup=80 (vs.1 13.9) 2
CORE ENHANCEMENTS
& MORE CORES!
F=0.9
F=0.5
4
8
16
R BCEs
32
F=0.9
R=28
(vs.
64
128 2) 256
Cores=9 (vs. 8)
Speedup=26.7 (vs. 6.7)
CORE ENHANCEMENTS!
As Moore’s Law increases N, often need enhanced core designs
Researcher should target single-core performance too
Outline
• Recall Amdahl’s Law
• A Model of Multicore Hardware
• Symmetric Multicore Chips
• Asymmetric Multicore Chips
• Dynamic Multicore Chips
• Caveats & Wrap Up
Asymmetric (Heterogeneous) Multicore Chips
• Symmetric Multicore Required All Cores Equal
• Why Not Enhance Some (But Not All) Cores?
• For Amdahl’s Simple Software Assumptions
– One Enhanced Core
– Others are Base Cores
• How?
– <fill in favorite micro-architecture techniques here>
– Model ignores design cost of asymmetric design
• How does this effect our hardware model?
How Many Cores per Asymmetric Chip?
•
•
•
•
•
Each Chip Bounded to N BCEs (for all cores)
One R-BCE Core leaves N-R BCEs
Use N-R BCEs for N-R Base Cores
Therefore, 1 + N - R Cores per Chip
For an N = 16 BCE Chip:
Symmetric: Four 4-BCE cores
Asymmetric: One 4-BCE core
& Twelve 1-BCE base cores
Performance of Asymmetric Multicore Chips
• Serial Fraction 1-F same, so time = (1 – F) / Perf(R)
• Parallel Fraction F
– One core at rate Perf(R)
– N-R cores at rate 1
– Parallel time = F / (Perf(R) + N - R)
• Therefore, w.r.t. one base core:
1
Asymmetric Speedup =
1-F
Perf(R)
+
F
Perf(R) + N - R
Asymmetric Multicore Chip, N = 256 BCEs
250
Asymmetric Speedup
F=0.999
200
150
F=0.99
100
F=0.975
50
F=0.9
F=0.5
0
1
2
4
8
(256 cores) (253 cores)
16
32
64
128
256
R BCEs (193 cores) (1 core)
(241 cores)
Number of Cores = 1 (Enhanced) + 256 – R (Base)
How do Asymmetric & Symmetric speedups compare?
Recall Symmetric Multicore Chip, N = 256 BCEs
Symmetric Speedup
250
200
F=0.999
150
100
F=0.99
F=0.975
50
F=0.9
F=0.5
Recall F=0.9, R=28, Cores=9, Speedup=26.7
0
1
2
4
8
16
R BCEs
32
64
128
256
Asymmetric Multicore Chip, N = 256 BCEs
250
Asymmetric Speedup
F=0.999
F=0.99
R=41 (vs. 3)
Cores=216 (vs. 85)
Speedup=166 (vs. 80)
200
150
F=0.99
100
F=0.975
50
F=0.9
F=0.5
0
1
2
4
8
16
R BCEs
32
64
128
256
F=0.9
R=118 (vs. 28)
Cores= 139 (vs. 9)
Speedup=65.6
(vs. 26.7)
Asymmetric offers greater speedups potential than Symmetric
• As Moore’s Law increases N, Asymmetric gets better
Some researchers should target developing asymmetric multicores
Outline
• Recall Amdahl’s Law
• A Model of Multicore Hardware
• Symmetric Multicore Chips
• Asymmetric Multicore Chips
• Dynamic Multicore Chips
• Caveats & Wrap Up
Dynamic Multicore Chips
• Why NOT Have Your Cake and Eat It Too?
• N Base Cores for Best Parallel Performance
• Harness R Cores Together for Serial Performance
• How? DYNAMICALLY Harness Cores Together
– <insert favorite or TBD techniques here>
parallel mode
How would one
model this chip?
sequential mode
Performance of Dynamic Multicore Chips
• N Base Cores Where R Can Be Harnessed
• Serial Fraction 1-F uses R BCEs at rate Perf(R)
• Serial time = (1 – F) / Perf(R)
• Parallel Fraction F uses N base cores at rate 1 each
• Parallel time = F / N
• Therefore, w.r.t. one base core:
1
Dynamic Speedup =
1-F
Perf(R)
+
F
N
Recall Asymmetric Multicore Chip, N = 256 BCEs
250
Asymmetric Speedup
F=0.999
200
Recall
150
F=0.99
100
F=0.99
R=41
Cores=216
Speedup=166
F=0.975
50
F=0.9
F=0.5
0
1
2
4
8
16
32
R BCEs
What happens with a dynamic chip?
64
128
256
Dynamic Multicore Chip, N = 256 BCEs
250
Dynamic Speedup
F=0.999
200
150
F=0.99
R=256 (vs. 41)
Cores=256 (vs. 216)
Speedup=223 (vs. 166)
F=0.99
100
F=0.975
50
F=0.9
F=0.5
0
1
2
4
8
16
R BCEs
32
64
128
256
Note:
#Cores
always
N=256
Dynamic offers greater speedup potential than Asymmetric
Researchers should target dynamically harnessing cores
Outline
• Recall Amdahl’s Law
• A Model of Multicore Hardware
• Symmetric Multicore Chips
• Asymmetric Multicore Chips
• Dynamic Multicore Chips
• Caveats & Wrap Up
Three Multicore Amdahl’s Law
Parallel Section
1
Symmetric Speedup =
Sequential Section
1-F
Perf(R)
+
1 Enhanced Core
Asymmetric Speedup =
F*R
Perf(R)*N
N/R
Enhanced
Cores
1
1-F
Perf(R)
+
F
Perf(R) + N - R
1 Enhanced
& N-R Base
Cores
1
Dynamic Speedup =
1-F
Perf(R)
+
F
N
N Base
Cores
Software Model Charges 1 of 2
• Serial fraction not totally serial
• Can extend software model to tree algorithms, etc.
• Parallel fraction not totally parallel
• Can extend for varying or bounded parallelism
• Serial/Parallel fraction may change
• Can extend for Weak Scaling [Gustafson, CACM’88]
• Run larger, more parallel problem in constant time
Software Model Charges 2 of 2
• Synchronization, communication, scheduling effects?
• Can extend for overheads and imbalance
• Software challenges for asymmetric multicore worse
• Can extend for asymmetric scheduling, etc.
• Software challenges for dynamic multicore greater
• Can extend to model overheads to facilitate
Hardware Model Charges
• Naïve to consider total resources for cores fixed
• Can extend hardware model to how core changes effect
‘The Rest’
• Naïve to bound Cores by one resource (esp. area)
• Can extend for Pareto optimal mix of area, power,
complexity, reliability, …
• Naïve to ignore challenges due to off-chip bandwidth
limits & benefits of last-level caching
• Can extend for modeling these
• Naïve to use performance = square root of resources
• Can extend as equations can use any function
• Just because the model is simple
• Does NOT mean the conclusions are wrong
• Prediction
– While the truth is more complex
– These basic observations will hold
Dynamic Multicore Chip, N = 1024 BCEs
Dynamic Speedup
1000
F=0.999
800
F1
R1024
Cores1024
Speedup 1024!
600
F=0.99
400
F=0.975
200
F=0.9
F=0.5
0
1
4
16
64
256
1024
R BCEs
NOT Possible Today
NOT Possible EVER Unless We Dream & Act
Summary so far: A Corollary to Amdahl’s Law
• Developed Simple Model of Multicore Hardware
– Complements Amdahl’s software model
– Fixed chip resources for cores
– Core performance improves sub-linearly with resources
• Show Need For Research To
–
–
–
–
Increase parallelism (Are you surprised?)
Increase core performance (especially for larger chips)
Refine asymmetric design (e.g., one core enhanced)
Refine dynamically harnessing cores for serial performance
What about clusters / clouds / massively
distributed platforms?
Amdahl’s Law – for multi-processor/distributed
• Begins with Simple Software Assumption (Limit Arg.)
– Fraction F of execution time perfectly parallelizable
– No Overhead for
– Scheduling
– Synchronization
– Communication, etc.
– Fraction 1 – F Completely Serial
1
Amdahl’s Speedup =
1-F
1
+
F
N
Clusters / Distributed systems
This class
• Weekly schedule (tentative)
Course Organization/Syllabus/etc.
Administravia: Course structure
• Paper discussion (most classes) & lectures
• Student projects
– Aim high! Have fun! It’s a class project, not your PhD!
– Teams of up to 3 students
– Project presentations at the end of the term
Administravia: Grading
• Paper reviews:35%
• Class participation 10%
• Discussion leading: 10%
• Project: 45%
Administravia:Paper Reviewing (1)
•
Goals:
–
–
–
•
•
•
Think of what you read
Expand your knowledge beyond the papers that are assigned
Get used to writing paper reviews
Reviews due by midnight the day before the class
Be professional in your writing
Have an eye on the writing style:
–
Clarity
–
Beware of traps: learn to use them in writing and detect them in reading
–
Detect (and stay away from) trivial claims.
E.g., 1st sentence in the Introduction:
“The tremendous/unprecedented/phenomenal growth/scale/ubiquity of the
Internet…”
Administravia: Paper Reviewing (2)
Follow the form provided when relevant.
•
State the main contribution of the paper
•
Critique the main contribution:
•
Rate the significance of the paper on a scale of 5 (breakthrough), 4 (significant
contribution), 3 (modest contribution), 2 (incremental contribution), 1 (no contribution
or negative contribution).
•
Explain your rating in a sentence or two.
Rate how convincing the methodology is.
•
Do the claims and conclusions follow from the experiments?
•
Are the assumptions realistic?
•
Are the experiments well designed?
•
Are there different experiments that would be more convincing?
•
Are there other alternatives the authors should have considered?
•
(And, of course, is the paper free of methodological errors?)
Administravia: Paper Reviewing (3)
•
•
•
•
•
•
What is the most important limitation of the approach?
What are the three strongest and/or most interesting ideas in the paper?
What are the three most striking weaknesses in the paper?
Name three questions that you would like to ask the authors.
Detail an interesting extension to the work not mentioned in the future work
section.
Optional comments on the paper that you’d like to see discussed in class.
Administravia: Discussion leading
• Come prepared!
– Prepare discussion outline
– Prepare questions:
• “What if”s
• Unclear aspects of the solution proposed
• …
– Similar ideas in different contexts
– Initiate short brainstorming sessions
• Leaders do NOT need to submit paper reviews
• Main goals:
– Keep discussion flowing
– Keep discussion relevant
– Engage everybody (I’ll have an eye on this, too)
Administravia: Projects
• Combine with your research if relevant to the class
• Get approval from all instructors if you overlap final projects:
– Don’t sell the same piece of work twice
– You can get more than twice as many results with less than twice
as much work
• Aim high!
– Put one extra month and get a publication out of it
– It is doable!
• Try ideas that you postponed out of fear: it’s just a class, not
your PhD.
Administravia: Project deadlines (tentative)
• 2nd week – 3 slides idea presentation: “What is the research
question you aim to answer”
• 4th week: 2-page project proposal
• 6th week: 3-page related work survey
– Know relevant work in your problem area
– If implementation project, list tools, similar projects
– Expand proposal
• 8th week: 4-page Midterm project due
– Have a clear image of what’s possible/doable
– Report preliminary results
• [see schedule] In-class project presentation
– Demo, if appropriate
• Last week of exam session:
– 10-page write-up
Next Class (Mon, 19/01)
• Note room change: KAIS
• Discussion of
– Project ideas
– Papers
To do:
• Subscribe to mailing list
• Volunteers for discussion leaders for class next week
Questions?
Backup Slides
Cost-Effective Parallel Computing
• Isn’t Speedup(P) < P inefficient? (P = processors)
• If only throughput matters, use P computers instead?
• But much of a computer’s cost is NOT in the
processor [Wood & Hill, IEEE Computer 2/95]
• Let Costup(P) = Cost(P)/Cost(1)
• Parallel computing cost-effective:
•
Speedup(P) > Costup(P)
• E.g. for SGI PowerChallenge w/ 500MB:
•
Costup(32) = 8.6
Symmetric Multicore Chip, N = 16 BCEs
16
F=0.999
Symmetric Speedup
14
F=0.99
12
F=0.975
10
8
F=0.9
6
4
F=0.5
2
0
1
2
4
R BCEs
8
16
Symmetric Multicore Chip, N = 256 BCEs
Symmetric Speedup
250
200
F=0.999
150
100
F=0.99
F=0.975
50
F=0.9
F=0.5
0
1
2
4
8
16
R BCEs
32
64
128
256
Symmetric Multicore Chip, N = 1024 BCEs
Symmetric Speedup
1000
800
600
F=0.999
400
F=0.99
F=0.975
F=0.9
200
F=0.5
0
1
4
16
64
R BCEs
256
1024
Asymmetric Multicore Chip, N = 16 BCEs
F=0.999
16
F=0.99
Asymmetric Speedup
14
F=0.975
12
10
F=0.9
8
6
4
F=0.5
2
0
1
2
4
R BCEs
8
16
Asymmetric Multicore Chip, N = 256 BCEs
250
Asymmetric Speedup
F=0.999
200
150
F=0.99
100
F=0.975
50
F=0.9
F=0.5
0
1
2
4
8
16
R BCEs
32
64
128
256
Asymmetric Multicore Chip, N = 1024 BCEs
Asymmetric Speedup
1000
F=0.999
800
600
F=0.99
400
F=0.975
200
F=0.9
F=0.5
0
1
4
16
64
R BCEs
256
1024
Dynamic Multicore Chip, N = 16 BCEs
F=0.999
16
F=0.99
Dynamic Speedup
14
F=0.975
12
10
F=0.9
8
6
4
F=0.5
2
0
1
2
4
R BCEs
8
16
Dynamic Multicore Chip, N = 256 BCEs
250
Dynamic Speedup
F=0.999
200
150
F=0.99
100
F=0.975
50
F=0.9
F=0.5
0
1
2
4
8
16
R BCEs
32
64
128
256
Dynamic Multicore Chip, N = 1024 BCEs
Dynamic Speedup
1000
F=0.999
800
600
F=0.99
400
F=0.975
200
F=0.9
F=0.5
0
1
4
16
64
R BCEs
256
1024