Transcript PPT

ECE 669
Parallel Computer Architecture
Lecture 9
Workload Evaluation
ECE669 L9: Workload Evaluation
February 26, 2004
Outline
° Evaluation of applications is important
° Simulation of sample data sets provides important
information
° Working sets indicate grain size
° Preliminary results offer opportunity for tuning
° Understanding communication costs
• Remember: software and communication!
ECE669 L9: Workload Evaluation
February 26, 2004
Workload-Driven Evaluation
° Evaluating real machines
° Evaluating an architectural idea or trade-offs
=> need good metrics of performance
=> need to pick good workloads
=> need to pay attention to scaling
• many factors involved
° Today: narrow architectural comparison
° Set in wider context
ECE669 L9: Workload Evaluation
February 26, 2004
Evaluation in Uniprocessors
° Decisions made only after quantitative evaluation
° For existing systems: comparison and procurement
evaluation
° For future systems: careful extrapolation from known
quantities
° Wide base of programs leads to standard benchmarks
• Measured on wide range of machines and successive generations
° Measurements and technology assessment lead to
proposed features
° Then simulation
• Simulator developed that can run with and without a feature
• Benchmarks run through the simulator to obtain results
• Together with cost and complexity, decisions made
ECE669 L9: Workload Evaluation
February 26, 2004
More Difficult for Multiprocessors
° What is a representative workload?
° Software model has not stabilized
° Many architectural and application degrees of freedom
• Huge design space: no. of processors, other architectural, application
• Impact of these parameters and their interactions can be huge
• High cost of communication
° What are the appropriate metrics?
° Simulation is expensive
• Realistic configurations and sensitivity analysis difficult
• Larger design space, but more difficult to cover
° Understanding of parallel programs as workloads is critical
• Particularly interaction of application and architectural parameters
ECE669 L9: Workload Evaluation
February 26, 2004
A Lot Depends on Sizes
° Application parameters and no. of procs affect inherent properties
• Load balance, communication, extra work, temporal and spatial locality
° Interactions with organization parameters of extended memory
hierarchy affect communication and performance
30


25


N
N
N
N
= 130
= 258
= 514
= 1,026
25

20
15


20






1



4
7








10 13 16 19 22 25 28 31
Number of processors
5
0








15



5


10
10
0
Origin—16 K
Origin—64 K
 Origin—512 K

Challenge—16 K
 Challenge—512 K


30
ocean
Speedup
Speedup
° Effects often dramatic, sometimes small: application-dependent
Barnes-hut












1
4
7
10 13 16 19 22 25 28 31
Number of processors
Understanding size interactions and scaling relationships is key
ECE669 L9: Workload Evaluation
February 26, 2004
Scaling: Why Worry?
° Fixed problem size is limited
° Too small a problem:
• May be appropriate for small machine
• Parallelism overheads begin to dominate benefits for larger
machines
- Load imbalance
- Communication to computation ratio
• May even achieve slowdowns
• Doesn’t reflect real usage, and inappropriate for large machines
- Can exaggerate benefits of architectural improvements,
especially when measured as percentage improvement in
performance
° Too large a problem
• Difficult to measure improvement (next)
ECE669 L9: Workload Evaluation
February 26, 2004
Too Large a Problem
° Suppose problem realistically large for big
machine
° May not “fit” in small machine
• Can’t run
• Thrashing to disk
• Working set doesn’t fit in cache
° Fits at some p, leading to superlinear speedup
° Real effect, but doesn’t help evaluate effectiveness
° Finally, users want to scale problems as machines
grow
• Can help avoid these problems
ECE669 L9: Workload Evaluation
February 26, 2004
Demonstrating Scaling Problems
° Small Ocean and big equation solver problems on SGI
Origin2000
50

30


25

45
Grid solver: 12 K x 12 K
Ideal


40
Ideal
Ocean: 258 x 258
35


20

15
25





5

20
15
10
0
Speedup
Speedup
30
10







1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Number of processors
ECE669 L9: Workload Evaluation


5
0





1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Number of processors
February 26, 2004
Communication and Replication
° View parallel machine as extended memory
hierarchy
• Local cache, local memory, remote memory
• Classify “misses” in “cache” at any level as for uniprocessors
- compulsory or cold misses (no size effect)
- capacity misses (yes)
- conflict or collision misses (yes)
- communication or coherence misses (no)
° Communication induced by finite capacity is most
fundamental artifact
• Like cache size and miss rate or memory traffic in
uniprocessors
ECE669 L9: Workload Evaluation
February 26, 2004
Working Set Perspective
a given level of the hierarchy (to the next further one)
Data traffic
•At
First working set
Capacity-generated traffic
(including conflicts)
Second working set
Other capacity-independent communication
Inherent communication
Cold-start (compulsory) traffic
Replication capacity (cache size)
• Hierarchy of working sets
• At first level cache (fully assoc, one-word block), inherent to algorithm
- working set curve for program
• Traffic from any type of miss can be local or nonlocal (communication)
ECE669 L9: Workload Evaluation
February 26, 2004
Workload-Driven Evaluation
° Evaluating real machines
° Evaluating an architectural idea or trade-offs
=> need good metrics of performance
=> need to pick good workloads
=> need to pay attention to scaling
• many factors involved
ECE669 L9: Workload Evaluation
February 26, 2004
Questions in Scaling
° Scaling a machine: Can scale power in many ways
• Assume adding identical nodes, each bringing memory
° Problem size: Vector of input parameters, e.g. N =
(n, q, Dt)
• Determines work done
• Distinct from data set size and memory usage
° Under what constraints to scale the application?
• What are the appropriate metrics for performance improvement?
- work is not fixed any more, so time not enough
° How should the application be scaled?
ECE669 L9: Workload Evaluation
February 26, 2004
Under What Constraints to Scale?
° Two types of constraints:
• User-oriented, e.g. particles, rows, transactions, I/Os per processor
• Resource-oriented, e.g. memory, time
° Which is more appropriate depends on application
domain
• User-oriented easier for user to think about and change
• Resource-oriented more general, and often more real
° Resource-oriented scaling models:
• Problem constrained (PC)
• Memory constrained (MC)
• Time constrained (TC)
ECE669 L9: Workload Evaluation
February 26, 2004
Problem Constrained Scaling
° User wants to solve same problem, only faster
• Video compression
• Computer graphics
• VLSI routing
° But limited when evaluating larger machines
SpeedupPC(p) =
Time(1)
Time(p)
ECE669 L9: Workload Evaluation
February 26, 2004
Time Constrained Scaling
° Execution time is kept fixed as system scales
• User has fixed time to use machine or wait for result
° Performance = Work/Time as usual, and time is
fixed, so
SpeedupTC(p) =
Work(p)
Work(1)
° How to measure work?
• Execution time on a single processor? (thrashing problems)
• Should be easy to measure, ideally analytical and intuitive
• Should scale linearly with sequential complexity
- Or ideal speedup will not be linear in p (e.g. no. of rows in
matrix program)
• If cannot find intuitive application measure, as often true,
measure execution time with ideal memory system on a
uniprocessor
ECE669 L9: Workload Evaluation
February 26, 2004
Memory Constrained Scaling
° Scale so memory usage per processor stays fixed
° Scaled Speedup: Time(1) / Time(p) for scaled up
problem
• Hard to measure Time(1), and inappropriate
SpeedupMC(p) =
Work(p)
Time(1)
x
Time(p)
Work(1)
Increase in Work
=
Increase in Time
° Can lead to large increases in execution time
• If work grows faster than linearly in memory usage
• e.g. matrix factorization
- 10,000-by 10,000 matrix takes 800MB and 1 hour on
uniprocessor. With 1,000 processors, can run 320K-by-320K
matrix, but ideal parallel time grows to 32 hours!
- With 10,000 processors, 100 hours ...
ECE669 L9: Workload Evaluation
February 26, 2004
Scaling Summary
° Under any scaling rule, relative structure of the
problem changes with P
• PC scaling: per-processor portion gets smaller
• MC & TC scaling: total problem get larger
° Need to understand hardware/software
interactions with scale
° For given problem, there is often a natural scaling
rule
• example: equal error scaling
ECE669 L9: Workload Evaluation
February 26, 2004
Types of Workloads
• Kernels: matrix factorization, FFT, depth-first tree search
• Complete Applications: ocean simulation, crew scheduling, database
• Multiprogrammed Workloads
° Multiprog.
Appls
Realistic
Complex
Higher level interactions
Are what really matters
Kernels
Microbench.
Easier to understand
Controlled
Repeatable
Basic machine characteristics
Each has its place:
Use kernels and microbenchmarks to gain understanding, but
applications to evaluate effectiveness and performance
ECE669 L9: Workload Evaluation
February 26, 2004
Coverage: Stressing Features
° Easy to mislead with workloads
• Choose those with features for which machine is good, avoid
others
° Some features of interest:
• Compute v. memory v. communication v. I/O bound
• Working set size and spatial locality
• Local memory and communication bandwidth needs
• Importance of communication latency
• Fine-grained or coarse-grained
- Data access, communication, task size
• Synchronization patterns and granularity
• Contention
• Communication patterns
° Choose workloads that cover a range of properties
ECE669 L9: Workload Evaluation
February 26, 2004
Coverage: Levels of Optimization
° Many ways in which an application can be suboptimal
• Algorithmic, e.g. assignment, blocking
2n
p
4n
p
• Data structuring, e.g. 2-d or 4-d arrays for SAS grid problem
• Data layout, distribution and alignment, even if properly structured
• Orchestration
- contention
- long versus short messages
- synchronization frequency and cost, ...
• Also, random problems with “unimportant” data structures
° Optimizing applications takes work
• Many practical applications may not be very well optimized
ECE669 L9: Workload Evaluation
February 26, 2004
Concurrency
° Should have enough to utilize the processors
• If load imbalance dominates, may not be much machine can do
• (Still, useful to know what kinds of workloads/configurations don’t
have enough concurrency)
° Algorithmic speedup: useful measure of
concurrency/imbalance
• Speedup (under scaling model) assuming all memory/communication
operations take zero time
• Ignores memory system, measures imbalance and extra work
• Uses PRAM machine model (Parallel Random Access Machine)
- Unrealistic, but widely used for theoretical algorithm development
° At least, should isolate performance limitations due to
program characteristics that a machine cannot do
much about (concurrency) from those that it can.
ECE669 L9: Workload Evaluation
February 26, 2004
Steps in Choosing Problem Sizes
° Variation of characteristics with problem size usually
smooth
• So, for inherent comm. and load balance, pick some sizes along range
° Interactions of locality with architecture often have
thresholds (knees)
• Greatly affect characteristics like local traffic, artifactual comm.
• May require problem sizes to be added
- to ensure both sides of a knee are captured
• But also help prune the design space
ECE669 L9: Workload Evaluation
February 26, 2004
Our Cache Sizes (16x1MB, 16x64KB)
 
20
L1 WS
L1 WS
8
  

40


20

10
6
4
L1 WS
2
0

L2 WS
     
Cache size (K)
(a) LU
(b) Ocean
(c) Barnes–Hut

10

L1 WS
10

5
1
2
4
8
16
32
64
128
256
512
1,024
L1 WS

30


20

 

  

L2 WS


10

0
L1 WS


L2 WS
   
40 

0
1
2
4
8
16
32
64
128
256
512
1,024
Miss rate (%)
15
50

Miss rate (%)
20

Miss rate (%)

Cache size (K)
15
0
 
Cache size (K)
20
5

1
2
4
8
16
32
64
128
256
512
1,024
0
 L WS
2
1
2
4
8
16
32
64
128
256
512
1,024
0
 
 
1
2
4
8
16
32
64
128
256
512
1,024
 

L2 WS
L0 WS

1
2
4
8
16
32
64
128
256
512
1,024

Miss rate (%)
Miss rate (%)
30
10

Miss rate (%)
60
40
Cache size (K)
Cache size (K)
Cache size (K)
(d) Radiosity
(e) Raytrace
(f ) Radix
ECE669 L9: Workload Evaluation
February 26, 2004
Multiprocessor Simulation
° Simulation runs on a uniprocessor (can be parallelized too)
• Simulated processes are interleaved on the processor
° Two parts to a simulator:
• Reference generator: plays role of simulated processors
- And schedules simulated processes based on simulated time
• Simulator of extended memory hierarchy
- Simulates operations (references, commands) issued by reference
generator
° Coupling or information flow between the two parts varies
• Trace-driven simulation: from generator to simulator
• Execution-driven simulation: in both directions (more accurate)
° Simulator keeps track of simulated time and detailed
statistics
ECE669 L9: Workload Evaluation
February 26, 2004
Execution-driven Simulation
° Memory hierarchy simulator returns simulated time
information to reference generator, which is used to
schedule simulated processes
P1
$1
Mem 1
P2
$2
Mem 2
P3
$3
Mem 3
·
·
·
·
·
·
Pp
$p
Reference generator
ECE669 L9: Workload Evaluation
N
e
t
w
o
r
k
Mem p
Memory and interconnect simulator
February 26, 2004
Summary
° Evaluate design tradeoffs
• many underlying design choices
• prove coherence, consistency
° Evaluation must be based on sound understandng
of workloads
• drive the factors you want to study
• representative
• scaling factors
° Use of workload driven evaluation to resolve
architectural questions
ECE669 L9: Workload Evaluation
February 26, 2004