Parallelization of Stochastic Heuristics to Achieve Linear

Download Report

Transcript Parallelization of Stochastic Heuristics to Achieve Linear

Parallelization of Stochastic
Metaheuristics to Achieve Linear
Speed-ups while Maintaining Quality
Course Project Presentation:
Mustafa Imran Ali
Ali Mustafa Zaidi
Outline of Presentation

Brief Introduction




Related Work



Classification of Parallel Strategies
Previous efforts
Our Efforts and Results




Motivation for this problem
Simulated Annealing (SA)
Simulated Evolution (SimE)
Low-Level Parallelization of Both Heuristics
Domain-Decomposition Parallelizations of SimE
Multithreaded-Search implementations of SA and SimE
Conclusions Drawn, Future Work
Motivation

Stochastic Heuristics are utilized to solve a wide variety of
combinatorial optimization problems


These heuristics attempt to find near-optimal solutions by
performing an ‘intelligent search’ of the ‘solution space’


VLSI CAD, Operations Research, Network Design etc.
Each algorithm has a built in ‘intelligence’ that allows it to move
towards successively better solutions.
But they usually require large execution times to achieve near
optimal solutions.
Motivation

Parallelization of Heuristics aims to achieve one of two basic goals:



Parallelization of Heuristics different from Data and Functional
parallelism:



Achieve Speedup
Achieve Better Quality Solutions
Different strategies alter the properties of basic algorithm in different ways
Thus, Quality of solution achievable is sensitive to parallelization strategy
Key is to select/develop parallelization strategy that has maximum
positive impact on heuristic

With respect to the goals we are trying to achieve:


For better quality solutions, strategy should enhance algorithmic “intelligence”.
For better runtimes, strategy should focus on reducing work-done per processor.
Our Objectives

Our goal is to explore, implement and
develop parallelization strategies that allow
us to:

Achieve near-linear speedup (or best effort)

Without sacrificing quality (fixed constraint)

Speedup trends should be as scalable as possible
Approaches taken and Considerations

In order to effectively consider this problem, we
must look at it in terms of several aspects:




The nature of the Heuristics themselves
The nature of the Parallel Environment
The nature of the Problem Instance and Cost Functions
All three factors influence both the


Runtime, and
Achievable Solution Quality
of any parallelization strategy.

For our task, we must optimize all three factors.
Introduction to Simulated Annealing

Basic Simulated Annealing Algorithm

With Metropolis Loop
Introduction to Simulated Evolution

Basic Simulated Evolution Algorithm
Related Work

Classification of Parallelization Strategies for
Metaheuristics [ref]
1.
2.
3.

Low-Level Parallelization
Domain Decomposition
Parallel or Multithreaded Search
Previous work done for Parallel SA


Extensive – general as well as problem specific
All three approaches tried,


Type 3 most promising – still active area of research
Previous work done for Parallel SimE

Minimal – Type 2 parallelization strategy proposed, both by
designers of SimE
Our Efforts

Starting Point for our work this semester:



Basic version of Type 3 Parallel SA
Basic version of Type 2 Parallel SimE
Parallelization of SA:


Several Enhanced versions of Type 3
Implementation of Type 1


Type 2 not implemented because…
Parallelization of SimE



Several Enhancements to Basic Type 2
Implementation of Type 1
Implementation of Type 3
Basic Type 3 Parallel SA

Basic Type 3 Parallel SA:

Based on Asynchronous
Multiple Markov Scheme
developed in [ref]


Best Type 3 scheme
developed for SA to
date.
Primarily intended for
improving solution qualities
achievable over serial
version
Basic Type 3 Parallel SA Algorithm
Type 3 Parallel SA – Second Version

Strategy 2 - Speed-up Oriented Type 3 Parallel SA

From the above starting point, we saw that high-quality
characteristics of basic Type 3 may be exploited to produce a
speed-up oriented version



Expected to be Capable of achieving quality equivalent to serial
version, but not better
While providing near-linear runtimes
Near-linear runtimes forced by dividing workload on each
processor by number of processors

M/p metropolis iterations instead of M.
Speed-up Oriented Type 3 Parallel SA

Speedup oriented Type 3
Parallel SA

Results consistently show
a 10% drop in achievable
solution quality from the
Serial version

Runtimes show near-linear
speedups, as expected
Lessons Learned from Strategy 2

We reasoned that



Intelligence of SA:



The 10% quality drop occurs due to negative impact on the
‘intelligence’ of the parallel SA.
To restore achievable quality, we must counteract this effect.
Lies in the “Cooling Schedule” [ref Dr. Sait’s book]
Division of workload directly tampers with the Cooling Schedule.
Proposed Solution:


Attempt to optimize the Cooling Schedule to take into account the
parallel environment – a “Multi-Dimensional” Cooling Schedule.
To maintain Speedup, M/p remained unchanged, while other
parameters varied across processors
Third Version of Type 3 Parallel SA

Strategy 3 – Varying Parameter settings across different
processors


Expected that this would result in more effective search of the
solution space.
Several sets of parameter settings were tried





Primarily by Varying  and T across processors
Processors with higher T and lower  would perform more random
search
Processors with lower T and higher  would be more greedy
Intermittent sharing of information should diversify search from same
position.
Results obtained from these versions:


NO IMPROVEMENT OF QUALITY OVER STRATEGY 2
Show results (hadeed_1)
Lessons Learned from Strategy 3
Based on the last two attempts, we reasoned that:


M/p drastically reduces time spent searching for better solutions
at a given temperature



Lower temperatures achieved quicker,
Adverse effect on hill-climbing property of Heuristic.
Simple division of M/p inadequate for sustaining quality.
What to do next?


Develop techniques that minimize adverse effect on algorithmic
intelligence.
Two different tracks possible

1.
2.
Type 1 parallel SA
Further Study Runtime vs Quality trends of Type 3 Parallel SA
What to do next?

Type 1 Parallel SA

Pros:



Cons:



Leaves algorithmic intelligence intact
Solution quality guaranteed.
High communication frequency adversely affects runtime
Environmental factors come into play.
Further explore dynamics of Type 3

Pros:



Low communication frequency
Suitable for cluster environment (show chart)
Cons:

Uncharted territory - progress not assured
Type 1 Parallel SA

Low-level parallelization:


Our cost function has 3 parts:



Divide Cost Computation Function
Wirelength, Power, Delay
First two are easy to divide
Division of Delay computation
posed a significant challenge


Too many replicated computations across
processors negated benefits of division.
Eventually had to be excluded
Results of Type 1 Parallel SA

Performance of Type 1 Parallel SA:


Abysmal!!
Reasons:



2 collective communications per iteration
Amount of work divided is small compared to communication time
Communication delay increases with increasing processors


(Show Chart)
Found to be completely unsuitable for


MIMD-DM environment, and
Our problem instance.
C ircuit
N um b er
N am e
o f C ells
T im e fo r P arallel S A T yp e 1
T im e fo r
m (s) S A
0 .5 6 6 0 0 7
S erial S A
4 2 .8 8 6
p=2
p=3
1 8 1 6 .4
p=4
p=5
p=6
p=7
2 2 1 6 .6
Further Exploring Type 3 Parallel SA

To improve our achievable quality of our Type 3
parallel SA:

In depth study of the impact of parameter M on achievable
solution quality.

All experiments first attempted for the Serial Version, then
replicated to the parallel version

Based on what we know of Type 3 Parallelization schemes,
see how any new lessons can be incorporated into it.
Impact of ‘M’ on Solution Quality (Serial)
S e ria l R u n C h a ra c te ris tic s vs d ivis io n fa c to r
0 .8
0 .7
0 .6
0 .5
q u a lity
1
9
0 .4
17
25
57
0 .3
0 .2
0 .1
0
0
50
100
150
tim e
200
250
300
Impact of ‘M’ on Solution Quality (Serial)
S e ria l R u n C h a ra c te ris tic s vs d ivis io n fa c to r
0 .6
0 .5
0 .4
q u a lity
9
17
0 .3
25
57
0 .2
0 .1
0
0
5
10
15
tim e
20
25
30
Impact of ‘M’ on Solution Quality (Parallel
7)
P a re lle l 7 , vs D iv F a c t
0 .8
0 .7
0 .6
0 .5
1
Q u a lity
9
0 .4
17
25
57
0 .3
0 .2
0 .1
0
-2 0
0
20
40
60
T im e
80
100
120
140
Impact of ‘M’ on Solution Quality (Parallel
7)
P a re lle l 7 , vs D iv F a c t
0 .7
0 .6
0 .5
0 .4
Q u a lity
9
17
25
57
0 .3
0 .2
0 .1
0
-5
0
5
10
T im e
15
20
25
Observations

We see that initially, fastest improvement in quality
appears with smallest M

However, Quality saturates earlier with smaller M.

Thus it might be beneficial to increase M as time
progresses.

But by mow much? How to minimize runtime.
Learning From Experiments

Another observation helps:



Thus best way to minimize time while sustaining quality
improvement:




Until saturation, rate of improvement nearly constant (per metropolis
calls)
This applicable for all runs for a given circuit.
Set value of M adaptively such that the average rate of improvement
remains constant.
New Enhancement to Serial SA.
Since Type 3 parallel SA improves faster than serial version, it is
expected that some speedup will be observed
Experiments still under way – parameter tuning being done
Preliminary results
A d a p tive T yp e 3 S c h e m e
0 .8
0 .7
0 .6
q u a lity
0 .5
s e ria l
0 .4
P a ra lle l 7
0 .3
0 .2
0 .1
0
-1 0
0
10
20
30
40
50
tim e

For 7 processors in parallel
60
70
80
90
100
Preliminary Results

Results are similar to the original implementation

Enhancement to serial SA mitigates the observed
benefits of the parallel version

Although further parameter tuning/code refinements may
improve parallel results even more.
Conclusions Drawn from Experiments

Type 3 parallelization of SA is most robust



For Parallel SA, direct trade off exists between
achievable solution quality and speedup



Minimum susceptibility to environment and problem instance
Type 1 fails due to unsuitability to environment and to problem
instance
Depending on quality desired, linear speedup possible.
For highest quality, speed-up diminishes to 1 in most cases.
Further experimentation needed to verify these
points.
Parallel Simulated Evolution
Strategies
Base Work for SimE




Type II (Domain Decomposition)
implementation
Problem: Poor Runtimes with Quality
Degradation
Improvement Proposed: Use of Random Row
Allocation over the Fixed Row Allocation used
in previous work
Results: Improvement of Quality over Fixed
Row Allocation but still short of Serial Quality
Type II Quality Issues

Observation: Parallel Quality will always lag
behind serial quality

Reason: Division of Solution into Domains
restricts optimum cell movement (worsens
with more processors/partitions)

Focus on improving runtime!
Type II Runtime Improvements

How?

Reduce Communication time




Reduce Frequency of Communications
Reduce Amount of Data Communicated
Overlap Communication with Computation
Reduce Computations


Can Workload be still better divided?
How will workload division affect communication
requirements?
Type II SimE Algorithm Structure
Communication
Cost Computations
Type II Communication Optimization

No potential for Computation Communication
Overlap


Implicit Barrier synchronization at communication points
Possibility of Reducing Communication Frequency
Over Multiple Iterations



Do multiple operations (E,S,A) on assigned partition before
communicating
Impact on Solution Quality due to accumulated errors
Actual Impact on Solution Quality vs. runtime improvement
not presently quantified (future work)
Type II Communication Optimization(2)

Reducing Communication Frequency



by combining gather & broadcast operation into 1
MPI call
Efficiency of collective call not too superior in MPI
implementation used
Reduce Data Communicated per call



Significant impact on runtime due to barrier
synchronization
Essentially compressing placement data!
Tradeoff: Added computations in data
preprocessing and post processing step
Type II Computation Optimization

Can workload be better divided?



Goodness Evaluation needs costs computed


Dependencies across partitions
Delay computations over long paths


Evaluation, Selection and Allocation already
localized (divided)
What about cost computation?
Spans over partitions
Wire length & Power Computation

Potential for cost computation division
Wirelength & Power Cost Division



More independent of other partitions than
delay computations
Effect of computation division can be readily
evaluated for wire length and power (within
the limited time constraints)
Tradeoff: Added Computation



Partial Costs Need to be communicated to Master
Additional communication phase added per
iteration!
Actual benefit over non-division: Results will tell!
Results & Observations
Original Type II
Implementation
Circuit
Cells
Quality
s1196
561
S1238
540

Communication +
Computation Type II
P=1
P=3
P=5
P=3
P=5
P=3
P=5
0.644
54 sec
~44 sec
~50 sec
~34 sec
~44 sec
~30 sec
~41 sec
0.680
60 sec
~45 sec
~55 sec
~39 sec
~46 sec
~34 sec
~42 sec
Effect of communication worsens speed-ups with
increasing processors


Communication
Optimized Type II
communication time over shadows gain of computation
division
Optimizing communication resulted in greater gains
than cost computation division

Dependencies do not allow much gains
Targeting Better Solution Qualities



Type II parallelization has poor quality issues
How to maintain quality?
Type I Parallelization



Same Quality as Serial Implementation
guaranteed
Speed-ups governed by gains resulting from
division of cost-computation
Type III Parallelization

Can we benefit from parallel co-operating
searches?
Type I Parallelization


Target as much computation division without
dividing SimE algorithm’s core “intelligence”
Computations that don’t affect intelligence



Cost computations
Goodness evaluations
Computations that affect intelligence


Selection function
Allocation function
Type I Parallelization


Again, wire length & power easier to divide
than delay (same reasoning as for Type II)
Workload division achieved by dividing the
cells among PE



Each PE computes costs and goodness for a
fraction of cells
Division of cells done at beginning & fixed
Slave PEs communicate goodness values to
master PE which does Selection & Allocation
Type I Parallelization Results
SimE Type I Parallelization
Circuit Cells Quality

P=3
P=5
P=7
s1196
561
0.762
~67 sec ~53 sec ~44 sec ~45 sec
s1238
540
0.799
~72 sec ~49 sec ~35 sec ~40 sec
Runtimes not drastically reduced


P=1
Allocation takes significant portion of overall
computation time
Speed-ups again limited by communication

Not by data communicated but due to more
overheads with increased participants
Type III Parallelization


Parallel Cooperating searches
communicating best solutions through a
master
Modeled after Parallel SA


A Metropolis loop can be equated with a SimE
compound move
Intelligence of SimE can benefit from best
solution produced among all

May lead to more rapid convergence to better
quality solutions
Type III Parallelization Results
SimE Type III Parallelization
Circuit Cells Quality




P=1
P=3
P=5
P=7
s1196
561
0.694
~58 sec ~57 sec ~60 sec ~61 sec
s1238
540
0.709
~65 sec ~66 sec ~64 sec ~67 sec
Results quite contrary to expectations
Highest quality inferior than serial algorithm
No speed-ups observed
Parallel Searches used as is doesn’t benefit

possible explanation: greedy behavior of parallel searches
is counter productive to SimE intelligence (inability to
escape local minima)
Improving Type III Parallelization
Proposed schemes for improved SimE parallel
searches
varying frequency of communication with time

1.

Initially exchanges can be frequent and with frequency
decreasing to allow more diversification
Intelligently combining good elements in each
solution to get a new starting solution
2.

Best location for each cell (or a cluster) can be identified
by examining/comparing goodness values among
solutions received from peers and constructing a good
solution to improve further