Microsoftmarch21-07

Download Report

Transcript Microsoftmarch21-07

Applications and Runtime for
multicore/manycore
March 21 2007
Geoffrey Fox
Community Grids Laboratory
Indiana University
505 N Morton Suite 224
Bloomington IN
[email protected]
http://grids.ucs.indiana.edu/ptliupages/presentations/
RMS: Recognition Mining Synthesis
Recognition
Mining
Synthesis
What is …?
Is it …?
What if …?
Model
Find a model
instance
Create a model
instance
Today
Model-less
Real-time streaming and
transactions on
static – structured datasets
Very limited realism
Tomorrow
Model-based
multimodal
recognition
Real-time analytics on
dynamic, unstructured,
multimodal datasets
Pradeep K. Dubey, [email protected]
Photo-realism and
physics-based
animation
Discussed in Seminars
http://grids.ucs.indiana.edu/ptliupages/presentations/PC2007/
Rest mainly classic
parallel computing
Intel’s Application Stack
Some Bioinformatics Datamining
• 1. Multiple Sequence Alignment (MSA)
– Kernel Algorithms
• HMM (Hidden Markov Model)
• pairwise alignments (dynamic programming) with heuristics (e.g.
progressive, iterative method)
• 2. Motif Discovery
– Kernel Algorithms:
– MEME (Multiple Expectation Maximization for Motif
Elicitation)
– Gibbs sampler
• 3. Gene Finding (Prediction)
– Hidden Markov Methods
• 4. Sequence Database Search
– Kernel Algorithms
• BLAST (Basic Local Alignment Search Tool)
• PatternHunter
• FASTA
Berkeley Dwarfs
•
•
•
•
•
•
•
Dense Linear Algebra
Sparse Linear Algebra
Spectral Methods
N-Body Methods
Structured Grids
Unstructured Grids
Pleasingly Parallel
•
•
•
•
•
Combinatorial Logic
Graph Traversal
Dynamic Programming
Branch & Bound
Graphical Models
(HMM)
• Finite State Machine
Consistent in Sprit with Intel Analysis
I prefer to develop a few key applications rather
than debate their classification!
Client side Multicore applications
• “Lots of not very parallel applications”
• Gaming; Graphics; Codec conversion for multiple user
conferencing ……
• Complex Data querying and data
manipulation/optimization/regression ; database and
datamining (including computer vision) (Recognition
and Mining for Intel Analysis)
– Statistical packages as in Excel and R
• Scenario and Model simulations (Synthesis for Intel)
• Multiple users give several Server side multicore
applications
• There are important architecture issues including
memory bandwidth not discussed here!
Approach I
• Integrate Intel, Berkeley and other sources including
database (successful on current parallel machines like
scientific applications)
• and define parallel approaches in “white paper”
• Develop some key examples testing 3 parallel
programming paradigms
– Coarse Grain functional Parallelism (as in workflow) including
pleasingly parallel instances with different data
– Fine Grain functional Parallelism (as in Integer Programming)
– Data parallel (Loosely Synchronous as in Science)
• Construct so can use different run time including
perhaps CCR/DSS, MPI, Data Parallel .NET
• May be these will become libraries used as in
MapReduce Workflow Coordination Languages ….
Approach II
• Have looked at CCR in MPI style applications
– Seems to work quite well and support more general messaging
models
• NAS Benchmark using CCR to confirm its utility
• Developing 4 exemplar multi-core parallel applications
– Support Vector Machines (linear algebra) Data Parallel
– Deterministic Annealing (statistical physics) Data Parallel
– Computer Chess or Mixed Integer Programming Fine Grain
Parallelism
– Hidden Markov Method (Genetic Algorithms) Loosely
Coupled functional Parallelism
• Test high level coordination to such parallel applications
in libraries
CCR for Data Parallel (Loosely
Synchronous) Applications
• CCR supports general coordination of messages queued
in ports in Handler or Rendezvous mode
• DSS builds service model on CCR and supports coarse
grain functional parallelism
• Basic CCR supports fine grain parallelism as in
computer chess (and use STM enabled primitives?)
• MPI has well known collective communication which
supply scalable global synchronization etc.
• Look at performance of MPI_Sendrecv
• What is model that encompasses best shared and
distributed memory approaches for “data parallel”
problems
– This could be put on top of CCR?
• Much faster internal versions of CCR
(a) Pipeline
(b) Shift
Thread0
Port
0
Thread0
Port
0
Thread1
Port
1
Thread1
Port
1
Thread2
Port
2
Thread2
Port
2
Thread3
Port
3
Thread3
Port
3
Use on
AMD 4-core
Xeon 4-core
Xeon 8-core
Latter do up to
8 way
parallelism
(d) Exchange
(c) Two Shifts
Thread0
Port
0
Thread0
Port
0
Thread1
Port
1
Thread1
Port
1
Thread2
Port
2
Thread2
Port
2
Port
Thread3
Port
3
Thread3
3
Four Communication Patterns used in CCR Tests. (a) and (b) use CCR Receive
while (c) and (d) use CCR Multiple Item Receive
Write
Exchanged
Messages
Read
Messages
Port
0
Thread0
Thread1
Port
1
Thread2
Thread3
Thread0
Write
Exchanged
Messages
Port
0
Thread0
Thread1
Port
1
Thread1
Port
2
Thread2
Port
2
Thread2
Port
3
Thread3
Port
3
Thread3
Exchanging Messages with 1D Torus Exchange
topology for loosely synchronous execution in CCR
Stage
Stage
Break a single computation into different number of stages
varying from 1.4 microseconds to 14 seconds for AMD
(1.6 microseconds to 16 seconds for Xeon Quad core)
Stage
Millions
Average Run Time vs. Maxstage (CCR Test Results)
120
4-way Pipeline Pattern
4 Dispatcher Threads
HP Opteron
100
1.4 microseconds
computation per stage
Average Run Time (microsec)
Time Seconds
80
Overhead =
Computation
60
Y(Ave time microsec)
8.04 microseconds overhead per
stage averaged from 1 to 10
million stages
40
14 microseconds
computation per stage
20
Computation Component if
no Overhead
Stages (millions)
0
0
2
4
6
8
10
12
Millions
Maxstage
7 units) divided into 4 cores and from 1 to 107
Fixed amount of computation (4.10
stages on HP Opteron Multicore. Each stage separated by reading and writing CCR
ports in Pipeline mode
16
16
Stage Overhead versus Thread Computation time
un Time (microsec)
Average
Run Time
vs. Maxstage (partial result)
Stage
Computation
14 Microseconds
12
14
12 Seconds
Millions
Average
AverageRun
RunTime
Time(microsec)
(microsec)
14
Overhead
per stage constant up to about million
14
stages and then increases
20
10
10
4-way Pipeline Pattern
4 Dispatcher Threads
HP Opteron
Time Seconds
1888
1666
4.63 microseconds per stage
averaged from 1 to 1 million
stages
1444
1222
Stages (millions)
1000
000
0.2
200
200
0.4
400
400
0.6
600
600
0.8
800
800
1.0
1000
1000
1200
1200
Millions
160
4-way Pipeline Pattern
4 Dispatcher Threads
Dell Xeon
140
120
Time Seconds
100
80
Y(Ave time microsec)
Overhead =
Computation
60
12.40 microseconds per stage
averaged from 1 to 10 million
stages
40
Computation Component if no Overhead
20
Stages (millions)
0
0
2
4
6
8
10
12
Millions
7 units) divided into 4 cores and from 1 to 107
Fixed amount of computation (4.10
Maxstage
stages on Dell 2 processor 2-core each Xeon Multicore. Each stage separated by
reading and writing CCR ports in Pipeline mode
Summary of Stage Overheads for AMD 2-core 2-processor Machine
These are stage switching overheads for a set of
runs with different levels of parallelism and
different message patterns –each stage takes
about 28 microseconds (500,000 stages)
Stage Overhead
(microseconds)
Straight
Pipeline
1
2
3
4
8
match
0.77
2.4
3.6
5.0
8.9
default
3.6
4.7
4.4
4.5
8.9
match
N/A
3.3
3.4
4.7
11.0
default
N/A
5.1
4.2
4.5
8.6
match
N/A
4.8
7.7
9.5
26.0
default
N/A
8.3
9.0
9.7
24.0
match
N/A
11.0
15.8
18.3
Error
default
N/A
16.8
18.2
18.6
Error
Shift
Two
Shifts
Number of Parallel Computations
Exchange
Summary of Stage Overheads for Intel 2-core 2-processor Machine
These are stage switching overheads for a set of runs with
different levels of parallelism and different message patterns
–each stage takes about 30 microseconds. AMD overheads
in parentheses
These measurements are equivalent to MPI latencies
Stage Overhead
(microseconds)
Straight
Pipeline
1
2
3
4
8
default
1.7
(0.77)
6.9
(3.6)
match
N/A
default
N/A
match
N/A
default
N/A
N/A
default
N/A
4.0
(3.6)
7.0
(4.4)
5.1
(3.4)
8.9
(4.2)
13.8
(7.7)
24.9
(9.0)
32.7
(15.8)
36.1
(18.2)
9.1
(5.0)
9.1
(4.5)
9.4
(4.7)
9.4
(4.5)
13.4
(9.5)
13.4
(9.7)
41.0
(18.3)
41.0
(18.6)
25.9
(8.9)
16.9
(8.9)
25.0
(11.0)
11.2
(8.6)
52.7
(26.0)
31.5
(24.0)
match
3.3
(2.4)
9.5
(4.7)
3.4
(3.3)
9.8
(5.1)
6.8
(4.8)
23.1
(8.3)
28.0
(11.0)
34.6
(16.8)
match
Shift
Two
Shifts
Number of Parallel Computations
Exchange
Error
Error
Summary of Stage Overheads for Intel 4-core 2-processor Machine
These are stage switching overheads for a set of runs with
different levels of parallelism and different message patterns
–each stage takes about 30 microseconds. 2-core 2processor Xeon overheads in parentheses
These measurements are equivalent to MPI latencies
Stage Overhead
(microseconds)
Straight
Pipeline
1
2
3
4
8
default
1.33
(1.7)
6.3
(6.9)
match
N/A
default
N/A
match
N/A
default
N/A
match
N/A
N/A
4.3
(4.0)
9.8
(7.0)
4.5
(5.1)
10.2
(8.9)
6.8
(13.8)
30.4
(24.9)
23.6
(32.7)
38.7
(36.1)
4.7
(9.1)
9.5
(9.1)
5.1
(9.4)
10.0
(9.4)
8.4
(13.4)
27.3
(13.4)
21.4
(41.0)
46.0
(41.0)
6.5
(25.9)
6.5
(16.9)
7.2
(25.0)
7.2
(11.2)
22.8
(52.7)
23.0
(31.5)
33.1
(error)
default
4.2
(3.3)
8.4
(9.5)
4.3
(3.4)
8.3
(9.8)
7.5
(6.8)
20.3
(23.1)
26.6
(28.0)
31.3
(34.6)
match
Shift
Two
Shifts
Number of Parallel Computations
Exchange
33.5
(error)
8-way Parallel Pipeline on two 4-core Xeon
• Histogram of 100 runs -- each run has 500,000
synchronizations following a thread execution that
takes 33.92 microseconds
– So overhead of 6.1 microseconds modest
30
20
Frequency
Frequency
25
15
20
10
15
AMD 4-way
27.94 microsecond
Computation Unit
10
5
5
0
53.
7
35. .
29
36. .1
4
36. .3
6
6
3. .5
8
6.
7
4
6.
4. 9
72.
4. 1
74.
3
4.
76.
5
4.
78.
7
75.
9
0
XP-Pro
XP
Pro
Overhead
microseconds(averaged
(averagedover
over
Overhead microseconds
500,000 consecutive
consecutive synchronizations)
synchronizations)
• Message size
is just one
integer
• Choose
computation
unit that is
appropriate for
a few
microsecond
stage
overhead
8-way Parallel Shift on two 4-core Xeon
• Histogram of 100 runs -- each run has 500,000
synchronizations following a thread execution that
takes 33.92 microseconds
– So overhead of 8.2 microseconds modest
53.
7.
XP
XP-Pro
Pro
VISTA
6
6.7.6
36.
8
6. 8
9
4
8.4
7.
2
48.8
.2
7.
5
49.2
.4
7.9.6
48.
6
10
8.
41.
10.4
8
8.
4
10.8
5
8.
11.2
57.
2
11.6
AMD 4-way
27.94 microsecond
Computation Unit
6.
37.2
3.
45
20
40
40
35
30
15
30
25
20
10
20
15
10
5
10
50
0
0
2
6.4
6.
30.
4
6.8
Frequency
Frequency
Frequency
25
50
50
Overhead
microseconds(averaged
(averaged
over
Overhead
over
Overheadmicroseconds
microseconds (averaged
over
500,000
consecutivesynchronizations)
synchronizations)
500,000
500,000consecutive
consecutive
synchronizations)
• Shift versus
pipeline adds a
microsecond to
cost
• Unclear what
causes second
peak
8-way Parallel Double Shift on two 4-core Xeon
• Histogram of 100 runs -- each run has 500,000
synchronizations following a thread execution that takes 33.92
microseconds
35
16
14
30
12
25
10
20
8
15
6
10
4
5
2
0
0
XP-Pro
XP Pro
Overhead microseconds (averaged
Overhead
(averaged over
over
500,000 consecutive synchronizations)
synchronizations)
25.8
9.
22.2
5
9.
22.6
7
923
.9
23.4
10
.1
23.8
10
.3
24.2
10
.5
24.6
10
.7
25
10
25.4
.9
9.
21.8
3
AMD 4-way
27.94 microsecond
Computation Unit
8.
9
21
9.
1
21.4
Frequency
Frequency
– So overhead of 22.3 microseconds significant
– Unclear why double shift slow compared to shift
• Exchange
performance
partly reflects
number of
messages
• Opteron
overheads
significantly
lower than
Intel
AMD 2-core 2-processor Bandwidth Measurements
Previously we measured latency as measurements corresponded to small
messages. We did a further set of measurements of bandwidth by
exchanging larger messages of different size between threads
We used three types of data structures for receiving data
Array in thread equal to message size
Array outside thread equal to message size
Data stored sequentially in a large array (“stepped” array)
For AMD and Intel, total bandwidth 1 to 2 Gigabytes/second
Number of
stages
250000
Bandwidths in Gigabytes/second summed over 4 cores
Array Outside
Stepped Array
Array Inside Thread
Threads
Outside Thread
Small
Large
Small
Large
Small
Large
0.90
0.96
1.08
1.09
1.14
1.10
0.89
0.99
1.16
1.11
1.14
1.13
2500
Approx.
Compute
Time per
stage µs
56.0
56.0
7
1.13 up to 10 words
5000
200000
1.19
1.15
1.15
1.13
1.13 up to 107 words
2800
70
Intel 2-core 2-processor Bandwidth Measurements
For bandwidth, the Intel did better than AMD especially when one
exploited cache on chip with small transfers
For both AMD and Intel, each stage executed a computational task
after copying data arrays of size 105 (labeled small), 106 (labeled
large) or 107 double words. The last column is an approximate value
in microseconds of the compute time for each stage. Note that
copying 100,000 double precision words per core at a
gigabyte/second bandwidth takes 3200 µs. The data to be copied
(message payload in CCR) is fixed and its creation time is outside
timed process
Number of
stages
250000
Bandwidths in Gigabytes/second summed over 4 cores
Array Inside Thread
Array Outside
Stepped Array
Threads
Outside Thread
Small
Large
Small
Large
Small
Large
0.84
0.75
1.92
0.90
200000
1.18
0.90
59.5
1.21
0.91
74.4
1.75
1.0
5000
2970
0.83
0.76
1.89
0.89
1.16
0.89
2500
2500
Approx.
Compute
Time per
stage µs
59.5
1.74
0.9
2.0
1.07
1.78
1.06
5950
Millions
200
4-way Pipeline Pattern
4 Dispatcher Threads
Dell Xeon
180
160
140
Time Seconds
120
100
Time with array copy
80
Slope Change
(Cache Effect)
60
Total Bandwidth 1.0 Gigabytes/Sec
up to one million double words and
1.75 Gigabytes/Sec up to
100,000 double words
40
20
Array Size: Millions of Double Words
0
0
0.2
0.4
0.6
0.8
1
Millions
Typical Bandwidth measurements showing effect of cache with slope change
Size run
of Copied
Point
Array size of double array copied in each
5,000 stages with
timeFloating
plotted
against
stage from thread to stepped locations in a large array on Dell Xeon Multicore
Average run time (microseconds)
350
DSS Service Measurements
300
250
200
150
100
50
0
1
10
100
1000
10000
Timing of HP Opteron Multicore as aRound
functiontrips
of number of simultaneous twoway service messages processed (November 2006 DSS Release)

CGL Measurements of Axis 2 shows about 500 microseconds – DSS is 10 times better