Oral Qualifying Exam - Community Grids Lab
Download
Report
Transcript Oral Qualifying Exam - Community Grids Lab
OVERVIEW OF
MULTICORE, PARALLEL COMPUTING,
AND DATA MINING
1
Indiana University
Computer Science Dept.
Seung-Hee Bae
1
OUTLINE
Multicore
Parallel Computing & MPI
Data Mining
2
Multicore
Toward Concurrency
What is Multicore?
Shared cache architecture
Recognition, Mining, and Synthesis (RMS)
Parallel Computing & MPI
Data Mining
3
TOWARD CONCURRENCY IN SOFTWARE
Exponential
growth (Moore’s Law) can’t continue
Previous CPU performance gains
Clock speed: getting more cycles
Become harder to exploit higher clock speeds due to several
physical issues, such as, heat, power consumption, and current
leakage problems. (2GHz:2001, 3.4GHz:2004, now?)
Execution optimization: more work per cycle
Pipelining, branch prediction, executing multiple instructions in
the same clock cycle
reordering the instruction stream: changing meaning of programs.
Cache
Increasing the size of on-chip cache: main memory is much
slower than the CPU
4
TOWARD CONCURRENCY IN SOFTWARE 2
Current CPU performance gains
Moore’s law is over? Not yet (# of transistors ↑)
Hyperthreading
Running two or more threads in parallel inside a single CPU
Runs some instructions in parallel
One each of most basic CPU features, (except extra registers)
5% ~ 15 %, 40% under ideal conditions
It doesn’t help single-threaded applications
Multicore
Running two or more actual CPUs on one chip.
Less than double the speed even in the ideal case.
It will boost reasonably well-written multi-thread applications, but not
single-threaded applications.
2 * 3GHz < 6 GHz
Coordination overhead between the cores to ensure cache coherency.
Cache
Only this will broadly benefit most existing applications.
A cache miss costs 10 to 50 times.
5
WHAT IS MULTICORE?
Single Chip
Multiple distinct processing Engine
E.g.) Shared-cache Dual Core Architecture
Core 0
Core 1
CPU
CPU
L1 Cache
L1 Cache
L2 Cache
6
SHARED-CACHE ARCHITECTURE
Options
for the last-level cache
private to each core
sharing the last-level cache among diff. cores
Benefits
of the Shared-Cache Architecture
Efficient use of the last-level cache.
reduce resource underutilization.
Reduce cache-coherence complexity
reduced false sharing because of shared cache.
reduce data-storage redundancy
same data only needs to be stored once.
reduce front-side bus traffic
data requests can be resolved at the shared-cache level instead of
system memory.
7
SOFTWARE TECHNIQUES FOR SHARED-CACHE
MULTICORE SYSTEMS
Cache blocking (Data Tiling)
Allow data to stay in the cache while being processing by data loops.
Reducing unnecessary cache traffic. (Better cache hit ratio.)
Hold approach (Late update)
Each thread maintain its own private copy of data.
Updating the shared copy only when it is necessary.
Reducing the frequency of access to the shared data.
Avoid false sharing
What is false sharing? (unnecessary cache line update.)
How to avoid false sharing?
To allocate non-shared data to different cache lines. (padding)
To copy the global variable to a local function variable, then copy the data back
8
before the function exits.
RECOGNITION, MINING, AND
SYNTHESIS (RMS)
Era of Tera is coming quickly
Teraflops (computing power), Terabits (comm.), Terabytes
(storage)
World data is doubling every three years and is now measured
exabytes (a billion billion bytes)
Need computing model to deal this enormous sea of
information
Working with Models
Recognition (What is ?)
Mining (Is it ?)
Identifying that a set of data constitutes a model and then constructing that model.
Search for instances of the model.
Synthesis (What if ?)
Create a potential instance of that model in an imaginary world.
9
RMS 2
(from P.Dubey, “Recognition, Mining and Synthesis Moves Computers to the Era of
Tera,” Technology@Intel Magazine, Feb. 2005.)
Examples
Medicine (a tumor)
Business (hiring)
Investment
10
Multicore
Parallel Computing & MPI
Parallel architectures (Shared-Memory vs. Distributed-Memory)
Decomposing Program (Data Parallelism vs. Task Parallelism)
MPI and OpenMP
Data Mining
11
PARALLEL COMPUTING: INTRODUCTION
Parallel computing
More than just a strategy for achieving good performance
Vision for how computation can seamlessly scale from a single
processor to virtually limitless computing power
Parallel computing software systems
Goal: to make parallel programming easier and the resulting applications
more portable and scalable while achieving good performance.
Difficulty
Explicitly parallel program is difficult
e.g.) computation, partitioning, synchronization, and data movement (correct answer
& high performance)
Must be machine-independent – portability
Complexity of the problems being attacked.
Parallel Computing Challenges
Concurrency & Communication
Need for high performance
Diversity of Architectures
12
PARALLEL ARCHITECTURE 1
Shared-memory machines
Have a single shared address
space that can be accessed by
any processor.
Examples
Multicore
Symmetric multiprocessor (SMP)
Uniform Memory Access (UMA)
Access time is independent of the loc.
Use bus or completely connected net.
Not scalable
Shared-Memory Programming
model
Need for synchronization to
preserve the integrity
E.g.) Open Specifications for
MultiProcessing (OpenMP)
Distributed-memory machines
The system memory is packaged
with individual nodes of one or
more processors (c.f. Use separate
computers connected by a network)
E.g. Cluster
communication is required to
provide data from a processor to a
different processor.
support message-passing
programming model
Send-receive communication steps.
E.g.) Message Passing Interface
(MPI)
13
PARALLEL ARCHITECTURE 2
Shared-Memory
Distributed Memory
Pros
• Lower latency and higher BW
• Data are available to all of the CPUs
through load and store instructions
• Single address space
• Scalable, if a scalable
interconnection network is used.
• Quite fast local data access.
Cons
• cache coherency issue
• synchronization is explicitly
needed to access shared data.
• scalability issue
• Communication required to
access data in a diff. processor.
• Communication management
problem
1. Long latency Consolidation of
messages btwn the same pair of
processors
2. Long transmission time
Overlapping communication and
computing
14
PARALLEL ARCHITECTURE 3
Hybrid
systems
Distributed shared-memory (DSM)
Distributed-memory machine which allows a processor to directly
access a datum in a remote memory.
Latency varies with the distance to the remote memory.
Emphasize the Non-Uniform Memory Access (NUMA)
characteristics.
SMP clusters
distributed-memory system with SMP as a unit.
15
PARALLEL PROGRAM: Decomposition 1
Decomposing Programs
Decomposition: Identifying
Decomposition strategy
Task (Functional) parallelism
the portions for the parallelism.
Different processors carry out different functions.
Data parallelism
Subdivides the data domain of a problem into multiple regions and
assigns different processors to compute the results for each region.
More commonly used in scientific problems.
Natural form of scalability
Programming models
Shared-memory programming model
Need for synchronization to preserve the integrity
Message-passing model
Communication is required to access a remote data location.
16
PARALLEL PROGRAM: DECOMPOSITION 2
Data Parallelism
Exploit the parallelism
inherent in many large
data structures.
Same Task on diff. data.
(SPMD)
Can be expressed by ALL
parallel programming
models (i.e. MPI, HPF like,
OpenMP like)
Features
Scalable
Hard to express when
geometry irregular or
dynamic
Functional Parallelism
Coarse grain parallelism
Parallelism btwn the
parts of many systems.
Diff. task on the same or
diff. data.
Features
Parallelism limited in size
Tens not millions
Synchronization probably
good as parallelism
Decomposition natural
E.g.) workflow
17
PARALLEL PROGRAM: DECOMPOSITION 3
Load balance and scalability
Scalable: running time is inversely proportional to the number of
processors used.
Speedup(n) = T(1)/T(n)
Second definition of scalability: scaled speedup
Scalable if speedup(n) ≈ n
Scalable if the running time remains the same when the number of
processors and the problem size are increased by a factor of n.
Why scalability is not achieved?
a region that must be run sequentially. Total speedup ≤ T(1)/Ts
(Amdahl’s Law)
Require for a high degree of communication or coordination.
Poor load balance (major goal of parallel programming)
If one of the processors takes half of the parallel work, speedup will be
limited to a factor of two.
18
PARALLEL PROGRAM
Memory-Hierarchy Management
Blocking
Ensuring that data remains in cache between subsequent accesses to the
same memory location.
Elimination of False Sharing
False sharing: When two diff. processors are accessing distinct data
items that reside on the same cache block.
Ensure that data used by diff. processors reside on diff. cache blocks.
(by padding: inserting empty bytes in a data structure.)
Communication Minimization and Placement
Move send and receive commands far enough apart so that time spent on
communication can be overlapped.
Stride-one access
Programs in which the loops access contiguous data items are much
more efficient than those that do not.
19
MESSAGE PASSING INTERFACE (MPI) 1
Message Passing Interface (MPI)
A specification for a set of functions for managing movement of
data among sets of communicating processes.
The dominant scalable parallel computing paradigm with
scientific problem.
Explicit message send and receive using rendezvous model.
Point-to-point communication
Collective communication
Commonly implemented in terms of an SPMD model
All processes execute essentially the same logic.
Pros:
scalable and portable
Race condition avoided (implicit synch. w/ the copy)
Cons:
implements details at communication.
20
MPI
6 Key Functions
MPI_INIT
MPI_COMM_RANK
MPI_COMM_SIZE
MPI_SEND
MPI_RECV
MPI_FINALIZE
Collective Communications
Barrier, Broadcast, Gather, Scatter, All-to-all, Exchange
General reduction operation (sum, minimum, scan)
Blocking, nonblocking, buffered, synchronous messaging
21
OPEN SPECIFICATIONS FOR
MULTIPROCESSING (OPENMP) 1
Appropriate to Shared-Memory.
A sophisticated set of annotations (compiler
directives) for traditional C, C++, or Fortran codes to
aid compilers producing parallel codes.
It provides parallel loops and collective operations
such as summation over loop indices.
Provide lock variables to allow fine-grain
synchronization btwn threads.
22
OPENMP 2
Directives: instruct the compiler to
Create threads
Perform synchronization operations.
Manage shared memory.
Examples
PARALLEL DO ~ END PARALLEL DO: explicit parallel loop.
SCHEDULE (STATIC): assign continuous blocks at compile time.
SCHEDULE (DYNAMIC): assign continuous blocks at run-time.
REDUCTION(+: x): final values of var. x is determined global sum.
PARALLEL SECTIONS: task parallelism.
OpenMP synchronization primitives
Critical sections
Atomic updates
Barriers
Master selection
23
OPENMP 3
Summary
Work decomposition
Ideal target system: uniform-access, shared-memory.
Specify where multiple threads should be applied, and how
to assign work to those threads.
Pros:
Excellent programming interface for uniform-access, sharedmemory machines.
Cons:
No way to specify locality in machines w/ non-uniform sharedmemory or distributed memory.
24
Multicore
Parallel Computing & MPI
Data Mining
Expectation Maximization (EM)
Deterministic Annealing (DA)
Hidden Markov Model (HMM)
Support Vector Machine (SVM)
25
EXPECTATION MAXIMIZATION (EM)
Expectation Maximization (EM)
A general algorithm for maximum-likelihood (ML)
estimation where the data are “incomplete” or the
likelihood function involves latent variables.
An efficient iterative procedure
Goal: estimate unknown parameters, given measurement.
Hill climbing approach guarantee to reach local maxima.
Two Steps
E-step (Expectation): the missing data are estimated given the
observed data and current estimate of the model parameters.
M-step (Maximization): the likelihood function is maximized
under the assumption that the missing data are known. (The
estimated missing data from the E-step are used in lieu of the actual
missing data.)
26
Those two steps are repeated until the likelihood converges.
DETERMINISTIC ANNEALING (DA)
Purpose: avoid local minima (optimization)
Clustering
example of unsupervised learning
Simulated Annealing (SA)
A sequence of random moves is generated and the random decision to
accept a move depends on the cost of resulting configuration relative to the
current state cost (Monte Carlo Method)
Deterministic Annealing (DA)
Deterministic:
don’t wandering randomly
(minimize the free energy directly)
Annealing:
still want to avoid local minima with certain level of uncertainty.
maintain the free energy at its minimum.
eq) F = D – TH (T: temperature, H: Shannon Entropy, D: cost)
At large T, entropy (H) dominates while at small T cost dominates.
Annealing lowers temperature so solution tracks continuously
27
DA FOR CLUSTERING
Start with a single cluster giving as solution Y1 as centroid
For some annealing schedule for T, iterate above algorithm testing
covariance matrix in Xi about each cluster center to see if
“elongated”
Split cluster if elongation “long enough”
You do not need to assume number of clusters but rather a final
resolution T or equivalent
At T=0, uninteresting solution is N clusters; one at each point xi
28
HIDDEN MARKOV MODEL (HMM) 1
Markov model
A system which may be described at any time as being in one
of a set of N distinct states, S1, S2, …, SN.
State transition probability
The special case of a discrete, first order Markov chain:
P[qt = Sj|qt-1 = Si, qt-2 = Sk, …] = P[qt = Sj|qt-1 = Si] (1)
Furthermore, consider those processes in which the right-hand side
of (1) is independent of time, thereby leading to the set of state
transition probability aij of the form
aij = P[qt = Sj|qt-1 = Si], 1 ≤ i, j ≤ N, aij ≥ 0 ∑J aij = 1
Initial state probability
29
HIDDEN MARKOV MODEL (HMM) 2
Hidden Markov Model
Observation is a probabilistic function of the state.
State is hidden.
Elements of an HMM
N, the number of states in the model. (Although the states are hidden)
M, the number of distinct observation symbols per state, i.e. the discrete
alphabet size.
The state transition probability distribution A = {aij},
where aij = P[qt = Sj|qt-1 = Si],
1 ≤ i, j ≤ N, aij ≥ 0 ∑J aij = 1
The observation symbol probability distribution (emission probability) in
state j, B = {bj(k)}, where
bj(k) = P[vk at t| qt = Sj], 1 ≤ j ≤ N, 1 ≤ k ≤ M
The initial state distribution π = {πi} where
πi = P[q1 = Si], 1 ≤ j ≤ N
30
Compact notation: λ = (A, B, π)
HIDDEN MARKOV MODEL (HMM) 3
Three Basic Problems for HMMs
Prob(observation seq | model): Given the observation sequence O =
O1O2 … OT, and a model λ = (A, B, π), how do we efficiently
compute P(O| λ), the probability of the observation sequence, given
the model?
Finding Optimal State Sequence: Given the observation sequence
O = O1O2 … OT, and a model λ = (A, B, π), how do we choose a
corresponding state sequence Q = q1q2 … qT which is optimal in
some meaningful sense (i.e. best “explains” the observations)?
Finding Optimal Model Parameters: How do we adjust the model
parameters λ = (A, B, π) to maximize P(O| λ) ?
31
HIDDEN MARKOV MODEL (HMM) 4
Solution to the three basic problems for HMMs
Solution to the problem 1 (Forward-Backward procedure)
Enumeration (straightforward way): computationally
unfeasible.
Forward Procedure
Consider forward variable αt(i) = P(O1O2 … Ot, qt = Si| λ) i.e.,
the probability of the partial observation sequence, O1O2
… Ot, (until time t) and state Si at time t, given the model λ.
Solution to the problem 2 (Viterbi algorithm)
Optimality criterion: to find the single best state
sequence (path), i.e., to maximize P(Q|O, λ) which is
equivalent to maximizing P(Q, O| λ).
A formal technique for finding this single best state
sequence exists, based on dynamic programming methods,
32
and is called Viterbi algorithm.
HIDDEN MARKOV MODEL (HMM) 5
Solution to the Problem 3. (Baum-Welch Algorithm)
The third problem of HMMs is to determine a method to adjust
the model parameters (A, B, π) to maximize the probability of
the observation sequence given the model.
Choose λ = (A, B, π) such that P(O| λ) is locally maximized using
an iterative procedure such as the Baum-Welch method (or
equivalently the EM (expectation-modification) method) or using
gradient techniques.
Reestimation (iterative update and improvement), define ξt(i, j),
the probability of being in state Si at time t and state Sj at time t+1,
given the model and the observation sequence, i.e.
ξt(i, j) = P(qt = Si, qt+1 = Sj|O, λ)
33