Oral Qualifying Exam - Community Grids Lab

Download Report

Transcript Oral Qualifying Exam - Community Grids Lab

OVERVIEW OF
MULTICORE, PARALLEL COMPUTING,
AND DATA MINING
1
Indiana University
Computer Science Dept.
Seung-Hee Bae
1
OUTLINE
Multicore
 Parallel Computing & MPI
 Data Mining

2

Multicore
Toward Concurrency
 What is Multicore?
 Shared cache architecture
 Recognition, Mining, and Synthesis (RMS)

Parallel Computing & MPI
 Data Mining

3
TOWARD CONCURRENCY IN SOFTWARE
 Exponential
growth (Moore’s Law) can’t continue
 Previous CPU performance gains
Clock speed: getting more cycles
 Become harder to exploit higher clock speeds due to several
physical issues, such as, heat, power consumption, and current
leakage problems. (2GHz:2001, 3.4GHz:2004, now?)
 Execution optimization: more work per cycle
 Pipelining, branch prediction, executing multiple instructions in
the same clock cycle
 reordering the instruction stream: changing meaning of programs.
 Cache
 Increasing the size of on-chip cache: main memory is much
slower than the CPU

4
TOWARD CONCURRENCY IN SOFTWARE 2

Current CPU performance gains
 Moore’s law is over? Not yet (# of transistors ↑)
 Hyperthreading
Running two or more threads in parallel inside a single CPU
 Runs some instructions in parallel
 One each of most basic CPU features, (except extra registers)
 5% ~ 15 %, 40% under ideal conditions
 It doesn’t help single-threaded applications


Multicore
Running two or more actual CPUs on one chip.
 Less than double the speed even in the ideal case.
 It will boost reasonably well-written multi-thread applications, but not
single-threaded applications.
 2 * 3GHz < 6 GHz



Coordination overhead between the cores to ensure cache coherency.
Cache
Only this will broadly benefit most existing applications.
 A cache miss costs 10 to 50 times.

5
WHAT IS MULTICORE?
Single Chip
 Multiple distinct processing Engine
 E.g.) Shared-cache Dual Core Architecture

Core 0
Core 1
CPU
CPU
L1 Cache
L1 Cache
L2 Cache
6
SHARED-CACHE ARCHITECTURE
 Options
for the last-level cache
private to each core
 sharing the last-level cache among diff. cores

 Benefits
of the Shared-Cache Architecture
Efficient use of the last-level cache.
 reduce resource underutilization.
 Reduce cache-coherence complexity
 reduced false sharing because of shared cache.
 reduce data-storage redundancy
 same data only needs to be stored once.
 reduce front-side bus traffic
 data requests can be resolved at the shared-cache level instead of
system memory.

7
SOFTWARE TECHNIQUES FOR SHARED-CACHE
MULTICORE SYSTEMS

Cache blocking (Data Tiling)



Allow data to stay in the cache while being processing by data loops.
Reducing unnecessary cache traffic. (Better cache hit ratio.)
Hold approach (Late update)
Each thread maintain its own private copy of data.
 Updating the shared copy only when it is necessary.
 Reducing the frequency of access to the shared data.


Avoid false sharing
What is false sharing? (unnecessary cache line update.)
 How to avoid false sharing?

To allocate non-shared data to different cache lines. (padding)
 To copy the global variable to a local function variable, then copy the data back
8
before the function exits.

RECOGNITION, MINING, AND
SYNTHESIS (RMS)

Era of Tera is coming quickly




Teraflops (computing power), Terabits (comm.), Terabytes
(storage)
World data is doubling every three years and is now measured
exabytes (a billion billion bytes)
Need computing model to deal this enormous sea of
information
Working with Models

Recognition (What is ?)


Mining (Is it ?)


Identifying that a set of data constitutes a model and then constructing that model.
Search for instances of the model.
Synthesis (What if ?)

Create a potential instance of that model in an imaginary world.
9
RMS 2
(from P.Dubey, “Recognition, Mining and Synthesis Moves Computers to the Era of
Tera,” Technology@Intel Magazine, Feb. 2005.)

Examples
Medicine (a tumor)
 Business (hiring)
 Investment

10
Multicore
 Parallel Computing & MPI

Parallel architectures (Shared-Memory vs. Distributed-Memory)
 Decomposing Program (Data Parallelism vs. Task Parallelism)
 MPI and OpenMP


Data Mining
11
PARALLEL COMPUTING: INTRODUCTION

Parallel computing



More than just a strategy for achieving good performance
Vision for how computation can seamlessly scale from a single
processor to virtually limitless computing power
Parallel computing software systems
Goal: to make parallel programming easier and the resulting applications
more portable and scalable while achieving good performance.
 Difficulty

Explicitly parallel program is difficult
e.g.) computation, partitioning, synchronization, and data movement (correct answer
& high performance)
 Must be machine-independent – portability
 Complexity of the problems being attacked.


Parallel Computing Challenges



Concurrency & Communication
Need for high performance
Diversity of Architectures
12
PARALLEL ARCHITECTURE 1

Shared-memory machines


Have a single shared address
space that can be accessed by
any processor.
Examples
Multicore
 Symmetric multiprocessor (SMP)
 Uniform Memory Access (UMA)





Access time is independent of the loc.
Use bus or completely connected net.
Not scalable
Shared-Memory Programming
model
Need for synchronization to
preserve the integrity
 E.g.) Open Specifications for
MultiProcessing (OpenMP)


Distributed-memory machines




The system memory is packaged
with individual nodes of one or
more processors (c.f. Use separate
computers connected by a network)
E.g. Cluster
communication is required to
provide data from a processor to a
different processor.
support message-passing
programming model
Send-receive communication steps.
 E.g.) Message Passing Interface
(MPI)

13
PARALLEL ARCHITECTURE 2
Shared-Memory
Distributed Memory
Pros
• Lower latency and higher BW
• Data are available to all of the CPUs
through load and store instructions
• Single address space
• Scalable, if a scalable
interconnection network is used.
• Quite fast local data access.
Cons
• cache coherency issue
• synchronization is explicitly
needed to access shared data.
• scalability issue
• Communication required to
access data in a diff. processor.
• Communication management
problem
1. Long latency  Consolidation of
messages btwn the same pair of
processors
2. Long transmission time 
Overlapping communication and
computing
14
PARALLEL ARCHITECTURE 3
 Hybrid

systems
Distributed shared-memory (DSM)
Distributed-memory machine which allows a processor to directly
access a datum in a remote memory.
 Latency varies with the distance to the remote memory.
 Emphasize the Non-Uniform Memory Access (NUMA)
characteristics.


SMP clusters

distributed-memory system with SMP as a unit.
15
PARALLEL PROGRAM: Decomposition 1
 Decomposing Programs
 Decomposition: Identifying
 Decomposition strategy

Task (Functional) parallelism


the portions for the parallelism.
Different processors carry out different functions.
Data parallelism
Subdivides the data domain of a problem into multiple regions and
assigns different processors to compute the results for each region.
 More commonly used in scientific problems.
 Natural form of scalability


Programming models
 Shared-memory programming model


Need for synchronization to preserve the integrity
Message-passing model

Communication is required to access a remote data location.
16
PARALLEL PROGRAM: DECOMPOSITION 2

Data Parallelism
Exploit the parallelism
inherent in many large
data structures.
 Same Task on diff. data.
(SPMD)
 Can be expressed by ALL
parallel programming
models (i.e. MPI, HPF like,
OpenMP like)
 Features

Scalable
 Hard to express when
geometry irregular or
dynamic


Functional Parallelism
Coarse grain parallelism
 Parallelism btwn the
parts of many systems.
 Diff. task on the same or
diff. data.
 Features

Parallelism limited in size
 Tens not millions
 Synchronization probably
good as parallelism
 Decomposition natural
 E.g.) workflow

17
PARALLEL PROGRAM: DECOMPOSITION 3

Load balance and scalability
Scalable: running time is inversely proportional to the number of
processors used.
 Speedup(n) = T(1)/T(n)



Second definition of scalability: scaled speedup


Scalable if speedup(n) ≈ n
Scalable if the running time remains the same when the number of
processors and the problem size are increased by a factor of n.
Why scalability is not achieved?
a region that must be run sequentially. Total speedup ≤ T(1)/Ts
(Amdahl’s Law)
 Require for a high degree of communication or coordination.
 Poor load balance (major goal of parallel programming)


If one of the processors takes half of the parallel work, speedup will be
limited to a factor of two.
18
PARALLEL PROGRAM

Memory-Hierarchy Management

Blocking


Ensuring that data remains in cache between subsequent accesses to the
same memory location.
Elimination of False Sharing
False sharing: When two diff. processors are accessing distinct data
items that reside on the same cache block.
 Ensure that data used by diff. processors reside on diff. cache blocks.
(by padding: inserting empty bytes in a data structure.)


Communication Minimization and Placement


Move send and receive commands far enough apart so that time spent on
communication can be overlapped.
Stride-one access

Programs in which the loops access contiguous data items are much
more efficient than those that do not.
19
MESSAGE PASSING INTERFACE (MPI) 1

Message Passing Interface (MPI)






A specification for a set of functions for managing movement of
data among sets of communicating processes.
The dominant scalable parallel computing paradigm with
scientific problem.
Explicit message send and receive using rendezvous model.
Point-to-point communication
Collective communication
Commonly implemented in terms of an SPMD model


All processes execute essentially the same logic.
Pros:
scalable and portable
 Race condition avoided (implicit synch. w/ the copy)


Cons:

implements details at communication.
20
MPI

6 Key Functions







MPI_INIT
MPI_COMM_RANK
MPI_COMM_SIZE
MPI_SEND
MPI_RECV
MPI_FINALIZE
Collective Communications
Barrier, Broadcast, Gather, Scatter, All-to-all, Exchange
 General reduction operation (sum, minimum, scan)


Blocking, nonblocking, buffered, synchronous messaging
21
OPEN SPECIFICATIONS FOR
MULTIPROCESSING (OPENMP) 1
Appropriate to Shared-Memory.
 A sophisticated set of annotations (compiler
directives) for traditional C, C++, or Fortran codes to
aid compilers producing parallel codes.
 It provides parallel loops and collective operations
such as summation over loop indices.
 Provide lock variables to allow fine-grain
synchronization btwn threads.

22
OPENMP 2

Directives: instruct the compiler to
Create threads
 Perform synchronization operations.
 Manage shared memory.
 Examples

PARALLEL DO ~ END PARALLEL DO: explicit parallel loop.
 SCHEDULE (STATIC): assign continuous blocks at compile time.
 SCHEDULE (DYNAMIC): assign continuous blocks at run-time.
 REDUCTION(+: x): final values of var. x is determined global sum.
 PARALLEL SECTIONS: task parallelism.


OpenMP synchronization primitives




Critical sections
Atomic updates
Barriers
Master selection
23
OPENMP 3

Summary
Work decomposition
 Ideal target system: uniform-access, shared-memory.
 Specify where multiple threads should be applied, and how
to assign work to those threads.
 Pros:



Excellent programming interface for uniform-access, sharedmemory machines.
Cons:

No way to specify locality in machines w/ non-uniform sharedmemory or distributed memory.
24
Multicore
 Parallel Computing & MPI
 Data Mining

Expectation Maximization (EM)
 Deterministic Annealing (DA)
 Hidden Markov Model (HMM)
 Support Vector Machine (SVM)

25
EXPECTATION MAXIMIZATION (EM)

Expectation Maximization (EM)





A general algorithm for maximum-likelihood (ML)
estimation where the data are “incomplete” or the
likelihood function involves latent variables.
An efficient iterative procedure
Goal: estimate unknown parameters, given measurement.
Hill climbing approach  guarantee to reach local maxima.
Two Steps
E-step (Expectation): the missing data are estimated given the
observed data and current estimate of the model parameters.
 M-step (Maximization): the likelihood function is maximized
under the assumption that the missing data are known. (The
estimated missing data from the E-step are used in lieu of the actual
missing data.)
26
 Those two steps are repeated until the likelihood converges.

DETERMINISTIC ANNEALING (DA)




Purpose: avoid local minima (optimization)
Clustering
 example of unsupervised learning
Simulated Annealing (SA)
 A sequence of random moves is generated and the random decision to
accept a move depends on the cost of resulting configuration relative to the
current state cost (Monte Carlo Method)
Deterministic Annealing (DA)
 Deterministic:
don’t wandering randomly
 (minimize the free energy directly)


Annealing:
still want to avoid local minima with certain level of uncertainty.
 maintain the free energy at its minimum.

eq) F = D – TH (T: temperature, H: Shannon Entropy, D: cost)
At large T, entropy (H) dominates while at small T cost dominates.
 Annealing lowers temperature so solution tracks continuously

27
DA FOR CLUSTERING
Start with a single cluster giving as solution Y1 as centroid
 For some annealing schedule for T, iterate above algorithm testing
covariance matrix in Xi about each cluster center to see if
“elongated”
 Split cluster if elongation “long enough”
 You do not need to assume number of clusters but rather a final
resolution T or equivalent
 At T=0, uninteresting solution is N clusters; one at each point xi
28

HIDDEN MARKOV MODEL (HMM) 1

Markov model
A system which may be described at any time as being in one
of a set of N distinct states, S1, S2, …, SN.
 State transition probability
 The special case of a discrete, first order Markov chain:




P[qt = Sj|qt-1 = Si, qt-2 = Sk, …] = P[qt = Sj|qt-1 = Si] (1)
Furthermore, consider those processes in which the right-hand side
of (1) is independent of time, thereby leading to the set of state
transition probability aij of the form
aij = P[qt = Sj|qt-1 = Si], 1 ≤ i, j ≤ N, aij ≥ 0 ∑J aij = 1
Initial state probability
29
HIDDEN MARKOV MODEL (HMM) 2
Hidden Markov Model



Observation is a probabilistic function of the state.
State is hidden.
Elements of an HMM




N, the number of states in the model. (Although the states are hidden)
M, the number of distinct observation symbols per state, i.e. the discrete
alphabet size.
The state transition probability distribution A = {aij},
where aij = P[qt = Sj|qt-1 = Si],



1 ≤ i, j ≤ N, aij ≥ 0 ∑J aij = 1
The observation symbol probability distribution (emission probability) in
state j, B = {bj(k)}, where
bj(k) = P[vk at t| qt = Sj], 1 ≤ j ≤ N, 1 ≤ k ≤ M
The initial state distribution π = {πi} where
πi = P[q1 = Si], 1 ≤ j ≤ N
30
Compact notation: λ = (A, B, π)
HIDDEN MARKOV MODEL (HMM) 3
Three Basic Problems for HMMs




Prob(observation seq | model): Given the observation sequence O =
O1O2 … OT, and a model λ = (A, B, π), how do we efficiently
compute P(O| λ), the probability of the observation sequence, given
the model?
Finding Optimal State Sequence: Given the observation sequence
O = O1O2 … OT, and a model λ = (A, B, π), how do we choose a
corresponding state sequence Q = q1q2 … qT which is optimal in
some meaningful sense (i.e. best “explains” the observations)?
Finding Optimal Model Parameters: How do we adjust the model
parameters λ = (A, B, π) to maximize P(O| λ) ?
31
HIDDEN MARKOV MODEL (HMM) 4
Solution to the three basic problems for HMMs


Solution to the problem 1 (Forward-Backward procedure)
Enumeration (straightforward way): computationally
unfeasible.
Forward Procedure




Consider forward variable αt(i) = P(O1O2 … Ot, qt = Si| λ) i.e.,
the probability of the partial observation sequence, O1O2
… Ot, (until time t) and state Si at time t, given the model λ.
Solution to the problem 2 (Viterbi algorithm)

Optimality criterion: to find the single best state
sequence (path), i.e., to maximize P(Q|O, λ) which is
equivalent to maximizing P(Q, O| λ).

A formal technique for finding this single best state
sequence exists, based on dynamic programming methods,
32
and is called Viterbi algorithm.
HIDDEN MARKOV MODEL (HMM) 5

Solution to the Problem 3. (Baum-Welch Algorithm)



The third problem of HMMs is to determine a method to adjust
the model parameters (A, B, π) to maximize the probability of
the observation sequence given the model.
Choose λ = (A, B, π) such that P(O| λ) is locally maximized using
an iterative procedure such as the Baum-Welch method (or
equivalently the EM (expectation-modification) method) or using
gradient techniques.
Reestimation (iterative update and improvement), define ξt(i, j),
the probability of being in state Si at time t and state Sj at time t+1,
given the model and the observation sequence, i.e.
ξt(i, j) = P(qt = Si, qt+1 = Sj|O, λ)
33