GP_on_GPU - School of Engineering and Computer Science

Download Report

Transcript GP_on_GPU - School of Engineering and Computer Science

Genetic Programming on General
Purpose Graphics Processing Units
(GPGPGPU)
Muhammad Iqbal
Evolutionary Computation Research Group
School of Engineering and Computer Sciences
Overview
• Graphics Processing Units (GPUs) are no longer
limited to be used only for Graphics:
• High degree of programmability
• Fast floating point operations
• GPUs are now GPGPUs
• Genetic programming is a computationally
intensive methodology so a prime candidate for
using GPUs.
2
Outline
•
•
•
•
•
Genetic Programming
Genetic Programming Resource Demands
GPU Programming
Genetic Programming on GPU
Automatically Defined Functions
3
Genetic Programming (GP)
• Evolutionary algorithm-based methodology
• To optimize a population of computer
programs
• Tree based representation
• Example:
X
Output
0
1
1
3
2
7
3
13
4
GP Resource Demands
• GP is notoriously resource consuming
• CPU cycles
• Memory
• Standard GP system, 1μs per node
• Binary trees, depth 17: 131 ms per tree
• Fitness cases: 1,000 Population size: 1,000
• Generations: 1,000 Number of runs: 100
» Runtime: 10 Gs ≈ 317 years
• Standard GP system, 1ns per node
» Runtime: 116 days
• Limits to what we can approach with GP
[Banzhaf and Harding – GECCO 2009]
5
Sources of Speed-up
•
•
•
•
•
•
•
Fast machines
Vector Processors
Parallel Machines (MIMD/SIMD)
Clusters
Loose Networks
Multi-core
Graphics Processing Units (GPU)
6
Why GPU is faster than CPU ?
The GPU Devotes More Transistors to Data Processing.
[CUDA C Programming Guide Version 3.2 ]
8
GPU Programming APIs
• There are a number of toolkits available for
programming GPUs.
•
•
•
•
CUDA
MS Accelerator
RapidMind
Shader programming
• So far, researchers in GP have not converged on
one platform
9
CUDA Programming
Massive number (>10000) of light-weight threads.
10
CUDA Memory Model
CUDA exposes all the different types of memory on the GPU:
(Device) Grid
Block (0, 0)
Shared Memory
Registers
Host
Registers
Block (1, 0)
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
[CUDA C Programming Guide Version 3.2 ]
11
CUDA Programming Model
GPU is viewed as a computing device operating as a
coprocessor to the main CPU (host).
• Data-parallel, computationally intensive functions should
be off-loaded to the device.
• Functions that are executed many times, but
independently on different data, are prime candidates, i.e.
body of for-loops.
• A function compiled for the device is called a kernel.
12
13
Stop Thinking About What to
Do and Start Doing It!
• Memory transfer time expensive.
• Computation is cheap.
• No longer calculate and store in memory
• Just recalculates
• Built-in variables
•
•
•
•
threadIdx
blockIdx
gridDim
blockDim
14
Example: Increment Array Elements
15
Example: Matrix Addition
16
Example: Matrix Addition
17
Parallel Genetic Programming
While most GP work is conducted on sequential
computers, the following computationally intensive
features make it well suited to parallel hardware:
• Individuals are run on multiple independent
training examples.
• The fitness of each individual could be calculated
on independent hardware in parallel.
• Multiple independent runs of the GP are needed
for statistical confidence to the stochastic element
of the result.
[Langdon and Banzhaf, EuroGP-2008]
18
A Many Threaded CUDA Interpreter for
Genetic Programming
• Running Tree GP on GPU
• 8692 times faster than PC without GPU
• Solved 20-bits Multiplexor
•
•
•
•
220 = 1048576 fitness cases
Has never been solved by tree GP before
Previously estimated time: more than 4 years
GPU has consistently done it in less than an hour
• Solved 37-bits Multiplexor
• 237 = 137438953472 fitness cases
• Has never been attempted before
• GPU solves it in under a day
[W.B.Langdon, EuroGP-2010]
19
Boolean Multiplexor
a
d=2
n=a+d
n
Num test cases = 2
20-mux 1 million test cases
37-mux 137 billion test cases
[W.B.Langdon, EuroGP-2010]
20
Genetic Programming Parameters
for Solving 20 and 37 Multiplexors
Terminals
20 or 37 Boolean inputs D0 – D19 or D0 – D36 respectively
Functions
AND, OR, NAND, NOR
Fitness
Pseudo random sample of 2048 of 1048576 or 8192 of 137438953472
fitness cases.
Tournament
4 members run on same random sample. New samples for each tournament
and each generation.
Population
262144
Initial
Population
Ramped half-and-half 4:5 (20-Mux) or 5:7 (37-Mux)
Parameters
50% subtree crossover, 5% subtree 45% point mutation.
Max depth 15, max size 511 (20-Mux) or 1023 (37-Mux)
Termination
5000 generations
Solutions are found in generations 423 (20-Mux) and 2866 (37-Mux).
[W.B.Langdon, EuroGP-2010]
21
AND, OR, NAND, NOR
AND: &
NAND: d
X
Y
X&Y
X
Y
X|Y
0
0
0
0
0
0
0
1
0
0
1
1
1
0
0
1
0
1
1
1
1
1
1
1
X
Y
XdY
X
Y
XrY
0
0
1
0
0
1
0
1
1
0
1
0
1
0
1
1
0
0
1
1
0
1
1
0
OR: |
NOR: r
22
Evolution of 20-Mux and 37-Mux
[W.B.Langdon, EuroGP-2010]
23
6-Mux Tree I
[W.B.Langdon, EuroGP-2010]
24
6-Mux Tree II
[W.B.Langdon, EuroGP-2010]
25
6-Mux Tree III
[W.B.Langdon, EuroGP-2010]
26
Ideal 6-Mux Tree
27
Automatically Defined Functions
(ADFs)
• Genetic programming trees often have repeated
patterns.
• Repeated subtrees can be treated as subroutines.
• ADFs is a methodology to automatically select
and implement modularity in GP.
• This modularity can:
• Reduce the size of GP tree
• Improve readability
28
Langdon’s CUDA Interpreter with ADFs
• ADFs slow down the speed
• 20-Mux taking 9 hours instead of less than an hour
• 37-Mux taking more than 3 days instead of less than a day
• Improved ADFs Implementation
• Previously used one thread per GP program
• Now using one thread block per GP program
• Increased level of parallelism
• Reduced divergence
• 20-Mux taking 8 to 15 minutes
• 37-Mux taking 7 to 10 hours
29
6-Mux with ADF
32
6-Mux with ADF
33
6-Mux with ADF
34
Conclusion 1: GP
• Powerful machine learning algorithm
• Capable of searching through trillions of states to
find the solution
• Often have repeated patterns and can be
compacted by ADFs
• But computationally expensive
35
Conclusion 2: GPU
• Computationally fast
• Relative low cost
• Need new programming paradigm, which is
practical.
• Accelerates processing speed up to 3000 times
for computationally intensive problems.
• But not well suited for memory intensive
problems.
36
Acknowledgement
• Dr Will Browne and Dr Mengjie Zhang for
Supervision.
• Kevin Buckley for Technical Support.
• Eric for helping in CUDA compilation.
• Victoria University of Wellington for Awarding
“Victoria PhD Scholarship”.
• All of You for Coming.
37
Thank You
Questions?
38