Transcript Description
Artificial Intelligence
Project 2:
Classification Using Genetic Programming
2008. 10. 27
Kim, MinHyeok
[email protected]
Biointelligence laboratory
Contents
Project outline
Description on the data set
Genetic Programming
Brief overview
Fitness function & Selection methods
Classification with GP (in this project)
Guide to writing reports
Style & contents
Submission guide / Marking scheme
(C) 2008, SNU Biointelligence Laboratory
2
Outline
Goal
Understand the Genetic Programming (GP) deeper
Practice researching and writing a paper
Forest Fires problem (classification)
To predict whether a fire occurs or not
Using Genetic Programming
Estimating several statistics on the dataset
Data set
Variation of the ‘Forest Fires data set’
http://archive.ics.uci.edu/ml/datasets/Forest+Fires
(C) 2008, SNU Biointelligence Laboratory
3
Forest Fires Data Set
Description
Database of 517 samples
You
can use at most 500 samples for training
17 samples for prediction
12 attributes
X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,label
Integer
or real value
Label (Class)
Two
classes
– 0 : a fire does not occur
– 1 : a fire occurs
(C) 2008, SNU Biointelligence Laboratory
4
Brief Summary of GP
A kind of evolutionary algorithms
It is represented with a tree structure
You need to set up following elements for GP run
The set of terminals (input attributes, the class variable, constants)
The set of functions (numerical / condition operators)
The fitness measure
The algorithm parameters
population
size, maximum number of generations
crossover rate and mutation rate
maximum depth of GP trees etc.
The method for designating a result and the criterion for
terminating a run.
(C) 2008, SNU Biointelligence Laboratory
5
GP Flowchart
GA loop
6
GP loop
Initialization
Maximum initial depth of trees Dmax is set.
Full method (each branch has depth = Dmax):
nodes at depth d < Dmax randomly chosen from function set F
nodes at depth d = Dmax randomly chosen from terminal set T
Grow method (each branch has depth Dmax):
nodes at depth d < Dmax randomly chosen from F T
nodes at depth d = Dmax randomly chosen from T
Common GP initialisation: ramped half-and-half, where grow
and full method each deliver half of initial population
(C) 2008, SNU Biointelligence Laboratory
7
Fitness Functions
Relative squared error
yi fˆ ( xi )
Fitness
yi
i 1
n
2
The number of outputs that are within % of the correct
value
And you can try other fitness functions which are welldefined to solve problems
Selection methods (1/2)
Fitness proportional (roulette wheel) selection
The roulette wheel can be constructed as follows.
Calculate
the total fitness for the population.
F
POP _ SIZE
f (i )
k 1
Calculate
k
selection probability pk for each chromosome vk.
f (ik )
pk
,
k 1,2,..., POP _ SIZE
F
Calculate cumulative probability qk for each chromosome vk.
k
qk p j ,
j 1
k 1,2,..., POP _ SIZE
Procedure: Proportional_Selection
Generate a random number r from the range [0,1].
If r q1, then select the first chromosome v1; else, select the kth
chromosome vk (2 k pop_size) such that qk-1 < r qk.
1
1
pk
qk
0.082407
0.082407
0.9
0.8
2
0.110652
F
0.193059
0.131931
0.324989
0.6
4
0.121423
0.446412
5
0.072597
0.519009
6
0.128834
0.647843
7
0.077959
0.725802
8
0.102013
0.827802
9
0.083663
0.911479
10
0.088521
1.000000
f (i ) 0.036441
k 1
0.7
3
pop _ size
0.5
0.4
0.3
0.2
0.1
0
1
k
Selection methods (2/2)
Tournament selection
Tournament size q
Ranking-based selection
pi
1
i 1
( )
1
2 POP_SIZE
1 + 2 and - = 2 - +
Elitism
To preserve n good solutions until the next generation
Classification with GP (in this project)
IF
Function Regression
Search a function f(x) s.t.
f(x)
≥ threshold t
f(x)< threshold t
when y=1
when y=0
f(x)
Converting to Boolean value
∧
¬
∨
>
=
rain
0
RH
<
50
1
>
wind
FFMC
+
ISI
t
0
What to do for the experiment?
Select a library that implements GP
You can find various libraries written in C++/Java/Matlab
See the list of recommended libraries on the next page
Build up your own code for the experiment
Check sample codes and tutorials of libraries for quick start
Add comments to explain the flow of your program
Caution
Running GP may take much time
(C) 2008, SNU Biointelligence Laboratory
13
Recommended Libraries for GP
C++
GPLib: http://www.cs.bham.ac.uk/~cmf/GPLib/index.html
Java
JGAP: http://jgap.sourceforge.net/
ECJ: http://cs.gmu.edu/~eclab/projects/ecj/
Matlab toolbox
GPLAB: http://gplab.sourceforge.net/
More References
Implementations section in Wiki – Genetic Programming:
http://en.wikipedia.org/wiki/Genetic_programming
(C) 2008, SNU Biointelligence Laboratory
14
Reports Style
English only!!
Scientific journal-style
How to Write A Paper in Scientific Journal Style and Format
http://abacus.bates.edu/~ganderso/biology/resources/writing/HTWsections.html
Experimental process
Section of Paper
What did I do in a nutshell?
Abstract
What is the problem?
Introduction
How did I solve the problem?
Materials and Methods
What did I find out?
Results
What does it mean?
Discussion
Who helped me out?
Acknowledgments (optional)
Whose work did I refer to?
Literature Cited
Extra Information
Appendices (optional)
(C) 2008, SNU Biointelligence Laboratory
15
Report Contents (1/3)
System description
Used programming language and running environments
Result tables
Training
Your prediction
Average SD Best
Worst 1 2 … 16 17 Equation
Setting 1
% %
%
%
Setting 2
% %
%
%
Setting 3
% %
%
%
Analysis & discussion (Very Important!!)
(C) 2008, SNU Biointelligence Laboratory
16
Report Contents (2/3)
Graph
Avg., Max. Fitness versus Generation
Tree size versus Generation
(C) 2008, SNU Biointelligence Laboratory
17
Report Contents (3/3)
Basic experiments
Changing parameters for the crossover and mutation
Various function sets: arithmetic, numerical
Optional experiments
Various selection methods
Depth limitation
Population size, generation numbers
Comparison to Neural Network
…
References
(C) 2008, SNU Biointelligence Laboratory
18
Submission Guide
Due date: Nov. 19 (Wed) 18:00
Submit both ‘hardcopy’ and ‘email’
Hardcopy submission to the office (301-417 )
E-mail submission to [email protected]
Subject : [AI Project1 Report] Student number, Name
Report + your source code with comments + executable file(s)
Length: report should be summarized within 12 pages.
We are NOT interested in the accuracy and your programming skill,
but your creativity and research ability.
If your major is not a C.S, team project with a C.S major student is
possible (Use the class board to find your partner and notice the
information of your team to TA ([email protected]) by Nov. 5)
(C) 2008, SNU Biointelligence Laboratory
19
Marking Scheme
5 points for programming
5 points for result prediction
30 points for experiment & analysis
15 pts for experiments, 15pts for analysis
10 points for report
Late work
- 10% per one day
Maximum 7 days
(C) 2008, SNU Biointelligence Laboratory
20
QnA
(C) 2008, SNU Biointelligence Laboratory
21
Test Data
X
Y
month
day
FFMC
DMC
DC
ISI
temp
RH
wind
rain
Data01
6
5
9
3
92.9
133.3
699.6
9.2
26.4
21
4.5
0
Data02
6
3
11
2
79.5
3
106.7
1.1
11.8
31
4.5
0
Data03
4
3
7
4
93.2
114.4
560
9.5
30.2
22
4.9
0
Data04
6
5
6
1
90.4
93.3
298.1
7.5
19.1
39
5.4
0
Data05
6
3
4
7
91
14.6
25.6
12.3
13.7
33
9.4
0
Data06
5
4
4
7
91
14.6
25.6
12.3
17.6
27
5.8
0
Data07
4
3
5
5
89.6
25.4
73.7
5.7
18
40
4
0
Data08
7
5
10
1
91.7
48.5
696.1
11.1
16.1
44
4
0
Data09
8
6
3
5
91.7
33.3
77.5
9
8.3
97
4
0.2
Data10
7
5
8
2
96.1
181.1
671.2
14.3
27.3
63
4.9
6.4
Data11
6
5
9
6
91.2
94.3
744.4
8.4
15.4
57
4.9
0
Data12
8
6
8
1
92.1
207
672.6
8.2
21.1
54
2.2
0
Data13
7
4
9
5
88.2
55.2
732.3
11.6
15.2
64
3.1
0
Data14
4
3
9
2
91.9
111.7
770.3
6.5
15.9
53
2.2
0
Data15
3
6
9
7
92.4
124.1
680.7
8.5
17.2
58
1.3
0
Data16
3
6
9
1
90.9
126.5
686.5
7
15.6
66
3.1
0
Data17
9
9
7
2
85.8
48.3
313.4
3.9
18
42
2.7
0