Transcript Description

Artificial Intelligence
Project 2:
Classification Using Genetic Programming
2008. 10. 27
Kim, MinHyeok
[email protected]
Biointelligence laboratory
Contents



Project outline
Description on the data set
Genetic Programming
 Brief overview
 Fitness function & Selection methods
 Classification with GP (in this project)

Guide to writing reports
 Style & contents

Submission guide / Marking scheme
(C) 2008, SNU Biointelligence Laboratory
2
Outline

Goal
 Understand the Genetic Programming (GP) deeper
 Practice researching and writing a paper

Forest Fires problem (classification)
 To predict whether a fire occurs or not
 Using Genetic Programming
 Estimating several statistics on the dataset

Data set
 Variation of the ‘Forest Fires data set’
 http://archive.ics.uci.edu/ml/datasets/Forest+Fires
(C) 2008, SNU Biointelligence Laboratory
3
Forest Fires Data Set

Description
 Database of 517 samples
 You
can use at most 500 samples for training
 17 samples for prediction
 12 attributes
 X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,label
 Integer
or real value
 Label (Class)
 Two
classes
– 0 : a fire does not occur
– 1 : a fire occurs
(C) 2008, SNU Biointelligence Laboratory
4
Brief Summary of GP



A kind of evolutionary algorithms
It is represented with a tree structure
You need to set up following elements for GP run
 The set of terminals (input attributes, the class variable, constants)
 The set of functions (numerical / condition operators)
 The fitness measure
 The algorithm parameters
 population
size, maximum number of generations
 crossover rate and mutation rate
 maximum depth of GP trees etc.
 The method for designating a result and the criterion for
terminating a run.
(C) 2008, SNU Biointelligence Laboratory
5
GP Flowchart
GA loop
6
GP loop
Initialization

Maximum initial depth of trees Dmax is set.

Full method (each branch has depth = Dmax):
 nodes at depth d < Dmax randomly chosen from function set F
 nodes at depth d = Dmax randomly chosen from terminal set T

Grow method (each branch has depth  Dmax):
 nodes at depth d < Dmax randomly chosen from F  T
 nodes at depth d = Dmax randomly chosen from T

Common GP initialisation: ramped half-and-half, where grow
and full method each deliver half of initial population
(C) 2008, SNU Biointelligence Laboratory
7
Fitness Functions

Relative squared error
 yi  fˆ ( xi ) 

Fitness   

yi
i 1 

n


2
The number of outputs that are within % of the correct
value
And you can try other fitness functions which are welldefined to solve problems
Selection methods (1/2)

Fitness proportional (roulette wheel) selection
 The roulette wheel can be constructed as follows.
 Calculate
the total fitness for the population.
F
POP _ SIZE
 f (i )
k 1
 Calculate
k
selection probability pk for each chromosome vk.
f (ik )
pk 
,
k  1,2,..., POP _ SIZE
F
 Calculate cumulative probability qk for each chromosome vk.
k
qk   p j ,
j 1
k  1,2,..., POP _ SIZE

Procedure: Proportional_Selection
 Generate a random number r from the range [0,1].
 If r  q1, then select the first chromosome v1; else, select the kth
chromosome vk (2 k  pop_size) such that qk-1 < r  qk.
1
1
pk
qk
0.082407
0.082407
0.9
0.8
2
0.110652
F
0.193059
0.131931
0.324989
0.6
4
0.121423
0.446412
5
0.072597
0.519009
6
0.128834
0.647843
7
0.077959
0.725802
8
0.102013
0.827802
9
0.083663
0.911479
10
0.088521
1.000000
 f (i )  0.036441
k 1
0.7
3
pop _ size
0.5
0.4
0.3
0.2
0.1
0
1
k
Selection methods (2/2)

Tournament selection
 Tournament size q

Ranking-based selection
pi 
1 
i 1 
  (     )


 1 
 2    POP_SIZE
 1  +  2 and - = 2 - +

Elitism
 To preserve n good solutions until the next generation
Classification with GP (in this project)

IF
Function Regression
 Search a function f(x) s.t.
 f(x)
≥ threshold t
 f(x)< threshold t

when y=1
when y=0
f(x)
Converting to Boolean value
∧
¬
∨
>
=
rain
0
RH
<
50
1
>
wind
FFMC
+
ISI
t
0
What to do for the experiment?

Select a library that implements GP
 You can find various libraries written in C++/Java/Matlab
 See the list of recommended libraries on the next page

Build up your own code for the experiment
 Check sample codes and tutorials of libraries for quick start
 Add comments to explain the flow of your program

Caution
 Running GP may take much time
(C) 2008, SNU Biointelligence Laboratory
13
Recommended Libraries for GP

C++
 GPLib: http://www.cs.bham.ac.uk/~cmf/GPLib/index.html

Java
 JGAP: http://jgap.sourceforge.net/
 ECJ: http://cs.gmu.edu/~eclab/projects/ecj/

Matlab toolbox
 GPLAB: http://gplab.sourceforge.net/

More References
 Implementations section in Wiki – Genetic Programming:
http://en.wikipedia.org/wiki/Genetic_programming
(C) 2008, SNU Biointelligence Laboratory
14
Reports Style


English only!!
Scientific journal-style
 How to Write A Paper in Scientific Journal Style and Format
 http://abacus.bates.edu/~ganderso/biology/resources/writing/HTWsections.html
Experimental process
Section of Paper
What did I do in a nutshell?
Abstract
What is the problem?
Introduction
How did I solve the problem?
Materials and Methods
What did I find out?
Results
What does it mean?
Discussion
Who helped me out?
Acknowledgments (optional)
Whose work did I refer to?
Literature Cited
Extra Information
Appendices (optional)
(C) 2008, SNU Biointelligence Laboratory
15
Report Contents (1/3)

System description
 Used programming language and running environments

Result tables
Training
Your prediction
Average  SD Best

Worst 1 2 … 16 17 Equation
Setting 1
% %
%
%
Setting 2
% %
%
%
Setting 3
% %
%
%
Analysis & discussion (Very Important!!)
(C) 2008, SNU Biointelligence Laboratory
16
Report Contents (2/3)

Graph
 Avg., Max. Fitness versus Generation
 Tree size versus Generation
(C) 2008, SNU Biointelligence Laboratory
17
Report Contents (3/3)

Basic experiments
 Changing parameters for the crossover and mutation
 Various function sets: arithmetic, numerical

Optional experiments
 Various selection methods
 Depth limitation
 Population size, generation numbers
 Comparison to Neural Network
…

References
(C) 2008, SNU Biointelligence Laboratory
18
Submission Guide

Due date: Nov. 19 (Wed) 18:00
 Submit both ‘hardcopy’ and ‘email’
 Hardcopy submission to the office (301-417 )
 E-mail submission to [email protected]

Subject : [AI Project1 Report] Student number, Name

Report + your source code with comments + executable file(s)
 Length: report should be summarized within 12 pages.
 We are NOT interested in the accuracy and your programming skill,
but your creativity and research ability.

If your major is not a C.S, team project with a C.S major student is
possible (Use the class board to find your partner and notice the
information of your team to TA ([email protected]) by Nov. 5)
(C) 2008, SNU Biointelligence Laboratory
19
Marking Scheme



5 points for programming
5 points for result prediction
30 points for experiment & analysis
 15 pts for experiments, 15pts for analysis


10 points for report
Late work
 - 10% per one day
 Maximum 7 days
(C) 2008, SNU Biointelligence Laboratory
20
QnA
(C) 2008, SNU Biointelligence Laboratory
21
Test Data
X
Y
month
day
FFMC
DMC
DC
ISI
temp
RH
wind
rain
Data01
6
5
9
3
92.9
133.3
699.6
9.2
26.4
21
4.5
0
Data02
6
3
11
2
79.5
3
106.7
1.1
11.8
31
4.5
0
Data03
4
3
7
4
93.2
114.4
560
9.5
30.2
22
4.9
0
Data04
6
5
6
1
90.4
93.3
298.1
7.5
19.1
39
5.4
0
Data05
6
3
4
7
91
14.6
25.6
12.3
13.7
33
9.4
0
Data06
5
4
4
7
91
14.6
25.6
12.3
17.6
27
5.8
0
Data07
4
3
5
5
89.6
25.4
73.7
5.7
18
40
4
0
Data08
7
5
10
1
91.7
48.5
696.1
11.1
16.1
44
4
0
Data09
8
6
3
5
91.7
33.3
77.5
9
8.3
97
4
0.2
Data10
7
5
8
2
96.1
181.1
671.2
14.3
27.3
63
4.9
6.4
Data11
6
5
9
6
91.2
94.3
744.4
8.4
15.4
57
4.9
0
Data12
8
6
8
1
92.1
207
672.6
8.2
21.1
54
2.2
0
Data13
7
4
9
5
88.2
55.2
732.3
11.6
15.2
64
3.1
0
Data14
4
3
9
2
91.9
111.7
770.3
6.5
15.9
53
2.2
0
Data15
3
6
9
7
92.4
124.1
680.7
8.5
17.2
58
1.3
0
Data16
3
6
9
1
90.9
126.5
686.5
7
15.6
66
3.1
0
Data17
9
9
7
2
85.8
48.3
313.4
3.9
18
42
2.7
0