slides - Indico

Download Report

Transcript slides - Indico

Computing in High Energy and Nuclear Physics, 13-17 February 2006, Mumbai, India
1
2




Introduction to evolutionary computation
Gene Expression Programming (GEP)
Application of GEP for Event Selection
Conclusion
Liliana Teodorescu, Brunel University
3
 Evolutionary computation simulates the natural evolution on a computer
process leading to maintenance or increase of a population
ability to survive and reproduce in a specific environment
quantitatively measured by evolutionary fitness
 Goal of natural evolution - to generate a population of individuals with
increasing fitness
 Goal of evolutionary computation - to generate a set of solutions (to a
problem) of increasing quality
Liliana Teodorescu, Brunel University
4
 Individual – candidate solution to a problem
decoding
encoding
 Chromosome – representation of the candidate solution
 Gene – constituent entity of the chromosome
 Population – set of individuals/chromosomes
 Fitness function – representation of how good a candidate solution is
 Genetic operators – operators applied on chromosomes in order
to create genetic variation (other chromosomes)
Liliana Teodorescu, Brunel University
5
Natural evolution simulation - core of the evolutionary algorithms:
Basic evolutionary algorithm
Run
Start
Initial population creation (randomly)
Fitness evaluation (of each chromosome)
New generation
 Problem definition
 Encoding of the
candidate solution
 Fitness definition
 Run
 Decoding the best fitted
chromosome = solution
yes
Terminate?
Stop
no
Selection of individuals (proportional with fitness)
Reproduction (genetic operators)
Replacement of the current population with the new one
Liliana Teodorescu, Brunel University
6
 Genetic Algorithms (GA) (J. H. Holland, 1975)
 Genetic Programming (GP) (J. R. Koza, 1992)
 Gene Expression Programming (GEP) (C. Ferreira, 2001)
Main differences
 Encoding method
 Reproduction method
Liliana Teodorescu, Brunel University
7
 search for the computer program that solve the problem (as GP)
 works with two entities: chromosomes and expression trees
Encoding
Candidate solution represented
by an expression tree (ET)
Q
( a  b )  (c  d )
*
Q*-+abcd
+
a
ET encoded in a chromosome:
read ET left - right
and top - down
b
c
Q means sqrt
d
Decoding the chromosome (translates the chromosome in an ET)
•first line of ET (root) – first element of the chromosome
•next line of ET – as many arguments needed by the element in
the previous line
Liliana Teodorescu, Brunel University
8
Reproduction
Genetic operators applied on chromosoms not on ET =>
always produce sintactically correct structures!
 Recombination – exchanges parts of two chromosomes
 Mutation – changes the value of a node
 Transposition – moves a part of the chromosome to another location
in the same chromosome
e.g. Mutation: Q replaced with *
*b+a-aQab+//+b+babbabbbababbaaa
*
+
b
-
a
a
*b+a-aQab+//+b+babbabbbababbaaa
*
+
b
-
a
Q
a
a
*
a
b
Liliana Teodorescu, Brunel University
9
Chromosome – has one or more genes of equal length
Gene – head: contains both functions and terminals (length h)
- tail: contains only terminals (length t)
t=h(n-1)+1
n – number of arguments of the function
with the highest number of arguments
e.g. set of functions: Q,*,/,-,+
set of terminals: a,b
n=2; h=15 (choosen)
t =16 => length of gene=15+16=31
*
+
b
-
a
a
*b+a-aQab+//+b+babbabbbababbaaa
ET ends before the end of the gene!
Q
a
Liliana Teodorescu, Brunel University
10
GEP for event selection
 cuts/selection criteria finding
 classification problem (signal/background classification)
 statistical learning approach
Data samples:
 Monte-Carlo simulation from BaBar experiment
 Ks production in e+e- (~10 GeV)
 5000 training events (for classification rule extraction)
 5000 test events (others than training events)
 limitations imposed by APS 3.0
 S/N = 0.25, 1, 5
Software resources
 APS 3.0 (Automatic Problem Solver)
- commercial package (Windows based)
- www.gepsoft.com
- function finding, classification, time series analysis
Liliana Teodorescu, Brunel University
11
Functions and constants to be used in the classification rules (cut type rule)
10 functions
- AND1 (x<0 and y<0 => 1 else 0), AND2 (x0 and y0 => 1 else 0)
- OR1 (x<0 or y<0 => 1 else 0), OR2 (x0 or y0 => 1 else 0)
- <, >, <=, >=, =, !=
36 functions
-previous 10 functions + common mathematical functions
constants
- floating point constants (-10,10)
 
Data – variables usually used in a cut based analysis for K S    selection
- doca (distance of closest approach)
- |cos(hel)|
- Fsig (Flight Significance)
- Mass (KS reconstructed mass)
- RXY, |RZ| (region around interaction point)
- SFL (Signed Flight Length)
- Pchi (2 probability of the vertex)
GEP parameters
- fitness function: number of hits (events correctly classified)
- gene length (head = 1-20)
- no. of chromosomes per generation: 100
- no. of generations per run: 1000-20000
- genetic operators rates: mutation 0.044, inversion 0.1, transposition 0.3, recombination 0.1
Liliana Teodorescu, Brunel University
12
Data sample: S/N =0.25
10 functions
No. of genes = 1, Head length =10
Fsig  5.26,
Rxy < 0.19,
doca <1,
Pchi > 0
Classification Accuracy = 95.36%
Liliana Teodorescu, Brunel University
13
Data samples: S/N =0.25, 1, 5
10 functions
95%
S/N = 0.25
5%
92%
S/N = 1
92%
S/N = 5
8%
8%
Liliana Teodorescu, Brunel University
14
Data sample: S/N =0.25
10 functions
Head
Acc. (%)
Selection criteria
1
83.34
Fsig  9.93
2
90.76
Fsig 8.80, doca <1
3
94.88
Fsig > 3.67, Rxy  Pchi
4
94.88
Fsig > 3.67, Rxy  Pchi
5
95.04
Fsig  3.63, |Rz|  2.65, Rxy < Pchi
7
95.04
Fsig  3.64, Rxy < Pchi, Pchi > 0
10
95.360
Fsig  5.26, Rxy < 0.19, doca <1, Pchi > 0
20
95.50
Fsig > 4.1, Rxy  0.2, SFL > 0.2, Pchi > 0,
doca > 0, Rxy  Mass
Liliana Teodorescu, Brunel University
15
GEP
20
Cut-based analysis
95.50
Fsig > 4.1, Rxy  0.2, SFL > 0.2, Pchi > 0,
doca > 0, Rxy  Mass
No cut
Fsig  4.0
Rxy  0.2cm
SFL  0cm
Pchi > 0.001
doca  0.4cm
|Rz|  2.8cm
Previous cuts
+
doca > 0
Rxy  Mass
Reduction
S: 15%
B: 98%
Reduction
S: 16%
B: 98.3%
Fsig  4.1
Rxy  0.2cm
SFL  0.2cm
Pchi > 0
Previous cuts
+
doca  0.4
| Rz|  2.8cm
Liliana Teodorescu, Brunel University
16
GEP
5
Reduction
S: 7.6%
B: 87.8%
95.04
Fsig  3.63, |Rz|  2.65, Rxy < Pchi
Fsig  3.63
|Rz|  2.65cm
Reduction
S: 16%
B: 97.8%
Fsig  3.63
|Rz|  2.65cm
Rxy<Pchi
Liliana Teodorescu, Brunel University
17
Data samples: S/N =0.25, 1, 5
10 functions
S/N = 0.25
S/N = 5
S/N = 1
Liliana Teodorescu, Brunel University
18
Data sample: S/N =0.25
36 mathematical functions
No. of genes = 1, Head length =10
Classification Accuracy = 95.00%
Liliana Teodorescu, Brunel University
19
Data sample: S/N =0.25
36 functions
Classification Accuracy  95%
Liliana Teodorescu, Brunel University
20
GEP allows
 fast identification of powerful cuts
 signal/background separation of 92-95% accuracy
for samples with S/N = 0.25, 1, 5
 potential of discovering new correlations between
variables
 large number of selection functions does not improve
the classification accuracy
GEP
 is still in the R&D phase
 needs software development -> underway
Liliana Teodorescu, Brunel University