Sieci neuronowe – bezmodelowa analiza danych?

Download Report

Transcript Sieci neuronowe – bezmodelowa analiza danych?

Sieci neuronowe –
bezmodelowa analiza danych?
K. M. Graczyk
IFT, Uniwersytet Wrocławski
Poland
Abstract
• Podczas seminarium opowiem o zastosowaniu jednokierunkowych
sieci neuronowych do analizy danych eksperymentalnych. W
szczególności skupię uwagę na podejściu bayesowskim, które
pozwala na klasyfikację i wybór najlepszej hipotezy badawczej.
Metoda ta ma w naturalny sposób wbudowane tzw. kryterium
„brzytwy Ockhama”, preferujące modele o mniejszym stopniu
złożoności. Dodatkowym atutem podejścia jest brak wymogu
używania tzw. zbioru testowego do weryfikacji procesu uczenia.
• W drugiej części seminarium omówię własną implementacje sieci
neuronowej, zawierającą metody uczenia bayesowskiego. Na
zakończenie pokaże moje pierwsze zastosowania w analizie danych
rozproszeniowych.
Why Neural Networks?
• Look at Electromagnetic Form Factor data
– Simple
– Strightforward
– Then attac more serious problems
• Inspired by C. Giunti (Torino)
– Papers of Forte et al.. (JHEP 0205:062,200, JHEP
0503:080,2005, JHEP 0703:039,2007, Nucl.Phys.B809:163,2009).
– A kind of model independet way of fitting data and computing
assiosiated uncertienty.
• Cooperation with R. Sulej (IPJ, Warszawa) and P.
Płoński (Politechnika Warszawska)
– NetMaker
• GrANNet ;) my own C++ library
Road map
• Artificial Neural Networks (NN)
– idea
• FeedForward NN
• Bayesian statistics
• Bayesian approach to NN
• PDF’s by NN
• GrANNet
• Form Factors by NN
Inspired by
Nature
Aplications, general list
• Function approximation, or
regression analysis, including
time series prediction,
fitness approximation and
modeling.
• Classification, including
pattern and sequence
recognition, novelty
detection and sequential
decision making.
• Data processing, including
filtering, clustering, blind
source separation and
compression.
• Robotics, including directing
manipulators, Computer
numerical control.
Artificial Neural Network
Output,
target
Input layer
Hidden layer
input
 1 weights
i-th perceptron
2
Summing
3
k
threshold
activation
function
output
Q2
F2
x
F 2(Q2, x; wij )
GM
Q2
GM (Q2; wij )
Q2
s
e
s (Q2, e ; wij )
A map from one vector space to another
Neural Networks
• The universal
approximation theorem for
neural networks states that
every continuous function
that maps intervals of real
numbers to some output
interval of real numbers can
be approximated arbitrarily
closely by a multi-layer
perceptron with just one
hidden layer. This result
holds only for restricted
classes of activation
functions, e.g. for the
sigmoidal functions.
(Wikipedia.org)
Feed-Forward-Network
1.0
activation function
sigmoid
4
0.5
2
2
0.5
tanh(x)
4
•Heavside function q(x)
• 0 or 1 signal
•Sigmoid function
•Tanh()
1.0
1
g ( x) 
1  e  x
architecture
• 3 -layers network, two hidden:
• 1:2:1:1
• 2+2+1 + 1+2+1: #par=9:
Bias neurons, instead of thresholds
G(Q2)
Q2
Linear Function
Symmetric Sigmoid Function
Supervised Learning
• Propose the Error Function (Standard
Error Function, chi2, etc, …, any continous
function which has a global minimum)
• Consider set of the data
• Train given network with data 
marginalize the error function
– Back propagation algorithms
– Iterative procedure which fixes weights
Learning
• Gradient Algorithms
–
–
–
–
–
Gradient descent
QuickProp (Fahlman)
RPROP (Ridmiller & Braun)
Conjugate gradients
Levenberg-Marquardt
(hessian)
– Newtonian method
(hessian)
• Monte Carlo algorithms
(based on the Marcov
chain algorithm)
Overfitting
• More complex models
describe data in better
way, but lost generalities
– bias-variance trade-off
• After fitting one needs to
compare with the test set
(must twice larger than
original)
• Overfitting  large values
of the wigths
• Regularization 
additional penalty term to
error function
ED  ED  EW
ED  ED  EW   2 
 W 2
 wi
2 i 1
dw
 ED  w,  data absence  w(t )  w(0) exp(t )
dt
Fitting data with Artificial
Neural Networks
‘The goal of the network training is not to learn on exact
representation of the training data itself, but rather to
built statistical model for the process which generates
the data’
C. Bishop, Neural Networks for Pattern Recognation
Parton Distribution Function
with NN
Some method but…
Q2
F2
x
Parton Distributions Functions S. Forte, L. Garrido, J. I.
Latorre and A. Piccione, JHEP 0205 (2002) 062
• A kind of model
independent analysis of
the data
• Construction of the
probability density
P[G(Q2)] in the space of
the structure functions
– In practice only one Neural
Network architecture
• Probability density in the
space of parameters of
one particular NN
But in reality Forte at al.. did
Generating Monte Carlo pseudo data
The idea comes from
W. T. Giele and S. Keller
Training Nrep neural networks, one for each set of Ndat pseudo-data
The Nrep trained neural networks  provide
a representation of the probability measure in the space
of the structure functions
uncertainty
correlation
10, 100 and 1000 replicas
short
enough long
30 data points, overfitting
too long
My criticism
• Artificial data, and chi2 error function 
overestimate error function?
• Do not discuss other architectures?
• Problems with overfitting?
Form Factors with NN,
done with FANN library
Applying Forte et al..
How to apply NN to the ep data
• First stage: checking if the NN are able to work on the
reasonable level
– GE and GM and Ratio separately
•
•
•
•
•
–
–
–
–
Input Q2  output Form Factor
The standard error function
GE: 200 points
GM: 86 points
Ratio: 152 points
Combination of the GE, GM, and Ratio
Input Q2 output GM and GE
The standard error function: a sum of three functions
GE+GM+Ratio: around 260 points
• One needs to constrain the fits by adding some artificial
points with GE(0)=GM(0)/mp=1
GMp
GMp
GMp
Neural Networks
Fit with TPE (our work)
GEp
GEp
Ratio
GEn
GEn
GMn
GMn
Bayesian Approach
‘common sense reduced to calculations’
Bayesian Framework for BackProp
NN, MacKay, Bishop,…
• Objective Criteria for comparing alternative
network solutions, in particular with different
architectures
• Objective criteria for setting decay rate a
• Objective choice of reularising function Ew
• Comparing with test data is not requiered.
Notation and Conventions
ti
xi
y ( xi )
D:
N
W
Data point, vector
input, vector
Network response
(t1 , x1 ), (t 2 , x2 ),...,(t N , x N )
Data set
Number of data points
Number of data weights
Model Classification
• A collection of models, H1,
H2, …, Hk
• We belive that models
are classified by P(H1),
P(H2), …, P(Hk) (sum to 1)
• After observing data D 
Bayes’ rule 
• Usually at the beginning
P(H1)=P(H2)= …=P(Hk)
Probability of D given Hi
P ( H i D) 
P( D H i ) P( H i )
P ( D)
Normalizing constatnt
Single Model Statistics
• Assume that model Hi
Likelihood  Pr ior
is correct one
Posterior 
Evidence
• The neural network A
with weights w is
P( D w, A ) P(w A )
P
(
w
D
,
A
)

considered
P( D A )
• Task 1: Assuming
some prior probability
P ( D Ai )   P ( D w, Ai ) P ( w Ai )dw
of w, construct
Posterior after
including data
i
i
i
i
P( Ai D)  P(D Ai )P( Ai )
Hierarchy
P ( w D,  , A) 
P ( D, A) 
P( A D) 
P ( D w,  , A) P ( w  , A)
P ( D  , A)
P ( D  , A) P ( A)
P ( D A)
P ( A D ) P ( A)
P( D)
Constructing prior and
posterior function
Assume  const ant
 y ( xi , w)  t ( xi ) 
2

ED     
s i
i 

1
EW   wi2
2 i
S  ED  EW
2
Prior
Pw
0.20
0.05
w
0
w0
P( D  )
likelihood
exp( E D )
P ( D w, A) 
ZD
Z D   d t exp( E D )  
N
P ( w  , A) 
N
N /2
 s
i
i 1
exp( EW )
ZW ( )
W /2
0.10
10
P ( D w,  ) P ( w  )
 2 
ZW ( )   d W w exp(EW )  

 
exp( S )
P ( w D,  , A) 
Z M ( )
0.15
20
P ( w D,  ) 
Weight distribution!!!
10
Z M ( )   d W w exp( E D  EW )
20
Posterior probability
wMP
Computing Posterior
1
T
S ( w)  S ( wMP )  w Aw
2
N
1
 k yi l yi  ( yi  t ( xi )) l  k yi    kl
Akl   k  k S  2
2
i 1 s i
N
 2
i 1
ZM
 k y l y
  kl
s i
hessian
W /2

2 

exp( S ( w
| A|
MP
s x2   dw y ( w, x)  y ( x)
))

2
exp( S ( w))  y ( wMP , x)  A1y ( wMP , x)
T
Covariance matrix
How to fix proper 
p( w D, A)   dp( w  , D, A) p( D, A)
Two ideas:
•Evidence Approximation (MacKay)
•Hirerchical
•Find wMP
•Find MP
•Perform analitically integrals over 
p( w D, A)  p( w  MP , D, A)  dp( D, A)  p ( w  MP , D, A)
If sharply peaked!!!
Getting MP
p ( D) 
p ( D  ) p ( )
p( D)
 p ( D  )   p ( D w,  ) p ( w  )   p ( D w) p ( w  ) 

d
log p ( D)  0
d
2E
MP
W
W

i 1
i  
W 
 iteration   / 2 EW

The effective number of well-determined parameters
Iterative procedure during training
Z M ( )
Z D ZW ( )
Bayesian Model
Comparison –
Occam Factor
P ( Ai D)  P ( D Ai ) P ( Ai )  P ( D Ai )
P ( D Ai )   p ( D w, Ai ) p ( w Ai )dw  p ( D wMP , Ai ) p ( wMP Ai )w posterior
if
p ( wMP Ai ) 
1
w prior
Occam Factor
P ( D Ai )  p ( D wMP , Ai ) p ( wMP Ai )
P ( D Ai )  p ( D wMP , Ai ) p ( wMP
Best fit likelihood
w posterior
w prior
(2 )W / 2
Ai )
det A
•The log of Occam Factor  amount of
•Information we gain after data have arraived
•Large Occam factor  complex models
•larger accesible phase space (larger range of
posterior)
•Small Occam factor  simple models
•larger accesible phase space (larger range of
posterior)
Evidence
Misfit of the interpolant data
ln p( D A)   E
MP
D
 E
g  2 M!
Symmetry Factor
M
MP
W
N
1
W
N
 ln det A  ln   ln    ln s i  ln g
2
2
2
i 1
Occam Factor – Penalty Term
Q2
F2
x
Tanh(.)
change w sign
What about cross sections
• GE and GM simultaneously,
– Input Q2 and e  cross sections
• Standard error function
• the chi-2-like function, with the covariance matrix obtained
from the Rosenbluth separation
– Possibilities:
• The set of Neural Networks becomes a natural distribution of
the differential cross sections
• One can produce artificial data in the wide range of the
epsilon and perform the Rosenbluth separation, searching
the nonlinearities of sR in the epsilon dependence.
What about TPE?
• Q2, epsilon  GE, GM and TPE?
• In the perfect case the change of the epsilon should not
affect the GE and GM.
– training by the NN by series of the artificial cross section data
with fixed epsilon?
– Collecting data in the epsilon bins, and Q2 bins, then showing
network the set of data with particular epsilon in the wide range
of Q2.
Q2
e
GM
GE
TPE
constraining error function
Es 
1  2
e 2
e 2

2
G

G

TPE

G

G

TPE


M , net
E , net
net
M , art
E , art
art 
2 i 1 



Ns

1 Ns 2
ER   GE ,net / GM2 ,net  GE2 ,art / GM2 ,art
2 i 1
2

2
every cycle computed with different epsilon!
One network!
Q2
e
GM
GE
TPE
Yellow line have the vanishing weights – they do not transfer signal
Q2
F2
x
F 2(Q2, x; wij )
Q2
GM
GM (Q2; wij )
Q2
s
e
s (Q2, e ; wij )
GEn
GEn
GMn
results
GEp
ddd
GMp