DataMining_Rome

Download Report

Transcript DataMining_Rome

Application of Data Mining Algorithms in Atmospheric
Neutrino Analyses with IceCube
Tim Ruhe, TU Dortmund
Outline







Data mining is more...
Why is IceCube interesting (from a machine learning point of view)
Data preprocessing and dimensionality reduction
Training and validation of a learning algorithm
Results
Other Detector configuration?
Summary & Outlook
2
Data Mining is more...
Beis
Examples
(annotated)
Learning Algorithm
Model
Historical data,
simulations
New data
(not annotated)
I
Information,
knowledge
Nobel prize(s)
3
Data Mining is more...
Beis
Examples
(annotated)
New data
(not annotated)
Preprocessing
Historical data,
simulations
Learning Algorithm
Model
Garbage in/
Garbage out
I
Information,
knowledge
Nobel prize(s)
4
Data Mining is more...
Beis
Examples
(annotated)
New data
(not annotated)
Preprocessing
Historical data,
simulations
Learning Algorithm
Validation
Model
Garbage in/
Garbage out
I
Information,
knowledge
Nobel prize(s)
5
Why is IceCube interesting from a
machine learning point of view?
 Huge amount of data
 Highly imbalanced distribution of event
classes (signal and background)
 Huge amount of data to be processed
by the learner (Big Data)
 Real life problem
6
Preprocessing (1): Reducing the Data Volume Through Cuts
Background Rejection: 91.4%
Signal Efficiency:
57.1%
BUT: Remaining Background
is significantly harder to reject!
7
Preprocessing (2): Variable Selection
Check for
missing values.
Exclude if number of missing
values exceed a 30%.
2600 variables
Check for
potential bias.
Exclude everything that is
useless, redundant or a
source of potential bias.
Check for
correlations.
477 variables
Automated
Feature Selection
Tim Ruhe | Statistische Methoden der Datenanalyse
Exclude everything
that has a
correlation of 1.0.
8
Relevance vs. Redundancy: MRMR (continuous case)
Redundancy:
Relevance:
maxVF
1
VF =
S
minWc
1
Wc = 2 å c(i, j)
S i, j
å F(i, h)
iÎS
max Q
MRMR:
Q=
VF
WC
or
max D
D = VF - Wc
9
Feature Selection Stability
Jaccard:
A B
J
A B
Average over many sets of
variables:
l
2
J= 2 å
l - l i=1
l
å J (F , F )
i
j
j=1
10
Comparing Forward Selection and MRMR
11
Training and Validation of a
Random Forest
s
1
ntrees
ntrees
s
i 0
i
si = {0,1}
 use an ensemble of simple decision
trees
 Obtain final classification as an
average over all trees
12
Training and Validation of a
Random Forest
s
1
ntrees
ntrees
s
i 0
i
si = {0,1}
 use an ensemble of simple decision
trees
 Obtain final classification as an
average over all trees
5-fold cross validation to validate the
performance of the forest.
13
Background Muons
750,000 in total
CORSIKA, Polygonato
600,000 available
for training
Sampling
Random Forest and Cross Validation in Detail (1)
27,000
27,000
Neutrinos
70,000 in total
NuGen, E-2 Spectrum
56,000 available
for training
14
Random Forest and Cross Validation in Detail (2)
500 Trees
150,000 available
for testing
27,000
Train
Apply
27,000
Repeat (x5)
14,000 available
for testing
15
Random Forest Output
16
Random Forest Output
17
Random Forest Output: Cut at 500 trees
 28830 ± 480 expected neutrino
candidates
 28830 ± 480 expected
background muons
Apply to
experimental data
27,771 neutrino candidates
This yields
 Background Rejection: 99.9999%
 Signal Efficiency 18.2%
 Estimated Purity: (99.59±0.37)%
18
Unfolding the spectrum
This is no
Data Mining...
TRUEE
...but it ain‘t
magic either
19
Moving on... IC79
 Entire analysis chain can be applied
on other detector configurations
 ...with minor changes (e.g. ice
model)
 212 neutrino candidates per day
 66885 neutrino candidates in total
 330±200 background muons
20
Summary and Outlook
MRMR
Random Forest
99.9999% Background Rejection
Future Improvements???
By starting at an earlier
analysis level...
Purities above 99% are routinely achieved
21
Backup Slides
22
RapidMiner in a Nutshell





Developed at the Department of Computer Science at TU Dortmund(YALE)
Operator based, written in Java
It used to be open source 
Many, many plugins due to a rather active community
One of the most widely used data mining tools
23
What I like about it
 Data flow is nicely visualized and can be easily followed and
comprehended
 Rather easy to learn, even without programming experience
 Large Community (Updates, Bugfixes, Plugins)
 Professional Tool (They actually make money with that!)
 Good support
 Many tutorials can be found online, even special one
 Most operators work like a charm
 Extendable
24
Relevance vs. Redundancy: MRMR (discrete case)
Redundancy:
Relevance:
maxVF
1
VF =
S
Mutual Information
minWc
1
Wc = 2 å I(i, j)
S i, jÎS
å I(i, h)
iÎS
I = å p(xi , x j )log
i, j
max Q
MRMR:
Q=
VF
WC
or
p(xi , x j )
p(xi )p(x j )
max D
D = VF - Wc
25
Feature Selection Stability
Jaccard:
A B
J
A B
Kuncheva:
rn  k 2
I C ( A, B ) 
k (n  k )
| A || B | k
r | A  B |
26
Ensemblemethoden
With Weight (e.g. Boosting)
Ensemble methods
Without Weight
(e.g. Random Forest)
Tim Ruhe | Statistische Methoden der Datenanalyse
27
Random Forest: What is randomized?
Randomness 1: Events the tree
is trained on (bagging)
Randomness 2:
Variables that are
available for a split
28
Are we actually better, than simpler methods?
29