Knowledge Extraction usind Artificial Neural Networks
Download
Report
Transcript Knowledge Extraction usind Artificial Neural Networks
Wismar
Business School
Artificial Neural Networks
and
Data Mining
Uwe Lämmel
www.wi.hs-wismar.de/~laemmel
[email protected]
Neural Networks and Data Mining
Folie 1
Content
Data Mining
Classification: approach
Data Mining Cup
– 2004: Who will cancel?
– 2007: Who will get a rebate coupon?
– 2008: How long will someone participate in a lottery?
– 2009: Forecast of book sales figures
– 2010 ?
Clustering: approach
– Behaviour of bank customers
Neural Networks and Data Mining
Folie 2
Data Mining
Data Mining is a
– systematic and automated
discovery and extraction
– of previously unknown knowledge
– out of huge amount of data.
"KDD – Knowledge Discovery in Data bases" – synonym
Notion wrong: Gold Mining Data Mining
Neural Networks and Data Mining
Folie 3
Data Mining – Applications
classification
clustering
association
prediction
text mining
web mining
classification
items are placed in subsets (classes)
classes have known properties
– customer is bad, average, good
– pattern recognition
– …
set of training items is used to train the
classification algorithm
clustering
partitioning a data set into subsets (clusters),
so that the data in each subset (ideally)
share some common features
– similarity or proximity for some defined
distance measure
is building classes
Neural Networks and Data Mining
Folie 4
Data Mining
Process
CRISP-DM
model
Neural Networks and Data Mining
Folie 5
Content
Data Mining
Classification: approach using NN
Data Mining Cup
Clustering: approach
Neural Networks and Data Mining
Folie 6
Classification using NN
training p.
prerequisite
set of training pattern (many patterns)
coded p.
approach
code the values
divide set of training pattern into:
– training set
– test set
build a network
train the network using the training set
check the network quality using the test set
training set
test set
real data
Neural Networks and Data Mining
Folie 7
Development of an NN-application
calculate
network
output
build a network
architecture
input of training
pattern
modify
weights
change
parameters
error is too high
compare to
teaching
output
quality is good
enough
use Test set data
error is too
high
evaluate output
compare to teaching
output
quality is good enough
Neural Networks and Data Mining
Folie 8
Build an Artificial Neural Network
Number of Input Neurons?
– depends on the number of attributes
– depends on the coding
Number of Output Neurons?
– depends on the coding of the class attribute
Number of Hidden Neurons?
– experiments necessary
– generally: not more than input neurons
– quarter … half of number of input neurons may
work
– see capacity of a neural network
Neural Networks and Data Mining
Folie 9
Experiments using the JavaNNS
Build a network
Load training-pattern
open the Error Graph
open the Control Panel
Initialize the network
try different learning parameter: 0.1, 0.2, 0.5, 0.8
Start Learning
Neural Networks and Data Mining
Folie 10
Getting Results
value the error
Finally:
– make the test-Pattern the actual one
– Save Data …
– include output files
– save as a .res-file
Evaluate the .res-file
Neural Networks and Data Mining
Folie 11
Experiments
How can we improve the results?
– Data pre-processing?
– Architecture of ANN?
– Learning Parameters?
– Evaluation of the results: post-processing?
record your work!
Neural Networks and Data Mining
Folie 12
Content
Data Mining
Classification: approach
Data Mining Cup
– 2004: Who will cancel?
– 2007: Who will get a rebate coupon?
– 2008: How long will someone participate in a lottery?
– 2009: Forecast of book sales figures
– 2010 ?
Clustering: approach
– Behaviour of bank customers
Neural Networks and Data Mining
Folie 13
Data Mining Cup
www.data–mining–cup.de
annual competition for students
runs April – May /June
real world problem:
– problem
– set of training data
– set of data for classification
– to be developed: classification
supported by many companies (data/software)
~ 200 – 300 participants
workshop (user day)
Neural Networks and Data Mining
Folie 14
DMC2004: A Mailing Action
mailing action of a company:
– special offer
– estimated annual income per customer:
customer
will
cancel
gets an offer
gets no offer
given:
– 10,000 sets of customer data
containing 1,000 cancellers (training)
problem:
– test set contains 10,000 customer data
will
not cancel
43.80€
66.30€
0.00€
72.00€
– Who will cancel ?
– Whom to send an offer?
Neural Networks and Data Mining
Folie 15
will
cancel
customer
Mailing Action – Aim?
will
not cancel
gets an offer
43.80€
66.30€
gets no offer
0.00€
72.00€
no mailing action:
– 9,000 x 72.00
= 648,000
everybody gets an offer:
– 1,000 x 43.80 + 9,000 x 66.30
= 640,500
maximum (100% correct classification):
– 1,000 x 43.80 + 9,000 x 72.00
= 691,800
Neural Networks and Data Mining
Folie 16
will
cancel
customer
Goal Function: Lift
will
not cancel
gets an offer
43.80€
66.30€
gets no offer
0.00€
72.00€
basis: no mailing action: 9,000 · 72.00
goal = extra income:
liftM = 43.8 · cM + 66.30 · nkM – 72.00· nkM
Neural Networks and Data Mining
Folie 17
----- 32 input data ------
<important
Data
results>
^missing values^
Neural Networks and Data Mining
Folie 18
Feed Forward Network – What to do?
train the net with training set (10,000)
test the net using the test set ( another 10,000)
– classify all 10,000 customer into canceller or loyal
– evaluate the additional income
Neural Networks and Data Mining
Folie 19
Results
data mining cup 2002
neural network project
2004
gain:
– additional income by the mailing action
if target group was chosen according analysis
Neural Networks and Data Mining
Folie 20
DMC 2007: Rebate System
Check-out couponing allows
an individual coupon generation at the check-out
The coupon is printed at the end of the sales slip
depending on the current customer.
Questions:
– How can the retailer identify
whether a customer is a potential couponing
customer?
– On what coupons he will respond?
Neural Networks and Data Mining
Folie 21
Couponing
Print:
– coupon A
– coupon B
– No coupon
50,000 customer cards for training
Classify another 50,000 customer!
Cost function:
– coupon not redeemed (false assignment to A or B): –1
– coupon A redeemed (correct assignment to A):
+3
– coupon B redeemed (correct assignment to B):
+6
Maximize the value!
Neural Networks and Data Mining
Folie 22
Data Understanding
What is the meaning of the attributes?
Type and range of values?
Neural Networks and Data Mining
Folie 23
20–20–2 Network
Profit = 3AA + 6 BB
– (NA+NB+BA+AB)
results:
winner 2007 7,890
my version 6,714
our students 6,468
(73/230)
Neural Networks and Data Mining
Folie 24
DMC2008: Participation in a Lottery
Predicting, at the beginning of the lottery,
how long participants will participate:
The first ticket has not been paid for
Only the ticket for the first class has been paid for
Only the first two classes were played
The lottery was played until the end
but no ticket purchased for the following lottery
4 – At least first ticket for the
following lottery purchased
0
1
2
3
–
–
–
–
cost matrix
Neural Networks and Data Mining
Folie 25
Data
113,476 pattern!
69 attributes
– new customer (yes/no)
– age
– bank
– car
– …
Neural Networks and Data Mining
Folie 26
100–40–20–5 Network
results:
1,030,240 RWTH Aachen (1)
…
1,024,535 RWTH Aachen (8)
865,565 Bauhaus Univ. Weimar (100)
Univ. Wismar: 878,550 – 835,035
– 1,494,315 (212)
Neural Networks and Data Mining
Folie 27
DMC 2009 – online bookshop „Libri“
Sales figures training:
– more than 1.800 books
– 2.418 shops
Sales figures forecast
– 8 books
– 2.394 shops
Neural Networks and Data Mining
Folie 28
DMC 2009 – online bookshop „Libri“
Neural Networks and Data Mining
Folie 29
DMC 2009 – 83-25-9-3 network
Neural Networks and Data Mining
Folie 30
DMC 2010:
Revenue maximisation by intelligent couponing
Many customers only make an order in an online shop once
decision whether to send a voucher worth € 5.00
voucher for those
who would not have decided to re-order by themselves.
32,427 data sets for training
32,428 data sets for prediction
37 attributes per set + target attribute in training set
Neural Networks and Data Mining
Folie 31
DMC 2010
out of 67 teams!
Neural Networks and Data Mining
Folie 32
Content
Data Mining
Classification: approach
Data Mining Cup
Clustering: approach
– Behaviour of bank customers
Neural Networks and Data Mining
Folie 33
Clustering Transaction Data
Co–operation
Hochschule Wismar
HypoVereinsbank
Medienhaus Rostock
Issue
What information can be extracted
from turnover time series?
Strategy
1. Clustering time series data
2. Assign customers/accounts to clusters
3. Examine clusters
Neural Networks and Data Mining
Folie 34
Transaction Data & Time Series
Corporate clients
223 branches
Cumulated transactions per
Month
Account
Type of transaction
... for a total of 6 years
Original financial data not suitable:
Order of values is important
Time displacements are problematic
Neural Networks and Data Mining
Folie 35
Fourier versus Original Data
No displacement
Similarity detected on both:
transaction curve and
frequency spectrum
Data is displaced
frequency spectrum
shows similarity
Neural Networks and Data Mining
Folie 36
Using a classification model
Turnover ...
Customer
t0
t0+n
tm
tm+n
1. Building the Model Sequence A
Sequence B
Preprocessin
g
Preprocessin
g
Clustering
Classification Model
Initial Cluster
3. Comparing cluster
assignments
Identical
?
2. Applying the
model
New Cluster
Different
Neural Networks and Data Mining
Folie 37
Clustering & Prediction Results
140.000 records
1 record = 1 account
6x5 SOM = max. 30 clusters
average changes of cluster assignments: ca. 19%
Variability per Business Sector
22,3%
Taxi
22,3%
Ship Broker Offices
20,9%
Churches
20,2%
Trucking
239/1070
64/471
228/1091
1010/5008
Neural Networks and Data Mining
Folie 38
Ende
Neural Networks and Data Mining
Folie 39