Knowledge Extraction usind Artificial Neural Networks

Download Report

Transcript Knowledge Extraction usind Artificial Neural Networks

Wismar
Business School
Artificial Neural Networks
and
Data Mining
Uwe Lämmel
www.wi.hs-wismar.de/~laemmel
[email protected]
Neural Networks and Data Mining
Folie 1
Content
 Data Mining
 Classification: approach
 Data Mining Cup
– 2004: Who will cancel?
– 2007: Who will get a rebate coupon?
– 2008: How long will someone participate in a lottery?
– 2009: Forecast of book sales figures
– 2010 ?
 Clustering: approach
– Behaviour of bank customers
Neural Networks and Data Mining
Folie 2
Data Mining
Data Mining is a
– systematic and automated
discovery and extraction
– of previously unknown knowledge
– out of huge amount of data.
"KDD – Knowledge Discovery in Data bases" – synonym
Notion wrong: Gold Mining  Data Mining
Neural Networks and Data Mining
Folie 3
Data Mining – Applications
 classification
 clustering
 association
 prediction
 text mining
 web mining
classification
 items are placed in subsets (classes)
 classes have known properties
– customer is bad, average, good
– pattern recognition
– …
 set of training items is used to train the
classification algorithm
clustering
 partitioning a data set into subsets (clusters),
so that the data in each subset (ideally)
share some common features
– similarity or proximity for some defined
distance measure
 is building classes
Neural Networks and Data Mining
Folie 4
Data Mining
Process
CRISP-DM
model
Neural Networks and Data Mining
Folie 5
Content




Data Mining
Classification: approach using NN
Data Mining Cup
Clustering: approach
Neural Networks and Data Mining
Folie 6
Classification using NN
training p.
prerequisite
 set of training pattern (many patterns)
coded p.
approach
 code the values
 divide set of training pattern into:
– training set
– test set
 build a network
 train the network using the training set
 check the network quality using the test set
training set
test set
real data
Neural Networks and Data Mining
Folie 7
Development of an NN-application
calculate
network
output
build a network
architecture
input of training
pattern
modify
weights
change
parameters
error is too high
compare to
teaching
output
quality is good
enough
use Test set data
error is too
high
evaluate output
compare to teaching
output
quality is good enough
Neural Networks and Data Mining
Folie 8
Build an Artificial Neural Network
 Number of Input Neurons?
– depends on the number of attributes
– depends on the coding
 Number of Output Neurons?
– depends on the coding of the class attribute
 Number of Hidden Neurons?
– experiments necessary
– generally: not more than input neurons
– quarter … half of number of input neurons may
work
– see capacity of a neural network
Neural Networks and Data Mining
Folie 9
Experiments using the JavaNNS







Build a network
Load training-pattern
open the Error Graph
open the Control Panel
Initialize the network
try different learning parameter: 0.1, 0.2, 0.5, 0.8
Start Learning
Neural Networks and Data Mining
Folie 10
Getting Results
 value the error
 Finally:
– make the test-Pattern the actual one
– Save Data …
– include output files
– save as a .res-file
 Evaluate the .res-file
Neural Networks and Data Mining
Folie 11
Experiments
How can we improve the results?
– Data pre-processing?
– Architecture of ANN?
– Learning Parameters?
– Evaluation of the results: post-processing?
record your work!
Neural Networks and Data Mining
Folie 12
Content
 Data Mining
 Classification: approach
 Data Mining Cup
– 2004: Who will cancel?
– 2007: Who will get a rebate coupon?
– 2008: How long will someone participate in a lottery?
– 2009: Forecast of book sales figures
– 2010 ?
 Clustering: approach
– Behaviour of bank customers
Neural Networks and Data Mining
Folie 13
Data Mining Cup
www.data–mining–cup.de
 annual competition for students
 runs April – May /June
 real world problem:
– problem
– set of training data
– set of data for classification
– to be developed: classification
 supported by many companies (data/software)
 ~ 200 – 300 participants
 workshop (user day)
Neural Networks and Data Mining
Folie 14
DMC2004: A Mailing Action
 mailing action of a company:
– special offer
– estimated annual income per customer:
customer
will
cancel
gets an offer
gets no offer
 given:
– 10,000 sets of customer data
containing 1,000 cancellers (training)
 problem:
– test set contains 10,000 customer data
will
not cancel
43.80€
66.30€
0.00€
72.00€
– Who will cancel ?
– Whom to send an offer?
Neural Networks and Data Mining
Folie 15
will
cancel
customer
Mailing Action – Aim?
will
not cancel
gets an offer
43.80€
66.30€
gets no offer
0.00€
72.00€
 no mailing action:
– 9,000 x 72.00
= 648,000
 everybody gets an offer:
– 1,000 x 43.80 + 9,000 x 66.30
= 640,500
 maximum (100% correct classification):
– 1,000 x 43.80 + 9,000 x 72.00
= 691,800
Neural Networks and Data Mining
Folie 16
will
cancel
customer
Goal Function: Lift
will
not cancel
gets an offer
43.80€
66.30€
gets no offer
0.00€
72.00€
basis: no mailing action: 9,000 · 72.00
goal = extra income:
liftM = 43.8 · cM + 66.30 · nkM – 72.00· nkM
Neural Networks and Data Mining
Folie 17
----- 32 input data ------
<important
Data
results>
^missing values^
Neural Networks and Data Mining
Folie 18
Feed Forward Network – What to do?


train the net with training set (10,000)
test the net using the test set ( another 10,000)
– classify all 10,000 customer into canceller or loyal
– evaluate the additional income
Neural Networks and Data Mining
Folie 19
Results
data mining cup 2002
neural network project
2004
gain:
– additional income by the mailing action
if target group was chosen according analysis
Neural Networks and Data Mining
Folie 20
DMC 2007: Rebate System
Check-out couponing allows
an individual coupon generation at the check-out
The coupon is printed at the end of the sales slip
depending on the current customer.
Questions:
– How can the retailer identify
whether a customer is a potential couponing
customer?
– On what coupons he will respond?
Neural Networks and Data Mining
Folie 21
Couponing
 Print:
– coupon A
– coupon B
– No coupon
 50,000 customer cards for training
 Classify another 50,000 customer!
 Cost function:
– coupon not redeemed (false assignment to A or B): –1
– coupon A redeemed (correct assignment to A):
+3
– coupon B redeemed (correct assignment to B):
+6
Maximize the value!
Neural Networks and Data Mining
Folie 22
Data Understanding
 What is the meaning of the attributes?
 Type and range of values?
Neural Networks and Data Mining
Folie 23
20–20–2 Network
Profit = 3AA + 6  BB
– (NA+NB+BA+AB)
results:
 winner 2007 7,890
 my version 6,714
 our students 6,468
(73/230)
Neural Networks and Data Mining
Folie 24
DMC2008: Participation in a Lottery
Predicting, at the beginning of the lottery,
how long participants will participate:
The first ticket has not been paid for
Only the ticket for the first class has been paid for
Only the first two classes were played
The lottery was played until the end
but no ticket purchased for the following lottery
 4 – At least first ticket for the
following lottery purchased




0
1
2
3
–
–
–
–
cost matrix
Neural Networks and Data Mining
Folie 25
Data
 113,476 pattern!
 69 attributes
– new customer (yes/no)
– age
– bank
– car
– …
Neural Networks and Data Mining
Folie 26
100–40–20–5 Network
results:
 1,030,240 RWTH Aachen (1)
…
1,024,535 RWTH Aachen (8)
 865,565 Bauhaus Univ. Weimar (100)
 Univ. Wismar: 878,550 – 835,035
 – 1,494,315 (212)
Neural Networks and Data Mining
Folie 27
DMC 2009 – online bookshop „Libri“
 Sales figures training:
– more than 1.800 books
– 2.418 shops
 Sales figures forecast
– 8 books
– 2.394 shops
Neural Networks and Data Mining
Folie 28
DMC 2009 – online bookshop „Libri“
Neural Networks and Data Mining
Folie 29
DMC 2009 – 83-25-9-3 network
Neural Networks and Data Mining
Folie 30
DMC 2010:
Revenue maximisation by intelligent couponing
 Many customers only make an order in an online shop once
 decision whether to send a voucher worth € 5.00
 voucher for those
who would not have decided to re-order by themselves.
 32,427 data sets for training
 32,428 data sets for prediction
 37 attributes per set + target attribute in training set
Neural Networks and Data Mining
Folie 31
DMC 2010
 out of 67 teams!
Neural Networks and Data Mining
Folie 32
Content




Data Mining
Classification: approach
Data Mining Cup
Clustering: approach
– Behaviour of bank customers
Neural Networks and Data Mining
Folie 33
Clustering Transaction Data
Co–operation
 Hochschule Wismar
 HypoVereinsbank
 Medienhaus Rostock
Issue
 What information can be extracted
from turnover time series?
Strategy
1. Clustering time series data
2. Assign customers/accounts to clusters
3. Examine clusters
Neural Networks and Data Mining
Folie 34
Transaction Data & Time Series
Corporate clients
 223 branches
Cumulated transactions per
 Month
 Account
 Type of transaction
... for a total of 6 years
Original financial data not suitable:
 Order of values is important
 Time displacements are problematic
Neural Networks and Data Mining
Folie 35
Fourier versus Original Data
No displacement
Similarity detected on both:
 transaction curve and
 frequency spectrum
Data is displaced
frequency spectrum
shows similarity
Neural Networks and Data Mining
Folie 36
Using a classification model
Turnover ...
Customer
t0
t0+n
tm
tm+n
1. Building the Model Sequence A
Sequence B
Preprocessin
g
Preprocessin
g
Clustering
Classification Model
Initial Cluster
3. Comparing cluster
assignments
Identical
?
2. Applying the
model
New Cluster
Different
Neural Networks and Data Mining
Folie 37
Clustering & Prediction Results




140.000 records
1 record = 1 account
6x5 SOM = max. 30 clusters
average changes of cluster assignments: ca. 19%
Variability per Business Sector
22,3%
Taxi
22,3%
Ship Broker Offices
20,9%
Churches
20,2%
Trucking
239/1070
64/471
228/1091
1010/5008
Neural Networks and Data Mining
Folie 38
Ende
Neural Networks and Data Mining
Folie 39