Data Mining?

Download Report

Transcript Data Mining?

Issues in Data Mining
Infrastructure
Authors:
Nemanja Jovanovic, [email protected]
Valentina Milenkovic, [email protected]
Voislav Galic, [email protected]
Dusan Zecevic, [email protected]
Sonja Tica, [email protected]
Prof. Dr. Dusan Tosic, [email protected]
Prof. Dr. Veljko Milutinovic, [email protected]
Page Number: 1
Data Mining in the Nutshell

Uncovering the hidden knowledge

Huge n-p complete search space

Multidimensional interface
NOTICE:
All trademarks and service marks mentioned in this document are marks of their
respective owners. Furthermore CRISP-DM consortium (NCR Systems Engineering
Copenhagen (USA and Denmark), DaimlerChrysler AG (Germany), SPSS Inc. (USA)
and OHRA Verzekeringen en Bank Groep B.V (The Netherlands)) permitted
presentation of their process model.
Page Number: 2
A Problem …
You are a marketing manager
for a cellular phone company

Problem: Churn is too high

Turnover (after contract expires) is 40%

Customers receive free phone (cost 125$)

You pay a sales commission of 250$ per contract

Giving a new telephone to everyone
whose contract is expiring is expensive

Bringing back a customer after quitting
is both difficult and expensive
Page Number: 3
… A Solution

Three months before a contract expires,
predict which customers will leave

If you want to keep a customer
that is predicted to churn,
offer them a new phone

The ones that are not predicted to churn
need no attention

If you don’t want to keep the customer, do nothing

How can you predict future behavior?

Tarot Cards?

Magic Ball?

Data Mining?
Page Number: 4
Still Skeptical?
Page Number: 5
The Definition
The automated extraction
of predictive information
from (large) databases

Automated

Extraction

Predictive

Databases
Page Number: 6
History of Data Mining
Page Number: 7
Repetition in Solar Activity

1613 – Galileo Galilei

1859 – Heinrich Schwabe
Page Number: 8
The Return of the
Halley Comet
Edmund Halley (1656 - 1742)
1531
1607
1682
239 BC
1910
1986
2061 ???
Page Number: 9
Data Mining is Not

Data warehousing

Ad-hoc query/reporting

Online Analytical Processing (OLAP)

Data visualization
Page Number: 10
Data Mining is

Automated extraction
of predictive information
from various data sources

Powerful technology
with great potential to help users focus
on the most important information
stored in data warehouses
or streamed through communication lines
Page Number: 11
Data Mining can

Answer question
that were too time consuming
to resolve in the past

Predict future trends and behaviors,
allowing us to make proactive,
knowledge driven decision
Page Number: 12
Data Mining Models
Page Number: 13
Neural Networks

Characterizes processed data
with single numeric value

Efficient modeling of
large and complex problems

Based on biological structures
- Neurons

Network consists of neurons
grouped into layers
Page Number: 14
Neuron Functionality
I1
W1
I2
W2
I3
W3
In
f
Output
Wn
Output = f (W1*I1, W2*I1, …, Wn*In)
Page Number: 15
Training Neural Networks
Page Number: 16
Neural Networks

Once trained, Neural Networks
can efficiently estimate the value
of an output variable for given input

Neurons and network topology
are essentials

Usually used for prediction
or regression problem types

Difficult to understand

Data pre-processing often required
Page Number: 17
Decision Trees

A way of representing a series of rules
that lead to a class or value

Iterative splitting of data
into discrete groups
maximizing distance between them
at each split

Classification trees and regression trees

Univariate splits and multivariate splits

Unlimited growth and stopping rules

CHAID, CHART, Quest, C5.0
Page Number: 18
Decision Trees
Balance>10
Balance<=10
Age<=32
Married=NO
Age>32
Married=YES
Page Number: 19
Decision Trees
Page Number: 20
Rule Induction

Method of deriving a set of rules
to classify cases

Creates independent rules
that are unlikely to form a tree

Rules may not cover
all possible situations

Rules may sometimes
conflict in a prediction
Page Number: 21
Rule Induction
If balance>100.000
then confidence=HIGH & weight=1.7
If balance>25.000 and
status=married
then confidence=HIGH & weight=2.3
If balance<40.000
then confidence=LOW & weight=1.9
Page Number: 22
K-nearest Neighbor and
Memory-Based Reasoning (MBR)

Usage of knowledge
of previously solved similar problems
in solving the new problem

Assigning the class to the group
where most of the k-”neighbors” belong

First step – finding the suitable measure
for distance between attributes in the data

+ Easy handling of non-standard data types

- Huge models
Page Number: 23
K-nearest Neighbor and
Memory-Based Reasoning (MBR)
Page Number: 24
Data Mining Algorithms

Many other available models and algorithms

Logistic regression

Discriminant analysis

Generalized Adaptive Models (GAM)

Genetic algorithms

The Apriori algorithm

Etc…

Many application specific variations
of known models

Final implementation usually involves
several techniques
Page Number: 25
The Apriori Algorithm



The task – mining association rules by finding large itemsets and
translating them to the corresponding association rules;
A  B, or A1  A2 … Am  B1  B2 … Bn, where A  B = 
The terminology
–
–
–
–
Confidence
Support
k-itemset – a set of k items;
Large itemsets – the large itemset {A, B} corresponds to the following rules
(implications): A  B and B  A;
Page Number: 26
The Apriori Algorithm

The  operator definition
–
–
–
–
n = 1: S2 = S1  S1 = {A}, {B}, {C}}  {{A}, {B}, {C}} = {{AB}, {AC}, {BC}}
n = k: Sk+1 = Sk  Sk = {X  Y| X, Y  Sk, |X  Y| = k-1}
X and Y must have the same number of elements, and must have exactly k-1
identical elements;
Every k-element subset of any resulting set element (an element is actually a k+1
element set) has to belong to the original set of itemsets;
Page Number: 27
The Apriori Algorithm

Example:
TID
elements
10
A
C
D
20
B
C
E
30
A
B
C
40
B
E
Page Number: 28
E
The Apriori Algorithm

Step 1 – generate a candidate set of 1-itemsets C1
–
–
–
–
Every possible 1-element set from the database is potentially a large itemset,
because we don’t know the number of its appearances in the database in
advance (á priori );
The task adds up to identifying (counting) all the different elements in the
database; every such element forms a 1-element candidate set;
C1 = {{A}, {B}, {C}, {D}, {E}}
Now, we are going to scan the entire database, to count the number of
appearances for each one of these elements (i.e. one-element sets);
Page Number: 29
The Apriori Algorithm

Now, we are going to scan the entire database, to count the number of
appearances for each one of these elements (i.e. one-element sets);
{A}
2
{B}
3
{C}
3
{D}
1
{E}
3
Page Number: 30
The Apriori Algorithm

Step 2 – generate a set of large 1-itemsets L1
–
Each element in C1 with support that exceeds some adopted minimum support
(for example 50%) becomes a member of L1;
–
L1 = {{A}, {B}, {C},{E}}
and we can omit D in further
steps (if D doesn’t have
enough support alone,
there is no way it could
satisfy requested support
in a combination with some
other element(s));
Page Number: 31
{A}
2
{B}
3
{C}
3
{D}
1
{E}
3
The Apriori Algorithm


Step 3 – generate a candidate set of large 2-itemsets, C2
–
C2 = L1  L1 ={{AB}, {AC}, {AE}, {BC}, {BE}, {CE}}
–
Count the corresponding appearances
Step 4 – generate a set of large 2-itemsets, L2;
–
–
Eliminate the candidates
without minimum support;
L2 = {{AC}, {BC}, {BE}, {CE}}
Page Number: 32
{AB}
1
{AC}
2
{AE}
1
{BC}
2
{BE}
3
{CE}
2
The Apriori Algorithm

Step 5 (C3)
–
–

Step 6 (L3)
–


C3 = L2  L2 = {{BCE}}
Why not {ABC} and {ACE} – because their 2-element subsets {AB} and {AE} are
not the elements of large 2-itemset set L2 (calculation is made according to the
operator  definition);
L3 = {{BCE}}, since {BCE} satisfies the required support of 50% (two
appearances);
There can be no further steps in this particular case,
because L3  L3 = ;
Answer = L1  L2  L3;
Page Number: 33
The Apriori Algorithm
L1 = {large 1-itemsets}
for (k=2; Lk-1  ; k++)
Ck = apriori-gen(Lk-1);
forall transactions t  D do begin
Ct = subset (Ck, t);
forall candidates c  Ct do
c.count++;
end;
Lk = {c  Ck | c.count  minsup}
end;
Answer = k Lk
Page Number: 34
The Apriori Algorithm


Enhancements to the basic algorithm
Scan-reduction
–
–
–
The most time consuming operation in Apriori algorithm is the database scan; it
is originally performed after each candidate set generation, to determine the
frequency of each candidate in the database;
Scan number reduction – counting candidates of multiple sizes in one pass;
Rather than counting only candidates of size k in the kth pass, we can also
calculate the candidates C’k+1, where C’k+1 is generated from Ck (instead Lk),
using the  operator;
Page Number: 35
The Apriori Algorithm
–
–
–
–
–
Compare: C’k+1 = Ck  Ck
Ck+1 = Lk  Lk
Note that C’k+1  Ck+1
This variation can pay off in later passes, when the cost of counting and keeping
in memory additional C’k+1 - Ck+1 candidates becomes less than the cost of
scanning the database;
There has to be enough space in main memory for both Ck and C’k+1;
Following this idea, we can make further scan reduction:
• C’k+1 is calculated from Ck for k > 1;
• There must be enough memory space for all Ck’s (k >
1);
–
Consequently, only two database scans need to be performed (the first to
determine L1, and the second to determine all the other Lk’s);
Page Number: 36
The Apriori Algorithm

Abstraction levels
–
–
Higher level associations are stronger (more powerful), but also less certain;
A good practice would be adopting different thresholds for different abstraction
levels (higher thresholds for higher levels of abstraction)
Page Number: 37
References







Devedzic, V., “Inteligentni informacioni sistemi,” Digit, FON, Beograd, 2003.
http://www.marconi.com
http://www.blueyed.com
http://www.fipa.org
http://www.rpi.edu
http://research.microsoft.com
http://imatch.lcs.mit.edu
Page Number: 38
DM Process Model

5A – used by SPSS Clementine
(Assess, Access, Analyze, Act and Automate)

SEMMA – used by SAS Enterprise Miner
(Sample, Explore, Modify, Model and Assess)

CRISP – tends to become a standard
Page Number: 39
CRISP - DM

CRoss-Industry Standard for DM

Conceived in 1996 by three companies:
Page Number: 40
CRISP – DM methodology
Four level breakdown of the CRISP-DM methodology:
Phases
Generic Tasks
Specialized Tasks
Process Instances
Page Number: 41
Mapping generic models
to specialized models

Analyze the specific context

Remove any details not applicable to the context

Add any details specific to the context

Specialize generic context according to
concrete characteristic of the context

Possibly rename generic contents
to provide more explicit meanings
Page Number: 42
CRISP – DM model

Business understanding

Data understanding

Data preparation

Modeling
Business
understanding
Deployment

Evaluation

Deployment
Evaluation
Page Number: 43
Data
understanding
Data
preparation
Modeling
Business Understanding

Determine business objectives

Assess situation

Determine data mining goals

Produce project plan
Page Number: 44
Data Understanding

Collect initial data

Describe data

Explore data

Verify data quality
Page Number: 45
Data Preparation

Select data

Clean data

Construct data

Integrate data

Format data
Page Number: 46
Modeling

Select modeling technique

Generate test design

Build model

Assess model
Page Number: 47
Evaluation
results = models + findings

Evaluate results

Review process

Determine next steps
Page Number: 48
Deployment

Plan deployment

Plan monitoring and maintenance

Produce final report

Review project
Page Number: 49
At Last…
Page Number: 50
Evolution of Data Mining
Evolutionary Step
Business Question
Enabling
Technologies
Product Providers
Characteristics
Data Collection
(1960s)
What was my average
total revenue over the
last 5 years?
Computers,
tapes,
disks
IBM,
CDC
Retrospective,
static data delivery
Data Access
(1980s)
What were unit sales
in New England
last March?
RDBMS,
SQL,
ODBC
Oracle, Sybase
Informix, IBM,
Microsoft
Retrospective,
dynamic data delivery
at record level
Data Navigation
(1990s)
What were unit sales
in New England last
March?
Drill down to Boston.
OLAP,
Multidimensional
databases,
data warehouses
Pilot, IRI,
Arbor, Redbrick,
Evolutionary
Technologies
Retrospective,
dynamic data delivery
at multiple levels
Data Mining
(2000)
What’s likely to
happen to Boston unit
sales next month?
Why?
Advanced algorithms,
multiprocessors,
massive databases
Lockheed,
IBM, SGI,
numerous startups
Prospective, proactive
information delivery
Page Number: 51
Examples of DM projects to stimulate your imagination

Here are six examples of how data mining is helping corporations
to operate more efficiently and profitably in today's business environment
– Targeting a set of consumers
who are most likely to respond to a direct mail campaign
– Predicting the probability of default for consumer loan applications
– Reducing fabrication flaws in VLSI chips
– Predicting audience share for television programs
– Predicting the probability that a cancer patient
will respond to radiation therapy
– Predicting the probability that an offshore oil well is actually going
to produce oil
Page Number: 52
Comparison of fourteen DM tools





Evaluated by four undergraduates inexperienced at data mining,
a relatively experienced graduate student, and
a professional data mining consultant
Run under the MS Windows 95, MS Windows NT,
Macintosh System 7.5
Use one of the four technologies:
Decision Trees, Rule Inductions, Neural, or Polynomial Networks
Solve two binary classification problems:
multi-class classification and noiseless estimation problem
Price from 75$ to 25.000$
Page Number: 53
Comparison of fourteen DM tools




The Decision Tree products were
- CART
- Scenario
- See5
- S-Plus
The Rule Induction tools were
- WizWhy
- DataMind
- DMSK
Neural Networks were built from three programs
- NeuroShell2
- PcOLPARS
- PRW
The Polynomial Network tools were
- ModelQuest Expert
- Gnosis
- a module of NeuroShell2
- KnowledgeMiner
Page Number: 54
Criteria for evaluating DM tools
A list of 20 criteria for evaluating DM tools, put into 4 categories:

Capability measures what a desktop tool can do,
and how well it does it
- Handles missing data
- Considers misclassification costs
- Allows data transformations
- Includes quality of tesing options
- Has a programming language
- Provides useful output reports
- Provides visualisation
Page Number: 55
-
Visualisation
+ excellent capability  good capability - some capability “blank” no capability
Page Number: 56
Criteria for evaluating DM tools

Learnability/Usability shows how easy a tool is to learn and use
-
Tutorials
Wizards
Easy to learn
User’s manual
Online help
Interface
Page Number: 57
Criteria for evaluating DM tools

Interoperability shows a tool’s ability to interface
with other computer applications
- Importing data
- Exporting data
- Links to other applications

Flexibility
- Model adjustment flexibility
- Customizable work enviroment
- Ability to write or change code
Page Number: 58
Data Input & Output Model
+ excellent capability
 good capability
- some capability
“blank” no capability
Page Number: 59
A classification of data sets

Pima Indians Diabetes data set
–
–

Wisconsin Breast Cancer data set
–
–

699 instances of breast tumors some of which are malignant,
most of which are benign
10 attributes plus the binary malignancy variable per case
The Forensic Glass Identification data set
–
–

768 cases of Native American women from the Pima tribe
some of whom are diabetic, most of whom are not
8 attributes plus the binary class variable for diabetes per instance
214 instances of glass collected during crime investigations
10 attributes plus the multi-class output variable per instance
Moon Cannon data set
–
–
300 solutions to the equation:
x = 2v 2 sin(g)cos(g)/g
the data were generated without adding noise
Page Number: 60
Evaluation of forteen DM tools
Page Number: 61
Potentials of R&D
in
Cooperation with U. of Belgrade
An Overview of Advanced Datamining Projects
for High-Tech Computer Industry
in the USA and EU
VLSI Detection
for
Internet/Telephony Interfaces
Goran Davidović, Miljan Vuletić, Veljko Milutinović,
Tom Chen, and Tom Brunett

* eT
Page Number: 63
USERS...
...
Superposition/DETECTION
Superposition/DETECTION
SPECIALIZED
INTERNET
REMOTE
SITE
SERVICE
PROVIDER
HOME/OFFICE/FACTORY AUTOMATION ON THE INTERNET
Page Number: 64
Reconfigurable FPGA for EBI
Božidar Radunović, Predrag Knežević, Veljko Milutinović,
Steve Casselman, and John Schewel*

* Virtual
Page Number: 65
USERS
...
SPECIALIZED
INTERNET
SERVICE
PROVIDER
VCC
VCC
CUSTOMER SATISFACTION vs CUSTOMER PROFILE
Page Number: 66
BioPoP
Veljko Milutinovic, Vladimir Jovicic, Milan Simic,
Bratislav Milic, Milan Savic, Veljko Jovanovic,
Stevo Ilic, Djordje Veljkovic, Stojan Omorac,
Nebojsa Uskokovic, and Fred Darnell
•isItWorking.com
Page Number: 67
Testing the Infrastructure for EBI

Phones
 Faxes
 Email
 Web links
 Servers
 Routers
 Software
• Statistics
• Correlation
• Innovation
Page Number: 68
CNUCE
Integration and Datamining
on Ad-Hoc Networks and the Internet
Veljko Milutinović,
Luca Simoncini, and Enrico Gregory

*University of Pisa, Santanna, CNUCE
Page Number: 69
GSM
DM
Ad-Hoc
Page Number: 70
Internet
Genetic Search
with Spatial/Temporal Mutations
Jelena Mirković, Dragana Cvetković,
and Veljko Milutinović

*Comshare
Page Number: 71
Drawbacks of INDEX-BASED:
Time to index + ranking
Advantages of LINKS-BASED:
Mission critical applications + customer tuned ranking
Well organized markets: Best first search
If elements of disorder: G w DB mutations
Chaotic markets: G w S/T mutations
Provider
Page Number: 72
e-Banking on the Internet
Miloš Kovačević, Bratislav Milic, Veljko Milutinović,
Marco Gori, and Roberto Giorgi

*University of Siena
Page Number: 73
Bottleneck#1: Searching for Clients and Investments
1472++
*University of Siena + Banco di Monte dei Paschi
Page Number: 74
WaterMarking for
e-Banking on the Internet
Darko Jovic, Ivana Vujovic, Veljko Milutinovic

Fraunhofer, IPSI, Darmstadt, Germany
Page Number: 75
Bottleneck#1: SpeedUp
Page Number: 76
SSGRR
Organizing Conferences via the Internet
Zoran Horvat, Nataša Kukulj, Vlada Stojanović,
Dušan Dingarac, Marjan Mihanović, Miodrag Stefanović,
Veljko Milutinović, and Frederic Patricelli

*SSGRR, L’Aquila
Page Number: 77
http://www.ssgrr.it
2000:
Arno Penzias
2001:
Bob
Richardson
2002:
Jerry Friedman
2003:
Harry Kroto
Page Number: 78
Summary
Books with Nobel Laureates:
Kenneth Wilson, Ohio (North-Holland)
Leon Cooper, Brown (Prentice-Hall)
Robert Richardson, Cornell (Kluwer-Academics)
Herb Simon (Kluwer-Academics)
Jerome Friedman, MIT (IOS Press)
Harold Kroto (IOS Press)
Arno Penzias (IOS Press)
Page Number: 79
http://galeb.etf.bg.ac.yu/~vm/
e-mail: [email protected]
Page Number: 80