Data Mining?
Download
Report
Transcript Data Mining?
Issues in Data Mining
Infrastructure
Authors:
Nemanja Jovanovic, [email protected]
Valentina Milenkovic, [email protected]
Voislav Galic, [email protected]
Dusan Zecevic, [email protected]
Sonja Tica, [email protected]
Prof. Dr. Dusan Tosic, [email protected]
Prof. Dr. Veljko Milutinovic, [email protected]
Page Number: 1
Data Mining in the Nutshell
Uncovering the hidden knowledge
Huge n-p complete search space
Multidimensional interface
NOTICE:
All trademarks and service marks mentioned in this document are marks of their
respective owners. Furthermore CRISP-DM consortium (NCR Systems Engineering
Copenhagen (USA and Denmark), DaimlerChrysler AG (Germany), SPSS Inc. (USA)
and OHRA Verzekeringen en Bank Groep B.V (The Netherlands)) permitted
presentation of their process model.
Page Number: 2
A Problem …
You are a marketing manager
for a cellular phone company
Problem: Churn is too high
Turnover (after contract expires) is 40%
Customers receive free phone (cost 125$)
You pay a sales commission of 250$ per contract
Giving a new telephone to everyone
whose contract is expiring is expensive
Bringing back a customer after quitting
is both difficult and expensive
Page Number: 3
… A Solution
Three months before a contract expires,
predict which customers will leave
If you want to keep a customer
that is predicted to churn,
offer them a new phone
The ones that are not predicted to churn
need no attention
If you don’t want to keep the customer, do nothing
How can you predict future behavior?
Tarot Cards?
Magic Ball?
Data Mining?
Page Number: 4
Still Skeptical?
Page Number: 5
The Definition
The automated extraction
of predictive information
from (large) databases
Automated
Extraction
Predictive
Databases
Page Number: 6
History of Data Mining
Page Number: 7
Repetition in Solar Activity
1613 – Galileo Galilei
1859 – Heinrich Schwabe
Page Number: 8
The Return of the
Halley Comet
Edmund Halley (1656 - 1742)
1531
1607
1682
239 BC
1910
1986
2061 ???
Page Number: 9
Data Mining is Not
Data warehousing
Ad-hoc query/reporting
Online Analytical Processing (OLAP)
Data visualization
Page Number: 10
Data Mining is
Automated extraction
of predictive information
from various data sources
Powerful technology
with great potential to help users focus
on the most important information
stored in data warehouses
or streamed through communication lines
Page Number: 11
Data Mining can
Answer question
that were too time consuming
to resolve in the past
Predict future trends and behaviors,
allowing us to make proactive,
knowledge driven decision
Page Number: 12
Data Mining Models
Page Number: 13
Neural Networks
Characterizes processed data
with single numeric value
Efficient modeling of
large and complex problems
Based on biological structures
- Neurons
Network consists of neurons
grouped into layers
Page Number: 14
Neuron Functionality
I1
W1
I2
W2
I3
W3
In
f
Output
Wn
Output = f (W1*I1, W2*I1, …, Wn*In)
Page Number: 15
Training Neural Networks
Page Number: 16
Neural Networks
Once trained, Neural Networks
can efficiently estimate the value
of an output variable for given input
Neurons and network topology
are essentials
Usually used for prediction
or regression problem types
Difficult to understand
Data pre-processing often required
Page Number: 17
Decision Trees
A way of representing a series of rules
that lead to a class or value
Iterative splitting of data
into discrete groups
maximizing distance between them
at each split
Classification trees and regression trees
Univariate splits and multivariate splits
Unlimited growth and stopping rules
CHAID, CHART, Quest, C5.0
Page Number: 18
Decision Trees
Balance>10
Balance<=10
Age<=32
Married=NO
Age>32
Married=YES
Page Number: 19
Decision Trees
Page Number: 20
Rule Induction
Method of deriving a set of rules
to classify cases
Creates independent rules
that are unlikely to form a tree
Rules may not cover
all possible situations
Rules may sometimes
conflict in a prediction
Page Number: 21
Rule Induction
If balance>100.000
then confidence=HIGH & weight=1.7
If balance>25.000 and
status=married
then confidence=HIGH & weight=2.3
If balance<40.000
then confidence=LOW & weight=1.9
Page Number: 22
K-nearest Neighbor and
Memory-Based Reasoning (MBR)
Usage of knowledge
of previously solved similar problems
in solving the new problem
Assigning the class to the group
where most of the k-”neighbors” belong
First step – finding the suitable measure
for distance between attributes in the data
+ Easy handling of non-standard data types
- Huge models
Page Number: 23
K-nearest Neighbor and
Memory-Based Reasoning (MBR)
Page Number: 24
Data Mining Algorithms
Many other available models and algorithms
Logistic regression
Discriminant analysis
Generalized Adaptive Models (GAM)
Genetic algorithms
The Apriori algorithm
Etc…
Many application specific variations
of known models
Final implementation usually involves
several techniques
Page Number: 25
The Apriori Algorithm
The task – mining association rules by finding large itemsets and
translating them to the corresponding association rules;
A B, or A1 A2 … Am B1 B2 … Bn, where A B =
The terminology
–
–
–
–
Confidence
Support
k-itemset – a set of k items;
Large itemsets – the large itemset {A, B} corresponds to the following rules
(implications): A B and B A;
Page Number: 26
The Apriori Algorithm
The operator definition
–
–
–
–
n = 1: S2 = S1 S1 = {A}, {B}, {C}} {{A}, {B}, {C}} = {{AB}, {AC}, {BC}}
n = k: Sk+1 = Sk Sk = {X Y| X, Y Sk, |X Y| = k-1}
X and Y must have the same number of elements, and must have exactly k-1
identical elements;
Every k-element subset of any resulting set element (an element is actually a k+1
element set) has to belong to the original set of itemsets;
Page Number: 27
The Apriori Algorithm
Example:
TID
elements
10
A
C
D
20
B
C
E
30
A
B
C
40
B
E
Page Number: 28
E
The Apriori Algorithm
Step 1 – generate a candidate set of 1-itemsets C1
–
–
–
–
Every possible 1-element set from the database is potentially a large itemset,
because we don’t know the number of its appearances in the database in
advance (á priori );
The task adds up to identifying (counting) all the different elements in the
database; every such element forms a 1-element candidate set;
C1 = {{A}, {B}, {C}, {D}, {E}}
Now, we are going to scan the entire database, to count the number of
appearances for each one of these elements (i.e. one-element sets);
Page Number: 29
The Apriori Algorithm
Now, we are going to scan the entire database, to count the number of
appearances for each one of these elements (i.e. one-element sets);
{A}
2
{B}
3
{C}
3
{D}
1
{E}
3
Page Number: 30
The Apriori Algorithm
Step 2 – generate a set of large 1-itemsets L1
–
Each element in C1 with support that exceeds some adopted minimum support
(for example 50%) becomes a member of L1;
–
L1 = {{A}, {B}, {C},{E}}
and we can omit D in further
steps (if D doesn’t have
enough support alone,
there is no way it could
satisfy requested support
in a combination with some
other element(s));
Page Number: 31
{A}
2
{B}
3
{C}
3
{D}
1
{E}
3
The Apriori Algorithm
Step 3 – generate a candidate set of large 2-itemsets, C2
–
C2 = L1 L1 ={{AB}, {AC}, {AE}, {BC}, {BE}, {CE}}
–
Count the corresponding appearances
Step 4 – generate a set of large 2-itemsets, L2;
–
–
Eliminate the candidates
without minimum support;
L2 = {{AC}, {BC}, {BE}, {CE}}
Page Number: 32
{AB}
1
{AC}
2
{AE}
1
{BC}
2
{BE}
3
{CE}
2
The Apriori Algorithm
Step 5 (C3)
–
–
Step 6 (L3)
–
C3 = L2 L2 = {{BCE}}
Why not {ABC} and {ACE} – because their 2-element subsets {AB} and {AE} are
not the elements of large 2-itemset set L2 (calculation is made according to the
operator definition);
L3 = {{BCE}}, since {BCE} satisfies the required support of 50% (two
appearances);
There can be no further steps in this particular case,
because L3 L3 = ;
Answer = L1 L2 L3;
Page Number: 33
The Apriori Algorithm
L1 = {large 1-itemsets}
for (k=2; Lk-1 ; k++)
Ck = apriori-gen(Lk-1);
forall transactions t D do begin
Ct = subset (Ck, t);
forall candidates c Ct do
c.count++;
end;
Lk = {c Ck | c.count minsup}
end;
Answer = k Lk
Page Number: 34
The Apriori Algorithm
Enhancements to the basic algorithm
Scan-reduction
–
–
–
The most time consuming operation in Apriori algorithm is the database scan; it
is originally performed after each candidate set generation, to determine the
frequency of each candidate in the database;
Scan number reduction – counting candidates of multiple sizes in one pass;
Rather than counting only candidates of size k in the kth pass, we can also
calculate the candidates C’k+1, where C’k+1 is generated from Ck (instead Lk),
using the operator;
Page Number: 35
The Apriori Algorithm
–
–
–
–
–
Compare: C’k+1 = Ck Ck
Ck+1 = Lk Lk
Note that C’k+1 Ck+1
This variation can pay off in later passes, when the cost of counting and keeping
in memory additional C’k+1 - Ck+1 candidates becomes less than the cost of
scanning the database;
There has to be enough space in main memory for both Ck and C’k+1;
Following this idea, we can make further scan reduction:
• C’k+1 is calculated from Ck for k > 1;
• There must be enough memory space for all Ck’s (k >
1);
–
Consequently, only two database scans need to be performed (the first to
determine L1, and the second to determine all the other Lk’s);
Page Number: 36
The Apriori Algorithm
Abstraction levels
–
–
Higher level associations are stronger (more powerful), but also less certain;
A good practice would be adopting different thresholds for different abstraction
levels (higher thresholds for higher levels of abstraction)
Page Number: 37
References
Devedzic, V., “Inteligentni informacioni sistemi,” Digit, FON, Beograd, 2003.
http://www.marconi.com
http://www.blueyed.com
http://www.fipa.org
http://www.rpi.edu
http://research.microsoft.com
http://imatch.lcs.mit.edu
Page Number: 38
DM Process Model
5A – used by SPSS Clementine
(Assess, Access, Analyze, Act and Automate)
SEMMA – used by SAS Enterprise Miner
(Sample, Explore, Modify, Model and Assess)
CRISP – tends to become a standard
Page Number: 39
CRISP - DM
CRoss-Industry Standard for DM
Conceived in 1996 by three companies:
Page Number: 40
CRISP – DM methodology
Four level breakdown of the CRISP-DM methodology:
Phases
Generic Tasks
Specialized Tasks
Process Instances
Page Number: 41
Mapping generic models
to specialized models
Analyze the specific context
Remove any details not applicable to the context
Add any details specific to the context
Specialize generic context according to
concrete characteristic of the context
Possibly rename generic contents
to provide more explicit meanings
Page Number: 42
CRISP – DM model
Business understanding
Data understanding
Data preparation
Modeling
Business
understanding
Deployment
Evaluation
Deployment
Evaluation
Page Number: 43
Data
understanding
Data
preparation
Modeling
Business Understanding
Determine business objectives
Assess situation
Determine data mining goals
Produce project plan
Page Number: 44
Data Understanding
Collect initial data
Describe data
Explore data
Verify data quality
Page Number: 45
Data Preparation
Select data
Clean data
Construct data
Integrate data
Format data
Page Number: 46
Modeling
Select modeling technique
Generate test design
Build model
Assess model
Page Number: 47
Evaluation
results = models + findings
Evaluate results
Review process
Determine next steps
Page Number: 48
Deployment
Plan deployment
Plan monitoring and maintenance
Produce final report
Review project
Page Number: 49
At Last…
Page Number: 50
Evolution of Data Mining
Evolutionary Step
Business Question
Enabling
Technologies
Product Providers
Characteristics
Data Collection
(1960s)
What was my average
total revenue over the
last 5 years?
Computers,
tapes,
disks
IBM,
CDC
Retrospective,
static data delivery
Data Access
(1980s)
What were unit sales
in New England
last March?
RDBMS,
SQL,
ODBC
Oracle, Sybase
Informix, IBM,
Microsoft
Retrospective,
dynamic data delivery
at record level
Data Navigation
(1990s)
What were unit sales
in New England last
March?
Drill down to Boston.
OLAP,
Multidimensional
databases,
data warehouses
Pilot, IRI,
Arbor, Redbrick,
Evolutionary
Technologies
Retrospective,
dynamic data delivery
at multiple levels
Data Mining
(2000)
What’s likely to
happen to Boston unit
sales next month?
Why?
Advanced algorithms,
multiprocessors,
massive databases
Lockheed,
IBM, SGI,
numerous startups
Prospective, proactive
information delivery
Page Number: 51
Examples of DM projects to stimulate your imagination
Here are six examples of how data mining is helping corporations
to operate more efficiently and profitably in today's business environment
– Targeting a set of consumers
who are most likely to respond to a direct mail campaign
– Predicting the probability of default for consumer loan applications
– Reducing fabrication flaws in VLSI chips
– Predicting audience share for television programs
– Predicting the probability that a cancer patient
will respond to radiation therapy
– Predicting the probability that an offshore oil well is actually going
to produce oil
Page Number: 52
Comparison of fourteen DM tools
Evaluated by four undergraduates inexperienced at data mining,
a relatively experienced graduate student, and
a professional data mining consultant
Run under the MS Windows 95, MS Windows NT,
Macintosh System 7.5
Use one of the four technologies:
Decision Trees, Rule Inductions, Neural, or Polynomial Networks
Solve two binary classification problems:
multi-class classification and noiseless estimation problem
Price from 75$ to 25.000$
Page Number: 53
Comparison of fourteen DM tools
The Decision Tree products were
- CART
- Scenario
- See5
- S-Plus
The Rule Induction tools were
- WizWhy
- DataMind
- DMSK
Neural Networks were built from three programs
- NeuroShell2
- PcOLPARS
- PRW
The Polynomial Network tools were
- ModelQuest Expert
- Gnosis
- a module of NeuroShell2
- KnowledgeMiner
Page Number: 54
Criteria for evaluating DM tools
A list of 20 criteria for evaluating DM tools, put into 4 categories:
Capability measures what a desktop tool can do,
and how well it does it
- Handles missing data
- Considers misclassification costs
- Allows data transformations
- Includes quality of tesing options
- Has a programming language
- Provides useful output reports
- Provides visualisation
Page Number: 55
-
Visualisation
+ excellent capability good capability - some capability “blank” no capability
Page Number: 56
Criteria for evaluating DM tools
Learnability/Usability shows how easy a tool is to learn and use
-
Tutorials
Wizards
Easy to learn
User’s manual
Online help
Interface
Page Number: 57
Criteria for evaluating DM tools
Interoperability shows a tool’s ability to interface
with other computer applications
- Importing data
- Exporting data
- Links to other applications
Flexibility
- Model adjustment flexibility
- Customizable work enviroment
- Ability to write or change code
Page Number: 58
Data Input & Output Model
+ excellent capability
good capability
- some capability
“blank” no capability
Page Number: 59
A classification of data sets
Pima Indians Diabetes data set
–
–
Wisconsin Breast Cancer data set
–
–
699 instances of breast tumors some of which are malignant,
most of which are benign
10 attributes plus the binary malignancy variable per case
The Forensic Glass Identification data set
–
–
768 cases of Native American women from the Pima tribe
some of whom are diabetic, most of whom are not
8 attributes plus the binary class variable for diabetes per instance
214 instances of glass collected during crime investigations
10 attributes plus the multi-class output variable per instance
Moon Cannon data set
–
–
300 solutions to the equation:
x = 2v 2 sin(g)cos(g)/g
the data were generated without adding noise
Page Number: 60
Evaluation of forteen DM tools
Page Number: 61
Potentials of R&D
in
Cooperation with U. of Belgrade
An Overview of Advanced Datamining Projects
for High-Tech Computer Industry
in the USA and EU
VLSI Detection
for
Internet/Telephony Interfaces
Goran Davidović, Miljan Vuletić, Veljko Milutinović,
Tom Chen, and Tom Brunett
* eT
Page Number: 63
USERS...
...
Superposition/DETECTION
Superposition/DETECTION
SPECIALIZED
INTERNET
REMOTE
SITE
SERVICE
PROVIDER
HOME/OFFICE/FACTORY AUTOMATION ON THE INTERNET
Page Number: 64
Reconfigurable FPGA for EBI
Božidar Radunović, Predrag Knežević, Veljko Milutinović,
Steve Casselman, and John Schewel*
* Virtual
Page Number: 65
USERS
...
SPECIALIZED
INTERNET
SERVICE
PROVIDER
VCC
VCC
CUSTOMER SATISFACTION vs CUSTOMER PROFILE
Page Number: 66
BioPoP
Veljko Milutinovic, Vladimir Jovicic, Milan Simic,
Bratislav Milic, Milan Savic, Veljko Jovanovic,
Stevo Ilic, Djordje Veljkovic, Stojan Omorac,
Nebojsa Uskokovic, and Fred Darnell
•isItWorking.com
Page Number: 67
Testing the Infrastructure for EBI
Phones
Faxes
Email
Web links
Servers
Routers
Software
• Statistics
• Correlation
• Innovation
Page Number: 68
CNUCE
Integration and Datamining
on Ad-Hoc Networks and the Internet
Veljko Milutinović,
Luca Simoncini, and Enrico Gregory
*University of Pisa, Santanna, CNUCE
Page Number: 69
GSM
DM
Ad-Hoc
Page Number: 70
Internet
Genetic Search
with Spatial/Temporal Mutations
Jelena Mirković, Dragana Cvetković,
and Veljko Milutinović
*Comshare
Page Number: 71
Drawbacks of INDEX-BASED:
Time to index + ranking
Advantages of LINKS-BASED:
Mission critical applications + customer tuned ranking
Well organized markets: Best first search
If elements of disorder: G w DB mutations
Chaotic markets: G w S/T mutations
Provider
Page Number: 72
e-Banking on the Internet
Miloš Kovačević, Bratislav Milic, Veljko Milutinović,
Marco Gori, and Roberto Giorgi
*University of Siena
Page Number: 73
Bottleneck#1: Searching for Clients and Investments
1472++
*University of Siena + Banco di Monte dei Paschi
Page Number: 74
WaterMarking for
e-Banking on the Internet
Darko Jovic, Ivana Vujovic, Veljko Milutinovic
Fraunhofer, IPSI, Darmstadt, Germany
Page Number: 75
Bottleneck#1: SpeedUp
Page Number: 76
SSGRR
Organizing Conferences via the Internet
Zoran Horvat, Nataša Kukulj, Vlada Stojanović,
Dušan Dingarac, Marjan Mihanović, Miodrag Stefanović,
Veljko Milutinović, and Frederic Patricelli
*SSGRR, L’Aquila
Page Number: 77
http://www.ssgrr.it
2000:
Arno Penzias
2001:
Bob
Richardson
2002:
Jerry Friedman
2003:
Harry Kroto
Page Number: 78
Summary
Books with Nobel Laureates:
Kenneth Wilson, Ohio (North-Holland)
Leon Cooper, Brown (Prentice-Hall)
Robert Richardson, Cornell (Kluwer-Academics)
Herb Simon (Kluwer-Academics)
Jerome Friedman, MIT (IOS Press)
Harold Kroto (IOS Press)
Arno Penzias (IOS Press)
Page Number: 79
http://galeb.etf.bg.ac.yu/~vm/
e-mail: [email protected]
Page Number: 80