Introduction to KDD for Tony`s MI Course

Download Report

Transcript Introduction to KDD for Tony`s MI Course

1
Knowledge Discovery
and Data Mining
An Introduction
Daniel L. Silver
Copyright (c), 2003
All Rights Reserved
CogNova
Technologies
2
Agenda
 Introduction
to KDD & DM
 Overview of the KDD Process
 Benefits, Costs, Status and Trends
CogNova
Technologies
3
“We are drowning in information, but
starving for knowledge.” John Naisbett
Megatrends, 1988
Data Analytics or KDD:
Data Warehousing, Data Mining,
Data Visualization
CogNova
Technologies
4
Introduction
Data Analytics is not a new field ...
 Since
1990’s referred to as:
Data Analysis, Data Mining, Data Warehousing
A
•
•
•
•
•
multidisciplinary field:
Database and data warehousing
Data and model visualization methods
On-line Analytical Processing
Statistics and machine learning
Knowledge management
CogNova
Technologies
5
Introduction
Why has Data Analytics become
important?
Competitive focus - Knowledge Management
 Abundance of data 
!!
 Inexpensive, powerful computing engines
 Strong theoretical/mathematical foundations
• machine learning & logical inference
• statistics and dynamically systems
• database management systems

CogNova
Technologies
6
Introduction
What is Data Analytics (KDD)?
A Process
The selection and processing of data for:
• the identification of novel, accurate, and
useful patterns, and
• the modeling of real-world phenomenon.
 Data Warehousing, Data mining, and Data
Visualization are major components.

CogNova
Technologies
7
The KDD Process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
Data
Consolidation
p(x)=0.02
Patterns &
Models
Data
Warehouse
Prepared Data
Consolidated
Data
Data Sources
CogNova
Technologies
8
Introduction – KDD In Context
9
T he KD D Pro ce ss
Interpretation
and Evaluation
D ata M ining
K no w le d g e
Sele ction a nd
Preprocessing
Problem
D ata
C onsolidation
Knowledge
p (x) = 0. 02
P a tt e r n s &
M o d e ls
W are h ou se
P r e p a r e d D a ta
C o n s o lid a te d
D a ta
D a ta S o u r c e s
C o g N o va
T e c h n o lo g i e s
Identify
Problem or
Opportunity
Strategy
“The Virtuous
Cycle”
Berry & Linoff
Measure Effect
of Action
Act on
Knowledge
Results
CogNova
Technologies
9
Introduction - CRISP
Cross Industry Standard Process for Data
Mining
 Developed by employees at SPSS, NCR,
DaimlerCrysler
 Iterative process with 6 major steps:

•
•
•
•
•
•
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
CogNova
Technologies
10
Marketing Embraces KM, DW, DM
Why? …
Marketing
Traditional
Marketing
MIS
Relationship
Marketing
a.k.a
Customer
Relationship
Management
Data
WarehousingData Mining
CogNova
Technologies
11
What is Relationship Marketing?


Arbuckle’s
Market


“ The Corner Store ”
Knowing your customers on
an individual basis
Maximizing life-time value
not individual sales
Developing and maintaining
a mutually beneficial
relationship
Acquire, retain, win-back
desirable customers
CogNova
Technologies
12
Knowledge Discovery
What can KDD do for an organization?
Impact on Marketing
Target marketing at a credit card company
 Consumer usage analysis at a telecomm
provider
 Loyalty assessment at a service bureau
 Quality of service analysis at an appliance
chain

CogNova
Technologies
13
Application Areas
Private/Commercial Sector








Marketing: segmentation, product targeting,
customer value and retention, ...
Finance: investment support, portfolio management
Banking & Insurance: credit and policy approval
Security: fraud detection, access control
Science and medicine: hypothesis discovery,
prediction, classification, diagnosis
Manufacturing: process modeling, quality control,
resource allocation
Engineering: pattern recognition, signal processing
Internet: smart search engines, web marketing
CogNova
Technologies
14
Application Areas
Public/Gov’t Sector








Finance: investment management, price forecasting
Taxation: adaptive monitoring, fraud detection
Health care: medical diagnosis, risk assessment,
cost /quality control
Education: process and quality modeling,
resource forecasting
Insurance: worker’s compensation analysis
Security: bomb, iceberg detection
Transportation: simulation and analysis
Statistics: demographic analysis, municipal planning
CogNova
Technologies
15
The Data Analytics (KDD)
Process
CogNova
Technologies
16
The KDD Process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
Data
Consolidation
p(x)=0.02
Patterns &
Models
Warehouse
Prepared Data
Consolidated
Data
Data Sources
CogNova
Technologies
17
The KDD Process
Possible results for any one effort:
Confirmation of the obvious
New knowledge - the data mine “nugget”
No significant relations found (random data)
CogNova
Technologies
18
The KDD Process
Core Problems & Approaches
 Problems:
Probability
•
•
•
identification of relevant data
representation of data
search for valid pattern or model
of sale
Age
 Approaches:
Income
• top-down deduction by expert
OLAP
• interactive visualization of data/models
Data
• * bottom-up induction from data *
Mining
CogNova
Technologies
19
The KDD Process
The Architecture of a KDD System
Graphical User Interface
Data
Consolidation
Data Sources
Selection
and
Preprocessing
Warehouse
Data
Mining
Interpretation
and Evaluation
Knowledge
CogNova
Technologies
20
The KDD Process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
p(x)=0.02
Data
Consolidation
Warehouse
CogNova
Technologies
21
Data Consolidation
Garbage in
Garbage out
The quality of results relates directly to
quality of the data
 50%-70% of KDD process effort will be spent
on data consolidation, cleansing and
preprocessing
 Major justification for a corporate Data
Warehouse

CogNova
Technologies
22
Data Consolidation & Warehousing
From data sources to consolidated data
repository
RDBMS
Legacy
DBMS
Analysis and
Info Sharing
Inflow
Data
Consolidation
and Cleansing
Warehouse
or Datamart
Flat Files
Metaflow
External
Upflow
Downflow
Outflow
CogNova
Technologies
24
Data Warehousing – A Process
Definition: The strategic collection, cleansing, and
consolidation of organizational data to meet
operational, analytical, and communication
needs.
75% of early DW projects were not completed
 Data warehousing is not a project
 It is an on-going set of organizational activities
 Must be business benefits driven

CogNova
Technologies
27
Relationship between DW and DM?
Strategic
Tactical
Rationale
for data
consolidation
Analysis
Data
Warehousing
Query/Reporting
OLAP
Data Mining
Source of
consolidated
data
CogNova
Technologies
28
The KDD Process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
p(x)=0.02
Data
Consolidation
Warehouse
CogNova
Technologies
29
Selection and Preprocessing

Generate a set of examples
•
•
•

Reduce attribute dimensionality
•
•

remove redundant and/or correlating attributes
combine attributes (sum, multiply, difference)
Reduce attribute value ranges
•
•

choose sampling method
consider sample complexity
deal with volume bias issues
group symbolic discrete values
quantize continuous numeric values
OLAP and visualization tools play key role
(Han calls this descriptive data mining)
CogNova
Technologies
30
OLAP: On-Line Analytical Processing
OLAP Functionality
Profit Values

Dimension selection
• slice & dice
Sales
Region
OLAP
cube

Rotation
• allows change in perspective

Filtration
• value range selection
Year
by Month
Product Class
by Product Name

Hierarchies
•
•
drill-downs to lower levels
roll-ups to higher levels
CogNova
Technologies
31
Selection and Preprocessing
 Transform data
• decorrelate and normalize values
• map time-series data to static representation
 Encode data
• representation must be appropriately for the Data
Mining tool which will be used
• continue to reduce attribute dimensionality where
possible without loss of information

OLAP and visualization tools as well as
transformation and encoding software
CogNova
Technologies
33
The KDD Process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
p(x)=0.02
Data
Consolidation
Warehouse
CogNova
Technologies
34
Overview of Data Mining Methods

Automated Exploration/Discovery
•
•

Prediction/Classification
•
•

e.g.. discovering new market segments
x2
distance and probabilistic clustering algorithms
x1
e.g.. forecasting gross sales given current factors
regression, neural networks, genetic algorithms f(x)
Explanation/Description
•
•
e.g.. characterizing customers by demographics
and purchase history
inductive decision trees,
if age > 35
association rule systems
Focus is on induction of a model
from specific examples
x
and income < $35k
then ...
CogNova
Technologies
35
Data Mining Methods
Automated Exploration and Discovery

Distance-based numerical clustering
•
•

metric grouping of examples (KNN)
graphical visualization can be used
Income
Bayesian clustering
•
Age
search for the number of classes which result in
best fit of a probability distribution to the data
 Unsupervised
Learning
CogNova
Technologies
36
Data Mining Methods
Prediction and Classification
Function approximation (curve fitting)
 Classification (concept learning, pattern

recognition)

A
Methods:
•
•
•
•
x2
Statistical regression
Artificial neural networks
Genetic algorithms
Nearest neighbour algorithms
 Supervised
Learning
B
f(x)
x
O1 O2
x1
I1 I2
I3
I4
CogNova
Technologies
37
Data Mining Methods
Generalization
The objective of learning is to achieve good
generalization to new cases, otherwise just use
a look-up table.
 Generalization can be defined as a
mathematical interpolation or regression over a
set of training points:

f(x)
x
CogNova
Technologies
41
Data Mining Methods
Explanation and Description
Learn a generalized hypothesis (model) from
selected data
 Description/Interpretation of model provides
new human knowledge
 Methods:
Root

•
•
•
Inductive decision tree and rule systems
B?
Association rule systems
Link Analysis
D?
A?
C?
Yes
Leaf
CogNova
Technologies
42
Modeling & Data Mining
DEMO
WEKA – A Data Mining
Environment
CogNova
Technologies
43
The KDD Process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
p(x)=0.02
Data Consolidation
and Warehousing
Warehouse
CogNova
Technologies
44
Interpretation and Evaluation
Evaluation



Statistical validation and significance testing
Qualitative review by experts in the field
Pilot surveys to evaluate model accuracy
Interpretation



Inductive tree and rule models can be read directly
Clustering results can be graphed and tabled
Code can be automatically generated by some
systems (ANNs, IDTs, Regression models)
CogNova
Technologies
45
Interpretation and Evaluation
Visualization tools can be very helpful:
•
•
•
•
sensitivity analysis (I/O relationship)
histograms of value distributions
time-series plots and animation
requires training and practice
Response
Temp
Velocity
CogNova
Technologies
46
Benefits, Costs,
Status and Trendss
CogNova
Technologies
47
Benefits of Data Analytics(KDD)

Maximum utility from corporate data
• discovery of new knowledge
• generation of predictive models

Important feedback to data warehousing effort
• identification and justification of essential data

Reduction of application dev ’t backlog
• model development vs. software development

Effect on bottom line of organization
• cost reduction, increased productivity, risk
avoidance … competitive advantage
CogNova
Technologies
48
Requirements and Costs of KDD
Hardware - computationally intensive
 Software - micro < $20k, integrated suites $100k+
 Data - internal collection, surveys, external sources
 Human resources

• DB/DP/DC expertise to consolidate and
preprocess data
• Machine learning and stats competence
• Application knowledge & project mgmt
 70% of the effort is expended on the data
consolidation and preprocessing activities
CogNova
Technologies
49
Current Status and Trends
Standards and methodologies are maturing
 Many products:
• Open source (WEKA, RapidMiner)
• micro DM packages (IBM Cognos)
• Macro integrated suites (IBM SPSS
Modeler, SAS Enterprise Miner)
 Software costs have stabalized
 Major players have been determined
 Internet - “the” sink and source of data
 Legal and ethical issues on the horizon

CogNova
Technologies
50
Current Status and Trends

Methods used
• http://www.kdnuggets.com/polls/2013/analytic
s-big-data-mining-data-science-software.html

Appication areas:
• http://www.kdnuggets.com/polls/2012/whereapplied-analytics-data-mining.html

Other Poles:
• http://www.kdnuggets.com/polls/index.html
CogNova
Technologies
51
The Current Status and Trends
What has prevented the use of Data Mining?
 Products:
• General in nature, not tailored for business
• Missing standard interfaces to organizational data
• Emphasis on sales and not training/consulting

Customers:
•
•
•
•
Frightened by technical skill set required
Uncertain of mining results and ROI
Convinced warehouse must be completed first
Lacking knowledge of external data sources
CogNova
Technologies
52
Key Technologies for KDD
Data warehousing and distributed database
 Parallel computing
 AI and expert systems
 Machine learning and statistical inference
 Visualization (including Virtual Reality)
 Internet - future sink and source of data
• adaptive filters, knowledge extractors
• smart web services

CogNova
Technologies
53
Current Management Issues
 Ownership
of data and knowledge
 Security of customer data
 Responsibility for accuracy of
information
 Ethical practices - fair use of data
CogNova
Technologies
54
A List of Major Vendors
Lots of Players
Approaching market from hardware, database,
statistical, machine learning, education,
financial/marketing, and management
consulting:
IBM, SAS, SPSS, SGI, Thinking Machines,
Cognos, ZDM Scientific, Neuralware,
Information Discovery, American Heuristics,
Data Distilleries, SuperInduction
CogNova
Technologies
55
THE END
[email protected]
CogNova
Technologies