Transcript Data Mining
Knowledge discovery & data mining
Tools, methods, and experiences
Fosca Giannotti and
Dino Pedreschi
Pisa KDD Lab
CNUCE-CNR & Univ. Pisa
http://www-kdd.di.unipi.it/
A tutorial @ EDBT2000
Contributors and acknowledgements
The people @ Pisa KDD Lab: Francesco BONCHI, Giuseppe
MANCO, Mirco NANNI, Chiara RENSO, Salvatore RUGGIERI,
Franco TURINI and many students
The many KDD tutorialists and teachers which made their slides
available on the web (all of them listed in bibliography) ;-)
In particular:
Jiawei HAN, Simon Fraser University, whose forthcoming book
Data mining: concepts and techniques has influenced the whole
tutorial
Rajeev RASTOGI and Kyuseok SHIM, Lucent Bell Labs
Daniel A. KEIM, University of Halle
Daniel Silver, CogNova Technologies
The EDBT2000 board who accepted our tutorial proposal
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
2
Tutorial goals
Introduce you to major aspects of the Knowledge
Discovery Process, and theory and applications of
Data Mining technology
Provide a systematization to the many many concepts
around this area, according the following lines
the
the
the
the
process
methods applied to paradigmatic cases
support environment
research challenges
Important issues that will be not covered in this
tutorial:
methods: time series, exception detection, neural nets
systems: parallel implementations
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
3
Tutorial Outline
1. Introduction and basic concepts
1. Motivations, applications, the KDD process, the techniques
2. Deeper into DM technology
1. Decision Trees and Fraud Detection
2. Association Rules and Market Basket Analysis
3. Clustering and Customer Segmentation
3. Trends in technology
1. Knowledge Discovery Support Environment
2. Tools, Languages and Systems
4. Research challenges
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
4
Introduction - module outline
Motivations
Application Areas
KDD Decisional Context
KDD Process
Architecture of a KDD system
The KDD steps in short
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
5
Evolution of Database Technology:
from data management to data analysis
1960s:
Data collection, database creation, IMS and network
DBMS.
1970s:
Relational data model, relational DBMS implementation.
1980s:
RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial,
scientific, engineering, etc.).
1990s:
Data mining and data warehousing, multimedia databases,
and Web technology.
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
6
Motivations
“Necessity is the Mother of Invention”
Data explosion problem:
Automated data collection tools, mature database
technology and internet lead to tremendous amounts of
data stored in databases, data warehouses and other
information repositories.
We are drowning in information, but starving for knowledge!
(John Naisbett)
Data warehousing and data mining :
On-line analytical processing
Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases.
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
7
A rapidly
field
A rapidly emerging
emerging field
Also referred to as:
Data dredging, Data harvesting, Data archeology
A multidisciplinary field:
Database
Statistics
Artificial intelligence
Machine
learning, Expert systems and Knowledge Acquisition
Visualization methods
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
8
Motivations for DM
Abundance of business and industry data
Competitive focus - Knowledge
Management
Inexpensive, powerful computing engines
Strong theoretical/mathematical
foundations
machine learning & logic
statistics
database management systems
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
9
What is DM useful for?
Increase knowledge
to base decision
upon.
Marketing
E.g., impact on
marketing
Database
Marketing
Data
Warehousing
EDBT2000 tutorial - Intro
KDD &
Data Mining
Konstanz, 27-28.3.2000
10
The Value Chain
Decision
• Promote product A in region Z.
Knowledge
• Mail ads to families of profile P
• Cross-sell service B to clients C
• A quantity Y of product A is used
in region Z
• Customers of class Y use x% of
C during period D
Information
• X lives in Z
Data
• S is Y years old
• X and S moved
• W has money in Z
• Customer data
• Store data
• Demographical Data
• Geographical data
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
11
Application Areas and Opportunities
Marketing: segmentation, customer targeting, ...
Finance: investment support, portfolio management
Banking & Insurance: credit and policy approval
Security: fraud detection
Science and medicine: hypothesis discovery,
prediction, classification, diagnosis
Manufacturing: process modeling, quality control,
resource allocation
Engineering: simulation and analysis, pattern
recognition, signal processing
Internet: smart search engines, web marketing
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
12
Classes of applications
Market analysis
target marketing, customer relation management, market
basket analysis, cross selling, market segmentation.
Risk analysis
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis.
Fraud detection
Text (news group, email, documents) and Web analysis.
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
13
Market Analysis
Where are the data sources for analysis?
Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies.
Target marketing
Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
Determine customer purchasing patterns over time
Conversion of single to a joint bank account: marriage, etc.
Cross-market analysis
Associations/co-relations between product sales
Prediction based on the association information.
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
MarketMarket
Analysis
and Management
Analysis
(2)
Customer profiling
data mining can tell you what types of customers buy
what products (clustering or classification).
Identifying customer requirements
identifying the best products for different customers
use prediction to find what factors will attract new
customers
Provides summary information
various multidimensional summary reports;
statistical summary information (data central tendency
and variation)
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
Risk Analysis
Finance planning and asset evaluation:
cash flow analysis and prediction
contingent claim analysis to evaluate assets
cross-sectional and time series analysis (financial-ratio,
trend analysis, etc.)
Resource planning:
summarize and compare the resources and spending
Competition:
monitor competitors and market directions (CI: competitive
intelligence).
group customers into classes and class-based pricing
procedures
set pricing strategy in a highly competitive market
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
Fraud Detection
Applications:
widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
Approach:
use historical data to build models of fraudulent behavior
and use data mining to help identify similar instances.
Examples:
auto insurance: detect a group of people who stage accidents
to collect on insurance
money laundering: detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network)
medical insurance: detect professional patients and ring of
doctors and ring of references
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
Fraud Detection (2)
More examples:
Detecting inappropriate medical treatment:
Australian Health Insurance Commission identifies that in
many cases blanket screening tests were requested (save
Australian $1m/yr).
Detecting telephone fraud:
Telephone call model: destination of the call, duration, time
of day or week. Analyze patterns that deviate from an
expected norm.
British Telecom identified discrete groups of callers with
frequent intra-group calls, especially mobile phones, and
broke a multimillion dollar fraud.
Retail: Analysts estimate that 38% of retail shrink is due to
dishonest employees.
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
Other applications
Sports
IBM Advanced Scout analyzed NBA game statistics (shots
blocked, assists, and fouls) to gain competitive advantage
for New York Knicks and Miami Heat.
Astronomy
JPL and the Palomar Observatory discovered 22 quasars
with the help of data mining
Internet Web Surf-Aid
IBM Surf-Aid applies data mining algorithms to Web
access logs for market-related pages to discover customer
preference and behavior pages, analyzing effectiveness of
Web marketing, improving Web site organization, etc.
Watch for the PRIVACY pitfall!
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
What is KDD? A process!
The selection and processing of data for:
the identification of novel, accurate,
and useful patterns, and
the modeling of real-world phenomena.
Data mining is a major component of the
KDD process - automated discovery of
patterns and the development of
predictive and explanatory models.
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
20
The KDD process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
Data
Consolidation
p(x)=0.02
Patterns &
Models
Warehouse
Prepared Data
Consolidated
Data
Data Sources
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
21
The KDD Process
Core Problems & Approaches
Problems:
identification
of relevant data
representation of data
search for valid pattern or model
Approaches:
top-down
deduction by expert
interactive visualization of data/models
*
bottom-up induction from data *
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
OLAP
Data
Mining
22
The steps of the KDD process
Learning the application domain:
relevant prior knowledge and goals of application
Data consolidation: Creating a target data set
Selection and Preprocessing
Data cleaning : (may take 60% of effort!)
Data reduction and projection:
find useful features, dimensionality/variable reduction, invariant
representation.
Choosing functions of data mining
summarization, classification, regression, association,
clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Interpretation and evaluation: analysis of results.
visualization, transformation, removing redundant patterns, …
Use of discovered knowledge
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
The virtuous cycle
9
The KDD Process
Interpretation
and Evaluation
Data Mining
Knowledge
Problem
Selection and
Preprocessing
Data
Consolidation
p(x)=0.02
Patterns &
Models
Warehouse
Knowledge
Prepared Data
Consolidated
Data
Data Sources
CogNova
Technologies
Identify
Problem or
Opportunity
Strategy
EDBT2000 tutorial - Intro
Act on
Knowledge
Measure effect
of Action
Konstanz, 27-28.3.2000
Results
24
Applications, operations, techniques
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
25
Roles in the KDD process
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
26
Data mining and business intelligence
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
DBA
27
Architecture of a KDD system
Graphical User Interface
Data
Consolidation
Data Sources
Selection
and
Preprocessing
Warehouse
EDBT2000 tutorial - Intro
Data
Mining
Interpretation
and Evaluation
Knowledge
Konstanz, 27-28.3.2000
28
A business intelligence environment
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
29
The KDD process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
Data
Consolidation
Warehouse
p(x)=0.02
Patterns &
Models
Prepared Data
Consolidated
Data
Data Sources
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
30
Data consolidation and preparation
Garbage in
Garbage out
The quality of results relates directly to quality
of the data
50%-70% of KDD process effort is spent on
data consolidation and preparation
Major justification for a corporate data
warehouse
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
31
Data consolidation
From data sources to consolidated data
repository
RDBMS
Legacy
DBMS
Flat Files
External
EDBT2000 tutorial - Intro
Data
Consolidation
and Cleansing
Warehouse
Object/Relation DBMS
Multidimensional DBMS
Deductive Database
Flat files
Konstanz, 27-28.3.2000
32
Data consolidation
Determine preliminary list of attributes
Consolidate data into working database
Internal and External sources
Eliminate or estimate missing values
Remove outliers (obvious exceptions)
Determine prior probabilities of categories and
deal with volume bias
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
33
The KDD process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
p(x)=0.02
Data
Consolidation
Warehouse
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
34
Data selection and preprocessing
Generate a set of examples
choose sampling method
consider sample complexity
deal with volume bias issues
Reduce attribute dimensionality
remove redundant and/or correlating attributes
combine attributes (sum, multiply, difference)
Reduce attribute value ranges
group symbolic discrete values
quantize continuous numeric values
Transform data
de-correlate and normalize values
map time-series data to static representation
OLAP and visualization tools play key role
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
35
The KDD process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
p(x)=0.02
Data
Consolidation
Warehouse
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
36
Data mining tasks and methods
Automated Exploration/Discovery
e.g.. discovering new market segments
clustering analysis
x2
x1
Prediction/Classification
e.g.. forecasting gross sales given current factors
regression, neural networks, genetic algorithms,
decision trees
f(x)
Explanation/Description
x
e.g.. characterizing customers by demographics
and purchase history
if age > 35
decision trees, association rules
and income < $35k
then ...
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
37
Automated exploration and discovery
Clustering: partitioning a set of data into a
set of classes, called clusters, whose members
share some interesting common properties.
Distance-based numerical clustering
metric
grouping of examples (K-NN)
graphical visualization can be used
Bayesian clustering
search
for the number of classes which
result in best fit of a probability distribution
to the data
AutoClass (NASA) one of best examples
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
38
Prediction and classification
Learning a predictive model
Classification of a new case/sample
Many methods:
Artificial
neural networks
Inductive decision tree and rule systems
Genetic algorithms
Nearest neighbor clustering algorithms
Statistical (parametric, and non-parametric)
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
39
Generalization and regression
The objective of learning is to achieve good
generalization to new unseen cases.
Generalization can be defined as a mathematical
interpolation or regression over a set of training
points
Models can be validated with a previously
unseen test set or using cross-validation
methods
f(x)
x
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
40
Classification and prediction
Classify data based on the values of a target
attribute, e.g., classify countries based on
climate, or classify cars based on gas mileage.
Use obtained model to predict some unknown or
missing attribute values based on other
information.
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
41
Summarizing: inductive modeling = learning
Objective: Develop a general model or
hypothesis from specific examples
Function approximation (curve fitting)
f(x)
x
Classification (concept learning, pattern
recognition)
A
x2
B
x1
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
42
Explanation and description
Learn a generalized hypothesis (model)
from selected data
Description/Interpretation of model
provides new knowledge
Methods:
Inductive
decision tree and rule systems
Association rule systems
Link Analysis
…
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
43
Exception/deviation detection
Generate a model of normal activity
Deviation from model causes alert
Methods:
Artificial neural networks
Inductive decision tree and rule systems
Statistical methods
Visualization tools
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
44
Outlier and exception data analysis
Time-series analysis (trend and deviation):
Trend and deviation analysis: regression,
sequential pattern, similar sequences, trend and
deviation, e.g., stock analysis.
Similarity-based pattern-directed analysis
Full vs. partial periodicity analysis
Other pattern-directed or statistical analysis
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
45
The KDD process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
p(x)=0.02
Data Consolidation
and Warehousing
Warehouse
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
46
Are all the discovered pattern interesting?
A data mining system/query may generate thousands
of patterns, not all of them are interesting.
Interestingness measures:
easily understood by humans
valid on new or test data with some degree of certainty.
potentially useful
novel, or validates some hypothesis that a user seeks to
confirm
Objective vs. subjective interestingness measures
Objective: based on statistics and structures of patterns,
e.g., support, confidence, etc.
Subjective: based on user’s beliefs in the data, e.g.,
unexpectedness, novelty, etc.
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
Completeness vs. optimization
Find all the interesting patterns: Completeness.
Can a data mining system find all the interesting patterns?
Search for only interesting patterns: Optimization.
Can a data mining system find only the interesting patterns?
Approaches
First generate all the patterns and then filter out the
uninteresting ones.
Generate only the interesting patterns - mining query
optimization.
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
Interpretation and evaluation
Evaluation
Statistical validation and significance testing
Qualitative review by experts in the field
Pilot surveys to evaluate model accuracy
Interpretation
Inductive tree and rule models can be read
directly
Clustering results can be graphed and tabled
Code can be automatically generated by some
systems (IDTs, Regression models)
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
49
Interpretation and evaluation
Visualization tools can be very helpful
sensitivity
analysis (I/O relationship)
histograms of value distribution
time-series plots and animation
requires training and practice
Response
Temp
Velocity
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
50
Important dates of data mining
1989 IJCAI Workshop on KDD
Knowledge Discovery in Databases (G. Piatetsky-Shapiro
and W. Frawley, eds., 1991)
1991-1994 Workshops on KDD
Advances in Knowledge Discovery and Data Mining (U.
Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.
Uthurusamy, eds., 1996)
1995-1998 AAAI Int. Conf. on KDD and DM
(KDD’95-98)
Journal of Data Mining and Knowledge Discovery (1997)
1998 ACM SIGKDD
1999 SIGKDD’99 Conf.
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
References - general
P. Adriaans and D. Zantinge. Data Mining. Addison-Wesley: Harlow, England, 1996.
M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database perspective. IEEE
Trans. Knowledge and Data Engineering, 8:866-883, 1996.
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery
and Data Mining. AAAI/MIT Press, 1996.
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000. To appear.
T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of
ACM, 39:58-64, 1996.
G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge discovery: An
overview. In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 1-35.
AAAI/MIT Press, 1996.
G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press,
1991.
Michael Berry & Gordon Linoff. Data Mining Techniques for Marketing, Sales and Customer
Support. John Wiley & Sons, 1997.
Sholom M. Weiss and Nitin Indurkhya. Predictive Data Mining: A Practical Guide. Morgan
Kaufmann, 1997.
W.H. Inmon, J.D. Welch, Katherine L. Glassey. Managing the data warehouse. Wiley, 1997.
T. Mitchell. Machine Learning. McGraw-Hill, 1997.
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
52
Main Web resources
http://www.kdnuggets.com
KDD Newsletter and comprehensive website
http://www.acm.org/sigkdd
ACM SIGKDD
http://www.research.microsoft.com/datamine/
Journal of Data Mining and Knowledge Discovery
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
53
Tutorial Outline
Introduction and basic concepts
Motivations, applications, the KDD process, the techniques
Deeper into DM technology
Decision Trees and Fraud Detection
Association Rules and Market Basket Analysis
Clustering and Customer Segmentation
Trends in technology
Knowledge Discovery Support Environment
Tools, Languages and Systems
Research challenges
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
54