Data Mining

Transcript Data Mining

Knowledge discovery & data mining
Tools, methods, and experiences
Fosca Giannotti and
Dino Pedreschi
Pisa KDD Lab
CNUCE-CNR & Univ. Pisa
http://www-kdd.di.unipi.it/
A tutorial @ EDBT2000
Contributors and acknowledgements
 The people @ Pisa KDD Lab: Francesco BONCHI, Giuseppe
MANCO, Mirco NANNI, Chiara RENSO, Salvatore RUGGIERI,
Franco TURINI and many students
 The many KDD tutorialists and teachers which made their slides
available on the web (all of them listed in bibliography) ;-)
 In particular:
 Jiawei HAN, Simon Fraser University, whose forthcoming book
Data mining: concepts and techniques has influenced the whole
tutorial
 Rajeev RASTOGI and Kyuseok SHIM, Lucent Bell Labs
 Daniel A. KEIM, University of Halle
 Daniel Silver, CogNova Technologies
 The EDBT2000 board who accepted our tutorial proposal
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
2
Tutorial goals
 Introduce you to major aspects of the Knowledge
Discovery Process, and theory and applications of
Data Mining technology
 Provide a systematization to the many many concepts
around this area, according the following lines
the
the
the
the
process
methods applied to paradigmatic cases
support environment
research challenges
 Important issues that will be not covered in this
tutorial:
 methods: time series, exception detection, neural nets
 systems: parallel implementations
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
3
Tutorial Outline
1. Introduction and basic concepts
1. Motivations, applications, the KDD process, the techniques
2. Deeper into DM technology
1. Decision Trees and Fraud Detection
2. Association Rules and Market Basket Analysis
3. Clustering and Customer Segmentation
3. Trends in technology
1. Knowledge Discovery Support Environment
2. Tools, Languages and Systems
4. Research challenges
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
4
Introduction - module outline
 Motivations
 Application Areas
 KDD Decisional Context
 KDD Process
 Architecture of a KDD system
 The KDD steps in short
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
5
Evolution of Database Technology:
from data management to data analysis
 1960s:
Data collection, database creation, IMS and network
DBMS.
 1970s:
Relational data model, relational DBMS implementation.
 1980s:
RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial,
scientific, engineering, etc.).
 1990s:
Data mining and data warehousing, multimedia databases,
and Web technology.
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
6
Motivations
“Necessity is the Mother of Invention”
 Data explosion problem:
Automated data collection tools, mature database
technology and internet lead to tremendous amounts of
data stored in databases, data warehouses and other
information repositories.
 We are drowning in information, but starving for knowledge!
(John Naisbett)
 Data warehousing and data mining :
 On-line analytical processing
Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases.
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
7
A rapidly
field
A rapidly emerging
emerging field
Also referred to as:
Data dredging, Data harvesting, Data archeology
A multidisciplinary field:



Database
Statistics
Artificial intelligence
 Machine

learning, Expert systems and Knowledge Acquisition
Visualization methods
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
8
Motivations for DM
Abundance of business and industry data
Competitive focus - Knowledge
Management
Inexpensive, powerful computing engines
Strong theoretical/mathematical
foundations
machine learning & logic
statistics
database management systems
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
9
What is DM useful for?
Increase knowledge
to base decision
upon.
Marketing
E.g., impact on
marketing
Database
Marketing
Data
Warehousing
EDBT2000 tutorial - Intro
KDD &
Data Mining
Konstanz, 27-28.3.2000
10
The Value Chain
Decision
• Promote product A in region Z.
Knowledge
• Mail ads to families of profile P
• Cross-sell service B to clients C
• A quantity Y of product A is used
in region Z
• Customers of class Y use x% of
C during period D
Information
• X lives in Z
Data
• S is Y years old
• X and S moved
• W has money in Z
• Customer data
• Store data
• Demographical Data
• Geographical data
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
11
Application Areas and Opportunities
Marketing: segmentation, customer targeting, ...
Finance: investment support, portfolio management
Banking & Insurance: credit and policy approval
Security: fraud detection
Science and medicine: hypothesis discovery,
prediction, classification, diagnosis
 Manufacturing: process modeling, quality control,
resource allocation
 Engineering: simulation and analysis, pattern
recognition, signal processing
 Internet: smart search engines, web marketing





EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
12
Classes of applications
 Market analysis
target marketing, customer relation management, market
basket analysis, cross selling, market segmentation.
 Risk analysis
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis.
 Fraud detection
 Text (news group, email, documents) and Web analysis.
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
13
Market Analysis
 Where are the data sources for analysis?
Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies.
 Target marketing
Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
 Determine customer purchasing patterns over time
Conversion of single to a joint bank account: marriage, etc.
 Cross-market analysis
Associations/co-relations between product sales
Prediction based on the association information.
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
MarketMarket
Analysis
and Management
Analysis
(2)
 Customer profiling
data mining can tell you what types of customers buy
what products (clustering or classification).
 Identifying customer requirements
identifying the best products for different customers
use prediction to find what factors will attract new
customers
 Provides summary information
various multidimensional summary reports;
statistical summary information (data central tendency
and variation)
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
Risk Analysis
 Finance planning and asset evaluation:
cash flow analysis and prediction
contingent claim analysis to evaluate assets
cross-sectional and time series analysis (financial-ratio,
trend analysis, etc.)
 Resource planning:
summarize and compare the resources and spending
 Competition:
monitor competitors and market directions (CI: competitive
intelligence).
group customers into classes and class-based pricing
procedures
set pricing strategy in a highly competitive market
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
Fraud Detection
 Applications:
widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
 Approach:
use historical data to build models of fraudulent behavior
and use data mining to help identify similar instances.
 Examples:
auto insurance: detect a group of people who stage accidents
to collect on insurance
money laundering: detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network)
medical insurance: detect professional patients and ring of
doctors and ring of references
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
Fraud Detection (2)
 More examples:
Detecting inappropriate medical treatment:
Australian Health Insurance Commission identifies that in
many cases blanket screening tests were requested (save
Australian $1m/yr).
Detecting telephone fraud:
Telephone call model: destination of the call, duration, time
of day or week. Analyze patterns that deviate from an
expected norm.
British Telecom identified discrete groups of callers with
frequent intra-group calls, especially mobile phones, and
broke a multimillion dollar fraud.
Retail: Analysts estimate that 38% of retail shrink is due to
dishonest employees.
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
Other applications
 Sports
IBM Advanced Scout analyzed NBA game statistics (shots
blocked, assists, and fouls) to gain competitive advantage
for New York Knicks and Miami Heat.
 Astronomy
JPL and the Palomar Observatory discovered 22 quasars
with the help of data mining
 Internet Web Surf-Aid
IBM Surf-Aid applies data mining algorithms to Web
access logs for market-related pages to discover customer
preference and behavior pages, analyzing effectiveness of
Web marketing, improving Web site organization, etc.
Watch for the PRIVACY pitfall!
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
What is KDD? A process!
The selection and processing of data for:
the identification of novel, accurate,
and useful patterns, and
the modeling of real-world phenomena.
Data mining is a major component of the
KDD process - automated discovery of
patterns and the development of
predictive and explanatory models.
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
20
The KDD process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
Data
Consolidation
p(x)=0.02
Patterns &
Models
Warehouse
Prepared Data
Consolidated
Data
Data Sources
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
21
The KDD Process
Core Problems & Approaches
Problems:
 identification
of relevant data
 representation of data
 search for valid pattern or model
Approaches:
 top-down
deduction by expert
 interactive visualization of data/models
*
bottom-up induction from data *
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
OLAP
Data
Mining
22
The steps of the KDD process
 Learning the application domain:
relevant prior knowledge and goals of application
 Data consolidation: Creating a target data set
 Selection and Preprocessing
 Data cleaning : (may take 60% of effort!)
 Data reduction and projection:
find useful features, dimensionality/variable reduction, invariant
representation.
 Choosing functions of data mining
summarization, classification, regression, association,
clustering.
 Choosing the mining algorithm(s)
 Data mining: search for patterns of interest
 Interpretation and evaluation: analysis of results.
visualization, transformation, removing redundant patterns, …
 Use of discovered knowledge
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
The virtuous cycle
9
The KDD Process
Interpretation
and Evaluation
Data Mining
Knowledge
Problem
Selection and
Preprocessing
Data
Consolidation
p(x)=0.02
Patterns &
Models
Warehouse
Knowledge
Prepared Data
Consolidated
Data
Data Sources
CogNova
Technologies
Identify
Problem or
Opportunity
Strategy
EDBT2000 tutorial - Intro
Act on
Knowledge
Measure effect
of Action
Konstanz, 27-28.3.2000
Results
24
Applications, operations, techniques
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
25
Roles in the KDD process
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
26
Data mining and business intelligence
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
DBA
27
Architecture of a KDD system
Graphical User Interface
Data
Consolidation
Data Sources
Selection
and
Preprocessing
Warehouse
EDBT2000 tutorial - Intro
Data
Mining
Interpretation
and Evaluation
Knowledge
Konstanz, 27-28.3.2000
28
A business intelligence environment
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
29
The KDD process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
Data
Consolidation
Warehouse
p(x)=0.02
Patterns &
Models
Prepared Data
Consolidated
Data
Data Sources
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
30
Data consolidation and preparation
Garbage in
Garbage out
 The quality of results relates directly to quality
of the data
 50%-70% of KDD process effort is spent on
data consolidation and preparation
 Major justification for a corporate data
warehouse
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
31
Data consolidation
From data sources to consolidated data
repository
RDBMS
Legacy
DBMS
Flat Files
External
EDBT2000 tutorial - Intro
Data
Consolidation
and Cleansing
Warehouse
Object/Relation DBMS
Multidimensional DBMS
Deductive Database
Flat files
Konstanz, 27-28.3.2000
32
Data consolidation
 Determine preliminary list of attributes
 Consolidate data into working database

Internal and External sources
 Eliminate or estimate missing values
 Remove outliers (obvious exceptions)
 Determine prior probabilities of categories and
deal with volume bias
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
33
The KDD process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
p(x)=0.02
Data
Consolidation
Warehouse
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
34
Data selection and preprocessing
 Generate a set of examples



choose sampling method
consider sample complexity
deal with volume bias issues
 Reduce attribute dimensionality


remove redundant and/or correlating attributes
combine attributes (sum, multiply, difference)
 Reduce attribute value ranges


group symbolic discrete values
quantize continuous numeric values
 Transform data


de-correlate and normalize values
map time-series data to static representation
 OLAP and visualization tools play key role
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
35
The KDD process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
p(x)=0.02
Data
Consolidation
Warehouse
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
36
Data mining tasks and methods
Automated Exploration/Discovery
e.g.. discovering new market segments
clustering analysis

x2
x1
Prediction/Classification
e.g.. forecasting gross sales given current factors
regression, neural networks, genetic algorithms,
decision trees
f(x)

Explanation/Description
x
e.g.. characterizing customers by demographics
and purchase history
if age > 35
decision trees, association rules

and income < $35k
then ...
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
37
Automated exploration and discovery
Clustering: partitioning a set of data into a
set of classes, called clusters, whose members
share some interesting common properties.
Distance-based numerical clustering
 metric
grouping of examples (K-NN)
 graphical visualization can be used
Bayesian clustering
 search
for the number of classes which
result in best fit of a probability distribution
to the data
 AutoClass (NASA) one of best examples
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
38
Prediction and classification
Learning a predictive model
Classification of a new case/sample
Many methods:
 Artificial
neural networks
 Inductive decision tree and rule systems
 Genetic algorithms
 Nearest neighbor clustering algorithms
 Statistical (parametric, and non-parametric)
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
39
Generalization and regression
 The objective of learning is to achieve good
generalization to new unseen cases.
 Generalization can be defined as a mathematical
interpolation or regression over a set of training
points
 Models can be validated with a previously
unseen test set or using cross-validation
methods
f(x)
x
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
40
Classification and prediction
Classify data based on the values of a target
attribute, e.g., classify countries based on
climate, or classify cars based on gas mileage.
Use obtained model to predict some unknown or
missing attribute values based on other
information.
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
41
Summarizing: inductive modeling = learning
Objective: Develop a general model or
hypothesis from specific examples
 Function approximation (curve fitting)
f(x)
x
 Classification (concept learning, pattern
recognition)
A
x2
B
x1
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
42
Explanation and description
Learn a generalized hypothesis (model)
from selected data
Description/Interpretation of model
provides new knowledge
Methods:
 Inductive
decision tree and rule systems
 Association rule systems
 Link Analysis
…
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
43
Exception/deviation detection
 Generate a model of normal activity
 Deviation from model causes alert
 Methods:




Artificial neural networks
Inductive decision tree and rule systems
Statistical methods
Visualization tools
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
44
Outlier and exception data analysis
 Time-series analysis (trend and deviation):
Trend and deviation analysis: regression,
sequential pattern, similar sequences, trend and
deviation, e.g., stock analysis.
Similarity-based pattern-directed analysis
Full vs. partial periodicity analysis
 Other pattern-directed or statistical analysis
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
45
The KDD process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
p(x)=0.02
Data Consolidation
and Warehousing
Warehouse
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
46
Are all the discovered pattern interesting?
 A data mining system/query may generate thousands
of patterns, not all of them are interesting.
 Interestingness measures:
easily understood by humans
valid on new or test data with some degree of certainty.
potentially useful
novel, or validates some hypothesis that a user seeks to
confirm
 Objective vs. subjective interestingness measures
Objective: based on statistics and structures of patterns,
e.g., support, confidence, etc.
Subjective: based on user’s beliefs in the data, e.g.,
unexpectedness, novelty, etc.
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
Completeness vs. optimization
 Find all the interesting patterns: Completeness.
Can a data mining system find all the interesting patterns?
 Search for only interesting patterns: Optimization.
Can a data mining system find only the interesting patterns?
Approaches
First generate all the patterns and then filter out the
uninteresting ones.
Generate only the interesting patterns - mining query
optimization.
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
Interpretation and evaluation
Evaluation
 Statistical validation and significance testing
 Qualitative review by experts in the field
 Pilot surveys to evaluate model accuracy
Interpretation
 Inductive tree and rule models can be read
directly
 Clustering results can be graphed and tabled
 Code can be automatically generated by some
systems (IDTs, Regression models)
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
49
Interpretation and evaluation
Visualization tools can be very helpful
 sensitivity
analysis (I/O relationship)
 histograms of value distribution
 time-series plots and animation
 requires training and practice
Response
Temp
Velocity
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
50
Important dates of data mining
 1989 IJCAI Workshop on KDD
Knowledge Discovery in Databases (G. Piatetsky-Shapiro
and W. Frawley, eds., 1991)
 1991-1994 Workshops on KDD
Advances in Knowledge Discovery and Data Mining (U.
Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.
Uthurusamy, eds., 1996)
 1995-1998 AAAI Int. Conf. on KDD and DM
(KDD’95-98)
Journal of Data Mining and Knowledge Discovery (1997)
 1998 ACM SIGKDD
 1999 SIGKDD’99 Conf.
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
References - general

P. Adriaans and D. Zantinge. Data Mining. Addison-Wesley: Harlow, England, 1996.

M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database perspective. IEEE
Trans. Knowledge and Data Engineering, 8:866-883, 1996.

U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery
and Data Mining. AAAI/MIT Press, 1996.

J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000. To appear.

T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of
ACM, 39:58-64, 1996.

G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge discovery: An
overview. In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 1-35.
AAAI/MIT Press, 1996.

G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press,
1991.

Michael Berry & Gordon Linoff. Data Mining Techniques for Marketing, Sales and Customer
Support. John Wiley & Sons, 1997.
Sholom M. Weiss and Nitin Indurkhya. Predictive Data Mining: A Practical Guide. Morgan
Kaufmann, 1997.
W.H. Inmon, J.D. Welch, Katherine L. Glassey. Managing the data warehouse. Wiley, 1997.
T. Mitchell. Machine Learning. McGraw-Hill, 1997.



EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
52
Main Web resources
 http://www.kdnuggets.com
 KDD Newsletter and comprehensive website
 http://www.acm.org/sigkdd
 ACM SIGKDD
 http://www.research.microsoft.com/datamine/
 Journal of Data Mining and Knowledge Discovery
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
53
Tutorial Outline
 Introduction and basic concepts
 Motivations, applications, the KDD process, the techniques
 Deeper into DM technology
 Decision Trees and Fraud Detection
 Association Rules and Market Basket Analysis
 Clustering and Customer Segmentation
 Trends in technology
 Knowledge Discovery Support Environment
 Tools, Languages and Systems
 Research challenges
EDBT2000 tutorial - Intro
Konstanz, 27-28.3.2000
54

Data Mining

Transcript Data Mining

Directory