Data Mining and Knowledge Discovery in Business Databases

Download Report

Transcript Data Mining and Knowledge Discovery in Business Databases

A Brief Introduction to
CRISP-DM
1
The Hard Facts About Data
• Enormous amounts of data are being stored in databases
• Businesses are increasingly becoming data-rich, yet,
paradoxically, they remain knowledge-poor
“We are drowning in information, but starving for knowledge”
-John Naisbett
• Unless it is used to improve business practices, data is a
liability, not an asset
• Standard data analysis techniques are useful but
insufficient and may miss valuable insight
Real Examples
• Consider the enormous amounts of data generated




Transactional data by credit card companies
Searches on Google, Yahoo, and MSN
Clickstream (web) or other sensor data
Europe's Very Long Baseline Interferometry (VLBI) has 16
telescopes, each of which produces 1 Gigabit/second of
astronomical data over a 25-day observation session
• storage and analysis are a big problem
 Walmart reported to have 24 Tera-byte DB (likely even larger now)
 AT&T handles billions of calls per day
• data cannot be stored -- analysis must be done on the fly
 Social media data
What Is Data Mining?
Business Definition
• Deployment of business processes, supported by
adequate analytical techniques, to:
 Take further advantage of data
 Discover RELEVANT knowledge
 ACT on the results
KDD is the non-trivial process of
identifying valid, novel, potentially
useful, and ultimately understandable
patterns in data.
Application Domains (I)
• Direct marketing and retail
 Behavior analysis, Offer targeting, Market basket
analysis, Up-selling, etc.
• Banks and financial institutions
 Credit risk assessment, Fraud detection, Portfolio
management, Forecasting, etc.
• Telecommunications
 Churn prediction, Product/service development,
campaign management, fraud detection, etc.
Application Domains (II)
• Healthcare
 Public health monitoring (infectious outbreaks, etc),
Outcomes measurement (performance, cost, success
rate, etc), Diagnostic help, etc.
• Pharmaceutical industry / Bio-informatics
 Biological activity prediction, Coding sequence
discovery, Animal tests reduction, etc.
• Insurances
 Cross-selling, Risk analysis, Premium setting, Claims
analysis, Fraud detection, etc.
Application Domains (III)
• Transports
 Network management, Booking optimization,
Customer service, etc.
• Manufacturing
 Load forecasting, Production management, Equipment
monitoring, Quality management, etc.
• Etc.
Multidisciplinary
Machine Learning
Business/Domain
Knowledge
Visualization
Data Mining and
Knowledge Discovery
Statistics
Databases
Data Mining Tasks
• Summarization
• Classification / Prediction
 Classification, Concept learning, Regression
•
•
•
•
Clustering
Dependency modeling
Anomaly detection
Link Analysis
Summarization
• To find a compact description for a subset
of the data.
 Producing the average down time of all plant
equipments in a given month, computing the total
income generated by each sales representative per
region per year
• Techniques:
 Statistics, Information theory, OLAP, etc.
Prediction
• To learn a function that associates a data item with
the value of a response variable. If the response
variable is discrete, we talk of classification
learning; if the response variable is continuous, we
talk of regression learning.
 Assessing credit worthiness in a loan underwriting business,
assessing the probability of response to a direct marketing
campaign
• Techniques:
 Decision trees, Neural networks, Naïve Bayes, Support
vector machines, Logistic regression, Nearest-neighbors, etc.
Clustering
• To identify a set of (meaningful) categories or
clusters to describe the data. Clustering relies on
some notion of similarity among data items and
strives to maximize intra-cluster similarity whilst
minimizing inter-cluster similarity.
 Segmenting a business’ customer base, building a taxonomy
of animals in a zoological application
• Techniques:
 K-Means, Hierarchical clustering, Kohonen SOM, etc.
Dependency Modeling
• To find a model that describes significant
dependencies, associations or affinities
among variables.
 Analyzing market baskets in consumer goods
retail, uncovering cause-effect relationships in
medical treatments
• Techniques:
 Association rules, ILP, Graphical modeling, etc.
Anomaly Detection
• To discover the most significant changes in
the data from previously measured or
normative values.
 Detecting fraudulent credit card usage, detecting
anomalous turbine behavior in nuclear plants
• Techniques:
 Novelty detectors, Probability density models, etc.
Data Mining Process
• CRISP-DM: Cross-Industry Standard Process for
Data Mining
• Consortium effort involving:




NCR Systems Engineering Copenhagen
DaimlerChrysler AG
SPSS Inc.
OHRA Verzekeringen en Bank Groep B.V
• History:
 Version 1.0 released in 1999
 See www.crisp-dm.org for further details
Visual Overview
Summary: Phases & Tasks
Business
Understanding
Data
Understanding
Data
Preparation
Determine
Business Objectives
Background
Business Objectives
Business Success
Criteria
Collect Initial Data
Initial Data Collection
Report
Describe Data
Data Description Report
Select Data
Rationale for Inclusion /
Exclusion
Situation Assessment
Inventory of Resources
Requirements,
Assumptions, and
Constraints
Risks and Contingencies
Terminology
Costs and Benefits
Explore Data
Data Exploration Report
Clean Data
Data Cleaning Report
Verify Data Quality
Data Quality Report
Construct Data
Derived Attributes
Generated Records
Determine
Data Mining Goal
Data Mining Goals
Data Mining Success
Criteria
Produce Project Plan
Project Plan
Initial Asessment of
Tools and Techniques
Data Set
Data Set Description
Integrate Data
Merged Data
Format Data
Reformatted Data
Modeling
Select Modeling
Technique
Modeling Technique
Modeling Assumptions
Generate Test Design
Test Design
Build Model
Parameter Settings
Models
Model Description
Assess Model
Model Assessment
Revised Parameter
Settings
Evaluation
Evaluate Results
Assessment of Data
Mining Results w.r.t.
Business Success
Criteria
Approved Models
Review Process
Review of Process
Determine Next Steps
List of Possible Actions
Decision
Deployment
Plan Deployment
Deployment Plan
Plan Monitoring and
Maintenance
Monitoring and
Maintenance Plan
Produce Final Report
Final Report
Final Presentation
Review Project
Experience
Documentation
CRISP-DM Phases
• Business Understanding
 Initial phase
 Focuses on:
• Understanding the project objectives and requirements from a business
perspective
• Converting this knowledge into a data mining problem definition, and a
preliminary plan designed to achieve the objectives
• Data Understanding
 Starts with an initial data collection
 Proceeds with activities aimed at:
•
•
•
•
Getting familiar with the data
Identifying data quality problems
Discovering first insights into the data
Detecting interesting subsets to form hypotheses for hidden information
CRISP-DM Phases
• Data Preparation
 Covers all activities to construct the final dataset (data that will be fed
into the modeling tool(s)) from the initial raw data
 Data preparation tasks are likely to be performed multiple times, and
not in any prescribed order
 Tasks include table, record, and attribute selection, as well as
transformation and cleaning of data for modeling tools
• Modeling
 Various modeling techniques are selected and applied, and their
parameters are calibrated to optimal values
 Typically, there are several techniques for the same data mining
problem type
 Some techniques have specific requirements on the form of data,
therefore, stepping back to the data preparation phase is often needed
CRISP-DM Phases
• Evaluation
 At this stage, a model (or models) that appears to have
high quality, from a data analysis perspective, has been
built
 Before proceeding to final deployment of the model, it is
important to more thoroughly evaluate the model, and
review the steps executed to construct the model, to be
certain it properly achieves the business objectives
 A key objective is to determine if there is some important
business issue that has not been sufficiently considered
 At the end of this phase, a decision on the use of the data
mining results should be reached
CRISP-DM Phases
• Deployment
 Creation of the model is generally not the end of the project
 Even if the purpose of the model is to increase knowledge of the data,
the knowledge gained will need to be organized and presented in a
way that the customer can use it
 Depending on the requirements, the deployment phase can be as
simple as generating a report or as complex as implementing a
repeatable data mining process
 In many cases it will be the customer, not the data analyst, who will
carry out the deployment steps
 However, even if the analyst will not carry out the deployment effort it
is important for the customer to understand up front what actions will
need to be carried out in order to actually make use of the created
models
The Missing Link
Closing the Loop
Changes in data
Changes in environment
Monitoring
How do I know my model
remains valid and
applicable?
When should I update my
model(s)?
How do I update my
model(s)?
Data Mining Myths (I)
• Data Mining produces surprising results that will utterly
transform your business
 Reality:
• Early results = scientific confirmation of human intuition.
• Beyond = steady improvement to an already successful organisation.
• Occasionally = discovery of one of those rare « breakthrough » facts.
• Data Mining techniques are so sophisticated that they can
substitute for domain knowledge or for experience in
analysis and model building
 Reality:
• Data Mining = joint venture.
• Close cooperation between experts in modeling and using the
associated techniques, and people who understand the business.
Data Mining Myths (II)
• Data Mining is useful only in certain areas, such as
marketing, sales, and fraud detection
 Reality:
• Data mining is useful wherever data can be collected.
• All that is really needed is data and a willingness to « give it a try. »
There is little to loose…
• Only massive databases are worth mining
 Reality:
• A moderately-sized or small data set can also yield valuable
information.
• It is not only the quantity, but also the quality of the data that matters
(characterising mutagenic compounds)
Data Mining Myths (III)
• The methods used in Data Mining are fundamentally
different from the older quantitative model-building
techniques
 Reality:
• All methods now used in data mining are natural extensions and
generalisations of analytical methods known for decades.
• What is new in data mining is that we are now applying these
techniques to more general business problems.
• Data Mining is an extremely complex process
 Reality:
• The algorithms of data mining may be complex, but new tools and
well-defined methodologies have made those algorithms easier to
apply.
• Much of the difficulty in applying data mining comes from the same
data organisation issues that arise when using any modeling
techniques.
Food for Thought
• “Data mining can't be ignored -- the data is there,
the methods are numerous, and the advantages that
knowledge discovery brings to a business are
tremendous.”
• “People who can't see the value in data mining as
a concept either don't have the data or don't have
data with integrity.”
• “Data mining is quickly becoming a necessity, and
those who do not do it will soon be left in the dust.
Data mining is one of the few software activities
with measurable return on investment associated
with it.”
Data Mining Deliverables
• Provides additional insight about the data
and the business
• Provides scientific confirmation of
empirical/intuitive business observations
• Discovers new, subtle pieces of business
knowledge
In that order !
Key Success Factors
• Have a clearly articulated business problem that needs to
be solved and for which Data Mining is the adequate
technology
• Ensure that the problem being pursued is supported by the
right type of data of sufficient quality and in sufficient
quantity
• Recognise that Data Mining is a process with many
components and dependencies
• Plan to learn from the Data Mining process whatever the
outcome
Conclusion
• Data Mining transforms data into actions
• Data Mining is hard work
 It is a process, not a single activity
 Most companies are clueless and DM is an
afterthought
 Plan to learn through the process
 Think big, start small
• Data Mining is FUN!
More on Data Mining
• KDnuggets
 News, software, jobs, courses, etc.
 www.KDnuggets.com
• ACM SIGKDD
 Data mining association
 www.acm.org/sigkdd
SAMPLE APPLICATIONS
Retail
The Situation
• Potential applications:
 Associations of products that sell together
 Segmentation of customers
• Short audit:
 Nice DWH, only 2 years old, not fully
populated
 Limited data on purchases and subscriptions
Summarization / Aggregation
• Revenue distribution
 80% generated by 41.5% of subscribers
 60% generated by 18.3% of subscribers
 42.9% generated by top 5 products
• Simple customer classes
 Over 65 years old most profitable
 Under 16 years old least profitable
• Birthdate filled-in for only about 10% of
subscribers!
Product Association
• About 21% of subscribers buy P4, P7 and P9
 P4 is most profitable product
 P7 is ranked 6th
 P9 is ranked 15th with only 2%
of revenue
P9
P1
P8
P2
P7
P3
• Several possible actions
P4
 Make a bundle offering of these products
 Cross-sell from P9 to P4
 Temptation to remove P9 should be resisted
P6
P5
Clustering
30% of customers who
buy a single yearly
product
!!!
Summary of Findings
• Data Mining found:
 A small percentage of the customers is responsible for a large
share of the sales
 Several groups of « strongly-connected » articles
 A sizeable group of subscribers who buy a single article
• What was learned?
 First 2 findings: « we knew that! » (BUT: scientific confirmation
of business observation)
 3rd finding: « we could target these customers with a special
offer! »
 Lack of relevant data: the structure is in place but not being used
systematically
Family History
Finding Affinities
Metrics generally depend on the
nature of the attribute (e.g.,
nominal, real, string)
Star Wars Family Tree
Total Affinities
(Thicker lines indicate stronger affinities --- Highly connected group )
CHARACTERISTICS
Name, Sex, Hometown, Occupation,
Political Affiliation, Children
More Than 2 Affinities
CHARACTERISTICS
Name, Sex, Hometown, Occupation,
Political Affiliation, Children
Seems to be an
important link
Occupational Affinity Network
Jedi Knights
Moisture Farmers
Stronger Affinity
between Luke and
Obi-Wan because
they were both
Jedi Knights and
Jedi Masters
Birthday Networks
(Two or more affinities)
Duplicate individual
Twins!
Close relatives
that share
birthdays
Given Name Network
(One or more affinities)
More neat
Naming Patterns…
Relatives sharing the
same middle names
Interesting!
both husband and wife’s
maternal grandfathers
share the same first and
middle names.
Interesting
Naming Pattern
Through generations
Record Linkage
Record Linkage
• The process of identifying similar people
• Essential for exchanging and/or merging
pedigrees
• MAL4:6 uses the individuals and their
relatives as found in their pedigrees
Challenges
• Each relationship/attribute is treated equally
• Weights
 Version 0.1 used feature selection instead of continuous
weights
 Weights would allow MAL4:6 to use all of the data in a
pedigree to a degree (TBD by MAL4:6)
• Naturally Skewed Data
 #NonMatches >> #Matches
 Learners tend to over learn the majority class
Similarity
• Attributes: A = {A1,A2,…An}, Ai would be a piece of information
(e.g., date of birth)
• For each Ai, simAi is the similarity metric associated with Ai
• Let x = < A1 : a1x, A2 : a2x,…, An : anx > denote an individual where
ajx is the value of Aj for x
 <firstname: John, lastname: Smith,…>
• Let R= {R0,R1,…Rm} be a set of functions that map an individual to
one of its relatives
Structured Network
Match
MisMatch
i
Individual
Father
Spouse
Weights
ij
Similarity Scores
Results
• Genealogical database from the LDS
Church’s Family History Department (~5
million individuals)
• ~16,000 labeled data instances
 Precision:
 Recall:
88.9%
93.8%
E-Commerce
Search Term Analysis
• Prior to April 2005
 Search terms used prior to April
contained very few unique
keywords
 Most common keywords used were
words in the actual domain name

Significant surge in April 2005



Diversification of the search terms, often corresponding to new
products/offers
Doubling of number of unique visitors
What happened? Search Engine Optimization (SEO)!
Shipping Policy
• August 2005
 Change shipping policy
 Highly visible, lower, free+
• Impact on abandoned carts?
 Not significant

Before-After Purchases



Marked increase in number of
purchases in all categories
100% increase for high-end
category (free shipping)
Can’t infer causality BUT clear
indication of some effect