Data Mining and Knowledge Discovery in Business Databases

Download Report

Transcript Data Mining and Knowledge Discovery in Business Databases

A Brief Introduction to
The Hard Facts About Data
• Enormous amounts of data are being stored in databases
• Businesses are increasingly becoming data-rich, yet,
paradoxically, they remain knowledge-poor
“We are drowning in information, but starving for knowledge”
-John Naisbett
• Unless it is used to improve business practices, data is a
liability, not an asset
• Standard data analysis techniques are useful but
insufficient and may miss valuable insight
Real Examples
• Consider the enormous amounts of data generated
Transactional data by credit card companies
Searches on Google, Yahoo, and MSN
Clickstream (web) or other sensor data
Europe's Very Long Baseline Interferometry (VLBI) has 16
telescopes, each of which produces 1 Gigabit/second of
astronomical data over a 25-day observation session
• storage and analysis are a big problem
 Walmart reported to have 24 Tera-byte DB (likely even larger now)
 AT&T handles billions of calls per day
• data cannot be stored -- analysis must be done on the fly
 Social media data
What Is Data Mining?
Business Definition
• Deployment of business processes, supported by
adequate analytical techniques, to:
 Take further advantage of data
 Discover RELEVANT knowledge
 ACT on the results
KDD is the non-trivial process of
identifying valid, novel, potentially
useful, and ultimately understandable
patterns in data.
Application Domains (I)
• Direct marketing and retail
 Behavior analysis, Offer targeting, Market basket
analysis, Up-selling, etc.
• Banks and financial institutions
 Credit risk assessment, Fraud detection, Portfolio
management, Forecasting, etc.
• Telecommunications
 Churn prediction, Product/service development,
campaign management, fraud detection, etc.
Application Domains (II)
• Healthcare
 Public health monitoring (infectious outbreaks, etc),
Outcomes measurement (performance, cost, success
rate, etc), Diagnostic help, etc.
• Pharmaceutical industry / Bio-informatics
 Biological activity prediction, Coding sequence
discovery, Animal tests reduction, etc.
• Insurances
 Cross-selling, Risk analysis, Premium setting, Claims
analysis, Fraud detection, etc.
Application Domains (III)
• Transports
 Network management, Booking optimization,
Customer service, etc.
• Manufacturing
 Load forecasting, Production management, Equipment
monitoring, Quality management, etc.
• Etc.
Machine Learning
Data Mining and
Knowledge Discovery
Data Mining Tasks
• Summarization
• Classification / Prediction
 Classification, Concept learning, Regression
Dependency modeling
Anomaly detection
Link Analysis
• To find a compact description for a subset
of the data.
 Producing the average down time of all plant
equipments in a given month, computing the total
income generated by each sales representative per
region per year
• Techniques:
 Statistics, Information theory, OLAP, etc.
• To learn a function that associates a data item with
the value of a response variable. If the response
variable is discrete, we talk of classification
learning; if the response variable is continuous, we
talk of regression learning.
 Assessing credit worthiness in a loan underwriting business,
assessing the probability of response to a direct marketing
• Techniques:
 Decision trees, Neural networks, Naïve Bayes, Support
vector machines, Logistic regression, Nearest-neighbors, etc.
• To identify a set of (meaningful) categories or
clusters to describe the data. Clustering relies on
some notion of similarity among data items and
strives to maximize intra-cluster similarity whilst
minimizing inter-cluster similarity.
 Segmenting a business’ customer base, building a taxonomy
of animals in a zoological application
• Techniques:
 K-Means, Hierarchical clustering, Kohonen SOM, etc.
Dependency Modeling
• To find a model that describes significant
dependencies, associations or affinities
among variables.
 Analyzing market baskets in consumer goods
retail, uncovering cause-effect relationships in
medical treatments
• Techniques:
 Association rules, ILP, Graphical modeling, etc.
Anomaly Detection
• To discover the most significant changes in
the data from previously measured or
normative values.
 Detecting fraudulent credit card usage, detecting
anomalous turbine behavior in nuclear plants
• Techniques:
 Novelty detectors, Probability density models, etc.
Data Mining Process
• CRISP-DM: Cross-Industry Standard Process for
Data Mining
• Consortium effort involving:
NCR Systems Engineering Copenhagen
DaimlerChrysler AG
OHRA Verzekeringen en Bank Groep B.V
• History:
 Version 1.0 released in 1999
 See for further details
Visual Overview
Summary: Phases & Tasks
Business Objectives
Business Objectives
Business Success
Collect Initial Data
Initial Data Collection
Describe Data
Data Description Report
Select Data
Rationale for Inclusion /
Situation Assessment
Inventory of Resources
Assumptions, and
Risks and Contingencies
Costs and Benefits
Explore Data
Data Exploration Report
Clean Data
Data Cleaning Report
Verify Data Quality
Data Quality Report
Construct Data
Derived Attributes
Generated Records
Data Mining Goal
Data Mining Goals
Data Mining Success
Produce Project Plan
Project Plan
Initial Asessment of
Tools and Techniques
Data Set
Data Set Description
Integrate Data
Merged Data
Format Data
Reformatted Data
Select Modeling
Modeling Technique
Modeling Assumptions
Generate Test Design
Test Design
Build Model
Parameter Settings
Model Description
Assess Model
Model Assessment
Revised Parameter
Evaluate Results
Assessment of Data
Mining Results w.r.t.
Business Success
Approved Models
Review Process
Review of Process
Determine Next Steps
List of Possible Actions
Plan Deployment
Deployment Plan
Plan Monitoring and
Monitoring and
Maintenance Plan
Produce Final Report
Final Report
Final Presentation
Review Project
• Business Understanding
 Initial phase
 Focuses on:
• Understanding the project objectives and requirements from a business
• Converting this knowledge into a data mining problem definition, and a
preliminary plan designed to achieve the objectives
• Data Understanding
 Starts with an initial data collection
 Proceeds with activities aimed at:
Getting familiar with the data
Identifying data quality problems
Discovering first insights into the data
Detecting interesting subsets to form hypotheses for hidden information
• Data Preparation
 Covers all activities to construct the final dataset (data that will be fed
into the modeling tool(s)) from the initial raw data
 Data preparation tasks are likely to be performed multiple times, and
not in any prescribed order
 Tasks include table, record, and attribute selection, as well as
transformation and cleaning of data for modeling tools
• Modeling
 Various modeling techniques are selected and applied, and their
parameters are calibrated to optimal values
 Typically, there are several techniques for the same data mining
problem type
 Some techniques have specific requirements on the form of data,
therefore, stepping back to the data preparation phase is often needed
• Evaluation
 At this stage, a model (or models) that appears to have
high quality, from a data analysis perspective, has been
 Before proceeding to final deployment of the model, it is
important to more thoroughly evaluate the model, and
review the steps executed to construct the model, to be
certain it properly achieves the business objectives
 A key objective is to determine if there is some important
business issue that has not been sufficiently considered
 At the end of this phase, a decision on the use of the data
mining results should be reached
• Deployment
 Creation of the model is generally not the end of the project
 Even if the purpose of the model is to increase knowledge of the data,
the knowledge gained will need to be organized and presented in a
way that the customer can use it
 Depending on the requirements, the deployment phase can be as
simple as generating a report or as complex as implementing a
repeatable data mining process
 In many cases it will be the customer, not the data analyst, who will
carry out the deployment steps
 However, even if the analyst will not carry out the deployment effort it
is important for the customer to understand up front what actions will
need to be carried out in order to actually make use of the created
The Missing Link
Closing the Loop
Changes in data
Changes in environment
How do I know my model
remains valid and
When should I update my
How do I update my
Data Mining Myths (I)
• Data Mining produces surprising results that will utterly
transform your business
 Reality:
• Early results = scientific confirmation of human intuition.
• Beyond = steady improvement to an already successful organisation.
• Occasionally = discovery of one of those rare « breakthrough » facts.
• Data Mining techniques are so sophisticated that they can
substitute for domain knowledge or for experience in
analysis and model building
 Reality:
• Data Mining = joint venture.
• Close cooperation between experts in modeling and using the
associated techniques, and people who understand the business.
Data Mining Myths (II)
• Data Mining is useful only in certain areas, such as
marketing, sales, and fraud detection
 Reality:
• Data mining is useful wherever data can be collected.
• All that is really needed is data and a willingness to « give it a try. »
There is little to loose…
• Only massive databases are worth mining
 Reality:
• A moderately-sized or small data set can also yield valuable
• It is not only the quantity, but also the quality of the data that matters
(characterising mutagenic compounds)
Data Mining Myths (III)
• The methods used in Data Mining are fundamentally
different from the older quantitative model-building
 Reality:
• All methods now used in data mining are natural extensions and
generalisations of analytical methods known for decades.
• What is new in data mining is that we are now applying these
techniques to more general business problems.
• Data Mining is an extremely complex process
 Reality:
• The algorithms of data mining may be complex, but new tools and
well-defined methodologies have made those algorithms easier to
• Much of the difficulty in applying data mining comes from the same
data organisation issues that arise when using any modeling
Food for Thought
• “Data mining can't be ignored -- the data is there,
the methods are numerous, and the advantages that
knowledge discovery brings to a business are
• “People who can't see the value in data mining as
a concept either don't have the data or don't have
data with integrity.”
• “Data mining is quickly becoming a necessity, and
those who do not do it will soon be left in the dust.
Data mining is one of the few software activities
with measurable return on investment associated
with it.”
Data Mining Deliverables
• Provides additional insight about the data
and the business
• Provides scientific confirmation of
empirical/intuitive business observations
• Discovers new, subtle pieces of business
In that order !
Key Success Factors
• Have a clearly articulated business problem that needs to
be solved and for which Data Mining is the adequate
• Ensure that the problem being pursued is supported by the
right type of data of sufficient quality and in sufficient
• Recognise that Data Mining is a process with many
components and dependencies
• Plan to learn from the Data Mining process whatever the
• Data Mining transforms data into actions
• Data Mining is hard work
 It is a process, not a single activity
 Most companies are clueless and DM is an
 Plan to learn through the process
 Think big, start small
• Data Mining is FUN!
More on Data Mining
• KDnuggets
 News, software, jobs, courses, etc.
 Data mining association
The Situation
• Potential applications:
 Associations of products that sell together
 Segmentation of customers
• Short audit:
 Nice DWH, only 2 years old, not fully
 Limited data on purchases and subscriptions
Summarization / Aggregation
• Revenue distribution
 80% generated by 41.5% of subscribers
 60% generated by 18.3% of subscribers
 42.9% generated by top 5 products
• Simple customer classes
 Over 65 years old most profitable
 Under 16 years old least profitable
• Birthdate filled-in for only about 10% of
Product Association
• About 21% of subscribers buy P4, P7 and P9
 P4 is most profitable product
 P7 is ranked 6th
 P9 is ranked 15th with only 2%
of revenue
• Several possible actions
 Make a bundle offering of these products
 Cross-sell from P9 to P4
 Temptation to remove P9 should be resisted
30% of customers who
buy a single yearly
Summary of Findings
• Data Mining found:
 A small percentage of the customers is responsible for a large
share of the sales
 Several groups of « strongly-connected » articles
 A sizeable group of subscribers who buy a single article
• What was learned?
 First 2 findings: « we knew that! » (BUT: scientific confirmation
of business observation)
 3rd finding: « we could target these customers with a special
offer! »
 Lack of relevant data: the structure is in place but not being used
Family History
Finding Affinities
Metrics generally depend on the
nature of the attribute (e.g.,
nominal, real, string)
Star Wars Family Tree
Total Affinities
(Thicker lines indicate stronger affinities --- Highly connected group )
Name, Sex, Hometown, Occupation,
Political Affiliation, Children
More Than 2 Affinities
Name, Sex, Hometown, Occupation,
Political Affiliation, Children
Seems to be an
important link
Occupational Affinity Network
Jedi Knights
Moisture Farmers
Stronger Affinity
between Luke and
Obi-Wan because
they were both
Jedi Knights and
Jedi Masters
Birthday Networks
(Two or more affinities)
Duplicate individual
Close relatives
that share
Given Name Network
(One or more affinities)
More neat
Naming Patterns…
Relatives sharing the
same middle names
both husband and wife’s
maternal grandfathers
share the same first and
middle names.
Naming Pattern
Through generations
Record Linkage
Record Linkage
• The process of identifying similar people
• Essential for exchanging and/or merging
• MAL4:6 uses the individuals and their
relatives as found in their pedigrees
• Each relationship/attribute is treated equally
• Weights
 Version 0.1 used feature selection instead of continuous
 Weights would allow MAL4:6 to use all of the data in a
pedigree to a degree (TBD by MAL4:6)
• Naturally Skewed Data
 #NonMatches >> #Matches
 Learners tend to over learn the majority class
• Attributes: A = {A1,A2,…An}, Ai would be a piece of information
(e.g., date of birth)
• For each Ai, simAi is the similarity metric associated with Ai
• Let x = < A1 : a1x, A2 : a2x,…, An : anx > denote an individual where
ajx is the value of Aj for x
 <firstname: John, lastname: Smith,…>
• Let R= {R0,R1,…Rm} be a set of functions that map an individual to
one of its relatives
Structured Network
Similarity Scores
• Genealogical database from the LDS
Church’s Family History Department (~5
million individuals)
• ~16,000 labeled data instances
 Precision:
 Recall:
Search Term Analysis
• Prior to April 2005
 Search terms used prior to April
contained very few unique
 Most common keywords used were
words in the actual domain name
Significant surge in April 2005
Diversification of the search terms, often corresponding to new
Doubling of number of unique visitors
What happened? Search Engine Optimization (SEO)!
Shipping Policy
• August 2005
 Change shipping policy
 Highly visible, lower, free+
• Impact on abandoned carts?
 Not significant
Before-After Purchases
Marked increase in number of
purchases in all categories
100% increase for high-end
category (free shipping)
Can’t infer causality BUT clear
indication of some effect