Transcript Slide 1
Chapter 1
Initial Description of Data Mining
in Business
Prepared by: Dr. Tsung-Nan Tsai
結束
Contents
Introduces data mining concepts
Presents typical business data applications
Explains the meaning of key concepts
Gives a brief overview of data mining tools
Outlines the remaining chapters of the book
1-2
結束
Definition
DATA MINING: exploration & analysis
Refers to the analysis of the large quantities of data that
are stored in computers.
by automatic means
of large quantities of data
to discover actionable patterns & rules
Data mining is a way to use massive quantities of data
that businesses generate
GOAL - improve marketing, sales, customer support
through better understanding of customers
1-3
結束
Retail Outlets
Bar coding & scanning generate masses of data
customer service (Grocery stores can quickly
process he purchases and accurately determine
product prices)
inventory control (Determine the quantity of items
of each product on hand, supply chain management)
MICROMARKETING
CUSTOMER PROFITABILITY ANALYSIS
MARKET-BASKET ANALYSIS
1-4
結束
Political Data Mining
Grossman et al., 10/18/2004, Time, 38
2004 Election
Republicans: VoterVault
From Mid-1990s
About 165 million voters
Massive get-out-the-vote drive
for those expected to vote
Republican
Democrats: Demzilla
Also about 165 million voters
Names typically have 200 to
400 information items
1-5
結束
Medical Diagnosis
J. Morris, Health Management Technology Nov 2004, 20,
22-24
Electronic Medical Records
Associated Cardiovascular
Consultants
31 physicians
40,000 patients per year,
southern New Jersey
Data mined to identify
efficient medical practice
Enhance patient outcomes
Reduced medical liability
insurance
1-6
結束
Mayo Clinic
Swartz, Information Management Journal Nov/Dec 2004, 8
IBM developed EMR program
Complete records on almost
4.4 million patients.
Doctors can ask for how last
100 Mayo patients with same
gender, age, medical history
responded to particular
treatments.
1-7
結束
Business Uses of Data Mining
Toyata used the data mining of its data warehouse to
determine more efficient transportation routes,
reducing time-to-market by average of 19 days.
Bank firms used the data mining in soliciting credit
card customers,
Insurance and Telecommunication companies used DM
to detect fraud.
Manufacturing firms used DM in quality control,
Many …..
1-8
結束
Business Uses of Data Mining
1. Customer profiling
Identify profitability from subset customers
2. Targeting
•
Determine characteristics of most profitable
customers
3. Market-Basket Analysis
•
•
•
Determine correlation of purchases by profile
(customers)
Cross-selling
Part of Customer Relationship Management
1-9
結束
What is needed to do DM?
DM requires the identification of a problem, along
with data collection that can lead to a better
understanding of the market.
Computer models provide statistical or other means
of analysis.
Two general types of DM studies:
1. Hypothesis testing: involving expressing a theory
about the relationship between actions and outcomes.
2. Knowledge discovery: a preconceived notion may not
be present, but rather than relationships can be
identified by looking at the data (correlation analysis).
1-10
結束
Reasons why Data Mining is now effective
Data are there
Data are warehoused (computerized)
Walmart: 35 thousand queries per week
Computing economically available
Competitive pressure
Commercial products available
1-11
結束
Trends
Every business is service
hotel chains record your
preferences
car rental companies the same
service versus price
credit card companies
long distance providers
airlines
computer retailers
1-12
結束
Trends
Information as Product
Custom Clothing Technology Corporation
fit jeans, other clothing
INFORMATION BROKERING
IMS - collects prescription data from pharmacies, sells
to drug firms
AC Nielsen - TV
1-13
結束
Trends
Commercial Software Available
using statistical, artificial intelligence tools
that have been developed
Enterprise Miner
Intelligent Miner
Clementine
PolyAnalyst
Specialty products
SAS
IBM
SPSS
Megaputer
1-14
結束
Fingerhut’s DM models
Fingerhut used segmentation, decision tree, regression analysis,
and neural modeling tools from SAS for regression analysis
tools and SPSS for neural network tools.
The segmentation model combines order and basic demographic
data with Fingerhut’s product offerings.
Neural network models used to identify in mailing patterns and
order filling telephone call orders.
Goal:
Create new mailings targeted at customers with the greatest
potential payoff.
Create a catalog containing products that those who is interested in,
such as furniture, telephones…
1-15
結束
How Data Mining Is Being Used
U.S. Government
track down Oklahoma City
bombers, Unabomber, many
others
Treasury department international funds transfers,
money laundering
Internal Revenue Service
1-16
結束
How Data Mining Is Used
Firefly
asks members to rate
music and movies
subscribers clustered
clusters get customdesigned
recommendations
1-17
結束
Warranty Claims Routing
Diesel engine manufacturer
stream of warranty claims
examine each by expert
determine whether charges are reasonable &
appropriate
think of expert system to automate claims
processing
1-18
結束
Data mining application area
Application Area
Applications
Specifics
Retailing
Affinity positioning
Cross-selling
Position products effectively
Find more products for
customers
Banking
Customer relationship
management
Identify customer value
develop programs to maximize
revenue
Credit card
Management
Lift
Churn,
Fraud detection
Identify effective market
segments
Identify likely customer turnover
Insurance
Fraud detection
Identify claims meriting
investigation
Telecommunications
Churn
Telemarketing
Online information
Aid telemarketers with easy data
access
Human Resource
Management
Churn
Identify potential employee
turnover
Identify likely customer turnover
1-19
結束
Retailing
Affinity positioning is based up the identification of
products that the same customer is likely to want.
Cold medicine tissues
Cross-selling: The knowledge of products that go
together can be used by marketing the complementary
product.
Grocery stores do that through position product shelf
location.
Grocery stores generate mountains of cash register data.
Current technology enables grocers to look at
customers who have defected from a store, their
purchase history, and characteristics of other potential
defectors.
1-20
結束
Cross-selling
USAA
insurance
doubled number of products held by average
customer due to data mining
detailed records on customers
predict products they might need
Fidelity Investments
regression - what makes customer loyal
1-21
結束
Banking
CRM involves the application of technology to
monitor customer service, a function that is enhanced
through data mining support.
DM applications in finance include predicting the
prices of equities involving a dynamic environment
with surprise information, some of which might be
inaccurate …
Only 3% of the customers at Norwest bank provided
44% of their profits.
CRM products enable banks to define and identify
customer and household relationships.
1-22
結束
Retaining Good Customers
Customer loss:
Banks - Attrition
Cellular Phone Companies - Churn
study who might leave, why
Southern California Gas
– customer usage, credit information
– direct mail contact - most likely best billing plan
– who is price sensitive
Who should get incentives, whom to keep
1-23
結束
Credit card management
Bank credit card marketing promotions typically generate 1,000
responses to mailed solicitations – a response rate of about 1%.
The rate is improved significantly through data mining analysis.
DM tools used by banks include credit scoring which is a
quantified analysis of credit applicants with respect to
predictions of on-time loan repayment. (Data covering deposits,
savings, loans, credit card, insurance…).
These credit scores can be used to accept/reject
recommendations, as well as to establish the size of a credit line.
ATM machines could be rigged up with electronic sales pitches
for products that a particular customer is likely to be interested
in.
1-24
結束
Fairbank & Morris
Credit card company’s most valuable asset:
INFORMATION ABOUT CUSTOMERS
Signet Banking Corporation
obtained behavioral data from many sources
built predictive models
aggressively marketed balance transfer card
First Union
who will move soon - improve retention
1-25
結束
Telecommunications
Retention of customers for telemarketing is very difficult. The
phenomenon of a customer switching carriers is referred to as
churn, a fundamental concept in telemarketing as well as in
other fields.
A communications company considered the 1/3 of churn is due
to poor call quality, and up to ½ is due to poor equipment.
A cellular fraud prevention monitors traffic to spot problems
with faulty telephones. When a telephone begins to go bad,
telemarketing personal are alerted to contact the customer and
suggest bringing the equipment in for service.
Another way to reduce churn is to protect customers from
subscription and cloning (duplication) fraud. Fraud prevention
systems provide verification that is transparent to legitimate
subscribers.
1-26
結束
Human resource management
Business intelligence is a way to truly understand
markets, competitors, and processes.
Software technology such as data warehouses, data
marts, online analytical processing (OLAP), and data
mining can be used to improve firm’s profitability.
In HRM, the analysis can lead to the identification of
individuals who are liable to leave the company unless
additional compensation or benefits are provided.
HRM would identify the right people so that
organizations could treat them well and retain them
(reduce churn).
1-27
結束
Methodology and Tools
Analyzing data
Given management goals and that management
can translate knowledge into action
1-28
結束
Basic Styles
Top-Down: HYPOTHESIS TESTING
SUPERVISED
have a theory, experiment to prove or disprove
SCIENCE
Bottom-Up: KNOWLEDGE DISCOVERY
UNSUPERVISED
start with data, see new patterns
CREATIVITY
1-29
結束
Hypothesis Testing
Generate theory
Determine data needed
Get data
Prepare data
Build computer model
Evaluate model results
confirm or reject hypotheses
1-30
結束
Generate Theory
Systematically tie different input sources
together (MENTAL MODEL)
What causes sales volume?
sales rep performance
economy, seasonality
product quality, price, promotion,
location
1-31
結束
Generate Theory
Brainstorm:
diverse representatives for broad coverage of
perspectives (electronic)
keep under control (keep positive)
generate testable hypotheses
1-32
結束
Define Data Needed
Determine data needed to test hypothesis
Lucky - query existing database
More often - gather
pull together from diverse databases, survey,
buy
1-33
結束
Locate Data
Usually scattered or unavailable
Sources: warranty claims
point-of-sale data (cash register records)
medical insurance claims
telephone call detail records
direct mail response records
demographic data, economic data
PROFILE: counts, summary statistics, cross-tabs, cleanup
1-34
結束
Prepare Data for Analysis
Summarize: too much - no discriminant information
too little - swamped with useless detail
Process for computer: ASCII, Spreedsheet
Data encoding: how data are recorded can vary may have been collected with specific purpose
Textual data: avoid if possible (may need to code)
Missing values: missing salary - use mean?
1-35
結束
Build and Evaluate Model
Build Computer Model
Choice the appropriate modeling tools and algorithms
Training and test data sets.
Determine if hypotheses supported
statistical practice
test rule-based systems for accuracy
Requires both business and analytic knowledge
1-36
結束
SUPERVISED
Dorn, National Underwriter Oct 18, 2004, 34,39
Health care fraud
Use statistics to identify
indicators of fraud or abuse
Can rapidly sort through large
databases
Identify patterns different from
norm
Moderately successful
But only effective on schemes
already detected
To benefit firm, need to identify
fraud before paying claim
1-37
結束
Knowledge Discovery
Machine learning?
Usually need intelligent analyst
Directed: explain value of some variable
Undirected: no dependent variable selected
identify patterns
Use undirected to recognize relationships;
use directed to explain once found
1-38
結束
Directed
Goal-oriented
Examples: If discount applies, impact on
products who is likely to purchase credit insurance?
Predicted profitability of new customer - what to
bundle with a particular package
Identify sources of preclassified data
Prepare data for analysis
Built & train computer model
Evaluate
1-39
結束
Identify Data Sources
Best - existing corporate data warehouse
data clean, verified, consistent, aggregated
Usually need to generate
most data in form most efficient for designed
purpose
historical sales data often purged for dormant
customers (but you need that information)
1-40
結束
Prepare Data
Put in needed format for computer
Make consistent in meaning
Need to recognize what data are missing
change in balance = new – old
add missing but known-to-be-important data
Divide data into training, test, evaluation
Decide how to treat outliers
statistically biasing, but may be most important
1-41
結束
Build & Train Model
Regression - human builds (selects IVs)
Automatic systems train
give it data, let it hammer
OVERFITTING:
fit the data
TEST SET a means to evaluate model against
data not used in training
tune weights before using to evaluate
1-42
結束
Evaluate Model
ERROR RATE: proportion of classifications
in evaluation set that were wrong
too little training: poor fit on training data and
poor error rate
optimal training: good fit on both
too much training: great fit on training data
and poor error rate
1-43
結束
Undirected Discovery
What items sell together? Strawberries & cream
Directed: What items sell with tofu? tabasco
Long distance caller market segmentation
Uniform usage - weekday & weekend, spikes
on holidays
After segmentation:
high & uniform except for several months of
nothing
1-44
結束
UNSUPERVISED
Dorn, National Underwriter Oct 18, 2004, 34,39
Health care fraud
Look at historical claim
submissions
Build ad hoc model to
compare with current
claims
Assign similarity score to
fraudulent claims
Predict fraud potential
1-45
結束
Undirected Process
Identify data sources
Prepare data
Build & train computer model
Evaluate model
Apply model to new data
Identify potential targets for undirected
Generate new hypotheses to test
1-46
結束
Generate hypotheses
Any commonalities in data?
Are they useful?
Many adults watch children’s movies
chaperones are an important market segment
they probably make final decision
When hypothesis is generated, that determines
data needed
1-47
結束
Bank Case Study
Directed knowledge discovery to recognize likely prospects
for home equity loan
training set - current loan holders
developed model for propensity to borrow
got continuous scores, ranked customers
sent top 11% material
Undirected: segmented market into clusters
in one, 39% had both business & personal
accounts
cluster had 27% of the top 11%
Hypothesis: people use home equity to start business
1-48
結束
Data mining products and data sets
A good source to view current DM products is
www.KDNuggests.com.
The UCI Machine Learning Repository is a source of
very good data mining datasets at
www.ics.uci.edu/~mlearn/MLOther.html.
Weka DM software at
http://www.cs.waikato.ac.nz/ml/weka/
Tanagra DM software at http://eric.univlyon2.fr/~ricco/tanagra/index.html
1-49