Transcript Slide 1

Chapter 1
Initial Description of Data Mining
in Business
Prepared by: Dr. Tsung-Nan Tsai
結束
Contents
Introduces data mining concepts
Presents typical business data applications
Explains the meaning of key concepts
Gives a brief overview of data mining tools
Outlines the remaining chapters of the book
1-2
結束
Definition
DATA MINING: exploration & analysis
Refers to the analysis of the large quantities of data that
are stored in computers.
by automatic means
of large quantities of data
to discover actionable patterns & rules
Data mining is a way to use massive quantities of data
that businesses generate
GOAL - improve marketing, sales, customer support
through better understanding of customers
1-3
結束
Retail Outlets
Bar coding & scanning generate masses of data
customer service (Grocery stores can quickly
process he purchases and accurately determine
product prices)
inventory control (Determine the quantity of items
of each product on hand, supply chain management)
MICROMARKETING
CUSTOMER PROFITABILITY ANALYSIS
MARKET-BASKET ANALYSIS
1-4
結束
Political Data Mining
Grossman et al., 10/18/2004, Time, 38
2004 Election
Republicans: VoterVault
From Mid-1990s
About 165 million voters
Massive get-out-the-vote drive
for those expected to vote
Republican
Democrats: Demzilla
Also about 165 million voters
Names typically have 200 to
400 information items
1-5
結束
Medical Diagnosis
J. Morris, Health Management Technology Nov 2004, 20,
22-24
Electronic Medical Records
Associated Cardiovascular
Consultants
31 physicians
40,000 patients per year,
southern New Jersey
Data mined to identify
efficient medical practice
Enhance patient outcomes
Reduced medical liability
insurance
1-6
結束
Mayo Clinic
Swartz, Information Management Journal Nov/Dec 2004, 8
IBM developed EMR program
Complete records on almost
4.4 million patients.
Doctors can ask for how last
100 Mayo patients with same
gender, age, medical history
responded to particular
treatments.
1-7
結束
Business Uses of Data Mining
Toyata used the data mining of its data warehouse to
determine more efficient transportation routes,
reducing time-to-market by average of 19 days.
Bank firms used the data mining in soliciting credit
card customers,
Insurance and Telecommunication companies used DM
to detect fraud.
Manufacturing firms used DM in quality control,
Many …..
1-8
結束
Business Uses of Data Mining
1. Customer profiling
 Identify profitability from subset customers
2. Targeting
•
Determine characteristics of most profitable
customers
3. Market-Basket Analysis
•
•
•
Determine correlation of purchases by profile
(customers)
Cross-selling
Part of Customer Relationship Management
1-9
結束
What is needed to do DM?
DM requires the identification of a problem, along
with data collection that can lead to a better
understanding of the market.
Computer models provide statistical or other means
of analysis.
Two general types of DM studies:
1. Hypothesis testing: involving expressing a theory
about the relationship between actions and outcomes.
2. Knowledge discovery: a preconceived notion may not
be present, but rather than relationships can be
identified by looking at the data (correlation analysis).
1-10
結束
Reasons why Data Mining is now effective
Data are there
Data are warehoused (computerized)
Walmart: 35 thousand queries per week
Computing economically available
Competitive pressure
Commercial products available
1-11
結束
Trends
Every business is service
hotel chains record your
preferences
car rental companies the same
service versus price
credit card companies
long distance providers
airlines
computer retailers
1-12
結束
Trends
Information as Product
Custom Clothing Technology Corporation
fit jeans, other clothing
INFORMATION BROKERING
IMS - collects prescription data from pharmacies, sells
to drug firms
AC Nielsen - TV
1-13
結束
Trends
Commercial Software Available
using statistical, artificial intelligence tools
that have been developed
Enterprise Miner
Intelligent Miner
Clementine
PolyAnalyst
Specialty products
SAS
IBM
SPSS
Megaputer
1-14
結束
Fingerhut’s DM models
Fingerhut used segmentation, decision tree, regression analysis,
and neural modeling tools from SAS for regression analysis
tools and SPSS for neural network tools.
The segmentation model combines order and basic demographic
data with Fingerhut’s product offerings.
Neural network models used to identify in mailing patterns and
order filling telephone call orders.
Goal:
 Create new mailings targeted at customers with the greatest
potential payoff.
 Create a catalog containing products that those who is interested in,
such as furniture, telephones…
1-15
結束
How Data Mining Is Being Used
U.S. Government
track down Oklahoma City
bombers, Unabomber, many
others
Treasury department international funds transfers,
money laundering
Internal Revenue Service
1-16
結束
How Data Mining Is Used
Firefly
asks members to rate
music and movies
subscribers clustered
clusters get customdesigned
recommendations
1-17
結束
Warranty Claims Routing
Diesel engine manufacturer
stream of warranty claims
examine each by expert
determine whether charges are reasonable &
appropriate
think of expert system to automate claims
processing
1-18
結束
Data mining application area
Application Area
Applications
Specifics
Retailing
Affinity positioning
Cross-selling
Position products effectively
Find more products for
customers
Banking
Customer relationship
management
Identify customer value
develop programs to maximize
revenue
Credit card
Management
Lift
Churn,
Fraud detection
Identify effective market
segments
Identify likely customer turnover
Insurance
Fraud detection
Identify claims meriting
investigation
Telecommunications
Churn
Telemarketing
Online information
Aid telemarketers with easy data
access
Human Resource
Management
Churn
Identify potential employee
turnover
Identify likely customer turnover
1-19
結束
Retailing
Affinity positioning is based up the identification of
products that the same customer is likely to want.
Cold medicine  tissues
Cross-selling: The knowledge of products that go
together can be used by marketing the complementary
product.
Grocery stores do that through position product shelf
location.
Grocery stores generate mountains of cash register data.
Current technology enables grocers to look at
customers who have defected from a store, their
purchase history, and characteristics of other potential
defectors.
1-20
結束
Cross-selling
USAA
insurance
doubled number of products held by average
customer due to data mining
detailed records on customers
predict products they might need
Fidelity Investments
regression - what makes customer loyal
1-21
結束
Banking
CRM involves the application of technology to
monitor customer service, a function that is enhanced
through data mining support.
DM applications in finance include predicting the
prices of equities involving a dynamic environment
with surprise information, some of which might be
inaccurate …
Only 3% of the customers at Norwest bank provided
44% of their profits.
CRM products enable banks to define and identify
customer and household relationships.
1-22
結束
Retaining Good Customers
Customer loss:
Banks - Attrition
Cellular Phone Companies - Churn
study who might leave, why
Southern California Gas
– customer usage, credit information
– direct mail contact - most likely best billing plan
– who is price sensitive
Who should get incentives, whom to keep
1-23
結束
Credit card management
Bank credit card marketing promotions typically generate 1,000
responses to mailed solicitations – a response rate of about 1%.
The rate is improved significantly through data mining analysis.
DM tools used by banks include credit scoring which is a
quantified analysis of credit applicants with respect to
predictions of on-time loan repayment. (Data covering deposits,
savings, loans, credit card, insurance…).
These credit scores can be used to accept/reject
recommendations, as well as to establish the size of a credit line.
ATM machines could be rigged up with electronic sales pitches
for products that a particular customer is likely to be interested
in.
1-24
結束
Fairbank & Morris
Credit card company’s most valuable asset:
INFORMATION ABOUT CUSTOMERS
Signet Banking Corporation
obtained behavioral data from many sources
built predictive models
aggressively marketed balance transfer card
First Union
who will move soon - improve retention
1-25
結束
Telecommunications
Retention of customers for telemarketing is very difficult. The
phenomenon of a customer switching carriers is referred to as
churn, a fundamental concept in telemarketing as well as in
other fields.
A communications company considered the 1/3 of churn is due
to poor call quality, and up to ½ is due to poor equipment.
A cellular fraud prevention monitors traffic to spot problems
with faulty telephones. When a telephone begins to go bad,
telemarketing personal are alerted to contact the customer and
suggest bringing the equipment in for service.
Another way to reduce churn is to protect customers from
subscription and cloning (duplication) fraud. Fraud prevention
systems provide verification that is transparent to legitimate
subscribers.
1-26
結束
Human resource management
Business intelligence is a way to truly understand
markets, competitors, and processes.
Software technology such as data warehouses, data
marts, online analytical processing (OLAP), and data
mining can be used to improve firm’s profitability.
In HRM, the analysis can lead to the identification of
individuals who are liable to leave the company unless
additional compensation or benefits are provided.
HRM would identify the right people so that
organizations could treat them well and retain them
(reduce churn).
1-27
結束
Methodology and Tools
Analyzing data
Given management goals and that management
can translate knowledge into action
1-28
結束
Basic Styles
Top-Down: HYPOTHESIS TESTING
SUPERVISED
have a theory, experiment to prove or disprove
SCIENCE
Bottom-Up: KNOWLEDGE DISCOVERY
UNSUPERVISED
start with data, see new patterns
CREATIVITY
1-29
結束
Hypothesis Testing
Generate theory
Determine data needed
Get data
Prepare data
Build computer model
Evaluate model results
confirm or reject hypotheses
1-30
結束
Generate Theory
Systematically tie different input sources
together (MENTAL MODEL)
What causes sales volume?
sales rep performance
economy, seasonality
product quality, price, promotion,
location
1-31
結束
Generate Theory
Brainstorm:
diverse representatives for broad coverage of
perspectives (electronic)
keep under control (keep positive)
generate testable hypotheses
1-32
結束
Define Data Needed
Determine data needed to test hypothesis
Lucky - query existing database
More often - gather
pull together from diverse databases, survey,
buy
1-33
結束
Locate Data
Usually scattered or unavailable
Sources: warranty claims
 point-of-sale data (cash register records)




medical insurance claims
telephone call detail records
direct mail response records
demographic data, economic data
PROFILE: counts, summary statistics, cross-tabs, cleanup
1-34
結束
Prepare Data for Analysis
Summarize: too much - no discriminant information
too little - swamped with useless detail
Process for computer: ASCII, Spreedsheet
Data encoding: how data are recorded can vary may have been collected with specific purpose
Textual data: avoid if possible (may need to code)
Missing values: missing salary - use mean?
1-35
結束
Build and Evaluate Model
Build Computer Model
Choice the appropriate modeling tools and algorithms
Training and test data sets.
Determine if hypotheses supported
statistical practice
test rule-based systems for accuracy
Requires both business and analytic knowledge
1-36
結束
SUPERVISED
Dorn, National Underwriter Oct 18, 2004, 34,39
Health care fraud
Use statistics to identify
indicators of fraud or abuse
Can rapidly sort through large
databases
Identify patterns different from
norm
Moderately successful
But only effective on schemes
already detected
To benefit firm, need to identify
fraud before paying claim
1-37
結束
Knowledge Discovery
Machine learning?
Usually need intelligent analyst
Directed: explain value of some variable
Undirected: no dependent variable selected
identify patterns
Use undirected to recognize relationships;
use directed to explain once found
1-38
結束
Directed
Goal-oriented
Examples: If discount applies, impact on
products who is likely to purchase credit insurance?
Predicted profitability of new customer - what to
bundle with a particular package
Identify sources of preclassified data
Prepare data for analysis
Built & train computer model
Evaluate
1-39
結束
Identify Data Sources
Best - existing corporate data warehouse
data clean, verified, consistent, aggregated
Usually need to generate
most data in form most efficient for designed
purpose
historical sales data often purged for dormant
customers (but you need that information)
1-40
結束
Prepare Data
Put in needed format for computer
Make consistent in meaning
Need to recognize what data are missing
change in balance = new – old
add missing but known-to-be-important data
Divide data into training, test, evaluation
Decide how to treat outliers
statistically biasing, but may be most important
1-41
結束
Build & Train Model
Regression - human builds (selects IVs)
Automatic systems train
give it data, let it hammer
OVERFITTING:
fit the data
TEST SET a means to evaluate model against
data not used in training
tune weights before using to evaluate
1-42
結束
Evaluate Model
ERROR RATE: proportion of classifications
in evaluation set that were wrong
too little training: poor fit on training data and
poor error rate
optimal training: good fit on both
too much training: great fit on training data
and poor error rate
1-43
結束
Undirected Discovery
What items sell together? Strawberries & cream
Directed: What items sell with tofu? tabasco
Long distance caller market segmentation
Uniform usage - weekday & weekend, spikes
on holidays
After segmentation:
high & uniform except for several months of
nothing
1-44
結束
UNSUPERVISED
Dorn, National Underwriter Oct 18, 2004, 34,39
Health care fraud
Look at historical claim
submissions
Build ad hoc model to
compare with current
claims
Assign similarity score to
fraudulent claims
Predict fraud potential
1-45
結束
Undirected Process
Identify data sources
Prepare data
Build & train computer model
Evaluate model
Apply model to new data
Identify potential targets for undirected
Generate new hypotheses to test
1-46
結束
Generate hypotheses
Any commonalities in data?
Are they useful?
Many adults watch children’s movies
chaperones are an important market segment
they probably make final decision
When hypothesis is generated, that determines
data needed
1-47
結束
Bank Case Study
Directed knowledge discovery to recognize likely prospects
for home equity loan
training set - current loan holders
developed model for propensity to borrow
got continuous scores, ranked customers
sent top 11% material
Undirected: segmented market into clusters
in one, 39% had both business & personal
accounts
cluster had 27% of the top 11%
Hypothesis: people use home equity to start business
1-48
結束
Data mining products and data sets
A good source to view current DM products is
www.KDNuggests.com.
The UCI Machine Learning Repository is a source of
very good data mining datasets at
www.ics.uci.edu/~mlearn/MLOther.html.
Weka DM software at
http://www.cs.waikato.ac.nz/ml/weka/
Tanagra DM software at http://eric.univlyon2.fr/~ricco/tanagra/index.html
1-49