Slide - Department of Industrial Engineering

Download Report

Transcript Slide - Department of Industrial Engineering

Introduction to
Data Mining
Massive
quantities of
data exist on
computers
Data mining
is a way to
use these
data to learn
1-2
Definition
• DATA MINING: exploration & analysis
– by automatic means
– of large quantities of data
– to discover actionable patterns & rules
• Data mining is a way to use massive
quantities of data that businesses generate
• GOAL - improve marketing, sales, customer
support through better understanding of
customers
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-3
Retail Outlets
• Bar coding & scanning generate
masses of data
– customer service
– inventory control
– MICROMARKETING
– CUSTOMER PROFITABILITY ANALYSIS
– MARKET-BASKET ANALYSIS
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-4
Political Data Mining
Grossman et al., 10/18/2004, Time, 38
• 2004 Election
– Republicans: VoterVault
• From Mid-1990s
• About 165 million voters
• Massive get-out-the-vote
drive for those expected to
vote Republican
– Democrats: Demzilla
• Also about 165 million voters
• Names typically have 200 to
400 information items
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-5
Medical Diagnosis
J. Morris, Health Management Technology Nov
2004, 20, 22-24
• Electronic Medical Records
– Associated Cardiovascular
Consultants
• 31 physicians
• 40,000 patients per year,
southern New Jersey
– Data mined to identify efficient
medical practice
– Enhance patient outcomes
– Reduced medical liability
insurance
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-6
Mayo Clinic
Swartz, Information Management Journal
Nov/Dec 2004, 8
• IBM developed EMR
program
– Complete records on almost
4.4 million patients
– Doctors can ask for how last
100 Mayo patients with
same gender, age, medical
history responded to
particular treatments
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-7
Business Uses of Data Mining
1. Customer profiling
Identify profitability of customers
2. Targeting
Determine characteristics of most profitable
customers
3. Market-Basket Analysis
Determine correlation of purchases by profile
Part of Customer Relationship Management
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-8
Reasons why Data
Mining is now effective
• Data are there
• Data are warehoused (computerized)
– Walmart: 35 thousand queries per week
• Computing economically available
• Competitive pressure
• Commercial products available
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-9
Trends
• Every business is service
– hotel chains record your
preferences
– car rental companies the same
– service versus price
•
•
•
•
McGraw-Hill/Irwin
credit card companies
long distance providers
airlines
computer retailers
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-10
Trends
• Mass Customization
– produce tailored products from
standardized components
•
•
•
•
Levi-Strauss - custom fit jeans
The Custom Foot
Andersen Windows
Individual, Inc.
– electronic clipping
– customer profiles of interests
– send custom newsletter
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-11
Trends
• Information as Product
– Custom Clothing Technology Corporation
• fit jeans, other clothing
– Lands End
– J. Crew
• INFORMATION BROKERING
– IMS - collects prescription data from pharmacies,
sells to drug firms
– AC Nielsen - TV
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-12
Trends
• Commercial Software Available
– using statistical, artificial intelligence tools
that have been developed
•
•
•
•
•
McGraw-Hill/Irwin
Enterprise Miner
Intelligent Miner
Clementine
PolyAnalyst
Specialty products
SAS
IBM
SPSS
Megaputer
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-13
How Data Mining Is
Being Used
• U.S. Government
– track down
Oklahoma City
bombers,
Unabomber, many
others
– Treasury department
- international funds
transfers, money
laundering
– Internal Revenue
Service
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-14
How Data Mining Is
Used
• Safeway
– offer Safeway Savings Club
card
• users given discounts
• users must give personal
information
• every use, collect data
– identify aggregate patterns
(what sells well together; what
should be sold together)
• sell names for 5.5 cents per
name to suppliers
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-15
How Data Mining Is
Used
• Firefly
– asks members to
rate music and
movies
– subscribers clustered
– clusters get customdesigned
recommendations
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-16
Cross-selling
• USAA
– insurance
– doubled number of products held by
average customer due to data mining
– detailed records on customers
– predict products they might need
• Fidelity Investments
– regression - what makes customer loyal
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-17
Warranty Claims Routing
• Diesel engine manufacturer
– stream of warranty claims
– examine each by expert
• determine whether charges are reasonable &
appropriate
• think of expert system to automate claims
processing
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-18
Retaining Good
Customers
• Customer loss:
– Banks - Attrition
– Cellular Phone Companies - Churn
• study who might leave, why
• Southern California Gas
– customer usage, credit information
– direct mail contact - most likely best billing plan
– who is price sensitive
• Who should get incentives, whom to
keep
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-19
Fairbank & Morris
• Credit card company’s most valuable asset:
– INFORMATION ABOUT CUSTOMERS
• Signet Banking Corporation
– obtained behavioral data from many sources
– built predictive models
– aggressively marketed balance transfer card
• First Union
– who will move soon - improve retention
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-20
Methodology
Analyzing data
Given management goals and that
management can translate knowledge into
action
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-21
Basic Styles
• Top-Down: HYPOTHESIS TESTING
– SUPERVISED
– have a theory, experiment to prove or
disprove
– SCIENCE
• Bottom-Up: KNOWLEDGE DISCOVERY
– UNSUPERVISED
– start with data, see new patterns
– CREATIVITY
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-22
Hypothesis Testing
•
•
•
•
•
•
Generate theory
Determine data needed
Get data
Prepare data
Build computer model
Evaluate model results
– confirm or reject hypotheses
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-23
Generate Theory
• Study
• Systematically tie different input sources
together (MENTAL MODEL)
– What causes sales volume?
• sales rep performance
• economy, seasonality
• product quality, price, promotion, location
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-24
Generate Theory
• Brainstorm:
– diverse representatives for broad coverage
of perspectives (electronic)
– keep under control (keep positive)
– generate testable hypotheses
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-25
Define Data Needed
• Determine data needed to test
hypothesis
– Lucky - query existing database
– More often - gather
• pull together from diverse databases, survey,
buy
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-26
Locate Data
• Usually scattered or unavailable
• Sources: warranty claims
point-of-sale data (cash register records)
medical insurance claims
telephone call detail records
direct mail response records
demographic data, economic data
• PROFILE: counts, summary statistics, cross-tabs,
cleanup
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-27
Prepare Data for
Analysis
• Summarize: too much - no discriminant information
too little - swamped with useless detail
• Process for computer: EBCDIC, ASCII
• Data encoding: how data are recorded can vary may have been collected with specific purpose (CAL
omitting LA)
• Textual data: avoid if possible (may need to code)
• Missing values: missing salary - use mean?
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-28
Build Computer Model
• Convert mental model into quantitative
– roamers less sensitive to price than others
• threshold defining roamer
• average price per call, or number of calls above
price level
– families with children in high school most
likely to respond to home equity loan offer
• identify families with, without high school age
• past data - responded or didn’t
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-29
Evaluate Model
• Determine if hypotheses supported
– statistical practice
– test rule-based systems for accuracy
• Requires both business and analytic
knowledge
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-30
SUPERVISED
Dorn, National Underwriter Oct 18, 2004,
34,39
• Health care fraud
– Use statistics to identify
indicators of fraud or
abuse
– Can rapidly sort through
large databases
• Identify patterns
different from norm
– Moderately successful
• But only effective on
schemes already
detected
• To benefit firm, need to
identify fraud before
paying claim
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-31
Knowledge Discovery
• Machine learning?
– Usually need intelligent analyst
• Directed: explain value of some variable
• Undirected: no dependent variable
selected
– identify patterns
• Use undirected to recognize
relationships; use directed to explain
once found
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-32
Directed
• Goal-oriented
• Examples: If discount applies, impact on products
-
who is likely to purchase credit insurance?
Predicted profitability of new customer - what to
bundle with a particular package
•
•
•
•
Identify sources of preclassified data
Prepare data for analysis
Built & train computer model
Evaluate
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-33
Identify Data Sources
• Best - existing corporate data
warehouse
– data clean, verified, consistent, aggregated
• Usually need to generate
– most data in form most efficient for
designed purpose
– historical sales data often purged for
dormant customers (but you need that
information)
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-34
Prepare Data
• Put in needed format for computer
• Make consistent in meaning
• Need to recognize what data are
missing
change in balance = new – old
add missing but known-to-be-important data
• Divide data into training, test, evaluation
• Decide how to treat outliers
– statistically biasing, but may be most important
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-35
Build & Train Model
• Regression - human builds (selects IVs)
• Automatic systems train
– give it data, let it hammer
• OVERFITTING:
– fit the data
– TEST SET a means to evaluate model
against data not used in training
• tune weights before using to evaluate
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-36
Evaluate Model
• ERROR RATE: proportion of
classifications in evaluation set that
were wrong
• too little training: poor fit on training data
and poor error rate
• optimal training: good fit on both
• too much training: great fit on training
data and poor error rate
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-37
Undirected Discovery
• What items sell together? Strawberries & cream
– Directed: What items sell with tofu? tabasco
• Long distance caller market segmentation
– Uniform usage - weekday & weekend, spikes on
holidays
– After segmentation:
high & uniform except for several months of
nothing
high credit worthiness & profitability college
students
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-38
UNSUPERVISED
Dorn, National Underwriter Oct 18, 2004, 34,39
• Health care fraud
– Look at historical claim
submissions
• Build ad hoc model to
compare with current claims
– Assign similarity score to
fraudulent claims
– Predict fraud potential
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-39
Undirected Process
•
•
•
•
•
Identify data sources
Prepare data
Build & train computer model
Evaluate model
Apply model to new data
• Identify potential targets for undirected
• Generate new hypotheses to test
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-40
Identify potential
targets
• Why
• Who
• When
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-41
Generate hypotheses
• Any commonalities in data?
• Are they useful?
– Many adults watch children’s movies
• chaperones are an important market segment
• they probably make final decision
• When hypothesis is generated, that
determines data needed
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-42
Bank Case Study
• Directed knowledge discovery to recognize likely
prospects for home equity loan
• training set - current loan holders
• developed model for propensity to borrow
• got continuous scores, ranked customers
• sent top 11% material
• Undirected: segmented market into clusters
• in one, 39% had both business & personal
accounts
• cluster had 27% of the top 11%
• Hypothesis: people use home equity to start business
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved