Transcript Data sets

Final Project
結束
Data sets
Visit web site:
http://www.kdnuggets.com/datasets/index.html
This is an online repository of large data sets which
encompasses a wide variety of data types, analysis
tasks, and application areas. The primary role of this
repository is to enable researchers in knowledge
discovery and data mining to scale existing and future
data analysis algorithms to very large and complex
data sets.
http://kdd.ics.uci.edu/
10-2
結束
Data sets
Data Sets
by application area
by name
by date (reverse chronological)
Machine Learning Repository
Task Files
by task type
by application area
by name
by date (reverse chronological)
by data type
10-3
結束
Report & Presentation
書面 (50%) + 簡報 (50%)==> 為期末考成績
4位同學一組
書面報告 (8 pages at least, cover not included)
簡報: 15分鐘+問題提問 (5分鐘) ,簡報同學
不發問,其餘同學皆須回答問題,不用及時
回答,可於下課前回答。
一節課用於討論與提問,並預先訂定所選定
資料庫。(可於一星期內修改之) 。
10-4
Business Data Mining Applications
結束
Business Data Mining Applications
Partial representative sample of applications
Catalog sales
CRM
Credit scoring
Banking (loans)
Investment risk
Insurance
10-6
結束
Fingerhut
Founded 1948
today sends out 130 different catalogs
to over 65 million customers
6 terabyte data warehouse
3000 variables of 12 million most active customers
over 300 predictive models
Focused marketing
10-7
結束
Fingerhut
Purchased by Federated Department Stores for $1.7
billion in 1999 (for database)
Fingerhut had $1.6 to $2 billion business per year,
targeted at lower income households
Can mail 400,000 packages per day
Each product line has its own catalog
10-8
結束
Fingerhut
Uses segmentation, decision tree, regression,
neural network tools from SAS and SPSS
Segmentation - combines order & demographic
data with product offerings
can target mailings to greatest payoff
customers who recently had moved tripled
their purchasing 12 weeks after the move
send furniture, telephone, decoration catalogs
10-9
結束
Data for SEGMENTATION
cluster
subj
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
age
53
48
32
26
51
59
43
38
35
27
income
80000
120000
90000
40000
90000
150000
120000
160000
70000
50000
indices
marital grocery
wife
180
husband 120
single 30
wife
80
wife
110
wife
160
husband 140
wife
80
single 40
wife
130
dine out
90
110
160
40
90
120
110
130
170
80
10-10
savings
30000
20000
5000
0
20000
30000
10000
15000
5000
0
結束
Initial Look at Data
Want to know features of those who spend a lot
dining out
INCLUDE AS MANY ACTIONABLE
VARIABLES AS POSSIBLE
things you can identify
Manipulate data
sort on most likely indicator (dine out)
10-11
結束
Sorted by Dine Out
cluster
subject
1004
1010
1001
1005
1002
1007
1006
1008
1003
1009
age
26
27
53
51
48
43
59
38
32
35
income
40000
50000
80000
90000
120000
120000
150000
160000
90000
70000
indices
marital grocery
wife
80
wife
130
wife
180
wife
110
husband 120
husband 140
wife
160
wife
80
single 30
single 40
dine out
40
80
90
90
110
110
120
130
160
170
10-12
savings
0
0
30000
20000
20000
10000
30000
15000
5000
5000
結束
Analysis
Best indicators
marital status
groceries
Available
marital status might be easier to get
10-13
結束
Fingerhut
Mailstream optimization
which customers most likely to respond to
existing catalog mailings
save near $3 million per year
reversed trend of catalog sales industry in 1998
reduced mailings by 20% while increasing net
earnings to over $37 million
10-14
結束
LIFT
LIFT = probability in class by sample divided by probability
in class by population
if population probability is 20% and
sample probability is 30%,
LIFT = 0.3/0.2 = 1.5
Best lift not necessarily best
need sufficient sample size
as confidence increases, longer list but lower lift
10-15
結束
Lift Example
Product to be promoted
Sampled over 10 identifiable segments of potential
buying population
Profit $50 per item sold
Mailing cost $1
Sorted by Estimated response rates
10-16
結束
Lift Data
Seg
R a te
R ev
C o st P ro fit
S e g R a te
R ev
C o st P ro fit
1
0 .0 4 2
$ 2 .1 0
$1
$ 1 .1 0
6
0 .0 1 3
$ 0 .6 5
$1
-$ 0 .3 5
2
0 .0 3 5
$ 1 .7 5
$1
$ 0 .7 5
7
0 .0 0 9
$ 0 .4 5
$1
-$ 0 .5 5
3
0 .0 2 5
$ 1 .2 5
$1
$ 0 .2 5
8
0 .0 0 5
$ 0 .2 5
$1
-$ 0 .7 5
4
0 .0 1 7
$ 0 .8 5
$1
-$ 0 .1 5 9
0 .0 0 4
$ 0 .2 0
$1
-$ 0 .8 0
5
0 .0 1 5
$ 0 .7 5
$1
-$ 0 .2 5 1 0
0 .0 0 1
$ 0 .0 5
$1
-$ 0 .9 5
10-17
結束
Lift Chart
Cumulative Proportion
LIFT
1.2
1
0.8
Cum Response
0.6
Random
0.4
0.2
0
0
1
2
3
4
5
6
7
8
9 10
Segment
10-18
結束
Profit Impact
PROFIT
12
10
Dollars
8
6
Cum Revenue
4
Cum Cost
2
Cum Profit
0
-2
0
1
2
3
4
5
6
7
8
9 10
-4
Segment
10-19
結束
RFM
Recency, Frequency, Monetary
Same purpose as lift
Identify customers more likely to respond
RFM tracks customer transactions by its 3 measures
Code each customer
Often 5 cells for each measure, or 125 combinations
Identify positive response of each of the
combinations
10-20
CUSTOMER RELATIONSHIP MANAGEMENT
(CRM)
understanding value customer provides to firm
Kathleen Khirallah - The Tower Group
Banks will spend $9 billion on CRM by end of 1999
Deloitte
only 31% of senior bank executives confident that
their current distribution mix anticipated customer
needs
10-21
結束
結束
Customer Value
Middle age (41-55), 3-9 years on job, 3-9 years in town, savings account
year
annual purchases profit discounted
net
1.3 rate
1
1000
200
153
153
2
1000
200
118
272
3
1000
200
91
363
4
1000
200
70
433
5
1000
200
53
487
6
1000
200
41
528
7
1000
200
31
560
8
1000
200
24
584
9
1000
200
18
603
10
1000
200
14
618
10-22
結束
Younger Customer
Young (21-29), 0-2 years on job, 0-2 years in town, no savings account
year
annual purchases profit discounted
net
1.3
1
300
60
46
46
2
360
72
43
89
3
432
86
39
128
4
518
104
36
164
5
622
124
34
198
6
746
149
31
229
7
896
179
29
257
8
1075
215
26
284
9
1290
258
24
308
10
1548
310
22
331
10-23
結束
Lifetime Value Application
Drew et al. (2001), Journal of Service Research 3:3
Cellular telephone division, major US telecommunications
firm
Data on billing, usage, demographics
Neural net model of churn proportion by month of tenure
 36 tenure classes
Tested model on 21,500 subscribers
 April 1998
 Trained on 15,000, tested on 6,500
10-24
結束
Customer Tenure Segments
1. Least likely to churn
•
Left alone
2. Slight propensity to churn at end of tenure
•
Moderate pre-expiration marketing
3. Large spike in churn at expiration
•
Concentrated marketing efforts before expiration
4. Highest risk
•
Continued competitive offers
10-25
結束
CREDIT SCORING
Data warehouse including demand deposits, savings,
loans, credit cards, insurance, annuities, retirement
programs, securities underwriting, other
Statistical & mathematical models (regression) to
predict repayment
10-26
結束
CREDIT SCORING
Bank Loan Applications
Age
24
20
20
33
30
55
28
20
20
39
Income
55557
17152
85104
40921
76183
80149
26169
34843
52623
59006
Assets Debts
Want
27040 48191 1500
11090
20455
400
0
14361
4500
91111
90076
2900
101162 114601
1000
511937
21923
1000
47355
49341
3100
0
21031
2100
0
23054 15900
195759 161750
600
On-time
1
1
1
1
1
1
0
1
0
1
10-27
結束
Credit Card Management
Very profitable industry
Card surfing - pay old balance with new card
Promotions typically generate 1000 responses,
about 1%
In early 1990s, almost all mass marketing
Data mining improves (lift)
10-28
結束
British Credit Card Company
Monthly credit data
 Didn’t want those who paid in full (no profit)
Application scoring
 Continued what had been done manually for over 50
years
Behavioral scoring
 Monitor revolving credit accounts for early warning
90,000 customers
 State variable: cumulative months of missed repayment
 Selected sample of 10,000 observations
 Initial state all 0 in selected data
 Over 70% of customers never left state 0
10-29
結束
Analysis
Clustering
Unsupervised partitioning
K-median to get more stable results
Pattern search
Sought patterns from object grouping
Unexpectedly large number of similar objects
Estimated probability of each case belonging to
objects
10-30
結束
Comparison
Compared clustering partitions with pattern search
groupings
Pattern search identified those behaving in
anomalous manner
10-31
結束
Banking
Among first users of data mining
Used to find out what motivates their customers
(reduce churn)
Loan applications
Target marketing
Norwest: 3% of customers provided 44% profits
Bank of America: program cultivating top 10% of customers
10-32
結束
CHURN
Customer turnover
Critical to:
telecommunications
banks
human resource management
retailers
10-33
結束
Characteristics of Not On-Time
Age
28
20
Income
26169
52623
Assets
47355
0
Debts
49341
23054
Want
3100
15900
On-time
0
0
Here, Debts exceed Assets
Age Young
Income Low
BETTER: Base on statistics, large sample
supplement data with other relevant variables
10-34
結束
Identify Characteristics of Those Who Leave
Age Time-job Time-town min bal checking
years
months months $
27
12
12
549
x
41
18
41
3259
x
28
9
15
286
x
55
301
5
2854
x
43
18
18
1112
x
29
6
3
0
x
38
55
20
321
x
63
185
3
2175
x
26
15
15
386
x
46
13
12
1187
x
37
32
25
1865
x
savings card
x
x
x
x
x
x
x
x
x
x
x
x
10-35
loan
x
x
x
x
x
結束
Analysis
What are the characteristics of those who leave?
Correlation analysis
Which customers do you want to keep?
Customer value - net present value of customer to the
firm
10-36
結束
Correlation
Age
Age
1.0
Job
1.0
Town
Min-Bal
Check
Saving
Card
Loan
Time
Job
0.6
0.9
Time
Town
0.4
-0.6
1.0
min-bal check
saving
card loan
-0.4
0.1
-0.5
1.0
0.4
0.9
0.3
0.3
0.5
1.0
0.2 0.3
-0.2
0.5 0.4
0.6 -0.1
0.2 0.2
0.9 0.3
1.0 0.5
1.0
0.0
0.6
-0.1
-0.2
1.0
10-37
結束
Bankruptcy Prediction
Sung et al. (1999), Journal of MIS 16:1
Late 20th-century, East Asian corporate bankruptcy critical
Models built for normal & crisis conditions
Used decision tree models for explanation
 Discriminant analysis applied to benchmark
Korean corporations
 Data for all bankrupt corporations on Korean Stock Exchange,
2nd quarter 1997 to 1st quarter 1998
 75 such cases – full data on 30 of those
 Normal 2nd Qtr 1991 to 1st Qtr 1995
 56 firms, full data on 26
10-38
結束
Korean Bankruptcy Study
Matched bankrupt firms with one or two
nonbankrupt firms that had similar assets and size
56 financial ratios used
Eliminated 16 due to duplication
10-39
結束
Financial Ratios
Growth (5)
Profitability (13)
Leverage (9)
Efficiency (6)
Productivity (7)
DV 0/1 variable of bankruptcy or not
10-40
結束
Multivariate Discriminant Analysis
Used stepwise procedure
NORMAL PERIOD
Normal = 0.58 * cash flow/assets
+ 0.0623 * productivity of capital
- 0.006 * average inventory turnover
BANKRUPT PERIOD
Bankrupt = 0.053 * cash flow/liabilities
+ 0.056 * productivity of capital
+ 0.014 * fixed assets/(equity+LT liab)
10-41
結束
Decision Tree Models
Used C4.5
Applied boosting to improve predictive power,
improved prediction success
NORMAL RULES
IF productivity of capital > 19.65 THEN OK
IF cash flow/total assets > 5.64 THEN OK
IF cash flow/total assets ≤ 55.64 & productivity of
capital ≤ 19.65 THEN bankrupt
10-42
結束
CRISIS RULES
IF productivity of capital > 20.61 THEN OK
IF cash flow/liabilities > 2.64 THEN OK
IF fixed assets/(equity+long-term invest) > 87.23 THEN
OK
IF cash flow/liabilities ≤ 2.64
AND productivity of capital ≤20.61
AND fixed assets/(equity+long-term invest) ≤ 87.23
THEN bankrupt
10-43
結束
Comparison
Correct
Bankrupt
Correct OK
Overall
Variables
DA-normal
0.69
0.90
0.82
3
DA-crisis
0.53
0.85
0.74
3
DT-normal
0.72
0.90
0.83
8
DT-crisis
0.67
0.89
0.81
6
10-44
結束
Mortgage Market
Early 1990s - massive refinancing
Need to keep customers happy to retain
Contact current customers who have rates
significantly higher than market
a major change in practice
data mining & telemarketing increased Crestar
Mortgage’s retention rate from 8% to over 20%
10-45
結束
Country Investment Risk
Outcome categories:
1.
2.
3.
4.
5.
Most safe
Developed
Mature emerging markets
New emerging markets
Frontier
10-46
結束
Investment Risk Analysis
Becerra-Fernandez et al. (2002) Computers and Industrial Engineering 43
Risk by country
 Expert assessment available
Decision tree (C5), neural network models
Data:
 Economic indicators (4)
 Depth & liquidity (4)
 Performance & value (5)
 Economic & market risk (4)
 Regulation & efficiency (4)
52 samples, so used bootstrapping
10-47
結束
Models
Decision trees
Pruning rate 50%:
Pruning rate 75%
Neural networks
Backpropogation
Fuzzy (ARTMAP)
Learning vector quantization
10-48
結束
Results
Decision tree algorithms more accurate
Lower pruning rate – lowest error rate
Neural networks disadvantaged by small data set
Decision tree algorithms consistently optimistic
relative to expert ratings
10-49
結束
Banking
Fleet Financial Group
$30 million data warehouse
hired 60 database marketers, statistical/quantitative
analysts & DSS specialists
expected to add $100 million in profit by 2001
10-50
結束
Banking
First Union
concentrated on contact point
previously had very focused product groups, little
coordination
Developed offers for customers
10-51
結束
INSURANCE
Marketing, as retailing & banking
Special:
Farmers Insurance Group - underwriting system
generating $ millions in higher revenues, lower
claims
7 databases, 35 million records
better understanding of market niches
lower rates on sports cars, increasing business
10-52
結束
Insurance Fraud
Specialist criminals - multiple personas
InfoGlide specializes in fraud detection products
Similarity search engine
link names, telephone numbers, streets, birthdays,
variations
identify 7 times more fraud than exact-match systems
10-53
結束
Insurance Fraud - Link Analysis
claim
type
amount
back
50000
neck
80000
arm
40000
neck
80000
leg
30000
multiple 120000
neck
80000
back
60000
arm
30000
internal 180000
physician
Welby
Frank
Barnard
Frank
Schmidt
Heinrich
Frank
Schwartz
Templer
Weiss
attorney
McBeal
Jones
Fraser
Jones
Mason
Feiffer
Jones
Nixon
White
Richards
10-54
結束
Insurance Fraud
Analytics’ NetMap for Claims
 uses industrywide database
 creates data mart of internal, external data
 unusual activity for specific chiropractors, attorneys
HNC Insurance Solutions
 workers compensation fraud
VeriComp - predictive software (neural nets)
 saved Utah over $2 million
10-55
結束
Insurance Data Mining Examples
Smith et al. (2000) Journal of the Operational Research Society 51:5
Large data warehouse system
Recorded every transaction & claim
Data mining to predict average claim costs &
frequency, impact on profitability
Pricing
10-56
結束
Customer Retention Analysis
Over 20,000 motor vehicle policies due for
renewal in one month
About 7% didn’t renew
Expected reasons: price, service, value of vehicle
10-57
結束
Customer Retention Results
Data Mining
Enterprise Miner
Used data exploration to select variables (13)
Used log transforms for highly skewed data
Performed log regression, decision trees, neural
networks
Neural network fit test set best
But low correct rate for termination
10-58
結束
Claims Analysis
Recent growth in policies
 Lower profitability
 Could improve by lowering frequency, reducing claim
amounts
Data over a three-year period
Sample size well over 100,000 per quarter
Descriptive statistics:
 High growth in young people, insurance over $40,000
10-59
結束
Claims Models
Clustering
Predict group policy claims behavior
Used 50 clusters
K-means algorithm
Identified several clusters with abnormal cost
ratios or frequency size
10-60
結束
TELECOMMUNICATIONS
Deregulation - widespread competition
churn
1/3 poor call quality, 1/2 poor equipment
wireless performance monitor tracking
reduced churn about 61%, $580,000/year
cellular fraud prevention
spot problems when cell phones begin to go bad
10-61
結束
Telecommunications
Metapath’s Communications Enterprise
Operating System
help identify telephone customer problems
dropped calls, mobility patterns,
demographics
to target specific customers
reduce subscription fraud
$1.1 billion
reduce cloning fraud
cost $650 million in 1996
10-62
結束
Telecommunications
Churn Prophet, ChurnAlert
data mining to predict subscribers who cancel
Arbor/Mobile
set of products, including churn analysis
10-63
結束
TELEMARKETING
MCI uses data marts to extract data on
prospective customers
typically a 2-month program
20% improvement in sales leads
multimillion investment in data marts & hardware
staff of 45
trend spotting (which approaches specific
customers)
10-64
結束
Telemarketing
Australian Tourist Commission
maintained database since 1992
responses to travel inquiries on tours, hotels, airlines,
travel agents, consumers
data mine to identify travel agents & consumers
responding to various media
sales closure rate at 10% and up
lead lists faxed weekly to productive travel agents
10-65
結束
Telemarketing
Segmentation
Which customers respond to new promotions, to
discounts, to new product offers
Determine
whom to offer new service to
those most likely to commit fraud
10-66
結束
Human Resource Management
Identify individuals liable to leave company without
additional compensation or benefits
Firm may already know 20% use 80% of offered
services
don’t know which 20%
data mining (business intelligence) can identify
Use most talented people in highest priority (or
most profitable) business units
10-67
結束
Human Resource Management
Downsizing
identify right people, treat them well
track key performance indicators
data on talents, company needs, competitor
requirements
State of Mississippi’s MERLIN network
30 databases (finance, payroll, personnel, capital
projects)
Cognos Impromptu system - 230 users
10-68
結束
CASINOS
Casino gaming one of richest data sets known
Harrah’s - incentive programs
about 8 million customers hold Total Gold cards,
used whenever the customer spends money in the
casino
comprehensive data collection
Trump’s Taj Card similar
10-69
結束
Casinos
Bellagio & Mandelay Bay
strategy of luxury visits
child entertainment
change from old strategy - cheap food
Identify high rollers - cultivate
identify those to discourage from play
estimate lifetime value of players
10-70
結束
ARTS
Computerized box offices lead to high volumes of
data
Identify potential consumers for shows
Software to manage shows
similar to airline seating chart software
10-71