PowerPoint 簡報

Download Report

Transcript PowerPoint 簡報

Data Mining
References:
U.S. News and World Report's Business & Technology
section, 12/21/98, by William J. Holstein
Prof. Juran’s lecture note 1 (at Columbia University)
J.H. Friedman (1999) Data Mining and Statistics. technical
report, Dept. of Stat., Stanford University
Main Goal
•Study statistical tools useful in managerial decision
making.
– Most management problems involve some degree of uncertainty.
– People have poor intuitive judgment of uncertainty.
– IT revolution... abundance of available quantitative information
• data mining: large databases of info, ...
• market segmentation & targeting
• stock market data
• almost anything else you may want to know...
•What conclusions can you draw from your data?
•How much data do you need to support your conclusions?
Applications in Management
•Operations management
– e.g., model uncertainty in demand, production function...
•Decision models
– portfolio optimization, simulation, simulation based
optimization...
•Capital markets
– understand risk, hedging, portfolios, beta's...
•Derivatives, options, ...
– it is all about modeling uncertainty
•Operations and information technology
– dynamic pricing, revenue management, auction design, ...
• Data mining... many applications
Portfolio Selection
•You want to select a stock portfolio of companies A, B, C, …
•Information: Stock Annual returns by year
A 10%, 14%, 13%, 27%, …
B 16%, 27%, 42%, 23%, …
•Questions:
– How do we measure the volatility of each stock?
– How do we quantify the risk associated with a given portfolio?
– What is the tradeoff between risk and returns?
Currency Value (Relative to Jan 2 1998)
Introduction
•Premise: All business becomes information driven.
– The concept of Data Mining is becoming increasingly popular as a
business information management tool where it is expected to reveal
knowledge structures that can guide decisions in conditions of limited
certainty.
•Competitiveness: How you collect and exploit information to
your advantage?
•The challenges
– Most corporate data systems are not ready.
•Can they share information?
•What is the quality of the information going in
– Most data techniques come from the empirical sciences; the world is not a
laboratory.
– Cutting through vendor hype, info-topia.
– Defining good metrics; abandoning gut rules of thumb may be too "risky" for
the manager.
– Communicating success, setting the right expectations.
Wal-Mart
•U.S. News and World Report's Business & Technology
section, 12/21/98, by William J. Holstein
Data-Crunching Santa
Wal-Mart knows what you bought last Christmas
•Wal-Mart is expected to finish the year with $135 billion in
sales, up from $118 billion last year.
– It hurts department stores such as Sears, J. C. Penney, and Federated's
Macy's and Bloomingdale's units, which have been slower to link all
their operations from stores directly to manufacturers. .
– For example, Sears stocked too many winter coats this season and was
surprised by warmer than average weather.
•The field of business analytics has improved significantly
over the past few years, giving business users better insights,
particularly from operational data stored in transactional
systems. business analytics in its everyday activities.
– Analytics are now routinely used in sales, marketing, supply chain
optimization, and fraud detection.
A visualization of a Naive Bayes model for predicting
who in the U.S. earns more than $50,000 in yearly
salary.
The higher the bar, the greater the amount of evidence
a person with this attribute value earns a high salary.
Telecommunications
•Data mining flourishes in telecommunications due to the
availability of vast quantities of high-quality data.
– A significant stream of it consists of call records collected at network
switches used primarily for billing; it enables data mining
applications in toll fraud detection and consumer marketing.
•The best-known marketing application of data mining, albeit
via unconfirmed anecdote, concerns MCI’s “Friends &
Family” promotion launched in the domestic U.S. market in
1991.
– As the anecdote goes, market researchers observed relatively small
subgraphs in this long-distance phone company’s large call-graph of
network activity.
– It reveals the promising strategy of adding entire calling circles to the
company’s subscriber base, rather than the traditional and costly
approach of seeking individual customers one at a time. Indeed, MCI
increased its domestic U.S. market share in the succeeding years by
exploiting the “viral” capabilities of calling circles; one infected
member causes others to become infected.
– Interestingly, the plan was abandoned some years later (not available
since 1997), possibly because the virus had run its course but more
Telecommunications
•In toll-fraud detection, data mining has been instrumental in
completely changing the landscape for how anomalous
behaviors are detected.
– Nearly all fraud detection systems in the telecommunications industry
10 years ago were based on global threshold models.
• They can be expressed as rule sets of the form “If a customer makes more
than X calls per hour to country Y, then apply treatment Z.”
• The placeholders X, Y, and Z are parameters of these rule sets applied to
all customers.
– Given the range of telecommunication customers, blanket application
of these rules produces many false positives.
•Data mining methods for customized monitoring of land and
mobile phone lines were subsequently developed by leading
service providers, including AT&T, MCI, and Verizon,
whereby each customer’s historic calling patterns are used as
a baseline against which all new calls are compared.
– For customers routinely calling country Y more than X times a day,
such alerts would be suppressed, but if they ventured to call a different
country Y’, an alert might be generated.
Risk management and targeted marketing
•Insurance and direct mail are two industries that rely on data
analysis to make profitable business decisions.
– Insurers must be able to accurately assess the risks posed by their
policyholders to set insurance premiums at competitive levels.
•For example, overcharging low-risk policyholders would motivate them to
seek lower premiums elsewhere; undercharging high-risk policyholders
would attract more of them due to the lower premiums.
•In either case, costs would increase and profits inevitably decrease.
– Effective data analysis leading to the creation of accurate predictive
models is essential for addressing these issues.
•In direct-mail targeted marketing, retailers must be able to
identify subsets of the population likely to respond to
promotions in order to offset mailing and printing costs.
– Profits are maximized by mailing only to those potential customers
most likely to generate net income to a retailer in excess of the
retailer’s mailing and printing costs.
Medical applications (diabetic screening)
•Preprocessing and postprocessing steps are often the most
critical elements determining the effectiveness of real-life
data-mining applications, as illustrated by the following
recent medical application in diabetic patient screening.
– In the 1990s in Singapore, about 10% of the population was diabetic, a
disease with many side effects, including increased risk of eye disease
kidney failure, and other complications.
– However, early detection and proper care management can make a
difference in the health and longevity of individual sufferers.
– To combat the disease, the government of Singapore introduced a
regular screening program for diabetic patients in its public hospitals
in 1992.
•Patient information, clinical symptoms, eye-disease diagnosis, treatments,
and other details, were captured in a database maintained by government
medical authorities.
•After almost 10 years of collecting data, a wealth of medical information is
available. This vast store of historical data leads naturally to the
application of data mining techniques to discover interesting patterns.
– The objective is to find rules physicians can use to understand more
about diabetes and how it might be associated with different
segments of the population.
Christmas Season: Georgia Stores
•Store at Decatur (just east of Atlanta)
– A black middle-income community
• Decoration display: African-American angels and ethnic Santas aplenty
• Music section: Promoting seasonal disks like "Christmas on Death Row,"
which features rapper Snoop Doggy Dogg.
• Toy department: a large selection of brown-skinned dolls
•Store at Dunwoody (20 miles away fom Decatur)
– An affluent, mostly white suburb (north of Atlanta)
• Music section: Showcasing Christmas tunes by country superstar Garth
Brooks.
• Toy department: a few expensive toys that aren't available in the Decatur
store; Out of the hundreds of dolls in stock, only two have brown skin.
•How to determine the kinds of products that are carried by
various Wal-Marts across the land?
Wal-Mart system
•Every item in the store has a laser bar code, so when
customers pay for their purchases a scanner captures
information about
– what is selling on what day of the week and at what price.
– The scanner also records what other products were in each
shopper's basket.
– Wal-Mart analyzes what is in the shopping cart itself.
– The combination of [what's in a purchaser's cart] gives you a good
indication of the age of that consumer and the preferences in terms
of ethnic background.
•Wal-Mart combines the in-store data with information
about the demographics of communities around each
store.
– The end result is surprisingly different personalities for Wal-Marts.
– It also help Wal-Mart figure out how to place goods on the floor to
get what retailers call "affinity sales," or sales of related products.
Wal-Mart system (Cont.)
•One big strength of the system is that about 5,000
manufacturers are tied into it through the company's Retail
Link program, which they access via the Internet.
– Pepsi, Disney, or Mattel, for example, can tap into Wal-Mart's data
warehouse to see how well each product is selling at each Wal-Mart.
– They can look at how things are selling in individual areas and
make decisions about categories where there may be an
opportunity to expand.
– That tight information link helps Wal-Mart work with its suppliers
to replenish stock of products that are selling well and to quickly
pull those that aren't.
Data Mining and Statistics
•Data Mining is used to discover patterns and relationships in
data with an emphasis on large observational data bases.
– It sits at the common frontiers of several fields including Data Base
Management, Artificial Intelligence, Machine Learning, Pattern
Recognition and Data Visualization.
– From a statistical perspective it can be viewed as computer
automated exploratory data analysis of large complex data sets.
– Many organizations have large transaction oriented data bases used
for inventory billing accounting, etc. These data bases were very
expensive to create and are costly to maintain. For a relatively small
additional investment DM tools offer to discover highly profitable
nuggets of information hidden in these data.
•Data, especially large amounts of it reside in data base
management systems DBMS.
– Conventional DBMS are focused on online transaction processing
(OLTP); that is the storage and fast retrieval of individual records for
purposes of data organization. They are used to keep track of
inventory payroll records, billing records, invoices, etc.
Data Mining Techniques
•Data Mining as an analytic process designed to
– explore data (usually large amounts of - typically business or market
related - data) in search for consistent patterns and/or systematic
relationships between variables, and then
– to validate the findings by applying the detected patterns to new
subsets of data.
– The ultimate goal of data mining is prediction - and predictive data
mining is the most common type of data mining and one that has
most direct business applications.
•The process of data mining consists of three stages:
– the initial exploration,
– model building or pattern identification with validation and
verification, and it is concluded with
– deployment (i.e., the application of the model to new data in order to
generate predictions).
Stage 1: Exploration
•It usually starts with data preparation which may involve
cleaning data, data transformations, selecting subsets of records
and - in case of data sets with large numbers of variables
("fields") - performing some preliminary feature selection
operations to bring the number of variables to a manageable
range (depending on the statistical methods which are being
considered).
•Depending on the nature of the analytic problem, this first
stage of the process of data mining may involve anywhere
between a simple choice of straightforward predictors for a
regression model, to elaborate exploratory analyses using a
wide variety of graphical and statistical methods in order to
identify the most relevant variables and determine the complexity
and/or the general nature of models that can be taken into
account in the next stage.
Stage 2: Model building and validation
•This stage involves considering various models and choosing
the best one based on their predictive performance
– Explain the variability in question and
– Producing stable results across samples.
•How do we achieve these goals?
•This may sound like a simple operation, but in fact, it
sometimes involves a very elaborate process.
– "competitive evaluation of models," that is, applying different models
to the same data set and then comparing their performance to choose
the best.
– These techniques - which are often considered the core of predictive
data mining - include: Bagging (Voting, Averaging), Boosting,
Stacking (Stacked Generalizations), and Meta-Learning.
Models for Data Mining
•In the business environment, complex data mining projects
may require the coordinate efforts of various experts,
stakeholders, or departments throughout an entire
organization.
•In the data mining literature, various "general frameworks"
have been proposed to serve as blueprints for how to
organize the process of gathering data, analyzing data,
disseminating results, implementing results, and
monitoring improvements.
– CRISP (Cross-Industry Standard Process for data mining) was
proposed in the mid-1990s by a European consortium of companies
to serve as a non-proprietary standard process model for data
mining.
– The Six Sigma methodology - is a well-structured, data-driven
methodology for eliminating defects, waste, or quality control
problems of all kinds in manufacturing, service delivery,
management, and other business activities.
CRISP
•This general approach postulates the following (perhaps not
particularly controversial) general sequence of steps for data
mining projects:
Six Sigma
• This model has recently become very popular (due to its successful
implementations) in various American industries, and it appears to gain
favor worldwide. It postulated a sequence of, so-called, DMAIC steps
– The categories of activities: Define (D), Measure (M), Analyze (A), Improve (I),
Control (C ).
– Postulates the following general sequence of steps for data mining projects:
Define (D) → Measure (M) → Analyze (A) → Improve (I) → Control (C )
- It grew up from the manufacturing, quality improvement, and process control
traditions and is particularly well suited to production environments
(including "production of services," i.e., service industries).
• Define. It is concerned with the definition of project goals and boundaries, and
the identification of issues that need to be addressed to achieve the higher sigma
level.
• Measure. The goal of this phase is to gather information about the current
situation, to obtain baseline data on current process performance, and to identify
problem areas.
• Analyze. The goal of this phase is to identify the root cause(s) of quality
problems, and to confirm those causes using the appropriate data analysis tools.
• Improve. The goal of this phase is to implement solutions that address the
problems (root causes) identified during the previous (Analyze) phase.
• Control. The goal of the Control phase is to evaluate and monitor the results of
the previous phase (Improve).
Six Sigma Process
•A six sigma process is one that can be expected to produce
only 3.4 defects per one million opportunities.
– The concept of the six sigma process is important in Six Sigma quality
improvement programs.
•The term Six Sigma derives from the goal to achieve a
process variation, so that 6×sigma (the estimate of the
population standard deviation) will "fit" inside the lower and
upper specification limits for the process.
– In that case, even if the process mean shifts by 1.5×sigma in one
direction (e.g., to +1.5 sigma in the direction of the upper specification
limit), then the process will still produce very few defects.
•For example, suppose we expressed the area above the upper
specification limit in terms of one million opportunities to
produce defects. The 6×sigma process shifted upwards by 1.5
×sigma will only produce 3.4 defects (i.e., "parts" or "cases"
greater than the upper specification limit) per one million
opportunities
Statisticians’s remark on DM paradigms
•The DM community may have to moderate its romance with
big.
– A prevailing attitude seems to be that unless an analysis involves
gigabytes or terabytes of data, it can not possibly be worthwhile.
– It seems to be a requirement that all of the data that has been collected
must be used in every aspect of the analysis.
– Sophisticated procedures that cannot simultaneously handle data sets of
such size are not considered relevant to DM.
– Most DM applications routinely require data sets that are considerably
larger than those that have been addressed by traditional statistical
procedures (kilobytes).
– It is often the case that the questions being asked of the data can be
answered to sufficient accuracy with less than the entire giga or terabyte
data base.
– Sampling methodology which has a long tradition in Statistics can
profitably be used to improve accuracy while mitigating computational
requirements.
– Also a powerful computationally intense procedure operating on a
subsample of the data may in fact provide superior accuracy than a less
sophisticated one using the entire data base.
Sampling
• Objective: Determine the average amount of money spent
in the Central Mall.
• Sampling: A Central City official randomly samples 12
people as they exit the mall.
– He asks them the amount of money spent and records the data.
– Data for the 12 people:
Person
$ spent
Person
$ spent
Person
$ spent
1
$132
5
$123
9
$449
2
$334
6
$ 5
10
$133
3
$ 33
7
$ 6
11
$ 44
4
$ 10
8
$ 14
12
$ 1
– The official is trying to estimate mean and variance of the population
based on a sample of 12 data points.
Population versus Sample
 A population is usually a group we want to know something
about:
 all potential customers, all eligible voters, all the products coming off an
assembly line, all items in inventory, etc....
 Finite population: {u1, u2, ... , uN} versus Infinite population
 A population parameter is a number (q) relevant to the population
that is of interest to us:
 the proportion (in the population) that would buy a product, the
proportion of eligible voters who will vote for a candidate, the average
number of M&M's in a pack....
 A sample is a subset of the population that we actually do know
about (by taking measurements of some kind):
 a group who fill out a survey, a group of voters that are polled, a number
of randomly chosen items off the line....
 {x1, x2, ... , xn}
 A sample statistic g(x1, x2, ... , xn) is often the only practical
estimate of a population parameter.
 We will use g(x1, x2, ... , xn) as proxies for q, but remember their difference.
Average Amount of Money spent in the Central Mall
• A sample (x1, x2, ... , xn)
• Its mean is the sum of their values divided by the number of
observations.
n
x
x
i 1
n
i
x1  x2  ...  xn

n
• The sample mean, the sample variance, and the sample standard
deviation are $107, $220,854, and $144.40, respectively.
• It claims that on average $107 are spent per shopper with a
standard deviation of $144.40.
• Why can we claim so?
2
2
(
x

x
)

...

(
x

x
)
2
1
n
s 
n 1
1 n
2

( xi  x )

n  1 i 1
s
1 n
2
( xi  x )

n  1 i 1
•The variance s2 of a set of observations is the average of the
squares of the deviations of the observations from their mean.
•The standard deviation s is the square root of the variance s2 .
•How far the observations are from the mean? s2 and s will be
– large if the observations are widely spread about their mean,
– small if they are all close to the mean.
Stock Market Indexes
•It is a statistical measure that shows how the prices of a
group of stocks changes over time.
– Price-Weighted Index: DJIA
– Market-Value-Weighted Index: Standard and Poor’s 500
composite Index
– Equally Weighted Index: Wilshire 5000 Equity Index
•Price-Weighted Index: It shows the change in the average
price of the stock that are included in the index.
– Price per share in current period P0 and price per share in
next period P1.
– Number of shares outstanding in current period Q0 and
number of shares outstanding in next period Q1.
DJIA
•Dow Jones industrial average (DJIA):
– Charles Dow first concocted his 12-stock industrial average in 1896
(expanding to 30 in 1928)
– Original: It is an arithmetic average of the thirty stock prices that
make up the index.
DJIA = [(P01 + P02 +… + P0,30)/30]/[(P11 + P12 +… + P1,30)/30]
– Current: It is adjusted for stock splits and the insurance of stock
dividends.
DJIA = [(P01+ P02 +… + P0,30)/AD1]/(P11 + P12 +… + P1,30)
where AD1 is the appropriate divisor.
•How do we adjust AD1 to account for stock splits, adding
new stocks,...?
– The adjustment process is designed to keep the index value the same as it
would have been if the split had not occurred.
– Suppose X30 splits 2:1 from $100 to $50. Then change c to c0 such that
(X1 + X2 +… + 100)/c = (X1 + X2 +… + 50)/c0
– change to c0 < c to keep index constant before & after split.
•How about when new stocks are added and others are
DJIA
• How each stock in the Dow performed during the period when the Dow
rose 100 percent (from its close above 5,000 on Nov. 21, 1995 until it
closed above 10,000 on March 29, 1999).
*Companies not in the Dow when it crossed 5,000.
**Adjusted for spinoffs. Does not reflect performance of stocks
spun off to shareholders.
Company Weight in the Dow (%) Change in Price (%)
Alcoa
1.9
+ 52
AlliedSignal
2.3
+129
Amer. Express
5.5
+185
AT&T**
3.6
+ 87
Boeing
1.5
-5
Caterpillar
2.1
+59
Chevron
4.0
+77
Citigroup*
2.8
+262
Coca-Cola
3.0
+69
Du Pont
2.5
+76
Eastman Kodak
2.9
-6
DJIA
Company Weight in the Dow (%) Change in Price (%)
Exxon
3.2
+ 83
General Electric
5.3
+232
General Motors**
3.9
+89
Goodyear
2.2
+ 23
Hewlett-Packard* 3.1
+66
I.B.M.
1.9
+276
International Paper 2.0
+24
J. P. Morgan
5.0
+63
Johnson & Johnson* 4.2
+120
McDonald's
2.0
+102
Merck
3.6
+175
Minnesota Mining** 3.2
+ 15
Philip Morris
1.8
+ 37
Procter & Gamble
4.5
+134
Sears, Roebuck
2.1
+ 18
Union Carbide
2.1
+ 19
United Technologies 6.0
+196
Wal-Mart*
4.2
+288
Walt Disney
1.5
+ 62
S&P 500
•The S&P 500, which started in 1957, weights stocks on the
basis of their total market value.
– Suppose X30 splits 2:1 from $100 to $50. Then change c to c0
such that (X1 + X2 +… + 100)/c = (X1 + X2 +… + 50)/c0
– change to c0 < c to keep index constant before & after split.
• How about when new stocks are added and others are
removed?
• S&P 500 is computed by
S&P 500 = (w1X1 + w2X2 +… + w500X500)/c
where Xi=price of ith stock and wi=# of shares of ith
stock.
• What happens when a stock splits?
• It is a weighted average.
Sample vs Population
• For both problems, we try to infer properties of a large group
(population) by analyzing a small subgroup (the sample).
– The population is the group we are trying to analyze; e.g., all eligible
voters, etc.
– A sample is a subset of the total population that we have observed or
collected data from; e.g., voters that are actually polled, etc.
• How to draw a sample which can be used to make statements
about the population?
– Sample must be representative of the population
– Sampling is the way to obtain reliable information in a cost effective way
(why not census?)
Issues in sampling
• Representativeness
– Interviewer discretion
– Respondent discretion - non-response
– Key question: is the reason for non-response related to the attribute
you are trying to measure? Illegal aliens/Census. Start-up
companies/not in phone book. Library exit survey.
• Good samples;
– Good samples; probability samples; each unit in the population has a
known probability of being in the sample
– Simplest case; equal probability sample, each unit has the same
chance of being in the sample.
Utopian Sample for Analysis
• You have a complete and accurate list of ALL the units in the
target population (sampling frame)
• From this you draw an equal probability sample (generate a list
of random numbers)
• Reality check; incomplete frame, impossible frame, practical
constraints on the simple random sample (cost and time of
sampling)
• Precision considerations
– How large a sample do I need?
– Focus on confidence interval - choose coverage rate (90%, 95%, 99%)
margin of error (half the width). Typically trade off width against
coverage rate.
– Simple rule of thumb for a population proportion - if it's a 95% CI, then
use n = 1/(margin of error)**2.
Data Analysis
• Statistical Thinking is understanding variation and how to
deal with it.
• Move as far as possible to the right on this continuum:
Ignorance-->Uncertainty-->Risk-->Certainty
• Information science:learning from data
– Probabilistic inference based on mathematics
– What is Statistics?
– What is the connection if any
– elds including Data Base Management Articial In