Chapter 1 Introduction Case Studies

Download Report

Transcript Chapter 1 Introduction Case Studies

Chapter 1 Case Studies
Introduction to Data Mining with Case Studies
Author: G. K. Gupta
Prentice Hall India, 2006
Case study - Aviation
Wipro (a large Indian IT company) reported a study of
frequent flyer data from an Indian airline.
Before carrying out data mining, the data was selected and
prepared. For example, it was decided to use only the
three most common sectors flown by each customer and
the three most common sectors when points are redeemed
by each customer. It was discovered that much of the data
supplied by the airline was incomplete or inaccurate. Also,
it was found that the customer data captured by the
company could have been more complete. For example,
the airline did not know customers’ marital status or their
income or their reasons for taking a journey.
27 November 2008
©GKGupta
2
Case Study - Astronomy
Astronomers produce huge amounts of data every night
on the fluctuating intensity of around 20 million stars
which are classified by their spectra and their surface
temperature.
Some 90% of stars are called main sequence stars
including some stars that are very large, very hot and
blue in colour. The main sequence stars are fuelled by
nuclear fusion and are very stable, lasting billions of
years.
Smaller main sequence stars include the Sun (star type
G in the table below). There are a number of classes
including stars called yellow dwarf, red dwarf and white
dwarf. We show the seven major classes:
27 November 2008
©GKGupta
3
Different Types of Stars
Star
type
Colour
Approximate
temperature
O
B
A
F
G
K
M
Blue
Blue
Blue
Blue to White
White to Yellow
Orange to Red
Red
> 25,000K
11,000 to 25,000K
7,500 to 11,000K
6,000 to 7,500K
5,000 to 6,000K
3,500 to 5,000K
< 3,500K
27 November 2008
©GKGupta
Average
brightness
(Sun = 1)
> million
20,000
80
6
1.2
0.4
0.04
Average
radius
(Sun =1)
60
18
3.2
1.7
1.1
0.8
0.3
4
Astronomy
When a clustering program was used to group a large
amount of astronomical data, four classes corresponding
to stars, galaxies with bright central cores, galaxies
without bright central cores and stars with a visible “fuzz”
around them were found.
The clustering program found meaningful results without
any understanding of astronomical data.
27 November 2008
©GKGupta
5
Case Study – Mail Order
A direct mail company held a list of large number of
potential customers with a response rate of only 1%. The
company wanted to improve the response rate.
To carry out data mining, the company had to first
prepare data, which included sampling the data to select
a subset of customers including those that responded to
direct mails and those that did not.
27 November 2008
©GKGupta
6
Case Study – Mail Order
For each customer, there were more than 200 variables
including basic personal information like the locality
where they lived, their gender, marital status, and their
buying habits including when they last responded to a
mailout, what money they spent the last time they
responded, and the product bought the last time.
27 November 2008
©GKGupta
7
Case Study – Mail Order
Using the decision tree approach, the company was able
to identify characteristics of customers who were more
likely to respond. The company was thus able to reduce
the number of customers it mailed to, thus reducing cost,
while simultaneously improving the response rate.
27 November 2008
©GKGupta
8
Case Study 1A Inventory Control
The case study reports results of using data mining in
inventory control of a US pharmaceutical company
Medicorp which is the largest retail distribution company
with 4100 stores in 25 US states.
Medicorp maintained an inventory worth almost one billion
dollars to ensure that any drug required by a customer had
a 95% chance of being available from any outlet of the
company. To achieve this goal, the company had a rule of
thumb to maintain “three weeks supply” of every drug.
27 November 2008
©GKGupta
9
Case Study 1A Inventory Control
The study involved collecting relevant data and then
carrying out some preliminary studies. Models were
developed for predicting demand for various drugs. The
models were not very accurate for daily predictions but
were more accurate for weekly forecasts and even better
for monthly forecasts.
27 November 2008
©GKGupta
10
Case Study 1A Inventory Control
The weekly forecasting model was chosen, since that
better suited the company’s need. The study concluded
that the company needed to change its rule of thumb of
maintaining three weeks supply of drugs.
It recommended that the three weeks should be reduced
for popular drugs and needed to be extended for less
popular items, since large selling items can be easily
replenished on a weekly basis. The company was reported
to have reduced its inventory by half, resulting in
considerable savings.
27 November 2008
©GKGupta
11
Case Study 1B Crime Prevention
This case study was published in the magazine IEEE
Computer in April 2004.
Crime data was grouped into eight categories comprising
traffic violation, sex crime, theft, fraud, arson, gang/drug
offences, violent crimes and cybercrime. Some of the
major crimes are included in the category violent crime,
including murder, assault, armed robbery, sexual and
hate crimes. The study focussed on three aspects of
crime: extracting named entities from narrative reports,
detecting deceptive criminal identities and identifying
criminal groups and key members of the groups.
27 November 2008
©GKGupta
12