Exploring Data Mining Implementation

Download Report

Transcript Exploring Data Mining Implementation

Exploring Data Mining
Implementation
By Karim Hirji, IBM Canada
Chichang Jou, Tamkang University
1
Motivation
• Traditional Statistical techniques could not
scale to handle millions of records and
thousands of variables
• Data mining emerges to handle the
scalability issue
• How to perform data mining?
• The study (2001) provides information and
experiences about 5-step data mining
proposed by Cabena et. al
2
5-Stage Data Mining Model
(Cabena et al.)
1.
2.
3.
4.
5.
Business objective determination
Data preparation
Data mining
Results analysis
Knowledge assimilation
3
Research Method
• Case study (Benbasat et. al)
– Concerned with the larger question of
developing a deeper understanding of “how”
data mining should be done
– It is important to find a company willing to
participate in this study and provide full
access to the organization during the time
frame of the study
– Multiple methods of data collection were used
• archival records, documentation, interviews,
observation
4
The Participating Company
• TAKCO is a mature North America fastfood retailer
• Its Canadian headquarter is in Toronto
• Interesting aspects of the fast-food
industry
– Consumer-driven
– Striving toward operational efficiency
– Extensive marketing analysis
5
Data Collection
• Direct observation is the primary data
collection method
– Comments recorded and probed
– Notes reviewed after each site visit for content
– Final data analysis after gathering qualitative
data from all site visits, structured by
comparing to Cabena model stage-by-stage
– Totally 10 site visits from 1998/07 to 1998/11
6
Members in the Data Mining Project
• A data mining specialist
• A project manager
• A senior director of strategic planning (the executive
sponsor)
• A research supervisor
• A business analyst
• An end-user analyst
• A data architect, and
• A database administrator (DBA)
• The executive sponsor and project manager decided that
the entire team should be present during key project
activities. Accordingly, the data mining activities were
highly interactive.
7
Project Time Line
An enterprise-wide
transaction data
warehouse of 30 gigabyte
were completed eight
months ago.
• Tool: IBM Intelligent
Miner for Data
• Functions: clustering,
associations, predicting
8
Project Outcome
• The executive sponsor
– Not a failure
• Completed on time and within budget
– No completely new and unexpected results
9
First Visit
• The executive sponsor and project
manager discuss the final parameters of
the DM project
– Candidate business problems identified
– How much historical transaction data to mine
– Received a formal project number and budget
10
not to develop a
production data
mining application
11
Stage 1
• A workshop held in 1998/09 to identify the
business problem to be mined
– Team members introduced
– Roles and responsibilities assigned
– High-level project plan developed
• Extensive discussion about original candidate
business problems
– Immediate obstacle: to ensure supporting data was
readily available
– Input from the data architect and DBA invaluable
• 2 out of 3 original business problems replaced, due to data
issues
– The research supervisor played a dominant role in
framing the business problems
12
Two Project Discontinuity Points in Stage 1
1. Anticipation
– Expectations about the potential to deliver novel and
interesting findings
– Goal alignment was employed to provide focus and
clarity
– Emphasis placed on establishing and reaching
consensus on a realistic, measurable, and achievable
business goal and project goal
– Agreed business goal: understanding DM technology
benefits to enhance business decision making
– Agreed project goal: Demonstrating DM potential to
provide new and valuable insights into a subset of
existing production system data
13
Two Project Discontinuity Points in Stage 1
2. Anxiety/Apprehension
–
–
–
Concerns about the nature of the data preparation
stage and the potential bias and noise in the data set
With data quality efforts in the data warehouse project,
members concerned about incorrect interpretation and
improper transformation
A data audit stage added after data preparation to
demonstrate validity, reliability, consistency,
completeness and integrity of the resulting transformed
data set
•
Minimize the danger of automatically dismissing potential
anomalous and relevant DM results
14
One Project Discontinuity Point in Interactive
Data Mining and Results Analysis Stage
3. Frustration
– OLAP already used extensively to gain knowledge
about product offerings and fast-food customer
profiles
– “I already know that” comment
– Back end data mining, involving data enrichment
and additional DM algorithm execution, introduced to
increase the dimensionality of the data set with 3rdparty demographic data
•
Effective in providing different and interesting analysis
results
15
Implications and Discussion
1.
A DM project appears to follow a more elaborated set
of stages than previously reported
Unlike other work, data preparation in this study is not
the most resource intensive stage
Several important process aspects relevant for the
Interactive DM and Results Analysis stage
2.
3.
•
•
•
A DM briefing would have made the stage more efficient and
effective without “I don’t understand what this means”
comment
The DM specialist worked as a facilitator
Linking DM results with business strategy and using
application software to perform sensitivity analysis
•
•
•
Product combinations
Importance of contextualizing the stage with business strategy
Use spreadsheet to perform sensitivity analysis
16