Constructive Research in Data Mining
Download
Report
Transcript Constructive Research in Data Mining
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from Data Mining:
some lessons learnt in the
Information Systems field
Mykola Pechenizkiy, Seppo Puuronen
Department of Computer Science
University of Jyväskylä
Finland
Alexey Tsymbal
Department of Computer Science
Trinity College Dublin
Ireland
Outline
• Introduction and What is our message?
• Part I: Existing frameworks for DM
– Theory-oriented: Databases; Statistics; Machine learning; etc
– Process-oriented: Fayyad’s, CRISP, Reinartz’s
• Part II: Where we are? – rigor vs. relevance in DM
• Part III: Towards the new framework for DM
research
–
–
–
–
–
DM System as adaptive Information System (IS)
DM research as IS Development: DM system as artefact
DM success model: success factors
KM Challenges in KDD
One possible reference for new DM research framework
• Further plans and Discussion
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
2
What is Data Mining
Data mining or Knowledge discovery is the process of
finding previously unknown and potentially interesting
patterns and relations in large databases (Fayyad, KDD’96)
Data mining is the emerging science and industry of
applying modern statistical and computational
technologies to the problem of finding useful patterns
hidden within large databases (John 1997)
Intersection of many fields: statistics, AI, machine
learning, databases, neural networks, pattern recognition,
econometrics, etc.
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
3
H. Information Systems
H.0 GENERAL
H.1 MODELS AND PRINCIPLES
H.2 DATABASE MANAGEMENT
• H.2.0 General
– Security, integrity, and protection
• H.2.8 Database Applications
– Data mining
– Image databases
– Scientific databases
– Spatial databases and GIS
– Statistical databases
• H.2.m Miscellaneous
http://www.acm.org/class/1998/
valid in 2003
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
4
I. Computing Methodologies
I.2 ARTIFICIAL INTELLIGENCE
I.5 PATTERN RECOGNITION
• I.2.0 General
• I.5.0 General
– Cognitive simulation
• I.5.1 Models
– Philosophical foundations
– Deterministic
• I.2.1 Applications and Expert Systems
– Fuzzy set
• I.2.2 Automatic Programming
– Geometric
• I.2.3 Deduction and Theorem Proving
– Neural nets
• I.2.4 Knowledge Representation
– Statistical
Formalisms and Methods
– Structural
• I.2.5 Programming Languages and
• I.5.2 Design Methodology
Software
– Classifier design & evaluation
• I.2.6 Learning
– Feature evaluation &
– Analogies
selection
– Concept learning
– Pattern analysis
– Connectionism and neural nets
• I.5.3 Clustering
– Induction
– Algorithms
– Knowledge acquisition
– Similarity measures
– Language acquisition
• I.5.4 Applications
– Parameter learning
– Computer vision
• I.2.7 Natural Language Processing
– Signal processing
• I.2.m Miscellaneous
– Text processing
– Waveform analysis
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
5
G. Mathematics of Computing
G.3 PROBABILITY AND STATISTICS
• Correlation and regression analysis
• Distribution functions
• Experimental design
• Markov processes
• Multivariate statistics
• Nonparametric statistics
• Probabilistic algorithms (including Monte Carlo)
• Statistical computing
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
6
Our Message
• DM is still a technology having great expectations
to enable organizations to take more benefit of
their huge databases.
• There exist some success stories where
organizations have managed to have competitive
advantage of DM.
• Still the strong focus of most DM-researchers in
technology-oriented topics does not support
expanding the scope in less rigorous but
practically very relevant sub-areas.
• Research in the IS discipline has strong traditions
to take into account human and organizational
aspects of systems beside the technical ones.
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
7
Our Message
•
Currently the maturation of DM-supporting processes which would
take into account human and organizational aspects is still living its
childhood.
•
DM community might benefit, at least from the practical point of
view, looking at some other older sub-areas of IT having traditions to
consider solution-driven concepts with a focus also on human and
organizational aspects.
•
The DM community by becoming more amenable to research results
of the IS community might be able to increase its collective
understanding of
– how DM artifacts are developed – conceived, constructed, and
implemented,
– how DM artifacts are used, supported and evolved,
– how DM artifacts impact and are impacted by the contexts in
which they are embedded.
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
8
Part I
• Existing Frameworks for DM
– Theory-oriented
• Databases;
• Statistics;
• Machine learning;
• Data compression
– Process-oriented
• Fayyad’s
• CRISP-DM
• Reinartz’s
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
9
Theory-Oriented Frameworks
Database Perspective
• DM as application to DBs
– “In the same way business applications are currently supported
using SQL-based API, the KKD applications need to be provided
with application development support.”
– query KDD objects, support for finding NNs, clustering, or
discretization and aggregate operations.
• Inductive databases approach
– query concept should be applied also to data mining and
knowledge discovery tasks
• “there is no such thing as discovery, it is all in the power of the
query language”
– contain not only the data but the theory of the data as well
Imielinski, T., and Mannila, H. 1996, A database perspective on knowledge discovery.
Communications of the ACM, 39(11), 58-64.
Boulicaut, J., Klemettinen, M., and Mannila, H. 1999, Modeling KDD processes within
the inductive database framework. In Proceedings of the First International Conference
on Data Warehousing and Knowledge Discovery, Springer-Verlag, London, 293-302
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
11
Reductionism Approach
• Two basic Statistical Paradigms
– “Statistical Experiment”
• Fisher’s version, inductive principle of maximum likelihood
• Neyman and Pearson-Wald’s version, inductive behaviour
• Bayesian version, maximum posterior probability
• “Statistical learning from empirical process”
– “Structural Data Analysis”
• SVD
• Data mining statistics - the issue of computational feasibility has a
much clearer role in data mining than in statistics
– data mining area approaches that emphasize on database integration,
simplicity of use, and the understandability of results
– theoretical framework of statistics does not concern much about data
analysis as a process that includes several steps
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
12
Machine Learning Approach
• “let the data suggest a model” can be seen as a
practical alternative to the statistical paradigm “fit a
model to the data”
• Constructive Induction – a learning process, two
intertwined phases: construction of the “best”
representation space and generating hypothesis in the
found space (Michalski & Wnek, 1993).
– Feature transformation (PCA, SVD, Random
Projection)
– Feature selection
– LSI
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
13
Data Compression Approach
– Compress the data set by finding some structure or
knowledge for it, where knowledge is interpreted as a
representation that allows coding the data by using fewer
amount of bits.
– Theories should not be ad hoc that is they should not
overfit the examples used to build it.
– Occam’s razor principle,14th century.
• "when you have two competing models which
make exactly the same predictions, the one that
is simpler is the better".
Mehta, M., Rissanen, J., and Agrawal, R. 1995, MDL-based decision tree pruning.
In U.M. Fayyad, R. Uthurusamy (Eds.) Proceedings of the KDD 1995, AAAI Press,
Montreal, Canada, 216-221.
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
14
Other Theoretical frameworks for DM
• Microeconomic view
– the key point is that data mining is about finding actionable
patterns: the only interest is in patterns that can somehow be used
to increase utility;
– a decision theoretic formulation of this principle: the goal can be
formulated in finding a decision x that tries to maximise utility
function f(x).
Kleinberg, J., Papadimitriou, C., and Raghavan, P. 1998, A microeconomic view of data mining,
Data Mining and Knowledge Discovery 2(4), 311-324
• Philosophy of Science
–
–
–
–
logical empiricism, critical rationalism, systems theory
formism, mechanism, contextualism
dispersive vs. integrative, analytical vs. synthetic theories
subjectivist vs. objectivist, nomothetic vs. ideographic,
nominalism vs. realism, voluntarism vs. determinism,
epistemological assumptions
– Explanation, prediction, understanding
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
15
Process-Oriented Frameworks
I
Knowledge discovery as a process
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R.,
Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1997.
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
17
CRISP-DM
http://www.crisp-dm.org/
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
18
KDD: “Vertical Solutions”
Busine ss
Understanding
Data
Understanding
Data
Preparation
Ex
p
er
Data
Exproration
ien
ce
ac
cu
mu
la
t io
Data
Mining
n
Evaluation &
Interpretation
Deployment
Reinartz, T. 1999, Focusing Solutions for Data Mining.
LNAI 1623, Berlin Heidelberg.
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
19
Conclusion on different frameworks
– Reductionist approach of viewing data mining as statistics has
advantages of the strong background, and easy-formulated
problems.
– The data mining tasks concerning processed like clusterisation,
regression and classification fit easily into these approaches.
– More recent (process-oriented) frameworks address the issues
related to a view of data mining as a process, and its iterative and
interactive nature
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
20
Part II
Where we are?
Rigor and Relevance in DM Reseach
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
21
So, where are we?
• Lin in Wu et al. notices that a new successful
industry (as DM) can follow consecutive phases:
1. discovering a new idea,
2. ensuring its applicability,
3. producing small-scale systems to test the market,
4. better understanding of new technology and
5. producing a fully scaled system.
• At the present moment there are several dozens
of DM systems, none of which can be compared
to the scale of a DBMS system.
– This fact indicates that we are still in the 3rd phase in
the DM area!
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
22
Rigor vs Relevance in DM Research
Relevance
Relevance
Rigor
Rigor
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
23
Where is the focus?
•
•
•
•
Still! … speeding-up, scaling-up, and increasing the accuracies of
DM techniques.
Piatetsky-Shapiro : “we see many papers proposing incremental
refinements in association rules algorithms, but very few papers
describing how the discovered association rules are used”
Lin claims that the R&D goals of DM are quite different:
– since research is knowledge-oriented while development is
profit-oriented.
– Thus, DM research is concentrated on the development of new
algorithms or their enhancements,
– but the DM developers in domain areas are aware of cost
considerations: investment in research, product development,
marketing, and product support.
However, we believe that the study of the DM development and DM
use processes is equally important as the technological aspects and
therefore such research activities are likely to emerge within the DM
field.
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
24
Part III
Towards the new framework for
DM research
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
25
DMS in the Kernel of an Organization
Environment
Organization
DM Task(s)
DMS (Artifact)
•
DM is fundamentally application-oriented area motivated by business
and scientific needs to make sense of mountains of data.
•
A DMS is generally used to support or do some task(s) by human
beings in an organizational environment both having their desires
related to DMS.
•
Further, the organization has its own environment that has its own
interest related to DMS, e.g. that privacy of people is not violated.
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
26
The ISs-based paradigm for DM
The External Environment
The Organizational Environment
User
Environment
The
Use
Process
IS
Development
Environment
The
Development
Process
IS
operations
environment
The
Operation
Process
The
Information
Subsystem
(ISS)
Ives B., Hamilton S., Davis G. (1980). “A Framework for Research in Computer-based MIS”
Management Science, 26(9), 910-934.
“Information systems are powerful instruments for organizational
problem solving through formal information processing”
Lyytinen, K., 1987, “Different perspectives on ISs: problems and solutions.” ACM Computing Surveys, 19(1), 5-46.
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
27
DM Artifact Development
Theory Building
DM Artifact
Development
Observation
Experimentation
A multimethodological approach to the construction of an artefact for DM
Adapted from: Nunamaker, W., Chen, M., and Purdin, T. 1990-91, Systems development
in information systems research, Journal of Management Information
Systems, 7(3), 89-106.
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
28
Research methods in a paper on DM
– Theoretical approach: theory
creating
• Hypothesis, new
algorithm, etc.
– Constructive approach
• Prototype of a DM tool
– Theoretical approach: theory
testing and evaluation
• Artificial, benchmark,
real-world data
• Evaluation techniques
– Conclusion on theory
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
29
The Action Research and Design Science
Approach to Artifact Creation
Awareness of business
problem
Contextual
Knowledge
Business
Knowledge
Design
Knowledge
Action planning
Artifact Development
Artifact Evaluation
Action taking
Conclusion
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
30
DM Artifact Use: Success Model 1 of 3
System
Quality
Use
Information
Quality
Service
Quality
Individual
Impact
User
Satisfaction
Organizational
Impact
Adapted from D&M IS Success Models
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
31
DM Artifact Use: Success Model 2 of 3
• What are the key factors of successful use and impact of
DMS both at the individual and organizational levels.
1. how the system is used, and also supported and
evolved, and
2. how the system impacts and is impacted by the
the leadership,
communication
skills and
contexts in which
it is embedded.
understanding
of factors
the culture
of the organization
are
Coppock:
the failure
of DM-related
projects.
lessnothing
important
the
traditionally
emphasized
•nothave
to do than
with the
skill
of the modeler
or the
technological
quality of data.job of turning data into insights
• But those do include:
1. persons in charge of the project did not formulate
actionable insights,
2. the sponsors of the work did not communicate the
insights derived to key constituents,
3. the results don't agree with institutional truths
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
32
DM Artifact Use: Success Model 3 of 3
• Hermiz communicated his beliefs that there are
the four critical success factors for DM projects:
• (1) having a clearly articulated business problem that needs
to be solved and for which DM is a proper tool;
• (2) insuring that the problem being pursued is supported by
the right type of data of sufficient quality and in sufficient
quantity for DM;
• (3) recognizing that DM is a process with many components
and dependencies – the entire project cannot be "managed"
in the traditional sense of the business word;
• (4) planning to learn from the DM process regardless of the
outcome, and clearly understanding, that there is no
guarantee that any given DM project will be successful.
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
33
KM Perspective
• A knowledge-driven approach to enhance the
dynamic integration of DM strategies in knowledge
discovery systems.
• Focus here is on knowledge management aimed to
organise a systematic process of (meta-)knowledge
capture and refinement over time.
– knowledge extracted from data
– the higher-level knowledge required for managing DM
techniques’ selection, combination and application
• Knowledge
Basic knowledge
management
processesKnowledge
of
Knowledge
Knowledge
Creation &
Organization &
Distribution &
Adaptation &
–
knowledge
creation
and
identification,
Acquisition
Storage
Integration
Application
representation, collection and organization,
Knowledge Evaluation, Validation and Refinement
sharing,
adaptation, and application
•
DEXA’05: TAKMA WS paper&presentation are available
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
34
New Research Framework for DM Research
Needs
Technology
Infrastructure
Applications
Communications
Architecture
Development
Capabilities
DM Research
Develop/Build
Theories
Artifacts
Assess
Refine
Justify/
Evaluate
Analytical
Case Study
Experimental
Field Study
Simulation
(Un-)Successful Applications in
the appropriate environment
Rigor
Knowledge
Organizations
Strategy
Structure&Culture
Processes
Business
People
Roles
Capabilities
Characteristics
Relevance
Applicable
Environment
Knowledge Base
Foundations
Base-level theories
Frameworks
Models
Instantiation
Validation Criteria
Design knowledge
Methodologies
Validation Criteria
(not instantiations
of models but KDD
processes, services,
systems)
Contribution to Knowledge Base
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
35
Further Work
• Definition of Relevance concept in DM research
• The revision of the book chapter
• Further work on the new framework for DM
research
• Organization of Workshop or Special Track or
Working conference on
– more social directions in DM research likely with one of
the focuses on IS as a sister discipline.
Few options:
– IRIS Scandinavian Conference on IS is one option
– Next PMKD
– Workshop in Jyväskylä
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
36
Thank You!
Feedback is very welcome:
• Questions
• Suggestions
• Collaboration
Book chapter draft is available on request from
Mykola Pechenizkiy
Department of Computer Science and Information Systems,
University of Jyväskylä, FINLAND
E-mail: [email protected]
Tel.: +358 14 2602472 Fax: +358 14 260 3011
http://www.cs.jyu.fi/~mpechen
PMKD’05 Copenhagen, Denmark August 22-26, 2005
Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A. Tsymbal
37