Transcript D3-P1

The Data Mining Visual Environment
Motivation
Major problems with existing DM systems
They are based on non-extensible frameworks.
They provide a non-uniform mining environment - the user is presented with
totally different interface(s) across implementations of different DM techniques.
Major needs
An overall framework that can support the entire Knowledge Discovery (KD)
process (accommodate and integrate all KD phases seamlessly).
Placing the user at the center of the entire KD process/in the framework. In
fact the corresponding system should provide a consistent, uniform and
flexible visual interaction environment that supports the user throughout
the entire discovery process.
The Data Mining Visual Environment
System Architecture
Primary layers
• User layer
• Engine layer
• Data layer
Main features
• Open
• Modular with
well defined
modification/extension points
• Possible integration of
different tasks
eg output reuse by another task
• User flexibility and enablement to:
process data and knowledge,
drive and guide the entire KD
process
At present, there is a partial
prototype, complete
implementation is underway.
The Data Mining Visual Environment
System Architecture: The Prototype
Primary layers
• User Interface/GUI Container
(interacts with specific DM
visual environments)
Developed using Java
• Abstract DM Engine/Wrapper of
DM algorithms (interacts with
specific DM algorithms)
Developed using Java
Stephen
Stefano
Note
• The DM algorithms may be
implemented by third parties
in possibly any language.
DM methods (but not limited to):
MQs, ARs, and clustering
The Data Mining Visual Environment
Visual Environment
A consistent, uniform, flexible and intuitive GUI, with support throughout the
whole DM process. The principal focus is to support the user in:
Visual construction of the task relevant dataset: The user directly interacts
with data. For this task, there are two intuitive interaction spaces.
Visual construction of the mining query: The user directly interacts with data
and other parameters (e.g. threshold values) in making queries e.g., in the
Metaquery Environment, the user can suggest patterns by linking attributes,
while the Association Rule Environment offers ‘visual baskets’.
Visual output presentation and interaction: Exploiting relevant effective
visualizations and where necessary, we have designed novel visualizations.
Planning: E.g., ‘advertising’ relevant prior knowledge.
Handling the non-static nature of user’s quest: E.g., enabling user to adjust.
The Data Mining Visual Environment
Visual Environment: Overall
The Data Mining Visual Environment
Visual Environment: Tree View (‘Progress Companion’)
Before user settings
After user settings
After DM results
The Data Mining Visual Environment
Visual Environment: Clustering - Input
The Data Mining Visual Environment
Visual Environment: Clustering - Output
... for more on the prototype, demo
The Data Mining Visual Environment
Usability
Usability heuristics: Done, but regular reference to the same will go on.
Mock-up tests: Done with DM experts. The experts gave an encouraging
feedback and even suggestions on how to improve the interface.
(These tests were done at the end of 2001.)
Questionnaire experiments:
The experiments involved: the application simulation, a case study, data schema
and user tasks corresponding to the case study, and a questionnaire.
Positive interface features: consistency, layout/organization, visual exploration.
Negative interface features: size of some visual elements small/big
(These tests were done in July 2002.)
Formal usability tests: In the pipeline.
The Data Mining Visual Environment
The Clustering Engine
Clustering method: Generalizations of three techniques: homogeneity,
separation, density.
Clustering based on homogeneity/separation: Homogeneity (separation) is a
global measure of the similarity between points belonging to the same
cluster (to different clusters)
Clustering based on density: Clusters are regions of the object space where
objects are located “most frequently”
Clustering based: The system selects the “best” clustering according to a cost
function
•
For homogeneity/separation-based clustering the cost function is computed
by evaluating pointwise, clusterwise, and partitionwise similarity/dissimilarity
•
For density-based clustering, the cost function is derived from an estimated
density function
The Data Mining Visual Environment
Formal Semantics of the Input Environment
Visual language: abstract syntax + semantics
Abstract syntax: defined in terms of multi-graphs
• Visual components are vertices of the multi-graph
• Spatial relations between visual components are edges of the multi-graph
Semantics:
• Clustering: defined by a mapping between multi-graphs and cost functions
and predicates expressing optimality
• Metaqueries/association rules: defined by a mapping between multi-graphs
and rules
The Data Mining Visual Environment
Operational Specification
1.
2.
3.
Concrete, high-level syntax of the tasks proposed in the usability tests
•
Describes “legal” click-streams allowed to occur during operation
•
Standard grammar notation
Communication protocol between the abstract clustering engine and the data
mining engines
•
XML DTDs based on PMML 2.0
•
Extension of PMML 2.0 to:
1. Specification of input
2. Broader spectrum of clustering methods
Concrete semantics of the clustering task by mapping on symbols in the
tasks grammar to structures of the communication protocol
•
Interpretation function recursively defined on the grammar rules of the
high-level syntax of tasks
•
The interpretation of a legal click-stream is an XML document satisfying
the DTD of the input specification