Visualization for Classification and Clustering Techniques
Download
Report
Transcript Visualization for Classification and Clustering Techniques
Visualization for
Classification and Clustering
Techniques
Marc René
CSE 8331
Data Mining - Project 1
Overview
Importance of Data Visualization in the KDD
Process
Understanding and Trust
Visualization techniques
–
–
2
Classification
Clustering
Future Directions
Marc René - CSE 8331
KDD Process
Selection
–
Preprocessing
–
Once the data is in a useable format/content, apply various
algorithms based upon the results trying to be achieved
Interpretation/Evaluation
–
3
After preprocessing the data, analyze the format/amount of data
Data Mining
–
After selecting the data, clean it to make sure it is consistent
Transformation
–
Obtain data from all of sources
Finally, present the results of the data mining step to the user, so
that the results can be used to solve the business need at hand
Marc René - CSE 8331
Importance of Data Visualization
The final step in the KDD process :
Highly dependent on the Data Visualization technique
Bad/inappropriate technique may result in
misunderstanding
Misunderstanding may cause an incorrect (or no)
decision
It is important to consider that the KDD process is
useless if the results are not understandable
4
Marc René - CSE 8331
Current Issues w/Data Visualization
The literature suggests a significant reliance on
expert users
General lack of data visualization support in
many data mining tools [Goebel99]
These are significant problems if
KDD/DM/Data Visualization will expand at the
rates suggested
–
5
Data visualization tool market – $2.2 billion by 2007
[Nuttall03]
Marc René - CSE 8331
Suggested Direction
Need to determine techniques that balance
simplicity with completeness
If this can be done for non-expert users
–
–
–
–
Simplicity & Completeness Understanding
Understanding Trust
Trust more use of KDD/DM
Result will be:
6
Better business value
Higher ROI
Marc René - CSE 8331
Common Visualization Techniques
Visualization techniques dependent upon
–
–
The type of data mining technique chosen
The underlying structure and attributes of the data
Classification
7
Clustering
Decision Trees
- Scatter Plots
Scatter Plots
- Dendrograms
Axis-Parallel Decision Trees
- Smoothed Data Histograms
Circle Segments
- Self-Organizing Maps
Decision Tables
- Proximity Matrixes
Marc René - CSE 8331
Classification
8
Marc René - CSE 8331
Decision Tree
Information limited to
–
–
–
9
Attributes
Splitting values
Terminal node class
assignments
Marc René - CSE 8331
Decision Tree with Histograms
Data mining rarely classify
100% of the data correctly:
–
–
–
10
Include the success of
properly classifying the data histogram added for each
terminal node
Percentage of data that was
classified correctly/incorrectly
Assists users in determining
if the classification is ‘good
enough’
Marc René - CSE 8331
Decision Tree - Different Format
Vertical representation allows for easy user
interaction
–
–
11
Combines the split points
and classification accuracy
- compactly
Key difference - colors are
matched with a specific
classification
Marc René - CSE 8331
Scatter Plot with Regression Line
12
Excellent way to view 2dimensional data
Familiar to anyone who
has taken high-school
algebra
Regression lines provide
descriptive
techniques
for classification
Marc René - CSE 8331
Axis-Parallel Decision Tree
13
Combination Scatter
Plot and Decision Tree
Areas divided in
parallel regions on the
axis
Well suited for
classification problems
with two attribute
values
High visibility into the
impact of outliers
Marc René - CSE 8331
Circle Segments
Multi-dimension data
Maps dataset with n
dimensions onto a circle
divided by n segments
–
–
–
14
Each segment is a different
attribute
Each pixel inside a segment is
a single value of the attribute
Values of each attribute are
then sorted (independently)
and assigned a different colors
based upon its class
Marc René - CSE 8331
Decision Table
Interactive technique
Maps attribute data to a 2D hierarchical matrix
Levels can be drilled down - another set of attributes
Height of a cell conveys the number of data entities
Cells color coded
–
–
15
Neutral color no data in that intersection point
Color coded by class (percentage)
Marc René - CSE 8331
Decision Table
16
Marc René - CSE 8331
Clustering
17
Marc René - CSE 8331
Scatter Plot
Extensions include, displaying points in:
–
–
–
–
–
18
Various sizes and colors to indicate additional attributes
Shading of points to introduce a third dimension
Using different brightness levels of the same color to represent
continuous values for the same attribute
Using various points or classification identifiers (i.e., numbers,
symbols)
Using various glyphs to display additional attributes
Marc René - CSE 8331
Scatter Plot
19
Map decision trees
on top of scatter
plots to describe
clusters
Marc René - CSE 8331
Scatter Plot with Regression Lines
20
Marc René - CSE 8331
Scatter Plot w/Min Spanning Tree
21
Marc René - CSE 8331
Dendrogram
Intuitive representation - hierarchical
decomposition of data into sets of
nested clusters.
From an agglomerative perspective:
–
–
–
–
22
Each leaf - a single data entity
Each internal node - the union of all data
entities in its sub-tree
The root - the entire dataset
The height of any internal node - the
similarity between its ‘children’.
Marc René - CSE 8331
Dendrogram with Exemplars
The “most typical
member of each
cluster” [Wishart99]
–
–
23
Underlined labels of
the leafs
Done in combination
with shading to
identify the clustering
level
Marc René - CSE 8331
Smoothed Data Histogram
24
Represents
data on a
‘display map’
Similar data
items are
located close
to each other
More defined
the clusters –
lighter colors
Marc René - CSE 8331
Smoothed Data Histogram - Detail
25
Marc René - CSE 8331
Self-Organizing Map ‘Grid’
Source of
Smoothed
Data
Histogram
Numbers
indicate most
‘common’
cluster
1
5
2
3
2
5
6
5
2
2
2
4
5
5
5
7
1
1
1
5
7
7
8
7
7
7
10
7
7
9
7
7
11
7
10
7
8
26
Marc René - CSE 8331
7
Proximity Matrix
27
Graphically display the
relationship between
data elements
Usually symmetric, but
can be sorted by the
strength of
relationships
Marc René - CSE 8331
Proximity Matrix and Dendrogram
28
Marc René - CSE 8331
Summary
29
Data visualization techniques are extremely important
for understanding the KDD process
A balance of simplicity and completeness is important
The techniques discussed allow average users to
understand the results of the KDD process
Understanding KDD results to be interpreted/trusted
by non-expert users extending the business value
If data visualization techniques do not establish a high
level of trust in the KDD process, the process will fail
Marc René - CSE 8331
Future Direction
Significant effort will be spent on improving data
visualization techniques in the next few years
–
–
Trends are moving to a more interactive mode
–
–
30
KDD process and data mining are becoming more widespread
Business will expect tools to become more ‘user-friendly’ and
support the varied level of skills
Static reporting techniques (i.e., standard decision trees,
standard circle segments) are being replaced
Interactive techniques (i.e., smoothed data histograms,
interactive circle segments and decision tables)
Very interactive data models ‘virtual reality’ are also
being considered/proposed
Marc René - CSE 8331
References
Part 1
Ahlberg, C., “Spotfire: An Information Exploration Environment”, ACM SIGMOD Record, Volume 25, Number 4,
December 1996
Ankerst, M., et. al., “Visual Classification: An Interactive Approach to Decision Tree Construction”, KDD-99, San
Diego, CA
Ankerst, M., et. al., “Towards an Effective Cooperation of the User and the Computer for Classification”, KDD’00,
Boston, MA, USA
Apte C. and Weiss S.M., “Data Mining with Decision Trees and Decision Rules”, Future Generation Computer
Systems, November 1997
Arkin, E., et. al., “Decision Trees for Geometric Models”, ACM, 9th Annual Computational Geometry, 5/93/CA, USA
de Hann, G., et. al., “Towards Intuitive Exploration Tools for Data Visualization in VR”, VRST’02, November 11-13,
2003, Hong Kong
Dunham, M. H., Data Mining – Introductory and Advanced Topics, Prentice Hall, 2003.
Fekete, J. and Plaisant, C., Excentric Labeling: Dynamic Neighborhood Labeling for Data Visualization,
Proceedings of the Conference on Human factors in Computer Systems (CHI'99), ACM , New York
Fredrikson, A., et. al., “Temporal, Geographical and Categorical Aggregations Viewed through Coordinated
Displays: A Case Study with Highway Incident Data”, NPIVM’99, Kansas City, MO, 1999
Goebel, M. and Gruenwald, L., “A Survey of Data Mining and Knowledge Discovery Software Tools”, SIGKDD
Explorations, June 1999.
Han, J. and Cersone, N., “RuleViz: A Model for Visualizing Knowledge Discovery Process”, Sixth International
Conference on Knowledge Discovery and Data Mining, 2000
Ho, T., et. al., “Visualization Support for a User-Centered KDD Process”, SIGKDD’02, 2002.
31
Marc René - CSE 8331
References
Part 2
Hsieh, H. and Shipman, F. M. III, “VITE: A Visual Interface Supporting the Direct Manipulation of Structured Data
Using Two-Way Mappings”, IUI 2000, New Orleans LA
“Solving Business Problems with IBM DB2 Intelligent Miner”, Presented by DB2 Developer Domain,
http://www7b.software.ibm.com/dmdd
Jain, A. K., et. al., “Data Clustering: A Review”, ACM Computing Surveys, Volume 3, Number 3, September 1999
Keim, D. A., “Visual Techniques for Exploring Databases”, KDD’97, Newport Beach, CA, 1997
Kohavi, R., and Sommerfield, D, “Targeting Business Users with Decision Table Classifiers”, KDD’99, New York City,
1998
Kohavi, R., et. al., “Emerging Trends in Business Analytics”, Communications of the ACM, Volume 45, Number 8,
August 2002
Liu, B., et. al., “Clustering Through Decision Tree Construction”, CIKM 2000, ACM, McLean VA, 2000
Louie, J. Q. and Kraay, T., “Origami: A New Visualization Tool”, KDD-99, San Diego, CA
Moret, B. M. E., “Decision Trees and Diagrams”, Computing Surveys, Volume 14, Number 4, December 1982
Nuttall, C., "It's a Vision Thing", Financial Times-IT Review , November 12, 2003
Pampalk, E. et. al., “Using Smoothed Data Histograms for Cluster Visualization in Self-Organizing Maps”,
Proceeding of the International Conference on Artificial Neural Networks (ICANN’02), Springer Lecture Notes
in Computer Science, Madrid Spain, 2002
Pampalk, E. et. al., “Content-based Organization and Visualization of Music Archives”, Proceeding of the 10th ACM
International Conference on Multimedia (MM’02), Juan-les-Pins, France, 2002
Pampalk, E., et. al., “A New Approach to Hierarchical Clustering and Structuring of Data with Self-Organizing
Maps”, Intelligent Data Analysis Journal (IDA), Volume 8, Number 2, 2003
32
Marc René - CSE 8331
References
Part 3
Rauber, A., et. al., “Empirical Evaluation of Clustering Algorithms”, Journal of Information and Organizational
Sciences (JIOS), Volume 24, Number 2, 2000
“Finding the Solution to Data Mining – Exploring the Features and Components of Enterprise Miner, Release 4.1
from SAS” SAS White Paper, 2001
See5 - Data Mining Tools, Release 1.9, Rulequest Research 1997-2003
Simoff, S. J., “VDM@ECML/PKDD2001: The International Workshop on Visual Data Mining at ECML/PKDD 2001”,
SIGKDD Explorations, Volume 3, Issue 2, 2001
Thearling, K., “Understanding Data Mining: It’s All in the Interaction”, DS Star: The On-Line Executive Journal for
Data-Intensive Decision Support”, Volume 1, Number 10, December 9, 1997
Thearling, K., et. al., “Visualizing Data Mining Models”, as published in Information Visualization in Data Mining and
Knowledge Discovery, edited by Fayyad, Usama, et. al., Morgan Kaufman, 2001
Ward, M. O., “XmdvTool: Integrating Multiple Methods for Visualizing Multivariate Data”, Proceedings of IEEE
Visualization '94 (Washington, DC, 1994).
Wishart, D., “Efficient Hierarchical Cluster Analysis for Data Mining and Knowledge Discovery”, Computing Science
and Statistics, Volume 30, 1998.
Wishart, D., “ClustanGraphics3 – Interactive Graphics for Cluster Analysis”, Published in: Classification in the
Information Age, Gaul W. and Locarrek-Junge, H (Eds.), Springer 1999
XmdvTool Home Page (http://davis.wpi.edu/~xmdv/visualizations.html)
33
Marc René - CSE 8331