data visualization

Download Report

Transcript data visualization

Data Mining
and visualization
(2)
Alfredo Vellido
Plan
A brief introduction to data
visualization
Visualization & history
Perception
Visual exploratory DM
The good, the bad & the ugly
…
Visualization
recap
Recap … PRINCIPLES:
the data mining visual cycle, or
Visual Exploratory Data Mining
Data gathering
Data Manipulation
Hipothesis of reality
DATA
Preprocessing
&
transformation
MODEL
Model manipulation
Graphic
engine
Data
exploration
Control &
navegation
visuo-spatial
model
cognitive-logic
model
Recap … CRISP: Methodology phases
Recap 6 .. Data visualization vs
model visualization
Recap 7 … Data visualization vs model
visualization
Plan
A brief introduction to data
visualization
Visualization & history
Perception
Visual exploratory DM
The good, the bad & the ugly
…
What type of visualization are we looking for?
Descriptive?
Exploratory?
What type of visualization are we looking for?
DESCRIPTIVE
PRINCIPLES:
a good visualization should...
...show data and/or results...
...at different levels of detail, from the overall landscape to the
fine detail
... in a coherent manner, even if we are dealing with large
collections.
... avoiding, as much as possible, distortion in their
representation
...focus attention in the most relevantes features...
...minimizing the impact of uninformative and misleading data
...integrating statistical results and linguistic descriptions (if
possible and relevant).
DATA EXPLORATION: The CURSE
of dimensionality
Most data available to us are stored in different kinds of databases and
in numeric format, mostly organized in table structures (remember
survey!) An extension of these are the data cubes generated by OLAP
processes.
What are your preferred methods for storing data for data mining? [403 votes
total]
Text, CSV (comma-separated) (72)
Text, space or tab separated (55)
Excel (38)
SAS (57)
SPSS (31)
S-Plus/R (15)
Weka ARFF (23)
Other data mining tool format (11)
In a database system (93)
Other - please comment (8)
How to display multiple dimensions? Cases:
Low dimensionality (1-3D)
Moderate dimensionality (4-10D)
High dimensionality (>10D)
18%
14%
9%
14%
8%
4%
6%
3%
23%
2%
DATA EXPLORATION:
low-moderate dimensionality <10D
Spatial coordinates
3D requires interactivity
Further pre-cognitive visual
elements allow us to “add”
extra dimensions:
color, movement, shape, …
Exotic solutions
Glyph*: Chernoff faces, stickfigures, whiskers...
* Un glifo es una representación gráfica de uno o varios
caracteres, o de parte de un carácter. Un carácter es una
entidad textual mientras que un glifo es una entidad gráfica.
… some of those alternatives
Gantt diagrams…
… some of those alternatives
Chernoff faces
Herman Chernoff (1973). "Using faces to represent points in k-dimensional space
graphically". Journal of the American Statistical Association 68 (342): 361–368.
DATA EXPLORATION:
high dimensionality data
How do we visualize data of high (or even very high)
dimensionality?
Some of the alternatives are rather straightforward… some others
are not…
Eliminate dimensions (data variables): those which are
redundant and / or uninformative (at least you manage to
alleviate part of the problem…)
Divide & conquer: a classic: create multiple visualizations of
low dimensionality.
Latent and projection models
DATA EXPLORATION:
The Grand Tour: multiple visualization of Iris data
www.ics.uci.edu/~mlearn/MLRepository.html
TECHNIQUES: Latency and
projection
Projection
Dimensionality compression
Similitude information coding
Clustering
Finding grouping structure in data
Similitude information coding
Self-Organizing Map (SOM) &
Generative Topographic Mapping (GTM)
They combine latent representation and
clustering
TECHNIQUES: projection
Representation in <4-D, so that the distanceneighborhood relations between multi-dimensional
points are faithfully preserved
It is impossible to preserve information integrally
Some scale normalization is required
Linear vs. non-linear projections
TECHNIQUES:
projection: methods
Methods based on inter-point distances, where:
dx = distance in the original space
dy = distancie in the projection space
h = neighborhood function
E =  (dx – dy)2
E =  (dx – dy)2 / dx
E =  (dx – dy)2 e-dy
E =  dx2 h(dy)
MDS, PCA
Sammon’s projection
CCA
SOM
... and in which we aim to minimize an inherent
projection distorsion (E)
TECHNIQUES:
projection: methods in a nutshell
MDS: technique used in data visualisation for exploring similarities or
dissimilarities in data. An MDS algorithm starts with a matrix of item-item
similarities, then assigns a location of each item in a low-dimensional space,
suitable for graphing or 3D visualisation.
Taxonomy:
Metric multidimensional scaling -- assumes the input matrix is just an item-item distance
matrix. Analogous to PCA, an eigenvector problem is solved to find the locations that
minimize distortions to the distance matrix. Its goal is to find a Euclidean distance
approximating a given distance.
Generalized multidimensional scaling (GMDS) -- A superset of metric MDS that allows
for the target distances to be non-Euclidean.
Non-metric multidimensional scaling -- It finds a non-parametric monotonic relationship
between the dissimilarities in the item-item matrix and the Euclidean distance between
items, and the location of each item in the low-dimensional space
Biblio:
Abdi, H. (2007). Metric multidimensional scaling. In N.J. Salkind (Ed.): Encyclopedia of
Measurement and Statistics. Thousand Oaks (CA): Sage.
Kruskal, J. B., and Wish, M. (1978), Multidimensional Scaling, Sage University Paper
series on Quantitative Application in the Social Sciences, 07-011. Beverly Hills and
London: Sage Publications.
TECHNIQUES:
projection: methods in a nutshell
PCA: It is a linear transformation that represents the data in a
new coordinate system such that the greatest variance explained
by the data lies on the first coordinate (called the first principal
component), the second greatest variance on the second
coordinate, and so on. PCA can be used for dimensionality
reduction in a dataset by retaining only those characteristics of
the dataset that contribute most to its variance.
Taxonomy:
Kernel PCA
PPCA, CCA (when unfolding a nonlinear structure, Sammon's
mapping cannot reproduce all distances. One way to face this
problem consists in favouring local topology: CCA tries to reproduce
short distances first, while long distances remain secondary.
FA
Some source code:
Open Computer Vision Library @ sourceforge.net/projects/opencvlibrary/
Murtagh’s page @ http://astro.u-strasbg.fr/~fmurtagh/mda-sw/
TECHNIQUES:
projection: example
Sammon’s projection
PCA
CCA
TECHNIQUES:
projection: discussion; pros & cons
Projection techniques code proximity / similarity information in
spacial coordinates (plus, sometimes, extra precognitive
elements such as colour ...)
They allow…
… Finding “natural” data groupings (clusters) on the basis of some
sort of similarity
… Finding the “shapes” of these groupings
But ...
Projection is always limited by error and information loss
New projection coordinates are not always readily interpretable
(latency by definition)
The original relations between data dimensions are lost
Quite often, the computacional effort is to be taken into account, as
most of these methods are based on distances between multivariate
points.
TECHNIQUES:
multiple visualizations
How to get some of the info conveyed by
observable variables back into the projections?
One possibility: Using multiple visualizations.
Parallel coordinates and pre-cognitive stimuli (colour, position...)
TECHNIQUES:
SOM & GTM
Self-Organizing Feature Map (or Kohonen Maps)
k-means is an special case of SOM
Discretization (in the form of network grids) and projection are simultaneously
performed
Set of prototypes» model
Cooperative learning (through neighbourhood function)
Competitive learning (winner takes most –if not all-)
GTM is a probabilistic alternative to SOM (i.e., a form of statistical
learning)
GTM is a generative model and, therefore, aims to reproduce data density
distributions
It defines a proper error function
It is a non-linear latent model that can be interpreted as a mixture model, as
well.
All the learning parameters can be adaptively optimized.
TECHNIQUES:
SOM & GTM: training / fitting
The learning process for both models can be illustrated by the
fisherman network simile.
TECHNIQUES:
SOM & GTM: clustering
The SOM and GTM
“units” can be interpreted as
micro-clusters
U-matrix (distance in
local neighbourhood) or
Magnification Factor
(distorsion levels)
Discrete or fuzzy
clusters discretos o
borrosos, from local
density or probability
maxima
Hierarchical clustering
and dendrograms
TECHNIQUES:
SOM & GTM: multiple visualization
TECHNIQUES:
SOM & GTM: visualization of class
membership
Visualization: further exotisms
Exotisms:
Conic trees
Exotisms:
Mapscapes
Visualization: software
Visualizing data:
Simple and off the shelf:
SS&C: Heatmaps®
…Complex and off the shelf:
TheBrain Tech. Corp.
“This is the knowledge crisis –
An ever-increasing demand for
organizational knowledge coupled
with an unforgiving environment in
which to produce it. Currently, we
have no systems to automate and
capture the knowledge processes
that are critical to our success.”
Woven and off the shelf:
Ixacta Web Analyzer
Neighborhood
sitemap diagram:
Ixsite creates this
diagram to help you
visualize the
relationship between
the files on your site.
Woven and free:
http://graphics.stanford.edu/
SOM off the selve:
Visumap (www.visumap.net)
Ellipse eSOM (www.ellipse.fi)
SOM fishing:
REEFSOM
Applied Neuroinformatics
Group, Bielefeld University,
Germany
Visualization: in summary …
In summary ...
Which are the features of a good, successful visualization?
Show the data (exploratory element)
Focus the attention (… in the most relevant aspects)
Never forget the “human factor” in visual perception
The science of vision is the necessary framework for the
visualization techniques
You have to be careful with pre-cognitive elements (position,
movement, colour, shape) in visual coding of dimensions.
How to use visualization in exploratory data mining?
Visualization allows especulation and model validation.
Visualization of high-dimensional data sets can be
accomplished through:
projections and clustering methods
multiple simultaneous visualizations.
Plan
A brief introduction to data
visualization
Visualization & history
Perception
Visual exploratory DM
The good, the bad & the ugly
…
The good ...
According to Michael Friendly’s Gallery of
Data Visualization (Psych./York Univ.)
NY weather in 1980. NYT, Jan.1981
2200 data pieces!!!
The good ...
According to Michael Friendly’s Gallery of
Data Visualization (Psych./York Univ.)
... And the bad and ugly
According to Michael Friendly’s Gallery of
Data Visualization (Psych./York Univ.)
Off-campus
Off-campus
FRIDAY 27 - AFTERNOON
LIGHT AND DATA: A JOURNEY THROUGH THE NEW AESTHETICS OF INFORMATION
ArtFutura is dedicating its first afternoon to the work of artists, scientists and designers who
are developing new and innovative ways of visualizing information and giving it meaning.
18:00 hours / Room MAC -Mercat de las FlorsAndrew Vande Moere, Information Aesthetics (AUS)
“Forms follow Data: An introduction to the art of data visualization”
http://www.infosthetics.com/
Andrew Vande Moere is the editor of Information Aesthetics, the outstanding weblog
dedicated to exploring the art and science of the dynamic representation of information. In his
blog, Andrew shows and analyses artistic projects of design and investigation based on the
exploration in real time of large databases and the communication, by means of innovative
interfaces, of the meaningful patterns hidden within their interiors.
Information Aesthetics offers an in-depth look into the exciting world of data landscapes, a
discipline that having seduced artists and scientists promises too radically change our user
experience in the area of information.
Off-campus
Museo de la ciencia y la técnica de Catalunya (Terrassa)
http://www.mnactec.com/eng/index.htm
Until December 17th