Microsoft PowerPoint - NCRM EPrints Repository
Download
Report
Transcript Microsoft PowerPoint - NCRM EPrints Repository
New Directions in Analysis
and Visualization
[Visual Analytics]
Dr Jeremy Walton
NAG Ltd, Oxford
[email protected]
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
Overview
Introduction
NAG, HECToR
Visualization
distribution, collaboration, steering
Data mining
classification, exploratory analysis
The ADVISE project
large data, interactive analysis
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
2
Overview
Introduction
NAG, HECToR
Visualization
distribution, collaboration, steering
Data mining
classification, exploratory analysis
The ADVISE project
large data, interactive analysis
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
3
NAG profile
Products
Mathematical, statistical, data analysis components
3D visualization, compilers & tools
HPC software engineering services
HECToR support
Users
Academic researchers
Professional developers
Analysts / modelers
Founded 1976
Not-for-profit company
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
4
High-End Computing Terascale Resource
Latest high-end computing service for UK
funded by EPSRC, NERC & BBSRC
will run from 2007-2013
Partners:
Hardware: Cray Inc
Service Provision: University of Edinburgh HPCx Ltd
hardware hosting, user services, help desk
CSE Support: NAG Ltd
technical assessment of project application
porting / tuning / optimisation of user codes
training courses (inc. visualization)
best practice guides, documentation, FAQs
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
5
Overview
Introduction
NAG, HECToR
Visualization
distribution, collaboration, steering
Data mining
classification, exploratory analysis
The ADVISE project
large data, interactive analysis
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
6
Visualization toolkits
Help construct visualization applications
no wheel-reinvention, stone canoes, chocolate teapots
Proprietary supported commercial systems
e.g. Excel, IRIS Explorer, Spotfire
Open source, freely available software
e.g. OpenDX, InfoVis
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
7
NAG’s IRIS Explorer…
General purpose toolkit for data visualization
Reusable building blocks (modules)
Connect modules to build application
Point-and-click development
Visual programming approach
Build, execute, reshape
Add new modules, if required
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
8
…in action
Application
in map
editor
Modules
in
module
librarian
Reads data
Colormaps it
Makes ribbon
Displays it
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
9
Make the connections
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
10
Add more modules...
Adds axes
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
11
...and even more
Adds
caption
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
12
Some examples
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
13
Trendalyzer (Gapminder)
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
14
Worldmapper: area
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
15
Worldmapper: deaths by disease
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
16
Many eyes: shared visualization
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
17
Overview
Introduction
NAG, HECToR
Visualization
distribution, collaboration, steering
Data mining
classification, exploratory analysis
The ADVISE project
large data, interactive analysis
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
18
NAG Data Mining Tools
Data Cleaning
Data imputation - adding missing values
Outlier detection - finding suspect data records
Data Transformation
Scaling Data - before distance computation
Principal Component Analysis - reducing # of variables
Model fitting
Cluster analysis - finding interesting groups
Classification techniques - # of groups is known
Regression no groups - outcome is continuous
Linear / Non-linear / Time series
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
19
Example: exploratory data analysis
How many species of water vole (Arvicola) in UK?
Measurement data
Presence / absence of 13 skull characteristics
300 observations, each in one of 14 regions
3 groups:
A. terrestris / A. sapidus / unclassified UK cases
Treatment
Average data within each region
Gives 14 data points in 13 dimensions
How to display dataset?
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
20
2D scatterplots
Analysis
2D scatterplots?
Structure is unclear
(13 x 12) / 2 = 78 plots needed
Principal components analysis?
2 PCs explain 49% of the variance
3 PCs explain 65% of the variance
Should be > 85% for confident representation
Fisher’s iris dataset (4 variables) is 95%
Alternative technique
Metric scaling
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
22
Metric scaling
14 data points – one for each region
Each point has values for 13 variables
Construct 14 by 14 dissimilarity matrix, Δ
Δij = distance between points i & j in 13D space
Δ is symmetric, with zero diagonal elements
Want to find a new matrix, Δ*
set of 14 new data points in 3D space that preserve Δ
Project Δ to Δ* using metric scaling
Display data points in 3D
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
23
Exploratory data analysis conclusions
2D scatterplots don’t indicate group structure
cf. iris dataset
3D PCA unreliable here
Metric scaling of Δ used to reduce D from 13 to 3
3D visualization reveals group structure
Distinct A. sapidus group
UK sample represents only A. terrestris
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
28
Overview
Introduction
NAG, HECToR
Visualization
distribution, collaboration, steering
Data mining
classification, exploratory analysis
The ADVISE project
large data, interactive analysis
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
29
The ADVISE project
DTI-funded research project, started March 2007
NAG / VSN / University of Leeds
Merge visualization & statistics (visual analytics)
use statistics to identify key characteristics of dataset
understand the characteristics through visualization
User community
pharmaceuticals
environmental science
engineering
Initial user meeting held September 2007
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
30
Large datasets
Size matters (but isn’t everything)
Developer’s view:
Too large for our current system
Problems of
performance
robustness
User’s view:
Too large for me to understand
Current ADVISE datasets are “only” a few GB
complications (e.g comparing several) could raise this
HECToR users have TB datasets
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
31
ADVISE ideas
Retention of visual programming interface
Re-use of algorithmic base
IRIS Explorer modules
GenStat statistics functionality (from VSN)
Three layered architecture
User interface
Web service middleware
Visualization components
Distribution, tailored user interface, collaboration
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
32
ADVISE progress
Porting IE modules to standalone environment
some of these use GenStat for statistics
New system used to revisit air quality demo
early (IEEE Viz 96)
web-based visualization
new system more
efficient
Working with
real user data
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
33
Conclusions
NAG offers software components for developers
no wheel-reinvention, stone canoes, chocolate teapots
Visualization & data mining crucial for analysis
distribution, steering, classification, exploration
interactivity / interrogation important
integration is an ongoing field of activity
ADVISE project
developing a new system for visual analysis
working with real user problems
improving understanding of data
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
34