Microsoft PowerPoint - NCRM EPrints Repository

Download Report

Transcript Microsoft PowerPoint - NCRM EPrints Repository

New Directions in Analysis
and Visualization
[Visual Analytics]
Dr Jeremy Walton
NAG Ltd, Oxford
[email protected]
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
Overview
 Introduction
 NAG, HECToR
 Visualization
 distribution, collaboration, steering
 Data mining
 classification, exploratory analysis
 The ADVISE project
 large data, interactive analysis
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
2
Overview
 Introduction
 NAG, HECToR
 Visualization
 distribution, collaboration, steering
 Data mining
 classification, exploratory analysis
 The ADVISE project
 large data, interactive analysis
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
3
NAG profile
 Products
 Mathematical, statistical, data analysis components
 3D visualization, compilers & tools
 HPC software engineering services
 HECToR support
 Users
 Academic researchers
 Professional developers
 Analysts / modelers
 Founded 1976
 Not-for-profit company
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
4
High-End Computing Terascale Resource
 Latest high-end computing service for UK
 funded by EPSRC, NERC & BBSRC
 will run from 2007-2013
 Partners:
 Hardware: Cray Inc
 Service Provision: University of Edinburgh HPCx Ltd
 hardware hosting, user services, help desk
 CSE Support: NAG Ltd
 technical assessment of project application
 porting / tuning / optimisation of user codes
 training courses (inc. visualization)
 best practice guides, documentation, FAQs
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
5
Overview
 Introduction
 NAG, HECToR
 Visualization
 distribution, collaboration, steering
 Data mining
 classification, exploratory analysis
 The ADVISE project
 large data, interactive analysis
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
6
Visualization toolkits
 Help construct visualization applications
 no wheel-reinvention, stone canoes, chocolate teapots
 Proprietary supported commercial systems
 e.g. Excel, IRIS Explorer, Spotfire
 Open source, freely available software
 e.g. OpenDX, InfoVis
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
7
NAG’s IRIS Explorer…
 General purpose toolkit for data visualization
 Reusable building blocks (modules)
 Connect modules to build application
 Point-and-click development
 Visual programming approach
 Build, execute, reshape
 Add new modules, if required
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
8
…in action
Application
in map
editor
Modules
in
module
librarian
Reads data
Colormaps it
Makes ribbon
Displays it
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
9
Make the connections
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
10
Add more modules...
Adds axes
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
11
...and even more
Adds
caption
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
12
Some examples
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
13
Trendalyzer (Gapminder)
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
14
Worldmapper: area
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
15
Worldmapper: deaths by disease
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
16
Many eyes: shared visualization
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
17
Overview
 Introduction
 NAG, HECToR
 Visualization
 distribution, collaboration, steering
 Data mining
 classification, exploratory analysis
 The ADVISE project
 large data, interactive analysis
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
18
NAG Data Mining Tools
 Data Cleaning
 Data imputation - adding missing values
 Outlier detection - finding suspect data records
 Data Transformation
 Scaling Data - before distance computation
 Principal Component Analysis - reducing # of variables
 Model fitting
 Cluster analysis - finding interesting groups
 Classification techniques - # of groups is known
 Regression no groups - outcome is continuous
 Linear / Non-linear / Time series
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
19
Example: exploratory data analysis
 How many species of water vole (Arvicola) in UK?
 Measurement data
 Presence / absence of 13 skull characteristics
 300 observations, each in one of 14 regions
 3 groups:
 A. terrestris / A. sapidus / unclassified UK cases
 Treatment
 Average data within each region
 Gives 14 data points in 13 dimensions
 How to display dataset?
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
20
2D scatterplots
Analysis
 2D scatterplots?
 Structure is unclear
 (13 x 12) / 2 = 78 plots needed
 Principal components analysis?
 2 PCs explain 49% of the variance
 3 PCs explain 65% of the variance
 Should be > 85% for confident representation
 Fisher’s iris dataset (4 variables) is 95%
 Alternative technique
 Metric scaling
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
22
Metric scaling
 14 data points – one for each region
 Each point has values for 13 variables
 Construct 14 by 14 dissimilarity matrix, Δ
 Δij = distance between points i & j in 13D space
 Δ is symmetric, with zero diagonal elements
 Want to find a new matrix, Δ*
 set of 14 new data points in 3D space that preserve Δ
 Project Δ to Δ* using metric scaling
 Display data points in 3D
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
23
Exploratory data analysis conclusions
 2D scatterplots don’t indicate group structure
 cf. iris dataset
 3D PCA unreliable here
 Metric scaling of Δ used to reduce D from 13 to 3
 3D visualization reveals group structure
 Distinct A. sapidus group
 UK sample represents only A. terrestris
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
28
Overview
 Introduction
 NAG, HECToR
 Visualization
 distribution, collaboration, steering
 Data mining
 classification, exploratory analysis
 The ADVISE project
 large data, interactive analysis
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
29
The ADVISE project
 DTI-funded research project, started March 2007
 NAG / VSN / University of Leeds
 Merge visualization & statistics (visual analytics)
 use statistics to identify key characteristics of dataset
 understand the characteristics through visualization
 User community
 pharmaceuticals
 environmental science
 engineering
 Initial user meeting held September 2007
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
30
Large datasets
 Size matters (but isn’t everything)
 Developer’s view:
Too large for our current system
 Problems of
 performance
 robustness
 User’s view:
Too large for me to understand
 Current ADVISE datasets are “only” a few GB
 complications (e.g comparing several) could raise this
 HECToR users have TB datasets
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
31
ADVISE ideas
 Retention of visual programming interface
 Re-use of algorithmic base
 IRIS Explorer modules
 GenStat statistics functionality (from VSN)
 Three layered architecture
 User interface
 Web service middleware
 Visualization components
 Distribution, tailored user interface, collaboration
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
32
ADVISE progress
 Porting IE modules to standalone environment
 some of these use GenStat for statistics
 New system used to revisit air quality demo
 early (IEEE Viz 96)
web-based visualization
 new system more
efficient
 Working with
real user data
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
33
Conclusions
 NAG offers software components for developers
 no wheel-reinvention, stone canoes, chocolate teapots
 Visualization & data mining crucial for analysis
 distribution, steering, classification, exploration
 interactivity / interrogation important
 integration is an ongoing field of activity
 ADVISE project
 developing a new system for visual analysis
 working with real user problems
 improving understanding of data
Results Matter. Trust NAG
1 July, 2008
Research Methods Festival, St Catherine's College, Oxford
34