Transcript slides

New ways of exploring
environmental data
or: Letting
do the hard work
Jon Blower
(ESSC and
Reading e-Science Centre)
[email protected]
Motivation
• The environmental sciences are very data-intensive
–
–
–
–
–
Satellite data (high resolution, several spectral bands)
Numerical model output data
Raw data -> analysis -> re-analysis
Ensembles
Easy to get up to terabytes of data
• Data are expensive to produce and are economically valuable
– Strong real-time requirement in many cases
• Need ways to cope with large datasets and make sense of them
• Computers get faster and disks get bigger
– But we can always fill them
• But our brains stay the same size!
Technical barriers
• Each data provider has its own preferred data format
– NetCDF, HDF, HDF5, GRIB, PP, GeoTIFF, more
– and there are many varieties of the above
• Data exist on a variety of grids
–
–
–
–
Latitude-longitude
Rotated-pole
Tri-polar
Or might not be on a grid at all (spectral format)
• Data providers choose different naming conventions
– e.g. “temperature”, “temp”, “T”
• This makes even simple tasks hard
– users should not have to care about any of these details
Solutions
• Expose data using standard interfaces
– irrespective of how data are ultimately stored
– Defining these interfaces is a community effort
• Provide simple tools for simple tasks
– e.g. simple Web interface
• Use distributed computing to work with very large
datasets
– more of this later…
GADS
• Grid Access Data Service
• GADS is a software library
for accessing gridded data
• Hides details of storage from
users
– user’s don’t have to know
internal data formats or
naming conventions
Applications
GADS library
• Uses standard names
• Can make queries about
data …
– e.g. “what variables are
there in dataset X?”
• … and get data subsets
DATA
META
DATA
GODIVA web portal
• The GODIVA Web portal
provides a graphical
interface to data at
ESSC
• Uses GADS to query and
extract data sets
• Users can make simple
visualisations
– pictures and movies
GADS as a Web Service
• Web Services are a standard
way of building distributed
systems
• “Black box” subroutines that
are executed over the
Internet
• Platform-independent
– strong interoperability
• GADS has a Web Service
interface
• Means that external
applications can use the
GADS routines at ESSC
External
applications
WS interface
GADS library
DATA
META
DATA
GADS application: Search and
Rescue
• British Maritime Technology
produce software (SARIS) to help
the Coastguard with Search and
Rescue
• Predicts drift patterns of people
and objects that have fallen
overboard
– This significantly cuts the time to
rescue
• Have worked with BMT to
produce prototype that uses live
Met Office data from GADS to
improve its predictions
– Uses forecasts of surface winds
and surface currents
• Can also be applied to oil spills
Geographical Information Systems
(GIS)
• Many companies produce GIS software for manipulating and
visualizing geographical data
– e.g. ArcInfo, Maptitude, many more
– Big business!
• Very sophisticated and powerful
– Spatial statistics, geoprocessing, mapping…
– e.g. identify high-risk flood zones, assess effectiveness of
ambulance centres
• Historically very map-oriented (2-d or “2.5d”)
– Hence not so useful in ocean/atmosphere sciences (need 4-d)
• Vendors typically used proprietary formats and interfaces
– Users “locked in” to a particular vendor, hard to share information
• The Open Geospatial Consortium is addressing these issues
OGC Web Services
Web Service
Purpose
Web Map Server (WMS)
Serves map images (cf. Streetmap,
Multimap)
Web Feature Server (WFS)
Serves geographical features
(roads, rivers, hospital locations etc)
Web Coverage Server (WCS)
Serves multidimensional data (e.g.
numerical model output)
Web Processing Server (WPS)
Processes data
Lots more!
(roughly in decreasing order of maturity)
Services can be composed to create a distributed geospatial
application
NERC Data Grid (NDG)
• NERC e-Science project led by BADC
• Will provide software for discovery and delivery of
data
• Data will be distributed between NDG and other
groups (NDG won’t hold everything)
• Vast diversity of data types (all NERC data!)
• Rigidly standards-based (ISO)
– Metadata is all-important: enables data discovery
– Have created CSML (Climate Science Markup
Language) – describes 7 feature types
• Producing whole array of OGC-compliant Web
Services
– Key task is to add proper security
http://ndg.nerc.ac.uk/
Some CSML features
ProfileFeature
GridFeature
ProfileSeriesFeature
NDG: data extractor and GeoSPLaT
Other uses of OGC Web Services
• DEWS project (Delivering Environmental Web Services)
– Deliver Met Office data to end users in marine and health sectors
– Marine applications: Search and rescue
– Health application: Chronic Obstructive Pulmonary Disease (COPD
prediction)
– Re-engineering GADS to be WCS-compliant
– Using NDG security layer
– Will hopefully influence Met Office’s data provision in future
• GDEVIL project (Data Assimilation Research Centre)
– In conjunction with RSI (makers of ENVI and IDL)
– Made WCS server and client software for extracting and visualizing
large datasets
The story so far: summary
• We can look forward to much easier access to data
– Allows more end-users (e.g. industry) to get data in real time and at
lower cost
• Data providers will work with the same OGC standards
• Web Services are a key technology
• NERC, Met Office, ECMWF data (and more) will be available to
you through the NERC DataGrid
• Still lots of work to do
– e.g. descriptions of community-specific datasets
The next generation…
Google Maps
• Web-based “widget” for
viewing map data
– or any images in fact
• Like Streetmap, Multimap
etc but much slicker
– draggable map
– fast response time
• Can mark locations
Google Earth
• “Mapping for the masses”
– According to Nature
• Desktop application
(Windows and Mac) for
displaying geographical data
– Satellite images
– Earthquake locations
– Live data!
• All on a 3-D spinning globe
• Can view data at all scales
• Very easy to incorporate new
data
– easy as writing a simple
Web page
Example of a KML file
How it renders
More examples of Google Earth data
• Post-Katrina satellite images
• Sea ice cover and ice velocity
• Locations of ARGO floats
• Bird flu outbreaks
Google Maps vs Google Earth
Google Maps
Google Earth
Web-based – works on any modern
browser (with Javascript)
Standalone application – Windows
and Mac only
Only two layers of pictures per map
(base plus overlay).
As many layers of pictures as you
like
Some specialist knowledge required
to incorporate your own data
Easy to distribute new data via the
web (just write a KML file) or
incorporate data from local disk
Relatively feature-poor
Feature-rich
Code has been released to public
Closed-source (black box)
Both load data from servers on-the fly
Neither deal with animations very well (if at all)
“GODIVA Two”
• (currently under
development)
• Near-instantaneous
previews of data
• Draggable Google Map
for easy navigation
• Adjustable scale
• links to Google Earth
• Now we really are
exploring data!
• An AJAX application
• (all donkey work is still
done by GADS)
What can be done with Godiva2?
• Search through data very quickly using the Web interface
• Pick your own scale range
– crude identification of isotherms
• Having identified data, explore further in Google Earth
–
–
–
–
Incorporate multiple data sources into GE
Overlay a lat-lon grid
Measure the size of features
much more!
• Download data into your application of choice (IDL, Matlab)
• Future modifications to Godiva2:
–
–
–
–
Other slices through data e.g. xt (Hovmuller)
Movies
Collaborative GE?
Simple data processing e.g. statistical calculations
ESSC Data serving architecture
SARIS
Other
apps
Google
Maps
SOAP
messaging
Web Service
interface
Tomcat
Application
Server
Google
Earth
HTTP GET
Google Maps
interface
GADS library
DATA
META
DATA
Google Earth
interface
Geospatial databases
•
•
•
A lot of the above relies on fast access to data in a multi-user
environment
This is the sort of thing that databases do well
But most databases don’t deal well with geospatial data
– Some exceptions, e.g. PostGIS
– Gridded data is still a problem for most systems
•
We have been evaluating software from Barrodale Computing Services
– Very advanced geospatial database that supports gridded data
– Versions for PostgreSQL, Informix, Oracle
– Demos exist at www.barrodale.com
•
Results are very promising
– Faster than our system especially for small data extractions
– Caches recently-used data for extra speed
•
But this is commercial software
– We have an evaluation version, in return for feeding back requirements
“New” methods for data
processing
Data processing
• Environmental datasets are typically large and
distributed
• In many cases data processing can be sped up
through parallel processing
• Can also help with problem of dealing with multiple
users on a data-intensive website
– Website must be responsive
• Often tasks can be “trivially parallelized”
– But even this is often awkward
• Let’s look at some tools we can use to make this
easy
Condor
• Mature technology for
scheduling jobs (programs) on
ordinary desktop machines
– “Cycle stealing”
• Makes good use of existing
resources
• Ideal for applications where you
need to run the same
executable lots of times on
different data sets
– Monte Carlo simulations
– Parameter sweeps
• Can also run MPI jobs
• Very popular world-wide
Condor application: TRACK
•
TRACK identifies and tracks
storms in numerical model
output
– Identifies pressure lows and
vorticity highs
•
Use Condor to run TRACK over
large numbers of datasets
– Datasets are downloaded from
the Internet on-demand
•
Then produce statistics and
diagnostics using the results
– Tells us about the predictability
of storms
•
Web interface
Lizzie Froude and Kevin Hodges
BOINC
• Berkeley Open Infrastructure for Network Computing
• Used by ClimatePrediction.net and SETI@home
• Run code on volunteer computers (i.e. home computers)
– In background or as a screensaver
– Windows, Linux, Mac OSX
• Each computer downloads a chunk of data to process
– In CP.net, each computer runs a simulation of evolution of Earth’s
climate
• Then uploads results
• Volunteers join BOINC, then decide which projects they want to
be involved in
• Have to deal with users dropping out
– Also some volunteers have been known to tamper with results
• Some users use CP.net running speed for bragging about their
computers!
ClimatePrediction.net on the BBC
Distributed Parallel Processing
Environment for Java (DPPEJ)
•
•
Run jobs in parallel by creating
a number of Java threads
Each thread runs on a different
machine
Easy to get started
– If you’re a Java programmer
•
Test case: search through 250
OCCAM ¼ degree ocean data
files (5 GB total) looking for files
that contain extreme
temperatures
– No point in using more than 4
machines for this task
– Limited by disk access speed
4 threads
Time
•
Number of threads
MapReduce
• Google have written papers on how they do some of their
distributed computing
– All done on clusters of commodity machines
– Have to take into account machine failures
• A key concept is the “Map-Reduce” programming model
– One routine maps input data to intermediate output
– Another routine reduces this to a final result
• E.g. Map names of data files to locations of storms contained
therein
• Then plot these data on a single plot (reduce)
• Open source implementation of this programming model in Java
(Hadoop)
• Programmers don’t have to worry about details of parallelization
and fault tolerance
– Just write a Map function and a Reduce function
Parallel processing tools: summary
•
Condor
–
–
–
–
•
uses spare power of desktop machines
for running a program lots of times
run compiled executables – can write in any language
not real-time (jobs might not run immediately)
Many other systems
– Sun GridEngine, PBS, etc (often installed with clusters)
•
BOINC (also World Community Grid and others)
– Potentially lots of computers involved
– Issue of trust in results
– Good way to reach general public
•
DPPEJ, MapReduce
–
–
–
–
Must program in Java, but easy if you know how
Idea is to reduce development time
MapReduce has fault-tolerance
Would probably sit behind a website like Godiva2 – most scientists wouldn’t
use these directly
What resources are available?
• ESSC Condor pool
• Reading Campus Grid
– Currently a Condor pool in Computer Science Dept
– Will incorporate other resources in future (e.g. library machines,
clusters)
• National Grid Service
– 2000 processors, and over 36TB
– CPUs heavily used, data capacity under-used
• OxGrid (in future)
– Intend to connect this to RCG
• In ideal world all these would be linked
– You would then submit jobs via a single portal
– this is Grid computing!
Where do we go from here?
Environmental e-Science toolkit
• The Reading e-Science Centre is building a “toolkit” for
environmental e-Science
• Will incorporate many of the ideas we have seen today
– Fast web access to data (“Godiva2”)
– Google Maps and Google Earth interfaces
– Parallel data processing at back end (for common processing
tasks)
– Perhaps IDL/Matlab/CDAT interfaces to the same back-end
– Fast searches through data
• Easy access to resources such as the National Grid Service,
Reading Campus Grid
• We will work closely with the NERC DataGrid
• Please tell us what you would like!
Stuff that you can do now
• Think about exposing your data through Google Earth
–
–
–
–
Easy to do
Reaches a wide range of people including the public
Great for demos
Useful for teaching?
• Think about what you could achieve if you had more processing
power
– And easy access to it
• If you are a data provider, look at the OGC standards and
seriously consider using them
• Talk to us ([email protected])!
– I would especially like to hear about real science use cases
Thank you