Transcript Data Rods

Data Rods:
High Speed, Time-Series Analysis
of Massive Data Sets Using Pure
Object Database Methods
David Gallaher(1), Qin Lv(2), Glenn Grant(1), Garrett Campbell(1)
1)
2)
1
National Snow and Ice Data Center, University of Colorado, Boulder,
Colorado, 80309, USA
Department of Computer Science, University of Colorado, Boulder,
Colorado, 80309, USA
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
The National Snow and Ice Data Center
Mission: To Monitor the Climate Data in Earth’s Icy Regions, Analyze
and Distribute it Worldwide 24x7. Focus is Mainly NASA
Satellite Data
Manages and
distributes
scientific data
Supports data
users
Performs scientific
research
University of
Colorado at Boulder
Cooperative Institute for
Research in Environmental
Sciences
World Data Center for Glaciology
(since 1976)
Creates tools for
data access
Affiliations and
Sponsorship
Educates the public
about the cryosphere
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Data Rods - Project Basis
The “Data Rods” project
proposes to create prototype a
high-speed, scalable database
structure for rapid retrieval,
filtering, and analysis of massive
multi-modality data sets.
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Objective: Remote Sensing Data Analysis
The Problem:
• Data sets are becoming too
large to move over the
internet
• Need for basic Boolean logic
for time-series anomaly
detection
• Data downloads for long
time-series analysis are
especially cumbersome
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Analysis Challenges
•
A wide variety of data formats
•
Ever-increasing data set sizes
•
Myriad analysis and visualization requirements
•
There will be uses and analysis of the data that
cannot be anticipated (data discovery is not
enough)
•
Lack of direct access to the data (ie albedo > 15%)
•
Our current directory trees impede data access
(We really need to consider a database)
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
“Big Data” Considerations:
Search, Order and Transmission of data is ending.
•We must develop systems where the data stay fixed and
analyses are rendered against it
•Rapid, scalable data access across time and space
•Direct query of the data, not just the metadata (we need
more than what, where, when)
•Web-based spatio-temporal analysis and visualization
6
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Database Choice





Fast and efficient storage, query and retrieval of
entire data sets – not just the metadata
Ability to store colossal amounts of small files
Relational databases can't handle it. The tables
grow too big. (Object-relational is no better)
Hadoop excels at unstructured data but due to
it’s batch oriented nature, it is inefficient with
real-time analytics as well as intra-data analysis
A “pure-object” database seen as best choice
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
The Data Rods Project
The “Data Rods” project has created a high speed,
scalable database structure for rapid retrieval,
filtering, and analysis of massive data sets.
We’ll cover the following:
• Database design
• Status on development
• Basic analysis examples and performance
• Planned analysis and potential applications
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Database design
Gridded data is key.
For consistency, NSIDC's Equal-Area Scalable
Earth Grids (EASE-Grids) tool is used.
Common resolutions between data sets (1km, 5km,
etc) and point data
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
The nesting relationship of differing resolutions in EASE-Grid
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Data Rods Concept
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Database Systems Development
Object Database Design
Passive
Microwave
Visual Infrared
Active
Microwave
Radar
Other
Ease Grid
Processing
Pixel Grid
Sampling
Data Rod
Objects
Cryospheric Change Analysis
Basic Data Management
(query & index)
Object
Interface
Pattern Search
(input pattern or trend)
Object
Database
Loading
Automated Pattern Discovery
Data Rod
Updating
• Anomaly Detection
• Trend Detection
• Cycle Detection
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
User
Interface
Pure-Object Database


Object persistence/instantiation is directly to/from
the database – no Java Spring or Hibernate needed
Not object-relational (examples include Versant,
ObjectDB, db4o, Objectivity)

Not as limited by size

Fast interactions across databases

Simple, efficient schema
Next: schema design
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Object Database Schema




Each image pixel is an
object
Data rods are time-series
collections of pixels
Each data rod can be
analyzed independently
Adjacency analysis by
row/col or lat/lon
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Database Creation



Standardized grid
dimensions
Visualize as layers of
imagery through time
(days to decades)
Time

Gridded data sets
Lends itself well to
time-series analysis
Longitude
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Status – Database Administration



5 AVHRR databases, each with 5 years of imagery (<100
GB each, administratively easier)
Surface mask databases for northern hemisphere at 5
km and 25km
SSM/I database, 25 years of daily 25 km data at all
frequencies and polarizations

Selected MODIS database at 250 Meter resolution

~600 GB total

No upper limit to database except disk space
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
AVHRR Database Creation


Initial demonstration region is Greenland
25 years of daily multi-spectral AVHRR data at
5 km resolution



9000+ images
2 billion+ pixel
objects total
Each pixel object is
independently
accessible for query
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Database Flexibility

Data can be spread across many databases

Transparent queries across databases



Methods (routines) can be attached to the data
rods to add functionality such as statistical
analysis
Data fusion: analyses may span multiple data
types, resolutions, time spans
Data Rods supports NetCDF output
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Simple AVHRR Object Database Time Test
• Built a using AVHRR 5km data from 1995-1999
• 2 visible channels, 3 IR channels, 3 references plus
albedo, skin temperature and cloud mask
• Database includes location class, time stamp class
and metadata
• 213,000 data rods covering 5-years over Greenland
• 1 Data rod contains 1825 pixels
• Pixels = 388,725,000 each with 11 variables/pixel
• Variables = 4.2 billion coded short integer values
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Example Analysis Using Object Databases
• All queries run on a singe processor, single thread
• Example #1: Queries and plots on single database
• Example #2: Queries and plots on multiple
databases
• Example #3: Advanced Spatiotemporal Analysis
• 1 Data rod contains 1825 pixels
• Pixels = 388,725,000 each with 11 variables/pixel
• Variables = 4.2 billion coded short integer values
• We will move to multi-tread, multiprocessor once we
have the design finalized (this is a research project)
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Using Single AVHRR Object Database Time Test
• Single processor under load
• 5-year plots returned in 2-10
seconds.
• Cached data plots returned
in ½ second.
• Images in 10 seconds
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Multi Data Rod
Selection
• Seven locations
selected across 5
years simultaneously
• Selected
Temperature
Brightness and
Albedo output
• Again caching is
much faster
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Example Analysis of Greenland & 5 databases
Using 5 5-year Rods and Statistics (1 min or 5 secs cached)
AVHRR albedo statistics
May average, 1981 – 2005
Camp Century:
Mean: 0.801
Std. dev.: 0.077
Summit Station:
Mean: 0.819
Std. dev.: 0.069
Swiss Camp:
Mean: 0.817
Std. dev.: 0.070
GISP Ice Core Camp:
Mean: 0.802
Std. dev.: 0.071
Image ref: Maurer, J. 2007. Atlas of the Cryosphere. Boulder, Colorado USA: National Snow and Ice Data Center.
Digital media.
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Temporal Analysis of Single Rods

Descriptive
Statistical functions

Spatiotemporal
data selection

Filtering by value

Anomaly detection
Also:
Image generation
Inter-database data fusion
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Broad Spatiotemporal Analysis
(This took some time)
• Statistical analysis
repeated at every grid
cell.
• Intersection of surface
mask database and
AVHRR database: only
pixels on the ice sheet
were processed.
• Bad data filtered out.
• Multivariate: cloud
mask used to exclude
cloudy pixels from
albedo averages.
• All 2 billion objects
queried and analyzed
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Analysis Example: Sea Ice Temporal Query
t8
t1
}
We would like to remove clouds from
the image (clouds move faster than ice
so find minimum Albedo for open
water)
Moving 8-day window through datarod
Minimum albedo in temporal window
Pseudocode example query:
datarod = database.getDatarod(row,col)
Datarod timeseries of pixels
albedo = datarod.getMinAlbedo(t,t+7)
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Analysis result: Sea Ice Detection
Technique for
removing clouds
from the image
Composite image
created from Data
Rods’ time series
One of the
Original images
Lowest AVHRR
albedo over an 8day period
Remaining objective: exclude lingering clouds
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Analysis Potential: Rapid Data Fusion

Loss of AMSR-E decreases sea ice detection capability

Data Rods AVHRR/SSM/I product fusion may fill the gap

Can be validated against AMSR-E sea ice record.
AVHRR 8day
+
High resolution
sea ice detection –
still some clouds
Fused product
SSM/I
=
Cloud free with
good sea ice
detection but low
resolution
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
High-res sea ice
extent, no clouds
Performing this lake detection analysis conventionally
took 6 months (downloading & gridding & image analysis)
With Data Rods, the analysis
was done in 2 days (single
tread, single processor)
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
What’s Next-Ongoing Efforts



Newest version of ODB software has multi-threaded
capability – to take advantage of multiprocessor
machines to reduce query times
Investigating Data rod performance on the Janus
supercomputer with Pan-Arctic extent
User Interface to Data Rod database
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Creating 1000s of Databases for Use
with Massive Parallel Machines
• Each database is small
enough to be held in memory
for each CPU (uses MPI
calls)
• Each database covers
5ox5ox25 years of Data Rods
• Each database is capped
(fixed for minimal changes)
• Changes are added to the
present year database for
each 5ox5o
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Creating 1000s of Databases for Use
with Massive Parallel Machines
• With this database it should
be possible perform analysis
at Internet speeds
• Multi-sensor analysis is
relatively simple
• We are starting the database
loading now
• 100TB database testing will
occur over the summer
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
Summary



We can now perform high-speed time-series
analysis on the server-side without downloads
Scalable, massive remote sensing databases
Accelerated analysis compared to traditional
“search, order and transmission”’ methods

Interactions across data sets – data fusion

Developing UI and additional analysis tools

Allow users interactive access to the data
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets
NSIDC Data Rods Project
Thank You
The Data Rods project is funded by the National Science Foundation
through grant: ARC 0941442
Interesting in testing Data
Rods? Please contact us at:
[email protected]
Data Rods:
High Speed, Time-Series Analysis of Massive Data Sets