OPeNDAP in the Cloud

Download Report

Transcript OPeNDAP in the Cloud

OPeNDAP in the Cloud
Optimizing the Use of Storage Systems Provided by Cloud Computing Environments
OPeNDAP
James Gallagher, Nathan Potter
and
NOAA/NODC
Deirdre Byrne, Jefferson Ogata, John Relph
26 June 2013
Cloud Systems Now*
•Providers: IBM, Microsoft, Amazon, Google,
Rackspace, …
•Microsoft: Azure “…handles 100 petabytes of
data a day”
•Amazon: “…hundreds of thousands of users”
•Netflix: “…stopped building it’s own data
centers in 2008;” all in Amazon by 2012
•Snapchat: 4000 pictures per second; “…never
owned a computer server.” (Google cloud)
*Quentin Hardy, “Google Joins a Heavyweight Competition in Cloud Computing,” NY Times, 3 December 2013
Why use OPeNDAP?
Full dataset
OPeNDAP request
100% Download
•
•
4% Download
TheOPeNDAP request smaller and is just the data the person wants
In cloud systems cost is a function of data transfer, in addition to to
data stored, so smaller targeted requests reduce costs
NOAA
Environmental
Data
Management
Conceptual
Cloud
Architecture*
*Aadapted from NOAA Environmental Data
Management Framework Draft v0.3
Appendix C - Dr. Jeff de La Beaujardière,
NOAA Data Management Architect
Potential locations of cloud-enabled OPeNDAP instances
Constraints
• No vendor lock-in!
• No Stovepipes! - flexible storage method
• What will be the client of 2020?
• Hierarchical/human browsable
dataset
file
file
file
Data stores: S3 and Glacier
•S3
• Spinning disk with a flat file system
• Designed to make web-scale computing easier
•Glacier
• Near-line device with 4-hour (or >) access times
• Secure and durable storage
•EC2
• EC2 was used to run the OPeNDAP data server
• Linux
Using S3 as a Data Store
S3
HTTP GET &
HEAD requests
Catalog
Data
Web requests
Catalog, or data request
S3
XML or data file
OPeNDAP Catalog requests
User catalog
Request
EC2
catalog
cache
Catalog Access
OPeNDAP
Server
THREDDS
catalog or
HTML
S3
data
cache
XML File
To enhance performance, data were accessed from
S3 only when not already cached.
OPeNDAP Data requests
User data
Request
EC2
catalog
cache
Data Access
OPeNDAP
Server
Data Slice
S3
data
cache
Data File
To enhance performance, data were accessed from
S3 only when not already cached.
Observations
•
•
•
S3FS & Amazon's APIs: vendor lock-in
XML catalogs were flexible:
•
•
•
•
Support both direct web and…
Subsetting server access
Likely adaptable to other use-cases
Easily support hierarchical structure
Catalogs didn't need to be stored in S3
Glacier and Asynchronous Responses
• To use Glacier, a web service protocol must
•
support asynchronous access! Glacier is a
near-line device; not a spinning disk.
Support via protocol is not enough: typical
use cases cannot be met without caching
‘metadata’
o To support web interfaces/clients DAP metadata
objects should be cached
o To support smart clients, may need domain data in
cache
Glacier Implementation
• Caching
o Catalog
o DAP metadata
• Support for programmatic and web clients
o Web clients are the primary user of the DAP metadata
because of their ‘click and browse’ behavior
• XML with an embedded XSL style sheet
o Single response (XML)
o Multiple target clients – smart and browser
OPeNDAP Catalog (THREDDS) Requests
EC2
catalog
cache
User catalog
Request
metadata
cache
Catalog
Response
OPeNDAP
Server
Glacier
data
cache
Because Glacier has an entirely flat inventory representation
a hierarchical catalog representation must be held externally.
OPeNDAP Metadata Requests
EC2
catalog
cache
User metadata
Request
metadata
cache
Metadata
Response
OPeNDAP
Server
Glacier
data
cache
Because all Glacier retrievals require 4 hours, meaningful metadata
must be cached at the service instance for clients to be able to
determine which data to access.
OPeNDAP Data Requests
EC2
catalog
cache
User data
Request
metadata
cache
Data
Response
OPeNDAP
Server
Glacier
data
cache
If the requested data is already held in the service instance cache
the data may be returned immediately and the access is complete.
OPeNDAP Data Requests
EC2
catalog
cache
User data
Request
Async
Response
metadata
cache
OPeNDAP
Server
Glacier
?
data
cache
If the requested data is not held in the service instance cache then
the client must be told that the request will be asynchronous.
OPeNDAP Data Requests
EC2
catalog
cache
User Async
Data Request
Async
Response
metadata
cache
OPeNDAP
Server
Glacier
?
data
cache
If the User makes an Asynchronous Data Request and the data is
not held in the local instance cache then the Server initiates a
Glacier retrieval job and returns an Async Response to the client
indicating the expected availability time.
OPeNDAP Data Requests
EC2
catalog
cache
metadata
cache
OPeNDAP
Server
Glacier
data
cache
After 4 hours the server retrieves the data from the completed
Glacier job and places it in the local instance cache.
OPeNDAP Data Requests
EC2
catalog
cache
User Data
Request
Data
Response
metadata
cache
OPeNDAP
Server
Glacier
data
cache
Subsequent user requests for the newly cached dataset complete in
the traditional manner (no asynchronous "dialog" required) as long
as the data persist in the instance cache.
Comparison: S3 and Glacier*
•Glacier provides “secure and durable storage”
•S3 is “designed to make web-scale computing
easier”
•These graphs: A tiny part of complex cost model.
They do not include the cost to move data out of
the Amazon cloud, EC2 instances, etc.
*http://calculator.s3.amazonaws.com/calc5.html
Summary
• OPeNDAP server with minimal changes
• Data stored in S3 and Glacier
• Solution widely applicable: Web + Smart
clients
• Complexity of the cost model  combination
of both S3 and Glacier likely
• Modeling & Monitoring use required