Brown Dog Overview - NCSA Open Source Projects

Download Report

Transcript Brown Dog Overview - NCSA Open Source Projects

NCSA Brown Dog
An Overview
Kenton McHenry, Ph.D.
Senior Research Scientist
National Center for Supercomputing Applications
University of Illinois at Urbana–Champaign
NSF ACI Data Program
Long Term Access to Large
Scientific Data Sets: The
SkyServer and Beyond
Kenton McHenry
$10,519,716
2013-2018
The Data Exacell
Alex Szalay
$7,603,723
2013-2018
Reagan Moore
$8,300,992
2011-2016
Michael Levine
$4,902,601
2013-2018
Steven Ruggles
$7,993,266
2011-2016
Bill Michener
$21,194,548
2009-2014
Integrating Geospatial
Capabilities into HUBzero
Xiaohui Carol Song
$3,409,029
2013-2018
Margaret Hedstrom
$8,000,000
2011-2016
Golam Choudhury
$10,085,120
2009-2014
CIF21 DIBBs: Brown Dog
• NSF ACI
• $10,519,716
•
•
•
•
•
PI: Kenton McHenry, Ph.D.
Co-PI: Jong Lee, Ph.D.
Co-PI: Barbara Minsker, Ph.D.
Co-PI: Praveen Kumar, Ph.D.
Co-PI: Michael Dietze, Ph.D.
The Problem
• The Scientific Method:
• Question
• Hypothesis
• Testing
• Procedure
• Analysis
• Result
• When procedure is executed one obtains the same
result every time!
• The majority of science today involves procedures which
include software and digital data.
• Both have relatively short lifespans!
The Problem
• Large collections of un-curated and/or unstructured
digital data (“long-tail” data)
•
•
•
•
•
Many file formats
No metadata
No useful filenames
No useful directory structure
No textual contents
What is needed (from the data side)
• Means of deciphering the bytes that make up digital data
so that one can retrieve its contents
• Data Structures (e.g. images, 3D points, sound waves, strings,
fields, matrices, etc…)
• Means of indexing data contents so that large collections
of data can be searched and desired data found
• An ability to compare data
What is needed (from the data side)
• The file format specifications describing how contents
are represented within the file’s bytes,
• the software used to create and view the data,
• and the execution environment (platform, operating
system, libraries, other software, etc…).
• The existence of metadata describing the data (possibly
as simple as useful file/directory names), in order to
search/index data.
Additional Considerations
• software is also a factor in this (i.e. the data side),
• obsolete operating systems and platforms,
• storage requirements (e.g. storing a working
environment in a virtual machine),
• software that is no longer available,
• software licensing,
• the existence of many file formats (even for the same
kind of data),
• lack of standards for data formats or enforceability of
standards,
• large complex file format specifications,
Additional Considerations
• unavailable format specifications (either lost or
proprietary),
• the ease and reward of creating data versus the burden
of curation (e.g. organizing and providing metadata for
files),
• different metadata standards,
• assuring the long term availability of preserved software
and data,
• assuring the archive preserving the software and data
exists over a reasonably long period of time,
Additional Considerations
• assuring the archival tools needed to index, find, access,
view, retrieve, and utilize the software and data within
the archive exists over a reasonably long period of time
(being software itself).
Additional Considerations
• a growing notion towards the need of academic reward,
and perhaps education, surrounding the costly products
of software development and data creation
• the necessity for science to build off of the work of others
and have software and data reused (possibly in ways not
remotely considered by the creator and crossing into
other disciplines)
• need for computation during the analysis of data
collections
• means of efficiently and reliably transferring large
amounts of data
What Brown Dog Addresses
• Accessing Data Contents with a Lack of Standards and
Many File Formats
• Discovering and Finding Data with a Lack of Curation
while also Considering the Need to Preserve Software
and Provide Credit for Software Development
• Creating Tools for Accessing Data while Addressing
Archival Tool Sustainability
What Brown Dog Addresses
• Accessing Data Contents with a Lack of Standards and
Many File Formats
• Discovering and Finding Data with a Lack of Curation
while also Considering the Need to Preserve Software
and Provide Credit for Software Development
• Creating Tools for Accessing Data while Addressing
Archival Tool Sustainability
Sustainable Software Cyberinfrastructure
• Knowing our history:
• NCSA Telnet, 1986
• Gaige Paulsen, Tim Krauskopf, Aaron Contorer
• Mosaic, 1993
• Marc Andreessen, Eric Bina
• Netscape, Internet Explorer, Firefox, Chrome (84% of
browser traffic)
• httpd (and CGI), 1993
• Robert McCool
• Apache (64% of all webservers)
• All built to access supercomputing resources
• Though they still serve this purpose none will be
remembered for that!
Sustainable Software Cyberinfrastructure
• Knowing our history:
• Funded to meet scientific need(s)
• Broad appeal (i.e. the general public)
• Free (e.g. open source)
• Broad public appeal to sustain and drive scientific
software post funding
The Domain Name Service (DNS)
• Originally written by Paul Mockapetris in 1983
• Distributed database to translate domain names (i.e.
strings) into IP addresses (i.e. 4 bytes)
• 13 logical root servers (A-M), 359 instances worldwide
• Internet Corporation for Assigned Names and Numbers (ICANN)
• Essential part of the modern internet!
• Used constantly by all yet largely invisible
• Data Access
Conversion:
Proxy A
(DAP)
transformation on digital data
that
• Alargely
highly extensible
preserves
and the
distributed
entirety
service
of the
for carrying out
file Largely
format conversions
data.
reversible.
•
• Move towards an internet/world that is agnostic to file
formats
Aid in accessing
a files contents independent
how it
Data •Extraction:
A transformation
on digitalofdata
is represented
disk higher level, data from the
which creates
new,on
often
contents of the given data (e.g. tags, signatures).
• Not
Datareversible.
Tilling Service (DTS)
•
An extensible and distributed service for the extraction of
new data or metadata from a file’s contents
• Provide means to query and/or relate collections of
data without metadata
Brown Dog Data Transformation Services
• The Data Access Proxy (DAP)
• http://dap.ncsa.illinous.edu/conversion/:output/:file
• File in, File out
• The Data Tilling Service
• http://dts.ncsa.illinois.edu/extraction/:domain/:file
• File in, JSON out
• JSON can contain metadata, tags, signatures, links to derived
data products, etc…
Brown Dog Data Transformation Services
• Services!!!
•
•
•
•
Provide a programmable interface (e.g. REST)
Client applications build on top of these services
Back with computational resources
Place to preserve/reuse software/tools
Brown Dog Use Cases
• Addressed specifically here:
•
•
•
•
Biology
Ecology
Civil and Environmental Engineering
Social Science
• Towards all science
• Early User Workshop!!
Ecosystems and Climate Change
• The Predictive Ecosystem Analyzer (PEcAn)
• Models:
• Ecosystem Demography (ED)
• SIPNET
• DALEC
• Data:
• Biofuel Ecophysiological Trait and Yield Database (BETY)
• Forest Inventory and Analysis (FIA)
• North American Regional Reanalysis (NARR)
• North American Carbon Program (NACP)
• Food and Agriculture Organization (FAO)
• …
• Lots of conversions taking place!!!
Ecosystems and Climate Change
•
•
•
•
•
MODIS (Multi-spectral)
Lidar
Palsar (Radar)
Aviris (Airborne Infrared Spectrometer)
Landsat (Images)
Ecosystems and Climate Change
• Settlement Vegetation data
• Born Physical
• Paper, Microfiche, Alphanumeric/Color coded on vellum sheets
• Born Digital
• PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5,
XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS,
SRTM, IMG, UA, LGW, SXW, ODS
• Document
• Ad hoc formats:
• Spreadsheets
• Databases
• Services
• R Data
• Matlab Data
Ecosystems and Climate Change
• Settlement Vegetation data
• Born Physical
• Paper, Microfiche, Alphanumeric/Color coded on vellum sheets
• Born Digital
• PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5,
XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS,
SRTM, IMG, UA, LGW, SXW, ODS
• Document
• Ad hoc formats:
• Image
• Spreadsheets
• Databases
• Services
• R Data
• Matlab Data
Ecosystems and Climate Change
• Settlement Vegetation data
• Born Physical
• Paper, Microfiche, Alphanumeric/Color coded on vellum sheets
• Born Digital
• PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5,
XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS,
SRTM, IMG, UA, LGW, SXW, ODS
• Document
• Ad hoc formats:
• Image
• Spreadsheets
• Spatial
• Databases
• Services
• R Data
• Matlab Data
Ecosystems and Climate Change
• Settlement Vegetation data
• Born Physical
• Paper, Microfiche, Alphanumeric/Color coded on vellum sheets
• Born Digital
• PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5,
XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS,
SRTM, IMG, UA, LGW, SXW, ODS
• Document
• Ad hoc formats:
• Image
• Spreadsheets
• Spatial
• Databases
• Tabular
• Services
• R Data
• Matlab Data
Ecosystems and Climate Change
• Settlement Vegetation data
• Born Physical
• Paper, Microfiche, Alphanumeric/Color coded on vellum sheets
• Born Digital
• PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5,
XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS,
SRTM, IMG, UA, LGW, SXW, ODS
• Document
• Ad hoc formats:
• Image
• Spreadsheets
• Spatial
• Databases
• Tabular
• Weather
• Services
• R Data
• Matlab Data
Ecosystems and Climate Change
• Settlement Vegetation data
• Born Physical
• Paper, Microfiche, Alphanumeric/Color coded on vellum sheets
• Born Digital
• PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5,
XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS,
SRTM, IMG, UA, LGW, SXW, ODS
• Document
• Ad hoc formats:
• Image
• Spreadsheets
• Spatial
• Databases
• Tabular
• Weather
• Services
• 3D
• R Data
• Matlab Data
Ecosystems and Climate Change
• Settlement Vegetation data
• Born Physical
• Paper, Microfiche, Alphanumeric/Color coded on vellum sheets
• Born Digital
• PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5,
XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS,
SRTM, IMG, UA, LGW, SXW, ODS
• Document
• Ad hoc formats:
• Image
• Spreadsheets
• Spatial
• Databases
• Tabular
• Weather
• Services
• 3D
• R Data
• Archive, Database,
• Matlab Data
Filesystem, …
Data Collection
Weather
Groupscope
DataData
Lidar
Data
Architecture/Landscap
MODIS
Photographs
Satellite
URL,
Architecture/Design
Handwritten
File
System,
…
eImages
Models
Settlement
Vegetation
Data
Native Byte
Encoding
Various
Formats
Video
LAS
Various
Various
3D
Image
Formats
File
Formats,
Data
Formats
Bases,
Websites,
Documents
Data Structures
DAP
Tabular
Video
Depth,
Depth,
Depth,
Depth
Polyglons
Polygons
Plot
3D
Image
Model
Arrays,
Strings,
Images, Videos,
Audio, 3D Models, …
Derived
Derived
Derived
Derived
Data/
Data/
Data/
DTS
Data/
Metadata
Metadata
Metadata
Metadata
Measure
People
Text,
of
Intermediary
Stream
Floodplains,
River
Floodplains
detection,
cross3D
Synthetic
Land
Model
Tags,
Locations/
Aesthetic
Number
Analysis
Results
Depth
Sinuosity
sections,
Distribution
Cover/Usage/
Images
Signatures
Interactions
Values
Appeal
Maturity
…
Applications
Climate
Large
Dynamic
Modeling
Flood
Plain
Green
Search,
Relate,
Group
Behavior
Analysis
Infrastructure
View,
Process
Design
Usable Data
• The Data Access Proxy (Demo)
Kenton McHenry
• The Data Tilling Service (Demo)
Luigi Marini
Technology
•
•
•
•
•
•
•
•
•
K. McHenry, R. Kooper, P. Bajcsy, “Towards a Universal, Quantiable, and Scalable File
Format Converter", The IEEE International Conference on eScience, 2009.
M. Ondrejcek, K. McHenry, P. Bajcsy, “The Conversion Software Registry", Microsoft eScience
Workshop in San Francisco, CA, 2010.
K. McHenry, M. Ondrejcek, L. Marini, R. Kooper, P. Bajcsy, “Towards a Universal Viewer for
Digital Content", International Conference on Computer Science, Executable Paper Workshop,
2011.
K. McHenry, R. Kooper, L. Marini, M. Ondrejcek, “The ISDA Tools: Preserving 3D Digital
Content", The Preservation of Complex Objects Symposia, 2011.
K. McHenry, R. Kooper, M. Ondrejcek, L. Marini, P. Bajcsy, “A Mosaic of Software", The IEEE
International Conference on eScience, 2011.
L. Marini, P. Bajcsy, S. Padhy, A. Vandecreme, R. Kooper, B. Long, M. Ondrejcek, P. Saba, D.
Bonnie, J. Chalfoun, K. McHenry, “Versus: A Framework for General Content-Based
Comparisons", IEEE eScience, 2012.
L. Diesendruck, L. Marini, R. Kooper, M. Kejriwal, K. McHenry, “Digitization and Search: A NonTraditional Use of HPC", IEEE eScience Workshop on Extending High Performance Computing
Beyond its Traditional User Communities, 2012.
L. Diesendruck, L. Marini, R. Kooper, M. Kejriwal, K. McHenry, “A Framework to Access Handwritten Information within Large Digitized Paper Collections", IEEE eScience, 2012.
L. Diesendruck, R. Kooper, L. Marini, K. McHenry, “Using Lucene to Index and Search the
Digitized 1940 US Census", XSEDE, 2013. (Best Paper Award and Best Science & Engineering
Track Paper Award)
Brown Dog: Data Access Proxy (DAP)
Brown Dog: Data Access Proxy (DAP)
Brown Dog: Data Tilling Service (DTS)
Goals
• Support
• Make list of supported formats as long and as relevant as
possible
• Make list of extractors/signatures as long and as relevant as
possible
• Performance
• Increase tasks per hour
• Backed by hardware (e.g. XSEDE, Amazon EC2, Azure, …)
• Minimize failures per hour
Software
• DAP & DTS REST Services
•
•
•
•
Javascript bookmarklets (for DAP & DTS)
Browser plugin (e.g. Firefox)
Linux module
Linux file manager (e.g. GNOME Files)
• Cross platform client to:
• Provide access to uncurated/unstructured collections
• Help users curate uncurated/unstructured collections
• Leverage other DataNet effort for rest of curation workflow
https://dap.ncsa.illinois.edu/conversion/
https://dts.ncsa.illinois.edu/extraction/
Medici
https://opensource.ncsa.illinois.edu/stash/projects/MMDB/
Polyglot
https://opensource.ncsa.illinois.edu/stash/scm/pol/polyglot.git
Versus
https://opensource.ncsa.illinois.edu/stash/projects/VS/
Daffodil
https://opensource.ncsa.illinois.edu/stash/scm/dfdl/daffodil.git
http://browndog.ncsa.illinois.edu