Peter Baumann

Download Report

Transcript Peter Baumann

eScience Needs in the Big Data Era
e-IRG Workshop
Athens, Greece, 9-10 June 2014
Peter Baumann, Dimitar Misev
Jacobs University | rasdaman GmbH
[email protected]
eScience Needs :: Brussels :: P. Baumann
Array DB Research @ Jacobs University
 Large-Scale Scientific Information Systems group
• massive n-D array services & beyond
• www.jacobs-university.de/lsis
 Main results:
• Pioneer Array DBMS, rasdaman
• Standardization:
editor of „Big Geo Data“ stds,
ISO Array SQL cand std
rasdaman visitors 2013+
ISO: member, SC32 / WG3 SQL; SC32 Study Group on Big Data; OGC liaison, TC211
Open Geospatial Consortium: co-chair, BigData.DWG, WCS.SWG; Coverages.DWG;
Temporal.DWG
Research Data Alliance: co-chair, Big Data Interest Group and Geospatial Interest Group
member, ERCIM Expert Group Big Data
member, Belmont Forum, WP 3 Harmonization of global environmental data infrastructure
Charter Member, OSGeo
council member, CGI / IUGS
founding member and secretary, CODATA Germany
...
eScience Needs :: Brussels :: P. Baumann
Sample User Queries
 "Given me all of the images in this geographic area in this this time span
that are at least 80% cloud free have been radiometrically corrected and
are from these satellites and then pass those images into a workflow to
perform functions x,y,z"
• Carl Reed, CTO, Open Geospatial Consortium (OGC)
 “Find images taken by the SEVIRI satellite on August 25, 2007 which
contain fire hotspots in areas which have been classified as forests
according to CORINE Land Cover, and are located within 2km from an
archaeological site in the Peloponnese.”
• INSPIRE related
eScience Needs :: Brussels :: P. Baumann
Core Requirements
 User-oriented
• Visual interfaces + powerful expert interfaces (R, Matlab, WMS, WCPS, ...)
 Flexible
• new apps, new research questions
 Scalable
• Allows for scalable implementations (auto-parallelization, orchestration)
• high-level service defs, not micro management
 Experience shows:
high-level query language (QL) advantageous
eScience Needs :: Brussels :: P. Baumann
Tackling Variety
 Stock trading: 1-D sequences (i.e., arrays)
 Social networks: large, homogeneous graphs
 Ontologies: small, heterogeneous graphs
 Climate modelling: 4D/5D arrays
 Satellite imagery: 2D/3D arrays (+irregularity)
 Genome: long string arrays
 Particle physics: sets of events
 Bio taxonomies: hierarchies (such as XML)
 Documents: key/value stores: sets of unique identifiers + whatever
 etc.
eScience Needs :: Brussels :: P. Baumann
Managed Variety in Big Geo Data
[OGC 09-146r2]
 OGC Coverage = regular & irregular grids, point clouds, meshes
• Fully n-D, spatio-temporal & beyond
 Unifying service: Web Coverage Service (WCS)
eScience Needs :: Brussels :: P. Baumann
Hadoop: Not the Answer to All
 MapReduce built for unstructured data
 ...no builtin knowledge about structured data types
• Ex: Array Analytics: n-D Euclidean neighborhood
• “Since it was not originally designed to leverage the
structure […] its performance […] is therefore suboptimal.”
o – Daniel Abadi
• M. Stonebraker (XLDB 2012): „will hit a scalability wall“
eScience Needs :: Brussels :: P. Baumann
OGC WCPS
 OGC Web Coverage Processing Service (WCPS)
= high-level geo raster query language; adopted 2008
 "From MODIS scenes M1, M2, M3: difference between red & nir, as TIFF"
• …but only those where nir exceeds 127 somewhere
for $c in ( M1, M2, M3 )
where
some( $c.nir > 127 )
return
encode(
$c.red - $c.nir,
“image/tiff“
)
(tiffA,
tiffC)
eScience Needs :: Brussels :: P. Baumann
8
Database Visualization
for $s in (SatImage)
for $d in (DEM)
return
encode(
struct {
red:
(char)
green: (char)
blue: (char)
alpha: (char)
},
“image/png"
)
s.img.b7[x0:x1,x0:x1],
s.img.b5[x0:x1,x0:x1],
s.img.b0[x0:x1,x0:x1],
scale( d.elev, 20 )
[JacobsU, Fraunhofer 2012; data courtesy BGS, ESA]
eScience Needs :: Brussels :: P. Baumann
Use Case: Plymouth Marine Laboratory
[Oliver Clements, EGU 2014]
 “Avg chlorophyll concentration for area & time period, from x/y/t cube”
• 10, 60,120, 240 days
 Conclusions:
• „we must minimise data transfer
as well as [client] processing”
• “standards such as WCPS
provide the greatest benefit”
eScience Needs :: Brussels :: P. Baumann
From Clouds to Federations
 Automatic, ad-hoc federation
between data centers, intelligent sensors, ...
• autonomous
• Heterogeneous
Dataset D
 Open standards!
Dataset C
Dataset A
Dataset B
eScience Needs :: Brussels :: P. Baumann
Summary
 Rec 1: Evaluate domain standards
 Rec 2: Geo domain as priority
• „80% of all data are location connected“
 Rec 3: tie in database, data mining experts
• Leverage long-standing experience
in flexible, scalable information systems
• Trend: high-level query languages
• New data type support
[rasdaman screenshots]
eScience Needs :: Brussels :: P. Baumann