Insert Title Here

Download Report

Transcript Insert Title Here

1
Distributed Software Systems:
Cyberinfrastructure and
Geoinformatics
Chaitan Baru
San Diego Supercomputer Center
2
Integrated Cyberinfrastructure System
Education and Training
Discovery & Innovation
Source: Dr. Deborah Crawford, Chair, NSF CI Working Committee
Applications
• Geosciences
• Environmental Sciences
• Neurosciences
• High Energy Physics …
•
Development
Tools & Libraries
Domain-specific
Cybertools
(software)
Shared
Cybertools
(software)
Middleware Services
Hardware
Distributed Resources
(computation, storage,
communication, etc.)
3
Community Cyberinfrastructure
Friendly Projects
Work-Facilitating Portals
Ecological Observatories (NEON)
Earthquake Engineering (NEES)
Authentication - Authorization - Auditing - Workflows - Visualization - Analysis
Hardware
Ocean Observing (ORION)
Geosciences (GEON)
Middleware
Services
Biomedical Informatics (BIRN)
Development
Tools & Libraries
High Enegy Physics (GriPhyN)
Adapted from: Prof. Mark Ellisman, UC San Diego
Your
Specific
Tools
& User
Apps.
Shared
Tools
Science
Domains
Distributed Computing, Instruments and Data Resources
4
Data, Tools, & Computation
• Data
– Field observations
– Laboratory analyses
– Sensor-based data (land, airborne, satellite)
• Tools
– QA/QC, simple transformations and analyses
– Complex models
• Computation
– Community codes
– Access to high-performance computing
– Data Intensive Computing
5
Variety of Geoinformatics Efforts
• Data collection
– Digital data collection in the field
– “When does it become cyberinfrastructure”?
• Database curation
– E.g. EarthChem, Paleobiology, MorphoBank, Paleo
Pollen, etc….
– When does it become “tools” and “community codes”
• Software Development
– Tools: gravity and magnetics, paleogeography,
geochemistry, seismic data products, …
– Community codes: SCEC-CME, CIG, …
6
Variety of Geoinformatics Efforts
• High Performance Computing
– LiDAR data management
– Seismic analyses
– Petascale initiative
• Data Integration
– E.g. CUAHSI HIS
– Also, a pressing need in projects like
EarthScope
7
Cyberinfrastructure: The Common
Platform Across Distributed Projects
Cyberinfrastructure
Data Management
And Curation
Data Collection
Modeling and
Integration
Tool Development
To provide access to all of these “resources”
and support “interoperability” among them
8
Example: USArray Data Flow
• Deploy field sensor arrays
– Across US
• Collect data from sensor arrays
and perform QA/QC
– One of the sites is SIO, San Diego
• Archive data for community
access
– IRIS, Seattle
EarthScope/USArray: Single
project, multiple participants.
9
Survey
Example: LiDAR
Workflow
Courtesy: Chris Crosby, ASU
D. Harding,
NASA
Point Cloud
x, y, z, …
Interpolate / Grid
Single goal: Multiple projects,
multiple participants, e.g. NCALM,
GEON, ASU, NASA, USGS, …
Analyze / “Do Science”
10
GEON Cyberinfrastructure
•
•
•
Funded by NSF IT Research program
Multi-institution collaboration between IT and Earth
Science researchers
GEON Cyberinfrastructure provides:
–
–
–
–
–
–
Authenticated access to data and Web services
Registration of data sets, tools, and services with metadata
Search for data, tools, and services, using ontologies
Scientific workflow environment and access to HPC
Data and map integration capability
Scientific data visualization and GIS mapping
11
Key Informatics Areas
• Portals
– Authenticated, role-based access to cyber resources: data, tools, models,
model outputs, collaboration spaces, …
• Data Integration
– Search, discovery and integration of data from heterogeneous information
sources (“mediation” and “semantic integration”)
• Use of workflow systems, and access to HPC
– Ability to “program” at a higher level of abstraction
– Sharing of models, along with “provenance” information
– Gateways to HPC environments
• Management of Geospatial Information
– Using GIS capabilities, map services, geospatial data integration
• Visualization of 3D, 4D geospatial data and information
12
Distributed System Definition
• A Distributed System is
– one in which the hardware and software
components in networked computers
communicate and coordinate their activities
only by passing messages, e.g. the Internet
• A Distributed Database System is
– one in which data is stored at several sites, each
managed by a database system (DBMS) that
can run independently
13
Distributed System Models
• Client – Server
invocation
Client A
Network
Server 1
Network
Client B
response
• Peer to Peer
Process 2
Network
Process 1
Networ
k
Process 3
Client C
14
Remote Service Invocation
• TCP/IP
– Basic Internet protocol for computer communications
– Platform for building a number of other open or
proprietary, “higher-level” communications protocols
• Communication at a higher-level of abstraction
• http
– Open protocol based on TCP/IP for the Web
– Fixed set of “verbs” (actions) used to transfer HTML
documents
• CORBA, Java RMI
– Protocols based on an object model
15
SDSC Storage Resource Broker
“Virtualizing” storage
User
Resource,
Mthd, User
User
Defined
C, C++,
Linux I/O
Unix
Shell
Java, NT
Browsers
Prolog
Web
Predicate
SRB
MCAT
Dublin
Core
Archives
File Systems
Databases
HPSS, ADSM,
UniTree, DMF
Unix, NT,
Mac OSX
DB2, Oracle,
Sybase
Metadata
Extraction
Remote
Proxies
DataCutter
Application
Meta-data
http://www.sdsc.edu/srb
16
SRB Client/Server Model
Data are requested
using an SRB ID and a
“file abstraction” (open,
close, read, write)
SRB
Client
Network
SRB Server
HPSS
Client
Networ
k
HPSS
server
Oracle
Client
Networ
k
Oracle
Server
Networ
k
SRB peer-topeer protocol
SRB Server B
17
OpenDAP
• Client/Server model
OpenDAP Servers
Network
OpenDAP
Clients
18
OpenDAP
Servers
CODAR
netCDF HDF4
Data
Data
CODAR
netCDF
Data
Matlab
DSP
Tables
SQL
FITS
CDF
Flat
Binary
Data
Data
Data
Data
Data
Data
Data
Matlab
HDF4
JGOFS
DSP
FITS
JDBC
CEDAR
General
Data
Data
ESML
FreeFrom
CDF
CEDAR
From: Peter Cornillon & Jim Gallagher
http://www.opendap.org/support/stennis_tutorial.html
Clients
netCDF C
Ferret
GrADS
netCDF Java
IDV
VisAD
ncBrowse
Matlab
Client
IDL
Client
Matlab
IDL
Access
Excel
19
OpenDAP Data Request
• Data are requested with a URL.
•
http://www.cdc.noaa.gov/cgi-bin/nph-nc/datasets/Reynolds_sst?sst[10:10][0:90][0:180]
•
Protocol Machine name
•
OPeNDAP server Directory File name
Constraint
User can impose a constraint on the data to be
acquired from a data set by appending a constraint
expression to the end of the URL
20
Remote Service Invocation with
Web Services
• A Web Service is a simple protocol for invoking remote
services on the Web. It is:
– A network “endpoint”, i.e. server, that implements one or more
“ports”.
• `Each port is defined by the message types that accepts and the
messages it returns.
– Specified by a “Web Service Definition Language” xml document.
• Given the WSDL for a web service you know all you need to interact
with it.
• Web Service Standards also exist for security, policy,
reliability, addressing, notification, choreography and
workflow.
– It is the basis for MS .NET, IBM Websphere, SUN, Oracle, BEA,
HP, …
– It is the basis for the new Grid standards like WSRF and OGSA.
21
Web Site vs Web Service
From: “Building Grid Applications and Portals, An Approach Based on
Components, Web Services and Workflow Tools,” Gannon et al, Euro-Par 2004
• Web Site
• Web Service
– Designed to pass http
get/post/put request to
between a browser and a
web server.
– Google has a web site.
Web
Server
– Designed for services to
talk to other services by
exchanging xml messages
– Google also provides a web
service so Google may be
used in distributed apps
Web
Service
Client’s Browser
Web
Service
Web
Service
22
Grid Services
From: “Building Grid Applications and Portals, An Approach Based on
Components, Web Services and Workflow Tools,” Gannon et al, Euro-Par 2004
• Grid: A distributed, heterogeneous set of resources
– Integrated by a pervasive layer of services
– Goal: allow users to view it as a single system
• More than the Internet (which forms part of the resource
layer)
• Builds on the Web by building on web services
Open Grid Service Architecture Layer
Registries and
Name binding
Reservations
And Scheduling
Data Management
Service
Security
Policy
Administration
& Monitoring
Event Service
Logging
Accounting
Service
Grid Orchestration
Web Services Resource Framework – Web Services Notification
Physical Resource Layer
23
Access Interfaces and Levels of
Access
• Web service, native application program
interface, ODBC/JDBC, filesystem
SOAP server
stack
WSDL and SOAP
Web Server
“stack”
URLs and http
Application can
also be “wrapped”
as a Web Service
SRB,
Application
OpenDAP,
Program
etc…
DBMS
filesystem
Expose ODBC/JDBC interface
(and full SQL)
Mount remote filesystems
24
Authentication
• Client – Server models
User
Client A
Client-side
authentication
Server 1
Network
?
Server 3
?
Server-side
authentication
Server 2
25
Common Authentication
Obtain
Credentials
Certificate
Authority
Verify
Credentials
Client
Invoke with
Credentials
Server 1
Server 2
Server 3
26
Grid Account Management Architecture (GAMA):
Single sign-on in GEON (also used in a number of other projects)
Karan Bhatia, Kurt Mueller, Choonhan Youn, Sandeep Chandra
gama
gridportlets
GridSphere
DB
Servlet container
import user
retrieve
credential
Java keystore
Portal server 1
Portal server 2
retrieve
credential
OGSA Grid
services wrapper
create user
CACL
Myproxy
CAS
…
Servlet container
Java keystore
GAMA server
Stand-alone applications
27
Systems Issues
• Load Balancing, Failover, Replication
Server 1
Client
Server 2
Server 3
Multiple servers for
load balancing,
failover
Data replication
28
Distributed Data Access
• What is the issue?
• Ability to access data stored in multiple, different
databases using a single request, e.g.
– Get geologic information from multiple geologic
databases
– Get employee information from all branches
• Ability to update data stored in multiple databases,
e.g.
– Transfer salary amount from University to my bank
account
– Transfer funds from Visa account to vendor’s account
29
Distributed data access
Client
Homogeneous:
mySQL
Heterogeneous:
mySQL
Database
mySQL 1
How about creating a
“cached” local copy?
mySQL
Sources may be
data repositories or
metadata catalogs
mySQL
Oracle
DB2
Database
Excel 2
ASCII
Database
flat file
3
30
Data Warehousing
Client
2. Query processing
interaction only between
client and warehouse
Data Warehouse
(common schema)
– Extract
1.
Load data
– Transform
from
sources
– Load
to
warehouse ETL
ETL
Data Source 1
Data Source 2
But, warehouse data could
be “stale”, i.e. out of
synch with source data…
ETL
Data Source 3
31
Data integration via middleware
1. Each client request
goes to sources, via
middleware
Database 1
Client
Data integration
Middleware
(aka Mediator)
2. Result collected by
middleware and
returned to client
Database 2
Database 3
32
Warehousing vs Mediation
• Warehousing: User ETL to “massage” local data to
fit into a common global, warehouse schema
• Mediation: Modify user query to match schemas
exported by each source
– But, which schema does the user query?
– The Integrated View Schema
– Sources “export” a view (the export schema)
• Federated databases
– Local sources belong to different “administrative
domains”, i.e. different owners.
– Local autonomy
33
The Canonical Mediator / Wrapper
Architecture
Client Application
Wrapper processes could
execute at sources, at
mediator, or elsewhere
Q1
Cached
data
Export view
in mediator
data model
Local view
in local data
model
Mediator
(Integrated view in mediator data model, e.g. relational, XML)
Q11
Q12
Q13
Q14
Wrapper
Wrapper
Wrapper
Wrapper
Local schema
Local schema
Local schema
Local schema
Data
source 1
Data
source 2
Data
source 3
q14
Data
source 4
34
Example: A Relational Mediator
Client Application
Mediator
(Relational data model)
Wrapper
Wrapper
Relational DBMS
e.g. PostGIS
Shape file
35
Example: A Shape-file Based Mediator
Client Application
Mediator
(Shape file-based data model)
Wrapper
Wrapper
Relational DBMS
e.g. PostGIS
Shape file
36
Example: An XML Mediator
User / Applications
Mediator
(XML-based data model, e.g. GML)
Wrapper
Wrapper
Wrapper
Relational DBMS
e.g. PostGIS
Shape file
XML file
e.g. ArcXML
37
User Authentication and Access
Control
How about using
GAMA for
authentication?
Client Application
1. User authenticates to
system
2. User connects to mediator (passes credentials to mediator)
3.
Mediator
Mediator connects to sources
a) Using original user credentials
b) Or, mapped credentials (role-based access)
4. Need to define
users or roles in
sources
Wrapper
Wrapper
Data
source 1
Data
source 2
38
Different types of heterogeneity in
data integration
• Platform heterogeneity: different OS
platforms
• DBMS heterogeneity: different database
systems, e.g. SQLServer, mySQL, DB2
• Data type heterogeneity
• Schema heterogeneity
• Heterogeneity in units, accuracy, resolution
• Semantic heterogeneity
39
Schema Integration
• A long standing Computer Science problem
• Simple case
Source 1 Wrapper
Sample ID:
Table
varchar
Rock type: Age:
varchar
int
…
Source 2 Wrapper: convert between int and varchar for Age
Table
Sample ID: Rock type: Age:
…
varchar
varchar
varchar
– Mediator View:
(SampleID varchar, Rock_Type varchar, Age int)
– In Source2 Table, map Age to int
40
Another integration scenario
Source 1
Table
Sample ID: Rock type: Eon:
Era:
Period:
varchar
varchar
varchar varchar varchar
Phanerozoic Mesozoic Jurassic
Source 2
Table
Sample ID: Rock type: Age:
varchar
varchar
varchar
“Phanerozoic/mesozoic;jur”
– Mediator View:
(SampleID varchar, Rock_Type varchar, Age varchar,
Era varchar, Period varchar)
– In Source 2 Table, parse Age to obtain sub-components
of the field
41
A more advanced integration
scenario
Source 1
Table
Sample ID: Rock type: Eon:
Era:
Period:
varchar
varchar
varchar varchar varchar
Phanerozoic Mesozoic Jurassic
Source 2
Table
Sample ID: Rock type: Age:
varchar
varchar
int 150
• Mediator View: (SampleID varchar, Rock_Type varchar, Eon
varchar, Era varchar, Period varchar)
– Same as Source1 table schema
• Query: Get rock types for all rocks from the Jurassic period
42
Doing the integration
•
•
•
Query sent to mediator:
SELECT DISTINCT(Rock_Type) FROM Mediator_View
WHERE Period=‘Jurrasic’
Query to Source 1:
SELECT DISTINCT(Rock_Type) FROM Source1_Table
WHERE Period=‘Jurrasic’
For Source2, need to map Period=“Jurassic” to Age values
Source 2 Table
Sample ID: Rock type: Age:
varchar
varchar
int
Geologic_Time Table
Eon:
Era:
Period: Min
varchar varchar varchar int
Max
int
43
Query “fragment” sent to Source 2
• SELECT DISTINCT (S2.Rock_Type)
FROM
Source2_Table S2,
Where is the
Geologic_Time_Table GT
Geologic_Time
table stored ?
WHERE
GT.Period = ‘Jurrasic’ AND
(S2.Age >= GT.Min) AND
(S2.Age <= GT.Max)
44
Data Integration Carts™
• Integrating data sets without explicitly creating views
• An example request:
Plot all gravity data points that fall within the spatial
extent of rocks of a given type, in the Rocky Mountain
testbed region
– Use GEONsearch to find all gravity and geologic data using
bounding box for “Rocky Mountain testbed region”
• Need gazeteer / spatial ontology to determine Rocky Mountain region
• Need to know classification of datasets (as gravity and geology)
• Intersect extent of gravity and geologic datasets (from metadata) with
extent of Rocky Mountain region
– Plot gravity point data that fall within polygons of rocks of given
type
45
Ad hoc integration
Search Metadata
Catalog
GEONsearch
“Geologic and gravity
data in Rocky Mountains”
Plot map
Data Integration Cart™ Query
Map
46
Data Registration
Spatial Ontology
Location
Rock Classification
Ontology
Igneous
Point Polygon
Granite Quartzmonzonite
Latitude Longitude
Item Registration
(Schema registration)
Metadata
(X, Y)
Gravity
dataset
Item Detail
Registration
Lat, Long, RockType
Geologic
dataset
Metadata
47
48
Another complex query
• Query: Get rock types for all rocks from the
mesozoic era
– Easy to do for Source 1: Era = “Mesozoic”
– For Source 2:
• Need to find numeric age range for Mesozoic
– Find age range across all subclasses of Mesozoic
(Cretaceous, Jurassic, Triassic)
• Select all Source 2 Table records whose age range
falls within the Mesozoic age range