Insert Title Here
Download
Report
Transcript Insert Title Here
1
Survey of Emerging IT Trends
and Technologies
Chaitan Baru
Monday, 10th Aug
2
OUTLINE
• Trends in data sharing
– And, Discovery/Search
• Trends in service-oriented architectures
• Trends in computing and data
infrastructure
• The road ahead
3
Geoinformatics Use Cases
• “…a use has access from a terminal to vast stores of
data of almost any kind, with the easy ability to
visualize, analyze and model those data.”
• “For a given region (i.e. lat/long extent, plus depth),
return a 3D structural model with accompanying
geophysical parameters and geologic information, at a
specified resolution”
4
Implied IT Requirements
• Search and discovery of resources
• Integration of heterogeneous 3D / 4D
Earth Science data
• Integration of data with tools
• Analysis and Visualization
– Ability to feed data to tools, and analyze &
visualize model outputs
• (data-centric view…)
5
Search and Discovery
• Searching “structured data”, i.e.
metadata catalogs
Search
Structured metadata
catalogs
6
Search and Discovery
• Searching “unstructured data”, i.e. the
Web
Search
The Web
• Structured databases are a major
component of the “Deep Web”
7
Combined Search and
Discovery
Search
Structured metadata
catalogs
The Web
8
Advanced Search
• Proposed:
– Geoscience
Knowledge System,
GeoKnowSys
– Built using Yahoo
Build Your Own
Search (BOSS)
service
• E.g. See
wolframalpha.com
9
Advanced Search: PaleoLit
• Research project at Dept of CS, CMU
– Dr. Judith Gelernter and Prof. Jamie Carbonell
• Use ontologies to match search requests
to related publications
• Demo…
Informatics Issues:
The Informatics Progression
Informatics
IT Cyber
Infrastru
cture
Cyber
Informatics
Core
Informatics
Courtesy: Prof. Peter Fox, RPI, CSIG’08
Science
Informatics,
aka
Xinformatics
Science,
SBAs
11
The Computer Science /
Domain Science continuum
Computer IT Geoinformatics Domain Domain
Science
Standards Standards
Standards
Science
Topics
Topics
e.g. Database e.g. ODBC,
Systems,
XML
Semistructure data
e.g. Ontologies,
GeoSciML
definitions
e.g. domain
e.g. geology
vocabularies
(Geologic Time,
rock description,…)
12
The data interoperability onion
• System Interop
Social
Networks
– Approaches: e.g., ODBC,
JDBC,
Java, Web services, …
Semantics
– Purview of: Computer Science
• Syntactic
Syntax
Systems
– Approaches: Schema standards
– Purview of: Standards organizations, domain science
repositories, data archives
Social Networks
Semantics
Syntax
Systems
• Semantic
– Approaches: Controlled vocabularies, thesaurii, domain ontologies
– Purview of: Domain scientists
• Social Networks
– Approaches: recommendation systems
– Purview of: social networking software (CS and domain science, data driven)
13
Software interoperability onion
• System Interop
– Approaches: e.g., REST, Web services
• Syntactic
– Approaches: e.g., SOAP, WSDL
• Semantic
Social Networks
Semantics
Syntax
Systems
– Approaches: Controlled vocabularies, thesaurii, domain ontologies
– Purview of: Domain scientists
• Social Networks
– Approaches: recommendation systems
– Purview of: social networking software
• Service orchestration via worflow systems
Geologic Map Integration
Data Mediation
• Dealing with heterogeneities in (distributed) data sources
– Data may be in different “administrative domains”
• Manage authentication
–
–
–
–
Data schemas may be different among sources
Terminologies may be different among sources
Terminologies may be different among sources and user
Software infrastructure (“stack”) may be different
• Solve the problem with “middleware”
– Layers of software between the original application and the end user
• Mediator
– Middleware that bridges across heterogeneities without requiring sources
to change
A Data Integration Example:
Geologic Maps
DB2
SRB
GM
L
Shapefile
(ESRI)
MT
MT
WY
ID
PostGIS
NV
Oracle
UT
Heterogeneities
• Operating system
• File storage
• Database schemas
• Data Semantics
AZ
NM
Windows
Linux
CO
iMac
Adopting WMS/WFS: Can provide
Syntactic Integration
MT
MT
Advantages
WY
ID
WMS
WMS
• Integrated presentation
• Uniform syntactical structure
• Uniform spatial definition
NV
WMS
Problems
UT
WMS
WMS
AZ
NM
FORMATION
UNIT_NAME
ROCK_TYPE
ERA
SYSTEM
SERIES
LITH
CO
ROCK_TYPE
PERIOD
• Each resource may use a
different schema
• Difficult to build a a uniform
query interface for
multiple resources.
GeoSciML: Can Provide Schema Integration
MT
MT
Advantages
WY
ID
GeoSciML
GeoSciML
• Integrated schema
• Partial integrated semantics
NV
GeoSciML
Problem
UT
GeoSciML
• Each resource may use
different vocabulary and
semantic model.
GeoSciML
AZ
NM
British Rock
Classification
CO
Multi-hierarchical
Rock Classification
Semantic Mediation with
GeoSciML
Mappings may also be
needed between the
data and the
application ontology
E.g., say, mapping
240 mya to Mesozoic
British Rock
Classification
Multi-hierarchical
Rock
Classification
CO
NM
GeoSciML
Application
Ontology
Semantic
Mapping
Query Rewriting:
Example: A Rock Classification
Ontology
Genesis
Fabric
Composition
Texture
Query: Concept Expansion
Concept expansion:
• what else to look for when
user asks for ‘Mafic’
Composition
Query: Concept Generalization
Generalization:
• finding data that are ‘like’
X and Y
Composition
Ontology-based Geologic Map
Integration: Implemented in GEON
domain
knowledge
Show
formations
where AGE =
‘Paleozic’
(without age
ontology)
Nevada
Show
formations
where
AGE =
‘Paleozic’
+/- a few
hundred
million
years
(with
age
ontology)
ODAL, SOQL, and
Data Integration Carts™
•ODAL: Ontological Database Annotation Language
• Create a partial model of ontologies from database
GUI
generate
<odal:NamedIndividuals odal:id="RockSample"
odal:database="VTDatabase">
<odal:Class odal:resource="http://geon.vt.edu#RockSample" />
<odal:Table>Samples</odal:Table>
<odal:Table>RockTexture</odal:Table>
<odal:Table>RockGeoChemistry</odal:Table>
<odal:Table>ModalData</odal:Table>
<odal:Table>MineralChemistry</odal:Table>
<odal:Table>Images</odal:Table>
<odal:Column>ssID</odal:Column>
</odal:NamedIndividuals>
The values in the column ssID of the tables Samples, RockTexture, RockGeoChemistry,
ModalData,MineralChemistry and Images represent instances of RockSample
to ODAL
processor
SOQL: Simple Ontology Query
Language
Query single or many resources
• via ontologies (i.e., high level logical views)
• independent of physical representation (i.e. schemas)
RockSample
location
hasSiO2
ValueWithUnit
Location
lat
value
long
float
unit
string
GUI
generate
SELECT X.location.*;
FROM RockSample X
WHERE X.location.lat > 60
AND X.location.long > 100
AND X.hasSiO2.value < 30
AND X.hasSiO2.unit =‘weightPercetage’
to SOQL
processor
26
Issues in sharing data:
Primary vs secondary (derived)
Collect Data
Share intermediate
results
Process and Visualize
Share Results
Share data
27
Sources of Data
• Distributed data collections
– By individual PIs
– “Informal” sharing, e.g. via social network
– “Formal” sharing, e.g. via submission to community data archives /
databases
• Centralized data collections
– E.g. via a large project (standardized protocols)
– By agencies (internal protocols)
• Metadata to the rescue
– Data description standards
– Process description standards (workflows)
• State Surveys and USGS are major sources
28
Major Interoperability Efforts
• OneGeology.org
– International initiative of
geological surveys to create
dynamic geological map data
available via the web.
• US Geoscience Information
Network (US GIN)
– Led by Lee Allison, AZGS
29
Federating Metadata Catalogs
• Local vs Community “View”
– Individual data providers may choose to “export” a
community view
• Direct access to the source may still provide
more “rich” access to data
• Federated Catalogs
– The Geosciences Information Network, GIN approach
– Adopt standards for catalog content (ISO) and
implementation (CSW)
Interoperation between GEON
and GEO GRID
GEON
GEO Grid
CSW
Composite
Service
ADN
REQUEST
CSW
SRB
600 scenes/day
Geogrid
Catalog
GEON
Catalog
Catalog
Service
Web
Adapter
RESPONSE
WMS
URL
REQUEST
CSW
RESPONSE
RESPONSE
Catalog
Service
Web
Storage
WMS
URL
WMS Server
WMS Server
• Implement CSW interfaces
– Collaboration with the NSF PRAGMA project (Pacific Rim Assembly for Grid
Middleware Applications)
Integration & Visualization of 3D/4D
data
“For a given region (i.e. lat/long extent, plus depth), return a 3D
structural model with accompanying physical parameters of
density, seismic velocities, geochemistry, and geologic ages,
using a cell size of 10km”
–Derived 3D volumetric model
–Multiple isosurfaces with different transparencies
–Slices through the volume
–Variable gridding: data typically has lower resolution at
greater depths
–2D surface data: Topography (“2.5D”) Satellite imagery, street
maps, geologic maps, fault lines, and other derived features etc.
–Bore hole or well data and point observations.
OpenEarth Framework Goals
Geoscience Integration:
• Data types - topography, imagery, bore hole samples,
velocity models from seismic tomography, gravity
measurements, simulation results…
• Data coordinate spaces and dimensionality - 2D and 3D
spatial representations and 4D that covers the range of
geologic processes (EQ cycle to deep time).
OpenEarth Framework Goals
Structural Integration:
• Data formats – shapefiles, NetCDF, GeoTIFF, and other formal
and defacto standards.
• Data models - 2D and 3D geometry to semantically richer models
of features and relationships between those features.
• Data delivery methods & Storage Schemes- local files to database
queries, web services (WMS, WFS) and services for new data
types (large tomographic volumes, etc.).
OEF Philosophy
• OEF focused on integrating data spanning the
geosciences.
• Open software architecture and corresponding
software that can properly access, manipulate and
visualize the integrated data.
• Open source to provide the necessary flexibility for
academic research and to provide a flexible test bed for
new data models and visualization ideas.
OEF Architecture
OEF Architecture
Data Integration Services:
– Designed to support rapid
visualization of integrated
datasets
– operations to grid data,
resample it at multiple
resolutions and subdivide data
to better support progressive
changes to the display as the
user pans and zooms
OEF Architecture
Visualization Tools:
– Run on the user's computer,
dynamically query spatial and
temporal data from the OEF
services
– Uses 3D graphics hardware for
fast display
– Open architecture supports
multiple visualization tools
authored throughout the
community (e.g GEON IDV)
– New viz capabilities
developed as necessary
OEF Visualization
The software services stack
Example: GEON
Pushing down the service interface
Compute nodes
Disk Storage
Software as a Service:
At different levels of software
• Software as a Service: SaaS
– E.g., Google Apps, Salesforce.com, SAP, …
• Infrastructure as a Service, IaaS
– E.g., Amazon EC2, …
• Platform as a Service, PaaS
SaaS
PaaS
IaaS
Compute nodes
Disk Storage
41
The evolving computational
architecture
• Mainframe computers (institutional
computing)
• Minicomputers (departmental
computing)
• Workstations (laboratory computing)
• Laptops (personal computing)
• …back to the future..??
Cloud Computing: A meeting
of trends
Price/performance
of computing
platforms
Data
Volumes
Capabilities of
networking and
distributed
systems
Cost of
Ownership
Cloud Computing Origins
• Cloud computing: Many definitions
– Here’s one: Use of remote data centers to manage scalable, reliable, ondemand access to applications
• Origins
– Goes back to the need by Web search engines to inexpensively process all
the pages on the Web
– Done by creating a grid of datacenters and processing data in parallel
across them
– Development of a parallel data programming environment by Google:
MapReduce
• Data + cloud computing
– what about remote centers for scalable, reliable, on-demand access to
data?
Cloud Computing
• A different pricing model
– No upfront cost of acquisition. Rent don’t buy.
• Can access 1000’s of processors / disks
– Scalability
– “Elastic computing”
• A different model for dealing with
system failures
– Retry, loose consistency, …
Cloud computing for data
• Data as a service: what is the abstraction for storage?
– Table, Blob, Queue
– …??
• Describing characteristics of the data
– Metadata about storage to specify policies to be applied
– Security, reliability, performance, etc
• Scaling to meet application needs
– Large configurations
– Dealing with virtualization
– New failure models
• Retry, loose consistency
Storage as a Service
• Amazon S3: An example
– Charges for Storage, Data Transfer, and Requests (e.g. PUT, COPY, POST, LIST,
GET)
• Issues
–
–
–
–
Bandwidth to storage
Quality of Service
Storage Elasticity
Privacy / security
• Standardization efforts
– Storage Networking Industry Assocation (SNIA) Technical Working Group (TWG)
on Cloud Storage has just started
• Important Issues
– Metadata for storage
– Scaling up to large dataset sizes
The two sides of Cloud Computing
• Large distributed infrastructure
–
–
–
–
•
“Everything is in the cloud”
Interesting as a proposition for the IT operations of an enterprise
Cloud companies would like to reach deep into enterprise IT
“Our business is not the entrenched data centers in current large
organizations, but the new companies…”
Large-scale infrastructure in the Datacenter
– Seeding the cloud
– Shared-nothing parallelism
– Data on the cheap…a la Google
The NSF Cluster Exploratory
(CluE) Program
• Google-IBM-NSF Cluster
– Well over a thousand processors
• When fully built out, will comprise approximately 1,600 processors
– Terabytes of memory
– Hundreds of terabytes of storage
• Open source software
– Linux and Apache Hadoop
• IBM Tivoli
– System management, monitoring and dynamic resource
provisioning
• A platform for “apples-to-apples” comparisons
– Can reserve time on nodes for exclusive access
Our CluE Project
• Project (PI: Baru; co-PI: Krishnan)
– Performance Evaluation of On-Demand Provisioning Strategies for Data
Intensive Applications
• Investigate hybrid software model
– Database system / Hadoop system
– Some parts of the application require features provided by a DBMS
• Transactional capability, full SQL support
– Other parts of the application can exploit Hadoop model
• Very large data sets
• Data parallel processing
• Loose consistency models
• Price / performance is an issue
– Including energy costs
San Andreas Fault LiDAR
Dataset:
Data Access Patterns
• B4 Dataset
Experiments
•
•
•
•
“On-demand” database vs Hadoop
SQL vs Hadoop
Energy consumption as a factor in price/performance
Platforms to be used
• Google-IBM cluster
• OpenCirrus testbed
• Triton resource
52
The Road Ahead
• Advanced search engines
– Search structured and unstructured data
– Deal with display of heterogeneous results
– Show provenance of data
• Sophisticated tools for 3D and 4D data
integration
– Combination of “server-side” processing and caching
and client-side interaction and visualization
• Service-oriented architecture
– Applications and IT infrastructure available as services
– Perhaps some of them in “the Cloud”
53
Dealing with very large data
• Either the data can be partitioned into
segments and processed in parallel
– Shared-nothing parallelism
• Or not
– Shared memory systems
Parallel Processing of Large
Data
P
P
P
M
D
P
P
Shared Memory
Shared Nothing
P
P
P
P
P
M
M
M
M
M
D
D
D
D
D
Network
Shared Nothing
P
P
P
P
P
M
M
M
M
M
D
D
D
D
D
Partitioning Strategy
Dataset
Data partitioning strategies
• Round-robin
– Equal distribution
across nodes by data
volume
P
P
P
P
P
M
M
M
M
M
D
D
D
D
D
• Hash
– all data with the
same key value go
to same node
• Range
– all data within a
range of values go
to the same node
Partitioning Strategy
Dataset
MapReduce / Hadoop
• Programming environment for very large scale
data processing
• Managing task executions and data transfers in a
shared nothing environment
– MapReduce: Infrastructure to support data scatter / gather
– Distributed data repository (“file system”)
• Google File System (GFS)
• Hadoop Distributed File System (HDFS)
– Round-robin partitioning of data
• MapReduce
– Google’s proprietary implementation
• Hadoop
– Apache, open source implementation
MapReduce execution
• Hadoop vs database
MapReduce vs Database
•
Database
–
–
–
–
•
MapReduce / Hadoop
–
–
–
–
•
Partition input data file into M splits
Intermediate data are re-hashed
Intermediate data can be “combined”
Java programs
Cost of dynamic vs static partitioning
–
–
•
Partition “base tables” into N partitions
Intermediate data can be “re-partitioned”
Intermediate data can be combined
Well-defined algebra for data manipulation (SQL)
Run time costs
Storage costs
Optimal partitioning
–
–
–
Query and Workload dependent
How to measure any deviations from the optimal?
When to repartition?
USGS Role in Geoinformatics
Fundamental: Develop, maintain, make accessible:
Long-term national and regional geologic, hydrologic,
biologic, and geographic databases
Earth and planetary imagery
Open-source models of the complex natural systems and
human interaction with that system
Physical collections of earth materials, biologic materials,
reference standards, geophysical recordings, paper records.
National geologic, biologic, hydrologic, and geographic
monitoring systems
Standards of practice for the geologic, hydrologic, biologic,
and geographic sciences
Source: Presentation by Dr. Linda Gundersen, USGS, at Geoinformatics
2007, San Diego, CA.
USGS Role in Geoinfomatics
All activities: Data creation, modeling,
monitoring, collections, standards etc. Must
be done in cooperation and collaboration
with the public and governmental,
academic, and private sector partners and
stakeholders.
A critical USGS role:
facilitate bringing communities together!
Source: Presentation by Dr. Linda Gundersen, USGS, at Geoinformatics
2007, San Diego, CA.
Data Collections versus
Communities of Practice
Geoinformatics must evolve beyond the
accumulation of data, models, and standards to
become the framework for a community of practice
in the natural sciences.
Etienne Wegner and Jean Lave coined the term
and developed the learning theory of communities
of practice – that we learn not only as individuals
but as communities. By engaging in communities
of practice we increase our capacity and innovation
as well as leverage our support for areas of interest.
Source: Presentation by Dr. Linda Gundersen, USGS, at Geoinformatics
2007, San Diego, CA.
Creativity, Learning, and Innovation
A community of practice is not merely a community
with a common interest. But are practitioners who
share experiences and learn from each other. They
develop a shared repertoire of resources:
experiences, stories, tools, vocabularies, ways of
addressing recurring problems. This takes time
and sustained interaction. Standards of practice
and reference materials will grow out of this. But
the critical benefits include: creating and
sustaining knowledge, leveraging of resources, and
rapid learning and innovation.
Source: Presentation by Dr. Linda Gundersen, USGS, at Geoinformatics
2007, San Diego, CA.
1000’s of National and Regional
Databases
The National Map – topographic, elevation,
orthoimagery, transportation hydrography etc.
Geospatial One Stop-portal
MRDATA – Mineral Resources and Related
Data
The National Geologic Map Database
stnadardized community collection of
geologic mapping
National Water Information System NWISWeb
National Geochemical Survey Database
(PLUTO, NURE)
National Geophysical Database (aeromag,
gravity, aerorad)
Earthquake Catalogs
North American Breeding Bird Survey
National Vegetation/speciation maps
National Oil and Gas Assessment
Source:Inventory
Presentation by Dr. Linda Gundersen, USGS, at Geoinformatics
National Coal Quality
2007, San Diego, CA.