Transcript file

SDSC Data and Knowledge
Systems Program
&
GEON: The Geosciences Network
Chaitan Baru
Director, DAKS Program
PI (SDSC), GEON
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Outline
• SDSC and cyberinfrastructure
• GEON: Cyberinfrastructure for the Geosciences
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
SDSC Organizational Structure
www.sdsc.edu
~ 400 employees/students total
Integrative
Biological
Sciences
(IBS)
• Molecular
biology
• Neuroscience
• Structural
Genomics
• Cell Signaling
• Proteomics
Integrative
Computational
Sciences
(ICS)
• Computational
chemistry
• Applied math
• Ecoinformatics
• Environmental
Science
• Computational
Economics
• User Services
Networking and Security
(N&S)
Office of the Director
Data and Knowledge
Systems
(DAKS)
• Data integration
• Distributed data
management
• Scientific
databases
• Data mining
• Scientific data
visualization
Communications
And Outreach
Fran Berman, Director
Alan Blatecky, Exec Director
Richard Moore, NPACI Exec Director
Anke Kamrath, COO
Grids and Clusters
(G&C)
• Cluster
management
• Portals
• Grid middleware
Business
Office
High-End Computing
(HEC)
• Production
systems
Education and
Training
• Production networking and security
• Research on network monitoring
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Cyberinfrastructure Vision
[Cyberinfrastructure] refers to infrastructure based upon
distributed computer, information, and communication
technology. If infrastructure is required for an industrial
economy, the we could say that cyberinfrastructure is
required for the knowledge economy.
Source: [NSF Blue Ribbon Panel]
Cyberinfrastructure for a knowledge economy requires a
new and innovative infrastructure for data management,
data exploration, analysis, and visualization, and
knowledge sharing.
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
High-End
Cyberinfrastructure
Instrumentation
(large and/or
many small
People
and
Training
Cyberinfrastructur
e
Computation
Courtesy: Dr. Peter Freeman
Assistant Director, CISE, NSF
Large / Complex
Databases
and
Libraries
Software
High-speed
Network
Connectivity
NSF - pf - 8/02
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Data:
A Cyberinfrastructure “Killer App”
•
Over the next decade, data will come
from everywhere
•
•
•
•
•
Data from
sensors
Data
from
instruments
And be used by everyone
•
•
•
•
•
Scientific instruments
Experiments
Sensors and sensornets
New devices (personal digital devices,
computer-enabled clothing, cars, …)
Scientists
Consumers
Educators
General public
SW environment will need to support
unprecedented diversity, globalization,
integration, scale, and use
AMD Seminar, SDSC, April 1 2004
Data from
simulations
Data
from
analysis
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
The SDSC DAKS Program
•
Organized as a set of R&D Labs
1.
2.
3.
4.
5.
6.
7.
8.
9.
Knowledge-based Integration (Bertram Ludaescher)
Advanced Query Processing (Amarnath Gupta)
Advanced Database Projects (David Archbell)
Data Mining (Tony Fountain)
Visualization (Michael Bailey)
Spatial Information Systems (Ilya Zaslavsky)
Geoinformatics (Dogan Seber)
Storage Resource Broker, SRB (Arcot Rajasekar)
Sustainable Archives and Digital library Technology
(Richard Marciano)
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
From Data to Information to Knowledge
Applications: Geoinformatics,
Biosciences, Ecoinformatics,…
Visualization
Data Mining, Simulation Modeling,
Analysis, Data Fusion
Knowledge-Based Integration
Advanced Query Processing
Grid Storage
Filesystems, Database Systems
High speed networking
Storage hardware
How do we represent data,
information and knowledge to
the user?
How do we detect trends and
relationships in data?
How do we obtain usable
information from data?
How do we collect, access
and organize data?
How do we configure computer
architectures to optimally support
data-oriented computing?
Networked Storage (SAN)
sensornets
How do we combine data, knowledge
and information management with
simulation and modeling?
instruments
SDSC Data and Knowledge Systems Program
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
SDSC and
Cyberinfrastructure Projects
• SDSC is involved in several, NSF and NIH-funded,
community-based CI projects
• BIRN – Biomedical Informatics Research Network,
funded by NIH. Integrating distributed brain image data
• GEON – Geosciences Network. Integrating distributed
Earth Sciences data
• SEEK – Scientific Environment for Ecological
Knowledge. Integrating distributed biodiversity data
along with tools
• TeraGrid – Providing access to high-End, national-scale,
physical computing infrastructure
• NEES – Network for Earthquake Engineering Simulation.
Integrating distributed earthquake simulation and sensor
data
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
TeraGrid—High-End
Cyberinfrastructure
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
GEON: The Geosciences Network
• NSF ITR Project, 2002-2007, $11.5M
PI Institutions
• Arizona State University
• Bryn Mawr College
• Penn State University
• Rice University
• San Diego State University
• San Diego Supercomputer Center/UCSD
• University of Arizona
• University of Idaho
• University of Missouri, Columbia
• University of Texas at El Paso
• University of Utah
• Virginia Tech
• UNAVCO
• Digital Library for Earth System
Education (DLESE)
AMD Seminar, SDSC, April 1 2004
Partners
• Chronos
• CUAHSI-HIS
• ESRI
• Geological Survey of Canada
• IBM
• Kansas Geological Survey
• Lawrence Livermore National
Laboratory
• U.S. Geological Survey (USGS)
• California Institute for
Telecommunications and Information
Technology (Cal-(IT)2)
• Georeference Online
Other Affiliates
• Southern California Earthquake
Consortium (SCEC), EarthScope, IRIS,
NASA GSFC
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
The GEON Project
• Close collaboration between geoscientists and IT to
interlink databases and Grid-enable applications
• “Deep” data modeling of 4D data
• Situating 4D data in context—spatial, temporal, topic, process
• Semantic integration of Geosciences data
• Logic-based formalisms to represent knowledge and map between
ontologies
• Grid computing
• Deploy a prototype GEON Grid: heterogeneous networks, compute
nodes, storage capabilities. Enable sharing of data, tools, expertise.
• Interaction environments
• Information visualization. Visualization of concept maps
• Remote data visualization via high-speed networks
• Augmented reality in the field
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Science Challenges and GEON
Research
• Origin and 4-D Evolution of Continents
High
Level
--Plate Tectonics
--Crustal Growth Through Time
--Terranes
--Terrane Recognition
--Integration of Distributed Databases
--Knowledge Representation of Domains
--Domain Ontology
--Databases
--Data Providers
Data Level
Krishna Sinha, VaTech
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
A Geoscientist’s Information
Integration Scenario
What is the distribution and U/ Pb zircon ages of A-type plutons in VA?
How about their 3-D geometry using gravity data ?
How do the plutons relate to the host rock structures?
?
Information
Integration
Digital geologic map Geochemical Geophysical Database Geochronologic
Database
database
(gravity contours)
Of Virginia
(Concordia)
(plutons in Virginia) (chemical data)
AMD Seminar, SDSC, April 1 2004
Structure database
(foliation map)
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Drilling into the
Concept Space
Plate Tectonics
PLATE
Krishna Sinha, VaTech
AMD Seminar, SDSC, April 1 2004
TECTONICS
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Components of the GEONgrid
Architecture
• GEONgrid Physical Implementation
• Core Grid Services
• Registry, authentication, access control, monitoring, replication,
distributed filesystem, collection management (SRB), job submission,
e.g. launch job to TeraGrid
• “Higher-Order” Services
• Registration: data and metadata, schema, ontology, services
• Data Integration: spatial data integration, data systems integration,
schema integration
• 2D Visualization, including GIS
• Workflow
• 3D Viz, Augmented Reality
• Portal
• Portlet-based design. User space, GeonSearch/GeoWorkbench.
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
GEONgrid Physical Implementation
• PoP Nodes only
• VaTech, Bryn Mawr, Penn State, Rice, Utah EGI, Utah,
DLESE, UNAVCO
• PoP nodes + Data Nodes
• Idaho, Arizona State, SDSC
• PoP nodes + Compute Nodes
• Missouri, UTEP, SDSC
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
The GEON Grid
Geological
Survey of
Canada
Chronos
Livermore
KGS
USGS
ESRI
CUAHSI
PoP node
Partner Projects
Compute cluster
Data Cluster
Partner services
1TF cluster
GEON Node Status
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Industry Involvement
• ESRI
•
•
•
•
PoP Node in Redlands
Access to ArcWeb services and content
Use of Arc software
Technical session on GEON ESRI Users’ Conference, San Diego,
Aug.9-11, 2004
• IBM
• Use of GMR (Grid Movement and Replication) software
• Free DB2 for academic use
• HP
• Donation of an Itanium cluster for GEON development and to power
GEON portal
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
GEON Services
• “Hosted” vs “non-hosted” services
• Hosted: service is implemented within the physical
GEONgrid environment (i.e. on one of the systems).
• The implementation can benefit from core capabilities
provided in GEONgrid, e.g. replication, load-balancing
• Need at least a PoP node to host a service
• Hosted databases will be stored at Data Nodes,
but may be replicated at one or more PoP nodes
• Data nodes
• Require Internet2 connectivity
• Will be backed up to SDSC
• Will be replicated among themselves
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
GEON Compute Nodes
• Compute nodes
• Want to create at least a few nodes as a TeraGrid
“sandbox”
• GEONgrid is currently based on Redhat Linux, OGSI and
Globus Toolkit Version 3 (GT3)
• TeraGrid is currently based on SuSE Linux, GT2.4
• Sandbox allows GEON PI’s to develop debug software in
GEONgrid prior to sending jobs to TeraGrid
• GEON has a TeraGrid allocation (30,000hours)
• Need to keep in mind GEONgrid heterogeneity
• Windows and other platforms
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Core Grid Services
• Registry:
• a place to register and find basic Web services. But also, all services
(e.g. PGAP, Gravity Database, Seismic Simulation Tool, …)
• Authentication:
• using GEON Certificate Authority and Grid certificates
• Access control:
• investigating various systems for policy-based access to services
• Data replication:
• initial target is IBM GMR software for replicating files as well as
databases
• Support for various data systems:
• e.g., SDSC Storage Resource Broker (SRB) and OpenDAP
• Implement servers at Data Nodes
• Job submission, e.g. launch job to TeraGrid.
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Higher-Order Grid Services
• Registration
• Data and metadata, schema, ontology, services
• Important in order to support search functionality
• Data Integration
• Defining “views” across multiple sources
• Multiple database schemas, e.g. in GEON PAST (Paleogeography and
AMOCO database), Chronos (Paleostrat, Neptune, Paleobiology),
Geochemisry (Navdat, PetDB, …)
• Multiple maps and map layers
• GIS and 2D Viz
• Integrating map layers. “Simple” mapping service.
• SVG-based data access and visualization tools
• Workflow
• Iconic representation of databases and tools
• Ability to link together tools and data to specify computations
• Based on Kepler system
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
GeonSearch
• Ad hoc search versus querying of preestablished “views”
• Ad hoc Search
• Search/discover information on data, services,
experiments, “other” (e.g., people, organizations)
• Display results via map interfaces, semantic graphs
• View-based querying
• E.g., use ad hoc search to find a set of databases, map
layers of interest; define a specific way of combining data
across these various sources
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Knowledge Representation in GEON
•
•
•
•
•
•
Controlled vocabularies
Database schema (relational, XML, …)
Conceptual schema (ER, UML, … )
Thesauri (synonyms, broader term/narrower term)
Taxonomies
Informal/semi-formal representations
• “Concept spaces”, “concept maps”
• Labeled graphs / semantic networks (RDF)
• Formal ontologies, e.g., in [Description] Logic (OWL)
• “formalization of a specification”
 constrains possible interpretation of terms
• What is an ontology? An ontology usually …
• specifies a theory (a set of models) by …
• defining and relating …
• concepts representing features of a domain of interest
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
GEON Ontology Development
Workshops
• Workshop format
• Led by GEON PI’s
• Involves small group of domain experts from community
• Participation by a few IT experts in data modeling and knowledge
representation
• Igneous Petrology, led by Prof. Krishna Sinha, VaTech, 2003
• Seismology, led by Prof. Randy Keller, UT El Paso, Feb 24-25, 2004
• Aqueous Geochemistry, led by Dr. William Glassley, Livermore Labs,
March 2-3, 2004
• Structural Geology, led by Prof. John Oldow, Univ. of Idaho, 2004
• Metamorphic Petrology, led by Prof. Maria Crawford, Bryn Mawr,
under planning
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
A Multi-Hierarchical Rock Classification
“Ontology” (GSC)
Genesis
Fabric
Composition
Texture
AMD Seminar, SDSC, April 1 2004
Kai Lin, SDSC
Boyan Brodaric, GSC
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Geologic Map Integration in the Portal
•
After registering datasets, ontologies (here: “classes”), and an application
(“OMI”), the datasets can be searched and displayed in an integrated way.
Kai Lin, SDSC
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Use of Knowledge Structures
• Conceptual models of a domain or application,
(communication means, system design, …)
• Classification of …
• concepts (taxonomy) and
• data/object instances through classes
• Analysis of ontologies e.g.
• Graph queries (reachability, path queries, …)
• Reasoning (concept subsumption, consistency checking, …)
• Targets for semantic data registration
• Conceptual indexes and views for
•
•
•
•
searching,
browsing,
querying, and
integration of registered data
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Creating and Sharing Concept Maps
(here: Seismology concept map & Cmap tool)
• Bring scientists together for 2+ days
• Add CS/KBMS “types”
• Create concept maps
• Refine
• Iterate
 from napkin drawings, to concept
maps, to ontologies
AMD Seminar, SDSC, April 1 2004
Randy Keller (UTEP),
Bertram Ludaescher, Kai Lin,
Dogan Seber (SDSC), et al
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Community-Based Ontology
Development
•
Draft of an aqueous geochemistry
ontology developed by scientists
AMD Seminar, SDSC, April 1 2004
Bill Glassley (LLNL),
Bertram Ludaescher, Kai Lin (SDSC),
et al
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
GeoWorkbench
• Data and service registration
• Create spatial, temporal, concept-based indexes as part of
registration process
• Ability to define views
• e.g. using GeonSearch to find data, services, etc.
• Run analysis routines
• e.g. via workflow specifications, using Kepler
• Personal space to “save” and “bookmark” work
• Visualize output, save output, feed output to other services
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
3D Earthquake Modeling using HPC
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Use of LIDAR data for geo-morphology
Ramon Arrowsmith, Chris Crosby
Arizona State University
• Manipulation, analysis and use of LIDAR (LIght
Detection And Ranging) data
Ramon Arrowsmith,
Chris Crosby, ASU
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
LIght Detection And Ranging
• Airborne scanning laser
rangefinder
• Differential GPS
• Inertial Navigation System
30,000 points per second at
~15 cm accuracy
• $400–$1000/mi2,
106 points/mi2, or
0.04–0.1 cents/point
Extensive filtering to remove
tree canopy (virtual deforestation)
Figure from R. Haugerud, U.S.G.S - http://duff.geology.washington.edu/data/raster/lidar/About_LIDAR.html
Ramon Arrowsmith,
Chris Crosby, ASU
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Northern San Andreas LIDAR: fault geomorphology
Full Feature DEM
AMD Seminar, SDSC, April 1 2004
Ramon Arrowsmith,
Chris Crosby, ASU
Bare Earth DEM
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Processing LiDAR data: the problems
• Huge datasets:
• 1 GB of point return (.txt)
data
• 150 MB of point return (.txt)
data
• 5.5 MB after filtering for
ground returns
Fort Ross, CA 7.5 min quad
• How do we grid these data?
• ArcGIS can’t handle it
• Expensive commercial
software not an option for
most data consumers
Ramon Arrowsmith,
Chris Crosby, ASU
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
GRASS as a processing tool for LiDAR
• GRASS: Open source GIS
• Interpolation commands designed for large data sets
• Splines use local pt density to segment data into rectangular
areas for interpolation
• Can control spline tension and smoothness
• Modular configuration could easily be implemented within
the GEON work flow
• E.g.: User uploads point data to remote site where GRASS
interpolation module runs on super computer and returns
user a raster file.
• Host the large LIDAR data sets on GEON Data
Node at SDSC, with access to large cluster
computers
Ramon Arrowsmith,
Chris Crosby, ASU
AMD Seminar, SDSC, April 1 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Contact Information
[email protected]
AMD Seminar,
1 2004
NASA
Seminar,SDSC,
MarchApril
23, 2004
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES