The Problem Solving Environments of TeraGrid, Science Gateways

Download Report

Transcript The Problem Solving Environments of TeraGrid, Science Gateways

The Problem Solving Environments
of TeraGrid, Science Gateways, and
the Intersection of the Two
J I M B A S N E Y 1, S T U A R T M A R T I N 2 , J P N AV A R R O 2, M A R L O N
P I E R C E 3, T O M S C AV O 1, L E I F S T R A N D 4,
T O M U R AM 2,5, N AN C Y W I L K I N S - D I E H R 6, W E N J U N W U 2,
CHOONHAN YOUN6
1
N A T I O N A L
6
S A N
C E N T E R
D I E G O
F O R
S U P E R C O M P U T I N G A P P L I C A T I O N S , U N I V E R S I T Y O F I L L I N O I S A T
U R B A N A - C H A M P A I G N
2 A R G O N N E
N A T I O N A L L A B O R A T O R Y
3 I N D I A N A
U N I V E R S I T Y
4 C A L I F O R N I A
I N S T I T U T E O F T E C H N O L O G Y
5 U N I V E R S I T Y
O F C H I C A G O
S U P E R C O M P U T E R C E N T E R , U N I V E R S I T Y O F C A L I F O R N I A A T S A N D I E G O
TeraGrid, what is it?
A unique combination of fundamental CI components
Navajo Technical College, September 25, 2008
Gateways, what are they?
Problem Solving Environments for Science
 Portal or client-server interfaces to high end resources

Web developments, explosion of digital data lead to the increased
importance of the internet and the web for science

Only 16 years since the availability of web browsers
 Developments in web technology
• From static html to cgi forms to the wikis and social web pages of today
Full impact on science yet to be felt
 Web usage model resonates with scientists
 But, need persistency if the Web is to have a profound impact
on science (this is key for all PSEs)

 TeraGrid provides common infrastructure for gateway
developers
Navajo Technical College, September 25, 2008
TeraGrid’s Infrastructure for Gateways
 Problem
 Local compute resources are typically not enough for Gateways
 Goal
 Make it easy to use any TeraGrid site from a Gateway
 Approach
 Provide a set of client APIs and command line tools for use in
Gateways/portals
 Maintain and deploy a set of common services on each site
 Maintain and deploy some central services
Infrastructure Capabilities
 Information Discovery
 Find deployed services
 Get details about the compute resources
 Data Management
 Move data to and from compute resources
 Execution Management
 Submit and monitor remote computational jobs
 Security

Make sure secure access is in place with all services and tools
Security
 Based on Grid Security Infrastructure (GSI)
 Uses X509 PKI
 End entity certificates (e.g. issued to a person or host)
 User proxy certificates (valid for a limited period of time)
 Enables single sign-on to all TG resources
 Enables delegation
 Users/clients can disconnect and let services perform actions
securely on their behalf
 Integrated in grid middleware services
 User Portal, MyProxy, GSISSH, GridFTP, GRAM, MDS, RFT, etc
GSI in Action
GT4 Server
GT4 Client
Java WS Container
Globus Web
Service
Globus WS
Client
X.509 proxy
certificate
grid-proxyinit
proxy
credential
Key
Gridmap
end entity
credential
Key
Single Sign-On
QuickTime™ and a
decompressor
are needed to see this picture.
Gateway Workflow with GSISSH
Client does:
• myproxy-logon (once)
• Move files with gsiscp
• Submit job with gsissh and lrm commands
Local Jobs
gateway
Jobs
PBS
LSF
GSISSH
Local Jobs
GSISSH Service
GSISSH Service
Scheduler (e.g., PBS)
Scheduler (e.g., LSF)
Compute Nodes
Compute Nodes
Resource A
Resource B
Remote Execution Management
 Grid Resource Allocation and Management (GRAM)
 Provide an abstraction layer on top of various local
resource managers (PBS, Condor, LSF, SGE, …)





Defines a common job description language
Client API and command line tools to asynchronously access
remote LRMs
Fault tolerant
GSI Security
“job” Workflow
File staging before and after job execution
 Lastly, File cleanup
 File staging requires delegation

Traditional LRM Interaction
 Satisfies many users and use cases
 TACC’s Ranger (62976 cores!) is the Costco of HTC ;-), one
stop shopping, why do we need more?
Local Jobs
Scheduler (e.g., PBS)
Compute Nodes
Resource A
GRAM Benefit
 Adds remote execution capability
 Enable clients/devices to manage jobs
from off of the cluster (Gateways!)
remote
GRAM4
Jobs
gramJob API
Local Jobs
GRAM4 Service
Scheduler (e.g., PBS)
Compute Nodes
Resource A
GRAM Benefit
 Provides scheduler abstraction
GRAM4
Jobs
gramJob API
Local Jobs
Local Jobs
GRAM4 Service
GRAM4 Service
Scheduler (e.g., PBS)
Scheduler (e.g., LSF)
Compute Nodes
Compute Nodes
Resource A
Resource B
Gateway Perspective
GRAM4
jobs
 Scalable job
management
 Interoperability
GRAM4
GRAM4
GRAM4
Sched
GRAM4
Sched
GRAM4
Sched
Compute
GRAM4
Sched
Compute
Sched
Nodes
Compute
Sched
Nodes
Compute
Nodes
Compute
Nodes
Compute
Nodes
Nodes
gramJob API
GRAM4
GRAM4
GRAM4
Sched
GRAM4
Sched
GRAM4
Sched
Compute
Sched
ComputeGRAM4
Sched
Nodes
Compute
Sched
Nodes
Compute
Nodes
Compute
Nodes
Compute
Nodes
Nodes
GRAM4
GRAM4
GRAM4
Sched
GRAM4
Sched
GRAM4
Sched
Compute
GRAM4
Sched
Compute
Sched
Nodes
Compute
Sched
Nodes
Compute
Nodes
Compute
Nodes
Compute
Nodes
Nodes
Data Management - GridFTP
 GridFTP









High-performance, secure, reliable data transfer protocol optimized for
high-bandwidth wide-area
GSI Security
Third-party transfers
Parallel Transfers
Striping
Lots of small files (LOSF)
Can outperform other file transfer methods like scp
Limited in that it does not queue and throttle requests
Needs a reliable higher-level service, hence RFT
Data Management - RFT
 Reliable File Transfer
 Adds reliability on top of GridFTP
 GSI Security
 Throttles requests
 Retries non-fatal transfer errors
 Resumes transfers from the last known position
 Requires delegation in order to contact GridFTP servers on user’s
behalf
Science Gateway with Community Credential
Web Browser
Web
Authn
Web Interface
Webapp
Java WS Container
WS GRAM
Client
WS GRAM Service
proxy
certificate
community
credential
community
account
proxy
credential
Key
Science Gateway
Key
Resource Provider
GridShib-enabled GSI
GT4 Client
GT4 Server
Java WS Container
(with GridShib for GT)
Globus WS
Client
GridShib
SAML PIP
Globus Web
Service
SAML
proxy
certificate
GridShib
SAML Tools
Security
Context
SAML
proxy
credential
Key
Logs
end entity
credential
Key
Policy
GridShib-enabled Science Gateway
Web Browser
Web
Authn
Web Interface
attributes
Webapp
Java WS Container
(with GridShib for GT)
WS GRAM
Client
GridShib
SAML PIP
WS GRAM
Service
SAML
proxy
certificate
username
GridShib
SAML Tools
Security
Context
SAML
proxy
credential
Key
Logs
community
credential
Policy
Key
Science Gateway
Resource Provider
Information Management
 TeraGrid’s Integrated Information Services are a network of web services
responsible for aggregating the availability of TeraGrid capability kits, software,
and services across all the infrastructure providers




Where are the job submission, file-transfer, and login services needed by Gateways?
What is the queue status and estimated delay for each resource?
What are the available testbeds (non-production / experimental software)?
What are the Gateways (problem solving environments) available to users?
High-Level Components
TeraGrid Wide
Information Services
Apache 2.0
Cache
TeraGrid
Wide
Information
WS/REST
HTTP GET
Clients
WS/SOAP
Clients
WS/SOAP
Clients
Tomcat
WebMDS
WS MDS4
Service Provider
Information Services
WS MDS4
Service
Provider
Information
High-Availability Design
TeraGrid Wide
Information Services
Clients
info.teragrid.org
Service Provider
Information Services
info.dyn.teragrid.org
TeraGrid Dynamic DNS
Static paths
Dynamic paths
…
Server failover propagates globally in 15 minutes
Today, there are approximately 29 gateways
using the TeraGrid
NSF Program Officers, September 10, 2008
Selected Highlights from the PSE08 paper
 The Social Informatics Data (SID) Grid
 The Geosciences Network (GEON)
 QuakeSim
 Computational Infrastructure for Geodynamics
(CIG)
 Conclusions
Social Informatics Data Grid
 Heavy use of “multimodal”
data.

Subject might be viewing a
video, while a researcher
collects heart rate and eye
movement data.
 Events must be
synchronized for analysis,
large datasets result
 Extensive analysis
capabilities are not
something that each
researcher should have to
create for themselves.
NSF Program Officers, September 10, 2008
http://www.ci.uchicago.edu/res
earch/files/sidgrid.mov
How does SIDGrid use the TeraGrid?
 Computationally intensive tasks

Speech, gesture, facial expression, and physiological measurements
Media transcoding for pitch analysis of audio tracks
 Once stored in raw form, data streams converted to formats
compatible with software for annotation, coding, integration, analysis


fMRI image analysis
 Workflows for massive job submissions and data
transfers using Virtual Data System (VDS)
 Worflows converted to concrete execution plan via
Pegasus Grid planner



TeraGrid information service (MDS)
Replica location service (RLS)
DAGMAN and Condor-G/GRAM
 The goal of GEON is
 to advance the field of
geoinformatics and
 to prepare and train current and
future generations of geoscience
researchers, educators, and
practitioners in the use of
cyberinfrastructure to further their
research, education, and
professional goals.
 GEON is providing several key
features


data access, computational
simulations, personal work spaces
and analyses environments
identifying best practices with the
objective of dramatically advancing
geoscience research and
education.
How does GEON use the TeraGrid?
 Computationally intensive tasks


Ability to speedily construct earth models, access observed
earthquake recordings and simulate them to understand the
subsurface structure and characteristics of seismic wave propagation
in an efficient manner
SYNSEIS (SYNthetic SEISmogram generation tool), provides access
to seismic waveform data and simulate seismic records using 2D
and 3D models.

Conduct advanced calculations for simulating seismic waveforms of
either earthquakes or explosions at regional distances (< 1000 km).
 GSI (security), GAMA (account management), GridFTP
(data transfer), GRAM (job submission), MyWorkspace
(job monitoring)
 Account management for classroom use, MyProjects
collaboration tool and tagging also serve students
QuakeSim - Some Design Choices
 Build portals out of portlets (Java Standard)
 Reuse capabilities from our Open Grid Computing Environments
(OGCE) project, the REASoN GPS Explorer project, and many
TeraGrid Science Gateways.
 Decorate with Google Maps, Yahoo UI gadgets, etc.
 Use Java Server Faces to build individual component
portlets.

Build standalone tools, then convert to portlets at the very end.
 Use simple Web Services for accessing codes and data.
 Keep It Stateless …
 Use Condor-G and Globus job and file management
services for interacting with high performance computers.

TeraGrid
 Favor Google Maps and Google Earth for their simplicity,
interactivity and open APIs.

Generate KML and GeoRSS
 Use Apache Maven based build and compile system, SVN
on SourceForge
Browser Interface
HTTP(S)
Portlets + Client Stubs
SOAP/HTTP
WSDL WSDL WSDL WSDL
WSDL
WSDL WSDL
Job Sub/Mon
And File
Services
WSDL
Visualization
Or Map
Service
DB
Operating and
Queuing
Systems
DB
Host 1 (Quaketables)
Host 2 (Grid)
DB Service
JDBC
Host 3 (G Maps)
Two Approaches to the Middle Tier
Fat Client
Thin Client
Portal Comp.
Portal Comp.
Grid Client
Grid Protocol
(SOAP)
HTTP + SOAP
Web Service
Grid Client
Grid Protocol
(SOAP)
Grid Service
Backend
Resource
Grid Service
Backend
Resource
Daily RDAHMM Updates
Daily analysis and
event classification
of GPS data from
REASoN’s GRWS.
Disloc output
converted to
KML and
plotted.
GeoFEST Finite
Element Modeling
portlet and plotting
tools
Desktop Users, Web Portal and Gateway style application
Standard Web Service Interface
Request Manager
QBET
Web
Service
Hosted by UCSB
RDMBS
Resource
Ranking Manager
DataModel
Manager
Fault Manager
User A’s Job Board
User A’s Job Queue
User A’s Resource Pool
Job Distributor
MyProxy
Server
Hosted by
TeraGrid Project
Tokens for
resource
X,Y,Z
Job Execution Manager
Condor G with Birdbath
High Performance Computing Clusters: Grid style clusters
and condor computing nodes
“SWARM: Scheduling Large-scale Jobs over the Loosely-Coupled HPC Clusters,” S. L.
Pallickara and M. E. Pierce, Friday, December 12, 2 p.m. to 2:30 p.m.
http://escience2008.iu.edu/sessions/SWARM.shtml
 Membership-governed
organization

40 institutional member, 9
foreign affiliates
 Supports and promotes
Earth science by
developing and
maintaining software for
computational geophysics
NSF Program Officers, September 10, 2008
How does CIG use the TeraGrid?
 Seismograms allow scientists to understand the ground motion
 Computationally-intensive simulations run on TeraGrid using an assortment
of 3D and 1D earth models produce synthetic seismograms


Necessary input datasets provided via the portal
Daemon (Python, Pyre) constantly polls the web site looking for work to do



GSI-OpenSSH and MyProxy credentials to submit jobs, monitors jobs, transfers output back to
portal
status updates to the web site using HTTP POST
Users can download results in ASCII and Seismic Analysis Code (SAC) format

Visualizations include "beachball" graphics depicting the earthquake's source mechanism, and
maps showing the locations of the earthquake and the seismic stations using GMT
(http://gmt.soest.hawaii.edu/)
 Researchers quickly receive results and can concentrate on the scientific
aspects of the output rather than on the details of running the analysis on a
supercomputer
 Future Directions


Parameter explorations
Custom earth models for users
Conclusions
 Technical requirements of some PSEs dictate
seamless access to high-end compute and data
resources

A robust, flexible and scalable infrastructure can provide
a foundation for many PSEs
 PSEs themselves must be treated as sustainable
infrastructure

Researchers will not truly rely on PSEs for their work unless
they have confidence that the PSE will remain operational for
the long term and provide reliable services