Transcript GridIntro

Introduction to Grid
Computing
Midwest Grid Workshop Module 1
1
Scaling up Social Science:
Citation Network Analysis
1975
1980
1985
1990
1995
2000
2002
Work of James Evans,
University of Chicago,
Department of
Sociology
2
Scaling up the analysis




Database queries of 25+ million citations
Work started on small workstations
Queries grew to month-long duration
With database distributed across
U of Chicago TeraPort cluster:



50 (faster) CPUs gave 100 X speedup
Many more methods and hypotheses can be tested!
Higher throughput and capacity enables deeper
analysis and broader community access.
3
Computing clusters have commoditized
supercomputing
Cluster Management
“frontend”
I/O Servers typically
RAID fileserver
Disk Arrays
A few Headnodes,
gatekeepers and
other service nodes
Lots of
Worker
Nodes
Tape Backup robots
4
PUMA: Analysis of Metabolism
PUMA
Knowledge Base
Information about
proteins analyzed
against ~2 million
gene sequences
Analysis on Grid
Natalia Maltsev et al.
http://compbio.mcs.anl.gov/puma2
Involves millions of
BLAST, BLOCKS, and
other processes
5
Initial driver: High Energy Physics
~PBytes/sec
Online System
~100 MBytes/sec
~20 TIPS
There are 100 “triggers” per second
Each triggered event is ~1 MByte in size
~622 Mbits/sec
or Air Freight (deprecated)
France Regional
Centre
SpecInt95 equivalents
Offline Processor Farm
There is a “bunch crossing” every 25 nsecs.
Tier 1
1 TIPS is approximately 25,000
Tier 0
Germany Regional
Centre
Italy Regional
Centre
~100 MBytes/sec
CERN Computer Centre
FermiLab ~4 TIPS
~622 Mbits/sec
Tier 2
~622 Mbits/sec
Institute
Institute Institute
~0.25TIPS
Physics data cache
Caltech
~1 TIPS
Institute
Tier2 Centre
Tier2 Centre
Tier2 Centre
Tier2 Centre
~1 TIPS ~1 TIPS ~1 TIPS ~1 TIPS
Physicists work on analysis “channels”.
Each institute will have ~10 physicists working on one or more
channels; data for these channels should be cached by the
institute server
~1 MBytes/sec
Tier 4
Physicist workstations
Image courtesy Harvey Newman, Caltech
6
Grids Provide Global Resources
To Enable e-Science
7
Ian Foster’s Grid Checklist

A Grid is a system that:



Coordinates resources that are not subject to
centralized control
Uses standard, open, general-purpose protocols
and interfaces
Delivers non-trivial qualities of service
8
Virtual Organizations




Groups of organizations that use the Grid to share resources
for specific purposes
Support a single community
Deploy compatible technology and agree on working policies
 Security policies - difficult
Deploy different network accessible services:




Grid Information
Grid Resource Brokering
Grid Monitoring
Grid Accounting
9
What is a grid made of ? Middleware.
Grid
Middleware
Grid
Storage
Grid Protocols





Grid resource time purchased
from commercial provider
Grid
Storage
Grid
Middleware
Computing
Cluster

Grid
Middleware
Grid resources shared
by OSG, LCG, NorduGRID
Grid
Storage
Resource,
Workflow
And Data
Catalogs
Grid
Middleware
Computing
Cluster
Application
User
Interface
Computing
Cluster
Grid Resources dedicated
by UC, IU, Boston
Grid Client
Security to control access and protect communication (GSI)
Directory to locate grid sites and services: (VORS, MDS)
Uniform interface to computing sites (GRAM)
Facility to maintain and schedule queues of work (Condor-G)
Fast and secure data set mover (GridFTP, RFT)
Directory to track where datasets live (RLS)
10
We will focus on Globus and Condor

Condor provides both client & server scheduling


In grids, Condor provides an agent to queue, schedule
and manage work submission
Globus Toolkit provides the lower-level
middleware




Client tools which you can use from a command line
APIs (scripting languages, C, C++, Java, …) to build
your own tools, or use direct from applications
Web service interfaces
Higher level tools built from these basic components,
e.g. Reliable File Transfer (RFT)
11
Grid components form a “stack”
Grid Application (M6,M7,M8)
(often includes a Portal)
Workflow system (explicit or ad-hoc) (M6)
Condor Grid Client (M3)
Job
Management (M3)
Data
Management (M4)
Information
(config & status)
Services (M5)
Grid Security Infrastructure (M2)
Core Globus Services
Standard Network Protocols
12
Grid architecture is evolving to a
Service-Oriented approach.
...but this is beyond our workshop’s scope.
See “Service-Oriented Science” by Ian Foster.

Service-oriented applications



Wrap applications as
services
Compose applications
into workflows
Service-oriented Grid
infrastructure

Provision physical
resources to support
application workloads
Users
Composition
Workflows
Invocation
Appln
Service
Appln
Service
Provisioning
“The Many Faces of IT as Service”, Foster, Tuecke, 2005
13
Security
Workshop Module 2
14
Grid security is a crucial component



Resources are typically valuable
Problems being solved might be sensitive
Resources are located in distinct administrative
domains


Each resource has own policies, procedures, security
mechanisms, etc.
Implementation must be broadly available &
applicable

Standard, well-tested, well-understood protocols;
integrated with wide variety of tools
15
Security Services



Forms the underlying communication medium for
all the services
Secure Authentication and Authorization
Single Sign-on



User explicitly authenticates only once – then single
sign-on works for all service requests
Uniform Credentials
Example: GSI (Grid Security Infrastructure)
16
Authentication means identifying
that you are whom you claim to be


Authentication stops imposters
Examples of authentication:





Username and password
Passport
ID card
Public keys or certificates
Fingerprint
17
Authorization controls what you are
allowed to do.




Is this device allowed to access to this service?
Read, write, execute permissions in Unix
Access conrol lists (ACLs) provide more flexible
control
Special “callouts” in the grid stack in job and data
management perform authorization checks.
18
Job and resource management
Workshop Module 3
19
Job Management Services provide a
standard interface to remote resources

Includes CPU, Storage and Bandwidth

Globus component is Globus Resource Allocation
Manager (GRAM)

The primary Condor grid client component is Condor-G

Other needs:

scheduling

monitoring

job migration

notification
20
GRAM provides a uniform interface to
diverse resource scheduling systems.
GRAM
User
Condor
VO
Site A
PBS
VO
Site C
VO
Site B
LSF
UNIX fork()
VO
Site D
Grid
21
GRAM: What is it?


Globus Resource Allocation Manager
Given a job specification:






Create an environment for a job
Stage files to and from the environment
Submit a job to a local resource manager
Monitor a job
Send notifications of the job state change
Stream a job’s stdout/err during execution
22
A “Local Resource Manager” is a batch system
for running jobs across a computing cluster


In GRAM
Examples:





Condor
PBS
LSF
Sun Grid Engine
Most systems allow you to access “fork”


Default behavior
It runs on the gatekeeper:

A bad idea in general, but okay for testing
23
Managing your jobs


We need something more than just the basic functionality
of the globus job submission commands
Some desired features





Job tracking
Submission of a set of inter-dependant jobs
Check-pointing and Job resubmission capability
Matchmaking for selecting appropriate resource for executing the
job
Options: Condor, PBS, LSF, …
24
Data Management
Workshop Module 4
25
Data management services provide the
mechanisms to find, move and share data





Fast
Flexible
Secure
Ubiquitous
Grids are used for analyzing and manipulating
large amounts of data



Metadata (data about data): What is the data?
Data location: Where is the data?
Data transport: How to move the data?
26
GridFTP is a secure, efficient and
standards-based data transfer protocol




Robust, fast and widely accepted
Globus GridFTP server
Globus globus-url-copy GridFTP client
Other clients exist (e.g., uberftp)
27
GridFTP is secure, reliable and fast

Security through GSI

Authentication and authorization

Can also provide encryption

Reliability by restarting failed transfers

Fast


Can set TCP buffers for optimal performance

Parallel transfers

Striping (multiple endpoints)
Not all features are accessible from basic client

Can be embedded in higher level and application-specific
frameworks
28
File catalogues tell you where the data is



Replica Location Service (RLS)
Phedex
RefDB / PupDB
29
Requirements from a File Catalogue

Abstract out the logical file name (LFN) for a
physical file


maintain the mappings between the LFNs and the PFNs
(physical file names)
Maintain the location information of a file
30
In order to avoid “hotspots”, replicate
data files in more than one location




Effective use of the grid resources
Each LFN can have more than 1 PFN
Avoids single point of failure
Manual or automatic replication

Automatic replication considers the demand for a file,
transfer bandwidth, etc.
31
The Globus-Based
LIGO Data Grid
LIGO Gravitational Wave Observatory
Birmingham•
Cardiff
AEI/Golm
Replicating >1 Terabyte/day to 8 sites
>40 million replicas
so
far
www.globus.org/solutions
32
National Grid
Cyberinfrastructure
Workshop Module 5
33
TeraGrid provides vast resources via a
number of huge computing facilities.
34
Open Science Grid (OSG) provides shared computing
resources, benefiting a broad set of disciplines
A consortium of universities and national
laboratories, building a sustainable grid
infrastructure for science.


OSG incorporates advanced networking and focuses on general services, operations, end-to-end
performance
Composed of a large number (>50 and growing) of shared computing facilities, or “sites”
http://www.opensciencegrid.org/
35
Open Science Grid
50 sites (15,000 CPUs) & growing
 400 to >1000 concurrent jobs
 Many applications + CS experiments;
includes long-running production operations
 Up since October 2003; few FTEs central ops

Jobs (2004)
www.opensciencegrid.org
36
To efficiently use a Grid, you must
locate and monitor its resources.




Check the availability of different grid sites
Discover different grid services
Check the status of “jobs”
Make better scheduling decisions with
information maintained on the “health” of sites
37
OSG Resource Selection Service: VORS
38
Monitoring and Discovery Service MDS
Clients (e.g., WebMDS)
WS-ServiceGroup
GT4 Container
MDSIndex
Registration &
WSRF/WSN Access
GT4 Container
MDSIndex
Automated
registration
in container
GRAM
adapter
GT4 Cont.
Custom protocols
for non-WSRF entities
MDSIndex
GridFTP
User
RFT
39
Grid Workflow
Workshop Module 6
40
A typical workflow pattern in image
analysis runs many filtering apps.
3a.h
3a.i
4a.h
4a.i
ref.h
ref.i
5a.h
5a.i
6a.h
align_warp/1
align_warp/3
align_warp/5
align_warp/7
3a.w
4a.w
5a.w
6a.w
reslice/2
reslice/4
reslice/6
reslice/8
3a.s.h
3a.s.i
4a.s.h
4a.s.i
5a.s.h
5a.s.i
6a.s.h
6a.i
6a.s.i
softmean/9
atlas.h
slicer/10
atlas.i
slicer/12
slicer/14
atlas_x.ppm
atlas_y.ppm
atlas_z.ppm
convert/11
convert/13
convert/15
atlas_x.jpg
atlas_y.jpg
atlas_z.jpg
Workflow courtesy James Dobson, Dartmouth Brain Imaging Center
41
Workflows can process vast datasets.

Many HEP and Astronomy experiments consist of:





Large datasets as inputs (find datasets)
“Transformations” which work on the input datasets (process)
The output datasets (store and publish)
The emphasis is on the sharing of the large datasets
Transformations are usually independent and can be
parallelized. But they can vary greatly in duration.
Mosaic of M42 created on TeraGrid
= Data
Transfer
= Compute
Job
Montage Workflow: ~1200 jobs, 7 levels
42
Virtual data model enables workflow to
abstract grid details.
43
An important application pattern:
Grid Data Mining
Workshop Module 7
44
Mining Seismic data for hazard analysis
(Southern Calif. Earthquake Center).
Seismicity
Paleoseismology
Local site effects
Geologic structure
Faults
Seismic
Hazard
Model
InSAR Image of the
Hector Mine Earthquake
A satellite
generated
Interferometric
Synthetic Radar
(InSAR) image of
the 1999 Hector
Mine earthquake.
Shows the
displacement field
in the direction of
radar imaging
Each fringe (e.g.,
from red to red)
corresponds to a
few centimeters of
displacement.
Stress
transfer
Crustal motion
Crustal deformation
Seismic velocity
structure
Rupture
dynamics
45
Data mining in the grid


Decompose across network
Clients integrate dynamically




Select & compose services
Select “best of breed” providers
Publish result as a new service
Decouple resource & service providers
Users
Discovery tools
Analysis tools
Data Archives
Fig: S. G. Djorgovski
46
Conclusion: Why Grids?

New approaches to inquiry based on






Deep analysis of huge quantities of data
Interdisciplinary collaboration
Large-scale simulation and analysis
Smart instrumentation
Dynamically assemble the resources to tackle a new
scale of problem
Enabled by access to resources & services without
regard for location & other barriers
47
Grids: Because Science Takes a Village …

Teams organized around common goals


With diverse membership & capabilities


Expertise in multiple areas required
And geographic and political distribution


People, resource, software, data, instruments…
No location/organization possesses all required skills
and resources
Must adapt as a function of the situation

Adjust membership, reallocate responsibilities,
renegotiate resources
48
Based on:
Grid Intro and Fundamentals Review
Dr Gabrielle Allen
Center for Computation & Technology
Department of Computer Science
Louisiana State University
[email protected]
Grid Summer Workshop
June 26-30, 2006
49