Grid Resource Management and Its Relationship to

Transcript Grid Resource Management and Its Relationship to

Resource Management of LargeScale Applications on a Grid
Laukik Chitnis and Sanjay Ranka
(with Paul Avery, Jang-uk In and Rick Cavanaugh)
Department of CISE
University of Florida, Gainesville
[email protected]
352 392 6838
(http://www.cise.ufl.edu/~ranka/)
1
Overview





High End Grid Applications and Infrastructure
at University of Florida
Resource Management for Grids
Sphinx Middleware for Resource Provisioning
Grid Monitoring for better meta-scheduling
Provisioning Algorithm Research for multicore and grid environments
2
The Evolution of High-End Applications
(and their system characteristics)
• Geographically
distributed datasets
• High speed storage
• Gigabit networks
Data Intensive
Applications
Compute Intensive
Applications
MainFrame
Applications
1980
1990
• Large clusters
• Supercomputers
• Central
mainframes
2000
3
Some Representative Applications
HEP, Medicine, Astronomy, Distributed
Data Mining
4
Representative Application: High Energy
Physics
1000+
1-
20+ countries
1-10 petabytes
5
Representative Application: Tele-Radiation
Therapy
RCET Center for Radiation Oncology
6
Representative Application: Distributed
Intrusion Detection
NSF ITR Project:
Middleware for
Distributed
Data Mining
(PI: Ranka
joint with
Kumar and
Grossman)
Application
Application
Data Mining and
Scheduling Services
.
.
Data
Transport
Services
Data
Management
Services
Data
Management
Services
7
Grid Infrastructure
Florida Lambda Rail and UF
8
Campus Grid (University of Florida)
NSF Major Research
Instrumentation
Project
(PI: Ranka, Avery et. al.)
20 Gigabit/sec Network
20+ Terabytes
2-3 Teraflops
10 Scientific and
Engineering
Applications
Gigabit Ethernet
Based
Cluster
Infiniband
based
Cluster
9
Grid Services
The software part of the infrastructure!
10
Services offered in a Grid
Resource
Management
Services
Data
Management
Services
Monitoring and
Information
Services
Security
Services
Note that all the other services use security services
11
Resource Management Services



Provide a uniform, standard interface to
remote resources including CPU, Storage and
Bandwidth
Main component is the remote job manager
Ex: GRAM (Globus Resource Allocation
Manager)
12
Resource Management on a Grid
GRAM
User
Condor
PBS
Site 1
Site 3
LSF
fork
Site 2
Site n
The Grid
Narration: note the different local schedulers
13
Scheduling your Application
14
Scheduling your Application



An application can be run on a grid site as a job
The modules in grid architecture (such as GRAM)
allow uniform access to the grid sites for your job
But…
 Most applications can be “parallelized”
 And these separate parts of it can be scheduled to
run simultaneously on different sites

Thus utilizing the power of the grid
15
Modeling an Application Workflow



Directed Acyclic Graph
Many workflows can be
modeled as a Directed
Acyclic Graph
The amount of resource
required (in units of time) is
known to a degree of
certainty
There is a small probability
of failure in execution (in a
grid environment this could
happen due to resources no
longer available)
16
Workflow Resource Provisioning
Large
Data
Intensive
Multi-core
Precedence
Applications
Time
Constraints
Heterogeneous
Resources
Distributed
Executing multiple
workflows
over distributed and
adaptive (faulty)
resources
while managing
policies
Faulty
17
A Real Life Example from High
Energy Physics
Merge two grids into a single
multi-VO“Inter-Grid”

UW
UC
UI
ANL
How to ensure that
 neither VO is harmed?
 both VOs actually benefit?
 there are answers to questions like:
Caltech
UCSD


MIT
BNL
FNAL
IU
LBL

BU
UM
OU
UTA
SMU
Rice
UF
“With what probability will my job be scheduled and
complete before my conference deadline?”
Clear need for a scheduling middleware!
18
Typical scenario
VDT Client
?
?
?
VDT Server
VDT Server
VDT Server
19
Typical scenario
@#^%#%$@#
VDT Client
?
?
?
VDT Server
VDT Server
VDT Server
20
Some Requirements for
Effective Grid Scheduling

Information requirements
 Past & future dependencies
of the application




Persistent storage of
workflows
Expected to vary slowly
over time
Global views of job
descriptions
Request Tracking and
Usage Statistics

State information
important
Resource Properties and
Status


Resource usage estimation
Policies



Expected to vary slowly
with time
Grid weather

Latency of measurement
important
Replica management
System requirements
 Distributed, fault-tolerant
scheduling
 Customisability
 Interoperability with other
scheduling systems
 Quality of Service


21
Incorporate Requirements
into a Framework
VDT Client
?
?

Assume the GriPhyN Virtual Data
Toolkit:

Client (request/job submission)




Globus clients
Condor-G/DAGMan
Chimera Virtual Data System
?
VDT Server
VDT Server
VDT Server
Server (resource gatekeeper)



MonALISA Monitoring Service
Globus services
RLS (Replica Location Service)
22
Incorporate Requirements
into a Framework
?


Framework design principles:

Information driven

Flexible client-server model

General, but pragmatic and
simple

Avoid adding middleware
requirements on grid
resources
Assume the Virtual Data Toolkit:

Client (request/job submission)





Clarens Web Service
Globus clients
Condor-G/DAGMan
Chimera Virtual Data System
VDT Client
Recommendation
Engine
VDT Server
VDT Server
VDT Server
Server (resource gatekeeper)



MonALISA Monitoring Service
Globus services
RLS (Replica Location Service)
23
Related Provisioning Software
System
Nimrod-G
Economy-driven
Deadline support
Maui/Silver
Priority-based
Reservation
PBS
Batch job scheduling
Queue-based
EZ-Grid
Policy-based
Prophet
Parallel SPMD
LSF
Interactive,
batch modes
Adaptive
Scheduling
Coallocation
Faulttolerant
Policybased
QoS
support
Flexible
interface
X
O
X
X
O
X
O
O
X
O
O
X
X
O
X
X
O
X
X
O
X
O
X
O
X
X
X
X
O
X
X
O
O
O
O
X
24




Innovative Workflow Scheduling Middleware
Modular system

Automated scheduling procedure based on modulated service
Robust and recoverable system

Database infrastructure

Fault-tolerant and recoverable from internal failure
Platform independent interoperable system

XML-based communication protocols

SOAP, XML-RPC
Supports heterogeneous service environment
60 Java Classes

24,000 lines of Java code

50 test scripts, 1500 lines of script code


25
The Sphinx Workflow Execution
Framework
Clarens
Sphinx Client
WS Backbone
Request
Processing
Chimera
Virtual Data
System
Condor-G/DAGMan
VDT Client
Data
Warehouse
Data
Management
Information
Gathering
Sphinx Server
Globus Resource
Replica Location Service
MonALISA Monitoring Service
VDT Server Site
26
Sphinx Workflow Scheduling Server
Control Process



Functions as the Nerve Centre
Data Warehouse
 Policies, Account Information,
Grid Weather, Resource
Properties and Status, Request
Tracking, Workflows, etc
Control Process
 Finite State Machine



Different modules modify jobs,
graphs, workflows, etc and
change their state
Flexible
Extensible
Message Interface
Graph Reducer
Job Predictor
Graph Predictor
Job Admission Control
Graph Admission Control
Graph Data Planner
Data Warehouse
Job Execution Planner
Graph Tracker
Data Management
Information Gatherer
Sphinx Server
27
SPHINX
Scheduling in Parallel for
Heterogeneous Independent NetworXs
28
Policy Based Scheduling

Sphinx provides “soft” QoS through time
dependent, global views of
 Submissions (workflows, jobs,
allocation, etc)
 Policies
 Resources
Uses Linear Programming Methods
 Satisfy Constraints


Submissions Resources
Time
Policy Space
Policies, User-requirements, etc
Optimize an “objective” function
Estimate probabilities to meet deadlines
within policy constraints
J. In, P. Avery, R. Cavanaugh, and S. Ranka, "Policy
Based Scheduling for Simple Quality of Service
in Grid Computing", in Proceedings of the 18th
IEEE IPDPS, Santa Fe, New Mexico, April, 2004
Submissions

29
Ability to tolerate task failures
Jang-uk In, Sanjay Ranka et. al. "SPHINX: A fault-tolerant system for scheduling
in dynamic grid environments", in Proceedings of the 19th IEEE IPDPS, Denver,
Colorado, April, 2005
Timeout (120 dags x 10 jobs/dag)
Average Dag Completion Time (30 dags x 10 jobs/dag)
10000
3400
2258
3200
3000
# of jobs
Time (Seconds)
1000
2800
2600
386
100
327
154
125
10
2400
2200
1
2000
# of CPUs based
Round- robin
# of CPUs basedwithout feedback
Scheduling Algorithms
Round- robinwithout feedback
Completion
time based
Queue length
based
# of CPUs
based
Round robin
# of CPUs
basedwithout
feedback
Scheduling Algorithms
• Significant Impact of using feedback information
30
Grid Enabled Analysis
SC|03
31
Distributed Services for Grid
Enabled Data Analysis
Virtual Data
Service
Clarens
Chimera
File
Service
Data Analysis
Client
Caltech
ROOT
File
Service
Execution
Service
Sphinx/VDT
Clarens
Clarens
VDT Resource
Service
VDT Resource
Service
Florida
File
Service
Scheduling
Service
VDT Resource
Service
Fermilab
Sphinx
File
Service
Globus
Replica
Location
Service
Monitoring
Service
RLS
MonALISA
VDT Resource
Service
Iowa
Evaluation of Information gathered from
grid monitoring systems
queue_length : Turnaround time v/s rating
value
AvgJobDelay : Turnaround time v/s value
700
turnaround time (sec)
turnaround time (sec)
800
600
500
400
300
200
100
0
0
200
400
600
800
1000
900
800
700
600
500
400
300
200
100
0
0
0.5
1
1.5
site rating value
parameter value
cluster_load : Turnaround time v/s value
turnaound time (sec)
800
Correlation index
Turnaround time
Queue length
-0.05818
Cluster load
-0.20775
Average Job Delay
0.892542
700
600
500
400
300
200
100
0
0
0.2
0.4
0.6
0.8
1
1.2
cluster_load value
33
Limitation of Existing Monitoring
Systems for the Grid



Information aggregated across multiple users
is not very useful in effective resource
allocation.
An end-to-end parameter such as Average
Job Delay - the average queuing delay
experienced by a job of a given user at an
execution site - is a better estimate for
comparing the resource availability and
response time for a given user.
It is also not very susceptible to monitoring
latencies.
34
Effective DAG Scheduling

Average Dag Completion Time (120 dags x 10 jobs/dag)
7000
Time (Seconds)
6500
6000

5500
5000
4500
Completion time
based
Queue length
based
# of CPUs based
Scheduling Algorithms
Round robin
The completion time
based algorithm here
uses the Average Job
Delay parameter for
scheduling
As seen in the adjoining
figure, it outperforms
the algorithms tested
with other monitored
parameters.
35
Work in Progress: Modeling Workflow Cost and
developing efficient provisioning algorithms
1. Developing an objective measure of completion time
Directed Acyclic Graph
Integrating performance and reliability of workflow
execution
P (Time to complete >=T) <= epsilon
2. Relating this measure to the properties of the longest path
of the DAG based on the mean and uncertainty of time
required for underlying tasks due to
1) variable time requirements due to different parameter
values
2) failure due to change of the underlying resources etc.
3. Developing novel scheduling and replication techniques to
optimize allocation based on these metrics.
36
Work in Progress: Provisioning algorithms
for multiple workflows (Yield
Management)
Multiple Workflows
Level 1
Level 2
Level 3
Level 4
Dag 1
Dag 2
Dag 3
Dag 4
Dag 5
• Quality of Service guarantees for each workflow
• Controlled (a cluster of multi-core processors) versus uncontrolled
(grid of multiple clusters owned by multiple units) environment
37
CHEPREO - Grid Education and
Networking
 E/O Center in Miami area
 Tutorial for Large Scale
Application Development
38
Grid Education


Developing a Grid tutorial as part of CHEPREO
 Grid basics
 Components of a Grid
 Grid Services OGSA …
OSG summer workshop
 South Padre island, Texas. July 11-15, 2005



http://osg.ivdgl.org/twiki/bin/view/SummerGridWorkshop/
Lectures and Hands-on sessions
Building and Maintaining a Grid
39
Acknowledgements




CHEPREO project, NSF
GriPhyN/iVDgL, NSF
Data Mining Middleware, NSF
Intel Corporation
40
Thank You
May the Force be with you!
41
Additional slides
42
Effect of latency on Average Job Delay
turnaround time (sec)
AvgJobDelay(-10) : Turnaround time v/s
parameter value

1000
900
800
700
600
500
400
300
200
100
0

0
200
400
600
800
1000
parameter value
Latency is simulated in the system by
purposely retrieving old values for the
parameter while making scheduling
decisions
The correlation indices with added
latencies are comparable, though
lower as expected, to the correlation
indices of ‘un-delayed’ Average Job
Delay parameter. The amount of
correlation is still quite high.
AvgJobDelay(-5) : Turnaround time v/s
parameter value
turnaround time (sec)
800
700
600
Average
Job
Delay
correlation index with
turnaround time
Added latency
minutes
=
5
Added latency
minutes
Site rank
0.688959
0.754222
Raw value
0.582685
0.777754
Learning period
29 jobs
48 jobs
=
10
500
400
300
200
100
0
0
200
400
600
800
parameter value
43
SPHINX Scheduling Latency
45
40
Seconds
35
30
20 DAG's
25
40 DAG's
80 DAG's
20
100 DAG's
15
10
5
0
0.5
1
2
4
11
13
17
# jobs / m inute
Average scheduling latency for various number of
DAG’s (20, 40 , 80 and 100) with different arrival rate per minute.
44
Demonstration at Supercomputing Conference:
Distributed Data Analysis in a Grid Environment
Graphical user interface
for data analysis
ROOT
Grid enabled
Web service
Clarens
Virtual data service
Chimera
Clarens
Grid resource
management
service
VDT server
Clarens
Clarens
Grid enabled
execution
service
VDT client
Clarens
Grid
scheduling
service
Sphinx
Grid resource
monitoring
system
MonALISA
Replica
location service
RLS
The architecture has been implemented and demonstrated in SC03 and SC04, Arizona, USA, 2003.
45
Scheduling DAGs: Dynamic Critical
Path Algorithm
The DCP algorithm executes the following steps
iteratively:
1. Compute the earliest possible start time (AEST) and
the latest possible start time (ALST) for all tasks on
each processor.
2. Select a task which has the smallest difference
between its ALST and AEST and has no unscheduled
parent task. If there are tasks with the same
differences, select the one with a smaller AEST.
3. Select a processor which gives the earliest start time
for the selected task
46
Scheduling DAGs: ILP- Novel algorithm to
support heterogeneity
(work supported by Intel Corporation)
Directed Acyclic Graph
There are two novel features:
 Assign multiple independent
tasks simultaneously – cost of
task assigned depends on the
processor available, many tasks
commence with a small
difference in start time.
 Iteratively refine the scheduling
- refines the scheduling by
using the cost of the critical
path based on the assignment
in the previous iteration.
47
Comparison of different algorithms
10850
10800
10750
Best
Scheduling Length
10900
10700
10650
10600
ICP (Th=3)
ICP (Th=5)
ICP (Th=7)
DCP
HEFT
2000 Tasks
Number of processors = 30.
Number of Tasks = 2000.
100
90
80
70
60
50
40
30
20
10
0
ICP (Th=5)
DCP
HEFT
1000
2000
3000
4000
Number of Tasks
Number of processors = 30.
48
Time for Scheduling
2000000
ICP (Th=3)
ICP (Th=5)
1500000
ICP (Th=7)
1000000
DCP
HEFT
500000
0
1000
2000
3000
Number of Tasks
4000
Scheduling Time
Scheduling Time
2500000
100000
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
ICP (Th=3)
ICP (Th=5)
ICP (Th=7)
DCP
HEFT
10
20
30
40
60
80
Number of Processors
49

Grid Resource Management and Its Relationship to

Transcript Grid Resource Management and Its Relationship to

Directory