Sphinx Server - University of Florida

Download Report

Transcript Sphinx Server - University of Florida

Sphinx: A Scheduling Middleware for
Data Intensive Applications on a Grid
Richard Cavanaugh
University of Florida
Collaborators:
Janguk In, Sanjay Ranka, Paul Avery, Laukik Chitnis,
Gregory Graham (FNAL), Pradeep Padala, Rajendra Vippagunta,
Xing Yan
18.09.2003
Data Mining and Exploration Middleware for Distributed and
Grid Computing – University of Minnesota
1
The Problem of
Grid Scheduling
o Decentralised ownership
o No one controls the grid
o Heterogeneous composition
o Difficult to guarantee execution environments
o Dynamic availability of resources
o Ubiquitous monitoring infrastructure needed
o Complex policies
o Issues of trust
o Lack of accounting infrastructure
o May change with time
o Information gathering and processing is critical!
18.09.2003
Data Mining and Exploration Middleware for Distributed and
Grid Computing – University of Minnesota
2
A Real Life Example
o Merge two grids into a single multi-VO
“inter-grid”
UW
UC
UI
ANL
o How to ensure that
BU
UM
MIT
BNL
FNAL
IU
LBL
o neither VO is harmed?
o both VOs actually benefit?
o there are answers to questions like:
Caltech
UCSD
OU
UTA
SMU
Rice
UF
o “With what probability will my job be scheduled and complete
before my conference deadline?”
o Clear need for a scheduling middleware!
18.09.2003
Data Mining and Exploration Middleware for Distributed and
Grid Computing – University of Minnesota
3
Some Requirements for
Effective Grid Scheduling
o Information requirements
o Past & future dependencies of
the application
o Persistent storage of
workflows
o Resource usage estimation
o Policies
o Expected to vary slowly over
time
o Global views of job
descriptions
o Request Tracking and Usage
Statistics
o State information important
18.09.2003
o Resource Properties and Status
o Expected to vary slowly with
time
o Grid weather
o Latency measurement
important
o Replica management
o System requirements
o Distributed, fault-tolerant
scheduling
o Customisability
o Interoperability with other
scheduling systems
o Quality of Service
Data Mining and Exploration Middleware for Distributed and
Grid Computing – University of Minnesota
4
Incorporate Requirements
into a Framework
VDT Client
?
?
o Assume the GriPhyN Virtual Data
Toolkit:
?
VDT Server
o Client (request/job submission)
o Globus clients
o Condor-G/DAGMan
o Chimera Virtual Data System
VDT Server
VDT Server
o Server (resource gatekeeper)
o
o
o
o
18.09.2003
Globus services
RLS (Replica Location Service)
MonALISA Monitoring Service
etc
Data Mining and Exploration Middleware for Distributed and
Grid Computing – University of Minnesota
5
Incorporate Requirements
into a Framework
o Framework design principles:
o Information driven
o Flexible client-server model
o General, but pragmatic and simple
o Implement now; learn; extend over
time
o Avoid adding middleware
requirements on grid resources
?
VDT Client
o Take what is offered!
o Assume the GriPhyN Virtual Data
Toolkit:
Scheduler
VDT Server
o Client (request/job submission)
o
o
o
o
Clarens Web Service
Globus clients
Condor-G/DAGMan
Chimera Virtual Data System
VDT Server
VDT Server
o Server (resource gatekeeper)
o MonALISA Monitoring Service
o Globus services
o RLS (Replica Location Service)
18.09.2003
Data Mining and Exploration Middleware for Distributed and
Grid Computing – University of Minnesota
6
The Sphinx Framework
Clarens
Sphinx Client
WS Backbone
Request
Processing
Chimera
Virtual Data
System
Condor-G/DAGMan
VDT Client
Data
Warehouse
Data
Management
Information
Gathering
Sphinx Server
18.09.2003
Globus Resource
Replica Location Service
MonALISA Monitoring Service
VDT Server Site
Data Mining and Exploration Middleware for Distributed and
Grid Computing – University of Minnesota
7
Sphinx Scheduling Server
Control Process
o Functions as the Nerve
Centre
o Data Warehouse
o Policies, Account Information,
Grid Weather, Resource
Properties and Status, Request
Tracking, Workflows, etc
o Control Process
Message Interface
Graph Reducer
Job Predictor
Graph Predictor
Job Admission Control
Graph Admission Control
Graph Data Planner
Data Warehouse
Job Execution Planner
Graph Tracker
o Finite State Machine
o Different modules modify jobs,
graphs, workflows, etc and
change their state
o Flexible
o Extensible
18.09.2003
Data Management
Information Gatherer
Sphinx Server
Data Mining and Exploration Middleware for Distributed and
Grid Computing – University of Minnesota
8
Policy Constraints
o Defined by Resource Providers
o Actual grid sites (resource centres)
o VO management
o Applied to Request Submitters
o VO, group, user, or even a proxy request (e.g. workflow)
o Valid over a Period of Time
o Can be dynamic (e.g. periodic) or constant
o Global accounting and book-keeping is necessary
18.09.2003
Data Mining and Exploration Middleware for Distributed and
Grid Computing – University of Minnesota
9
Quality of Service
o For grid computing to become economically viable, a
Quality of Service is needed
o “Can the grid possibly handle my request within my required
time window?”
o If not, why not? When might it be able to accommodate
such a request?
o If yes, with what probability?
o But, grid computing today typically:
o Relies on a “greedy” job placement strategies
o Works well in a resource rich (user poor) environment
o Assumes no correlation between job placement choices
o Provides no QoS
18.09.2003
Data Mining and Exploration Middleware for Distributed and
Grid Computing – University of Minnesota
10
Quality of Service
o As a grid becomes resource limited,
o QoS becomes even more important!
o “greedy” strategies may not be a good choice
o Strong correlation between job placement choices
o Sphinx is designed to provide QoS through time
dependent, global views of
o Requests (workflows, jobs, allocation, etc)
o Policies
o Resources
18.09.2003
Data Mining and Exploration Middleware for Distributed and
Grid Computing – University of Minnesota
11
Resource Usage Estimation
o User Requirements
o Upper limits on CPU, memory, storage, bandwidth usage
o Domain Specific Knowledge
o Applications are often known to depend logarithmically,
linearly, etc on certain input parameters, data size or type
o Historical Estimates
o Record the performance of all applications
o Statistically estimate resource usage within some confidence
level
18.09.2003
Data Mining and Exploration Middleware for Distributed and
Grid Computing – University of Minnesota
12
Data Management
o Smart Replication:
o Graph based
o Examine and insert replication nodes to
minimise overall completion time
o Distribute and collect required data
o Particularly useful in data parallelism
o “Hot Spot” based
o Monitor current and historical data access
patterns and replicate to optimise future
access
18.09.2003
Data Mining and Exploration Middleware for Distributed and
Grid Computing – University of Minnesota
13
Data Management
o Smart Replication:
o Graph based
o Examine and insert replication nodes to
minimise overall completion time
o Distribute and collect required data
o Particularly useful in data parallelism
o “Hot Spot” based
o Monitor current and historical data access
patterns and replicate to optimise future
access
18.09.2003
Data Mining and Exploration Middleware for Distributed and
Grid Computing – University of Minnesota
14
Early Sphinx Prototype
Test Results
o Simple sanity checks
o 120 canonical virtual data workflows
submitted to US-CMS Grid
o Round-robin strategy
o Equally distribute work to all sites
o Upper-limit strategy
o Makes use of global information (site
capacity)
o Throttle jobs using just-in-time planning
o 40% better throughput (given grid
topology)
o Conclusion: Prototype is working!
18.09.2003
Data Mining and Exploration Middleware for Distributed and
Grid Computing – University of Minnesota
15
Some Current and Future
Activities
o
o
o
o
o
o
Policy Based Scheduling
Quality of Service
Graph Partitioning
Data Parallelism
Prediction Module
Useful Views and Fusion of Monitoring Data
18.09.2003
Data Mining and Exploration Middleware for Distributed and
Grid Computing – University of Minnesota
16
Conclusions
o Scheduling on a grid has unique requirements
o Information
o System
o Decisions based on global views providing a Quality of
Service are important
o Particularly in a resource limited environment
o Sphinx is an extensible, flexible grid middleware which
o Already implements many required features for effective
global scheduling
o Provides an excellent “workbench” for future activities!
18.09.2003
Data Mining and Exploration Middleware for Distributed and
Grid Computing – University of Minnesota
17