Distributed Systems - University of Alabama in Huntsville
Download
Report
Transcript Distributed Systems - University of Alabama in Huntsville
Software for Distributed Systems
Distributed Systems – Case Studies
• NOW: a Network of Workstations
• Condor: High Throughput Computing
• MOSIX: A Distributed System for Clusters
Projects
• NOW – a Network Of Workstations
University of California, Berkely
Terminated about 1997 after demonstrating the
feasibility of their approach
• Condor – University of Wisconsin-Madison
Started about 1988
Is now an ongoing, world-wide system of
shared computing clusters.
• MOSIX – clustering software for Linux machines
NOW: a Network of Workstations
http://now.cs.berkeley.edu/
• Scale: A NOW system consists of a building-wide
collection of machines providing memory, disks, and
processors.
• Basic Ideas
– Use idle CPU cycles for parallel processing on clusters of
workstations
– Use memories as disk cache to break the I/O bottleneck
(slow disk access times)
– Share the resources over fast LANs
– Premise: Access to another processor
via LAN would be faster than access
to local disk.
NOW “Opportunities”: Memory
• Network RAM: fast networks, high bandwidth
make it reasonable to page across the network.
– Instead of paging out to slow disks, send over fast
networks to RAM in an idle machine
• Cooperative file caching: improve performance by
using network RAM as a very large file cache
– Shared files can be fetched from another client’s
memory rather than server’s disk
– Active clients can extend their disk cache size by using
memory of idle clients.
NOW Opportunities: “RAWD”
(Redundant Arrays of Workstation Disks)
• RAID systems provide fast performance by
connecting arrays of small disks. By reading/
writing data in parallel, throughput is
increased.
• Instead of a hardware RAID, build software
version by reading/writing data across the
work stations in the network
– Especially useful for parallel programs running on
separate machines in the network
NOW Opportunities: “Parallel
Computing”
• Harnessing the power of multiple idle workstations in a
NOW can support high-performance parallel applications.
• NOW principles:
– avoid going to disk by using RAM on other network nodes (assumes
network access is faster than disk access)
– Further speedup may be achieved by parallelizing the computation
and striping the data to multiple disks.
– allow user processes to access the network directly rather than
going through the operating system (one way of increasing access
times)
Berkeley NOW Features
• GLUnix (Global Layer UNIX) is a layer on top of
UNIX OS’s running on the workstations
• Applications running on GLUnix have a
protected virtual operating system layer which
catches UNIX system calls and translates them
into GLUnix calls.
• Serverless Network File System – xFS
– Avoided central server bottleneck
– Cooperative file system (basically, peer-to-peer)
GLUnix Sociology
• Their biggest problem: the reluctance of users to share
their computing resources; e.g., will it affect my interactive
response time?
• Will it mean my big job only gets to run at night?
• They guaranteed each active user full workstation
capability by
• migrating guest processes when the user returns;
• saving user state before accepting a new process so file cache,
memory could be restored when the user returns – thus
avoiding a performance hit due to having to re-acquire the
working set of a process.
• They planned to dedicate a significant amount of systemwide resources to large problems to compete with existing
MPPs performance-wise.
Summary
• Successful, in their opinion.
• Ran for several years in the late 90s on the
Berkeley CS system
• Key enabling technologies
– Scalable, high performance network
– Fast access to the network for user processes
– Global operating system layer to support system
resources as a true shared pool.
CONDOR – pre 2012
http://research.cs.wisc.edu/condor/
• Goal: “…to develop, implement,
deploy, and evaluate
mechanisms and policies that
support High Throughput
Computing (HTC) on large
collections of distributively
owned computing resources”
HTCONDOR – post 2012
http://research.cs.wisc.edu/htcondor/
• Goal: “…to develop, implement,
deploy, and evaluate
mechanisms and policies that
support High Throughput
Computing (HTC) on large
collections of distributively
owned computing resources”
(note that the goal remains
unchanged)
CONDOR Definitions
• HTC computing – “… problems that require
weeks or months of computation to solve. …this
type of research needs a computing
environment that delivers large amounts of
computational power over a long period of
time.”
• Compare to High Performance computing (HPC)
which “…delivers a tremendous amount of
power over a short period of time.”
Overview
• Condor can be used to manage computing
clusters. It is designed to take advantage of
idle machines
• Condor lets users submit many (batch) jobs at
the same time. Result: tremendous amounts
of computation with very little user
intervention.
• No need to rewrite code - just link to Condor
libraries
Overview
• Condor runs on most contemporary operating
systems: Windows, UNIX, Linux, MacOS, ..
• Several clusters can be combined into a
“flock”, to achieve grid processing capabilities.
• Globus tools are used.
Features
• Checkpoints: save complete state of the
process
– Critical for programs that run for long periods of
time to recover from crashes or to vacate
machines whose user has returned, or for process
migration due to other reasons.
• Remote system calls: data resides on the
home machine and system calls are directed
there. Provides protection for host machines.
Features
• Jobs can run anywhere in the cluster (which can be
a physical cluster, a virtual cluster, or even a single
machine)
• Different machines have different capabilities;
when submitting a job Condor users can specify
the kind of machine they wish to run on.
• When sets of jobs are submitted it’s possible to
define dependencies; i.e., “don’t run Job 3 until
jobs 1 and 2 have completed.”
Slides
• From a talk by Myron Livny, “The Principles and Power
of Distributed Computing”,
International Winter School on Grid Computing 2010.
http://research.cs.wisc.edu/condor/talks.html
• Livny is a professor at the U of Wisconsin-Madison
where he heads the Condor Project, and other
grid/distributed computing projects or centers.
• Paradyn: related research; tools for performance
measurement on long-running programs on distributed
or parallel software.
Condor Daemons
• Title unknown, by Hans Holbein the Younger,
from Historiarum Veteris Testamenti icones,
1543
Condor Daemons
negotiator
collector
master
shadow
schedd
procd
startd
starter
kbdd
exec
Condor today
• http://research.cs.wisc.edu/condor/map/
MOSIX
The MOSIX Management System for
Linux Clusters, Multi-Clusters, GPU
Clusters, and Clouds: A White Paper
A. Barak and A. Shiloh, 2010
22
What is MOSIX?
• MOSIX (Multicomputer Operating System for
UnIX), is an on-line management system that acts
like a cluster operating system. [Barak and Shiloh]
• A multicomputer consists of a collection of
computers, each with its own memory, that
communicate over a network.
– It is sometimes called a distributed memory
multiprocessor.
• MOSIX targets High Performance Computing
(HPC) on Linux clusters, multi-clusters, and
clouds.
23
High Performance Computing
• HPC uses parallel processing to execute complex,
compute intensive applications reliably and quickly
– Sometimes synonymous with systems that operate above a
teraflop (1012 flops) or that require supercomputers
– Scientific & academic research, engineering applications,
and the military are typical users.
• Not to be confused with High Throughput Computing
(HTC)
– HPC: tightly coupled components, needs to run in an
environment where communication is fast
– HTC: sequential batch jobs, can be scheduled on different
computers
24
MOSIX Overview
• Provides a single-system image to users and
applications – appears to be a SMP.
• Handles interactive and batch jobs
• Supports resource discovery and automatic
workload distribution across processors. In other
words, processes can migrate from the home
computer to other computers in the system
• Implemented as a set of utilities that provide a
Linux-like run-time environment.
25
System Configurations
• MOSIX clusters: connected computers,
(servers, workstations, or a combination),
running Linux, & having a single administrator.
• MOSIX multi-cluster: several MOSIX clusters,
probably within the same organization,
configured to work together
– MOSIX processes can run in any
of the clusters, using their home
cluster environment
26
System Configurations
• MOSIX Clouds: a collection of various thingsMOSIX clusters and multi-clusters, Linux
clusters, individual servers and workstations,
etc., each running possibly a different version
of Linux and MOSIX
• MOSIX Cloud users can launch jobs from their
home computers to run on remote nodes.
The jobs have access to files on the remote
node and the home node.
Cluster Partitioning
• MOSIX clusters can be partitioned with each
partition being assigned to a different user or
designated as a general-use pool.
Processes
• MOSIX recognizes Linux processes as well as
MOSIX processes
• Linux processes run in native Linux mode,
cannot be migrated to other processors
– Generally used for administrative tasks
• MOSIX processes usually represent user
applications that might need to migrate
elsewhere in the system to get good service.
• Each MOSIX process has a unique home node.
29
MOSIX Features
•
•
•
•
•
Automatic Resource Discovery
Process Migration
The Run-Time Environment
The Priority Method
Flood Control
30
Features: Automatic Resource
Discovery
• Resources include nodes and their individual
resources: current load, available memory, etc.
• Processors periodically assess their own
resource state and send this information, along
with recent information about other nodes, to
a random set of nodes in the system, using a
variation of randomized gossip dissemination.
• Result: all nodes have relatively current
information about resource availability.
31
Features:
Process Migration
• MOSIX supports preemptive process migration,
meaning a process can be moved after it has
begun to execute
– This is more difficult that non-preemptive because
the entire process state must be moved
• Migration may be initiated by the user or
automatically by MOSIX
– Reasons: load balancing, moving a process closer to
resources it needs, etc.
32
Features: Process Migration
• Performance-monitoring algorithms run
continuously to profile processes and make
migration decisions.
– Process profiles consist of information such as size, rate
of system calls, how much I/O & IPC is generated, and
so forth.
• Migration decisions are based on the profiles and
on the resource status of cluster machines.
– Node speed and current load, process size versus
available memory, other process characteristics all
inform the decision.
33
Features: Process Migration
• The MOSIX layer supports process migration
by making it possible for processes to execute
as if they still were running on the home node.
• System calls are intercepted by MOSIX and
forwarded to the home node when necessary
(most of the time).
• Programs and users don’t need to make any
modifications to a program to make it
migrateable.
34
Features: Process Migration
• The down-side of process migration is added overhead
• The developers evaluated performance by running
three programs, one CPU intensive, one using a small
amount of I/O, one using a significant amount of I/O.
• Each program was created as a Linux process, a MOSIX
process migrated to a node in the same cluster, a
MOSIX process migrated to another cluster. Each
program was then run several times
• Results: The programs that were CPU intensive or used
only a little I/O ran efficiently in the local cluster and
the remote cluster, given a fast network (1Gb/s
Ethernet).
35
Features:
Process Migration
• Socket migration
– Socket: network communication
endpoint, represented by an IP
address and a port number
– Migratable sockets let processes communicate directly
without having to go through the home node
– This is accomplished by giving each process a
“mailbox” (message queue) to which other processes
can send messages
– Actual process location is thus transparent to the
communication mechanism
– In-order message reception is guaranteed
36
Features: Process Migration
• Host nodes are protected from migrated
processes by the MOSIX software, which
guarantees a secure run-time environment
(sandbox).
• The only local resources accessible to the
migrated process are the CPU and the
memory assigned to the process.
37
Features: The Priority Method
• Guarantees that local processes and processes
with a higher priority can evict migrated
processes and low priority processes
• Cluster owners are allowed to reject processes
from clusters that owners do not wish to share
resources with.
– NOW and Condor have
a similar philosophy
38
Features: Flood Control
• Flooding: a user (intentionally or
accidentally) creates a large
number of processes that threaten
to swamp the system.
• Control mechanisms:
– Load-balancing algorithm won’t migrate a process to a
node that doesn’t have enough memory
– Any node can set a limit on the number of processes it
will accept; attempts to create more will result in
“frozen” processes that are swapped out to disk until
there is room for them
39
MOSIX Virtual OpenCl (VCL)
• Open Computing Language (OpenCL) supports
programs that execute on heterogeneous
platforms (CPUs, GPUs, and other processors)
– It consists of a language and an API that enables
platform definition and control
– It enables the use of graphics processors (GPUs) for
non-graphics computing
• MOSIX applications can create OpenCL
“contexts” that aggregate devices from
several nodes
40
MOSIX Reach the Clouds (MRC)
• MRC allows applications to run in a MOSIX
cloud without having to pre-copy their files.
• They can use files from their home system and
files from the target node.
• The idea is to support file-sharing,
and to allow applications to run
in the cloud without having to
store their data there.
41
Checkpoint and Recovery
• A process checkpoint is a copy of the program’s state at
a given point in time.
• The purpose is to be able to stop a program and restart
it later, or to recover from an error by “rolling back” to
a previous point and resuming execution.
• It is an important tool for migratory environments
because it is one way to stop a process on one machine
and start it on another. It’s particularly important if
processes are sometimes “frozen”
• MOSIX provides checkpointing for most processes, but
some may not run correctly after being recovered.
42
Other Features
• MOSIX supports batch jobs (hard to migrate
interactive jobs!)
• MOSIX can run directly on top of Linux, or in a
Virtual Machine.
43
Conclusion
• MOSIX provides operating system-like
management system for sharing resources in
Linux clusters, multi-clusters, and clouds
• Gives the illusion of running on a multiprocessor
• Users don’t need to modify applications
• Important features include resource discovery,
dynamic workload distribution, ability for
processes to migrate to location of available
resources, and other features such as flood
control that protect both nodes and processes.
44
Dandelion: Compiler and Runtime for
Heterogeneous Systems
Christopher J. Rossbach, et.al, SOSP’13
• Designed for heterogeneous computer
systems running data-parallel applications
– General purpose computers, clouds, multicore
with CPUs and GPUs, FPGAs
• “…automatically and transparently distributes
data-parallel portions of a program to
available computing resources, …”
• For the .net environment; programmers
typically use C# or F#
Dandelion
• Provides a single-system image
• Programmer writes a sequential program,
system parallelizes it and distributes across
currently available resources.
• Complex – programming language, compilers,
and runtime systems must cooperate