Distributed Operating Systems

Download Report

Transcript Distributed Operating Systems

P00603 Lectures 2
Distributed Systems
C Cox
Distributed systems
Evolution
Principles
Pervasive computing
Transparency
Distributed file systems
Distributed operating systems
Recovery and fault tolerance
•
•
•
•
Reading
A.S. Tanenbaum and M. van Steen (2003) Distributed Systems principles and paradigms, , Prentice Hall, **
G. Colouris, J, Dollimore and T. Kindberg (2005) Distributed Systems –
concepts and design, Addison Wesley, 4th Edition **
B. Wilkinson and M Allen (2004) Parallel Programming, techniques and
applications using networked workstations and parallel computers,
Prentice Hall. *
Middleware Architecture with Patterns and Frameworks,
http://proton.inrialpes.fr/~krakowia/MW-Book/Chapters/Intro/intro.html
1
What are distributed computer systems?
Compare: Centralised Systems
•
•
•
•
•
•
•
One system with non autonomous parts
System shared by users all the time
All resources accessible
Software runs in a single process
(often) single physical location
Single point of control (and management)
Single point of failure
2
Distributed Computer Systems - 1
•
•
•
•
•
•
•
•
•
Multiple autonomous components
Components shared by users
Resources may not be accessible
Software can run in concurrent processes on different
processors
(often) multiple physical locations
Multiple points of control
Multiple points of failure
No global time
No shared memory
3
Distributed Computer Systems - 3
Tanenbaum definition:
A distributed system is:
A collection of independent computers that
appears to its users as a single coherent system.
(idea of a virtual computer )
. . autonomous computers
. . connected by a network
. . specifically designed to provide an integrated
computing environment
4
Evolution of Distributed Computer Systems
•
•
•
•
•
•
•
1960
centralized Mainframe
happy system manager
centralized control
sad users
single point of failure
1970
localized Minicomputers
localized control
1980
de-centralized PC’s
sad system manager
user control
happy users
1985
networked PC’s on LAN and WAN
client server
1990
distributed Systems
happy system manager
distributed management
happy users
distributed applications
middleware. virtual computing
2000
internet computing, grid,
web services,
cluster and cloud computing
2010
Mobile, ubiquitous and pervasive computing
5
•
•
•
•
1970’s
email
electronic data interchange (EDI)
Ethernet
1980’s
PC
Client-Server computing
RPC
1990’s
WWW
PVM, MPI
CORBA
XML
GRID
2000’s
.NET
Web Services
SOAP
Pervasive computing
Grid
Distributed Computing
Notable Developments
6
Computer processor performance evolution
(see http://www.top500.org/)
7
Motivation for Distributed Computer Systems
•
•
•
•
•
•
•
•
•
High cost of powerful single processor – its cheaper (£/MIP) to buy
many small machines and network them than buy a single large
machine. Since 1980 computer performance has increased at 1.5 per
year
Share resources
Distributed applications and mobility of users
Efficient low cost networks
Availability and Reliability – if one component fails the system will
continue
Scalability – easier to upgrade system by adding more machines, than
replace the only system
Computational speedup
Service Provision -Need for resource and data sharing and remote
services
Need for communication
8
More complex distributed computing examples - 1
Computing dominated problems
(distributed processing)



Computational Fluid Dynamics (CFD) and Structural
Dynamics (using Finite Element Method)
Environmental and Biological Modeling – human genome
project, pollution and disease control, traffic simulation,
weather and climate modeling, global climate model
Economic and Financial modeling
Graphics rendering for visualization

Network Simulation – telecommunications, power grid

9
More complex distributed computing examples - 2
Storage dominated problems
(distributed data)
• Distributed databases: Google BigTable, Amazon Dynamo,
Windows Azure
• Peer network data stores: BitTorrent, Chord, GNUnet
• Applications
• Data Mining
• Image Processing
• Seismic data analysis
• Insurance Analysis
10
More complex distributed computing examples - 3
Communications dominated problems
• Transaction processing – banks, credit cards, EFTPOS
• Video on Demand – (also Text, Image and Simulation on
demand)
• Electronic banking, electronic shopping
• Search engine e.g. Google
11
Ubiquitous Computing (or pervasive computing)
Concerned with the increased integration (and transparency) of computing devices
[Weiser, 1991]
Computers everywhere, but hidden (embedded into the environment)
•
First Wave – Mainframe: 1 processor N users
•
Second Wave – PC: 1 processor 1 user
... transition ... Distributed Computing: N users N processors
•
Third Wave – Ubiquitous Computing: 1 user N processors
12
Pervasive computing - concept
Computers and computational devices are everywhere
• But they are not always obvious
• They become part of the environment
Google Trends “Pervasive Computing”
13
Computing: The Trend
14
Distributed Computer System Metrics
•
•
•
•
•
•
•
•
Latency – network delay before any data is sent
Bandwidth – maximum channel capacity (analogue communication
Hz, digital communication bps)
Granularity – relative size of units of processing required. Distributed
systems operate best with coarse grain granularity because of the slow
communication compared to processing speed in general
Processor speed – MIPS, FLOPS
Reliability – ability to continue operating correctly for a given time
Fault tolerance – resilience to partial system failure
Security – policy to deal with threats to the communication or
processing of data in a system
Administrative/management domains – issues concerning the
ownership and access to distributed systems components
15
Distributed Computer System Architectures
•
•
•
Flynn, 1966+1972 classification of computer systems in terms of
instruction and data stream organizations
Based on Von-Neumann model (separate processor and memory units)
4 machine organizations
•
•
•
SISD - Single Instruction, Single Data
SIMD - Single Instruction, Multiple Data
MISD - Multiple Instruction, Single Data
•
MIMD - Multiple Instruction, Multiple Data
SISD
Computers
SIMD
MISD
MIMD
Pentium
vector
parallel
??
SM
DM
Distributed Computers are essentially all MIMD machines
SM – shared memory, multiprocessor, e.g. SUN Sparc
1
DM – distributed memory, multicomputer,Cray
e.g.
LAN Cluster
16
Flynn Architectures
II
CU
D
PU
Serial Processor
SISD
PU
CU
I
D1
CU
I1
PU
D
Array Processor
PU
Dn
CU
In
No real examples
- possibly some
pipeline architectures
PU
SIMD
MISD
CU
I1
PU
D1
Multiprocessor
and
Multicomputer
CU
In
PU
Dn
CU – control unit
PU – processor unit
I – instruction stream
D – data stream
MIMD
17
Network Computing and Supercomputing
High performance computing
e.g. driver, worker parallel computing model
using PVM or MPI
Supercomputing
Distributed
Computing
Network Computing
Typically Client-Server
computing with Sockets
18
Cluster Computing
Supercomputing
LAN Technology
Cluster
Computing
Distributed
Computing
19
GRID (Meta)Computing
Supercomputing
WAN Technology
GRID
Computing
Distributed
Computing
20
Operation Transparency in a Distributed System
Transparency
Description
Access
Hide differences in data representation and how a
resource is accessed, e.g. access NFS, SQL
Location
Hide where a resource is physically located, e.g.
URL, tables in distributed database
Migration (or
mobility)
Hide that a resource may move to another location,
e.g. mobile phone
Relocation
Hide that a resource may be moved to another
location while in use
Replication
If a resource is replicated among several locations, it
should appear to the user as a single resource. E.g.
distributed database, mirrored web site
Cotd. . .
21
Cotd. Operation Transparency in a Distributed System
Transparency
Description
Scaling (or
Scalability)
Users unaware when system size or specification
changes (increase or decrease) e.g. world-wide-web
or as applications change - except for change in
quality of service
Performance
Users unaware that system is reconfigured to allow
processing to be distributed automatically among the
available processors
Concurrency
Hide that a resource may be shared by several
competitive users
Failure
Hide the failure and recovery of a resource, e.g.
email
Persistence
Hide whether a (software) resource is in memory or
on disk
22
Design Issues
Placement of Processes on Processors
Considers the required or optimal placement of processes, applications or
components onto processors, then considers the interrelation or communication
between processes or components
No Global Clock
Therefore synchronization strategies are required
Flexibility
Important that systems operate efficiently during the intended lifetime and that future
changes can be made easily
Failure Handling, Reliability and Fault Tolerance
Independent failures
Detection, masking, fault tolerance and recovery
Series reliability, Rseries  R1 R2 R3
Parallel reliability, Rparallel  1  1  R1 1  R2 1  R3 
Basis of reliability enhancement is replication for redundancy.
Performance
In terms of metrics: throughput (jobs per hour) system and network utilization
Ability to utilize concurrent operation
Must include communication and synchronization delays
Must apply to all applications / overall system operation
Must apply to a range of job (grain) sizes
Need to consider Quality of Service, especially as service provider
23
Distributed Computing Paradigms
• Network programming – the client server model using TCP or
UDP, sockets and message passing
• Concurrent Programming – UNIX fork and threads. Parallel
programming using clusters of multicomputers and message
passing libraries, e.g. PVM, MPI
• Object Based Systems - Java RMI, CORBA etc. ref
Tanenbaum Ch.9
• Distributed File Systems - e,g. NFS, (ref. Tanenbaum Ch.10)
• Document Based Systems - e.g. Lotus Notes and WWW (ref
Tanenbaum Ch.11)
• Distributed Coordination Based Systems - e.g. Linda, TIB,
Java Jini, (ref Tanenbaum Ch.12)
24
Distributed File Systems
• File system that allows access to hosts sharing a network
• Clients do not have control of the file storage but access using a
communication protocol
• Transparency is essential functional requirement
• Distributed file systems may offer replication and hence fault
tolerance
• Distributed file systems usually use a LAN
• Distributed File Stores often use a WAN
Microsoft DFS Implementation
• Standalone allows DFS root on local computer, e.g Windows NT
• Domain based, store DFS configuration within an active directory
25
Hadoop File System (Apache Foundation)
•
•
•
•
•
•
•
•
Hadoop Distributed File System (HDFS)
Open source, derived from Google GFS
High fault tolerance, high throughput
Client-server architecture
Multiple servers, each storing part of file system data
Fault detection and quick automatic recovery
Good scalability (single server to thousands of machines)
HDFS cluster has 1 Namenode (to manage file system name
space and regulate access by clients and maintain meta data)
and 1 Datanode per cluster node.
• Uses replication: file sequenced into blocks, all same size except
last, default is 3. Provides fault tolerance
26
Hadoop (more . . .)
•
•
•
•
•
•
•
HDFS comms protocol layered on top of TCP/IP
Uses RPC
Java API available, provision also for C and Python
Download from Apachi
Runs on windows, Unix, OS X
Map and reduce – mechanism to fracture and recombine blocks
Hadoop clusters used commercially, e.g. Amazon, Facebook,
AOL, Google, EBay
• Easiest implementation with write once, read many
27
Distributed Operating Systems-1
Evolution:
Centralized (Uniprocessor) Operating System
• based on centralized systems
• resource management, process management, multitasking
• IPC, I/O, interrupt handling etc.
• E.g. MS-DOS, VAX VMS
Network Operating System
• extension of centralized operating systems
 offer local services to remote clients
• each processor has own operating system
• user owns a machine, but can access others (e.g. rlogin, telnet)
• no global naming of resources
• system has little fault tolerance
• e.g. UNIX, Windows NT, 2000 etc
28
Uniprocessor Operating Systems
•
•
Separating applications from operating system code through
a microkernel.
1.11
29
Network Operating System - structure
•
General structure of a network operating system.
1-19
30
Distributed (Multicomputer) Operating Systems-2
Distributed Operating Systems
• Allows a multiprocessor or multicomputer network resources to be integrated as a
single system image
• Hide and manage hardware and software resources
• provides transparency support
• provide heterogeneity support
• control network in most effective way
• consists of low level commands + local operating systems + distributed features
• Inter-process communication (IPC)
• remote file and device access
• global addressing and naming
• trading and naming services
• synchronization and deadlock avoidance
• resource allocation and protection
• global resource sharing
• deadlock avoidance
• communication security
• no examples in general use but many research systems: Amoeba, Chorus etc. see
Google “distributed systems research”
• Network operating system with middleware popularly considered a distributed
operating system
31
Multicomputer Operating Systems (1)
•
General structure of a multicomputer operating system
1.14
32
Middleware acting as a distributed operating system
1.1
middleware layer at top of NOS implementing general-purpose
services to provide distribution transparency
33
The Amoeba Distributed Operating System
•
•
•
•
•
•
Developed by A.S. Tanenbaum (1983 onwards) as a research tool, it uses
a large number of CPUs, . Communication via RPC and provides relatively
good distributive transparency, security with efficient communication but
suffers from a lack of user control
Aim was to develop a transparent distributed operating system for
heterogeneous workstation and/or processor pool networks
Personal multiprocessor (rather than networked multi-computer) concept
Individual processors are not owned, users log onto system anywhere
System allocates processors as needed, dynamically so system looks like
a single virtual machine. Amoeba takes each user command then
determines which CPU to execute on
Written in C but has its own parallel and distributed programming language
ORCA
34
Replication of Data - maintaining copies on multiple computers
(e.g. Distributed Database)
Requirements
•
•
Replication transparency – clients unaware of multiple copies
Consistency of copies
Benefits
•
•
•
•
•
•
Performance enhancement, e.g. replicate heavily loaded servers
Reliability enhancement
Data closer to client
Share workload
Increased availability 1  p n
Increased fault tolerance
Constraints
•
•
•
How to keep data consistency (need to ensure a satisfactorily consistent image
for clients)
Where to place replicas and how updates are propagated
Scalability
35
Data Centric Consistency Models
Distributed Data e.g. DFS, Distributed Database, Distributed Shared Memory
•
•
Operate on a single (virtual) data store
With processes on different processors the lack of global clock makes
absolute synchronisation difficult
Consistency Models
•
•
•
Strict – a read must return values from most recent write (so generally
impossible in practice)
Linearizable and sequential – maintain causality using synchronised clocks
Causal and FIFO – weaker consistency. Writes from same processor are in
same order, writes from different processor not guaranteed
36
Fault Tolerant Services
•
•
•
•
Improve availability/fault tolerance using replication
Provide a service with correct behaviour despite n process/server
failures, as if there was only one copy of data
Use of replicated services
Operations need to be linearizable and sequentially consistent when
dealing with distributed read and write operations (see Coulouris).
Fault Tolerant System Architectures
Client (C)
Front End (FE) = client interface
Replica Manager (RM) = service provider
37
single primary
replica manager
Passive Replication
Primary
C
FE
RM
RM
Backup
C
•
•
•
•
•
•
•
•
FE
RM
Backup
All client requests (via front end processes) directed to nominated primary replica
manager (RM)
Single primary RM together with one or more secondary replica managers
(operating as backups)
Single primary RM responsible for all front end communication – and updating of
backup RM’s
Distributed applications communicate with primary replica manager, which sends
copies of up to date data.
Requests for data update from client interface to primary RM is distributed to each
backup RM
If primary replica manager fails a secondary replica manager observes this and is
promoted (elected) to act as primary RM
To tolerate n process failures need n+1 RM,s
Passive replication cannot tolerate Byzantine failures
38
Passive Replication – how it works
•
•
•
•
•
•
•
FE request is issued to primary RM, each with unique id
Primary RM receives request
Check request id, in case request has already been executed
If request is an update the primary RM sends the updated state and
unique request id to all backup RM’s
Each backup RM sends acknowledgment to primary RM
When ack. Is received from all backup RM’s the primary RM sends
request acknowledgment to front end (client interface)
All requests to primary RM are processed in n the order of receipt.
39
Multiple RM’s operating –
where majority output is
taken
Active Replication
RM
C
FE
RM
FE
C
RM
•
•
•
•
•
•
•
•
Multiple (group) replica managers (RM), each with equivalent roles
The RM’s operate as a group
Each front end (client interface) multicasts requests to a group of RM’s
requests processed by all RM’s independently (and identically)
client interface compares all replies received
can tolerate N out of 2N+1 failures, i.e. consensus when N+1 identical
responses received
Can tolerate byzantine failure
40
Active Replication – how it works
•
•
•
•
Client request is sent to group of RM’s using totally ordered reliable
multicast, each sent with unique request id
Each RM processes the request and sends response/result back to the
front end
Front end collects (gathers) responses from each RM
Fault Tolerance:
Individual RM failures have little effect on performance. For n process
fails need 2n+1 RM’s (to leave a majority n+1 operating).
RM
C
FE
RM
RM
41
FE
C
The Gossip Architecture - 1
Concept: replicate data close to points where clients need it first.
You exchange news with neighbors
If you receive new news then you want to know it
You gossip news to neighbours
Aim is to provide high availability at expense of weaker data consistency
• Framework for dealing with highly available services through use of
replication
• RM’s exchange (or gossip) in the background from time to time
• Multiple replica managers (RM), single front end (FE) – sends query
or update to any (one) RM
• A given RM may be unavailable, but the system is to guarantee a
service
42
Gossip Service - 2 (How it works)
• Front End sends time stamped request to single replica manager
• If request is Query
 Front End (Client) blocks waiting for a reply – since the data
should be at every RM
• If request is Update
 Update is carried out immediately by the local RM and replies to
client, then the updates are propagated to other RM’s in a lazy
fashon using gossip messages
43
Reliable Group Communication - 1
leave
group
send/receive
Stop, crash or
become faulty
join
Problem: Provide guarantee that all members in a process group
receive a message.
•
•
networks and protocols geared toward point-to-point process
communications
for small groups just use multiple point to point connections (recall
pvm_mcast)
Problem with larger groups:
•
•
•
with such complex communication schemes the probability of an error is
increased
a process may join, or leave, a group
a process may become faulty, i.e. is a member of a group but unable to
participate
44
Reliable Group Communication: simple case:
Where members of a group are known and fixed
• Sender assigns message sequence number to each message –
so that receiver can detect missing message.
• Sender retains message (in history buffer) until all receivers
acknowledge receipt.
• Receiver can request missing message (reactive) or sender can
resend if acknowledgement not received after a certain time
(proactive).
• Important to minimize number of messages, so combine
acknowledgement with next message.
[Project Idea: reliable multicast process communication using
message buffering]
45
Scalability in Reliable Multitasking:
Simple with small groups
Problem with large groups.
With N receivers sender needs to accept N
acknowledgments (positive acknowledgement)
Receivers could acknowledge only if a message is
missing (negative acknowledgement) - this gives
some communications performance optimization
46
Distributed Atomic Multicast Problem
• How to guarantee that message from sender is either received
by all processes or by none.
• Replicated database has 1 process per replica. If replica
crashes during update it needs to schedule an update for later
(see Tannenbaum).
• When a crashed replica is restarted it needs to be rejoin the
group and provision made to bring the replica to the same state
as others.
• Project Ideas!
47
Recovery
• Once failure has occurred in many cases it is important to
recover critical processes to a known state in order to resume
processing
• Problem is compounded in distributed systems
Two Approaches:
• Backward recovery, by use of checkpointing (global snapshot
of distributed system status) to record the system state but
checkpointing is costly (performance degradation)
• Forward recovery, attempt to bring system to a new stable state
from which it is possible to proceed (applied in situations where
the nature if errors is known and a reset can be applied)
48
Backward Recovery
- most extensively used in distributed systems and generally safest
• can be incorporated into middleware layers
• complicated in the case of process, machine or network failure
• no guarantee that same fault may occur again (deterministic view
– affects failure transparency properties)
• can not be applied to irreversible (non-idempotent) operations,
e.g. ATM withdrawal or UNIX rm *
49
Recovery- Combine Checkpointing with Message Logging (1)
•
•
•
•
after each checkpoint all messages logged
sender based or receiver based logging
provides additional information to check that rollback recovery is possible
Checkpointing can be expensive to incorporate
time
P1
P2
recovery line
checkpoint
50
Recovery- Combine Checkpointing with Message Logging (2)
time
P1
P2
recovery line
•
•
•
•
•
checkpoint
If either process P1 or P2 crashes we need to recover to most recent checkpoint that
was not complicated by message passing activity
An unfortunate sequence of messages (in which message activity is high compared to
checkpoint frequency) can lead to cascaded rollback or domino effect (rollback leads
to another rollback)
Possible solution is to use globally coordinated checkpointing – which requires global
time synchronization rather than independent (per processor) checkpointing
Implementing independent checkpointing rollback requires process dependencies to
be considered
Coordinated checkpointing requires synchronization to provide global storage of
system state. Saved states must be globally consistent, e.g. using a two phase
blocking protocol
51
Backward Recovery - Using Two Phase Blocking Protocol
Phase 1
• coordinator broadcasts ‘checkpoint request’ to all processes
• processes receive message and saves local checkpoint – then
queues subsequent messages received and acknowledges back to
coordinator
Phase 2
• when coordinator receives all replies it multicasts ‘checkpoint done’,
so processes can continue
• can improve algorithm by multicast to processes that depend on
recovery of the coordinator
52
Distributed System Design – for message passing
program design
•
•
•
•
•
•
•
•
Start with sequential program
Determine dependencies the identify code that can execute concurrently
• Need to understand algorithm
• Exploit inherent parallelism
• My require some algorithm restructuring
Decompose problem using control (functional) or data parallelism
Consider available machine architecture
Choose programming paradigm
Determine communication and add message passing code
Compile and test
Optimise performance
• Measure performance
• Locate bottlenecks
• Minimise message passing
• Load balance
53
Problems with Distributed Computing
•
•
•
•
•
•
•
•
•
•
•
Few standards (particularly for WAN meta-computing)
Lack of expertise of users, developers, and systems support staff
High cost of good commercial software
Immature software development tools
Applications difficult and time consuming to develop
Portability problems in heterogeneous environment
Difficult to tune and optimise across all platforms
Scheduling and load balancing
Distributed fault tolerance difficult, especially over WAN
Trade-off between performance in using remotely mounted files and disk
space requirements using local disk.
How to handle legacy systems - rewrite, add interface or use wrappers?
54