Transcript DC1Slides

Distributed Computing
COEN 317
DC1: Introduction
COEN 317
JoAnne Holliday
Email: [email protected] (best way to reach me)
Office: Engineering 247, (408) 551-1941
Office Hours: MW 3:30-4:45 and by appointment
Class web page: http://www.cse.scu.edu/~jholliday/
Textbook: Distributed Systems,
Concepts and Design
By Coulouris, Dollimore, Kindberg (CDK)
We follow the book only roughly. The chapters with
material that corresponds to the lectures are chapters 5
and 9 - 14.
Read chapter 1 and 2. Review chapters 3 and 4 if
needed for networks, threads and processes.
Topics
• What is a Distributed System?
• What are the goals of a Distributed
System?
– Scalability
– Fault Tolerance
– Transparency
Definition of a Distributed System
(1)
A distributed system is:
A collection of independent
computers that appears to
its users as a single
coherent system.
Definition of a Distributed System
(2)
1.1
A distributed system organized as middleware.
Note that the middleware layer extends over multiple machines.
Distributed systems
“Distributed System” covers a wide range
of architectures from slightly more
distributed than a centralized system to a
truly distributed network of peers.
One Extreme: Centralized
Centralized: mainframe and dumb
terminals
All of the computation is done on the
mainframe. Each line or keystroke is sent
from the terminal to the mainframe.
Moving Towards Distribution
In a client-server system, the clients
are workstations or computers in
their own right and perform
computations and formatting of the
data.
However, the data and the application which
manipulates it ultimately resides on the server.
More Decentralization
In Distributed-with-Coordinator, the nodes or
sites depend on a coordinator node with extra
knowledge or processing abilities
Coordinator might be used
only in case of failures or
other problems
True Decentralization
A true Distributed system has no distinguished
node which acts as a coordinator and all nodes
or sites are equals.
The nodes may choose
to elect one of their
own to act as a
temporary coordinator
or leader
Distributed Systems: Pro and Con
• Some things that were difficult in a centralized
system become easier
– Doing tasks faster by doing them in parallel
– Avoiding a single point of failure (all eggs in one
basket)
– Geographical distribution
• Some things become more difficult
– Transaction commit
– Snapshots, time and causality
– Agreement (consensus)
Advantages of the True Distributed
System
• No central server or coordinator means it
is scalable
• SDDS, Scalable Distributed Data
Structures, attempt to move distributed
systems from a small number of nodes to
thousands of nodes
• We need scalable algorithms to operate
on these networks/structures
– For example peer-to-peer networks
Homogeneous and tightly coupled vs
heterogeneous and loosely coupled
We will study heterogeneous and
loosely coupled systems.
Software Concepts
System
Description
Main Goal
DOS
Tightly-coupled operating system for multiprocessors and homogeneous
multicomputers
Hide and manage
hardware
resources
NOS
Loosely-coupled operating system for
heterogeneous multicomputers (LAN and
WAN)
Offer local
services to remote
clients
Middleware
Additional layer atop of NOS implementing
general-purpose services
Provide
distribution
transparency
• DOS (Distributed Operating Systems)
• NOS (Network Operating Systems)
• Middleware
Distributed Operating Systems
• May share memory or other resources.
1.14
Network Operating System
• General structure of a network operating system.
1-19
Two meanings of synchronous and
asynchronous communications
• Synchronous communications is where a process blocks
after sending a message to wait for the answer or before
receiving.
• Sync and async have come to describe the
communications channels with which they are used.
• Synchronous: message transit time is short and bounded.
If site does not respond in x sec, site can be declared
dead. Simplifies algorithms!
• Asynchronous: message transit time is unbounded. If a
message is not received in a given time interval, it could
just be slow.
What’s the Point of Unrealistic
Assumptions?
• So why should we study synchronous systems
(which never occur in real life) or systems
without failures?
• If something is difficult (intractable) in a
synchronous system it will be worse in an
asynchronous system.
• If an algorithm works well in an asynchronous
system, it probably will do even better in a real
system.
Scalability
• Something is scalable if it “increases linearly with
size” where size is usually number of nodes or
distance.
• “X is scalable with the number of nodes”
• Every site (node) is directly connected to every
other site through a communication channel.
Number of channels is NOT scalable. For N sites
there are N! channels.
• Sites connected in a ring. # of channels IS
scalable. (N channels for N sites)
Scalability Problems
Concept
Example
Centralized services
A single server for all users
Centralized data
A single on-line telephone book
Centralized algorithms
Doing routing based on complete information
Examples of scalability limitations.
Scaling Techniques (1)
1.4
The difference between letting:
a) a server or
b) a client check forms as they are being filled
Scaling Techniques (2)
1.5
An example of dividing the DNS name space into zones.
Characteristics of Scalable
Distributed Algorithms
• No machine (node, site) has complete
information about the system state.
• Sites make decisions based only on local
information.
• Failure of one site does not ruin the
algorithm.
• There is no implicit assumption that a global
clock exists.
Transparency in a D.S.
Transparency
Description
Access
Hide differences in data representation and how a
resource is accessed
Location
Hide where a resource is located
Migration
Hide that a resource may move to another location
Relocation
Hide that a resource may be moved to another
location while in use
Replication
Hide that copies of a resource exist and a user might
use different ones at different times
Concurrency
Hide that a resource may be shared by several
competitive users
Failure
Hide the failure and recovery of a resource
Persistence
Hide whether a (software) resource is in memory or
on disk
Important: location, migration (relocation), replication,
concurrency, failure.
Fault Tolerance
• An important goal in DS is to make the system
resilient to failures of some of the components.
Fault tolerance (FT) is frequently one of the
reasons for making it distributed in the first
place.
• Dependability Includes:
–
–
–
–
Availability
Reliability
Safety
Maintainability
Fault Tolerance Goals
• Availability: Can I use it now? Probability of
being up at any given time.
• Reliability: Will it be up as long as I need it?
Ability to run continuously without failure. If
system crashes briefly every hour, it may still
have good availability (it is up most of the time)
but has poor reliability because it cannot run for
very long before crashing.
• Safety: If it fails, ensure nothing bad happens?
• Maintainability: How easy is it to fix if it breaks?
Faults
• FAULT A fault is the cause of an error
• FAULT TOLERANCE - A system can continue
to function even in the presence of faults.
• Classification of faults:
– Transient faults - occur once then disappear.
– Intermittent faults - occurs, goes away, then
comes back, goes away …
– Permanent faults - doesn't go away by itself,
like disk failures.
Failure Models
Type of failure
Description
Crash failure or fail-stop
A server halts, but is working correctly until it halts
Omission failure
Receive omission
Send omission
A server fails to respond to incoming requests
A server fails to receive incoming messages
A server fails to send messages
Timing failure
A server's response lies outside the specified time interval
Response failure
Value failure
State transition failure
The server's response is incorrect
The value of the response is wrong
The server deviates from the correct flow of control
Arbitrary or Byzantine
A server may produce arbitrary responses at arbitrary times
Failure Models
• Fail-stop: Process works fine until it stops,
then remains dead. Other processes can
detect this state.
• Crash: Works fine until it stops. Other
processes may not be able to detect this.
• Send or Receive Omission: Process fails
to do one thing, but works fine otherwise.
• Arbitrary (Byzantine): Process behavior is
unpredictable, maybe malicious.
Network Failures
• Link failure (one way or 2 way): 5 can talk to
6, but 6 can not talk to 5
• Network partitions: the network 1,2,3,4,5,6
is partitioned into 1,2,3,4 and 5,6.
1
2
5
6
4
3
Are The Models Realistic?
• No, of course not!
• Synch vs Asynch
– Asynchronous model is too weak (real systems have
clocks, “most” timing meets expectations… but
occasionally…)
– Synchronous model is too strong (real systems lack a
way to implement synchronize rounds)
• Failure Types
– Crash fail (fail-stop) model is too weak (systems usually
display some odd behavior before dying)
– Byzantine model is too strong (assumes an adversary
of arbitrary speed who designs the “ultimate attack”)
Models: Justification
• If we can do something in the asynchronous
model, we can probably do it even better in a
real network
– Clocks, a-priori knowledge can only help…
• If we can’t do something in the synchronous
model, we can’t do it in a real network
– After all, synchronized rounds are a powerful, if
unrealistic, capability to introduce
• If we can survive Byzantine failures, we can
probably survive a real distributed system.
Fault Tolerance Strategies
• Redundancy
– Hardware, software, informational,
temporal
• Hierarchy
– Confinement of errors
Failure Masking by Redundancy
• Triple modular redundancy.
• Voter circuits choose majority of inputs to determine correct
output
Flat Groups versus Hierarchical
Groups
a)
b)
Communication in a flat group.
Communication in a simple hierarchical group
Identical Processes, Fail-stop
• A system is K fault tolerant if it can withstand faults in
K components and still produce correct results.
• Example: FT through replication - each replica
reports a result. If the nodes in a DS are fail-stop
and there are K+1 identical processes, then the
system can tolerate K failures: the result comes from
the remaining one.
1
2
5
4
3
Identical Processes, Byzantine
Failures
• If K failures are Byzantine (with K-collusion)
then 2K+1 processes are needed for K FT.
• Example: K processes can be faulty and "lie"
about their result. (If they simply fail to report a
result, that is not a problem). If there are 2K+1
processes, at least K+1 will be correct and
report the same correct answer. So by taking
the result reported by at least K+1 (which is a
majority), we get the correct answer.
What makes Distributed Systems
Difficult?
• Asynchrony – even “synchronous”
systems have time lag.
• Limited local knowledge – algorithms can
consider only information acquired locally.
• Failures – parts of the distributed system
can fail independently leaving some
nodes operational and some not.
Example: Agreement with
Unreliable Communication
• Introduced as voting problem (Lamport, Shostak,
Pease ’82)
• A and B can defeat enemy iff both attack
• A sends message to B: Attack at Noon!
General A
General B
The Enemy
Agreement with Unreliable Comm
• Impossible with unreliable networks
• Possible if some guarantees of reliability
– Guaranteed delivery within bounded time
– Limitations on corruption of messages
– Probabilistic guarantees (send multiple
messages)