Transcript Basics
CS542: Topics in
Distributed Systems
Diganta Goswami
Distributed System
• A collection of independent computers
that appears to its users as a single
coherent system.
– Autonomous computers
– Many components – connected by a network –
sharing resources.
Distributed System
• A System of networked components that
communicate and coordinate their actions only by
passing messages
– concurrent execution of programs
– no global clock
– components fail independently of one another
Another definition
• You know you have a distributed system when
the crash of a computer you’ve never heard of
stops you from getting any work done.
– inter-dependencies
– shared state
– independent failure of components
A working definition for us
A distributed system is a collection of entities, each
of which is autonomous, programmable,
asynchronous and failure-prone, and which
communicate through an unreliable communication
medium using message passing.
• Entity=a process on a device (PC, PDA)
• Communication Medium=Wired or wireless network
• Our interest in distributed systems involves
– design and implementation, maintenance, algorithmics
“Important” Distributed Systems Issues
• No global clock: no single global notion of the correct
time (asynchrony)
• Unpredictable failures of components: lack of
response may be due to either failure of a network
component, network path being down, or a computer
crash (failure-prone, unreliable)
• Highly variable bandwidth: from 16Kbps (slow
modems or Google Balloon) to Gbps (Internet2) to
Tbps (in between DCs of same big company)
• Possibly large and variable latency: few ms to
several seconds
• Large numbers of hosts: 2 to several million
There are a range of interesting problems for
Distributed System designers
•
•
• Real distributed systems
– Cloud Computing, Peer to peer systems, Hadoop, distributed file
systems, sensor networks, graph processing, …
• Classical Problems
– Failure detection, Asynchrony, Snapshots, Multicast, Consensus,
Mutual Exclusion, Election, …
• Concurrency
– RPCs, Concurrency Control, Replication Control, …
• Security
– Byzantine Faults, …
•
•
Others…
Typical Distributed Systems Design Goals
• Common Goals:
– Heterogeneity – can the system handle a large variety of
types of PCs and devices?
– Robustness – is the system resilient to host crashes
and failures, and to the network dropping messages?
– Availability – are data+services always there for clients?
– Transparency – can the system hide its internal
workings from the users?
– Concurrency – can the server handle multiple clients
simultaneously?
– Efficiency – is the service fast enough? Does it utilize
100% of all resources?
– Scalability – can it handle 100 million nodes without
degrading service? (nodes=clients and/or servers)
– Security – can the system withstand hacker attacks?
– Openness – is the system extensible?
Challenges and Goals of Distributed Systems
•
•
•
•
•
•
•
Heterogeneity
Openness
Security
Scalability
Failure handling
Concurrency
Transparency
Challenges
•
Heterogeneity (variety and difference) of underlying
network infrastructure,
•
Internet consists of many different sorts of network –
their differences are masked by the fact that all of the
computers attached to them use the Internet Protocols for
communication.
–
e.g. a computer attached to an Ethernet has an implementation of the
Internet Protocols over the Ethernet, whereas a computer on a different sort
of network will need an implementation of the Internet Protocols for that
network.
Heterogeneity
• Computer hardware and software
– e.g., operating systems, compare UNIX socket and Winsock
calls
• Programming languages : in particular, data
representations
Some approaches: Middleware
•
A S/W layer that provides a programming
abstraction as well as masking the heterogeneity
of the underlying networks, H/W, O/S and
programming languages.
– Middleware (e.g., CORBA):
transparency of network, hard- and
software and programming language heterogeneity. JAVA
RMI
• In addition to solving the problems of heterogeneity,
middleware provides a uniform computational model for
use by the programmers of servers and distributed
applications.
Positioning Middleware
• General structure of a distributed system as
middleware.
1-22
Openness
• Characteristic that determine whether the system
can be extended and re-implemented in various
ways.
– Determined primarily by the degree to which new resource
sharing services can be added and be made available for use
by a variety of client programs.
– Cannot be achieved unless the specification and
documentation of the key s/w interfaces are made available to
s/w developers (i.e. key interfaces are published)
Openness
• Designers of the Internet protocols introduced
a series of documents called RFCs
– Specifications of the Internet communication
protocols
– Specifications for applications run over them
» e.g., email, telnet, file transfer, etc. (by the mid 80’s)
• RFCs are not the only means --- e.g. CORBA is
published through a series of documents, including a
complete specification of the interfaces of its services
(www.omg.org)
Openness
• Offering services according to standard rules that
describe the syntax and semantics of those
services
– e.g., Network protocol rules (RFCs)
• Services specified through interfaces
– Interface definition languages (IDLs)
• specifies names and available functions as well as
parameters, return values, exceptions etc.
Security
• Distributed systems must protect the shared
information and resources
• The openness of DS makes them vulnerable to
security threats
• Providing security is a significant challenge for
DS
Security.
Privacy / Confidentiality: protection against
disclosure to unauthorized individuals
Integrity: protection against alteration or corruption
Availability: protection against interference with the
means to access the resources
Scalability
• Scalable system—system that can handle additional
number
of
users/resources
without
suffering
noticeable loss of performance
• Three metrics of a scalable system
– No of user/resources
– Distance between the farthest nodes in the system (network radius)
– Number of organizations exerting control over the pieces of the
system
Challenges in designing scalable DS
•
Controlling the cost of physical resources:
– As the demand for a resource grows, it should be
possible to extend the system, at reasonable cost,
to meet it.
» e.g. it must be possible to add server computers to avoid
the performance bottleneck that would arise if a single file
server had to handle all file access request when the freq.
of file access request grows in an intranet with the
increase in users and computers.
www.amazon.com is more than one computer
Challenges in designing scalable DS
•
Controlling the performance loss:
– Management of a set of data whose size is
proportional to the number of users or resources in
the system
» e.g. the Domain Name System holds the table with the
correspondence between domain names of computers
and their Internet address
» Hierarchic structures scale better than linar structures.
Scaling Techniques
1.5
An example of dividing the DNS name space into zones.
Challenges in designing scalable DS
• Preventing s/w resources running out:
– Numbers used as Internet address --- 32 bits was
used in the late 70’s but may run out soon.
– Change from 32 bits to 128 bits?
– Difficult to predict the demand.
– Over-compensating for future growth may be worse than
adapting to a change when we are forced to - large Internet
address occupy extra space in messages and in computer
storage.
Failure Handling
• Failure in a DS is partial
– Some components fail while others continue to function
– This makes handling of failures difficult.
Techniques for dealing with failures
• Detecting failures
– may be impossible – remote site crash or delay
in message transmission?
– Some can be.
– Ex. - Checksums can be used to detect corrupted data
Techniques for dealing with failures
• Masking failure
– Some can be hidden or made less severe
– Retransmission – when messages fail to arrive
Techniques for dealing with failures
• Tolerating failures
– Would not be practical to detect and hide all of the failures.
Can be designed to tolerate some of those
– e.g. timeouts when waiting for a web resource – clients give
up after a predetermined number of attempts and take other
actions & inform the user.
Failure Handling
• Recovery from failures
– Rollback
– Undo/Redo in transactions
• Redundancy
– Makes the system more available through replication of
resources/data
– Redundant routes in the network
– Replication of name tables in multiple domain name servers
Concurrency
• In a distributed system it is possible that
multiple machines/processes/users may try to
access shared data/resource concurrently
– Can potentially lead to incorrect results and/or
– Deadlocks
• The operations must be synchronized/serialized so
that the end result is correct
Transparency
• Concealing
the
heterogeneous
and
distributed nature of the system so that it
appears to the user like one system
– Making the user believe that there is only a
single, undivided system i.e., to hide the notion of
distribution completely
• What are the challenges of transparency?
Transparency Categories
• Access transparency - access local and remote
resources using identical operations
– e.g., users of UNIX NFS can use the same commands
and parameters for file system operations regardless of
whether the accessed files are on a local or remote disk.
Transparency categories
• Location Transparency: Access without
knowledge of location of a resource
– e.g., URLs, email addresses (hostname, IP addresses, etc.
not required --- the part of the URL that identifies a web
server domain name refers to a computer name in a domain,
rather than to an Internet address)
Transparency Categories
• Concurrency
transparency:
Allow several
processes to operate concurrently using shared
resources in a consistent fashion w/o interference
between them.
– That is, users and programmers are unaware that
components request services concurrently.
• Replication transparency
– Use replicated resource as if there was just one
instance.
» Increase reliability and performance w/o knowledge of
the replicas by users or application programmers.
Failure transparency
•
Enables the concealment of faults, allowing
users and application programs to complete
their task despite failures of h/w or s/w
components.
• Retransmit of email messages – eventually
delivered
even
when
servers
or
communication links fail – it may even take
several days.
Failure transparency
• Failure transparency depends on concurrency and
replication transparency.
• Replication can be employed to achieve failure
transparency
• Message transmission governed by TCP is a
mechanism for providing failure transparency
Mobility Transparency
• Mobility transparency: allow resources to move
around w/o affecting the operation of users or
programs
• e.g., 700 phone number – but URLs are not, because
someone’s personal web page cannot move to their new
place of work in a different domain – all of the links in other
pages will still point to the original page!
•
Transparency Categories
• Performance transparency: adaptation of the
system to varying load situations without the user
noticing it.
• Scaling transparency: allow system and
applications to expand without need to change
structure or application algorithms
Degree of transparency
• There are systems in which attempting to blindly hide
all distribution aspects from users is not always a
good idea
– Requesting your electronic newspaper in your mailbox before 7 am
local time – while you are at the other end of the world living in a
different time zone
– (Your morning paper will not be the morning paper you are used to)
Degree of transparency
• There is trade-off between a high degree of
transparency and the performance of a system
– Masking transient server failure by retransmitting the request
may slow down the system
– If it is necessary to guarantee that several replicas need to be
consistent all the time, a single update may take a long time –
something that cannot be hidden from the user.