Transcript intro
Introduction to Distributed * Systems
Outline
• about the course
• relationship to other courses
• the challenges of distributed systems
• distributed services
• *ility for distributed services
• about the course
What is CPS 212 about?
What do I mean by “distributed information systems”?
• Distributed: a bunch of “computers” connected by “wires”
• Nodes are (at least) semi-autonomous...
but run software to coordinate and share resources.
• Information systems: focus on systems to store/access/share data and
operations on data.
Move {data, computation} around the network and deliver it to the
right places at the right times, safely and securely.
• Focus on Internet information services and their building blocks.
The Web, Web Services, name services, resource sharing (Grid)
Clustering, network storage, file sharing
Why are you here?
• You are a second-year (or later) CPS graduate student.
• You have taken CPS 210 and 214 and/or 216 and you want more.
familiarity with TCP/IP networking, threads, and file systems
• Or: we have talked and we agreed that you should take the class.
• You are comfortable with concurrent programming in Java.
(You want to do some Java programming labs.)
• You want to prepare for R/D in this exciting and important area.
(You want to read about 15 papers and take some exams.)
• You want to get started...
(Semester group project.)
Continuum of Distributed Systems
small
fast
Parallel
Architectures
CPS 221
Issues:
naming and sharing
performance and scale
resource management
Networks big
CPS 214 slow
?
Multiprocessors
low latency
high bandwidth
secure, reliable interconnect
no independent failures
coordinated resources
?
Global
Internet
clusters
LAN
fast network
trusting hosts
coordinated
slow network
untrusting hosts
autonomy
high latency
low bandwidth
autonomous nodes
unreliable network
fear and distrust
independent failures
decentralized administration
The Challenges of Distributed Systems
• private communication over public networks
who sent it (authentication), did anyone change it, did anyone see it
• building reliable systems from unreliable components
nodes fail independently; a distributed system can “partly fail”
Lamport: “A distributed system is one in which the failure of a machine
I’ve never heard of can prevent me from doing my work.”
• location, location, location
Placing data and computation for effective resource sharing, and finding it
again once you put it somewhere.
• coordination and shared state
What should we (the system components) do and when should we do it?
Once we’ve all done it, can we all agree on what we did and when?
Information Systems vs. Databases
“Information systems” is more general than “relational databases”.
• Overlap: We study distributed concurrency control and recovery, but
not the relational model.
The issues are related, but we’ll consider a wider range of data
models and service models.
In this course, we view databases as:
• local components of larger distributed systems, or
• distributed systems in themselves.
Focus: scale and robustness of large-scale Internet services.
September 11, 2001
The 9/11 load spike at CNN.com:
• complete collapse
• scramble to manually deploy new servers
How can we handle “flash crowds”?
• Buy/install enough hardware for worst-case load?
• Block traffic?
• Adaptive provisioning?
• Steal resources from less critical services?
That Other September 11
This is a graph of request traffic to download the Starr Report on Pres.
Clinton’s extracurricular pursuits, released on 9/11/98.
Broader Importance of Distributed Software Technology
Today, the global community depends increasingly on distributed
information systems technologies.
There are many recent examples of high-profile meltdowns of
systems for distributed information exchange.
• Code Red worm: July 2001
• denial-of-service attacks against Yahoo etc. (spring 00)
• stored credit card numbers stolen from CDNow.com (spring 00)
People were afraid to buy over the net at all just a few years ago!
• Network Solutions DNS root server failure (fall 00)
• MCI trunk drop interrupts Chicago Board of Exchange (summer 99)
These reflect the reshaping of business, government, and society
brought by the global Internet and related software.
We have to “get it right”!
The Importance of Authentication
This is a picture of a $2.5B move in the value of Emulex Corporation, in
response to a fraudulent press release by short-sellers through InternetWire in
2000. The release was widely disseminated by news media as a statement
from Emulex management, but media failed to authenticate it.
EMLX
[reproduced from clearstation.com]
Challenges for Services: *ility
We want our distributed applications to be useful, correct, and
secure. We also want reliability. Broadly, that means:
• recoverability
Don’t lose data if a failure occurs (also durability)
• availability
Don’t interrupt service if a failure occurs.
• scalability
Grow effectively with the workload. See also: manageability.
• survivability
Murphy’s Law says it’s a dangerous world. Can systems protect
themselves?
• See also: security, adaptibility, agility, dependability, perormability,
etc.
The Meaning of Scalability
Scalability is now part of the “enhanced standard litany” [Fox]; everybody
claims their system is “scalable”. What does it really mean?
cost
scalable
unscalable
total cost of
capacity
Pay as you go: expand capacity by
spending more money, in
proportion to the new capacity.
Note: watch out for “hockey sticks”!
marginal
cost of capacity
capacity
How do we measure or
validate claims of scalability?
Scalability II: Manageability
Today, “cost” has a broader meaning than it once did:
• growth in administrative overhead with capacity
• no interruption of service to upgrade capacity
“24 * 7 * 365 * .9999”
Where does the money go?
[Borrowed from Jim Gray]
facility
20%
Old
World
vendor
5%
vendor
40%
staff
40%
50%
staff
40%
facility
5%
New
World
Self-Managing Systems
IBM’s Autonomic Computing Challenge
How to Build Self-Managing Systems?
Actuator
(directives)
clients
Adaptation
Policy
Servers in
the Mist
Monitor
(observations)
Where are the
humans in the loop?
Availability
The basic technique for achieving availability is replication.
replicate hardware components
replicate functions
replicate data
replicate servers
• e.g., primary/backup, hot standby, process pairs, etc.
• e.g., RAID parity for available storage
Build decentralized systems that eliminate single points of failure.
• If a component fails, select a replica and redirect requests there.
fail over
Recoverability
Some basic assumptions:
• Nodes have volatile and (optional) nonvolatile storage.
• Volatile storage is fast, but its contents are discarded in a failure.
OS crash/restart, power failure, untimely process death
• Nonvolatile (stable) storage is slow, but its contents survive failures
of components other than the storage device itself.
E.g., disk: high latency but also high bandwidth (if sequential)
Low-latency nonvolatile storage exists. It is expensive but getting
cheaper: NVRAM, Uninterruptible Power Supply (UPS), flash
memory, MRAM, etc...these help keep things interesting.
• Stability is never absolute: it is determined by probability of device
failure, often measured by “mean time between failure” (MTBF).
How about backing up data in remote memory?
Another View
The Course
These challenges affect how/where we place functions and data in
the network.
It turns out that there are many common problems and techniques that can
be (mostly) “factored out” of applications and services. That is
(mostly) what this course is about.
• Web operating systems
• Large-scale information system: the Web
• Distributed services: the next-generation Web
• Internet service infrastructure and Internet information systems
• Building blocks for scalable services: storage services, file services, cluster
management,
• Core distributed systems material