Notes - Systems@NYU
Download
Report
Transcript Notes - Systems@NYU
Distributed systems
[Fall 2014]
G22.3033-002
Lec 1: Course Introduction
Waitlist status
• Course admittance priority: Ph.D., M.S.
• If you are not going to take the class,
drop early to let others in
Class staff
• Instructor: Prof. Jinyang Li (me)
– [email protected]
– Office Hour: Wed 4-5pm (715 Bway Rm 708)
• Instructional Assistant: Yang Cui
– [email protected]
– Office Hour: Thu 4-5pm (715 Bway Rm 707)
Background
• What I assume you already know:
– OS organization
– Programming experience in C or C++
– Concurrency and threading
– Programming w/ sockets, TCP/IP
Course readings
• No official textbook
• Lectures are based on research papers
– Check webpage for schedules
• Useful reference books
– Principles of Computer System Design. (Saltzer and
Kaashoek)
– Distributed Systems (Tanenbaum and Steen)
– Advanced Programming in the UNIX environment
(Stevens)
– UNIX Network Programming (Stevens)
Meeting times & Lecture
structure
• Tuesdays 5:10-7pm
– With a 10-minute break in the middle
• Lecture will do basic concepts followed
by paper discussion
– Read assigned papers before lecture
• Sometimes instructional assistant will
do a 30-min discussion on labs.
Important addresses
• URL: http://www.news.cs.nyu.edu/~jinyang/fa14-ds
– Check regularly for schedule
• We’ll use Piazza.com for making
announcements and conducting discussion
How are you evaluated?
• Participation 10%
• Labs 40%
• Quizzes 50%
– mid-term and final (90 minutes each)
Using Piazza
• Please post all questions on Piazza instead of
emailing course staff
• You can make your post as either private (only
staff can see it) or public (visible to the whole
class)
• We encourage you to make public posts
– Whole class benefits from seeing your question and its answer
Participation
• Participation is 10% of your final grade
1. Paper summary submitted (before lecture) via Piazza
• Summarize the assigned paper before class
– 3 things you’ve learnt from the paper
– 1 weakness of the paper
– Answer the assigned question (if there’s any)
2. In class participation
3. Piazza discussion
• Asking questions and answering others’ questions
Questions?
What are distributed systems?
Multiple
hosts
A local or wide
area network
Machines communicate
to provide some service
for applications
• Examples?
Why distributed systems?
for ease-of-use
• Handle geographic separation
• Provide users (or applications) with location
transparency:
– Web: access information with a few “clicks”
– Network file system: access files on remote
servers as if they are on a local disk, share files
among multiple computers
Why distributed systems?
for availability
• Build a reliable system out of unreliable parts
– Hardware can fail: power outage, disk failures,
memory corruption, network switch failures…
– Software can fail: bugs, mis-configuration,
upgrade …
– How to achieve 0.99999 availability?
Why distributed systems?
for scalable capacity
• Aggregate resources of many computers
– CPU: MapReduce, Spark, Grid computing
– Bandwidth: Akamai CDN, BitTorrent
– Disk: Google file system, Hadoop File System
Why distributed systems?
for modular functionality
• Only need to build a service to accomplish a
single task well.
– Authentication server
– Backup server.
• Compose multiple simple services to
achieve sophisticated functionality
– A distributed file system: a block service + a
meta-data lookup service
The downside
• Much more complex
A distributed system is a system in which
I can’t do my work because some
computer that I’ve never even heard of
has failed.”
-- Leslie Lamport
The important things in
distributed systems design
#1 Abstraction & Interface
• Application users access your service
via some interface
• An example, a storage service’s API:
– File system (mkdir, readdir, write, read)
– Database (create tables, SQL queries)
– Disk (read block, write block)
• Conflicting goals:
– simple vs. efficient to implement
#2: Fault Tolerance
• How to keep the system running when
some machine is down?
• Does the system still give “correct”
service?
• How to incorporate recovered machine
correctly?
#3: Consistency
• Contract with apps/users about meaning of
operations. Difficult due to:
– Failure, multiple copies of data, concurrency
• E.g. how to keep 2 replicas “identical”
– If one is down, it will miss updates
– If net is broken, both might process different
updates
#4 Performance
• Latency & Throughput
• To increase throughput, exploit parallelism
– Many resources exist in multiples
• CPU cores, IO and CPU
• To reduce latency,
– Figure out what takes time: queuing, network,
storage, some expensive algorithm, many
serial steps?
• How much performance is enough?