Transcript L01

EECE 411: Design of Distributed Software Applications
(or Distributed Systems 101)
Matei Ripeanu
http://www.ece.ubc.ca/~matei
EECE 411: Design of Distributed Software Applications
Today’s Objectives
Class mechanics
http://www.ece.ubc.ca/~matei/EECE411/
Understand real-world applications in terms of:
Motivation and objectives
Resource requirements:
• compute/storage/network resources
Architecture (“distributed systems” part)
Examples: Recent p2p applications
Start thinking of computer networks from the
perspective of a networked-application
Why? More intuitive
EECE 411: Design of Distributed Software Applications
P2P Definition(s)
Def 1: “A class of applications that takes
advantage of resources — storage, cycles, content,
human presence — available at the edges of the
Internet.”
Edges often turned on/off, without permanent IP addresses
Def 2: “A class of decentralized, self-organizing
distributed systems, in which all or most
communication is symmetric.”
Lots of other definitions that fit in between
Lots of (P2P?) systems that fit nowhere …
EECE 411: Design of Distributed Software Applications
P2P Impact: Widespread adoption
Skype:
560M registered users (Q2’10)
• 120M active, 8M paying
15M user online
Number of users for file-sharing applications
(estimate www.slyck.com, Sept ‘06)
P2P design techniques
are now mainstream!
eDonkey
3,108,909
FastTrack (Kazaa)
2,114,120
Gnutella
2,899,788
Cvernet
691,750
Filetopia
3,405
EECE 411: Design of Distributed Software Applications
P2P Impact (2): Huge resource users
P2P generated traffic now dominates the
Internet load (30-50% of the traffic)
Internet2 traffic statistics
100%
Cornell.edu (March
’02): 60% P2P
90%
80%
70%
60%
Other
50%
Data transfers
40%
Unidentified
30%
File sharing
20%
10%
0%
Feb.'02
Aug.'02
Feb.'03
Aug. '09
EECE 411: Design of Distributed Software Applications
Apr.'10
P2P Impact (3) – Demonstrate that
volatile, small, non-proprietary resources
can be efficiently harnessed
Resources: CPU, storage space,
But also: network bandwidth, availability, user attention and
expertise
Boinc statistics
Users
Hosts
Average FLOPS
Total
Active
2,030,434
380,406
5,642,321
614,926
5.6 PetaFLOPS (1Peta = 1015)
EECE 411: Design of Distributed Software Applications
P2P Impact (4) – Social / Business
Data distribution at (almost) zero almost cost
Forces companies to change their business
models
• Digital content production and distribution
• Telecommunications companies
New collaboration models
• Crowd-sourcing!
EECE 411: Design of Distributed Software Applications
Roadmap
Definitions
Impact
Applications
Mechanisms
A case study
EECE 411: Design of Distributed Software Applications
Applications: Number crunching
Examples: Folding@Home, UnitedDevices, etc
Characteristics (e.g., Folding@Home):
Massive parallelism
Low bandwidth/computation ratio
Error tolerance
Users do donate *real* resources
Problems
$1.5M / year
extra consumed power
Centralized. Does it scale?
Cheating!
Approach suitable for a particular class of problems.
• How to extend the model to problems that are not
massively parallel
EECE 411: Design of Distributed Software Applications
Applications: Content distribution
(files, video)
The ‘killer application’ to date
Too many to list them all:
BitTorrent, FastTrack (KaZaA, KazaaLite, iMesh),
Gnutella (LimeWire,BearShare)
Two independent problems
1. Distributed index
2. Fast content download
Environment: unreliable, non-cooperative
EECE 411: Design of Distributed Software Applications
Applications: Performance evaluation
Poor online performance costs businesses
$25 billion per year (Zone Research)
28% of attempted online purchases fail (BCG)
• Slow page download is the primary reason for
transaction abandonment
• Business transactions are at particular risk
User expectations for page download are around 4 seconds
Performance evaluation & monitoring requires
multiple vantage points
Connectivity statistics
Routing errors
Evaluate Web-site performance form end-user perspective
EECE 411: Design of Distributed Software Applications
Measurements: The Performance “Blind Spot”
Back-end
Infrastructure
Network
Landscape
Last-mile
“Blind Spot”
Datacenter Testing
“Beacon”
Database
Web server
Backbone
ISP
Enterprise Provider
T1
Firewall
Corporate
Network
ISP
App server
Backbone
3rd party
content
Regional
Network
Corporate
User
Major
Provider
Local ISP
Component
Testing
•BMC
•Mercury Interactive
•Tivoli
•ProactiveNet
•HP OpenView
•Computer Associates
Datacenter
Monitoring
•Keynote Systems
•Mercury Interactive
•BMC/SiteAngel
•Service Metrics
Consumer
User
Critical to estimate end-to-end
performance
EECE 411: Design of Distributed Software Applications
Slide source: www.porivo.com
Measurements: End-to-end Performance
Back-end
Infrastructure
Database

Network
Landscape
Web server

App server

Backbone
ISP
Enterprise Provider
T1
Firewall


Corporate
Network
ISP
Backbone
3rd party
content
Regional
Network
Corporate
User
Major
Provider
Local ISP
Component
Testing
Datacenter
Monitoring
Consumer
User
End-to-end
Web Performance
Testing
EECE 411: Design of Distributed Software Applications
Slide source: www.porivo.com
More applications …
Backup storage (HiveNet, OceanStore)
Collaborative environments
Spam filtering
Anonymous email
Censorship-resistant publishing systems
(Ethernity, Freenet)
EECE 411: Design of Distributed Software Applications
Roadmap
Definitions
Impact
Applications
Mechanisms
A Case Study
EECE 411: Design of Distributed Software Applications
Mechanisms (I)
To obtain a resilient system:
use redundancy for data and services
integrate multiple components with uncorrelated
failure curves.
To reduce cost and improve the QoS delivered:
move service delivery closer to the user
integrate multiple clients with uncorrelated
demand curves
(lower over-provisioning at resource providers)
EECE 411: Design of Distributed Software Applications
Example (I): Cooperative Web serving
Other
Origin Server
Other
Server
Server
Other
www.matei.com
Server
Problem:
Flash-crowds!
dnssrv
DNS
Query
Resolver
Browser
www.matei.com
216.165.108.10
EECE 411: Design of Distributed Software Applications
Example (I): Cooperative Web serving
Origin
Server

httpprx
Fetch data
from nearby
httpprx
dnssrv
DNS Redirection
Return proxy,
preferably one
near client
Cooperative
Web Caching
Resolver
Browser
akamai.cnn.com
216.165.108.10
EECE 411: Design of Distributed Software Applications
Mechanisms (II)
To detect anomalies, to generate good statistics:
Use multiple views
• Example: Web-server performance
characterization
To provide anonymity:
use large number of independent components
(“hide in the crowd”) and make search
impossible (or at least costly)
• Example: onion routing
EECE 411: Design of Distributed Software Applications
Roadmap
Definitions
Impact
Uses and Examples
Mechanisms
A case study
File sharing: The Gnutella Network & BitTorrent
EECE 411: Design of Distributed Software Applications
Basic Primitives for File Sharing
Join: How do I begin participating?
Publish: How do I advertise my file(s)?
Search: How do I find a file?
Fetch: How do I retrieve a file?
Lots of different solutions for each of these four
primitives.
EECE 411: Design of Distributed Software Applications
What makes these systems interesting?
Large scale
Self-organizing networks
Fast growth
Gnutella: more than 50x during first half of 2001; 50x
again 2001 to 2006
Open architecture, simple and flexible protocols
Interesting mix of social and technical issues
EECE 411: Design of Distributed Software Applications
Gnutella search mechanism
Boston
Chicago
MIT
UBC
Beatles: Yellow Submarine
Q:Beatles
Calgary
Gnutella nodes
TCP overlay tunnels
Routers
Search steps:
1. Initiates search for “Yellow Submarine”
2. Sends message to all neighbors
3. Neighbors forward message
4. Initiate reply message
5. Reply message is back-propagated
6. File download
EECE 411: Design of Distributed Software Applications
Gnutella: Overview
Join: on startup, client contacts a few other
nodes; these become its “neighbors”
Publish: no need
Search:
Flooding: pass query to neighbors, who
pass the query in turn to their own
neighbors, and so on...
Back-propagation in case of success
Fetch: get the file directly from peer (HTTP)
[Note: this was the original design. Later the network
moved to a two-layer structure]
EECE 411: Design of Distributed Software Applications
BitTorrent
Ingredients
A “seed” node that has the file
A “.torrent” meta-file is built for the file
A web-sever (usually) to index torrents
A “tracker” node is associated with each file
Identified in the .torrent
File is split into fixed-size segments (e.g.,
256KB)
EECE 411: Design of Distributed Software Applications
How does it work
Tracker
Web Server
C
A
Peer
Peer
Downloader
“US”
B
[Seed]
Peer
[Leech]
EECE 411: Design of Distributed Software Applications
Overview – system components
Tracker
Web Server
C
A
Peer
Peer
[Leech]
B
Downloader
Peer
“US”
[Leech]
[Seed]
EECE 411: Design of Distributed Software Applications
Overview – system components
Tracker
Web Server
C
A
Peer
Peer
[Leech]
B
Downloader
Peer
“US”
[Leech]
[Seed]
EECE 411: Design of Distributed Software Applications
Overview – system components
Tracker
Web Server
C
A
Peer
Peer
[Leech]
B
Downloader
Peer
“US”
[Leech]
[Seed]
EECE 411: Design of Distributed Software Applications
Overview – system components
Tracker
Web Server
C
A
Peer
Peer
[Leech]
B
Downloader
Peer
“US”
[Leech]
[Seed]
EECE 411: Design of Distributed Software Applications
Overview – system components
Tracker
Web Server
C
A
Peer
Peer
[Leech]
B
Downloader
Peer
“US”
[Leech]
[Seed]
EECE 411: Design of Distributed Software Applications
Overview – system components
Tracker
Web Server
C
A
Peer
Peer
[Leech]
B
Downloader
Peer
“US”
[Leech]
[Seed]
EECE 411: Design of Distributed Software Applications
BitTorrent: Overview
Join: nothing
just find a server/community
Publish: create ‘tracker’, spread .torrent file
Search:
for file: (not included in the protocol)
• the community is supposed to provide search
tools
for segments: exchange segment IDs maps
with other peers.
Fetch: exchange segments with other peers
(HTTP)
EECE 411: Design of Distributed Software Applications
Gnutella vs. BitTorrent: Discussion
System properties
Reliability?
Scalability?
Fairness?
Overheads?
Quality of Service
• Search coverage for content?
• Ability to download content fast?
• Ability to survive flash crowds?
The rest of this course: How to build
(distributed) systems with desirable
characteristics.
EECE 411: Design of Distributed Software Applications
Assignment 0
To do: Subscribe to mailing list
EECE 411: Design of Distributed Software Applications
EECE 411: Design of Distributed Software Applications
Gnutella -- Network Resilience
Topology
Random 30% die
Targeted 4% die
from Saroiu et al., MMCN 2002
EECE 411: Design of Distributed Software Applications
Gnutella: Query distribution
Highly heterogeneous distribution for query
popularity
 similar to Web pages popularity
 caching will work well

from Kunwadee
et al., 2002
EECE 411: Design of Distributed
Software Applications
Gnutella: Topology issues (1)
1.5Mbps DSL
1.5Mbps DSL
QuickTime™ and a
TIFF (Uncompress ed) dec ompres sor
are needed to s ee this pic ture.
56kbps Modem
1.5Mbps DSL
10Mbps LAN
1.5Mbps DSL
56kbps Modem
56kbps Modem
EECE 411: Design of Distributed Software Applications
Gnutella Topology Mismatch
EECE 411: Design of Distributed Software Applications