Transcript ppt

INF5071 – Performance in distributed systems:
Peer-to-Peer Systems
27/10 & 10/11 – 2006
Client-Server



Traditional distributed computing
Successful architecture, and will
continue to be so (adding proxy servers)
Tremendous engineering necessary to
make server farms scalable and robust
backbone
network
local
distribution
network
INF5071 – performance in distributed systems
local
distribution
network
local
distribution
network
2006 Carsten Griwodz & Pål Halvorsen
Distribution with proxies
 Hierarchical distribution
system
completeness of
available content
root servers
 E.g. proxy caches that
consider popularity
 Popular videos replicated
and kept close to clients
 Unpopular ones close to the
root servers
 Popular videos are
replicated more frequently
regional
servers
local servers
end-systems
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Peer-to-Peer (P2P)

Really an old idea - a distributed system architecture



No centralized control
Nodes are symmetric in function
Typically, many nodes, but unreliable and heterogeneous
backbone
network
local
distribution
network
INF5071 – performance in distributed systems
local
distribution
network
local
distribution
network
2006 Carsten Griwodz & Pål Halvorsen
Overlay networks
Overlay node
Overlay link
IP link
backbone
network
LAN
LAN
backbone
network
backbone
network
LAN
LAN
IP path
INF5071 – performance in distributed systems
IP routing
2006 Carsten Griwodz & Pål Halvorsen
P2P
 Many aspects similar to proxy caches







Nodes act as clients and servers
Distributed storage
Bring content closer to clients
Storage limitation of each node
Number of copies often related to content popularity
Necessary to make replication and de-replication decisions
Redirection
 But
 No distinguished roles
 No generic hierarchical relationship
 At most hierarchy per data item
 Clients do not know where the content is
 May need a discovery protocol
 All clients may act as roots (origin servers)
 Members of the P2P network come and go
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
P2P Systems
 Peer-to-peer systems
 New considerations for distribution systems
 Considered here




Scalability, fairness, load balancing
Content location
Failure resilience
Routing
 Application layer routing
 Content routing
 Request routing
 Not considered here
 Copyright
 Privacy
 Trading
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Examples: Napster
Napster
 Program for sharing (music) files over the Internet
 Approach taken
 Central index
 Distributed storage and download
 All downloads are shared
 P2P aspects
 Client nodes act also as file servers
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Napster
 Client connects to Napster
with login and password
 Transmits current listing of
join
shared files
central index
 Napster registers username,
maps username to IP address
and records song list
INF5071 – performance in distributed systems
...
2006 Carsten Griwodz & Pål Halvorsen
Napster
 Client sends song request to
Napster server
 Napster checks song database
query
central index
answer
 Returns matched songs with
usernames and IP addresses
(plus extra stats)
INF5071 – performance in distributed systems
...
2006 Carsten Griwodz & Pål Halvorsen
Napster
 User selects a song, download
request sent straight to user
 Machine contacted if available
get
file
central index
...
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Napster: Assessment
 Scalability, fairness, load balancing
 Replication to querying nodes
 Number of copies increases with popularity




Large distributed storage
Unavailability of files with low popularity
Network topology is not accounted for at all
Latency may be increased
 Content location
 Simple, centralized search/location mechanism
 Can only query by index terms
 Failure resilience
 No dependencies among normal peers
 Index server as single point of failure
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Examples: Gnutella
Gnutella
 Program for sharing files over the Internet
 Approach taken




Purely P2P, centralized nothing
Dynamically built overlay network
Query for content by overlay broadcast
No index maintenance
 P2P aspects
 Peer-to-peer file sharing
 Peer-to-peer querying
 Entirely decentralized architecture

Many iterations to fix poor initial design (lack of scalability)
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Gnutella: Joining
 Connect to one known
host and send a
broadcast ping
 Can be any host, hosts
transmitted through
word-of-mouth or hostcaches
 Use overlay broadcast
ping through network
with TTL of 7
TTL 4
INF5071 – performance in distributed systems
TTL 3
TTL 2
TTL 1
2006 Carsten Griwodz & Pål Halvorsen
Gnutella: Joining
 Hosts that are not
overwhelmed respond
with a routed pong
 Gnutella caches these
IP addresses or replying
nodes as neighbors
 In the example the grey
nodes do not respond
within a certain amount
of time (they are
overloaded)
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Gnutella: Query
 Query by broadcasting
in the overlay
query
 Send query to all
overlay neighbors
TTL:6
 Overlay neighbors
forward query to all
their neighbors
 Up to 7 layers deep
(TTL 7)
INF5071 – performance in distributed systems
TTL:7
2006 Carsten Griwodz & Pål Halvorsen
Gnutella: Query
 Send routed responses
 To the overlay node that
was the source of the
broadcast query
 Querying client receives
several responses
 User receives a list of
files that matched the
query and a
corresponding IP address
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Gnutella: Transfer
 File transfer
 Using direct communication
 File transfer protocol not part of
the Gnutella specification
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Gnutella: Assessment
 Scalability, fairness, load balancing
 Replication to querying nodes
 Number of copies increases with popularity




Large distributed storage
Unavailability of files with low popularity
Bad scalability, uses flooding approach
Network topology is not accounted for at all, latency may be increased
 Content location
 No limits to query formulation
 Less popular files may be outside TTL
 Failure resilience
 No single point of failure
 Many known neighbors
 Assumes quite stable relationships
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Examples: Freenet
Freenet
 Program for sharing files over the Internet
 Focus on anonymity
 Approach taken





Purely P2P, centralized nothing
Dynamically built overlay network
Query for content by hashed query and best-first-search
Caching of hash values and content
Content forwarding in the overlay
 P2P aspects




Peer-to-peer file sharing
Peer-to-peer querying
Entirely decentralized architecture
Anonymity
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Freenet: Nodes and Data
 Nodes
 Routing tables
 Contain IP addresses of other nodes and the hash values they hold
(resp. held)
 Data is indexed with a hash values
 “Identifiers” are hashed
 Identifiers may be keywords, author ids, or the content itself
 Secure Hash Algorithm (SHA-1) produces a “one-way” 160bit key
 Content-hash key (CHK) = SHA-1(content)
 Typically stores blocks
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Freenet: Storing and Retrieving Data
 Storing Data
 Data is moved to a server with arithmetically close keys
1. The key and data are sent to the local node
2. The key and data is forwarded to the node with the nearest key
Repeat 2 until maximum number of hops is reached
 Retrieving data
 Best First Search
1. An identifier is hashed into a key
2. The key is sent to the local node
3. If data is not in local store, the request is forwarded to the best neighbor
Repeat 3 with next best neighbor until data found, or request times out
4. If data is found, or hop-count reaches zero, return the data or error along
the chain of nodes (if data found, intermediary nodes create entries in their
routing tables)
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Freenet: Best First Search
query
...
?
 Heuristics for Selecting Direction
>RES: Returned most results
<TIME: Shortest satisfaction time
<HOPS: Min hops for results
>MSG: Sent us most messages (all types)
<QLEN: Shortest queue
<LAT: Shortest latency
>DEG: Highest degree
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Freenet: Routing Algorithm
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Freenet: Assessment
 Scalability, fairness, load balancing
 Caching in the overlay network
 Access latency decreases with popularity
 Large distributed storage
 Fast removal of files with low popularity
 A lot of storage wasted on highly popular files
 Network topology is not accounted for
 Content location
 Search by hash key: limited ways to formulate queries
 Content placement changes to fit search pattern
 Less popular files may be outside TTL
 Failure resilience
 No single point of failure
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Examples: FastTrack,
Morpheus, OpenFT
FastTrack, Morpheus, OpenFT
 Peer-to-peer file sharing protocol
 Three different nodes
 USER
 Normal nodes
 SEARCH
 Keep an index of “their” normal nodes
 Answer search requests
 INDEX
 Keep an index of search nodes
 Redistribute search requests
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
FastTrack, Morpheus, OpenFT
USER
SEARCH
INDEX
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
FastTrack, Morpheus, OpenFT
USER
SEARCH
!
!
INDEX
INF5071 – performance in distributed systems
?
2006 Carsten Griwodz & Pål Halvorsen
FastTrack, Morpheus, OpenFT: Assessment
 Scalability, fairness, load balancing





Large distributed storage
Avoids broadcasts
Load concentrated on super nodes (index and search)
Network topology is partially accounted for
Efficient structure development
 Content location
 Search by hash key: limited ways to formulate queries
 All indexed files are reachable
 Can only query by index terms
 Failure resilience
 No single point of failure but overlay networks of index servers (and search
servers) reduces resilience
 Relies on very stable relationship / Content is registered at search nodes
 Relies on a partially static infrastructure
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Examples: BitTorrent
BitTorrent
 Distributed download system
 Content is distributed in segments
 Tracker
 One central download server per content
 Approach to fairness (tit-for-tat) per content
 No approach for finding the tracker
 No content transfer protocol included
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
BitTorrent
Segment download operation
 Tracker tells peer source and
number of segment to get
 Peer retrieves content in pull
mode
 Peer reports availability of
new segment to tracker
Tracker
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
BitTorrent
No second input stream:
not contributed enough
Tracker
Rarest first strategy
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
BitTorrent
No second input stream:
not contributed enough
All nodes: max 2 concurrent streams in and out
Tracker
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
BitTorrent
Tracker
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
BitTorrent
Tracker
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
BitTorrent
Tracker
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
BitTorrent Assessment
 Scalability, fairness, load balancing





Large distributed storage
Avoids broadcasts
Transfer content segments rather than complete content
Does not rely on clients staying online after download completion
Contributors are allowed to download more
 Content location
 Central server approach
 Failure resilience
 Tracker is single point of failure
 Content holders can lie
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Comparison
Napster
Scalability
Routing
information
Lookup
cost
Gnutella
Limited by
flooding
One central
server
O(1)
Physical
locality
INF5071 – performance in distributed systems
FreeNet
O(#nodes)
BitTorrent
Separate
overlays per
file
Uses caching
Neighbour list
O(log(#nodes))
FastTrack
Index server
One tracker
per file
O(1)
O(#blocks)
By search
server
assignment
2006 Carsten Griwodz & Pål Halvorsen
Comparison
Napster
Load
balancing
Content
location
Many replicas of popular
content
All files
reachable
Uses index
server
Search by
index term
Failure
resilience
Gnutella
Index server
as single point
of failure
INF5071 – performance in distributed systems
FreeNet
FastTrack
BitTorrent
Content
placement
changes to fit
search
Load
concentrated
on
supernodes
Rarest first
copying
All files
reachable
Search by
hash
External issue
Overlay
network of
index servers
Tracker as
single point of
failure
Unpopular files may be
outside TTL
Uses flooding
Search by
hash
No single point of failure
2006 Carsten Griwodz & Pål Halvorsen
Peer-to-Peer Systems
Distributed directories
Examples: Chord
Chord
 Approach taken




Only concerned with efficient indexing
Distributed index - decentralized lookup service
Inspired by consistent hashing: SHA-1 hash
Content handling is an external problem entirely
 No relation to content
 No included replication or caching
 P2P aspects
 Every node must maintain keys
 Adaptive to membership changes
 Client nodes act also as file servers
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Lookup Based on Hash Tables
lookup(key)  data
Insert(key, data)
key
hash
function
hash table
0
1
2
y
z
pos 3
..
..
..
Hash
bucket
N
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Distributed Hash Tables (DHTs)
Distributed application
Insert(key, data)
Lookup (key)
data
Distributed hash tables
node







node
node
….
node
Define a useful key nearness metric
Keep the hop count small
Keep the routing tables “right size”
Stay robust despite rapid changes in membership
Nodes are the hash buckets
Key identifies data uniquely
DHT balances keys and data across nodes
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Chord IDs & Consistent Hashing
 m bit identifier space for both keys and nodes
 Key identifier = SHA-1(key)
Key=“LetItBe”
SHA-1
ID=54
 Node identifier = SHA-1(IP address)
IP=“198.10.10.1”
SHA-1
ID=123
 Both are uniformly distributed
 Identifiers ordered in a circle modulo 2
m
 A key is mapped to the first
node whose id is equal to or
follows the key id
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Routing: Everyone-Knows-Everyone
 Every node knows of every other node - requires global information
 Routing tables are large – N
Where is “LetItBe”?
“N56 has K60”
Hash(“LetItBe”) = K54
Requires O(1) hops
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Routing: All Know Their Successor
 Every node only knows its successor in the ring
 Small routing table – 1
Where is “LetItBe”?
“N56 has K60”
Hash(“LetItBe”) = K54
Requires O(N) hops
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Routing: “Finger Tables”
 Every node knows m other nodes in the ring
 Increase distance exponentially
 Finger i points to successor of n+2i
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Joining the Ring
 Three step process:



Initialize all fingers of new node - by asking another node for help
Update fingers of existing nodes
Transfer keys from successor to new node
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Handling Failures
 Failure of nodes might cause incorrect lookup
N120
N10
N113
N102
N85
Lookup(90)
N80
 N80 doesn’t know correct successor, so lookup fails
 One approach: successor lists



Each node knows r immediate successors
After failure find first known live successor
Increased routing table size
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Chord Assessment
 Scalability, fairness, load balancing





Large distributed index
Logarithmic search effort
Network topology is not accounted for
Routing tables contain log(#nodes)
Quick lookup in large systems, low variation in lookup costs
 Content location




Search by hash key: limited ways to formulate queries
All indexed files are reachable
Log(#nodes) lookup steps
Not restricted to file location
 Failure resilience
 No single point of failure
 Not in basic approach
 Successor lists allow use of neighbors to failed nodes
 Salted hashes allow multiple indexes
 Relies on well-known relationships, but fast awareness of disruption and
rebuilding
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Examples: Pastry
Pastry
 Approach taken




Only concerned with efficient indexing
Distributed index - decentralized lookup service
Uses DHTs
Content handling is an external problem entirely
 No relation to content
 No included replication or caching
 P2P aspects




Every node must maintain keys
Adaptive to membership changes
Leaf nodes are special
Client nodes act also as file servers
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Pastry
 DHT approach

Each node has unique 128-bit nodeId
 Assigned when node joins
 Used for routing


Each message has a key
NodeIds and keys are in base 2b
 b is configuration parameter with typical value 4 (base = 16, hexadecimal digits)



Pastry node routes the message to the node with the closest nodeId to the key
Number of routing steps is O(log N)
Pastry takes into account network locality
 Each node maintains



Routing table is organized into log2b N rows with 2b-1 entry each
Neighborhood set M — nodeId’s, IP addresses of M closest nodes, useful to
maintain locality properties
Leaf set L — set of L nodes with closest nodeId
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Pastry Routing
b=2, so nodeId
is base 4 (16 bits)
Contains the
nodes that are
numerically
closest to
local node
NodeId 10233102
Leaf set
SMALLER
LARGER
10233033
10233001
10233021
10233000
10233120
10233230
10233122
10233232
-2-2301203
1-2-230203
-3-1203203
1-3-021022
2
10-3-23302
102-2-2302
3
3
 log2b N  rows
Routing table
-0-2212102
1
0
1-1-301233
10-0-31203
102-0-0230
10-1-32102
102-1-1302
1023-0-322
10233-0-01
1023-1-000
1
0
Contains the
nodes that are
closest to
local node
according to
proximity metric
1023-2-121
10233-2-32
102331-2-0
Entries in the nth row
share the first n-1 digits
with current node
2
Neighborhood set
13021022
02212102
INF5071 – performance in distributed systems
10200230
22301203
common prefix – next digit – rest
2b-1 entries per row
11301233
31203203
Entries in the mth column
have m as nth row digit
31301233
33213321
Entries with no suitable
nodeId are left empty
2006 Carsten Griwodz & Pål Halvorsen
Pastry Routing
1.
2.
source
3.
2331
X0: 0-130 | 1-331 | 2-331 | 3-001
Search leaf set for exact match
Search route table for entry with at one
more digit common in the prefix
Forward message to node with equally
number of digits in prefix,
but numerically closer in leaf set
1331
X1: 1-0-30 | 1-1-23 | 1-2-11 | 1-3-31
1211
X2: 12-0-1 | 12-1-1 | 12-2-3 | 12-3-3
1223
L: 1232 | 1221 | 1300 | 1301
INF5071 – performance in distributed systems
des
t
1221
2006 Carsten Griwodz & Pål Halvorsen
Pastry Assessment
 Scalability, fairness, load balancing
Distributed index of arbitrary size
Support for physical locality and locality by hash value
Stochastically logarithmic search effort
Network topology is partially accounted for, given an additional metric for
physical locality
 Stochastically logarithmic lookup in large systems, variable lookup costs




 Content location
 Search by hash key: limited ways to formulate queries
 All indexed files are reachable
 Not restricted to file location
 Failure resilience
 No single point of failure
 Several possibilities for backup routes
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Examples: Tapestry
Tapestry
 Approach taken




Only concerned with self-organizing indexing
Distributed index - decentralized lookup service
Uses DHTs
Content handling is an external problem entirely
 No relation to content
 No included replication or caching
 P2P aspects
 Every node must maintain keys
 Adaptive to changes in membership and value change
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Routing and Location
 Namespace (nodes and objects)
 SHA-1 hash: 160 bits length
 Each object has its own hierarchy rooted at RootID = hash(ObjectID)
 Prefix-routing




[JSAC 2004]
Router at hth hop shares prefix of length ≥h digits with destination
local tables at each node (neighbor maps)
route digit by digit: 4***  42**  42A*  42AD
neighbor links in levels
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Routing and Location
 Suffix routing
[tech report 2001]
 Router at hth hop shares suffix of length ≥h digits with destination
 Example: 5324 routes to 0629 via
5324  2349  1429  7629  0629
 Tapestry routing






Cache pointers to all copies
Caches are soft-state
UDP Heartbeat and TCP timeout to verify route availability
Each node has 2 backup neighbors
Failing primary neighbors are kept for some time (days)
Multiple root nodes possible, identified via hash functions
 Search value in a root if its hash is that of the root
 Choosing a root node
 Choose a random address
 Route towards that address
 If no route exists, choose deterministically, a surrogate
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Routing and Location
 Object location
 Root responsible for storing object’s location (but not the object)
 Publish / search both routes incrementally to root
 Locates objects
 Object: key/value pair
 E.g. filename/file
 Automatic replication of keys
 No automatic replication of values
 Values
 May be replicated
 May be stored in erasure-coded fragments
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Tapestry
#addr 1
#addr 2
…
#K
V
Insert( , key K, value V)
(#K,●)
(#K,●)
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Tapestry
V
(#K,●)
caching
(#K,●)
(#K,●)
(#K,●)
(#K,●)
#addr 1
#addr 2
…
#K
result
?K
(#K,●)
(#K,●)
●
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Tapestry
V
(#K,●)
(#K,●)
(#K,●)
(#K,●)
(#K,●)
(#K,●)
(#K,●)
●
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Tapestry
V
Move( , key K, value V)
V
(#K,●)
(#K,●)
(#K,●)
(#K,●)
(#K,●)
(#K,●)
(#K,●)
●
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Tapestry
V
(#K,●)
(#K,●)
(#K,●)
(#K,●)
(#K,●)
Stays wrong
till timeout
(#K,●)
(#K,●)
●
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Tapestry Assessment
 Scalability, fairness, load balancing




Distributed index(es) of arbitrary size
Limited physical locality of key access by caching and nodeId selection
Variable lookup costs
Independent of content scalability
 Content location
 Search by hash key: limited ways to formulate queries
 All indexed files are reachable
 Not restricted to file location
 Failure resilience




No single point of failure
Several possibilities for backup routes
Caching of key resolutions
Use of hash values with several salt values
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Comparison
Comparison
Chord
Pastry
Routing
information
Log(#nodes)
routing table size
Log(#nodes) x (2b – 1) At least log(#nodes) routing
routing table size
table size
Lookup cost
Log(#nodes)
lookup cost
Approx. log(#nodes)
lookup cost
Variable lookup cost
By neighbor list
In mobile tapestry
No single point of
failure
Several backup route
No single point of failure
Several backup route
Alternative hierarchies
Physical
locality
Failure
resilience
No resilience in
basic version
Additional
successor lists
provide resilience
INF5071 – performance in distributed systems
Tapestry
2006 Carsten Griwodz & Pål Halvorsen
Applications
Streaming in Peer-to-peer networks
Peer-to-peer network
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Promise
Promise
 Video streaming in Peer-to-Peer systems
 Video segmentation into many small segments
 Pull operation
 Pull from several sources at once
 Based on Pastry and CollectCast
 CollectCast
 Adds rate/data assignment
 Evaluates
 Node capabilities
 Overlay route capabilities
 Uses topology inference
 Detects shared path segments - using ICMP similar to traceroute
 Tries to avoid shared path segments
 Labels segments with quality (or goodness)
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Promise
active sender
Each active sender:
• receives a control packet specifying which data segments, data rate, etc.,
• pushes data to receiver as long as no new control packet is received
standby
active sender
standby sender
The receiver:
• sends a lookup request
using DHT
• selects some active
senders, control packet
• receives data as long
as
no errors/changes
occur
• if a change/error is
detected, new active
senders may be
Receiver
selected
Thus, Promise is a multiple sender to one receiver P2P media
streaming system which 1) accounts for different capabilities,
2) matches senders to achieve best quality, and 3) dynamically
adapts to network fluctuations and peer failure
INF5071 – performance in distributed systems
active sender
2006 Carsten Griwodz & Pål Halvorsen
SplitStream
SplitStream
 Video streaming in Peer-to-Peer systems




Uses layered video
Uses overlay multicast
Push operation
Build disjoint overlay multicast trees
 Based on Pastry
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
SplitStream
Each node:
• joins as many multicast trees as there are stripes (K)
• may specify the number of stripes they are willing to act as
Source: full quality movie
router for, i.e., according to the amount of resources available
Stripe 1
Each movie is split into K stripes and each
stripe is multicasted using a separate three
Thus, SplitStream is a multiple sender to multiple receiver P2P system which
distributes the forwarding load while respecting each node’s resource limitations,
but some effort is required to build the forest of multicast threes
Stripe 2
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Some References
1.
2.
3.
4.
5.
6.
M. Castro, P. Druschel, A-M. Kermarrec, A. Nandi, A. Rowstron and A. Singh, "SplitStream:
High-bandwidth multicast in a cooperative environment", SOSP'03, Lake Bolton, New York,
October 2003
Mohamed Hefeeda, Ahsan Habib, Boyan Botev, Dongyan Xu, Bharat Bhargava, "Promise:
Peer-to-Peer Media Streaming Using Collectcast", ACM MM’03, Berkeley, CA, November 2003
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek and Hari Balakrishnan, “Chord: A
scalable peer-to-peer lookup service for internet applications”, ACM SIGCOMM’01
Ben Y. Zhao, John Kubiatowicz and Anthony Joseph, “Tapestry: An Infrastructure for Faulttolerant Wide-area Location and Routing”, UCB Technical Report CSD-01-1141, 1996
John Kubiatowicz, “Extracting Guarantees from Chaos”, Comm. ACM, 46(2), February 2003
Antony Rowstron and Peter Druschel, “Pastry: Scalable, distributed object location and routing
for large-scale peer-to-peer systems”, Middleware’01, November 2001
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Adaptive Multi-Source Streaming in Heterogeneous
Peer-to-Peer Networks
Vikash Agarwal, Reza Rejaie
Computer and Information Science Department
University of Oregon
http://mirage.cs.uoregon.edu
January 19, 2005
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Introduction
 P2P streaming becomes increasingly popular
 Participating peers form an overlay to cooperatively stream
content among themselves
 Overlay-based approach is the only way to efficiently support multiparty streaming apps without multicast
 Two components:
 Overlay construction
 Content delivery
 Each peer desires to receive max. quality that can be streamed
through its access link
 Peers have asymmetric & heterogeneous BW connectivity
 Each peer should receive content from multiple parent peers =>
Multi-source streaming.
 Multi-parent overlay structure rather than tree
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Benefits of Multi-source Streaming
 Higher bandwidth to each peer
 higher delivered quality
 Better load balancing among peers
 Less congestion across the network
 More robust to dynamics of peer participation
 Multi-source streaming introduces new
challenges …
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Multi-source streaming: Challenges

Congestion controlled connections from different parent peers
exhibit
 independent variations in BW
 different RTT, BW, loss rate
 Aggregate bandwidth changes over time
 Streaming mechanism should be quality adaptive
 Static “one-layer-per-sender” approach is inefficient
 There must be a coordination mechanism among senders in order
to
 Efficiently utilize aggregate bandwidth
 Gracefully adapt delivered quality with BW variations
This paper presents a receiver-driven coordination mechanism for
multi-source streaming called PALS
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Previous Studies

Congestion control was often ignored
 Server/content placement for streaming MD content [Apostolopoulos
et al.]
 Resource management for P2P streaming [Cue et al.]

Multi-sender streaming [Nguyen et al], but they assumed
 Aggregate BW is more than stream BW

RLM is receiver-driven but ..
 RLM tightly couples coarse quality adaptation with CC
 PALS only determines how aggregate BW is used
 P2P content dist. mechanism can not accomodate “streaming”
apps
 e.g. BitTorrent, Bullet
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Overall Architecture
 Overall architecture for P2P streaming
 PRO: Bandwidth-aware overlay construction
 Identifying good parents in the overlay
 PALS: Multi-source adaptive streaming
 Streaming content from selected parents
 Distributed multimedia caching
 Decoupling overlay construction from delivery
provides great deal of flexibility
 PALS is a generic multi-source streaming protocol for
non-interactive applications
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Assumptions & Goals
Assumptions:
Goals:
 All peers/flows are cong.
 To fully utilize aggregate bandwidth
controlled
to dynamically maximize delivered
 Content is layered encoded
quality
 All layers are CBR with the
 Deliver max no of layers
same cons. rate*
 Minimize variations in quality
 All senders have all layers
(relax this later)*
 Limited window of future
packets are available at each
sender
 Live but non-interactive
* Not requirements
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
P2P Adaptive Layered Streaming (PALS)
 Receiver: periodically requests an ordered list of
packets/segments from each sender.
 Sender: simply delivers requested packets with the given
order at the CC rate
 Benefits of ordering the requested list:
 Provide flexibility for the receiver to closely control delivered
packets
 Graceful degradation in quality when bandwidth suddenly drops
 Periodic requests => stability & less overhead
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Basic Framework
EWMA aggregate BW
Peer 2
Receiver passively monitors
EWMA BW from each sender
Peer 1
Estimate total no of pkts to be
delivered during next window (K)
BW1
Demux
bw (t)
3
C
C
C
C
Decoder
buf 3
bw2(t)
buf 2
bw1(t)
buf 1
buf 0
INF5071 – performance in distributed systems
0
•Controlling evolution of buf. state.
bw (t)
•Controlling bw0(t), bw1(t), …,
•Allocating each sender’s bw among
active layers
BW0
Internet
BW2
Allocate K pkts among active
layers (Quality Adaptation)
Assign a subset of pkts to each
sender (Packet assignment)
Peer 0
2006 Carsten Griwodz & Pål Halvorsen
Key Components of PALS
 Sliding Window (SW): to keep all senders busy &
loosely synchronized with receiver playout time
 Quality adaptation (QA): to determine quality of
delivered stream, i.e. required packets for all layers
during one window
 Packet Assignment (PA): to properly distribute
required packets among senders
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Sliding Window



1)
2)
3)
Buffering window: range of timestamps for packets that must
be requested in one window.
Window is slided forward in a step-like fashion
Requested packets per window can be from
Playing window (loss recovery)
Buffering window (main group)
Future windows (buffering)
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Sliding Window (cont’d)
 Window size determines the tradeoff between
smoothness or signaling overhead & responsiveness
 Should be a function of RTT since it specifies timescale of
variations in BW
 Multiple of max smoothed RTT among senders
 Receiver might receive duplicates
 Re-requesting the packet that is in flight!
 Ratio of duplicates are very low and can be reduced by
increasing window
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Coping with BW variations
 Sliding window is insufficient
 Coping with sudden drop in BW by
 Overwriting request at senders
 Ordering requested packets
 Coping with sudden increase in BW by
 Requesting extra packets
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Quality Adaptation
 Determining required packets from
future windows
 Coarse-grained adaptation
 Add/drop layer
 Fine-grained adaptation
Peer 2
Peer 1
BW1
Demux
bw (t)
3
C
C
C
C
Decoder
buf 3
bw2(t)
buf 2
buf 1
buf 0
bw1(t)
0
 Buffer distribution determines what
bw (t)
What is a proper buffer dist?
INF5071 – performance in distributed systems
BW0
Internet
BW2
 Controlling bw0(t), bw1(t), …,
 Loosely controlling evolution of
receiver buffer state/dist.
degree of BW variations can be
smoothed.
Peer 0
2006 Carsten Griwodz & Pål Halvorsen
Buffer Distribution
 Impact on delivered quality
 Conservative buf. distribution achieves
long-term smoothing
 Aggressive buf. distribution achieves
short-term improvement
 PALS leverages this tradeoff in a
balanced fashion
 Window size affects buffering:
 Amount of future buffering
 Slope of buffer distribution
 Multiple opportunities to request a
packet (see paper)
 Implicit loss recovery
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Packet Assignment
 How to assign an ordered list of selected pkts from
diff. layers to individual senders?
 Number of assigned pkts to each sender must be
proportional to its BW contribution
 More important pkts should be delivered
 Weighted round robin pkt assignment strategy
 Extended this strategy to support partially available
content at each peer
 Please see paper for further details
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Performance Evaluation
 Using ns simulation to control BW dynamics
 Focused on three key dynamics in P2P systems: BW variations, Peer
participation, Content availability
 Senders with heterogeneous RTT & BW
 Decouple underlying CC mechanism from PALS
 Performance Metrics: BW Utilization, Delivered Quality
 Two strawman mechanisms with static layer assignment to each
sender:
 Single Layer per Sender (SLS): Sender i delivers layer i
 Multiple Layer per Sender (MLS): Sender i delivers layer j<i
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Necessity of Coordination
 SLS & MLS exhibit high
variations in quality
 No explicit loss recovery
 No coordination
 Inter-layer dependency magnifies
the problem
 PALS effectively utilizes
aggregate BW & delivers stable
quality in all cases
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Delay-Window Tradeoff
 Avg. delivered quality only depends
on agg. BW Heterogeneous senders
 Higher Delay => smoother quality
 Duplicates exponentially decrease
with window size
 Avg. per-layer buffering linearly
increases with Delay
 Increasing window leads to even
buffer dist.
 See paper for more results.
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Conclusion & Future Work
 PALS is a receiver-driven coordination mechanism for
streaming from multiple cong. controlled senders.
 Simulation results are very promising
 Future work:
 Further simulation to examine further details
 Prototype implementation for real experiments
 Integration with other components of our architecture for P2P
streaming
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Partially available content
 Effect of segment size and
redundancy
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen
Packet Dynamics
INF5071 – performance in distributed systems
2006 Carsten Griwodz & Pål Halvorsen