Client-server - Dipartimento di Informatica

Download Report

Transcript Client-server - Dipartimento di Informatica

Complex Adaptive Systems
C.d.L. Informatica – Università di Bologna
Peer-to-peer systems and
overlay networks
Fabio Picconi
Dipartimento di Scienze dell’Informazione
1
Outline
•
Introduction to P2P systems
•
Common topologies
•
Data location
•
Churn
•
Newscast algorithm
•
Security
2
Peer-to-peer vs. client-server
client-server
peer-to-peer
• Server well connected to the
“center” of the Internet
• Only nodes located on the
“periphery” of the Internet
• Servers carries out critical tasks
• Tasks distributed across all nodes
• Clients only talk to server
• Clients talk to other clients
3
Example – Video sharing
Client-server: YouTube
Advantages
• Client can disconnect after upload
downloader
downloader
• Uploader needs little bandwidth
• Other users can find the file easily
(just use search on server webpage)
downloader
uploader
Disadvantages
• Server may not accept file or remove
it later (according to content policy)
downloader
downloader
client-server
• Whole system depends on the server
(what if shut down like Napster?)
• Server storage and bandwidth
are expensive!
4
Example – Video sharing
Peer-to-peer: BitTorrent
Advantages
• Does not depend on a central server
downloader
downloader
• Bandwidth shared across nodes
(downloaders also act as uploaders)
• High scalability, low cost
downloader
seeder
Disadvantages
• Seeder must remain on-line to
guarantee file availability
downloader
downloader
peer-to-peer
• Content is more difficult to find
(downloaders must find .torrent file)
• Freeloaders cheat in order to
download without uploading
5
Comparison: P2P vs. client-server
Client-server
Peer-to-peer
• Asymmetric: client and servers
carry out different tasks
• Symmetric: each node carries out
the same tasks
• Global knowledge: servers have
a global view of the network
• Local knowledge: nodes only
know a small set of other nodes
• Centralization: communications
and management are centralized
• Decentralization: nodes must selforganize in a decentralized way
• Single point of failure: a server
failure brings down the system
• Robustness: several nodes may fail
with little or no impact
• Limited scalability: servers
easily overloaded
• High scalability: high aggregate
capacity, load distribution
• Expensive: server storage and
bandwidth capacity is not cheap
• Low-cost: storage and bandwidth
are contributed by users
6
Characterizing peer-to-peer systems
The main characteristics of P2P systems are:
• decentralization (i.e., no central server)
• self-organization (e.g., adding new nodes and removing disconnected ones)
• symmetric communications (e.g., peers act as clients and servers)
• scalability (thanks to high aggregate capacity and load distribution)
• shared ownership (i.e., storage and bandwidth are contributed by peers)
• overlay construction and routing (i.e., nodes form a logical network on
top of the underlying IP network)
a message from one peer to
another is sent through the
underlying IP network
7
P2P environment
P2P systems are deployed in a challenging environment:
• High latency and low bandwidth between nodes
- a high hop count will result in a high end-to-end latency
- transferring large files may take a long time
• Churn
- nodes may disconnect temporarily
- new nodes are constantly joining the system, while others leave the
overlay permanently
• Security
- P2P clients run on machines under full control of their users
- data sent to other nodes may be erased, corrupted, disclosed, etc.
- malicious users may try to bring down the system (e.g., routing attack)
• Selfishness
- users may run hacked P2P clients in order to avoid contributing resources
8
Problems
Some of the problems that a P2P systems designer must face:
• Overlay construction and maintenance
- maintain a given overlay topology (e.g., random, two-level, ring, etc.)
• Data location
- locate a given data object among a large number of nodes
• Data dissemination
- propagate data in an efficient and robust manner
• Per-node state
- keep the amount of state per node small
• Tolerance to churn
- maintain system invariants (e.g., topology, data location, data availability)
despite node arrivals and departures
9
Topology
Some common topologies:
• Flat unstructured: a node can connect to any other node
- only constraint: maximum degree dmax
- fast join procedure
- usually very tolerant to churn
- good for data dissemination, bad for location
• Two-level unstructured: nodes connect to a supernode
- supernodes form a small overlay
- used for indexing and forwarding
- large state and high load on supernodes
• Flat structured: constraints based on node ids
- allows for efficient data location
- constraints require long join and leave procedures
- less robust in high-churn environments
10
Data location - Flooding
Problem: find the set of nodes S that store a copy of object O
Solutions:
(1) Flooding: send a search message to all nodes [first Gnutella protocol]
• A search message contains either keywords or an object id
Advantages:
- simplicity
- no topology constraints
Disadvantages:
- high network overhead (huge traffic generated by each search request)
- flooding stopped by TTL (which produces search horizon)
- only applicable to small number of nodes
11
Data location - Flooding
(1) Flooding (cont.)
Flooding in a flat unstructured network:
obj
search
horizon for
TTL = 2
search
Objects that lie outside of the horizon are not found
12
Data location - Superpeers
(2) Two-level overlay: use superpeers to index the locations of an object
[eMule, Gnutella 2, BitTorrent]
• Each node connects to a superpeer and advertises the list of objects it stores
• Search requests are sent to the superpeer, which forwards them to other
superpeers
Advantages:
- highly scalable
Disadvantages:
- superpeers must be realiable, powerful and well connected to the
Internet (expensive)
- superpeers must maintain large state
- the system relies on a small number of superpeers
13
Data location - Superpeers
(2) Two-level overlay (cont.)
obj
request
response
• A two-level overlay is a partially centralized system
• In some systems superpeers do not connect to each other (e.g., BitTorrent)
14
Data location - KBR
(3) Structured networks: use a routing algorithm that implements Key-Based
Routing [Overnet, Kad, BitTorrent trackerless]
Key-Based Routing (also known as Distributed Hash Tables, or DHTs)
works as follows:
• each node is given a unique node identifier, or nodeid
• given a key k, the node whose nodeid is numerically closest to k
among all nodes in the network is known as the root of key k
• given a routing key k, a KBR algorithm can route a message to the
root of k in a small number of hops, usually O(log N)
• the location of an object with id objectid is tracked by the root of
k = objectid
• thus, one can find the location of an object by routing a message to the
root of k = objectid and querying the root for the location of the object
15
Data location - KBR (cont.)
Key-Based Routing [Pastry]
route(k=8955,msg)
Source node id: 04F2
k = object id:
04F2
E25A
8955
obj
8955
C52A
3A79
Hop #
0
1
2
3
4
Hop id
04F2
85E0
8909
8957
8954
Shared prefix length
0
1
2
3
3 (root of k)
Object 8955 is tracked by node 8954,
which knows of two copies stored
at nodes 620F and C52A
AC78
5230
8957
620F
obj
8955
8954
85E0
8821
8909
overlay address space
[0000,FFFF]
obj8955
stored on
nodes
620F,C52A
16
Data location - KBR (cont.)
Routing table for node 4F28 [Pastry]
Node id: 4F28
Routing table
02A3
19BA
…
2F34
409A
413C
4288
…
4F04
4F1B
N/A
…
4F21
E129
4E01
F0A4
N/A
4FF5
...
Leaf Set
4F04
4F1B
used to find
next hop with
longer shared
prefix
4F21
-
4F30
4F55
4FF5
used to find the
nodeid closest
to a key that is
close to the
local nodeid
• In this example the routing table size is 4 x 15 = 60 entries, for a
maximum network size of N = 65536 nodes.
• The average route length in this case is 4 hops.
17
Data location - KBR (cont.)
(3) Structured networks (cont.)
Advantages:
- completely decentralized (no need for superpeers)
- routing algorithm achieves low hop count for large network sizes
Disadvantages:
- each object must be tracked by a different node
- objects are tracked by unreliable nodes (i.e., which may disconnect)
- keyword-based searches are more difficult to implement than
with superpeers (because objects are located by their objectid)
- the overlay must be structured according to a given topology
in order to achieve a low hop count
- routing tables must be updated every time a node joins or leaves the
overlay
18
Data location - Loosely structured overlays
(4) Loosely structured networks: use hints on the location of objects [Freenet]
• Nodes locate objects by sending search requests containing the object id
• Requests are propagated using a technique similar to flooding
• Objects with similar identifiers are grouped on the same nodes
request
for AE5J
C
A
B
AE5J
5B20
E
D
response
F
AF02
19
Data location - Loosely structured overlays
(4) Loosely structured networks (cont.)
• A search response leaves routing hints on the path back to the source
• Hints are used when propagating future requests for similar object ids
Hints
AE5J: B
Hints
AE5J: D
5B20: E
A
B
C
request
for AF02
AE5J
5B20
E
D
Hints
AE5J: F
F
AF02
20
Data location - Loosely structured overlays
(4) Loosely structured networks (cont.)
Advantages:
- no topology constraints, flat architecture
- searches are more efficient than with plain flooding
Disadvantages:
- does not support keyword-based searches
- search requests have a TTL
- contrary to structured overlays, loosely structured overlay do not
guarantee a low number of hops, nor that the object will be found
21
Data location - Summary
The location schemes described previously can be classified according to:
• degree of structure
• decentralization
unstructured
loosely
structured
structured
partially
centralized
eMule
Gnutella 2
BitTorrent
-
-
decentralized
Gnutella
Freenet
Kad
22
Effects of churn
Churn (node arrivals and departures) can have several effects on a P2P system:
• data objects may be become unavailable if all replicas disconnect
• routing tables may become inconsistent
(e.g., entries may point to nodes which have disconnected)
• the overlay may become partitioned if several nodes suddenly disconnect:
23
Churn tolerance
Node arrivals and departures must not disrupt the normal behavior of the system
• system invariants must be maintained
- connected overlay (i.e., no partitions), low average path length
- data objects accessible from anywhere in the network
• two types of churn tolerance:
- dynamic repair: ability to react to changes in the overlay to maintain
system invariants (e.g., heal partitions)
- static resilience: ability to continue operating correctly before adaptation
occurs (e.g., route messages through alternate paths)
24
Churn – Preventing partitions
An simple way to prevent partitions is to increase the node degree
Ring partitions can be avoided by keeping a list of successor nodes
25
Churn – Static resilience (Ring topology)
Percentage of failed paths for various numbers of successor nodes
1 successor
16 successors
48 successors
More successors reduce the chances of “opening” the ring
26
Churn – Static resilience (various topologies)
Percentage of failed paths for various overlay topologies
Some topologies provide more alternate paths between nodes than others
27
Churn – Dynamic repair (Ring topology)
• Dynamically update routing information to adapt to overlay changes
• Two types of repair algorithms:
- reactive: start maintenance procedure immediately after detection
- periodic: execute maintenance procedure periodically
A reactive algorithm can bring down the system instead of repairing it:
• Each node disconnection triggers a maintenance procedure on a set of nodes
• Over a given churn rate, the maintenance traffic congests the network
• The network congestion leads to nodes being considered as disconnected
• This triggers even more maintenance procedures (positive feedback),
eventually bringing down the network
This effect may be avoided using a periodic maintenance algorithm
28
Churn – Summary
Churn is an important issue in P2P overlays:
• Data may become unavailable, and routing information outdated
• Static resilience
- depends on the topology (i.e., the number of alternate paths)
- increases with the average node degree
• Dynamic repair protocols must be carefully designed
- reactive protocols are usually faster
- periodic protocols can handle higher churn rates
29
Newscast
• Peer-to-peer protocol that creates and maintains an unstructured overlay
• Highly resilient to churn
• Can be used to propagate information
• Extremely simple design based on information gossip:
- Each node only knows about a small set of other knows (view)
- Each node periodically picks a random node from this set
- Both nodes exchange their views and update them
• The random view exchange makes the algorithm very robust to failures and
changes in the overlay
30
Newscast - View exchange
• Each node maintains a view v containing c entries, where
entry = {node address, timestamp}
• Each node executes the following code every T seconds:
1.
2.
3.
4.
select random entry r from local view
send local view plus an entry with the local address to node r
retrieve view of node r and merge it with local view
keep the c entries with the most recent timestamps
• The view of a node changes on each round
31
Newscast - View exchange (cont.)
Example view exchange initiated by node A
B
B,10
D,5
E,3
X,8
S,12
W,2
D
W
E
A
S
view of
node A
(c = 6)
X
32
Newscast - View exchange (cont.)
1. Select random node
B,10
D,5
E,3
X,8
S,12
W,2
selected node
view of
node A
2. Exchange views (plus local entry) with selected node
view of
node A
A
A,15
B,10
D,5
E,3
X,8
S,12
W,2
E,15
J,14
C,9
D,8
H,10
L,14
Z,2
view of
node E
E
33
Newscast - View exchange (cont.)
3. Merge views, and order
by timestamp
E,15
J,14
L,14
S,12
H,10
B,10
C,9
X,8
D,8
D,5
E,3
W,2
Z,2
merge
result
4. Keep c most recent entries
E,15
J,14
L,14
S,12
H,10
B,10
new view
of node A
(c = 6)
34
Newscast - Average path length
• Newscast overlays have a low average path length, i.e., O(log N)
(average number of hops between any two nodes)
35
Newscast - Static resilience
• The overlay shows high static resilience
36
Newscast - Summary
• Simple peer-to-peer algorithm
• Nodes use only local information
• Periodic peer-wise data exchanges
• Emergent properties:
- low average path length
- resistance to high churn
37
Security
Security in peer-to-peer systems is hard to enforce:
• Users have full control on their computers
• Modified clients may not follow the standard protocol
• Communications may be eavesdropped
• Data may be corrupted
• Private data stored on remote computers may be disclosed
38
Security - Weak identities
• The user may leave the system and rejoin it with a new identity (i.e, user id)
• If an attack is detected, the attacker can reenter the system with a new id
• An attacker may create a large number of false identities (Sybil attack)
Example of Sybil Attack:
S1
S2
S6
S3
A
S5
• Nodes S1 to S6 are actually 6
instances of the P2P client
running on the same machine
• The attacker can intercept all
traffic coming from or going to
node A
S4
39
Security - Strong identities
• The user cannot change its identity
• Solution: use a centralized, trusted Certification Authority (CA)
- Each new user must obtain an identity certificate:
certificate = { user id, IP address, user’s public key, signatureCA }
- The certificate is digitally signed by the CA, whose public key is
known by all users
- A certificate cannot be forged (would require the CA’s private key)
-To prove his identity, a user signs a message with his private key,
and attaches the corresponding certificate signed by the CA
• Strong identities prevent Sybil Attacks
• If an attacker is caught, it cannot easily rejoin the system
40
Security – Weak vs. strong identities
• Strong identities requires a centralized CA
• new nodes must contact the CA before joining the network:
- the CA response may be slow (a few days)
- if the CA is unavailable, new nodes cannot join the system
• the security of the system depends on the CA:
- the CA must correctly verify the identity of the requester
- the CA’s private key must be kept secret
• Many P2P systems use weak identities
• IP address already gives some identity information
• Some systems ensure anonymity (e.g., FreeNet)
41
The end
You can get these slides from
http://www.cs.unibo.it/~picconi/slides
For any questions, mail me to:
[email protected]
42