Peer to Peer File Sharing: A Survey

Transcript Peer to Peer File Sharing: A Survey

Peer to Peer File Sharing: A
Survey
Ismail Guvenc and Juan Jose Urdaneta
Outline
Peer-to-Peer Concept
 Overview of P2P systems:

– Napster, Gnutella, Freenet, Freehaven,
Oceanstore, PAST, Farsite, Publius, CFS,
Tapestry, Pastry, Chord, Can and others
Comparison of the systems
 Conclusion

What is Peer-to-Peer?
Every node is designed to(but may not by
user choice) provide some service that helps
other nodes in the network to get service
 Each node potentially has the same
responsibility
 Sharing can be in different ways:

– CPU cycles: SETI@Home
– Storage space: Napster, Gnutella, Freenet…
P2P: Why so attractive?

Peer-to-peer applications fostered
explosive growth in recent years.
– Low cost and high availability of large
numbers of computing and storage resources,
– Increased network connectivity
» As long as these issues keep their importance,
peer-to-peer applications will continue to gain
importance
Main Design Goals of P2P
systems
Ability to operate in a dynamic environment
 Performance and scalability
 Reliability
 Anonymity: Freenet, Freehaven, Publius
 Accountability: Freehaven, Farsite

First generation P2P routing and
location schemes
Napster, Gnutella, Freenet…
 Intended for large scale sharing of data files
 Reliable content location was not
guaranteed
 Self-organization and scalability: two issues
to be addressed

Second generation P2P systems
Pastry, Tapestry, Chord, CAN…
 They guarantee a definite answer to a query
in a bounded number of network hops.
 They form a self-organizing overlay
network.
 They provide a load balanced, fault-tolerant
distributed hash table, in which items can be
inserted and looked up in a bounded number
of forwarding hops.

Napster
Application-level, client-server protocol over point to-point
TCP, centralized system
Retrieval: four steps
 Connect to Napster server
 Upload your list of files (push) to server.
 Give server keywords to search the full list with.
 Select “best” of correct answers. (pings)


centralized server: single logical point of failure, can load
balance among servers using DNS rotation, potential for
congestion, Napster “in control” (freedom is an illusion)
no security: passwords in plain text, no authentication, no
anonymity
Napster: How it works?(1)
1. File list is
uploaded
napster.com
users
Napster: How it works?(2)
2. User
requests
search at
server.
napster.com
Request
and
results
user
Napster: How it works?(3)
3. User pings
hosts that
apparently
have data.
Looks for
best transfer
rate.
napster.com
pings
pings
user
Napster: How it works?(4)
4. User
retrieves file
napster.com
Retrieves
file
user
Gnutella



peer-to-peer networking: applications connect to peer
applications
focus: decentralized method of searching for files
each application instance serves to:
– store selected files
– route queries (file searches) from and to its neighboring peers
– respond to queries (serve file) if file stored locally

How it works:
Searching by flooding:
– If you don’t have the file you want, query 7 of your partners.
– If they don’t have it, they contact 7 of their partners, for a
maximum hop count of 10.
– Requests are flooded, but there is no tree structure.
– No looping but packets may be received twice.
Note: Play gnutella animation at:
http://www.limewire.com/index.jsp/p2p
Freenet(discussed in class before)



Completely anonymous, for producers or
consumers of information
Resistance to attempts by third parties to deny
access to information
Goals:
–
–
–
–
–
Anonymity for producers and consumers
Deniability for information storers
Resistance to denial attacks
Efficient storing and routing
Does NOT provide
» Permanent file storage
» Load balancing
» Anonymity for general n/w usage
Free haven



Anonymous
Resists powerful adversaries to find/destroy data
Goals
–
–
–
–

Anonymity: publishers, readers, servers
Persistence: lifetime determined by publisher
Flexibility: add/remove nodes
Accountability: reputation
A server gives up space => gets space on other
servers
Freehaven: Publication

Split doc into n shares, k of which can
rebuild the file(k<n)
– Large k => brittle file
– Small k =>larger share, more duplication
Generate (SKdoc, PKdoc) and encrypt each
share with SKdoc
 Store on a server:

– Encrypted share, timestamp, expiration date,
hash(PKdoc)
Freehaven: Retrieval



Documents are indexed by hash(PKdoc)
Reader generates (PKclient ,SKclient) and a one-time
remailler reply block
Reader broadcasts hash(PKdoc), PKclient ,and the
remailler block
– To all servers it knows about
– Broadcasts may be queued and bulk sent

Servers holding shares with hash(PKdoc)
– Encode the share with PKclient
– Send it using the remailler block
Freehaven: Share expiration

Absolute Date
“Price” of a file: size x lifetime
Freenet and Mojo Notion favor Popular documents

Unsolved problems:


– Large corrupt servers, list of “discouraged” documents,
DoS

Not ready for wide deployment:
– Inefficient communication=>few users=>weak anonymity
PAST & Pastry

Pastry:
– Completely decentralized, scalable, and selforganizing; it automatically adapts to the
arrival, departure and failure of nodes.
– Seeks to minimize the distance messages travel,
according to a scalar proximity metric like the
number of IP routing hops.
– In a Pastry network,
» Each node has a unique id, nodeId.
» Presented with a message & a key, Pastry node
efficiently routes the message to the node with a
nodeId that is numerically closest to the key.
Pastry: NodeId
Leaf set: stores
numerically
closest nodeIds.
 Routing table:
Common prefix
with 10233102next digit-rest
of NodeId
Neighborhood set:
Stores closest
nodes
according to
proximity
metric

Pastry: Routing

Given a message, Check:
If it falls within the range of nodeId’s covered in the
leaf set, then forward directly to it.
If not, using the Routing table, the message is
forwarded to a node that shares a most common prefix
with the key.
If routing table is empty or the node cannot be
reached, then forward to a node that is numerically
closer to the key and also shares a prefix with the key.

Performance:
– If key within the leaf set = O ( 1 )
– If key goes to the routing table=O(LogN)
– Worst case = O (N) ( under failures)
PAST







PAST: an archival, cooperative file storage and distribution
facility.
uses Pastry as its routing scheme
Offers persistent storage services for replicated read-only files
Owners can insert or reclaim files, but clients can just look up
Collection of PAST nodes form an overlay network. A PAST
node is at least an access point, but it can also contribute to
storage and participate in the routing optionally as well.
Security: Each node and each user in the system holds a
smartcard with which there is a private/public key pair
associated.
Three operations: insert, lookup and reclaim
Farsite



Farsite is a symbiotic, serverless, distributed file system.
Symbiotic: It works among cooperating but not completely
trusting the clients.
Main design goals:
– To provide high availability and reliability for file storage.
– To provide security and resistance to Byzantine threats.
– To have the system automatically configure and tune itself
adaptively.




Farsite first encrypts the contents of the files. This prevents
an unauthorized user to read the file.
Such a user can not read a file even if it is in his own
desktop computer, because of encryption.
Digital signatures are used to prevent an unauthorized user
to write a file.
After encryption, multiple replicas of the file are made and
they are distributed to several other client machines.
Publius





Publius system mainly focuses on availability and
anonymity. It maintains availability by distributing files as
shares over n web servers. J of these shares are enough to
reconstruct a file.
For publishing the file, we first encrypt the document with
key K. Then K is split in n shares, any j of which can
rebuild K. K(doc) and a share are sent to n servers.
“Name” of the document is the address of n servers.
Query operation is basically running a local web proxy,
contacting j servers and rebuild K.
While the identity of the servers are not anonymized, an
attacker can remove information by forcing the closure of
n-k+1 servers.
Publius lacks accountability(DoS with garbage) and
smooth join/leave for servers.
Chord
 Chord:
 Provides
peer-to-peer hash lookup service:
 Lookup(key)


 IP address
Chord does not store the data
Efficient: O(Log N) messages per lookup

N is the total number of servers
 Scalable:
 Robust:
O(Log N) state per node
survives massive changes in membership
Chord: Lookup Mechanism

Lookups take O(Log N) hops
N5
N10
N110
N20 K19
N99
N32 Lookup(K19)
N80
N60
Tapestry
Self-administered, self-organized, location
independent, scalable, fault-tolerant
 Each node has a neighbor map table with
neighbor information.

Tapestry Cont.
The system is able to adapt to network
changes because it algorithms are dynamic.
 This also provides for Fault-handling

CAN
The network is created in a tree-like form.
 Each node is associated to one in the upper
level an to a group in the lower level.
 A query travels from the uppermost level
down through the network until a match is
found or until it reaches the lowermost
level.
 For its query model, scalability is an issue.

CAN Cont.

The tree like network:
Oceanstore
De-centralized but monitored system.
 Build thinking of untrusted peers (for data
storage) and nomadic data.
 Monitoring allows pro-active movement of
data.
 Uses replication and caching.
 Two lookup methods used: Fast
probabilistic and slow deterministic.

MojoNation
Centralized: Central Service Broker + many
peers.
 When a file is inserted it’s hashed. This is
the files Unique Identifier.
 Uses fragmentation and replication (50%)
 Load balancing is an issue since it is
market-based.

Panther
Based on Chord lookup algorithms
 Chord capabilities are used for load
balancing and replication
 Files and file chunks are identified by keys
generated through Chords hash system.
 Replication, fragmentation and caching are
used.
 Authentication is provided through the use
of public and private keys.

Eternity
Provide “eternal” storage capabilities.
 Protect data even from its publisher and
systems administrators
 Fragmentation and replication are proposed
 Anonymity is used to protect data

Others

Others:
– Ohaha system uses consistent hashing-like algorithm to map documents to nodes.
Query routing is like the one in freenet, which brings some of the weaknesses of
freenet together.
– The Rewebber maintains a measure of anonymity for producers of web
information by means of an encrypted URL service. TAZ extends Rewebber by
using chains of nested encrypted URL’s that successively point to different
rewebber services to be contacted.
– Intermemory and INDIA are two cooperative systems where files are divided into
redundant shares and distributed among many servers. They are intended for long
term archival storage along the lines of Eternity.
– The xFS file system focuses on providing support to distributed applications on
workstations interconnected by a very high-performance network, providing high
availability and reliability.
– Frangipani is a file system built on the Petal distributed virtual disk, providing
high availability and reliability like xFS through distributed RAID semantics.
Unlike xFS, Petal provides support for transparently adding, deleting, or
reconfiguring servers.
– GNUnet is free software, available to the general public under the GNU
Public License (GPL). As opposed to Napster and Gnutella, GNUnet was
designed with security in mind as the highest priority.
Comparison
Things to note
Routing and File Retrieval
Napster
Centralized server(single point of failure), not scalable
Centralized lookup, keyword is used for look-up
Gnutella
Inefficient routing(flooding), not scalable
Flooded routing(multicast), filename is used for look-up
Freenet
Scales well, intelligent routing, anonymity maintained
No structured lookup algorithm, degrades efficiency
Free Haven
Anonymous, accountability maintained
File is split into n shares, k of which can rebuild the file
Shared-Private keys used for encryption
Farsite
Scales well, accountability is maintained, feasible
Files stored in encrypted format
Publius
Anonymity is maintained
File is split into n shares, k of which can rebuild the file
Pastry
Leafset, Routing Table, Neighborhood set
O(logN) without neighbor failure, O(1) if node is in leaf set
Used by Squirrel, PAST, Scribe
Routing based on address prefixes
lookup(key)=> IP address, used by CFS, Panther
O(logN) with neighbor failure, using replicas on successors
Routing based on finger table(O(LogN))
Routing based on numerical difference with destination node
Tapestry
Neighbor Map Table on Nodes
Routing based on address prefixes
CAN
Tree-like network topology
Routing table doesn't grow with network size
Scalability is an issue
Messages routed in d-dim. Space, each node has routing
Chord
table of O(d) entries, any node reached in (dN1/d) routing hops
File Storage Systems
Comparison
Caching
Panther
OceanStore
Mojonation
CFS
Fragmentation Replication
De-centralized
Conclusion
Issues that need to be addressed: Caching,
versioning, fragmentation, replication.
 Copyright laws!
 The technology is very promising. It will
probably a common thing in the near future.


Peer to Peer File Sharing: A Survey

Transcript Peer to Peer File Sharing: A Survey

Directory