explanation - Department of Computer Science and Engineering

Download Report

Transcript explanation - Department of Computer Science and Engineering

CS-279-I: Design
Project In Computer Science
Computer Networks
Michalis Faloutsos
Surge 333
[email protected]
www.cs.ucr.edu/~michalis
1 Riverside
Copyright: UC
This is THE Class!
The ultimate challenge
Developing a complete system
Real engineering
Dress rehearsal for when you go out there
2
What is different here
Open-ended definition of the problem
Freedom to design
Freedom to define the scope
Fruitful interaction with colleagues
3
This Translates to:
Problem-solving engineering skills
Designing complete systems
• Design, Implement, Test
Exercise self motivation and independent
thinking
Cultivate teamwork skills
4
Freedom Comes At a Price
Responsible behavior
Proactive approach
Need for communication
• Towards me and among yourselves
5
How Real Engineering Differs from
School Assignments
Development is a process, not an all-nighter
The user defines features not guidelines
The design is half of the solution
Most design decisions attempt to strike the
balance in a trade-off
You need to justify your approach
Teamwork is critical for success
6
The Project
Develop a distributed file sharing system
Requirements scalable to large number of users
Other requirements
• Exchange arbitrary type files
• Search for the files of interest (id, keywords,
description)
7
Illustrating the Project
Application level
connections
New “users” can join
Users can leave
Users can search for
information
Users acquire
information
8
The Process and the Phases
Understanding the problem and previous work
Design of the system
Implementation
Testing
Evaluation of the process
9
Some Tips
Start early: it is a lot of work!
Find ways to distribute the work equally
• Maximize parallelism
• Modularity (contain errors, facilitate testing)
Thinking ahead and organizing is critical
• Desing and Tool selection, mode of operation
10
Layout of work
End of 2nd week: project proposal ver 1
End of 3-4th week: project proposal ver 2
End of 5-6th week: midpoint presentation
• Design of system
End of 10th week: Deliver project, presentation,
demonstration
11
Peer to Peer File Sharing: A
Survey
Based on a presentation by
Ismail Guvenc and Juan Jose Urdaneta
Copyright:12
UC Riverside
Outline
Peer-to-Peer Concept
Overview of P2P systems:
• Napster, Gnutella, Freenet, Freehaven, Oceanstore,
PAST, Farsite, Publius, CFS, Tapestry, Pastry, Chord,
CAN and others
Comparison of the systems
Conclusion
13
What is Peer-to-Peer?
Every node is designed to(but may not by user
choice) provide some service that helps other
nodes in the network to get service
Each node potentially has the same
responsibility
Sharing can be in different ways:
• CPU cycles: SETI@Home
• Storage space: Napster, Gnutella, Freenet…
14
P2P: Why so attractive?
Peer-to-peer applications fostered explosive
growth in recent years.
• Low cost and high availability of large numbers of
computing and storage resources,
• Increased network connectivity
 As long as these issues keep their importance,
peer-to-peer applications will continue to gain
importance
15
Main Design Goals of P2P systems
Ability to operate in a dynamic environment
Performance and scalability
Reliability
Anonymity: Freenet, Freehaven, Publius
Accountability: Freehaven, Farsite
16
First generation P2P routing and
location schemes
Napster, Gnutella, Freenet…
Intended for large scale sharing of data files
Reliable content location was not guaranteed
Self-organization and scalability: two issues to
be addressed
17
Second generation P2P systems
Pastry, Tapestry, Chord, CAN…
They guarantee a definite answer to a query in a
bounded number of network hops.
They form a self-organizing overlay network.
They provide a load balanced, fault-tolerant
distributed hash table, in which items can be
inserted and looked up in a bounded number of
forwarding hops.
18
The Main Functions of Sharing
Publish a file: make available
Search for a file
• Search for the ID of the file (Springsteen, born in)
• Forward request (flood or with clue/structure)
Retrieve the file
Maintain the connectivity of the network
• Who connects to whom - optimize
Maintain copies of files in the network
• File replication - optimize availability of files
19
Napster
Application-level, client-server protocol over point to-point TCP,
centralized system
Retrieval: four steps
Connect to Napster server
Upload your list of files (push) to server.
Give server keywords to search the full list with.
Select “best” of correct answers. (pings)
centralized server: single logical point of failure, can load
balance among servers using DNS rotation, potential for
congestion, Napster “in control” (freedom is an illusion)
no security: passwords in plain text, no authentication, no
anonymity
20
Napster: How it works?(1)
1. File list is
uploaded
napster.com
users
21
Napster: How it works?(2)
2. User
requests
search at
server.
napster.com
Request
and
results
user
22
Napster: How it works?(3)
3.
napster.com
User pings
hosts that
apparently
have data.
Looks for
best transfer
rate.
pings
pings
user
23
Napster: How it works?(4)
4. User
retrieves file
napster.com
Retrieves
file
user
24
Gnutella
peer-to-peer networking: applications connect to peer applications
focus: decentralized method of searching for files
each application instance serves to:
• store selected files
• route queries (file searches) from and to its neighboring peers
• respond to queries (serve file) if file stored locally
How it works: Searching by flooding:
• If you don’t have the file you want, query 7 of your partners.
• If they don’t have it, they contact 7 of their partners, for a maximum
hop count of 10.
• Requests are flooded, but there is no tree structure.
• No looping but packets may be received twice.
Note: Play gnutella animation at:
http://www.limewire.com/index.jsp/p2p
25
Freenet
Completely anonymous, for producers or consumers
of information
Resistance to attempts by third parties to deny access
to information
Users share disk-space
• “Freenet” can write files in your space -> not resposnsible
Goals:
•
•
•
•
Anonymity for producers and consumers
Deniability for information storers
Resistance to denial attacks
Efficient storing and routing
26
27
Free haven
Anonymous
Resists powerful adversaries to find/destroy data
Goals
•
•
•
•
Anonymity: publishers, readers, servers
Persistence: lifetime determined by publisher
Flexibility: add/remove nodes
Accountability: reputation
A server gives up space => gets space on other servers
28
Freehaven: Publication
Split doc into n shares, k of which can rebuild the
file(k<n)
• Large k => brittle file
• Small k =>larger share, more duplication
Generate (SKdoc, PKdoc) and encrypt each share with
SKdoc
Store on a server:
• Encrypted share, timestamp, expiration date, hash(PKdoc)
29
Freehaven: Retrieval
Documents are indexed by hash(PKdoc)
Reader generates (PKclient ,SKclient) and a one-time
remailler reply block
Reader broadcasts hash(PKdoc), PKclient ,and the remailler
block
• To all servers it knows about
• Broadcasts may be queued and bulk sent
Servers holding shares with hash(PKdoc)
• Encode the share with PKclient
• Send it using the remailler block
30
Freehaven: Share expiration
Absolute Date
“Price” of a file: size x lifetime
Freenet and Mojo Notion favor Popular documents
Unsolved problems:
• Large corrupt servers, list of “discouraged” documents, DoS
Not ready for wide deployment:
• Inefficient communication=>few users=>weak anonymity
31
PAST & Pastry
Pastry:
• Completely decentralized, scalable, and selforganizing; it automatically adapts to the arrival,
departure and failure of nodes.
• Seeks to minimize the distance messages travel,
according to a scalar proximity metric like the
number of IP routing hops.
• In a Pastry network,
 Each node has a unique id, nodeId.
 Presented with a message & a key, Pastry node
efficiently routes the message to the node with
a nodeId that is numerically closest to the key.
32
Pastry: NodeId
Leaf set: stores
numerically closest
nodeIds.
Routing table:
Common prefix
with 10233102next digit-rest of
NodeId
Neighborhood set:
Stores closest
nodes according to
proximity metric
33
Pastry: Routing
Given a message, Check:
If it falls within the range of nodeId’s covered in the leaf
set, then forward directly to it.
If not, using the Routing table, the message is forwarded to
a node that shares a most common prefix with the key.
If routing table is empty or the node cannot be reached,
then forward to a node that is numerically closer to the key
and also shares a prefix with the key.
Performance:
• If key within the leaf set = O ( 1 )
• If key goes to the routing table=O(LogN)
• Worst case = O (N) ( under failures)
34
PAST
PAST: an archival, cooperative file storage and distribution facility.
uses Pastry as its routing scheme
Offers persistent storage services for replicated read-only files
Owners can insert or reclaim files, but clients can just look up
Collection of PAST nodes form an overlay network. A PAST node
is at least an access point, but it can also contribute to storage and
participate in the routing optionally as well.
Security: Each node and each user in the system holds a smartcard
with which there is a private/public key pair associated.
Three operations: insert, lookup and reclaim
35
Farsite
Farsite is a symbiotic, serverless, distributed file system.
Symbiotic: It works among cooperating but not completely
trusting the clients.
Main design goals:
• To provide high availability and reliability for file storage.
• To provide security and resistance to Byzantine threats.
• To have the system automatically configure and tune itself
adaptively.
Farsite first encrypts the contents of the files. This prevents an
unauthorized user to read the file.
Such a user can not read a file even if it is in his own desktop
computer, because of encryption.
Digital signatures are used to prevent an unauthorized user to
write a file.
After encryption, multiple replicas of the file are made and they
are distributed to several other client machines.
36
Publius
Publius system mainly focuses on availability and anonymity. It
maintains availability by distributing files as shares over n web
servers. J of these shares are enough to reconstruct a file.
For publishing the file, we first encrypt the document with key
K. Then K is split in n shares, any j of which can rebuild K.
K(doc) and a share are sent to n servers. “Name” of the
document is the address of n servers.
Query operation is basically running a local web proxy,
contacting j servers and rebuild K.
While the identity of the servers are not anonymized, an
attacker can remove information by forcing the closure of n-k+1
servers.
Publius lacks accountability(DoS with garbage) and smooth
join/leave for servers.
37
Chord
Chord:

Provides peer-to-peer hash lookup service:
 Lookup(key)


 IP address
Chord does not store the data
Efficient: O(Log N) messages per lookup

N is the total number of servers

Scalable: O(Log N) state per node

Robust: survives massive changes in membership
38
Chord: Lookup Mechanism

Lookups take O(Log N) hops
N5
N10
N110
N20 K19
N99
N32 Lookup(K19)
N80
N60
39
Tapestry
Self-administered, self-organized, location
independent, scalable, fault-tolerant
Each node has a neighbor map table with
neighbor information.
40
Tapestry Cont.
The system is able to adapt to network changes
because it algorithms are dynamic.
This also provides for Fault-handling
41
CAN
The network is created in a tree-like form.
Each node is associated to one in the upper level
an to a group in the lower level.
A query travels from the uppermost level down
through the network until a match is found or
until it reaches the lowermost level.
For its query model, scalability is an issue.
42
CAN Cont.
The tree like network:
43
Oceanstore
De-centralized but monitored system.
Build thinking of untrusted peers (for data
storage) and nomadic data.
Monitoring allows pro-active movement of data.
Uses replication and caching.
Two lookup methods used: Fast probabilistic
and slow deterministic.
44
MojoNation
Centralized: Central Service Broker + many
peers.
When a file is inserted it’s hashed. This is the
files Unique Identifier.
Uses fragmentation and replication (50%)
Load balancing is an issue since it is marketbased.
45
Panther
Based on Chord lookup algorithms
Chord capabilities are used for load balancing
and replication
Files and file chunks are identified by keys
generated through Chords hash system.
Replication, fragmentation and caching are used.
Authentication is provided through the use of
public and private keys.
46
Eternity
Provide “eternal” storage capabilities.
Protect data even from its publisher and systems
administrators
Fragmentation and replication are proposed
Anonymity is used to protect data
47
Others
Others:
•
•
•
•
•
•
Ohaha system uses consistent hashing-like algorithm to map documents to nodes.
Query routing is like the one in freenet, which brings some of the weaknesses of
freenet together.
The Rewebber maintains a measure of anonymity for producers of web information
by means of an encrypted URL service. TAZ extends Rewebber by using chains of
nested encrypted URL’s that successively point to different rewebber services to be
contacted.
Intermemory and INDIA are two cooperative systems where files are divided into
redundant shares and distributed among many servers. They are intended for long
term archival storage along the lines of Eternity.
The xFS file system focuses on providing support to distributed applications on
workstations interconnected by a very high-performance network, providing high
availability and reliability.
Frangipani is a file system built on the Petal distributed virtual disk, providing high
availability and reliability like xFS through distributed RAID semantics. Unlike xFS,
Petal provides support for transparently adding, deleting, or reconfiguring servers.
GNUnet is free software, available to the general public under the GNU Public License
(GPL). As opposed to Napster and Gnutella, GNUnet was designed with security in
mind as the highest priority.
48
Comparison
Things to note
Routing and File Retrieval
Napster
Centralized server(single point of failure), not scalable
Centralized lookup, keyword is used for look-up
Gnutella
Inefficient routing(flooding), not scalable
Flooded routing(multicast), filename is used for look-up
Freenet
Scales well, intelligent routing, anonymity maintained
No structured lookup algorithm, degrades efficiency
Free Haven
Anonymous, accountability maintained
File is split into n shares, k of which can rebuild the file
Shared-Private keys used for encryption
Farsite
Scales well, accountability is maintained, feasible
Files stored in encrypted format
Publius
Anonymity is maintained
File is split into n shares, k of which can rebuild the file
Pastry
Leafset, Routing Table, Neighborhood set
O(logN) without neighbor failure, O(1) if node is in leaf set
Used by Squirrel, PAST, Scribe
Routing based on address prefixes
lookup(key)=> IP address, used by CFS, Panther
O(logN) with neighbor failure, using replicas on successors
Routing based on finger table(O(LogN))
Routing based on numerical difference with destination node
Tapestry
Neighbor Map Table on Nodes
Routing based on address prefixes
CAN
Tree-like network topology
Routing table doesn't grow with network size
Scalability is an issue
Messages routed in d-dim. Space, each node has routing
Chord
table of O(d) entries, any node reached in (dN1/d) routing hops
49
File Storage Systems Comparison
Caching
Fragmentation Replication
De-centralized
Panther
OceanStore
Mojonation
CFS
50
Conclusion
Issues that need to be addressed: Caching,
versioning, fragmentation, replication.
Copyright laws!
The technology is very promising. It will
probably a common thing in the near future.
51