Transcript slides18

ICS 214B: Transaction Processing and
Distributed Data Management
Lecture 18: Data Management in Peer-to-Peer Systems
Professor Chen Li
Based on slides developed by
Beverley Yang and Hector Garcia-Molina
1
What is P2P?
pastry
napster
aim
jxta
can
freenet
united devices
open cola
ocean store
?
farsite
gnutella
morpheus
uddi
grove
icq
jabber
folding@home
process tree
ICS214B
netmeeting
ebay
limewire
bearshare
kazaa
fiorana
chord
Notes 18
seti@home
popular power
tapestry
mojo nation
2
Napster
join
get
query
file
response
central server
...
ICS214B
Notes 18
3
Gnutella
query
ICS214B
Notes 18
4
PeerCast
before:
UCI
source
after:
UCI
source
ICS214B
Notes 18
5
What is Peer-to-Peer?
• Definition:
Nodes of equal roles exchanging
information and services directly
• Is this a new idea?
– IP routing (1970’s)
– Mariposa (1980’s)
– Distributed Databases!
• What are people really thinking?
ICS214B
Notes 18
6
Implicit Definition of P2P
• Scale: millions (billions?) of peers
• Nature of peers: PC’s
• Application: lightweight semantics
(e.g., file-sharing)
ICS214B
Notes 18
7
P2P vs. Distributed DBMS
Traditional DDBMS Issues:
•
•
•
•
Transactions
Network Partitions
Distributed Query Optimization
Interoperation of heterogeneous data
sources
• Reliability/failure of nodes
Complex features do not scale
ICS214B
Notes 18
8
P2P vs. Distributed DBMS
Example application: file-sharing
• Simple data model and query language
– No complex query optimization
– Easy interoperation
• No guarantee on quality of results
– Individual site availability unimportant
• Local updates
– No transactions
– Network partitions OK
Simple
ICS214B
Amenable to large-scale network of PCs
Notes 18
9
Potential Benefits
•
•
•
•
Efficiency: harnessing unused resources
Self-organizing
Effectively sharing cost of ownership
Robustness and availability through
replication
• Anonymity/legal protection
ICS214B
Notes 18
10
Challenges
•
•
•
•
No authority to enforce behavior
Cooperation
Unreliability of individual peers
Efficiency of distributed operations
(absolute resources)
ICS214B
Notes 18
11
Research Areas
• Resource Management
• Security
• Efficient Search
ICS214B
Notes 18
12
Resource Management
• Resource:
– Storage/information
– CPU processing
– bandwidth
• Issues:
– fairness
– load balancing
ICS214B
Notes 18
13
Example: Data Trading
site 1
site 2
site 3
A1
B1
C1
A2
B2
C2
B1
B2
ICS214B
trade
trade
A1
A2
Notes 18
14
Example: Data Trading
site 1
site 2
site 3
A1
B1
C1
A2
B2
C2
A1
A2
B1
C1
ICS214B
trade
C2
Notes 18
trade
trade
B2
15
Data Trading
• Order of trades impacts availability
• Issues:
– Swaps vs. Deeds
– Fixed price vs. bids
– Preference to
 sites with a lot of space?
 reliable sites?
 “desperate” sites?
ICS214B
Notes 18
16
Security
• Issues:
–
–
–
–
–
–
Reputation
Trust
Accountability
Information Preservation
Information Quality
Denial of service attacks
• Problem: Detecting and punishing bad
behavior
ICS214B
Notes 18
17
Information Preservation
• Example Policy: make 3 copies of documents
A1
make copies
What can go wrong?
ICS214B
Notes 18
18
What Can Go Wrong?
•
•
•
•
“Bad” sites deletes copies
“Bad” site alters copy
“Bad” site publishes fake
A
1
“Bad” site makes many
copies at other sites
• ...
ICS214B
Notes 18
A1
make copies
A’1
19
Reputation Systems
• Peers evaluate each other
• Good reviews -> Good reputation
Bad reviews -> Bad reputation
No reviews -> ?
• Problems
– Trustworthiness of reviews
– Permanence of identity
ICS214B
Notes 18
20
Efficiency of Search
• Problem: finding needle in haystack
• Efficiency measured in terms of
absolute resources consumed
ICS214B
Notes 18
21
Architecture
• Hybrid
– Centralized index, P2P
file storage and transfer
• Super-peer
– A “pure” network of
“hybrid” clusters
• Pure
– functionality completely
distributed
ICS214B
Notes 18
22
Goal
Develop search techniques for “loose”
systems that are
• Efficient
• Simple (easy to implement, no hidden
costs)
• Realistically and thoroughly evaluated
ICS214B
Notes 18
23
Current Techniques: Gnutella
Breadth-First Search (BFS)
= source
= forward
query
= processed
query
= found
result
= forward
response
ICS214B
Notes 18
24
Metrics
• Cost (aggregate)
– Bandwidth
– Processing Power
• Quality of Results
– Number of results
– Satisfaction (true if # results >= X, false
otherwise)
– Time to satisfaction
ICS214B
Notes 18
25
Iterative Deepening
• Interested in satisfaction, not # of results
• BFS returns “too many” results  expensive
• Iterative Deepening: common technique to reduce
the cost of BFS
– Intuition: A search at a small depth is much cheaper than at
a larger depth
ICS214B
Notes 18
26
Iterative Deepening
= source
= forward
query
= processed
query
?
= found
result
= forward
response
ICS214B
Notes 18
27
Directed BFS
• Sends query to a subset of neighbors
• Maintains statistics on neighbors
– E.g., ping latency, history of number of results
• Chooses subset intelligently (via heuristics),
to maximize quality of results
– E.g., Neighbors with shortest message queue,
since long message queue implies neighbor is
saturated/dead
ICS214B
Notes 18
28
Directed BFS
= source
= forward
query
= processed
query
?
= found
result
= forward
response
ICS214B
Notes 18
29
Directed BFS: Heuristics
RAND
(Random)
RES
Returned greatest # results in past
TIME
Had shorted avg. time to satisfaction
in past
HOPS
Had smallest avg. # hops for
response messages in past
Sent our client greatest # of
messages
MSG
QLEN
Shortest message queue
DEG
Highest degree
ICS214B
Notes 18
30
Local Indices
• Each node maintains index over other nodes’
collections
– r is the radius of the index
– Index covers all nodes within r hops away
• Can process query at fewer nodes, but get just as
many results back
r
ICS214B
Notes 18
31
Local Indices (r=1)
= source
sdf
nrd
= forward
query
sdf
nrd
sdf
nrd
sdf
nrd
= processed
query
= found
result
= forward
response
ICS214B
Notes 18
32
Evaluation
• Goal: realistic evaluation of techniques
• Cannot directly evaluate techniques in a real
environment
• Simulation of large-scale distributed systems
is hard
• Use Gnutella as a “laboratory” for gathering
data
• Use analysis driven by query “traces” to
project cost
ICS214B
Notes 18
33
Passive Observation
Gnutella Network
• Pong
• Query
1. Statistics
• Size of collection
• % redundant messages
2. Sample queries (Qrep)
ICS214B
Notes 18
34
Gathering Data
• # hops traveled
• IP address
Ping
Query (Qrep)
• # hops traveled
• IP address
• Timestamp
• Individual result records
ICS214B
Notes 18
35
Gathering Data
• For each query Q:
L(Q)
Length of query string
M(Q,n)
# response messages from n
hops away
# results from n hops away
R(Q,n)
S(Q,n,Z)
T(Q,Z,W,P)
True if >= Z results received
from n hops away
Time to satisfaction
N(Q,n)
# nodes n hops away
C(Q,n)
# redundant edges n hops
away
ICS214B
Notes 18
36
Example: Trace-driven Cost Projection
= source
= forward
query
= processed
query
?
= found
result
= forward
response
ICS214B
Notes 18
37
Example: Calculating
Message Size
• Use the Gnutella protocol, trace data
• e.g., Query message consists of:
–
–
–
–
Gnutella header (22 B)
Options field (2 B)
Query string (L(Q))
TCP/IP and Ethernet headers (58 B)
Total size of Query message for query Q:
82 + L(Q) bytes
ICS214B
Notes 18
38
Calculating Cost
• We know the sizes of each type of message
• We know # messages sent, for each type of
message, for query Q
• Put together: aggregate bandwidth for Q
• Similar process to compute aggregate
processing power
ICS214B
Notes 18
39
Overall Comparison
B I D L
Time to Satisfy
B I D L
Bandwidth Cost
ICS214B
B I D L
Prob. of Satisfying
B I D L
# results
B
BFS
I
Iterative Deepening (d=5,W=6)
D Directed BFS (>RES)
L
Local Indices (r=1)
Notes 18
40
Summary: Efficient Search
• What we’ve done:
– Proposed techniques to improve performance
 Kept simple
– Evaluated techniques using extensive real data
– Improved performance, with tradeoffs
• Open issues:
– More efficient!
– Make intelligent use of topology, replication
– Take advantage of heterogeneity (e.g., superpeers)
ICS214B
Notes 18
41