Transcript Powerpoint
YouSearch – Searching a
Network of Personal Webservers
Mayank Bawa, Roberto J. Bayardo Jr., Sridhar
Rajagopalan, Eugene J. Shekita.
Make it Fresh, Make it Quick – Searching a
Network of Personal Webservers
Simon Pinkel
1
Talk Outline
1.
2.
3.
4.
Introduction
Internals
Performance & Tuning
Conclusions & Future Work
2
1.1 Personal Webservers for
Content sharing
the web?
Simplicity of www-protocols (e.g. HTTP)
Maturity of the Internet (e.g. DNS)
personal?
Commoditization of personal Computers
Ubiquity of the browser interface
Examples:
Corporation (1.500 people within IBM use a personal
webserver)
University (Saarbrücken: every Research Group maintains
its own (personal) webserver to publish its papers)
3
1.2 Search Issues on Personal
Webservers
Transience of the content
Search by Navigation ineffective:
Content is often poorly arranged
large fraction of diverse (non-HTML) data
typically many files are not reachable by links
(crawl-based) Web Search Engines?
crawl frequency vs transience:
host is offline at crawl time
host is offline at query time
small life-cycles of documents
Results are stale and incomplete
4
1.3 Problems with existing
(P2P) file sharing systems
Kazaa/Gnutella: Query flooding
YouSearch is intended to run on a
corpus which does not support wide content
replication
Napster: centralized index
YouSearch is capable of indexing file
name and content
Both use their proprietary protocol
YouSearch is („at its core“) webcompatible
5
2.1 Overview
YouSearch consists of:
1) Peer Nodes that run YouSearchenabled Webservers
2) Browsers that search YouSearchenabled content
3) A light-weight, centralized Component
called the „Registrar“, whose purpose
is to store the network state
Peer 2
Peer 6
Peer 4
Registrar
Peer 1
Browser
Peer 3
Peer 5
6
2.2 Indexing
Registrar
Bit-Pos.
IP-Addresses
The indexing process looks as follows:
1) The Inspector examines shared files
...
2) If necessary the Indexer updates the local disc-based index
3) the Summarizer sums up the content information: It obtains a list
of Terms T from the Indexer and creates a corresponding bloom
filter:
1) create bit
k bit
vector
vectors
V with
V[1],length
...,V[k]L,with
all bits
length
set L,
to all
0 bits set to 0
2) Using Hash
k independent
Function Hash
H:Term
Functions
{1,...,L},
H[i]:Term
map each
{1,...,L},
Term t in
map
T
intoeach
V Term t in T into k bit Vectors V[1], ...,V[k]
4) VV[1],
is sent
...,V[k]
to the
areRegistrar
sent to the Registrar
5) the Registrar‘s Summary Manager aggregates V
V[1],
into...,V[k]
a structure
into
structures
that
maps that
bit positions
map bit positions
to sets of to
peers
sets of peers
but what about hash conflicts?
introducing k independent Hash Functions with their respective bit
vectors V[1], ...,V[k] s.t.:
term t occurs at this peer iff for all i: V[i](H[i](t)) = 1
Inspector
Indexer
Summary Manager
Registrar
0
1
0
0
1
1
1
0
1
0
0
0
0
1
0
0
0
1
0
...
...
...
0
0
0
1
...
Summarizer
Peer
7
1
2.3 Querying
1)
2)
3)
4)
5)
Query Manager
Bob asks Alice for „pdf group:YouSearchTeam“
Alice transforms query into canonical form
„{(keywords,{pdf}),(group,{YouSearchTeam})}“
and sends it to the Registrar
The Query Manager computes the Set R of relevant Peers
and sends it back to Alice
Web Interface
Canonical Tx
Browser
(Bob)
6)
7)
8)
Registrar
Peer 1
(Alice)
Result Display
Result Gatherer
Alice refines R if necessary(group, site), then contacts
all Peers in R directly,
they issue the query on their local content and return the
result back to Alice
while gathering results, Bob already receives results from
Alice so Bob does not perceive latency
Bit-Pos.
IP-Addresses
...
Peer 2
Registrar
Peer 6
Peer 4
Peer 5
Peer 3
8
= group YouSearchTeam
2.4 Caching Results
Registrar
Query theCaching
Peer peer
Everytime a global query is answered,
querying
pdf
peer 1, ...
1) caches the Url Set U
...
2) informs the Registrar,
3) which adds (query, IP-address) in the cache table
4) deletes Cache entries after a small lifetime(ttl)
5) informs the registrar again
Peer 1
Browser
Query
Result
pdf
<urls>
Peer 1
Query Manager
Peer 2
Peer 1
Peer 2
Registrar
...
When a global query is issued,
Registrar
1)
the registrar looks up its cache mappings
2)
computes all caching peersQuery Caching Peer
pdf
...
peer 1, ...
3)
picks one at random
...
4)
and sends its ip address to the querying peer
Browser
5)
which gathers the cached result from this peer
Peer 3
Query Manager
9
2.5 Failure Management
The Registrar
1)
periodically attempts to contact
all peers
2)
if peer Alice guesses that Carol
is offline,
3)
it informs the Registrar,
4)
Carol moves up in the
Registrar‘s check queue
5)
if Carol does not respond to the
Registrar,
6)
it is removed from the network
The Peers
1)
also send messages
to confirm their status
2)
if the registrar answers „offline“,
3)
the peer starts a new session
Alice
Carol
Carol
...
...
Alice
Registrar
Alice
Carol
Bit-P.
IP
Carol,...
...
...
Query
Peer
Alice,...
...
Alice,...
Carol,Alice,...
Alice,...
10
Seconds
3.1.1 Time to answer Queries
11
3.1.2 Characteristics of Bloom
Filters
12
3.1.3 Gains from Caching
Experiment:
25 queries were issued at
one peer
then these 25 queries were
repeated at different peers
Real World Example:
Out of 1.500 logged Queries,
3.54% (53) were served from
peer caches (ttl = 5min)
31.31% (~470) were asked
more than once
13
3.1.4 Load on Registrar
Consider:
n peers send k bloom filters of size L(bits) every T
seconds if their content has changed, and let f be
the number of peers whose content has changed
in a time interval of length T.
average inbound traffic at registrar:
f*n*k*L every T seconds
In the current implementation:
n = 3, L = 65.536b, T = 300s, and let f = 20%
Assuming T1(1.54 Mb/s), with 20% network overhead
n = (80% * 1.54 Mb/s * T) / (f*k*L) = 9.856 peers
T3(44.736 Mb/s) 286.310 peers
14
3.2 Tuning
Example:
YouSearch is released with bloom filter size L = 512bit
now we discover that a significant fraction of peers have most
of the 512 bits set
increase L, but then every peer needs to adjust parameters like
this (number of bloom filters, timeout durations, frequency to
send bloom filters/check for updates on the index etc.), or
install new software
YouSearch‘s Proposal: The Tuning Manager
installed at each peer
reacts on changes pushed by a centralized component, the
Administrator, and changes a parameter state file
15
4.1 Conclusions
P2P-hybrid architecture to
provide search on transient, rapidly
evolving content
produce fast, fresh and complete results
lightweight central component
option of distribution on multiple hosts
small space/processing requirements
16
4.2 Future Work
Desired Extensions:
Partitioning the content in public and private
(only for authenticated peers/users)
including text snippets in the displayed
result
maintaining cache results for popular
queries
global ranking (only local ranking in current
implementation)
17
Bibliography
1)
2)
3)
4)
Mayank Bawa, Roberto J. Bayardo Jr., Sridhar
Rajagopalan, Eugene J. Shekita. Make it Fresh,
Make it Quick – Searching a Network of Personal
Webservers
B. Bloom. Space/Time Trade-Offs in Hash Coding
with Allowable Errors
Andrei Broder, Michael Mitzenmacher. Network
Applications of Bloom Filters: A Survey
D. Carmel, E. Amitay, M. Hersovici, Y. Mareek, Y.
Petruschka, and A.Soffer. Juru at TREC 10 –
Experiments with Index Pruning
18