Transcript p2p

Database Systems & Peer-2-Peer
Networks
By
Chibuike Muoh
Introduction
• What is a Peer-2-Peer? According to [1] P2P is
– a network that relies on the computing power and
bandwidth of the participants (referred to as nodes)
in the network rather than concentrating it in a
relatively low number of servers.
• Examples of P2P and its applications include
– Content sharing networks such as Gnutella,
Bittorrent etc.
– Distributed processing such as computing clusters
etc.
Introduction contd.
• In essence, P2P networks are an alternative to
the client/server architecture because they
provide scalability and fault tolerance unlike
the client/server model that is hierarchical and
introduces a single point of failure for the
system.
• Other definitions of P2P
– A P2P system is a collection of distributed
computing nodes of equal roles or capabilities,
exchanging information and services directly with
each other
Introduction: Advantages of P2P
• The P2P system of distributed computing is
very desirable because:
– Ad-hoc and dynamic membership means P2P
systems can easily scale up/down as the number of
peers grow (self organizing)
– Better utilizes bandwidth since the peers
communicate directly with each other
– Inherent robustness and load balancing
– Fault tolerant since no one node is more important
the other
– Richness and diversity in services/resource
available in the network.
Types of P2P Systems
• Currently there are (2) main architecture of
content sharing P2P networks
– Pure P2P
• Peers have equal role and capabilities (act as both the
client and server in the network)
• Employs no centralized servers for global indexing and
searching.
– Conversely there is no concept of login
– Unrestrictive membership qualification
• Generally more dynamic and less venerable to
downtimes.
• E.g. Gnutella and Freenet
Types of P2P Systems
– Hybrid P2P
• Employs a system of central server(s) solely for the
purpose of global indexing of resources in the system
– Typically has restrictive membership usually requiring
registration and logins
• Yield better performance at the expense of the system
being limited by the availability of the indexing servers
• The peers in the system are responsible for hosting the
resources available in the system (servers only index the
resources)
• E.g. Napster
P2P vs. Client server
• For the purpose of this presentation, the key
difference to note between P2P and the client server
model is the way both systems handle query
evaluation
– In traditional client server model: the query must be either
posted to the data store which then post back a response
(e.g. traditional database systems, warehouses) or the data
store could be posted to the query originator to evaluate for
results (e.g. sensor and filter networks)
– In the case of P2P system: each peer receives a copy of the
query, evaluates against its own data store and post back
the results. The data originator then aggregates the
responses to the original query
Technical Challenges for P2P
Systems
• It is difficult to predict or reason about the
location/quality of the systems resources due
to its dynamicity
– In the case of a Hybrid system, there is the risk
of information explosion due to the fact the
system has to maintain a global catalog/index of
an increasingly changing number of available
resources
Technical Challenges for P2P
Systems
• Queries in P2P systems are not flexible in that
they only allow searches or lookup of resources
by name or identifier. This limiting semantics of
the system does not allow the user to fully
explore the full breadth of the system’s
resources.
• Updates to a resource go unannounced to the
peers with copies of the original data thus
littering the system with inconsistent copies
Technical Challenges for P2P
Systems
• Data placement and availability is another
area of problem in current P2P systems.
– In current P2P system data placement is mostly
“demand driven” and as such network
bandwidth is utilized inefficiently (hot spots).
Data placement in P2P systems
Data placement (a.k.a)
Rendezvous problem [4]
• In a P2P system with nodes connected by a
network with limited bandwidth on each link.
Peers are typically engaged in the following roles
– Data originator/Service provider: provides resources,
services or stores content in the P2P system.
– Query evaluator/Rendezvous peer: a peer that stores
materialized views and/or evaluates queries
– Query initiator: a peer that acts a client in the network
and poses new queries in request for a service or
resource (a node may initiate new queries on behalf of a
query it is attempting to evaluate)
Data placement contd.
• As you can reason, there is an associated cost of
answering queries in a P2P system. These cost
include
– Transfer cost of query to storage provider(s)/data
originator(s)
– Resource utilization cost at query evaluator and other
participating nodes (workload)
– Cost to transfer the results back to the query initiator
• More formally,
– The Data placement problem in P2P is to distribute
data and work so the full query workload is answered
with lowest possible cost under existing resource and
bandwidth constraints
Data placement in P2P
• Thus in traditional P2P, to determine which
nodes contain the matches for our query
typically the best solution was to flood the
whole network
– thereby exacerbating the data placement
problem in P2P.
Data placement in P2P contd.
• On of the more popular strategy to minimize data
placement cost in P2P is Distributed Hash tables
• DHT forms an abstraction to overlay networks that
provide a key-based routing scheme i.e. exact-match
search (e.g. given an object identifier oida the query
q(oa) at node nj returns object oa).
– Using DHT, the recipient of a query/message is
determined dynamically by a distributed routing
algorithm
– One of the main drawbacks of DHT is that they only
support exact-match search using identifiers rather
than keyword search (although that functionality can
be built atop it).
Static data placement problem
• Given a graph G describing a network of peers,
the static data placement problem is to perform
data placement with optimal cost, where queries
are zero-cost object lookups.
– Theorem [3]: In a given graph G, the static placement
problem is NP-complete, even if all the queries in the
workload in G are object queries.
• Proof of this theorem is based on a reduction from the vertexcover problem.
What can P2P learn from
database systems
• Some of the problems inherent in P2P today are
similar to issues addressed (or are being addressed)
by the data management community
• This includes issues such as
–
–
–
–
Semantics,
view materialization
query evaluation
data transformation and relationships
What can P2P learn from
database systems
• One of the main problem at the heart of any P2P
system is the cost of data placement
• The cost associated with data placement is more
like the issue of view materialization in database
(warehouses) which is an hotly research topic in
the database community. P2P system design
would benefit greatly from research in this area.
– Also beneficial to reducing data placement cost are
lessons learned from the area of multi-query
optimization in distributed database. But there some
main differences such as the absence of a sole
administrator/controller and no centralize database
schema in P2P.
What can P2P learn from
database systems
• Accessing data in hierarchical granularity
levels is another area from which P2P
system could learn from database systems
What can P2P learn from
database systems
• Currently, data within a P2P system can only be
accessed at the atomic granularity level e.g. a
complete audio MP3 file.
– But in reality, multiple MP3 files can be grouped into
an album, and albums into collections
– Thus the data placement problem at this level is that
current P2P system only allow placing a single file or
an entire album as atomic object at a peer
• This is unlike database systems where data can be
expressed as a relation between other groups of
data or as an aggregation before being displayed
Other Issues in P2P system design
• In P2P systems any one peer can originate
data/resource. This high level of heterogeneity
influences the degree to which the P2P system can
ensure uniform global semantics for the
data/resources.
– One possible solution to this problem might be use a
moderator system much like in bittorrent.
– This system/model guarantees that data/resource
injected into the system are consistent and ensures
integrity of the resources in the system
Other Issues in P2P system design
• Freshness and update consistency in P2P is an
issue in P2P design there is no way of propagating
updates from the data originator to storage
providers that have materialized view of the data.
– Some possible solution would be a timeout/expirationbased protocol much like in DNS system or web
caches.
– Has lower overhead then invalidation messages
(client/server).
Other Issues in P2P system design
• Extent of knowledge sharing is another issue in in
P2P. It relates to the question of how much
knowledge is available to the system during its
query optimization process i.e. choosing a query
evaluation strategy that can identify nodes that have
materialized views which can speed up query
processing.
– A simple strategy would be to use a centralized catalog
(much like Napster’s hybrid P2P system). But this would
introduce a single point of failure in the system
– An alternative solution might to arrange the P2P system
in a hierarchical organization as in DNS or LDAP with
each peer holding some fragment of the global catalog.
• The basic challenge in this scheme is to achieve a degree of
consistency due to the dynamicity of the system.
Resources
[1] http://en.wikipedia.org/wiki/Peer-to-peer
[2] BeverlyYang, Hector Garcia-Molina Comparing
Hybrid Peer-to-Peer Systems. VLDB 2001
[3] Steven Gribble, Alon Halevy, Zachary Ives, Maya
Rodrig, Dan Suciu. What Can Databases do for
Peer-to-Peer? WebDB 2001.
[4] Wesley W. Terpstra, Stefan Behnel, Ludger
Fiege, Jussi Kangasharju, and Alejandro
Buchmann. Bit Zipper Rendezvous
[5] Zheng Zhang, Mallik Mahalingam, Zhichen Xu,
Wenting Tang. Scalable, Structured Data
Placement over P2P Storage Utilities
[6] http://en.wikipedia.org/wiki/Distributed_hash_table