Data Management in Large
Download
Report
Transcript Data Management in Large
Data Management in
Large-scale P2P Systems
Patrick Valduriez, Esther Pacitti
Atlas group, INRIA and LINA
University of Nantes, France
1
Motivations
P2P systems
Decentralized control, large scale
Low-level, simple services
Distributed database systems
High-level data management services
File sharing, computation sharing, com. sharing
queries, transactions, consistency, security, etc.
Centralized control, limited scale
P2P + distributed database
Why? How?
2/26
Why high-level P2P data sharing?
Professional community example
Medical doctors in a hospital may want to
share (some of) their patient data for an
epidemiological study
They have their own, independent patient
descriptions
They want to ask queries such as “age and
weight of male patients diagnosed with
disease X …” over their own descriptions
They don’t want to create a database and
buy a server
3/26
Problem definition
P2P system
No centralized control, very large scale
Very dynamic: peers can join and leave the
network at any time
Peers can be autonomous and unreliable
Techniques designed for distributed data
management no longer apply
Too static, need to be decentralized,
dynamic and self-adaptive
4/26
Outline
Data management in distributed
systems
P2P systems
Data management in P2P systems
Data management in APPA
5/26
Data management basic principle
Data independence
Application
Application
Logical view
(schema)
Provision for high-level
services
Storage
Storage
Hide implementation
details
Schema
Queries (SQL, XQuery)
Automatic optimization
Transactions
Consistency
Access control
…
6/26
Distributed database system (DDBS)
Queries, Transactions
Distribution transparency
Site 1
Distributed
Database
System
Site 2
Global schema
Site 3
Centralized control through
global catalog
Distributed functions
DBMS1
DBMS2
Common data descriptions
Distributed data placement
Schema mapping
Query processing
Transaction management
Access control
Etc.
7/26
Scaling up DDBS
Distributed database systems
Data integration systems
Enterprise information systems
Scale up to tens of databases
strong heterogeneity and autonomy of data sources
(files, databases, XML documents, ..)
Limited functionality (queries)
Scale up to hundreds of data sources
Parallel database systems
Focus on high-performance and high-availability
Strong homogeneity
Scale up to hundreds of data nodes
8/26
A generic P2P system
A user at a peer may access sharable
data at remote peers
P2P software
P2P software
private
private
sharable
sharable
P2P software
private
sharable
9/26
Potential benefits of P2P systems
Scale up to very large numbers of peers
Dynamic self-organization
Load balancing
Parallel processing
High availability through massive
replication
10/26
P2P vs DDBS
P2P
DDBS
Joining the
network
Upon peer’s
initiative
Controled by DBA
Queries
No schema,
key-word based
Global schema,
static optimization
Query answers
Partial
Complete
Content location
Using neighbors
or DHT
Using directory
11/26
Requirements for P2P data
management (1)
Autonomy of peers
Query expressiveness
Peers should be able to join/leave at any
time, control their data wrt other (trusted)
peers
Key-lookup, key-word search, SQL-like
Efficiency
Efficient use of bandwidth, computing
power, storage
12/26
Requirements for P2P data
management (2)
Quality of service (QoS)
Fault-tolerance
User-perceived efficiency: completeness of
results, response time, data consistency, …
Efficiency and QoS despite failures
Security
Data access control in the context of very
open systems
13/26
P2P network topologies
Unstructured systems
Structured (DHT) systems
e.g. SETI@home
e.g. CAN, CHORD
Super-peer (hybrid) systems
e.g. Napster
14/26
P2P unstructured network
p2p
p2p
p2p
data
peer 1
data
peer 2
data
peer 3
High autonomy (peer needs to know neighbor to login)
Searching by flooding the network
p2p
data
peer 4
general, inefficient
High-fault tolerance with replication
15/26
P2P structured network
Distributed Hash Table (DHT)
h(k1)= p1
h(k2)= p2
h(k3)= p3
h(k4)= p4
p2p
p2p
p2p
p2p
d(k1)
peer 1
d(k2)
peer 2
d(k3)
peer 3
d(k4)
peer 4
Efficient exact-match search
O(log n) for put(key,value), get(key)
Limited autonomy since a peer is responsible for a
range of keys
16/26
Super-peer network
sp2sp
sp2sp
sp2p
sp2p
p2sp
p2sp
p2sp
p2sp
data
peer 1
data
peer 2
data
peer 3
data
peer 4
Super-peers can perform complex functions (meta-data
management, indexing, acces control, etc.)
Efficiency and QoS
Restricted autonomy
SP = single point of failure => use several
17/26
P2P systems comparison
Requirements
Unstructure
d
DHT
Super-peer
Autonomy
high
low
avg
Query exp.
high
low
high
Efficiency
low
high
high
QoS
low
high
high
Fault-tolerance
high
high
low
Security
low
low
high
18/26
Data management in P2P systems
Current research focuses on
Decentralized schema mappings
Extending DHT for complex querying
PIER : exact-match and join queries
Query reformulation
PeerDB: unstruct. network, keyword search only
Edutella: super-peer, RDF-based schemas
Piazza: graph of pair-wise schema mappings
Replication
generally limited to static read-only files
P-Grid addresses updates in structured networks
19/26
Data management in APPA (Atlas
P2P Architecture)
Objectives
Main features
Scalability, availability and performance
Network-independent architecture
Layered, service-based architecture
Replication with semantics-based reconciliation
Decentralized schema management
Schema-based query support and optimization
Peer data caching
Prototype on JXTA
Network-independent P2P services
20/26
Network independent APPA
Advanced Services
Query
Processing
Replication
Cache
Management
Security
...
Basic Services
Group Membership
Management
Consensus
Management
P2P Data
Management
Peer
Management
Peer
Communication
...
P2P Network
Key-based Storage and Retrieval
Peer ID Assignment
Peer Linking
Internet
21/26
Different APPA architectures
Peer
Advanced
services
Basic
services
local
data
P2P
network
DHT
network
P2P
data
Peer
Super-peer
P2P
data
Basic services
P2P network
Super-peer
Peer
Peer
Peer
local
data
Advanced
services
Peer
22/26
Schema management in APPA
Takes advantage of the collaborative
nature of the applications
Given 2 CSD relation definitions, an
example of peer mapping at peer p is:
Peers that wish to cooperate agree on a
Common Schema Description (CSD)
p:r(A,B,D) csd:r1(A,B,C), csd:r2(C,D,E)
Peer mappings stored as P2P data
23/26
Replication in APPA
Small-world assumption: peers work in
smaller groups with time locality
Lazy multi-master replication
n peers can update the same replica
Improves read performance and availability
Replica divergence solved by distributed
log-based reconciliation
Exploit P2P data management service
24/26
Query processing in APPA
Given a SQL-like query on peer schema,
performs
query reformulation
query matching
Finds relevant peers
query optimization
Maps the query on CSD schemas
Selects best peers, taking replication into
account
query decomposition and execution
Exploits parallelism
25/26
Conclusion
Advanced P2P applications will need
high-level data management services
Various P2P networks will improve
Network-independence crucial to exploit
and combine them
Many technical issues
Important to characterize applications
that can most benefit from P2P wrt
other distributed architectures
26/26