A Vision for an Architecture Supporting Data Coordination

Download Report

Transcript A Vision for an Architecture Supporting Data Coordination

Working Group
(in alph. order):
Making Peer Databases
Bernstein Phil
Interact
Kementsietsidis Tasos
Kuper Gabriel
Mylopoulos John
–
Serafini Luciano
Shvaiko Pavel
A Vision for an Architecture
Zaihrayeu Ilya
Supporting Data Coordination
Sites:
(4)
(2)
(1)
(2)
(3)
(1)
(1)
(1)
(2)
(3)
(4)
University of Trento
University of Toronto
ITC-Irst, Trento
Microsoft Research
Fausto Giunchiglia
(1)
Madrid, 20 September 2002
2
The Talk
Peer-to-Peer Databases – The intuition
Preliminary Logical Architecture
The Running Example
Conclusion
… and Agents???
3
PEER-TO-PEER DATABASES –
THE INTUITION
4
The Peer-to-Peer (P2P)
“Peer-to-peer is a class of applications that take
advantage of resources – storage, cycles, content,
human presence – available at the edges of the
Internet. Because accessing these decentralized
resources means operating in an environment of
unstable connectivity and unpredictable IP
addresses, peer-to-peer nodes must … have
significant or total autonomy of central servers”
Quote from Clay Shirkey (www.shirky.com)
5
Examples of P2P Computing
Napster – a shared directory of available music and client
software which allows, for instance, to import and export files
Gnutella – a decentralized group membership and search
protocol, mainly used for file sharing
Groove – a system which implements a secure shared space
among peers
JXTA – which aims at creating a common platform that makes it
simple and easy to build a wide range of distributed services and
applications in which every device is addressable by a peer
Is there a place for databases?
6
Motivating Example: Databases of Medical Patients
One patient may be described in several
databases: pharmacist, family doctor and
hospital
But the databases can use different patient
ID formats, disease descriptions, etc
Nevertheless they still may need to
interoperate
At this point data integration may suffice, if
the patient goes to the same doctor,
pharmacist and hospital
When a patient is injured on a ski holiday
in another country, yet more databases
need to get involved
Complete integration is likely to be
infeasible
But dynamic integration of databases
relevant to one patient could have high
value


7
Data (base) Coordination
“... Coordination is managing dependencies between interacting
databases”
Why is it different from data (base) Integration?
 No statically maintained global schema
 many of the parameters (metadata) influencing
the interaction among
peer databases are decided at run time, whereas Integration is made in
design time
 Change in content of a node does not affect the overall system
performance
… and
 For
any given query, nodes coordinate in order to define and use the
most “appropriate” (virtual) schema – this is crucial for dealing with the
strong dynamics of a P2P network
8
The Three Variances
Data integration mechanisms for randomly acquainted databases
become impractical
We have three kinds of unpredictable run time factors, which influence
the answer to a given query in a P2P network:
 Network
(dependent) variance: the network changes over time
 Database
(dependent) variance: different databases, if asked the
same global query will provide different answers
 Query
(dependent) variance: different queries, even if posed to the
same database, will impose different points of view on the network
9
Good Enough Answers
In data coordination, it becomes hard to maintain a high
quality level in the answers provided bythe P2P network
High quality data can flow among the databases
preserving (at the best possible level of approximation)
soundness and completeness
Good Enough Answer (intuition) – high quality level
answer which serves its purposes given the amount of
effort made in computing it
10
Example of a Good Enough Answer
When planning his vacation in
Trentino, John goes to a local travel
agency (TA)
TA unluckily can not offer John
anything from their own database
Instead TA searches for single
operators in the Trentino region
(hotels, ski resorts, etc)
TA starts communication sessions
with some operators
TA queries for the necessary
information (e.g., prices, conditions,
availability)
As long as, for instance, TA gets a
hotel John likes, this is Good Enough
Compared to the Motivating Example,
much lower quality data coordination
will probably suffice
Cost: 150 $
Avail: 05/01/03 – 15/01/03
Services: …
11
Tuning Coordination Over Time
A lot of metadata needs to be produced and maintained
Due to the strong dynamics of a P2P network, this is a crucial
and hard task to perform because:
 A node will never know the full list of its peers
 A node will never know everything about its peers
 Its knowledge will be hard to maintain and will easily become
obsolete
There is a need of tuning/improving, on each peer, the quality of
the interaction (for instance, with the help of learning algorithms,
metadata editors, and so on)
There is an obvious trade-off between the quality of the answers
and the effort made in maintaining coordination
12
VERY PRELIMINARY HINTS OF A LOGICAL
ARCHITECTURE
13
A Proposed Architecture
Four basic ingredients:
1.
Interest Groups
2.
Acquaintances
3.
Coordination Rules
4.
Correspondence Rules
14
Interest Groups
Peer nodes know very little of the other nodes of the P2P network,
and about the topics (e.g., Tourism, Medical care, …) their peers are
able to answer queries
An Interest Group is a set of nodes which are able to answer queries
about a certain topic
There is a Group Manager (GM) which is in charge of the
management of the metadata needed in order to run the group
The main goal of GM is to compute the Query Scope (QS) – the set of
nodes a query should be propagated to
15
Acquaintances
Acquaintances are nodes that a node knows about and
that have data relevant to answer specific queries
A node is an acquaintance of another node only with
respect to (possibly, a schematic representation of) a
query
There must be a way to compute how to propagate a
query, to propagate results back, and to reconcile them
with the results coming from the other acquaintances
16
Coordination Rules
Each acquaintance may be associated with one or more
Coordination Rules
coordination rules specify under what conditions, when, how
and where to propagate queries or updates
A proposed implementation of coordination rules is as EventCondition-Action (ECA) rules
 Event can be an update or a query coming from the user or
from another node
 Condition refers to properties of the update or query (e.g., the
type of query and/or which data are referenced by the query)
 Action can be the translation and propagation of a given update
or query to a particular acquaintance
17
Correspondence Rules
Each acquaintance is associated with one or more
Correspondence Rules
Correspondence Rules translate queries and query
results (semantic heterogeneity)
Implemented as rewrite rules and are called by
coordination rules, in action and condition components
They can be used, for instance to translate attribute or
element names (Domain Relations)
18
Level One Architecture
P2P Layer
P2P functionality’s add-on
Local Data Source
Database
File system
Web site
…
User Interface
User queries
Results
…
Query Manager and
Update Manager
Responsible for query and
update propagation
Manage coordination and
correspondence rules,
acquaintances, and
interest groups
Wrapper
provides a translation layer
between QM and UM, and
LDS
19
A Proposed Strategy for Query Propagation
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
“no more
“no more
5. “nodes 2 and 4
propagation
“node 8 is
“node
6 is= ? GM propagation
3. QS
topic)
are (,
reached”
from 8”
reached”
from 9”
reached”
User submits query Q ()
Node defines query topic
Node sends to Group Manager (GM)
4. QS (, topic)= (2, 4, 6, 8, 9, 11)
request to define Query Scope (QS)
GM computes and sends back QS
9
6
Node 1 sends query to acquaintances
2
in QS, and reports this fact to GM
2. Q (, topic)
10
Nodes 2 and 4 send answer to node 1
7
1. Q ()
←Res4
Nodes propagate the query to theirs
1
4
acquaintances from QS and report this
11
fact to GM
3
And so on…
5
Nodes which do not propagate any
8
further, report this fact to GM
Propagation stops when “no more
propagation” received from all
boundary nodes
20
THE RUNNING EXAMPLE
21
“Toy” Databases
Recall Motivating Example:
Family Doctor DB
F: Prescription (PatID, P_Name, Illness_Desc, StartDate,
RecoveryDate, Treatment, Type, Prescriptions);
Hospital DB
H: Patients (PID, Name, Disease, Treatment_Desc, In, Out);
Medical Office DB
M: Accidents (P_id, FN, LN, Address_Reason, Treatment_Taken,
Prescription_Given, Date)
John, who suffers the accident, is described in H with ID
“P12”, in F as “8”, and, when addressed to M, he is
assigned ID “A13”
22
Query Example
Lets suppose QM is asked to M:
Select FN, LN, Address_Reason,
Treatment_Taken,
Prescription_Given, Date
From “M:Accidents”
Where Address_Reason Like
(‘%Fracture%’ Or ‘%Dislocation%’)
And PID = ‘A13’
With the indication QM is a global
query with topic
T = “Medical Care in Canada”
After some search T is matched
with the topic “Medical Care in
Toronto” of the interest group G
23
Group G
H is acquainted with F and P
is acquainted with F; dashed
lines are group metadata
channels; H is GM of G
GM computes query scope
QS = G = {F, H, P} for query QM
M gets acquainted with H
M: Accidents and H: Patients
are matched
As the result a set of
Coordination Rules is
generated
24
Examples of Coordination Rules
Coor # 1
 Event: M:Q
 Condition: Q:(Address_Reason  Select OR
Treatment_Taken  Select) AND (PID = ‘A13’
 Where)
 Action: Q = Apply (Q, Corr_Rules_Query)
Send (Q, H)
Coor # 2
 Event: M:RH
 Condition: None
 Action: RM = Apply (RH, Corr_Rules_Results)
Where Corr_Rules_Query and Corr_Rules_Results are correspondence
rules which translate outgoing query and incoming results
25
Query Propagation
P is not reachable because there is
no acquaintance graph from M to P
In the graph the following queries
are circulating:
 QH = Select
Name, Disease,
Treatment_Desc
From “H:Patients”
Where Disease Like (‘%Fracture%’ Or
‘%Dislocation%’) And PID = ‘P12’
 QF = Select
P_Name, Illness_Desc,
Treatment
From “F:Prescriptions”
Where Illness_Desc Like (‘%Fracture%’ Or
‘%Dislocation%’) And PID = ‘8’
26
Results Propagation and Reconciliation
H and F generate the following
results:
 ResH =
<’John’, ‘Forearm dislocation’, ‘Bandage’>
 ResF =
<’John’, ‘Leg fracture’, ‘Leg put in plaster’>
When reached M, the results are
reconciled as follows:
 ResM =
<’John’, ‘Forearm dislocation’, ‘Bandage’>
<’John’, ‘Leg fracture’, ‘Leg put in plaster’>
27
Variance and Good Enough Answers
Good Enough answers
 ResM is incomplete, some fields from H: Patients and F: Prescription are
missing
 Nevertheless the results are good enough because they still serve the
needs of M
Network Variance
 If F is down, the results are even more incomplete
Database Variance
 If M gets acquainted with F instead of H – only ResF is retrievable. F
has a different “vision” of the world, as it is not acquainted with H
Query Variance
 If in QM ID of John is substituted by ID of another, not shared patient,
then no Coordination Rules and therefore no propagation
28
Conclusion
First investigation of how to make databases interact in a P2P network.
There are four main dimensions:
 We
must integrate data coming from autonomous, most often
semantically heterogeneous, databases;
 We must deal with network, database, and query variance. This is
why we talk of data coordination, as distinct from data integration;
 We will almost never get correct and complete answers. We must be
content with answers which are good enough;
 There is a need to tune metadata. This is requires in order to cope
with the dynamics of a P2P network.
29
References
Project website: http://www.dit.unitn.it/~p2p/
“Data Management for Peer-to-Peer Computing: A Vision”, WebDB
2002, P. Bernstein, F. Giunchiglia, A. Kementsietsidis, J. Mylopoulos,
L. Serafini, and I. Zaihrayeu
L. Serafini, F. Giunchiglia, J. Mylopoulos and P. Bernstein “The Local
Relational Model: Model and Proof Theory”, tech. rep. IRST, Trento