Data Mining Engineering

Download Report

Transcript Data Mining Engineering

Distributed Database Design
& Semantic Data Control
Univ.-Prof. Dr. Peter Brezany
Institut für Scientific Computing
Universität Wien
Tel. 4277 39425
Sprechstunde: Di, 13.00-14.00
LV-Portal: www.par.univie.ac.at/~brezany/teach/gckfk/06ss/300658.html
P.Brezany
Institut für Scientific Computing – Universität Wien
Distributed Database Design
P.Brezany
Institut für Scientific Computing – Universität Wien
2
A General Framework Considered in the Distributed
Database Design Problem
The organization of distributed systems can be investigated along 3 dimensions:
1. Level of Sharing
(a) no sharing: each application and its data execute at one site, and there
is no communication with any other program or access to any data file
at other sites (not very common today)
(b) data sharing: all the programs
are replicated at all the
sites, but data files are
not
P.Brezany
(c) data-plus-program sharing:
a program at a given site
can request a service from
another program at a second site, which, in turn,
may have to access a data
file located at a 3. site.
Institut für Scientific Computing – Universität Wien
3
A General Framework Considered in the Distributed
Database Design Problem (2)
2. Access patterns (along this dimension, the relationship between
distributed database design and query processing is established)
(a) static (they do not change over time) – the design is easier for static
environments; unfortunately, it is difficult to find many real-life
applications that would be classified as static.
(b) dynamic
3. Level of knowledge about the access
pattern (AP) behavior
(a) no knowledge (a theoretical
possibility)
it is very difficult to design a distr.
database that can effectively cope
with this situation
(b) complete knowledge – AP can be resonably (almost precicely) predicted
(c) partial information – there are
deviations from the predictions
P.Brezany
Institut für Scientific Computing – Universität Wien
4
Alternative Design Strategies
• Top-Down approach
• Bottom-Up approach
In most database designs, the two approaches may need
to be applied to complement one another.
It is a joint function of the database, enterprise, and
application system administrators (or of the
administrator performing all three roles)
P.Brezany
Institut für Scientific Computing – Universität Wien
5
Top-Down Design Process
P.Brezany
Institut für Scientific Computing – Universität Wien
6
Bottom-Up Design Process
• Top-down design is a suitable approach when a
database system is being designed from scratch.
• Commonly, a number of databases already exist,
and the design task involves integrating them into
one database.  the bottom-up approach is suited
for this type of environment.
• The starting point is individual local conceptual
schemas.
• The process consists of integrating local schemas
into the global conceptual schemas.
• This type of environment exists primarily in the
context of heterogeneous databases.
P.Brezany
Institut für Scientific Computing – Universität Wien
7
Distribution Design Issues Fragmentation
•
•
•
•
Why fragment at all?
How should we fragment?
How much should we fragment?
Is there any way to test the correctness of
decomposition?
• How should we allocate?
• What is necessary information for fragmentation
and allocation?
P.Brezany
Institut für Scientific Computing – Universität Wien
8
Reasons for Fragmentation
• The important issue is the appropriate unit of distribution: a
relation is not a suitable unit – why?
– Application views are usually subsets of relations, therefore the locality of
accesses of applications is defined not on entire relations but on their subsets 
it is only natural to consider subsets of relations as distribution units.
– If the applications that have views defined on a given relation reside at
different sites, two alternatives can be followed, with the entire relation being
the unit of distribution:
» The relation is not replicated  this results into in an unnecessarily high
volume of remote data accesses.
» The relation is replicated at all or some of the sites where the applications
reside  problems in executing updates and may not be desirable if storage
is limited.
• The decomposition of a relation into fragments, each being treated
as a unit, permits a large number of transactions to execute
concurrently and intraquery concurrency is enabled.
• Disadvantages of fragmentation
– Applications whose views are defined on more than one fragment may suffer
performance degradation.
– Integrity checking (semantic data control): Attributes participating in a
dependency may be decomposed into different fragments which might be
allocated to different sites.
P.Brezany
Institut für Scientific Computing – Universität Wien
9
Fragmentation Alternatives:
Horizontal and Vertical Fragmentation
Example Database
P.Brezany
Institut für Scientific Computing – Universität Wien
10
Example of Horizontal Fragmentation
P.Brezany
Institut für Scientific Computing – Universität Wien
11
Example of Vertical Fragmentation
Remark: The fragmentation may, of course, be nested. If the nestings
are of different types, one gets hybrid fragmentation.
P.Brezany
Institut für Scientific Computing – Universität Wien
12
Hybrid (Mixed) Fragmentation
Relation A
Fragment H1
Fragment H2V1H1
Fragment H2V1H2
Fragment H2V2
Fragment H2V1H3
P.Brezany
Institut für Scientific Computing – Universität Wien
13
Correctness Rules of Fragmentation
• Completness. If a relation R is decomposed into fragments R1, R2,
..., Rn, each data item that can be found in R can also be found in
one or more of Ri´s.
• Reconstruction. If a relation R is decomposed into a set fragments
FR = {R1, R2, ..., Rn}, it should be possible to define a relational
operator  such that
R = Ri, Ri  FR
The operator  will be different for for the different forms of
fragmentation. The reconstructability of the relation from its
fragments ensures that constraints defined on the data in the form
of dependencies are preserved.
• Disjointness. If a relation R is horizontally decomposed into
fragments R1, R2, ..., Rn and data item di is in Rj, it is not in any
other fragment Rk (k  j).If relation R is vertically decomposed, its
primary key attributes are typically repeated in all its fragments 
disjointness is defined only on the nonprimary key attributes of a
relation.
P.Brezany
Institut für Scientific Computing – Universität Wien
14
Allokation
Sei F = {F1, ..., Fn} eine Menge von Fragmenten, S = {S1, ..., Sm} ein Netzwerk gegeben
durch die Menge seiner Sites, und Q = {Q1, ..., Qp} die Menge der relevanten
Anwendungen.
Allokationsproblem:
Was ist die „optimale“ Zuordnung von F zu S bzgl. Q ?
Optimaltätskriterium:
• Minimalität der Kosten gegeben durch die Speicherkosten der Fi an den Sites
Sj, der Anfragekosten für Fi an Site Sj, der Änderungskosten der Fi an allen
Sites an den sie gespeichert sind, und die Kosten der Datenkommunikation.
• Performanz im Sinne von Antwortzeiten oder Systemdurchsatz.
P.Brezany
Institut für Scientific Computing – Universität Wien
15
R1,1
R1
R
R2,1
Sites
S1
R2
R1,2
R3
Globale Relation
R4
R2,2
S2
Fragmente
Fragmente und
ihre Allokation
P.Brezany
Allokation an
den Sites
Institut für Scientific Computing – Universität Wien
R3,3
R4,3
S3
16
Zusätzliche Beispiele
Beispiel 1:
JNO JNAME
BUDGET
LOC
J1
Instrumentation
150000
Montreal
J2
Database
Develop.
135000
New York
J3
CAD/CAM
250000
New York
J4
Maintenance
310000
Paris
JNO JNAME
BUDGET LOC
J1
Instrumentation
150000
Montreal
J2
Database
Develop.
135000
New York
horizontale Fragmentierung
P.Brezany
JNO JNAME
BUDGET LOC
J3
CAD/CAM
250000
New York
J4
Maintenance
310000
Paris
Institut für Scientific Computing – Universität Wien
17
Beispiel 2:
JNO Importance
J1
low
J2
high
J3
low
J4
high
JNO JNAME
BUDGET
LOC
J1
Instrumentation
150000
Montreal
J2
Database
Develop.
135000
New York
J3
CAD/CAM
250000
New York
J4
Maintenance
310000
Paris
abgeleitete horizontale
Fragmentierung
JNO JNAME
BUDGET
LOC
JNO
JNAME
BUDGET
LOC
J1
Instrumentation
150000
Montreal
J2
Database
Develop.
135000
New York
J3
CAD/CAM
250000
New York
J4
Maintenance
310000
Paris
P.Brezany
Institut für Scientific Computing – Universität Wien
18
Beispiel 3:
JNO JNAME
BUDGET
LOC
J1
Instrumentation
150000
Montreal
J2
Database
Develop.
135000
New York
J3
CAD/CAM
250000
New York
J4
Maintenance
310000
Paris
JNO BUDGET
J1
150000
J2
135000
J3
250000
J4
310000
vertikale Fragmentierung
P.Brezany
JNO JNAME
LOC
J1
Instrumentation
Montreal
J2
Database
Develop.
New York
J3
CAD/CAM
New York
J4
Maintenance
Paris
Institut für Scientific Computing – Universität Wien
19
Semantic Data Control
P.Brezany
Institut für Scientific Computing – Universität Wien
20
Main Objectives
• View management
• Security control
• Semantic integrity control.
The above items can be defined by rules that the system
automatically enforces. The violation of some such rule by a
user program (a set of database operations) generally implies
the rejection of the effects of that program.
• The rules are defined by the DB administration.
• The cost of enforcing semantic data control, which is high in
terms of resource utilization in a centralized DBMS, can be
prohibitive in a distributed environment.
• The rules must be stored in a catalog (directory)  efficient
management in a distributed environment is important
P.Brezany
Institut für Scientific Computing – Universität Wien
21
View Management
• In a rel. DB system a view is a virtual relation, defined as the
result of a query on base relations (or real relations), but not
materialized like a base relation, which is stored in the DB.
• An external schema can be defined as a set of views and/or
base relations.
• Besides their use in external schemas, views are useful for
ensuring data security in a simple way  if users may only
access the database through views, they cannot see or
manipulate the hidden data, which are therefore secure.
• In a distributed DBMS, a view can be derived from distributed
relations, and the access to a view requires the execution of
the distributed query corresponding to the view definition.
• An important issue in a distributed DBMS is to make view
materialization efficient.
P.Brezany
Institut für Scientific Computing – Universität Wien
22
Views in Centralized DBMSs
• Example 1
The view of system analysts (SYSAN) derived from relation EMP
(ENO,ENAME,TITLE), can be defined by the following SQL query:
CREATE VIEW
SYSAN(ENO, ENAME)
AS
SELECT ENO, ENAME
FROM
EMP
WHERE TITLE = ``Syst. Anal.´´
The single effect is the storage of the view defintion in the catalog.
No other information needs not be recorded. Therefore, the result to
the query defining the view (i.e., a relation having the attributes ENO
and ENAME for the system analysts as shown in the figure below) is
not produced. However, the view SYSAN can be manipulated as a base
relation.
P.Brezany
ENO
ENAME
E2
E5
E8
M.Smith
B. Casey
J.Jones
Institut für Scientific Computing – Universität Wien
23
Views in Centralized DBMSs (cont.)
• Example 2
The query „Find the names of all the system analysts with their
project number and responsibility(ies)“
involving the view SYSAN and relation ASG(ENO,PNO,RESP,DUR)
can be expressed as
SELECT ENAME, PNO, RESP
FROM
SYSAN, ASG
WHERE SYSAN.ENO = ASG.ENO
Mapping a query expresed on views into a query expressed on base
relations can be done by query modification at compile time.
The preceding can be modified to
SELECT ENAME, PNO, RESP
FROM
EMP, ASG
WHERE EMP.ENO = ASG.ENO
AND
TITLE = ``Syst. Anal.´´
P.Brezany
ENAME
PNO
RESP
M.Smith
M.Smith
B.Casey
J.Jones
P1
P2
P3
P4
Analyst
Analyst
Manager
Manager
Institut für Scientific Computing – Universität Wien
24
Views in Distributed DBMSs
• The definition of a view is similar as in centralized systems –
distinction, a view may be derived from fragmented relations stored
at different sites.
• View definitions can be centralized at one site, partially duplicate, or
fully duplicated.
• Processing a query expressed on views: distributed query processor.
• Views derived from distributed relations may be costly to evaluate.
• Since in a given organization it is likely that many users access the
same views, some proposals have been made to optimize view
derivation.
• An alternative solution is to avoid view derivation by maintaining
actual versions of the views – snapshots. A snapshot represents a
particular state of the database and is therefore static, meaning
that it does not reflect updates to base relations.
P.Brezany
Institut für Scientific Computing – Universität Wien
25
Data Security
• Data security includes 2 aspects:
– Data protection. It is required to prevent unauthorized users from
understanding the physical content of data.
– Authorization control. It must guarantee that only authorized
users perform operations they are allowed to perform on DB.
» Checking whether a given triple (user, operation, object) can
be allowed to proceed.
» The introduction of a user: (user name, password). The user
name uniquely identifies the users of that name in the system,
while the password, known only to the users of that name,
authenticates the users.
• Distributed authorization control
– Additional complexity: objects and subjects are distributed
» remote user authentication
» management of distributed authorization rules
» handling of views and of user groups (DB administration is
simplified)
P.Brezany
Institut für Scientific Computing – Universität Wien
26
Semantic Integrity Control
How to guarantee database consistency?
• A DB state is said to be consistent if the DB satisfies a
set of constraints, called semantic integrity constraints.
• Mantaining a consistent DB requires various mechanisms:
concurrency control, reliability, protection, and semantic
integrity control (SIC).
• SIC ensures DB consistency by rejecting update
programs which lead to inconsistent DB states, or by
activating specific actions on the DB state, which
compensate for the effects of the update programs.
• SI constraints are rules that represent the knowledge
about the properties of an application.
– Structural constraints (e.g. unique key constraints in the rel. model)
– Behavioral constraints regulate the application behavior.
P.Brezany
Institut für Scientific Computing – Universität Wien
27
Centralized Semantic Integrity Control
• The language for expression and manipulating integrity
assertions. Examples:
– Employee number in relation EMP cannot be null: ENO NOT NULL in EMP
– The pair (ENO,PNO) is the unique key in relation ASG: (ENO,PNO) UNIQUE
IN ASG
• Enforcement mechanism that performs specific actions to
enforce DB integrity and updates. Examples:
– The query for increasing the budget of the CAD/CAM project by 10%, which
would be specified as
UPDATE
PROJ SET BUDGET = BUDGET * 1.1
WHERE
PNAME = ``CAD/CAM´´
will be transformed into the following query in order to enforce the domain
constraint:
CHECK ON PROJ (BUDGET >= 500000 AND BUDGET <= 1000000)
UPDATE
P.Brezany
PROJ
SET BUDGET = BUDGET * 1.1
WHERE
PNAME = ``CAD/CAM´´ AND NEW.BUDGET >= 500000
AND NEW.BUDGET <= 1000000)
Institut für Scientific Computing – Universität Wien
28
Distributed Semantic Integrity Control
• Individual assertions. They refer only to tuples to
be updated independently of the rest of the
database. Example: CHECK ON PROJ (BUDGET >= 500000 AND
BUDGET <= 1000000). The assertion definition is sent to all other
sites that contain fragments of the relation involved in the assertion.
The assertion must be compatible with the relation data at each site.
Compatibility can be checked at two levels: predicate and data. First,
predicate compatibility is verified by comparing the assertion
predicate with the fragment predicate. An asertion C is not
compatible with a fragment predicate p if „C is true“ implies that „p
is false,“ and is compatible with p otherwise. If noncompatibility is
found at one of the sites, the assertion definition is globally rejected
because tuples of that fragment do not satisfy the integration
constraints. Second, if predicate compatibility has been found, the
assertion is tested against the instance of the fragment. If it is not
satisfied by that instance, the assertion is also globally rejected. If
compatibility is found, the assertion is stored at each site.
P.Brezany
Institut für Scientific Computing – Universität Wien
29
Distributed Individual Assertions: Example
Consider relation EMP, horizontally fragmented across three
sites using the predicates
p1 : 0 <= ENO < „E3“
p2: „E3“ <= ENO <= „E6“
p3: ENO > „E6“
and the domain assertion C: ENO < „E4“. Assertion is
compatible with p1 (if C is true, p1 is true) and p2 (if C
is true, p2 is not necessarily false), but is not with p3
(if C is true, then p3 is false).
Therefore, assertion C should be globally rejected because
the tuples at site 3 cannot satisfy C, and thus relation EMP
does not satisfy C.
P.Brezany
Institut für Scientific Computing – Universität Wien
30
Set-Oriented Assertions
• They include single-relation multivariable constraints such
as functional dependency (Example 1) and multirelation
multi-variable constraints such as foreign key constraints
(Example 2)
• Example 1: The employee number functionally determines
the employee name.
ENO IN EMP DETERMINES ENAME
• Example 2: The project number PNO in relation ASG is a
foreign key matching the primary key PNO of relation
PROJ. In other words, a project referred to in relation
ASG must exist in relation PROJ.
PNO IN ASG REFERENCES PNO IN PROJ
P.Brezany
Institut für Scientific Computing – Universität Wien
31
Assertions Involving Aggregates
• Example: The total duration for all employees in the
CAD/CAM project is less than 100.
CHECK ON g:ASG, j:PROJ (SUM(g.DUR WHERE g.PNO=j.PNO) < 100
IF j.PNAME=``CAD/CAM´´)
• These assertions are among the most costly to test
because they require the calculations of the aggregate
functions.
P.Brezany
Institut für Scientific Computing – Universität Wien
32