Data Mining Engineering

Transcript Data Mining Engineering

Parallele und Verteilte Datenbanksysteme
Univ.-Prof. Dr. Peter Brezany
Institut für Scientific Computing
Universität Wien
Tel. 4277 39425
Sprechstunde: Di, 13.00-14.00
LV-Portal: www.par.univie.ac.at/~brezany/teach/gckfk/300658.html
P.Brezany
Institut für Scientific Computing – Universität Wien
Motivation
Business
Medicine
Scientific
experiments
Data and data
exploration
cloud
Simulations
P.Brezany
Earth observations
Institut für Scientific Computing – Universität Wien
2
The Knowledge Discovery Process
OLAP Queries
OLAP
Online Analytical
Mining
Knowledge
Evaluation and
Presentation
Data Mining
Data
Warehouse
Selection and
Transformation
Cleaning and
Integration
P.Brezany
Institut für Scientific Computing – Universität Wien
3
Fig. 3.1
P.Brezany
Institut für Scientific Computing – Universität Wien
Data
Preprocessing
4
EcoGRID Scetch
Distributed
Data
Distributed
Applications
Biodiversity
Reporting
Waste
Statistic
Air
Distributed
Datamining
Popular
Presentation
Soil
Emmisions
Flow
Analysis
Prediction
Models
Water
Forests
GeoStatistic
…
Common Ontology
P.Brezany
Institut für Scientific Computing – Universität Wien
5
Management of TBI patients
• Traumatic brain injuries (TBIs) typically result from
accidents in which head strikes an object.
• The treatment of TBI patients is very resource intensive.
• The trajectory of the TBI patients management:
–
–
–
–
–
Trauma event
First aid
Transportation to hospital
Acute hospital care
Home care
Usage of mobile
communication
devices
• All the above phases are associated with data collection
into databases – now managed by individual hospitals.
P.Brezany
Institut für Scientific Computing – Universität Wien
6
Data Mining Accuracy vs. Data Size
accuracy
100%
sampled data size
P.Brezany
Institut für Scientific Computing – Universität Wien
available data size
7
The GridMiner Project in Vienna
• GridMiner : A knowledge discovery Grid infrastructure
(http://www.gridminer.org/)
 OGSA-based architecture
 Workflow management
 Grid-aware data preprocessing
and data mining services
 Data mediation service
 OLAP service
 GUI
 Current Implementation on top of Globus Toolkit 3.2
• Applications : Exploration of ecological data, management of
patients with traumatic brain injuries
• Research exhibition available
P.Brezany
Institut für Scientific Computing – Universität Wien
8
Literatur
Auf der WWW-Seite der LV
P.Brezany
Institut für Scientific Computing – Universität Wien
9
Distributed Memory Architecture
(Shared Nothing)
Interconnection Network
P.Brezany
CPU
CPU
CPU
CPU
Local
Memory
Local
Memory
Local
Memory
Local
Memory
Institut für Scientific Computing – Universität Wien
10
DMM: Shared Disk Architecture
Interconnection Network
CPU
CPU
CPU
CPU
Local
Memory
Local
Memory
Local
Memory
Local
Memory
Global Shared Disk Subsystem
P.Brezany
Institut für Scientific Computing – Universität Wien
11
Shared Memory Architecture
(Shared Everything, SMP)
Interconnection Network
CPU
CPU
CPU
CPU
Global Shared Memory
P.Brezany
Institut für Scientific Computing – Universität Wien
12
Cluster of SMPs
Interconnection Network
P.Brezany
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
4-CPU
SMP
4-CPU
SMP
4-CPU
SMP
4-CPU
SMP
Institut für Scientific Computing – Universität Wien
13
High-Performance I/O Systems
P.Brezany
Institut für Scientific Computing – Universität Wien
14
P.Brezany
Institut für Scientific Computing – Universität Wien
15
Note: RAID technology is introduced in a separate scriptum.
P.Brezany
Institut für Scientific Computing – Universität Wien
16
Principles of Distributed Database
Systems
The main literature
P.Brezany
Institut für Scientific Computing – Universität Wien
Distributed Database System
(DDBS) Technology – Introduction
• DDBS is the union of what appears to be two
diametrically opposed approaches to data processing:
database systems and computer network
technologies.
• Database systems have taken us from a paradigm of
data processing in which each application defined
and maintained its own data (figure follows) to one
in which the the data is defined and adminstered
centrally (figure follows) -> data independence (The
application programs are immune to changes in the
logical and or physical organization of the data and
vice versa.) One of the major motivations is the
desire to integrate the operational data of an
enterprise and to provide centralized, thus
controlled access to that data.
P.Brezany
Institut für Scientific Computing – Universität Wien
18
DDBS – Introduction (cont.)
• The technology of computer networks promotes a
mode of work that goes against all centralization
efforts.
• How these two contrasting approaches can be
synthesized to produce a technology that is more
powerful and more promising than either one alone?
– The key understanding is the realization that the most important
objective of the database technolgy is integration, not
centralization. It is important to realize that either one of these
terms does not necessarily imply the other.
– It is possible to achieve integration without centralization, and
that is exactly what the distributed database technology attempts
to achieve.
P.Brezany
Institut für Scientific Computing – Universität Wien
19
Distributed Database System
Technology - Introduction
P.Brezany
Institut für Scientific Computing – Universität Wien
20
P.Brezany
Institut für Scientific Computing – Universität Wien
21
Central Database on a Network Example
Boston
Edmonton
Communication
Network
Paris
San
Francisco
P.Brezany
Institut für Scientific Computing – Universität Wien
22
Distributed Database System
(DDBS) - Definitions
• Definition 1: Distributed database. A distributed
database is a collection of multiple, logically interrelated
databases distributed over a computer network.
• Definition 2: Distributed database management system
(DBMS). It is defined as the software system that
permits the management of the DDBS and makes the
distribution transparent to the users.
• A DDBS is not a „collection of files“ that can be
individually stored at each node of a computer network.
To form a DDBS, files should not only be logically
related, but there should be structure among the files,
and access should be via a common interface.
• The physical distribution of data is very important. It
creates problems, that are not encountered when the
databases reside in the same computer system.
P.Brezany
Institut für Scientific Computing – Universität Wien
23
Promises of DDBSs
1.Transparent Management of Distributed and Replicated Data
• Transparency refers to separation of the higher-level semantics of
a system from lower-level implementation issues; a transparent
system „hides“ the implementation details from the user.
• Example (next slide): Consider an engineering firm that has offices
in several cities.
– It is preferable, to localize each data such that data about the employees in
Edmonton office are stored in Edmonton, ..., and so forth. The same applies to
the project information. In this process we partition each of the relations and
store each partition at a differetn site – it is known as fragmentation.
– It may be preferable to duplicate some of this data at other sites for
performance and reliability reasons. The result is a distributed database which
is fragmented and replicated. Fully transparent access means that the users
can still pose queries in the same form as to a centralized system, without
paying any attention to the fragentation, location, or replication of data, and
let the system worry about resolving these issues.
P.Brezany
Institut für Scientific Computing – Universität Wien
24
Distributed Database System
Environment - Example
Edmonton
Boston
•Edmonton (employees)
•Boston Angestellte (employees)
•Paris Projekte (projects)
•Paris Angestellte (employees)
•Edmont Projekte (projects)
•Boston Projekte (projects)
Paris
Communication
Network
San
Francisco
•Paris Angestellte (employees)
•San Francisco Angestellte (employees)
•Paris Projekte (projects)
•San Francisco Projekte (projects)
•Boston Angestellte (employees)
•Boston Projekte(projects)
P.Brezany
Institut für Scientific Computing – Universität Wien
25
Promises of DDBSs
2. Reliability Through Distributed Transactions
• Distributed DBMSs are intended to improve reliability
since they have replicated components and, thereby
eliminate single points of failure.
• The failure of a single site, or the failure of a
communication link which makes one or more sites
unreachable, is not sufficient to bring down the entire
system. In the case of a distributed database, this
means that some of the data may be unreachable, but
with proper care, users may be permitted to access other
parts of the dist. database.
• The „proper care“ comes in the form of support for
distributed transactions.
P.Brezany
Institut für Scientific Computing – Universität Wien
26
Promises of DDBSs
3. Improved Performance
1. A distributed DBMS fragments the conceptual
database, enabling data to be stored in close
proximity to its points of use.
2. The inherent parallelism of dist. systems may be
exploited for inter-query and intra-query
parallelism.
•
•
P.Brezany
Inter-query parallelism results from the ability to execute
multiple queries at the same time.
Intra-query parallelism is achieved by breaking up a single query
into a number of subqueries each of which is executed at a
different site, accessing a different part of the distributed
database.
Institut für Scientific Computing – Universität Wien
27
Promises of DDBSs
4. Easier System Expansion
• In a distributed environment, it is much easier to
accommodate increasing database sizes.
• Major system overhauls are seldom necessary;
expansion can usually be handled by adding
processing and storage power to the network.
• It may be possible to obtain a linear increase in
„power“, since this also depends on the overhead of
distribution.
• It normally costs much less to put together a
system of smaller computers with the equivalent
power of a single big machine.
P.Brezany
Institut für Scientific Computing – Universität Wien
28
Problem Areas
•
•
•
•
•
•
P.Brezany
Distributed database design
Distributed query processing
Distributed directory management
Distributed concurrency control
Distributed deadlock management
Heterogeneous databases
Institut für Scientific Computing – Universität Wien
29
Distributed DBMS Architecture
• The architecture of a system defines its structure.
• This means that the components of the system are
identified, the function of each component is
specified, and the interrelationships and interactions
among these components are defined.
• In this part we classify DBMS architectures. These
are idealized views – many research and
commercially available systems may deviate from
them.
• We use a classification (next slides) that organizes
the systems as characterized with respect to (1)
the autonomy of local systems, (2) their
distribution, and (3) their heterogeneity.
P.Brezany
Institut für Scientific Computing – Universität Wien
30
Autonomy
• Autonomy refers to the distribution of control, not
of data. It indicates the degree to which individual
DBMSs can operate independently.
• Requirements of an autonomous system:
– The local operations of the individual DBMSs are not affected by
their participaion in a multidatabase system.
– The manner in which the individual DBMSs process queries and
optimize them should not be affected by the execution of global
queries that access multiple databases.
– System consistency or operation should not be compromised when
individual DBMSs join or leave the multidatabase confederation.
P.Brezany
Institut für Scientific Computing – Universität Wien
31
Distribution
• Whereas autonomy refers to the distributed
control, the distribution dimension of the taxonomy
deals with data.
• There are a number of ways DBMSs have been
distributed. We abstract 2 alternative classes:
– client/server distribution
– peer-to-peer distribution (or full distribution)
P.Brezany
Institut für Scientific Computing – Universität Wien
32
Heterogeneity
• Heterogeneity may occur in different forms:
–
–
–
–
P.Brezany
hardware
data models
query languages
transaction management protocols
Institut für Scientific Computing – Universität Wien
33
Architekturmodell
P.Brezany
Institut für Scientific Computing – Universität Wien
34
Architektur von DBMS
•Client - Server Architektur (nicht interessant für diese LV)
•Verteilte Datenbank Architektur
•Multi Datenbank Architektur
P.Brezany
Institut für Scientific Computing – Universität Wien
35
Client/Server Architektur
Hier gibt es typischerweise einen zentralen Datenbank-Server und eine
größere Anzahl vernetzter Arbeitsplatzrechner, die keine relevanten
Daten speichern. Der Benutzer am Arbeitsplatzrechner sieht die volle
Funktionalität des DBMS. Das System verhält sich wie ein zentrales
Datenbanksystem, die Kommunikation ist für den Benutzer transparent.
P.Brezany
Institut für Scientific Computing – Universität Wien
36
Client/Server Architektur (cont.)
P.Brezany
Institut für Scientific Computing – Universität Wien
37
Verteiltes Datenbanksystem
• Hier gibt es mehrere Datenbankserver, wobei bestimmte Daten auf
nur einem Rechner oder auch auf mehreren (replizit) gespeichert
sein können.
• Eine virtuelle Datenbank, deren Komponenten physisch in einer
Anzahl unterschiedlicher, real existierender DBMS abgebildet
werden.
• Transaktionen können in diesem Fall über mehrere DBMS laufen.
• Sammlung von Daten, die
• Aufgrund gemeinsamer, verknüpfender Eigenschaften dem gleichen System
angehören
• Auf versch. Rechnern im Netzwerk verteilt sind
• Wobei jeder Rechner seine eigene Datenbank besitzt
• Autonom lokal Aufgaben abwickeln kann
P.Brezany
Institut für Scientific Computing – Universität Wien
38
Verteiltes Datenbanksystem (cont.)
- gleichzeitige Benutzung der Rechenleistung mehrerer Rechner
- Engpaß in zentralen Datenbanksystemen bei Zugriff auf die Daten
wird vermieden, da die Daten verteilt sind (ggf. repliziert)
- Daten werden von einem Datenbanksystem verwaltet
- Verteilungstransparenz
- Grundlage: 4-Ebenen-Schema-Architektur
P.Brezany
Institut für Scientific Computing – Universität Wien
39
Repetition: ANSI/SPARC Architecture
Users
External
Schema
External
view
Conceptual Schema
Conceptual
view
Internal Schema
Internal
view
The external view is concerned with how users view
the database. An individual user‘s view represents
the portion of the database that will be accessed by
that user as well as the relationships that the user
would like to see among the data. A view can be shared
among a number of users.
P.Brezany
External
view
External
view
The conceptual schema is an abstract
definition of the database – it is the
„real view“ of the enterprise being modeled
in the database. The requirements of individual applications or the restrictions of the
physical storage media are not considered.
The internal view deals with the physical
definition and organization of data. The
location of data on different storage devices
and the access mechanisms used to reach
and manipulate data are the issues dealt
with at this level.
Institut für Scientific Computing – Universität Wien
40
Verteiltes Datenbanksystem (cont.)
externes Schema 1
...
externes Schema N
glob. konzept. Schema
lokales konzept.
Schema
lokales konzept.
Schema
...
lokales konzept.
Schema
lokales internes
Schema
lokales internes
Schema
...
lokales internes
Schema
4 - Ebenen - Schema - Architektur
P.Brezany
Institut für Scientific Computing – Universität Wien
41
Functional Schematic of an Integrated Distributed DBMS
Global directory (GD/D)
permits the required
global mappings.
Local mappings are performed by a local
directory/dictionary
(LD/D) mappings.
P.Brezany
Institut für Scientific Computing – Universität Wien
42
User processor
1.
Components of a Distributed DBMS
The user interface handler is responsible for interpreting users commands and formatting the result data.
The semantic data controller uses the integrity
constraints and authorizations that are defined as part
of the global conceptual schema to check if the user
query can be processed.
The global query optimizer and decomposer determines
an execution strategy to minimize a cost function, and
translates the global queries into local ones using the
global and local conceptual schemas as well as the global
directory.
The distributed execution monitor coordinates the
distributed execution of the user request.
2.
3.
4.
Data processor
1.
2.
3.
The local query optimizer is responsible for choosing the
best access path (The term access path refers to the
data structures and algorithms that are used to access
data. A typical access path is an index on one or more
attributes of a relation.) to acces any data item.
The local recovery manager is responsible for making sure
that the locak database remains consistent.
The run-time support processor physically accesses the
database according to the physical commands in the
schedule generated by the query optimizer.
P.Brezany
Institut für Scientific Computing – Universität Wien
43
Multidatenbanksystem
- Ein MDBS ist ein Verbund von mehreren
Datenbanksystemen.
- Das Konzeptionelle Schema repräsentiert nur den Teil
von Daten, den die lokalen DBMS teilen wollen.
- Auf jedes DBS können lokale Anwendungen zugreifen.
- Jedes DBS kann Daten enthalten, welche keine
Beziehung zu Daten anderer DBS haben.
P.Brezany
Institut für Scientific Computing – Universität Wien
44
Multidatenbanksystem
GES
LES
LES
LES
GES
GKS
GES
LES
LES
LKS 1
...
LKS n
LIS 1
...
LIS n
LES
Modell mit globalem konzeptionellem Schema
P.Brezany
Institut für Scientific Computing – Universität Wien
45
Multidatenbanksystem (cont.)
ES 1
ES 2
ES n
Multidatabase
layer
Local system
layer
LKS 1
LKS 2
LKS 3
LIS 1
LIS 2
LIS 3
Modell ohne globales konzeptionelles Schema
P.Brezany
Institut für Scientific Computing – Universität Wien
46
Components of an MDBS
P.Brezany
Institut für Scientific Computing – Universität Wien
47
Directory Management Strategies - Alternatives
P.Brezany
Institut für Scientific Computing – Universität Wien
48