Data Mining Engineering

Download Report

Transcript Data Mining Engineering

Databases and the Grid
Peter Brezany
Institute für Scientific Computing
University of Vienna
Diplomarbeit: Mobile Computing
(Dr. Karin Hummel)
P.Brezany
Institute for Software Science – University of Vienna
1
Introduction
• This lecture part outlines how databases can be integrated into the
Grid.
• Most existing and proposed applications are file-based.
• Consequently, there has been little work on how databases can be
made available on the Grid for access by distributed applications.
• If the Grid is to support a wider range of applications, then
database integration into the Grid will become important. E.g. many
applications in the life and earth sciences, and many business
applications are heavily dependent on databases.
• The database integration into the Grid is not possible just by
adopting or adapting the existing Grid services that handle files, as
databases offer much richer operations (e.g. queries and transactions), and there is much greater heterogeneity between different
database management systems than there is between different file
systems.
• There are major differences between database paradigms (e.g.
object and relational) – within one paradigm different database
products (e.g. Oracle and DB2) vary in their functionality and
interfaces.
P.Brezany
Institute for Software Science – University of Vienna
2
Introduction (2)
• Existing DB systems do not offer Grid integration. They are
however result of many hundreds of person-years of effort
that allows them to provide a wide range of functionality,
valuable programming interfaces and tools, and important
properties such as security, performance, dependability
(Verlässlichkeit), and (new development) autonomous behaviour.
• Because the above attributes will be required by Grid
applications, scientists strongly believe that building new Gridenabled database management systems (DBMSs) from scratch
is both unrealistic and a waste of effort.
• Instead we must consider how to integrate existing DBMSs
into the Grid.
P.Brezany
Institute for Software Science – University of Vienna
3
Basic Grid Database Requirements
• If database data is to be accessible in Grid applications, then a
Database System must support the relevant, existing, existing and
emerging, Grid standards; for example the Grid Security Infrastructure
(GSI). These standards should bear in mind the requirements of
databases.
• A typical application may consist of a computation that queries one or
more databases and carries out further analysis on the retrieved data.
If the compute and database systems all conform to the same
standards for security, accounting, performance monitoring and
scheduling etc. then the task of building such applications will be
greatly simplified.
• Note: A Database System (DBS) is created, using a DBMS, to manage a
specific database. The DBS includes any associated application software. Many
Grid applications will need to utilize more than one DBS. To reduce the effort
required to achieve this, federated databases use a layer of middleware running
on top of autonomous databases, to present applications with some degree of
integration.
• Other requirements of applications on Grid-enabled DBMS: scalability,
handling unpredictable usage, metadata-driven access, multiple database
federation, etc.
P.Brezany
Institute for Software Science – University of Vienna
4
Integrating Databases into the Grid
Two possible approaches:
• Grid-enabled version of JDBC/ODBC - The core set
of functionality offered by JDBC/ODBC does not
include a number of operations to fulfill Grid
database requirements.
• Service-based approach shown in the next slide,
with a service wrapper placed between the Grid and
the DBS
P.Brezany
Institute for Software Science – University of Vienna
5
Integrating Databases into the Grid (2)
P.Brezany
Institute for Software Science – University of Vienna
6
Federating Database Systems Across the Grid
P.Brezany
Institute for Software Science – University of Vienna
7
Grid Database Access
P.Brezany
Institute for Software Science – University of Vienna
8
Grid Database Access With OGSA-DAI
GDS gets a query via
Perform Document
GDS Engine process
specified activities
GDS returns results
P.Brezany
Institute for Software Science – University of Vienna
9
Grid Database Access With OGSA-DAI
P.Brezany
Institute for Software Science – University of Vienna
10
Grid Data Mediation Service (our own
development): Federation of Databases Architecture
P.Brezany
Institute for Software Science – University of Vienna
11
GDMS – Example Scenario
• Heterogeneities:
– Name in A is „Alexander Wöhrer“
– Name in C has to be combined
• Distribution:
– 3 data sources
P.Brezany
Institute for Software Science – University of Vienna
12
Literature Sources
http://www.cs.man.ac.uk/grid-db/
http://www.gridminer.org/
P.Brezany
Institute for Software Science – University of Vienna
13
Distributed Query Processing on the Grid
A representative query over bioninformatics resources is
used as an example.
The query accesses 2 databases: (1) the Gene Ontology
database GO (www.geneontology.org) stored in a MySQL (www.
mysql.com) RDBMS; and (2) GIMS, a genome database running
on a Polar parallel object DB server.
The query also calls a local installation of the BLAST sequence
similarity program (www.ncbi.nlm.nih.gov/BLAST/) which, given
a protein sequence, returns a set of structures containing
protein Ids and similarity scores.
The query identifies proteins that are similar to the human
proteins with the GO term 8372.
P.Brezany
Institute for Software Science – University of Vienna
14
Distributed Query Processing on the Grid
select struct(A:c.proteinID, B:Blast(c.sequence))
from c in proteins, d in proteinTerms
where d.termID="8372" and c.proteinID=d.proteinID;
•
P.Brezany
Institute for Software Science – University of Vienna
15
Distributed Query Processing on the Grid
In the query, protein is a class extent in GIMS, while
proteinTerm is a table in GO. Therefore, as illustrated in
the figure, the infrastructure initiates 2 subqueries: one on
GIMS, and the other on GO.
The results of these subqueries are then joined in a
computation running on the Grid. Finally, each protein in
the result is used as a parameter to the call to BLAST.
One key opportunity created on the Grid is the flexibility it
offers on resource allocation decisions. In our example,
machines need to be found to run both the join operator,
and the operation call.
If it is a danger that the join will be the bottleneck in the
query, than it could be allocated to a system with large
amounts of main memory so as to reduce IO costs
asociated with the management of intermediate results.
P.Brezany
Institute for Software Science – University of Vienna
16
Distributed Query Processing on the Grid
Further, a parallel algorithm could be used to implement
the join, so a set of machines acquired on the Grid could
each contribute to its execution.
Similarly, the BLAST calls could be speeded-up by
allocating a set of machines, each of which can run
BLAST on a subset of the proteins.
The information needed to make these resource allocation
decisions comes from 2 sources: (1) the query optimizer
estimates the cost of executing each part of a query and
so identifies performance critical operations; (2) the
Globus Grid infrastructure provides information on
available resources.
P.Brezany
Institute for Software Science – University of Vienna
17