Data Mining Engineering

Download Report

Transcript Data Mining Engineering

Grids, Grid Technologies and
Data Mining
Peter Brezany
Institut für Softwarewissenschaft
Universität Wien
E-mail : [email protected]
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
1
Grid and Grid Technologies
Grid computing has emerged as an important field, distinguished
from conventional distributed computing by its focus on largescale resource sharing, innovative applications, and, in some
cases, high-performance orientation.
Grid itself is supposed to connect computing resources over
the wide area network.
Internet computing and Grid technologies promise to change the
way we tackle complex problems. Harnesing these new technologies effectively will transform scientific disciplines ranging from
high-energy physics to the life sciences.
The Grid research field can further be divided into 2 subdomains:
- Computational Grid : a natural extension of the former cluster computer
- Data Grid : efficient management, placement, and replication of large
amounts of data; once data are in place, computational tasks can be run.
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
2
Data Mining on (Data) Grids
• Data mining on the Grid (DMG) : finding data patterns in
an environment with geographically distributed data and
computation – an environment with a special data
management, data placement, and data replication.
• A good DMG algorithm analyzes data in a distributed
fashion with modest data communication overhead.
• A typical DMG algorithm involves local data analysis
followed by the generation of a global data model.
• Huge data volumes are involved – high performance I/O
needed.
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
3
Application Examples
• Finding out the dependency of the emergence of hepatitis-C on the
weather patterns: access to a large hepatitis-C DB at one location
and an environmental DB at another location.
• 2 major financial organizations want to cooperate. They need to
share data patterns relevant to the data mining task, they do not
want to share the data since it is sensitive - combining the
databases may not be feasible.
• A major multi-national corporation wants to analyze the customer
transaction records for quickly developing successful business
strategies. It has thousands of establishments through out the
world and collecting all the data to a centralized data warehouse,
followed by analysis using existing commercial data mining
software,takes too long.
• Telemedical applications – see the next 2 slides.
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
4
Components of Telemedical Applications
Database
Raw Medical Data
Derived Medical Data
Database
Reconstructed Medical Data
Web
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
5
Telemedical Collaboration - Example
A patient living in a remote village has a heart problem.
An EEG is taken by the local doctor and all the patient’s details
are stored in the doctor’s PC based telemedical system.
MRI and CT scans are taken within different departments of a
general hospital and stored in the telemedical DB. A consultant
compiles a report and saves it in the DB.
If necessary, in a specialized clinic a 3D ultrasound scan is taken
and further report compiled.
Requiring complicated surgery, an external specialist using Virtual
Reality techniques defines how the surgery should be planned.
The resulting operation is placed on video for, e.g., education.
 Data mining support/assistance is needed.
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
6
Motivations and History
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
7
Grid Computing Concept
Enable communities (“virtual
organizations”) to share
geographically distributed
resources as they pursue
common goals—in the absence
of central control,
omniscience, trust
relationships
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
8
Grid Computing Concept (2)
The term ``the Grid´´ was coined in the mid 1990s to denote
a proposed distributed computing infrastructure for science and
engineering.
The aim is coordinated resource sharing and problem solving in
dynamic, multi-institutional virtual organizations.
Resources: computers, files, data to computers, sensors,
networks, laboratory equipments, etc.
Sharing is highly controlled, with resource providers and
consumers defining clearly and carefully just what is shared,
who is allowed to share, and conditions under which sharing occurs.
A set of individuals and/or institutions defined by such sharing
form a virtual organization (VO).
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
9
Grid Computing Concept (3)
Grid technologies complement rather than compete with existing
distributed computing technologies.
For example, CORBA focus on enabling resource sharing within
a single organization. GRID technologies focus on dynamic,
cross-organizational sharing.
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
10
Grid Communities and Applications:
Home Computers Evaluate AIDS Drugs
• Community =
– 1000s of home computer
users
– Philanthropic computing
vendor (Entropia)
– Research group (Scripps)
• Common goal= advance
AIDS research
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
11
The Nature of Grid Architecture
A Grid architecture identifies fundamental system components,
specifies the purpose and function of these components, and
indicates how these components interact with one another.
Interoperability is the central issue to be addressed. In a
network environment, interoperability means common protocols.
The GRID architecture is first and foremost a protocol
architecture, with protocols defining the basic mechanisms by
which VO users and resources negotiate, establish, manage, and
exploit sharing relationships.
Standard protocols make it easy to define standard services
that provide enhanced capablities and construct Application
Programming Interfaces and Software Development Kits.
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
12
The Nature of Grid Architecture (2)
Just as the Web revolutionized information sharing by providing
a universal protocol and syntax (HTTP and HTML) for
information exchange, so we require standard protocols and
syntaxes for general resource sharing.
A Grid protocol definition specifies
- how distributed system elements interact with one
another in order to achieve a specified behavior, and
- the structure of the information exchanged during this
interaction
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
13
The Nature of Grid Architecture (3)
A Grid service is defined solely by the protocol that it speaks
and the behaviors that it implements.
There are standard Grid services for:
- access to computation
- access to data
- resource discovery
- coscheduling (mechanisms for coordinating operations
across multiple resources)
- data replications, etc.
The definition of the above services allows as to enhance
services offered to VO participants and also to abstract away
resource specific details.
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
14
The Nature of Grid Architecture (4)
Why do we also consider Application Programming Interfaces
(APIs) and Software Development Kits (SDKs)?
There is more to VOs than interoperability, protocols and services.
Developers must be able to develop sophisticated applications in
complex and dynamic execution environments. Users must be able
to operate these applications.
Standard abstractions, APIs, and SDKs can accelerate code
development, enable code sharing, and enhance application
portability.
Summary:
identification and definition of 1. protocols  2. services
 3. APIs and SDKs.
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
15
Grid Architecture
The architecture is organized into layers – see the next slide
Components within each layer share common characteristics
but can build on capabilities and behaviors provided by any lower
layer.
Resource and Connectivity protocols facilitate the sharing of
individual resources. They are designed so that they can be
imlemented n top of a diverse range of resource types, defined
at the Fabric layer, and can in turn be used to construct a wide
range of global services and application-specific behaviors at the
Collective layer.
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
16
Layered Grid Architecture
(By Analogy to Internet Architecture)
“Coordinating multiple resources”:
ubiquitous infrastructure services, appspecific distributed services
“Sharing single resources”: negotiating
access, controlling use
“Talking to things”: communication
(Internet protocols) & security
“Controlling things locally”: Access to,
& control of, resources
P.Brezany
Collective
Application
Resource
Connectivity
Transport
Internet
Fabric
Link
Institut für Softwarewissenschaft - Universität Wien
17
Internet Protocol Architecture
Application
Fabric: Interface to Local Control
The Grid Fabric layer provides the resources to which shared
access is mediated.
Fabric components implement the local resource-specific
operations that occur as a result of sharing operations at
higher levels.
At a minimum, recources should implement enquiry mechanisms
that permit discovery of their structure and state, and resource
management mechanisms that provide some control of delivered
quality of service.
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
18
Fabric: Interface to Local Control (2)
A resource-specific characterization of capabilities:
Computational resources: Mechanisms for starting programs and
for montoring and controlling the execution of resulting
processes.
Storage resources: Mechanisms for putting and geting files.
Enquiry functions for determining hardware and software characteristics and information about available space utilization.
Network resources: Mechanisms that provide control over the
resources allocated to network transfers. Enquiry functions to
determine network characteristics and load.
Code repositories: Managing versioned source and object code.
Catalogs: Catalog query and update operations.
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
19
Connectivity: Communicating Easily
and Securely
The Connectivity layer defines core communication and
authentication protocols required for Grid-specific
network transactions.
Communication protocols enable the exchange of data
between Fabric layered resources.
Authentication protocols build on communication
services to provide cryptographically secure
mechanisms for verifying the identity of users and
resources.
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
20
Connectivity (2)
Authentications solutions for VO environments should have the following
characteristics:
Single sign on: Users must be able to ``log on´´ (authenticate) just once
and then have access to multiple Grid resources defined by the Fabric
layer, without further user intervention.
Delegation: A user must be able to endow a program with the ability to
run on that user´s behalf, so that the program is able to access the
resources on which the user is authorized.
Integration with various local security solutions: Grid security solutions
must be able to interoperate with various local security solutions.
User-based trust relationships: If a user hs the right to use sites A and
B, the user should be able to use sites A and B together without
requiring that A´s and B´s security adminstrators interact.
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
21
Resource: Sharing Single Resources
The Resource layer defines protocols (and APIs and SDK´s) for
secure initiation, monitoring, and control of sharing operations
on individual resources.
The primary classes of Resource layer protocols:
Information protocols are used to obtain information about the
structure and state of a resource, e.g., its configuration,
current load, and usage policy.
Management protocols are used to negotiate access to a shared
resource, specifying, for example, resource requirements and
the operations to be performed, such as process creation, or
data access. A protocol may support monitoring the status of
an operation and controlling (e.g., terminating) the operation.
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
22
Collective: Coordinating Multiple
Resources
Collective layer contains protocol and services (and
APIs and SDKs) that are not associated with any
one specific resource but rather are global in nature
and capture interactions across collections of
resources. This layer can, e.g., implement:
Directory services allow VO participants to discover the
existence and/or properties of VO resources.
Co-allcation, scheduling, and brokering services allow VO
participants to request the allocatin of one or more resources
for a specific purpose and the schedulng of tasks on the
appropriate resources.
Monitoring and diagnosics services support the monitoring of VO
resources for failure, adversarial attack (``intrusion
detection´´), overload, and so forth.
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
23
Collective (2)
Data replication services suport the management of VO storage (and
perhaps also network and computing) resources to maximize data
access peformance with respect to metrics such as response time,
reliability, and cost.
Grid-enabled programming systems enable familiar programming models
to be used in Grid environments. E.g., a Grid-enabled
implementations of the Message Passing Interface (MPI).
Software discovery services discover and select the best software
imlementation and execution platform based on the parameters of the
problem being solved.
Community authorization servers enforce community policies governing
resource access.
Collaboratory services support the coordinated exchange of information
within potentially large user communties.
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
24
Applications
Applications are constructed in terms of, and by
calling upon, services defined at any layer.
Effective application development can often benefit
from the use of higher-level languages and
frameworks (e.g., the Common Component
Architecture, CORBA, etc.). These higher-level
systems can build on protocols, services, and APIs
provided within the Grid architecture.
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
25
Protocols, Services, and Interfaces
Occur at Each Level
Applications
Languages/Frameworks
Collective Service APIs and SDKs
Collective Services
Resource APIs and SDKs
Collective Service Protocols
Resource Service Protocols
Resource Services
Connectivity APIs
Connectivity Protocols
Local Access APIs and Protocols
Fabric Layer
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
26
Data Grid
The need for Data Grids stems from the fact that scientific
applications like data analysis in High Energy Physics, climate
modeling or earth observation are very data intensive and a
large community of researchers all around the globe wants to
have fast access to the data.
Future Data Grid applications: Medical Grids and E-Business
Grids.
Grid Data Warehousing and Grid Data Mining – a new
challenging field.
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
27
Storage Model
2 different kinds of files:
• Master files (owned by their creators)
• Replica files. There may be many replicas of a master file.
Replicas are owned by, managed by, and may be deleted by,
the Grid.
The notion of replicas is new, and critical in a Grid
environment. Example:
• Before a DataGrid job can run at site A, data at site B may
need to be copied to site A.
• This data may then be used by subsequent jobs at site A, or
may be needed by jobs at site C, which has a better network
connection to site A than site B. For this reason, the data
should be kept at site A as long as possible.
The ReplicaManager keeps track of all replica data so that the
replica selection service can select the optimal replica to use
for a given job, or to request the creation of a new replica.
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
28
SQLDatabaseService
This servis allows to efficiently store, retrieve and query very
large amounts of meta data held in any type of local or remote
RDBMS. The database can be used for the implementation of
catalogs.
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
29
GridMiner
A Framework for Data Mining
on Grids
A new research field
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
Architecture of a Data Mining System
Graphical user interface
Pattern evaluation
Knowledge
base
Data mining engine
Database or
data warehouse server
Data cleaning, data integration
Database
P.Brezany
Filtering
Data
warehouse
Institut für Softwarewissenschaft - Universität Wien
31
Decomposition of a Knowledge
Discovery Process
Preprocessing
- data cleaning
- data transformation
- data reduction
Data mining (e.g., association rules)
- find frequent itemsets
- generate association rules
Evaluation of discovered patterns
Graphical User Interface
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
32
Our Philosophy
• Data mining systems can be decomposed into a set
of communicating components
 distributed component architecture
• Placement of data-processing functionalities is
critical.
• Grid data mining research tightly coupled to the
ongoing work on parallel I/O for Grids
(e.g., Armada project at the Dartmouth College, USA)
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
33
Basic Grid Data Mining Models
1. Local data analysis followed by the generation
of a global data model – adapting distributed
data mining techniques. No data replication.
2. Data mining system components are optimally
located on the grid. No dynamic data replication.
3. Data mining system components are optimally
located on the Grid. Dynamic data replication is
considered.
P.Brezany
Institut für Softwarewissenschaft - Universität Wien
34
Data Storage and the Components
Site A
Site B
Site C
Site D
Preprocesing
Preprocessing
Preprocessing
Preprocessing
Local DM
Local DM
Local DM
Local DM
Construction of the Global Model
GUI
P.Brezany
Site E
Institut für Softwarewissenschaft - Universität Wien
35