MULTI-LAYERED FRAMEWORK FOR DISTRIBUTED DATA MINING
Download
Report
Transcript MULTI-LAYERED FRAMEWORK FOR DISTRIBUTED DATA MINING
MULTI-LAYERED SOFTWARE
SYSTEM FRAMEWORK FOR
DISTRIBUTED DATA MINING
Masum Serazi, Amal Perera,
Qiang Ding, Vasiliy Malakhov,
William Perrizo
North Dakota State University
Computer Science Department
Outline
Introduction to Distributed Data Mining
•
•
•
Importance of a Layered Architecture
A Prototype System
•
•
Demands
Existing Projects
Architecture
System Architecture (layered)
•
•
•
Server
Communication
Client & GUI
System Characteristics
Conclusion
Demands of distributed data
mining
Large dataset size
Diversity of data
Geographic distribution of users and
resources
Computationally intensive result
generation
Large scale distributed data
mining project
Kensington project
• Mining enterprise data distributed across the internet.
Papyrus project
• Based on mobile agents implemented using java.
PaDDMAS
• A component based tool set that integrates pre-developed
or custom packages
JAM
• Agent based distributed system that has been developed to
mine stored in different sites.
BODHI
• Collective data mining with stress on the learning vertically
partitioned data.
Architecture
Client-Server
• Advantage:
• Able to use high performance computing on the
server side to do the data mining.
Agent based
Hybrid
Importance of a Layered
Architecture
Layered framework helps to manage
complexity.
Provides the flexibility to add/remove/modify
layer and components of a layer
Allows for a better tracking of progress of large,
complex projects.
Human input is required to tune the data and
the algorithms to suite the need (Mix of
greyware versus software can be changed over
time).
System Architecture
DataMIMETM
developed as proofof-concept.
Based on patent
pending, “P-tree
technology”
Efficient and
scalable system.
Flexible plug-ins.
Conceptual view of
the system
Mine on
DataMIME™
Capture
dataset to
DataMIMET
M
Integrate data
(synchronize to
existing)
Internet
System
performanc
e ananlysis
Client Side
Server Side
One of the
Slave Servers
Master Server
Server Architecture
Data capture and
integration layer (DCI/DII)
Data mining interface
(DMI)
Distributed Ptree
Management Interface
(DPMI)
•
Room for
new feeder
Already
Plugged
Algorithm
Plugs for
new
algorithms
DMA Layer
DCI/DII Layer
DMI Layer
Uniform data structure
Data mining algorithms
(DMA)
Client-server
communication
Client interface
DPMI: Distributed Ptree Management Interface
Distributed
Ptree
database
The Distributed P-tree Database
The DPD collects all data in
vertical format (as opposed to
the ubiquitous horizontal
(record-based) data structure
used in DBMSs), as Predicatetrees (P-trees) based on the
patent pending P-tree
technology).
P-trees can be 0-dimensional,
1-dimensional, 2-dimensional,
etc.
Next slide shows the detailed
construction of 1-D P-trees
from a generic horizontal table
of data.
Room for
new feeder
Already
Plugged
Algorithm
Plugs for
new
algorithms
DMA Layer
DCI/DII Layer
DMI Layer
DPMI: Distributed Ptree Management Interface
Distributed Ptree database
(DPD)
Current practice: Structure data into
horizontal records. Process vertically (scans)
R(A1
Horizontally
structured
records
Scanned
vertically
2
6
2
2
5
2
7
7
Predicate tree technology: vertically project each attribute,
then vertically project each bit position of each attribute,
then compress each bit slice into a basic Ptree.
e.g., compression of R11 into P11 goes as follows:
R[A1] R[A2] R[A3] R[A4]
A2 A 3 A4 )
7
7
7
7
2
2
0
0
6
6
5
5
1
1
1
1
1
0
1
7
4
5
4
4
=
010
011
010
010
101
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
0
0
0
0
1
0
1
1
R11
pure1? false=0
pure1? true=1
Top-down construction
of the 1-dimensional
Ptree representation of
R11, denoted, P11, is built
by recording the truth of
the universal predicate
“pure 1” in a tree
recursively on halves,
until purity is achieved.
pure1? false=0 pure1? false=0
pure1? false=0
010
011
010
010
101
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
R11 R12 R13 R21 R22 R23 R31 R32 R33
0
0
0
0
1
0
1
1
1
1
1
1
0
1
1
1
0
1
0
0
1
0
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
R41 R42 R43
0
0
0
1
1
1
1
1
0
0
0
1
0
0
0
0
1
0
1
1
0
1
0
0
Horizontally AND basic Ptrees
1. Whole is pure1? false 0
P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43
2. Left half pure1? false 0
P11 And it’s pure0000so00 10 0 00 0 1 00 1 00 0 00 1 00 0 00 0 01 0 01 0 00 00 0
3. Right half pure1? false 0
10 10
10 01
01 01
0100
0
01
1 01 0001
branch ends 10
0
^ 01
^ 10
^
^
^ 01
^
^^
^
^ 10
01
01
01
01
10
4. Left half of rt half ? false0
7 0 1 4
0 0
5. Rt half of right half? true1
01
To count occurrences of 7,0,1,4 use pure111000001100: 0 23-level
6. Lf half of lf of rt? true1 But it is pure
1
10
P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level =2
(pure0) so this
^ 21-level
01
7. Rt half of lf of rt? false0 branch ends
2-D P-tree Data Structure
m
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
11
11
11
11
00
00
00
10
11
11
11
11
1
m
m
m 0 1 m
1 1 m 1
1
1 1 1 0 0 0 1 0 1 1 0 1
Peano or Z-ordering
Pure (Pure-1/Pure-0/Mixed)
quadrants
Root Count (count of 1s in
the tree)
Provides an efficient format for
ANDing, ORing and
Complementing.
Lossless, compressed, countcomputation_ready
representations.
DCI/DII (Data Capture and Data
Integration Interface) layer
Allows user to capture and
to integrate data to the
required format (p-tree
format).
The main component of
this layer is the feeder.
An individual feeder can
process a particular format
of incoming data.
User can write his own
feeder and plug it very
easily in this architecture.
DCI/DII Layer
DMI (Data Mining Interface)
layer
DMI does counting, the most important
operation for data mining provided by
P-trees, including:
• basic P-trees
• value P-trees
• tuple P-trees
• Interval P-trees
• Cube P-trees
DMI also provide the P-tree algebra,
which has four operations:
• AND
• OR
• NOT (complement) and
• XOR, to implement the point wise
logical operations on P-trees for
(Data Mining Algorithms) DMA.
D
M
I
Distributed Ptree Management
Interface (DPMI) Layer
The DPMI layer provides:
• access
• location
• and concurrency transparency
by hiding the fact that:
• data representation may differ
• resources may be located in different places
• resources may be shared by several competitive
users.
By resource we meant data and its converted form
Ptree.
DMA (Data Mining Algorithms)
layer
This layer is a collection of data
mining tools (algorithms).
Upon receiving a request from
the client side an algorithm will be
fired up for mining.
This layer depends on the DMI
for accessing meta-info and
required counts needed in:
•
•
•
•
Ptree based K Nearest Neighbor
PKNN
Podium Incremental Neighbor
Evaluator PINE
P-BAYESIAN
Etc.
The architecture has the flexibility
to plug-in any new algorithm on
this layer.
DMA Layer
Communication
The communication between different layers is designed
in such a way that it minimizes the data flow over the
network.
In the DCI and the DMA communication protocols a client
will create a connection, send a request, receive a
response and close the connection. A client will send only
one request in a single threaded connection. The
response for a request is a line with a message indicating
the outcome of the request.
A DMA protocol request has a similar structure : header
and an optional set of binary files with checksums. The
header in the DMA protocol is a set of key / value pairs
(properties. Response to the DMA protocol request also
contains key / value pairs.
Client Structure
The two main functionalities are:
• Capture: Which sends
datasets along with their
meta information
(description of the data) to
the DII/DCI layer of the
server for capturing.
Meta
Data
•
Mining: This sends
requests to the DMA layer
for applying data mining
applications on previously
captured datasets and the
presentation of the results.
Prediction
Model
Data
Unclassified
data
DMA
Client side DMA
Client Side DCI
DCI
Data
Meta-data
generator
Client Side DCI
Visualization Tool
Client and GUI
Data Capturing
Data Mining
In the client side DataMIMETM has a graphical user interface (GUI) to
visually interact with a user (http://midas.cs.ndsu.nodak.edu/~datasurg/datamime )
System Characteristics
Ability to handle formatted record-based, relational-like
data with numerical and/or categorical attributes. The
data could be in text format, relational format, or TIFF
image format.
Easy conversion from any other machine readable format
can be provided through customized data feeders.
Users can do any data analysis and mining on data sets
in the system, or on any new data they capture or
integrate into the system.
Capable of handling large quantities of data and mines
them in scalable time.
Clients of the system can run on UNIX and Microsoft
Windows platform with the server designed to be a UNIXbased system.
System Characteristics (cont.)
Supports major RDBMS platforms.
The server engine can be run on a single machine or
distributed across multiple computers for better scalability
and efficiency.
The system has an open architecture provides high
degree of software extensibility and integration
capabilities.
The system provides high level of asynchronous
background operations, performing most data intensive
operations in the background or offline and allowing users
to continue their work.
The system minimizes the flow of data across the
network.
Conclusion
We have shown the importance of having a layered
architecture for a distributed data mining system.
Key elements were identified in deciding on the different
layers.
Able to identify a unique efficient vertical data structure at
the lowest layer that can take advantages of the latest
hardware.
To facilitate the data distribution a management layer is also
recognized.
Two other layers are defined: data capture and data mining
layer.
A prototype system was developed as a proof-of-concept to
show the feasibility of the approach.