Overview of Distributed Data Mining
Download
Report
Transcript Overview of Distributed Data Mining
Overview of Distributed Data Mining
Xiaoling Wang
March 11, 2003
Data Mining
“ We are drowning in information, but starving for
knowledge.” - John Naisbett
What is data mining?
–
–
–
–
Closely related to knowledge discovery
Discovering useful, usually unknown patterns from data
Data: a set of facts F (e.g., cases in a database)
Pattern: an expression E describing facts in a subset FE
2
Goals of Data Mining
Goals
– Prediction
– Description
Domains
– Induction, Compression, Querying,
Approximation, Search
3
Basic Techniques of Data Mining
Basic techniques
– Clustering
– Association rule discovery
– Classification
– Sequential pattern discovery
– Outlier detection
4
Data Warehouse Architecture
Data Mining Algorithm
Data Warehouse
Data Transformation & Integration
Extractor
Extractor
Extractor
…
Data source
Data source
Data source
5
Distributed Data Mining Framework
Final Model
Local Model Aggregation
Local
Model
Local
Model
Local
Model
Data Mining
Algorithm
Data Mining
Algorithm
Data Mining
Algorithm
…
Data source
Data source
Data source
6
Distributed Data Source Definitions
Homogeneous
– Contain the same set of attributes across distributed
data sites
Heterogeneous
– Define different sets of attributes across distributed
data sites
7
Distributed Data Mining Techniques
Distributed classifier learning
– Meta-learning framework
– Distributed learning with knowledge probing
Collective data mining
Distributed clustering
Distributed association rule mining
Others
8
Meta-learning
Chan , Florida Institute of Technology & Stolfo,
Columbia University
“base classifiers” and “meta-classifier”
Meta-learning rules: voting, arbitrating, and combining
Scalability, efficiency, portability, compatibility, adaptivity,
extensibility, and effectiveness
For heterogeneous data sites, apply bridging methods
9
Meta-learning Framework
Meta-level
Training
Data
Training
Data
Meta-learning
(Arbitration and Combining)
Learning
Algorithm
Final
Classifier
System
Prediction
Classifier
Validation
Data
Training
Data
Learning
Algorithm
Classifier
Prediction
10
Distributed Learning with Knowledge Probing
Guo & Sutiwaraphun, Imperial College
Objective: distributed classification
Meta-learning based technique
Applied on homogeneous data sites
Knowledge probing: to extract descriptive
knowledge from a black box model from a new
data set whose classes are assigned by the
model
11
DLKP (Cont.)
Prediction Scheme
Final Model
Local
Model 1
Local
Model 2
Local
Model 3
Local Model
Derivation
Local Model
Derivation
Local Model
Derivation
…
Data source 1
Data source 2
Probing
set
Probing Strategy
Data source k
12
Collective Data Mining (CDM)
Kargupta, University of Maryland & Park, Washington
State University
Objective: predictive data modeling
Applied to heterogeneous (vertically partitioned) data
sites
Foundation: any function can be represented in a
distributed fashion using an appropriate set of basis
functions (orthonormal)
Example: Collective Principal Component Analysis
(CPCA)
13
CDM Framework
Step 1: Generate approximate orthonormal basis
coefficients at each local site
Step 2: Move a chosen sample of data sets from
each site to a single site; Generate approximate
basis coefficients corresponding to non-linear
cross terms
Step 3: Combine the local models; Transform it
into user described representation; Output the
model
14
Distributed Clustering
Sources from parallel center-based clustering
algorithms, such as k-means, etc
Applied on homogeneous scenarios
Two basic approaches
– Approximate the underlying distance measure by
aggregation
– Provide the exact measure by data broadcasting
15
Distributed Association Rule Mining
Two main approaches
– Count Distribution (CD)
• data is partitioned homogeneously into several data sites
– Data Distribution (DD)
• maximizing parallelism
16
Applications of Distributed Data Mining
Credit card fraud detection
Intrusion detection
Information retrieval from Internet
Ad hoc sensor networks
17
Challenges of Distributed Data Mining
Real-time distributed data mining
Adaptive to changing environment, new data,
new pattern
18