kborne-interop2008

Download Report

Transcript kborne-interop2008

THE US NATIONAL VIRTUAL OBSERVATORY
P2P Data Mining
Kirk D. Borne
George Mason University
[email protected] , http://classweb.gmu.edu/kborne/
with H. Kargupta, S. Arora, K. Bhaduri, K. Das, Tushar, W. Griffin (UMBC),
and C. Giannella (Loyola)
IVOA Interop – Baltimore – October 2008
Topics
•
•
•
•
•
Distributed vs. P2P Data Mining
Science Use Cases
P2P Data Mining Project Plans
Current Design & Status
IVOA GWS Standards
IVOA Interop – Baltimore – October 2008
2
Distributed Data Mining (DDM)
• DDM comes in 2 types:
1. Distributed Mining of Data
2. Mining of Distributed Data
• Type 1 requires sophisticated algorithms
that operate with data in situ
• Type 2 takes many forms, with data being
centralized (in whole or in partitions) or
data remaining in place at distributed sites
• References: http://www.cs.umbc.edu/~hillol/DDMBIB/
–
–
C. Giannella, H. Dutta, K. Borne, R. Wolff, H. Kargupta. (2006). Distributed Data Mining for Astronomy Catalogs.
Proceedings of 9th Workshop on Mining Scientific and Engineering Datasets, as part of the SIAM International
Conference on Data Mining (SDM), 2006. [ http://www.cs.umbc.edu/~hillol/PUBS/Papers/Astro.pdf ]
H. Dutta, C. Giannella, K. Borne and H. Kargupta. (2007). Distributed Top-K Outlier Detection from Astronomy
Catalogs using the DEMAC System. Proceedings of the SIAM International Conference on Data Mining, Minneapolis,
USA, April 2007. [ http://www.cs.umbc.edu/~hillol/PUBS/Papers/sdm07.pdf ]
IVOA Interop – Baltimore – October 2008
3
P2P Data Mining
• P2P Data Mining represents one possible
implementation of DDM
• P2P has two types:
– Task-parallel :: the compute processes are
distributed across the nodes
– Data-parallel :: the data are distributed
across the nodes
•
References: http://www.cs.umbc.edu/~hillol/DDMBIB/ddmbib_html/DistSys.html
–
–
–
S. Banyopadhyay, C. Giannella, U. Maulik, H. Kargupta, S. Datta, and K. Liu. Clustering distributed data streams
in peer-to-peer environments. Information Science, 176(14):1952-1985, 2006.
[ http://www.cs.umbc.edu/~hillol/PUBS/p2pDM.pdf ]
K. Bhaduri, R. Wolff, C. Giannella, H. Kargupta. (2008). Distributed Decision Tree Induction in Peer-to-Peer
Systems. Statistical Analysis and Data Mining. Volume 1, Issue 2, pp. 85-103.
[http://www.cs.umbc.edu/~hillol/PUBS/Papers/sam08_dtree_bhaduri.pdf ]
S. Datta, K. Bhaduri, C. Giannella, R. Wolff, H. Kargupta. (2006). Distributed Data Mining in Peer-to-Peer
Networks. (Invited submission to the IEEE Internet Computing special issue on Distributed Data Mining),
Volume 10, Number 4, pp. 18--26. [ http://www.cs.umbc.edu/~hillol/PUBS/P2PDM.pdf ]
IVOA Interop – Baltimore – October 2008
4
Why distributed data mining?
Because …
Many great astronomical
discoveries have come
from inter-comparisons
of various wavelengths:
- Quasars
- Gamma-ray bursts
- Ultraluminous IR galaxies
- X-ray black-hole binaries
- Radio galaxies
- ...
“Just
Checking”
IVOA Interop – Baltimore – October 2008
5
Some Fundamental Astronomy problems:
most of these require VO-accessible distributed data
•
Some key astronomy problems that can be addressed with
distributed data:
•
•
•
•
•
•
•
•
•
•
•
•
•
Cross-Match objects from different catalogues
The distance problem (e.g., Photometric Redshift estimators)
Star-Galaxy Separation
Cosmic-Ray Detection in images
Supernova Detection and Classification
Morphological Classification (galaxies, AGN, gravitational lenses, ...)
Class and Subclass Discovery (brown dwarfs, methane dwarfs, ...)
Dimension Reduction = Correlation Discovery
Learning Rules for improved classifiers
Classification of massive data streams
Real-time Classification of Astronomical Events
Clustering of massive data collections
Novelty, Anomaly, Outlier Detection in massive databases
IVOA Interop – Baltimore – October 2008
6
Sample Astronomy Data Mining Applications:
most of these require VO-accessible distributed data
– Neural Network for Pixel Classification: Event Detection and
Prediction (e.g., Supernova or Cosmic-ray hit?)
– Bayesian Network for Object Classification (star or galaxy?)
– PCA for finding Fundamental Planes of Galaxy Parameters
– PCA (weakest component) for Outlier Detection: anomalies,
novel discoveries, new objects
– Link Analysis (Association Mining) for Causal Event Detection
(e.g., linking optical transients with gamma-ray events)
– Clustering analysis: Spatial, Temporal, or any scientific
database parameters
– Markov models: Temporal mining, classification, and
prediction from time series data
IVOA Interop – Baltimore – October 2008
7
Class Discovery: feature separation and
discrimination of classes across multiple databases
•
Reference: http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/BrunnerDPS.pdf
• The separation of classes improves when attributes from
disparate databases are chosen to be projected, as in the
following star-galaxy discrimination test:
Not good
IVOA Interop – Baltimore – October 2008
Good
8
Novelty Discovery (Outlier Detection): improved
discovery of rare objects across multiple databases
IVOA Interop – Baltimore – October 2008
9
Correlation Discovery: Fundamental Plane for 156,000 cross-matched
Sloan+2MASS Elliptical Galaxies: plot shows variance captured by first
2 Principal Components as a function of local galaxy density.
Reference: Borne, Dutta, Giannella, Kargupta, & Griffin 2008
Slide Content
% of variance captured by PC1+PC2
•
•
•
•
Slide content
Slide content
Slide content
low
IVOA Interop – Baltimore – October 2008
(Local Galaxy Density)
high
10
Our Project Plans
• NASA-funded (AISR) project to implement a P2P
distributed data mining system
• Provide a small number of “useful” data mining
algorithms (one-to-one mapping with science
use cases):
• Clustering :: Class Discovery & Characterization
• Outlier detection :: Novelty Discovery
• PCA :: Correlation Discovery
• Select problems and algorithms that are
decomposable: task-parallel and/or dataparallel
• Implement system within VO framework
IVOA Interop – Baltimore – October 2008
11
Architecture-NASA project
(back-end)
User
user
Interface
Metadata with
Cross matched
information
Portion
of
Metadata
chosen
based on
user
query
IVOA Interop – Baltimore – October 2008
12
IVOA GWS Standards
• GWS standards enable access to distributed data
and distributed compute resources
• Nodes in P2P system individually request
distributed data partitions
• Workflow is distributed across the P2P compute
nodes
• P2P activities are stateful & asynchronous
• Relevant GWS activities: Security, VOSpace,
Asynchronous services, Single Sign-on,
Universal Worker Service (UWS), Logging
IVOA Interop – Baltimore – October 2008
13
GWS functions required by
P2P Data Mining Environment
• Acquiring & managing nodes and workspaces
(VOSpace)
• Single sign-on to nodes (SSO)
• Distributing work and metadata to nodes (GRID)
• Cone-search and other data requests submitted
from compute nodes to data repositories
– RESTful services?
• Secure stateful asynchronous computations (UWS)
– Communicate results between nodes, as required by some
DDM algorithms
• Recording and sharing results, and demonstrating
interoperable multi-database VO science (Logging)
IVOA Interop – Baltimore – October 2008
14