Data Mining on Astronomy Catalogs

Download Report

Transcript Data Mining on Astronomy Catalogs

A Web service for Distributed Covariance
Computation on Astronomy Catalogs
Presented by
Haimonti Dutta
CMSC 691D
ROADMAP
• Background Information
• Interesting Astronomy Data Mining Problems
• What has / not been done (Literature review)
• My project objectives
• The problem of Alignment in astronomy catalogs
• The Fundamental Plane
• A case study for recreating the Fundamental Plane from astronomy
catalogs
• Experimental Results
• Efforts towards building Web services
Background Information
 Next generation Astronomy catalogs will contain data for
most of the sky
 Existing astronomy sky surveys – SDSS, 2Mass, FIRST, etc
 Terabytes and Peta bytes of Data
 Data Avalanche in Astronomy
 Getting useful information is like looking for a needle in a
haystack
 National Virtual Observatory (NVO) has been set up to
facilitate scientific discovery
 Obvious need for Distributed Data Mining
What kind of Data Mining activities are astronomers interested in ?
 Detection of transient objects such as supernovae (Online transient
object detection in real time)
 Obtain statistics of variable and moving objects (model variability, refine
existing models, fit models to irregularly sampled data )
 Parameterize shapes of objects using rotationally invariant quantities
 Efficient cluster and outlier detection
 Supervised Data Mining problems (match objects detected in multiple
bands, derive photometric red shifts)
What has/not been done
 Lot of efforts in centralized data mining
(NVO, FMass, Class X, FIRST etc )
 Some grid mining (Notable GRIST
project)
 Very few distributed data mining efforts in
their preliminary stages
(http://www.cs.queensu.ca/home/mcconell/DDMAstro.html)
Objectives of this project
 Aligning of Catalogs (The Fundamental Plane Problem)
 Implementation of algorithms for Distributed Data Mining on
Astronomy Catalogs
 Development of webservices for the catalogs / investigation into
what needs to be done to integrate this into the NVO
Alignment of Astronomy Catalogs
Cross matching is a non trivial problem in itself. We assume
cross matching happens off line and there exists an indexing
scheme by which catalogs know the exact cross matched tuples
Some interesting numbers
 Size of current SDSS catalogs 3.0 TB , contains about 180 million
objects (As per Data Release 4)
 2Mass has already observed 99% of the sky and reports
470,992,970 Point sources and 1,647,599 Extended sources
Portion of the sky observed by SDSS
Problems
 Cross Matching is an inherently difficult
problem for the astronomy catalogs
 We assume data sets are cross matched
and this computation is done offline
 This is a strong assumption and often
may not be acceptable to astronomers
A real life cross matching Exercise
Problems encountered
 Which catalogs to use ?
 We tried several - SDSS, 2Mass, HyperLeda, CfA RedShift Catalog
 Catalogs have different indexing schemes – more recent ones use
HTM (Hierarchical Triangular Mesh), others use (ra,dec) or even
Names of objects
 Some attributes are really not available ! (SDSS has -9999 for most
of its red shift values)
 Different catalogs observe different portions of the sky (SDSS
covers only about 16% of the sky in the latest release while 2Mass
covers the entire sky) – Select subsets to cross match wisely !
The successful cross matching …..

Chose a region of the sky between 0 and 15 (dec) and 150 and 200
degrees (ra) – observed by both SDSS and 2Mass
 Use a web interface provided by SDSS to do the cross matching
 Selected the K-band for obtaining red shift and surface brightness
(astronomical significance)
Case Study
 Centralized database 1249 cross matched objects
 Attributes are size, surface brightness, velocity dispersion
 Does not really make a case for a distributed data mining scenario !
Solution
- try a larger subset of the data from both catalogs
The Fundamental Plane
 Interesting problem in astronomy - Identify
correlations in high dimensional spaces
 For the class of elliptical and spiral galaxies
Observed features – radius, mean surface
brightness and central velocity dispersion
A two dimensional plane in the observed
space of 3D parameters exist called
THE FUNDAMENTAL PLANE
An illustration of the Fundamental Plane
Experimental Results
 First PC captured
69.4193% of variance
 Second PC captured
12.1333% of the variance
 The astronomy literature
suggests 1st and 2nd PC
together should capture
about 88% of variance
Reasonably close recreation of the Fundamental Plane from two
cross matched data sets in the centralized setting
Algorithm for Distributed Covariance Computation
 A central co-ordination site S sends A and B a random
number generation seed
 A and B generate and n X l Random matrix R where l << n
 A and B send S – R T A and R T B
 S computes ( R A )T (RB) / n
Experimental Results – Distributed Setting
Case Study
 1249 attributes at site A and B
 2 attributes at site A and 1
attribute at site B
More results
Development of a Web Service
Architecture of the Proposed System
SITE A
Soap Message
CLIENT
Soap Message
WEB SERVICE
For Distributed
Covariance
Computation
SITE B
Current Implementation
 Using Apache Axis (SOAP engine – a




framework for making SOAP processors
such as clients, servers )
Tomcat version 4.1
SOAP version 1.2
Short Demo
Further System Developmental Issues
(use of SOAP with attachments)
QUESTIONS ?