Data Mining on Astronomy Catalogs
Download
Report
Transcript Data Mining on Astronomy Catalogs
A Web service for Distributed Covariance
Computation on Astronomy Catalogs
Presented by
Haimonti Dutta
CMSC 691D
ROADMAP
• Background Information
• Interesting Astronomy Data Mining Problems
• What has / not been done (Literature review)
• My project objectives
• The problem of Alignment in astronomy catalogs
• The Fundamental Plane
• A case study for recreating the Fundamental Plane from astronomy
catalogs
• Experimental Results
• Efforts towards building Web services
Background Information
Next generation Astronomy catalogs will contain data for
most of the sky
Existing astronomy sky surveys – SDSS, 2Mass, FIRST, etc
Terabytes and Peta bytes of Data
Data Avalanche in Astronomy
Getting useful information is like looking for a needle in a
haystack
National Virtual Observatory (NVO) has been set up to
facilitate scientific discovery
Obvious need for Distributed Data Mining
What kind of Data Mining activities are astronomers interested in ?
Detection of transient objects such as supernovae (Online transient
object detection in real time)
Obtain statistics of variable and moving objects (model variability, refine
existing models, fit models to irregularly sampled data )
Parameterize shapes of objects using rotationally invariant quantities
Efficient cluster and outlier detection
Supervised Data Mining problems (match objects detected in multiple
bands, derive photometric red shifts)
What has/not been done
Lot of efforts in centralized data mining
(NVO, FMass, Class X, FIRST etc )
Some grid mining (Notable GRIST
project)
Very few distributed data mining efforts in
their preliminary stages
(http://www.cs.queensu.ca/home/mcconell/DDMAstro.html)
Objectives of this project
Aligning of Catalogs (The Fundamental Plane Problem)
Implementation of algorithms for Distributed Data Mining on
Astronomy Catalogs
Development of webservices for the catalogs / investigation into
what needs to be done to integrate this into the NVO
Alignment of Astronomy Catalogs
Cross matching is a non trivial problem in itself. We assume
cross matching happens off line and there exists an indexing
scheme by which catalogs know the exact cross matched tuples
Some interesting numbers
Size of current SDSS catalogs 3.0 TB , contains about 180 million
objects (As per Data Release 4)
2Mass has already observed 99% of the sky and reports
470,992,970 Point sources and 1,647,599 Extended sources
Portion of the sky observed by SDSS
Problems
Cross Matching is an inherently difficult
problem for the astronomy catalogs
We assume data sets are cross matched
and this computation is done offline
This is a strong assumption and often
may not be acceptable to astronomers
A real life cross matching Exercise
Problems encountered
Which catalogs to use ?
We tried several - SDSS, 2Mass, HyperLeda, CfA RedShift Catalog
Catalogs have different indexing schemes – more recent ones use
HTM (Hierarchical Triangular Mesh), others use (ra,dec) or even
Names of objects
Some attributes are really not available ! (SDSS has -9999 for most
of its red shift values)
Different catalogs observe different portions of the sky (SDSS
covers only about 16% of the sky in the latest release while 2Mass
covers the entire sky) – Select subsets to cross match wisely !
The successful cross matching …..
Chose a region of the sky between 0 and 15 (dec) and 150 and 200
degrees (ra) – observed by both SDSS and 2Mass
Use a web interface provided by SDSS to do the cross matching
Selected the K-band for obtaining red shift and surface brightness
(astronomical significance)
Case Study
Centralized database 1249 cross matched objects
Attributes are size, surface brightness, velocity dispersion
Does not really make a case for a distributed data mining scenario !
Solution
- try a larger subset of the data from both catalogs
The Fundamental Plane
Interesting problem in astronomy - Identify
correlations in high dimensional spaces
For the class of elliptical and spiral galaxies
Observed features – radius, mean surface
brightness and central velocity dispersion
A two dimensional plane in the observed
space of 3D parameters exist called
THE FUNDAMENTAL PLANE
An illustration of the Fundamental Plane
Experimental Results
First PC captured
69.4193% of variance
Second PC captured
12.1333% of the variance
The astronomy literature
suggests 1st and 2nd PC
together should capture
about 88% of variance
Reasonably close recreation of the Fundamental Plane from two
cross matched data sets in the centralized setting
Algorithm for Distributed Covariance Computation
A central co-ordination site S sends A and B a random
number generation seed
A and B generate and n X l Random matrix R where l << n
A and B send S – R T A and R T B
S computes ( R A )T (RB) / n
Experimental Results – Distributed Setting
Case Study
1249 attributes at site A and B
2 attributes at site A and 1
attribute at site B
More results
Development of a Web Service
Architecture of the Proposed System
SITE A
Soap Message
CLIENT
Soap Message
WEB SERVICE
For Distributed
Covariance
Computation
SITE B
Current Implementation
Using Apache Axis (SOAP engine – a
framework for making SOAP processors
such as clients, servers )
Tomcat version 4.1
SOAP version 1.2
Short Demo
Further System Developmental Issues
(use of SOAP with attachments)
QUESTIONS ?