KDD for Science Data Analysis Issues and Examples

Download Report

Transcript KDD for Science Data Analysis Issues and Examples

KDD for Science Data Analysis
Issues and Examples
Contents



Introduction
Data Considerations
Brief Case Studies







Sky Survey Cataloging
Finding Volcanoes on Venus
Biosequence Databases
Earth Geophysics
Atmospheric Science
Issues and Challenges
Conclusion
Data Considerations





Image Data
Time-series and sequence data
Numerical Vs Categorical values
Structured and sparse data
Reliability of Data
Brief Case Studies





Sky Survey Cataloging
Finding Volcanoes on Venus
Earth Geophysics
Atmospheric Science
Biosequence Databases
Sky Survey Cataloging
The survey consists of 3 terabytes of image data containing an
estimated 2 billion sky objects
The basic problem is to generate a survey catalog which records the
attributes of each object along with its class: star or galaxy
To achieve this scientists developed the SKICAT system
Reasons why SKICAT was successful




The astronomers solved the feature extraction problem
Data mining methods contributed to solving difficult
classification problems
Manual approaches were simply not feasible.
Astronomers needed an automated classifier to make
the most out of the data
Decision tree methods proved to be an effective tool for
finding the important dimensions for this problem
Finding Volcanoes on Venus



Data collected by Magellan spacecraft
The first pass of Venus using the left looking radar
resulted in 30,000 1000 x 1000 pixel images
To help geologists analyze this data set, the JPL
Adaptive Recognition Tool (JARtool) was developed
Motivation for using Data mining methods



Scientists did not know much about image processing or
about the SAR properties. Hence they could easily label
images but not design recognizers
There was little variation in illumination and orientation of
objects of interest. Hence mapping from pixel space to
feature space can be performed automatically
Geologists did not have any other easy means for
finding the small volcanoes, hence they were motivated
to cooperate by providing training data and other help
Earth Geophysics
Two images taken before and after an earthquake and
by repeatedly registering different local regions of the
two images, it is possible to infer the direction and
magnitude of ground motion due to the earthquake.
Example of a geoscientific data mining system is
Quakefinder which automatically detects and measures
tectonic activity in the earths crust by examination of
Satellite data
Atmospheric Science



Data mining tool used is called CONQUEST
Parallel testbeds were employed by Conquest to
enable rapid extraction of spatio-temporal
features for content based access.
Some of the goals of the this tool is the
development of “learning” algorithms which look
for novel patterns, event clusters etc.
Retrieved Sea Level Pressure Fields
Biosequence Databases


The largest DNA database is GENBANK with a database
of about 400 million letters of DNA from a variety of
organisms
The pressing data mining tasks for biosequence are
Find genes in the DNA sequences of various
organisms.
Some of the gene finding programs such as GRAIL,
GeneID, GeneParser, Genie use neural nets and
other AI or statistical methods
Issues and Challenges






Feature Extraction
Minority Classes
High degree of Confidence
Data mining task
Relevant domain Knowledge
Scalable machines and Algorithms
Conclusions
KDD applications in science may in general be easier
than applications in business, finance, or other areas.
This is due to the fact that science end users typically
know the data in intimate detail.