Transcript Document

Scalable Exploratory Data Mining
of Distributed Geoscientific Data
Authors : E.C Shek, R.R Muntz, E. Mesrobian and K. Ng
by
Sona Srinivasan
1
Outline
• Introduction
• Geoscientific Data Modeling
• Geoscientific Algebraic Operators
• Physical Data Model
• Parallel Query Execution
• Automatic Query Execution
• Heterogeneous Distributed Data Access
• Implementations and Experiences
• Conclusion
• References
2
Introduction
• Geoscience studies produce a tremendous amount of raw data
• Involves extracting interesting geoscientific phenomena
not observed directly from raw datasets
• Cyclone tracks - trajectories traveled along low-pressure areas
over time, that can be extracted from a sea-level pressure dataset
• Data mining in business applications and Geoscientific feature
extraction involve sieving through large volumes of isolated events
and data to locate salient patterns
• A database query processing problem in order to take advantage of
automatic query optimization, parallelization techniques
• Conquest - an extensible parallel geoscientific query processing
system
3
Geoscientific Data Model
Example Geographic Data Field
4
Geoscientific Data Model
• A field - which associates parameter values with cells in a
multidimensional coordinate space
• Cells can be of various geometric object types
• The type of cells and the coordinate space they lie in
is determined by the Coordinate space
• Values for the cells lie in a multidimensional variable space
• Variable Attributes -The type of values associated with a cell in the
coordinate space
• A cell record - a cell and the variable value associated with it
• Cell coverage - the set of distinct cells in the coordinate space for
which variable values are recorded
5
Geoscientific Algebraic Operators
• A base set of general purpose logical field data manipulation
operators. Users may introduce operators based on application
specific algorithms
• Set-Oriented Relational operators - Selection, Projection, Cartesian
Product, Union, Intersection, Set Difference, Join
• Sequence-Oriented Operators
• Grouping Operators - Nest and Unnest
• Space Conversion Operators
6
Physical Data Model
Nesting of a Data Field
7
Parallel Query Execution
• Parallelization Techniques are used to remove bottlenecks in I/O and
computation and improve query performance
 Pipelining Processing or Dataflow Parallelism
 Partitioning or Intra-Operator Parallelism
 Multicasting
8
Query Parallelization
• Window of Relevance - Maximum length of time between
arrival of an object and the time it ceases to have an effect on
the execution state of the operator
 Instantaneous
 Known
 Random but Bounded
 Fixed Windows
9
Heterogeneous Distributed Data Access
• Only a small percentage of data is analyzed, due to unavailable
storage, bandwidth and difficulty in integrating distributed
datasets
• Conquest supports datasets both through distributed object
interface and a repository- specific scanner operator, as accessing
data from distributed objects eliminates opportunities for query
capability of data repositories to optimize query evaluation
10
Implementations and Experiences
• Ported to run IBM SP1, SP2 and Intel Paragon
• Has been used for the past five years for exploratory data analysis
and data mining of spatio-temporal phenomena produced at UCLA
and also for extraction and analysis of cyclonic activity, blocking
features, and oceanic warm pools.
Number of upward wave propagation trajectories between 500mb and
11
50mb levels extracted per year
Implementations … (Contd.)
Number of upward wave propagation trajectories between
500mb and 50mb at different latitudes
12
Conclusion
• Conquest - geoscientific data model that applies distributed
and parallel database query processing to handle computationally
expensive data mining queries on massive datasets.
• Helps analyze the large volumes of data to extract the necessary
information
• Query Optimization emphasizes parallelization and optimal data
access
• Future Work - This system is currently being integrated as part of a
larger environment.
13
References
• E.C. Shek, R.R. Muntz, E. Mesrobian, and K. Ng, "Scalable
Exploratory Data Mining of Distributed Geoscientific Data",
KDD, 1996
• E.C. Shek, E. Mesrobian, and R.R. Muntz, "On Heterogeneous
Distributed Geoscientific Query Processing", Feb. 1996
• F. Fabbrocino, E.C. Shek, R.R. Muntz, “ The Design and
Implementation of the Conquest Query Execution Environment”,
July. 1997
• E. Mesrobian, et al…, "Exploratory Data Mining and Analysis
Using Conquest", May 1995
14