Mining on the Grid with ADaM

Download Report

Transcript Mining on the Grid with ADaM

Focus Study:
Mining on the Grid with ADaM
Sara Graves
Sandra Redman
Information Technology and Systems Center
and
Information Technology Research Center
University of Alabama in Huntsville
National Space Science and Technology Center
256-961-7806
[email protected]
[email protected]
www.itsc.uah.edu
Data Mining
• Automated discovery of patterns, anomalies from vast
•
observational data sets
Derived knowledge for decision making, predictions
and disaster response
http://datamining.itsc.uah.edu
Creating a Successful Environment for
Data Mining


Provide scientists with the capabilities to allow the
flexibility of creative scientific analysis
Provide data mining benefits of






Automation of the analysis process
Reducing data volume
Provide a framework to allow a well defined structure
to the entire process
Provide a suite of mining algorithms for creative
analysis that can adapt to new hypotheses
Provide capabilities to add science algorithms to the
environment
Exploit emerging technologies in computational and
data grids, high-performance networks, and
collaborative environments
Challenges for Next-generation Mining
•
Develop and document common/standard interfaces for
interoperability of data and services
•
Design new data models for handling
•
real-time/streaming input
•
data fusion/integration
•
Design and develop distributed standardized catalog capabilities
•
Develop advanced resource allocation and load balancing
techniques
•
Exploit the grid concept for enhanced data mining functionality
•
Develop more intelligent and intuitive user interfaces
•
Integrate with collaborative environments
•
Develop ontologies of scientific data, processes and data mining
techniques for multiple domains
•
Support language and system independent components
•
Incorporate data mining into science and engineering curricula
Algorithm Development and Mining System
(ADaM) - System Overview







Consists of over 100 interoperable mining and image
processing components
Each component is provided with a C++ application
programming interface (API), an executable in support
of scripting tools (e.g. Perl, Python, Tcl, Shell)
ADaM components are lightweight and autonomous,
and have been used successfully in a grid environment
(NASA IPG, TeraGrid, lab)
ADaM has several translation components that provide
data level interoperability with other mining systems
(such as WEKA and Orange), and point tools (such as
libSVM and svmLight)
Web service interfaces in development
Executes in multiple environments (e.g. workstation,
cluster, grid, on-board, etc.)
NMI Integration Testbed test cases
MEAD
Modeling Environment for Atmospheric Discovery



One of the NSF PACI Alliance research
Expeditions
Expeditions ensure intense collaboration among
technology developers and application scientists
and focus on the deployment of infrastructure
that supports computational science and
engineering and science in a variety of
disciplines
MEAD’s focus is on retrospective analysis of
hurricanes and severe storms using the
TeraGrid, integrating computation, grid workflow
management, data management, model
coupling, data analysis/mining, and visualization
MEAD Mining Example:
Mesocyclone Detection Algorithm

Science Objective:
–

To investigate different thunderstorm cell
interactions favorable for subsequent tornado
(mesocyclone) formation
Goals:
–
–
–
Develop a mesocyclone detection algorithm (in both
2D and 3D)
Develop an algorithm to track the temporal evolution
of the mesocyclone features
Investigate the use of clustering techniques to:


Summarize differences in simulation runs
Provide an overview of all the simulations
Approach

Mining Approach
–
–
–

Use idealized WRF model simulations with
different initial conditions
Create a large parameter space of thunderstorm
cell interaction and storm behavior
Mine this search space for patterns and trends
Grid Approach
–
–
Application scripts developed in Python and tested
on linux; modified for Globus environment by
writing a simple Globus RSL file
Application scripts constructed to run each
combination of tools in parallel on a different node
on the grid
Example MEAD Workflow
Initial Setup
Initial Data
and
Parameters
Model Execution
Multiple
WRF Models
(Weather)
Inter-model
communications
Multiple
ROMS Models
(Ocean)
Initial Data
and
Parameters
Model
Results
Post Run Analysis
Data Mining
(ADaM)
Model
Results
Visualization
Grid environment supports the demanding computational,
data storage and post analysis requirements
Using the TeraGrid


Excellent user documentation at
http://www.teragrid.org/userinfo/
Account Management - Procedures vary per site
–
–
–
–

Programming Environment – Know your systems
–
–
–

Get account at each site
Obtain certificate (from one of several sites, X.509 or KX.509)
Establish Distinguished Name in grid-mapfile at each site
Create certificate proxy (grid-proxy-int, MyProxy, kinit)
Compilers (you have a number of choices)
Environment Variables (SoftEnv)
Message Passing (several flavors available)
Executing Jobs
–
–
Condor-G
Globus
WRF Initializations
• 230 WRF runs were made,
+ two control (single-cell)
• Each corresponded to a
particular arrangement of
a pair of initial storm cells
Matrix of WRF simulations
• In figure at left:
• Each square: 1
simulation
• 1st storm in the middle;
• 2nd at one of blue
squares
• Center cell stronger
Slide Source: Brian Jewett
Example: Tracking Results
Mesocyclone Detection and Tracking Results
Features with
time durations of
a single time
step are filtered
out
Summary – Mesocyclone Detection



Number of mesocyclones with higher duration tend
to be associated with initializations where the second
cell is closer to the first
Mesocyclones found in the storm simulations are
sensitive to the particular arrangement of a pair of
initial storm cells (secondary storm placement at 45
degrees to the primary storm)
Clustering techniques are useful
–
–

Summarize differences in simulation runs
Provide an overview of all the simulations
Limitations of Clustering algorithms
–
–
Investigated K-Means, Dbscan, Maximin and Hiearchical
Clustering Algorithms
K-Means clustering quality is inferior but provides useful
cluster centers or profiles
LEAD
Linked Environments for Atmospheric Discovery
A cyberinfrastructure for
mesoscale meteorology
–
–
–
real-time, on-demand, and
dynamically adaptive needs for
mesoscale weather research
High volume data sets and
streams
Computationally demanding
numerical models and data
assimilation systems
LEAD
NSF Information Technology
Research (ITR) program
Multi-Disciplinary team
contributing expertise in
meteorological applications,
analysis tools, forecast tools,
data distribution and
management, portal
development, workflow
orchestration, education and
outreach
LEAD
An integrated framework
for identifying, accessing,
preparing, assimilating,
predicting, managing,
analyzing, mining, and
visualizing meteorological
data, independent of
format and physical
location
Dynamic workflow
orchestration and data
management are key
elements
LEAD GWSTBs
Grid and Web Services Testbeds
–
–
–
–
–
Local User Environment – customized portal,
control of information flows, collaboration tools,
managing processes
Productivity Environment – models, tools, and
algorithms
Data Services Environment – data transport,
data formatting, and interoperability
Distributed Technologies Environment –
workflow infrastructure to autonomously acquire
resources and adapt to changing plans
Data Archive – recent and historical data,
products, and tools
The Portal as a Grid Access Point

The Portal Server provides the
users Grid Context.
OGCE or
GridSphere
Grid Portal Server
https
SOAP &
WS-Security
Open Grid Service Architecture Layer
Registries and
Name binding
Reservations
And Scheduling
Data Management
Service
Security
Policy
Administration
& Monitoring
Event Service
Logging
Accounting
Service
Grid Orchestration
Web Services Resource Framework – Web Services Notification
Physical Resource Layer
Services Oriented Architecture




User interfaces with portal via browser
Portal provides tools for users to build and
launch workflows
Portlets (JSR-168) provide interface between
user and grid services
Applications can be wrapped as services via a
Portal Factory Service Generator
–
–
–

Requires application, script to run it, input
parameters, output parameters
Write an AppService document and upload to
Portal Factory Service Generator (in portal)
Service is created as well as the portal client
interface
Security model integral to design
Data Integration and Mining:
From Global Information to Local Knowledge
Emergency
Response
Bioinformatics
Precision Agriculture
Urban
Environments
Weather
Prediction