D2K Driven Application - Duke Univ. Visualization Technology Group
Download
Report
Transcript D2K Driven Application - Duke Univ. Visualization Technology Group
D2K – Data To Knowledge
March 19, 2004
Duke University
Loretta Auvil
Automated Learning Group
National Center for Supercomputing Applications
University of Illinois
217. 265.8021
[email protected]
Outline
•
•
•
Overview of Data Mining
Overview of D2K Functionality
D2K Toolkit
•
•
D2K Driven Application
•
•
•
ThemeWeaver – Mining Text Data
MAEViz – Visualizing Earthquake Damage Analysis
D2K Streamline (SL)
•
•
MAIDS – Mining Streaming Data
EMO – Finding Optimal Decisions
D2K Web Service
•
Phylomat – Finding Motifs in Sequences
alg | Automated Learning Group
ALG Mission
The specific mission of the Automated Learning Group is:
•
To collaborate with researchers to develop novel computer
methods and the scientific foundation for using historical data to
improve future decision making
•
To work closely with industrial, government, and academic
partners to explore new application areas for such methods, and
•
To transfer the resulting software technology into real world
applications
alg | Automated Learning Group
ALG Research, Development, & Technology Transfer Model
alg | Automated Learning Group
Overview of Knowledge Discovery
What is It?
Knowledge Discovery in Databases is the non-trivial process of
identifying valid, novel, potentially useful, and ultimately
understandable patterns in data
•
The understandable patterns are used to:
•
•
•
•
Make predictions about or classifications of new data
Explain existing data
Summarize the contents of a large database to support decision making
Create graphical data visualization to aid humans in discovering complex
patterns
alg | Automated Learning Group
Overview of Knowledge Discovery
Why Do We Need Data Mining ?
•
Data volumes are too large for classical analysis approaches:
•
•
Large number of records (108 – 1012 bytes)
High dimensional data ( 102 – 104 attributes)
How do you explore millions of records, tens or hundreds or
thousands of fields, and find patterns?
•
•
As databases grow, the ability to use traditional query languages for
the decision support process becomes infeasible
Many queries of interest are difficult to state in a query language
(query formulation problem)
•
“Find all cases of fraud”
•
“Find all individuals likely to by Ford Explorer”
•
“Find all documents that are similar to this customers problem”
alg | Automated Learning Group
Overview of Knowledge Discovery
Knowledge Discovery Process
alg | Automated Learning Group
Overview of Knowledge Discovery
Required Effort for each KDD Step
Arrows indicate the direction we want the effort to go
60
Effort (%)
50
40
30
20
10
0
Objectives
Determination
alg | Automated Learning Group
Data Preparation
Data Mining
Interpretation/
Evaluation
Overview of Knowledge Discovery
Three Primary Paradigms
•
Predictive Modeling – supervised learning approach where
classification or prediction of one of the attributes is desired
•
Classification is the prediction of predefined classes
– Naive Bayesian, Decision Trees, and Neural Networks
•
Regression is the prediction of continuous data
– Neural Networks, and Decision (Regression) Trees
•
•
Discovery – unsupervised learning approach for exploratory
data analysis
•
Association Rules and Link Analysis
•
Clustering and Self Organizing Maps
Deviation Detection – identifying outliers in the data
•
Visualization
alg | Automated Learning Group
Importance of Data Mining Framework
•
•
•
•
•
•
•
•
Provides capability to build custom applications
Provides access to data management tools
Contains data mining algorithms for prediction and discovery
Provides data transformations for standard operations
Supports an extensible interface for creating one’s own algorithms
Provides means for building and applying models
Provides integrated visualizations components
Provides access to distributed computing capabilities
alg | Automated Learning Group
D2K Overview
D2K - Data To Knowledge
D2K is a flexible data mining system that integrates
effective analytical data mining methods for prediction,
discovery, and anomaly detection with data management
and information visualization
alg | Automated Learning Group
D2K Overview
D2K and Its Many Components
• D2K Infrastructure
•
•
•
•
•
•
D2K API, data flow environment,
distributed computing framework
and runtime system
D2K Modules
Computational units written in Java
that follow the D2K API
D2K Itineraries
Modules that are connected to form
an application
D2K Toolkit
User interface for specification of
itineraries and execution that
provides the rapid application
development environment
D2K-Driven Applications
Applications that use D2K modules
with a custom user interface
D2K Streamline (SL)
Task driven system that uses D2K
modules
D2K Web/Grid Services
Enables web deployment
alg | Automated Learning Group
D2K Overview
D2K Toolkit
Major features that D2K provides
to an application developer
include:
•
•
•
•
•
•
Visual programming system
employing a data flow
paradigm
Scalable distributed
computing capabilities
Flexible and extensible
software development
environment
Multi-layered learning
strategies
Integrated environment for
models and visualization
Web service capabilities for
deployment
alg | Automated Learning Group
D2K Overview
D2K Modules
Input Module: Loads data from the outside world
•
Flat files, database, etc.
Data Prep Module: Performs functions to select, clean, or transform the data
•
Binning, Normalizing, Feature Selection, etc.
Compute Module: Performs main algorithmic computations
•
Naïve Bayesian, Decision Tree, Apriori, etc.
User Input Module: Requires interaction with the user
•
Data Selection, Input and Output selection, etc.
Output Module: Saves data to the outside world
•
Flat files, databases, etc.
Visualization Module: Provides visual feedback to the user
•
Naïve Bayesian, Rule Association, Decision Tree, Parallel Coordinates, 2D Scatterplot,
3D Surface Plot
alg | Automated Learning Group
D2K Overview
D2K Module Icon Description
Module Progress Bar
Appears during execution to show the
percentage of time that this
module executed over the entire
execution time. It is green when
the module is executing and red
when not
Input Port
Rectangular shapes on the left side of
the module represent the inputs
for the module. They are colored
according to the data type that
they represent
Properties Symbol
If a “P” is shown in the lower left
corner of the module, then the
module has properties that can be
set before execution
alg | Automated Learning Group
Output Port
Rectangular shapes on the right
side of the module represent the
outputs for the module. They are
colored according to the data
type that they represent
Current ALG Projects
MAIDS: Mining Alarming Incidents in Data Streams
Stream Characteristics
• Huge volumes of continuous
data, possibly infinite
• Fast changing and requires
fast, real-time response
• Data stream captures nicely
our data processing needs of
today
• Random access is expensive—
single linear scan algorithm
(can only have one look)
• Store only the summary of
the data seen thus far
• Most stream data are at
pretty low-level or multidimensional in nature, needs
multi-level and multidimensional processing
alg | Automated Learning Group
Using D2K Toolkit
MAIDS
alg | Automated Learning Group
Current ALG Projects
Text Mining
•
Information Retrieval
•
•
•
Information Extraction
•
•
Indexing and retrieval of textual documents and extraction of partial knowledge using
the web
Classification
•
•
Extraction of partial knowledge in the text
Web Mining
•
•
Indexing and retrieval of textual documents
Finding a set of (ranked) documents that are relevant to the query
Predict a class for each text document
Clustering
•
Generating collections of similar text documents
alg | Automated Learning Group
Using D2K Driven Application
Text Mining: Views from T2K and ThemeWeaver
alg | Automated Learning Group
Using D2K Driven Application
MAEViz: Damage Synthesis Visualization
•
•
•
•
•
•
•
Displays terrain map
Loads hazard,
inventory, and
fragility data
Shows contour map of
ground acceleration
(hazard)
Displays cones/bars to
indicate level of
damage
Overlays shapefiles of
different information
Uses VTK for 3D
Uses CUBE at BI
alg | Automated Learning Group
D2K SL
D2K Streamline (D2K SL)
•
•
•
•
•
Provides step by step
interface to guide user
in data analysis
Supports return to
earlier steps to run
with different
parameters
Uses the D2K
infrastructure
transparently
Uses same D2K
modules
Provides way to
capture different
experiments
alg | Automated Learning Group
Using D2K SL
EMO – Evolutionary Multiobjective Optimization
•
•
•
Identify tradeoffs among
complex objectives
Apply a genetic algorithm
(GA) optimization in a
general framework
Guide the user through
discrete steps to defining
decision variables,
fitness functions,
constraints, and setting
up GA parameters
alg | Automated Learning Group
D2K Web Service Architecture
• Any web enabled client can connect to and
use the D2K Web Service by sending SOAP
messages over HTTP.
• Itineraries and modules are stored on the
web service machine and loaded over the
network by the D2K Servers.
• Job results are also stored in the web
service tier.
•
Results are returned to clients upon request.
• A relational database is used by the web
service to lookup accounts, itineraries,
servers, and jobs.
• Remote D2K Servers handle itinerary
processing. If possible, modules should
load any data from remote locations.
alg | Automated Learning Group
Using D2K Web Service
Phylomat (Motif Analysis Tool for Phylogenomics)
alg | Automated Learning Group
The ALG Team
Staff
Students
Loretta Auvil
Peter Bajcsy
Colleen Bushell
Dora Cai
David Clutter
Lisa Gatzke
Vered Goren
Chris Navarro
Greg Pape
Tom Redman
Duane Searsmith
Andrew Shirk
Anca Suvaiala
David Tcheng
Michael Welge
alg | Automated Learning Group
Ritesh Agrawal
Tyler Alumbaugh
John Cassel
Sang-Chul Lee
Xiaolei Li
Jeff Ng
Scott Ramon
Martin Urban
Bei Yu
Hwanjo Yu
Licensing D2K
•
•
•
Faculty, staff and students at US academic institutions will be able
to license and use D2K for free by downloading from
alg.ncsa.uiuc.edu
Private Sector Partners who have provided funding for projects
related to D2K will be able to license and use D2K for free
Private Sector Partners who have not provided funding will be able
to license and use D2K for a discounted fee
Contact John McEntire
Office of Technology Management
308 Ceramics Building, MC-243
105 South Goodwin Avenue
Urbana, Illinois 61801-2901
(217) 333-3715
[email protected]
alg | Automated Learning Group