PragmaD2Ktutorial

Download Report

Transcript PragmaD2Ktutorial

From D2K to SEASR
Overview
September 27, 2007
Loretta Auvil
Automated Learning Group, NCSA
University of Illinois, Urbana-Champaign
ALG Mission
The specific mission of the Automated Learning Group is:
• To collaborate with researchers to develop novel computer
methods and the scientific foundation for using historical
data to improve future decision making
• To work closely with industrial, government, and academic
partners to explore new application areas for such
methods, and
• To transfer the resulting software technology into real
world applications
Knowledge Discovery Process
Required Effort for each KDD Step
Arrows indicate the direction we want the effort to go.
60
Effort (%)
50
40
30
20
10
0
Objectives
Determination
Data Preparation
Data Mining
Interpretation/
Evaluation
Three Primary Paradigms
• Predictive Modeling – supervised learning approach where
classification or prediction of one of the attributes is desired.
– Classification is the prediction of predefined classes
• e.g. Naive Bayesian, Decision Trees, and Neural Networks
– Regression is the prediction of continuous data
• e.g. Neural Networks, and Decision (Regression) Trees
• Discovery – unsupervised learning approach for exploratory
data analysis.
– e.g. Association Rules, Link Analysis, Clustering, and Self Organizing Maps
• Deviation Detection – identifying outliers in the data.
– e.g. Visualization
D2K- Framework for Data Analysis
•
•
•
•
•
•
•
•
•
Provides scalable environment from
the Desktop to Web Services
Employs a visual programming system
for data/work flow paradigm
Provides capability to build custom
applications
Provides capability to access data
management tools
Contains data mining algorithms for
prediction and discovery
Provides data transformations for
standard operations
Integrated environment for models and
visualization
Supports an extensible interface for
creating one’s own algorithms
Provides access to distributed
computing capabilities
D2K Components
•
•
•
•
D2K Infrastructure
• Itinerary Execution engine
D2K-Driven Applications
• Applications that make use of the D2K
Infrastructure
• Toolkit is a D2K-Driven app
D2K Server
• Special kind of D2K-Driven app
• Wraps the infrastructure to provide remote
itinerary and module execution
• Used by the Toolkit to distribute module
execution
D2K Web Service
• Provides a generic programmatic interface for
executing itineraries
• Communicates with D2K Servers over socket
connections using D2K Specific protocols.
D2K Streamline (D2K SL)
•
•
•
•
•
•
Provides step by step
interface to guide user in
data analysis
Supports return to earlier
steps to run different
parameters
Uses the D2K
infrastructure
transparently
Uses same D2K modules
Provides way to capture
different experiments
Define templates that can
be reused in different
experiments
D2K Web Service Architecture
• Any web enabled client can connect to
and use the D2K Web Service by
sending SOAP messages over HTTP.
• Itineraries and modules are stored on
the web service machine and loaded
over the network by the D2K Servers.
• Job results are also stored in the web
service tier.
– Results are returned to clients
upon request.
• A relational database is used by the
web service to lookup accounts,
itineraries, servers, and jobs.
• Remote D2K Servers handle itinerary
processing. If possible, modules should
load any data from remote locations.
Creating Customer Value
Prediction
Industrial Manufacturer
Computed customer buying propensities
Achieved 25% conquest customer sales lift by executing directed
cross/upsell resulting in $65 million in incremental revenue
Discovery
Automotive manufacturer
Identified patterns of inappropriate warranty work in dealer channel
Targeted $200M+ of potentially unnecessary annual expense
Monitoring
Department store retailer
Watched POS transaction flow for unusual variations
Deterred inappropriate behavior and fraudulent transactions
Resulted in savings of over $125 million
Applications Examples
Comparative Genomics
Harris A. Lewin explains that Evolution Highway
allows one to look " . . . at the whole genome at
once - multiple chromosomes across multiple
species. The insights wouldn't have come so
quickly if we couldn't throw the data at this
framework from NCSA.”
Science, Vol. 309, Issue 5734, Pages 613-617, 22
July 2005
Music Analysis
Astronomy
J. Stephen Downie, The Scientific
Evaluation of Music Information
Retrieval Systems: Foundations and
Future, Computer Music Journal, Vol.
28, No. 2, Pages 12-23 Summer
2004
Nicholas M. Ball, Robert J. Brunner, Adam D. Myers,
and David Tcheng, Robust Machine Learning Applied to
Astronomical Data Sets. I. Star-Galaxy Classification of
the Sloan Digital Sky Survey DR3 Using Decision
Trees, The Astrophysical Journal, Vol. 650, Part 1,
Pages 497–509, 2006
D2K- Lineage
NCSA
RiverGlass
One Llama
Engagements
● D2K Streamline
● D2K / Data to Knowledge
DataMining
● T2K / ThemeWeaver
TextMining
● Full Multi-language
● I2K / Image to Knowledge
ImageMining
● M2K / Music to Knowledge
Audio Mining
● MAIDS / Mining Alarming Incidents from Data Streams
StreamMining
● RiverGlass Recon™
WebAcquire
Future
Research,
Technology,
Applications
● RiverGlass Detect™
InferenceEng.
Fed.Query
● RiverGlass Detect™
MotionMining
● MotionMining
● One Llama Media
● GeoSpatial
Music Analysis
GeoSpatial
● Sensors/RFID
Sensors/RFID
● Multimedia
Multimedia
Interface
Visualization
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
RiverGlass, Inc.
D2K ToolKit
1.
2.
Workspace
Resource
Panel
3. Modules
4. Models
5. Itineraries
6. Visualizations
7. Generated
Visualizations
8. Generated
Models
9. Component
Information
10. Toolbar
11. Console
D2K Basic
• Set of D2K Modules to perform data mining techniques
– Prediction
• Decision Trees
– C4.5 Decision Tree, Continuous Decision Tree, SQL Rain Forest Decision Tree
• Naïve Bayesian Classification and SQL Naïve Bayesian Classification
• Neural Networks
– Discovery
• Rule Association
– Apriori, FP Growth, Htree
• Clustering
– Hierarchical Agglomerative, Kmeans, Coverage, etc.
• Includes visualizations for many of the modeling approaches
• Includes a set of data transformations
– Attribute selection, binning, filtering, attribute construction
• Includes optimization strategy for searching parameter space
D2K Modules
Input Module: Loads data from the outside world.
–
Flat files, database, etc.
Data Prep Module: Performs functions to select, clean, or transform the data
–
Binning, Normalizing, Feature Selection, etc.
Compute Module: Performs main algorithmic computations.
–
Naïve Bayesian, Decision Tree, Apriori, FP Growth, etc.
User Input Module: Requires interaction with the user.
–
Data Selection, Input and Output selection, etc.
Output Module: Saves data to the outside world.
–
Flat files, databases, etc.
Visualization Module: Provides visual feedback to the user.
–
Naïve Bayesian, Rule Association, Decision Tree, Parallel Coordinates, 2D Scatterplot,
3D Surface Plot
D2K Module Icon Description
Module Progress Bar
Appears during execution to show
the percentage of time that this
module executed over the entire
execution time. It is green when
the module is executing and red
when not.
Input Port
Rectangular shapes on the left
side of the module represent the
inputs for the module. They are
colored according to the data type
that they represent
Properties Symbol
If a “P” is shown in the lower left
corner of the module, then the
module has properties that can be
set before execution.
Output Port
Rectangular shapes on the
right side of the module
represent the outputs for the
module. They are colored
according to the data type that
they represent.
D2K Demo
SEASR: Research, Development, &
Technology Transfer Model
SEASR: The Data Problem
Structured Vs. Unstructured
20%
Today, 80% of business is conducted
on unstructured information
– Gartner Group
80% of the information needed
is in the Open Source
– NIA
Structured
Data
Workers spend 80% of the time
gathering information
– STIC, EMF
Cave paintings,
Bone tools 40,000
WritingBCE
3500 BCE
80%
Unstructured
Data
0 C.E.
Paper 105
Printing 1450
Computing 1950
Internet (DARPA) Late 1960s
The Web 1993
1999
GIGABYTES
Electricity, Telephone
1870
Transistor 1947
www.fastsearch.com
SEASR
Software Environment for the Advancement of Scholarly Research (SEASR)
–
addresses the challenges of transforming information into knowledge by constructing
the software bridges that are required to move from the unstructured and semistructured data world to the structured data world.
–
aims to make collections more useful by integrating two well-known research and
development frameworks NCSA’s Data-To-Knowledge (D2K) and IBM’s Unstructured
Information Management Architecture (UIMA) into an easily usable environment that
researchers in any discipline can easily learn and adapt for their own unstructured
data analysis.
SEASR: Architecture
•
•
•
•
•
•
SEASR’s advanced informatics tools will
expand the technical capabilities of what is
now available in the field by:
connecting data sources that are currently
incompatible, whether due to different formats
or protocols
offering all project components as open source,
to enable users to modify and add to tools
allowing users to write analytic engines in their
programming language of choice
installing on all hardware footprints, so that the
tools can be brought to data sets where they
are housed
creating a repository for components that will
support sharing and publishing among users
enabling scalability so that components may
run on a large variety of hardware footprints,
including shared memory processors and
clusters
SEASR Applications
NoraVis OpenLaszlo
DISCUS
SEASR
FeatureLens
M2K
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
NoraVis OpenLaszlo
FeatureLens: n-gram patterns
Create by Anthony Don at http://www.cs.umd.edu/hcil/textvis/featurelens/.
Getting the “Band” Together
• June 2007 – Band formation
– Project start date
– More use ideas and framework discussions
• December – First ‘gig”
– Framework and data app demonstration
• Vocals - Research Technology
– John Unsworth, Stephen Downie, Tim Wentling
– Dan Roth, Jiawei Han, Kevin Chang, Cheng Xiang Zhai
• Percussions & Bass - SEASR Development
– Loretta Auvil, Tara Bazler, Duane Searsmith, Andrew Shirk, Students
• Lead – Designers/Developer/Applications Areas
– Humanities – M2K, Nora/Monk and Others (we heard about
yesterday/today))
• Need Groupies! (Advisors, Researchers, Developers, and Application
Drivers) – Loretta Auvil
SEASR: How can I participate?
• Collaborate on application
development or ontology creation
• Contribute to component
development for analytics or data
access
• Participate in visualization and UI
design
• Serve as an advisor
Contact Loretta Auvil ([email protected])
SEASR
Engineering Knowledge for the Humanities
Thank You