Understanding Virtual Blah Blahs…

Download Report

Transcript Understanding Virtual Blah Blahs…

Scientific Applications of
Data Mining
Bioinformatics Seminar
August 28, 2002
Gary Lindstrom
School of Computing
University of Utah
Outline
What is data mining?
Where has it been successfully
applied?
How can it be applied to scientific
applications?
Research Opportunities
What Is Data Mining?
One definition (Robert Grossman)
• Data mining is the semi-automatic
discovery of patterns, associations,
anomalies, structures, and changes in
large data sets
Data Mining
Characteristics
•
•
•
•
Large data, vs. small data
Discovery, not validation
Data driven, not hypothesis driven
Automated, not manual application
Supported by
• Statistics, machine learning, databases,
high performance computing
The Data Gap
Exponential growth of data
• More automation, greater throughput,
more models, e.g. simulated
But: linear increase in number of
researchers
• Sift the sand, rather than searching a
sensor
Classical Data Mining
Applications
Retail
• Market basket analysis
Political science
• Targeting campaign resources
Financial
• Exploiting market trends & imbalances
Decision Support Systems
Generic term for analytic and historic
uses of DBs
• Contrast with: operational uses
• Commonly known as On-Line
Transaction Processing (OLTP)
Data warehouses
• Data culled from operational DBs, with
history and derived summary data
Data Warehouses vs.
Databases
• Replicate data from distributed sources
• Do not require strict currency of data
• Oriented toward complex, often
statistical queries
• Often based on materialized views of
operational data
 Views which have been expanded into real
tables
Tools for DSS
 Ad hoc SQL-style queries
• Optimized for large, complex data
 On-Line Analytic Processing (OLAP)
• Queries optimized for aggregation operations
• Data is viewed as multidimensional array
• Influenced by end-user tools such as
spreadsheets
 Data mining
• Exploratory data analysis
• Looking for interesting unanticipated patterns
in the data
Data Warehousing
Visualization
External Data Source
Metadata Repository
EXTRACT
TRANSFORM
LOAD
REFRESH
SERVES
OLAP
Data Warehouse
Data Mining
Creating And Maintaining
A Warehouse
 Challenges
• Schema design for integrated information
• Operations





Cleaning (curation): filling gaps, correcting errors
Transforming: making consistent with new schema
Loading: also sorting and summarizing
Refreshing: incorporate updates to operation data
Purging: aging out old data
 Role of metadata
• Sources of data, schema conversion
information, refresh history, etc.
OLAP Naturally Leads to
Data Mining
Seeks interesting trends or patterns in
large datasets
• An example of exploratory data analysis
• Related to knowledge discovery and machine
learning
Mining for rules
• Association rules: motivated by retail market
basket analysis
Market Basket Analysis
 Market basket
• A collection of items purchased by a customer
in one transaction
• Retailers want to learn of items often
purchased together
 For promotional and display grouping purposes
• Simple tabular representation
 Purchases(transid, custid, date, item, price, quantity)
Association Rules
Seek rules of the form:
{ pen } => { ink }
• Meaning:
 If a pen is purchased in a transaction, it is
likely that ink will also be purchased in that
transaction
Important Measures for
Association Rules
Support
• % of transactions containing all items
mentioned in rule
• Low support reduces interest in the rule
Confidence
• % of transactions containing the LHS
that also contain RHS
• Indicates degree of correlation
Using Association Rules
For Prediction
Always somewhat risky
• Because ultimate goal is understanding
causality
• Which is not directly reflected in
transaction data
There Can Be High Support
and Confidence
 … but no causality
 Example: pencils and pens are often
bought together
• And pens and ink are often bought together
• Hence pencils and ink are often bought
together
 But there is no causal link between pencils
and ink
• Hence sale promotions on pencils and ink
probably won’t be effective
Finding Association Rules
Seek rules with:
• Support greater than minsup
• Confidence greater than minconf
Steps
• Find frequent item sets
 Sets of items with support >= minsup
• Break each frequent item set into LHS and
RHS of candidate rules
 Keep those with confidence >= minconf
Testing Candidate Rules
Confidence calculation for each
candidate rule
• Maintain two counters: lhscount,
rhscount
• Scan entire customer transaction table
• Count in lhscount occurrences of all
items in LHS
• If LHS is present, tally in rhscount if all
items in RHS are present
Identifying Frequent Item
Sets
The a priori property:
• Every subset of a frequent item set is
also a frequent item set
This leads to an iterative algorithm
• Identify frequent item sets of one item
• Iteratively, seek to extend frequent item
sets by adding an item
Finding Frequent Itemsets
foreach item,
check if it is a frequent itemset
repeat
foreach new frequent itemset Ik with k items
generate all itemsets Ik+1 with k+1 items, Ik  Ik+1
Scan all transactions once and check if
the generated k+1-itemsets are frequent
until no new frequent itemsets are found
Example: Mining Simulated
Combustion Data
Joint work with
• Brijesh Garabadu, School of Computing
• Zoran Djurisic, Chem. & Fuels Engg.
The problem
• Combustion model for powdered coal
furnaces
• Which conditions control NOx pollution?
The Data
Multidimensional space
• Pressure, fuel mix, oxygen concentration
• Can explore (simulate) any combination
 But which to look at?
Need to:
• Locate relevant subspaces
• Characterize important events
• Develop causal hypotheses
Techniques Applied
Cluster analysis
• Which datasets are similar?
Neural networks
• Which datasets are interesting?
Decision trees
• Which features best explain similarities?
Cluster Analysis:
Unsupervised Learning
At outset, category structure of the
data is unknown
• All that is known is a collection of
observations
Objective: To discover a category
structure which fits the observation
• i.e. finding natural groups in data
Combustion Application
 Cluster analysis was used to detect
relationships among various species
• Are the behaviors of any two species related?
• Is the concentration of one species dependent
on that of one or more other species?
 One confirmed hypothesis:
• CH reaches it peak concentration either before
or at the same time as H reaches its peak
concentration
• An important engineering observation
Artificial Neural Networks
 A general, practical method for learning
real-valued, discrete-values, and vectorvalues function from examples
 Combustion application
• Finding out different kinds of pattern
(increasing / decreasing, etc) in the lifetime of
a species during the combustion process
• This can be used to prove various hypothesis
as well as to detect patterns of specific species
in previously unseen data
Neural Networks:
Supervised Learning
Application Technique
 Training set data are labeled by the user
• These labeled data are used to train the ANN
 The ANN is then used to classify
previously unseen data
• e.g., species in a particular combustion
• Into a particular pattern class
 For example, NO shows two different
trends under differing conditions
 A trained ANN can be used to classify the
datasets according to the trend of NO
Decision Trees
Characterize data by features
• e.g., species concentration at an instant
Categorize data sets
• Manually, or use ANN
• e.g., according to the trend of NO
Use decision tree algorithm to
discover clustering criteria
Sample Output
=== Classifier model (full training set) ===
J48 pruned tree
--------------------CO <= 0.002945
|
OH <= 0.000016
|
|
CO <= 0.000166: yes (17.0/1.0)
|
|
CO > 0.000166: no (3.0)
|
OH > 0.000016: yes (30.0)
CO > 0.002945: no (60.0 / 1.0)
Research Opportunities
Try it!
• In your area, on your data, for new
results
Features
• Definition, efficient extraction
Community building
• Sharing data mining results
PMML
Predictive Model Markup Language
XML based representation of
association rules
Developed by Data Mining Group
• Industrial and university research
collaboration
An Excellent Tutorial
Used for material in this talk
• Data Mining Scientific and Engineering
Applications
 Tutorial at SC2001, November 12, 2001 by
R. Grossman, C. Kamath and V. Kumar
http://www-users.cs.umn.edu/
~kumar/Presentation/sc2001.html