Understanding Virtual Blah Blahs…

Download Report

Transcript Understanding Virtual Blah Blahs…

Scientific Applications of
Data Mining
Bioinformatics Seminar
August 28, 2002
Gary Lindstrom
School of Computing
University of Utah
What is data mining?
Where has it been successfully
How can it be applied to scientific
Research Opportunities
What Is Data Mining?
One definition (Robert Grossman)
• Data mining is the semi-automatic
discovery of patterns, associations,
anomalies, structures, and changes in
large data sets
Data Mining
Large data, vs. small data
Discovery, not validation
Data driven, not hypothesis driven
Automated, not manual application
Supported by
• Statistics, machine learning, databases,
high performance computing
The Data Gap
Exponential growth of data
• More automation, greater throughput,
more models, e.g. simulated
But: linear increase in number of
• Sift the sand, rather than searching a
Classical Data Mining
• Market basket analysis
Political science
• Targeting campaign resources
• Exploiting market trends & imbalances
Decision Support Systems
Generic term for analytic and historic
uses of DBs
• Contrast with: operational uses
• Commonly known as On-Line
Transaction Processing (OLTP)
Data warehouses
• Data culled from operational DBs, with
history and derived summary data
Data Warehouses vs.
• Replicate data from distributed sources
• Do not require strict currency of data
• Oriented toward complex, often
statistical queries
• Often based on materialized views of
operational data
 Views which have been expanded into real
Tools for DSS
 Ad hoc SQL-style queries
• Optimized for large, complex data
 On-Line Analytic Processing (OLAP)
• Queries optimized for aggregation operations
• Data is viewed as multidimensional array
• Influenced by end-user tools such as
 Data mining
• Exploratory data analysis
• Looking for interesting unanticipated patterns
in the data
Data Warehousing
External Data Source
Metadata Repository
Data Warehouse
Data Mining
Creating And Maintaining
A Warehouse
 Challenges
• Schema design for integrated information
• Operations
Cleaning (curation): filling gaps, correcting errors
Transforming: making consistent with new schema
Loading: also sorting and summarizing
Refreshing: incorporate updates to operation data
Purging: aging out old data
 Role of metadata
• Sources of data, schema conversion
information, refresh history, etc.
OLAP Naturally Leads to
Data Mining
Seeks interesting trends or patterns in
large datasets
• An example of exploratory data analysis
• Related to knowledge discovery and machine
Mining for rules
• Association rules: motivated by retail market
basket analysis
Market Basket Analysis
 Market basket
• A collection of items purchased by a customer
in one transaction
• Retailers want to learn of items often
purchased together
 For promotional and display grouping purposes
• Simple tabular representation
 Purchases(transid, custid, date, item, price, quantity)
Association Rules
Seek rules of the form:
{ pen } => { ink }
• Meaning:
 If a pen is purchased in a transaction, it is
likely that ink will also be purchased in that
Important Measures for
Association Rules
• % of transactions containing all items
mentioned in rule
• Low support reduces interest in the rule
• % of transactions containing the LHS
that also contain RHS
• Indicates degree of correlation
Using Association Rules
For Prediction
Always somewhat risky
• Because ultimate goal is understanding
• Which is not directly reflected in
transaction data
There Can Be High Support
and Confidence
 … but no causality
 Example: pencils and pens are often
bought together
• And pens and ink are often bought together
• Hence pencils and ink are often bought
 But there is no causal link between pencils
and ink
• Hence sale promotions on pencils and ink
probably won’t be effective
Finding Association Rules
Seek rules with:
• Support greater than minsup
• Confidence greater than minconf
• Find frequent item sets
 Sets of items with support >= minsup
• Break each frequent item set into LHS and
RHS of candidate rules
 Keep those with confidence >= minconf
Testing Candidate Rules
Confidence calculation for each
candidate rule
• Maintain two counters: lhscount,
• Scan entire customer transaction table
• Count in lhscount occurrences of all
items in LHS
• If LHS is present, tally in rhscount if all
items in RHS are present
Identifying Frequent Item
The a priori property:
• Every subset of a frequent item set is
also a frequent item set
This leads to an iterative algorithm
• Identify frequent item sets of one item
• Iteratively, seek to extend frequent item
sets by adding an item
Finding Frequent Itemsets
foreach item,
check if it is a frequent itemset
foreach new frequent itemset Ik with k items
generate all itemsets Ik+1 with k+1 items, Ik  Ik+1
Scan all transactions once and check if
the generated k+1-itemsets are frequent
until no new frequent itemsets are found
Example: Mining Simulated
Combustion Data
Joint work with
• Brijesh Garabadu, School of Computing
• Zoran Djurisic, Chem. & Fuels Engg.
The problem
• Combustion model for powdered coal
• Which conditions control NOx pollution?
The Data
Multidimensional space
• Pressure, fuel mix, oxygen concentration
• Can explore (simulate) any combination
 But which to look at?
Need to:
• Locate relevant subspaces
• Characterize important events
• Develop causal hypotheses
Techniques Applied
Cluster analysis
• Which datasets are similar?
Neural networks
• Which datasets are interesting?
Decision trees
• Which features best explain similarities?
Cluster Analysis:
Unsupervised Learning
At outset, category structure of the
data is unknown
• All that is known is a collection of
Objective: To discover a category
structure which fits the observation
• i.e. finding natural groups in data
Combustion Application
 Cluster analysis was used to detect
relationships among various species
• Are the behaviors of any two species related?
• Is the concentration of one species dependent
on that of one or more other species?
 One confirmed hypothesis:
• CH reaches it peak concentration either before
or at the same time as H reaches its peak
• An important engineering observation
Artificial Neural Networks
 A general, practical method for learning
real-valued, discrete-values, and vectorvalues function from examples
 Combustion application
• Finding out different kinds of pattern
(increasing / decreasing, etc) in the lifetime of
a species during the combustion process
• This can be used to prove various hypothesis
as well as to detect patterns of specific species
in previously unseen data
Neural Networks:
Supervised Learning
Application Technique
 Training set data are labeled by the user
• These labeled data are used to train the ANN
 The ANN is then used to classify
previously unseen data
• e.g., species in a particular combustion
• Into a particular pattern class
 For example, NO shows two different
trends under differing conditions
 A trained ANN can be used to classify the
datasets according to the trend of NO
Decision Trees
Characterize data by features
• e.g., species concentration at an instant
Categorize data sets
• Manually, or use ANN
• e.g., according to the trend of NO
Use decision tree algorithm to
discover clustering criteria
Sample Output
=== Classifier model (full training set) ===
J48 pruned tree
--------------------CO <= 0.002945
OH <= 0.000016
CO <= 0.000166: yes (17.0/1.0)
CO > 0.000166: no (3.0)
OH > 0.000016: yes (30.0)
CO > 0.002945: no (60.0 / 1.0)
Research Opportunities
Try it!
• In your area, on your data, for new
• Definition, efficient extraction
Community building
• Sharing data mining results
Predictive Model Markup Language
XML based representation of
association rules
Developed by Data Mining Group
• Industrial and university research
An Excellent Tutorial
Used for material in this talk
• Data Mining Scientific and Engineering
 Tutorial at SC2001, November 12, 2001 by
R. Grossman, C. Kamath and V. Kumar