Understanding Virtual Blah Blahs…
Download
Report
Transcript Understanding Virtual Blah Blahs…
Scientific Applications of
Data Mining
Bioinformatics Seminar
August 28, 2002
Gary Lindstrom
School of Computing
University of Utah
Outline
What is data mining?
Where has it been successfully
applied?
How can it be applied to scientific
applications?
Research Opportunities
What Is Data Mining?
One definition (Robert Grossman)
• Data mining is the semi-automatic
discovery of patterns, associations,
anomalies, structures, and changes in
large data sets
Data Mining
Characteristics
•
•
•
•
Large data, vs. small data
Discovery, not validation
Data driven, not hypothesis driven
Automated, not manual application
Supported by
• Statistics, machine learning, databases,
high performance computing
The Data Gap
Exponential growth of data
• More automation, greater throughput,
more models, e.g. simulated
But: linear increase in number of
researchers
• Sift the sand, rather than searching a
sensor
Classical Data Mining
Applications
Retail
• Market basket analysis
Political science
• Targeting campaign resources
Financial
• Exploiting market trends & imbalances
Decision Support Systems
Generic term for analytic and historic
uses of DBs
• Contrast with: operational uses
• Commonly known as On-Line
Transaction Processing (OLTP)
Data warehouses
• Data culled from operational DBs, with
history and derived summary data
Data Warehouses vs.
Databases
• Replicate data from distributed sources
• Do not require strict currency of data
• Oriented toward complex, often
statistical queries
• Often based on materialized views of
operational data
Views which have been expanded into real
tables
Tools for DSS
Ad hoc SQL-style queries
• Optimized for large, complex data
On-Line Analytic Processing (OLAP)
• Queries optimized for aggregation operations
• Data is viewed as multidimensional array
• Influenced by end-user tools such as
spreadsheets
Data mining
• Exploratory data analysis
• Looking for interesting unanticipated patterns
in the data
Data Warehousing
Visualization
External Data Source
Metadata Repository
EXTRACT
TRANSFORM
LOAD
REFRESH
SERVES
OLAP
Data Warehouse
Data Mining
Creating And Maintaining
A Warehouse
Challenges
• Schema design for integrated information
• Operations
Cleaning (curation): filling gaps, correcting errors
Transforming: making consistent with new schema
Loading: also sorting and summarizing
Refreshing: incorporate updates to operation data
Purging: aging out old data
Role of metadata
• Sources of data, schema conversion
information, refresh history, etc.
OLAP Naturally Leads to
Data Mining
Seeks interesting trends or patterns in
large datasets
• An example of exploratory data analysis
• Related to knowledge discovery and machine
learning
Mining for rules
• Association rules: motivated by retail market
basket analysis
Market Basket Analysis
Market basket
• A collection of items purchased by a customer
in one transaction
• Retailers want to learn of items often
purchased together
For promotional and display grouping purposes
• Simple tabular representation
Purchases(transid, custid, date, item, price, quantity)
Association Rules
Seek rules of the form:
{ pen } => { ink }
• Meaning:
If a pen is purchased in a transaction, it is
likely that ink will also be purchased in that
transaction
Important Measures for
Association Rules
Support
• % of transactions containing all items
mentioned in rule
• Low support reduces interest in the rule
Confidence
• % of transactions containing the LHS
that also contain RHS
• Indicates degree of correlation
Using Association Rules
For Prediction
Always somewhat risky
• Because ultimate goal is understanding
causality
• Which is not directly reflected in
transaction data
There Can Be High Support
and Confidence
… but no causality
Example: pencils and pens are often
bought together
• And pens and ink are often bought together
• Hence pencils and ink are often bought
together
But there is no causal link between pencils
and ink
• Hence sale promotions on pencils and ink
probably won’t be effective
Finding Association Rules
Seek rules with:
• Support greater than minsup
• Confidence greater than minconf
Steps
• Find frequent item sets
Sets of items with support >= minsup
• Break each frequent item set into LHS and
RHS of candidate rules
Keep those with confidence >= minconf
Testing Candidate Rules
Confidence calculation for each
candidate rule
• Maintain two counters: lhscount,
rhscount
• Scan entire customer transaction table
• Count in lhscount occurrences of all
items in LHS
• If LHS is present, tally in rhscount if all
items in RHS are present
Identifying Frequent Item
Sets
The a priori property:
• Every subset of a frequent item set is
also a frequent item set
This leads to an iterative algorithm
• Identify frequent item sets of one item
• Iteratively, seek to extend frequent item
sets by adding an item
Finding Frequent Itemsets
foreach item,
check if it is a frequent itemset
repeat
foreach new frequent itemset Ik with k items
generate all itemsets Ik+1 with k+1 items, Ik Ik+1
Scan all transactions once and check if
the generated k+1-itemsets are frequent
until no new frequent itemsets are found
Example: Mining Simulated
Combustion Data
Joint work with
• Brijesh Garabadu, School of Computing
• Zoran Djurisic, Chem. & Fuels Engg.
The problem
• Combustion model for powdered coal
furnaces
• Which conditions control NOx pollution?
The Data
Multidimensional space
• Pressure, fuel mix, oxygen concentration
• Can explore (simulate) any combination
But which to look at?
Need to:
• Locate relevant subspaces
• Characterize important events
• Develop causal hypotheses
Techniques Applied
Cluster analysis
• Which datasets are similar?
Neural networks
• Which datasets are interesting?
Decision trees
• Which features best explain similarities?
Cluster Analysis:
Unsupervised Learning
At outset, category structure of the
data is unknown
• All that is known is a collection of
observations
Objective: To discover a category
structure which fits the observation
• i.e. finding natural groups in data
Combustion Application
Cluster analysis was used to detect
relationships among various species
• Are the behaviors of any two species related?
• Is the concentration of one species dependent
on that of one or more other species?
One confirmed hypothesis:
• CH reaches it peak concentration either before
or at the same time as H reaches its peak
concentration
• An important engineering observation
Artificial Neural Networks
A general, practical method for learning
real-valued, discrete-values, and vectorvalues function from examples
Combustion application
• Finding out different kinds of pattern
(increasing / decreasing, etc) in the lifetime of
a species during the combustion process
• This can be used to prove various hypothesis
as well as to detect patterns of specific species
in previously unseen data
Neural Networks:
Supervised Learning
Application Technique
Training set data are labeled by the user
• These labeled data are used to train the ANN
The ANN is then used to classify
previously unseen data
• e.g., species in a particular combustion
• Into a particular pattern class
For example, NO shows two different
trends under differing conditions
A trained ANN can be used to classify the
datasets according to the trend of NO
Decision Trees
Characterize data by features
• e.g., species concentration at an instant
Categorize data sets
• Manually, or use ANN
• e.g., according to the trend of NO
Use decision tree algorithm to
discover clustering criteria
Sample Output
=== Classifier model (full training set) ===
J48 pruned tree
--------------------CO <= 0.002945
|
OH <= 0.000016
|
|
CO <= 0.000166: yes (17.0/1.0)
|
|
CO > 0.000166: no (3.0)
|
OH > 0.000016: yes (30.0)
CO > 0.002945: no (60.0 / 1.0)
Research Opportunities
Try it!
• In your area, on your data, for new
results
Features
• Definition, efficient extraction
Community building
• Sharing data mining results
PMML
Predictive Model Markup Language
XML based representation of
association rules
Developed by Data Mining Group
• Industrial and university research
collaboration
An Excellent Tutorial
Used for material in this talk
• Data Mining Scientific and Engineering
Applications
Tutorial at SC2001, November 12, 2001 by
R. Grossman, C. Kamath and V. Kumar
http://www-users.cs.umn.edu/
~kumar/Presentation/sc2001.html