Graphical Data Mining for Computational Estimation in Materials

Download Report

Transcript Graphical Data Mining for Computational Estimation in Materials

Graphical Data Mining for
Computational Estimation in
Materials Science Applications
Aparna Varde
Ph.D. Dissertation
August 15, 2006
Committee Members
Prof. Elke Rundensteiner (Advisor)
Prof. Carolina Ruiz
Prof. David Brown
Prof. Neil Heffernan
Prof. Richard Sisson Jr. (Head of Materials Science, WPI)
This work is supported by the Center for Heat Treating Excellence and
by Department of Energy Award DE-FC-07-01ID14197
1
Introduction

Scientific domains: Experiments conducted with given input
conditions

Results plotted as graphs: Good visual depictions

Experimental results help in analysis: Assist decision-making

Performing experiment: Consumes time and resources
2
Motivating Example

Heat Treating of Materials
• Controlled heating & cooling of
materials to achieve mechanical &
thermal properties
Pneumatic on/off
switch
Pneumatic cylinder
K-type thermocouple

Performing experiments involves
•
•
•
•

One time cost: $1000s
Recurrent costs: $100s
Time: 5 to 6 hours
Human labor
Desirable to estimate
• Graphs given input conditions
• Conditions to achieve given graph
Connecting rod
Computer with Data
Acquisition Card
Probe tip
Furnace
Oil beaker
Thermocouple for Oil temp.
CHTE Experimental Setup
3
Problem Definition

To develop an estimation technique with following goals:
1. Given input conditions in an experiment, estimate resulting graph
2. Given desired graph in an experiment, estimate conditions to obtain it
4
Proposed Estimation Approach:
AutoDomainMine
5
Knowledge Discovery in
AutoDomainMine
6
Estimation of Graph in
AutoDomainMine
7
Estimation of Conditions in
AutoDomainMine
8
Main Tasks
Task 1
AutoDomainMine Learning Strategy
of Integrating Clustering and Classification
[AAAI-06 Poster, ACM SIGART’s ICICIS-05]
Task 2
Learning Domain-Specific
Distance Metrics for Graphs
Task 3
Designing Semantics-Preserving
Representatives for Clusters
[ACM KDD’s MDM-05,
MTAP-06 Journal]
[ACM SIGMOD’S IQIS-06,
ACM CIKM-06]
9
Task 2:
Learning Domain-Specific
Distance Metrics for Graphs
10
Motivation

Various distance metrics
• Absolute position of points
• Statistical observations
• Critical features

Issues
• Not known what metrics apply
• Multiple metrics may be
relevant

Need for distance metric
learning in graphs
Example of domain-specific problem
11
Proposed Distance Metric Learning
Approach: LearnMet

Given
• Training set with
actual clusters of
graphs

Additional Input
• Components:
distance metrics
applicable to graphs

LearnMet Metric
• D = ∑wiDi
12
Evaluate Accuracy
Use pairs of graphs
 A pair (ga,gb) is

 TP - same predicted,
same actual cluster:
(g1, g2)
 TN - different
predicted, different
actual clusters: (g2,g3)
 FP - same predicted
cluster, different actual
clusters: (g3,g4)
 FN - different
predicted, same actual
clusters: (g4,g5)
13
Evaluate Accuracy (Contd.)

How do we compute error for whole set of graphs?
• For all pairs

Error Measure
• Failure Rate FR
• FR = (FP+FN) / (TP+TN+FP+FN)

Error Threshold (t)
• Extent of FR allowed
• If (FR < t) then clustering is accurate
14
Adjust the Metric

Weight Adjustment Heuristic: for each Di
• New wi = wi – sfi (DFNi/DFN + DFPi/DFP)
[KDD’s MDM-05]
15
Evaluation of LearnMet


• G = number of graphs,
e.g., = 25
• GC2 = total number of
pairs, e.g., = 300
• Select subset of GC2 pairs
per epoch
Accuracy of Learned Metrics over Test Set
Learning Efficiency over Training Set
Details: MTAP-06
Effect of pairs per epoch
(ppe)

Observations
• Highest accuracy with
middle range of ppe
• Learning efficiency best
with low ppe
• Average accuracy with
LearnMet 86%
16
Task 3: Designing SemanticsPreserving Representatives
for Clusters
17
Motivation



Different combinations of
conditions could lead to a
single cluster
Graphs in a cluster could
have variations
Need for designing
representatives that
• Incorporate semantics
• Avoid visual clutter
• Cater to various users
18
Proposed Approach for Designing
Representatives: DesRept
19
Candidates for Conditions
1. Nearest Representative
Set of conditions in Cluster A
Nearest Representative for Cluster A


Return set of conditions closest to all others in cluster
Notion of distance: Domain-specific distance metric from
decision tree paths [CIKM-06]
20
Candidates for Conditions (Contd.)
2. Summarized Representative
Cluster A

Build sub-clusters of
condition using domain
knowledge

Return nearest subcluster representatives

Sort them
Subclusters
within the
Cluster A
Summarized Representative for Cluster
21
Candidates for Conditions (Contd.)
3. Combined Representative
Cluster A
Combined Representative for Cluster A


Return all sets of conditions
Sort them in ascending order
22
Candidates for Graphs
1. Nearest Representative


Select graph that is nearest neighbor for all others
Notion of distance: Domain-specific metric from LearnMet
23
Candidates for Graphs (Contd.)
2. Medoid Representative


Select graph closest to average of all graphs
Average of y-coordinate values since x-coordinates are same
24
Candidates for Graphs (Contd.)
3. Summarized Representative


Construct average graph with prediction limits
Average: centroid, Prediction limits: domain-specific thresholds
25
Candidates for Graphs (Contd.)
3. Combined Representative


Construct superimposed graph of all graphs in cluster
Same x-values, so plot y-values on a common x-axis
26
Effectiveness Measure for Candidates

Minimum Description Length Principle
• Theory: Representative, Examples: all items in cluster

Representative: Measure Complexity (ease of interpretation)
Complexity = log2 N for graphs, log2 AV for conditions,
• N = number of points to store representative graph
• A = number of attributes for conditions,
• V = number of values in representative set of conditions

Examples: Measure distance of items from representative (information loss)
Distance for graphs = log2 (1/G)∑{i=1 to G} D(r,gi)
• D: distance using domain-specific metric
• G: total number of graphs in cluster
• gi: each graph
• r: representative graph

Encoding [SIGMOD IQIS-06]
Effectiveness= UBC*Complexity + UBD*Distance
• UBC, UBD: User bias % weights for complexity and distance
27
Evaluation of DesRept: Conditions

Details
• Data Set Size = 400, Number of Clusters = 20

Observations
• Overall winner is Summarized
• As weight for complexity increases, Nearest wins
• Designed better than Random
28
Evaluation of DesRept: Graphs

Details
• Data Set Size = 400, Number of Clusters = 20

Observations
• Overall winner is Summarized
• As weight for complexity increases, Nearest / Medoid wins
• Designed better than Random
29
User Evaluation of
AutoDomainMine System

Formal user surveys in
different applications

Evaluation Process
• Compare estimation with
real data in test set
• If they match estimation
is accurate

Accuracy: Estimating Conditions
Observations
• Estimation Accuracy
around 90 to 95 %
Accuracy: Estimating Graphs
30
Related Work









Similarity Search [HK-01, WF-00]
• Non-matching conditions could be significant
Mathematical Modeling [M-95, S-60]
• Existing models not applicable under certain situations
Case-based Reasoning [K-93, AP-03]
• Adaptation of cases not feasible with graphs
Learning nearest neighbor in high-dimensional spaces: [HAK-00]
• Focus is dimensionality reduction, do not deal with graphs
Distance metric learning given basic formula: [XNJR-03]
• Deal with position-based distances for points, no graphs involved
Similarity search in multimedia databases [KB-04]
• Use various metrics in different applications, do not learn a single metric
Image Rating: [HH-01]
• User intervention involved in manual rating
Semantic Fish Eye Views: [JP-04]
• Display multiple objects in small space, no representatives
PDA Displays in Levels of Detail: [BGMP-01]
• Do not evaluate different types of representatives
31
Summary

Dissertation Contributions
• AutoDomainMine: Integrating Clustering and Classification
for Estimation [AAAI-06 Poster, ACM SIGART’s ICICIS-05]
• LearnMet: Learning Domain-Specific Distance Metrics for
Graphs [ACM KDD’s MDM-05, MTAP-06 Journal]
• DesRept: Designing Semantics-Preserving Representatives
for Clusters [ACM SIGMOD’s IQIS-06, ACM CIKM-06]
• Trademarked Tool for Computational Estimation in Materials
Science [ASM HTS-05, ASM HTS-03]

Future Work
• Image Mining, e.g., Comparing Nanostructures
• Data Stream Matching, e.g., Stock Market Analysis
• Visual Displays, e.g., Summarizing Web Information
32
Publications
Dissertation-Related Papers
1.
2.
3.
4.
5.
6.
7.
Designing Semantics-Preserving Representatives for Scientific Input
Conditions, A. Varde, E. Rundensteiner, C. Ruiz, D. Brown, M.
Maniruzzaman and R. Sisson Jr., In CIKM, Arlington, VA, Nov 2006.
Integrating Clustering and Classification for Estimating Process Variables in
Materials Science. A. Varde, E. Rundensteiner, C. Ruiz, D. Brown, M.
Maniruzzaman and R. Sisson Jr. In AAAI, Poster Track, Boston, MA, Jul
2006.
Effectiveness of Domain-Specific Cluster Representatives for Graphical
Plots. A. Varde, E. Rundensteiner, C. Ruiz, D. Brown, M. Maniruzzaman and
R. Sisson Jr. In ACM SIGMOD IQIS, Chicago, IL, Jun 2006.
LearnMet: Learning Domain-Specific Distance Metrics for Plots of Scientific
Functions. A. Varde, E. Rundensteiner, C. Ruiz, M. Maniruzzaman and R.
Sisson Jr. Accepted in the International MTAP Journal, Springer
Publications, Special Issue on Multimedia Data Mining, 2006.
Learning Semantics-Preserving Distance Metrics for Clustering Graphical
Data. A. Varde, E. Rundensteiner, C. Ruiz, M. Maniruzzaman and R. Sisson
Jr. In ACM KDD MDM, Chicago, IL, Aug 2005, pp. 107-112.
Apriori Algorithm and Game-of-Life for Predictive Analysis in Materials
Science. A. Varde, M. Takahashi, E. Rundensteiner, M. Ward, M.
Maniruzzaman and R. Sisson Jr. In KES Journal, IOS Press, Netherlands,
Vol. 8, No. 4, 2004, pp. 213 – 228.
Data Mining over Graphical Results of Experiments with Domain
Semantics. A. Varde, E. Rundensteiner, C. Ruiz, M. Maniruzzaman and R.
Sisson Jr. In ACM SIGART ICICIS, Cairo, Egypt, Mar 2005, pp. 603 –
611.
33
Publications (Contd.)
8.
9.
10 .
QuenchMiner: Decision Support for Optimization of Heat Treating Processes.
A. Varde, M. Takahashi, E. Rundensteiner, M. Ward, M. Maniruzzaman and R.
Sisson Jr. In IEEE IICAI, Hyderabad, India, Dec 2003, pp. 993 – 1003.
Estimating Heat Transfer Coefficients as a Function of Temperature by Data
Mining. A. Varde, E. Rundensteiner, M. Maniruzzaman and R. Sisson Jr. In
ASM HTS, Pittsburgh, PA, Sep 2005.
The QuenchMiner Expert System for Quenching and Distortion Control. A.
Varde, E. Rundensteiner, M. Maniruzzaman and R. Sisson Jr. In ASM HTS,
Indianapolis, IN, Sep 2003, pp. 174 – 183.
Other Papers
11.
12.
MEDWRAP: Consistent View Maintenance over Distributed Multi-Relation
Sources. A. Varde and E. Rundensteiner. In DEXA. Aix-en-Provence, France,
Sep 2002, pp. 341 – 350.
SWECCA for Data Warehouse Maintenance. A. Varde and E. Rundensteiner.
In SCI, Orlando, FL, Jul 2002, Vol. 5, pp. 352 – 357.
13. MatML: XML for Information Exchange with Materials Property Data, A. Varde,
14.
E. Begley, S. Fahrenholz-Mann. In ACM KDD DM-SPP, Philadelphia, PA, Aug
2006.
Semantic Extensions to Domain-Specific Markup Languages. A. Varde, E.
Rundensteiner, M. Mani, M. Maniruzzaman and R. Sisson Jr. In IEEE CCCT,
Austin, TX, Aug 2004, Vol. 2, pp. 55 – 60.
34
Acknowledgments















First of all, my Advisor: Prof. Elke Rundensteiner
Committee: Prof. Carolina Ruiz,Prof. David Brown, Prof Neil Heffernan
External Member: Prof. Richard D. Sisson Jr., Head of Materials Program
Director of Metal Processing Institute: Prof. Diran Apelian
Domain Expert: Dr. Mohammed Maniruzzaman
Members of Center for Heat Treating Excellence
CS Department Head: Prof. Michael Gennert
Former CS Department Head: Prof. Micha Hofri
WPI Administration (CS, Materials): In particular Mrs. Rita Shilansky
Reviewers of Conferences and Journals where my papers got accepted
Members of DSRG, AIRG, KDDRG and Quenching Research Group
Colleagues and Friends: Shuhui, Sujoy, Viren, Olly, Mariana, Rimma, Maged,
Bin, Lydia, Shimin and others…
Great Thanks to my Family: Parents Dr. Sharad Varde and Dr. (Mrs.) Varsha
Varde, Grandparents Mr. D.A. Varde and Mrs. Vimal Varde, Brother Ameya
Varde and Sister-in-law Deepa Varde
All the attendees of my Ph.D. Defense
Finally, God for guiding me throughout my doctoral journey
35
Thank You
36