Presentation II

Download Report

Transcript Presentation II

Dimacs Graph Mining
(via Similarity Measures)
Ye Zhu Stephanie
REU-DIMACS, July 17, 2009
[email protected]
Mentor : James Abello
Talk Outline
I.
From Data to Graphs via Similarity Measure
II.
Our Research Project
Input: REU participants information
DIMACS Workshop Data
Output: A variety of Graphs
III. Main Questions
a. Choose good similarity measures
b.Visualize and detect “interesting” patterns
Original data records for building
REU-Participants graphs
Original Data Records
DIMACS Workshop Abstracts
General Method

Step 1: Compute a similarity measure
among the data records shown above.
Since a record can be viewed as a
unweighted/weighted set of attributes we
use unweighted/weighted version of an
standard metric among finite sets that
uses the size of the intersection over the
size of the union between two sets
Weighted case
dog
eat 0.7
shaggy 0.8 pet
brown 0.9
fat
fat 0.75
pet 0.6
cat
pet 0.6
hairy 0.85
fat 0.75
pet 0.6
Computation
dog
cat
lw(wi,attj)  log (freq of attj with wi)
eat 0.7
pet 0.6
W  gw(attj) lw(wj,attj)
shaggy 0.8 pet
hairy 0.85
 gw(attj) log (freq of attj with wi)
brown 0.9
fat 0.75
min
(W(w
m,attj),W(wn,attj))

j
fat
fat
0.75
pet 0.6
WJ(w m,wn) 
max (W(w m,attj),W(wn,attj))

j
pet 0.6
0  0  0  0.75  0.6  0
 0.28
0.9  0.7  0.85  0.75  0.79  0.8
General Method
Step 1: Compute a similarity measure
among the data records shown above.
 Step 2: Deal with different types of data
records respectively.

Computing Edge Weight

To deal with different types of
information, we partition the attributes
into different classes according to their
value types and compute a similarity
measure for each class and then combine
these values using a convex combination
Eg. Total Weight=
0.3*Weighted Coeff+0.7*Unweighted Coeff
REU participants example
How to calculate the Edge Weight?
Unweighted
Weighted
Unweighted
REU participants example
How about the Vertices' Weight(ball size)
We can simply convert these 3 columns to
three-digit numbers !!!
General Method
Step 1: Compute a similarity measure
among the data records shown above.
 Step 2: Deal with different types of data
records respectively.
 Step 3: Build weighted graph where each
record is now treated as a vertex and
two vertices are joined by an edge with
weight equal to their computed similarity

General Method
Step 1: Compute a similarity measure
among the data records shown above.
 Step 2: Deal with different types of data
records respectively.
 Step 3: Build weighted graph where each
record is now treated as a vertex and
two vertices are joined by an edge with
weight equal to their computed similarity
 Step 4: Visualize the graph use GraphView
Software and find interesting clusters

REU Participants Graph
Workshop Abstract Example
Read in all workshop abstracts file
 Delete stop words -> unimportant
words
 Get a count of number of appearances
(freqency) of ALL words left in All
workshop abstracts
 Compute Jaccard Coefficient

Dimacs Workshop Abstract Graph
Conclusion

We have shown how data set records can
be transformed into a weighted graph by
using a similarity measure among records

This methodology allows us to use
powerful graph clustering techniques to
analyze and visualize data bases.
References
 [1]
GraphView system
 [2] C Gasperin, P Gamallo, A Agustini, G Lopes,V Lima 2001
Using syntactic contexts for measuring word similarity
[3] Resnik, Philip (1999) Semantic Similarity in a Taxonomy: An
Information-Based Measure and its Application to Problems of
Ambiguity in Natural Language. Journal of Artificial Intelligence
Research
Thank you!
The end