Presentation II
Download
Report
Transcript Presentation II
Dimacs Graph Mining
(via Similarity Measures)
Ye Zhu Stephanie
REU-DIMACS, July 17, 2009
[email protected]
Mentor : James Abello
Talk Outline
I.
From Data to Graphs via Similarity Measure
II.
Our Research Project
Input: REU participants information
DIMACS Workshop Data
Output: A variety of Graphs
III. Main Questions
a. Choose good similarity measures
b.Visualize and detect “interesting” patterns
Original data records for building
REU-Participants graphs
Original Data Records
DIMACS Workshop Abstracts
General Method
Step 1: Compute a similarity measure
among the data records shown above.
Since a record can be viewed as a
unweighted/weighted set of attributes we
use unweighted/weighted version of an
standard metric among finite sets that
uses the size of the intersection over the
size of the union between two sets
Weighted case
dog
eat 0.7
shaggy 0.8 pet
brown 0.9
fat
fat 0.75
pet 0.6
cat
pet 0.6
hairy 0.85
fat 0.75
pet 0.6
Computation
dog
cat
lw(wi,attj) log (freq of attj with wi)
eat 0.7
pet 0.6
W gw(attj) lw(wj,attj)
shaggy 0.8 pet
hairy 0.85
gw(attj) log (freq of attj with wi)
brown 0.9
fat 0.75
min
(W(w
m,attj),W(wn,attj))
j
fat
fat
0.75
pet 0.6
WJ(w m,wn)
max (W(w m,attj),W(wn,attj))
j
pet 0.6
0 0 0 0.75 0.6 0
0.28
0.9 0.7 0.85 0.75 0.79 0.8
General Method
Step 1: Compute a similarity measure
among the data records shown above.
Step 2: Deal with different types of data
records respectively.
Computing Edge Weight
To deal with different types of
information, we partition the attributes
into different classes according to their
value types and compute a similarity
measure for each class and then combine
these values using a convex combination
Eg. Total Weight=
0.3*Weighted Coeff+0.7*Unweighted Coeff
REU participants example
How to calculate the Edge Weight?
Unweighted
Weighted
Unweighted
REU participants example
How about the Vertices' Weight(ball size)
We can simply convert these 3 columns to
three-digit numbers !!!
General Method
Step 1: Compute a similarity measure
among the data records shown above.
Step 2: Deal with different types of data
records respectively.
Step 3: Build weighted graph where each
record is now treated as a vertex and
two vertices are joined by an edge with
weight equal to their computed similarity
General Method
Step 1: Compute a similarity measure
among the data records shown above.
Step 2: Deal with different types of data
records respectively.
Step 3: Build weighted graph where each
record is now treated as a vertex and
two vertices are joined by an edge with
weight equal to their computed similarity
Step 4: Visualize the graph use GraphView
Software and find interesting clusters
REU Participants Graph
Workshop Abstract Example
Read in all workshop abstracts file
Delete stop words -> unimportant
words
Get a count of number of appearances
(freqency) of ALL words left in All
workshop abstracts
Compute Jaccard Coefficient
Dimacs Workshop Abstract Graph
Conclusion
We have shown how data set records can
be transformed into a weighted graph by
using a similarity measure among records
This methodology allows us to use
powerful graph clustering techniques to
analyze and visualize data bases.
References
[1]
GraphView system
[2] C Gasperin, P Gamallo, A Agustini, G Lopes,V Lima 2001
Using syntactic contexts for measuring word similarity
[3] Resnik, Philip (1999) Semantic Similarity in a Taxonomy: An
Information-Based Measure and its Application to Problems of
Ambiguity in Natural Language. Journal of Artificial Intelligence
Research
Thank you!
The end