Data mining, interactive semantic structuring, and
Download
Report
Transcript Data mining, interactive semantic structuring, and
Data mining, interactive semantic
structuring, and collaboration:
A diversity-aware method
for sense-making in search
Mathias Verbeke,
Bettina Berendt,
Siegfried Nijssen
Dept. Computer Science, KU Leuven
Agenda
Motivation
Diversity Diversity-aware tools (our) Context
Main part
Measures of diversity Tool
Outlook
Motivation (1): Diversity is ...
Speaking different
languages (etc.)
localisation /
internationalisation
Having different
abilities
accessibility
Liking different
things
collaborative filtering
Structuring the world
in different ways ?
Motivation (2):
Diversity-aware applications ...
Must have a (formal) notion of diversity
Can follow a
– “personalization approach“
adapt to the user‘s value on the diversity
variable(s)
transparently? Is this paternalistic?
– “customization approach“
show the space of diversity
allow choice / semi-automatic!
(Our) Context
Diversity and Web usage: language, culture
2. Family of tools focussing on interactive sensemaking helped by data mining
1.
– PORPOISE: global and local analysis of news and
blogs + their relations
– STORIES: finding + visualisation of “stories” in news
– CiteseerCluster: literature search + sense-making
– Damilicious: CiteseerCluster + re-use/transfer of
semantics + diversity
Measuring grouping diversity
Diversity = 1 – similarity = 1 - Normalized mutual information
By colour &
NMI = 0
NMI = 0.35
Measuring user diversity
“How similarly do two users group documents?“
For each query q, consider their groupings gr:
For various queries: aggregate
... and now: the application domain
... that‘s only the 1st step!
Workflow
1.
2.
3.
4.
Query
Automatic clustering
Manual regrouping
Re-use
1. Learn + present way(s) of grouping
2. Transfer the constructed concepts
Concepts
Extension
– the instances in a group
Intension
– Ideally: “squares vs. circles“
– Pragmatically: defined via a
classifier
Step 1: Retrieve
CiteseerX via OAI
Output: set of
– document IDs,
– document details
– their texts
Step 2: Cluster
“the classic bibliometric solution“
CiteseerCluster:
– Similarity measure: co-citation, bibliometric
coupling, word or LSA similarity, combinations
– Clustering algorithm: k-means, hierarchical
Damilicious: phrases Lingo
How to choose the “best“?
– Experiments: Lingo better than k-means at
reconstruction and extension-over-time
Step 3 (a): Re-organise
& work on document groups
Step 3 (b):
Visualising document groups
Steps 4+5: Re-use
Basic idea:
1. learn a classifier from the final grouping (Lingo phrases)
2. apply the classifier to a new search result
“re-use semantics“
Whose grouping?
– One‘s own
– Somebody else‘s
Which search result?
–
–
–
–
“ the same“ (same query, structuring by somebody else)
“ More of the same“ (same query, later time more doc.s)
“ related“ (... Measured how? ...)
arbitrary
Visualising user diversity (1)
Simulated users with different strategies
U0: did not change anything
(“System“)
U1: tried produce a better fit of the
document groups to the cluster
intensions; 5 regroupings
U2: attempted to move everything
that did not fit well into the remainder
group “Other topics”, & better fit; 10
regroupings
U3: attempted to move everything
from „Other topics“ into matching real
groups; 5 regroupings
U4: regrouping by author and
institution; 5 regroupings
5*5 matrix of diversities gdiv(A,B,q)
multidimensional scaling
Visualising user diversity (2)
Web mining
Data mining
RFID
aggregated
using gdiv(A,B)
Evaluating the application
Clustering only: Does it generate
meaningful document groups?
– yes (tradition in bibliometrics) – but: data?
– Small expert evaluation of CiteseerCluster
Clustering & regrouping
– End-user experiment with CiteseerCluster
– 5-person formative user study of Damilicious
Summary and
(some) open questions
Damilicious: a tool that helps users in sense-making, exploring
diversity, and re-using semantics
diversity measures when queries and result sets are different?
how to best present of diversity?
– How to integrate into an environment supporting user and community
contexts (e.g., Niederée et al. 2005)?
Incentives to use the functionalities?
how to find the best balance between similarity and diversity?
which measures of grouping diversity are most meaningful?
– Extensional?
– Intensional? Structure-based? Hybrid? (cf. ontology matching)
which other sources of user diversity?
Thanks!