Data mining, interactive semantic structuring, and

Download Report

Transcript Data mining, interactive semantic structuring, and

Diversity
in search:
what, how,
and what for?
Bettina Berendt
Dept. Computer Science,
KU Leuven
Thanks to
Sebastian Kolbe-Nusser
 Anett Kralisch
 Siegfried Nijssen
 Ilija Subašić
 Mathias Verbeke
 Hugo Zaragoza
 ...

Diversity in natural language
diverse (s#2), various :
distinctly dissimilar or unlike
..., diversity (s#1), ..., variety :
noticeable heterogeneity
(Wordnet)

“the fact that members of a set are
different from one another“
Why is diversity
interesting for search?
“People like to see a range of different, nonredundant things/views/etc.“
“Different people search differently.“
 How?
 When / under what conditions?
 (What) can we do?
What is diverse?

Documents
– the relevance of a document must be determined
considering the documents appearing before it
(Goffman, 1964)
– E.g. MMR (Carbonell & Goldstein, 1998)
– Many further developments, e.g. for images
– Presentation choices, e.g. re-ranking or clustering?
What is diverse?
Documents
 People

– “The term diversity is a form of euphemistic
shorthand to describe differences in racial or ethnic
classifications, age, gender, religion, philosophy,
physical abilities, socioeconomic background, sexual
orientation, gender identity, intelligence, mental
health, physical health, genetic attributes, behavior,
attractiveness, place of origin, cultural values, or
political view as well as other identifying features.”
http://en.wikipedia.org/wiki/Diversity_(politics)
What is diverse?
Documents
 People
Knowledge and its articulations
(= documents in a wider sense?!)

– “Knowledge and its articulations are strongly
influenced by diversity in, e.g., cultural backgrounds,
schools of thought, geographical contexts.”
– “LivingKnowledge will study the effect of diversity and
time on opinions and bias.”
– “The goal [is] to improve navigation and search in
very large multimodal datasets (e.g., the Web itself).”
How we got here
The impact of
language and
culture on
Web usage
behaviour
Diversity of
users
How we got here
The impact of
language and
culture on
Web usage
behaviour
Diversity of
users
Tools for
sense-making
in literature
search
Diversity of
documents
How we got here
The impact of
language and
culture on
Web usage
behaviour
Diversity of
users
Tools for
sense-making
in literature
search
Diversity of
documents
PORPOISE,
STORIES tools
for graphical
news summarization and
understanding
How we got here
The impact of
language and
culture on
Web usage
behaviour
Collaborative
re-use of
literature
search results
Diversity of
users
Diversity of
diversity 
Tools for
sense-making
in literature
search
Diversity of
documents
PORPOISE,
STORIES tools
for graphical
news summarization and
understanding
Why this talk?
The impact of
language and
culture on
Web usage
behaviour
Collaborative
re-use of
literature
search results
Diversity of
users
Diversity of
diversity 
Tools for
sense-making
in literature
search
Diversity of
documents
PORPOISE,
STORIES tools
for graphical
news summarization and
understanding
Why this talk?
The impact of
language and
culture on
Web usage
behaviour
Collaborative
re-use of
literature
search results
e.g. Information
Retrieval J. 2009
Proceedings
Living Web
WS@ISWC 2009
Tools for
sense-making
in literature
search
Inf. Processing &
Management
2010
PORPOISE,
STORIES tools
for graphical
news summarization and
understanding
e.g. Knowledge
and Information
Systems J. 2009
Towards an integrated
understanding of
diversity
The impact of linguistic diversity on
Web usage and thereby on the Web
Or:

Why are non-English languages underrepresented on the Web?

A web-analysis approach asking for underlying
– cognitive-linguistic
– behavioural
– attitude
factors
A simple expectation of how much
content exists in which language
But: Dynamics of content creation, link
setting, link following, attitudes, and use
But: Dynamics of content creation, link
setting, link following, attitudes, and use
People create less content
People link less to content
People use links less
People think the content
is bad
... and use it less
But: Dynamics of content creation, link
setting, link following, attitudes, and use
 Under-representation !
Underlying data and methods


Database of countries and official languages
Distribution comparisons between
–
–
–
–
–

worldwide proportions of native speakers of different languages
worldwide distribution of servers registered by country
crawler analysis of links to a multilingual site S
log analysis assigning each session a native language
log analysis of
(user native language) – (S-entry-page language)
Questionnaire/TAM analysis of native and non-native
users of S:
– usability, ease of use, competence in English, beliefs about
availability of content in native language
Some questions
Does one find such dynamics also in search
engines?
 What factors stop or reverse such languagemarginalisation trends?

– Critical mass?
– Laws?
– Volunteers?
Did / can Web 2.0/3.0 change this?
 (When) is it better to work without pre-defined
labels for users?

 Part 2: An approach that ...
Does one find such dynamics also in search
engines?
 What factors stop or reverse such languagemarginalisation trends?

– Critical mass?
– Laws?
– Volunteers?
Did / can Web 2.0/3.0 change this?
 (When) is it better to work without pre-defined
labels for users?

Motivation (1):
Diversity of people is ...
Speaking different
languages (etc.) 
localisation /
internationalisation
 Having different
abilities 
accessibility
 Liking different
things 
collaborative filtering
 Structuring the world
in different ways  ?

Motivation (2):
Diversity-aware applications ...
Must have a (formal) notion of diversity
 Can follow a

– “personalization approach“
 adapt to the user‘s value on the diversity
variable(s)
 transparently? Is this paternalistic?
– “customization approach“
 show the space of diversity
 allow choice / raise awareness / semi-automatic!
Measuring grouping diversity
Diversity = 1 – similarity = 1 - Normalized mutual information
By colour &
NMI = 0
NMI = 0.35
Measuring user diversity
“How similarly do two users group documents?“
 For each query q, consider their groupings gr:


For various queries: aggregate
... and now: the application domain
... that‘s only the 1st step!
Workflow
1.
2.
3.
4.
Query
Automatic clustering
Manual regrouping
Re-use
1. Learn + present way(s) of grouping
2. Transfer the constructed concepts
Concepts

Extension
– the instances in a group

Intension
– Ideally: “squares vs. circles“
– Pragmatically: defined via a
classifier
Step 1: Retrieve
CiteseerX via OAI
 Output: set of

– document IDs,
– document details
– their texts
Step 2: Cluster
“the classic bibliometric solution“
 CiteseerCluster:

– Similarity measure: co-citation, bibliometric
coupling, word or LSA similarity, combinations
– Clustering algorithm: k-means, hierarchical
Damilicious: phrases  Lingo
 How to choose the “best“?

– Experiments: Lingo better than k-means at
reconstruction and extension-over-time
Step 3 (a): Re-organise
& work on document groups
Step 3 (b):
Visualising document groups
Steps 4+5: Re-use

Basic idea:
1. learn a classifier from the final grouping (Lingo phrases)
2. apply the classifier to a new search result
 “re-use semantics“

Whose grouping?
– One‘s own
– Somebody else‘s

Which search result?
–
–
–
–
“ the same“ (same query, structuring by somebody else)
“ More of the same“ (same query, later time  more doc.s)
“ related“ (... Measured how? ...)
arbitrary
Visualising user diversity (1)
Simulated users with different strategies
 U0: did not change anything
(“System“)
 U1: tried produce a better fit of the
document groups to the cluster
intensions; 5 regroupings
 U2: attempted to move everything
that did not fit well into the remainder
group “Other topics”, & better fit; 10
regroupings
 U3: attempted to move everything
from „Other topics“ into matching real
groups; 5 regroupings
 U4: regrouping by author and
institution; 5 regroupings
 5*5 matrix of diversities gdiv(A,B,q)
 multidimensional scaling
Visualising user diversity (2)
Web mining
Data mining
RFID
aggregated
using gdiv(A,B)
Evaluating the application

Clustering only: Does it generate
meaningful document groups?
– yes (tradition in bibliometrics) – but: data?
– Small expert evaluation of CiteseerCluster

Clustering & regrouping
– End-user experiment with CiteseerCluster
– 5-person formative user study of Damilicious
The Damilicious tool: Summary and
(some) open questions

A tool that helps users in sense-making, exploring diversity, and reusing semantics

diversity measures when queries and result sets are different?
how to best present of diversity?

– How to integrate into an environment supporting user and community
contexts?



Incentives to use the functionalities?
how to find the best balance between similarity and diversity?
which measures of grouping diversity are most meaningful?
– Extensional?
– Intensional? Structure-based? Hybrid? (cf. ontology matching)


which other sources of user diversity?
Diversity and relevance: can we learn from user-dependent
relevance judgements?
Some lessons learned
(or questions raised?)
We need to embrace diversity.
 We need to take into account

– The diversity of documents / knowledge
– The diversity of people
Thanks!
– The diversity of diversity .
We need to be clear about what we mean.
 We need to ask whether / when „striving for
diversity“ is in itself A Good Thing.
 We need to ask whether / when „raising
awareness of diversity“ is in itself A Good Thing.

Diversity
in search:
what, how,
and what for?
Bettina Berendt
Dept. Computer Science,
KU Leuven
... and now: the application domain
... that‘s only the 1st step!