Folie 1 - KU Leuven

Download Report

Transcript Folie 1 - KU Leuven

Where does this new information belong?
From developing mining algorithms
to supporting knowledge discovery
Bettina Berendt – thanks for joint work with and support from
Ilija Subasić
Mathias Verbeke
Siegfried Nijssen
Luc De Raedt
K.U. Leuven
Yes we can!
The problem
The solution?
Automatic topic dectection
Period 1
Healthcare
agenda
Climate
agenda
Period 2
Health
Care
Insurance
Green
American
Uninsured
energy
Families
plan
Working
A
healthcare
vote
Period 3
0.017
0.015
0.013
Opposition to
0.013
0.009
healthcare
0.008
reform
0.005
Peace Nobel
Prize
Cophenhagen
climate
summit
Period 4
Another
healthcare
vote
Same event/document; different
interpretations & categorisations
Similar problems in science and learning
Topic detection in
time-indexed
corpora of
news texts
!
Conference
programme
Similar
problems
in other
areas
Music collections,
multimedia
collections: see
Andreas
Nürnberger‘s talk at
SML 2010
The solution?
Context-aware systems / personalisation
Political
activist
Female
Has problems with
anger management
You probably do /
should think about
it this way:
...
What users want
... to structure the world
how they see it
left right
 interactivity
... to re-use their categories
(that they worked so hard to find)
 semantics
... to acknowledge
squares / circles
that others see
green /
the world differently
not green
Social
similarity
/ diversity
... to be able to see through their eyes
is (nearly) green perspectivetaking
... to provide data mining
methods to do all that!
 Research agenda
The problem
 interactivity
automatic topic dectection
support sense-making
= provide methods / tools
for Knowledge Disovery
(in the full sense)
 semantics
Social
similarity
/ diversity
perspectivetaking
... to provide data mining
methods to do all that!
 Research agenda
The problem
Our solution
approach
 interactivity
automatic topic dectection
support sense-making
= provide methods / tools
for Knowledge Disovery
(in the full sense)
 semantics
Social
similarity
/ diversity
perspectivetaking
... to provide data mining
methods to do all that!
STORIES: functionality basics
STORIES: functionality basics
STORIES: mining basics (1)
Graphical summarisation of multiple text documents
Document / text pre-processing
• Template recognition
• Multi-document named entities
• Stopword removal, lemmatization
•“fact (assertion) recognition”
Similarity measure
to determine salient relations
Document summarization strategy
• time relevance,
a “temporal co-occurrence lift”
• no topics, but salient concepts & relations
• time window; word-span window
Selection approach for concepts
• concepts = words or named entities
• salient concept = high TF & involved
in a salient relation, time-indexed
• bursty co-occurrence
Burstiness measure
STORIES: mining basics (2)
Graph analysis for query recommendation
Aim: highlight subgraphs that
represent an event
Topological properties
Change: Subgraph new in this period
STORIES: evaluation
1. Information retrieval quality
•
Edges – events: up to 80% recall, ca.
30% precision
2. Search quality
•
Subgraphs index
coherent
document clusters
3. Learning effectiveness
 Document search with story graphs leads to averages of
 67-75% accuracy on judgments of story fact truth
 on average, 1.3-4.7 queries with 3.4-5.2 nodes/words per query
4.
Comparison with other temporal text mining methods


New (and only) framework for cross-method comparison
Recall-&precision-style metrics  different method rankings
Damilicious: functionality basics
Apply my grouping rfid (Security/privacy,
Group 2, ...) to the following new search
result:
* Show users and how similarly they group
* Apply U4‘s grouping to my new search
result:
Damilicious: mining basics (1)
Methods and process
1.
2.
3.
4.
Query
Automatic clustering
Manual regrouping
Re-use
1.
2.
Learn classifier & present way(s) of grouping
Transfer the constructed concepts
Features/methods for the conceptual/predictive clustering:


Lingo phrases, Lingo clustering, Ripper
co-citation, bibliometric coupling, word or LSA similarity,
combinations; k-means, hierarchical
Damilicious: mining basics (2)
Measures of grouping and user diversity
Diversity = 1 – similarity = 1 - Normalized mutual information
(entropy-based measure)
• “How similarly do two users group documents?“
• For each query q, consider their groupings gr:
NMI = 0
• For several queries: aggregate
Damilicious: evaluation
• Clustering: Does it generate meaningful
document groups?
– yes (tradition in bibliometrics) – but: data?
– Small expert evaluation of CiteseerCluster
• Choosing the clustering and classification
methods for conceptual clustering
– Experiments: different features, clustering methods,
classification methods  quality of reconstruction and
extension-over-time (NMI)
• Technology acceptance
– End-user experiment (clustering & regrouping)
– 5-person formative user study (transfer of own results)
Conclusions and (some) questions
• Sense-making involves
–
–
–
–
–
–
KD approach
Extracting information from texts
Text mining
Extracting structural information between entities Graph mining
Creating, using and modifying categories
Semantics
Interacting with external representations
Interactivity
Acknowledging diversity and perspective-taking
...
Usage mining and “model-processing“
(conceptual / predictive clustering)
• Appropriate mining methods, measures, ...?
• More/better evaluation methods and frameworks?
• Use cases?
•
•
•
•
•
Subašić, I. & Berendt, B. (2009). Discovery of interactive graphs for understanding
and searching time-indexed corpora. Knowledge and Information Systems. DOI 10.1007/s10115-009-0227-x (PDF)
Berendt, B. & Subašić, I. (2009). STORIES in time: a graph-based interface for news
tracking and discovery. n N. Cristianini & M. Turchi (Eds.), Proceedings of Intelligent
Analysis and Processing of Web News Content (IAPWNC) at The 2009 IEEE /WIC /
ACM International Conferences Web Intelligence (WI'09) / Intelligent Agent
Technology (IAT'09). 15 September 2009, Milan, Italy. (Proceedings of WI-IAT.2009,
DOI 10.1109/WI-IAT.2009.342, pp. 531-534) (PDF)
Verbeke, M., Berendt, B., & Nijssen, S. (2009). Data mining, interactive semantic
structuring, and collaboration: A diversity-aware method for sense-making in search.
In G. Boato & C. Niederee (Eds.), Proceedings of First International Workshop on
Living Web, collocated with the 8th International Semantic Web Conference (ISWC2009), Washington D.C., USA, October 26, 2009. CEUR Workshop Proceedings Vol515. (PDF)
Berendt, B. (2010). Diversity in search: what, how, and what for? Talk at Barcelona
Media / Yahoo! Research and UPF, 4 March 2010. (PPT)
Berendt, B., Krause, B., & Kolbe-Nusser, S. (2010). Intelligent scientific authoring
tools: Interactive data mining for constructive uses of citation networks. networks.
Information Processing & Management, 46(1), 1-10. (PDF)