Transcript lcs35

Individualized
Knowledge Access
David Karger
Lynn Andrea Stein
Web Search Tools
 Indices

search by keyword
 Taxonomies

A lot like libraries...
 Library catalogues
 Dewey Digital
classify by subject
 Cool site of the day  New book shelf,
suggested reading
Is a universal library enough?
Library/Web Limitations
 Huge:

too many answers, mostly irrelevant
 Only published material

miss info known to few, leading-edge content
 Rigid:
all get same search results
 even if come back and try again

The library is the last place we look
Bookshelves First
 My data:
information gathered personally
 high quality, easy for me to understand
 not limited to publicly available content
 annotations

 My organization:
choose own subject arrangement
 optimize for my kind of searching

 Adapts to my needs
Then a Friend
 Leverage
they organize information for their access
 so quickly find things for me

 Personal expertise

they know things not in any library
 Trust

their recommendations are good
 Shared vocabulary

they know me and what I want
Last the Library
 Answer usually there
but hard to find
 would be nice to rearrange to my needs

 For hardest problems, need librarian
they have broad knowledge of library
 but not as deep as an expert on question

Lessons
 Individualized access: The best tools adapt
to individual ways of organizing and
seeking data.
 Individualized knowledge: People know
much more than they publish. That
knowledge is useful.
Haystack:
a Tool for Oxygen
 Independent but interacting repositories
that adapt to their individual users
 Individualize access
My data collection, organization
 My search tools, with answers for me

 Leverage individual knowledge
Collaborative retrieval with others
 Motivate people to organize their data for
their own benefit and thus for others’

Example
 Have probabilistic models been used in
data mining?
My haystack doesn’t know, but “probability” is
in lots of mail I got from Tommi Jaakola
 Tommi told his haystack that “Bayesian”
refers to “probability models”
 Tommi has read several papers on Bayesian
methods in data mining
 His haystack suggests them to mine

Research Threads
 Heterogeneous data and metadata

archive whatever user wants
 Human-Computer Interaction
let user express/use own organizational rules
 observe user to detect unexpressed knowledge

 Machine learning

use gathered data to improve performance
 Collaborative filtering

use others’ decisions to help me
My data
 Haystack archives anything

web pages browsed, email sent and received,
documents written, scanned images, home
directory, people known, projects worked on
 And any properties, relationships
text of object (if know how)
 author, title, color, citations, quotations,
annotations, quality, last usage

 Users freely adds types, relationships
Gathering My Data
 Active user input

interfaces let user add data, note relationships
 Mining data from haystack
plug-in services opportunistically extract data
 e.g., find author/title/text in MSWord document
 or, detect that one document quotes another

 Observing user
plug-ins to other interfaces report user actions
 web pages browsed, mail sent, queries made

Adaptation
 Remember user’s attempts to tune a query
instead of first query attempt, use last one
 record items user picked as good matches
 future similar queries do better right away

 Stored content shows what user knows/likes
modify queries to big search engines
 filter results coming back
 personalized “cool site of the day”

Collaborative Access
 Leverage others’ work organizing data
no need to “publish” expertise
 exposed automatically
 self interest helps others

 Privacy/permission concerns
allowing exposure easier than publishing
 much public info: mailing lists, papers read

 Whose opinions matter?
people I mail, w/shared data, referrals
 collaborative filtering techniques

Conclusion
 Libraries are not enough
 Haystack teases out individual knowledge
 Individualizes information access for user
 Exposes individual knowledge to benefit
community
 Current status: individual-user prototype.
Some data extraction, observation, adapting.
Collaborative version in future.