Search Engines for Digital Libraries

Download Report

Transcript Search Engines for Digital Libraries

SEARCH ENGINES
for DIGITAL LIBRARIES
Jaime Carbonell
[email protected]
Language Technologies Institute
Carnegie Mellon University
Pittsburgh PA, USA
November 18, 2006
OUTLINE OF PRESENTATION
• Search Engine Primer
•
•
– Architecture, function and brief history
– Components: indexing, matching, …
Recent R&D Results for Search Engines
– Clustering + summarization
– Beyond relevance (popularity, novelty, …)
– The “invisible web” and distributed IR
Search Engines for eLibraries
– Coping with OCR imperfections
– Knowledge maps (text, metadata, and more)
18 November 2006
2
Alexandria Universal Library Conference
Search Engines in a Nutshell
User
The Web
Spider
Search
Engine
Inverted
Index
Library, etc.
18 November 2006
3
Alexandria Universal Library Conference
Search Engine Evolution
• …in the 1980s (pre-web)
•
•
– Single collection with < 106 documents (archive)
…mid 90s-mid 00’s (web):
– Single collection with > 109 documents (web)
…beyond 2006:
– Multiple collections > 1012 docs (invisible web)
– “Find what I mean” queries & profiles with
clustering, summarization and personalization.
– Beyond monolingual text: OCR, audio, video, crosslingual search with translation, …
18 November 2006
4
Alexandria Universal Library Conference
INVERTED INDEXING:
Multiple Access Methods
Task: Enable accurate and efficient document retrieval
Solution: Database of inverted lists, different access methods
Hash Table Access
zebra
: :
: :
apple
• Exact match
• O(1) access
B-Tree-style Access
....
apple
....
....
....
....
....
zebra
• Exact match
• Range match
• O(log (n)) access
Database of Inverted Lists
18 November 2006
5
Alexandria Universal Library Conference
QUERY-DOCMUMENT SIMILARITY
(Simplified)
Traditional “Cosine Similarity”


 
qd
Sim (q , d )   
qd
where:

d 
2
d
 i
i 1,... n
Each element in the query and document vectors are word weights
Rare words count more, e.g.: di = log2(Dall/Dfreq(wordi))
Getting the top-k documents (or web pages) is done by:
 

Retrieve( q , k )  Arg max [k , Sim(d , q )]
d D
18 November 2006
6
Alexandria Universal Library Conference
REFINEMENTS TO IMPROVE
SEARCH ENGINES
• Well-known methods
•
– Stop-word removal (e.g., “it”, “the”, “in”, …)
– Phrasing (e.g., “heart attack”, “to be or not to be”)
– Morphology (e.g., “countries” => “country”)
More recent methods
– Query expansion (e.g., “cheap” => “inexpensive”,
“discount”, “economic”, “affordable”…)
– Pure relevance => popularity + relevance
» Google’s page-rank by in-link density
» Collaborative filtering (e.g. Amazon)
18 November 2006
7
Alexandria Universal Library Conference
Coping with OCR Errors in
Documents
• Problem: Search engines require text (not images)
– OCR: images  text is imperfect
– Errors are language, font, and OCR-engine dependent
• Solution: Multifaceted
– Character n-gram-level confusion matrix (e.g.: ‘rn’  ‘m’)
– Word-level confusion matrix
– Augmented noisy channel model
w*  arg max[ P( w | v, f , l )]  arg max[ P(v | w, f , l ) P( w)]
– P(v|w,f,l) is a function of OCR edit distance, actual frequencies in
labeled data, etc.
– Automated Thesaurus (for redundancy)
18 November 2006
8
Alexandria Universal Library Conference
BEYOND SEARCHING
• Automated Summarization
•
•
•
– Multi-document summaries
– User-controllable (length, type, etc.)
Document Clustering
– Group search results by content similarity
– Then, summarize and label each cluster
Personal Profiling
– User models (of interests, level of knowledge)
– Task models (progression of types of info needed)
Information Push (beyond automated clipping)
18 November 2006
9
Alexandria Universal Library Conference
NEXT-GENERATION
SEARCH ENGINES
• Search Criteria Beyond Query-Relevance
•
– Popularity of web-page (link density, clicks, …)
– Information novelty (content differential, recency)
– Trustworthiness of source
– Appropriateness to user (difficulty level, …)
“Find What I Mean” Principle
– Search on semantically related terms
– Induce user profile from past history, etc.
– Disambiguate terms (e.g. “Jordan”, or “club”)
– From generic search to helpful E-Librarians
18 November 2006
10
Alexandria Universal Library Conference
Clustering Search vs Standard Search
(e.g. clusty.com)
documents
query
IR
Cluster
summaries
18 November 2006
11
Alexandria Universal Library Conference
NEXT-GENERATION SEARCH:
Maximal Marginal Relevance Principle
In general, we want to retrieve the k maximally-useful docs,
Where utility = F(relevance, novelty, popularity, clarity, …)
So far, we can do relevance & popularity. Novelty is next,
by defining “marginal relevance” to be “relevant + new”:
 
 

MMR( q , D, k )  Arg max[ k , Sim (d i , q )  (1   ) max
  Sim ( d i , d j )]
di  d j
d i D
MMR is used for ranking search results, or for selecting
optimal passages in summary generation.
18 November 2006
12
Alexandria Universal Library Conference
MMR Ranking vs Standard IR
(Future CONDOR Release)
documents
query
MMR
IR
λ controls spiral curl
18 November 2006
13
Alexandria Universal Library Conference
NEXT-GENERATION SEARCH:
Seeking the “Invisible” Web
• Invisible Web = DB’s Accessible via Web Pages
•
– Dynamically-generated web-pages from DB’s
– Information (dynamic pages) served via Java apps
– 10 to 100 times larger than static HTML web
– Growing faster than static “visible” web
Need Distributed-IR Model to Access (Callan)
– Either unify content or model each DB
– User’s query => appropriate DB(s) => secondary
search => unify results
18 November 2006
14
Alexandria Universal Library Conference
The Web-Search Model
...
...
…
U.S. Sales (New York)
…
Make it a
single database
problem
European Sales (Zurich)
…
Administration
(Pittsburgh)
:
:
:
:
…
…
R & D (San Jose)
18 November 2006
15
Competitor
(Dallas)
Alexandria Universal Library Conference
Automatic Resource Selection:
Federated (Distributed) Search
......
......
......
......
......
Library of
Congress
NY Times
West
Microsoft
Google
......
•
•
•
•
Best
DBs
?
?
Search Results
Dynamic
ranking
Automatic
database
selection
Find out what each database (or library) contains
Decide where to search for this query (one or multiple sites)
Search one or more databases
Merge results returned by different searches (reliability, relevance, …)
18 November 2006
16
Alexandria Universal Library Conference
KNOWLEDGE MAPS:
First Steps Towards Useful eLibrarians
Query: “Tom Sawyer”
RESULTS:
Tom Sawyer home page
WHERE TO GET IT:
Universal Library: free online text & images
The Adventures of Tom Sawyer
Bibliomania – free online literature
Tom Sawyer software (graph search)
Amazon.com: The Adventures of Tom…
Disneyland – Tom Sawyer Island
DERIVATIVE & SECONDARY WORKS:
RELATED INFORMATION:
CliffsNotes: The Adventures of Tom…
Mark Twain: life and works
Tom Sawyer & Huck Finn comicbook
Wikipedia: “Tom Sawyer”
“Tom Sawyer” filmed in 1980
Literature chat room: Tom Sawyer
A literary analysis of Tom Sawyer
On merchandising Huck Finn and Tom
Sawyer
18 November 2006
17
Alexandria Universal Library Conference
CONCLUDING REMARKS
• Search Engine Technology is Evolving Rapidly
•
•
– Relevance => relevance + novelty + reliability
– Federated (distributed) search
E-library search  web search
– OCR issues for e-libraries
– Metadata search
– In-depth organizaiton
Presentation and Organization is Key
– Clusters, MMR, …
– Variable-depth summaries, knowledge maps, …
18 November 2006
18
Alexandria Universal Library Conference
Language Technologies
SLOGAN
•
•
•
•
•
•
“…right information”
“…right people”
“…right time”
“…right medium”
“…right language”
“…right level of detail”
TECHNOLGY (e.g.)
•
•
•
•
•
IR (search engines)
Routing, personalization
Anticipatory analysis
Info extraction, speech
Machine translation
• Summarization,
expansion
18 November 2006
19
Alexandria Universal Library Conference