Search Engine Caching
Download
Report
Transcript Search Engine Caching
Search Engine Caching
Rank-preserving two-level caching for scalable search
engines, Paricia Correia Saraiva et al, September 2001
http://doi.acm.org.ezproxy.lib.vt.edu:8080/10.1145/383952.3
83959
Predictive Caching and Prefetching of Query Results in
Search Engines, Ronny Lempel and Shlomo Moran,
September 2001
http://doi.acm.org.ezproxy.lib.vt.edu:8080/10.1145/775152.7
75156
Presented by
Adam "So, is this gonna be on the test?" Edelman
The Problem
• The User: "I want my results now!"
• But...
– Over 4 billion web pages
– Over 1 million queries per minute
• How do we keep response times down as
the web grows?
Search Engine Statistics
• 63.7% of the search phrases appear only once in
the billion query log
• The 25 most popular queries in the log account for
1.5% of the submissions
• Considerable time and processing power can be
saved through well implemented caching
Search Engine Statistics
• 58% of the users view only the first page of results
(the top-10 results)
• No more than 12% of users browse through more
than 3 result pages.
• We do not need to cache large result sets for a
given query
What do we Cache?
• 36% of all queries have been retrieved
before
• Can we apply caching even if the query
does not exactly match any previous query?
What do we Cache?
• Saraiva et. al propose
a two level cache
• In addition to caching
query results, we also
cache inverted lists for
popular terms
Query Cache Implementation
• Store only the first 50
references per query
– ~25KB per query
• Query logs show that
the miss ratios do not
drastically improve
after query result
cache exceeds 10 MB
Inverted List Cache Implementation
• For this data set 50-75%
of inverted lists contain
documents where term
appears only once
• Use 4KB inverted list size
per term
– More work needs to be
done
• Asymptotic behavior is
apparent after cache
exceeds 200MB
• Use 250MB for IL Cache
Two-Level Cache Implementation
• Combine previous two caches
• 270MB total cache
– Accounts for only 6.5% of overall index size
• Tested over a log of 100K queries to
TodoBR
Two-Level Cache Results
• Compared to caches of
270MB for only query
results, only inverted lists
and no cache
• Queries processed reduced
by 62%
– 21% increase compared to
only query result cache
• Page fetches from the
database reduced 95%
– 3% increase compared to
only inverted list cache
Two-Level Cache Results
• For more than 20 queries per second twolevel cache is 20% disk reads of no cache
• Two-level cache can handle 64 queries per
second against 22 per second with no cache
How do we cache?
• Saraiva et al use a least recently used (LRU)
replacement policy for cache maintenance
• Users search in sessions, the next query will
probably be related to the previous query
• Can we use this to improve caching?
Probability Driven Cache (PDC)
• Lempel and Moran propose a cache based
on the probability of a page being requested
Page Least Recently Used
(PLRU)
• Allocate a page queue that can
accommodate a certain number of result
pages
• When the queue is full and a new page
needs to be cached, the least recently used
page is removed from the cach
• Achieves hit ratios around 30% for warm,
large caches
Page Segmented LRU (PSLRU)
• Maintains two LRU segments, a protected segment
and a probationary segment
• Pages are first placed in the probationary segment,
if requested again they are moved to the protected
segment
• Pages evicted from the protected segment are
moved to the probationary segment
• Pages evicted from the probationary segment are
removed from the cache
• Consistently outperforms PLRU although
difference is very small
Topic LRU (TLRU)
• Let t(q) denote the topic of the query q
• After the cache is warm, any cached result
page of t(q) is moved to the tail of the
queue.
• Each topic’s pages will reside contiguously
in the queue
Topic SLRU (TSLRU)
• All pages are initially inserted in the
probationary segment
• In addition to promoting pages from
probationary to protected, we also promote
all pages of t(q)