Searching the web - Department of Computer Science and

Download Report

Transcript Searching the web - Department of Computer Science and

Searching the Web
Mark Levene
(Follow the links to learn more!)
Mechanics of a Typical Search
Query submitted to Google
Mechanics of a Typical Search
Google results for the query
Search Engines as Information
Gatekeepers
• Search engines are becoming the primary
entry point for discovering web pages.
• Ranking of web pages influences which
pages users will view.
• Exclusion of a site from search engines
will cut off the site from its intended
audience.
• The privacy policy of a search engine is
important.
Search Engine Wars
• The battle for domination of the web
search space is heating up!
• The competition is good news for users!
• The way in which advertising is combined
with search results is crucial!
• There are serious implications if one of the
search engines will manage to dominate
the space!
Google
• Verb “google” has become synonymous
with searching for information on the web.
• Has raised the bar on search quality,
• Has been the most popular search engine
in the last few years.
• Had a very successful IPO in August
2004.
• Is innovative and dynamic.
Yahoo!
• Synonymous with the dot-com boom,
probably the best known brand on the
web.
• Started off as a web directory service.
• Has very strong advertising and ecommerce partnerships.
• Acquired leading search engine
technology in 2003.
MSN Search
• Synonymous with PC software.
• Remember its victory in the browser wars
with Netscape.
• Developed its own search engine
technology only recently, officially
launched in Feb. 2005.
• May link web search into its next version
of Windows.
Others
• Ask Jeeves
– Specialises in natural language question
answering.
– Search driven by Teoma.
• Looksmart
– Has its own directory service.
– Search driven by Wisenut.
• …
Statistics from search engine
logs
Statistic
AltaVista AlltheWeb Excite
(Year)
(1998)
(2002)
(2001)
average terms per
2.35
2.30
2.60
query
average queries per
2.02
2.80
2.30
session
average result
1.39
1.55
1.70
pages viewed
usage of advanced
20.4%
1.0%
10.0%
search features
Experiment with search engine
query syntax
• Default is AND, e.g. “computer chess” normally
interpreted as “computer AND chess”, i.e. both
keywords must be present in all hits.
• “+chess” in a query means the user insists that
“chess” be present in all hits.
• “computer OR chess” means either keywords
must be present in all hits.
• “”computer chess”” means that the phrase
“computer chess” must be present in all hits.
The most popular search
keywords
AltaVista
(1998)
AlltheWeb
(2002)
sex
free
Excite
(2001)
free
applet
sex
sex
porno
download
pictures
mp3
software
new
chat
uk
nude
Search Engine Architecture
Crawler Algorithm
A crawler is a program that traverses web pages, downloads them
for indexing and follows (or harvests) the hyperlinks on the
downloaded pages.
• A crawler will typically start from a multitude of web pages and aims
to cover as much of the indexable web as possible.
• Standard algorithm used breadth-first strategy.
• Focused crawlers use best-first strategy.
•
Search Index - Inverted File
• Also store position of word in web page and info.
on HTML structure.
The query engine
• The interface between the search index,
the user and the web.
• Algorithmic details of commercial search
engines kept as trade secrets.
• First step is retrieval of potential results
from the index.
• Second step is the ranking of the results
based on their “relevance” to the query.
Vector Space Model –
Content Relevance
Term Frequency (TF)
• Count number of
occurrences of each term.
• Bag of words approach
• Ignore stopwords such as
is, a, of, the, …
• Stemming - computer is
replaced by comput, as are
its variants: computers,
computing
computation,computer
and computed.
• Normalise TF by dividing by
doc length, byte size of doc
or max num of occurrences
of a word in the bag.
chess
game
computer
is a
game
chess
programming
chess
Inverse Document Frequency (IDF)
N
log
ni
•
•
•
•
N is number of documents in the corpus.
ni is number of docs in which word i appears.
Log dampens the effect of IDF.
IDF is also number of bits to represent the term.
Ranking with TF-IDF
wi , j  TFi , j IDFi
score j   wi , j
iq
•
•
•
•
•
j – refers to document j
i – refers to word (or term) i in doc j
q – is the query which is a sequence of terms
scorej - is the score for document j given q
Rank results according to the scoring function.
Content Relevance
•
•
•
•
•
•
Phrase matching.
Synonyms.
URL analysis.
Date last updated.
Spell checking.
Home page detection.
Link Text (Anchor Text)
• Include link text for a link pointing to a web
page, say P, as part of the content of P
• Link text is very useful in finding home
pages.
• Link text behaves like user queries
– They act as short summaries
– They often match query terms
HTML Weighting
•
•
•
•
•
Class Name
1) Plain Text
HTML tags
None of the above
2) Strong
3) List
4) Header
STRONG, B, EM, I, U
DL, OL, UL
H1, H2, H3, H4, H5, H6
5) Anchor
6) Title
A
TITLE
Normal retrieval = (111101) ranking with TF-IDF
(181882) – 39.6% improvement.
(181782) – 48.3% improvement – C2, C4 and C5.
(181582) - 43.5% improvement
Meta tag text is mostly ignored by search engines
Factor in Link Metrics
wi , j  TFi , j IDFi PR j
• Multilply by PageRank of document (web page).
• We do not know exactly how Google factors in
the PR, it may be that log(PR) is used.
Popularity Based Metrics
• Factor in users’ opinions as represented in
the query logs.
• Document space modification adjusts the
weights of keywords in popular pages.
• Clickthrough data can also be taken into
account to improve the ranking of search
engine query results.
Precision and Recall
• Precision is Overlap/Retrieved (first results page
retrieved is most important).
• Recall is Overlap/Relevant (for web search recall is
related to index coverage).
Typical Recall-Precision Curve
• Top-n precision – proportion of relevant for top n ranked
results.
• Measure top-n precision at fixed recall point for n being
0% to 100% of the ranked results.
Probabilistic IR
• Basic question: What is the probability that
a document, D, is relevant to a query Q?
• Probability ranking assumption: If docs
retrieved are ordered by decreasing
probability of relevance then the overall
effectiveness of the system is the best
obtainable given the input documents.
Bayes Formulation of Relevance
P( D | R) P( R)
P( R | D) 
P( D)
• R – relevance of D with respect to a query Q
• D – document (web page)
• P(R|D) – probability that a page is relevant given
its description (or representation)
• NR – D not relevant with respect to Q
Naïve Bayes Independence
Assumption
n
P ( D | R )   P ( wi | R )
i 1
n
P ( D | NR)   P ( wi | NR)
i 1
• n – the number of words in D
• wi – the word in position i in D
• Also assume that the probability of a word is
independent of its position in the document.
Computing the probabilities
ri  1
P( wi | R) 
RW  c
nri  1
P( wi | NR) 
NRW  c
• ri – number of times wi occurs in relevant docs
• RW - number of words in relevant docs (counting
duplicate words multiple times, since docs are bags)
• nri – number of times wi occurs in relevant docs
• NRW – number of words in non relevant docs
• P(R) is the number of relevant docs with respect to Q.
• P(NR) is the number of docs which are not relevant with
respect to Q.
• c – is a smoothing constant greater or equal to one
Is a Document Relevant?
n
n
i 1
i 1
P( R) P( wi | R)  P( NR) P( wi | NR)
• Assume we have a set of training examples of
relevant and non-relevant documents to
compute P(wi|R) and P(wi|NR) for words wi.
• The user could mark docs as R or NR.
• Choose the class (R or NR) which has higher
probability.
Ranking Documents
nD1
nD 2
 P( w | R)   P( w | R)
i
i 1
i
i 1
• D1 is more relevant than D2 given Q, where
– nD1 - the number of words in D1
– nD2 – the number of words in D2
• If we are ranking, then P(wi|R) can be
approximated on the basis of a document with a
weighting function such as TF-IDF.
Other types of Search Engine
•
•
•
•
•
•
•
•
Directory – e.g. Yahoo! (Open Directory)
MetaSearch – e.g. Dogpile (Mamma)
Clustering – e.g. Clusty
Question Answering – e.g. Ask Jeeves
WolframAlpha, True Knowledge, & Google,
Yahoo and Bing
Visual – e.g. Quintura
Collaborative – e.g, Omgili, Sproose, AfterVote
Human Input – e.g. ChaCha
Social Tagging – e.g. Blekko, MrTaggy
Directions in Search
•
•
•
•
•
•
•
•
Mobile search – e.g. Google Mobile
Local search – e.g. Google Local
Video search – e.g. YouTube
Image search – e.g. Picsearch
Audio search – e.g. Yahoo Audio
Blog search – e.g. Technorati
Social bookmarking - e.g. Delicious
Some new ideas - e.g. Cuil
Paid Inclusion and Paid
Placement
• Paid inclusion – payment to speed up
inclusion in the search index.
• Pay-Per-Click (PPC) or Cost-Per-Click
(CPC) – payment for being advertised on
the search engine’s sponsored results list.
• The sponsored list should be separated
from the organic list.
• PPC is a major revenue source for search
engines.
• Click fraud is a problem!
Behavioural Targeting
• Contextual targeting is a weaker form
based on a single user session.
• Personalised advertising in order to
increase the effectiveness of advertising.
• Data collected from individual users,
normally through cookies.
Pay-Per-Action (PPA)
• Charge the advertiser only when an action
takes place such as a purchase, a
download or any other trackable action.
• Ad network will require the advertiser to
place a script in the web page triggering
the action.
• Some level of trust is needed.