Microsoft PowerPoint Presentation: 13_1_WebSearch_Ranking1

Download Report

Transcript Microsoft PowerPoint Presentation: 13_1_WebSearch_Ranking1

Web Search – Summer Term 2006
VI. Web Search Ranking
(c) Wolfgang Hürst, Albert-Ludwigs-University
General Web Search Engine Architecture
CLIENT
QUERIES
QUERY
ENGINE
WWW
RESULTS
PAGE
REPOSITORY
RANKING
CRAWLER(S)
COLLECTION
ANALYSIS MOD.
INDEXER
MODULE
CRAWL
CONTROL
INDEXES
UTILITY
STRUCTURE
TEXT
USAGE FEEDBACK
(CF. [1] FIG. 1)
1st Generation Web Search Engines
Use only "on page" text data
- word frequency, language
1995-1997 (AltaVista, Excite, Lycos, etc.)
- Extended Boolean model
- Matches: Exact, prefix, phrase, ...
- Operators: AND, OR, AND NOT, NEAR, ...
- Fields: TITLE, URL, HOST, ...
- AND is somewhat easier to implement,
maybe preferable as default for short queries
- Ranking
- TF like factors: TF, explicit keywords, words in
the title, explicit emphasis (headers), etc.
- IDF factors: IDF, total word count in corpus,
frequency in query log, frequency in language
TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002
Classic IR vs. Web Search
Initial problem is similar to traditional IR ...
INFORMATION
INFORMATION
NEED
DATA /
DOCUMENTS
QUERY
The no. of users is
huge. Very huge.
Big variety in users
Users don't cooperate
(short queries, ...)
The web is huge.
Very huge.
Big variety in data
Doc. authors don't
cooperate (spam,...)
.. but basic conditions & characteristics differ significantly
Classic IR vs. Web Search
Classic IR: Generally a 2-step process
1. Select subset of relevant documents
(based on term appearance, etc.)
2. Rank / sort by estimated relevance
(based on term frequency, etc.)
Problems with web search:
- Size of the WWW
- Heterogeneity (length, quality, format, ...)
- Not always self descriptive
(e.g. search engine)
- Can be manipulated (spam)
Therefore: Pure content based ranking critical
The evolution of search engines
1st generation: Use only "on page", text data
- Word frequency, language
1995-1997 (AltaVista, Excite, Lycos, etc.)
2nd gen.: Use off-page, web-specific data
- Link (or connectivity) analysis
- Click-through data (what results people
click on)
- Anchor-text (how people refer to a page)
From 1998 (made popular by Google but
everyone now)
TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002
Some 2nd generation web IR ranking criteria
- Extended Boolean model
- Content-based techniques (variants of
term vector or probabilistic model)
- Ad-hoc factors (anti-porn heuristics,
publication/location data, url depth, ...)
- Human annotations (in Yahoo, ODP, etc.)
- Popularity: Click-through feedback
("People vote with their clicks", DirectHit)
- Anchor text
- Connectivity-based techniques
TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002
The Evolution of Search Engines
Rough summary:
First web search engines: Like tradition IR, i.e.
1. Search: Textprocessing to find pages
2. Ranking: Analysis of term distributions
(and some other things)
New / current web search engines:
1. Search: Textprocessing to find pages
(incl. anchor text and some other things)
2. Ranking: Link analysis
(and some other things)
The Web Graph
What do we know about the link structure of
the web?
SOURCE: [3]
Considering the link structure
Idea: Use link structure to get a contentindependent feature to estimate relevance
of a web page
Why? Because a link from page A to page B
- Often represents a relationship
of their contents
- Can be interpreted as a reference
or recommendation
Compare citations in scientific literature
- Articles with lots of citations:
Considered to have a high scientific value
(Bibliometric laws)
Considering the link structure
First approach: Use the number of backlinks
as a measure for a page's relevance
Problem: web vs. scientific publications
- Publications: rigorous review process, high quality,
well-defined length and style, same goal, similar
number of citations, etc.
- Web pages: no quality control, no publication costs,
can be manipulated (self-citations), large variance
regarding quality, number of links, intended usage,
length and style, etc.
(Note: Compare differences to classic IR)
Idea: Consider reputation and credibility of a
link, i.e. the quality and importance of a web
page
Simple PageRank - Definition
Goal: Measure for the importance and quality
of a web page
Definition: Simple PageRank
- Assume the web is a strongly connected graph
- Define the Simple PageRank R of a page p as
R(p) =
R(q)

N

q Bp
q
Simple PageRank - Example
R(2) = 0,286
2
R(1) = 0,286
1
R(5) = 0,143
3
5
4
R(3) = 0,143
R(4) = 0,143
SOURCE: [3]
Simple PageRank - Calculation
Mathematical representation:
If R = mx1 vector [R(1), R(2), ..., R(m)]
and A = matrix with aij
= 1 / N(j) if link from i to j exists
= 0 otherwise
Then R = A R
R(p) =
R(q)

N

q Bp
q
Calculation:
- Set of linear equations, but direct calculation
of a solution too expensive
- But: R is eigenvector of A (with eigenvalue 1)
- Hence: Efficient approaches for iterative
calculation exist (e.g. power iteration)
Simple PageRank - Calculation
Mathematical representation:
If R = mx1 vector [R(1), R(2), ..., R(m)]
and A = matrix with aij
= 1 / N(j) if link from i to j exists
= 0 otherwise
Then R = A R
R(p) =
R(q)

N

q Bp
Convergence:
- Yes! (see literature for a proof)
- Number of iterations can be reduced by
q
- using "good" starting values
- interruption as soon as the ranking order
will (most likely) not change anymore
But: Only true for strongly connected graphs
Problems if graph is not strongly connected
Page Sink = group
of pages that link to
each other but have
no outside links
Page Leak = single
page with no outside
links (special case of
a page sink)
What happens with Leaks?
R(p) =
2
R(q)

N

q Bp
1
q
3
5
4
Possible Solutions:
- Remove all Leaks before the PageRank is calculated
- Or: Insert back links for each forward link to a Leak
What happens with Sinks?
R(p) =
2
R(q)

N

q Bp
1
3
5
4
Possible Solutions:
- Introduce a damping factor d
q
PageRank - Definition and Example
Definition: PageRank R of page p
Example:
R(2) = 0,142
2
R(1) = 0,154
1
R(5) = 0,290
3
5
4
R(3) = 0,101
R(4) = 0,313
SOURCE: [3]
Intuitive Interpretation
Link: Recommendation and quality statement
- Important pages link to important pages
- Many links from important pages:
Target is important as well
- Importance influences other pages and
depends on them, hence: recursive
calculation (relatively robust against spam)
Another Intuitive Interpretation
The Random Surfer Model: A random surfer
performs a random walk on the web graph
- Surfer on page q ->
probability to click on any of the links = 1/Nq
- Probability to get to page p: Sum of the
probabilities of all links q in Bq times the
probability of being on page q
- The damping factor d
represents the probability of going to a
random new page
R(p) =
R(q)

N

q Bp
q
The Random Surfer Model
Comments to the Random Surfer Model:
Easy, illustrative model to explain the
involved mathematics, but many limitations:
- Does not model the back button
- Jumps and selection of links usually not
random or uniformly distributed
- Random walk can hardly be seen as a
good search strategy
PageRank - First conclusion
PageRank = Quality or importance measure
for a web page
Harder to manipulate than pure backlinks
Content and query independent
Ranking:
-> The higher the PageRank,
-> the higher a page's quality or importance,
-> the higher the page's relevance
Note: Other applications exist (e.g. crawling)
PageRank in the 1st Google engine
Good for "common cases", not necessarily
when searching for a particular information
In practice: Combination with other features
Example: First Google search engine [2]
- Hits classified in different types (anchor, title,
...), represented by a vector of type-weights
- Term distribution / no. of hits, represented
by a vector of count-weights
- Scalar product of both vectors = IR score
- Final rank = combination of IR score and
PageRank
PageRank today
PageRank and web search:
- Most likely used by most search engines (in some way)
Other interesting (research) problems:
-
More efficient calculation
Storage and speed
Update
Fighting spam
Make better usage of the structure of the web
Consider the continuous change of the web's structure
Better study of its characteristics (practical + theor.):
- Stability and sensitivity
- Influence of the linkage
- Influence of different parameters
- Personalization and specialization
References - Indexing
[1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S.
RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS
ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001
Chapter 5 (Ranking and Link Analysis)
[2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE
HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998
Chapter 2 and 4.5.1
[3] BORDER, KUMAR, MAGHOUL, RAGHAVAN, RAJAGOPALAN,
STATA, TOMKINS, WIENER: "GRAPH STRUCTURE IN THE WEB",
WWW 2000
[4] PAGE, BRIN, MOTWANI, WINOGRAD: "THE PAGERANK
CITATION RANKING: BRINGING ORDER TO THE WEB",
STANFORD TECHNICAL REPORT
[5] HAVELIWALA: "TOPIC-SENSITIVE PAGERANK", WWW 2002