Microsoft PowerPoint Presentation: 14_1_WebSearch_Ranking3
Download
Report
Transcript Microsoft PowerPoint Presentation: 14_1_WebSearch_Ranking3
Web Search – Summer Term 2006
VI. Web Search Ranking (cont.)
(c) Wolfgang Hürst, Albert-Ludwigs-University
The Evolution of Search Engines
1st generation: Use only "on page", text data
- Word frequency, language
1995-1997 (AltaVista, Excite, Lycos, etc.)
2nd gen.: Use off-page, web-specific data
- Link (or connectivity) analysis
- Click-through data (what results people
click on)
PageRank
[2], introduced
by Brin
- Anchor-text
(how people
refer to a page)
and Page, used by Google
From 1998 (made popular by Google but
HITS [3], introduced by Kleinberg
everyone now)
(used by Teoma?)
TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002
Link-based ranking: HITS
Motivation (compare PageRank):
Broad-topic queries:
deliver (too) large set of relevant results
Therefore:
Ranking based on the authority of a web
page (cf. PageRank: quality / importance)
Link:
Interpreted as a conferral of authority
Goal:
Find pages with high authority (balance
between relevance and popularity)
Link-based ranking: HITS (cont.)
Basic idea:
Consider sub-graph of the web graph that
contains as much relevant pages as
possible
Analyze the graph's link structure to find:
Authorities = the most authoritative
or definitive subset of relevant pages
(for ranking)
Hubs = Pages pointing to many related
authorities (for their identification)
Authorities and Hubs - Example
Example: Query “search engine”
www.google.com
HUBS
www.teoma.com
AUTHORITIES
searchenginewatch.com
dir.yahoo.com/
Computers_and_Internet/
Internet/World_Wide_Web/
Searching_the_Web/
Search_Engines_and_Directories/
www.altavista.com
www.alltheweb.com
Authorities and Hubs - Basic idea
Approach:
- Generate a query-dependent sub-graph
- Recursively calculate hubs and authorities
Assume S is the set of pages in this subgraph, then S should be
- rather small
- contain lots of relevant pages
- contain the most important authorities
Basic idea to generate such a sub-graph:
- Get initial root set based on any IR criteria
- Include the local neighborhood of this set
Authorities and Hubs - Base set
GIVEN:
- QUERY Q
- TEXT-BASED SEARCH ENGINE SE
- CONSTANTS T AND D (NAT. NUMBERS)
- SET R(Q) OF THE FIRST T RESULTS OF SE GIVEN Q
ALGORITHM TO CALCULATE SUBGRAPH S(Q)
S(Q) := R(Q)
FOR EACH PAGE P IN R(Q)
T+(P) := SET OF PAGES LINKED BY P
T-(P) := SET OF PAGES LINKING TO P
ADD ALL PAGES FROM T+(P) TO S(Q)
IF |T-(P)| < D
THEN ADD ALL PAGES FROM T-(P) TO S(Q)
ELSE ADD RANDOM SUBSET OF T-(P) TO S(Q)
Query-dependent base set - Comments
Why only use a sub-graph?
- Advantage of query dependence
- Reduces processing time (online calculation!)
Why not just take the root set?
- Appearance of query terms does not
necessarily represent relevance (or authority)
- Larger network is needed for link analysis
In original work: Heuristics for special cases
- Remove intrinsic links, i.e. links from the
same domain (navigational links, etc.)
- Consider only a certain number of links from
one domain to a page p (to avoid spamming)
Calculating Hubs and Authorities
Obviously, there exists a mutual reinforcing
relationship between Hubs and Authorities:
- A good Hub links to many good Authorities
- A good Authority is linked by many Hubs
Hence, use an iterative algorithm to estimate a
Hub and Authority value, respectively
Authorities: I-Operation
Hubs: O-Operation
Calculating Hubs and Authorities
Authorities: I-Operation
Hubs: O-Operation
q1
q2
q1
PAGE p
q3
PAGE p
q2
q3
Calculating Hubs and Authorities
GIVEN:
- SUB-GRAPH G WITH N PAGES (FROM BASE SET S(Q))
- CONSTANT NUMBER K
ALGORITHM TO CALCULATE HUBS AND AUTHOR.
X0 := (1, 1, ..., 1)
Y0 := (1, 1, ..., 1)
FOR i = 1, ..., K
CALCULATE NEW WEIGHTS Xi BY
APPLYING THE I-OPERATION TO Xi-1, Yi-1
CALCULATE NEW WEIGHTS Yi BY
APPLYING THE O-OPERATION TO Xi, Yi-1
NORMALIZE Xi AND Yi
Calculating Hubs and Authorities
Convergence: see lit.
Basic idea:
PageRank vs. HITS
PageRank
HITS
+
- Hard to spam
- Computes quality
signal for all pages
- Easy to compute, realtime execution is hard
- Query specific
- Works on small graphs
+
-
- Local graph structure - Non-trivial to compute
can be manufactured
- Not query specific
- Provides a signal only
- Does not work on
when there is direct
small graphs
connectivity (e.g.
home pages)
Proven to be effective for Well suited for supervised
general purpose ranking directory construction
TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002
Commercial search engines using HITS
(Maybe?) Teoma, now search.ask.com
"Teomas underlying technology is an extension of
the HITS algorithm …", C. Sherman, April 2002,
http://dc.internet.com/news/article.php/1002061
(Not online anymore)
References - HITS
[1] S. BRIN, L. PAGE: "THE ANATOMY OF A
LARGE-SCALE HYPERTEXTUAL WEB
SEARCH ENGINE", WWW 1998
[2] JON KLEINBERG: "AUTHORITATIVE
SOURCES IN A HYPERLINKED
ENVIRONMENT", JOURNAL OF THE ACM,
VOL. 46, NO. 5, SEPTEMBER 1999
General Web Search Engine Architecture
CLIENT
QUERIES
QUERY
ENGINE
WWW
RESULTS
PAGE
REPOSITORY
RANKING
CRAWLER(S)
COLLECTION
ANALYSIS MOD.
INDEXER
MODULE
CRAWL
CONTROL
INDEXES
UTILITY
STRUCTURE
TEXT
USAGE FEEDBACK
(CF. [1] FIG. 1)