Hyperlink Analysis

Download Report

Transcript Hyperlink Analysis

Using Hyperlink structure
information for web search
Hyperlink structure information

Hyperlink analysis for the web by Monika
R. Henzinger, Google Inc.
 Structural web search using a graph-based
discovery system by Nitish Monocha etc.,
University of Texas
How are hyperlinks useful?

Assumptions
a)Assumption 1. A hyperlink from page
A to page B is a recommendation of
page B by the author of page A.
b) Assumption 2. If page A and page B are
connected
by a hyperlink, then they might be on
the same topic.
c) Pages pointed by many pages are of higher quality
than pages pointed to by fewer pages.
main uses of hyperlink
analysis






crawling (collecting the pages)
ranking (rank the returned results)
Compute the geographic scope of a web
page
Find mirrored host
Compute the statistics of web pages and
search engine
Major search engine use hyperlink analysis
but do not want to disclose the algorithms
Crawling

Collect web pages
 Start with a set of pages, recursively visit
the hyperlinks
Traditional IR

Vector model or Boolean model
 Does not work well in the web because:
Web page authors manipulate the ranking.
 The power of hyperlink analysis comes
from the fact that it uses the content of
other pages to rank the current page.
Connectivity-Based Ranking
(rank using hyperlink analysis)
query-independent schemes, which
assign a score to a page independent of
a given query;
 query-dependent schemes, which
assign a score to a page in the context
of a given query.

Model

Web pages as graph, page as node,
hyperlink as edge.
 Directed graph: link graph. Used for finding
related pages
 Undirected graph: co-citation graph. Used
for categorizing related pages.
Query-independent Ranking

Major drawbacks: it does not distinguish between
the quality of a page pointed by a number of lowquality pages and the quality of a page pointed to
by the same number of high-quality page.
 PageRank algorithm. Weight each hyperlink to the
page proportionally to the quality of the page
containing the hyperlink. PageRank of a page A
depends on the pagerank of a page B pointing to
A. Used by Google.
Query-dependent Ranking

a)
b)

Build query-specific graph: neighborhood
graph.
Start set of documents matching the query
Augmented by the sets of the documents
that either hyperlinks to or is hyper linked
to by the documents in the start set.
Perform the hyperlink analysis.
Query-dependent
Ranking(continued)
a)
b)
c)
Indegree-based approach. (the number of
documents hyper linking to a document in the
start set)
Authorities (pages with good content on a topic)
and hubs (directory-like pages with many
hyperlinks to pages on the topic)
HITS algorithm to determine good hubs and
good authorities. Each node has auth score and
hub score.
Problems of HITS

Small additions to neighborhood graph may
considerably change the scores of hub and
auth.
 Topic drift when the majority of pages on
neighborhood graph is on a topic different
from the query topic.
Structural web search using a
graph-based discovery system






WebSUBDUE: SUBDUE is the engine for
knowledge discovery(data mining). Support
structural search, text search, synonym search, and
combinations of these searches.
Data preparation: Crawler written in Perl to build
the labeled graph for the web site.
Labeled graph is feed into SUDUE system.
Query can be modeled as labeled graph as well.
Search the sub graph in the graph
Make comparison with existing search engine:
AltaVista
Find all pages that link to a
page containing the term
subdue
Jobs in computer science
Find hubs and authorities
pages on “algorithm”
Conclusion

Hyperlink structure information is valuable
information.
 Use of hyperlink information to enhance
normal web search in crawling, ranking etc.
 Use of hyperlink information to support
structural search, which is still missing in
existing search engine.