lecture5 - Andrew.cmu.edu

Download Report

Transcript lecture5 - Andrew.cmu.edu

Building Web Spiders
Web-Based Information Architectures
MSEC 20-760
Mini II
Jaime Carbonell
General Topic: Spidering the Web
•
•
•
•
Motivation: Acquiring a Collection
Bare Essentials of Graph Theory
Web Spidering Algorithms
Web Spidering: Current Practice
Acquiring a Collection (1)
Revising the Total IR Scheme
1. Acquire the collection, i.e. all the documents
[Off-line process]
2. Create an inverted index (Homework 1)
[Off-line process]
3. Match queries to documents (Homework 2)
[On-line process, the actual retrieval]
4. Present the results to user
[On-line process: display, summarize, ...]
Acquiring a Collection (2)
Document Collections and Sources
• Fixed, pre-existing document collection
e.g., the classical philosophy works
• Pre-existing collection with periodic updates
e.g., the MEDLINE biomedical collection
• Streaming data with temporal decay
e.g., the Wall-Street financial news feed
• Distributed proprietary document collections
See Prof. Callan’s methods
• Distributed, linked, publicly-accessible documents
e.g. the Web
Technical Detour:
Properties of Graphs I (1)
Definitions:
Graph
a set of nodes n and a set of edges (binary
links) v between the nodes.
Directed graph
a graph where every edge has a prespecified direction.
Technical Detour:
Properties of Graphs I (2)
Connected graph
a graph where for every pair of nodes there
exists a sequence of edges starting at one
node and ending at the other.
The web graph
the directed graph where n = {all web
pages} and v = {all HTML-defined links
from one web page to another}.
Technical Detour:
Properties of Graphs I (3)
Tree
a connected graph without any loops and
with a unique path between any two nodes
Spanning tree of graph G
a tree constructed by including all n in G,
and a subset of v such that G remains
connected, but all loops are eliminated.
Technical Detour:
Properties of Graphs I (4)
Forest
a set of trees (without inter-tree links)
k-Spanning forest
Given a graph G with k connected
subgraphs, the set of k trees each of which
spans a different connected subgraph.
Graph G = <n, v>
Directed Graph Example
Tree
Web Graph
<href …>
<href …>
<href …>
<href …>
<href …>
<href …>
<href …>
HTML references are links
Web Pages are nodes
Technical Detour:
Properties of Graphs II (1)
Theorem 1: For every connected graph G,
there exists a spanning tree.
Proof: Depth-first search starting at any node
in G builds the spanning tree.
Technical Detour:
Properties of Graphs II (2)
Theorem 2: For every G with k disjoint
connected subgraphs, there exists a kspanning forest.
Proof: Each connected subgraph has a
spanning tree (Theorem 1), and the set of k
spanning trees (being disjoint) define a kspanning forest.
Technical Detour:
Properties of Graphs II (3)
Additional Observations
• The web graph at any instant of time
contains k-connected subgraphs (but we do
not know the value of k, nor do we know apriori the structure of the web-graph).
• If we knew every connected web subgraph,
we could build a k-web-spanning forest, but
this is a very big "IF."
Graph-Search Algorithms I
PROCEDURE SPIDER1(G)
Let ROOT := any URL from G
Initialize STACK <stack data structure>
Let STACK := push(ROOT, STACK)
Initialize COLLECTION <big file of URL-page pairs>
While STACK is not empty,
URLcurr := pop(STACK)
PAGE := look-up(URLcurr)
STORE(<URLcurr, PAGE>, COLLECTION)
For every URLi in PAGE,
push(URLi, STACK)
Return COLLECTION
What is wrong with the above algorithm?
Depth-first Search
1
2
3
5
6
4
numbers = order in
which nodes are
visited
7
Graph-Search Algorithms II (1)
SPIDER1 is Incorrect
• What about loops in the web graph?
=> Algorithm will not halt
• What about convergent DAG structures?
=> Pages will replicated in collection
=> Inefficiently large index
=> Duplicates to annoy user
Graph-Search Algorithms II (2)
SPIDER1 is Incomplete
• Web graph has k-connected subgraphs.
• SPIDER1 only reaches pages in the the
connected web subgraph where ROOT page
lives.
Graph-Search Algorithms III
A Correct Spidering Algorithm
PROCEDURE SPIDER2(G)
Let ROOT := any URL from G
Initialize STACK <stack data structure>
Let STACK := push(ROOT, STACK)
Initialize COLLECTION <big file of URL-page pairs>
While STACK is not empty,
|
Do URLcurr := pop(STACK)
|
Until URLcurr is not in COLLECTION
PAGE := look-up(URLcurr)
STORE(<URLcurr, PAGE>, COLLECTION)
For every URLi in PAGE,
push(URLi, STACK)
Return COLLECTION
Graph-Search Algorithms IV
A More Efficient Correct Algorithm
PROCEDURE SPIDER3(G)
Let ROOT := any URL from G
Initialize STACK <stack data structure>
Let STACK := push(ROOT, STACK)
Initialize COLLECTION <big file of URL-page pairs>
| Initialize VISITED <big hash-table>
While STACK is not empty,
|
Do URLcurr := pop(STACK)
|
Until URLcurr is not in VISITED
|
insert-hash(URLcurr, VISITED)
PAGE := look-up(URLcurr)
STORE(<URLcurr, PAGE>, COLLECTION)
For every URLi in PAGE,
push(URLi, STACK)
Return COLLECTION
Graph-Search Algorithms V
A More Complete Correct Algorithm
PROCEDURE SPIDER4(G, {SEEDS})
| Initialize COLLECTION <big file of URL-page pairs>
| Initialize VISITED <big hash-table>
| For every ROOT in SEEDS
|
Initialize STACK <stack data structure>
|
Let STACK := push(ROOT, STACK)
While STACK is not empty,
Do URLcurr := pop(STACK)
Until URLcurr is not in VISITED
insert-hash(URLcurr, VISITED)
PAGE := look-up(URLcurr)
STORE(<URLcurr, PAGE>, COLLECTION)
For every URLi in PAGE,
push(URLi, STACK)
Return COLLECTION
Graph-Search Algorithms VI
Completeness Observations (1)
Completeness is not guaranteed
• In k-connected web G, we do not know k
• Impossible to guarantee each connected
subgraph is sampled
• Better: more seeds, more diverse seeds
Graph-Search Algorithms VI
Completeness Observations (2)
Search Engine Practice
• Wish to maximize subset of web indexed.
• Maintain (secret) set of diverse seeds
(grow this set opportunistically, e.g. when X
complains his/her page not indexed).
• Register new web sites on demand
New registrations are seed candidates.
To Spider or not to Spider? (1)
User Perceptions
• Most annoying: Engine finds nothing (too small
an index, but not an issue since 1998 or so).
• Somewhat annoying: Obsolete links
=> Refresh Collection by deleting dead links
(OK if index is slightly smaller)
=> Done every 1-2 weeks in best engines
• Mildly annoying: Failure to find new site
=> Re-spider entire web
=> Done every 2-4 weeks in best engines
To Spider or not to Spider? (2)
Cost of Spidering
• Semi-parallel algorithmic decomposition
• Spider can (and does) run in hundreds of severs
simultaneously
• Very high network connectivity (e.g. T3 line)
• Servers can migrate from spidering to query
processing depending on time-of-day load
• Running a full web spider takes days even with
hundreds of dedicated servers
Current Status of Web Spiders (1)
Historical Notes
• WebCrawler: first documented spider
• Lycos: first large-scale spider
• Top-honors for most web pages spidered:
First Lycos, then Alta Vista, then Google...
Current Status of Web Spiders (2)
Enhanced Spidering
• In-link counts to pages can be established
during spidering.
• Hint: In SPIDER4, store <URL, COUNT>
pair in VISITED hash table.
• In-link counts are the basis for GOOGLE’s
page-rank method
Current Status of Web Spiders (3)
Unsolved Problems
• Most spidering re-traverses stable web graph
=> on-demand re-spidering when changes occur
• Completeness or near-completeness is still a major
issue
• Cannot Spider JAVA-triggered or local-DB stored
information