Artificial Intelligence 15-381 Heuristic Search Methods

Download Report

Transcript Artificial Intelligence 15-381 Heuristic Search Methods

Artificial Intelligence 15-381
Web Spidering & HW1 Preparation
Jaime Carbonell
[email protected]
22 January 2002
Today's Agenda
Finish A*, B*, Macrooperators
Web Spidering as Search




How to acquire the web in a box
Graph theory essentials
Algorithms for web spidering
Some practical issues
Search Engines on the Web
Revising the Total IR Scheme
1. Acquire the collection, i.e. all the documents
[Off-line process]
2. Create an inverted index (IR lecture, later)
[Off-line process]
3. Match queries to documents (IR lecture)
[On-line process, the actual retrieval]
4. Present the results to user
[On-line process: display, summarize, ...]
Acquiring a Document Collection
Document Collections and Sources
Fixed, pre-existing document collection
e.g., the classical philosophy works
Pre-existing collection with periodic updates
e.g., the MEDLINE biomedical collection
Streaming data with temporal decay
e.g., the Wall-Street financial news feed
Distributed proprietary document collections
Distributed, linked, publicly-accessible
documentse.g. the Web
:
Properties of Graphs I (1)
Definitions:
Graph
a set of nodes n and a set of edges (binary
links) v between the nodes.
Directed graph
a graph where every edge has a prespecified direction.
Properties of Graphs I (2)
Connected graph
a graph where for every pair of nodes
there exists a sequence of edges starting
at one node and ending at the other.
The web graph
the directed graph where n = {all web
pages} and v = {all HTML-defined links
from one web page to another}.
Properties of Graphs I (3)
Tree
a connected graph without any loops and
with a unique path between any two
nodes
Spanning tree of graph G
a tree constructed by including all n in G,
and a subset of v such that G remains
connected, but all loops are eliminated.
Properties of Graphs I (4)
Forest
a set of trees (without inter-tree links)
k-Spanning forest
Given a graph G with k connected
subgraphs, the set of k trees each of
which spans a different connected
subgraphs.
Graph G = <n, v>
Directed Graph Example
Tree
Web Graph
<href …>
<href …>
<href …>
<href …>
<href …>
<href …>
<href …>
HTML references are links
Web Pages are nodes
More Properties of Graphs
Theorem 1: For every connected graph G,
there exists a spanning tree.
Proof: Depth-first search starting at any node
in G builds the spanning tree.
Properties of Graphs
Theorem 2: For every G with k disjoint
connected subgraphs, there exists a kspanning forest.
Proof: Each connected subgraph has a
spanning tree (Theorem 1), and the set of k
spanning trees (being disjoint) define a kspanning forest.
Properties of Web Graphs
Additional Observations
The web graph at any instant of time contains
k-connected subgraphs (but we do not know
the value of k, nor do we know a-priori the
structure of the web-graph).
If we knew every connected web subgraph, we
could build a k-web-spanning forest, but this is
a very big "IF."
Graph-Search Algorithms I
PROCEDURE SPIDER1(G)
Let ROOT := any URL from G
Initialize STACK <stack data structure>
Let STACK := push(ROOT, STACK)
Initialize COLLECTION <big file of URL-page pairs>
While STACK is not empty,
URLcurr := pop(STACK)
PAGE := look-up(URLcurr)
STORE(<URLcurr, PAGE>, COLLECTION)
For every URLi in PAGE,
push(URLi, STACK)
Return COLLECTION
What is wrong with the above algorithm?
Depth-first Search
1
2
3
5
6
4
numbers = order in
which nodes are
visited
7
Graph-Search Algorithms II (1)
SPIDER1 is Incorrect
What about loops in the web graph?
=> Algorithm will not halt
What about convergent DAG structures?
=> Pages will replicated in collection
=> Inefficiently large index
=> Duplicates to annoy user
Graph-Search Algorithms II (2)
SPIDER1 is Incomplete
Web graph has k-connected subgraphs.
SPIDER1 only reaches pages in the the
connected web subgraph where ROOT
page lives.
A Correct Spidering Algorithm
PROCEDURE SPIDER2(G)
Let ROOT := any URL from G
Initialize STACK <stack data structure>
Let STACK := push(ROOT, STACK)
Initialize COLLECTION <big file of URL-page pairs>
While STACK is not empty,
|
Do URLcurr := pop(STACK)
|
Until URLcurr is not in COLLECTION
PAGE := look-up(URLcurr)
STORE(<URLcurr, PAGE>, COLLECTION)
For every URLi in PAGE,
push(URLi, STACK)
Return COLLECTION
A More Efficient Correct Algorithm
PROCEDURE SPIDER3(G)
Let ROOT := any URL from G
Initialize STACK <stack data structure>
Let STACK := push(ROOT, STACK)
Initialize COLLECTION <big file of URL-page pairs>
| Initialize VISITED <big hash-table>
While STACK is not empty,
|
Do URLcurr := pop(STACK)
|
Until URLcurr is not in VISITED
|
insert-hash(URLcurr, VISITED)
PAGE := look-up(URLcurr)
STORE(<URLcurr, PAGE>, COLLECTION)
For every URLi in PAGE,
push(URLi, STACK)
Return COLLECTION
Graph-Search Algorithms V
A More Complete Correct Algorithm
PROCEDURE SPIDER4(G, {SEEDS})
| Initialize COLLECTION <big file of URL-page pairs>
| Initialize VISITED <big hash-table>
| For every ROOT in SEEDS
|
Initialize STACK <stack data structure>
|
Let STACK := push(ROOT, STACK)
While STACK is not empty,
Do URLcurr := pop(STACK)
Until URLcurr is not in VISITED
insert-hash(URLcurr, VISITED)
PAGE := look-up(URLcurr)
STORE(<URLcurr, PAGE>, COLLECTION)
For every URLi in PAGE,
push(URLi, STACK)
Return COLLECTION
Completeness Observations
Completeness is not guaranteed
In k-connected web G, we do not know k
Impossible to guarantee each connected
subgraph is sampled
Better: more seeds, more diverse seeds
Completeness Observations
Search Engine Practice
Wish to maximize subset of web indexed.
Maintain (secret) set of diverse seeds
(grow this set opportunistically, e.g. when
X complains his/her page not indexed).
Register new web sites on demand
New registrations are seed candidates.
To Spider or not to Spider? (1)
User Perceptions
Most annoying: Engine finds nothing
(too small an index, but not an issue since 1997
or so).
Somewhat annoying: Obsolete links
=> Refresh Collection by deleting dead links
(OK if index is slightly smaller)
=> Done every 1-2 weeks in best engines
Mildly annoying: Failure to find new site
=> Re-spider entire web
=> Done every 2-4 weeks in best engines
To Spider or not to Spider? (2)
Cost of Spidering
Semi-parallel algorithmic decomposition
Spider can (and does) run in hundreds of severs
simultaneously
Very high network connectivity (e.g. T3 line)
Servers can migrate from spidering to query
processing depending on time-of-day load
Running a full web spider takes days even with
hundreds of dedicated servers
Current Status of Web Spiders
Historical Notes
WebCrawler: first documented spider
Lycos: first large-scale spider
Top-honors for most web pages spidered:
First Lycos, then Alta Vista, then
Google...
Current Status of Web Spiders )
Enhanced Spidering
In-link counts to pages can be established
during spidering.
Hint: In SPIDER4, store <URL,
COUNT> pair in VISITED hash table.
In-link counts are the basis for
GOOGLE’s page-rank method
Current Status of Web Spiders
Unsolved Problems
Most spidering re-traverses stable web
graph
=> on-demand re-spidering when
changes occur
Completeness or near-completeness is
still a major issue
Cannot Spider JAVA-triggered or localDB stored information