Lecture17KS

Transcript Lecture17KS

Introduction to Graphs
15-211
Fundamental Data Structures and
Algorithms
Klaus Sutner
March 18, 2004
Announcements
- HW 6 is out.
This is a bit harder than the previous ones,
so don't procrastinate. Don't.
- Reading: Chapter 14 in MAW.
- How about the roving eyeballs ?
Application of Graphs:
Search Engines
Search Engines
Tools to find information on the chaotic WWW.
Machine-generated index of a large segment of the
web.
At the most basic level:
- Type in keyword,
- get list of all pages containing this keyword.
The Model
Think of the WWW as a (huge) digraph.
<href …>
<href …>
<href …>
<href …>
<href …>
<href …>
<href …>
<href …>
The Model
Problems:
Neither V nor E is known a priori, has to be
computed.
Changes all the time (#@$!#@$# stale links).
Web pages are often not syntactically correct
HTML.
Search Engines
Did you know?
 The concept of a Web spider was
developed by Dr. Fuzzy Mouldin
 Implemented in 1994 on the Web
 Went into the creation of Lycos
 Lycos propelled CMU into the top 5
most successful schools

Commercialization proceeds
 Tangible evidence

Newel-Simon Hall
Dr. Michael L.
(Fuzzy) Mauldin
Did you know?
 Vivisimo was developed here
at CMU
 Developed by
Prof. Raul Valdes-Perez
 Developed in 2000
A look at
http://www.nytimes.com/
2004/03/14/fashion/14GOOG.html
“to google” is now a verb
4,285,199,774 web pages
some 200 million searches a day
used by 60 million Americans in January (alas,
often for stuff like Janet Jackson ...)
The Process
1. Acquire the collection, i.e. all the pages.
2. Create an inverted index.
both off-line
3. Match queries to documents (retrieval).
4. Present the results to user.
both on-line
SE Architecture
 Spider

Crawls the web to find pages. Follows hyperlinks.
Never stops
 Indexer

Produces data structures for fast searching of all
words in the pages (ie, it updates the lexicon)
 Retriever

Query interface

Database lookup to find hits


1 billion documents

1 TB RAM, many terabytes of disk
Ranking
Acquisition
Perform a BFS on the web graph.
Pick some arbitrary starting page.
Generate the “adjacency list” for each node on the
fly by parsing the web page.
At the same time, provide input for the indexer.
Indexing
 Arrangement of data (data structure) to
permit fast searching
 Sorting helps.
Permits binary search. About log2n probes into list


log2(1 billion) ~ 30
Permits interpolation search. About log2(log2n)
probes

log2 log2(1 billion) ~ 5
Inverted Files
POS
1
A file is a list of words by position
10
- First entry is the word in position 1 (first word)
20
- Entry 4562 is the word in position 4562 (4562nd word)
30
- Last entry is the last word
36
An inverted file is a list of positions by word!
a (1, 4, 40)
entry (11, 20, 31)
file (2, 38)
list (5, 41)
position (9, 16, 26)
positions (44)
word (14, 19, 24, 29, 35, 45)
words (7)
4562 (21, 27)
INVERTED FILE
FILE
Inverted Files for Multiple Documents
LEXICON
WORD
jezebel
OCCUR
POS 1
POS 2
...
NDOCS PTR
20
jezer
3
jezerit
1
jeziah
1
jeziel
1
jezliah
1
jezoar
1
jezrahliah
1
jezreel
jezoar
DOCID
39
34
44
56
6
3
4
1
215
5
118
2291
22
2087
3010
134
566
3
203
245
287
67
1
132
4
6
1
3
322
15
481
42
107
232
677
713
..
.
“jezebel” occurs
6 times in document 34,
3 times in document 44,
4 times in document 56 . . .
3922
3981
5002
1951
2192
992
WORD
INDEX
354
195
381
248
312
802
405
1897
Ranking (Scoring) Hits
 Hits must be presented in some order
 What order?

Relevance, recency, popularity, reliability?
 Some ranking methods

Presence of keywords in title of document

Closeness of keywords to start of document

Frequency of keyword in document

Link popularity (how many pages point to this one,
none other than the indegree of the node in the web
graph)
Ranking (Scoring) Hits, cont’d
 Can the user control?
 Can the page owner control?
 Can you find out what order is used?
 Spamdexing: influencing retrieval ranking by
altering a web page. (Puts “spam” in the
index)
Single-Source Shortest Paths
Shortest Path
Given a digraph G, a non-negative cost cost(x,y) for
each edge and a source vertex s.
Hence we can define the distance from s to x for any
vertex x.
Problem: Compute the distances for all targets x.
This is the single-source version.
Brute Force
A path is simple if it contains any vertex at most
once.
Note that there are only finitely many simple paths
(even if there are cycles in the graph).
Enumerate all simple paths starting at s.
For each target vertex t, collect all simple paths with
target t.
Compute their cost, determine the min.
Bad Idea
Even in an acyclic graph, the number of simple paths
may be exponential in n.
Exercise: determine the number of paths s to t.
s
t
General Rules
We maintain an array dist[x]:
- initially dist[s] = 0, dist[x] =  for all other
vertices
- at any time during the algorithm, we store the cost
of a real path from s to x in dist[x] (but not
necessarily the cost of the shortest path, we may
have an overestimate).
- edge (x,y) requires attention if
dist[x] + cost(x,y) < dist[y]
Prototype Algorithm
When an edge requires attention we relax it:
dist[y] = dist[x] + cost(x,y)
Thus we now have a better estimate for the shortest
path from s to x. This produces a prototype
algorithm:
initialize dist[];
while( some edge (x,y) requires attention )
relax (x,y);
Correctness
Claim: Upon completion of the algorithm dist[x] is
the correct distance from s to x, for all x.
Proof:
Suppose otherwise, pick x such that the path from s
to x has minimal length (number of edges, not
weights). Then there is some vertex y such that
(y,x) is an edge, dist[y] is correct and dist[y] +
cost(y,x) < dist[x].
But then (y,x) requires attention, contradiction.
Termination
Claim: The algorithm always terminates.
Proof:
Suppose otherwise. Then there is one edge (x,y)
that is relaxed infinitely often.
But then there must be infinitely many simple paths
from s to y, contradiction.
Make sure you understand why the paths must be
simple.
Dijkstra's Algorithm
The problem is to choose the right edge to be
relaxed.
Dijkstra's algorithms always picks the edges (x,y)
such that dist[x] is minimal – but works on each x
only once.
This sounds like a recipe for disaster, how do you
know that there are no shortcuts that will be
discovered later?
Dijkstra's Algorithm
initialize dist[];
insert V into Q; // PQ, priorities:
while( Q not empty )
x = deleteMin( Q );
forall (x,y) in E do
if( (x,y) requires attention )
relax edge
dist
Dijkstra’s algorithm
2
c
Visited
f
4
2
1
2
s
4
b
e
5
1
a
d
g
1
s a b c d e f g
0       
Dijkstra’s algorithm
2
c
f
4
2
1
2
s
4
b
e
5
1
a
d
g
1
b c a d e f g
2 4 5    
Visited
s (D = 0)
Dijkstra’s algorithm
2
c
f
4
2
1
2
s
4
b
e
5
1
a
d
g
1
d c a e f g
3 4 5 6  
Visited
s (D = 0)
b (D = 2)
Dijkstra’s algorithm
2
c
f
4
2
1
2
s
4
b
e
5
1
a
d
g
1
c a e f g
4 4 6  
Visited
s (D = 0)
b (D = 2)
d (D = 3)
Dijkstra’s algorithm
2
c
f
4
2
1
2
s
4
b
e
5
1
a
d
g
1
a e f g
4 6 6 
Visited
s (D = 0)
b (D = 2)
d (D = 3)
c (D = 4)
Dijkstra’s algorithm
2
c
f
4
2
1
2
s
4
b
e
5
1
a
d
g
1
e f g
6 6 
Visited
s (D = 0)
b (D = 2)
d (D = 3)
c (D = 4)
a (D = 4)
...
Dijkstra’s algorithm
2
c
f
4
2
1
2
s
4
b
e
5
1
a
d
1
g
Visited
s (D = 0)
b (D = 2)
d (D = 3)
c (D = 4)
a (D = 4)
e (D = 6)
f (D = 6)
g (D = )
Features of Dijkstra’s Algorithm
• A greedy algorithm
• “Visits” every vertex only once, when it
becomes the vertex with minimal distance
amongst those still in the priority queue
• Distances may be revised multiple
times: current values represent ‘best
guess’ based on our observations so far
• Once a vertex is visited we are
guaranteed to have found the shortest
path to that vertex…. why?
Correctness (induction)
fringe
reached
x
s
unreached
u
Classify vertices into three groups:
- reached
- fringe: reached plus one edge
- unreached
Correctness (induction)
fringe
reached
x
s
unreached
u
Loop invariant:
- distance correct for reached.
- distance for fringe: correct if restricted to
paths in reached part plus one edge.
Performance (using a heap)
Initialization: O(n)
Visitation loop: n calls
• deleteMin(): O(log n)
• Each edge is considered only once during
entire execution, for a total of e updates
of the priority queue, each O(log n)
Overall cost:
O( (n+e) log n )
More
Fact:
Heap is used unevenly: n delete-mins but e
promotes.
Can be exploited by using a better data
structure (Fibonacci heap) to get running
time O(n log n + e).
Incidentally:
How does one find the actual shortest path?
Negative weights?
Dijkstra’s greedy algorithm can only guarantee
shortest paths for non-negative weights
2
c
f
b c a d e f g
4
2
-3
2
s
4
b
e
5
1
a
d
1
2 4 5    
g
visiting b incorrectly produces a
path of distance 2
Bellman-Ford Algorithm
Detects negative cost cycles, finds shortest paths if
there are none.
do n – 1 times
forall (x,y) in E do
if( (x,y) requires attention )
relax edge
Claim: Upon completion, there are NCCs iff some
edge still requires attention. If not, the distances
are correct.
Bellman-Ford path updates
Assume edges are examined in lexicographic order
(i.e., (b,d), (b,e), (c,b), (c,f), (d,a), (f,e), (s,a),
(s,b), (s,c) )
2
Iteration 1:
f
c
4
-2
-3
s
2
4
b
5
a
e
1
1
g
d
s a b c d e f g
0 5 2 4    
Bellman-Ford path updates
Assume edges are examined in lexicographic order
(i.e., (b,d), (b,e), (c,b), (c,f), (d,a), (f,e), (s,a),
(s,b), (s,c) )
2
Iteration 2:
f
c
4
-2
-3
s
2
4
b
5
a
e
1
1
g
d
s a b c d e f g
0 5 2 4 3   
Bellman-Ford path updates
Assume edges are examined in lexicographic order
(i.e., (b,d), (b,e), (c,b), (c,f), (d,a), (f,e), (s,a),
(s,b), (s,c) )
2
Iteration 2:
f
c
4
-2
-3
s
2
4
b
5
a
e
1
1
g
d
s a b c d e f g
0 5 2 4 3 6  
Bellman-Ford path updates
Assume edges are examined in lexicographic order
(i.e., (b,d), (b,e), (c,b), (c,f), (d,a), (f,e), (s,a),
(s,b), (s,c) )
2
Iteration 2:
f
c
4
-2
-3
s
2
4
b
5
a
e
1
1
g
d
s a b c d e f g
0 5 1 4 3 6  
Bellman-Ford path updates
Assume edges are examined in lexicographic order
(i.e., (b,d), (b,e), (c,b), (c,f), (d,a), (f,e), (s,a),
(s,b), (s,c) )
2
Iteration 2:
f
c
4
-2
-3
s
2
4
b
5
a
e
1
1
g
d
s a b c d e f g
0 5 1 4 3 6 6 
Bellman-Ford path updates
Assume edges are examined in lexicographic order
(i.e., (b,d), (b,e), (c,b), (c,f), (d,a), (f,e), (s,a),
(s,b), (s,c) )
2
Iteration 2:
f
c
4
-2
-3
s
2
4
b
5
a
e
1
1
g
d
s a b c d e f g
0 4 1 4 3 6 6 
Bellman-Ford path updates
Assume edges are examined in lexicographic order
(i.e., (b,d), (b,e), (c,b), (c,f), (d,a), (f,e), (s,a),
(s,b), (s,c) )
2
Iteration 2:
f
c
4
-2
-3
s
2
4
b
5
a
e
s a b c d e f g
0 4 1 4 3 4 6 
1
1
g
d
etcetera...
Bellman-Ford cycle check
After Iteration 7:
2
f
c
4
-2
-3
s
2
4
b
5
a
e
s a b c d e f g
0 3 1 4 2 4 6 
1
1
g
d
• Performs one final iteration for all edges
• If any weights change at this point, a
negative cycle exists.
For this graph, the algorithm returns TRUE.
Key features
• If the graph contains no negative-weight
cycles reachable from the source vertex,
after n - 1 iterations all distance estimates
represent shortest paths…why?
• We assumed edges were considered in the
same order for each iteration. Would the
algorithm still work if we changed the order
for every iteration?
Correctness
Case 1: Graph G=(V,E) doesn’t contain any negativeweight cycles reachable from the source vertex s.
Consider a shortest path p = v0, v1,..., vk= x which
must have
k  n - 1 edges
By induction on i  k:
• dist[s] = 0 after initialization
• Assume dist[vi-1] is a shortest path after iteration i – 1.
• Since edge (vi-1,vi) is updated on the ithpass, dist[vi]
must then reflect the shortest path to vi.
• Since we perform n – 1 iterations, all distances are
correct.
Correctness
Case 2:
If there is a negative cost cycle (more precisely, a NCC
reachable from s) then it is easy to see that some edge
on the cycle will always require attention.
Performance
Initialization: O(n)
Path update and cycle check:
n calls checking e edges, O(ne)
Overall cost:
O(ne) (which is cubic in n)

Lecture17KS

Transcript Lecture17KS

Directory