Transcript Lecture16AN

Introduction to Graphs
15-211
Fundamental Data Structures and
Algorithms
Aleks Nanevski
March 16, 2004
1
Announcements
 Homework Assignment #6 out today
 Building a Web Search Engine
Web Crawler (due March 25)
Web Reader (due March 25)
Web Search (due April 5)
 Start early and ask questions
2
Graphs — an overview
Vertices (aka nodes)
3
Graphs — an overview
Vertices (aka nodes)
BOS
SFO
DTW
PIT
JFK
LAX
4
Graphs — an overview
Vertices (aka nodes)
BOS
SFO
DTW
PIT
JFK
LAX
Edges
5
Graphs — an overview
Vertices (aka nodes)
BOS
SFO
DTW
PIT
JFK
LAX
Undirected Edges
6
Graphs — an overview
Vertices (aka nodes)
618
SFO
DTW
2273
211
190
PIT
1987
344
BOS
318
JFK
2145
2462
Weights
LAX
Undirected Edges
7
Examples of Graphs
Vanguard Airlines
Route Map
 Questions
 What is the fastest way to get from Pittsburgh
to St Louis?
 What is the cheapest way to get from
Pittsburgh to St Louis?
8
Web Graph
<href …>
<href …>
<href …>
<href …>
<href …>
<href …>
<href …>
Web Pages are nodes (vertices)
HTML references are links (edges)
9
Graphs as models
 Physical objects are often modeled
by meshes, which are a particular
kind of graph structure.
By Jonathan Shewchuk
10
Structure of the Internet
NAP
Europe
Backbone 1
NAP
Backbone 4, 5, N
Backbone 2
Japan
Regional A
NAP
NAP
Backbone 3
Australia
Regional B
SOURCE: CISCO SYSTEMS
11
Relationship graphs
 Graphs are also used to model
relationships among entities.
Scheduling and resource constraints.
Inheritance hierarchies.
15-113
15-151
15-211
15-212
15-251
15-213
15-312
15-411
15-462
15-412
15-451
12
Terminology
 Graph G = (V,E)
 Set V of vertices (nodes)
 Set E of edges
 Elements of E are pair (v,w) where v,w  V.
 An edge (v,v) is a self-loop. (Usually assume no selfloops.)
 Weighted graph
 Elements of E are (v,w,x) where x is a weight,
i.e., a cost associated with the edge.
13
Terminology, cont’d
 Directed graph (digraph)
 The edge pairs are ordered
 Every edge has a specified direction
 Undirected graph
 The edge pairs are unordered
 E is a symmetric relation
 (v,w)  E
implies
(w,v)  E
 In an undirected graph (v,w) and (w,v) are usually
treated as though they are the same edge
14
Directed Graph (digraph)
Vertices (aka nodes)
BOS
SFO
DTW
PIT
JFK
LAX
Edges
15
Undirected Graph
Vertices (aka nodes)
BOS
SFO
DTW
PIT
JFK
LAX
Edges
16
Paths
 A path is a sequence of edges from one node
to another
 A length of a path is the number of edges in
it.
BOS
DTW
SFO
PIT
JFK
LAX
Question: What are paths of length 2 in the
above graph?
17
Paths
A simple path is a path where no
vertex is repeated
 first and last vertices can be the same
SFO
BOS
DTW
PIT
JFK
LAX
Question: an example of a simple path? A non-simple
path?
18
Connected graphs
Connected graph
A graph is connected if for all (u,v) in V, there exists a
path from u to v.
A directed graph is strongly connected if for all
(u,v) in V, there exists a path from u to v.
A directed graph is weakly connected if for all (u,v)
in V, either (u,v) is in E or (v,u) is in E.
Complete Graph
A graph with all nodes connected to each other directly
19
Cycles
A cycle (in a directed graph) is a path that
begins and ends in the same vertex.
A directed acyclic graph (DAG) is a directed
graph having no cycles.
20
So, is this a connected graph?
Cyclic or
Acyclic?
Directed or
Undirected?
21
Directed graph (unconnected)
Cyclic or
Acyclic?
22
Tree is a Graph
A
A
C
G
C
G
B
B
D
F
D
E
E
F
 A graph with no cycles is called a tree.
 This is a general definition of a tree
23
Edges and Degree
 Degree
number of edges incident to a vertex
in a directed graph
 in-degree(v) - number of edges into vertex
v
 out-degree(v) - number of edges from
vertex v
24
Dense Graphs
 For most graphs |E|  |V|2
 A dense graph when most edges
are present
E = (|V|2)
large number of edges (quadratic)
Also |E| > |V| log |V| can be considered
dense
25
Sparse Graphs
 A sparse graph is a graph with
relatively few edges
no clear definition
metric for sparsity |E| < |V| log |V|
Examples of sparse graphs
 computer network
26
Representing Graphs
Graph operations
 Navigation in a graph.
 (v,w)E
 Rv = {w | (v,w)E}
 Dw = {v | (v,w)E}
 Enumeration of the elements of a graph.
E
V
 Size of a graph
 |V|
 |E|
28
Representing graphs
 Adjacency matrix
 2-dimensional array
 For each edge (u,v), set A[u][v] to
true; otherwise false
 Adjacency lists
1
1
2
3
4
5
6
7
2
3
x
4
x
x
5
6
7
x
x
x
x
x
x
x
 For each vertex, keep a list of
adjacent vertices
2
3
4
3
6
4
3
6
5
4
7
1
4
5
1
7
3
2
4
5
6
7
 Q: How to represent weights?
6
7
29
Choosing a representation
 Size of V relative to size of E is a primary
factor.
 Dense: E/V is large
 Sparse: E/V is small
 Adjacency matrix is expensive if the graph is
sparse. WHY?
 Adjacency list is expensive if the graph is
dense. WHY?
 Dynamic changes to V.
 Adjacency matrix is expensive to copy/extend
if V is extended. WHY?
30
Graphs : Application
Search Engines
Search Engines
32
What are they?
 Tools for finding information on the Web
 Problem: “hidden” databases, e.g. New York
Times (ie, databases of keywords hosted by
the web site itself. These cannot be accessed
by Yahoo, Google etc.)
 Search engine
 A machine-constructed index (usually by
keyword)
 So many search engines, we need search
engines to find them. Searchenginecollosus.com
33
What are they?
 Search engines: key tools for ecommerce
 Buyers and sellers must find each other
 How do they work?
 How much do they index?
 Are they reliable?
 How are hits ordered?
 Can the order be changed?
34
The Process
1. Acquire the collection, i.e. all the documents
[Off-line process]
2. Create an inverted index
[Off-line process]
3. Match queries to documents
[On-line process, the actual retrieval]
4. Present the results to user
[On-line process: display, summarize, ...]
35
SE Architecture
 Spider
 Crawls the web to find pages. Follows hyperlinks.
Never stops
 Indexer
 Produces data structures for fast searching of all
words in the pages (i.e., it updates the lexicon)
 Retriever
 Query interface
 Database lookup to find hits
 Billions of documents
 1 TB RAM, many terabytes of disk
 Ranking
36
Did you know?
 The concept of a Web spider was
developed by Dr.Fuzzy Mouldin
 Implemented in 1994 on the Web
 Went into the creation of Lycos
 Lycos propelled CMU into the top 5
most successful schools
 Commercialization proceeds
Dr. Michael L.
(Fuzzy) Mauldin
 Tangible evidence
 Newel-Simon Hall
37
Did you know?
 A meta-search engine: send a
query to many search engines,
than rank and cluster the
results. (especially useful if
you’re not sure which
keywords produce optimal
results)
 Vivisimo was developed here
at CMU
 Developed by
Prof. Raul Valdes-Perez
 Developed in 2000
38
A look at
 > 20,000 servers
 Spiders over:
 4.28 billion pages
 880 million images
 Supports 35 non-English languages
 Over 200 million queries per day
39
Google’s server farm
40
Web Spiders
 Start with an initial page P0. Find URLs on P0 and
add them to a queue
 When done with P0, pass it to an indexing
program, get a page P1 from the queue and repeat
 Can be specialized (e.g. only look for email
addresses)
 Issues
 Which page to look at next? (Special subjects, recency)
 Avoid overloading a site
 How deep within a site do you go (depth search)?
 How frequently to visit pages?
41
So, why Spider the Web?
User Perceptions
 Most annoying: Engine finds nothing
(too small an index, but not an issue since
1997 or so).
 Somewhat annoying: Obsolete links
 Refresh Collection by deleting dead link
 OK if index is slightly smaller
 Done every 1-2 weeks in best engines
 Mildly annoying: Failure to find new site
=> Re-spider entire web
=> Done every 2-4 weeks in best engines
42
So, why Spider the Web?, cont’d
Cost of Spidering
 Semi-parallel algorithmic decomposition
 Spider can (and does) run in hundreds of
severs simultaneously
 Very high network connectivity
 Servers can migrate from spidering to query
processing depending on time-of-day load
 Running a full web spider takes days even with
hundreds of dedicated servers
43
Indexing
 Arrangement of data (data structure) to
permit fast searching
 Which list is easier to search?
sow fox pig eel yak hen ant cat dog hog
ant cat dog eel fox hen hog pig sow yak
 Sorting helps. Why?
 Permits binary search. About log2n probes into list
 log2(1 billion) ~ 30
 Permits interpolation search. About log2(log2n)
probes
 log2 log2(1 billion) ~ 5
44
Inverted Files
POS
1
A file is a list of words by position
10
- First entry is the word in position 1 (first word)
20
- Entry 4562 is the word in position 4562 (4562nd word)
30
- Last entry is the last word
36
FILE
An inverted file is a list of positions by word!
a (1, 4, 40)
entry (11, 20, 31)
file (2, 38)
list (5, 41)
position (9, 16, 26)
positions (44)
word (14, 19, 24, 29, 35, 45)
words (7)
4562 (21, 27)
INVERTED FILE
45
Inverted Files for Multiple Documents
LEXICON
WORD
jezebel
OCCUR
POS 1
POS 2
...
NDOCS PTR
20
jezer
3
jezerit
1
jeziah
1
jeziel
1
jezliah
1
jezoar
1
jezrahliah
1
jezreel
jezoar
DOCID
39
34
44
56
6
3
4
1
215
5
118
2291
22
2087
3010
134
566
3
203
245
287
67
1
132
4
6
1
3
322
15
481
42
3922
3981
5002
1951
2192
992
WORD
INDEX
...
107
232
677
713
“jezebel” occurs
6 times in document 34,
3 times in document 44,
4 times in document 56 . . .
354
195
381
248
312
802
405
1897
46
Ranking (Scoring) Hits
 Hits must be presented in some order
 What order?
 Relevance, recency, popularity, reliability?
 Some ranking methods
 Presence of keywords in title of document
 Closeness of keywords to start of document
 Frequency of keyword in document
 Link popularity (how many pages point to this one)
47
Ranking (Scoring) Hits, cont’d
 Can the user control? Can the page owner
control?
 Can you find out what order is used?
 Spamdexing: influencing retrieval ranking by
altering a web page. (Puts “spam” in the
index)
48
Link Popularity
 How many pages link to this page?
 on the whole Web
 in our database?
 http://www.linkpopularity.com
 Link popularity is used for ranking
 Many measures
 Number of links in (In-links)
 Weighted number of links in (by weight of
referring page)
49
Search Engine Sizes
ATW
AV
GG
INK
TMA
AllTheWeb
Altavista
Google
Inktomi
Teoma
Billions of textual documents indexed.
SOURCE: SEARCHENGINEWATCH.COM
50
Search Engine Sizes, cont’d
ATW
AV
GG
INK
TMA
AllTheWeb
Altavista
Google
Inktomi
Teoma
Billions of textual documents indexed
(as of Sept 2, 2003)
SOURCE: SEARCHENGINEWATCH.COM
51
Current Status of Web Spiders
Historical Notes
 WebCrawler: first documented
spider
 Lycos: first large-scale spider
52
Current Status of Web Spiders
Enhanced Spidering
 In-link counts to pages can be
established during spidering.
 Hint: Hmmm… hw6?
 In-link counts are the basis for
Google’s page-rank method
53
Current Status of Web Spiders
Unsolved Problems
 Most spidering re-traverses stable web
graph
=> on-demand re-spidering when
changes occur
 Completeness or near-completeness is
still a major issue
 Cannot Spider information stored in
local-DB
 Spidering non-textual data
54
Reading
 About graphs:
Chapter 14.
 About Web search:
http://www.cadenza.org/search_engine
_terms/srchad.htm
55