Transcript Lecture16AN
Introduction to Graphs
15-211
Fundamental Data Structures and
Algorithms
Aleks Nanevski
March 16, 2004
1
Announcements
Homework Assignment #6 out today
Building a Web Search Engine
Web Crawler (due March 25)
Web Reader (due March 25)
Web Search (due April 5)
Start early and ask questions
2
Graphs — an overview
Vertices (aka nodes)
3
Graphs — an overview
Vertices (aka nodes)
BOS
SFO
DTW
PIT
JFK
LAX
4
Graphs — an overview
Vertices (aka nodes)
BOS
SFO
DTW
PIT
JFK
LAX
Edges
5
Graphs — an overview
Vertices (aka nodes)
BOS
SFO
DTW
PIT
JFK
LAX
Undirected Edges
6
Graphs — an overview
Vertices (aka nodes)
618
SFO
DTW
2273
211
190
PIT
1987
344
BOS
318
JFK
2145
2462
Weights
LAX
Undirected Edges
7
Examples of Graphs
Vanguard Airlines
Route Map
Questions
What is the fastest way to get from Pittsburgh
to St Louis?
What is the cheapest way to get from
Pittsburgh to St Louis?
8
Web Graph
<href …>
<href …>
<href …>
<href …>
<href …>
<href …>
<href …>
Web Pages are nodes (vertices)
HTML references are links (edges)
9
Graphs as models
Physical objects are often modeled
by meshes, which are a particular
kind of graph structure.
By Jonathan Shewchuk
10
Structure of the Internet
NAP
Europe
Backbone 1
NAP
Backbone 4, 5, N
Backbone 2
Japan
Regional A
NAP
NAP
Backbone 3
Australia
Regional B
SOURCE: CISCO SYSTEMS
11
Relationship graphs
Graphs are also used to model
relationships among entities.
Scheduling and resource constraints.
Inheritance hierarchies.
15-113
15-151
15-211
15-212
15-251
15-213
15-312
15-411
15-462
15-412
15-451
12
Terminology
Graph G = (V,E)
Set V of vertices (nodes)
Set E of edges
Elements of E are pair (v,w) where v,w V.
An edge (v,v) is a self-loop. (Usually assume no selfloops.)
Weighted graph
Elements of E are (v,w,x) where x is a weight,
i.e., a cost associated with the edge.
13
Terminology, cont’d
Directed graph (digraph)
The edge pairs are ordered
Every edge has a specified direction
Undirected graph
The edge pairs are unordered
E is a symmetric relation
(v,w) E
implies
(w,v) E
In an undirected graph (v,w) and (w,v) are usually
treated as though they are the same edge
14
Directed Graph (digraph)
Vertices (aka nodes)
BOS
SFO
DTW
PIT
JFK
LAX
Edges
15
Undirected Graph
Vertices (aka nodes)
BOS
SFO
DTW
PIT
JFK
LAX
Edges
16
Paths
A path is a sequence of edges from one node
to another
A length of a path is the number of edges in
it.
BOS
DTW
SFO
PIT
JFK
LAX
Question: What are paths of length 2 in the
above graph?
17
Paths
A simple path is a path where no
vertex is repeated
first and last vertices can be the same
SFO
BOS
DTW
PIT
JFK
LAX
Question: an example of a simple path? A non-simple
path?
18
Connected graphs
Connected graph
A graph is connected if for all (u,v) in V, there exists a
path from u to v.
A directed graph is strongly connected if for all
(u,v) in V, there exists a path from u to v.
A directed graph is weakly connected if for all (u,v)
in V, either (u,v) is in E or (v,u) is in E.
Complete Graph
A graph with all nodes connected to each other directly
19
Cycles
A cycle (in a directed graph) is a path that
begins and ends in the same vertex.
A directed acyclic graph (DAG) is a directed
graph having no cycles.
20
So, is this a connected graph?
Cyclic or
Acyclic?
Directed or
Undirected?
21
Directed graph (unconnected)
Cyclic or
Acyclic?
22
Tree is a Graph
A
A
C
G
C
G
B
B
D
F
D
E
E
F
A graph with no cycles is called a tree.
This is a general definition of a tree
23
Edges and Degree
Degree
number of edges incident to a vertex
in a directed graph
in-degree(v) - number of edges into vertex
v
out-degree(v) - number of edges from
vertex v
24
Dense Graphs
For most graphs |E| |V|2
A dense graph when most edges
are present
E = (|V|2)
large number of edges (quadratic)
Also |E| > |V| log |V| can be considered
dense
25
Sparse Graphs
A sparse graph is a graph with
relatively few edges
no clear definition
metric for sparsity |E| < |V| log |V|
Examples of sparse graphs
computer network
26
Representing Graphs
Graph operations
Navigation in a graph.
(v,w)E
Rv = {w | (v,w)E}
Dw = {v | (v,w)E}
Enumeration of the elements of a graph.
E
V
Size of a graph
|V|
|E|
28
Representing graphs
Adjacency matrix
2-dimensional array
For each edge (u,v), set A[u][v] to
true; otherwise false
Adjacency lists
1
1
2
3
4
5
6
7
2
3
x
4
x
x
5
6
7
x
x
x
x
x
x
x
For each vertex, keep a list of
adjacent vertices
2
3
4
3
6
4
3
6
5
4
7
1
4
5
1
7
3
2
4
5
6
7
Q: How to represent weights?
6
7
29
Choosing a representation
Size of V relative to size of E is a primary
factor.
Dense: E/V is large
Sparse: E/V is small
Adjacency matrix is expensive if the graph is
sparse. WHY?
Adjacency list is expensive if the graph is
dense. WHY?
Dynamic changes to V.
Adjacency matrix is expensive to copy/extend
if V is extended. WHY?
30
Graphs : Application
Search Engines
Search Engines
32
What are they?
Tools for finding information on the Web
Problem: “hidden” databases, e.g. New York
Times (ie, databases of keywords hosted by
the web site itself. These cannot be accessed
by Yahoo, Google etc.)
Search engine
A machine-constructed index (usually by
keyword)
So many search engines, we need search
engines to find them. Searchenginecollosus.com
33
What are they?
Search engines: key tools for ecommerce
Buyers and sellers must find each other
How do they work?
How much do they index?
Are they reliable?
How are hits ordered?
Can the order be changed?
34
The Process
1. Acquire the collection, i.e. all the documents
[Off-line process]
2. Create an inverted index
[Off-line process]
3. Match queries to documents
[On-line process, the actual retrieval]
4. Present the results to user
[On-line process: display, summarize, ...]
35
SE Architecture
Spider
Crawls the web to find pages. Follows hyperlinks.
Never stops
Indexer
Produces data structures for fast searching of all
words in the pages (i.e., it updates the lexicon)
Retriever
Query interface
Database lookup to find hits
Billions of documents
1 TB RAM, many terabytes of disk
Ranking
36
Did you know?
The concept of a Web spider was
developed by Dr.Fuzzy Mouldin
Implemented in 1994 on the Web
Went into the creation of Lycos
Lycos propelled CMU into the top 5
most successful schools
Commercialization proceeds
Dr. Michael L.
(Fuzzy) Mauldin
Tangible evidence
Newel-Simon Hall
37
Did you know?
A meta-search engine: send a
query to many search engines,
than rank and cluster the
results. (especially useful if
you’re not sure which
keywords produce optimal
results)
Vivisimo was developed here
at CMU
Developed by
Prof. Raul Valdes-Perez
Developed in 2000
38
A look at
> 20,000 servers
Spiders over:
4.28 billion pages
880 million images
Supports 35 non-English languages
Over 200 million queries per day
39
Google’s server farm
40
Web Spiders
Start with an initial page P0. Find URLs on P0 and
add them to a queue
When done with P0, pass it to an indexing
program, get a page P1 from the queue and repeat
Can be specialized (e.g. only look for email
addresses)
Issues
Which page to look at next? (Special subjects, recency)
Avoid overloading a site
How deep within a site do you go (depth search)?
How frequently to visit pages?
41
So, why Spider the Web?
User Perceptions
Most annoying: Engine finds nothing
(too small an index, but not an issue since
1997 or so).
Somewhat annoying: Obsolete links
Refresh Collection by deleting dead link
OK if index is slightly smaller
Done every 1-2 weeks in best engines
Mildly annoying: Failure to find new site
=> Re-spider entire web
=> Done every 2-4 weeks in best engines
42
So, why Spider the Web?, cont’d
Cost of Spidering
Semi-parallel algorithmic decomposition
Spider can (and does) run in hundreds of
severs simultaneously
Very high network connectivity
Servers can migrate from spidering to query
processing depending on time-of-day load
Running a full web spider takes days even with
hundreds of dedicated servers
43
Indexing
Arrangement of data (data structure) to
permit fast searching
Which list is easier to search?
sow fox pig eel yak hen ant cat dog hog
ant cat dog eel fox hen hog pig sow yak
Sorting helps. Why?
Permits binary search. About log2n probes into list
log2(1 billion) ~ 30
Permits interpolation search. About log2(log2n)
probes
log2 log2(1 billion) ~ 5
44
Inverted Files
POS
1
A file is a list of words by position
10
- First entry is the word in position 1 (first word)
20
- Entry 4562 is the word in position 4562 (4562nd word)
30
- Last entry is the last word
36
FILE
An inverted file is a list of positions by word!
a (1, 4, 40)
entry (11, 20, 31)
file (2, 38)
list (5, 41)
position (9, 16, 26)
positions (44)
word (14, 19, 24, 29, 35, 45)
words (7)
4562 (21, 27)
INVERTED FILE
45
Inverted Files for Multiple Documents
LEXICON
WORD
jezebel
OCCUR
POS 1
POS 2
...
NDOCS PTR
20
jezer
3
jezerit
1
jeziah
1
jeziel
1
jezliah
1
jezoar
1
jezrahliah
1
jezreel
jezoar
DOCID
39
34
44
56
6
3
4
1
215
5
118
2291
22
2087
3010
134
566
3
203
245
287
67
1
132
4
6
1
3
322
15
481
42
3922
3981
5002
1951
2192
992
WORD
INDEX
...
107
232
677
713
“jezebel” occurs
6 times in document 34,
3 times in document 44,
4 times in document 56 . . .
354
195
381
248
312
802
405
1897
46
Ranking (Scoring) Hits
Hits must be presented in some order
What order?
Relevance, recency, popularity, reliability?
Some ranking methods
Presence of keywords in title of document
Closeness of keywords to start of document
Frequency of keyword in document
Link popularity (how many pages point to this one)
47
Ranking (Scoring) Hits, cont’d
Can the user control? Can the page owner
control?
Can you find out what order is used?
Spamdexing: influencing retrieval ranking by
altering a web page. (Puts “spam” in the
index)
48
Link Popularity
How many pages link to this page?
on the whole Web
in our database?
http://www.linkpopularity.com
Link popularity is used for ranking
Many measures
Number of links in (In-links)
Weighted number of links in (by weight of
referring page)
49
Search Engine Sizes
ATW
AV
GG
INK
TMA
AllTheWeb
Altavista
Google
Inktomi
Teoma
Billions of textual documents indexed.
SOURCE: SEARCHENGINEWATCH.COM
50
Search Engine Sizes, cont’d
ATW
AV
GG
INK
TMA
AllTheWeb
Altavista
Google
Inktomi
Teoma
Billions of textual documents indexed
(as of Sept 2, 2003)
SOURCE: SEARCHENGINEWATCH.COM
51
Current Status of Web Spiders
Historical Notes
WebCrawler: first documented
spider
Lycos: first large-scale spider
52
Current Status of Web Spiders
Enhanced Spidering
In-link counts to pages can be
established during spidering.
Hint: Hmmm… hw6?
In-link counts are the basis for
Google’s page-rank method
53
Current Status of Web Spiders
Unsolved Problems
Most spidering re-traverses stable web
graph
=> on-demand re-spidering when
changes occur
Completeness or near-completeness is
still a major issue
Cannot Spider information stored in
local-DB
Spidering non-textual data
54
Reading
About graphs:
Chapter 14.
About Web search:
http://www.cadenza.org/search_engine
_terms/srchad.htm
55