CS 728 Lecture 4 It’s a Small World

Download Report

Transcript CS 728 Lecture 4 It’s a Small World

CS 728
Lecture 4
It’s a Small World on the Web
Small World Networks
• It is a ‘small world’ after all
– Billions of people on Earth, yet every pair separated by “six
degrees” of acquaintance relationships
– Notion popularized by experimental psychologist Stanley
Milgram’s, different from his more infamous experiment
• Mathematically
– Sparse – linear number of edges
– Diameter - small like logarithm (log N)
– Clustering is high – neighbors are neighbors
Small World = Small Diameter +
Clustering
• Defined by two measures:
– characteristic path length L = number of
edges in shortest path between two vertices,
averaged over all vertex pairs
– clustering coefficient C:
•
•
•
•
take vertex v with k  1 neighbors
at most k(k-1)/2 edges among neighbors
C(v) = fraction of k(k-1)/2 edges present
C = average clustering coefficient
• C >> C_random, L  L_random
The small world of the Web
• Empirical study of Web-graph reveals
small-world property
– Sparse graph
– Average distance (d) in simulated web:
e.g.
d = 0.35 + 2.06 log (n)
n = 109, d ~= 19
– Diameter properties inferred from sampling
• Calculation of max. diameter computationally
demanding for large values of n
– Clustering unknown
Implications for Web
• Logarithmic scaling of diameter makes
future navigation of web manageable
– 10-fold increase of web pages results in only
2 more additional ‘clicks’, but …
– Users may not take shortest path, may use
bookmarks or just get distracted on the way
– Search engines play a crucial role, how can
they use this SW link structure?
Small World in Real World of
Hollywood: The Kevin Bacon Game
Goal: Connect any actor to
Kevin Bacon, by linking
actors who have acted in
the same movie.
Boxed version of the
Kevin Bacon Game
Oracle of Bacon website uses
Internet Movie Database
(IMDB.com) to find shortest
link between any two
actors. Created by students
at Univ. of Virginia
http://oracleofbacon.org/
The Hollywood Network
Total # of actors in
database: ~550,000
Most actors are within three
links of each other!
Average path length to
Kevin Bacon: 2.79
Actor closest to “center”:
Rod Steiger (2.53)
Rank of Kevin, in closeness
to center: 876th
Center of Hollywood?
Math Citation Network:
Erdős Number
Number of links required to connect
scholars to Erdős, via coauthorship of papers
Erdős wrote 1500+ papers with 507
co-authors.
Jerry Grossman’s (Oakland Univ.)
website allows mathematicians
to compute their Erdos numbers:
http://www.oakland.edu/enp/
Paul Erdős (1913-1996)
Connecting path lengths, among
mathematicians only:
– average is 4.65
– maximum is 13
My number is 3
Erdős
Fan
Chung
Arny
Rosenberg
Fred
Annexstein
- Erdős and Renyi showed that average
path length between connected nodes in a
random graph is logarithmic
ln N
ln k
- But degree sequences in social networks like
Web and Hollywood are not Poisson
- Back to Power-laws
Classes of small-world networks
– Single-scale: Connectivity distribution decays exponentially
(e.g., Poisson and random graphs)
– Scale-free: Power-law distribution of connectivity over
entire range
– Broad-scale: Power-law over “broad range” + abrupt cut-off
Bow-tie Structure of Web
• A large scale study (Altavista crawls)
reveals another interesting property of
web – “symmetric asymmetry”
– Study of 200 million nodes & 1.5 billion links
– Small-world property not applicable to entire
web
• Some parts unreachable
• Others have long paths
– Power-law connectivity holds though
• Page indegree ( = 2.1), outdegree ( = 2.72)
Bow-tie Components
• Strongly Connected
Component (SCC)
– Core with small-world
property
• Upstream (IN)
– Core can’t reach IN
• Downstream (OUT)
– OUT can’t reach core
• Disconnected (Tendrils)
Component Properties
• Each component is roughly same size
– ~50 million nodes
• Tendrils not connected to SCC
– But reachable from IN and can reach OUT
• Tubes: directed paths IN->Tendrils->OUT
• Disconnected components
– Maximal and average diameter is infinite
Empirical Numbers for Bow-tie
• Maximal minimal (?) diameter
– 28 for SCC, 500 for entire graph
• Probability of a path between any 2 nodes
– ~1 quarter (0.24)
• Average length
– 16 (directed path exists), 7 (undirected)
• Shortest directed path between 2 nodes in
SCC: 16-20 links on average
Next Time:
Models for the Web Graph
• Stochastic models that can explain or at
least partially reproduce the properties of
the web graph. Goals of model
– power law distribution properties
– maintain the small world property
– bow-tie structure