Information Networks
Download
Report
Transcript Information Networks
Information Networks - Web
Graphs
Thanks to: P. Tsaparas, P. Baldi,
P. Frasconi, P. Smyth
Outline
Networks - motivation
Past and present
Real vs random
Review of graphs and measurements
Scale free - power laws
Web as a graph/network
Why important
What is a network?
Network: a collection of entities that are
interconnected with links.
people that are friends
computers that are interconnected
web pages that point to each other
proteins that interact
Graphs
In mathematics, networks are called graphs, the
entities are nodes, and the links are edges
Graph theory starts in the 18th century, with Leonhard
Euler
The problem of Königsberg bridges
Since then graphs have been studied extensively.
Networks in the past
Graphs have been used in the past to
model existing networks (e.g., networks of
highways, social networks)
usually these networks were small
network can be studied visual inspection can
reveal a lot of information
Networks now
More and larger networks appear
Products of technological advancement
• e.g., Internet, Web
Result of our ability to collect more, better,
and more complex data
• e.g., gene regulatory networks
Networks of thousands, millions, or billions
of nodes
impossible to visualize
The internet map
Understanding large graphs
What are the statistics of real life
networks?
Can we explain how the networks were
generated?
Measuring network properties
Around 1999
Watts and Strogatz, Dynamics and smallworld phenomenon
Faloutsos3, On power-law relationships of the
Internet Topology
Kleinberg et al., The Web as a graph
Barabasi and Albert, The emergence of
scaling in real networks
Real network properties
Most nodes have only a small number of neighbors
(degree), but there are some nodes with very high
degree (power-law degree distribution)
scale-free networks
If a node x is connected to y and z, then y and z are
likely to be connected
high clustering coefficient
Most nodes are just a few edges away on average.
small world networks
Networks from very diverse areas (from internet to
biological networks) have similar properties
Is it possible that there is a unifying underlying generative
process?
Generating random graphs
Classic graph theory model (Erdös-Renyi)
each edge is generated independently with probability p
Very well studied model but:
most vertices have about the same degree
the probability of two nodes being linked is independent of
whether they share a neighbor
the average paths are short
Modeling real networks
Real life networks are not “random”
Can we define a model that generates
graphs with statistical properties similar to
those in real life?
a flurry of models for random graphs
Processes on networks
Why is it important to understand the
structure of networks?
Epidemiology: Viruses propagate much
faster in scale-free networks
Vaccination of random nodes does not work,
but targeted vaccination is very effective
Random sampling can be dangerous!
Web search
First generation search engines: the Web as a collection
of documents
Suffered from spammers, poor, unstructured, unsupervised
content, increase in Web size
Second generation search engines: the Web as a
network
use the anchor text of links for annotation
good pages should be pointed to by many pages
good pages should be pointed to by many good pages
• PageRank algorithm, Google!
• Hits (hubs and authorities)
• Most search engines ranking methods
The future of networks
Networks seem to be here to stay
More and more systems are modeled as
networks
Scientists from various disciplines are working
on networks (physicists, computer scientists,
mathematicians, biologists, sociologist,
economists)
There are many questions to understand.
Mathematical Tools
Graph theory
Probability theory
Linear Algebra
Types of networks
Social networks
Knowledge (Information) networks
Technology networks
Biological networks
Social Networks
Links denote a social interaction
Networks of acquaintances
collaboration networks
• actor networks
• co-authorship networks
• director networks
phone-call networks
e-mail networks
IM networks
Bluetooth networks
sexual networks
home page/blog networks
Knowledge (Information) Networks
Nodes store information, links associate
information
Citation network (directed acyclic)
The Web (directed)
Peer-to-Peer networks
Word networks
Networks of Trust
Software graphs
Technological networks
Networks built for distribution of commodity
The Internet
• router level, AS level
Power Grids
Airline networks
Telephone networks
Transportation Networks
• roads, railways, pedestrian traffic
Biological networks
Biological systems represented as networks
Protein-Protein Interaction Networks
Gene regulation networks
Gene co-expression networks
Metabolic pathways
The Food Web
Neural Networks
Measuring Networks
Degree distributions
Small world phenomena
Clustering Coefficient
Mixing patterns
Degree correlations
Communities and clusters
The basic random graph model
The measurements on real networks are usually
compared against those on “random networks”
The basic Gn,p (Erdös-Renyi) random graph model:
n : the number of vertices
0≤p≤1
for each pair (i,j), generate the edge (i,j) independently with
probability p
Degree distributions
frequency
fk = fraction of nodes with degree k
= probability of a randomly
selected node to have degree k
fk
k
degree
Problem: find the probability distribution that best fits the
observed data
Power-law distributions
The degree distributions of most real-life networks follow a power
law
p(k) = Ck-a
Right-skewed/Heavy-tail distribution
there is a non-negligible fraction of nodes that has very high degree
(hubs)
scale-free: no characteristic scale, average is not informative
In stark contrast with the random graph model!
Poisson degree distribution, z=np
zk z
p(k) P(k; z) e
k!
highly concentrated around the mean
the probability of very high degree nodes is exponentially small
Power-law signature
Power-law distribution gives a line in the log-log plot
log p(k) = -a logk + logC
log frequency
frequency
degree
α
log degree
a : power-law exponent (typically 2 ≤ a ≤ 3)
Examples of degree distribution for power laws
Taken from [Newman 2003]
A random graph example
Exponential distribution
Observed in some technological or collaboration
networks
p(k) = le-lk
Identified by a line in the log-linear plot
log p(k) = - lk + log l
log frequency
λ
degree
Average/Expected degree
For random graphs z = np
For power-law distributed degree
if a ≥ 2, it is a constant
if a < 2, it diverges
Maximum degree
For random graphs, the maximum degree
is highly concentrated around the average
degree z
For power law graphs
k max n1/(α1)
Collective Statistics (M. Newman 2003)
Clustering (Transitivity) coefficient
Measures the density of triangles (local
clusters) in the graph
Two different ways to measure it:
C (1)
triangles centered at node i
triples centered at node i
i
i
The ratio of the means
Example
1
4
3
2
5
C
(1)
3
3
1 1 6 8
Clustering (Transitivity) coefficient
Clustering coefficient for node i
triangles centered at node i
Ci
triples centered at node i
C
(2)
1
Ci
n
The mean of the ratios
Example
1
4
C (2)
1
1 1 1 6 13
5
30
C (1)
3
8
3
2
5
The two clustering coefficients give different
measures
C(2) increases with nodes with low degree
Collective Statistics (M. Newman 2003)
Clustering coefficient for random graphs
The probability of two of your neighbors also being
neighbors is p, independent of local structure
clustering coefficient C = p
when z is fixed C = z/n =O(1/n)
The C(k) distribution
The C(k) distribution is supposed to capture the
hierarchical nature of the network
when constant: no hierarchy
when power-law: hierarchy
C(k) = average clustering coefficient
of nodes with degree k
C(k)
k
degree
Millgram’s small world experiment
Letters were handed out to people in Nebraska to be
sent to a target in Boston
People were instructed to pass on the letters to someone
they knew on first-name basis
The letters that reached the destination followed paths of
length around 6
Six degrees of separation: (play of John Guare)
Also:
The Kevin Bacon game
The Erdös number
Small world project:
http://smallworld.columbia.edu/index.html
Measuring the small world phenomenon
dij = shortest path between i and j
Diameter:
d max dij
i, j
Characteristic path length:
1
dij
n(n - 1)/2 i j
Harmonic mean
1
1
-1
d
n(n - 1)/2 i j ij
Also, distribution of all shortest paths
Collective Statistics (M. Newman 2003)
Is the path length enough?
Random graphs have diameter
logn
d
logz
d=logn/loglogn when z=ω(logn)
Short paths should be combined with other
properties
ease of navigation
high clustering coefficient
Degree correlations
Do high degree nodes tend to link to high degree nodes?
Pastor Satoras et al.
plot the mean degree of the neighbors as a function of the
degree
Degree correlations
Newman
compute the correlation coefficient of the
degrees of the two endpoints of an edge
assortative/disassortative
Collective Statistics (M. Newman 2003)
Connected components
For undirected graphs, the size and
distribution of the connected components
is there a giant component?
For directed graphs, the size and
distribution of strongly and weakly
connected components
Network Resilience
Study how the graph properties change when performing
random or targeted node deletions
The history of the Web
Vannevar Bush – “As we may think” (1945)
The “MEMEX”: A photo-electrical-mechanical
device that stores documents and images and
allows to create and follow links between them
The history of the Web
Tim Berners-Lee
1980 – CERN: Writes a notebook program
“Enquire-upon-within-everything”
that allows links to be made between
arbitrary nodes
1989 – CERN: Circulates the document
“Information management: a proposal”
1990 – CERN: The first Web browser, and the first
Web server-client communication
1994 : The creation of the WWW consortium (W3C)
The history of the Web
The history of the Web
Hypertext 1991: Tim Berners-Lee paper on WWW
was accepted only as a poster
Today
The Web consists of hundreds of billions
of pages
It is considered one of the biggest
information revolutions in recent human
history
The Web graph
A graph G = (V, E) is defined by
a set V of vertices (nodes)
a set E of edges (links) = pairs of nodes
The Web page graph (directed)
V is the set of static public pages
E is the set of static hyperlinks
Many more graphs can be defined
The host graph
The co-citation graph
etc
Which pages do we care for?
We want to avoid “dynamic” pages
catalogs
pages generated by queries
pages generated by cgi-scripts (the
nostradamus effect)
We are only interested in “static” web
pages
The Static Public Web
Static
not the result of a cgi-bin
scripts
no “?” in the URL
doesn’t change very often
etc.
Public
no password required
no robots.txt exclusion
no “noindex” meta tag
etc.
These rules can still be fooled
“Dynamic pages” appear static
• browseable catalogs (Hierarchy built from DB)
Spider traps -- infinite url descent
• www.x.com/home/home/home/…./home/home.html
Spammer games
Why do we care about the Web graph?
Is it the largest human artifact ever created?
Exploit the Web structure for
crawlers
search and link analysis ranking
spam detection
community discovery
classification/organization
Predict the Web future
mathematical models
algorithm analysis
sociological understanding
New business opportunities
New politics
The first question: what is the size of the Web?
Surprisingly hard to answer
Naïve solution: keep crawling until the whole graph has been
explored
Extremely simple but wrong solution: crawling is complicated
because the web is complicated
spamming
duplicates
mirrors
Simple example of a complication: Soft 404
When a page does not exists, the server is supposed to return an error
code = “404”
Many servers do not return an error code, but keep the visitor on site, or
simply send him to the home page
A sampling approach
Sample pages uniformly at random
Compute the percentage of the pages that
belong to a search engine repository (search
engine coverage)
Estimate the size of the Web
Problems:
how do you sample a page uniformly at random?
how do you test if a page is indexed by a search
engine?
Sampling pages [LG98, HHMN00]
Create IP addresses uniformly at random
problems with virtual hosting, spamming
Starting from a subset of pages perform a
random walk on the graph. After “enough”
steps you should end up in a random
page.
near uniform sampling
Testing search engine containment
[BB98]
Find r of the least frequent terms and
generate a query with these terms to the
search engine
“strong query”
Measuring the Web
It is clear that the Web that we see is what the
crawler discovers
We need large crawls in order to make
meaningful measurements
The measurements are still biased by
the crawling policy
size limitations of the crawl
Perturbations of the "natural" process of birth and
death of nodes and links
Measures on the Web graph [BKMRRSTW00]
Degree distributions
The global picture
what does the Web look like from a far?
Reachability
Connected components
Community structure
The finer picture
In-degree distribution
Power-law distribution with exponent 2.1
Out-degree distribution
Power-law distribution with exponent 2.7
The good news
The fact that the exponent is greater than
2 implies that the expected value of the
degree is a constant (not growing with n)
Therefore, the expected number of edges
is linear in the number of nodes n
This is good news, since we cannot
handle anything more than linear
Connected components – definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
The bow-tie structure of the Web
SCC and WCC distribution
The SCC and WCC sizes follows a power
law distribution
the second largest SCC is significantly
smaller
The inner structure of the bow-tie
[LMST05]
What do the individual components of the
bow tie look like?
They obey the same power laws in the degree
distributions
The inner structure of the bow-tie
Is it the case that the bow-tie repeats itself in
each of the components (self-similarity)?
It would look nice, but this does not seem to be the
case
no large WCC, many small ones
The daisy structure?
Large connected core, and highly fragmented IN
and OUT components
Unfortunately, we do not have a large crawl to
verify this hypothesis
A different kind of self-similarity [DKCRST01]
Consider Thematically Unified Clusters
(TUC): pages grouped by
keyword searches
web location (intranets)
geography
hostgraph
random collections
All such TUCs exhibit a bow-tie structure!
Self-similarity
The Web consists of a collection of selfsimilar structures that form a backbone of
the SCC
Is the Web a small world?
Based on a simple model, [AJB99]
predicted that most pages are within 19
links of each other. Justified the model by
crawling nd.edu (1999)
Well, not really!
Distance measurements [BKMRRSTW00]
The probability that there exists a directed path between
two nodes is ~25%
Therefore, for ~75% of the nodes there exists no path that
connects them
Average directed distance between two nodes in the
CORE: ~16
Average undirected distance between two nodes in the
CORE: ~7
Maximum directed distance between two nodes in the
CORE: >28
Maximum directed distance between any two nodes in
the graph: > 900
Community discovery [KRRT99]
Hubs and authorities
hubs: pages that point to (many good)
pages
authorities: pages that are pointed to
by (many good) pages
Find the (i,j) bipartite cliques of hubs
and authorities
intuition: these are the core of a
community
grow the core to obtain the community
Bipartite cores
Computation of bipartite cores requires
heuristics for handling the Web graph
iterative pruning steps
Surprisingly large number of bipartite cores
lead to the copying model for the Web
Discovery of unusual communities of enthusiasts
Australian fire brigades
Web as a graph
Information about the growth of information
Knowledge
Structure
Growth
Not a random graph
Self similar structures in the whole
Average degree constant (not growing with n)
Average path log n
References
Vannevar Bush, As we may think, The Atlantic Monthly, July 1945
[BB98] K. Bharat and A. Broder. A technique for measuring the relative size and overlap of public
Web search engines. Proc. 7th International World Wide Web Conference, 1998.
[HHMN00] M. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. On Near-Uniform URL
Sampling . 9th International World Wide Web Conference, May 2000.
[LG98] S. Lawrence, C. L. Gilles, Searching the World Wide Web, Science 280, 98-100 (1998).
[AJB99] A. Albert, H. Jeong, and A.-L. Barabási, Diameter of the World Wide Web, Nature,401,
130-131 (1999).
[BKMRRSTW00] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A.
Tomkins, J. Wiener. Graph structure in the web. 9th International World Wide Web Conference,
May 2000.
[DKCRST01] S. Dill, R. Kumar, K. McCurley, S. Rajagopalan, D. Sivakumar, A. Tomkins. Selfsimilarity in the Web. 27th International Conference on Very Large Data Bases, 2001.
[KRRT99] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for cyber
communities, Proc. 8th WWW , Apr 1999.
[EMC03] Nadav Eiron and Kevin S. McCurley, Locality, Hierarchy, and Bidirectionality on the Web,
Workshop on Web Algorithms and Models, 2003.
[RSWW] K. Randall, R. Stata, R. Wickremesinghe, J. Wiener, The Link Database: Fast Access to
Graphs of the Web, Technical Report
[BBHKV98] K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. The
connectivity server: fast access to linkage information on the web, Proc. 7th WWW, 1998.
[BV04] P. Boldi, S. Vigna, The Webgraph framework I: Compression Techniques, WWW 2004
[DLMT05] D. Donato, S. Leonardi, S. Milliozzi, P. Tsaparas, Mining the Inner Structure of the bowtie, unpublished