Information Networks

Download Report

Transcript Information Networks

Information Networks - Web
Graphs
Thanks to: P. Tsaparas, P. Baldi,
P. Frasconi, P. Smyth
Outline
 Networks - motivation
 Past and present
 Real vs random
 Review of graphs and measurements
 Scale free - power laws
 Web as a graph/network
 Why important
What is a network?
 Network: a collection of entities that are
interconnected with links.




people that are friends
computers that are interconnected
web pages that point to each other
proteins that interact
Graphs
 In mathematics, networks are called graphs, the
entities are nodes, and the links are edges
 Graph theory starts in the 18th century, with Leonhard
Euler
 The problem of Königsberg bridges
 Since then graphs have been studied extensively.
Networks in the past
 Graphs have been used in the past to
model existing networks (e.g., networks of
highways, social networks)
 usually these networks were small
 network can be studied visual inspection can
reveal a lot of information
Networks now
 More and larger networks appear
 Products of technological advancement
• e.g., Internet, Web
 Result of our ability to collect more, better,
and more complex data
• e.g., gene regulatory networks
 Networks of thousands, millions, or billions
of nodes
 impossible to visualize
The internet map
Understanding large graphs
 What are the statistics of real life
networks?
 Can we explain how the networks were
generated?
Measuring network properties
 Around 1999
 Watts and Strogatz, Dynamics and smallworld phenomenon
 Faloutsos3, On power-law relationships of the
Internet Topology
 Kleinberg et al., The Web as a graph
 Barabasi and Albert, The emergence of
scaling in real networks
Real network properties
 Most nodes have only a small number of neighbors
(degree), but there are some nodes with very high
degree (power-law degree distribution)
 scale-free networks
 If a node x is connected to y and z, then y and z are
likely to be connected
 high clustering coefficient
 Most nodes are just a few edges away on average.
 small world networks
 Networks from very diverse areas (from internet to
biological networks) have similar properties
 Is it possible that there is a unifying underlying generative
process?
Generating random graphs
 Classic graph theory model (Erdös-Renyi)
 each edge is generated independently with probability p
 Very well studied model but:
 most vertices have about the same degree
 the probability of two nodes being linked is independent of
whether they share a neighbor
 the average paths are short
Modeling real networks
 Real life networks are not “random”
 Can we define a model that generates
graphs with statistical properties similar to
those in real life?
 a flurry of models for random graphs
Processes on networks
 Why is it important to understand the
structure of networks?
 Epidemiology: Viruses propagate much
faster in scale-free networks
 Vaccination of random nodes does not work,
but targeted vaccination is very effective
 Random sampling can be dangerous!
Web search
 First generation search engines: the Web as a collection
of documents
 Suffered from spammers, poor, unstructured, unsupervised
content, increase in Web size
 Second generation search engines: the Web as a
network
 use the anchor text of links for annotation
 good pages should be pointed to by many pages
 good pages should be pointed to by many good pages
• PageRank algorithm, Google!
• Hits (hubs and authorities)
• Most search engines ranking methods
The future of networks
 Networks seem to be here to stay
 More and more systems are modeled as
networks
 Scientists from various disciplines are working
on networks (physicists, computer scientists,
mathematicians, biologists, sociologist,
economists)
 There are many questions to understand.
Mathematical Tools
 Graph theory
 Probability theory
 Linear Algebra
Types of networks




Social networks
Knowledge (Information) networks
Technology networks
Biological networks
Social Networks
 Links denote a social interaction
 Networks of acquaintances
 collaboration networks
• actor networks
• co-authorship networks
• director networks






phone-call networks
e-mail networks
IM networks
Bluetooth networks
sexual networks
home page/blog networks
Knowledge (Information) Networks
 Nodes store information, links associate
information






Citation network (directed acyclic)
The Web (directed)
Peer-to-Peer networks
Word networks
Networks of Trust
Software graphs
Technological networks
 Networks built for distribution of commodity
 The Internet
• router level, AS level




Power Grids
Airline networks
Telephone networks
Transportation Networks
• roads, railways, pedestrian traffic
Biological networks
 Biological systems represented as networks






Protein-Protein Interaction Networks
Gene regulation networks
Gene co-expression networks
Metabolic pathways
The Food Web
Neural Networks
Measuring Networks






Degree distributions
Small world phenomena
Clustering Coefficient
Mixing patterns
Degree correlations
Communities and clusters
The basic random graph model
 The measurements on real networks are usually
compared against those on “random networks”
 The basic Gn,p (Erdös-Renyi) random graph model:
 n : the number of vertices
 0≤p≤1
 for each pair (i,j), generate the edge (i,j) independently with
probability p
Degree distributions
frequency
fk = fraction of nodes with degree k
= probability of a randomly
selected node to have degree k
fk
k
degree
 Problem: find the probability distribution that best fits the
observed data
Power-law distributions
 The degree distributions of most real-life networks follow a power
law
p(k) = Ck-a
 Right-skewed/Heavy-tail distribution
 there is a non-negligible fraction of nodes that has very high degree
(hubs)
 scale-free: no characteristic scale, average is not informative
 In stark contrast with the random graph model!
 Poisson degree distribution, z=np
zk  z
p(k)  P(k; z)  e
k!
 highly concentrated around the mean
 the probability of very high degree nodes is exponentially small
Power-law signature
 Power-law distribution gives a line in the log-log plot
log p(k) = -a logk + logC
log frequency
frequency
degree
α
log degree
 a : power-law exponent (typically 2 ≤ a ≤ 3)
Examples of degree distribution for power laws
Taken from [Newman 2003]
A random graph example
Exponential distribution
 Observed in some technological or collaboration
networks
p(k) = le-lk
 Identified by a line in the log-linear plot
log p(k) = - lk + log l
log frequency
λ
degree
Average/Expected degree
 For random graphs z = np
 For power-law distributed degree
 if a ≥ 2, it is a constant
 if a < 2, it diverges
Maximum degree
 For random graphs, the maximum degree
is highly concentrated around the average
degree z
 For power law graphs
k max  n1/(α1)
Collective Statistics (M. Newman 2003)
Clustering (Transitivity) coefficient
 Measures the density of triangles (local
clusters) in the graph
 Two different ways to measure it:
C (1)
 triangles centered at node i

 triples centered at node i
i
i
 The ratio of the means
Example
1
4
3
2
5
C
(1)
3
3


1 1  6 8
Clustering (Transitivity) coefficient
 Clustering coefficient for node i
triangles centered at node i
Ci 
triples centered at node i
C
(2)
1
 Ci
n
 The mean of the ratios
Example
1
4
C (2) 
1
1  1  1 6   13
5
30
C (1) 
3
8
3
2
5
 The two clustering coefficients give different
measures
 C(2) increases with nodes with low degree
Collective Statistics (M. Newman 2003)
Clustering coefficient for random graphs
 The probability of two of your neighbors also being
neighbors is p, independent of local structure
 clustering coefficient C = p
 when z is fixed C = z/n =O(1/n)
The C(k) distribution
 The C(k) distribution is supposed to capture the
hierarchical nature of the network
 when constant: no hierarchy
 when power-law: hierarchy
C(k) = average clustering coefficient
of nodes with degree k
C(k)
k
degree
Millgram’s small world experiment
 Letters were handed out to people in Nebraska to be
sent to a target in Boston
 People were instructed to pass on the letters to someone
they knew on first-name basis
 The letters that reached the destination followed paths of
length around 6
 Six degrees of separation: (play of John Guare)
 Also:
 The Kevin Bacon game
 The Erdös number
 Small world project:
http://smallworld.columbia.edu/index.html
Measuring the small world phenomenon
 dij = shortest path between i and j
 Diameter:
d  max dij
i, j
 Characteristic path length:

1
dij

n(n - 1)/2 i j
 Harmonic mean
 1 
1
-1
d

n(n - 1)/2 i j ij
 Also, distribution of all shortest paths
Collective Statistics (M. Newman 2003)
Is the path length enough?
 Random graphs have diameter
logn
d
logz
 d=logn/loglogn when z=ω(logn)
 Short paths should be combined with other
properties
 ease of navigation
 high clustering coefficient
Degree correlations
 Do high degree nodes tend to link to high degree nodes?
 Pastor Satoras et al.
 plot the mean degree of the neighbors as a function of the
degree
Degree correlations
 Newman
 compute the correlation coefficient of the
degrees of the two endpoints of an edge
 assortative/disassortative
Collective Statistics (M. Newman 2003)
Connected components
 For undirected graphs, the size and
distribution of the connected components
 is there a giant component?
 For directed graphs, the size and
distribution of strongly and weakly
connected components
Network Resilience
 Study how the graph properties change when performing
random or targeted node deletions
The history of the Web
Vannevar Bush – “As we may think” (1945)
The “MEMEX”: A photo-electrical-mechanical
device that stores documents and images and
allows to create and follow links between them
The history of the Web
Tim Berners-Lee
1980 – CERN: Writes a notebook program
“Enquire-upon-within-everything”
that allows links to be made between
arbitrary nodes
1989 – CERN: Circulates the document
“Information management: a proposal”
1990 – CERN: The first Web browser, and the first
Web server-client communication
1994 : The creation of the WWW consortium (W3C)
The history of the Web
The history of the Web
Hypertext 1991: Tim Berners-Lee paper on WWW
was accepted only as a poster
Today
 The Web consists of hundreds of billions
of pages
 It is considered one of the biggest
information revolutions in recent human
history
The Web graph
 A graph G = (V, E) is defined by
 a set V of vertices (nodes)
 a set E of edges (links) = pairs of nodes
 The Web page graph (directed)
 V is the set of static public pages
 E is the set of static hyperlinks
 Many more graphs can be defined
 The host graph
 The co-citation graph
 etc
Which pages do we care for?
 We want to avoid “dynamic” pages
 catalogs
 pages generated by queries
 pages generated by cgi-scripts (the
nostradamus effect)
 We are only interested in “static” web
pages
The Static Public Web
Static
 not the result of a cgi-bin
scripts
 no “?” in the URL
 doesn’t change very often
 etc.
 Public




no password required
no robots.txt exclusion
no “noindex” meta tag
etc.
 These rules can still be fooled
 “Dynamic pages” appear static
• browseable catalogs (Hierarchy built from DB)
 Spider traps -- infinite url descent
• www.x.com/home/home/home/…./home/home.html
 Spammer games
Why do we care about the Web graph?
 Is it the largest human artifact ever created?
 Exploit the Web structure for





crawlers
search and link analysis ranking
spam detection
community discovery
classification/organization
 Predict the Web future





mathematical models
algorithm analysis
sociological understanding
New business opportunities
New politics
The first question: what is the size of the Web?
 Surprisingly hard to answer
 Naïve solution: keep crawling until the whole graph has been
explored
 Extremely simple but wrong solution: crawling is complicated
because the web is complicated
 spamming
 duplicates
 mirrors
 Simple example of a complication: Soft 404
 When a page does not exists, the server is supposed to return an error
code = “404”
 Many servers do not return an error code, but keep the visitor on site, or
simply send him to the home page
A sampling approach
 Sample pages uniformly at random
 Compute the percentage of the pages that
belong to a search engine repository (search
engine coverage)
 Estimate the size of the Web
 Problems:
 how do you sample a page uniformly at random?
 how do you test if a page is indexed by a search
engine?
Sampling pages [LG98, HHMN00]
 Create IP addresses uniformly at random
 problems with virtual hosting, spamming
 Starting from a subset of pages perform a
random walk on the graph. After “enough”
steps you should end up in a random
page.
 near uniform sampling
Testing search engine containment
[BB98]
 Find r of the least frequent terms and
generate a query with these terms to the
search engine
 “strong query”
Measuring the Web
 It is clear that the Web that we see is what the
crawler discovers
 We need large crawls in order to make
meaningful measurements
 The measurements are still biased by
 the crawling policy
 size limitations of the crawl
 Perturbations of the "natural" process of birth and
death of nodes and links
Measures on the Web graph [BKMRRSTW00]
 Degree distributions
 The global picture
 what does the Web look like from a far?




Reachability
Connected components
Community structure
The finer picture
In-degree distribution
 Power-law distribution with exponent 2.1
Out-degree distribution
 Power-law distribution with exponent 2.7
The good news
 The fact that the exponent is greater than
2 implies that the expected value of the
degree is a constant (not growing with n)
 Therefore, the expected number of edges
is linear in the number of nodes n
 This is good news, since we cannot
handle anything more than linear
Connected components – definitions
 Weakly connected components (WCC)
 Set of nodes such that from any node can go to any node via an
undirected path
 Strongly connected components (SCC)
 Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
The bow-tie structure of the Web
SCC and WCC distribution
 The SCC and WCC sizes follows a power
law distribution
 the second largest SCC is significantly
smaller
The inner structure of the bow-tie
[LMST05]
 What do the individual components of the
bow tie look like?
 They obey the same power laws in the degree
distributions
The inner structure of the bow-tie
 Is it the case that the bow-tie repeats itself in
each of the components (self-similarity)?
 It would look nice, but this does not seem to be the
case
 no large WCC, many small ones
The daisy structure?
 Large connected core, and highly fragmented IN
and OUT components
 Unfortunately, we do not have a large crawl to
verify this hypothesis
A different kind of self-similarity [DKCRST01]
 Consider Thematically Unified Clusters
(TUC): pages grouped by





keyword searches
web location (intranets)
geography
hostgraph
random collections
 All such TUCs exhibit a bow-tie structure!
Self-similarity
 The Web consists of a collection of selfsimilar structures that form a backbone of
the SCC
Is the Web a small world?
 Based on a simple model, [AJB99]
predicted that most pages are within 19
links of each other. Justified the model by
crawling nd.edu (1999)
 Well, not really!
Distance measurements [BKMRRSTW00]
 The probability that there exists a directed path between
two nodes is ~25%
 Therefore, for ~75% of the nodes there exists no path that
connects them
 Average directed distance between two nodes in the
CORE: ~16
 Average undirected distance between two nodes in the
CORE: ~7
 Maximum directed distance between two nodes in the
CORE: >28
 Maximum directed distance between any two nodes in
the graph: > 900
Community discovery [KRRT99]
 Hubs and authorities
 hubs: pages that point to (many good)
pages
 authorities: pages that are pointed to
by (many good) pages
 Find the (i,j) bipartite cliques of hubs
and authorities
 intuition: these are the core of a
community
 grow the core to obtain the community
Bipartite cores
 Computation of bipartite cores requires
heuristics for handling the Web graph
 iterative pruning steps
 Surprisingly large number of bipartite cores
 lead to the copying model for the Web
 Discovery of unusual communities of enthusiasts
 Australian fire brigades
Web as a graph
 Information about the growth of information
 Knowledge
 Structure
 Growth
 Not a random graph
 Self similar structures in the whole
 Average degree constant (not growing with n)
 Average path log n
References













Vannevar Bush, As we may think, The Atlantic Monthly, July 1945
[BB98] K. Bharat and A. Broder. A technique for measuring the relative size and overlap of public
Web search engines. Proc. 7th International World Wide Web Conference, 1998.
[HHMN00] M. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. On Near-Uniform URL
Sampling . 9th International World Wide Web Conference, May 2000.
[LG98] S. Lawrence, C. L. Gilles, Searching the World Wide Web, Science 280, 98-100 (1998).
[AJB99] A. Albert, H. Jeong, and A.-L. Barabási, Diameter of the World Wide Web, Nature,401,
130-131 (1999).
[BKMRRSTW00] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A.
Tomkins, J. Wiener. Graph structure in the web. 9th International World Wide Web Conference,
May 2000.
[DKCRST01] S. Dill, R. Kumar, K. McCurley, S. Rajagopalan, D. Sivakumar, A. Tomkins. Selfsimilarity in the Web. 27th International Conference on Very Large Data Bases, 2001.
[KRRT99] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for cyber
communities, Proc. 8th WWW , Apr 1999.
[EMC03] Nadav Eiron and Kevin S. McCurley, Locality, Hierarchy, and Bidirectionality on the Web,
Workshop on Web Algorithms and Models, 2003.
[RSWW] K. Randall, R. Stata, R. Wickremesinghe, J. Wiener, The Link Database: Fast Access to
Graphs of the Web, Technical Report
[BBHKV98] K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. The
connectivity server: fast access to linkage information on the web, Proc. 7th WWW, 1998.
[BV04] P. Boldi, S. Vigna, The Webgraph framework I: Compression Techniques, WWW 2004
[DLMT05] D. Donato, S. Leonardi, S. Milliozzi, P. Tsaparas, Mining the Inner Structure of the bowtie, unpublished