Transcript Document

The Web as a Graph, Models and Algorithms
Sridhar Rajagopalan
IBM Almaden Research Center
Sridhar Rajagopalan
Graphs in Computer Science: How do they help?
• What do they model?
– An abstraction in Core CS.
• Examples: VLSI Circuits, Communication Networks, Logical flow of a computer program, Data
structures.
– An abstraction for data and relationships.
• Examples: The Web, Social Networks, Flows and Flow Networks, Biological Data,
Taxonomies, Citations, Explicit relations within a DB system.
• What aspects are studied?
– Algorithms, Data Structures and Complexity Theory.
– Characterization and Modeling of Graphs.
– Implementations of Graph Algorithms in Specific contexts.
• What is new?
– Some graphs are getting very very large.
• Several Web crawls have over 2 Billion pages (nodes) and 10 times as many edges.
• Some social networks (Telephone call graphs, for instance) are huge.
– Sensors and small devices will output enormous amounts of data.
Sridhar Rajagopalan
Why and how does scale change the game?
• It is not about an instance, it is about THE instance.
– Systems and Software is designed from scratch to solve one (or a few) instances of the
problem.
– Properties of the instance are important. Generality and genericity of the code base are
not critical.
– An answer to a single instance can be worth a lot.
• A new kind of algorithm and system design which is very engineering
orientated.
– Non-viability of the random access model.
– Hardware and software co-design..
• “Time to market” considerations encourage careful approximations.
– Statistical approaches are valuable.
– Do not need the complete answer, even a partial high quality result can be very valuable.
Sridhar Rajagopalan
What does a system and programming model for processing large data sets
have to do?
• System and programming model for processing large graphs.
– Exceptions will occur.
– Components (both hard and soft) will fail.
– Data structures will exceed available memory.
• Aware of statistical issues.
– Approximate or incomplete results are usually good enough.
– What happens when you string (even well understood) approximate
techniques together.
Sridhar Rajagopalan
Commodity Computing Platforms
1.
Disk based storage is plentiful and cheap.
•
2.
Current price less than $1.5 dollars/gigabyte.
Memory hierarchies are complicated and pipelines are deep.
•
•
•
•
3.
Large gap between random and sequential access, even within main memory.
Random access in main is slower than sequential access to disk.
CPU is underutilized, especially in data intensive applications.
Where have all the cycles gone?
Parallelism is great in theory, but hard in practice.
•
Sridhar Rajagopalan
Naïve data partitioning is the only practical option to keep system complexity under
control.
The Streaming Model [Munro-Paterson]
• Data is seen as a sequence of elements.
– Proposed as a model for large data sets, as well as data from sensors.
• Issues:
– Memory usage.
– Passes required over the data.
– Many variations.
• Sorting is hard. As are almost all interesting combinatorial/graph
theoretic problems.
• Exact computation of statistical functions are hard. Approximation is
possible.
• Relationships to communication complexity and information complexity.
Sridhar Rajagopalan
Sorting
• Sorting large a large data set on commodity hardware is a solved problem.
– Google sorts more than 100 B terms in its index.
– SPSort.
– Penny Sort, Minute Sort.
• But Sorting well requires a great deal of care and customization to the
hardware platform.
• What is the cost of indexing the Web? 2B documents, each with 500 words =
1 Trillion records. Cost of index build per Penny Sort is under 50 bucks.
• Networking speeds make sharing large quantities of streaming data possible.
Sridhar Rajagopalan
Model 1: Stream + Sort
• Basic multi-pass data stream model with access to a Sort box.
• Quite powerful.
–
–
–
–
–
–
Can do entire index build (including PageRank like calculations).
Spanning Tree, Connected Components, MinCut, STCONN, Bipartiteness.
Exact computations of order statistics and frequency moments.
Suffix Tree/Suffix Array build.
Red/Blue segment intersection.
So strictly stronger than just streaming.
Theorem : NC  StrSort
Sridhar Rajagopalan
Questions About the Web Graph
•
•
•
•
Size: how big is the graph?How many
links on a page (outdegree)? How many
links to a page (indegree)?
Connectedness: can one browse from any
web page to any other? How many clicks?
Sampling: can we pick a random page on
the web?
Browsing: how different is browsing from a
“random walk”?
Sridhar Rajagopalan
•
•
Applications: can we exploit the structure
of the web graph for searching and
mining?
Models: what does the web graph reveal
about social processes which result in its
creation and dynamics?
A Picture of the Web Graph
Sridhar Rajagopalan
Broder et.al., 2000
Power Laws: A Curious Statistic About the Web
•
•
Indegree outdegree distributions of the web graph are distributed by the power law.
Component size distributions are distributed by the power law.
Sridhar Rajagopalan
Broder et.al., 2000
Power Laws
• Inverse polynomial tail.
• Word frequency in text. Yule (later Mandelbrot) statistical study of the literary
vocabulary.[Yule, 1944].
• Citation analysis [Lotka, 1926].
• Zipf human behavior and the principle of least effort. [Zipf, 1947].
• Pareto Cours d’economie politique. [Pareto,1897].
• Network graph. [Faloutsos-Faloutsos-Faloutsos, 1999].
• Oligonucleotide sequences [Martindale-Konopka, 1996].
• Access statistics for web pages. (From server logs) [Glassman, 1997].
• User behavior (instrument browsers and proxies) [Lukose-Huberman, 1998,
Crovella and others,1997-99].
• Many other instances.
Sridhar Rajagopalan
The Web Is a Rich Source of Graphs
•
Co-citation graph: C.
– Nodes = web pages. Undirected (weighted) edges = co-citation (weight).
•
Bibliographic coupling graph: B.
– Nodes = web pages. Undirected (weighted) edges = bibliographic overlap.
•
Content similarity graph: S.
– Nodes = web pages. Undirected or directed (weighted) edges = similarity of text on the pages.
•
Namespace tree: T.
– Nodes = URLs. Directed (labeled) edges = parent to child (labeled by extension).
•
Host graph: H.
– Nodes = websites/webhosts. Directed (weighted) edges = (number) links from one host to the
other.
•
Data mining graph: D.
– Bipartite. Two nodes per web page, one per partition. Edges = from LHS to RHS corresponding to
each edge in the web graph.
– Nodes on LHS are “adjacent vertex sets.” On RHS are “items.”
Sridhar Rajagopalan
Applications of Graph Methods
•
Searching.
– Page rank. (Google).
– Hits.
– Clever and variants.
•
Data mining.
– Communities.
– Focused crawling.
– Mirrors.
Sridhar Rajagopalan
•
Estimation methods.
– Sampling.
– Random walks.
•
Browsing and foraging.
– Back links.
– Find similar.
Spectral Analysis: Matrices
• Adjacency matrix A(G).
0
Ai , j  
1
if i and j are not adjacentin G
if i and j are adjacentin G
• Markov chain M(G).
if i and j are not adjacentin G
0
1
M i, j  
if i and j are adjacentin G

 di
d i is theoutdegreeof vertexi
Sridhar Rajagopalan
Search: Eigenvectors of M.
• Page rank comes from the web graph. [Brin and Page, 1998]
PrincipalEigenvector of (1   )M (G)  
• Hub rank comes from bibliographic coupling. [Kleinberg, 1998]
PrincipalEigenvector of A( B)
• Authority rank comes from co-citations. [Kleinberg, 1998]
PrincipalEigenvector of A(C ).
• Latent semantic indexing builds on textual similarity. [Dumais et.al.]
Eigen spaceof M (S ).
Sridhar Rajagopalan
Co-citation and Web Communities.
• Social networks: Milgram – 6DOS – Routing.
• Bibliographic coupling thesis: frequently co-cited web pages are related.
Pages with large bibliographic overlap are related.
• CS problem: enumerate all frequently co-cited groups of web pages.
(Complete bipartite subgraphs). We call these cores.
K (3,3)
Sridhar Rajagopalan
The Cores Are Interesting.
Explicit communities.
Implicit communities
•Yahoo!, Excite, Infoseek
• Hotels in costa rica
•webrings
• Clipart
•news groups
• Japanese elementary schools
• Turkish student associations
•mailing lists
•
•
•
•
Oil spills off the coast of japan
Australian fire brigades
Aviation/aircraft vendors
Guitar manufacturers
(1) Implicit communities are defined by cores.
(2) There are an order of magnitude more implicit communities.
(3) Very reliable. Over 97% (sampled) make sense.
Sridhar Rajagopalan
More Applications.
• Find similar: look at neighborhood in co-citation graph (C).
• Bidirectional browsing: edges are reversed and a list of pages which point to
the currently visible one is made available.
• Mirror detection: find identical sub-trees in the namespace tree.
Sridhar Rajagopalan
Random Graphs
•
•
•
•
Erdos and Renyi’s Gn, pmodel.
Graph with n vertices.
Each of n(n-1) arcs appear independently with probability p.
Graphical evolution [Palmer]: study properties of the resulting random graph
as p is increased from 0 to 1.
0
Sridhar Rajagopalan
t
1
Facts About the Erdos-Renyi Model
• A random graph with average degree 4 has a giant connected component
containing almost all (90%) of the vertices.
• Indegrees and outdegrees are concentrated around the mean. And have
exponentially declining tails.
• Most vertices in the graph are close to most others (small world).
Sridhar Rajagopalan
New Models
1.
2.
3.
4.
Ad Hoc. [Aiello, Chung and Lu, 2000] Pick a random graph which satisfies
degree distribution constraints.
Copying models. [Barabasi-Albert, KRRT, 2000-01].
Local optimization models. [Papadimitriou et.al. , 2001].
See [Mitzenmacher 2000] survey.
Sridhar Rajagopalan