degree correlation

Download Report

Transcript degree correlation

Decoding the Structure of the
WWW : A Comparative Analysis of
Web Crawls
AUTHORS:
M.Angeles Serrano
Ana Maguitman
Marian Boguna
Santo Fortunato
Alessandro Vespignani
AGENDA
•
•
•
•
•
•
•
INTRODUCTION
MAIN OBJECTIVE
MAIN CONTRIBUTIONS
PREVIOS RESEARCH AND RESULTS
RELATED WORK DONE AND RESULTS
IMPLEMENATION
CONCLUSION AND FUTURE WORK
INTRODUCTION
• Topological Structure of the “World Wide Web” can
be represented by the properties of it’s
representative graphs.
• Vertices identified with Web Pages and Directed
Edges identified with Hyperlinks.
• The Comparative Analysis of different WWW graphs
differ Quantitatively, Qualitatively depending on the
domain and crawl used for gathering data.
• Degree Distribution, Degree-Degree Correlation
Functions , Statistics of Reciprocal Connections are
used as measurement for the analysis of Web
Graphs.
CONT.
• The dynamical nature of the web and its huge
size make very difficult the process of
compressing, ranking, indexing or mining the
web.
• Exact policies and strategies followed by Crawl
engines helps in decoding the huge structure
of WWW.
MAIN OBJECTIVE
• Give the clear picture on the reliability of the
widely accepted large-scale statistical
properties of the Web and provide the new
measures to discover whether or not
inconsistencies are found when measuring the
same properties across different crawls.
MAIN CONTRIBUTIONS
• A careful comparative analysis of the structural
and statistical large-scale different Web graphs,
with evident qualitative and quantitative
differences across different samples.
• Introduced single and two-vertex correlations
and Reciprocal Linksfor a full connectivity
pattern and structural ordering of the Web
graph. And these all properties depend on the
communication
patterns
among
the
constituent sites of the network.
PREVIOUS RESEARCHES AND RESULTS
• Measurement of the Directed Degree Distributions
P(kin) and P(kout), where the in/out-degree, kin or
kout respectively, is defined as the number of
incoming/outgoing links connecting a page to its
neighbors.
• Kumar et al. [1999] worked on a big crawl of about 40M
nodes, and that by Barab´ asi and Albert [1999] on a
smaller set of over 0.3M nodes restricted to the domain
of the University of Notre Dame, resulted a scale-free
nature for the WWW with power-law behaviors both for
the in- and out-degree distributions.
CONT.
• Border et al. [2000] worked on two sets from AltaVista crawls,
in May and October for year 1999 with 200 million pages and
1.5 billion links. And concluded that the structure of the Web
was relatively insensitive to the particular large crawl used
and the connectivity structure was resilient to the removal of
a significant number of nodes.
• Donato et al. [2004] worked on the same lines with a large
2001 data set of 200M pages and about 1.4 billion edges
made available by the WebBase project at Stanford. The
obtained results were compared with the ones presented in
the work by Broder et al.[2000]. One of the reported
differences is the deviation from the power-law behavior of
the out-degree distribution.
RELATED WORK DONE AND RESULTS
• Analysis and Comparison for Four Data Sets from years,
from 2001 to 2004, and different domains, general and
.uk and .it domains. The sets have been gathered within
two different projects: theWebBase project and the
WebGraph project, with own Web crawler, WebVac and
UbiCrawler, respectively.
• While pages in the .uk domain have higher probability to
point to pages outside the domain,due to English , and
the links in the Italian .it domain may be much more
endogenous, which could potentially have a high effect
on the Web description derived from the data.
CONT.
Table I. Number of Nodes and Edges of the Networks Considered, After Extracting Multiple
Links and Self-Connections
Data set
WBGC01
WGUK02
WBGC03
WGIT04
# nodes
80,571,247
18,520,486
49,296,313
41,291,594
# links
752,527,660
292,243,663
1,185,396,953
1,135,718,909
• Measurements carried out during research:
1. Structural Properties
2. Degree Correlation
3. The Role of Reciprocal Links
Structural Properties
• Connected Components
1. Strongly Connected Component : All pages mutually connected
by a path.
2. In-Component: Vertices from which it is possible to reach SCC
using directed path.
3. Out-Component: Vertices which can be reached from SCC using
directed path.
4. Tendrils: Pages which cannot reach the SCC and cannot be
reached from it.
5. Tubes: Directly connect the IN and OUT components without
crossing the SCC.
CONT.
Table II Sizes of the IN,OUT, SCC and their union MAIN
CONT.
• Crawlers perform a directed exploration that they follow outgoing
hyperlinks to reach pointed pages, but cannot navigate backwards
using incoming hyperlinks. In summary, the structure of Web
graphs is strongly dependent on the data set considered
Degree Distribution
• For directed networks, the in-degree distribution P(kin) and the
out-degree distribution P(kout),probabilities of having kin
incoming links and kout outgoing links, respectively.
• In-degree of a vertex is the sum of all the hyperlinks incoming
from all the Web pages in the WWW. And there is no limit to the
number of incoming hyperlinks, that is determined only by the
popularity of the Web page itself. Out-degree is determined by the
number of hyperlinks present in the page, which are controlled by
Web administrators.
DEGREE CORRELATION
• Degree Correlation between In and Out degree
can
determine that the n/w will or will not have a bow structure
and for the model validation quantitatively.
1. Single Vertex Degree Correlation
• Shows that more popular pages tend to point to a higher
number of other pages. This positive correlation is found to
be true for a range of in-degrees that spans from kin = 1 to kin
=102∼ 103, depending on the specific set. The set for the
Italian domain is more noisy, but this pattern appears to be
independent of the crawl used to gather the data.
CONT.
2. Two Vertex Degree Correlation
• The implication from the Two Vertex Degree Correlation can help
in the study of Page Rank, as this includes the neighborhood of
each single node i into neighboring nodes connected to it by
incoming and outgoing links. And it shows the popularity of the
web page in terms of the number of pages pointing to them.
ROLE OF RECIPROCAL LINKS
• It plays an important role in percolation catalysts, the fine
structure of the web components and the navigability of the web.
1. Degree Distribution: Depends on the crawl examined.
2. One Degree Correlation: No clear relation between
nonreciprocal in-and out- degrees but there is a positive
correlation between reciprocal and nonreciprocal in-degrees.
3. Degree-Degree Correlation: Shows most of the correlations of
web graphs are found in vertices connected by reciprocal links
which depends on the web structure.
• Reciprocal Sub graph: Shows the organization of the reciprocal
sub graph is a set of star like structures combined with cliques, or
communities, of highly interconnected pages.
Statistical Properties Of RECIPROCAL
SUBGRAPH
Average Degree qr , Maximum Degree qmaxr , Standard
Deviation σr , Heterogeneity Parameter κr , and Maximum
Likelihood Estimate of the Exponent of the Power-Law inDegree Distribution γr (Precision Error ±0.1) (The symbol ∞
means that the distribution decays faster than a power-law.)
CONCLUSION AND FUTURE WORK
CONT.
• Despite an approximate view of the web from
data provided by Web Crawlers , still lacking an
exact definitive description of its large scale
properties and architecture which can affect the
navigation, indexing searching and mining.
FUTURE WORK:
• Differences among crawls should be further
investigated in relation to crawling policies
adopted in designing of the engines.
• The Reciprocal links role has to be explored in
detail.