The Mobile Web is Structurally Different
Download
Report
Transcript The Mobile Web is Structurally Different
The Mobile Web is Structurally
Different
Apoorva Jindal
USC
Chris Crutchfield
MIT
Ravi Jain
Google Inc
Samir Goel
Google Inc
Ravi Kolluri
Google Inc
is Structurally
Different
The
MobileWeb
Web?
The Mobile
is Structurally
Different
Web pages designed for consumption on mobile
wireless devices
All other pages referred to as fixed web
Becoming more important
CHTML, XHTML, WML
Better devices
Better networks
Cheaper plans
Different from fixed web?
Smaller pages
Fewer hyperlinks
Fewer images
The Mobile Web is Structurally
Different
Structurally?
Web graph
Properties of this graph
pages ↔ nodes
hyperlinks ↔ edges
In-degree distribution
Out-degree distribution
Strongly connected component size distribution
….
Importance
INFOCOM
2008
Used in basic algorithms to implement search
Crawling
Ranking the web pages
Studied in detail for fixed web
EDAS
Bow-tie Structure [Broder et al 2000]
Model to describe the structure of the fixed web.
Methodology
Collapse all pages in a
domain to one node
Google’s mobile web index, June
2007
CHTML
XHTML + WML
Webbase 2001
Google’s fixed web index, July
2007
Use Tools based
on Mapreduce
In-degree & out-degree distributions
Determine bow-tie structure properties
Tools based on mapreduce
Use [Clauset et al 2006] to infer the power law
coefficient
Use COSIN tools [Donato et al 2004]
Limitations
Cannot handle Google fixed web 2007 at page level
Page-level Graph properties – Degree
Distributions
Out-degree
distribution
off
Mobile
web isfalls
sparser
faster for mobile
web
CHTML lies between
XHTML+WML and
fixed web
Corpus
Avg Node
Degree
Coefficient of power-law distribution
In-degree
Out-degree
XHTML+WML
3.75
2.00
3.49
CHTML
5.06
1.99
4.06
Webbase
7.0
2.1
2.7
Page-level Graph properties – Bow-tie
structure
SCC
IN
OUT
Tendrils
Discon
nected
XHTML
+WML
10.5%
18%
10.4%
18.3%
42.7%
CHTML
22%
25.9%
14.2%
22%
15.8%
Webba
se
33%
11%
39%
13%
4%
Mobile web
Corpus
Smaller SCC
Larger IN and smaller OUT
Bigger Disconnected + Tendrils
Connectivity: Fixed Web > CHTML > XHTML/WML
Language Properties
Sub-graph of pages that share a common trait
Corpus
XHTML
CHTML
Like keyword, location.
Called Thematically Unified Clusters (TUCs).
In fixed web, they retain the structural properties of the entire graph.
Mobile web?
Language
Fraction of
Nodes
Corpus
SCC
IN
OUT
Tendrils
Disconn
ected
Chinese
42.6%
10.5%
18%
10.4%
18.3%
42,7%
English
22.3%
XHTML
+WML
Russian
13.4%
Chinese
13%
22%
9%
14%
42%
French
3.4%
German
2.3%
English
2%
3%
7%
25%
63%
Japanese
92.3%
English
5.9%
Russian
22%
40%
8%
11%
19%
Don’t study Japanese: Properties same as CHTML
Domain-level Graph Properties
Domain-level graph
Compare mobile web 2007 and fixed web 2007
Advantages
Corpus
Collapse all nodes for a domain into a single super-node
Avg
Node
Degree
Allows us to understand the differences at a much coarser level
Allows us to compare present day fixed and mobile webs
SCC
IN
OUT
Tendrils
+
Disconn
.
XHTML
+WML
3.91
40.6%
40.7%
2.73%
15.9%
CHTML
5.56
83%
16.4%
0.22%
0.36%
Fixed
web
2007
35.75
93.9%
5.62%
0.4%
0.03%
Observations
Domain-level graphs are better
connected.
XHMTL + WML has a much larger
Disconnected component
CHTML properties lies between
XTHML+WML and Fixed web.
Structural differences between
domain-level fixed web and mobile
web same as the differences between
page-level fixed web and mobile web.
Application: Impact on Crawling
Crawling is resource-intensive.
Efficiency is important
Higher level of disconnectedness
Need a larger and a more diverse seed set
Covering the IN component requires special care
Depth-first strategy risks spending a disproportionate time in Tendrils and
Disconnected components
Different languages have different levels of disconnectedness
Require a larger seed set for English pages than Russian pages
Crawl depth can be reduced for Russian sub-graph
Sparseness also can give an advantage
Chances of encountering the page again during a crawl is smaller
Conclusions
Mobile web graph is structurally different
Sparser, more disconnected
Smaller SCC and OUT
CHTML properties lies between XHTML+WML and
Fixed web
Surprising preponderance of Chinese pages
English sub-graph extremely disconnected
Future Work
Only a first step
Results motivate the need of a deeper and more
extensive analysis
Propose alternatives to bow-tie model for mobile
web
Better understanding of language sub-graphs
Quantitatively characterize the impact of differences
in structure on different search algorithms