The Mobile Web is Structurally Different

Download Report

Transcript The Mobile Web is Structurally Different

The Mobile Web is Structurally
Different
Apoorva Jindal
USC
Chris Crutchfield
MIT
Ravi Jain
Google Inc
Samir Goel
Google Inc
Ravi Kolluri
Google Inc
is Structurally
Different
The
MobileWeb
Web?
The Mobile
is Structurally
Different

Web pages designed for consumption on mobile
wireless devices



All other pages referred to as fixed web
Becoming more important




CHTML, XHTML, WML
Better devices
Better networks
Cheaper plans
Different from fixed web?



Smaller pages
Fewer hyperlinks
Fewer images
The Mobile Web is Structurally
Different
Structurally?

Web graph



Properties of this graph





pages ↔ nodes
hyperlinks ↔ edges
In-degree distribution
Out-degree distribution
Strongly connected component size distribution
….
Importance
INFOCOM
2008

Used in basic algorithms to implement search



Crawling
Ranking the web pages
Studied in detail for fixed web
EDAS
Bow-tie Structure [Broder et al 2000]

Model to describe the structure of the fixed web.
Methodology

Collapse all pages in a
domain to one node
Google’s mobile web index, June
2007


CHTML
XHTML + WML

Webbase 2001

Google’s fixed web index, July
2007
Use Tools based
on Mapreduce

In-degree & out-degree distributions



Determine bow-tie structure properties


Tools based on mapreduce
Use [Clauset et al 2006] to infer the power law
coefficient
Use COSIN tools [Donato et al 2004]
Limitations

Cannot handle Google fixed web 2007 at page level
Page-level Graph properties – Degree
Distributions
Out-degree
distribution
off
Mobile
web isfalls
sparser
faster for mobile
web
CHTML lies between
XHTML+WML and
fixed web
Corpus
Avg Node
Degree
Coefficient of power-law distribution
In-degree
Out-degree
XHTML+WML
3.75
2.00
3.49
CHTML
5.06
1.99
4.06
Webbase
7.0
2.1
2.7
Page-level Graph properties – Bow-tie
structure

SCC
IN
OUT
Tendrils
Discon
nected
XHTML
+WML
10.5%
18%
10.4%
18.3%
42.7%
CHTML
22%
25.9%
14.2%
22%
15.8%
Webba
se
33%
11%
39%
13%
4%
Mobile web




Corpus
Smaller SCC
Larger IN and smaller OUT
Bigger Disconnected + Tendrils
Connectivity: Fixed Web > CHTML > XHTML/WML
Language Properties

Sub-graph of pages that share a common trait




Corpus
XHTML
CHTML
Like keyword, location.
Called Thematically Unified Clusters (TUCs).
In fixed web, they retain the structural properties of the entire graph.
Mobile web?
Language
Fraction of
Nodes
Corpus
SCC
IN
OUT
Tendrils
Disconn
ected
Chinese
42.6%
10.5%
18%
10.4%
18.3%
42,7%
English
22.3%
XHTML
+WML
Russian
13.4%
Chinese
13%
22%
9%
14%
42%
French
3.4%
German
2.3%
English
2%
3%
7%
25%
63%
Japanese
92.3%
English
5.9%
Russian
22%
40%
8%
11%
19%
Don’t study Japanese: Properties same as CHTML
Domain-level Graph Properties

Domain-level graph


Compare mobile web 2007 and fixed web 2007

Advantages


Corpus
Collapse all nodes for a domain into a single super-node
Avg
Node
Degree
Allows us to understand the differences at a much coarser level
Allows us to compare present day fixed and mobile webs
SCC
IN
OUT
Tendrils
+
Disconn
.
XHTML
+WML
3.91
40.6%
40.7%
2.73%
15.9%
CHTML
5.56
83%
16.4%
0.22%
0.36%
Fixed
web
2007
35.75
93.9%
5.62%
0.4%
0.03%

Observations




Domain-level graphs are better
connected.
XHMTL + WML has a much larger
Disconnected component
CHTML properties lies between
XTHML+WML and Fixed web.
Structural differences between
domain-level fixed web and mobile
web same as the differences between
page-level fixed web and mobile web.
Application: Impact on Crawling

Crawling is resource-intensive.


Efficiency is important
Higher level of disconnectedness

Need a larger and a more diverse seed set

Covering the IN component requires special care

Depth-first strategy risks spending a disproportionate time in Tendrils and
Disconnected components

Different languages have different levels of disconnectedness



Require a larger seed set for English pages than Russian pages
Crawl depth can be reduced for Russian sub-graph
Sparseness also can give an advantage

Chances of encountering the page again during a crawl is smaller
Conclusions

Mobile web graph is structurally different


Sparser, more disconnected
Smaller SCC and OUT

CHTML properties lies between XHTML+WML and
Fixed web

Surprising preponderance of Chinese pages

English sub-graph extremely disconnected
Future Work

Only a first step

Results motivate the need of a deeper and more
extensive analysis

Propose alternatives to bow-tie model for mobile
web

Better understanding of language sub-graphs

Quantitatively characterize the impact of differences
in structure on different search algorithms