CS315TheStructureOfTheWeb

Download Report

Transcript CS315TheStructureOfTheWeb

Meet the web: First impressions
How big is the web and how do you measure it?
How many people use the web?
How many use search engines?
What is the shape of the web?
How hard is it to go from one page to another?
How do people search for information?
Can we categorize web searchers?
Differences b/w web search & Information Retrieval.
Differences between global and local search.
Differences between search and navigation.
How big is the web?
Number of accessible web pages –
May 2005 estimate: 11.5 Billion pages
Most recent estimate? ________
The deep (or hidden or invisible) web
“contains 400-550 times more information”
that means __________ pages. Do others agree?
Coverage (i.e. the proportion of the web indexed)
is crucial for search engines.
Today, ____________ pages are indexed
How do you measure the size of web?
Capture-recapture method



SE1 = # of pages indexed search engine 1.
QSE2 = # of pages returned by search engine 2 for typical
queries.
OVR = # of pages returned by both search engines for typical
queries.
Estimate :
SE1 / WWW = OVR / QSE2 =>
WWW = (SE1 x QSE2) / OVR
WWW
SE1 OVR
QSE2
Lawrence & Giles: Searching the WWW
Relative Size from Overlap
Sample URLs randomly from A
Check if contained in B
and vice versa
AB
A B =
A B =
(1/2) * Size A
(1/6) * Size B
(1/2)*Size A = (1/6)*Size B
\ Size A / Size B =
(1/6)/(1/2) = 1/3
Each test involves: (i) Sampling (ii) Checking
(Assume for now that we can do them reliably)
How many people use the web? SEs?
Over 10% of the world’s population were online as of late
2004. Today? ________
Number of broadband users is growing
(over 50% of connected Americans use broadband).
Search engine share as of June 2004:

Google (41.6%), Yahoo! (31.5%), MSN (27.4%),
AOL (13.6%), Ask Jeeves (7%) Today? _______
200 million hits per day to Google (mid 2004). Today? ___
“Map of the Internet” (1998)
What is the shape of the web?
Example
Look at paths
and strongly
connected
components
Bow-tie shape of the web
What is the shape of the web?
Broder et.al: Graph structure of the web (2000)
How hard is it to go
from one page to another?
Over 75% of the time there is no directed path
from one random web page to another.
When a directed path exists
its average length is 16 clicks.
When an undirected path exists
its average length is 7 clicks.
Short average path between pairs of nodes
is characteristic of a small-world network.
Kleiberg: The small-world phenomenon (we will revisit later)
How do people search for information?
Direct navigation

Enter the URL directly into the browser.
Navigation within a directory

Use a web portal as an entry point to the web.
Information seeking on the web is problematic and more
users are turning to search engines.
Broder: A taxonomy of web search
How do people search for information?
Query formulation
Result selection
Query modification
Surfing
Can we categorize web searchers?
Broder: A taxonomy of web search
Informational ____ %

acquire some information about a topic from web pages.
Navigational ____ %

find a site to start navigation from.
Transactional ____ %

perform some activity mediated by a web site.
Think of your own searches. Do you agree?
How did Broder found out these categories?
How did he measure the percentages?
Web search vs. Info Retrieval
The scale of web search
is way beyond traditional information retrieval.
The web is very dynamic.
The web contains
an enormous amount of duplication.
The quality of web pages is not uniform.
The range of topics on the web is open.
The web is globally distributed.
Users typical habits are different
(short queries, inspect only top-10 pages).
The web is hypertextual.
Differences b/w global & local search
Local search engines on web sites have a bad reputation.
Users often use a web search engine such as Google or
Yahoo! to find information on web sites, rather than the
local web site search engine.
Many companies do not invest in local search.
Content management is a problem.
Language may be a problem.
Information needs on web sites may be different.
Differences b/w search & navigation
Search –

employing a search engine to find information.
Navigation (or surfing) –

employing a link-following strategy to find information.
The web encourages a combination of
search, navigation and browsing.