Slide 2.6 Structure of the web

Download Report

Transcript Slide 2.6 Structure of the web

Slide 2.1
Chapter 2 : The Web and the
Problem of Search
•
•
•
•
•
•
•
•
•
The size of the web, and how is it measured.
Search engine usage statistics.
The bow-tie structure of the web.
The small-world web.
Web information seeking strategies.
A taxonomy of web searches.
Web search versus Information Retrieval.
Differences between global and local search.
Differences between search and navigation.
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 2.2
Web size statistics
• Number of accessible web pages – latest
estimate, May 2005, 11.5 billion.
• The deep (or hidden or invisible) web contains
400-550 times more information.
• Coverage (i.e. the proportion of the web
indexed) is crucial for search engines.
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 2.3
Measuring the size of the web
• Capture-recapture method
– SE1 is the number of pages indexed first search
engine.
– QSE2 is the number of pages returned by second
search engine for typical queries.
– OVR is the number of pages returned by both search
engines for typical queries.
• Estimate = (SE1 x QSE2)/OVR
• Estimate of 64.81 million web sites as of June 2005.
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 2.4
Web usage statistics
• Over 10% of the world’s population were
online as of late 2004.
• Number of broadband users is growing (over
50% of connected Americans use broadband).
• Search engine usage as of June 2004:
– Google (41.6%), Yahoo! (31.5%), MSN
(27.4%), AOL (13.6%), Ask Jeeves (7%)
• 200 million hits per day to Google (mid 2004).
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 2.5
Tabular Data versus Web Data
Figure 2.1: A database table versus a web site
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 2.6
Structure of the web
Figure 2.2: Map of the Internet (1998)
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 2.7
Structure of the web
Figure 2.3: Web pages related to dcs.bbk.ac.uk
(see www.touchgraph.com)
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 2.8
Structure of the web
Figure 2.4: Bow-tie shape of the web
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 2.9
The small-world web
• Over 75% of the time there is no directed path
from one random web page to another.
• When a directed path exists its average length
is 16 clicks.
• When an undirected path exists its average
length is 7 clicks.
• Short average path between pairs of nodes is
characteristic of a small-world network.
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 2.10
Web information seeking strategies
• Direct navigation
– Enter the URL directly into the browser.
• Navigation within a directory
– Use a web portal as an entry point to the web.
• Information seeking on the web is problematic
and more users are turning to search engines.
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 2.11
Navigation using a search engine
Figure 2.5: Information seeking
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 2.12
A taxonomy of web searches
• Informational – acquire some information
about a topic from web pages.
• Navigational – find a site to start navigation
from.
• Transactional – perform some activity
mediated by a web site.
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 2.13
Web search versus Information Retrieval
• The scale of web search is way beyond traditional
information retrieval.
• The web is very dynamic.
• The web contains an enormous amount of
duplication.
• The quality of web pages is not uniform.
• The range of topics on the web is open.
• The web is globally distributed.
• Users typical habits are different (short queries,
inspect only top-10 pages).
• The web is hypertextual.
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 2.14
Information retrieval evaluation
Figure 2.6: Recall versus precision
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 2.15
Differences between global and local search
• Local search engines on web sites have a bad
reputation.
• Users often use a web search engine such as Google
or Yahoo! to find information on web sites, rather than
the local web site search engine.
• Many companies do not invest in local search.
• Content management is a problem.
• Language may be a problem.
• Information needs on web sites may be different.
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 2.16
Differences between search and navigation
• Search – employing a search engine to find
information.
• Navigation (or surfing) – employing a linkfollowing strategy to find information.
• The web encourages a combination of search,
navigation and browsing.
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005