Transcript chapter13

Searching the Web
The Web
Why is it important:
– “Free” ubiquitous information resource
– Broad coverage of topics and perspectives
– Becoming dominant information collection
– Growth and jobs
Web access methods
Search (e.g. Google)
Directories (e.g. Yahoo!)
Other …
Web Characteristics
Distributed data
– 80 million web sites (hostnames responding) in April
2006
– 40 million active web sites (don’t redirect, …)
High volatility
– Servers come and go …
Large volume
– One study found 11.5 billion pages in January 2005
(at that time Google indexed 8 billion pages)
– “Dark Web” – content not indexed, not crawlable
estimated to be 4,200 to 7,500 terabytes in 2000
(when there were ~ 2.5 billion indexable pages)
Web Characteristics
Unstructured data
– Lots of duplicated content (30% estimate)
– Semantic duplication much higher
Quality of data
– No required editorial process
– Many typos and misspellings (impacts IR)
Heterogeneous data
– Different media
– Different languages
These characteristics are not going to change
Web Content Types
Source: How much information 2003
Search Engine Architecture
Lots and lots of computers
Users
Interface
Query Engine
Index
Crawler
Web
Indexer
Search Engine Architecture
Evaluation
Users
Chapter 3
Interface
Query Engine
Chapter 10
Chapters 2, 4, & 5
Index
Crawler
Indexer
Chapters 6 & 7
Web
Chapter 8
Search Engine Architecture
Users
Interface
Query Engine
Chapter 10
Chapters 2, 4, & 5
Index
Crawler
Indexer
Chapters 6 & 7
Web
Chapter 8
Hubs and Authorities
Hubs
– Have lots of links to other pages
Authorities
– Have lots of links that point to them
Can use feedback to rank hubs and
authorities
– Better hubs have links to good authorities
– Better authorities have links from good hubs
Crawling the Web
Creating a Web Crawler (Web Spider)
Simplest Technique
– Start with a set of URLs
– Extract URLs pointed to in original set
– Continue using either breadth-first or depth-first
Works for one crawler but hard to coordinate for
many crawlers
– Partition Web by domain name, ip address, or other
technique
– Each crawler has its own set but shares a to-do list
Crawling the Web
Need to recrawl
– Indexed content is always out of date
– Sites come and go and sometimes disappear
for periods of time to reappear
Order of URLs traversed makes a difference
– Breadth first matches hierarchic organization
of content
– Depth first gets to deeper content faster
– Proceeding to “better” pages first can also
help (e.g. good hubs and good authorities)
Server and Author Control of Crawling
Avoid crawling sites that do not want to be
crawled
– Legal issue
– Robot exclusion protocol (Server level control)
• file that indicates which portion of web site should not be
visited by crawler
• http://.../robots.txt
– Robot META tag (Author level control)
• used to indicate if a file (page) should be indexed or
analyzed for links
• few crawlers implement this
• <meta name=“robots” content=“noindex, nofollow”)
• http://www.robotstxt.org/wc/exclusion.html
Example robots.txt Files
Google
User-agent: *
Allow: /searchhistory/
Disallow: /search
Disallow: /groups
Disallow: /images
Disallow: /catalogs
Disallow: /catalogues
…
CSDL
User-agent: *
Disallow: /FLORA/arch_priv/
Disallow: /FLORA/private/
TAMU Library
User-agent: *
Disallow: /portal/site/chinaarchive/template.REGISTER/
Disallow: /portal/site/Library/template.REGISTER/
…
New York Times
User-agent: *
Disallow: /pages/college/
…
Allow: /pages/
Allow: /2003/
…
User-agent: Mediapartners-Google*
Disallow:
Crawling Goals
Crawling technique may depend on goal
Types of crawling goals:
– Create large broad index
– Creating a focused topic or domain-specific index
• Target topic-relevant sites
• Index preset terms
– Creating a subset of content to model characteristics
of (part of) the Web
• Need to survey appropriately
• Cannot use simple depth-first or breadth-first
– Create up-to-date index
• Use estimated change frequencies
Crawling Challenges
Identifying and keeping track of links
– Which to visit
– Which have been visited
Issues
– Relative vs. absolute link descriptions
– Alternate server names
– Dynamically generated pages
– Server-side scripting
– Links buried in scripts
Crawling Architecture
Crawler components
Worker threads – attempt to retrieve data for URL
DNS resolver – resolves domain names into IP
addresses
Protocol modules – downloads content in appropriate
protocol
Link extractor – finds and normalizes URLs
URL filter – determines which URLs to add to to-do list
URL to-do agent – keeps list of URLs to visit
Crawling Issues
Avoid overloading servers
– Brute force approach can become a denial of service
attack
– Weak politeness guarantee: only one thread allowed
to contact a server
– Stronger politeness guarantee: maintain queues for
each server that put URLs into the to-do list based on
priority and load factors
Broken links, time outs
– How many times to try?
– How long to wait?
– How to recognize crawler traps? (server-side
programs that generate “infinite” links)
Web Tasks
Precision is the key
– Goal: first 10-100 results should satisfy user
– Requires ranking that matches user’s need
– Recall is not important
• Completeness of index is not important
• Comprehensive crawling is not important
Browsing
Web directories
– Human-organized taxonomies of Web sites
– Small portion (< than 1%) of Web pages
• Remember that recall (completeness) is not
important
• Directories point to logical web sites rather than
pages
– Directory search returns both categories and
sites
– People generally browse rather than search
once they identify categories of interest
Metasearch
Search a number of search engines
Advantages
– Do not build their own crawler and index
– Cover more of the Web than any of their
component search engines
Difficulties
– Need to translate query to each engine query
language
– Need to merge results into a meaningful
ranking
Metasearch II
Merging Results
– Voting scheme based on component search engines
• No model of component ranking schemes needed
– Model-based merging
• Need understanding of relative ranking, potentially by query
type
Why they are not used for the Web
– Bias towards coverage (e.g. recall), which is not
important for most Web queries
– Merging results is largely ad-hoc, so search engines
tend to do better
Big application: the Dark Web
Using Structure in Search
Languages to search
content and structure
– Query languages
over labeled graphs
• PHIQL: Used in
Microplis and
PHIDIAS hypertext
systems
• Web-oriented: W3QL,
WebSQL, WebLog,
WQL
Using Structure in Search
Other use of structure in search
– Relevant pages have neighbors that also
tend to be relevant
– Search approaches that collect (and filter)
neighbors to returned pages
Web Query Characteristics
Few terms and operators
– Average 2.35 terms per query
• 25% of queries have a single term
– Average 0.41 operators per query
Queries get repeated
– Average 3.97 instances of each query
– This is very uneven (e.g. “Britney Spears” vs. “Frank
Shipman”)
Query sessions are short
– Average 2.02 queries per session
– Average of 1.39 pages of results examined
Data from 1998 study
– How different today?