Chapter13-WebIR

Download Report

Transcript Chapter13-WebIR

Advanced information
retrieval
Computer engineering department
Chapter 13 – Web IR
Searching the web
• Introduction (History, Definitions, Aims, Statistics)
• Tasks of search engines
•
•
•
•
- Gathering
- Indexing
- “Searching” (Querying and ranking algorithms)
- Document and query management
Metasearch
Browsing and web directory
Users and web search
Research issues
Web Searching and Classical IR
TREC
Classic IR research
1970s
1980s
1990s
then came the web
web searching
2000s
Terminology and Definitions
• A web page corresponds to a document in traditional IR
• Web pages are different in their size and in the types of
•
•
•
files that constitute them
– text, graphics, sound, video, GIF, JPEG, ASCII,PDF,
etc
IR on the web considers as collection of documents the
part of the web that is publicly indexable
– exclude pages that cannot be indexed (authorization,
dynamic pages)
Location of web pages by navigation
Location of web pages by searching (IR on the web)
Challenges for web search
• Problem with the data
– Distributed data
– High percentage of volatile data
– Large volume
– Unstructured data
– Redundant data
– Quality of data
– Heterogeneous data
• Problem faced by users
– How to specify a query?
– How to interpret answers?
Search engines
• Portals
•
– identification of real names
– links to books from Amazon.com
– send electronic postcards
– translation to other languages
– search of other media (metadata)
– language-specific searches
– weather, stock price, traffic
Business model
– targeted advertising with revenue from clickthroughs
– word of mouth (since no industry standard evaluation)
– fast response (<1s) 24 hours / 7 days a week
– filter out spam
Examples
Search engines
AltaVista
Excite
Google
Infoseek
Lycos
NorthernLight
URL
www.altavista.com
www.excite.com
www.google.com
www.infoseek.com
www.lycos.com
www.nlsearch.com
• 25-55% of web covered only!
• Some search engines powered by same IR engine
• Most based in US, English
Other search engines
• Specialised in different countries/languages
(e.g., http://www.iol.it/)
• Specific
– Rank according to popularity (e.g., DirectHit)
– Topic oriented (e.g., SearchBroker)
– Personnel or institutional pages, electronic mail
addresses, images, software applets
Tasks of a web search engine
• Document gathering
– select the documents to be indexed
• Document indexing
– represent the content of the selected documents
– often 2 indices maintained (full + small for frequent queries)
• Searching
– represent the user information need into a query
– retrieval process (search algorithms, ranking of web pages)
• Document and query management
– display the results
– virtual collection (documents discarded after indexing) vs.
physical collection (documents maintained after indexing)
Document gathering
• Document gathering = crawling the web
• Crawler
– Robot, spider, wanderer, walker, knowbot, web
search agent
– Program that traverses the web to send new or
updated pages to be indexed
– Run on local server and send requests to
remote servers
Crawling the web (1)
• Crawling process
– Start with set of URLs
• Submitted by users or companies
• Popular URLs
– Breath-first or depth-first
– Extract further URLs
• Up to 10 millions pages per day
• Several crawlers
– Problem of redundancy
– Web partition  robot per partition
Crawling the web (2)
• Up-to-date?
– Non-submitted pages need up to 2 months to be
indexed
– Search engine learns change frequency of pages
– Popular pages (having many links to them)
crawled more frequently
• Guideline for robot behaviours
– File placed at root of web server
– Indicate web pages that should not be indexed
– Avoid overloading servers/sites
Typical anatomy of a large-scale crawler.
Document indexing
• Document indexing = building the indices
• Indices are variant of inverted files
–
–
–
–
meta tag analysis
stop words removal + stemming
position data (for phrase searches)
weights
• tf x idf;
• downweight long URLs (not important page)
• upweight terms appearing at the top of the documents, or
emphasised terms
– use de-spamming techniques
• hyperlink information
• count link popularity
• anchor text from source links
• hub and authority value of a page
Maintaining indices over dynamic collections.
Stop-press index
• Collection of document in flux
– Model document modification as deletion followed by insertion
– Documents in flux represented by a signed record (d,t,s)
– “s” specifies if “d” has been deleted or inserted.
• Getting the final answer to a query
– Main index returns a document set D0.
– Stop-press index returns two document sets
• D+ : documents not yet indexed in D0 matching the query
• D- : documents matching the query removed from the collection since D0
was constructed.
• Stop-press index getting too large
– Rebuild the main index
• signed (d, t, s) records are sorted in (t, d, s) order and mergepurged into the master (t, d) records
– Stop-press index can be emptied out.
Searching
• Querying
–
–
–
–
–
1 word or all words must be in the retrieved pages
normalisation (stop words removal, stemming, etc)
complex queries (date, structure, region, etc)
Boolean expressions (advanced search)
metadata
• Ranking algorithms
– use of web links
– web page authority analysis
• HITS (Hyperlink Induced Topic Search)
• PageRank (Google)
Use of web links
• Web link: represent a relationship between the
connected pages
• The main difference between standard IR algorithms
and web IR algorithms is the massive presence of
web links
– web links are source of evidence but also source of noise
– classical IR: citation-based IR
– web track in TREC, 2000, TREC-9: Small Web task (2GB
of web data); Large Web task (100GB of web data, 18.5
million documents)
Use of anchor text
• represent referenced document
– why?
• provides more accurate and concise description than the page
itself
• (probably) contains more significant terms than the page itself
– used by ‘WWW Worm’ - one of the first search engines
1994
– representation of images, programs, …
• generate page descriptions from anchor text
Algorithms
• Query independent page quality
– global analysis
• PageRank (Google): simulates a random walk across the web and
computes the “score” of a page as the probability of reaching the
page
• Query dependent page quality
– local analysis
• HITS (Hyperlink Induced Topic Search): focusses on broad
topic queries that are likely to be answered with too many pages
– the more a page is pointed to by other pages, the more popular is the
page
– popular pages are more likely to include relevant information than nonpopular pages
PageRank (1)
• Designed by Brin and Page at Stanford University
• Used to implement Google
• Main idea:
– a page has a high rank if the sum of the ranks of its in-links
is high
• in-link of page p: a link from a page to page p
• out-link of a page p: a link from page p to a page
– a high PageRank page has many in-links or few highly
ranked in-links
• Retrieval: use cosine product (content, feature, term
weight) combined with PageRank value
PageRank (2)
• Random Surfer Model : user randomly navigates
– Initially the surfer is at a random page
– At each step the surfer proceeds
• to a randomly chosen Web page with probability d called the “damping factor”
•
(e.g. probability of random jump = 0.2)
to a randomly chosen page linked to the current page with probability 1-d
(e.g. probability of following a random outlink = 0.8)
• Process modelled by Markov Chain
– PageRank PR of a page a = probability that the surfer is at page
a on a given time
PR(a) = Kd + K(1-d) i=1,n PR(ai)/C(ai)
d set by system
K normalisation factor
a = page pointed by ai for i=1,n
C(ai) = number of outlinks of ai
HITS: Hypertext Induced Topic Search
• Originated from Kleinberg, 1997
• Also referred to as the “The Connectivity Analysis Approach”
• Broad topic queries produce large sets of retrieved results
– abundance problem  too many relevant documents
– new type of quality measure needed  distinguish the most
“authoritative” pages  high-quality response to a broad query
• HITS: for a certain topic, it identifies
– good authorities
• pages that contain relevant information (good sources of content)
– good hubs
• page that point to useful pages (good sources of links)
HITS (2)
• Intuition
– authority comes from inlinks
– being a good hub comes from outlinks
– better authority comes from inlinks from good hubs
– being a better hub comes from outlinks to good authorities
• Mutual reinforcement between hubs and authorities
– a good authority page is pointed to by many hub pages
– a good hub page point to many authority pages
HITS (3)
• Set of pages S that are retrieved
– sometimes k (e.g. k = 200) top-ranked pages
• Set of pages T that point to or are pointed to by
retrieved set of pages S
– rank pages according to in_degree (number of in-links) - not
effective
• Set of pages T :
– authoritative pages relevant to query should have a large
in_degree
– considerable overlap in the sets of pages that point to them
 hubs
Algorithm for HITS (General Principle)
• Computation of hub and authority value of a page
through the iterative propagation of “authority weight”
and “hub weight”
• Initially all values equal to 1
• Authority weight of page x(p)
– if p is pointed to by many pages with large y-values,then it
should receive a large x-value
•
•
x(p) = qip y(qi)
Hub weight of page y(p)
– if p points to many pages with large x-values, then it should
receive a large y-value
y(p) = pqi x(qi)
After each computation (iteration), weights are
normalised
Topic distillation
• Process of finding ‘quality’ (authority - hub) Web documents
related to a query topic, given an initial user information need.
• Extensions of HITS
– ARC (Automatic Resource Compilation)
• distance-2 neighbourhood graph
• anchor (& surrounding) text used in computation of hub and authority
values
– SALSA, etc.
• Problems with HITS
–
–
–
–
mutual reinforcing relationship between hosts
automatically generated links
non relevant highly connected pages
topic drift: generalisation of the query topic
Difference between PageRank and
HITS
• The PageRank is computed for all web pages stored in
the database and then prior to the query; HITS is
performed on the set of retrieved web pages, and for
each query.
• HITS computes authorities and hubs; PageRank
computes authorities only.
• PageRank: non-trivial to compute, HITS: easy to
compute, but real-time execution is hard
• Implementation details of PageRank have been
Document and query
management
• Results
–
–
–
–
–
–
Usually screens of 10 pages
Clustering
URL, size, date, abstract, etc
Various sorting
Most similar documents options
Query refinement
• Virtual collection vs. physical collection
– document can change over time
– document may be different to the one that has been indexed
– broken link
Metasearch (1)
• Problems of Web search engines:
– limited coverage of the publicly indexable Web
– index different overlapping sections of the Web
– based on different IR models  different results to the same
query
 users do not have the time, knowledge to select the most
appropriate search engines with regard to their information
need
• Possible solution: metasearch engines
– Web server that sends query to
• Several search engines, Web directories, Databases
– Collect results
– Unify (merge) them - Data fusion
• Aim: better coverage, increase effectiveness
Metasearch (2)
• Divided into phases
– search engine selection
• topic-dependent, past queries, network traffic, …
– document selection
• how many documents from each search engine?
– merging algorithm
• utilise rank positions, document retrieval scores, …
Metasearcher
URL
used
MetaCrawler www.metacrawler.com
Dogpile
www.dogpile.com
25
SavvySearch www.search.com
Sources
13
> 1000
MetaCrawler
Browsing
• Web directory
– Catalog, yellow pages, subject categories
– Many standard search engines provide subject categories now
• Hierarchical taxonomy that classifies human
knowledge
Arts & Humanities
Automotive
Business & Economy
Computers & Internet
Education
Employment
Entertainment & Leisure
Games
Government
Investing
Kids & Family
Life & Style
News
People
Philosophy & Religion
Politics
Science & Technologies
Social Science …
Yahoo!
• Around one million classified pages
• More than 14 country specific directories
(Chinese, Danish, French, …)
• Manual classification
(has its problem)
• Acyclic graphs
• Search within the classified pages/taxonomy
• Problem of coverage
Users and the web (1)
• Queries on the web: average values
Measure
Average value
Range
Number of words
0 - 393
Number of operators
0 - 958
Repetitions of queries
1 - 1.5 million
Queries per user session
2.35
0.41
3.97
2.02
Users and the web (2)
• Some statistics
– Main purpose: research, leisure, business, education,
• products and services (e-commerce)
• people and company names and home pages
• factoids (from any one of a number of documents)
• entire, broad documents
• mp3, image, video, audio
–
–
–
–
80% do not modify query
85% look first screen only
64% queries are unique
25% users use single keywords (problem for polysemic
words and synonyms)
– 10% queries are empty
Users and the web (3)
• Results of search engines are identical, independent
of
– user
– context in which the user made the request
• adding context information for improving search
results
it directly
 focus on the user need and answer
– explicit context
• query + category
– implicit context
• based on documents edited or viewed by user
– personalised search
• previous requests and interests, user profile
Research issues
• Modelling
• Querying
• Distributed architecture
• Ranking
• Indexing
• Dynamic pages
• Browsing
• User interface
• Duplicated data