Chapter13-WebIR

Download Report

Transcript Chapter13-WebIR

Advanced information
retrieval
Computer engineering department
Chapter 13 – Web IR
Searching the web
• Introduction (History, Definitions, Aims, Statistics)
• Tasks of search engines
•
•
•
•
- Gathering
- Indexing
- “Searching” (Querying and ranking algorithms)
- Document and query management
Metasearch
Browsing and web directory
Users and web search
Research issues
Web Searching and Classical IR
TREC
Classic IR research
1970s
1980s
1990s
then came the web
web searching
2000s
Terminology and Definitions
• A web page corresponds to a document in traditional IR
• Web pages are different in their size and in the types of
•
•
•
files that constitute them
– text, graphics, sound, video, GIF, JPEG, ASCII,PDF,
etc
IR on the web considers as collection of documents the
part of the web that is publicly indexable
– exclude pages that cannot be indexed (authorization,
dynamic pages)
Location of web pages by navigation
Location of web pages by searching (IR on the web)
Challenges for web search
• Problem with the data
– Distributed data
– High percentage of volatile data
– Large volume
– Unstructured data
– Redundant data
– Quality of data
– Heterogeneous data
• Problem faced by users
– How to specify a query?
– How to interpret answers?
Search engines
• Portals
•
– identification of real names
– links to books from Amazon.com
– send electronic postcards
– translation to other languages
– search of other media (metadata)
– language-specific searches
– weather, stock price, traffic
Business model
– targeted advertising with revenue from clickthroughs
– word of mouth (since no industry standard evaluation)
– fast response (<1s) 24 hours / 7 days a week
– filter out spam
Examples
Search engines
AltaVista
Excite
Google
Infoseek
Lycos
NorthernLight
URL
www.altavista.com
www.excite.com
www.google.com
www.infoseek.com
www.lycos.com
www.nlsearch.com
• 25-55% of web covered only!
• Some search engines powered by same IR engine
• Most based in US, English
Other search engines
• Specialised in different countries/languages
(e.g., http://www.iol.it/)
• Specific
– Rank according to popularity (e.g., DirectHit)
– Topic oriented (e.g., SearchBroker)
– Personnel or institutional pages, electronic mail
addresses, images, software applets
Tasks of a web search engine
• Document gathering
– select the documents to be indexed
• Document indexing
– represent the content of the selected documents
– often 2 indices maintained (full + small for frequent queries)
• Searching
– represent the user information need into a query
– retrieval process (search algorithms, ranking of web pages)
• Document and query management
– display the results
– virtual collection (documents discarded after indexing) vs.
physical collection (documents maintained after indexing)
Document gathering
• Document gathering = crawling the web
• Crawler
– Robot, spider, wanderer, walker, knowbot, web
search agent
– Program that traverses the web to send new or
updated pages to be indexed
– Run on local server and send requests to
remote servers
Crawling the web (1)
• Crawling process
– Start with set of URLs
• Submitted by users or companies
• Popular URLs
– Breath-first or depth-first
– Extract further URLs
• Up to 10 millions pages per day
• Several crawlers
– Problem of redundancy
– Web partition  robot per partition
Crawling the web (2)
• Up-to-date?
– Non-submitted pages need up to 2 months to be
indexed
– Search engine learns change frequency of pages
– Popular pages (having many links to them)
crawled more frequently
• Guideline for robot behaviours
– File placed at root of web server
– Indicate web pages that should not be indexed
– Avoid overloading servers/sites
Typical anatomy of a large-scale crawler.
Document indexing
• Document indexing = building the indices
• Indices are variant of inverted files
–
–
–
–
meta tag analysis
stop words removal + stemming
position data (for phrase searches)
weights
• tf x idf;
• downweight long URLs (not important page)
• upweight terms appearing at the top of the documents, or
emphasised terms
– use de-spamming techniques
• hyperlink information
• count link popularity
• anchor text from source links
• hub and authority value of a page
Maintaining indices over dynamic collections.
Stop-press index
• Collection of document in flux
– Model document modification as deletion followed by insertion
– Documents in flux represented by a signed record (d,t,s)
– “s” specifies if “d” has been deleted or inserted.
• Getting the final answer to a query
– Main index returns a document set D0.
– Stop-press index returns two document sets
• D+ : documents not yet indexed in D0 matching the query
• D- : documents matching the query removed from the collection since D0
was constructed.
• Stop-press index getting too large
– Rebuild the main index
• signed (d, t, s) records are sorted in (t, d, s) order and mergepurged into the master (t, d) records
– Stop-press index can be emptied out.
Searching
• Querying
–
–
–
–
–
1 word or all words must be in the retrieved pages
normalisation (stop words removal, stemming, etc)
complex queries (date, structure, region, etc)
Boolean expressions (advanced search)
metadata
• Ranking algorithms
– use of web links
– web page authority analysis
• HITS (Hyperlink Induced Topic Search)
• PageRank (Google)
Meta-search systems
• Take the search engine to the document
– Forward queries to many geographically distributed
repositories
• Each has its own search service
– Consolidate their responses.
• Advantages
– Perform non-trivial query rewriting
• Suit a single user query to many search engines with different
query syntax
– Surprisingly small overlap between crawls
• Consolidating responses
– Function goes beyond just eliminating duplicates
– Search services do not provide standard ranks which can
be combined meaningfully
Similarity search
• Cluster hypothesis
– Documents similar to relevant documents are also
likely to be relevant
• Handling “find similar” queries
– Replication or duplication of pages
– Mirroring of sites
Document similarity
• Jaccard coefficient of similarity between
document d1 and d 2
• T(d) = set of tokens in document d
T (d ) |
– . r ' (d , d )  || TT ((dd )) 
 T (d ) |
– Symmetric, reflexive, not a metric
– Forgives any number of occurrences and any
permutations of the terms.
1
•
1
2
1
2
2
1  r ' (d1 , d2 )
is a metric
Finding near duplicates algorithm
Use of web links
• Web link: represent a relationship between the
connected pages
• The main difference between standard IR algorithms
and web IR algorithms is the massive presence of
web links
– web links are source of evidence but also source of noise
– classical IR: citation-based IR
– web track in TREC, 2000, TREC-9: Small Web task (2GB
of web data); Large Web task (100GB of web data, 18.5
million documents)
Use of anchor text
• represent referenced document
– why?
• provides more accurate and concise description than the page
itself
• (probably) contains more significant terms than the page itself
– used by ‘WWW Worm’ - one of the first search engines
1994
– representation of images, programs, …
• generate page descriptions from anchor text
Algorithms
• Query independent page quality
– global analysis
• PageRank (Google): simulates a random walk across the web and
computes the “score” of a page as the probability of reaching the
page
• Query dependent page quality
– local analysis
• HITS (Hyperlink Induced Topic Search): focusses on broad
topic queries that are likely to be answered with too many pages
– the more a page is pointed to by other pages, the more popular is the
page
– popular pages are more likely to include relevant information than nonpopular pages
PageRank (1)
• Designed by Brin and Page at Stanford University
• Used to implement Google
• Main idea:
– a page has a high rank if the sum of the ranks of its in-links
is high
• in-link of page p: a link from a page to page p
• out-link of a page p: a link from page p to a page
– a high PageRank page has many in-links or few highly
ranked in-links
• Retrieval: use cosine product (content, feature, term
weight) combined with PageRank value
PageRank (2)
• Random Surfer Model : user randomly navigates
– Initially the surfer is at a random page
– At each step the surfer proceeds
• to a randomly chosen Web page with probability d called the “damping factor”
•
(e.g. probability of random jump = 0.2)
to a randomly chosen page linked to the current page with probability 1-d
(e.g. probability of following a random outlink = 0.8)
• Process modelled by Markov Chain
– PageRank PR of a page a = probability that the surfer is at page
a on a given time
PR(a) = Kd + K(1-d) i=1,n PR(ai)/C(ai)
d set by system
K normalisation factor
a = page pointed by ai for i=1,n
C(ai) = number of outlinks of ai
HITS: Hypertext Induced Topic Search
• Originated from Kleinberg, 1997
• Also referred to as the “The Connectivity Analysis Approach”
• Broad topic queries produce large sets of retrieved results
– abundance problem  too many relevant documents
– new type of quality measure needed  distinguish the most
“authoritative” pages  high-quality response to a broad query
• HITS: for a certain topic, it identifies
– good authorities
• pages that contain relevant information (good sources of content)
– good hubs
• page that point to useful pages (good sources of links)
HITS (2)
• Intuition
– authority comes from inlinks
– being a good hub comes from outlinks
– better authority comes from inlinks from good hubs
– being a better hub comes from outlinks to good authorities
• Mutual reinforcement between hubs and authorities
– a good authority page is pointed to by many hub pages
– a good hub page point to many authority pages
HITS (3)
• Set of pages S that are retrieved
– sometimes k (e.g. k = 200) top-ranked pages
• Set of pages T that point to or are pointed to by
retrieved set of pages S
– rank pages according to in_degree (number of in-links) - not
effective
• Set of pages T :
– authoritative pages relevant to query should have a large
in_degree
– considerable overlap in the sets of pages that point to them
 hubs
Algorithm for HITS (General Principle)
• Computation of hub and authority value of a page
through the iterative propagation of “authority weight”
and “hub weight”
• Initially all values equal to 1
• Authority weight of page x(p)
– if p is pointed to by many pages with large y-values,then it
should receive a large x-value
•
•
x(p) = qip y(qi)
Hub weight of page y(p)
– if p points to many pages with large x-values, then it should
receive a large y-value
y(p) = pqi x(qi)
After each computation (iteration), weights are
normalised
Topic distillation
• Process of finding ‘quality’ (authority - hub) Web documents
related to a query topic, given an initial user information need.
• Extensions of HITS
– ARC (Automatic Resource Compilation)
• distance-2 neighbourhood graph
• anchor (& surrounding) text used in computation of hub and authority
values
– SALSA, etc.
• Problems with HITS
–
–
–
–
mutual reinforcing relationship between hosts
automatically generated links
non relevant highly connected pages
topic drift: generalisation of the query topic
Difference between PageRank and
HITS
• The PageRank is computed for all web pages stored in
•
•
•
the database and then prior to the query; HITS is
performed on the set of retrieved web pages, and for
each query.
HITS computes authorities and hubs; PageRank
computes authorities only.
PageRank: non-trivial to compute, HITS: easy to
compute, but real-time execution is hard
Implementation details of PageRank have been
reported
Document and query
management
• Results
–
–
–
–
–
–
Usually screens of 10 pages
Clustering
URL, size, date, abstract, etc
Various sorting
Most similar documents options
Query refinement
• Virtual collection vs. physical collection
– document can change over time
– document may be different to the one that has been indexed
– broken link
Metasearch (1)
• Problems of Web search engines:
– limited coverage of the publicly indexable Web
– index different overlapping sections of the Web
– based on different IR models  different results to the same
query
 users do not have the time, knowledge to select the most
appropriate search engines with regard to their information
need
• Possible solution: metasearch engines
– Web server that sends query to
• Several search engines, Web directories, Databases
– Collect results
– Unify (merge) them - Data fusion
• Aim: better coverage, increase effectiveness
Metasearch (2)
• Divided into phases
– search engine selection
• topic-dependent, past queries, network traffic, …
– document selection
• how many documents from each search engine?
– merging algorithm
• utilise rank positions, document retrieval scores, …
Metasearcher
URL
used
MetaCrawler www.metacrawler.com
Dogpile
www.dogpile.com
SavvySearch www.search.com
Sources
13
25
> 1000
MetaCrawler
Browsing
• Web directory
– Catalog, yellow pages, subject categories
– Many standard search engines provide subject categories now
• Hierarchical taxonomy that classifies human
knowledge
Arts & Humanities
Automotive
Business & Economy
Computers & Internet
Education
Employment
Entertainment & Leisure
Games
Government
Investing
Kids & Family
Life & Style
News
People
Philosophy & Religion
Politics
Science & Technologies
Social Science …
Yahoo!
• Around one million classified pages
• More than 14 country specific directories
(Chinese, Danish, French, …)
• Manual classification
(has its problem)
• Acyclic graphs
• Search within the classified pages/taxonomy
• Problem of coverage
Users and the web (1)
• Queries on the web: average values
Measure
Average value
Number of words
2.35
Number of operators 0.41
Repetitions of queries 3.97
million
Queries per user session 2.02
Screens per query
1.39
Range
0 - 393
0 - 958
1 - 1.5
1 - 173325
1 - 78496
Users and the web (2)
• Some statistics
– Main purpose: research, leisure, business, education,
• products and services (e-commerce)
• people and company names and home pages
• factoids (from any one of a number of documents)
• entire, broad documents
• mp3, image, video, audio
–
–
–
–
80% do not modify query
85% look first screen only
64% queries are unique
25% users use single keywords (problem for polysemic
words and synonyms)
– 10% queries are empty
Users and the web (3)
• Results of search engines are identical, independent
of
– user
– context in which the user made the request
• adding context information for improving search
results
it directly
 focus on the user need and answer
– explicit context
• query + category
– implicit context
• based on documents edited or viewed by user
– personalised search
• previous requests and interests, user profile
Motivation
• Problem: Query word could be ambiguous:
– Eg: Query“Star” retrieves documents about astronomy,
plants, animals etc.
– Solution: Visualisation
• Clustering document responses to queries along lines of different
topics.
• Problem 2: Manual construction of topic hierarchies
and taxonomies
– Solution:
• Preliminary clustering of large samples of web documents.
• Problem 3: Speeding up similarity search
– Solution:
• Restrict the search for documents similar to a query to most
representative cluster(s).
Example
Scatter/Gather, a text clustering system, can separate salient topics in the
response to
keyword queries. (Image courtesy of Hearst)
Clustering
• Task : Evolve measures of similarity to cluster a collection of
•
•
documents/terms into groups within which similarity within a cluster is
larger than across clusters.
Cluster Hypothesis: Given a `suitable‘ clustering of a collection, if
the user is interested in document/term d/t, he is likely to be
interested in other members of the cluster to which d/t belongs.
Similarity measures
– Represent documents by TFIDF vectors
– Distance between document vectors
– Cosine of angle between document vectors
Clustering: Parameters
• Similarity measure: (eg: cosine similarity)
 (d1 , d2 )
• Distance measure: (eg: eucledian distance)
 (d1 , d 2 )
• Number “k” of clusters
• Issues
– Large number of noisy dimensions
– Notion of noise is application dependent
Clustering: Formal specification
• Partitioning Approaches
– Bottom-up clustering
– Top-down clustering
• Geometric Embedding Approaches
– Self-organization map
– Multidimensional scaling
– Latent semantic indexing
• Generative models and probabilistic approaches
– Single topic per document
– Documents correspond to mixtures of multiple topics
Partitioning Approaches
• Partition document collection into k clusters
• Choices:
{D , D .....D }
1
•
2
k
   (d , d )
– Minimize intra-cluster distance
– Maximize intra-cluster semblance
   (d , d )
If cluster representations Di are available
   (d , D )
– Minimize
– Maximize
   (d , D )
i
• Soft clustering
i
d1 , d 2 Di
i
d1 , d 2 Di
1
1
2
2
i
d Di
dDi
i
i
Di with `confidence’ z d ,i
– d assigned to
zso
, D maximize
)
d ,i as to minimize
– Find
  z  (dor
 z
d ,i
i
• Two ways to get partitions - bottom-up clustering
i
and top-down clustering
d Di
i
dDi
d ,i
 (d , Di )
Bottom-up clustering(HAC)
• Initially G is a collection of singleton groups, each
•
with one document d
Repeat
– Find ,  in G with max similarity measure, s()
– Merge group  with group 
• For each  keep track of best 
• Use above info to plot the hierarchical merging
•
process (DENDOGRAM)
To get desired number of clusters: cut across any
level of the dendogram
Dendogram
A dendogram presents the progressive, hierarchy-forming merging process pictorially.
Similarity measure
• Typically s() decreases with increasing
number of merges
• Self-Similarity
– Average pair wise similarity between documents in
1

s() 

C2
 s(d , d )
d1 , d 2 
1
2
– s(d1 , d 2 )= inter-document similarity measure (say
cosine of tfidf vectors)
– Other criteria: Maximium/Minimum pair wise
similarity between documents in the clusters
Switch to top-down
• Bottom-up
– Requires quadratic time and space
• Top-down or move-to-nearest
– Internal representation for documents as well as
clusters
– Partition documents into `k’ clusters
– 2 variants
• “Hard” (0/1) assignment of documents to clusters
• “soft” : documents belong to clusters, with fractional scores
– Termination
• when assignment of documents to clusters ceases to change
much OR
• When cluster centroids move negligibly over successive
iterations
Top-down clustering
• Hard k-Means: Repeat…
– Choose k arbitrary ‘centroids’
– Assign each document to nearest centroid
– Recompute centroids
• Soft k-Means :
– Don’t break close ties between document assignments to
clusters
– Don’t make documents contribute to a single cluster which
wins narrowly
Research issues
• Modelling
• Querying
• Distributed architecture
• Ranking
• Indexing
• Dynamic pages
• Browsing
• User interface
• Duplicated data