Transcript Lecture 5

CSM06 Information Retrieval
Lecture 5: Web IR part 2
Dr Andrew Salway
[email protected]
Recap of Lecture 4
• Various techniques that search
engines can use to index and
rank web pages
• In particular, techniques that
exploit hypertext structure:
– Use of anchor text
– Link analysis  PageRank
• Plus, techniques that analyse the
words in webpages
Recap of Lecture 4 (*ADDED)
• The ‘Random Surfer’ explanation
of PageRank:
• A web surfer follows links at
random: at a page with no
outlinks they ‘teleport’ at random
to another page…
“the PageRank value of a web page
is the long run probability that the
surfer will visit that page” (Levene
2005, page 95)
Recap of Lecture 4 (*ADDED)
• “Whether PageRank Leakage
exists or not, is a question of
semantics. The PageRank for a
given page is solely determined
by the inbound links. However,
an outgoing link can drain the
entire site for PageRank”.
http://www.pagerank.dk/Pagerankleak/Pagerank-leakage.htm
Past Exams
•Previous exams and solutions for CSM06 are available from:
www.computing.surrey.ac.uk/personal/pg/A.Salway/csm06/csm06_exam_info_2005.html
IMPORTANT
•1) The content of the module is updated and revised each year so some of the past
questions refer to topics that were not part of the module in 2005.
•2) There were some changes to the structure of the exam in 2004, e.g. each question
is worth 50 marks. This will be the same in 2005. Also, in 2004 more emphasis was put
on current research and development of information retrieval systems (cf. some of the
research papers given as Set Reading). As in 2004 the 2005 exam will include
questions that ask you to write about some specified research. You do NOT need to go
beyond the lecture content and the Set Reading to answer these questions.
•3) The solutions that are provided are written in a style to help with the marking of the
exams – this does not necessarily reflect how you would be expected to write your
answers, e.g. solutions are sometimes given in note form, whereas you would normally
be expected to write full sentences.
Lecture 5: OVERVIEW
• Retrieving ‘similar pages’ based on
link analysis: companion and
cocitation algorithms (Dean and
Henzinger 1999)
• Transforming questions into
queries: TRITUS system (Agichtein,
Lawrence and Gravano 2001)
• Evaluating web search engines
“Finding Related Pages in the World
Wide Web” (Dean and Henzinger 1999)
• Use a webpage (URL) as a query – may be an
easier way for a user to express their information
need
– The user is saying “I want more pages like this one” –
maybe easier than thinking of good query words?
– e.g. the URL www.nytimes.com (New York Times
newspaper) returns URLs for other newspapers and
news organisations
• Aim is for high precision with fast execution using
minimal information
Two algorithms to find pages related to the query
page using only connectivity information, i.e. link
analysis (nothing about webpage content or usage):
– Companion Algorithm
– Cocitation Algorithm
What does ‘related’ mean?
“A related web page is one
that addresses the same
topic as the original page,
but is not necessarily
semantically identical”
Companion Algorithm
• Based on Kleinberg’s HITS
algorithm – mutually reinforcing
authorities and hubs
1. Build a vicintiy graph for u
2. Contract duplicates and near-duplicates
3. Compute edge weights (i.e. links)
4. Compute hub and authority scores for
each node (URL) in the graph  return
highest ranked authorities as results set
Companion Algorithm
1. Build a vicintiy graph for u
The graph is made up of the following nodes
and edges between them:
• u
• Up to B parents of u, and for each parent up
to BF of its children – if u has > B parents
then choose randomly; if a parent has > BF
children, then choose children closest to u
• Up to F children of u, and for each child up to
FB of its parents
NB. Use of a ‘stop list’ of URLs with very high indegree
Companion Algorithm
2. Contract duplicates and near-duplicates:
if two nodes each have > 10 links and > 95% are in
common then make them into one node whose links
are the union of the two
3. Compute edge weights (i.e. links)
• Edges between nodes on the same host are
weighted 0
• Scaling to reduce the influence from any single host:
“If there are k edges from documents on a first host to a
single document on a second host then each edge
has authority weight 1/k”
“If there are l edges from a single document on a first
host to a set of documents on a second host, we give
each edge a hub weight of 1/l”
Companion Algorithm
4. Compute hub and authority
scores for each node (URL) in the
graph
 return highest ranked authorities as
results set
“a document that points to many others
is a good hub, and a document that
many documents point to is a good
authority”
Companion Algorithm
4. continued…
H = hub vector with one element for the
Hub value of each node
A = authority vector with one element
for the Authority value of each node
Initially all values set to 1
Companion Algorithm
4. continued…
Until H and A converge:
For all nodes n in the graph N
A[n] = Σ
H[n´]*authority_weight(n´,n)
For all nodes n in the graph N
H[n] = Σ A[n´]*hub_weight(n,n´)
Cocitation Algorithm
• Finds pages that are frequently
cocited with the query web
page u – “it finds other pages
that are pointed to by many
other pages that also point to u”
• Two nodes are co-cited if they
have a common parent: the
number of common parents is
their degree of co-citation
Cocitation Algorithm
1. Select up to B parents of u
2. For each parent add up to BF
of its children to the set of u’s
siblings S
3. Return nodes in S with
highest degrees of cocitation
with u
NB. If < 15 nodes in S that are cocited
with u at least twice then restart
using u’s URL with one path
element removed, e.g.
aaa.com/X/Y/Z  aaa.com/X/Y
Evaluation of companion and
cocitation algorithms
• 59 input URLs chosen by 18
volunteers (mainly computing
professionals)
• The volunteers were shown results for
each URL they chose and have to
judge it ‘1’ for valuable and ‘0’ for not
valuable
 Various calculations of precision, e.g.
‘precision at 10’ for the intersection
group (those query URLs that all 3
algorithms returned results for)
Evaluation of companion and
cocitation algorithms
• Authors suggest that their algorithms
perform better than an algorithm
(Netscape’s) that incorporates
content and usage information, as
well as connectivity information –
“This is surprising” – IS IT??
• Perhaps it is because they had more
connectivity information??
Transforming Questions into Queries…
• Users of IR systems might prefer to express
their information needs directly as
questions, rather than as keywords, e.g.
– “What is a hard disk?” – rather than the
query “hard disk”
– What the user wants is a specific answer
to their question, rather than web-pages
selling hard disks, or web-pages
reviewing different kinds of hard disks
• But, web search engines may treat the query
as a ‘bag of words’ and not recognise
questions as such; documents are returned
that are similar to the ‘bag of words’
Transforming Questions into Queries…
• The challenge then is to
automatically transform the question
into a suitable query for which search
engines will return more pages that
do answer the user’s question
• Here we consider the work of
Agichtein, Lawrence and Gravano
(2001) who developed the Tritus
system to try and solve this
problem…
• Cf. AskJeeves (www.ask.com)
Tritus: premise
• A good answer to the question “What
is a hard disk?” might be “magnetic
rotating disk used to store data”
• So maybe the query “What is a hard
disk?” should be transformed into the
query –
“hard disk” NEAR “used to”
Tritus: aim
• To automatically learn how to
transform natural language
questions into queries that
contain terms and phrases
which are expected to appear
in documents containing
answers to these questions.
Tritus: learning algorithm
Step 1
Select question phrases from a set
of questions by extracting frequent
n-grams that don’t contain domain
specific nouns, e.g. “who was”,
“what is a”, “how do I”
Tritus: learning algorithm
Step 2
For each question type, select
candidate transformations from
set of good answers for each
question, e.g.
“what is a” {“is used to”, “is a”,
“used”}
Tritus: learning algorithm
Step 3
Weight and re-rank
transformation using results from
web search engines
Tritus: in use
• Trained to learn the best query
transformations for specific web
search engines, e.g. Google
and AltaVista
• Evaluation conducted to
compare the effect of query
transforms, and to compare
with AskJeeves
Evaluation of Web Search
Engines
• Precision may be applicable to
evaluate a web search engine, but it
may be the precision in the first page
of results that is most important
• Recall, as traditionally defined, may
not be applicable because it is
difficult or impossible to identify all
the relevant web-pages for a given
query
Four strategies for evaluation of
web search engines
(1) Use precision and recall in the traditional
way for a very tightly defined topic: only
applicable if all relevant web pages are
known in advance
(2) Use ‘relative recall’ – estimate total
number of relevant documents by doing a
number of searches and adding the total
number of relevant documents returned
(3) Statistically sample the web in order to
estimate number of relevant pages
(4) Avoid recall altogether
SEE: Oppenheim, Morris and McKnight (2000), p. 194
Alternative Evaluation Criteria
• Number of web-pages covered,
and coverage: Is more pages
covered better? May be more
important that certain domains are
included in coverage?
• Freshness / broken links: Webpage content is frequently updated
so index also needs to be updated;
broken links frustrate users. Should
be relatively straightfoward to
quantify.
continued…
• Search Syntax: More experienced
users may like the option of
‘advanced searches’, e.g. phrases,
Boolean operators, and field
searching.
• Human Factors and Interface
Issues: Evaluation from a user’s
perspective is a more subjective
criterion, however it is an important
one – it can be argued that an
intuitive interface for formulating
queries and interpreting results helps
a user to get better results from the
system.
continued…
• Quality of Abstracts: related to
interface issues are the ‘abstracts’ of
web-pages that a web search engine
displays – if good then these help a
user to quickly identify more
promising pages
Set Reading for Lecture 5
Dean and Henzinger (1999), ‘Finding Related
Pages in the World Wide Web’. Pages 1-10.
http://citeseer.ist.psu.edu/dean99finding.html
Agichtein, Lawrence and Gravano (2001), ‘Learning
Search Engine Specific Query Transformations for
Question Answering’, Procs. 10th International
WWW Conference. **Section 1 and Section 3**
www.cs.columbia.edu/~eugene/papers/www10.pdf
Oppenheim, Morris and McKnight (2000), ‘The
Evaluation of WWW Search Engines’, Journal of
Documentation, 56(2), pp. 190-211. Pages 194-205.
In Library Article Collection.
Exercise: Google’s ‘Similar Pages’
• It is suggested that Google’s ‘Similar
Pages’ feature is based in part on
the work of Dean and Henzinger.
• By making a variety of queries to
Google and choosing ‘Similar Pages’
see what you can find out about how
this works.
Exercise: web search engine
evaluation
• Compare three web-search engines
by making the same queries to each.
How do they compare in terms of:
–
–
–
–
Advanced query options?
Coverage?
Quality of highest ranked results?
Ease of querying and understanding
results?
– Ranking factors that they appear to be
using?
Further Reading
• The other parts of the papers
given for Set Reading
Lecture 5: LEARNING OUTCOMES
• For both (Dean and Henzinger 1999)
and (Agichtein, Lawrence and
Gravano 2001), you should be able to
- Explain how they were trying to make web
search better for users
- Outline their proposed solution
- Discuss their evaluation of their solution
and make your own comments
• You should be able to explain and
apply various techniques to compare
and evaluate web search engines
Reading ahead for LECTURE 6
If you want to prepare for next week’s
lecture then take a look at…
The visual interface of the KartOO search engine:
http://www.kartoo.com/
Use and read about the clustering of web pages
done by Vivisimo:
http://vivisimo.com/
Recent developments in Google Labs, especially
Google Sets and Google Suggest:
http://labs.google.com/