Proj1 - Department of Computer Science

Download Report

Transcript Proj1 - Department of Computer Science

Project 1:
Adding Proximity Preference to
Vector-Space Retrieval
1
Problems with VSR
(see trace)
• Does not know the position of tokens in the
document or query.
• Therefore cannot account for the proximity
of tokens in the documents compared to the
query.
• Similarity along a few dimensions can
dominate others.
– May prefer a document in which one of the
query words is very frequent over another in
which both occur less frequently.
2
Solutions
• Phrasal search
– Requires verbatim appearance of the correct phrase.
• Boolean search
– Requires strict occurrence of all words, still not address
proximity.
• Proximity metric
– Include a measure of proximity in the similarity metric
itself.
– Allows a flexible match with a proximity factor.
– Google apparently includes such a factor in its scoring.
3
Example
• Query “computer science”
– Should not prefer just any page that talks about
computers and talks about science.
– Should prefer a page in which words are close
and in same order but the exact phrase does not
appear, e.g. “Computer and Information
Science,” Computer Engineering and Science”
– Should also prefer a page in which terms are
frequent.
4
My Solution
• Separate proximity score for each document,
normalized to be between 0 and 1.
• Average closest distance in the document
(measured in number of words, excluding stop
words) that a query word appears from another
query word.
• Averaged across all pairs of words in the query.
• Multiplicative penalty factor included when a pair
of words appeared in the reverse order from that in
the query.
• Final score is the ratio of cosine-similarity and
proximity.
5
Your Task
• Be creative! Should be challenging, leave
sufficient time to complete the project!
• Requires fundamental changes to existing code to
add positional information to the inverted index.
• So first understand the existing code.
• Make changes by creating new classes and
methods rather than altering existing ones.
• Final code must support both original and the new
proximity-enhanced approach.
6
Positional Inverted Index
• Instead of just a count of the token,
TokenOccurence must include a list of
integer positions of the token in the
document.
• Positions can be in terms of token number
in the document, excluding stop words.
7
Project Submission
• Follow submission directions on the web to
submit electronically with “turnin”.
• Document all code (Javadoc).
• Include a trace of your working system on
the sample problems.
• Include a short project report describing
your approach and discussing your results.
8