Transcript Slide 1
Improving Search Results Quality
by Customizing Summary Lengths
Michael Kaisser★, Marti Hearst
and John B. Lowe
★University
of Edinburgh,
UC Berkeley, Powerset, Inc.
ACL-08: HLT
Talk Outline
How best to display search results?
Experiment 1: Is there a correlation between
response type and response length?
Experiment 2: Can humans predict the best
response length?
Summary and Outlook
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Motivation
Web Search result listings today are largely
standardized; display a document’s surrogate
(Marchionini et al., 2008)
Typically: One header line, two lines text
fragments, one line for URL:
(Source: Yahoo!)
But: Is this the best way to present search
results? Especially: Is this the optimal length for
every query?
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Experiment 1 – Research Question
Do different types of queries
require responses of
different lengths?
(And if so, is the preferred response type dependent
on the expected semantic response type?)
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Experiment 1 – Setup
Data used:
12,790 queries from Powerset’s query database
Contains search engines’ query logs and hand crafted
queries
disproportionally large number of natural language
queries
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Experiment 1 – Setup
Disproportionally large number of natural language
queries.
Examples:
“date of next US election”
Hip Hop
A synonym for material
highest volcano
What problems do federal regulations cause?
I want to make my own candles
industrial music
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Excursus – Mechanical Turk
Amazon web services API for computers to
integrate "artificial artificial intelligence"
requesters can upload Human Intelligence Tasks
(HITs)
Workers work on these HITs and are paid small
sums of money
Examples:
can you see a person in the photo?
is the document relevant to a query?
is the review of this product positive or negative?
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Excursus – Mechanical Turk
Amazon web services API for computers to
integrate "artificial artificial intelligence"
requesters can upload Human Intelligence Tasks
(HITs)
Workers work on these HITs and are paid small
sums of money
Mechanical Turk is/can also be seen as a
platform for online experiments
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Experiment 1
Turkers are asked to classify
queries by
• Expected response type
• Best response length
Each query is done by three
different subjects.
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Experiment 1 – Results
Distribution of length categories differs across
individual expected response categories.
Some results are intuitive :
Queries for numbers want short results
Advice queries want longer results
Some results are more surprising:
Different length distributions for Person vs.
Organization
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Experiment 2 – Research Question
Can human judges correctly
predict the preferred result
length?
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Experiment 2 – Setup
Experiment 1 produced 1099 high-confidence queries
(where all three turkers agreed on semantic category and
length)
For 170 of these turkers manually created snippets from
Wikipedia of different lengths:
Phrase
Sentence
Paragraph
Section
Article (in this case a link to the article was displayed)
Note: Categories differ slightly from first experiment
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Experiment 2 – Setup
Manually created snippets from Wikipedia of different
lengths:
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Experiment 2 – Setup
Displayed:
• Instructions
• Query
• One response from one
length category
• Rating scale
Each Hit was shown to ten
turkers.
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Experiment 2 – Setup
Instructions:
Below you see a search engine
query and a possible response.
We would like you to give us your
opinion about the response. We
are especially interested in the
length of the response. Is it
suitable for the query? Is there too
much or not enough information?
Please rate the response on a
scale from 0 (very bad response)
to 10 (very good response).
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Experiment 2 – Significance
Slope
Std. Error
p-value
Phrase
-0.850
0.044
<0.0001
Sentence
-0.550
0.050
<0.0001
Paragraph
0.328
0.049
<0.0001
Article
0.856
0.053
<0.0001
Significance results of unweighted linear regression on the data for
the second experiment, which was separated into four groups based
on the predicted preferred length.
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Experiment 2 – Details
146 queries
5 length categories per query
10 judgments per query
= 7,300 judgments
124 judges
16 judges did more than 146 hits
2 of these 16 were excluded (scammers)
$0.01 per judgment
$73 paid at judges, plus $73 Amazon fees
$146 for Experiment 2 (excluding snippet generation)
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Experiment 2 – Results
Results:
Human judges can predict the preferred result
lengths (at least for a subset of especially clear
queries)
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Experiment 2 – Results
Results:
Human judges can predict the preferred result
lengths (at least for a subset of especially clear
queries)
Standard results listings are often too short (and
sometimes too long)
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Outlook
Can queries be automatically classified according
to their predicted result length?
Initial Experiment:
Unigram word counts
805 training queries, 286 test queries
Three length bins (long, short, other)
Weka NaiveBayesMultinomial
Initial Result:
78% of queries correctly classified
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Thank you!
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
MT Demographics - Age
Survey, data and graphs from Panos Ipeirotis’ blog:
http://behind-the-enemy-lines.blogspot.com/2008/03/mechanical-turk-demographics.html
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
MT Demographics - Gender
Survey, data and graphs from Panos Ipeirotis’ blog:
http://behind-the-enemy-lines.blogspot.com/2008/03/mechanical-turk-demographics.html
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
MT Demographics - Education
Survey, data and graphs from Panos Ipeirotis’ blog:
http://behind-the-enemy-lines.blogspot.com/2008/03/mechanical-turk-demographics.html
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
MT Demographics - Income
Survey, data and graphs from Panos Ipeirotis’ blog:
http://behind-the-enemy-lines.blogspot.com/2008/03/mechanical-turk-demographics.html
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
MT Demographics - Purpose
Survey, data and graphs from Panos Ipeirotis’ blog:
http://behind-the-enemy-lines.blogspot.com/2008/03/mechanical-turk-demographics.html
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT
Excursus – Mechanical Turk
Example HIT (not ours):
Michael Kaisser, Marti Hearst and John B. Lowe
ACL-08: HLT