Slides - Computer Science @ UC Davis

Download Report

Transcript Slides - Computer Science @ UC Davis

Bringing Order to the Web:
Automatically Categorizing Search Results
Hao Chen, CS Division, UC Berkeley
Susan Dumais, Microsoft Research
ACM:CHI April 4, 2000
Organizing Search Results
Query: jaguar
List Organization
Category Org (SWISH)
Outline

Background


Using category structure to organize information
SWISH System
Searching With Information Structured Hierarchically
 Text classification
 User interface


User Study
Future Work
Using Category Structure

To Organize Information


To Help Web Search


Superbook, Cat-a-Cone, etc.
Yahoo!, Northern Light
What’s New in SWISH?



Automatic categorization of new documents
User interface that tightly couples hierarchical
category structure with search results
User study for the new user interface
SWISH System

Combines the Advantages of



Manually crafted & easily understood directory
structure
Broad coverage from search engines
System Components


Text classification models
User interface
Text Classification

Text Classification




Assign documents to one or more of a predefined set
of categories
E.g., News feeds, Email - spam/no-spam, Web data
Manually vs. automatically
Inductive Learning for Classification



Training set: Manually classified a set of documents
Learning: Learn classification models
Classification: Use the model to automatically classify
new documents
Training Set:

Category Structure (spring 99)



13 top-level categories
150 second-level categories
Training Set


LookSmart Web Directory
~50k web pages; chosen randomly from all cats
Top-level Categories
Automotive
Business & Finance
Computers & Internet
Entertainment & Media
Health & Fitness
Hobbies & Interests
Home & Family
People & Chat
Reference & Education
Shopping & Services
Society & Politics
Sports & Recreation
Travel & Vacations
Learning & Classification

Support Vector Machine (SVM)


Accurate and efficient for text classification
(Dumais et al., Joachims)
Model = weighted vector of words



“Automobile” = motorcycle, vehicle, parts, automobile,
harley, car, auto, honda, porsche …
“Computers & Internet” = rfc, software, provider,
windows, user, users, pc, hosting, os, downloads ...
Hierarchical Models



1 model for N top level categories
N models for second level categories
Very useful in conjunction w/ user interaction
SWISH Architecture
Train
(offline)
manually
classified
web
pages
Classify
(online)
SVM
model
web
search
results
local
search
results
...
Interface Characteristics

Problems

Large amount of information to display




Search results
Category structure
Limited screen real estate
Solutions


Information overlay
Distilled information display
Information Overlay

Use tooltips to show


Summaries of web pages
Category hierarchy
Expansion of Category Structure
Expansion of Web Page List
User Study - Conditions
Category Interface
List Interface
User Study
User Study

Participants:


Tasks



18 intermediate Web users
30 search tasks
e.g., “Find home page for Seattle Art Museum”
Search terms are fixed for each task
Experimental Design

Category/List – within subjects



15 search tasks with each interface
Order (Category/List First) – counterbalanced between subjects
Both Subjective and Objective Measures
Subjective Results


7-point rating scale (1=disagree; 7=agree)
Questions:
Question
Category
It was easy to use this software.
6.4
I liked using this software
6.7
I prefer this to my usual Web Search engine
6.4
It was easy to get a good sense of the range of alternatives. 6.4
I was confident that I could find information if it was there.
6.3
List
3.9
4.3
4.3
4.2
4.4
significance
p<.001
p<.001
p<.001
p<.001
p<.001
The "More" button was useful
The display of summaries was useful
6.1
6.4
n.s.
n.s.
6.5
6.5
Use of Interface Features
Average Number of Uses of Feature per Task
Interface Features
Expansing / Collapsing Structure
Category
0.78
List
0.48
significance
p<.003
Viewing Summaries in Tooltips
Viewing Web Pages
2.99
1.23
4.60
1.41
p<.001
p<.053
Search Time
Average Median RT
RT for Category vs. List
Category: 56 secs
List:
85 secs
100
80
p < .002
60
40
50% faster with
Category interface
20
0
Category
List
Interface Condition
Search Time by Query Difficulty
Average Median RT

RT by Interface and Query Difficulty
160
140
120
100
80
60
40
20
0
Easy
(Top20)
Hard
(Not Top20)
Category
List
Interface Condition
Top20:
57 secs
NotTop20: 98 secs
•No reliable interaction
between query difficulty
and interface condition
•Category interface is
helpful for both easy
and difficult queries
Summary

Text Classification




User Interface



Organize search results
Use hierarchical category models
Classify new web pages on-the-fly
Tightly couple search results with category structure
Allow manipulation of presentation of category structure
User Study

Suggest strong preference and performance advantages
for categorically organized presentation of search results
Open Issues


Improve Accuracy of Classification Algorithms
Enhance User Interface

Heuristics for selecting categories and pages to display





Query_Match: rank of page, and sometimes match score
Categ_Match: p(category for each page)
Integration with non-content information
Conduct End-to-end User Study
More info:

http://research.microsoft.com/~sdumais
Searching With Information
Structured Hierarchically
SWISH