Web-page Classification through Summarization

Download Report

Transcript Web-page Classification through Summarization

Query Classification
and KDDCUP 2005
Qiang Yang, Dou Shen
1
Query Classification and
Online Advertisement
2
QC as Machine Learning
Inspired by the KDDCUP’05 competition



Classify a query into a ranked list of categories
Queries are collected from real search engines
Target categories are organized in a tree with
each node being a category
3
How to do it?
4
Solutions: Query Enrichment
+ Staged Classification
Queries
Target
Categories
Solution 1: Query/Category Enrichment
Queries
Target
Categories
Solution 2: Bridging classifier
5
Query enrichment

Textual information
Title

Category information
Snippet
Category
Full text
6
Classifiers

Map by Word Matching


Direct and Extended
Matching

Device

E
D
High precision, low
recall

SVM: Apply synonymbased classifiers to map
Web pages from ODP to
target taxonomy
Obtain <pages, target
category> as the
training data
Train SVM classifiers for
the target categories;

Higher Recall
7
Bridging Classifier

Problem with Solution 1:


When target is changed, training needs to repeat!
Solution:

Connect the target taxonomy and queries by
taking an intermediate taxonomy as a bridge
8
Bridging Classifier (Cont.)

How to connect?
T
The relation between Ci
and C Ij
The relation between
and C Ij
q
Prior prob. of C Ij
The relation between
and CiT
q
9
Category Selection for
Intermediate Taxonomy

Category Selection for Reducing Complexity

Total Probability (TP)

Mutual Information
10
Experiment
─ Data Sets & Evaluation

KDDCUP


Starting at 1997, KDD Cup is the leading Data Mining and
Knowledge Discovery competition in the world, organized by ACM
SIGKDD
KDDCUP 2005


Task: Categorize 800K search queries into 67 categories
Three Awards


Participation


142 registration groups; 37 solutions submitted from 32 teams
Evaluation data



(1) Performance Award ; (2) Precision Award; (3) Creativity Award
800 queries randomly selected from the 800K query set
3 human labelers labeled the entire evaluation query set (details)
Evaluation measurements: Precision and Performance (F1) (details)

1
Overall F1 
3
a
3
 (F1 against human labeler i)
i 1
11 / 68
11
Experiment Results
─ Compare Different Methods
Comparison among our own methods
Comparison with other teams in KDDCUP2005
From Different
Groups
12
12 / 68
Result of Bridging Classifiers


Performance of the Bridging Classifier with Different
Granularity of Intermediate Taxonomy
Using bridging classifier allows the target
classes to change freely
 without the need to retrain the classifier!
13
Target-transfer Learning


Classifier, once trained, stays constant
When target classes change, classifier needs to be
retrained with new data



Bridging Classifier:



Too costly
Not online
Allow target to change
Application: advertisements come and go, but our
querytarget mapping needs not be retrained!
We call this the target-transfer learning problem
14
Task 2: Can computer do this?
15
Data: Web Search Queries

Consider the following search queries



“AAAI”
“Machine Learning”
“Constraint Reasoning”
AAAI
Machine learning
Constraint Reasoning
16
AAAI 07, joint work with D. Shen,
J. Sun, M. Qin, Z. Chen et al.

Queries have different granularity


Car v.s BMW; BMW v.s. AUDI
Can we organize the queries into hierarchies?
 Benefits of building query hierarchies

Provide online query suggestion
Query classification

Query clustering


Difficulties of building query hierarchies


Queries are short
The hierarchical structure cannot be pre-defined
17
Clickthrough Data
√
Person 1
“SIGIR”
√
Search
Engines

Clickthrough Data
√
18
Intuitive Ideas

Our goal: mine the query hierarchies from
clickthrough data


If two queries are related to each other, they
should share some of the same or similar clicked
Web pages;
For two queries qi and qj , qi is more general if



most of the clicked pages of qj have similar pages to
some clicked pages of qi
while not the other way around
If a query is specific,

the contents of its clicked pages are relatively
consistent,
19