Web-page Classification through Summarization
Download
Report
Transcript Web-page Classification through Summarization
Query Classification
and KDDCUP 2005
Qiang Yang, Dou Shen
1
Query Classification and
Online Advertisement
2
QC as Machine Learning
Inspired by the KDDCUP’05 competition
Classify a query into a ranked list of categories
Queries are collected from real search engines
Target categories are organized in a tree with
each node being a category
3
How to do it?
4
Solutions: Query Enrichment
+ Staged Classification
Queries
Target
Categories
Solution 1: Query/Category Enrichment
Queries
Target
Categories
Solution 2: Bridging classifier
5
Query enrichment
Textual information
Title
Category information
Snippet
Category
Full text
6
Classifiers
Map by Word Matching
Direct and Extended
Matching
Device
E
D
High precision, low
recall
SVM: Apply synonymbased classifiers to map
Web pages from ODP to
target taxonomy
Obtain <pages, target
category> as the
training data
Train SVM classifiers for
the target categories;
Higher Recall
7
Bridging Classifier
Problem with Solution 1:
When target is changed, training needs to repeat!
Solution:
Connect the target taxonomy and queries by
taking an intermediate taxonomy as a bridge
8
Bridging Classifier (Cont.)
How to connect?
T
The relation between Ci
and C Ij
The relation between
and C Ij
q
Prior prob. of C Ij
The relation between
and CiT
q
9
Category Selection for
Intermediate Taxonomy
Category Selection for Reducing Complexity
Total Probability (TP)
Mutual Information
10
Experiment
─ Data Sets & Evaluation
KDDCUP
Starting at 1997, KDD Cup is the leading Data Mining and
Knowledge Discovery competition in the world, organized by ACM
SIGKDD
KDDCUP 2005
Task: Categorize 800K search queries into 67 categories
Three Awards
Participation
142 registration groups; 37 solutions submitted from 32 teams
Evaluation data
(1) Performance Award ; (2) Precision Award; (3) Creativity Award
800 queries randomly selected from the 800K query set
3 human labelers labeled the entire evaluation query set (details)
Evaluation measurements: Precision and Performance (F1) (details)
1
Overall F1
3
a
3
(F1 against human labeler i)
i 1
11 / 68
11
Experiment Results
─ Compare Different Methods
Comparison among our own methods
Comparison with other teams in KDDCUP2005
From Different
Groups
12
12 / 68
Result of Bridging Classifiers
Performance of the Bridging Classifier with Different
Granularity of Intermediate Taxonomy
Using bridging classifier allows the target
classes to change freely
without the need to retrain the classifier!
13
Target-transfer Learning
Classifier, once trained, stays constant
When target classes change, classifier needs to be
retrained with new data
Bridging Classifier:
Too costly
Not online
Allow target to change
Application: advertisements come and go, but our
querytarget mapping needs not be retrained!
We call this the target-transfer learning problem
14
Task 2: Can computer do this?
15
Data: Web Search Queries
Consider the following search queries
“AAAI”
“Machine Learning”
“Constraint Reasoning”
AAAI
Machine learning
Constraint Reasoning
16
AAAI 07, joint work with D. Shen,
J. Sun, M. Qin, Z. Chen et al.
Queries have different granularity
Car v.s BMW; BMW v.s. AUDI
Can we organize the queries into hierarchies?
Benefits of building query hierarchies
Provide online query suggestion
Query classification
Query clustering
Difficulties of building query hierarchies
Queries are short
The hierarchical structure cannot be pre-defined
17
Clickthrough Data
√
Person 1
“SIGIR”
√
Search
Engines
Clickthrough Data
√
18
Intuitive Ideas
Our goal: mine the query hierarchies from
clickthrough data
If two queries are related to each other, they
should share some of the same or similar clicked
Web pages;
For two queries qi and qj , qi is more general if
most of the clicked pages of qj have similar pages to
some clicked pages of qi
while not the other way around
If a query is specific,
the contents of its clicked pages are relatively
consistent,
19