Transcript Slides

Heterogeneous Cross Domain
Ranking in Latent Space
Bo Wang1, Jie Tang2, Wei Fan3, Songcan Chen1, Zi Yang2, Yanzhu Liu4
1Nanjing
University of Aeronautics and Astronautics
2Tsinghua University
3IBM T.J. Watson Research Center, USA
4Peking University
1
Introduction
• The web is becoming more and more
heterogeneous
• Ranking is the fundamental problem over
web
– unsupervised v.s. supervised
– homogeneous v.s. heterogeneous
2
Motivation
Dr. Tang
Association...
write
SVM...
cite
ISWC
IJCAI
1) How to capture the correlation
SDM between
Authors
heterogeneous objects?
ICDM
2) How to preserve the preference orders
PAKDD
between objects across heterogeneous
domains?
WWW
write
?
Prof. Wang publish
write
write
EOS... Semantic...
write
Pc member
Data Mining: Concepts and
Techniques
Limin
KDD
ISWC
publish
Conferences
write
Main Challenges
cite
publish
cite
IJCAI
Principles of Data Mining
publish WWW
cite
Query: “data mining” Papers
write
Tree CRF...
publish
publish
Prof. Li
?
Annotation...
coauthor
Write
coauthor
?
P. Yu
?
?
Dr. Tang
Tree CRF...
SVM... EOS...
Prof. Wang Limin
Heterogeneous cross domain ranking
3
Outline
• Related Work
• Heterogeneous cross domain ranking
• Experiments
• Conclusion
4
Related Work
• Learning to rank
– Supervised: [Burges, 05] [Herbrich, 00] [Xu and Li, 07]
[Yue, 07]
– Semi-supervised: [Duh, 08] [Amini, 08] [Hoi and Jin, 08]
– Ranking adaptation: [Chen, 08]
• Transfer learning
– Instance-based: [Dai, 07] [Gao, 08]
– Feature-based: [Jebara, 04] [Argyriou, 06] [Raina, 07]
[Lee, 07] [Blitzer, 06] [Blitzer, 07]
– Model-based: [Bonilla, 08]
5
Outline
• Related Work
• Heterogeneous cross domain ranking
– Basic idea
– Proposed algorithm: HCDRank
• Experiments
• Conclusion
6
Query: “data mining”
Conference
Source Domain
Expert
KDD
KDD
A
X
PKDD
SDM
B
Y
PAKDD
ADMA
C
Z
Target Domain
Jiawei Han
Alice
Jie Tang
Jerry
Bo Wang
Bob
Tom
might be empty!
(no labelled data in
target domain)
KDD
A
Jiawei Han
PKDD
B
Jerry
PAKDD
C
Jie Tang
Bob
mis-ranked pairs
KDD
X
SDM
Y
ADMA
Z
Alice
Bo Wang
Tom
Latent Space
7
mis-ranked pairs
Learning Task
In the HCD ranking problem, the transfer ranking task can be
defined as:
Given limited number of labeled data L_T, a large number of
unlabeled data S from the target domain, and sufficiently labeled
data L_S from the source domain, the goal is to learn a ranking
function f_T^* for predicting the rank levels of unlabeled data in the
target domain.
Key issues:
-Different feature distributions/different feature spaces
-Number of rank levels different
-Number of labeled training examples very unbalanced (thousands
vs a few)
8
The Proposed Algorithm — HCDRank
How to optimize?
Loss function in source domain
How to define?
Loss function in target domain
penalty
Non-convex
Loss function: Number of mis-ranked pairs
unsolvable
C: cost-sensitive parameter which deals with imalance of
labeled data btwn domains
\lambda: balances the empirical loss and the penalty
10
Dual problem
alternately optimize
matrix M and D
O(2T*sN logN)
O((2T+1)*sN log(N) + d3 Construct transformation
matrix
d: feature number, N = nr of instance
O(d3) of nonpairs for training, s: number
zero features
11
Learn weight vector of target domain
learning in latent
space
Apply learnt weight vector to predict
O(sN logN)
Outline
• Related Work
• Heterogeneous cross domain ranking
• Experiments
– Ranking on Homogeneous data
– Ranking on Heterogeneous data
– Ranking on Heterogeneous tasks
• Conclusion
12
Experiments
• Data sets
– Homogeneous data set: LETOR_TR
• 50/75/106 queries with 44/44/25 features for TREC2003_TR,
TREC2004_TR and OHSUMED_TR
– Heterogeneous academic data set: ArnetMiner.org
• 14,134 authors, 10,716 papers, and 1,434 conferences
– Heterogeneous task data set:
• 9 queries, 900 experts, 450 best supervisor candidates
• Evaluation measures
– P@n : Precision@n
– MAP : mean average precision
– NDCG : normalized discount cumulative gain
13
Ranking on Homogeneous data
• LETOR_TR
– We made a slight revision of LETOR 2.0 to fit into the crossdomain ranking scenario
– three sub datasets: TREC2003_TR, TREC2004_TR, and
OHSUMED_TR
• Baselines
14
TREC2003_TR
TREC2004_TR
Cosine Similarity=0.01
Cosine Similarity=0.23
OHSUMED_TR
15
Cosine Similarity=0.18
Observations
• Ranking accuracy
HCDRank is +5.6% to +6.1% in terms of MAP
better
• Effect of difference
when cosine similarity is high (TREC2004), simply
combining the two domains would result in a better
ranking performance
• Training time: next slide
16
Training Time
BUT: HCDRank can easily be parallelized
And training process only needs to be run once on a data set
17
Ranking on Heterogeneous data
• ArnetMiner data set (www.arnetminer.org)
14,134 authors, 10,716 papers, and 1,434 conferences
• Training and test data set:
– 44 most frequent queried keywords from log file
• Author collection: Libra, Rexa and ArnetMiner
• Conference collection: Libra, ArnetMiner
• Ground truth:
– Conference: online resources
– Expert: two faculty members and five graduate students from
CS provided human judgments for expert ranking
18
Feature Definition
Features
Description
L1-L10
Low-level language model features
H1-H3
High-level language model features
S1
How many years the conference has been held
S2
The sum of citation number of the conference during recent 5 years
S3
The sum of citation number of the conference during recent 10 years
S4
How many years have passed since his/her first paper
S5
The sum of citation number of all the publications of one expert
S6
How many papers have been cited more than 5 times
S7
How many papers have been cited more than 10 times
16 features for a conference, 17 features for an expert
19
Expert Finding Results
20
Observations
• Ranking accuracy
HCDRank outperforms the baselines
especially the two unsupervised systems
• Feature analysis
next slide: final weight vectors which exploits the data
information from two domains and adjusts the weight
learn from single domain data
• Training time: next slide
21
Feature Correlation Analysis
22
Ranking on Heterogeneous tasks
• Expert finding task v.s. best supervisor finding task
• Training and test data set:
– expert finding task: ranking lists from ArnetMiner or annotated
lists
– best supervisor finding task: 9 most frequent queries from log
file of ArnetMiner
• For each query, we collected 50 best supervisor candidates, and sent
emails to 100 researchers for annotation
• Ground truth:
– Collection of feedbacks about the candidates (yes/ no/ not sure)
23
Best supervisor finding
• Training/test set and
ground truth
– 724 mails sent
– Fragment of mail
24
– Feedbacks in effect > 82
(increasing)
– Rate each candidate by the
definite feedbacks (yes/no)
24
Feature Definition
Features
L1-L10
H1-H3
B1
B2
B3
B4
B5
B6
B7
B8
SumCo1-SumCo8
AvgCo1-AvgCo8
SumStu1-SumStu8
AvgStu1-AvgStu8
25
Description
Low-level language model features
High-level language model features
The year he/she published his/her first paper
The number of papers of an expert
The number of papers in recent 2 years
The number of papers in recent 5 years
The number of citations of all his/her papers
The number of papers cited more than 5 times
The number of papers cited more than 10 times
PageRank score
The sum of coauthors’ B1-B8 scores
The average of coauthors’ B1-B8 scores
The sum of his/her advisees’ B1-B8 scores
The average of his/her advisees’ B1-B8 scores
Best supervisor finding results
26
Outline
• Related Work
• Heterogeneous cross domain ranking
• Experiments
• Conclusion
28
Conclusion
• Formally define the problem of heterogeneous cross
domain ranking and propose a general framework
• We provide a preferred solution under the regularized
framework by simultaneously minimizing two ranking
loss functions in two domains
• The experimental results on three different genres of
data sets verified the effectiveness of the proposed
algorithm
29