Cikm09het - Columbia University

Download Report

Transcript Cikm09het - Columbia University

Heterogeneous Cross Domain
Ranking in Latent Space
Bo Wang1, Jie Tang2, Wei Fan3, Songcan Chen1, Zi Yang2, Yanzhu Liu4
1Nanjing
University of Aeronautics and Astronautics
2Tsinghua University
3IBM T.J. Watson Research Center, USA
4Peking University
1
Introduction
• The web is becoming more and more
heterogeneous
• Ranking is the fundamental problem over
web
– unsupervised v.s. supervised
– homogeneous v.s. heterogeneous
2
Motivation
Dr. Tang
Association...
write
SVM...
cite
Pc member
ISWC
IJCAI
WWW
?
Prof. Wang publish
write
write
1) How to capture the correlation
SDM between
Authors
heterogeneous objects?
ICDM
2) How to preserve the preference orders
PAKDD
between objects across heterogeneous
domains?
EOS... Semantic...
write
write
Data Mining: Concepts and
Techniques
Limin
KDD
ISWC
publish
Conferences
write
Main Challenges
cite
publish
cite
IJCAI
Principles of Data Mining
publish WWW
cite
Query: “data mining” Papers
write
Tree CRF...
publish
publish
Prof. Li
?
Annotation...
coauthor
Write
coauthor
?
P. Yu
?
?
Dr. Tang
Tree CRF...
SVM... EOS...
Prof. Wang Limin
Heterogeneous cross domain ranking
3
Outline
• Related Work
• Heterogeneous cross domain ranking
• Experiments
• Conclusion
4
Related Work
• Learning to rank
– Supervised: [Burges, 05] [Herbrich, 00] [Xu and Li, 07]
[Yue, 07]
– Semi-supervised: [Duh, 08] [Amini, 08] [Hoi and Jin, 08]
– Ranking adaptation: [Chen, 08]
• Transfer learning
– Instance-based: [Dai, 07] [Gao, 08]
– Feature-based: [Jebara, 04] [Argyriou, 06] [Raina, 07]
[Lee, 07] [Blitzer, 06] [Blitzer, 07]
– Model-based: [Bonilla, 08]
5
Outline
• Related Work
• Heterogeneous cross domain ranking
– Basic idea
– Proposed algorithm: HCDRank
• Experiments
• Conclusion
6
Query: “data mining”
Conference
Source Domain
Expert
KDD
KDD
A
X
PKDD
SDM
B
Y
PAKDD
ADMA
C
Z
Target Domain
Jiawei Han
Alice
Jie Tang
Jerry
KDD
A
Jiawei Han
PKDD
B
Jerry
PAKDD
C
Jie Tang
Bo Wang
Bob
Tom
Bob
mis-ranked pairs
KDD
X
SDM
Y
ADMA
Z
Alice
Bo Wang
Tom
Latent Space
7
mis-ranked pairs
The Proposed Algorithm — HCDRank
How to optimize?
How to define?
Non-convex
Dual problem
8
alternately optimize
matrix M and D
O(2T*sN logN)
O((2T+1)*sN log(N) + d3 Construct transformation
matrix
O(d3)
learning in latent
space
O(sN logN)
9
Outline
• Related Work
• Heterogeneous cross domain ranking
• Experiments
– Ranking on Homogeneous data
– Ranking on Heterogeneous data
– Ranking on Heterogeneous tasks
• Conclusion
10
Experiments
• Data sets
– Homogeneous data set: LETOR_TR
• 50/75/106 queries with 44/44/25 features for TREC2003_TR,
TREC2004_TR and OHSUMED_TR
– Heterogeneous academic data set: ArnetMiner.org
• 14,134 authors, 10,716 papers, and 1,434 conferences
– Heterogeneous task data set:
• 9 queries, 900 experts, 450 best supervisor candidates
• Evaluation measures
– MAP
– NDCG
11
Ranking on Homogeneous data
• LETOR_TR
– We made a slight revision of LETOR 2.0 to fit into the crossdomain ranking scenario
– three sub datasets: TREC2003_TR, TREC2004_TR, and
OHSUMED_TR
• Baselines
12
TREC2003_TR
TREC2004_TR
Cosine Similarity=0.01
Cosine Similarity=0.23
OHSUMED_TR
13
Cosine Similarity=0.18
Training Time
14
Ranking on Heterogeneous data
• ArnetMiner data set (www.arnetminer.org)
14,134 authors, 10,716 papers, and 1,434 conferences
• Training and test data set:
– 44 most frequent queried keywords from log file
• Author collection: Libra, Rexa and ArnetMiner
• Conference collection: Libra, ArnetMiner
• Ground truth:
– Conference: online resources
– Expert: two faculty members and five graduate students from
CS provided human judgments for expert ranking
15
Feature Definition
16
Features
Description
L1-L10
Low-level language model features
H1-H3
High-level language model features
S1
How many years the conference has been held
S2
The sum of citation number of the conference during recent 5 years
S3
The sum of citation number of the conference during recent 10 years
S4
How many years have passed since his/her first paper
S5
The sum of citation number of all the publications of one expert
S6
How many papers have been cited more than 5 times
S7
How many papers have been cited more than 10 times
Expert Finding Results
17
Feature Correlation Analysis
18
Ranking on Heterogeneous tasks
• Expert finding task v.s. best supervisor finding task
• Training and test data set:
– expert finding task: ranking lists from ArnetMiner or annotated
lists
– best supervisor finding task: 9 most frequent queries from log
file of ArnetMiner
• For each query, we collected 50 best supervisor candidates, and sent
emails to 100 researchers for annotation
• Ground truth:
– Collection of feedbacks about the candidates (yes/ no/ not sure)
19
Feature Definition
Features
L1-L10
H1-H3
B1
B2
B3
B4
B5
B6
B7
B8
SumCo1-SumCo8
AvgCo1-AvgCo8
SumStu1-SumStu8
AvgStu1-AvgStu8
20
Description
Low-level language model features
High-level language model features
The year he/she published his/her first paper
The number of papers of an expert
The number of papers in recent 2 years
The number of papers in recent 5 years
The number of citations of all his/her papers
The number of papers cited more than 5 times
The number of papers cited more than 10 times
PageRank score
The sum of coauthors’ B1-B8 scores
The average of coauthors’ B1-B8 scores
The sum of his/her advisees’ B1-B8 scores
The average of his/her advisees’ B1-B8 scores
Best supervisor finding results
21
Experimental Results
22
Outline
• Related Work
• Heterogeneous cross domain ranking
• Experiments
• Conclusion
23
Conclusion
• Formally define the problem of heterogeneous cross
domain ranking and propose a general framework
• We provide a preferred solution under the regularized
framework by simultaneously minimizing two ranking
loss functions in two domains
• The experimental results on three different genres of
data sets verified the effectiveness of the proposed
algorithm
24
Data Set
25
Ranking on Heterogeneous data
• A subset of ArnetMiner (www.arnetminer.org)
14134 authors, 10716 papers, and 1434 conferences
• 44 most frequent queried keywords from log file
• Author collection:
– For each query, we gathered top 30 experts from Libra, Rexa and
ArnetMiner
• Conference collection:
– For each query, we gathered top 30 conferences from Libra and
ArntetMiner
• Ground truth:
– Three online resources
• http://www.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html
• http://www3.ntu.edu.sg/home/ASSourav/crank.htm
• http://www.cs-conference-ranking.org/conferencerankings/alltopics.html
– Two faculty members and five graduate students from CS provided
human judgments
26
Ranking on Heterogeneous tasks
• For expert finding task, we can use results from ArnetMiner or
annotated lists as training data
• For best supervisor task, 9 most frequent queries from log file of
ArnetMiner are used
– For each query, we sent emails to 100 researchers
• Top 50 researchers by ArnetMiner
• Top 50 researchers who start publishing papers only in recent years
(91.6% of them are currently graduates or postdoctoral researchers)
– Collection of feedbacks
• 50 best supervisor candidates (yes/ no/ not sure)
• Also add other candidates
– Ground truth
28