200412ICADLwSHK - Edward A. Fox

Transcript 200412ICADLwSHK - Edward A. Fox

Interest-based User Grouping Model
for Collaborative Filtering
in Digital Libraries
7th ICADL 2004
Shanghai, P.R.China
Dec. 15, 2004
Edward A. Fox, Seonho Kim
Digital Library Research Laboratory (DLRL)
Virginia Tech, Blacksburg, VA 26061 USA
Overview
•
Introduction
–
–
•
Collecting User Interests
System Diagram
System Analysis by 5S Model
User Model
Recommendation Algorithms
Hypotheses
Experiment
Experiment Results
–
–
–
•
•
.
Proposed User Grouping Model
–
–
–
–
–
–
•
•
Previous work
Recommender System for Digital Libraries
Collected Data
Hypothesis Test
User Grouping
Future Work
Conclusions
ICADL 2004
2
Previous Work
• Collaborative filtering
– GroupLens [2] for Usenet news
• Recommender system
– Ungar [1] and lots of works for online shopping malls
– PASS [3], Renda [4], DEBORA [5] for Digital Library (DL)
• Standard log for DL
– Gonçalves et al. [6] XML log standard for DLs
• Rating data
– Explicit rating data – Entered by user during registration or by
answering questionnaires
– Implicit rating data, Nichols [7] – Obtained by analyzing log data
ICADL 2004
3
Recommender System for Dynamic
& Complex Information Systems
• Different features of DL
–
–
–
–
–
–
Difficulties in collecting explicit user rating data
Difficulties in analyzing user log data
Difficulties in item classification
Sparseness in rating data
Dynamic, vast and complex items
Narrow user interests
• Important aspects of recommender system for DLs –
scalability, accommodating new data,
comprehensibility [9]
ICADL 2004
4
Conventional Recommender
System for Shopping Mall
Item Classes
User Classes
Items
Users
recommend
ICADL 2004
5
Recommender System for DL
Users
User Classes
Items
Item Classes
?
?
?
?
?
?
?
hard to classify
ICADL 2004
dynamic items
6
Proposed User Grouping Model
• User grouping is the most critical procedure for a
recommender system.
• Suitable for dynamic and complex information systems
like DLs
• Overcomes data sparseness
• Uses implicit rating data rather than explicit rating data
• User oriented recommender algorithm
• User interest-based community finding
• User modeling
– User model (UM) contains complete statistics for recommender
system.
– Enhanced interoperability
ICADL 2004
7
Collecting User Interests for User
Grouping
• Users with similar interests are grouped
• Employs a Document Clustering Algorithm,
LINGO [10], to collect document topics
• Users’ interests are collected implicitly during
searching and browsing.
• A User Model (UM) contains her interests and
document topics.
• Interests of a user are subset of document topics
proposed to her by Document Clustering.
ICADL 2004
8
Interest-based Recommender
System
ICADL 2004
9
System Analysis with 5S Model
Interest-based Recommender System for DL
Society
represented by
Structure
User Model
UM schema
User description
Statistics
User interests
Document topics
User groups
Researcher
Interest Group
participates
Teacher
Users
Class Group
Learner
Community
refers
Collaboration space
Probability Space
Space
Vector space
User Interface
Text
Users
Video
refers
Recommendation
generates
displays
Presentation
Audio
Stream
composed of
ICADL 2004
Push service
Filtering
Ranking
Highlighting
Personalized
pages
Group Selection
Individual Selection
Scenario
10
User Model (UM)
User ID
User Description
Groups
(explicit data
-obtained from
questionnaire)
Name
Statistics
(implicit data
-generated by
recommender)
Group ID
Score
(implicit data
-generated by user interface
and recommender)
Document Topic
Score
User Interest
Score
E-mail
Address
Publications
User Interests
ICADL 2004
11
User Similarity Based on User
Interests
• Derived from Pearson’s correlation coefficient
• Similarity between users ‘a’ and ‘i’
s ( a, i ) 
 (v  v )(v  v )
 ( v  v )  (v  v )
j
•
a
a, j
j
a, j
a
i
i, j
2
j
i, j
i
2
where j is common interest, va,j is rating score of
item j by user a, and
va 
total number of topics user selected
total number of topics proposed by the system
ICADL 2004
12
Interest-based Group
Recommendation
Rk , the probability that a user group ‘k’ is affected from
a rating which is made by a user ‘a’ to an item ‘j’, can be
calculated as follows:
•
Rk  Pk
•
1
N
 vi , j (1  P )
i:Ci  k
k
1
T N
 (1 vi , j )
i:Ci  k
where
T : total number of users in the system
Ci : the group that user i belongs to
Vi,j : probability that user i votes for j
N : total number of users in group K
Pk : base rate of group K, observed from DB
ICADL 2004
13
Interest-based Individual
Recommendation
• Once groups are selected, individual users, who will be
affected from the voting, are selected.
• Probability that a user ‘a’ in group ‘k’ likes the item ‘j’
can be obtained by:
n
Pa , j  va    s ( a, i )( vi , j  vi )
i 1
•
where
n is the number of users in the selected group k
s(a,i) is the similarity between user ‘a’ and user ‘i’
va is average probability of positive voting of user ‘a’
 is a normalizing factor
ICADL 2004
14
Interest-based Recommendation
Algorithms
• Correct user grouping is critical for correct
recommendation.
• User centered algorithm
– User model consists of complete statistics for recommender.
– Since no references to statistics of items are needed, is suitable for
information systems with dynamic and complex items.
– Enhanced User Model interoperability and reusability
• Group oriented algorithm
– Two phase recommendation
• Group selection & Individual selection
– Automatic user’s communities finding
ICADL 2004
15
Hypotheses
• Three hypotheses about proper document clustering
algorithm behavior
– H1 : Any serious user who has his own research interests and
topics, shows consistent output for the document collections
referred to by that user.
– H2 : Serious users who share common research interests and
topics, show overlapped output for the document collections
referred to by them.
– H3 : Serious users who don’t share any research interests and
topics, show different output for document collections referred to
by them.
ICADL 2004
16
Experiment - Tasks
• Subjects are asked to
– answer a questionnaire to collect democratic
information
– list research interests to help us collect explicit rating
data which is used for evaluation in the experiment
– search some documents in her research interests and
browse the result documents to help us collect implicit
rating data
ICADL 2004
17
Experiment - Participants
• 22 Ph.D and MS students majoring in Computer
Science
• CITIDEL [8] is used as a DL in “Computing” field
• Data from 4 students were excluded as their
research domains are not included in CITIDEL
ICADL 2004
18
Experiment - Interfaces
• Specially designed user interfaces are required to
capture user’s interactions
• JavaScripts
• Java Application
ICADL 2004
19
Results - Collected Data
• Example
<Semi Structured Data<Cross Language Information Retrieval CLIR<Translation Model<Structured English
Query<TREC Experiments at Maryland<Structured Document<Evaluation<Attribute
Grammars<Learning<Web<Query Processing<Query
Optimisers<QA<Disambiguation<Sources<SEQUEL<Fuzzy<Indexing<Inference Problem<Schematically
Heterogeneity<Sub Optimization Query Execution Plan<Generation<(Other)(<Cross Language Information
Retrieval CLIR)(<Structured English Query)(<TREC Experiments at Maryland)(<Evaluation)(<Query
Processing)(<Query Optimisers)(<Disambiguation)
<Cross Language Information Retrieval CLIR<Machine Translation<English Japanese<Based Machine<TREC
Experiments at Maryland<Approach to Machine<Natural Language<Future of Machine Translation<Machine
Adaptable Dynamic Binary<CLIR Track<Systems<New<Tables Provide<Design<Statistical Machine<Query
Translation<Evaluates<Chinese<USA October Proceedings<Interlingual<Technology<Syntax Directed
Transduction<Interpretation<Knowledge<Linguistic<Divergences<(Other)(<Cross Language Information Retrieval
CLIR)(<Machine Translation)(<English Japanese)(<TREC Experiments at Maryland)(<CLIR Track)(<Query
Translation)
• Parenthesized topics mean they are rated positively
ICADL 2004
20
Results – Hypothesis Test for H1
• H0 (Null hypothesis of H1) : Mean(μ) of frequency of document topics
proposed by Document Clustering Algorithm are NOT consistent (μ0 = 1)
for a user,
H0 : μ = μ0 vs H1 : μ > μ0
• Conditions : 95% confidence (test size α = 0.05), sample size ‘n’ ≤ 25,
standard deviation ‘σ’ unknown, i.i.d. random samples, normal distribution,
 estimated z-score T-test
• Test statistics : sample mean ‘ỹ’ = 1.1429, sample standard deviation ‘s’ =
0.2277 are observed from the experiment
• Rejection Rule is to reject H0 if the ỹ > μ0+zα/2 σ/√n
• From the experiment, ỹ = 1.1429 > μ0+zα/2 σ/√n = 1.0934
• Therefore decision is to Reject H0 and accept H1
• 95% Confidence Interval for μ is 1.0297 ≤ μ ≤1.2561
• P-value (confidence of H0) = 0.0039
ICADL 2004
21
Results - Users
• All users are assigned a symbol after experiments
according to their explicit data for convenience of analysis
User Symbols
User profiles collected from questionnaire
1
dlmember
The one who belonged to the Digital Library Research Laboratory
2
softeng
The one who has an interest in Software Engineering
3
bio
The one who has an interest in Bioinformatics
4
vr_hci
The one who has an interest in Virtual Reality and Human Computer Interaction
5
clir_1
The one who has an interest in Cross Language Information Retrieval
6
clir_2
The one who has an interest in Cross Language Information Retrieval
7
nlp_1
The one who has an interest in Natural Language Processing
8
nlp_2
The one who has an interest in Natural Language Processing
9
vr_1
The one who has an interest in Virtual Reality
10
vr_2
The one who has an interest in Virtual Reality
11
EC_agent
The one who has an interest in E-Commerce and Agent
12
CybEdu_agt
The one who has an interest in Cyber Education and Agent
13
dlandedu_1
The one who has an interest in Digital Library and Education
14
dlandedu_2
The one who has an interest in Digital Library and Education
15
person_1
The one who has an interest in Personalization
16
person_2
The one who has an interest in Personalization
17
se_me
The one who has an interest in Software Engineering
18
fuzzy
ICADL 2004
22
The one who has an interest in Fuzzy Theory
Results – User Similarities
ICADL 2004
23
Results – User Similarity Levels
User ID
Level 1
Level 2
Level 3
Level 4
dlmember
dlandedu_1, dlandedu_2
softeng
se_me
person_2
vr_2, vr_1
person_1, person_2
bio
vr_hci
clir_1
nlp_1, clir_2
nlp_2
clir_2
clir_1, nlp_1
nlp_2
nlp_1
nlp_2
clir_1, clir_2
nlp_2
nlp_1
clir_1, clir_2
vr_1
vr_2, vr_hci
vr_2
vr_hci
EC_agent
CybEdu_agt
CybEdu_agt
EC_agent
vr_1
person_1, person_2
person_2
fuzzy
dlandedu_1
dlmember
dlandedu_2
vr_hci, CybEdu_agt
dlandedu_2
dlmember
dlandedu_1
CybEdu_agt, vr_hci
person_1
person_2
person_2
person_1
se_me
softeng
fuzzy
CybEdu_agt
ICADL 2004
bio, nlp_1
24
Results - User Groups
User Group Member IDs ( assigned after experiment according to their
ID
research interests which are answered on the questionnaire)
A
dlmember, dlandedu_1, dlandedu_2
B
softeng, se_me,
C
vr_hci, vr_1, vr_2
D
clir_1, nlp_1, clir_2
E
nlp_1, nlp_2
F
person_1, person_2
G
fuzzy, cybedu_agt
H
EC_agent, cybedu_agt, fuzzy
I
Bio
• User groups are generated by merging a user and other
members with the closest similarity level
ICADL 2004
25
Future Work
• Detailed analyses for accuracy, scalability and
efficiency
• Further confirmation of our hypotheses
ICADL 2004
26
Conclusions
• Proposed a user clustering model based on user
interests
• Proposed user centered recommendation
algorithms which are suitable for DLs
• Proposed a way of collecting and using implicit
rating data from DL users
• Proposed a active way of user communities
finding
• Verified proposed approaches through designed
experiments and hypothesis tests
ICADL 2004
27
References
•
•
•
•
•
•
[1] Lyle H. Ungar and Dean P. Foster: A Formal Statistical Approach to Collaborative Filtering.
CONALD ’98, Carnegie Mellon U., Pittsburgh, PA (1998)
[2] Joseph A. Konstan, Bradley N. Miller, David Maltz, Jonathan L. Herlocker, Lee R. Gordon,
and John Riedl: GroupLens: Applying Collaborative Filtering to Usenet News. Communications
of the ACM, Vol. 40 (1997) 77-87
[3] Chun Zeng, Xiaohui Zheng, Chunxiao Xing, Lizhu Zhou : Personalized Services for Digital
Library. In Proc. 5th Int. Conf. on Asian Digital Libraries, ICADL (2002) 252-253
[4] M. Elena Renda, Umberto Straccia: A Personalized Collaborative Digital Library Environment.
In Proc. 5th Int. Conf. on Asian Digital Libraries, ICADL (2002) 262-274
[5] David M Nichols, Duncan Pemberton, Salah Dalhoumi, Omar Larouk, Clair Belisle, Michael
B. Twidale: DEBORA: Developing an Interface to Support Collaboration in a Digital Library. In
Proceedings of the Fourth European Conference on Research and Advanced Technology for
Digital Libraries (2000) 239-248
[6] Marcos A. Gonçalves, Ming Luo, Rao Shen, Mir Farooq, and Edward A. Fox: An XML Log
Standard and Tools for Digital Library Logging Analysis, in Proc. of Sixth European Conference
on Research and Advanced Technology for Digital Libraries (2002) 16-18
ICADL 2004
28
References
•
•
•
•
[7] David M Nichols, Duncan Pemberton, Salah Dalhoumi, Omar Larouk, Clair Belisle, Michael B.
Twidale: DEBORA: Developing an Interface to Support Collaboration in a Digital Library. In
Proceedings of the Fourth European Conference on Research and Advanced Technology for Digital
Libraries (2000) 239-248
[8] CITIDEL project, Computing and Information Technology Interactive Digital Educational
Library, http://www.citidel.org/ (2004)
[9] John S. Breese, David Heckerman and Carl Kadie: Empirical Analysis of Predictive Algorithms
for Collaborative Filtering. In Proceedings of the 14th Conference on Uncertainty in Artificial
Intelligence (1997) 43-52
[10] Stanisław Osiński and Dawid Weiss: Conceptual Clustering Using Lingo Algorithm: Evaluation
on Open Directory Project Data, Advanced in Soft Computing, Intelligent Information Processing
and Web Mining, Proceedings of the International IIS: IIPWM’04 Conference, Zakopane, Poland
(2004) 369-378
• Acknowledgements to NSF for partial support
– DUE-0121679, 0121741, 0333531; IIS-0086227, 0307867, 0325579
ICADL 2004
29