Transcript Zhang Tao
UOS
Ontology Based Personalized Search
Zhang Tao
The University of Seoul
1
Zhang Tao
Contents
1
2
3
4
5
Overview
Determining the content of
documents
User Profiles
Improving Search Results
Conclusions and Future Work
2
Data Mining
Zhang Tao
Overview
Proposing a problem
With the exponentially growing amount of information
available on the Internet, the task of retrieving documents
of interest has become increasingly difficult.
People have two ways to find the data they are looking
for: search and browse
In terms of searching, about one half of all retrieved
documents have been reported to be irrelevant. Why?
Conclusion: How is the effective personalization system?
3
Data Mining
Zhang Tao
Overview
The study of this paper
This paper studies ways to model a user’s interests and
shows how these profiles can be deployed for more
effective information retrieval and filtering.
A user profile is created over time by analyzing surfed
pages.
This paper shows how the profiles can be used to
achieve search performance improvements.
Introduce the OBIWAN project
The goal of OBIWAN is to investigate a novel contentbased approach to distributed information retrieval.
Websites are clustered into regions.
4
Data Mining
Zhang Tao
Overview
The architecture is a hierarchy of regions.
The text classifier is a core component not only of the
entire OBIWAN project, but also of the presented
personalization method.
Related Work
Personalization is a broad field of very active ongoing
research.
Applications include personalized access to certain
resources and filtering/rating systems.
SmartPush is currently the only system to store profiles
as concept hierarchies.
5
Data Mining
Zhang Tao
Determining the content of documents
Importance
User interests are inferred by analyzing the web pages
the user visits.
For this purpose, it is necessary to determine the content,
or characterize of these surfed pages.
A hierarchy of concepts
This ontology is based on a publicly accessible browsing
hierarchy.
Each node is associated with a set of documents, all of
documents for node are merged into a superdocument.
Documents as well as superdocuments are represented
as weighted keyword vectors
6
Data Mining
Zhang Tao
Determining the content of documents
This page vector is compared with the keyword vectors
associated with every node to calculate similarities.
The nodes with the top matching vectors are assumed to
be most related to the content of the surfed page.
7
Data Mining
Zhang Tao
User Profiles
Introduce
User profiles store approximations of the interests of a
given user.
User profiles include three features:
• hierarchically structured, and not just a list of keywords
• generated automatically, without explicit user feedback
• Dynamical
Creation and Maintenance
Profiles are generated by analyzing the surfing behavior
of a user. “Surfing behavior” here refers to the length of
the visited pages and the time spent thereon.
8
Data Mining
Zhang Tao
User Profiles
Four different combinations of time, length, and subject
discriminators have been investigated.
In the following function, time refers to the time a user
spent on a given page, and length refers to the length of
the page, ɤ(d,ci) is the strength of the match between the
content of document d and category ci. △L(ci) represents
the interest L in a category ci.
time
r (d , Ci)
log length
(1)
time
r (d , Ci)
log log length
(2)
L(Ci) log
L(Ci) log
9
Data Mining
Zhang Tao
User Profiles
Profile Evaluation: Convergence
The evaluation of the user profiles consists of two parts:
• A notion of convergence is introduced with respect to which 16
actual user profiles are discussed.
• Examines the relationship between the calculated user interests
and the actual user interests.
Figure 1 shows a sample profile (adjustment function 2),
it consists of roughly 75 non-zero categories.
Figure 2 shows the numbers of non-zero categories for
five sample profiles with 100-150 categories created
using the same interest adjustment function.
10
Data Mining
Zhang Tao
User Profiles
11
Data Mining
Zhang Tao
User Profiles
12
Data Mining
Zhang Tao
User Profiles
On average, that corresponds to roughly 320 pages, or
17 days of surfing. Table 1 summarizes the convergence
properties.
13
Data Mining
Zhang Tao
User Profiles
Comparison with actual user interests
Although convergence is a desirable property, it does not
measure the accuracy of the generated profiles.
The sixteen users were shown the top twenty subjects in
their profiles in random order and asked how
appropriately these inferred categories reflected their
interests.
Table 2 shows the experiment for the answers to some
questions with the top 20 and top 10 categories
respectively.
14
Data Mining
Zhang Tao
User Profiles
15
Data Mining
Zhang Tao
Improving Search Results
A problem about search results
The wealth of information available on the web is actually
too large.
As to search results, the top ranked documents a user
can have a look at are often not relevant to this user.
There are three common approaches to address this
problem:
• Re-ranking: The algorithms apply a function to the ranking
numbers that have been returned by the search engine.
• Filtering: Filtering systems determine which documents in the
results sets are relevant and which are not.
• Query Expansion: If a query can be expanded with the user’s
interests, the search results are likely to be more narrowly
focused.
16
Data Mining
Zhang Tao
Improving Search Results
Re-Ranking
Given a query, re-ranking is done by modifying the
ranking that was returned by a publicly accessible search
engine.
ProFusion (www.profusion.com) in this case. The idea is
to characterize each of the returned documents and, by
referring to the user profiles, to determine how much a
user is interested in these categories.
The following function is the adjustment function of the
Re-ranking method.
1 4
Q( Dj ) w( Dj ) (0.5 (Ci ) r ( Dj, Ci ))
4 i 1
17
Data Mining
Zhang Tao
Improving Search Results
Evaluation
The results that have been produced by the different reranking systems must be evaluated.
The eleven point precision average is the better measure
method.
The eleven point precision average evaluates ranking
performance in terms of recall and precision.
Number of relevant items retrieved
Recall =
Number of relevant items in collection
Number of relevant items retrieved
Precision =
Total number of items retrieved
18
Data Mining
Zhang Tao
Improving Search Results
Figure 3 shows the recall-precision graphs for one
interest adjustment functions.
Figure 4 shows The remaining set of 16 queries were
evaluated using this function.
19
Data Mining
Zhang Tao
Improving Search Results
20
Data Mining
Zhang Tao
Improving Search Results
21
Data Mining
Zhang Tao
Improving Search Results
Filtering
To filter a set of result documents means to exclude some
documents.
Filtering was done by using the above ranking functions
with thresholds to decide which documents were
irrelevant and which were not.
Figures 5 and 6 show the performance of the filter for the
training and the testing set, respectively.
22
Data Mining
Zhang Tao
Improving Search Results
23
Data Mining
Zhang Tao
Conclusion and Future Work
Conclusion
These profiles have been shown to converge and to
reflect actual user interests quite well.
With the presented approach, the length of a surfed page
can be neglected when the interest in a page is inferred.
Future work
Future work includes the integration of the system into a
web browser.
Other areas of profile deployment are conceivable.
24
Data Mining