Applying Semantic Analyses to Content
Download
Report
Transcript Applying Semantic Analyses to Content
Applying Semantic Analyses to
Content-based Recommendation and
Document Clustering
Eric Rozell, MRC Intern
Rensselaer Polytechnic Institute
Bio
• Graduate Student @ Rensselaer Polytechnic Institute
• Research Assistant @ Tetherless World Constellation
• Student Fellow @ Federation of Earth Science
Informatics Partners
• Research Advisor: Peter Fox
• Research Focus: Semantic eScience
• Contact: [email protected]
2
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Outline
• Background
• Semantic Analysis
– Probase Conceptualization
– Explicit Semantic Analysis
– Latent Dirichlet Allocation
• Recommendation Experiment
– Recommendation Systems
– Experiment Setup
– Results
• Clustering Experiment
– Problem
– K-Means
– Results
• Conclusions
3
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Background
• Billions of documents on the Web
• Semi-structured data from Web 2.0 (e.g., tags,
microformats)
• Most knowledge remains in unstructured text
• Many natural language techniques for:
– Ontology extraction
– Topic extraction
– Named entity recognition/disambiguation
• Some techniques are better than others for
various information retrieval tasks…
4
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Probase
• Developed at Microsoft Research Asia
• Probabilistic knowledge base built from Bing
index and query logs (and other sources)
• Text mining patterns
– Namely, Hearst patterns: “… artists such as Picaso”
• Evidence for hypernym(artists, Picaso)
5
Conclusions
Clustering
Recommendation
Semantic Analysis
Background
Probase
6
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Probase
• Very capable at conceptualizing groups of entities:
– “China; India; United States” yields “country”
– “China; India; Brazil; Russia” yields “emerging market”
• Differentiates attributes and entities
– “birthday” -> “person” as attribute
– “birthday” -> “occasion” as entity
• Applications
– Clustering Tweets from Concepts [Song et al., 2011]
– Understanding Web Tables
– Query Expansion (Topic Search)
7
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Research Questions
• What’s the best way of extracting concepts from text?
– Compare techniques for semantic analysis
• How are extracted concepts useful?
– Generate data about where semantic analysis techniques
are applicable
• Are user ratings affected by the concepts in media
items such as movies?
– Test semantic analysis techniques in recommender
systems
• How useful is Web-scale domain knowledge in
narrower domains for information retrieval?
– Identify need for domain specific knowledge
8
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Semantic Analysis
• Generating meaning (concepts) from text
• Specifically, get prevalent hypernyms
– E.g., “… Apple, IBM, and Microsoft …”
– “technology companies”
• Semantic analysis using external knowledge
– Probase Conceptualization
– Explicit Semantic Analysis
– WordNet Synsets
• Semantic analysis using latent features
– Latent Dirichlet Allocation
– Latent Semantic Analysis
9
Document
Corpus
.
.
.
Probase
c4
c4
.
.
.
c4
c4
.
.
.
t4
c1 c2 c3
c1 c2 c3
c1 c2 c3
c1 c2 c3
.
.
.
This is
some
plain
text.
t1
t2
t3
.
.
.
For each document…
.
.
.
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Probase Conceptualization
Naïve Bayes /
Summation
c1
c2
c3
c4
.
.
.
Inverse
Document
Frequency /
Filtering
Document
Concepts
10
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Probase Conceptualization
• “Cowboy doll Woody (Tom Hanks) is co ordinating
a reconnaissance mission to find out what
presents his owner Andy is getting for his
birthday party days before they move to a new
house. Unfortunately for Woody, Andy receives a
new spaceman toy, Buzz Lightyear (Tim Allen)
who impresses the other toys and Andy, who
starts to like Buzz more than Woody. Buzz thinks
that he is an actual space ranger, not a toy, and
thinks that Woody is interfering with his
"mission" to return to his home planet…”
11
Text Source: Internet Movie Database (IMDb)
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Sample Features for “Toy Story”
(Probase)
•
•
•
•
•
•
•
•
•
dvd encryptions 0.050 “RC”
duty free item
0.044 “toys”
generic word
0.043 “they, travel, it,…”
satellite mission 0.032 “reconnaissance mission”
creator-owned work
0.020 “Woody”
amazing song
0.013 “fury”
doubtful word
0.013 “overcome”
ill-fated tool
0.013 “Buzz”
lovable ``toy story'' character 0.011 “Buzz Lightyear,
Woody,…”
• pleased star
0.010 “Woody”
• trail builder
0.010 “Woody”
12
Background
Conclusions
Clustering
Recommendation
Semantic Analysis
Explicit Semantic Analysis
13
Image Source: Gabrilovich et al., 2007
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Sample Features for “Toy Story” (ESA)
•
•
•
•
•
•
•
•
•
•
#REDIRECT [[Buzz!]]
0.034
#REDIRECT [[The Buzz]]
0.028
#REDIRECT [[Buzz (comics)]] 0.027
#REDIRECT [[Buzz cut]] 0.027
#REDIRECT [[Buzz (DC Thomson)]] 0.024
#REDIRECT [[Buzz Out Loud]] 0.024
#REDIRECT [[The Daily Buzz]] 0.023
#REDIRECT [[Buzz Aldrin]]
0.022
#REDIRECT [[Buzz cut]] 0.022
#REDIRECT [[Buzzing Tree Frog]]
0.022
14
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Latent Dirichlet Allocation
• Blei et al., 2003
• Unsupervised Learning Method
• “Generates” documents from Dirichlet
distributions over words and topics
• Topic distributions over documents can be
inferred from corpus
15
Image Source: Wikipedia
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Recommendation Systems
• Collaborative Filtering
– “Customers who purchased X also purchased Y.”
• Content-based
– “Because you enjoyed ‘GoldenEye’, you may want
to watch ‘Mission: Impossible’.”
• Hybrid
– Most modern systems take a hybrid approach.
16
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Content-based Recommendation
• In GoldenEye/Mission: Impossible example…
– Structured item content
• Genre – Action/Adventure/Thriller
• Tags – Action, Espionage, Adventure
– Unstructured item content
• Plot synopses – “helicopter, agent, inflitrate, CIA, …”
• Concepts? – “aircraft, intelligence agency, …”
17
Background
Conclusions
Clustering
Recommendation
Semantic Analysis
Recommendation Systems
Structured
Content-based
Approaches
Collaborative
Filtering
Approaches
Unstructured
Content-based
Approaches
Test semantic
analysis approaches
here.
18
Background
Movie
Ratings from
MovieLens
Movie
Synopses
from IMDb
Feature
Generation
Matchbox
Recommendation
Platform
Mean
Absolute
Error (MAE)
Conclusions
Clustering
Recommendation
Semantic Analysis
Experiment
19
Conclusions
Related Work
Clustering
Recommendation
Semantic Analysis
Matchbox
20
Source: Matchbox API Documentation
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Experimental Data
• Data: MovieLens Dataset [HetRec ’11]
– 855,598 ratings
– 10,197 movies
– 2,113 users
• Movie synopses from IMDb (http://www.imdb.com)
– Collected synopses for 2,633 movies
– With 435,043 ratings
– From 2,113 users
• Ratings data:
– Scored by half points from 0.5 to 5
• Choose different numbers of movies (200; 1,000; all)
• Train on 90% of ratings, test on remaining 10%
21
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Experimental Data
• Controls
– Baseline 1: Only features are user IDs and movie IDs
– Baseline 2: User IDs, Movie IDs, Movie Genre
– Baseline 3: User IDs, Movie IDs, Movie Tags
• Feature Sets
–
–
–
–
Term Frequency – Inverse Document Frequency
Latent Dirichlet Allocation
Explicit Semantic Analysis
Probase Conceptualization
22
Background
Movies
Movies
Users
Movies
Users
Conclusions
Movies
Users
Users
Semantic Analysis
• 4 Scenarios: (training: white, testing: black)
Clustering
Recommendation
Experimental Setup
23
Background
Semantic Analysis
Results
0.595
0.59
Recommendation
0.585
Baseline #1
0.58
Baseline #2
Baseline #3
TFIDF Normalized
0.575
Probase Sum
Clustering
ESA
0.57
0.565
Conclusions
0.56
1
2
3
4
5
6
7
8
9
10
24
Background
Semantic Analysis
Recommendation
Results
# of Movies
Clustering
1,000
200
Baseline 1
0.672293
0.71654
0.802044
Baseline 2
0.641556
0.683297
0.752745
Baseline 3
0.655613
0.68994
0.764369
TF-IDF
0.674764
0.706914
0.815245
Probase
0.670694
0.715456
0.797196
0.670182
(unfinished)
0.714967
0.796787
0.711307
0.790362
ESA
LDA
Conclusions
All (2,633)
• testing set contains users and movies not seen in training set
• recommendations based on item features alone
• small amounts of structured data (e.g., genre) are the most
influential in this scenario
25
Background
Semantic Analysis
Recommendation
Results
# of Movies
Clustering
1,000
200
Baseline 1
0.580087
0.564226
0.577349
Baseline 2
0.576183
0.563028
0.576673
Baseline 3
0.575398
0.563378
0.572297
TF-IDF
0.579906
0.575932
0.588288
Probase
0.578889
0.563669
0.578089
0.579798
(unfinished)
0.564334
0.577638
0.566639
0.579633
ESA
LDA
Conclusions
All (2,633)
• testing set contains users not seen in training set.
• lots of collaborative data available (explains comparable performance
in all feature sets)
• given extensive collaborative data, item features are marginally
beneficial (in Matchbox)
26
Background
Semantic Analysis
Recommendation
Results
# of Movies
Clustering
1,000
200
Baseline 1
0.672843
0.687586
0.832491
Baseline 2
0.639683
0.651141
0.81416
Baseline 3
0.652071
0.66492
0.745593
TF-IDF
0.672362
0.665116
0.844305
Probase
0.670159
0.686235
0.823972
0.670451
(unfinished)
0.683594
0.817306
0.684689
0.852056
ESA
LDA
Conclusions
All (2,633)
• testing set contains movies not seen in the training set
• recommendations based on item features and extensive information
on users “rating model”
• small amounts of structured data (e.g., genre) are the most
influential in this scenario (even for long-term users)
27
Background
Semantic Analysis
Recommendation
Results
# of Movies
Clustering
1,000
200
Baseline 1
0.560163
0.564673
0.568706
Baseline 2
0.556011
0.556456
0.567598
Baseline 3
0.550761
0.561643
0.56445
TF-IDF
0.551909
0.558942
0.588288
Probase
0.556414
0.558113
0.567332
0.556517
(unfinished)
0.55706
0.568174
0.558105
0.568927
ESA
LDA
Conclusions
All (2,633)
• testing set contains users and movies seen in the training set
• recommendations again are primarily collaborative
• given a large corpus of rating data for users and items, item features
are only marginally beneficial
28
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Results
Experiment
Baseline 1
0.672293
0.580087
0.672843
0.560163
Baseline 2
0.641556
0.576183
0.639683
0.556011
Baseline 3
0.655613
0.575398
0.652071
0.550761
TF-IDF
0.674764
0.579906
0.672362
0.551909
Probase
0.670694
0.578889
0.670159
0.556414
ESA
0.670182
0.579798
0.670451
0.556517
29
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Document Clustering
• Divide a corpus into a specified number of
groups
• Useful for information retrieval
– Automatically generated topics for search results
– Recommendations for similar items/pages
– Visualization of search space
30
Background
Semantic Analysis
1.
2.
3.
4.
5.
Start with initial clusters
Compute means of clusters
Compare cosine distance of each item to means
Assign to clusters to based on min. distance
Repeat from step 2 until convergence
Conclusions
Clustering
Recommendation
K-Means
31
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Experimental Setup
1.
2.
3.
4.
5.
Generate features for datasets
Randomly assign initial clusters
Run K-Means
Compute purity and ARI
Repeat steps 2-4 20 times for mean and
standard deviation
32
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Experimental Data
From sci.electronics …
• 20 Newsgroups (mini)
• 2,000 messages from Usenet
newsgroups
• 100 messages per topic
• Filter messages for body text
• Source:
http://kdd.ics.uci.edu/databases/20
newsgroups/20newsgroups.html
“A couple of years ago I put
together a Tesla circuit which was
published in an electronics
magazine and could have been the
circuit which is referred to here.
This one used a flyback
transformer from a tv onto which
you wound your own primary
windings...”
33
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Results
Feature Set
Purity
ARI Scores
TF-IDF
0.379 ± 0.027
0.199 ± 0.023
Probase Only
0.265 ± 0.013
0.101 ± 0.010
Probase + TF-IDF
0.414 ± 0.034
0.241 ± 0.029
ESA Only
0.204 ± 0.010
0.040 ± 0.004
ESA + TF-IDF
0.389 ± 0.036
0.211 ± 0.032
LDA Only
N/A
N/A
LDA + TF-IDF
N/A
N/A
34
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Results Comparison
• Song et al. Tweets Clustering
– Experiment #2: Subtle Cluster Distinctions
– Used Tweets about NA, Asia, Africa and Europe
– Comparable performance for ESA and Probase
Conceptualization
• Hotho et al. WordNet Clustering
– Used Reuters dataset and Bisecting K-Means
– Found best results for combined TF-IDF and feature
sets
– Overall improvement from WordNet features was
comparable to Probase features (O[+10%])
35
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Conclusions
• Semantic analysis features are marginally
beneficial in recommendation
• Structured data from limited vocabulary work
best for recommending “new items”
• Explicit and latent semantic analysis are
comparable in recommendation
• Knowledge bases generated at Web-scale may be
too noisy for narrow domain tasks
• Confirmed the efficacy of semantic analysis in
document clustering tasks
36
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Future Directions
• Noise Reduction
– Tune the recommender platform for “concepts”
– Further explore parameter space for feature
generators
– Hybrid Conceptualization / Named Entity
Disambiguation?
• Domain-specific knowledge sources
– Comparison of Web-scale and domain-specific
resources as external knowledge (e.g., [Aljaber et al.,
2010])
37
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
Further Reading
• Short Text Conceptualization Using a Probabilistic
Knowledge Base [Song et al., 2011]
• Exploiting Wikipedia as External Knowledge for
Document Clustering [Hu et al., 2009]
• Hybrid Recommender Using WordNet “Bag of Synsets”
[Degemmis et al., 2007]
• Hybrid Recommender Using LDA [Jin et al., 2005]
• Feature Generation for Text Categorization Using World
Knowledge [Gabrilovich and Markovitch, 2005]
• WordNet Improves Text Document Clustering [Hotho et
al., 2003]
38
Acknowledgements
•
•
•
•
David Stern, Ulrich Paquet, Jurgen Van Gael
Haixun Wang, Yangqiu Song, Zhongyuan Wang
Special thanks to Evelyne Viegas!
Microsoft Research Connections
39
References
•
[Gabrilovich et al., 2007] Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing
semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the
20th international joint conference on Artifical intelligence (IJCAI'07), Rajeev Sangal, Harish
Mehta, and R. K. Bagga (Eds.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
1606-1611.
•
[Blei et al., 2003] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet
allocation. J. Mach. Learn. Res. 3 (March 2003), 993-1022.
•
[Song et al., 2011] Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hongsong Li, and Weizhu
Chen, Short Text Conceptualization using a Probabilistic Knowledgebase, in IJCAI, 2011.
•
[Stern et al., 2009] David H. Stern, Ralf Herbrich, and Thore Graepel. 2009. Matchbox: large
scale online bayesian recommendations. In Proceedings of the 18th international conference
on World wide web (WWW '09). ACM, New York, NY, USA, 111-120.
•
[HetRec ’11] Ivan Cantador, Peter Brusilovsky, and Tsvi Kuflik. 2011. 2nd Workshop on
Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011). In
Proceedings of the 5th ACM conference on Recommender systems. ACM, New York, NY, USA.
•
[Degemmis et al., 2007] Marco Degemmis, Pasquale Lops, and Giovanni Semeraro. A contentcollaborative recommender that exploits WordNet-based user profiles for neighborhood
formation. User Modeling and User-Adapted Interaction. Vol. 17, Issue 3, 217-255.
40
References
•
[Jin et al., 2005] Xin Jin, Yanzan Zhou, and Bamshad Mobasher. 2005. A maximum entropy
web recommendation system: combining collaborative and content features. In Proceedings
of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
(KDD '05). ACM, New York, NY, USA, 612-617.
•
[Hu et al., 2009] Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou. 2009.
Exploiting Wikipedia as external knowledge for document clustering. In Proceedings of the
15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD
'09). ACM, New York, NY, USA, 389-396.
•
[Gabrilovich and Markovitch, 2005] Evgeniy Gabrilovich and Shaul Markovitch. 2005. Feature
generation for text categorization using world knowledge. In Proceedings of the 20th
international joint conference on Artifical intelligence (IJCAI'05), 1606-1611.
•
[Hotho et al., 2003] Andreas Hotho, Steffen Staab, and Gerd Stumme. 2003. Wordnet
improves text document clustering. In Proceedings of the SIGIR 2003 Semantic Web
Workshop, 541-544.
•
[Aljaber et al., 2010] Bader Aljaber, Nicola Stokes, James Bailey, and Jian Pei. 2010. Document
clustering of scientific texts using citation contexts. Information Retrieval. Vol. 13, Issue 2,
101-131.
41
Questions?
• Thanks for attending
42
Appendix
A.
B.
C.
D.
E.
Matchbox Details
Implementation Details
Probase Conceptualization Details
Explicit Semantic Analysis Details
Learnings from Probase
43
Semantic Analysis
Recommendation
Clustering
Related Work
Conclusions
(Appendix A) Matchbox
• [Stern et al., 2009]
• MSR Cambridge recommendation platform
• Implements a hybrid recommender using
Infer.NET
– Uses combination of expectation propagation (EP)
and variational message passing
• Reduces user, item, and context features to
low dimensional trait space
44
Semantic Analysis
Recommendation
Clustering
Related Work
Conclusions
(Appendix A) Matchbox Setup
• Matchbox settings
– Use 20 trait dimension (determined
experimentally)
– 10 iterations of EP algorithm
– Trained on approx. 90% of ratings
– Updated model with 75% of ratings per user
(in remaining 10%)
– MAE computed for remaining 25% per user
45
Semantic Analysis
Recommendation
•
•
•
•
•
ESA: https://github.com/faraday/wikiprep-esa
LDA: Infer.NET
Probase: Probase Package v. 0.18
TF-IDF: http://www.codeproject.com/KB/cs/tfidf.aspx
Matchbox: http://codebox/matchbox
Conclusions
Related Work
Clustering
(Appendix B) Implementation
46
Background
Semantic Analysis
1. Identify all Probase terms in text
2. Use Noisy-or Model to combine:
– Concepts from tl as attribute (zl = 1)
– Concepts from tl as entity/concept (zl = 0)
Conclusions
Clustering
Recommendation
(Appendix C) Probase
Conceptualization
47
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
(Appendix C) Probase
Conceptualization
3. Weight terms based on occurrence
a. Naïve Bayes (similar to Song et al., 2010)
• Compute P(c|t) for individual terms and use Naïve Bayes
model to derive concepts
• Penalizes false positives, does not reward true positives
• Generates very small probabilities for large numbers of terms
b. Weighted Sum (similar to Gabrilovich et al., 2007)
• Compute P(c|t) for individual terms and compute sum over
document for each concept
• Rewards true positives, does not penalize false positives
(accurate concepts and inaccurate concepts, resp.)
48
Background
Semantic Analysis
Recommendation
Clustering
Conclusions
(Appendix C) Probase
Conceptualization
4. Penalize frequent concepts
– Stop word (concepts) are domain-independent
– For films, many domain-specific stop concepts
• E.g., “movie”, “character”, “actor”, etc.
– Inverse Document Frequency on concepts
penalizes those that are too frequent
– Also rewards those that are too infrequent (in only
one document)
– Solution: Filter for minimum and maximum
occurrence
49
Semantic Analysis
Recommendation
Clustering
Related Work
Conclusions
(Appendix C) Probase
Conceptualization
• Using Summation (similar to Wikipedia ESA)
• Using Naïve Bayes from Song et al. approach
– P(|T) P(T|)P()/P(T)
/ P()L - 1
• Inverse Document Frequency for concepts
– IDF(ck) = log ( # of documents / document frequency of ck )
– Minimum occurrence = 2
– Maximum occurrence = 0.5 * # of documents
50
Semantic Analysis
Recommendation
Clustering
Related Work
Conclusions
(Appendix D) Explicit Semantic
Analysis
• Gabrilovich et al., 2007
• Builds inverted index of Wikipedia content
• Input text converted to weight vector of
concepts based on TF-IDF
• 𝑤𝑖𝜖𝑇 𝑣𝑖 ∙ 𝑘𝑗
– 𝑣𝑖: TF−IDF weight of w𝑖
– 𝑘𝑗: Weight of concept, c𝑗, for w𝑖
51
(Appendix E) Learnings from Probase
• Conceptualization works wonders for small
numbers of entities
• Would be extremely useful in a large-scale QA
environment with many semantic analysis and
ML algorithms (e.g., Watson)
• A noisy source of knowledge is best suited to
noise-tolerant IR applications
• Still being developed and improving!
– Working on recognizing verbs
52