Entity Set Expansion - UIC - Computer Science
Download
Report
Transcript Entity Set Expansion - UIC - Computer Science
Entity Set Expansion in Opinion Documents
Lei Zhang
Bing Liu
University of Illinois at Chicago
Introduction to opinion mining
Opinion Mining
Computational study
expressed in text
of
opinions,
sentiments
Why opinion mining now? mainly because of the
Web, we can get huge volumes of opinionated text
Why opinion mining is important
Whenever we need to make a decision, we would
like to hear other’s advice.
In the past.
Individual : Friends or family.
Business : Surveys and consultants.
Word of mouth on the Web
People can express their opinions in reviews, forum
discussions, blogs…
An example for opinion mining application
What is an entity in opinion documents
An entity can be a product, service, person,
organization or event in opinion document.
Basically, opinion mining is to extract opinions
expressed on entities and their attributes.
“I brought a Sony camera yesterday, and its
picture quality is great.”
picture quality is the product attribute; Sony is
the entity.
Why we need entity extraction
Without knowing the entity, the piece of opinion
has little value.
Companies want to know the competitors in the
market. This is the first step to understand the
competitive landscape from opinion documents.
Related work
Named entity recognition (NER)
Aims to identity entities such as names of persons,
organizations and locations in natural language text.
Our problem is similar to NER problem, but with
some differences.
1.
2.
3.
4.
5.
Fine grained entity classes (products, service) rather
than coarse grained entity classes (people, location,
organization )
Only want a specific type: e.g. a particular type of drug
names.
Neologism : e.g. “Sammy” (Sony) , “SE” (Sony- Ericsson)
Feature sparseness (lack of contextual patterns)
Data noise (over-capitalization , under-capitalization)
NER methods
Supervised
learning methods
The current dominant technique for addressing the
NER problem
Hidden Markov Models (HMM)
Maximum Entropy Models (ME)
Support Vector Machines (SVM)
Conditional Random Field (CRF)
Shortcomings:
Rely on large sets of labeled examples. Labeling is
labor-intensive and time-consuming.
NER methods
Unsupervised
learning methods
Mainly clustering. Gathering named entities from
clustered groups based on the similarity of
context. The techniques rely on lexical resources
(e.g., WordNet), on lexical patterns and on
statistics computed on a large unannotated
corpus.
Shortcomings:
low precision and recall for the result
NER methods
Semi-supervised
learning methods
Show promise for identifying and labeling entities.
Starting with a set of seed entities, semi-supervised
methods use either class specific patterns to populate
an entity class or distributional similarity to find
terms similar to the seeds.
Specific methods:
Bootstrapping
Co-traning
Distributional similarity
Our problem is set expansion problem
To find competing entities, the extracted entities
must be relevant, i.e., they must be of the same
class/type as the user provided entities.
The user can only provide a few names because there
are so many different brands and models.
Our problem is actually a set expansion problem,
which expands a set of given seed entities.
Set expansion problem
Given a set Q of seed entities of a particular class C,
and a set D of candidate entities, we wish to
determine which of the entities in D belong to C.
That is, we “grow” the class C based on the set of
seed examples Q.
This is a classification problem. However, in
practice, the problem is often solved as a ranking
problem.
Distributional similarity
Distributional similarity is a classical method for set
expansion problem.
(Basic idea: the words with similar meanings tend
to appear in similar contexts)
It compares the similarity of the word distribution of
the surround words of a candidate entity and the
seed entities, and then ranking the candidate
entities based on the similarity values.
Our result shows this approach is inaccurate.
Bayesian sets
Based on Bayesian inference and was designed
specifically for the set expansion problem.
It learns from a seeds set (i.e., a positive set P) and
an unlabeled candidate set U.
Bayesian sets
Given D {e} and Q D , we aim to rank the
elements of D by how well they would “fit into” a
set which includes Q (query)
Define a score for each:
score(e)
e D
p(e Q)
p(e)
From Bayes rule, the score can be re-written as:
score(e)
p(e, Q)
p(e) p(Q)
Bayesian sets
Intuitively, the score compares the probability
that e and Q were generated by the same model
with the same unknown parameters θ, to the
probability that e and Q came from models with
different parameters θ and θ’.
Bayesian sets
Compute following equations:
Bayesian sets
The final score can be computed as:
Where
α and β are hyperparameters obtained from data
The top ranked entities should be highly related to seed
set Q according to Bayesian sets algorithm
Direct application of Bayesian sets
For seeds and candidate entities, the feature vector is
created as follows:
(1) A set of features is first designed to represent each entity.
(2) For each entity, identify all the sentences in the corpus
that contain the entity. Based on the context, produce single
feature vector to represent the entity.
But it produces poor result.
(Reason: First, Bayesian sets uses binary feature, multiple
occurrences of an entity in the corpus, which give rich
contextual information, is not fully exploited; Second, the
number of seeds is very small, the result is not reliable)
Improving Bayesian sets
We propose a more sophisticated method to use Bayesian
Sets. It consists of following two steps.
(1) Feature identification:
A set of features to represent each entity is designed.
(2) Data generation:
(a) Multiple feature vector for an entity
(b) Feature reweighting
(c) Candidate entity ranking
How to get candidate entities
Part of Speech (POS) tags – NNP, NNPS and CD as
entity indicators;
A phrase (could be one word) with a sequence of
NNP, NNPS and CD POS tags as one candidate
entity. (e.g. “Nokia/NNP N97/CD” as a single entity
“Nokia N97”)
How to identity features
Like a typical learning algorithm, one has to design a
set of features for learning. Our feature set consists of
two subsets:
Entity word features (EWF): characterize the words
representing entity themselves. This set of features is
completely domain independent. (e.g. “Sony” , “IBM”)
Surrounding word features (SWF): the surrounding
words of a candidate entity. (e.g. “ I bought the Sony
tv yesterday”)
Data generation
Because for each candidate entity, there are several
feature vector is generated. It causes the problem of
feature sparseness and entity ranking.
We proposed two techniques to deal with the problems:
(1) Feature reweighting
(2) Candidate entity ranking
Feature reweighting
Recall the score for an entity from Bayesian set
N is the number of items in the seed set. qij is the feature j of seed
entity qi
mj is the mean of feature j of all possible entities. k is a scaling
factor (which we use 1).
In order to make a positive contribution to the final score of
entity e, wj must be greater than zero.
Feature reweighting
It means if feature j is effective (wj > 0), the seed data
mean must be greater than the candidate data mean
on feature j.
Due to the idiosyncrasy of the data, There are many
high-quality features, whose seed data mean may be
even less than the candidate data mean.
Feature reweighting
For example: In drug data set: “prescribe” is a
very good entity feature. “Prescribe EN/NNP”
(EN represents an entity, NNP is its POS tag)
strongly suggests that EN is a drug. However,
the problem is that the mean of this feature in
the seed set is 0.024 which is less than its
candidate set mean 0.025
It means it is worse than no feature at all !
Feature reweighting
In order to fully utilize all features, we change original
mj to by multiplying a scaling factor to force all feature
weights wj > 0:
The idea is that we lower the candidate data mean
intentionally so that all the features found from the
seed data can be utilized.
we let
to be greater than N for all features j.
to determine t.
Identifying high-quality Features
features A and B, same feature frequency => same feature
weight
In some cases:
for feature A, all feature count may come from only one entity in
the seed set; for feature B, the feature counts are from different
different entities (e.g. Bought + “ Nokia” , “Motorola” “SE”)
feature B is a better feature than feature A
shared by or associated with more entities.
Identifying high-quality Features
r is used to represent feature quality for feature j. h is
the number of unique entities that have jth feature. T is
the total number of entities in the seed set.
Candidate entity ranking
Each unique candidate entity may generate multiple
feature vectors
It is highly desirable to rank those correct and
frequent entities at the top
Ranking candidate entities
Md is the median value for all feature vector scores of candidate
entity , n is the candidate entity’s frequency
fs(d) is the final score for the candidate entity
Additional technique: enlarge the seed sets
Enlarging the seed set using some high precision
syntactic coordination patterns.
EN [or | and] EN
from EN to EN
neither EN nor EN
prefer EN to EN
[such as | especially | including] EN (, EN)* [or | and]
EN
EN is the entity name.
e.g “Nokia and Samsung do not produce smart phones,”
Additional technique: bootstrapping Bayesian set
This strategy again tries to find more seeds, but
using Bayesian Sets itself. run the algorithm
iteratively.
At the end of each iteration, pick up k top ranked
entities (k = 5 in our experiments).
The iteration ends if no new entity is added to the
current seed list.
The whole algorithm
Experiment result
Similar web-based systems
Google
Boo!
Sets
Wa!
Experiment result
Thank
you