Transcript Slides

Mining Subjective Properties
from the Web
Writers: Immanuel Trummer, Alon Halevy,
Hongrae Lee, Sunita Sarawagi, Rahul Gupta
Presenting: Amir Taubenfeld
Outline for Today’s Lecture
• Motivation: Search future is in structured data
• Introduction to The Surveyor System
• Getting into the details:
– Extracting subjective properties from the web and
polarity of statements
– Determine the dominant opinion of the authors
of the Web
• Experimental Evaluation & conclusion
Outline for Today’s Lecture
• Motivation: Search future is in structured data
• Introduction to The Surveyor System
• Getting into the details:
– Extracting subjective properties from the web and
polarity of statements
– Determine the dominant opinion of the authors
of the Web
• Experimental Evaluation & conclusion
Answering queries with links
Answering queries with links – Page Rank
Page Rank the algorithm from which Google
began. It calculates the probability that a person
randomly clicking on links will arrive at any
particular page
Answering queries with links
Page Rank is awesome!
It shows us the best links for the words we are
looking for, but it does not understand our
queries.
Now: Answering queries from Structured Data
Now: Answering queries from Structured Data
We all remember YAGO from Databases course
Now: Answering queries from Structured Data
But regular knowledge bases don’t capture
subjective properties
Current limitation: Objective Queries
Need Subjective knowledge base
Subjective Property Mining
Objective: Create a subjective knowledge base.
Main challenge:
No ground of truth - Need to aggregate many
opinions.
Outline for Today’s Lecture
• Motivation: Search future is in structured data
• Introduction to The Surveyor System
• Getting into the details:
– Extracting subjective properties from the web and
polarity of statements
– Determine the dominant opinion of the authors
of the Web
• Experimental Evaluation & conclusion
The Surveyor System
Specialized system for mining subjective
properties from the Web
(Surveyor derived from survey)
System overview - Course of action
• Extract statements involving entities and
subjective properties from the Web
• Determine the polarity of each statement
• Aggregate the results
• Determine the dominant opinion
System overview – Extraction & aggregation
System overview – dominant opinion
Is it enough for concluding that kittens
are cute and tigers are not cute
System overview – dominant opinion
Is it enough for concluding that kittens
are cute and tigers are not cute
Of course not, otherwise I wouldn’t
have asked this question. But why?
System overview – dominant opinion
(Tel-Aviv, Safe city) → pro: 5, contra: 10
System overview – dominant opinion
(Tel-Aviv, Safe city) → pro: 5, contra: 10
Does it means that Tel-Aviv is not safe?
System overview – dominant opinion
(Tel-Aviv, Safe city) → pro: 5, contra: 10
Does it means that Tel-Aviv is not safe?
No! we must consider skew
System overview – dominant opinion
Can we take it into our advantage?
(Ibtin, Big city) → pro: 0, contra: 0
System overview – dominant opinion
Can we take it into our advantage?
System overview – dominant opinion
Can we take it into our advantage?
Big cities tend to be mentioned more
Often on the Web than small cities
System overview – dominant opinion
Example: Taking skew and correlation
into account improves the model.
System overview – dominant opinion
Conclusion I:
Skew &correlation exists, and we can use
them for our advantage.
Conclusion II:
Skew and correlation are property & type
specific.
System overview – Putting it together
Outline for Today’s Lecture
• Motivation: Search future is in structured data
• Introduction to The Surveyor System
• Getting into the details:
– Extracting subjective properties from the web and
polarity of statements
– Determine the dominant opinion of the authors
of the Web
• Experimental Evaluation & conclusion
Getting into the details of Surveyor
The Surveyor system can be divided into two
main algorithms:
• Extracting evidence about entities from the
web. Evidence = Statement connecting an entity to a
property
• Aggregating the evidences from previous step
and determine the dominant opinion
Outline for Today’s Lecture
• Motivation: Search future is in structured data
• Introduction to The Surveyor System
• Getting into the details:
– Extracting subjective properties from the web and
polarity of statements
– Determine the dominant opinion of the authors
of the Web
• Experimental Evaluation & conclusion
Extracting evidences – problem definition
Input:
• Collection of annotated web documents
• Knowledge base containing entities and their types.
Output: Set of tuples < 𝐸, 𝑃, 𝐶 + , 𝐶 − > where
•
•
•
•
𝐸 = 𝑒𝑛𝑡𝑖𝑡𝑦.
𝑃 = 𝑝𝑟𝑜𝑝𝑒𝑟𝑡𝑦.
𝐶 + = # 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒𝑠.
𝐶 − = # 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒𝑠.
Extracting evidences - annotating documents
Surveyor system receives as an input a web
snapshot was preprocessed with NLP methods
such as Stanford Parser.
The output of such parsers is a dependencies
tree that represents the lexical structure of the
sentence.
60 seconds on NLP
Natural language processing is a very interesting
story. Unfortunately it is also a very long story, thus
we will only discuss it in a nutshell
60 seconds on NLP
Natural language processing refers to the ability of
computers to process text in natural human language
such as Hebrew, rather than artificial language such
as Java
60 seconds on NLP
In order to do that we need to parse natural language
text into a more formal representation, usually this
representation is a tree
60 seconds on NLP
One basic model uses a probabilistic CFG in
order to create a parsing tree that
represents a formal
structure of a given
sentence
60 seconds on NLP
Each derivation rule have a different probability,
and the goal is to find the parsing tree with the
highest total probability.
Naïve algorithm has
an exponential running
time, but using DP we
can get polynomial
complexity
Back to the paper
Extracting evidences - matching patterns
Red = Tokens that
together form the
property.
Green = The
entities.
Extracting evidences – filtering
New York is bad for parking
Southern France is warm
Greece is a southern country
Extracting evidences – filtering
New York is bad for parking
Southern France is warm
Greece is a southern country
Extracting evidences – filtering
New York is bad for parking
Solution: Check for subtrees that can represent
constrictions.
Southern France is warm
Greece is a southern country
Extracting evidences – filtering
New York is bad for parking
Solution: Check for subtrees that can represent
constrictions.
Southern France is warm
Greece is a southern country
Extracting evidences – filtering
New York is bad for parking
Solution: Check for subtrees that can represent
constrictions.
Southern France is warm
Greece is a southern country
Solution: Don’t allow co-reference to the same
entity.
Extracting evidences – determine polarity
Outline for Today’s Lecture
• Motivation: Search future is in structured data
• Introduction to The Surveyor System
• Getting into the details:
– Extracting subjective properties from the web and
polarity of statements
– Determine the dominant opinion of the authors
of the Web
• Experimental Evaluation & conclusion
Estimating the dominant opinion - problem definition
Input:
• Knowledge base that links types to entities.
• Set of tuples < 𝐸, 𝑃, 𝐶 + , 𝐶 − > (from stage 1) where
o
o
o
o
𝐸 = 𝑒𝑛𝑡𝑖𝑡𝑦
𝑃 = 𝑝𝑟𝑜𝑝𝑒𝑟𝑡𝑦
𝐶 + = # 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒𝑠
𝐶 − = # 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒𝑠
Output:
The dominant of whether P applies to E
Estimating the dominant opinion
As we saw previously, estimating the dominant
opinion based on the majority vote counting
does not work very well.
We must take into consideration different types
of biases.
Estimating the dominant opinion
Each property-type combination is associated with two
probability distributions over the statement counters.
The first distribution represent the probability for an
evidence, given that the dominant opinion applies
the property to the entity, whereas in the second,
the dominant opinion does not apply.
Estimating the dominant opinion
Therefore, we assume that each evidence tuple was
drawn from one of the two possible probability
If we know how to express those two probability
distributions, then we can calculate for each
evidence tuple the probability with which it was
drawn from one distribution or the other
Estimating the dominant opinion
Modeling user behavior
In order to model the probability to receive a certain
number of positive or negative statements, we must
model the probability that a single user decides to issue
a positive or negative statement.
Modeling user behavior
Modeling user behavior
Now, lets write our model as a Bayesian network
Modeling user behavior
Now, lets write our model as a Bayesian network
But first, what is a Bayesian network?
Bayesian networks - definition
From Wikipedia: Model that represent a set of
random variables and their conditional
dependencies via a directed acyclic graph
Sprinkler
Rain
Grass
wet
Bayesian networks - example
Bayesian networks - example
If we know that the grass is wet, we can calculate
the probability that it was raining.
Σ𝑆∈ 𝑇,𝐹 𝑃 𝐺 = 𝑇, 𝑆, 𝑅 = 𝑇
𝑃 𝐺 = 𝑇, 𝑅 = 𝑇
𝑃 𝑅=𝑇𝐺=𝑇 =
=
𝑃 𝐺=𝑇
Σ𝑆∈ 𝑇,𝐹 𝑃(𝐺 = 𝑇, 𝑆, 𝑅 = 𝑇)
Bayesian networks - example
We calculate each element in the sum by the tables
𝑃 𝐺 = 𝑇, 𝑆 = 𝑇, 𝑅 = 𝑇 = 𝑃 𝐺 = 𝑇 𝑆 = 𝑇, 𝑅 = 𝑇 𝑃 𝑆 = 𝑇, 𝑅 = 𝑇
= 𝑃 𝐺 = 𝑇 𝑆 = 𝑇, 𝑅 = 𝑇 𝑃 𝑆 = 𝑇|𝑅 = 𝑇 𝑃(𝑅 = 𝑇)
Bayesian networks - example
We calculate each element in the sum by the tables
𝑃 𝐺 = 𝑇, 𝑆 = 𝑇, 𝑅 = 𝑇 = 𝑃 𝐺 = 𝑇 𝑆 = 𝑇, 𝑅 = 𝑇 𝑃 𝑆 = 𝑇, 𝑅 = 𝑇
= 𝑃 𝐺 = 𝑇 𝑆 = 𝑇, 𝑅 = 𝑇 𝑃 𝑆 = 𝑇|𝑅 = 𝑇 𝑃(𝑅 = 𝑇)
Modeling user behavior
Now we are ready to write our model as a
Bayesian Network
𝐷𝑜𝑒𝑠 < 𝑃𝑟𝑜𝑝𝑒𝑟𝑡𝑦 > 𝑎𝑝𝑝𝑙𝑦 𝑡𝑜 < 𝐸𝑛𝑡𝑖𝑡𝑦 >?
Modeling user behavior
𝐷𝑜𝑒𝑠 < 𝑃𝑟𝑜𝑝𝑒𝑟𝑡𝑦 > 𝑎𝑝𝑝𝑙𝑦 𝑡𝑜 < 𝐸𝑛𝑡𝑖𝑡𝑦 >?
Dominant opinion
Modeling user behavior
𝐷𝑜𝑒𝑠 < 𝑃𝑟𝑜𝑝𝑒𝑟𝑡𝑦 > 𝑎𝑝𝑝𝑙𝑦 𝑡𝑜 < 𝐸𝑛𝑡𝑖𝑡𝑦 >?
Dominant opinion
User 1 opinion
…
User n opinion
Modeling user behavior
𝐷𝑜𝑒𝑠 < 𝑃𝑟𝑜𝑝𝑒𝑟𝑡𝑦 > 𝑎𝑝𝑝𝑙𝑦 𝑡𝑜 < 𝐸𝑛𝑡𝑖𝑡𝑦 >?
Dominant opinion
User 1 opinion
…
User n opinion
User 1 post opinion?
…
User n post opinion?
Modeling user behavior
𝐷𝑜𝑒𝑠 < 𝑃𝑟𝑜𝑝𝑒𝑟𝑡𝑦 > 𝑎𝑝𝑝𝑙𝑦 𝑡𝑜 < 𝐸𝑛𝑡𝑖𝑡𝑦 >?
Dominant opinion
User 1 opinion
…
User n opinion
User 1 post opinion?
…
User n post opinion?
# Pro statements
# Contra statements
Modeling user behavior
𝐷𝑜𝑒𝑠 < 𝑃𝑟𝑜𝑝𝑒𝑟𝑡𝑦 > 𝑎𝑝𝑝𝑙𝑦 𝑡𝑜 < 𝐸𝑛𝑡𝑖𝑡𝑦 >?
Dominant opinion
To infer
User 1 opinion
…
User n opinion
User 1 post opinion?
…
User n post opinion?
# Pro statements
Given
# Contra statements
Modeling user behavior
𝑙𝑒𝑡𝑠 𝑤𝑟𝑖𝑡𝑒 𝑖𝑡 𝑤𝑖𝑡ℎ 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠 𝑓𝑜𝑟 𝑒𝑛𝑡𝑖𝑡𝑖𝑦 ′𝑖′
𝐷𝑖
𝑂𝑖1
…
𝑂𝑖𝑛
𝑆𝑖1
…
𝑆𝑖𝑛
𝐶𝑖+ , 𝐶𝑖−
𝑫𝒊 = Dominant opinion
𝑺𝒊𝒘 = User makes Statement
𝑶𝒊𝒘 = User’s Opinion
𝑪𝒊 = Count
Modeling user behavior
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑓𝑜𝑟 𝑒𝑛𝑡𝑖𝑡𝑦 𝑖 𝑎𝑛𝑑 𝑢𝑠𝑒𝑟 𝑤
𝐷𝑖
𝑂𝑖𝑤
𝑆𝑖𝑤
𝐶𝑖+ , 𝐶𝑖−
𝑫𝒊 = Dominant opinion
𝑺𝒊𝒘 = User makes Statement
𝑶𝒊𝒘 = User’s Opinion
𝑪𝒊 = Count
Modeling user behavior
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑓𝑜𝑟 𝑒𝑛𝑡𝑖𝑡𝑦 𝑖 𝑎𝑛𝑑 𝑢𝑠𝑒𝑟 𝑤
𝐷𝑖
𝑂𝑖𝑤
𝑆𝑖𝑤
𝐶𝑖+ , 𝐶𝑖−
𝑫𝒊 = Dominant opinion
𝑺𝒊𝒘 = User makes Statement
𝑶𝒊𝒘 = User’s Opinion
𝑪𝒊 = Count
Modeling user behavior
Our goal is to compute Pr
𝑫𝒊 = Dominant opinion
𝑺𝒊𝒘 = User makes Statement
+ −
𝐷𝑖 𝐶𝑖 , 𝐶𝑖
𝑶𝒊𝒘 = User’s Opinion
𝑪𝒊 = Count
Modeling user behavior
Our goal is to compute Pr
Which is from bayes: Pr
𝑫𝒊 = Dominant opinion
𝑺𝒊𝒘 = User makes Statement
+ −
𝐷𝑖 𝐶𝑖 , 𝐶𝑖
𝐶𝑖+ , 𝐶𝑖− 𝐷𝑖
Pr 𝐷𝑖
Pr 𝐶𝑖+ ,𝐶𝑖−
𝑶𝒊𝒘 = User’s Opinion
𝑪𝒊 = Count
Modeling user behavior
Our goal is to compute Pr
Which is from bayes: Pr
+ −
𝐷𝑖 𝐶𝑖 , 𝐶𝑖
𝐶𝑖+ , 𝐶𝑖− 𝐷𝑖
Pr 𝐷𝑖
Pr 𝐶𝑖+ ,𝐶𝑖−
Because 𝐶𝑖+ , 𝐶𝑖− is a deterministic function of 𝑆𝑖𝑤
We first solve 𝑃𝑟 𝑆𝑖𝑤 𝐷𝑖
𝑫𝒊 = Dominant opinion
𝑺𝒊𝒘 = User makes Statement
𝑶𝒊𝒘 = User’s Opinion
𝑪𝒊 = Count
Modeling user behavior
𝑃𝑟 𝑆𝑖𝑤 𝐷𝑖
= Σ𝑂𝑖𝑤 ∈ +,− Pr(𝑆𝑖𝑤 , 𝑂𝑖𝑤 |𝐷𝑖 )
= Σ𝑂𝑖𝑤 ∈ +,− Pr 𝑆𝑖𝑤 𝑂𝑖𝑤 Pr 𝑂𝑖𝑤 𝐷𝑖
And from the Bayesian network, we obtain
𝑃𝑟 𝑆𝑖𝑤 = + 𝐷𝑖 = + = 𝑝𝐴 𝑝𝑆+
𝑃𝑟 𝑆𝑖𝑤 = − 𝐷𝑖 = + = (1 − 𝑝𝐴 ) 𝑝𝑆−
𝑃𝑟 𝑆𝑖𝑤 = + 𝐷𝑖 = − = 1 − 𝑝𝐴 𝑝𝑆+
𝑃𝑟 𝑆𝑖𝑤 = − 𝐷𝑖 = − = 𝑝𝐴 𝑝𝑆−
𝑫𝒊 = Dominant opinion
𝑺𝒊𝒘 = User makes Statement
𝑶𝒊𝒘 = User’s Opinion
𝑪𝒊 = Count
Modeling user behavior
The variables 𝐶𝑖+ , 𝐶𝑖− are obtained by summing up n
variables 𝑆𝑖𝑤 each of which can be +,−, or neutral.
We assume that the variables 𝑆𝑖𝑤 are independent for
different 𝒘 since the chances that two randomly
selected documents on the Web are authored by the
same person are negligible.
This implies that 𝐶𝑖+ , 𝐶𝑖− follows a Multinomial
distribution where
Pr 𝐶𝑖+ = 𝑎, 𝐶𝑖− = 𝑏 𝐷𝑖 = 𝑥 =
𝑛!
−
𝑝𝑥+ a 𝑝𝑥− b 1 − p+
−
𝑝
x
𝑥
𝑎!𝑏! 𝑛−𝑎−𝑏 !
Where 𝑝𝑦𝑥 = 𝑃𝑟 𝑆𝑖𝑤 = 𝑥 𝐷𝑖 = 𝑦
𝑛−𝑎−𝑏
Modeling user behavior
And by assuming the n is very big , compare to 𝐶𝑖+ , 𝐶𝑖− ,
we can estimate the distribution as a multiplication of
two Poisson distribution
+
Pr 𝐶𝑖+ = 𝑎, 𝐶𝑖− = 𝑏 Di = + = Pois a, 𝜆+
Pois
b,
𝜆
−
+
−
Pr 𝐶𝑖+ = 𝑎, 𝐶𝑖− = 𝑏 Di = − = Pois a, 𝜆−
Pois
b,
𝜆
−
+
Where
+
𝜆+
=
𝑛
𝑝
𝑝
𝐴 𝑆
+
−
𝜆−
=
𝑛
𝑝
𝑝
−
𝐴 𝑆
−
𝜆−
=
𝑛
(1
−
𝑝
)
𝑝
+
𝐴
𝑆
+
𝜆+
=
𝑛
(1
−
𝑝
)
𝑝
−
𝐴
𝑆
Modeling user behavior
The bottom line is this:
If we know 𝐶𝑖+ , 𝐶𝑖− , than we can use our Bayesian
network to compute two expressions that depend
−.
on 𝒑𝑨 , 𝒑+
,
𝒑
𝒔
𝒔
The first represent the probability distribution when
the dominant opinion is positive, whereas the
second, represent the probability distribution when
the dominant opinion is negative
Modeling user behavior
Example: If we assume
𝑷𝑨 = 𝟎. 𝟗, 𝒏𝒑+
𝒔 = 𝟏𝟎𝟎,
𝒏𝒑−
𝒔 =𝟓
The agreement parameter is relatively high,
and the probability to post a statement when the
dominant opinion is ‘+’, is significantly higher than when
it is ‘-’. We get the two distributions that we saw
on the big cities example
Estimating model parameters
But how do we choose 𝑝𝐴 , 𝑝𝑠+ , 𝑝𝑠− ?
Estimating model parameters
But how do we choose 𝑝𝐴 , 𝑝𝑠+ , 𝑝𝑠− ?
Well, it is kind of maximum likelihood
problem
Maximum Likelihood - Remainder
Method for estimating the parameters of a
statistical model given a set of sample data
Example: Estimating Bernoulli parameter.
Assumption: 𝑋~𝐵𝑒𝑟(𝑝? )
Input: 𝑥1 , … , 𝑥𝑛
𝐿 𝑝? = 𝑃 𝑥𝑖 = 1
Σ𝑥𝑖 𝑃
𝑥𝑖 = 0
𝑛−Σ𝑥𝑖
Σ𝑥𝑖
= 𝑝?
∙ (1
Estimating model parameters
But in our case we don’t have all the sample data
Estimating model parameters
But in our case we don’t have all the sample data
Estimating model parameters - definitions
• 𝑬={<
+ −
𝑐𝑖 , 𝑐𝑖
> |𝑖 ∈ 1, . . 𝑚}
Set of tuples with positive/negative Evidence count
• 𝑫 = < 𝐷1 , … 𝐷𝑚 >
Vector of random variables that represent the
possible Dominant opinion for each entity
• 𝜽 = < 𝑝𝐴 , 𝑝𝑆+ , 𝑝𝑆− >
Vector containing the parameters for our Bayesian
network (which we are trying to estimate)
Estimating model parameters
Estimating model parameters
Estimating model parameters
6:
Done using the Bayesian network we defined
Previously
7:
Done by a special ML method that take into
account all possible sample data sets. Each data set
is weighted using the probabilities from the
previous step
Estimating model parameters - definitions
The “expectation maximization” function which
we want to maximize:
𝑳 𝒅, 𝜽 = 𝑷𝒓(𝑬, 𝑫 = 𝒅|𝜽)
𝑸𝒌 𝜽 = 𝚺𝒅∈ +,− 𝒎 𝐏𝐫 𝑫 = 𝒅 𝜽𝒌−𝟏 , 𝑬 𝐥𝐨𝐠 𝑳 𝒅, 𝜽
Weight
Expansion of the ML function that considers all
the data sets, each one with different weight
The weight function is the probability function
that we computed in the previous step
Estimating model parameters
𝑄𝑘 𝜃 = Σ𝑑∈ +,− 𝑚 Pr 𝐷 = 𝑑 𝜃𝑘−1 , 𝐸 log 𝐿 𝑑, 𝜃
Problem: Exponential number of terms
Estimating model parameters
𝑄𝑘 𝜃 = Σ𝑑∈ +,− 𝑚 Pr 𝐷 = 𝑑 𝜃𝑘−1 , 𝐸 log 𝐿 𝑑, 𝜃
Problem: Exponential number of terms
Thus, we will work with a summarized
version, that is linear in the number of entities
𝑚
log Pr 𝐷𝑖 = 𝑑𝑖 , 𝐸𝑖 𝜃
𝑖=1 𝑑𝑖 ∈ +,−
Pr 𝐷𝑖 = 𝑑𝑖 𝜃𝑘−1 , 𝐸𝑖
Estimating model parameters
𝑚
log Pr 𝐷𝑖 = 𝑑𝑖 , 𝐸𝑖 𝜃
Pr 𝐷𝑖 = 𝑑𝑖 𝜃𝑘−1 , 𝐸𝑖
𝑖=1 𝑑𝑖 ∈ +,−
Last step - differentiate and compare to 0
Estimating model parameters
𝑚
log Pr 𝐷𝑖 = 𝑑𝑖 , 𝐸𝑖 𝜃
Pr 𝐷𝑖 = 𝑑𝑖 𝜃𝑘−1 , 𝐸𝑖
𝑖=1 𝑑𝑖 ∈ +,−
Last step - differentiate and compare to 0
I will leave it for you as a home exercise
Estimating model parameters
𝑚
log Pr 𝐷𝑖 = 𝑑𝑖 , 𝐸𝑖 𝜃
Pr 𝐷𝑖 = 𝑑𝑖 𝜃𝑘−1 , 𝐸𝑖
𝑖=1 𝑑𝑖 ∈ +,−
Last step - differentiate and compare to 0
I will leave it for you as a home exercise
Submit it to Schreiber cell 777
Outline for Today’s Lecture
• Motivation: Search future is in structured data
• Introduction to The Surveyor System
• Getting into the details:
– Extracting subjective properties from the web and
polarity of statements
– Determine the dominant opinion of the authors
of the Web
• Experimental Evaluation & conclusion
Experiment evaluation
• Surveyor was applied on a 40TB annotated
web snapshot
• The data processing pipeline was executed on
a large cluster (5000 nodes) and took 2 hours
• Inferred dominant opinion for over 4 billion
entity-property pairs
Statistics
• Surveyor was applied on a 40TB annotated
web snapshot
• The data processing pipeline was executed on
a large cluster (5000 nodes) and took 2 hours
• Inferred dominant opinion for over 4 billion
entity-property pairs
Experiment against AMT workers
• Selected 500 entity-property pairs:
5 types X 20 entities X 5 properties
• Compared against 20 AMT workers
Each worker was asked about each of the 500
entity property pairs. In total 10000 opinions
Experiment against AMT workers
Experiment against AMT workers
Conclusions
• Introduced a new problem of “Subjective
Property Mining”
• There is a need for special type of systems to
solve this problem
• Introduced Surveyor system