Transcript ppt

Exploiting Dictionaries in Named Entity Extraction:
Combining Semi-Markov Extraction Processes and
Data Integration Methods
William W. Cohen, Sunita Sarawagi
Presented by: Quoc Le
CS591CXZ – General Web Mining.
Motivation
• Information Extraction
– Deriving structured data from unstructured
data.
– Using structured data as guidance to improve
extracting unstructured sources.
• Name Entity Recognition
– Extracting names, locations, times.
– Improving NER systems with external
dictionaries.
Approaches
• Look-up entities in (large) dictionary.
– Surface from is different, prone to noise and
errors.
• Take an existing NER system and link to
an external dictionary.
– High performance NER classify words to class
vs. similarity of entire entity to dictionary entry.
Problem Formulation
• Name finding as Word Tagging
– E.g.: (Fred)Person (please stop by)Other (my
office)Loc (this afternoon)Time
– x: sequence of words map to y: sequence
of labels.  (x,y) pairs.
• Conditional distribution of y given x
(HMM):
| x|
P( y | x)   P( yi | i, x, yi 1 )
i 1
Semi-Markovian NER
• Segmentation: S = <S1,…,SM>: start
position tj, end position uj and a label lj.
– E.g.: S = {(1,1,Person), (2,4.Oth), (5,6,Loc),
(7,8,Time)}
• Conditional semi-Markov Model (CSMM):
Inference and Learning problems
P(S | x)   P(S j | t j , x, l j 1 )
j
Compare to other approaches
• CMM: CSMM predicts tag + position.
• Order-L CMM: CSMM uses corresponding
tokens (not just previous ones).
• Treat external dictionary as training examples:
prone to misspelling, large dictionary, different
dictionary. (Good when training data is limited).
• N-gram classification: entities may overlap
• Use dictionary to bootstrap the search for
extraction patterns: rule-based vs. probabilistic.
Training SMM
• Modified version of Collins’ perceptron-based algorithm
for training HMMs.
• Assume local feature function f which maps pair (x,S)
and an index j to a vector of features f(j,x,S). Define:
F ( x, S )   j 1 f ( j, x, S )
|S |
• Let W be the weight vector over the components of F.
– Inference: Compute V(W,x) – the Viterbi decoding of x with W.
– Training: Learn W that lead to best performance.
• Viterbi search can be done with recurrence of Vx,W(i,y).
Perceptron-based SMM Learning
• Let SCORE(x,W,S) = W. F(x,S).
• For each example xt, St:
– Find K segmentations that gives highest score
– Let Wt+1 = Wt
– For each I such that SCORE(xt,Wt,Si) is
greater than (1-β). SCORE(xt,Wt,St), update
Wt+1: Wt+1 = Wt+1 + F(xt,St) – F(xt,Si)
• Return the average of all Wt.
Features
• Examples: value of segment, length of
segment, left window, right window etc.
• Most can be applied to HMM NER system.
• More powerful and meaningful, e.g. “X+
X+” is more indicative than “X+” for name.
• Distance features: similarity to words in
an external dictionary
Distance Features
• D: dictionary; d: distance metric, e: entity
name in D. e’: segment.  Define:
– gD/d(e’) = min d(e,e’).
• Distance metric: Jaccard (word-level),
Jaro-Winkler (character-level), TFIDF
(word-level), SoftTFIDF (hybrid measure).
Experiments
HMM-VP(1): Predicts two labels y: one for tokens
inside an entity & one outside.
• HMM-VP(4): Encoding scheme: Use labels with
tags unique, begin, end, continue and other.
– E.g.: (Fred)Personunique,please stop by the
(fourth)Locbegin (floor meeting)Loccontinue (room)Locend
• SMM (K = 2, E = 20, β = 0.05): any, first, last,
exact.
• Data sets: address in India, student emails, jobs.
Considerations
• Evaluate exact matching against a
dictionary: low recall, errors, but provide
good indication of quality of dictionary.
• Normalizing dictionary entries: yes and no.
E.g.: “Will” & “will”.
• For HMMs, we could use partial distance
between tokens & dictionary entries.
• Segment size is bounded by some number
Evaluation
• Combination of NER methods; without
external dictionary, with binary features,
with distance features.
• Train with only 10% data, test with the
rest. Do it 7 times and record average.
• Partial extraction gets no credit
• Use precision, recall and F1.
Results
• SMM-VP is best: outperforms HMM-VP(4)
on 13 out of 15 cases.
• HMM-VP(1) is worst: HMM-VP(4)
outperforms HMM-VP(1) on 13 out of 15
cases.
• Binary dict. features are helpful but
distance features are more helpful.
• See Table 1 (details) and 4 (short).
Effects
• Improvements over Collins methods – T.5.
• The gap between SMM and HMM-VP(4)
decreases when training size increases,
but still at different convergent speed. T.2
• High order HMM doesn’t improve much
performance. T.6
• Alternative (less-related) dictionaries: Both
methods seems fairly robust.
Conclusion & Questions
• Conclusion
– Incorporate knowledge nicely.
– Applicable to sequential model.
– Improvements is significant, but it uses more
resources and run 3-5 times slower.
• Questions:
– What if dictionary is not super-set? Unrelated
dictionary.
– Harder type of data which is not easy to get named.