Statistical Relational Learning for NLP
Download
Report
Transcript Statistical Relational Learning for NLP
Statistical Relational
Learning for NLP
Ray Mooney & Razvan Bunescu
Statistical Relational Learning
Presented by Michele Banko
Outline
Quick intro to NLP
– Problems, History
– Why SRL?
SRL for two NLP tasks
– Information Extraction using RMNs
– Semantic Parsing using ILP
Quick Tour of NLP
Systems
– Translation, Question-Answering/NL Search,
Reading Comprehension
Sub-tasks
– Part-of-Speech Tagging, Phrase Finding,
Parsing [syntax, semantics], Named-entity
Recognition
Quick Tour of NLP [2]
Knowledge engineering/linguistics vs. statistics
Recently, many NLP tasks treated as sequence
labeling problems
– Pos-tagging: The/DT cat/NN can/MD walk/VB
– NP-finding: Stanford/B-NP University/I-NP is/O in/O California/B-NP
– SRL with 1 relation, adjacency
HMMs, CRFs to model, Viterbi to label
– Find most probable assignment of labels
– States = Part-of-speech tags
– Compute P(w|t), emit word in a particular state
Why SRL?
NLP involves..
Complex reasoning about entities and
relationships between them
Predicate
logic
Resolving & integrating ambiguities on
mulitple levels (morphology, syntax,
semantics..)
– More than just adjacency!
Bayesian methods, graphical models
Intro to Information Extraction
Early 90s, DARPA Message
Understanding Conference
– Identify references to named-entities
(people, companies, locations..)
– Multi-lingual, multi-document
– Attrributes, relationships, events
Fletcher Maddox, former Dean
of the UCSD Business School,
announced the formation of La
Jolla Genomatics together with
his two sons.
Dr. Maddox will be the firm's
CEO. His son, Oliver, is the
Chief Scientist and holds
patents on many of the
algorithms used in Geninfo.
Attributes
NAME
Fletcher Maddox,
Dr. Maddox
DESCRIPTORS
former Dean of the
UCSD Business
School,
his father,
the firm's CEO
CATEGORY
PERSON
Facts
PERSON
Employee_Of
Fletcher Maddox
UCSD Business
School, La Jolla
Genomatics
IE for Protein Identification
Medline DB: 750K abstracts, 3700 proteins
Production of nitric oxide ( NO ) in endothelial
cells is regulated by direct interactions of
endothelial nitric oxide synthase ( eNOS ) which
effector proteins such as Ca2+ - calmodulin . ...
which avidly binds to the carboxyl terminal region
of the eNOS oxygenase domain.
Rule-based, HMM, SVM, MaxEnt
CRFs outperform rest (Ramani et al, 2005)
– May fail to capture long-distance
dependencies
Collective IE using RMNs
(Bunescu & Mooney, 2004)
Typical IE: extractions in isolation
Want to consider influences between extractions
– If context surrounding one occurrence strongly
indicates protein, should affect future taggings in
different contexts
– Acronyms & their long forms
Use Sequence Labeling Treatment to get all
substrings that make up protein names
– Start, End, Continue, Unique, Other
Classification of sequence types, not solo tokens
RMNs (Relational Markov Networks)
RMNs (Taskar et al., 2002)
For each document d in collection
– Associate d with set of candidate entities d.E
Entities = token sequences = too many possible
phrases
Either: constrain length or form (baseNPs)
– Characterize each entity e in d.E with a
predefined set of boolean features e.F
E.label=1 if e is a valid extraction
E.HeadWord, E.POS_Left, E.BigramRight, E.Prefix
Clique Templates
Find all subsets of entities satisfying a given
constraint, then for each subset, form a clique c
Local Templates
– Number of hidden labels = 1
– Model correlations between an entity’s observed
features and its label
Global Templates
– Number of hidden labels > 1
– Model influences among multiple entities
Overlap Template, Repeat Template, Acronym Template
Using Local Templates
Variable Nodes: labels of all candidate entities in
document
Potential Nodes: represent correlations between
>1 entity attributes by linking to variable nodes
– Edges: by matching clique templates against d.E
RMN by Local Templates
e.label
Factor Graph by Local Templates
e.label
φ.PREFIX=A0
…
E.f1=vi
E.f2=vj
φ.HEAD=enzyme
φ.POSLEFT=NOUN
E.fh=vk
φ.WORDLEFT=the
Using Global Templates
Connect label nodes of two or more
entities
Acronym Factor Graph
The antioxidant superoxide dismutase-1 (SOD1)
φAT (acronym potential)
uOR
V (the entity)
φOR
antioxidant superoxide
dismutase-1
superoxide
dismutase-1
dismutase-1
Inference & Learning in RMNs
Inference
– Labels are only hidden features
– Probability distribution over hidden entity labels
computed using Gibbs distribution
– Find most probable assignment of values to labels
using max-product algorithm
Learning
– Structure defined by clique templates
– Find clique potentials that maximize likelihood over
training data using Voted Perceptron
Experiments
Datasets
– Yapex, Aimed: ~200 Medline abstracts, ~4000
protein references each
Systems
– LT-RMN: RMN with local + overlap templates
– GLT-RMN: RMN with local + global templates
– CRF: McCallum 2002
Evaluation
– Position-based precision, recall
Results
Yapex
Method
LT-RMN
GLT-RMN
CRF
Aimed
Method
LT-RMN
GLT-RMN
CRF
Precision
70.79
69.71
72.45
Recall
53.81
65.76
58.64
F-Measure
61.14
67.68
64.81
Precision
81.33
82.79
85.37
Recall
72.79
80.04
75.90
F-Measure
76.82
81.39
80.36
Intro to Semantic Parsing
NL input logical form
Assignment of semantic roles
“John gives the book to Mary”
gives1(subj: John2, dobj: book3, iobj: Mary4)
Ambiguities
“Every man loves a woman”
– Is there one woman loved by all or…
YX man(X) ^ woman(Y) -> loves(X,Y)
XY man(X) ^ woman(Y) -> loves(X,Y)
– LF just says
loves1(every m1:man(m1), a w1:woman(w1))
Previous Work in Semantic Parsing
Hand-built/Linguistic systems
– Grammar development
– NLPwin (MSR)
Data-driven approaches
– Tagging problem: for each NP, tag role
Gildea & Jurafsky (2002): NB classifier combination
using lexical and syntactic features, previous labels
Jurafsky et al. (2004): SVMs
– Sentence + LF Parser
CHILL: ILP to learn a generalized Prolog parser
ILP in CHILL
–
–
Parser induction = learn control rules of a shift reduce
parser
Shift symbols from the input string onto a stack
Reduce items on the stack by applying a matching grammar
rule
Can be encoded as Prolog program:
–
parse(S,Parse) :-
parse([],S,[Parse],[]).
Start with 3 generic operators
–
–
–
–
INRTRODUCE pushes predicate onto stack based on word
input
–
Lexicon: ‘capital’ -> capital(_,_)
COREF_VARS unifies two variables under consideration
DROP_CONJ embeds one predicate as argument of another
SHIFT pushes word input onto stack
What is the capital of Texas?
Stack
Input Buffer
Action
[answer(_,_):[]]
[what,is,the,capital,of,texas,?]
[answer(_,_):[the,is,what]]
[capital,of,texas,?]
SHIFT, SHIFT, SHIFT
[capital(_,_):[],
answer(_,_):[the,is,what]]
[capital,of,texas,?]
INTRODUCE
[capital(C,_):[],
answer(C,_):[the,is,what]]
[capital,of,texas,?]
COREF_VARS
[capital(C,_):[of,capital],
answer(C,_):[the,is,what]]
[texas,?]
SHIFT, SHIFT
[const(S,stateid(texas)):[],
capital(C,S):[of,capital],
answer(C,_):[the,is,what]]
[texas,?]
INTRODUCE,
COREF_VARS
[answer(C, (capital(C,S),
const(S,stateid(texas)))):
[?,texas,of,capital,the,is,what]
]
[]
DROP_CONJ,
SHIFT, SHIFT
Learning Control Rules
Operator Generation
– Initial parser is too general, will produce spurious
parses
– Use training data to extend program
Example Analysis
– Use general parser, recording parse states
– Positive Examples
Parse states to which operator should be applied
Find 1st correct parse of training pair, ops used to achieve
subgoals become positive examples
– Negative Examples for single-parse systems
S is a negative example for the current action if S is a positive
example for a previous action
Induction in CHILL
Control-Rule Induction
– Cover all positive examples, not negative
– Bottom-Up: Compact rule set by forming LeastGeneral Generalizations of clause pairs
– Top-Down: Overly-general rules specialized by
addition of literals
– Invention of new predicates
op([X,[Y,det:the]], [the|Z],A,B) :animate(Y).
animate(man). animate(boy). animate(girl). . .
Fold constraints back into general parser
ProbCHILL: Parsing via ILP Ensembles
ILP + Statistical Parsing
– ILP: not forced to make decisions based on
predetermined features
– SP: handle multiple sources of uncertainty
Ensemble classifier
– Combine outputs of > 1 classifiers
– Bagging, boosting
TABULATE: generate several ILP hypotheses
– Use SP to estimate probability for each potential
operator
– Find most probable semantic parse (beam-search)
P(parse state) = product of probs of operators to reach state
Experiments
U.S. Geography DB with 800 Prolog facts
250 questions from 50 humans, annotated
with Prolog queries
What is the capital of the state with the highest
population?
answer(C, (capital(S,C), largest(P, (state(S),
population(S,P)))))
10-fold CV, measuring
Recall =
# correct queries produced
# test sentences
Precision =
# correct queries produced
# complete parses reduced
Results
System
Recall
Precision F-measure
Geobase
56.00
96.40
70.85
CHILL
68.50
97.65
80.52
ProbCHILL
80.40
88.16
84.10