elaborazione del linguaggio naturale - clic

Download Report

Transcript elaborazione del linguaggio naturale - clic

ELABORAZIONE DEL
LINGUAGGIO NATURALE
SEMANTICA:
NAMED ENTITIES
RELAZIONI
SEMANTICA MODERNA
• Sue sottocompiti base
– Classificazione di entita’:
NAMED ENTITY RECOGNITION (and classification)
– Riconoscimento di predicati e loro argomenti:
RELATION EXTRACTION
Named Entity Recognition (NER)
Input:
Apple Inc., formerly Apple Computer,
Inc., is an American multinational
corporation headquartered in Cupertino,
California that designs, develops, and
sells consumer electronics, computer
software and personal computers. It was
established on April 1, 1976, by Steve
Jobs, Steve Wozniak and Ronald Wayne.
Output:
Apple Inc., formerly Apple Computer,
Inc., is an American multinational
corporation headquartered in Cupertino,
California that designs, develops, and
sells consumer electronics, computer
software and personal computers. It was
established on April 1, 1976, by Steve
Jobs, Steve Wozniak and Ronald Wayne.
Named Entity Recognition (NER)
• Locate and classify atomic elements in text into
predefined categories (persons, organizations,
locations, temporal expressions, quantities,
percentages, monetary values, …)
• Input: a block of text
– Jim bought 300 shares of Acme Corp. in 2006.
• Output: annotated block of text
– <ENAMEX TYPE="PERSON">Jim</ENAMEX> bought
<NUMEX TYPE="QUANTITY">300</NUMEX> shares of
<ENAMEX TYPE="ORGANIZATION">Acme
Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>
– ENAMEX tags (MUC in the 1990s)
THE STANDARD NEWS DOMAIN
• Most work on NER focuses on
– NEWS
– Variants of repertoire of entity types first studied
in MUC and then in ACE:
• PERSON
• ORGANIZATION
– GPE
• LOCATION
• TEMPORAL ENTITY
• NUMBER
HOW
• Two tasks:
– Identifying the part of text that mentions a text
(RECOGNITION)
– Classifying it (CLASSIFICATION)
• The two tasks are reduced to a standard
classification task by having the system classify
WORDS
Basic Problems in NER
• Variation of NEs – e.g. John Smith, Mr Smith,
John.
• Ambiguity of NE types
– John Smith (company vs. person)
– May (person vs. month)
– Washington (person vs. location)
– 1945 (date vs. time)
• Ambiguity with common words, e.g. “may”
Problems in NER
• Category definitions are intuitively quite
clear, but there are many grey areas.
• Many of these grey area are caused by
metonymy.
Organisation vs. Location : “England won
the World Cup” vs. “The World Cup took
place in England”.
Company vs. Artefact: “shares in MTV” vs.
“watching MTV”
Location vs. Organisation: “she met him at
Heathrow” vs. “the Heathrow authorities”
Approaches to NER:
List Lookup
• System that recognises only entities stored in
its lists (GAZETTEERS).
• Advantages - Simple, fast, language
independent, easy to retarget
• Disadvantages – collection and maintenance
of lists, cannot deal with name variants,
cannot resolve ambiguity
Approaches to NER:
Shallow Parsing
• Names often have internal structure. These
components can be either stored or guessed.
location:
CapWord + {City, Forest, Center}
e.g. Sherwood Forest
Cap Word + {Street, Boulevard, Avenue, Crescent, Road}
e.g. Portobello Street
Shallow Parsing Approach
(E.g., Mikheev et al 1998)
• External evidence - names are often used in
very predictive local contexts
Location:
“to the” COMPASS “of” CapWord
e.g. to the south of Loitokitok
“based in” CapWord
e.g. based in Loitokitok
CapWord “is a” (ADJ)? GeoWord
e.g. Loitokitok is a friendly city
Machine learning approaches to NER
• NER as classification: the IOB representation
• Supervised methods
– Support Vector Machines
– Logistic regression (aka Maximum Entropy)
– Sequence pattern learning
– Hidden Markov Models
– Conditional Random Fields
• Distant learning
• Semi-supervised methods
THE ML APPROACH TO NE:
THE IOB REPRESENTATION
THE ML APPROACH TO NE: FEATURES
FEATURES
FEATURES
Supervised ML for NER
• Methods already seen
– Decision trees
– Support Vector Machines
• Sequence pattern learning (also supervised)
– Hidden Markov Models
– Maximum Entropy Models
– Conditional Random Fields
EVALUATION
TYPICAL PERFORMANCE
NER Evaluation Campaigns
• English NER-- CoNLL 2003 - PER/ORG/LOC/MISC
– Training set:
203.621 tokens
– Development set: 51.362 tokens
– Test set:
46.435 tokens
• Italian NER-- Evalita 2009 - PER/ORG/LOC/GPE
– Development set: 223.706 tokens
– Test set:
90.556 tokens
• Mention Detection-- ACE 2005
– 599 documents
CoNLL2003 shared task (1)
• English and German language
• 4 types of NEs:
– LOC Location
– MISC Names of miscellaneous entities
– ORG Organization
– PER Person
• Training Set for developing the system
• Test Data for the final evaluation
CoNLL2003 shared task (2)
• Data
–
–
–
–
columns separated by a single space
A word for each line
An empty line after each sentence
Tags in IOB format
• An example
Milan
's
player
George
Weah
meet
NNP B-NP
POS
B-NP
NN
I-NP
NNP I-NP
NNP I-NP
VBPB-VP O
I-ORG
O
O
I-PER
I-PER
CoNLL2003 shared task (3)
English
precision
recall
F
[FIJZ03]
88.99%
88.54%
88.76%
[CN03]
88.12%
88.51%
88.31%
[KSNM03]
85.93%
86.21%
86.07%
[ZJ03]
86.13%
84.88%
85.50%
--------------------------------------------------[Ham03]
69.09%
53.26%
60.15%
baseline
71.91%
50.90%
59.61%
CURRENT RESEARCH ON NER
• New domains
• New approaches:
– Semi-supervised
– Distant
• Handling many NE types
• Integration with Machine Translation
• Handling difficult linguistic phenomena such
as metonymy
NEW DOMAINS
• BIOMEDICAL
• CHEMISTRY
• HUMANITIES: MORE FINE GRAINED TYPES
Bioinformatics Named Entities
•
•
•
•
•
•
•
Protein
DNA
RNA
Cell line
Cell type
Drug
Chemical
NER IN THE HUMANITIES
SITE
LOC
CULTURE
SEMANTIC INTERPRETATION 2:
FROM SENTENCES TO PROPOSITIONS
Powell met Zhu Rongji
battle
wrestle
join
debate
Powell and Zhu Rongji met
Powell met with Zhu Rongji
Powell and Zhu Rongji had
a meeting
consult
Proposition: meet(Powell, Zhu Rongji)
meet(Somebody1, Somebody2)
...
When Powell met Zhu Rongji on Thursday they discussed the return of the spy plane.
meet(Powell, Zhu)
discuss([Powell, Zhu], return(X, plane))
OTHER ASPECTS OF SEMANTIC
INTERPRETATION
• Identification of RELATIONS between entities
mentioned
– Focus of interest in modern CL since 1993 or so
• Identification of TEMPORAL RELATIONS
– From about 2003 on
• QUALIFICATION of such relations (modality,
epistemicity)
– From about 2010 on
TYPES OF RELATIONS
• Predicate-argument structure (verbs and
nouns)
– John kicked the ball
• Nominal relations
– The red ball
• Relations between events / temporal relations
– John kicked the ball and scored a goal
PREDICATE-ARGUMENT STRUCTURE
• Linguistic Theories
– Case Frames – Fillmore  FrameNet
– Lexical Conceptual Structure – Jackendoff  LCS
– Proto-Roles – Dowty  PropBank
– English verb classes (diathesis alternations) Levin  VerbNet
– Talmy, Levin and Rappaport
Fillmore’s Case Theory
• Sentences have a DEEP STRUCTURE with CASE
RELATIONS
• A sentence is a verb + one or more NPs
– Each NP has a deep-structure case
•
•
•
•
•
•
A(gentive)
I(nstrumental)
D(ative)
F(actitive)
L(ocative)
O(bjective)
– Subject is no more important than Object
• Subject/Object are surface structure
THEMATIC ROLES
• Following on Fillmore’s original work, many
theories of predicate argument structure /
thematic roles were proposed, among which
the best known perhaps
– Jackendoff’s LEXICAL CONCEPTUAL SEMANTICS
– Dowty’s PROTO-ROLES theory
Dowty’s PROTO-ROLES
• Event-dependent
• Prototypes based on shared entailments
• Grammatical relations such as subject related
to observed (empirical) classification of
participants
• Typology of grammatical relations
• Proto-Agent
• Proto-Patient
Proto-Agent
• Properties
– Volitional involvement in event or state
– Sentience (and/or perception)
– Causing an event or change of state in another
participant
– Movement (relative to position of another
participant)
– (exists independently of event named)
*may be discourse pragmatic
Proto-Patient
• Properties:
– Undergoes change of state
– Incremental theme
– Causally affected by another participant
– Stationary relative to movement of another
participant
– (does not exist independently of the event, or at
all) *may be discourse pragmatic
Semantic role labels:
Jan broke the LCD projector.
break (agent(Jan), patient(LCD-projector))
Filmore, 68
cause(agent(Jan),
change-of-state(LCD-projector))
(broken(LCD-projector))
Jackendoff, 72
agent(A) -> intentional(A), sentient(A),
causer(A), affector(A)
patient(P) -> affected(P), change(P),…
Dowty, 91
VERBNET AND PROPBANK
• Dowty’s theory of proto-roles was the basis
for the development of PROPBANK, the first
corpus annotated with information about
predicate-argument structure
PROPBANK REPRESENTATION
a GM-Jaguar
pact
Arg0
*T*-1
that would give
Arg2
a GM-Jaguar pact that would give the U.S.
car maker an eventual 30% stake in the
British company.
Arg1
an eventual 30% stake in the
British company
the US car
maker
give(GM-J pact, US car maker, 30% stake)
ARGUMENTS IN PROPBANK
• Arg0 = agent
• Arg1 = direct object / theme / patient
• Arg2 = indirect object / benefactive /
instrument / attribute / end state
• Arg3 = start point / benefactive / instrument /
attribute
• Arg4 = end point
• Per word vs frame level – more general?
FROM PREDICATES TO FRAMES
In one of its senses, the verb observe evokes a frame called
Compliance: this frame concerns people’s responses to norms,
rules or practices.
The following sentences illustrate the use of the verb in the
intended sense:
– Our family observes the Jewish dietary laws.
– You have to observe the rules or you’ll be penalized.
– How do you observe Easter?
– Please observe the illuminated signs.
FrameNet
FrameNet records information about English words in
the general vocabulary in terms of
1. the frames (e.g. Compliance) that they evoke,
2. the frame elements (semantic roles) that make up the
components of the frames (in Compliance, Norm is one
such frame element), and
3. each word’s valence possibilities, the ways in which
information about the frames is provided in the linguistic
structures connected to them (with observe, Norm is
typically the direct object).
theta
NOMINAL RELATIONS
CLASSIFICATION SCHEMES FOR
NOMINAL RELATIONS
ONE EXAMPLE
(Barker et al1998, Nastase & Spakowicz 2003)
THE TWO-LEVEL TAXONOMY OF
RELATIONS, 2
THE SEMEVAL-2007 CLASSIFICATION
OF RELATIONS
•
•
•
•
•
•
•
Cause-Effect: laugh wrinkles
Instrument-Agency: laser printer
Product-Producer: honey bee
Origin-Entity: message from outer-space
Theme-Tool: news conference
Part-Whole: car door
Content-Container: the air in the jar
THE MUC AND ACE TASKS
• Modern research in relation extraction, as
well, was kicked-off by the Message
Understanding Conference (MUC) campaigns
and continued through the Automatic Content
Extraction (ACE) and Machine Reading followups
• MUC: NE, coreference, TEMPLATE FILLING
• ACE: NE, coreference, relations
TEMPLATE-FILLING
EXAMPLE MUC: JOB POSTING
THE ASSOCIATED TEMPLATE
AUTOMATIC CONTENT EXTRACTION
(ACE)
ACE: THE DATA
ACE: THE TASKS
RELATION DETECTION AND
RECOGNITION
ACE: RELATION TYPES
OTHER PRACTICAL VERSIONS OF
RELATION EXTRACTION
• Biomedical domain (BIONLP, BioCreative)
• Chemistry
• Cultural Heritage
THE TASK OF SEMANTIC RELATION
EXTRACTION
SEMANTIC RELATION EXTRACTION:
THE CHALLENGES
HISTORY OF RELATION EXTRACTION
• Before 1993: Symbolic methods (using
knowledge bases)
• Since then: statistical / heuristic based
methods
– From 1995 to around 2005: mostly SUPERVISED
– More recently: also quite a lot of UNSUPERVISED /
SEMI SUPERVISED techniques
MORE COMPLEX SEMANTICS
• Modalities
• Temporal interpretation