major reductions and realignments of troops in central

Download Report

Transcript major reductions and realignments of troops in central

Type Systems,
Interoperability
and Database Population
Eric Nyberg, CMU
Shilpa Arora, CMU
Lance Ramshaw, BBN
Outline
• Annotation sample analysis
– emergent type systems
– ongoing issues / clarification questions
• Data interoperability
• Database population
– CMU’s Annotations DB
– OntoNotes
– Possible architecture for interoperability with
UIMA annotators
– Issues for Discussion
Task
In progress,
not finished
Not started
• Analyze sample outputs from
different annotation groups
• Formalize annotation type system
(UML object model) for each
sample
• Generate clarification questions
• Work toward a unified type
system
• Work toward interoperability
architecture
For each annotation sample:
•
•
•
•
Overview of what we received
Brief example annotation
Type system analysis
Issues / Questions
Whats in the bin ?
#
Annotation
1.1
CMU Belief Annotations
1.2
2.1
CMU Event Coreference Annotations
Ed Hovy's Group - Noun Sense
Annotation
3.1
BBN Temporal Ordering Annotation
3.2
Manual Samples Analysis
x
x
Type System
x
x
x
x
x
x
x
x
BBN Name Annotations
x
x
x
3.3
BBN Coreference Annotation
x
x
x
3.4
BBN (Complex) Coreference Annotation
x
x
x
x
4.1
UMBC Modality Annotation
x
x
x
x
5.1
Columbia Dialog Annotation
x
x
x
5
CMU/Columbia Belief Annotation
• Annotation Manual:
– Davis et. al., “Annotating belief in
Communication: Manual”
• Annotation Units: Propositions identified
by PropBank and NomBank
6
CMU/Columbia CMU Belief
Annotation
• Three categories:
– Committed belief: Belief expressed in utterance
• Can be a proposition about present or future
• E.g. (1) I know Mark and Sandra have eloped. (2) The
sun will rise again. (Future)
– Non-committed belief: Not a strong belief
• Can be a proposition about present or future
• E.g. (1) Mark and Sandra may have eloped. (2) John
may return tomorrow.
– Not application: Not a belief
• E.g. (1) I wish Mark and Sandra would finally elope.
7
CMU/Columbia Belief Annotation
• Five Classes:
– Committed Belief
– Committed Belief Future
– Non-Committed Belief
– Non-Committed Belief Future
– Not Applicable
8
CMU/Columbia Belief Annotation:
Type System (1)
9
CMU/Columbia Belief Annotation:
Type System (2)
10
CMU/Columbia Belief Annotation:
Type System (3)
11
Follow up questions
• Extensions:
– What extensions do we expect to the
annotation scheme?
– How best we can tailor the type system
towards expected future changes
• Requirements from application domain?
– Do we have a set of requirements from the
application side?
12
Ed Hovy’s group
• Annotations:
– Annotated with OntoNotes for Noun senses
– 205 nouns, one file for each noun, sense + location in files for
each noun is stored
• Sample annotations:
– eng/AFGP-2002-600175-Trans.txt 427 4 [email protected] 3 Mon Dec 3
02:31:27 2007
– eng/AFGP-2002-602187-Trans.txt 25 6 [email protected] 2 Mon Dec 3
02:31:27 2007
– Noun="position", sense=3; file= AFGP-2002-600175-Trans.txt, position
= “427 4”
– Noun="position", sense=3; file=AFGP-2002-602187-Trans.txt,
position=“25 6”
13
TypeSystem (Ed Hovy et. al.
Annotation)
14
BBN
1. BBN TTO-3 Temporal Ordering
Annotation
2. BBN Name Annotations: named entities –
org, date, per etc
3. BBN-Coref-Annotation: entity (with type)
and entity mentions etc
4. BBN-complex-coref-annotation
15
Temporal Relationship
Assignment
•
•
•
•
•
•
•
•
•
•
•
•
11/28
Arrived
yesterday
told
Visiting
left
Return
Monday
is
Return
day
ID
1
2
3
4
5
6
7
8
9
10
11
TT
DS
EP
DS
SP
EUN
EP
EF
DS
BC
EF
DU
TP
2
0
2
2
4
4
2
7
0
9
10
TR
A
B
C
B
A
A
A
C
C
A
C
16
Type System (BBN Temporal
Ordering Annotation)
17
BBN Name Annotations (Type
system)
18
BBN-complex-coref-annotation
Annotations:
• Relations between entities
–
–
–
–
Member
Member Base
Subset
Subset Size (future type system)
• Other annotations - Attributes of a mention
– Reference type
– Syntactic Context
19
Type System for BBN (Complex) coreference annotation
20
Type System for BBN (Complex) coreference annotation (contd…)
21
UMBC Modality Annotations
• TMR – Text Meaning Representation or
Concepts annotated
• Main Annotation – Modality. It has three
main attributes: TYPE, VALUE, SCOPE &
ATTRIBUTED-TO
• TMRs can be nested i.e. attributes or
relation can refer to other TMRs
22
UMBC Modality Annotations
23
Interoperability: Data
• Common data model
• Multiple implementations
– based on the same underlying schema
(formal object model)
– meet different goals / requirements
• Implementation Criteria:
– support effective run-time annotation
(e.g. UIMA type system)
– Support effective user interface, query/update
(e.g. OntoNotes)
– Support on-the-fly schema extension
(e.g. CMU’s AnnotationsDB)
Interoperability: Data [2]
• Formal object model is mapped to:
– UIMA type system definition (create)
– OntoNotes RDBMS schema (extend)
– CMU’s Annotations DB (extend)
• Annotated data can be represented in any
format that implements the formal model
• “Have your cake and eat it too”
CMU’s Annotations Database
• MySQL implementation
• Java APIs (SQL connection API and
simple object access API)
• Fully integrated with UIMA
• Used on DTO and DARPA projects
• PRO: tag types can be extended at run
time by the application (schema supports
open-ended type definition)
• CON: interactive tools are currently limited
Annotations Database
tag
type
value
parent
*
*
span
offset
length
*
passage
text
*
document
datetime
docno
doctype
In an interview with Defense News,
Indian Defence Research and
Development Organization (DRDO)
scientists said India was launching a
comprehensive plan to develop a wide range
of modern nuclear missiles. Within two years,
India would develop an intercontinental
ballistic missile (ICBM), ...
<entity type=org offset=21 length=12 />
<entity type=org offset=35 length=59 />
<entity type=gpe offset=111 length=5
source=bbn ref=#INDIA />
<entity type=gpe offset=223 length=5
source=bbn ref=#INDIA />
<entity type=fac offset=231 length=41 />
JAVELIN Project Briefing
AQUAINT
Program
An Integrated Annotation DB
in OntoNotes
Sameer Pradhan, Eduard Hovy, Mitchell Marcus,
Martha Palmer, Lance Ramshaw, and Ralph Weischedel
http://www.bbn.com/NLP/OntoNotes
28
Goals
 Capture multiple layers of annotation and modeling
–
–
–
–
–
–
Syntax
Propositions
Word sense
Ontology
Coreference
Names
 Using an integrated relational database representation
– Enforces consistency across the different annotations
– Supports integrated models that can combine evidence from
different layers
29
Unified Representation
 Provide a bare-bones representation independent
of the individual semantics that can
– Efficiently capture intra- and inter- layer
semantics
– Maintain component independence
– Provide mechanism for flexible integration
– Integrate information at the lowest level of
granularity
 A Relational Database
30
Unified Relational Representation
Corpus
Trees
Coreference
31
Senses
Propositions
Names
Example: DB Representation of Syntax
• Treebank tokens (stored in the Token table) provide the common base
• The Tree table stores the recursive tree nodes, each with its span
• Subsidiary tables define the sets of function tags, phase types, etc.
32
Advantages of an Integrated Representation
 Each layer translates into a common representation
 Clean, consistent layers
– Resolve the inconsistencies and problems that this reveals
 Well defined relationships
– Database schema defines the merged structure efficiently
 Original representations available as predefined views
– Treebank, PropBank, etc.
 SQL queries can extract examples based on multiple
layers or define new views
 Python Object-oriented API allows for programmatic
access to tables and queries
33
Syntax Layer
 Identifies meaningful phrases in the text
 Lays out the structure of how they are related
Concerns about the pace of the Vienna talks -- which are aimed at the destruction
of some 100,000 weapons , as well as major reductions and realignments of troops
incentral
centralEurope
Europe – also are being registered at the Pentagon .
in
S
NP
SYNTAX
NP
JJ
NNS
PP
CC
NNS
IN
PP
NP
NNS
IN
NP
JJ
NNP
... major reductions and realignments of troops in central Europe – ...
34
Propositional Structure
 Tells who did what to whom
 For both verbs and nouns
Concerns about the pace of the Vienna talks -- which are aimed at the destruction
of some 100,000 weapons , as well as major reductions and realignments of troops
in central Europe – also are being registered at the Pentagon .
S
NP
ARG2
NP
PP
PP
ARGM-LOC
ARG1
JJ
NNS
CC
NNS
IN
NP
NNS
IN
NP
JJ
NNP
... major reductions and realignments of troops in central Europe – ...
35
Predicate Frames
 Predicate frames define the meanings of the numbered arguments
Concerns about the pace of the Vienna talks -- which are aimed at the destruction
of some 100,000 weapons , as well as major reductions and realignments of troops
in central Europe – also are being registered at the Pentagon .
Predicate Frames
aim
Predicate Frames
reduction
aim.01 – Plan
reduce.01 – Make less
ARG0 – Aimer
ARG1 – Action
aim.02 – Directed motion
ARG0 – Aimer
ARG1 – Thing in motion
ARG2 – Target
36
ARG0
ARG1
ARG2
ARG3
ARG4
– Agent
– Thing falling
– Amount fallen
– Starting point
– Ending point
Word Sense and Ontology




Meaning of nouns and verbs are specified
All the senses are annotatable at 90% inter-annotator agreement
Catalog of possible meanings supplied in the sense inventory files
Ontology links (currently being added) will capture similarities
between related senses of different words
Concerns about the pace of the Vienna talks -- which are aimed at the destruction
of some 100,000 weapons , as well as major reductions and realignments of troops
in central Europe – also are being registered at the Pentagon .
Word Sense
37
Word Sense
aim
register
1. Point or direct object, weapon,
at something ...
2.
2. Wish,
Wish, purpose
purpose or
or intend
intend to
to achieve
achieve
something
something
1. Enter
Enterinto
intoan
anofficial
officialrecord
record
2. Be aware of, enter into someone’s
conciousness
3. Indicate a measurement
4. Show in one’s face
Coreference
 Identifies different mentions of the same entity in text – especially
links definite, referring noun phrases, and pronouns in text
 Two types – Identity as well as Attributive coreference tagged.
President Bush
conventional arms talk
Concerns about the pace of the Vienna
Viennatalks
talks--–which
whichare
areaimed
aimedatatthe
thedestruction
destruction
ofsome
some100,000
100,000weapons
weapons, ,as
aswell
wellas
asmajor
majorreductions
reductionsand
andrealignments
realignmentsof
oftroops
troops
of
in
in central
central Europe
Europe – also are being registered at the Pentagon .
He
Pentagon
e0
38
e0
e0
Example of DB Query Function
What is the distribution of named entities that are ARG0s of the predicate “say”?
if (proposition.lemma == “say”):
for a_proposition in a_proposition_bank:
if(a_proposition.lemma != "say"):
arg_in_p_q = "select * from argument where proposition_id = '%s';" % (a_proposition.id)
a_cursor.execute(arg_in_p_query)
argument_rows = a_cursor.fetchall()
query = “select * from argument where proposition_id = '%s';” ..
for a_argument_row in argument_rows:
a_argument_id = a_argument_row["id"]
a_argument_type = a_argument_row["type"]
if(a_argument_type != "ARG0"):
n_in_arg_q = "select * from argument_node where argument_id = '%s';" % (a_argument_id)
a_cursor.execute(n_in_arg_q)
argument_node_rows = a_cursor.fetchall()
for a_argument_node_row in argument_node_rows:
a_node_id = a_argument_node_row["node_id"]
if (argument_type == "ARG0"):
a_ne_node_query = "select * from name_entity where subtree_id = '%s';" % (a_node_id)
a_cursor.execute(a_ne_node_query)
ne_rows = a_cursor.fetchall()
for child in node.subtrees():
for a_ne_row in ne_rows:
a_ne_type = a_ne_row["type"]
ne_hash[a_ne_type] = ne_hash[a_ne_type] + 1
39
a_tree = a_tree_document.get_tree(a_tree_id)
a_node = a_tree.get_subtree(a_node_id)
Name Entity
Frequency
for a_child in a_node.subtrees():
a_ne_subtree_query = "select * from name_entity where subtree_id = '%s';" % (a_child.id)
subtree_ne_rows = a_cursor.execute(a_ne_subtree_query)
Person
84
ne_subtree_rows = a_cursor.fetchall()
GPE
34
for a_ne_subtree_row in ne_subtree_rows:
a_subtree_ne_type = a_ne_subtree_row["type"]
ne_hash[a_subtree_ne_type] = ne_hash[a_subtree_ne_type] + 1
Organization
29
NORP
15
...
...
Conclusion
 Integrating the annotation layers using a relational
schema
– Improves consistency
– Allows predictive features that combine evidence from
multiple layers
 Easily Accessible
– Through Python API
– SQL queries
40
Interoperability:
Components
A shared, formal type system
allows multiple data formats to
be combined effectively
XCAS
Collection
Reader
XML
XCAS
CAS
Consumer
TXT
File System
Collection
Reader
OntoNotes
OntoNotes
Collection
Reader
OntoNotes
CAS
Consumer
key
UIMA
Analysis
Engine
Customer’s
annotators
File
storage
ADB
Collection
Reader
ADB
CAS
Consumer
RDBMS
storage
New UIMA
wrapper
Annotations
DB
Existing
UIMA wrapper
Issues for Discussion
• Persistence formats optimize for different
concerns
– RDBMS – relational querying, update
– XCAS – fast deserialization of run-time
objects
• Consider extending schema to hold XML
serialization of document annotations