Supporting Annotation Layers for Natural Language Processing

Transcript Supporting Annotation Layers for Natural Language Processing

Supporting Annotation
Layers for Natural
Language Processing
Archana Ganapathi, Preslav Nakov,
Ariel Schwartz, and Marti Hearst
Computer Science Division and SIMS
University of California, Berkeley
Motivation
 Most natural language processing (NLP) algorithms
make use of the results of previous processing steps,
e.g.:





Tokenizer
Part-of-speech tagger
Phrase boundary recognizer
Syntactic parser
Semantic tagger
 No standard way to represent, store and retrieve text
annotations efficiently.
 MEDLINE has close to 13 million abstracts. Full text
starts to become available as well.
Text Annotation Framework
 Annotations are stored independently of text
in an RDBMS
 Declarative query language for annotation
retrieval
 Indexing structure designed for efficient query
processing
 Object Oriented API for annotations: insertion,
deletion and modification
Key Contributions
 Support for hierarchical and overlapping layers of




annotation
Querying multiple levels of annotations simultaneously
First to evaluate different physical database designs
Focused on scaling annotation-based queries to very
large corpora with many layers of annotations
We propose a query language and demonstrate its
power and the efficiency of the indexing architecture on
a wide variety of query types that have been published in
the NLP literature.
Outline
 Related Work
 Layered Query Language
 Database Design
 API
 Evaluation
 Conclusions
Related Work

Annotation graphs (AG): directed
acyclic graph; nodes can have time
stamps or are constrained via paths
to labeled parents and children. (Bird
and Liberman, 2001)
 Emu system: sequential levels of
annotations. Hierarchical relations
may exist between different levels,
but must be explicitly defined for
each pair.(Cassidy&Harrington,2001)
 The Q4M query language for
MATE: directed graph; constraints
and ordering of the annotated
components. Stored in XML
(McKelvie&al., 2001)
 TIQL: queries consist of manipulating
intervals of text, indicated by XML
tags; supports set operations.
(Nenadic et al., 2002)
Annotation Graphs
Find arcs labeled as words, whose phonetic
transcription starts with a “hv“:
SELECT I
WHERE X.[id:I].Y <- db/wrd
X.[:hv].[]*.Y <- db/phn;
Emu
Find sentences of phonetic “A” followed by
“p“ both dominated by an “S” syllable:
[[Phonetic=A -> Phonetic=p] ^ Syllable=S]
Q4M (MATE system)
Find nouns followed by the word “lesser”:
($a word) ($b word);
($a pos ~ "NN") && ($a <> $b)
&& ($b # ~ "lesser")
TIQL (TIMS system)
Find sentences containing the noun phrase
“COUP-TF II” and the verb “inhibit”:
(<SENTENCE>  <TERM nf=‘COUP TF II’>)
 <V lemma=‘inhibit’>
Outline
 Related Work
 Layered Query Language
 Database Design
 API
 Evaluation
 Conclusions
Gene/protein
Layers of Annotations
596 12043 24224 281020
Word
Ontology
Part of Speech
Gene/protein
42722 397276
Shallow Parse
Ontology
D007962
D016923
D001773
D044465 D001769 D002477 D003643
D019254
D016158
Shallow
parse POS Word
Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.
185
8 51112 23017
7
5874
2791
JJ
JJ
NN
IN NN
VBZ
IN
NP
PP NP
VP
PP
NP
8952 1263 5632
17
8252
NN
CC
NN
IN NN
NP
PP NP
NN
NN
8 12523
Gene/protein
Layers of Annotations
596 12043 24224 281020
Word
Ontology
Part of Speech
Gene/protein
42722 397276
Shallow Parse
Ontology
D007962
D016923
D001773
D044465 D001769 D002477 D003643
D019254
D016158
Shallow
parse POS Word
Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.
185
8 51112 23017
7
5874
2791
JJ
JJ
NN
IN NN
VBZ
IN
NP
PP NP
VP
PP
NP
8952 1263 5632
17
8252
NN
CC
NN
IN NN
NP
PP NP
NN
NN
8 12523
Gene/protein
Layers of Annotations
596 12043 24224 281020
Word
Ontology
Part of Speech
Gene/protein
42722 397276
Shallow Parse
Ontology
D007962
D016923
D001773
D044465 D001769 D002477 D003643
D019254
D016158
Shallow
parse POS Word
Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.
185
8 51112 23017
7
5874
2791
JJ
JJ
NN
IN NN
VBZ
IN
NP
PP NP
VP
PP
NP
8952 1263 5632
17
8252
NN
CC
NN
IN NN
NP
PP NP
NN
NN
8 12523
Gene/protein
Layers of Annotations
596 12043 24224 281020
Word
Ontology
Part of Speech
Gene/protein
42722 397276
Shallow Parse
Ontology
D007962
D016923
D001773
D044465 D001769 D002477 D003643
D019254
D016158
Shallow
parse POS Word
Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.
185
8 51112 23017
7
5874
2791
JJ
JJ
NN
IN NN
VBZ
IN
NP
PP NP
VP
PP
8952 1263 5632
17
8252
NN
CC
NN
IN NN
NP
PP NP
NN
NN
NP
Full parse, sentence and section layers are not shown.
8 12523
Layers of Annotation (cont.)
 Each annotation represents an interval spanning a sequence of
characters
 absolute start and end positions
 Each layer corresponds to a conceptually different kind of
annotation
 i.e., word, gene/protein, shallow parse
 can have several layers with the same semantics
 Layers can be
 sequential
 overlapping


e.g., two multiple-word concepts sharing a word
hierarchical


spanning, when the intervals are nested as in a parse tree,
or
ontologically, when the token itself is derived from a
hierarchical ontology
Layer Type Properties
 One-to-one correspondence between the Word and the
Part-of-speech (POS) layers.
 The Word, POS and Shallow parse layers are sequential
 The Full parse layer is spanning hierarchical
 The Gene/protein layer assigns IDs from the LocusLink
database of gene names
 many-to-one in the case of multiple species
 The Ontology layer assigns terms from the hierarchical
medical ontology MeSH (Medical Subject Headings)

Overlapping (share the word cell) and hierarchical:


both spanning, since blood cell (with MeSH ID D001773)
spans cell (which is also in MeSH), and
ontologically, since blood cell is a kind of cell and
cell death (D016923) is a type of Biological Phenomena.
Layered Query Language
 Requirements for the query language on layers of
annotations:






Intuitive
Compact
Declarative
Expressive power for real world queries
Support for hierarchical and overlapping annotations
Compatible with SQL
 LQL (Layered Query Language)
 XML-like
 Can be translated to SQL to run against an RDBMS
 Tested on real world bioscience NLP applications
LQL by Example
(a) Protein-Protein Interaction
<document
<sentence
<gene_protein> {print tag_type}
...
<pos [tag_type=verb]
<word [lex=results]> >
...
<gene_protein> {print tag_type}
> {print}
> {print document.id}
(Blaschke et al., 1999)
(b) Protein-Protein Interaction
<sentence
<shallow_parse [tag_type=NP]>{$np1=lex}
<pos [tag_type=verb]
<word [lex=binds]>>
<pos [tag_type=prep]
<word [lex=to]>>
<shallow_parse [tag_type=NP]>{$np2=lex}
> {print $np1, $np2, sentence}
(Thomas et al., 2000)
(c) Descent of Hierarchy:
A01 A07
<shallow_parse [tag_type=NP]
<pos [tag_type=noun]
^<mesh [label=G07.553*]>{print}$ >
<pos [tag_type=noun]
^<mesh [label=D*]>{print}$ >$ >
(Rosario et al., 2002)
MeSH
label
MeSH
label
G07.553.481
D12.776.124.050.250
G07.553.481
D12.776.124.125.500
G07.553.481
D12.776.811.300
G07.553.481
D24.185.119.490
(d) Acronym-Meaning Extraction
<shallow_parse [tag_type=NP]>{print}
<pos [tag_type=$]>
<shallow_parse [tag_type=NP]>{print}
<pos [tag_type=$]>
(Pustejovsky et al., 2001)
limb:vein
shoulder:
artery
LQL Syntax
 “< >” Defines an arbitrary range over text.
 A range is typically restricted to a specific layer type using <layer_name>.
 All layers have a lex (the text spanned by the range) and a tag_type








attribute.
Predicates on attribute values are enclosed in square brackets,
i.e. “<layer_name [attribute_name{ = | ! = | > | > = | < | < = } value]>”.
The language supports the boolean operators
conjunction (&&), disjunction (||), and negation (!).
By default tokens must follow each other immediately.
The ellipses (...) indicate that tokens may intervene in between the
specified ranges.
A range is optionally followed by an action statement, enclosed in curly
braces, which binds variables or specifies what should be printed, e.g.
“<gene_protein> {print tag_type}”.
With no arguments, print outputs the value of the lex attribute.
The two special characters ‘ˆ’ and ‘$’ are used to match the range’s
beginning and end positions, respectively.
‘*’ is used as a wildcard; can be used to descend an ontological hierarchy.
Additional LQL Features
 For spanning hierarchical layers we can have hierarchical queries with
several nested references to the same layer. The following query finds a
PP of the form preposition+NP and prints that NP:
<full_parse [tag_type=PP]
ˆ<pos [tag_type=prep]>
<full_parse [tag_type=NP]>{print} $>
 The keyword noorder allows an arbitrary order for the tokens within a
range, e.g.:
<sentence [noorder]
<gene_protein>
<pos [tag_type=verb]>
> {print sentence}
 The language allows for a combination of ordered and unordered
constraints. For example,
<sentence [noorder]
<gene_protein>
( <pos [tag_type=verb] <word [lex=binds]>>
<pos [tag_type=prep] <word [lex=to]>> )
> {print sentence}
 LQL currently does not support a range overlap operator.
LQL and SQL
 LQL can be automatically translated into SQL (although this is




not yet implemented), as:
 user-defined function, or
 a macro
The result of an LQL query is a relation
Thus, allowing the use of standard SQL syntax such as GROUP
BY, COUNT, DISTINCT, ORDER BY, UNION etc.
An added advantage of LQL over SQL is that the LQL queries
do not need to be modified, if the underlying logical design is
changed.
LQL is still a work in progress;
 We plan to assess it via usability studies with computational
linguistics researchers, modifying it as necessary.
 However, we feel it is more intuitive and easier to use for text
processing than the existing languages.
LQL Versus SQL
LQL:
SQL:
<document <shallow_parse [tag_type=NP]
<pos [tag_type=noun] ^<mesh [label=C21*]>{print label}$ >
<pos [tag_type=noun] ^<mesh [label=G10*]>{print label}$ >$ >
>{print document.id}
SELECT nn1.pmid, mt1.tree_number, mt2.tree_number
FROM biotext_annotation_1 np
JOIN biotext_annotation_1 nn1 ON
np.pmid = nn1.pmid
AND np.section = nn1.section
AND np.start_char_pos <= nn1.start_char_pos
AND np.end_char_pos >= nn1.end_char_pos
JOIN biotext_annotation_1 nn2 ON
np.pmid = nn2.pmid
AND np.section = nn2.section
AND nn2.start_char_pos > nn1.end_char_pos
AND np.start_char_pos <= nn2.start_char_pos
AND np.end_char_pos = nn2.end_char_pos
JOIN biotext_annotation_1 mesh1 ON
np.pmid = mesh1.pmid
AND np.section = mesh1.section
AND nn1.start_char_pos = mesh1.start_char_pos
AND nn1.end_char_pos = mesh1.end_char_pos
JOIN biotext_annotation_1 mesh2 ON
np.pmid = mesh2.pmid
AND np.section = mesh2.section
AND nn2.start_char_pos = mesh2.start_char_pos
AND nn2.end_char_pos = mesh2.end_char_pos
JOIN biotext_annotation_mesh_tree mt1 ON
mt1.descriptor_ui = mesh1.tag_type
JOIN biotext_annotation_mesh_tree mt2 ON
mt2.descriptor_ui = mesh2.tag_type
WHERE nn1.layer_id = 1
AND nn2.layer_id = 1
AND np.layer_id = 3
AND nn1.tag_type in (27,30)
AND nn2.tag_type in (27,30)
AND np.tag_type = 31
AND mesh1.layer_id = 6
AND mesh2.layer_id = 6
AND NOT EXISTS (
SELECT word.pmid
FROM biotext_annotation_1 word
WHERE word.layer_id = 0
AND word.pmid = nn1.pmid
AND word.section = nn1.section
AND word.start_char_pos < nn2.start_char_pos
AND word.end_char_pos > nn1.end_char_pos
)
AND mt1.tree_number like 'C21%'
AND mt2.tree_number like 'G10%'
Outline
 Related Work
 Layered Query Language
 Database Design
 API
 Evaluation
 Conclusions
Database Design



We evaluated 5 different logical and physical database designs.
The basic model is similar to the one of TIPSTER (Grishman,
1996). Each annotation is stored as a record in a relation.
Architecture 1 contains the following columns:
1.
docid: document ID;
2.
section: title, abstract or body text;
3.
layer_id: a unique identifier of the annotation layer;
4.
start_char_pos: starting character position, relative to
particular section and docid;
5.
end_char_pos: end character position, relative to particular
section and docid;
6.
tag_type: a layer-specific token unique identifier.

There is a separate table mapping token IDs to entities (the string
in case of a word, the MeSH label(s) in case of a MeSH term etc.)
Database Design (cont.)
 Architecture 2 introduces one additional column,
sequence_pos, thus defining an ordering for each
layer.

Simplifies some SQL queries as there is no need for
“NOT EXISTS” self joins, which are required under
Architecture 1 in cases where tokens from the same
layer must follow each other immediately.
 Architecture 3 adds sentence_id, which is the
number of the current sentence and redefines
sequence_pos as relative to both layer_id and
sentence_id.

Simplifies most queries since they are often limited to
the same sentence.
Database Design (cont.)
 Architecture 4 merges the word and POS layers,
and adds word_id assuming a one-to-one
correspondence between them.

Reduces the number of stored annotations and the
number of joins in queries with both word and POS
constraints.
 Architecture 5 replaces sequence_pos with
first_word_pos and last_word_pos, which
correspond to the sequence_pos of the first/last word
covered by the annotation.



Requires all annotation boundaries to coincide with
word boundaries.
Copes naturally with adjacency constraints between
different layers.
Allows for a simpler indexing structure.
An Example Relation
Example: “Kinase inhibits RAG-1.”
PMID SECTION LAYER
ID
START
CHAR
POS
END
CHAR
POS
TAG
TYPE
SEQUE
NCE
POS
SENTE
NCE
WORD
ID
FIRST
WORD
POS
LAST
WORD
POS
3345
b (body)
0 (word)
34
39
59571
1
2
59571
1
1
3345
b
0
41
48
55608
2
2
55608
2
2
3345
b
0
50
54
89985
3
2
89985
3
3
3345
b
1 (POS)
34
39
27 (NN)
1
2
59571
1
1
3345
b
1
41
48
53 (VB)
2
2
55608
2
2
3345
b
1
50
54
27
3
2
89985
3345
b
3(s.parse)
34
39
31(NP)
1
2
3
1
3
1
3345
b
3
41
48
59(VP)
2
2
2
2
3345
b
3
50
54
31
3
2
3345
b
5 (gene)
34
39
39(prt)
1
2
3
1
3
1
3345
b
5
50
54
39
2
2
3
3
3345
b
6(mesh)
34
39
10770
1
2
3345
b
6
50
54
16654
2
2
1
3
1
3
Basic architecture
Added, architecture 3
Added, architecture 2
Added, architecture 4
Added, architecture 5
Indexing Structure
 Two types of composite indexes: forward and
inverted.



An index lookup can be performed on any column
combination that corresponds to an index prefix.
The forward indexes support lookup based on position
in a given document.
The inverted indexes support lookup based on
annotation values (i.e., tag type and word id).
 Most query plans involve both forward and inverted
indexes

Joins statistics would have been useful
 Detailed statistics are essential.
 Standard statistics in DB2 are insufficient.
 Records are clustered on their primary key
Indexing Structure (cont.)
Architecture
Type
Columns
Arch 1-4
F
*DOCID +SECTION +LAYER_ID +START_CHAR_POS +END_CHAR_POS +TAG_TYPE
Arch 1-4
I
LAYER_ID +TAG_TYPE +DOCID +SECTION +START_CHAR_POS +END_CHAR_POS
Arch 2
F
DOCID +SECTION +LAYER_ID +SEQUENCE POS +TAG_TYPE +START_CHAR_POS
+END_CHAR_POS
Arch 2
I
LAYER_ID +TAG_TYPE +DOCID +SECTION +SEQUENCE POS +START_CHAR_POS
+END_CHAR_POS
Arch 3-4
F
DOCID +SECTION +LAYER_ID +SENTENCE +SEQUENCE POS +TAG_TYPE
+START_CHAR_POS +END_CHAR_POS
Arch 3-4
I
LAYER_ID +TAG_TYPE +DOCID +SECTION +SENTENCE +SEQUENCE POS
+START_CHAR_POS +END_CHAR_POS
Arch 4
I
WORD ID +LAYER_ID +TAG_TYPE +DOCID +SECTION +START_CHAR_POS
+END_CHAR_POS +SENTENCE +SEQUENCE POS
Arch 5
F
*DOCID +SECTION +LAYER_ID +SENTENCE +FIRST_WORD_POS +LAST_WORD_POS
+TAG_TYPE
Arch 5
I
LAYER_ID +TAG_TYPE +DOCID +SECTION +SENTENCE +FIRST_WORD_POS
+LAST_WORD_POS
Arch 5
I
WORD ID +LAYER_ID +TAG_TYPE +DOCID +SECTION +SENTENCE +FIRST_WORD_POS
Outline
 Related Work
 Layered Query Language
 Database Design
 API
 Evaluation
 Conclusions
API
 Java based API allows for simple insertion,
deletion and modification of annotations.


Need to specify document ID, section, layer
ID, and positional information.
Supports editing a collection of annotations
and storing them back to the database.
 We plan to develop a user interface for
viewing, editing and querying annotations.

Not a trivial task, since there are many HCI
issues on how to display annotations
effectively.
Outline
 Related Work
 Layered Query Language
 Database Design
 API
 Evaluation
 Conclusions
Experimental Setup
 Annotated 13,504 MEDLINE abstracts


Stanford Lexicalized Parser (Klein and Manning, 2003) for
sentence splitting, word tokenization, POS tagging and
parsing.
We wrote a shallow parser and tools for gene and MeSH
term recognition.
 This resulted in 10,910,243 records stored in an IBM
DB2 Universal Database Server.
 Defined 4 workloads based on variants of queries (a-d).
Workload
(a)
(b)
(c)
(d)
#Queries
54
11
50
1
#Results/query
303.4
77.5
1.6
16,701
LQL lines
8
6
5
4
Results
Workload
(a)
(b)
Architecture
1
2
3
4
5
1
2
3
4
5
SQL lines
37
37
34
29
29
91
77
75
65
50
# Joins
6
6
6
5
5
12
11
11
9
7
3.98
4.35
3.59
1.69
1.94
3.88
5.68
5.41
3.85
3.55
Time (sec)
Workload
(c)
(d)
Architecture
1
2
3
4
5
1
2
3
4
5
SQL lines
45
38
38
39
41
59
50
53
53
35
# Joins
7
6
6
6
6
7
7
7
7
4
17.9
23.42
21.49
30.07
4.06
1,879
1,700
2,182
1,682
1,582
Time (sec)
Architecture
Space (MB)
1
2
3
4
5
Data Storage
168.5
168.5
168.5
132.5
136.5
Index Storage
617.0
1,397.0
1,441.0
1,182.0
673.5
Total Storage
785.5
1,565.5
1,609.5
1,314.5
810.0
Results (cont.)
 Different architectures are optimized for different types of
queries.
 Architecture 5 performs well (if not best) on all query types,
while the other architectures perform poorly on at least one
query type.
 Storage requirement of Architecture 5 is comparable to that of
Architecture 1
 Architecture 5 results in much simpler queries
 We recommend Architecture 5 in most cases, or
Architecture 1, if atomic annotation layer cannot be defined.
Scalability Analysis


Combined workload of 3 query types
Varying buffer pool sizes
Buffer Pool Size (MB)



Elapsed Time (ms)
Buffer Read Time (ms)
1000
2300
1050
100
2900
1670
10
4600
3340
1
8300
6250
Suggests that the query execution time grows as a
sub-linear function of memory size.
We believe a similar ratio will be observed when
increasing the database size and keeping the
memory size fixed
Parallel query execution can be enabled after
partitioning the annotation on document_id
Conclusions
 Provided a mechanism to effectively store and query
layers of textual annotations.
 Evaluated various structures for data storage and
have arrived at an efficient and simple one.
 Used variations of queries drawn from published
research, to ensure the real-world applicability.
 Presented a concise language (LQL) to express
queries that span multiple levels of the annotation
structure, which captures the user’s intent better as
the syntax is more intuitive and closely resembles the
annotation structure.
Future Work
 Conduct a usability study to assess the query
language.
 Automate the LQL to SQL translation process.
 Test the scalability of this approach on larger
document collections.
References
 Steven Bird and Mark Liberman. 2001. A formal framework for linguistic






annotation. Speech Communication, 33(1–2):23–60.
Steve Cassidy and Jonathan Harrington. 2001. Speech annotation and
corpus tools. Speech Communication, 33(1–2):61–77.
David McKelvie, Amy Isard, Andreas Mengel, Morten B. Moller, Michael
Grosse and Marion Klein. 2001. Speech annotation and corpus tools.
Speech Communication, 33(1–2):97–112.
Goran Nenadic, Hideki Mima, Irena Spasic, Sophia Ananiadou and
Jun-ichi Tsujii. 2002. Terminology-Driven Literature Mining and
Knowledge Acquisition in Biomedicine. International Journal of Medical
Informatics, 67:33–48.
Ralph Grishman. 1996. Building an Architecture: a CAWG Saga.
Advances in Text Processing: Tipster Program Phase II, Morgan
Kaufmann, 1996.
Steve Cassidy. 1999. Compiling Multi-tiered Speech Databases into the
Relational Model: Experiments with the Emu System. 6th European
Conference on Speech Communication and Technology Eurospeech
99, 2127–2130, Budapest, Hungary.
Xiaoyi Ma, Haejoong Lee, Steven Bird and Kazuaki Maeda. 2002.
Models and Tools for Collaborative Annotation. Third International
Conference on Language Resources and Evaluation, 2066–2073.
Thank You
Questions and
constructive comments
are welcomed
http://biotext.berkeley.edu

Supporting Annotation Layers for Natural Language Processing

Transcript Supporting Annotation Layers for Natural Language Processing

Directory