1 - University of Pennsylvania
Download
Report
Transcript 1 - University of Pennsylvania
Parallel
Entity and Treebank
Annotation
Ann Bies – Linguistic Data Consortium*
Seth Kulick – Institute for Research
in Cognitive Science*
Mark Mandel – Linguistic Data Consortium*
*University of Pennsylvania
New Frontiers in Corpus Annotation Workshop, 6/29/05
6/29/05
1
Mining the Bibliome: Information Extraction
from the Biomedical Literature
• NSF ITR grant EIA-0205448
• Collaboration with Division of Oncology,
Children’s Hospital of Philadelpia
• PubMed abstracts – mining cancer literature for
associations that link variations in genes with
malignancies
• http://bioie.ldc.upenn.edu - release 0.9 available
1157 abstracts entity annotated, 318 also
treebanked
6/29/05
2
Outline
• Entity Annotation
• Treebank Annotation –
• Modifications from Penn Treebank guidelines
• Annotation Process and Merged Format
• Entity-Constituent Mapping – How
successful?
6/29/05
3
Entity Annotation
• Gene X with genomic Variation event Y is
correlated with Malignancy Z
• Gene – composite entity, can refer to gene or protein
: Gene-generic, Gene-protein, Gene-RNA
• (Malignancy – under development, not included in
release 0.9)
• Variation Event – Relation between entities
representing different aspects of a variation
6/29/05
4
Entity Annotation - Variations
• Variation – A relation between variation
component entities
• “a single nucleotide substitution at codon
249, predicting a serine to cysteine amino
acid substitution”
•
•
•
•
6/29/05
Var-type – substitution
Var-location –codon 249
Var-state-orig –serine
Var-state-altered –cysteine
5
A Change in Tokenization
• Tokenization – Many hyphenated words
treated as separate tokens
• “New York-based”
• Old (Penn Treebank) tokenization:
[New] [York-based]
• New tokenization:
[New][York][-][based]
6/29/05
6
Discontinuous Entities
• E.g.: “K- and N-ras”
• Tokenization: [K][-][and][N][-][ras]
• Entity annotation:
• [K][-]… [ras] – “chain” of discontinuous tokens
• [N][-][ras] – Contiguous tokens
• Splitting up not always done, depends on
coordination
6/29/05
7
Treebank Annotation
•
•
•
•
Default NP right-branching structure
(NP (JJ primary) (NN liver) (NN cancer))
Simplifies multi-token nominal annotation
Allows recovery of implicit constituents:
• (NP (JJ primary)
(newnode (NN liver) (NN cancer)))
• Entities sometimes map to such implicit
constituents
6/29/05
8
Treebank Annotation
• Exceptions to right-branching marked by NML
• So: Any two or more non-final elements that form
a constituent are a NML
• (ADJP (NML (NNP New) (NNP York))
(HYPH -)
(VBN based))
• (ADJP (NML (NN breast) (NN cancer))
(HYPH -)
(VBN associated))
• (NP (NML (NN human) (NN liver) (NN tumor))
(NN analysis)
6/29/05
9
Treebank Annotation
• Placeholder *P* for distributed material in
coordinated nominal structures
• “K- and N-ras”
NP
CC
NP
NN HYPH NML-1
K
-
-NONE*P*
6/29/05
and
NP
NN HYPH NML-1
N
-
-NONEras
10
Treebank Annotation
• To the left or right
• “codon 12 or 13”
NP
6/29/05
NP
CC
NML-1
CD
NN
12
codon
NP
or
NML-1
CD
-NONE-
13
*P*
11
First Release
• Goal – let users choose how to handle the
integration of entity and treebank levels
• Standoff annotation for entity and treebank
• Identical tokenization
• Merged representation
• Penn Treebank style
• (POSTag:[from..to] terminal)
• Entity listing before each tree.
6/29/05
12
Merged Output Example
sentence 4 Span:331..605
;In the present study, we screened for
;the K-ras exon 2 point mutations in a
;group of 87 gynecological neoplasms
;[373..378]:gene-rna:"K-ras"
;[379..385]:variation-location:"exon 2"
;[386..401]:variation-type:
"point mutations“
6/29/05
13
Merged Output Example
[…]
((VP (VBD:[356..364] screened)
(PP-CLR (IN:[365..368] for)
(NP (DT:[369..372] the)
(NN:[373..378] K-ras)
(NML (NN:[379..383] exon)
(CD:[384..385] 2))
(NN:[386..391] point)
(NNS:[392..401] mutations)))
[…]
6/29/05
14
Merged Output Example
;[373..378]:gene-rna:"K-ras"
;[379..385]:variation-location:"exon 2"
;[386..401]:variation-type:
"point mutations"
((VP (VBD:[356..364] screened)
(PP-CLR (IN:[365..368] for)
(NP (DT:[369..372] the)
(NN:[373..378] K-ras)
(NML (NN:[379..383] exon)
(CD:[384..385] 2))
(NN:[386..391] point)
(NNS:[392..401] mutations)))
6/29/05
15
Entity-Constituent Mapping : Exact Match
• Exact Match: A node in the tree yields exactly
the entity:
;[379..385]:variation-location:"exon 2"
(NP (DT:[369..372] the)
(NN:[373..378] K-ras)
(NML (NN:[379..383] exon)
(CD:[384..385] 2))
(NN:[386..391] point)
(NNS:[392..401] mutations)))
6/29/05
16
Entity-Constituent Mapping : Missing Node
• Missing Node – Possible to add a node to yield
exactly the entity
;[386..401]:variation-type:
"point mutations"
(NP (DT:[369..372] the)
(NN:[373..378] K-ras)
(NML (NN:[379..383] exon)
(CD:[384..385] 2))
(NN:[386..391] point)
(NNS:[392..401] mutations)))
6/29/05
17
Entity-Constituent Mapping : Missing Node
(NP (DT:[369..372] the)
(NN:[373..378] K-ras)
(NML (NN:[379..383] exon)
(CD:[384..385] 2))
(newnode(NN:[386..391] point)
(NNS:[392..401] mutations))))
• Done for internal research purposes, not in
release (implicit constituents)
• NML already in release (explicit constituents)
6/29/05
18
Entity-Constituent Mapping : Crossing
• Crossing: Cuts across constituent boundaries,
so cannot even add a node yielding the entity
• Typical case: entity containing text
corresponding to a prepositional phrase
One ER showed a G-to-T mutation in the
second position of codon 12
[1280..1307]: variation-location:
“second position of codon 12”
6/29/05
19
Entity-Constituent Mapping : Crossing
[1280..1307]: variation-location:
“second position of codon 12”
(NP (NP (DT:[1276..1279] the)
(JJ:[1280..1286] second)
(NN:[1287..1295] position))
(PP (IN:[1296..1298] of)
(NP (NN:[1299..1304] codon)
(CD:[1305..1307] 12)))))
• Crossing - Determiner in NP but not in entity.
• Could relax matching, or modify entity or
treebank annotation. Didn’t do that.
6/29/05
20
Entity-Constituent Mapping – Chain
Exact Match
• “codon 12 or 13”
• Entities: “codon 12”, “codon..13”
NP
NP
NML-1
CD
NN
12
codon
6/29/05
NP
CC
or
NML-1
CD
-NONE-
13
*P*
21
Entity-Constituent Mapping – Chain
Not a Exact Match
• “specific codons (12, 13, and 61)
• Entities: “codons…12”, “codons..13”, “codons..61”
(NP (JJ specific) (NNS codons)
(PRN (-LRB- -LRB-)
(NP (NP (CD 12))
(, ,)
(NP (CD 13))
(, ,)
(CC and)
(NP (CD 61)))
(-RRB- -RRB-)))
6/29/05
22
Multiple Token Entities (Non-Chained)
Entity Type
Total Exact
Match
Gene-generic
6
4
Gene-protein
349 236
Gene-RNA
156 115
Var-location
445 348
Var-state-orig
5
3
Var-state-altered 10
8
Var-type
271 123
Total
1242 837
6/29/05
Missing
Node
1
103
35
68
1
0
142
350
Crossing
1
10
6
29
1
2
6
55(4.4%)
23
Multiple Token Entities (Chained)
Entity Type
Total
Gene-generic
Gene-protein
Gene-RNA
Var-location
Var-state-orig
Var-state-altered
Var-type
Total
0
6
36
125
0
0
1
168
6/29/05
Exact
Match
0
4
29
103
0
0
0
136
Not Exact
Match
0
2
7
22
0
0
1
32(19%)
24
Conclusion
• Annotation of entities and treebank done together
• Identical tokenization for entities and trees, with
standoff annotation
• Allows flexibility in use of integrated annotation
• Only 6.2% of the entities cannot be mapped to an
implicit or explicit constituent node
• Changes in Treebank guidelines
• Use of Relations for potentially large entities
• Next: Relation annotation and integrated taggers
6/29/05
25
References
• Ryan’s tagger
• Dan’s parser
• Web page again
6/29/05
26
Entity Annotation - Variations
• “(S249C)”
•
•
•
•
Var-type – none
Var-location –249
Var-state-orig –S
Var-state-altered –C
• Gene-{RNA,generic,protein} disambiguates gene
metonymy
• Var-{type,location,state-orig,state-altered} are
different kinds of entities
6/29/05
27
Entities
--Multiple Tokens-Entity Type
Gene-generic
Gene-protein
Gene-RNA
Var-location
Var-state-orig
Var-state-altered
Var-type
6/29/05
Single
Tokens
104
921
1987
95
151
162
235
Nonchains
6
349
156
445
5
10
271
Chains
0
6
36
125
0
0
1
28
Introduction
• Corpus for biomedical IE with several
levels of annotation:
• Entity
• Syntactic Structure (Treebank)
• Relations (McDonald et al, ACL 2005)
• Ideal - entities mapped to treebank
constituents
• Allow users to choose how to integrate
the levels
6/29/05
29
Annotation Process
• Tokenization Entity POS
Treebanking Merged Representation
• Minimal requirement: identical tokenization
for entity and treebank annotation
• Did not require an entity/constituent
correspondence – but how did it work out?
6/29/05
30