ACE Annotation Practices and Quality Control

Download Report

Transcript ACE Annotation Practices and Quality Control

Biomedical information extraction
at the University of Pennsylvania
Mark Liberman
[email protected]
Linguistic Data Consortium
http://www.ldc.upenn.edu
 Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008
Outline
 The PennBioIE project:
Background, accomplishments, future
 Public service announcement:
Publishing data via the LDC
 The parable of Yang Jin
 Annotation as “common law semantics”
 a serviceable technology that will improve
 are there better long-term alternatives?
 Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008
PennBioie Project
 Goals:
 Learn to strip-mine the bibliome:
better NLP tools for text datamining
 Publish biomedical text annotation:
Treebanks, entities, relations
 Participants:
 Penn NLP researchers
 Biomedical researchers
(Penn, GSK, CHoP)
 Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008
Penn BioIE Project
Domains:
 CYP
• inhibition of cytochrome P-450 enzymes
• 1100 abstracts
• collaboration with GSK
 Onco
• genomic variations associated with cancer
• 1158 abstracts
• collaboration with Children’s Hospital of Philadelphia
 Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008
Annotation sequence
1. pretagging (document segmentation etc.)
2. named entities
3. POS
4. treebanking
5. relations
 Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008
Penn BioIE Project
Results:
 Some improved techniques
 Some published data
get rel. 0.9 from http://bioie.ldc.upenn.edu
rel. 1.0 soon to be published by LDC
 Some applications -- e.g. FABLE
 Some questions
• How to break the F-measure ceiling?
• How to decrease annotation burden?
• How to increase semantic coverage?
 Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008
 Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008
A note on the LDC
 Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008
The Linguistic Data Consortium is
an open consortium
of universities,
companies,
and government laboratories;
founded in 1992
with seed money from DARPA;
run by the University of Pennsylvania
with 45 full-time staff in Philadelphia.
 Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008
But really, the LDC is…
a specialized digital publisher,
which has distributed
>50,000 copies
of >750 corpora and other resources
to ~2,500 research organizations
in 62 countries.
… and might want to publish your data.
 Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008
Why publish with LDC?
 It’s a publication!
 LDC pubs have:
•
•
•
•
authors
ISBN numbers
standard bibliographic citation formats
editions
 IPR, licensing are handled your way
(from “all rights reserved” to open access)
 LDC deals with the hassle
of reproduction, distribution, maintenance
 Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008
The parable of Yang Jin
 Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008
The annotation conundrum
 “Natural” annotation is inconsistent
 poor agreement for entities, worse for relations
 task-internal metrics are noisy
 “Top down” specification is even worse
(e.g. existing elaborate ontologies)
 Solution: iterative refinement of rules
via interaction with annotation practice




result: complex accretion of “common law”
slow to develop, hard to learn
more consistent -- but is it correct?
complexity may re-create inconsistency
new types and sub-types  ambiguity, confusion
 Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008
ACE 2005 consistency
English
Entity
Relation
Timex2
Value
Event
Chinese
Entity
Relation
Timex2
Value
Event
ACE Value Score
1P vs. 1P ADJ vs. ADJ
73.40%
84.55%
32.80%
52%
72.40%
86.40%
51.70%
63.60%
31.50%
47.75%
ACE Value Score
1P vs. 1P ADJ vs. ADJ
81.20%
85.90%
50.40%
61.95%
84.40%
82.75%
78.70%
71.65%
41.10%
32%
 Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008
 1P vs. 1P
independent first
passes by junior
annotator, no QC
 ADJ vs. ADJ
output of two parallel,
independent dual first
pass annotations are
adjudicated by two
independent senior
annotators
Iterative improvement
From ACE 2005 (Ralph Weischedel):
Repeat until criteria met or until time has expired:
1. Analyze performance of previous task & guidelines
Scores, confusion matrices, etc.
2. Hypothesize & implement changes to tasks/guidelines
3. Update infrastructure as needed
DTD, annotation tool, and scorer
4. Annotate texts
5. Evaluate inter-annotator agreement
 Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008
NLP as Law School
Rules, Notes, Fiats and Exceptions
Many complex rules
 Plus Wiki
 Plus Listserv
Task
#Pages
#Rules
Entity
34
20
Value
10
5
TIMEX2
75
50
Relations
36
25
Events
77
50
232
150
Total
Example Decision Rule (Event p33)
Note: For Events that where a single common trigger is ambiguous
between the types LIFE (i.e. INJURE and DIE) and CONFLICT
(i.e. ATTACK), we will only annotate the Event as a LIFE Event in
case the relevant resulting state is clearly indicated by the
construction.
The above rule will not apply when there are independent
triggers.
 Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008
BioIE case law
Guidelines for oncology tagging (local)
 Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008
Discussion
 How to make it better
 Integrating multiple information sources
text, bioinformatic databases, microarray data, …
 less-supervised learning
• inferring useful features from untagged text
• active learning, information markets, etc.
 create a “basis set” of ready-made entity types
 How to make it different
 the analogy to translation
 the lure of systematic semantics
 (machine) learning: who is learning what?
 Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008