Entity Recognition: Current Status and Summer Plan

Download Report

Transcript Entity Recognition: Current Status and Summer Plan

Entity Recognition:
Current Status and Summer Plan
Jing Jiang
May 12, 2006
Update since last meeting
• Met with Nyla (the biologist) to talk about
training/evaluation data
– Most annotated genes in the BioCreative data set are
reasonable
– To manually annotate a sample set of bee literature
for evaluation and tuning purpose
• Tagged some other collections (fly-bcb,
songbird, Wnt pathway)
• Identified some common errors and came up
with some heuristics to fix the errors
Current performance
• On BIOSIS honey bee: waiting to hear
from Nyla for judgment on the honey bee
sample
• On Wnt pathway full-text articles (a
sample of 100 sentences, judged by Xin)
– Precision: 92% (207 / 224)
– Recall: 84% (207 / 245)
• Examples:
– fly, songbird, Wnt pathway
Common errors and heuristics
• Same word/phrase tagged differently within the
same article
– Because of the different contexts
– Heuristic: force the tagging to be consistent
• Long form and its abbreviation tagged differently
– E.g.: …a cDNA encoding Apis mellifera ultraspiracle
(AMUSP) and…
– Heuristic: force the tagging to be consistent
• Easily detectable false positives
– E.g.: Roughly half of Drosophila genes currently…
– Heuristic: compile a list (of species names, chemical
names, etc.) and some heuristic rules
Common errors and heuristics
(cont.)
• Conjunctive words/phrases tagged differently
– E.g.: …three cbl genes (c-cbl , cblb , and cblc)
which…
– Heuristic: use some rules to capture such conjunctive
words, and tag them consistently
• Tokenization errors:
– E.g.: There is no difference in AmTRP-expressing
cells among worker, …
– Heuristic: compile a list of typical suffixes (such as “expressing”, “-dependent”, etc.) that should be
separated from their prefixes
Common errors and heuristics
• Mistakes caused by citations:
– Only in certain text (Wnt pathway collection has this
problem. BIOSIS collections don’t.)
– E.g.: Among the downstream targets of PI 3-kinase
are phospholipase C (6-9) , protein kinase C (10, 11) ,
Rac (12-14) , and…
– Heuristic: remove these citations(?)
• Controversial cases: domain, subunit, etc.
– E.g.: Alternating proline / alanine sequence of beta B1
subunit originates…
– BioCreative data set tags these as part of gene
names
Summer plan
• Evaluate the performance on honey bee data
based on Nyla’s judgments
• Implement and tune the heuristics to capture the
common errors, and evaluate their effectiveness
– Some heuristics may cause new errors
– Tune on the annotated sample honey bee data
• Based on the need of BeeSpace, find a good
balance between precision and recall
• Work with Todd on the input/output format of the
entity recognizer
Discussion