BANNER-CHEMDNER
Download
Report
Transcript BANNER-CHEMDNER
Part – I: Travel
BioCreative IV workshop
Washington DC, US
7 Oct – 9 Oct
Tsendsuren Munkhdalai
Chungbuk National University, Republic of Korea
10/23/2013
Contents
BioCreative IV
Venue & Access
Places to Visit in DC
2
BioCreative IV
BioCreative: Critical Assessment of Information
Extraction in Biology:
The key goal:
Community-wide effort for evaluating text mining and
information extraction systems applied to the biological
domain
The active involvement of the text mining user community
in the design of the tracks, preparation of corpus and the
testing of interactive systems
The workshop takes place once in every two years
BioCreative IV has 5 tracks: Interoperability,
CHEMDNER, Comparative Toxicogenomics
Database curation, Gene Ontology curation, and
Interactive curation
3
Venue & Access
The workshop took place in DoubleTree by Hilton
Hotel Bethesda - Washington DC
Was supposed to be held at NCBI, National Institutes of
Health (NIH), Washington DC
DoubleTre
e by Hilton
Hotel
Bethesda
~14 Hours Flight from ICN to DC
BWI Airport
4
Venue & Access
BWI Airport to The Hotel:
Two Options (Very Expensive):
Supper Shuttle - $50
Taxi - $120
A - BWI Airport, B – The hotel
Third option:
Take Bus (no. 201) to Shady Grove
metro station - $6
Get in Metro from Shady Grove to Bethesda - $5
Walk for 7 mins in the RIGHT DIRECTION
BWI Airport Super Shuttle
DC Metro Map
BWI Airport To Shady Grove
5
Places to Visit in DC
6
Places to Visit in DC
7
Places to Visit in DC
8
The Tip of the Trip
Don’t forget your power socket converter
9
Part – II: Paper & System
BANNER-CHEMDNER:
Incorporating Domain
Knowledge in Chemical and
Drug Named Entity Recognition
Tsendsuren Munkhdalai, Meijing Li, Khuyagbaatar
Batsuren, Keun Ho Ryu+
Chungbuk National University, Republic of Korea
+Corresponding
Author
Contents
Introduction
Related Work
Methods
Preprocessing
Feature Processing
Supervised Learning
Experiments and Results
Conclusion and Future Work
11
Introduction: Background
Biomedical literature grows exponentially
Manually impossible to manage
Everyday 2,000
Information Extraction (IE) is critical task
Chemical and Drug Named Entity Recognition
(CHEMDNER)
A fundamental task in IE and IR
Define the boundary of terminologies and categorize them
Drug NEs
NER
Chemical NEs
IE & IR
Document Indexing
Protein-protein Event extraction
interaction
Overview of information extraction from biomedical literature
12
Introduction: Motivation
CHEMDNER task is not trivial
The unique difficulties
Name boundary problem
The number of literatures is growing exponentially
Chemical mentions occur with other terminology
Training data size is too small
Ex: GENIA corpus has 2,000 abstracts, while 2,000 new
articles EVERYDAY!
The lack of standardization of technical terminology
A method that exploits unlabeled data has been an
active research due the BIG DATA
13
Introduction: CHEMDNER Task
One of the BioCreative IV Challenge Tasks
Chemical Document Indexing (CDI) sub-task
Chemical Entity Mention recognition (CEM) sub-task
Given a set of documents, return a list of chemical entities described within
each of these documents
Provide for a given document the start and end indices corresponding to all the
chemical entities mentioned in this document
The participants develop a chemical compounds
and drugs mention recognition system
Organizers provide:
3,500 docs - training
3,500 docs - development
20,000 docs - system evaluation
a subset of 3,000 documents constitutes the real test set
17,000 abstracts were added as a background collection to
avoid manual correction of the results
14
Related work
Comparison of character-level and part of speech
features for name recognition in biomedical texts
[N. Collier, K. Takeuchi, 2004, Journal of Biomedical Informatics]
Gene/protein name recognition based on support
vector machine using dictionary as features
[T. Mitsumori, S. Fation, M. MuRata, K. Doi, and H. Doi, 2005, BMC Bioinformatics]
ABNER system
[B. Settles, 2004, NLPBA/BioNLP]
Penn BioTagger system
[R. McDonald and F. Pereira, 2005, BMC Bioinformatics]
BANNER: An Executable Survey of Advances in
Biomedical Named Entity Recognition
[R. LEAMAN, G. Gonzalez, 2008, Pacific Symposium on Biocomputing]
Defines the stat-of-the-art in biomedical NER
Used in Biomedical IE systems
15
Related work
The most of the works rely on supervised ML
Word, the word and character n-gram, and the
traditional orthographic features are the base
The system performance is in a limit of training set
However, poor to represent domain background!!!
How to incorporate domain knowledge in NER?
Use lexicons/dictionary in conjunction with string
matching methods
Exploit rich unlabeled data during model construction
Learn better feature representation by using unlabeled
data (unsupervised feature learning)
16
Related work
Combining Labeled and Unlabeled Data with Co-training
[A. Blum, and T. Mitchell, 1998, Computational learning theory]
Introduce co-training, semi-supervised learning algorithm
Build two models using labeled and unlabeled data
Co-training asumptions: view-independent, view-sufficient
Bio Named Entity Recognition based on Co-training
Algorithm
[T. Munkhdalai, M. Li, T. Kim, O. Namsrai, S. Jeong, J. Shin, K.H. Ryu, 2012, AINA]
A Self-training with Active Example Selection Criterion for
Biomedical Named Entity Recognition
[E. Shin, T. Munkhdalai, M. Li, I. Paik, K. H. Ryu, 2012, ICHIT]
An Active Co-Training Algorithm for Biomedical Named-Entity
Recognition
[T. Munkhdalai, M. Li, U. Yun, O. Namsrai, K. H. Ryu, 2012, JIPS]
17
Related work
A Unified Architecture for Natural Language Processing:
Deep Neural Networks with Multitask Learning
[R. Collobert, J. Weston,, 2008, ICML]
Introduced a neural language model
A word with its context is positive training example, a random
word in the context gives a negative training example
Ex: cat chills on a mat (+) VS cat chills Jeju a mat (-)
Induces an n-dimensional real valued vector (word
embedding or WE) for each word
Class-Based n-gram Models of Natural Language
[Peter F. B., Peter V. D., Robert L. M., Vincent J. D. P, Jenifer C. L., 1992, ACL]
Introduced a hierarchical word clustering (brown clustering)
Clusters words to maximize the mutual information of bigrams
Quality (Cluster) = I(C) – H
Time complexity: O(V·K2), where V is the size of the vocabulary
and K is the number of clusters
18
Purpose & Approach
Our primary purpose is to develop a CHEMDNER
system, and to exploit the unlabeled data
A better recognition accuracy
Scalable over millions of documents
Configurable
Pluggable in other systems
The approach
Biomedical Natural Language Processing (BioNLP) tasks
The traditional baseline features + Word representation
features
CRF Model training with carefully tuned hyperparameters
19
BANNER-CHEMDNER Architecture
System design of BANNER-CHEMDNER system
20
Preprocessing
Text cleaning
Sentence splitting
Remove too short sentences
Detect sentence boundary in bio text data using Genia
Sentence Splitter
Part-of-speech tagging
Annotate token in a sentence with POS tags based on their
context
Ex: Verb (V), Noun (NN), Proper Noun (NNP), JJ (Adjective)
Lemmatization
Find canonical form (lemma) of a token with
BioLemmatizer
21
Feature Processing
The base (or baseline) feature set:
Lowercased token, POS tags, and Lemma
Orthographic features by regular expressions
Character prefix, suffix and n-grams
Token word and number class (normalization)
Roman number and Greek letter matching
Word representation features:
Brown cluster label prefixes with 4, 6, 10 and 20 lengths
Use cluster supersets as features, since Brown clusters are
hierarchical
50 and 100 dimensional vectors of word embeddings
No stopping criterion for inducing WE, and the quality of the
22
embeddings improves as training proceeds
Supervised Learning
Conditional Random Fields (CRF) is a sequence
labeling algorithm
Treat each sentence as a sequence of tokens
Train CRF model for label sequence of every
sentence in training set
2nd – order CRF: current label is conditioned on the
previous two
Use BIO label model
Hyperparameter tuning with development set
Once noticing the optimal hyperparameter values,
build the model on the whole annotated set
Avoid model overfitting!!!
Submit the model built only on the training set too
23
Experiments and Results:
Datasets
Labeled data
3,500 docs - training
3,500 docs - development
Unlabeled data
Biomedical domain corpus
Subset of PUBMED documents collected via NCBI web service
Number of abstracts: ~ 1.4 million
10.8 million sentences were preserved, after sentence
splitting and text cleansing
News domain corpus
RCV1 corpus: 1.3 million sentences
Use models provided by Joseph Turian [Joseph et al. 2010]
24
Experiments and Results:
Inducing word representations
Induced Brown models of 25, 50 and 3,00 of
clusters for comparison
Inducing 3,00 clusters took roughly 10 days
We would induce 1,000 however this would take several
months with this large corpus
Applied the word embeddings induced on the RCV1
corpus [Joseph et al. 2010]
50 and 100 dimensions of word vectors were used as
features
CRF model with WE is quite complex
Real valued embedding vectors introduce continues attributes
Training and tagging time dramatically increases (3 or 4 times)
25
Experiments and Results:
Performance comparison on CEM sub-task
The system performance was mainly guided by the
chemical entity mention recognition sub-task
Description
Pre (%)
Rec
F-scr
Elp-time
BANNER setup
84.85
72.93
78.44
343 sec
Baseline
82.65
78.33
80.43
339 sec
Baseline + Brown 1000 RCV1
84.33
77.82
80.94
365 sec
Baseline + Brown 1000 RCV1 + WE 50
82.38
74
77.97
1,730 sec
Baseline + Brown 50 pubmed
84.85
77.89
81.22
377 sec
Baseline + Brown 50 pubmed + WE 50
83,79
76,56
80,01
595 sec
Baseline + Brown 100 pubmed
82.29
73.86
77.85
346 sec
Baseline + Brown 300 PubMed
84.59
78.47
81.41
389 sec
Comparison of different setups evaluated on development set
26
Experiments and Results:
The best runs for CDI sub-task
Picked the best setups for both CEM and CDI subtask
The following table shows the top four runs for CDI
sub-task
Description
Pre
Rec
F-scr
Baseline + Brown 50 pubmed + WE 50
80.91
81.29
81.1
Baseline + Brown 1000 RCV1
81.54
84.01
81.23
Baseline + Brown 50 PubMed
82.33
82.57
82.45
Baseline + Brown 300 PubMed
81.91
84.1
82.58
The best runs for CDI sub-task
27
Experiments and Results:
The final result on the test set
Total 65 teams
27 teams submitted
87 researchers from
the world wide
28
Experiments and Results:
The final result on the test set
29
Experiments and Results:
Discussion & Findings
WE features were not always to improve the
performance
Brown cluster features improve the F-measure
In some case, degrades the system performance with CRF
model
The improvement is significant when the model was built
on the domain corpus
Lemma and Brown cluster features in the BANNERCHEMDNER system were observed to boost the
performance around 4% of F-measure
2% of each
The improvement in the same range is also achievable in
biomedical NER by BANNER-CHEMDNER
30
Conclusion and Future Work
We developed a new branch of BANNER system, called
BANNER-CHEMDNER for CHEMDNER
A better recognition accuracy
Scalable over millions of documents
Configurable via XML
Pluggable in other systems
Our system processes ~530 documents per minute
On going and in the future
Evaluate the BANNER-CHEMDNER setup for gene/protein
mention recognition
Try WE features with other classifiers
Induce an accurate Brown cluster model and WE matrix from
PubMed documents for the community
Korean NIH:
Development of Biomedical Text Mining Systems (2011 - now)
PPI miner system (2011 - 2012)
PubMed clustering system (2013)
31
Q&A
Thank you