BANNER-CHEMDNER

Download Report

Transcript BANNER-CHEMDNER

Part – I: Travel
BioCreative IV workshop
Washington DC, US
7 Oct – 9 Oct
Tsendsuren Munkhdalai
Chungbuk National University, Republic of Korea
10/23/2013
Contents



BioCreative IV
Venue & Access
Places to Visit in DC
2
BioCreative IV

BioCreative: Critical Assessment of Information
Extraction in Biology:


The key goal:



Community-wide effort for evaluating text mining and
information extraction systems applied to the biological
domain
The active involvement of the text mining user community
in the design of the tracks, preparation of corpus and the
testing of interactive systems
The workshop takes place once in every two years
BioCreative IV has 5 tracks: Interoperability,
CHEMDNER, Comparative Toxicogenomics
Database curation, Gene Ontology curation, and
Interactive curation
3
Venue & Access

The workshop took place in DoubleTree by Hilton
Hotel Bethesda - Washington DC

Was supposed to be held at NCBI, National Institutes of
Health (NIH), Washington DC
DoubleTre
e by Hilton
Hotel
Bethesda
~14 Hours Flight from ICN to DC
BWI Airport
4
Venue & Access

BWI Airport to The Hotel:

Two Options (Very Expensive):



Supper Shuttle - $50
Taxi - $120
A - BWI Airport, B – The hotel
Third option:



Take Bus (no. 201) to Shady Grove
metro station - $6
Get in Metro from Shady Grove to Bethesda - $5
Walk for 7 mins in the RIGHT DIRECTION
BWI Airport Super Shuttle
DC Metro Map
BWI Airport To Shady Grove
5
Places to Visit in DC
6
Places to Visit in DC
7
Places to Visit in DC
8
The Tip of the Trip

Don’t forget your power socket converter
9
Part – II: Paper & System
BANNER-CHEMDNER:
Incorporating Domain
Knowledge in Chemical and
Drug Named Entity Recognition
Tsendsuren Munkhdalai, Meijing Li, Khuyagbaatar
Batsuren, Keun Ho Ryu+
Chungbuk National University, Republic of Korea
+Corresponding
Author
Contents



Introduction
Related Work
Methods
Preprocessing
 Feature Processing
 Supervised Learning



Experiments and Results
Conclusion and Future Work
11
Introduction: Background

Biomedical literature grows exponentially


Manually impossible to manage


Everyday 2,000
Information Extraction (IE) is critical task
Chemical and Drug Named Entity Recognition
(CHEMDNER)


A fundamental task in IE and IR
Define the boundary of terminologies and categorize them
Drug NEs
NER
Chemical NEs
IE & IR
Document Indexing
Protein-protein Event extraction
interaction
Overview of information extraction from biomedical literature
12
Introduction: Motivation

CHEMDNER task is not trivial

The unique difficulties

Name boundary problem


The number of literatures is growing exponentially




Chemical mentions occur with other terminology
Training data size is too small
Ex: GENIA corpus has 2,000 abstracts, while 2,000 new
articles EVERYDAY!
The lack of standardization of technical terminology
A method that exploits unlabeled data has been an
active research due the BIG DATA
13
Introduction: CHEMDNER Task

One of the BioCreative IV Challenge Tasks

Chemical Document Indexing (CDI) sub-task


Chemical Entity Mention recognition (CEM) sub-task



Given a set of documents, return a list of chemical entities described within
each of these documents
Provide for a given document the start and end indices corresponding to all the
chemical entities mentioned in this document
The participants develop a chemical compounds
and drugs mention recognition system
Organizers provide:



3,500 docs - training
3,500 docs - development
20,000 docs - system evaluation


a subset of 3,000 documents constitutes the real test set
17,000 abstracts were added as a background collection to
avoid manual correction of the results
14
Related work

Comparison of character-level and part of speech
features for name recognition in biomedical texts
[N. Collier, K. Takeuchi, 2004, Journal of Biomedical Informatics]

Gene/protein name recognition based on support
vector machine using dictionary as features
[T. Mitsumori, S. Fation, M. MuRata, K. Doi, and H. Doi, 2005, BMC Bioinformatics]

ABNER system
[B. Settles, 2004, NLPBA/BioNLP]

Penn BioTagger system
[R. McDonald and F. Pereira, 2005, BMC Bioinformatics]

BANNER: An Executable Survey of Advances in
Biomedical Named Entity Recognition
[R. LEAMAN, G. Gonzalez, 2008, Pacific Symposium on Biocomputing]


Defines the stat-of-the-art in biomedical NER
Used in Biomedical IE systems
15
Related work

The most of the works rely on supervised ML


Word, the word and character n-gram, and the
traditional orthographic features are the base


The system performance is in a limit of training set
However, poor to represent domain background!!!
How to incorporate domain knowledge in NER?



Use lexicons/dictionary in conjunction with string
matching methods
Exploit rich unlabeled data during model construction
Learn better feature representation by using unlabeled
data (unsupervised feature learning)
16
Related work

Combining Labeled and Unlabeled Data with Co-training
[A. Blum, and T. Mitchell, 1998, Computational learning theory]




Introduce co-training, semi-supervised learning algorithm
Build two models using labeled and unlabeled data
Co-training asumptions: view-independent, view-sufficient
Bio Named Entity Recognition based on Co-training
Algorithm
[T. Munkhdalai, M. Li, T. Kim, O. Namsrai, S. Jeong, J. Shin, K.H. Ryu, 2012, AINA]

A Self-training with Active Example Selection Criterion for
Biomedical Named Entity Recognition
[E. Shin, T. Munkhdalai, M. Li, I. Paik, K. H. Ryu, 2012, ICHIT]

An Active Co-Training Algorithm for Biomedical Named-Entity
Recognition
[T. Munkhdalai, M. Li, U. Yun, O. Namsrai, K. H. Ryu, 2012, JIPS]
17
Related work

A Unified Architecture for Natural Language Processing:
Deep Neural Networks with Multitask Learning
[R. Collobert, J. Weston,, 2008, ICML]
Introduced a neural language model
 A word with its context is positive training example, a random
word in the context gives a negative training example




Ex: cat chills on a mat (+) VS cat chills Jeju a mat (-)
Induces an n-dimensional real valued vector (word
embedding or WE) for each word
Class-Based n-gram Models of Natural Language
[Peter F. B., Peter V. D., Robert L. M., Vincent J. D. P, Jenifer C. L., 1992, ACL]
Introduced a hierarchical word clustering (brown clustering)
 Clusters words to maximize the mutual information of bigrams



Quality (Cluster) = I(C) – H
Time complexity: O(V·K2), where V is the size of the vocabulary
and K is the number of clusters
18
Purpose & Approach

Our primary purpose is to develop a CHEMDNER
system, and to exploit the unlabeled data





A better recognition accuracy
Scalable over millions of documents
Configurable
Pluggable in other systems
The approach



Biomedical Natural Language Processing (BioNLP) tasks
The traditional baseline features + Word representation
features
CRF Model training with carefully tuned hyperparameters
19
BANNER-CHEMDNER Architecture
System design of BANNER-CHEMDNER system
20
Preprocessing

Text cleaning


Sentence splitting


Remove too short sentences
Detect sentence boundary in bio text data using Genia
Sentence Splitter
Part-of-speech tagging

Annotate token in a sentence with POS tags based on their
context


Ex: Verb (V), Noun (NN), Proper Noun (NNP), JJ (Adjective)
Lemmatization

Find canonical form (lemma) of a token with
BioLemmatizer
21
Feature Processing

The base (or baseline) feature set:






Lowercased token, POS tags, and Lemma
Orthographic features by regular expressions
Character prefix, suffix and n-grams
Token word and number class (normalization)
Roman number and Greek letter matching
Word representation features:

Brown cluster label prefixes with 4, 6, 10 and 20 lengths


Use cluster supersets as features, since Brown clusters are
hierarchical
50 and 100 dimensional vectors of word embeddings

No stopping criterion for inducing WE, and the quality of the
22
embeddings improves as training proceeds
Supervised Learning

Conditional Random Fields (CRF) is a sequence
labeling algorithm



Treat each sentence as a sequence of tokens
Train CRF model for label sequence of every
sentence in training set



2nd – order CRF: current label is conditioned on the
previous two
Use BIO label model
Hyperparameter tuning with development set
Once noticing the optimal hyperparameter values,
build the model on the whole annotated set


Avoid model overfitting!!!
Submit the model built only on the training set too
23
Experiments and Results:
Datasets

Labeled data



3,500 docs - training
3,500 docs - development
Unlabeled data

Biomedical domain corpus




Subset of PUBMED documents collected via NCBI web service
Number of abstracts: ~ 1.4 million
10.8 million sentences were preserved, after sentence
splitting and text cleansing
News domain corpus


RCV1 corpus: 1.3 million sentences
Use models provided by Joseph Turian [Joseph et al. 2010]
24
Experiments and Results:
Inducing word representations

Induced Brown models of 25, 50 and 3,00 of
clusters for comparison



Inducing 3,00 clusters took roughly 10 days
We would induce 1,000 however this would take several
months with this large corpus
Applied the word embeddings induced on the RCV1
corpus [Joseph et al. 2010]


50 and 100 dimensions of word vectors were used as
features
CRF model with WE is quite complex


Real valued embedding vectors introduce continues attributes
Training and tagging time dramatically increases (3 or 4 times)
25
Experiments and Results:
Performance comparison on CEM sub-task

The system performance was mainly guided by the
chemical entity mention recognition sub-task
Description
Pre (%)
Rec
F-scr
Elp-time
BANNER setup
84.85
72.93
78.44
343 sec
Baseline
82.65
78.33
80.43
339 sec
Baseline + Brown 1000 RCV1
84.33
77.82
80.94
365 sec
Baseline + Brown 1000 RCV1 + WE 50
82.38
74
77.97
1,730 sec
Baseline + Brown 50 pubmed
84.85
77.89
81.22
377 sec
Baseline + Brown 50 pubmed + WE 50
83,79
76,56
80,01
595 sec
Baseline + Brown 100 pubmed
82.29
73.86
77.85
346 sec
Baseline + Brown 300 PubMed
84.59
78.47
81.41
389 sec
Comparison of different setups evaluated on development set
26
Experiments and Results:
The best runs for CDI sub-task


Picked the best setups for both CEM and CDI subtask
The following table shows the top four runs for CDI
sub-task
Description
Pre
Rec
F-scr
Baseline + Brown 50 pubmed + WE 50
80.91
81.29
81.1
Baseline + Brown 1000 RCV1
81.54
84.01
81.23
Baseline + Brown 50 PubMed
82.33
82.57
82.45
Baseline + Brown 300 PubMed
81.91
84.1
82.58
The best runs for CDI sub-task
27
Experiments and Results:
The final result on the test set


Total 65 teams
27 teams submitted

87 researchers from
the world wide
28
Experiments and Results:
The final result on the test set
29
Experiments and Results:
Discussion & Findings

WE features were not always to improve the
performance


Brown cluster features improve the F-measure


In some case, degrades the system performance with CRF
model
The improvement is significant when the model was built
on the domain corpus
Lemma and Brown cluster features in the BANNERCHEMDNER system were observed to boost the
performance around 4% of F-measure


2% of each
The improvement in the same range is also achievable in
biomedical NER by BANNER-CHEMDNER
30
Conclusion and Future Work

We developed a new branch of BANNER system, called
BANNER-CHEMDNER for CHEMDNER
A better recognition accuracy
 Scalable over millions of documents
 Configurable via XML
 Pluggable in other systems



Our system processes ~530 documents per minute
On going and in the future
Evaluate the BANNER-CHEMDNER setup for gene/protein
mention recognition
 Try WE features with other classifiers
 Induce an accurate Brown cluster model and WE matrix from
PubMed documents for the community


Korean NIH:
Development of Biomedical Text Mining Systems (2011 - now)
PPI miner system (2011 - 2012)
 PubMed clustering system (2013)

31
Q&A
Thank you 