Mining External Resources for Biomedical IE
Download
Report
Transcript Mining External Resources for Biomedical IE
SMBM Talks
NLP for Biomedical Text Mining
SMBM, Cambridge, April 11-13 (Edinburgh May 2)
Resources and Tools for
Biomedical Text Mining
Junichi Tsujii (U of Tokyo)
Keywords: GENIA corpus; annotation
Main point: progress in text mining depends on the integration
of growing GENIA annotation (coreference, eg) with lexical
resources for domain knowledge (ontologies) and software
development.
Take home message: see main point above
• annotated corpus
• POS
• NER
• coreference (670 abstracts, Singapore)
• interaction (biological events; cooperation with CNRS)
• parse trees (1.5 million GENIA abstracts parsed in 10 days
using a 100 PC cluster)
• ontology
• top nodes: substance; source; other
• software development
• POS tagger
• NER tagger
• parser
• IR system (Medusa)
• IE (event extraction: relation gene/disease) system
• POS tagger
• MaxEnt model (Kazama and Tsujii 2003, 2005)
• Trained on WSJ (>39,000 sent.) and GENIA (18,500 sent.)
train
test
WSJ
GENIA
WSJ+GENIA
WSJ
97.0
75.2
96.9
GENIA
84.3
98.1
98.1
• NER tagger
• combines a rule-based and statistical approach
• on BioNLP: 70.8% (?) -- our system got 70.1%
• HPSG-based parser (Enju)
•
•
•
•
•
see Miyao et al. ACL05
available on website
XML output
dependency relations
predicate-argument accuracy:
• PTB: prec=88.3% rec=87.2
• GENIA: lower...
• gene/disease relation extraction
• pred/arg works better than bag of words or local context
(gives best precision)
Recognising noun phrases in biomedical
text: an evaluation of lab prototypes
and a commercial chunker
J. Wermter, J. Fluck, J.Stroetgen, S.Geissler, U. Hahn (U. Jena, Temis)
Keywords: chunking, portability
Main point:take several existing chunkers trained on (or
developed for) newspaper text and evaluate their performance
on biomedical data (beta version of GENIA syntactic annotation).
Take home messages:
• overall performance drop (~3-6 points) for ML systems when
shifting to bio domain
• no significant difference between statistical and rule-based
systems
Three statistical chunkers:
• YamCha (support vector machine)
• Tbl (transformation-based error-driven learning)
• BoSS (boundaries predictor by combining observed probabilities
of NP boundaries and POS patterns in trainset)
One rule-based commercial system
• Temis
1. Uses words rather than GENIA POS tags
2. Computes morphological information (XeLDA toolkit)
3. HMM POS tagger disambiguates chain of POS tags
• hand-coded grammar had to be modified (on PTB)
• tagset had to be translated (not straightforward)
Training and Test Sets
Train
• sections 15-18 of Penn Treebank for training
(over 200,000 POS-tagged tokens and IOB-chunked)
Test
• GENIA treebank (beta version)
(200 MedLine abstracts with syntactic annotation)
the GENIA treebank was automatically converted
into the IOB format
• just under 45,000 tokens
• ~11,000 = devtest for settting Temis’ IOB output
• ~34,000 = actual test set
Results and Errors
PTB Corpus
Rec
Prec
YamCha
F
GENIA Corpus
Rec
Prec
F
94.29
94.15
94.22
89.00
89.30
89.15
BoSS
89.92
90.10
90.01
86.46
86.84
86.65
Tbl
92.27
91.80
92.03
86.31
85.49
85.90
86.94
86.29
86.61
87.14
85.34
86.23
Temis
After domain adaptations
Temis
BoSS
91.24 90.59 90.91
87.25 89.19 88.21
Errors
• Coordination
• bracketed elements
• ...
Automatic Term List Generation
for Entity Tagging
Ted Sandler, Andrew Schein, and Lyle Ungar (CS, UPenn)
Keywords:NER, automatic gazetteer creation
Main point: term lists can be obtained automatically, and when
integrated in a NER (gene)tagger (CRF) boost its performance
to a level comparable with hand-modelled lists
Take home messages:
• unsupervised gazetteer creation is feasible and useful
• supervised methods for obtaining terms outperform
unsupervised methods
Overall Approach
• choose set of vocabulary items (nouns) to partition into classes
• choose set of useful syntactic relations
• frequent
• informative
• relatively noise-free
• parse corpus to extract relations and collect statistics
• use clustering algorithm to partition the vocabulary
• resulting partitions are term lists
4 related methods for generating term lists; they differ wrt:
(see table)
• word representation
• clustering algorithms to partition the words
• choice of feature weighting
Corpus
• 15,000 sentences from BioCreative + 1,800,547 Medline abs
• parsed using Minipar; vocabulary=7782 single token nouns
Representation of the base vocabulary
• vector space where each item is represented
by set of syn configurations it occurs in
• affinity matrix where each item is represented
as its similarities to other items in the vocabulary
Weighting Schemes
• Pearson’s chi-square test
• Generalized Likelihood Ratio (G-square; Dunning 1993;
better with sparse data)
• first better at “common sense” generalisations; second
better at domain-specific generalisations
Clustering Algorithms
• kmeans clustering for words in vector space (high recall)
• agglomerative clustering for data in affinity matrix (high prec)
NER (Gene) Tagging
• McDonald and Pereira’s CRF tagger
• automatically generated 2,164 overlapping term lists
incorporated as features in the model
• binary feature (0/1) for each term list (in=1; not=0)
• baseline tagger without lists
• tagger augmented with hand-compiled lists of genes (57,563)
• tagger augmented with large list of genes obtained via
supervised learning (Tanabe and Wilbur Gene.Lexicon:1,145,913)
TRAIN/TEST: 5-fold Xvalidation on 394,661 words of BioCreative
(1/5 for training and 4/5 for testing)
Baseline
Unsupervised
Supervised
Manual
prec
0.698
0.705
0.709
0.716
rec
0.613
0.622
0.621
0.631
f-score
0.653
0.661
0.662
0.671
Protein-Protein Interaction Extraction: A
Supervised Learning Approach
J. Xiao, J. Su, G. Zhou, C. Tan
(Inst. For Infocomm Research, Singapore)
Keywords:relation extraction
Main point: a MaxEnt approach to protein-protein relation
extraction that exploits simple local features performs better
than co-occurrence and rule-based approaches, achieving nearly
94% recall and 88% precision on 303 MedLine abstracts.
Take home message:
• supervised learning with shallow features work well for
protein-protein interaction extraction
Task: extract couple of interacting proteins
• no direction
• perfect NER (manual annotation)
Procedure
•
•
•
•
•
•
tokenisation and morphological analysis
POS tagging
NER
sentence analysis (parsing)
coreference resolution (including abbreviations and aliases)
MaxEnt classifier
Features
• Words
• all words that appear in two protein names
• words in between two protein names
• previous/next words in a n-words window (unordered)
• Overlap
• number of protein names in between 2 protein names
• Keywords
• occurrence of word from keyword list in surroundings
• Chunks
• all heads of base phrases in between 2 protein names
• all heads surrounding the protein name pair
• all phrase types between 2 protein names
• Parse Tree
• Dependency Tree
• dependency between two proteins
• Pair of heads of protein names
• Pair of abbreviations of two proteins
Experiment and Results
• corpus: IEPA (Iowa University)
• 303 Medline abstracts
• 633 positive instances
• 1080 negative instances
• POS tagger trained on GENIA using an HMM model
• Collin’s parser
• 10-fold Xvalidation
• best result: rec=93.9%; prec=88%; f=90.9
GOOD Features
- words (esp. surrounding)
- chunks
- pairs of protein heads
- pairs of abbreviations
- keywords (so and so)
NOTSOGOOD Features
- overlap
- parse trees
- dependency relations
Challenges of Information Mining in
a Pharmaceutical Environment
Philippe Sanseau (Glaxo-Smith-Kline, UK)
Main point:
Q:How do you see the role of NLP in your field?
A:Excuse me, could someone explain what NLP is, please.
Take home question:
are NLP and pharmaceutical communities on the same track?