Transcript Document

Application of the NLP
techniques to IE and IR
CREST
言語処理グループ
Outline


Background
Building NLP resources


Extracting Disease-Gene Associations from
MEDLINE



GENIA
H-invitational
Extracting DGAs by machine learning
An IR system for predicate-argument relations

MEDUSA
Application to the Biomedical
domain

Plenty of text



Domain knowledge


MEDLINE database: 12 million abstracts
Needs of effective IE and IR
Gene ontology, KEGG, UMLS, ICD, …
Other Information sources

A variety of molecular databases

DNA sequences, motifs, diseases, molecular
interactions, etc…
Developing NLP resources

Resources for NLP research




Domain knowledge
Training data for ML-based techniques
Test data for evaluating the transferability of a system
We are now developing…

GENIA


Ontology
Corpus
GENIA corpus

4,000 MEDLINE abstracts



Selected by MeSH Terms (Human, Blood cells,
Transcription factors)
XML format
Contents




Named-entity (Kim et al 2003)
Part-of-speech (Tateisi et al 2004)
Parse tree
Co-reference (Institute of Infocomm
Research, Singapore)
GENIA named-entity corpus
The peri-kappa B site mediates human immunodeficiency
DNA
virus
virus type 2 enhancer activation in monocytes …
cell_type


Terms are annotated based on the semantic classes
in the GENIA ontology
Size



2,000 abstracts
Number of the terms: 92,723
Vocabulary size: 36,568
GENIA part-of-speech corpus
The peri-kappa B site mediates human immunodeficiency
DT
NN
NN NN
VBZ
JJ
NN
virus type 2 enhancer activation in monocytes …
NN NN CD
NN
NN
IN
NNS


Each token is annotated with its part-of-speech tag.
Size



2,000 abstracts
20,544 sentences
50,1054 words (about half the size of Penn Treebank)
GENIA treebank
S
VP
VP
PP
NP
NP
ADJP
CD3-episilon expression is controlled by a downstream T lymphocyte-specific enhancer element


Based on the standard of the Penn TreeBank
Size


200 abstracts
(1500 abstracts at the end of this fiscal year)
GENIA corpus

Used in more than 240 institutions


Japan (28), Asia (54), North America (63), Europe (62), etc…
De facto standard for evaluating biomedical named-entity
recognition systems

BioNLP workshop at Coling 2004

Named-entity recognition shared task









Institute for Infocomm Research (Singapore),
Stanford University (USA),
University of Edinburgh (UK),
University of Wisconsin-Madison (USA),
Pohang University of Science and Technology (Korea),
University of Alberta (Canada),
University Duisburg-Essen (Germany),
Korea University (Korea),
National Taiwan University (Taiwan),
Outline


Background
Building NLP resources


Extracting Disease-Gene Associations from
MEDLINE



GENIA
H-invitational
Extracting DGAs by machine learning
An IR system for predicate-argument relations

MEDUSA
H-Invitational Disease Edition
Specific disease
Literature
(PubMed)
Dictionary
Select
specific disease
List of genes
Text-mining
Known disease gene
Genomic region
of interest (GROI)
Scoring system
(PANDA)
H-InvDB
Other DB
Genes with high score
SNPs
1) Public
2) Private
Synthetic
analysis
AND/OR
Final Result
Gene expression
1) Public
2) Private
June 25, 2004
Disease group, JBIRC
Disease-Gene Associations
extracted from MEDLINE

DGA explorer
(demo)
Text

1.5 million MEDLINE abstracts

Selected by MeSH Terms


“Disease Category” AND (“Amino Acids, Peptides,
and Proteins” OR “Genetic Structures”)
Parsing



All the sentences were parsed by the HPSG
parser
Using a PC cluster (100 processors with GXP)
Time: 10 days
Disease-Gene Associations in
texts
These results suggested that targeted
disruption of Cyp19 caused anovulation and
precocious depletion of ovarian follicles
Furthermore, AML cells with methylated
p15(INAK4B) tended to express higher levels
of DNMT1 and 3B.
Training data

All co-occurrences are classified into
“relevant” or “irrelevant” by a domain expert.
All foals with OLWS were homozygous for the Ile118Lys EDNRB mutation,
and adults that were homozygous were not found.
Dominant radial drusen and Arg345Trp EFEMP1 mutation.
The 5 year overall survival (OS) and event-free survival (EFS) were 94 and
90 +/- 8%, respectively, with a median follow-up of 48 months.
These data may indicate that formation of parathyroid adenoma in young
patients is related to a mechanism involving EGFR.
:
Maximum entropy learning

Log-linear model

Features

Feature function
1
 F

qx   exp  i f i x 
Z
 i 1

Weight




Bag-of-words
Local context
Gene/disease name
Predicate-argument
structures
:
Features of predicateargument structures (1)
ARG2
X

gene/disease
Dedifferentiation of adenoid cystic carcinoma:
report of a case implicating p53 gene mutation.
Features of predicateargument structures (2)
ARG1
gene/disease


ARG2
X
disease/gene
These results suggested that targeted
disruption of Cyp19 caused anovulation
and precocious depletion of ovarian follicles.
Furthermore, AML cells with methylated
p15(INAK4B) tended to express higher
levels of DNMT1 and 3B.
Extraction accuracy


Training/test data: 2,253 sentences
10-fold cross validation
features
recall
precision
f-score
1.0
0.351
0.520
+ bag of words
0.733
0.682
0.706
+ local context
0.733
0.695
0.714
+ predicateargument structures
0.759
0.710
0.733
N/A
Outline


Background
Building NLP resources


Extracting Disease-Gene Associations from
MEDLINE



GENIA
H-invitational
Extracting DGAs by machine learning
An IR system for predicate-argument relations

MEDUSA
MEDUSA: An IR system for
predicate-argument structures

Ex.

Search a sentence in which the subject of the verb
activate is protein.
• Simple: Since the PHO2 Asp-230 mutant mimics Ser-230-phosphorylated PHO2, we
postulate that only phosphorylated PHO2 protein could activate the transcription of
PHO5 gene.
• With a relative pronoun: Transcription initiation by the sigma(54)-RNA polymerase
holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54)
to activate transcription.
• Coordination: Full-strength Straufen protein lacking this insertion is able to assocaite
with osker mRNA and activate its translation, but fais to localize the RNA to the
posterior.
MEDUSA
demonstration



100,000 MEDLINE abstracts
Parsed by Enju
Genes and diseases are annotated by using
the UMLS dictionary
Summary

GENIA corpus


Extracting gene-disease associations from
MEDLINE


Parts of speech, Named-entities, Parse trees
Machine learning with HPSG parse results
An IR system for predicate-argument
structures

MEDUSA
Software and resource

GENIA











Named entity corpus
Part-of-speech corpus
Parse tree corpus
Co-reference (Singapore)
Part-of-speech tagger
Named entity tagger (soon)
HPSG parse results (100,00 MEDLINE abstracts)
Enju (HPSG parser)
MEDUSA
LiLFeS
Amis