Transcript Slide 1
Information Extraction
from Literature
Yue Lu
BeeSpace Seminar
Oct 24, 2007
Outline
Overview of BeeSpace v4
Entity Recognition
Relation Extraction
Overview
BeeSpace V4
deeper
semantic base than the current v3 system
entities and relations VS mutual information
Four levels
Level1:
Entity Recognition
Level2: Entity Association Mining
Level3: Relation Extraction
Level4: Inference and Hypothesis Generation
Overview
Level1: Entity Recognition (detailed later)
Level2 Entity Association Mining
Suppose
entities are properly tagged
Utilize the co-occurrence patterns of entities
to extract semantics
e.g. a bee biologist may want to know which
genes are important for foraging behavior.
Similar to TREC Genomics 2007 task
TREC Genomics 2007
e.g. “Which [PATHWAYS] are possibly
involved in the disease ADPKD?”
currently only retrieval techniques
Gene
synonym expansion
Conjunctive query interpretation
User relevance feedback
tagged Entities definitely would help
Overview
Level3: Relation Extraction
Goal
is to extract the relations between entities
Generally requires entities to be properly tagged first
Detailed later
Level4: Inference and Hypothesis Generation
Inference
on knowledge base
Graph mining
Outline
Overview of BeeSpace v4
Entity Recognition
Relation Extraction
Entity Recognition
Gene
Example:
Although <GENE>mxp</GENE> and
<GENE>Pb</GENE> display very similar
expression patterns, <GENE>pb</GENE>
null embryos develop normally
Entity Recognition
Anatomy
Example:
In normal embryos, mxp is expressed in the
<ANATOMY>maxillary</ANATOMY> and
<ANATOMY>labial</ANATOMY> segments,
whereas ectopic expression is observed in
some GOF variants.
Entity Recognition
Biological process
Example:
Amongst these are the Bicoid, the Nanos, and the
terminal class gene products, some of which are
oncoproteins involved in signal transduction for
<BIOLOGICAL PROCESS>the formation of terminal
structures in the embryo<BIOLOGICAL PROCESS>.
Entity Recognition
Pathways
Example:
Several signal transduction pathways have been
described in Drosophila, and this review explores
the potential of oncogene studies using one of those
pathways - <PATHWAY>the terminal class signal
transduction pathway</PATHWAY> - to better
understand the cellular mechanisms of protooncogenes that mediate cellular responses in
vertebrates including humans
Entity Recognition
Protein family
Example:
While non-arthropod orthologs have been found for
many Drosophila eye developmental genes, this
has not been the case for the glass (gl) gene, which
encodes a <PROTEIN FAMILY>zinc finger
transcription factor</PROTEIN FAMILY> required
for photoreceptor cell specification, differentiation,
and survival.
Entity Recognition
CRE (cis-regulatory elements)
Example:
A synthetic, 23-bp <CRE>ecdysterone regulatory
element (EcRE) </CRE>, derived from the upstream
region of the Drosophila melanogaster hsp27 gene,
was inserted adjacent to the herpes simplex virus
thymidine kinase promoter fused to a bacterial gene
for chloramphenicol acetyltransferase (CAT).
Entity Recognition
Phenotype
Definition:
a set of observable physical characteristics
of an individual organism
Example:
Fog, dumpy
Entity Recognition
Class1: Small Variation
(Dictionary/Ontology)
Organism, Anatomy
, Biological Process,
Pathway, Protein Family
Class2: Medium Variation
Gene,
cis Regulatory Element
Class3: Large Variation
Phenotype,
Behavior
Entity Recognition
Generally can be defined as a
classification problem
Boils down to feature definition
Class1:
matching a word in the
Dictionary/Ontology
Class2: prefix/suffix of the word, POS tags, …
Class3:?
Entity Recognition
Firstly focus on Class1
Relatively
simple
Class2 and Class3 need training examples
Useful
in entity association mining
Useful in facilitating extraction of many
interesting relations
Related work: Textpresso
Textpresso
Input: full text C. elegans literature
Output: tagged XML format
Defined a Textpresso ontology
First
category is biological entities
manually curated a lexicon of names
Implemented by PERL regular expressions
We could reuse some of the regular expressions
Entity Recognition
Resources:
Organism
Anatomy
Entrez gene table,
Textpresso, BeeSpace DB
FlyBase
Biological Process, Textpresso
Cellular Component,
Molecular Function
Pathway
KEGG
Protein Family
PDB, NCBI
Outline
Overview of BeeSpace v4
Entity Recognition
Relation Extraction
Relation Extraction
Expression Location
the expression of a gene in some location
(tissues, body parts)
Homology/Orthology
one gene is homologous to another gene
Relation Extraction
Biological process
one
gene has some role in a biological
process
Genetic/Physical/Regulatory Interaction
one
gene interacts with another gene in a
certain fashion (3 types of relations)
a simple case: Protein-Protein Interaction
(PPI)
Relation Extraction
Generally can be defined as a
classification problem, which requires
training data
Domain adaptation?
an
example of PPI
PPI
Problem Definition:
Gene/protein
names are already tagged
A known list of interaction words
133 words
classify
each tuple (p1, p2, interWord) in one
single sentence
PPI
Methods
Learning
algorithm: Maximum Entropy
Context features
“Extracting protein-protein interactions using
simple contextual features training data” BioNLP
Workshop on HLT-NAACL 06
e.g. lexical forms, POS tags …
Less dependent on domain
PPI
Training/Testing data:
BioCreative
1000
hand labeled sentences, 3964 tuples
5-fold cross validation
Performance
avgpr
= 47.14624
avgre = 43.97337
avgf1 = 45.35523
PPI
Training data:
BioCreative
1000
Testing Data (different domain)
Bee
hand labeled sentences, 3964 tuples
collection
Performance (Judged by Moushumi)
Total
number of tuples extracted as PPI instances: 92
Precision: 63%
PPI Misclassification examples
Type1: No interaction
Sentence: Pretreatment of platelet suspension
with phospholipase A2 from N. naja atra or A.
mellifera venom (50 .mu.g/ml) inhibited platelet
aggregation induced by sodium arachidonate or
collagen, but not induced by thrombin or
ionophore A-23187.
False: (collagen, thrombin, induced)
True: relation between protein and platelet
aggregation; no PPI
PPI Misclassification examples
Type2: Incorrect interaction word
Sentence: IgG antibody was able to inhibit
binding of IgE antibody in the PLA
radioallergsorbent test (RAST) from 10-40% at a
molar excess of 10- to 1000-fold.
False: (IgG antibody, IgE antibody, binding)
True: (IgG antibody, IgE antibody, inhibit)
PPI Misclassification examples
Type3: Incorrect protein involved
Sentence: AChE exhibits a
butyrylcholinesterase (BuChE) activity that
represents about 14% of AChE activity.
False: (AChE, AChE, exhibits)
True: (AChE, BuChE, exhibits )
PPI
Possible Improvement
syntactic
patterns: “Optimizing syntax-patterns
for discovering protein-protein interactions” In
Proc ACM Symposium on Applied Computing,
SAC, Bioinformatics Track,
parse tree
dependency parsing
…
The End