Transcript Slide 1

Information Extraction
from Literature
Yue Lu
BeeSpace Seminar
Oct 24, 2007
Outline
Overview of BeeSpace v4
 Entity Recognition
 Relation Extraction

Overview

BeeSpace V4
 deeper
semantic base than the current v3 system
 entities and relations VS mutual information

Four levels
 Level1:
Entity Recognition
 Level2: Entity Association Mining
 Level3: Relation Extraction
 Level4: Inference and Hypothesis Generation
Overview
Level1: Entity Recognition (detailed later)
 Level2 Entity Association Mining

 Suppose
entities are properly tagged
 Utilize the co-occurrence patterns of entities
to extract semantics
 e.g. a bee biologist may want to know which
genes are important for foraging behavior.
 Similar to TREC Genomics 2007 task
TREC Genomics 2007
e.g. “Which [PATHWAYS] are possibly
involved in the disease ADPKD?”
 currently only retrieval techniques

 Gene
synonym expansion
 Conjunctive query interpretation
 User relevance feedback

tagged Entities definitely would help
Overview

Level3: Relation Extraction
 Goal
is to extract the relations between entities
 Generally requires entities to be properly tagged first
 Detailed later

Level4: Inference and Hypothesis Generation
 Inference
on knowledge base
 Graph mining
Outline
Overview of BeeSpace v4
 Entity Recognition
 Relation Extraction

Entity Recognition


Gene
Example:

Although <GENE>mxp</GENE> and
<GENE>Pb</GENE> display very similar
expression patterns, <GENE>pb</GENE>
null embryos develop normally
Entity Recognition


Anatomy
Example:

In normal embryos, mxp is expressed in the
<ANATOMY>maxillary</ANATOMY> and
<ANATOMY>labial</ANATOMY> segments,
whereas ectopic expression is observed in
some GOF variants.
Entity Recognition


Biological process
Example:

Amongst these are the Bicoid, the Nanos, and the
terminal class gene products, some of which are
oncoproteins involved in signal transduction for
<BIOLOGICAL PROCESS>the formation of terminal
structures in the embryo<BIOLOGICAL PROCESS>.
Entity Recognition


Pathways
Example:

Several signal transduction pathways have been
described in Drosophila, and this review explores
the potential of oncogene studies using one of those
pathways - <PATHWAY>the terminal class signal
transduction pathway</PATHWAY> - to better
understand the cellular mechanisms of protooncogenes that mediate cellular responses in
vertebrates including humans
Entity Recognition


Protein family
Example:

While non-arthropod orthologs have been found for
many Drosophila eye developmental genes, this
has not been the case for the glass (gl) gene, which
encodes a <PROTEIN FAMILY>zinc finger
transcription factor</PROTEIN FAMILY> required
for photoreceptor cell specification, differentiation,
and survival.
Entity Recognition


CRE (cis-regulatory elements)
Example:

A synthetic, 23-bp <CRE>ecdysterone regulatory
element (EcRE) </CRE>, derived from the upstream
region of the Drosophila melanogaster hsp27 gene,
was inserted adjacent to the herpes simplex virus
thymidine kinase promoter fused to a bacterial gene
for chloramphenicol acetyltransferase (CAT).
Entity Recognition


Phenotype
Definition:


a set of observable physical characteristics
of an individual organism
Example:

Fog, dumpy
Entity Recognition

Class1: Small Variation
(Dictionary/Ontology)
 Organism, Anatomy
, Biological Process,
Pathway, Protein Family

Class2: Medium Variation
 Gene,

cis Regulatory Element
Class3: Large Variation
 Phenotype,
Behavior
Entity Recognition
Generally can be defined as a
classification problem
 Boils down to feature definition

 Class1:
matching a word in the
Dictionary/Ontology
 Class2: prefix/suffix of the word, POS tags, …
 Class3:?
Entity Recognition

Firstly focus on Class1
 Relatively

simple
Class2 and Class3 need training examples
 Useful
in entity association mining
 Useful in facilitating extraction of many
interesting relations

Related work: Textpresso
Textpresso



Input: full text C. elegans literature
Output: tagged XML format
Defined a Textpresso ontology
 First



category is biological entities
manually curated a lexicon of names
Implemented by PERL regular expressions
We could reuse some of the regular expressions
Entity Recognition
Resources:
Organism
Anatomy
Entrez gene table,
Textpresso, BeeSpace DB
FlyBase
Biological Process, Textpresso
Cellular Component,
Molecular Function
Pathway
KEGG
Protein Family
PDB, NCBI
Outline
Overview of BeeSpace v4
 Entity Recognition
 Relation Extraction

Relation Extraction

Expression Location


the expression of a gene in some location
(tissues, body parts)
Homology/Orthology

one gene is homologous to another gene
Relation Extraction

Biological process
 one
gene has some role in a biological
process

Genetic/Physical/Regulatory Interaction
 one
gene interacts with another gene in a
certain fashion (3 types of relations)
 a simple case: Protein-Protein Interaction
(PPI)
Relation Extraction
Generally can be defined as a
classification problem, which requires
training data
 Domain adaptation?

 an
example of PPI
PPI

Problem Definition:
 Gene/protein
names are already tagged
 A known list of interaction words

133 words
 classify
each tuple (p1, p2, interWord) in one
single sentence
PPI

Methods
 Learning
algorithm: Maximum Entropy
 Context features
“Extracting protein-protein interactions using
simple contextual features training data” BioNLP
Workshop on HLT-NAACL 06
 e.g. lexical forms, POS tags …
 Less dependent on domain

PPI

Training/Testing data:
 BioCreative
 1000
hand labeled sentences, 3964 tuples
 5-fold cross validation

Performance
 avgpr
= 47.14624
 avgre = 43.97337
 avgf1 = 45.35523
PPI

Training data:
 BioCreative
 1000

Testing Data (different domain)
 Bee

hand labeled sentences, 3964 tuples
collection
Performance (Judged by Moushumi)
 Total
number of tuples extracted as PPI instances: 92
 Precision: 63%
PPI Misclassification examples




Type1: No interaction
Sentence: Pretreatment of platelet suspension
with phospholipase A2 from N. naja atra or A.
mellifera venom (50 .mu.g/ml) inhibited platelet
aggregation induced by sodium arachidonate or
collagen, but not induced by thrombin or
ionophore A-23187.
False: (collagen, thrombin, induced)
True: relation between protein and platelet
aggregation; no PPI
PPI Misclassification examples




Type2: Incorrect interaction word
Sentence: IgG antibody was able to inhibit
binding of IgE antibody in the PLA
radioallergsorbent test (RAST) from 10-40% at a
molar excess of 10- to 1000-fold.
False: (IgG antibody, IgE antibody, binding)
True: (IgG antibody, IgE antibody, inhibit)
PPI Misclassification examples
Type3: Incorrect protein involved
 Sentence: AChE exhibits a
butyrylcholinesterase (BuChE) activity that
represents about 14% of AChE activity.
 False: (AChE, AChE, exhibits)
 True: (AChE, BuChE, exhibits )

PPI

Possible Improvement
 syntactic
patterns: “Optimizing syntax-patterns
for discovering protein-protein interactions” In
Proc ACM Symposium on Applied Computing,
SAC, Bioinformatics Track,
 parse tree
 dependency parsing
…

The End