Finding Semantic Relations

Download Report

Transcript Finding Semantic Relations

Discovering Semantic Relations
(for Proteins and Digital
Devices)
Barbara Rosario
Intel Research
Outline
• Semantic relations
– Protein-protein interactions (joint work
with Marti Hearst)
– Digital devices (joint work with Bill Schilit,
Google and Oksana Yakhnenko, Iowa State
University)
• Models to do text classification and
information extraction
• Two new proposals for getting labeled data
Text mining
• Text Mining is the discovery by
computers of new, previously unknown
information, via automatic extraction
of information from text
• Example: a (human) analysis of titles
of articles in the biomedical literature
suggested a role of magnesium
deficiency in migraines [Swanson]
Text mining
• Text:
–
–
–
–
Stress is associated with migraines
Stress can lead to loss of magnesium
Calcium channel blockers prevent some migraines
Magnesium is a natural calcium channel blocker
1: Extract semantic entities from text
Text mining
• Text:
–
–
–
–
Stress is associated with migraines
Stress can lead to loss of magnesium
Calcium channel blockers prevent some migraines
Magnesium is a natural calcium channel blocker
1: Extract semantic entities from text
Stress
Magnesium
Migraine
Calcium channel blockers
Text mining (cont.)
• Text:
–
–
–
–
Stress is associated with migraines
Stress can lead to loss of magnesium
Calcium channel blockers prevent some migraines
Magnesium is a natural calcium channel blocker
2: Classify relations between entities
Stress
Lead to loss
Migraine
Associated with
Magnesium
Prevent
Calcium channel blockers
Subtype-of (is a)
Text mining (cont.)
• Text:
–
–
–
–
Stress is associated with migraines
Stress can lead to loss of magnesium
Calcium channel blockers prevent some migraines
Magnesium is a natural calcium channel blocker
3: Do reasoning: find new correlations
Stress
Lead to loss
Migraine
Associated with
Magnesium
Prevent
Calcium channel blockers
Subtype-of (is a)
Relations
• The identification and classification of
semantic relations is crucial for the
semantic analysis of text
• Protein-protein interactions
• Relations for digital devices
Protein-protein interactions
• Applications throughout biology
• There are several protein-protein
interaction databases (BIND, MINT,..), all
manually curated
• Most of the biomedical research and new
discoveries are available electronically but
only in free text format.
• Automatic mechanisms are needed to
convert text into more structured forms
Protein-protein interactions
• Supervised systems require manually
trained data, while purely unsupervised
are still to be proven effective for these
tasks.
• We propose the use of resources
developed in the biomedical domain to
address the problem of gathering labeled
data for the task of classifying interactions
between proteins
HIV-1, Protein interaction
database
• “The goal of this project is to provide
scientists a summary of all known
interactions of HIV-1 proteins with host
cell proteins, other HIV-1 proteins, or
proteins from disease organisms
associated with HIV/AIDS”
• There are 2224 interacting protein pairs
and 51 types of interaction
http://www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions/
HIV-1, Protein interaction
database
Protein 1 Protein 2
10000
10015
1017
10197
…
155871
155030
155871
155348
Interaction
Paper ID
activates
binds
induces
degraded by
11156964
14519844, …
9223324
10893419
Protein-protein interactions
• Idea: use this to “label data”
Protein 1 Protein 2 Interaction
10000
155871 activates
Paper ID
11156964
Extract from the paper all the sentences
with Protein 1 and Protein 2
…
Label them with the interaction given in the database
Protein-protein interactions
• Idea: use this to “label data”
Protein 1 Protein 2 Interaction
10000
155871 activates
Paper ID
11156964
Extract from the paper all the sentences
with Protein 1 and Protein 2
activates
…
activates
Label them with the interaction given in the database
Protein-protein interactions
• Use citations
Protein 1 Protein 2 Interaction
10000
155871 activates
Paper ID
11156964
• Find all the papers ID 9918876
that cite the papers
in the database
ID 9971769
Protein-protein interactions
Protein 1 Protein 2 Interaction
10000
155871 activates
• From the papers, extract
the citation sentences;
from these extract the
sentences with Protein 1
and Protein 2
• Label them
activates
Paper ID
11156964
ID 9918876
ID 9971769
Protein-protein interactions
• Task:
Interaction
• Given the sentences
extracted from paper ID
and/or the citation
sentences:
• Determine the interaction
given in the HIV-1
database for paper ID
• Identify the proteins
involved in the interaction
(protein name tagging, or
role extraction).
Papers
Citances
Degrades
60
63
Synergizes with
86
101
103
64
Binds
98
324
Inactivates
68
92
Interacts with
62
100
Requires
96
297
119
98
Inhibits
78
84
Suppresses
51
99
Stimulates
Upregulates
The models (1)
Naïve Bayes (NB) for
interaction classification.
The models (2)
Dynamic graphical model (DM)
for protein interaction
classification (and role extraction).
Dynamic graphical models
• Graphical model composed of
repeated segments
• HMMs (Hidden Markov Models)
– POS tagging, speech recognition, IE
t1
t2
tN-1
tN
w1
w2
wN-1
wN
POS Tags
Words
HMMs
• Joint probability distribution
– P(t1,.., tN, w1,.., wN) = P(t1)  P(ti|ti-1)P(wi|ti)
• Estimate P(t1), P(ti|ti-1), P(wi|ti) from
labeled data
t1
t2
tN-1
tN
w1
w2
wN-1
wN
HMMs
• Joint probability distribution
– P(t1,.., tN, w1,.., wN) = P(t1)  P(ti|ti-1)P(wi|ti)
• Estimate P(t1), P(ti|ti-1), P(wi|ti) from
labeled data
• Inference: P(t1 , t2 ,… tN | w1 , w2 ,… wN)
t1
t2
tN-1
tN
w1
w2
wN-1
wN
Graphical model for role and
relation extraction
Interaction
Roles
Features
– Markov sequence of states (roles)
– States generate multiple observations
– Relation generate the state sequence and the
observations
Analyzing the results
• Hiding the protein names: “Selective
CXCR4 antagonism by Tat” becomes:
“Selective PROT1 antagonism by
PROT2”
– To check whether the interaction types
could be unambiguously determined by
the protein names.
• Compare results with a trigger words
approach
Results: interaction classification
Model
Classification accuracies
All
Papers
Citances
DB
60.5
57.8
53.4
NB
58.1
57.8
55.7
No Protein Names
DB
60.5
44.4
52.3
NB
59.7
46.7
53.4
Trigger words
25.8
40.0
26.1
Baseline: most frequent inter.
21.8
11.1
26.1
Results: proteins extraction
Recall
Precision
F-measure
All
0.74
0.85
0.79
Papers
0.56
0.83
0.67
Citances
0.75
0.84
0.79
Conclusions of protein-protein
interaction project
• Difficult and important problem: the classification of
(ten) different interaction types between proteins in text
• The dynamic graphical model DM can simultaneously
perform protein name tagging and relation identification
• High accuracy on both problems (well above the
baselines)
• The results obtained removing the protein names indicate
that our models learn the linguistic context of the
interactions.
• Found evidence supporting the hypothesis that citation
sentences are a good source of training data, most likely
because they provide a concise and precise way of
summarizing facts in the bioscience literature.
• Use of a protein-interaction database to automatically
gather labeled data for this task.
Relations for digital devices
• Identification of activities/relations
between device pairs.
• What can you do with a given device pair?
–
–
–
–
–
–
Digital camera and TV
Media server and computer
Media server and wireless router
Toshiba laptop and wireless audio adapter
PC and DVR
TV and DVR
Looking for relations
• You can searches the Web?
– Google searches TV DVR and PC DVR
• Current search engines find cooccurrence of query terms
• Often you need to find semantically
related entities
• For text mining, inference and for
search (IR)
Looking for relations
• You can searches the Web?
– Google searches PC DVR and TV DVR
• You may want to see instead all the sentences in
which the two devices are involved in an
activity/relation and get a sense of what you can
do with these devices
• Activities_between(PC DVR)
– From which you learn for example that
» Can build a Better DVR out of an Old PC
» Any modern Windows PC can be used for DVR duty
• Activities_between(TV DVR)
– From whichyou learn for example that
» DVR allows you to pause live TV
» Can watch Google Satellite TV through your "internet
ready" Google DVR
Looking for relations
• We can frame this problem as a
classification problem:
• Given a sentence containing two
digital devices, is there a relations
between them expressed in the
sentence or not?
Looking for relations
• Media server and computer
– The Allegro Media Server application reads the iTunes music
library file to find the music stored on your computer
• YES
– You will use the FTP software to transfer files from your
computer to the media server
• YES
– The media server has many functions and it needs to be a high-end
computer with plenty of hard drive space to store the very large
video files that get created
• YES
– Sometimes you might want to play faster than your computer, or
your Internet connection, or your media server, can handle
• NO
– Anderson , George Homsy, A Continuous Media I/O Server and Its
Synchronization Mechanism, Computer, v.24 n.10, p.51-57, October
1991
• NO
– GSIC > Research Computer System > Obtaining Accouts > Media
Server
• NO
Looking for relations
• Media server and wireless router
– For example, if you access a local media server in
your house that is connected to a wireless router
that has a port speed of only 100 Mbps [..]
• YES
– Besides serving as a router, a wireless access
point, and a four-port switch, the WRTSL54GS
includes a storage link and a media server
• YES
– It has a built in video server, media server, home
automation, wireless router, internet gateway
• NO
Our system
• Set of 57 pairs of digital devices
• Searched the Web (Google) using the
device pairs as queries
• From the Web pages retrieved, we
extracted the text (3627) excerpts
containing both devices
• We labeled them (YES or NOT)
• Trained a classification system
Our FUTURE system
• Will allow to identify the Web pages
containing relations.
– Could display only those.
– Could highlight only sentences with relations
– For digital devices, this would allow, for
example, useful queries for troubleshooting
• Searching the web is one of the principle methods
used to seek out information and to resolve problems
involving digital devices for home networks
Our FUTURE system
• Possible extensions of the project to
get the activities types
– We look at the sentences extracted and
come up with a set of possible activities.
Build a (multi) classification system to
classify the different activities
(supervised)
– Extract the most indicative words for
the activities (like the words highlighted
here); cluster them to get “activity
clusters” (unsupervised)
Our system
• Set of 50 Device Pairs
• Search the Web (Google) using the
device pairs as query
• From the Web pages retrieved, we
extracted the sentences containing
both devices
• We labeled them (YES or NOT)
• Trained a classification system
Labeling with Mechanical Turk
• To train a classification system, we need
labels
– Time consuming, subjective, different for each
domain and task
– (But unsupervised systems work usually
worse)
• We used a web service, Mechanical Turk
(MTurk, http://www.mturk.com) that
allows to create and post a task that
requires human intervention, and offers a
reward for the completion of the task.
Mechanical Turk HIT for labeling
relations
Surveys
Surveys
Mechanical Turk
• We created a total of 121 surveys consisting of 30
questions.
• Our reward to users was between 15 and 30 cents per
survey (< 1 cent for text segment)
– We obtained labels for 3627 text segments for under $70.
• HIT completed (by all 3 “workers”) within a few
minutes to a half-hour
– We had perfect agreement for 49% of all sentences
– 5% received all three labels (discarded)
– 46% two labels were assigned (the majority vote was used to
determine the final label)
• 1865 text segments were labeled YES
• 1485 text segments were labeled NO
Classification
• Now we have labeled data
• Need a (binary) classifier
Summary (from lecture 17)
• Algorithms for Classification
• Binary classification
–
–
–
–
–
Perceptron
Winnow
Support Vector Machines (SVM)
Kernel Methods
Multilayer Neural Networks
• Multi-Class classification
– Decision Trees
– Naïve Bayes
– K nearest neighbor
Support Vector Machine (SVM)
M
• Large Margin
Classifier w x + b = -1
• Linearly
separable case
• Goal: find the
hyperplane that
maximizes the
margin
Support vectors
T
From Lecture 17
wTxa + b = 1
b
From Gert Lanckriet, Statistical
Learning Theory Tutorial
wT x + b = 0
45
Graphical models
• Directed (like Naïve Bayes and HMM)
• Undirected (Markov Network)
Maximum Margin Markov
Networks
• Large Margin Classifier + (undirected)
Markov Networks [Taskar 03]
– To combine the strengths of the two
methods:
• High dimensional feature space, strong
theoretical guarantees
• Problem structure, ability to capture
correlation between labels
Benjamin Taskar, Carlos Guestrin, and Daphne Koller.
2003. Max-margin markov networks. In NIPS.
Directed Maximum Margin
Model
• Large Margin Classifier + (directed)
graphical model (Naïve Bayes)
• MMNB: Maximum Margin Naïve Bayes
– Essentially, to combine the strengths of
graphic models (better at interpreting data,
worse performance in classification) with
discriminative models (better performance,
unintelligible working mechanism)
Results
• Compare with Naïve Bayes and
Perceptron (Weka)
• Classification accuracy:
– MMNB: 79.98
– Naïve Bayes: 75.62
– Perceptron: 63.03
Conclusion
• Semantic relations
• Two projects: interactions between
proteins and relations between digital
devices
• Statistical models (dynamic graphical
models, maximum margin naïve bayes)
• Creative ways of obtaining labeled data:
protein database and “paying” people
(Mturk)
Thanks!
Barbara Rosario
[email protected]
Intel Research
Additional slides
All device pairs
•
•
•
•
•
•
•
•
•
•
•
•
•
desktop wireless router
PC stereo
digital camera television
pc wireless audio adapter
digital camera tv set
pc wireless router
ibm laptop buffalo media player
Phillips stereo pc
ibm laptop linksys wireless
router
prismq media player wireless
router
ibm laptop squeezebox
stereo laptop
ibm laptop wireless audio
adapter
• stereo toshiba laptop
• kodak camera television
• toshiba laptop buffalo media
player
• laptop linksys wireless router
• toshiba laptop linksys wireless
router
• laptop media server
• toshiba laptop netgear wireless
router
• laptop squeezebox
• toshiba laptop squeezebox
• laptop stereo
• toshiba laptop wireless audio
adapter
• laptop wireless audio adapter
All device pairs (cont.)
• buffalo media player wireless
router
• laptop wireless router
• buffalo media server wireless
router
• linkstation home server
wireless router
• camera tv
• linkstation multimedia server
wireless router
• computer linksys wireless
router
• media player wireless router
• computer media server
• media server linksys wireless
router
• computer stereo
• media server netgear wireless
router
• computer wireless audio adapter
• media server wireless router
• computer wireless router
• network media player wireless
router
• desktop media server
• nikon camera television
• desktop stereo
• pc media server
• desktop wireless audio adapter
• pc squeezebox