Exploiting Semantic Relations for Literature

Download Report

Transcript Exploiting Semantic Relations for Literature

Literature-Based Knowledge
Discovery using
Natural Language Processing
Dimitar Hristovski,1 PhD, Carol Friedman,2 PhD,
Thomas C Rindflesch,3 PhD, Borut Peterlin,4 MD PhD
1Institute
of Biomedical Informatics, Medical Faculty, University of Ljubljana,
Slovenia
2Department of Biomedical Informatics, Columbia University, New York
3National Library of Medicine, Bethesda, Maryland
4Division of medical genetics, UMC, Slajmerjeva 3, Ljubljana, Slovenia
e-mail: [email protected]
1
Part 1: Co-occurrence based LBD
2
Motivation
•
•
•
•
Overspecialization
Information overload
Large databases
Need and opportunity for computer
supported knowledge discovery
3
Literature-based Discovery
(LBD)
• A method for automatically generating
hypotheses (discoveries) from literature
• Hypotheses have form:
Concept1 –Relation– Concept2
• Example:
Fish oil –Treats– Raynaud’s disease
4
Background
• Swanson’s LBD paradigm:
New Relation?
e.g. Treats
Concept X
(Disease)
e.g. Raynaud’s
Concepts Y
(Pathologycal or Cell
Function, …)
e.g. Blood viscosity
Concepts Z
(Drugs, …)
e.g. Fish oil
5
Biomedical Discovery Support
System (BITOLA)
• Goal:
– discover potentially new relations (knowledge) between
biomedical concepts
– to be used as research idea generator and/or as
– an alternative way to search Medline
• System user (researcher or intermediary):
– interactively guides the discovery process
– evaluates the proposed relations
6
Extending and Enhancing
Literature Based Discovery
• Goal:
– Make literature based discovery more suitable for
disease candidate gene discovery
– Decrease the number of candidate relations
• Method:
– Integrate background knowledge:
• Chromosomal location of diseases and genes
• Gene expression location
• Disease manifestation location
7
System Overview
Knowledge Base
Concepts
Background
Knowledge
Discovery
Algorithm
(Chromosomal
Locations, …)
Association
Rules
User
Interface
Knowledge Extraction
Databases
(Medline, LocusLink,
HUGO, OMIM, …)
8
Terminology Problems during
Knowledge Extraction
• Gene names
• Gene symbols
• MeSH and genetic diseases
9
Detected Gene Symbols by
Frequency
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
type|666548
II|552584
III|201776
component|179643
CT|175973
AT|151337
ATP|147357
IV|123429
CD4|99657
p53|89357
MR|88682
SD|85889
GH|84797
LPS|68982
59|67272
E2|64616
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
82|63521
AMP|61862
TNF|59343
RA|58818
CD8|57324
O2|56847
ACTH|54933
CO2|53171
PKC|51057
EGF|50483
T3|49632
MS|46813
A2|44896
ER|43212
upstream|41820
PRL|41599
10
Gene Symbol
Disambiguation
• Find MEDLINE docs in which we can
expect to find gene symbols
• Example of false positive:
– Ethics in a twist: "Life Support", BBC1. BMJ 1999
Aug 7;319(7206):390
– breast basic conserved 1 (BBC1) gene, v.s. BBC1
television station featuring new drama series Life
Support
11
Binary Association Rules
•
•
•
•
XY (confidence, support)
If X Then Y (confidence, support)
Confidence = % of docs containing Y within the X docs
Support = number (or %) of docs containing both X and
Y
• The relation between X and Y not known.
• Examples:
– Multiple Sclerosis  Optic Neuritis (2.02, 117)
– Multiple Sclerosis  Interferon-beta (5.17, 300)
12
Discovery Algorithm
Candidate Gene?
Concept X
(Disease)
Chromosomal
Region
Manifestation
Location
Concepts Y
(Pathologycal or
Cell Function,
…)
Match
Concepts
Z
(Genes)
Chromosomal
Location
Match
Expression
Location
13
Ranking Concepts Z
X
Y1
Z1
Y2
Z2
Y3
…
Z3
Yi
…
Zk
Yj
Zn
14
Problem Size
• Full Medline analyzed (cca 15,000,000 recs)
• 87,000,000 association rules between 180,000
biomedical concepts
15
Bilateral Perisylvian
Polymicrogiria - BPP (OMIM:
300388)
• Polymicrogyria of the cerebral cortex is
a developmental abnormality
characterized by excessive surface
convolution
• Clinical characteristics:
– Mental retardation
– Epilepsy
– Pseudobulbar palsy (paralysis of the face,
throat, tongue and the chewing process)
• X linked dominant inheritance
16
237 genes in Xq28
relation between semantic types Cell Movement and Gene or gene products
18 gene candidates
Sublocalisation in the Xq28
15 gene candidates
Tissue specific expression
2 gene candidates: L1CAM and FLNA
17
User Interface “cgi-bin” version
18
Automatically search for supporting Medline Citations
19
Part 1: Summary and Conclusions
• Discovery support system (BITOLA) presented
• The system can be used as:
– Research idea generator, or
– Alternative method of searching Medline
• Genetic knowledge about the chromosomal
locations of diseases and genes included to make
BITOLA more suitable for disease candidate gene
discovery
20
System Availability
• URL:
www.mf.uni-lj.si/bitola/
21
Part 2: Exploring Semantic
Relations for LBD
22
Current LBD Systems
• Co-occurrence based
• Concepts
–
–
–
–
Title/Abstract Words/Phrases
MeSH
UMLS
Genes ...
• UMLS Semantic types used for filtering
• Semantic relations between concepts
NOT used
23
Drawbacks of Current LBD
• Not all co-occurrences represent a relation
• Users have to read many Medline citations
when reviewing candidate relations
• Many spurious (false-positive) relations and
hypotheses produced
• No explanation of proposed hypotheses
24
Enhancing the LBD paradigm
• Use semantic relations obtained from
– two NLP systems (BioMedLee and SemRep)
to augment
– co-occurrence based LBD system (BITOLA)
25
Methods
26
Discovery Patterns
• Discovery pattern:
Set of conditions to be satisfied for the generation
of new hypotheses
• Conditions are combinations of semantic relations
between concepts
• Maybe_Treats pattern in this research – has two
forms:
– Maybe_Treats1
– Maybe_Treats2
27
Maybe_Treats Discovery Pattern
Maybe_Treats1
Substance Y1
Change1
(or Body meas.,
Body funct.)
Opposite_Change1
Drug Z1
(or substance)
Disease X
Disease X2
Change2
Substance Y2
(or Body meas.,
Body funct.)
Same
Change2
Treats
Drug Z2
Maybe_Treats2
(or substance)
28
Maybe_Treats1 and
Maybe_Treats2
• Goal:
Propose potentially new treatments
• Can work in concert:
– Propose different treatments (complementary)
– Propose same treatments using different discovery
reasoning (reinforcing)
29
Multiple Usages of Maybe_Treats
• Given Disease X as input:
– find new treatments Z
• Given Drug Z as input:
– find diseases X that can be treated
• Given Disease X and Drug Z as input:
– test whether Z can be used to treat X
30
Semantic Relations Used
• Associated_with_change and Treats used to
extract known facts from the literature
• Then Maybe_Treats1 and Maybe_Treats2
predict new treatments based on the known
extracted facts
31
Associated_with_change
• One concept associated with a change in another
concept, for example:
• Associated_with(Raynaud’s, Blood viscosity, increase):
– “Local increase of blood viscosity during cold-induced Raynaud's
phenomenon.”
– “Increased viscosity might be a causal factor in secondary forms
of Raynaud's disease, …”
• BioMedLee (Friedman et al) used to extract
Associated_with_change
32
Treats
• Used to extract drugs known to treat a disease
• Major purpose in our approach:
– Eliminate drugs already known to be used to treat a disease
– Find existing treatments for similar diseases
• TREATS(Amantadine,Huntington):
– “…treatment of Huntington’s disease with amantadine…”
• Treats extracted by SemRep (Rindflesch et al)
33
Results
34
Huntington Disease
• Inherited neurodegenerative disorder
• All 5511 Huntington citations (Jan.2006)
processed with BioMedLee and SemRep
• 35 interesting concepts assoc.with change
selected and corresponding citations
(250.000) processed
35
Insulin for Huntington
Disease
• Assoc_with(Huntington,Insulin,decrease):
– “Huntington's disease transgenic mice develop an
age-dependent reduction of insulin mRNA
expression and diminished expression of key
regulators of insulin gene transcription, …”
• Insulin also decreased in diabetes mellitus
• Therapies used to regulate insulin in
diabetes might be used for Huntington
36
Capsaicin for Huntington
• Assoc_with(Huntington,Substance P,decrease):
– “In Huntington's disease brains decreased Substance P
staining was found in …”
• Assoc_with(Capsaicin,Substance P,increase):
– “Capsaicin also attenuated the increase in Substance P
content in sciatic nerve, …”
• Capsaicin maybe treats Huntington because
Substance P is decreased in Huntington and
Capsaicin increases Substance P.
37
Huntington Results - Summary
Maybe_Treats1
Substance P
Decrease
(Substance Y1)
Increase
Capsaicin
(Drug Z1)
Huntington
(Disease X)
Diabetes M
(Disease X2)
Decrease
Insulin
Decrease
(Substance Y2)
Treats
Maybe_Treats2
Insulin
regulation ther.
(Z2)
38
Example: Parkinson disease
as starting concept. Bellow shown some related
concepts changed in association to Parkinson
39
Potential Treatments for Parkinson
(e.g. gabapentine)
40
Showing Supporting Sentences
with highlighted concepts and relations
41
Gabapentine for Parkinson
• Assoc_with(Parkinson,gamma-aminobutyric
acid(GABA),decrease):
– “…studies indicate that patients with Parkinson's disease
have decreased basal ganglia gamma-aminobutyric acid
function… ”
• Assoc_with(GABA,Gabapentine,increase):
– “Gabapentin, probably through the activation of glutamic acid
decarboxylase, leads to the increase in synaptic GABA. ”
• Explanation: Gabapentine maybe treats Parkinson
because GABA is decreased in Parkinson and
Gabapentine increases GABA.
42
Part 2: Conclusions
• A new method to improve LBD presented
• Based on discovery patterns and semantic
relations extracted by BioMedLee and
SemRep, coupled with BITOLA LBD
• Easier for the user to evaluate smaller number
of hypotheses
• Two potentially new therapeutic approaches
for Huntington proposed and one for
Parkinson
• Raynaud’s—Fish oil discovery replicated
43
The future of
Literature-based Discovery
• Development of specific discovery patterns
based on semantic relations and further
integrated with co-occurrence-based LBD
44
Link, References and some
propaganda
• http://www.mf.uni-lj.si/bitola
• Hristovski D, Peterlin B, Mitchell JA and Humphrey SM. Using literature-based
discovery to identify disease candidate genes. Int. J. Med. Inform. 2005. Vol.
74(2–4), pp. 289–298.  Selected for Yearbook of Medical Informatics 2006
• Hristovski D, Friedman C, Rindflesch TC, Peterlin B. Exploiting semantic
relations for literature-based discovery. In Proc AMIA 2006 Symp; 2006. p.
349-53.
• Ahlers C, Hristovski D, Kilicoglu H, Rindflesch TC. Using the Literature-Based
Discovery Paradigm to Investigate Drug Mechanisms. In Proc AMIA 2007
Symp; 2007. p. 6-10.  “Distinguished Paper Award AMIA2007”
• Hristovski D, Friedman C, Rindflesch TC, Peterlin B. Literature-Based
Knowledge Discovery using Natural Language Processing.  To appear as a
chapter in the first LBD book in 2008
45