Semantic relation extraction
Download
Report
Transcript Semantic relation extraction
Semantic Relation Extraction
for Linking Named Entities
to Biomedical Databases
Presenter: Lê Hoàng Quỳnh
Knowledge of Technology Laboratory, UET, VNU
[email protected]
Hanoi, February 18th, 2012
Main contents
•
•
•
•
Motivation and purpose
Some approaches: the pros and cons
Discussion and Proposal
Conclusion
2
Motivation and purpose
“… developing a state of the art named entities
tagger for full open source biomedical texts …”
• Deploying various named entity recognizers to
see which works the best
• Linking the named entities to its appropriate
identifier in public databases
3
Motivation and purpose (cont’)
What’re named entities we focus on ?
•
•
•
•
Phenotype descriptions
Disease names
Gene names
Chemical names
4
Motivation and purpose (cont’)
Ontology =
• Concept/Class
• Term/Individual
• Relation/Property
5
Motivation and purpose (cont’)
The Biocaster
Multilingual Ontology
biocaster.org
6
Motivation and purpose (cont’)
• How to link the named entities to unique
identifiers in a biomedical database ?
• What are the difference between “linking”
and “filling” ?
• Method ?
• Clustering
• Sematic relation extraction [LTB11]
• …
[LTB11] Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, Nguyen-Cuong Phan, Quang-Thuy Ha. An Integrated Approach Using Conditional Random
7
Fields for Named Entity Recognition and Person Property Extraction in Vietnamese Text. In IALP 2011, Penang, Malaysia.
Motivation and purpose (cont’)
Semantic relation extraction
• Extracting relationships between terms is
the task of extracting underlying relations
between two term expressed by words or
phrases [Gir08]
• Due to the unique patterns of biomedical
relations,
techniques
designed
for
extracting relations from general text may
not be suitable for the biomedical domain
[Gir08] Girju R, “Semantic relation extraction and its applications”, ESSLLI 2008 Course Material, Hamburg, Germany, 4-15 August 2008
8
Motivation and purpose (cont’)
What’re kinds of semantic relation we focus
on ?
Entity:
• Hyponymy
• Phenotype descriptions
• Disease names
• Synonymy
• Gene names
• Chemical names
• Causal/effect
• Indicate/hasSymptom
• Treat
• …..
9
Motivation and purpose (cont’)
What’re kinds of semantic relation we focus
on ?
Entity:
• Hyponymy
• Phenotype descriptions
• Disease names
• Synonymy
• Gene names
• Chemical names
• Causal/effect
• Indicate/hasSymptom
• Treat
• …..
10
Some approaches
Three groups of existing methods:
• Pattern-based extraction relies on the occurrence of
term pairs in the same contexts and uses the words
in the context to identify the relation
• Distributional clustering uses the contexts that terms
occur in individually and attempts to group
semantically related elements based on similarities
of these contexts
• Term variation is based on the form of the term and
uses similarities between terms to identify, which are
semantically related
11
Some approaches (cont’)
Distributional clustering:
• Considering the context that a term tends to
occur in and then apply clustering to work
out, which terms are most “similar".
• By using this methodology they could found
class of words that are similar in meaning
For example: Use the verb "fire“ we to found these following class
of nouns:
o
o
o
Gun, Missile, Weapon
Shot, Bullet, Rocket, Missile
Officer, Aide, Chief Manager
12
Some approaches (cont’)
Distributional clustering:
• Pros:
o
o
Distributional clustering does not require that the terms
occur in the same sentence or even in the same document
Generally has a higher recall than pattern based methods
• Cons:
o
o
o
This method requires a mathematical approach to
determine the clusters of terms which have a similar
distribution of contexts
It is very difficult from distributional clustering to work out
the nature of the relationship between the terms
Distributional clustering is not suitable for extracting
specific relationships such as if "X is a causal agent of Y“
13
Some approaches (cont’)
Term variation:
• Looking at the form of the actual term and
using the similarity of the words in it to
deduce if the terms are related.
For example: "cancer of the mouth" and "mouth cancer"
• Jacquemin [Jac99] defines three main ways
that term variation occurs:
o
o
o
Syntactic Variations
Morpho-syntactic Variations
Semantic Variations
[Jac99] Christian Jacquemin. Syntagmatic and paradigmatic representations of term variation. In Proceedings of the 37th annual meeting of the14
Association for Computational Linguistics on Computational Linguistics, pages 341-348.1999.
Some approaches (cont’)
Term variation:
• Pros:
o
o
o
Often has very high precision
Strongest for finding if two terms are synonymous
Can prove useful for some other cases as well
• Cons:
o
Cannot help to identify relationships between
terms with no similarity
15
Some approaches (cont’)
Pattern-based extraction involve finding the terms in the
same sentence and in some “pattern" that is suggestive of a particular
relation.
• Hearst [Hea92] used patterns to extract terms that exhibit the
hyponymy relation
• Her approach involved noting that such terms often occurred near
each other in stereotypical patterns
Some kinds of flu, such as bird flu are …”
Pattern: noun phrase - “such as" - noun phrase
hyponym(“bird flu", “flu")
• Method for developing these patterns
o
o
o
o
o
Decide on a lexical relationship
Collect a set of term pairs known to have this relationship and a corpus, which
contains these pairs
Find the places where these terms co-occur
Find commonalities and hypothesize a pattern
Use this pattern to find more term pairs and repeat the process
“
[Hea92] Marti A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics,
pages 539-545, 1992.
16
Some approaches (cont’)
Pattern-based extraction
• Pros:
o Simple
o Patterns have the advantage that they can be specialised for different
relationships.
o Can be used for various languages
• Cons:
o This method was manual
o There was no way to provide a strong comparison between the
effectiveness of the different patterns, which perhaps lead to the
inclusion of a relatively “weak" pattern
o It is not clear how to automatically generate patterns, which are specific
to a given relationship and domain
o As patterns rely on finding the two terms in the same context, this limits
the recall and ambiguity in the text can cause errors in the extractions
o Problem of identification boundaries of the terms
17
Some approaches (cont’)
Mccrae’s approach [Mcc09] for synonym and
hyponymy relation
• Starts with the most
general pattern, that is the
pattern consisting of only
wild cards
• Develops a more specific
pattern by replacing wild
cards with terms from
some corpus (full text chap. 3.1)
[Mcc09] John Philip Mccrae. Automatic Extraction of Logically Consistent Ontologies from Text Corpora. Doctor of philosophy. Department of
informatics, school of multidisciplinary sciences, the graduate univesity of advanced studies. September 2009
18
Some approaches (cont’)
Mccrae’s approach:
• Problem of identification term’s boundary
entity = (NN|JJ|NNS|NNP|FW|NNPS|JJR) * (NN|NNS|NNP|NNPS)
NN: A singular noun
NNS: A plural noun
NNP: A proper noun
NNPS: A pluralised proper noun
JJ: An adjective
FW: A prefix
JJR: An adjective in comparative form
19
Some approaches (cont’)
Mccrae’s approach:
• Covers every possible variation of the patterns
the the search space is far too large to be
tractable
It is necessary to find a way to cover this search
space more efficiently
o
o
prioritizing "better" patterns
skipping those patterns which are too similar to
existing patterns.
20
Some approaches (cont’)
Mccrae’s approach:
• Rule definition:
*1 * such as *2 :Rule: :- name() words(1,1) "such" "as" name()
• Simplified the rules
o
Match-set (Chap. 3.2.1 in full text)
:- words(1,2) name() words(0,1) words(2,3) "literal" name()
Simplified form: :- words(1,1) name() words(2,4) "literal" name()
o
Join-set and alignment (Chap 3.2.2 in full text)
:- "a" name() "b" "c" "d" name()
:- words(,1) name() words(2,3) "c" name()
Alignment on these rules: f(2; 2); (4; 4); (6; 5)g
The alignment-to-join conversion:
:- words(,1) name() words(2,3) "c" words(0,1) name() words(0,0)
Simplified form: :- name() words(2,3) "c" words(0,1) name()
• Classification
21
Mccrae’s approach: Results
Some approaches (cont’)
22
Mccrae’s approach: Results
Some approaches (cont’)
23
Some approaches (cont’)
Approach by utilizing the Web [SNR08] [TNN10]
• RDF describes a SemanticWeb using RDF Statements,
which are triples of the form <Subject, Property, Object>
• Query the search engines with lexico-syntactic patterns
to retrieve relevant information
• The “seed” patterns are initially handcrafted but can be
progressively learnt
• Extract relations from snippets
[SNR08] Saurav Sahay, Shamkant B. Navathe, Ashwin Ram. Discovering Semantic Biomedical Relations Utilizing the Web. ACM Transactions on
Knowledge Discovery from Data, Vol. 2, No. 1, Article 3. March 2008.
[TNN10] Mai-Vu Tran, Tien-Tung Nguyen, Thanh-Son Nguyen, Hoang-Quynh Le (2010). "Automatic Named Entity Set Expansion Using Semantic
24
Rules and Wrappers for Unary Relations", IALP 2010: 170-173, Harbin, Heilongjiang China; December 28-30, 2010
Some approaches (cont’)
• [SNR08] focus on discovering causal relationship
between a disease and a biological entity
• Application: For augmenting Ontologies
• Purpose: Given
a disease
discover
the
likely causes of
this disease
25
Some approaches (cont’)
Approaches summary and evaluation
Method
Precision
Recall
Applicability
Patterns
OK
Limited
Produces specfic results for any
relationships
Distributional Clustering
OK
Good
Only produces a concept
“semantic relatedness"
Term Variation
Good
Poor
Strongest for synonym, some use
elsewhere
of
26
Some approaches (cont’)
What if using machine learning ?
• Using CRF [BDS08]:
o
o
Extracts both the existence of a relation and its
type
Using two type of CRF
• Using Kernel-Based learning [LZL08]:
o
o
Relation detection: a binary classification of true
and false relations
Relation classification: a 4-class classification of
the four relation types
[BDS08] Markus Bundschus, Mathaeus Dejori, Martin Stetter, Volker Tresp, Hans-Peter Kriege. Extraction of semantic biomedical
relations from text using conditional random fields. BMC Bioinformatics 2008, 9:207 doi:10.1186/1471-2105-9-207
[LZL08] Jiexun Li, Zhu Zhang, Xin Li, Hsinchun Chen. Kernel-Based Learning for Biomedical Relation Extraction. journal of the american
society for information science and technology, 59(5):756–769, 2008
27
Discussion and Proposal
Challenges
• Language complexity
• Requirement good pre-processing (POS-tagging,
chunking, NER, etc.)
• …
• Techniques designed for extracting relations from
general text may not be suitable for the biomedical
domain
• Lack of tools, data
• …
• It is unlikely that the extracted relations will match the
structure of the ontology
28
Discussion and Proposal
Challenges
• Modifiers: The inclusion of an adjective modifier in a
term
For example: "acute headache" & "headache“; “mental retardation”
• Granularity: Terms are nearly always used
synonymously but have slight differences in their
meaning.
For example: The term "HIV-1" is the most common strain of "HIV“ but "HIV-2"
is less easily transmitted and mostly confined to a small area of West Africa
• Property: This means that two terms refer to the same
thing but with a slightly different property
For example: "dengue shock syndrome" is a late stage development of
"dengue fever
29
Discussion and Proposal (cont’)
Compromises
- Figure out what type of relationship or not
- Binary
classification
or
multi-label
classification
- 1 or 2 classifier
- Pattern-based
extraction,
distributional
clustering or term variation
- Using machine learning or not
- …
30
Discussion and Proposal (cont’)
Proposal
• Only deal with intra-sentence relations !!!
• 2 classifiers
• Pattern-based extraction and term variation
• Semi-supervised learning
• There is still not a strong definition or training
resources for Phenotype and disease need to
work on this using available resources such as the
Human Phenotype Ontology and the CALBC data
set from the EBI shared task 2011
31
What’s about the Model ?
Discussion and Proposal (cont’)
32
Conclusion & Future Works
• Purpose: Hyponymy, Synonym and Causal
relation extraction for Phenotype descriptions,
Disease names, Gene names and Chemical
names
• Improve on method (using semantic pattern &
term variation, bootstrapping technique, etc.)
• Exploring data and ontology
• “Linking to ontology” review
• Propose model
• Try to use other available resources
33
References
[LTB11] Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, Nguyen-Cuong Phan, Quang-Thuy Ha. An Integrated
Approach Using Conditional Random Fields for Named Entity Recognition and Person Property Extraction in
Vietnamese Text. In IALP 2011, Penang, Malaysia.
[TNN10] Mai-Vu Tran, Tien-Tung Nguyen, Thanh-Son Nguyen, Hoang-Quynh Le (2010). "Automatic Named
Entity Set Expansion Using Semantic Rules and Wrappers for Unary Relations", IALP 2010: 170-173, Harbin,
Heilongjiang China; December 28-30, 2010
[Mcc09] John Philip Mccrae. Automatic Extraction of Logically Consistent Ontologies from Text Corpora.
Doctor of philosophy. Department of informatics, school of multidisciplinary sciences, the graduate univesity of
advanced studies. September 2009
[BDS08] Markus Bundschus, Mathaeus Dejori, Martin Stetter, Volker Tresp, Hans-Peter Kriege. Extraction of
semantic biomedical relations from text using conditional random fields. BMC Bioinformatics 2008, 9:207
doi:10.1186/1471-2105-9-207
[Gir08] Girju R, “Semantic relation extraction and its applications”, ESSLLI 2008 Course Material, Hamburg,
Germany, 4-15 August 2008
[LZL08] Jiexun Li, Zhu Zhang, Xin Li, Hsinchun Chen. Kernel-Based Learning for Biomedical Relation
Extraction. journal of the american society for information science and technology, 59(5):756–769, 2008
[SNR08] Saurav Sahay, Shamkant B. Navathe, Ashwin Ram. Discovering Semantic Biomedical Relations
Utilizing the Web. ACM Transactions on Knowledge Discovery from Data, Vol. 2, No. 1, Article 3. March 2008.
[Jac99] Christian Jacquemin. Syntagmatic and paradigmatic representations of term variation. In Proceedings
of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pages 341348.1999.
[Hea92] Marti A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the
14th conference on Computational linguistics, pages 539-545, 1992.
[Bio] http://biocaster.org
34
Thank you for you attention!
35