Transcript MIE 2008

MIE 2008
Assessment of
Biomedical Knowledge
According to
Confidence Criteria
Ines Jilani : [email protected]
Natalia Grabar
Pierre Meneton
Marie-Christine Jaulent
Wednesday, 28th of May 2008
1
MIE 2008
Context
• Increasing number of biomedical articles in
Pubmed*
• Follow-up work on automatic extraction of
functional knowledge about genes/proteins
from scientific articlesΔ indexed in Pubmed
– Using lexico-syntactic patterns:
• Language specific automaton (grammar)
o Syntactic elements (Verb, Noun, Adjective…)
o Semantic elements (Meaning of words…)
* http://www.ncbi.nlm.nih.gov/sites/entrez
Δ
Jilani I, Grabar N & Jaulent M.-C. Fitting the finite-state automata platform for mining gene functions from biological
scientific literature. In SMBM in Jena (Germany) 2006
2
MIE 2008
Example of lexico-syntactic pattern
o (Sox2; sensory organ development)
o (Hint; murine development)
3
MIE 2008
Introduction
• Limits of the system
– Loss of context: reliability and confidence of
the claim
• Solution
– Use some devices to « weight » the extracted
knowledge
• In order to make more confident use of extracted
knowledge
o Hedge, modifier, qualifier
o Confidence markers
4
MIE 2008
Hedges, modifiers, qualifiers …
• Linguistic devices used by authors to qualify their
assertions
– Different grammatical categories: verbs, adverbs, adjectives…
– “Copper deficiency is a plausible cause of Alzheimer disease
(AD). This hypothesis should be tested with a lengthy trial of
copper supplementation”*
• “hedge” was first used by Lakoff Δ : “words whose job it is
to make things more or less fuzzy”
• HylandΦ, and others carried out qualitative studies of
these qualifiers
– without modelling them
– nor integrated their use for weighting any kind of information in a
knowledge extraction system
* Quoted from the abstract of the article with Pubmed Identifier 17928161
Δ Lakoff, G., (1972) : Hedges: A study of Meaning Criteria and the Logic of Fuzzy Concepts, Chicago Linguistic Society, 8, pp. 183-228
Φ Hyland, K. 1995. The Author in the Text: Hedging Scientific Writing. Hong Kong Papers in Linguistics and Language Teaching.
5
MIE 2008
Objectives
• Work on confidence markers in scientific articles
–
–
–
–
Their use
Their significance
Their classification
Their automatic detection in texts for knowledge weighting
purposes
• The main aim was to document the information so that it
could be used confidently
E.g. : (Sox2; sensory organ development)
–
–
–
–
Sox2 is required for sensory organ development
Sox2 might be required for sensory organ development
Sox2 is probably required for sensory organ development
Our findings suggest that Sox2 is required for sensory organ
development
– Doe, et al. has demonstrated that Sox2 is required for sensory
organ development
6
MIE 2008
Materials
• 3 corpora obtained by querying Pubmed
Corpus
QUERY
SPECIES
SOURCE
SPECIFICITY
NUMBER of
SENTENCES
CORP1
160 genes
+ Alzheimer
disease
human
Pubmed
355 abstracts
817
CORP2
160 genes
+ Alzheimer
disease
human
Pubmed
Central
68 full texts
27,912
CORP3
160 genes
+ Alzheimer
disease
worm
Pubmed
348 abstracts
825
• Lexical resource: WordNet®* is a large lexical database
of layman English: nouns, verbs, adjectives and adverbs
– Used to enrich the extracted confidence markers by identifying
their synonyms
* WordNet, An Electronic Lexical Database, Christiane Fellbaum ed., (1998), The MIT Press, Cambridge, Mass
7
MIE 2008
Methods
• Manual collection of confidence markers from CORP1,
CORP2 and CORP3
• Enrichment of the list of confidence markers
– Using WordNet®
• Classification of confidence markers according to 2 types of
classes
• Add the Impact Factor (IF) as another confidence criterion
– Hypothesis: IF of a journal is subjectively related to the
reliability of the biological and medical
information published
• Modeling confidence criteria: develop a formula allowing to
order the triplets (representing annotations) in respect to their
confidence score, and consistently
8
MIE 2008
Results
• List of 250 manually collected confidence
markers was generated
• Enrichment using WordNet® increased the
number of confidence markers listed to 478
• Classification
– 4 different categories in ascending order of
confidence  Type 1
– 10 distinct qualifiers modifying confidence levels
within the Type 1 categories, characterizing
subjectivity in texts  Type 2
9
MIE 2008
Results: Type 1 class
1 - Interrogation or trial and error of the author: Knowledge that
remains unproven and requires demonstration.
e.g.: “remain to be confirmed”, “has yet to be identified”, “?”
2 - Distance suggested by the author compared to his assertions
or the knowledge presented in the text: It may also correspond to a
restriction of the knowledge concerned to a specific context (e.g.: the context
of the article or experiment).
e.g.: “our findings suggest that”, “in this case we conclude that”, ”it is
possible that”
3 - Studies by other researchers, references to other works,
articles or methods: We assume that if an article is cited, the information
is assumed, or at worst simply believed to be true.
e.g.: “previous observation”, “it is now believed that”, “it has been proposed
that”
4 - Demonstration or proof given by the author: This corresponds to
work carried out by the author and presented in the concerned article.
e.g.: “we reveal that”, “we show here that”, “our results indicate that”…
10
MIE 2008
Results: Type 2 class*
• 10 Qualifiers representing probabilities from
negation to affirmation, i.e. from the least
probable to the most probable
Confidence - -
* Work derived from: Ian Jacobs. 1995. English Modal Verbs
Confidence + +
11
MIE 2008
Results: Modeling
• Modeling confidence criteria for their automatic
extraction
– Regular expressions are used
• “we anticipate” and “we expect”
we<have>*(<anticipate>+<expect>)
– Synonyms are used
We had anticipated that…
We have anticipated that…
We expect that….
We have expected that…
• “we hypothesise” and “we suspect ”
we<have>*(<hypothesise>+ <speculate>+<expect>+<predict>+<suspect>)
• “have been previously confirmed”, “is now largely confirmed” and
is “widely confirmed ”
<have>*<be>(previously+now)*(largely+widely+extensively+generally)*<confirm>
12
MIE 2008
Results: Application
points
• Context of apolipoprotein E gene
Triplets (Gene, Function, PMID)
*
*
*
13
MIE 2008
Results : Explanations
- ApoE allelic variability influences pupil response to cholinergic challenge and
cognitive impairment. 1
- The Apolipoprotein E (ApoE) epsilon4 allele role in LOD is controversial, while
it is still unknown in vascular depression. 2
- ApoE4 seems to facilitate HSV-1 latency in the brain much more so than ApoE3.3
Type1 Type2
Triplets
IF
ApoE/ cognitive impairment/167646771
4
10
4,091
ApoE epsilon4 allele/vascular depression /173370102
1
10
2,035
ApoE4/ HSV-1 latency/166990183
2
10
5,178
Triplets ordered in an ascending confidence order:
1;3;2
14
MIE 2008
Discussion / Conclusion
• Confidence markers collected manually
– Abstracts
– full text articles
• They are extended with WordNet® resource
• They are classified into 4 categories of Type 1 and
10 categories of Type2
• This study constitute a priming work: the confidence markers will be
easily added to lexico-syntactic patterns already generated for
annotating genes/proteins functionally
• Annotation already present in databases could be additionally
documented with confidence markers
– Gene Ontology Annotation files
– Swissprot / Uniprot
• The confidence markers can be used by curators to annotate
genes/proteins through a system able to detect those qualifiers
15
MIE 2008
Perspectives
• The users of the final system are
potentially biologists, curators…
• Take into account for the confidence
scoring the type of study presented in an
article
– Observational study (epidemiological)
– Controlled experiment
– Clinical essay…
16