Transcript Slide 1

A Cascaded Approach to
Normalising Gene Mentions
in Biomedical Literature
Hui Yang, Goran Nenadic, John Keane
School of Computer Science
Manchester Interdisciplinary BioCentre
[email protected]
Identification of gene names
• Gene/protein names are essential for
integrating and exploring bio-literature
– e.g. building/browsing regulatory networks
• Two step process
1) recognise gene mentions in text
2) map these to a referent database
• Gene name variability and ambiguity
“biologists would rather share a toothbrush than a
gene name”
Outline
1. Overview – why a cascaded approach
2. Dictionary for matching
3. Exact-like matching
4. Approximate matching
5. Experiments
6. Summary and conclusions
Why a cascaded approach?
• Overall aim: improve recall, but with a
controlled loss of precision
• Give your best shot first
– intuitively: try “exact” and exact-like
matches first, and then try with
approximations
– experimentally: find an optimal sequence
• Apply further (less-reliable) steps only on
(still) unmatched gene mentions
Dictionary re-engineering
• Automatically generate gene name synonyms
from existing DBs (e.g. Entrez Gene, UniProt)
– use set of (generic, non-organism specific)
rules to generate canonical representations
of synonyms
alphaCP-4 protein
alphaCP4_protein
RP13-16H11.4
RP13_6H11.4
Rev-ErbAalpha
Rev_ErbAalpha
ST3GALVI
ST3GALVI
• Two versions: preserve original synonyms as
well as normalised canonical forms
Pre-processing gene mentions
• Generate a set of canonical representations of
a gene mention
– analogous to dictionary re-engineering
– but, some add-ons and differences
•resolving potential acronyms
interleukin (IL)-17E  interleukin -17E, IL-17E
•resolving gene name coordinations
ORP 3 to 6
ORP 3 and 6
– token-based normalisation
•Roman numbers, acronyms, Greek letters
1st stage: exact matching
• Step E1:
Match original dictionary and original mentions
• Step E2:
Match normalised dictionary and normalised
mentions
• Step E3:
Match normalised dictionary and token-based
normalised mentions
2nd stage: approximate matching
• component-based comparisons
– relevant (specific) component classes
(Digit, Greek-Letter, Roman-Number, Chemical etc. tokens)
• Step A1a:
Component permutation (order) is ignored
• Step A1b:
Non-relevant components missing from a
synonym are ignored
2nd stage: approximate matching
• Step A2a:
One non-relevant extra component in a
synonym is ignored
• Step A2b:
One non-relevant extra component in a
synonym is ignored if all relevant components
are matched
Original
synonyms
Original vs. Original
Normalised vs. Normalised
Normalised vs. Token-normalised
Normalised
synonyms
Ignore word permutations
Ignore one missing non-relevant component
Original
mentions
Ignore one extra non-relevant component
Normalised
mentions
Token
normalised
mentions
Ignore one extra non-relevant component
if all relevant components are matched
Experiments
Experimental context
• BioCreative II data set
• Map human genes to Entrez Gene
#
abstract
s
# gene
mentions
Set-1 (training data)
Set-2 (test data)
281
262
985
1092
995
1100
Set-0 (total)
543
2077
2095
# matched gene
identifiers
Results: exact-like matching
TP
Original vs. original
Normalised vs
Normalised
Normalised vs.
Token-normalised
total
FP
prec
recall
1044
24
0.98
0.50
Organism prefix
(hSPRY1)
28
0
1.00
0.01
Coordinations
50
23
0.69
0.02
Parentheses
18
9
0.67
0.01
Canonical forms
148
3
0.98
0.07
total
244
35
0.86
0.12
47
9
0.84
0.02
1335
68
0.95
0.64
Results: approximate matching
Ignore component permutations
Ignore one missing non-relevant component
Ignore one extra non-relevant component
Ignore one extra non-relevant component if
all relevant components are matched
TP
FP
prec
recall
23
1
0.96
0.01
16
6
0.73
0.01
65
38 0.63
0.03
13
2
0.01
117
47 0.71
0.87
0.06
Cumulative performance
• Precision:
0.93
• Recall:
0.69
• F-measure:
0.79
• For comparisons (BioCreative II test data)
– Precision:
0.94
– Recall:
0.72
– F-measure:
0.81
Some conclusions
• Exact-like matching achieves 0.76 F-measure
(0.96 P, 0.64 R)
• Approximate matching improve recall only
10-15%
– ignoring word order is effective (both recall and
precision-wise), as well as ignoring one extra nonrelevant component (recall)
• Some approaches consistent across different
test sets, some not
– e.g.
precision of approximate match: 0.63 – 0.78
recall of exact matching:
0.59 – 0.68
Summary
• Simple yet effective approach
– cascaded approach with reliable matching
strategies which can be switched on and off
– some are good for precision, some for recall
– can be easily used for other species
• More work needed on
– gene name coordination and enumerations
– acronyms/symbols embedded in mentions
– species identification
Acknowledgements
• Partially funded by UK BBSRC
(Project “Mining Term Associations from
Literature to Support Knowledge Discovery in
Biology”)
• Manchester Interdisciplinary Biocentre
(Irena Spasic)
• Faculty of Life Sciences (Casey Bergman)
• National Centre for Text Mining (NaCTeM)
A Cascaded Approach to
Normalising Gene Mentions
in Biomedical Literature
Hui Yang, Goran Nenadic, John Keane
School of Computer Science
Manchester Interdisciplinary BioCentre
[email protected]