Transcript Document
Benchmarking Infrastructure
for Mutation Text Mining
Artjom Klein*, Alexandre Riazanov, Christopher J. O. Baker
Computer Science and Applied Statistics,
University of New Brunswick, Saint John, Canada.
Matthew Hindle
Synthetic and Systems Biology,
Edinburgh University,
Edinburgh, UK.
AIMM 2012
September 9th
Basel, Switzerland.
Mutation Text Mining
Mutation text mining facilitates a wide range of activities in
multiple scenarios in bioinformatics and systems biology,
including:
•
•
•
•
Modeling of cell signalling pathways
Protein structure annotation
Expansion of disease-mutation database annotations
Development of tools predicting the impacts of mutations
Useful text mining tasks:
• Simple identification of mutation mentions
• Linking (”grounding”) identified mutations to the
corresponding genes and proteins
• Identifying mutation impacts and/or related phenotypes.
Sample Mutation Text Mining Task: Grounding
“ Haloalkane dehalogenase (DhlA) from Xanthobacter autotrophicus GJI0 hydrolyses
terminally chlorinated and brominated n-alkanes to the corresponding alcohols.”
“The A149T mutant showed only a slight reduction of dehalogenase activity (Vmax)
and while D260N resulted in a larger increase of Km with 1,2-dibromoethane.”
Mutation / Protein / Gene / Organism / Direction / Protein Property / Chemical
Protein ID
Sequence
Sample Mutation Text Mining Tasks: Relations
“Haloalkane dehalogenase (DhlA) from Xanthobacter autotrophicus GJI0 hydrolyses
terminally chlorinated and brominated n-alkanes to the corresponding alcohols.”
“The A147F mutant showed only a slight reduction of the enzyme activity (Vmax)
and while D157P resulted in a larger increase of Km with 1,2-dibromoethane.”
Mutation / Protein / Gene / Organism / Direction / Protein Property / Chemical
• Event: Mutation Impacts (Protein Property + Impact Direction)
• Impact Direction: Positive / Negative
• Linking / Grounding Impact mentions to Mutation mentions in text
• Linking / Grounding Mutation Mentions to Protein mentions in text
• Linking / Grounding Protein Property mentions to GO Molecular Function terms
• Normalizing Mutation mentions (e.g. to HGVS Nomenclature)
Performance of Mutation Grounding Systems
Mut. Gr. System
Text Type
Coverage
Size of test
doc. corpus
Performance
Horn et al., 2004
full text
G protein-coupled
receptors and nuclear
hormone receptors
914 + 1094
P: 0.87
R: 0.64
Krallinger et al., 2009 full text
abstracts
human kinase
mutations
714 abstracts P: 0.72
3486 full texts R: ?
Winnenburg et al.,
2009
abstracts
ten species
508 abstracts
Laurila et al., 2010
full text
no restriction, but
76
evaluated on 4 corpora
for 7 specific proteins.
P: 0.84
R: 0.65
Baker, Kanagasabai,
2011
full text
no restriction, but
evaluated on corpora
for specific proteins.
96
P: 0.819
R: 0.601
Klein et al., 2012
full text
no restriction, tested
on 91 different
UniProt identifiers
331
P: 0.82
R: 0.77
P: ?
R: ?
Benchmarking Resources
The Developer needs:
• Annotated Corpora (training and dev.)
• Robust tools to evaluate the performance of
– different systems
– different runs of an evolving prototype system against a
gold standard corpus.
• Tools for migrating / uploading system outputs
• Semantic models for integrating data
• Appropriate Metrics
Typical Benchmarking Challenges
• Benchmarks (annotated corpora) often not open / published
• Must be build from scratch
• Annotation Set (Semantic Types, Ontology classes, …)
• Different formats, different annotation schemas
• Choice of Representation Format (TXT, XML, TAB, RDF)
• Different metrics used for evaluation
• What and How to evaluate it
• Complex submission procedures
• Cumbersome / Slow
mutation-text-mining
We leverage the semantic web standards: OWL, RDF and SPARQL
Our Representation Format: RDF
Why not XML? - XML is a widely used standard format for
corpora annotation and is supported by a large number of
tools.
• The processing of complex annotations in XML –
parsing, storing, querying, evaluation – is usually
virtually impossible with off-the-shelf XML tools.
• Developers need to develop schema-specific parsers
and processing scripts and change them each time
when the schema is changed or extended.
RDF: Extensibility
• Different mutation text mining tasks and all requirements can not
be foreseen
• Same data may be used for different tasks
=> We need extensible representations.
OWL/RDF ontologies are highly extensible data schemas providing:
• easy integration of new corpora with annotation schemas that
need not be identical, as long as they are compatible.
• easy merging of data defined modulo one ontology with data
modulo another ontology.
• additional alignments between the ontologies can be provided by
the annotation providers – corpus curators or text mining system
developers.
RDF: Tool Availability
• OWL reasoners for data integrity checking
• RDF and OWL APIs for multiple programming
languages facilitate easy programmatic generation and
manipulation of annotations or RDF data representing
text mining results.
• SPARQL query language can be directly used for
calculating system performance metrics as well as for
various searches in the gold standard corpora.
• Multiple implementations of RDF databases
(triplestores) are available that facilitate efficient storing
and querying of large volumes of annotations.
Semantic Model:
Mutation Impact Extraction Ontology (MIEO)
Riazanov A, Laurila JB and Baker C, Deploying mutation impact text-mining software with the SADI Semantic Web Services framework, BMC
Bioinformatics 2011, 12(Suppl 4):S6 and Nona Naderi and René Witte, Automated extraction and semantic analysis of mutation impacts from the
biomedical literature, BMC Genomics 2012, 13(Suppl 4):S10
Benchmarking with SPARQL
• SPARQL is query language for RDF data.
• Create a new SPARQL query or change an existing
one is usually easier than create or rewrite some
scripts.
• We use named graphs (identified subsets of RDF
statements) to separate results coming from different
systems or different experiments and gold-standard
data: results from different experiments, and even
gold standard data from different corpora.
• Basically, metrics are calculated by comparing 2
graphs and finding overlaps between them.
Corpus Development: KinMutBase
• A subset of 201 documents annotated with singular
amino acid substitutions grounded to proteins
• Curation – using MutationFinder (high recall) and
comparing the results with the annotations in the
database. Based on this comparison, we discarded
about 70 documents that appear annotated with
protein-level mutations not explicitly mentioned in the
documents
• The final size of the corpus is 128 documents. In total,
we have 271 mutations linked to 26 different UniProt
identifiers.
• Primarily for Mutation Grounding Evaluations
Stenberg KA, Riikonen PT, Vihinen M., KinMutBase, a database of human disease-causing protein kinase mutations , Nucleic
Acids Res. 1999 Jan 1;27(1):362-4.
Corpus Development: EnzyMiner
• Full text documents (38) randomly selected from EnzyMiner* abstracts
• Documents with proteins from 49 UniProt Ids and 24 different species.
• Coverage: 488 statements (occurrences of impact information in text),
61 molecular functions and 29 combined mutations.
Annotated Information:
• Studied protein-level mutations, in the form of singular amino acid substitutions.
For situations when effects of several simultaneous amino acid substitutions are
studied, we annotate them as combined mutations.
• Proteins to which the mutations are related are identified with UniProt IDs. Host
organisms / sets of specific protein sequences can be identified via UniProt IDs.
• Protein properties specified as Gene Ontology Molecular function classes.
• Mutation impacts qualified as Positive, Negative or Neutral.
• Text fragments from where information was extracted from. Typical fragments
contain mentions of protein properties, impact directionality words, such as
“increased” or “worse”, mutation mentions, protein and organism names, etc.
• Documents identified with PubMed IDs.
* Yeniterzi S, Sezerman U., EnzyMiner: automatic identification of protein level mutations and their impact on target enzymes from PubMed abstracts.
BMC Bioinformatics. 2009 Aug 27;10 Suppl 8:S2.
Corpora Statistics
Corpus Size
UniProt IDs
Mutations (per
document)
EnzyMiner (Dev.)
38
49
176
KinMutBase
(published set)
128
26
271
DHLA
13
4
52
PIK3CA
30
1
169
FGFR3
26
1
175
MEN1
7
1
22
Utilities
As a part of our infrastructure, we created a small set of
simple utilities, which facilitate data access:
• The evaluator utility calculates standard performance
metrics by executing some user-provided SPARQL
queries, counting the results and making necessary
calculations. The user can supply the queries in a simple
configuration file.
• The Sesame loader and query client are simple
command line applications that allow loading RDF
graphs into a Sesame triplestore and executing queries
from files.
Mutation Grounding Metrics
• Precision - the fraction of correctly grounded mutations
(true positives) over all grounded mutations (true
positives + false positives)
• Recall - the fraction of correctly grounded mutations
over all mutations in the gold standard (true positives +
false negatives).
• A mutation is considered correctly grounded if it is
mapped to a sequence corresponding to the UniProt ID
specified by the corresponding gold standard corpus.
Witte R, Baker CJO (2007) "Towards a systematic evaluation of protein mutation extraction " proposes over 15 different
metrics to evaluate protein mutation extraction systems.
Evaluation Example
• There are 3 SPARQL queries to calculate:
• The number of all correctly grounded mutations (query1),
• All grounded mutations (query2),
• All mutations in gold-standard (query 3).
• Queries defined in a configuration file of evaluator tool.
• Evaluator tool executes 3 queries and combines results
in recall and precision formulas which are also defined in
the configuration file as mathematical expressions.
E.g.
• precision = query1/query2,
• recall = query1/query3,
SPARQL Query 1
• Singular mutation mention recognition. (True positives)
SELECT ?doc ?singl_mut1
FROM NAMED <http://unbsj.biordf.net/misc/text_mining_experiment255.rdf>
WHERE {
GRAPH <http://unbsj.biordf.net/misc/text_mining_experiment255.rdf> {
?doc sio: 'refers to' ?singl_mut1 .
?singl_mut1 a mieo:AminoAcidSubstitution .
?singl_mut1 mieo:mutationHasWildtypeResidue ?wt_residue .
?singl_mut1 mieo:mutationHasMutantResidue ?mut_residue .
?singl_mut1 mieo:mutationHasPosition ?pos1 .
?pos1 sio:’has_value’ ?pos_value .
}.
GRAPH goldst:v0.0 {
?doc sio: 'refers to' ?singl_mut2 .
?singl_mut2 a mieo:AminoAcidSubstitution .
?singl_mut2 mieo:mutationHasWildtypeResidue ?wt_residue .
?singl_mut2 mieo:mutationHasMutantResidue ?mut_residue .
?singl_mut2 mieo:mutationHasPosition ?pos2 .
?pos2 sio:’has_value’ ?pos_value .
}
}
We select single mutations from system output, which match single mutations from gold standard
(overlap: wild type residue, position and mutant residue)
Testing the Infrastructure
For concept validation the infrastructure was used for
testing and iterative performance evaluation during a
project dedicated to the development of a robust mutation
grounding system.
EnzyMiner was used as development corpus.
• originally created for mutation impact extraction
• it only contains information about mutations whose
impact is studied
• there may be other mutations associated with specific
proteins but not with impacts
• we only compute our performance metrics on the
subsets of mutations mentioned in the annotations
All other corpora were used as test corpora.
Evaluation Results
Original
Prototype
Original
Prototype
New System
New System
Precision
Recall
Precision
Recall
EnzyMiner
(Dev.)
0.31
0.12
0.75
0.72
KinMutBase
(225 docs)
0.36
0.14
0.92
0.92
DHLA
0.83
0.73
0.96
0.94
PIK3CA
0.86
0.70
0.98
0.81
FGFR3
0.89
0.66
0.27
0.25
MEN1
0.54
0.32
0.54
0.52
Total w/o
EnzMiner
0.64
0.35
0.82
0.77
On all corpora the new system outperforms the original prototype ….
mutation-text-mining
http://code.google.com/p/mutation-text-mining/
Future work
• Further stress-test infrastructure with text mining tasks
other than mutation grounding and mutation impact
extraction, and a third-party mutation text mining system.
• Extend the ontology based on the new requirements
identified through community involvement and our own
research.
• Extend the infrastructure to include protein properties other
than molecular functions, such as enzyme kinetics, and
DNA-level mutations.
• Modeling sentence level provenance to provide more
precise pointers to text fragments supporting annotations.
• Want participants: Open Mutation Text Mining Competition
Acknowledgements
This research was funded in part
by the New Brunswick Innovation
Foundation, New Brunswick,
Canada; the NSERC, Discovery
Grant Program, Canada and the
Quebec
New
Brunswick
University
Co-operation
in
Advanced Education – Research
Program, Government of New
Brunswick, Canada.