Discussion on cis-regulatory / text mining interface

Download Report

Transcript Discussion on cis-regulatory / text mining interface

Cis-Regulatory/
Text Mining Interface
Discussion
Questions
(1) What does ORegAnno want from text mining?
– Curation queue
– Document mark-up
– Mapping to database IDs
(2) What does text mining need from ORegAnno?
(3) What can text mining provide?
– What level of performance is needed?
(4) What is the right way to proceed?
– Data sets for BioCreAtIvE?
– Custom tools for individual “early adopters”?
Answers: (1) What does
ORegAnno Want from Text Mining
• Management of curation queue
– Ideally, user customized, so that user annotates those
documents of immediate interest to her/him
• Document mark-up to highlight relevant
passages
– A workflow pipeline making either the html or pdf
version of the document available, with the
(potentially) relevant terms highlighted
– Support for “cut and paste” transfer of relevant
regions to the database comments fields
• Mapping to IDs, ontology codes
– Gene, transcription factor (protein), organism, cell and
tissue type, evidence types
Answers: (2) What does Text
Mining Need From ORegAnno?
• Significant quantity of reliably annotated data to
train text mining systems
– Annotated at a level useful for natural language
processing (e.g., marked for evidence at the phrase,
sentence or passage level, depending on task)
• This requires that ORegAnno have:
– A clear statement of the scope of the ORegAnno
database and a stable set of annotation guidelines
– Annotations with high inter-annotator agreement
– Tracking of entries by annotator, including depth of
annotation (different annotators will annotate to
different levels of detail, depending on interests)
Answers: (3) What Can
Text Mining Provide?
• Curation queue management:
– Document classification approaches (from e.g., TREC Genomics or
BioCreAtIvE) can be applied and evaluated, making use of new training
data from pre-jamboree and jamboree annotation
– We can experiment with “user defined” criteria, based on restrictions for
gene, transcription factor, organism, tissue, etc.
• Document mark-up
– Users could be provided with a list of genes/transcription factors in a
paper, with hot links into the paper to find relevant passages
– This would allow the annotator to drive the annotation process, selecting
only those annotations that are correct and relevant. This in turn
provides feedback using ORegAnno annotations to validate & train the
text mining
– Such a tool should make it easy for the annotator to provide the
underlying text passages as evidence for the annotation, to provide
more training data
• Mapping to unique identifiers/controlled vocabulary/ontology
– For each entity type (gene, transcription factor, organism, tissue type...),
a tool can provide a mapping to the correct identifier; where there is
possible ambiguity, the tool could provide a ranked list for the annotator
to choose from
– A tool can also flag different evidence types, with suggested code(s)
Answers: (4) How to Proceed?
• Stabilize guidelines and redo the inter-annotator
agreement expt (and write up)
• Prepare a Gold Standard data set of expert
annotated data for training new annotators
• Collect sufficient amount of training data for the
various tasks (queue management, document
mark up, automated mapping)
• Develop end-to-end pipeline (in the style of the
FlySlip project) to capture whole documents in
machine-readable form for mark-up
Recommendations:
Training Materials & Tools
• Case studies and gold-standard annotated
articles
• On-line training
– Perhaps with a way for new annotators to test
themselves against a set of gold standard annotations
– This will require automated comparison of
annotations for certain fields
• Best tools links
• Tools:
– Copy mechanism for largely duplicated record