Text Mining (biocreative)

Download Report

Transcript Text Mining (biocreative)

Text mining activities at PIR
Cecilia Arighi
March 12, 2013
Text mining projects in collaboration with:
1
Dr. Vijay Shanker, CIS Department,
University of Delaware
2
BioCreative Consortium
iProLink: Text mining resources at PIR
http://proteininformationresource.org/pirwww/iprolink/
Resource to facilitate text mining for biocuration with focus on
annotation of post-translational modifications (PTMs)
RLIMS-P: Text mining tool for
extraction of protein
phosphorylation information
eFIP: Extracting Functional
Impact of protein
Phosphorylation
eGIFT: Extracting Gene
Information From Text
RLIMS-P: extraction of protein
phosphorylation information
• Rule-based information extraction system
• It extracts information about :
• phosphorylated protein(s)
• the kinase(s)
• phosphorylation site(s)
• The tool needs to capture the different ways that
protein phosphorylation is described in literature
PMID:2141171
Rule-based systems:
make use of :
-knowledge about how
language is structured
-specific knowledge
about how biologically
relevant facts are stated
in the biomedical
literature.
RLIMS-P 2.0 over a 100
regular expressions,
some of these are of
supporting nature (e.g for
anaphora resolution).
RLIMS-P Interface: Search
New interface!
Keywords
Provides suggestions of protein
and gene names while typing
List of PMIDs
RLIMS-P Interface: Result Table
Query: BAD
Kinase, substrate and sites are color-coded
Statistics
Arrange data according to interest
Summary: list all kinases and phospho-proteins found per abstract
PMID: list all kinases and phospho-proteins and sites found per abstract
Kinase: list results based on individual kinases extracted by RLIMS-P
Substrate: list results based on indivudual substrate extracted by RLIMS-P
Text Evidence Page
eFIP: Functional Impact of Phosphorylation
Find relation between phosphorylation and protein interaction
Bad phosphorylation induced by survival factors leads to its preferential binding to
14-3-3 and suppression of the death-inducing function of Bad. (PMID 10579309)
Protein interaction in eFIP:
Protein-protein
Protein-protein complex
Protein-protein region
Protein-protein class
Example of interaction-related terms
detected eFIP
Binding
Interact
Complex
Dissociates (used to capture a negative
impact of phosphorylation)
The eFIP system for text mining of protein interaction networks of phosphorylated proteins
Tudor CO, Arighi CN, Wang Q, Wu CH, Shanker VK. (2012) Database (doi: 10.1093/database/bas044)
8
eFIP Architecture
9
eFIP Website
1
3
2
To correct and save eFIP results
10
eFIP: To find relevant papers about
phosphorylated proteins and their functions
Search for BAD
If logged in
IfIflogged
loggedinin
11
Discovery from Literature Mining
• Distinct phosphorylated forms of a protein may have different interacting proteins,
leading to different subcellular locations, functions and pathways
• Literature mining connects the impact to different BAD forms, and, through
kinases, links BAD to pathways
PMID:14967141
12
DATA MINING
TEXT MINING
Protein
A
Protein B
Protein-Protein
Interaction Databases
Pubmed Search Results
RACEPRO
FUNCTIONAL ANNOTATION TERMS
RLIMS-P
THE PROTEIN ONTOLOGY (PRO)
Set of Phosphorylation-Related
Articles for Curation
VISUALIZATION
hSTK38
hCENPA/Phos:1
hGSG2/Phos:1 + i
+p
+p
AURKB
+p
+i
hINCENP/Phos:1
hSTK4
Figure 1: Overview of the Workflow
+p
ATM
+i
BubR1
Cytoscape
eGIFT
http://biotm.cis.udel.edu/eGIFT/
Uses natural language processing techniques to retrieve iTerms (informative terms)
relevant to a specific gene.
Gene centric document retrieval and categorization
iTerms
Applications
• Finding relevant articles to assist in biocuration :
– of protein phosphorylated forms and complexes in the
Protein Ontology.
– Phosphorylated proteins in external databases, such as
phospho.ELM (PMID: 17962309)
– Pathway curation in Gallus Reactome (The Third Workshop
on Integrative Data Analysis in Systems Biology (IDASB)
2012)
• Automatic information extraction from literature to improve
knowlegbase content (iPTM and Gallus Reactome)
• Improvement of kinase site prediction algorithms (RLIMS-P)
• Finding set of genes/proteins with common iTerms (eGIFT)
What’s in it for UniProt?
1-For curation: Assist in prioritization of entry annotation
based on potential relevant information on protein features
(phosphorylation)
As of 03/11/2013 in Medline
# of RLIMS-P positive PMIDs = 135,739
# with site information= 41,947
# with kinase information= 38,924
2-For UniProt user: Processing on RLIMS-P on the UniProtKB
additional bibliography could provide the UniProt user with an
extra layer of information that he/she could readily use.
Use eFIP/eGIFT model of displaying documents based on
information content of the additional bibliography.
Example:
Additional Bibliography
for raptor: 30 PMIDs
New Information from Additional Bibliography
and RLIMS-P
T908 not annotated
BioCreative Activities
Interactive Text Mining
BioCreative: Critical Assessment of
Information Extraction in Biology
International community-wide effort to evaluate text mining and
information extraction systems applied to the biological domain
BioNLP
Text REtrieval Conference (TREC)
strong linguistic focus with topics of
interest to NLP community
BioCreative workshops are very much driven by the needs of
users with focus on:
-Biocuration tasks
-Biocuration workflows
-Interoperability
Background
• BioCreative I: 2004, Granada, Spain
 BMC Bioinformatics 2005, 6 (Suppl 1)
• BioCreative II: 2007, Madrid, Spain
 Genome Biology 2008, 9 (Suppl 2)
• BioCreative II.5: 2009, Madrid, Spain
 IEEE Transactions in Computational Biology and Bioinformatics 2010
• BioCreative III: 2010, Bethesda, USA
 BMC Bioinformatics 2011, Supp 8
• Biocuration and Text Mining: 2012, Georgetown U, USA
 Database Virtual Issue 2012
• BioCreative IV: 2013
21
BioCreative Traditional Tracks
Ranking of relevant documents (document triage)
Extraction of genes and proteins names (gene mention)
Linkage of names to database identifiers (gene normalization)
Extraction of functional annotation in standard ontologies (GO)
Extraction of entity relations (e.g. protein–protein interaction)
TM system
Biocurators
annotate corpus
Testing set
Compare annotation
BioCreative Interactive task
Active involvement of the end users to guide development
and evaluation of useful tools and standards.
TM system
Manual
annotation
Compare annotation and time spent in curation
Systemassisted
annotation
User Advisory Group (UAG)
A diverse sample of end users with multiple text mining needs
Roles: .
-Develop the end user requirements for interactive text mining task
-Provide logistics on system evaluation
-Assist in annotating corpora and testing the systems
Workshop 2012 and BioCreative IV
BioCreative III
UAG Member
UAG Member
Affiliation
Donghui Li
Eva Huala
TAIR
Judy Blake
Lois Maltais
MGI (not current)
Kimberly Van Auken
Paul Sternberg
Wormbase
Fiona McCarthy
Pascale Gaudet
dictyBase (not current) Mary Schaeffer
Ian Harrow
Pfizer (not current)
Stan Laulederkind
Michele Gwinn Giglio
University Maryland
Peter McQuilton
Phoebe Roberts
Pfizer
Phoebe Roberts
Andrew Chatr-Aryamontri BioGrid
Andrew Chatr-Aryamontri
Luca Toldo
Merck (not current)
Sandra Orchard
Gianni Cesarini
MINT
Sherri Matis
Affiliation
TAIR
MGI
WormBase
AgBase
MaizeDB
RGD
FlyBase
Pfizer
BioGrid
IntAct
AstraZeneca
BioCreative Interactive Task
1-Recruitment of Teams
Call for participation via NLP-related mailing lists and
Interested teams should provide a document addressing:
• Relevance and Impact
• Adaptability
• Interactivity
• Performance
2-Recruitment of Curators
Call for participation via International Society for Biocuration
(ISB) mailing list, and the ISB meeting and BioCreative websites
BioCreative Interactive Task Workflow
1-Preparation phase
Key:
Submission Text Mining System Description
Coordinators
Teams
Curators
Coord/teams
Submission of internal benchmarking
result, test set and URL
Did team provide
benchmarking
results?
Yes
No
System cannot participate in
pre-workshop evaluation, but
team is invited to participate
in demo and poster session
during workshop.
Post list of systems and
recruitment of biocurators
Team/biocurator pairing
Participation in pre-wokshop evaluation
System tuned to biocuration group (optional)
2-Training phase
with
BioCreative interactivePractice
task
examples, workflow
Team provides training via demo,
examples, help document, annotation
guidelines, and output format
report bugs
No
Key:
Coordinators
Yes
Teams
Is biocurator familiar
with system and
annotation ?
Curators
Coord/teams
3-Evaluation phase
Gold Standard:
Dataset manually
annotated by
independent expert
Dataset selected by domain
expert (or coordinator)
1/2
Manual
Annotation
Fill user survey
1/2
System-assisted
Annotation
Collect output and calculate
metrics
Report at Workshop
BioCreative Interactive Tasks
BioCreative III:
- Identify genes that are “primary/central” (biologically relevant) in
the context of the article (full-length), and normalization
-Retrieve articles for which a given gene is “primary/central”
6 Teams participated, 12 biocurators tested systems
BioCreative 2012:
-Open to any literature-based biocuration task
7 teams participated, more than 40 biocurators tested systems
BioCreative IV, October 2013:
-Open to any literature-based biocuration task
21 teams registered!! Will recruit biocurators at biocuration meeting
Teams Registered in BioCreative 2012
7 teams covering very diverse tasks
System
Tasks
Curation of subcellular localization using Gene Ontology
TextPresso
cellular component
PCS
Curation of Entity-Quality terms from phylogenetic
(Charaparser) literature using ontologies
PubTator
PPIFinder
eFIP
T-HOD
Tagtog
Document triage (relevant documents for curation) and
bioconcept annotation (gene, disease, chemicals)
Mining of protein-protein interaction for human proteins
(abstract and full legth articles):document classification
and extraction of interacting proteins and keywords.
Mining Protein Interactions of Phosphorylated Proteins
from the Literature. Document classification and
information extraction of phosphorylated protein, protein
binding partners and impact keyword
Document triage for disease-related genes (relevant
documents for curation) and bioconcept annotation (gene,
disease and relation)
Protein/gene mentions recognition via interactive learning
and annotation framework
Articles
Full-Text
NA
Abstract
Abstract
Abstract
Abstract
Abstract
What do we measure?
User Survey
Precision at document and/or sentence level
Recall at document and and/or sentence level
Time manual vs. system assisted
Survey results:
Correlation of response to questions with overall
system satisfaction to learn what aspects are
important to users
What’s in it for UniProt?
User Survey
• As users we can guide the development of tools that
are useful for biocuration
• We have access to state of the art text mining tools
• Participate to ensure the use of standards and quality
of annotations provided by the tools
• Publications