Transcript Slide 1
iProLINK: An integrated protein resource for
literature mining and literature-based curation
1. Bibliography mapping
- UniProt mapped citations
2. Annotation extraction
- annotation tagged literature
3. Protein named entity recognition
- dictionary, name tagged literature
4. Protein ontology development
- PIRSF-based ontology
1
Objective: Accurate, Consistent, and Rich
Annotation of Protein Sequence and Function
Literature-Based Curation – Extract Reliable
Information from Literature
Function, domains/sites, developmental stages, catalytic
activity, binding and modified residues, regulation, pathways,
tissue specificity, subcellular location …...
Ensure high quality, accurate and up-to-date experimental
data for each protein.
A major bottleneck!
Ontologies/Controlled Vocabularies – For Information
Integration and Knowledge Management
UniProtKB entries will be annotated using widely accepted
biological ontologies and other controlled vocabularies, e.g.
Gene Ontology (GO) and EC nomenclature.
2
Access to
iProLINK
homepage
3
iProLINK
http://pir.georgetown.edu/iprolink/
Testing and
Benchmarking Dataset
• RLIMS-P text mining tool
• Protein dictionaries
• Name tagging guideline
• Protein ontology
4
Protein Phosphorylation Annotation Extraction
Manual tagging assisted with computational extraction
Training sets of positive and negative samples
Evidence
attribution
RLIMS-P
P-group
Enzyme
(e.g., MAP kinase)
P-site
Substrate
(e.g., cPLA2)
(e.g., Ser505)
Phosphorylation
phosphorylated-cPLA2 Ser-P
<AGENT> Enzyme (kinase catalyzing the phosphorylation)
3 objects
<THEME> Substrate (protein being phosphorylated)
<SITE> P-Site (amino acid residue being phosphorylated)
5
RLIMS-P
Rule-based LIterature Mining System for Protein Phosphorylation
Preprocessing
Entity Recognition
Sentence
extraction
Abstracts
Full-Length Texts
Acronym
detection
Part of speech
tagging
Extracted Annotations
Tagged Abstracts
PostProcessing
Term recognition
Phrase Detection
Relation
Identification
Nominal
level relation
Semantic
Type
Classification
Verbal level
relation
Noun and verb
group detection
Other syntactic
structure
detection
Pattern 1: <AGENT> <VG-active-phosphorylate> <THEME> (in/at <SITE>)?
ATR/FRP-1 also phosphorylated p53 in Ser 15
download
6
http://pir.georgetown.edu/iprolink/
Benchmarking of RLIMS-P
Bioinformatics. 2005 Jun 1;21(11):2759-65
High recall for paper retrieval and high
precision for information extraction
UniProtKB site feature annotation
Proteomics Mass Spec. data
analysis: protein identification
7
Online RLIMS-P
(version 1.0)
http://pir.georgetown.edu/iprolink/rlimsp/
• Search interface
• Summary table with
top hit of all sites
1.
2.
3.
• All sites and tagged
text evidence
8
BioThesaurus http://pir.georgetown.edu/iprolink/biothesaurus/
NCBI
Genome
Entrez Gene
RefSeq
GenPept
UniProt
FlyBase
WormBase
MGD
SGD
RGD
UniProtKB
UniRef90/5
0
PIR-PSD
Name Filtering
Name
Extraction
iProClass
Highly
Ambiguous
Nonsensical
Terms
Raw
Thesaurus
Semantic Typing
Other
HUGO
EC
OMIM
BioThesaurus v1.0
BioThesaurus
UniProtKB
Entries:
Protein/Gene
Names &
Synonyms
UMLS
m = million
# UniProtKB entry
1.86m
# Source DB record
6.6m
# Gene/protein names/terms
3.6m
(May, 2005)
Applications:
• Biological entity tagging
• Name mapping
• Database annotation
• literature mining
• Gateway to other resources
9
BioThesaurus Report
Synonyms for Metalloproteinase inhibitor 3
1
Gene/Protein Name Mapping
1. Search Synonyms
2. Resolve Name Ambiguity
3. Underlying ID Mapping
Name ambiguity
2
3
ID Mapping
TMP3
10
Protein Name Tagging
Tagging guideline versions 1.0 and 2.0
Dictionary pre-tagging
Generation of domain expert-tagged corpora
Inter-coder reliability – upper bound of machine tagging
F-measure: 0.412 (0.372 Precision, 0.462 Recall)
Advantages: helpful with standardization and extent of
tagging, reducing the fatigue problem, and improve intercoder reliability.
BioThesaurus for pre-tagging
11
PIRSF-Based Protein Ontology
PIRSF family hierarchy based on evolutionary relationships
Standardized PIRSF family names as hierarchical protein ontology
DAG Network structure for PIRSF family classification system
PIRSF in DAG View
12
PIRSF to GO Mapping
Mapped 5363 curated PIRSF homeomorphic families and
subfamilies to the GO hierarchy
68% of the PIRSF families and subfamilies map to GO leaf nodes
2329 PIRSFs have shared GO leaf nodes
Complements GO: PIRSF-based ontology can be used to analyze
GO branches and concepts and to provide links between the GO
sub-ontologies
DynGO viewer
Hongfang Liu
University of Maryland
Superimpose GO and
PIRSF hierarchies
Bidirectional display (GOor PIRSF-centric views)
13
Protein Ontology Can Complement GO
GO-centric view
Expanding a Node:
Identification of GO
subtrees that can be
expanded when GO
concepts are too broad
IGFBP subfamilies and
High- vs. low-affinity
binding for IGF between
IGFBP and IGFBPrP
14
Exploration of Gene and Protein Ontology
Molecular function
PIRSF-centric view
Biological process
Estrogen receptor
alpha (PIRSF50001)
Systematic links between
three GO sub-ontologies,
e.g., linking molecular
function and biological
process:
Estrogen receptor binding
Estrogen receptor
signaling pathway
15
Summary
PIR iProLINK literature mining resource provides
annotated data sets for NLP research on annotation
extraction and protein ontology development
RLIMS-P text-mining tool for protein
phosphorylation from PubMed literature.
BioThesaurus can be used for name mapping to
solve name synonym and ambiguity issues.
PIRSF-based protein ontology can complement
other biological ontologies such as GO.
16
Acknowledgements
Research Projects
NIH: NHGRI/NIGMS/NLM/NIMH/NCRR/NIDCR (UniProt)
NSF: SEIII (Entity Tagging)
NSF: ITR (Ontology)
Collaborators
I. Mani from Georgetown University Department of Linguistics on
protein name recognition and protein name ontology.
H. Liu from University of Maryland Department of Information
System on protein name recognition and text mining.
Vijay K. Shanker from University of Delaware Department of
Computer and Information Science on text mining of protein
phosphorylation features.
17