Transcript Slide 1
Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR)
Hu ZZ1, Mani I2, Liu H3, Vijay-Shanker K4, Hermoso V1, Nikolskaya A1, Natale DA1, and Wu CH1
1Protein Information Resource, Georgetown University Medical Center, 3900 Reservoir Road, NW, Washington, DC 20057; 2Georgetown University, 37th and O Streets, NW, Washington, DC 20057;
3University of Maryland at Baltimore County, Baltimore, MD 21250; 4Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716
1
3
ABSTRACT
An integrated protein literature mining resource iProLINK is developed at PIR to provide data sources for Natural
Language Processing (NLP) research on bibliography mapping, annotation extraction, protein named-entity
recognition, and protein ontology development. A rule-based text-mining system RLIMS-P is used to extract
protein phosphorylation information from MEDLINE abstracts to assist database annotation, an online
BioThesaurus is developed for protein/gene name mapping and to assist with protein named-entity recognition,
and a family classification PIRSF-based protein ontology is developed and to complement other ontologies.
2
INTRODUCTION
4
iProLINK:
An integrated protein resource for literature mining
RLIMS-P
• Manual tagging assisted with computational extraction
• Training sets of positive and negative samples
Rule-based LIterature Mining System for Protein Phosphorylation
Preprocessing
1. Bibliography mapping
- UniProt mapped citations
2. Annotation extraction
- annotation tagged literature
3. Protein entity recognition
- dictionary, tagged literature
4. Protein ontology development
- PIRSF-based ontology
PIR – Integrated Protein Informatics Resource
As the volume of scientific literature rapidly grows, literature data mining
becomes increasingly critical to facilitate genome/proteome annotation
and to improve the quality of biological databases. Annotations derived
from experimentally verified data from literature are of special value to the
UniProtKB (UniProt Knowledgbase). One objective of UniProtKB is to
have accurate, consistent, and rich annotation of protein sequence and
function. Relevant to this goal are the literature-based curation and
development and adoption of ontologies and controlled vocabularies.
• Literature-Based Curation – Extract Reliable Information from Literature
• Protein properties: protein function, domains and sites,
developmental stages, catalytic activity, binding and modified
residues, regulation, induction, pathways, tissue specificity,
subcellular location, quaternary structure…
• This will ensure high quality, accurate and up-to-date experimental
data for each protein. But it is a major bottleneck!
• Ontologies/Controlled Vocabularies – For Information Integration and
Knowledge Management
• UniProtKB entries will be annotated using widely accepted biological
ontologies and other controlled vocabularies, e.g. Gene Ontology
(GO) and EC nomenclature.
The Protein Information Resource has been collaborating with several
NLP research groups to develop text-mining methodologies to extract
information from biological literature and to develop protein ontology.
Protein Phosphorylation Annotation Extraction
for Genomic/Proteomic Research
Entity Recognition
Sentence
extraction
Abstracts
Full-Length Texts
Acronym
detection
Part of speech
tagging
RLIMS-P
Extracted Annotations
Tagged Abstracts
Enzyme
P-group
(e.g., MAP kinase)
PostProcessing
Substrate
(e.g., cPLA2)
(e.g., Ser505)
Phosphorylation
Phrase Detection
Relation
Identification
Nominal level
relation
P-site
Term recognition
phosphorylated-cPLA2 Ser-P
Semantic
Type
Classification
Verbal level
relation
<AGENT> Enzyme (kinase catalyzing the phosphorylation)
Noun and verb
group detection
Other syntactic
structure
detection
<THEME> Substrate (protein being phosphorylated)
<SITE> P-Site (amino acid residue being phosphorylated)
Pattern 1: <AGENT> <VG-active-phosphorylate> <THEME> (in/at <SITE>)?
ATR/FRP-1 also phosphorylated p53 in Ser 15
http://pir.georgetown.edu/iprolink/
Benchmarking of RLIMS-P
Testing and
Benchmarking Dataset
http://pir.georgetown.edu/iprolink/
(http://pir.georgetown.edu)
Online RLIMS-P text-mining tool (version 1.0)
2
UniProt – Central international database
of protein sequence and function
High recall for paper retrieval and high precision for
information extraction
• UniProtKB site feature annotation
• Proteomics MS data analysis: protein identification
• RLIMS-P text mining tool
• Protein dictionaries
1
http://pir.georgetown.edu/i
prolink/rlimsp/
1. Search interface
2. Summary table with
top hit of all sites
Bioinformatics. 2005 Jun 1;21(11):2759-65
• Name tagging guideline
3. All sites and tagged
text evidence
• Protein ontology
3
(http://www.uniprot.org)
5
Web-based BioThesaurus
BioThesaurus
Name Filtering
NCBI
Genome
Entrez Gene
RefSeq
GenPept
UniProt
FlyBase
WormBase
MGD
SGD
RGD
UniProtKB
UniRef90/50
PIR-PSD
Name
Extraction
iProClass
Highly
Ambiguous
Nonsensical
Terms
Raw
Thesurus
Semantic Typing
Other
HUGO
EC
OMIM
BioThesaurus v1.0
BioThesaurus
UniProtKB
Entries:
Protein/Gene
Names &
Synonyms
Gene/Protein Name Mapping
1.Search Synonyms
2.Resolve Name Ambiguity
3.Underlying ID Mapping
6
PIRSF-Based Protein Ontology
PIRSF to GO Mapping
• PIRSF family hierarchy based on evolutionary relationships
• Standardized PIRSF family names as hierarchical protein ontology
• DAG Network structure for PIRSF family classification system
PIRSF in DAG View
• Complements GO: PIRSF-based ontology can be
used to analyze GO branches and concepts and to
provide links between the GO sub-ontologies
• Mapped 5363 curated PIRSF homeomorphic
families and subfamilies to the GO hierarchy
– 68% of the PIRSF families and subfamilies map
to GO leaf nodes
– 2329 PIRSFs have shared GO leaf nodes
# UniProtKB entry
1.86m
# Source DB record
6.6m
# Gene/protein name/terms 3.6m
(May, 2005)
Applications:
• Biological entity tagging
• Name mapping
• Database annotation
• literature mining
• Gateway to other resources
Expanding a Node:
Identification of GO
subtrees that need
expansion if GO
concepts are too broad
– IGFBP subfamilies
– High- vs. low-affinity
binding for IGF
between IGFBP
and IGFBPrP
DynGO viewer
UMLS
m = million
Protein Ontology Can Complement GO
http://pir.georgetown.edu/iprolink/biothesaurus/
Liu et al, 2005, submitted
BioThesaurus report
UniProtKB entry P35625
GO-centric view
Example 1. Name ambiguity of TIMP3
DAG file: ftp://ftp.pir.georgetown.edu/pir_databases/pirsf/dagfiles/
PIRSF Protein Family Classification
PIRSF: A network structure from superfamilies to subfamilies to reflect
evolutionary relationships of full-length proteins
Definitions
Basic unit = Homeomorphic Family
Homeomorphic: Full-length similarity, common domain architecture
Network Structure: Flexible number of levels with varying degrees
of sequence conservation
Domain Superfamily
• One common Pfam
domain
Protein Name Tagging
• One or more common domains
PIRSF Homeomorphic Family
• Exactly one level
• Full-length sequence similarity and
common domain architecture
• 0 or more levels
• Functional specialization
PF02735: Ku70/Ku80 beta-
Example 2. Name ambiguity of CLIM1
PIRSF800001: Ku70/80 autoantigen
PIRSF016570: Ku80 autoantigen
PIRSF006493: Ku, prokaryotic type
PIRSF500001: IGFBP-1
PF00219: Insulin-like growth
factor binding protein
(IGFBP)
PIRSF001969: IGFBP
…
PIRSF500006: IGFBP-6
PIRSF018239: IGFBP-related protein, MAC25 type
PIRSF017318: CM of AroQ class, eukaryotic type
PIRSF001501: CM of AroQ class, prokaryotic type
Exploration of Gene and Protein Ontology
Two cases: analyze GO branches and concepts
and identify missing GO nodes
Case I. Nuclear receptor superfamily
Molecular function
Case II. IGF-binding protein superfamily
Biological process
Estrogen receptor alpha
(PIRSF50001)
1
Systematic links
between three GO
sub-ontologies
based on the shared
annotations at
different protein
family levels, e.g.,
linking molecular
function and
biological process:
– estrogen receptor
binding and
– estrogen receptor
signaling pathway
PF01817: Chorismate
mutase (CM)
PIRSF026640: Periplasmic CM
PIRSF001500: Bifunctional CM/PDT (P-protein)
PIRSF-centric view
PIRSF001499: Bifunctional CM/PDH (T-protein)
PIRSF001499: Bifunctional CM/PDH (T-protein)
PF02153: Prephenate
dehydrogenase (PDH)
• PIR iProLINK literature mining resource
provides annotated data sets for NLP
research on annotation extraction and
protein ontology development
• RLIMS-P text-mining tool for protein
phosphorylation from PubMed literature.
Coupling the high recall for paper
retrieval and high precision for
information extraction, RLIMS-P can be
applied for UniProtKB protein feature
annotation.
• Biothesaurus can be used to solve name
synonym and ambiguity, name mapping.
• PIRSF-based protein ontology can
complement GO by identify missing GO
concepts/nodes and provides systematic
links between three GO sub-ontologies.
Superimpose GO and PIRSF hierarchies Liu et al, 2005, submitted
Bidirectional display (GO- or PIRSF-centric views)
PIRSF Homeomorphic
Subfamily
PIRSF003033: Ku70 autoantigen
barrel domain
• Tagging guideline versions 1.0 and 2.0
– Generation of domain expert-tagged corpora
– Inter-coder reliability – upper bound of machine tagging
• Dictionary pre-tagging
– F-measure: 0.412 (0.372 Precision, 0.462 Recall)
– Advantages: helpful with standardization and extent of tagging,
reducing fatigue problem, and improve inter-coder reliability.
• BioThesaurus for pre-tagging
PIRSF Superfamily
• 0 or more levels
•
•
Summary
7
8
Acknowledgements
Research Projects
NIH: NHGRI/NIGMS/NLM/NIMH/NCRR/NIDCR
(UniProt)
NSF: SEIII (Entity Tagging)
NSF: ITR (Ontology)
Collaborators
I. Mani from Georgetown University
Department of Linguistics on protein name
recognition and protein name ontology.
H. Liu from University of Maryland Department
of Information System on protein name
recognition and text mining.
Vijay K. Shanker from University of Delaware
Department of Computer and Information
Science on text mining of protein
phosphorylation features.