Transcript Slide 1

iProLINK – A Literature Mining Resource at PIR
(integrated Protein Literature INformation and Knowledge )
Hu ZZ1, Liu H2, Vijay-Shanker K3, Mani I4, and Wu CH1
1Protein
Information Resource, 2Department of Biostatistics, Bioinformatics, and Biomathematics, 4Department of
Computational Linguistics, Georgetown University, Washington, DC 20007; 3University of Delaware, DE 19716
Introduction: With the increasing volume of scientific literature available electronically, efficient text mining tools will greatly
facilitate the extraction of information buried in free text and will assist in database annotation and scientific inquiry. Many methods,
including natural language processing, machine learning, and rule-based approaches have been employed for biological literature
mining, especially in areas of entity recognition, information retrieval and extraction. The Protein Information Resource (PIR) group,
actively collaborating with several other groups, conducts research and provides resources on literature mining in the above three
areas. iProLINK is a public resource provided by PIR that aims at providing annotated literature data sets for development of new
literature mining algorithms, such as protein named entity recognition, text categorization, and protein annotation extraction, and of
protein ontology. iProLINK also provides literature mining tools for scientific users and curators. (Comp Biol Chem, 28:409-416, 2004)
iProLINK Resource Overview
Bibliography mapping:
1. Bibliography mapping
- UniProtKB mapped citations
2. Annotation extraction
- annotation tagged literature
3. Protein entity recognition
- dictionary, tagged literature
4. Protein ontology development
Contains curated literature
citations for UniProtKB
protein entries from multiple
sources including GeneRIF,
SGD, and MGI, in addition to
current UniProt literature
citations. Also included are
user-submitted and
computationally mapped
citations.
- PIRSF-based ontology
Annotation tagged literature sets:
e.g. acetylation, glycosylation,
hydroxylation, phosphorylation,
methylation in abstract or full text.
Protein entity recognition: name dictionaries,
tagged abstracts and tagging guidelines
Search and
browse tagged
features

Tagging guideline versions 1.0 and 2.0
 2 sets of tagged corpora
Inter-coder reliability
Data sets for the five PTMs are being used for
developing machine learning algorithms for text
categorization (classification). A substringbased approach is developed that is highly
effective in biomedical document classification
(Bioinformatics, submitted, 2006)
Data sets for protein phosphorylation were
used for testing and benchmarking a rulebased text mining program for phosphorylation
– RLIMS-P (Bioinformatics 21:2759-65, 2005.)
PIRSF-Based Protein Ontology




RLIMS-P
PIRSF family hierarchy based on evolutionary relationships
Standardized PIRSF family names and relations as protein ontology
DAG Network structure for PIRSF family classification system (left)
PIRSF-based protein ontology can complement Gene Ontology (right)
PIRSF in DAG View
Details in a separate RLIMS-P poster
Guideline v1.0
Guideline v2.0
Bioinformatics. 2006 Apr 27
Protein name tagging guidelines: lessons learned –
Comp. Funct Genomics, 6(1-2): 72-76, 2005
RLIMS-P and BioThesaurus combined
can be used for UniProt protein feature
annotations.
BioThesaurus
• Comprehensive collection of protein/gene names from
multiple molecular databases
• Associates names with UniProtKB entries
• Primary usage:
• Retrieve synonymous names
• Resolve ambiguous names
• Evaluate name coverage
Synonyms for
Metalloproteinase
inhibitor 3
Bioinformatics, 21(11): 2759-2765, 2005
http://pir.georgetown.edu/iprolink
Name ambiguity of TIMP-3
Summary
- iProLINK is a public resource for
literature mining and ontology
development.
- RLIMS-P is a text-mining tool for
protein phosphorylation.
- BioThesaurus is for gene and protein
name mapping to solve name
ambiguity.
- BioThesaurus and RLIMS-P can be
used to assist UniProtKB protein
annotations.
- PIRSF-based protein ontology can
complement GO.
Acknowledgements: NIH (UniProt), NSF (Entity Tagging,
Ontology). PIR team: Hermoso V, Fang C, Yuan X, Huang H,
Zhang J, Natale D, Nikolskaya A. Temple University: Han B,
Obradovic Z, Vucetic S.
Contact:
[email protected]