Playing Biology’s Name Game: Identifying Protein Names In
Download
Report
Transcript Playing Biology’s Name Game: Identifying Protein Names In
Playing Biology’s Name Game:
Identifying Protein Names In
Scientific Text
Daniel Hanisch, Juliane Fluck,
Heinz-Theodor Mevissen and Ralf
Zimmer
Pac Symp Biocomput. 2003;:403-14.
Abstract
Construction of a comprehensive general
purpose name dictionary
An accompanying automatic curation
procedure based on a simple token model of
protein names
An efficient search algorithm to analyze all
abstracts in MEDLINE
Parameters are optimized using machine
learning techniques
Model for protein and gene
names
Protein names are often composed of
more than one word (token)
The “order” of these words is not very
important – permutation of tokens may
occur
General-purpose dictionaries of protein
names must be automatically composed
Token classes (1/3)
Token classes (2/3)
Extract all words from the dictionary
with frequency of occurrence > 100
Non-descriptive tokens: words occurring
in databases but rarely used in free text
or have no influence on the significance
of match
Modifier tokens: words crucial for
correct recognition
Token classes (3/3)
Specifier tokens: Arabic and Roman
numbers and Greek letters
Delimiter tokens: used to gain
specificity in the matching procedure –
help identify name boundaries
Common words: obtained by
comparison to a standard English
dictionary
Standard tokens: gene identifiers as
they cannot be easily assigned to a
separate calss
Automatic generation of the
dictionary
Extract gene symbols, alias names, and full
names for all human genes from the HUGO
Nomenclature database
Create an entry for each official gene symbol
and add the corresponding names in the
OMIM database
Extract all synonyms in SWISSPROT and
TREMBL database and match these to HUGO
entries
Curation of the dictionary (1/3)
To resolve ambiguities and to remove
nosensical names from the dictionary
A curation procedure consists of two
phases – expansion and pruning
Expansion:
Curation of the dictionary (2/3)
Pruning: remove redundancies, ambiguities,
and irrelevant synonyms
First: synonyme a sequence of token class
identifiers
Use regular expression to search unspecific
synonyms (e.g. only non-descriptive tokens,
only specifier tokens, etc.)
Finally, a list of ambiguous names is stored
separately with reference to their original
records
Curation of the dictionary (3/3)
The ambiguity list can be used to
identify such entries and move them to
the manual curation list based on their
frequency of occurrence.
Efficient detection of names (1/3)
MEDLINE contains about 11 million abstracts
Linear time in the number of tokens of the
parsed text
To sweep over the abstract, processing one
token at a time and keep a set of candidate
solutions and two associated scoring
measures, boundary score s and acceptance
score s, for the present position
Efficient detection of names (2/3)
boundary score s: controls the end of the
extension of a candidate match and is
increased on a token mismatch. The
candidate is pruned if s >boundary threshold
acceptance score s: determine whether the
candidate is reported as a match. s is a
linear combination of token-class-specific
match and mismatch terms. In other words,
the significance of token classes vary.
Efficient detection of names (3/3)
Example:
Only the non-descriptive token “precursor” is
unmatched in the candidate a nearly
maximal match score would be computed (if
non-descriptive tokens receive a small weight)
However, the semantically significant modifier
token “receptor” leads to a substantial
mismatch term (if weights are set
appropriately)
Parameter optimization
Robust linear programming (RPL) was used to
compute a set of sensible weights
This supervised machine learning techniques uses
a set of positive samples, i.e. correctly identified
protein names, and a set of negative ones.
The match and mismatch weighting parameters
for delimiter, specifier, modifier, and standard
tokens were tuned.
The optimized weightings penalize mismatch of
modifier and number tokens and reward
matching of other token classes to various extend
Evaluation
The test dataset is based on the TRANSPATH
database on regulatory interactions.
Extracted all human proteins with
SWISSPROT annotations
Discarded abstracts if no text was available or
if a protein was described for the first time
Resulting benchmark set consists of 611
associations (141 objects in 470 abstracts)
Results – 5-fold c.v.