Transcript Document
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
EMBL
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
EBI
(3)
Qui ckT ime™ and a
T IFF (Uncompres sed) dec ompres sor
are needed t o s ee t his pic ture.
Indexing with Uniprot
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Sylvain Gaudan * Miguel Arregui * Harald Kirsch * Vivian Lee * Dietrich Rebholz-Schuhmann
Wolfson
College
Cambridge
University
Mum, Dad, the fish and other species.
I like fish. My favorite
is Zebrafish. It’s called like that because, from a fish point of view, it looks like a Zebra. But
still, it’s a fish, so it’s a Zebrafish. Of course, they have fins and eyes so that they can see and quickly hide from the starving
ugly big fish.
It’s so nice to look at them. At the beginning, it’s only an egg, and then it becomes a fish! With fins, mouth and eyes! I
heard that it’s all done by the genes.
For example, dad told me that there’s a gene called six 3 that has to do with the eyes. He didn’t say much. So I thought
that I could get more information about six 3 on the Internet. That’s when problems started.
I typed six 3 in the little box
and I started to read the articles. Many were not about my gene. Then, when it
was about six 3, it wasn’t about Zebrafish (I don’t care about Chicken or Elephant!). So, I went to see dad. Dad said that it’s
because the Internet is about too many different things. He said I have to be more precise. Ah! I thought I just have to ask.
Also, he said I forgot to put Sine oculis blabla 3 something because I should also look for the synonyms. From that
moment I decided to go to see mum. Dad wasn’t funny any more.
Mum said that I shouldn’t listen to dad. That wasn’t the first time she said that. She said that I should forget all about
these strange names and just use the UniProt ID (what ever it is). She just said it’s O73708 for six3 in Zebrafish and that’s
enough to find all the publications and that I don’t have to worry about the synonyms. Mum is fun. The UniProt Index too.
What: Acronym disambiguation
ADMR
How: Acronyms can be resolved
ADrenoMedullin Receptor (gene)
(1)
Average Daily Metabolic Rate
with their long-forms. Either
AES
Amino-terminal Enhancer of Split (gene)
the long-form of the abbreviation
Anterior Ectosylvian Sulcus
AMFR
is contained in the document or
Autocrine Motility Factor Receptor (gene)
Amplitude-Modulation Following Response
the context of the document
allows to guess the long-form.
Once it is resolved, the long form can be considered as a protein
name or not.
~~~
What: Solving the species
How: Publications mentioning
protein names often contain
information about the studied
organism. It can be the name of
the organism itself, or of an
ancestor or even of a
descendant. Using the NCBI taxonomy, the most probable species is
selected given the organisms cited in the document.
~~~
What: Including synonyms in the search
How: Swiss-Prot is the most comprehensive and accurate source of
names and synonyms for proteins. All the protein names, once
disambiguated, are indexed under their names as well as the unique
form that represent the protein in the correct organism: the UniProt
PANs.
1) Why is it so difficult to find
publications related to a
protein?
Fact: Protein names are highly ambiguous
Numbers: More than 600 protein names from Swiss-Prot are also
English words such as ’Had’, ’Great’, ’This’. Also, around 6 000 names
from Swiss-Prot are abbreviations with several potential expansions.
For instance, ADM abbreviates the gene name ’adrenomedullin’ as
well as the drug name ’adriamycin’.
Consequence: Search engine results can be unrelated to the protein
of interest.
~~~
Fact: Protein names are not species specific
Numbers: Around 90 000 protein names from UniProt are shared
over several species.
Consequence: When a protein name is mentioned in the text, it is not
obvious which species is concerned.
~~~
Fact: Proteins have several names
Numbers: Around 84% of Uniprot entries reference more than one
name per protein. Half of SwissProt proteins have at least three
names.
Consequence: Search engine results are incomplete.
2) What can be done?
3) So, How does it work?
Frequencies of protein names
in the BNC (log)
x x
x
10 000
1 000
Cut-off
x
100
10
x
x
When using EbiMEd for retrieving publications related to a protein,
simply use the protein’s UniProt PAN instead of using one of the
protein name in conjunction with an organism name. For instance,
instead of the query:
(“methionine aminopeptidase 2” OR
“peptidase M2” OR MAP2 OR MetAP2) and
(mouse or mice) and “tooth germ”
Simply use the query:
O08663 AND “tooth germ”
x
Bnac2
Aciculin
Oxitocin
Insulin
April
Task
Light
What: Name disambiguation
How: Protein names that are also
English words can be identified by
analyzing their frequencies in
general English text such as the
British National Corpus (BNC).
~~~
4) Sounds good. Where can I use it?
(2)
For Biologists: The Protein Index is available on EbiMed, a Web Portal developed at the European Bioinfromatics Institue. EbiMed retrieves
abstracts from Medline and also build a condensed view on the biomedical terminology contained in the result (e.g. Protein names, GO Terms,
...).
Bioinformatics: Access the Protein Index via the EBI’s Web Services (SOAP/HTTP).
1) S. Gaudan *, H. Kirsch and D. Rebholz-Schuhmann: Resolving abbreviations to their senses in Medline. Bioinformatics September 2005
2) EbiMed: http://www.ebi.ac.uk/Rebholz-srv/ebimed/ and http://www.ebi.ac.uk/Rebholz-srv/whatizit/
3) Sylvain Gaudan is supported by an “E-STAR” fellowship funded by the EC’s FP6 Marie Curie Host fellowship for Early Stage Research Training under con- tract number MEST-CT-2004-504640.
http://www.ebi.ac.uk/Rebholz/