Abstract Presentation

Download Report

Transcript Abstract Presentation

and
Tools for exploring the biomedical
information landscape
Les Grivell
EMBO Electronic Information Programme
EAHIL 2004, Santander,
Electronic information
programme
Online research
information
environment for the
life sciences
A next
generation
information
service for the
life sciences
Communities@embo
Life
Sciences
Mobility
Portal
But first, let me take
you back – not to
Altomira,
but to the ……
early days of
scientific publishing
(pre- impact factor)
When libraries
were comfortable
places that had
everything you
needed …
and it was possible to keep track of the
literature …. (more or less) …
Where are we now?
– Publishing is big business
• STM publishing is a multi-billion EUR activity
(In the UK alone, GBP 22 billion in 2000)
• Estimated 164000 scientific periodicals
worldwide; around 16% of these are online
– Core science; core journals
• PubMed lists some 4600 journals in biomedical disciplines
• As of 19 Sept 2004, 4429 of these are online
• The PubMed database provides access to circa
15 million abstracts (but if you can’t be found,
you won’t be read …)
• The Science Citation Index lists 5876 journals
with impact factors ranging from 54.45 – 0.00.
(you’ve been found, but are you worth
reading? …)
Another information explosion: genomics
35
Base pairs (billions)
30
Sequence entries in the
EMBL DNA database
25
20
15
10
Morowitz
5
0
1980
1985
1990
Year
1995
2000
2005
Raw sequences are not the only
form of digital information
The nice thing about biological information
resources is that there are so many …..
• Hundreds of different
databases, many in flatfile format
• A variety of user
interfaces
• General lack of
interoperability
Wouldn’t it be nice to …… find all published literature references for a
large set of gene symbols and explore their relationships?
Micro-array
chip
Co-regulated
genes
Find literature
Database lookup
Discover
relationships
This is not really such a novel idea ….
Fritz Saxl (1890– 1948)
‘Ich will nicht, dass in der Bibliothek
I don’t want there to be endless
ewig gesucht wird! Dieses Suchen
searching in the library! It is at
kostet Nerven und die dürfen nicht
the expense of nerves and these
verschwendet werden an solche
should not be wasted on such
Dummheiten...
stupidities….
Aby Warburg (1866– 1929)
Saxl & Warburg:Mnemosyne Atlas
Some text search engines
Bibliographic
databases
Biosis
Full text / web-pages
Pubmed
Text-based!
No direct linkage to
other datasets
Search only title,
authors, abstract
Boolean keyword search
(AND / OR)
Search language is
English
All documents stored and
indexed in one location
No ranking on
relevance to query!
main features
• Ability to interconnect literature articles with
different types of molecular data, including
images
• Ability to search through and retrieve journal
articles and other full text documents, even
when in different physical locations
• Ability to support multi-lingual documents and
queries
• Services free to the academic community
Features implemented via conceptual fingerprinting
A discovery tool
conceptual fingerprints
Full text document
Index and link
index terms to
(multi-lingual)
thesauri
•1 conceptual fingerprint (CFP) = 400 bytes
•Abstraction: 250.000 pages/PC/day
•Matching: 500.000 CFP’s: 40 millisec.
Fingerprint
database
prototypes
• Initial
prototypes in
September
2002 and July
2003
• Current
prototype
online since 1st
March 2004
• Next launch
due midOctober 2004
E-BioSci
Content selection:
abstracts + full text
Choose search focus
Full text query in English,
French or German. Is
fingerprinted for search
… and now a word about
8 partners ( DE, ES, FR,UK)
(Platform)
13 partners (ES, FR, IT, NL, UK)
(Research project)
Oriel’s aims
Wouldn’t it be nice to be able to navigate from an
image to literature and molecular databases?
www.bioimage.org
(Dr David Shotton, Univ. Oxford)
Gene symbol identification in text
Text containing symbols
Improved literature – molecular dataset linkage
PEO1
Twinkle, twinkle, little star,
How I wonder what you are.
Up above the world so high,
Like a diamond in the sky.
Twinkle, twinkle, little star,
How I wonder what you are
GUCY2C
TYRO3
CD44
Problems in gene symbol recognition
• Many gene symbols are indistinguishable
from everyday words or abbreviations
• Synonyms
• Homonyms
• Homonym synonyms
(ELK1 = SAP1; CAR1 = SAP1; BD-2 =
SAP1; RIP1_SAPOF = SAP1)
Word-“processing”
Natural language processing
Protein interaction networks
ataxia
requires
Yfh1
regulates
Ssc1
Isu1
interacts
activates
Oct1
Hoffman & Valencia (Madrid)
Some web-addresses
http://www.e-biosci.org
http://www.oriel.org
http://www.bioimage.org
http://www.pdg.cnb.uam.es/UniPub/iHOP/