The University of Sheffield Natural Language Processing

Download Report

Transcript The University of Sheffield Natural Language Processing

Integrating Text Mining into
Bio-Informatics Workflows
Neil Davis
George Demetriou
Robert Gaizauskas
Yikun Guo
Ian Roberts
Henk Harkema
Natural Language Processing
Group
Department of Computer
Science
University of Sheffield
Sheffield, UK ISMB Demo; June 27, 2005
Overview
• Demonstration scenario: to show the use of text mining
techniques to support biomedical researchers investigating
the genetic basis of human disorders
• Case study: Williams-Beuren syndrome
• Overview of presentation & demonstration:
• Background on Williams-Beuren syndrome
• Architecture of system
• Text mining
• Workflows
• User interface
• Demonstration of system
2
Context
• MyGrid
• University Of Manchester, University of Newcastle, University Of
Nottingham, University Of Sheffield, University Of Southampton,
IT Innovation Centre, European Bioinformatics Institute
• CLEF
• University of Manchester, University College London, Royal
Marsden Hospital, University of Cambridge, University of
Sheffield, Open University
3
WBS: Clinician’s View
• WBS was first clinically described in 1961 before genetic
screening was available
• WBS presents multiple but highly variable symptoms,
including:
•
•
•
•
•
•
Congenital heart disorders
Elfin face
Mental retardation with relatively spared language skills
Growth retardation
Dental malformations
Infantile hypercalcemia
• Due to the variable underlying genetic basis of WBS,
the symptoms that WBS patients present are notoriously
variable
4
WBS: Geneticist’s View
• WBS is caused by a variable (typically 1.54-1.88 Mb)
deletion from 7q11.23
• The deleted region (termed the Williams-Beuren Critical
Region or WBSCR) spans multiple genes
• The complex genotype (multiple genes may be deleted)
leads to a complex phenotype (multiple symptoms may
be present)
5
Research Process
• Multi-step and iterative:
• Step 1: Sequence the section of the genome of interest
• Step 2: Scan the sequence for putative genes
• Step 3: BLAST the putative gene sequence against database
of known genes to identify homologues whose function may
be known
• Step 4: Annotate the putative gene sequence with the data
associated with homologous sequences
• Repeat as new data and sequences become available
• Text Mining techniques can facilitate Step 4
• Navigating biomedical literature to find papers containing
information about homologous genes etc.
6
Workflows
• …
7
Text Mining
• Uncovering the information content of unstructured or
semi-structured textual data sources in an automatic way
• Includes research areas such as information extraction (IE),
information retrieval (IR), natural language processing
(NLP), knowledge discovery from databases (KDD)
• Relevance to biomedical informatics
• Textual biomedical data sources contain valuable information,
but volume is so large and growing so fast that it is difficult for
researchers to find relevant information
• Some information is available in textual form only,
e.g., clinical records
8
Text Mining Workflow
• Workflow: computational model for
processes that require repeated
execution of a series of analytical
tasks
9
• BLAST reports provide links to
abstracts in the literature
• Use MeSH terms to find related
papers
• Show retrieved papers to end user
Architecture of System
User Client
Workflow definition
+ parameters
Workflow Server
Clustered PubMed Ids
+ titles
Initial
Cluster
Workflow
Abstracts
Workflow
Swissprot/Blast
10
Enactment
record
Extract
Get Related
PubMed Id
Abstracts
Term-annotated
Medline abstracts
Get Medline
Abstract
Medline Server
Medline
Abstracts
PubMed Ids
Medline: pre-processed
offline to extract biomedical
terms + indexed
PubMed Ids
Text Collection Server
User Client
Workflow definition
+ parameters
Workflow Server
Clustered PubMed Ids
+ titles
Initial
Cluster
Workflow
Abstracts
Workflow
Swissprot/Blast
11
Enactment
record
Extract
Get Related
PubMed Id
Abstracts
Term-annotated
Medline abstracts
Get Medline
Abstract
Medline Server
Medline
Abstracts
PubMed Ids
Medline: pre-processed
offline to extract biomedical
terms + indexed
PubMed Ids
Text Collection Server
• Text collection is MEDLINE (www.ncbi.nlm.nih.gov)
•
•
•
•
More than 14 million abstracts since 1950’s
Largest repository of biomedical abstracts
Copies made available for research, updated continually
Records contain semi-structured information annotated in XML
• Unique id – PubMed id
• Citation information – author(s), journal, year, etc.
• Manually assigned controlled vocabulary keywords
(MeSH terms)
• Text of abstract
12
Text Collection Server
• Local copy
•
•
•
•
Loaded in MySQL, indexed on various fields, e.g. MeSH terms
Text portion indexed for search engines (Lucene, Madcow)
Text preprocessed with text mining tools (AMBIT & Termino)
Indexes built for term classes (proteins, genes, diseases, etc.)
• Server accepts web service calls to, e.g.
•
•
•
•
•
Return text of abstract given a PubMed id
Return MeSH terms of abstracts given PubMed ids
Return PubMed ids of abstracts with given MeSH terms
Return PubMed ids of abstracts matching a free text query
Return PubMed ids of abstracts containing a specific term
13
Preprocessing / Text Mining
• AMBIT
• Lexical & terminological processing
• Syntactic & semantic processing
• Pattern recognition & discourse processing
• Termino
• Large-scale terminological resource to support term processing
for (biomedical) text processing applications
• Efficient recognition and classification of terms in text through use
of finite state recognizers compiled from terminological database
• Term are associated with links to outside ontologies and other
terminological knowledge sources
• Text Mining results saved as annotations on text
14
Workflow Server
User Client
Workflow definition
+ parameters
Workflow Server
Clustered PubMed Ids
+ titles
Initial
Cluster
Workflow
Abstracts
Workflow
Swissprot/Blast
15
Enactment
record
Extract
Get Related
PubMed Id
Abstracts
Term-annotated
Medline abstracts
Get Medline
Abstract
Medline Server
Medline
Abstracts
PubMed Ids
Medline: pre-processed
offline to extract biomedical
terms + indexed
PubMed Ids
Workflow Server
• Workflow server runs the Freefluo enactment engine to
execute XScufl workflow (designed using Taverna)
• WBS workflow:
16
Interface/Browsing Client
User Client
Workflow definition
+ parameters
Workflow Server
Clustered PubMed Ids
+ titles
Initial
Cluster
Workflow
Abstracts
Workflow
Swissprot/Blast
17
Enactment
record
Extract
Get Related
PubMed Id
Abstracts
Term-annotated
Medline abstracts
Get Medline
Abstract
Medline Server
Medline
Abstracts
PubMed Ids
Medline: pre-processed
offline to extract biomedical
terms + indexed
PubMed Ids
Interface/Browsing Client
• Two components
• Submit workflows for enactment
• Explore results, find related documents, free text search
• Explore results
• Documents organized in tree derived from MeSH hierarchy
(or chromosome locations)
• Links to outside databases containing more information about terms
• Find related documents
• Terms hyperlinked to same terms in other documents
• Finding similar documents
• Similarity measure based on MeSH terms
• Similarity measure based on words in document
• Free text search
• Based on Lucene search engine
18
Interface/Browsing Client
• Gridsphere Portal Framework is used for relaying workflow
requests to the Freefluo enactment engine
• Text Mining Results viewer is implemented as a JavaSwing applet for enhanced functionality and easy inclusion
in portals
• The applet can re-enact workflow requests via the portal
so that the user can further process document sets without
explicitly having to enact a new workflow
19
Interface/Browsing Client
MeSH
Tree
Abstract
Titles
Abstract20
body
Search
scope
restrictors
Linked
terms
Get
Related
Abstracts
Free text
search
Chromosome location
• Extracting relationships between terms
• Viewer can be used to show data organized according to
other trees, e.g, chromosome location, GO tree, etc.
21
Further Information
• Papers
• N. Davis, G. Demetriou, R. Gaizauskas, Y. Guo, I. Roberts. In press.
Web Service Architectures for Text Mining: An Exploration of the
Issues via an E-Science Demonstrator. In: International Journal of
Web Services Research.
• R. Gaizauskas, N. Davis, G. Demetriou, Y. Guo, I. Roberts. 2004.
Integrating Text Mining Services into Distributed Bioinformatics
Workflows: A Web Services Implementation. In: Proceedings of the
IEEE International Conference on Services Computing (SCC 2004).
• Contact
• Neil Davis: [email protected]
• Sheffield NLP website: http://www.nlp.shef.ac.uk/
22
More slides
23
Context: MyGrid
• Objective:
• To develop a comprehensive suite of middleware components
specifically to support data intensive in silico experiments in biology
• Workflows and query specifications link together third party and local
resources using web service protocols
• Sheffield’s contribution:
• Provision of text mining capabilities to link experimental results to
the biomedical literature
• Duration, funding, participants:
• 4 years, ending in June 2005
• EPSRC-funded e-Science pilot project
• Five UK universities, European Bioinformatics Institute, several
industrial partners (GSK, IBM)
24
Common WBS Deletions
25
SVAS = SupraValvular Aortic Stenosis
Why Research WBS?
• Without an understanding of the underlying causes of the
disease only palliative care can be offered
• Before any type of therapy can be developed the pertinent
genes, interactions and expression pathways must all be
elucidated
26
Williams-Beuren Syndrome
• Congenital disorder resulting in mental retardation caused
by deletion of genetic material on 7th chromosome
• Area in which deletions occur not well characterised –
better sequence information is becoming available
• As new sequence information becomes available
• gene finding software is run against it
• BLAST is run against new putative genes to identify
homologues whose function may be known
• BLAST reports provide links to abstracts in the literature
27
Why Automate
• The process of searching for associated papers is tedious
and time consuming
• The gene annotation pipeline is iterative and automating
time consuming elements will free up the researchers time
for more specialist research
• Automation allows easy collection of provenance and
replication of the research process
28
Architecture of System (2)
• 3-way division of labour sensible way to deliver distributed
text mining services
• Providers of e-archives, such as Medline, will make archives
available via web-services interface
• Cannot offer tailored services for every application
• Will provide core, common services
• Specialist workflow designers will add value to basic services
from archive to meet their organization’s needs
• Users will prefer to execute predefined workflows via standard
light clients such as a browser
• Architecture appropriate for many research areas, not just
bioinformatics
29
Text Mining Service Architecture
• Data pre-processing and merging architecture:
G et an n otatio ns
from D B
M erg in g
co m p o n en t
S tan d o ff
an n o tatio n s
AMBIT &
Termino
S p ecialist tex t
p ro cessin g lo g ic
R equ est origin al text
from app ro priate
p ub lish er(s) o n
clien t’s b eh alf
P u b lish er 1
P u b lish er n
MEDLINE
abstracts
M erged d ata returned
C lien t
A p p licatio n
C lient req uests
enrich ed d ata
30
…
R egularly request new tex t from publishers
A n n otatio ns
sto red in D B
C lient visible
WS
T ex t p ro cessin g sp ecialist
Text Mining: Termino
• Large-scale terminological resource to support term
processing for (biomedical) text processing applications
• Uniform access to terminological information aggregated
across many sources, without the need for multiple, sourcespecific terminological components
• Immediate entry points into a variety of outside ontologies
and other knowledge sources, making this information
available to processing steps subsequent to term recognition
• Efficient recognition of terms in text through use of finite
state recognizers compiled from contents of Termino
• Lexical look-up service accessible via web service
(http://don.dcs.shef.ac.uk/termino)
31
Workflow Server
• Workflow server runs the Freefluo enactment engine to
execute XScufl workflow (designed using Taverna)
• Graves’ disease workflow:
32
Example Project: CLEF
• Clinical e-Science Framework
• Objective:
• To develop a high quality, secure and interoperable information repository,
derived from operational electronic patient records to enable ethical and
user-friendly access to patient information in support of clinical care and
biomedical research
• Sheffield’s contribution:
• Analyzing clinical narratives to extract medically relevant entities and
events, and their properties and relationships
• Duration, funding, participants:
• 2003 – 2005 (CLEF), 2005 – 2007 (CLEF-S)
• Funded by Medical Research Council (MRC)
• Six universities, Royal Marsden Hospital, industrial partners engaged
through CLEF Industrial Forum Meetings
33