WP7: Text Services

Download Report

Transcript WP7: Text Services

Integrating BioMedical Text Mining Services into
a Distributed Workflow Environment
UK E-Science All Hands Meeting
Nottingham
September 1-3, 2004
Rob Gaizauskas, Neil Davis, George Demetriou,
Yikun Guo, Ian Roberts
Outline

Introduction: Workflows, Web Services and Text
Mining for Bioinformatics

Two Case Studies: Graves’ Disease and Williams
Syndrome

Text Services
– Text Collection Server
– Text Services Workflow Server
– Interface/Browsing Client

Conclusions/Future Work
September 1-3, 2004
All Hands Meeting, Nottingham
Workflows, Web Services and Text
Mining for Bioinformatics

Workflows
– useful computational models for processes that require repeated
execution of a series of complex analytical tasks
– E.g. biologist researching genetic basis of a disease repeatedly




maps reactive spot in microarray data to gene sequence
uses a sequence alignment tool to find proteins/DNA of similar structure
mines info about these homologues from remote DBs
annotates unknown gene sequence with this discovered info
September 1-3, 2004
All Hands Meeting, Nottingham
Workflows, Web Services and Text
Mining for Bioinformatics

Web services
– Processing resources that are



available via the Internet
use standardised messaging formats, such as XML
enable communication between applications without being tied to a
particular operating system/programming language
– Useful for bioinformatics where data used in research is



heterogeneous in nature – DB records, numerical results, NL texts
distributed across the internet in research institutions around the world
available on a variety of platforms and via non-uniform interfaces
September 1-3, 2004
All Hands Meeting, Nottingham
Workflows, Web Services and Text
Mining for Bioinformatics

Text mining
– any process of revealing information – regularities, patterns or trends
– in textual data
– includes more established research areas such as information
extraction (IE), information retrieval (IR), natural language processing
(NLP), knowledge discovery from databases (KDD)
– relevant to bioinformatics because of


explosive growth of biomedical literature
availability of some information in textual form only, e.g. clinical records
September 1-3, 2004
All Hands Meeting, Nottingham
Workflows, Web Services and Text
Mining for Bioinformatics
Workflows
Web services
Text mining
Bioinformatics
September 1-3, 2004
All Hands Meeting, Nottingham
Context

Objective: deliver text services for the myGrid and CLEF projects

myGrid has adopted the workflow model for delivering an e-biologist’s
workbench
– Scufl workflow specification language
– Taverna workflow design tool
– Freefluo workflow enactment engine

Problem: how to integrate text mining into a biological workflow?
– Most text mining runs off-line and supports interactive browsing of results
– Most workflows run end to end with no user intervention
– What are the inputs to text mining to be?

Solution: tap off result of a workflow step and treat as implicit query
September 1-3, 2004
All Hands Meeting, Nottingham
Two Case Studies in the Genetic Basis
of Disease

Graves’ Disease
– an autoimmune condition affecting tissues in the thyroid and orbit
– being investigated using the micro-array methods



micro-array shows which genes are differentially expressed in normal
patients vs patients with the disease = candidate genes
sequence alignment search (e.g. BLAST) finds genes/proteins with
similar structure
function of these “homologues” may suggest function of candidate gene
– key step for text mining follows BLAST search



for homologous proteins BLAST report contains references to proteins
in SWISSPROT protein database
Swissprot records contain ids of abstracts describing the protein in
Medline abstract database
abstracts can be mined directly or used as ``seed'' documents to
assemble a set of related abstracts
September 1-3, 2004
All Hands Meeting, Nottingham
Two Case Studies in the Genetic Basis
of Disease

Williams Syndrome
– congenital disorder resulting in mental retardation caused by
deletion of genetic material on 7th chromosome
– area in which deletions occur not well characterised – better
sequence info is becoming available
– as new sequence information becomes available
 gene finding software run against it
 BLAST is run against new putative genes to identify
homologues whose function may be known
– BLAST reports provide links to abstracts in the literature
September 1-3, 2004
All Hands Meeting, Nottingham
Text Services Architecture
User Client
Workflow definition
+ parameters
Workflow Server
Clustered PubMed Ids
+ titles
Initial
Cluster
Workflow
Abstracts
Workflow
Swissprot/Blast
Enactment
record
Extract
Get Related
PubMed Id
Abstracts
Term-annotated
Medline abstracts
Get Medline
Abstract
Medline Server
Medline
Abstracts
PubMed Ids
Medline: pre-processed
offline to extract biomedical
terms + indexed
September 1-3, 2004
PubMed Ids
All Hands Meeting, Nottingham
Text Services Architecture

3-way division of labour sensible way to deliver
distributed text mining services
– Providers of e-archives, such as Medline, will make archives
available via web-services interface


Cannot offer tailored sevices for every application
Will provide core, common services
– Specialist workflow designers will add value to basic
services from archive to meet their organization’s needs
– Users will prefer to execute predefined workflows via
standard light clients such as a browser

Architecture appropriate for many research areas, not
just bioinformatics
September 1-3, 2004
All Hands Meeting, Nottingham
Text Services Architecture
User Client
Workflow definition
+ parameters
Workflow Server
Clustered PubMed Ids
+ titles
Initial
Cluster
Workflow
Abstracts
Workflow
Swissprot/Blast
Enactment
record
Extract
Get Related
PubMed Id
Abstracts
Term-annotated
Medline abstracts
Get Medline
Abstract
Medline Server
Medline
Abstracts
PubMed Ids
Medline: pre-processed
offline to extract biomedical
terms + indexed
September 1-3, 2004
PubMed Ids
All Hands Meeting, Nottingham
Text Collection Server

Text collection is Medline (www.ncbi.nlm.nih.gov/)
–
–
–
–
> 10 million abstracts since 1950’s
largest repository of biomedical abstracts
copies made available for research, updated annually
records contain semi-structured information annotated in XML




Unique id – PubMed id
Citation information – author(s), journal, year, etc.
Manually assigned controlled vocabulary keywords (MeSH terms)
Text of abstract
September 1-3, 2004
All Hands Meeting, Nottingham
Text Collection Server (cont)

Local copy
– Loaded in mySQL, indexed on various fields, e.g. MeSH terms
– Text portion indexed with for search engines (Lucene, Madcow)
– Text pre-preprocessed with text mining tools
 Tokenisation
 Part-of-speech tagging

Terminology look-up
 Term Parsing
and indexes built for term classes (proteins, genes, diseases, etc.)

Server accepts web service calls to, e.g.
–
–
–
–
–
Return text of abstract given a PubMed id
Return MeSH terms of abstracts given PubMed ids
Return PubMed ids of abstracts with given MeSH terms
Return PubMed ids of abstracts matching a free text query
Return PubMed ids of abstracts containing a specific term
September 1-3, 2004
All Hands Meeting, Nottingham
Text Services Architecture
User Client
Workflow definition
+ parameters
Workflow Server
Clustered PubMed Ids
+ titles
Initial
Cluster
Workflow
Abstracts
Workflow
Swissprot/Blast
Enactment
record
Extract
Get Related
PubMed Id
Abstracts
Term-annotated
Medline abstracts
Get Medline
Abstract
Medline Server
Medline
Abstracts
PubMed Ids
Medline: pre-processed
offline to extract biomedical
terms + indexed
September 1-3, 2004
PubMed Ids
All Hands Meeting, Nottingham
Workflow Server

Workflow server runs Freefluo enactment engine to
execute Scufl workflow (designed using Taverna)

Graves’ disease workflow:
September 1-3, 2004
All Hands Meeting, Nottingham
Text Services Architecture
User Client
Workflow definition
+ parameters
Workflow Server
Clustered PubMed Ids
+ titles
Initial
Cluster
Workflow
Abstracts
Workflow
Swissprot/Blast
Enactment
record
Extract
Get Related
PubMed Id
Abstracts
Term-annotated
Medline abstracts
Get Medline
Abstract
Medline Server
Medline
Abstracts
PubMed Ids
Medline: pre-processed
offline to extract biomedical
terms + indexed
September 1-3, 2004
PubMed Ids
All Hands Meeting, Nottingham
Interface/Browsing Client

Two components
– Submit workflow for enactment
– Explore results and launch follow-on queries

Three types of follow-on search
– Find other texts containing terms in current text
– Find texts containing a specific search string (free text search)
– Find others text “like” current one (with same MeSH terms)

Implemented as a Java-Swing applet for easy inclusion
in portals
September 1-3, 2004
All Hands Meeting, Nottingham
Interface/Browsing Client
MeSH
Tree
Abstract
Titles
Abstract
body
Search
scope
restrictors
Linked
terms
Get
Related
Abstracts
Free text
search
September 1-3, 2004
All Hands Meeting, Nottingham
Conclusion

Have implemented a set of text mining web services that run in
a workflow to support biologists in exploring the genetic basis of
disease

Implementation based on a generic 3 component architecture
(archive server, workflow server, browser client) with wider
applicability

Basic idea is to glean an implicit query from a workflow
operation (e.g. sequence alignment)
– find abstracts of papers related to abstracts describing homologous
proteins/genes of gene of interest
– Cluster results and present to user

User can explore results and issue follow-on queries via a richlyfeatured graphical interface
September 1-3, 2004
All Hands Meeting, Nottingham
Future Work

Integrate in practice with rest of Graves’/Williams workflows in
myGrid and get feedback from biologists

Explore other intepretations of “relatedness” for abstracts in
addition to MeSH terms
– in assembling corpus of related abstracts (e.g. vector
space/language model notions of similarity)
– in clustering results (e.g. k-means/agglomerative clustering)

Explore other ways of deriving implicit queries from workflows –
e.g. mining provenance data

Explore further interface search filtering operations and interface
design issues

Scale up to process all of Medline for term/entity identification
September 1-3, 2004
All Hands Meeting, Nottingham