Linking data to publications

Download Report

Transcript Linking data to publications

Science, Workflows
and Collections
Professor Carole Goble
The University of Manchester, UK
[email protected]
Roadmap




How bioinformaticians will work (and are
now)
The myGrid project - workflows
Using publications in workflows
Workflow implications for serials
©
2
Williams-Beuren Syndrome



Contiguous sporadic gene deletion
disorder
1/20,000 live births, caused by unequal
crossover (homologous recombination)
during meiosis
Haploinsufficiency of the region results
in the phenotype
*
*
Patient deletions
WBS
7q11.
23
SVAS
~1.5 Mb
Physical Map
CTA-315H11
‘Gap’
CTB-51J22
Chr 7 ~155 Mb
©
Hannah Tipney
3
1. Identify new, overlapping sequence of interest
2. Characterise the new sequence at nucleotide and
amino acid level
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat
ttggtgttgt
12241 cagtctttta aattttaacc tttagagaag agtcatacag
tcaatagcct tttttagctt
12301 gaccatccta atagatacac agtggtgtct
cactgtgatt ttaatttgca ttttcctgct
12361 gactaattat gttgagcttg
ttaccattta gacaacttca ttagagaagt gtctaatatt
12421 taggtgactt
gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt
12481
aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt
12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt
tttacaattg
12601 tggttaagta tacatgacat aaaacggatt atcttaacca
ttttaaaatg taaaattcga
12661 tggcattaag tacatccaca atattgtgca
actatcacca ctatcatact ccaaaagggc
12721 atccaatacc cattaagctg
tcactcccca atctcccatt ttcccacccc tgacaatcaa
12781 taacccattt
tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
Cutting and pasting between numerous web-based
services i.e. BLAST, InterProScan etc
©
4
In Life Sciences:
Data, Publication, its all the same


12181 acatttctac caacagtgga tgaggttgtt
ggtctatgtt ctcaccaaat ttggtgttgt
12241
cagtctttta aattttaacc tttagagaag agtcatacag
tcaatagcct tttttagctt
12301 gaccatccta
atagatacac agtggtgtct cactgtgatt ttaatttgca
ttttcctgct
12361 gactaattat gttgagcttg
ttaccattta gacaacttca ttagagaagt gtctaatatt
12421 taggtgactt gcctgttttt ttttaattgg


©
Its just part of the
experiment
No separation
between data
and publications
Publications are
the context for
data
Break the silo
between
published papers
and published
data
5
Aside: A heretic speaks







©
Life Scientists read journals
I’m a Computer Scientist. I
don’t.
Its on the Web
Its in PodCast talks or
Powerpoint
Google is the Lord’s work
What PhD students are for
Journal publications too
outdated
6
Bioinformatics pipelines on
the web
RepeatMasker



BLASTn
Twinscan
Copy and paste from one web based application
to another
Annotate by hand
Disadvantages: time consuming, error prone,
tacit procedure so difficult to share both protocol
and results
©
7
Workflows for
Science
12181 acatttctac caacagtgga tgaggttgtt
ggtctatgtt ctcaccaaat ttggtgttgt
12241 cagtctttta aattttaacc tttagagaag
agtcatacag tcaatagcct tttttagctt
12301 gaccatccta atagatacac agtggtgtct
cactgtgatt ttaatttgca ttttcctgct
12361 gactaattat gttgagcttg ttaccattta
gacaacttca ttagagaagt gtctaatatt
12421 taggtgactt gcctgttttt ttttaattgg
©
8
Workflows for Science
“Workflow at its simplest is the movement of
documents and/or tasks through a work
process.
More specifically, workflow is the operational
aspect of a work procedure: how tasks are
structured, who performs them, what their
relative order is, how they are synchronized,
how information flows to support the tasks and
how tasks are being tracked”.
©
9
Workflows for Science
Sequence
in



Repeat
Masker
Web service
Predicted
genes out
BLASTn
Web Service
Twinscan
Web Service
Simple scripting language specifies how steps of a
pipeline link together
Hides all the fiddling about.
Advantages : automation, quick to write, easier to
explain, share, relocate, and record provenance of
results in a standard way
©
10
Workflows for Science

Workflows describe the scientists in
silico experiment



Remote, third party, external
applications and services



Accessible to the workflow machinery
And that includes serials!
Results management



Link together and cross reference data
in different repositories
And that includes serials!
Semantic metadata annotation of data
Provenance tracking of results
Sharing and replicating know-how

Reuse of workflows
©
11
©
12
WBS



The first complete and
accurate map of the region of
chromosome 7 involved in
Williams-Beuren Syndrome
Perform one WBS pipeline from
2 weeks to 2 hours
Faster, automated, systematic
and shareable
©
13
Reuse
adapting and sharing best practice
and know-how across a community
by publishing workflows
Williams-Beuren Syndrome
Grave Disease
Chicken genome
Trypanosomiasis in cattle
©
Mouse genome
15
Trypanosomiasis in cattle




Identify the genetic difference
responsible for resistance to
trypanosomiasis and breed
into productive cattle.
Mice as a model.
Gene expression and
microarray analysis
The literature


©
Associations between
upregulated genes
Links between changed genes
and genes in the Tir1 region
16
©
17
©
18
PubMed Text Mining results
©
20
©
21
Chilibot text mining in Taverna
©
22
Taverna
output
©
Chilibot web
page
23
lipoprotein and cholesterol
•Trypanosomes need
cholesterol – and have
a scavenger receptor –
specific for HDL
•Resistant mice reduce
available HDL –
slowing trypanosome
growth
New hypothesis:
Resistance and susceptibility in mice is a function of
cholesterol recycling pathway. Mice love lard.
©
24
©
Biological pathway, highlighted
with RNA molecules (orange)
and DNA QTL molecules (pink),
discovered with the aid of
Chilibot text mining over
25
PubMed.
©
26
myGrid/
Discovery Net
Assigning Gene
Ontology terms to
papers in MedLine
Specialist Term recognition software
©
27
Science: Knowledge-driven
HTML-CML
version
MEDLINE abstract; marked©up by SciBorg
28
“the development of online submission systems for
scientific manuscripts provides a mechanism for
including a mapping of the information in the
manuscript to controlled terminologies as an integral
part of the publishing process. It is not hard to
envision that the indexing of a paper to
controlled terms for anatomical, gene
nomenclature, or functional terminologies would
be a necessary requirement for acceptance of a
paper for publication. This, then, would enable the
rapid incorporation of the paper and its contents into
bioinformatics systems. “ Judith Blake
©
Judith Blake, Bio-ontologies—fast and furiousNature Biotechnology 22, 773 - 774 (2004)
29
Presentation services: subject, media-specific, data, commercial portals
Data creation /
capture /
gathering:
laboratory
experiments,
Grids,
fieldwork,
surveys, media
Resource
discovery, linking,
embedding
Data analysis,
transformation,
mining, modelling
Searching ,
harvesting,
embedding
Aggregator
services: national,
commercial
Resource
discovery,
linking,
embedding
Learning object
creation, re-use
Harvesting
metadata
Research &
e-Science
workflows
Deposit / selfarchiving
Learning &
Teaching
workflows
Repositories :
institutional,
e-prints, subject,
data, learning objects
Validation
Publication
Resource
discovery, linking,
embedding
The scholarly knowledge cycle.
Liz Lyon, Ariadne, July 2003.
© Liz Lyon (UKOLN, University of Bath), 2005
This work is licensed under a Creative Commons License
Attribution-ShareAlike 2.0
Deposit / selfarchiving
Institutional
presentation
services: portals,
Learning
Management
Systems, u/g, p/g
courses, modules
Peer-reviewed
publications: journals,
©
conference
proceedings
Validation
Quality
assurance
bodies
30
http://www.ukoln.ac.uk/projects/ebank-uk/
eBank UK Project




Aggregator service harvests metadata from institutional
repository (e-crystals archive)
eBank service embedded in PSIgate portal for 3rd party
search
Service linking from data to derived research publication
Embedding eBank service in learning workflows
UKOLN (lead), University of Southampton, University of
Manchester
©
31
Linking data to publications
©
32
Provenance
Log what, where,
when who
For data and for
publications
Ingredient List
Fluorinated biphenyl
Br11OCB
Potassium Carbonate
Butanone
Dissolve 4flourinated
biphenyl in
butanone
0.9 g
1.59 g
2.07 g
40 ml
Add
Add K2CO3
powder
Add
0.9031
Heat at reflux
for 1.5 hours
Cool and add
Br11OCB
Heat at
reflux until
completion
Cool and add
water (30ml)
Extract with
DCM
(3x40ml)
Cool
Reflux
Add
Cool
Reflux
Liquidliquid
extraction
Add
Combine organics,
dry over MgSO4 &
filter
Dry
Remove
solvent in
vacuo
Remove
Solvent
by Rotary
Evaporation
Filter
Fuse compound to silica &
column in ether/petrol
Column
Chromatography
Fuse
grammes
Inorganics dissolve 2
layers. Added brine
~20ml.
3 of 40
g
excess
ml
text
Ether/
Petrol
Ratio
image
Weigh
Butanone dried via silica column and
measured into 100ml RB flask.
Used 1ml extra solvent to wash out
container.
Silica
Measure
Measure
Sample of 4flourinated
biphenyl
Annotate
DCM
MgSO4
Annotate
Add
1
1
2
2
Add
1
3
Reflux
text
Annotate
Butanone
Sample of
K2CO3
Powder
Measure
Cool
1
3
4
5
Add
Sample of
Br11OCB
Weigh
2
Reflux
6
2
4
7
Add
Cool
Water
Weigh
8
9
10
Dry
Liquidliquid
extraction
Annotate
11
Filter
(Buchner)
Annotate
40
ml
2.0719
g
g
14
Column
Chromatography
33
©
1.5918
13
Fuse
Measure
text
Started reflux at 13.30. (Had to
change heater stirrer) Only reflux
for 45min, next step 14:15.
12
Remove
Solvent
by Rotary
Evaporation
30
ml
Organics are yellow
solution
text
Washed MgSO4 with
DCM ~ 50ml
text
Web
services
Semantic
mark-up
Workflows
Text mining
Bioinformatics
©
34
Publications have to be
computational services – web
services

Web
services

Semantic
They will be read and processed
mark-up
by machines

Licensing that works!

Authorisation, Authentication and
digital rights management (e.g.
Text mining
Shibboleth)
Workflows
Integration of data and
publications


Workflows are linking results,
whatever the source
Bioinformatics

Common ids and persistent ids
for citation (DOI, LSID, InCHI)
35

No silos
©

Semantic publishing at
source
Web



Workflows
Integration
of data and
publications


In order to automate we need
services
better ways
of interpreting the
publication content
They will be read and
processed by machines
Semantic
mark-up
Text mining
Common vocabularies
Accessible full texts for
text mining, Bioinformatics

Not just abstracts.
©
36
Semantic markup
Provenance
Data
Publications
Workflows
Bioinformatics
©
37
Semantic markup
Provenance
Data
Publications
Workflows
Publish workflows with data with publications
Privacy? Intellectual property?
Bioinformatics
Licensing models for services so can reuse and share
results and workflows.
©
38
Take home







Machines are reading your journals, not just people
And if the Journals are not online then they unread
Workflows are another form of outcome to publish
alongside data, metadata and publications
Google rocks – I don’t use anything else!
http://www.mygrid.org.uk
http://www.ukoln.ac.uk/projects/ebank-uk/
http://www.combechem.org
©
39
Acknowledgements
The myGrid Team, esp.
 Tom Oinn
 Chris Wroe
 Antoon Goderis
 Andy Brass
 Paul Fisher
 Hannah Tipney
 May Tassabehji
 Rob Gaizauskas
 Ian Roberts
Discovery Net / Inforsense
 Vasa Curcin
 Moustafa M Ghanem
BioBank / CombeChem
 David De Roure
 Liz Lyon
Scientists
 Peter Murray-Rust
 Judith Blake
 Mike Ashburner
©
40
Digital Library workflows








Workflows for data capture, deposit, preservation,
citation, discovery, mining &&….
Multiple workflows interacting together
Workflows may call on each other, in a defined order
Multiple workflows may use “common” services e.g.
Assign (identifier)
Require sequential or parallel execution, have
dependencies, be time-limited, repetitive
Have an owner (control)
Include essential human interventions
???
©
41