Experiences from the NEUROSCIENCE

Download Report

Transcript Experiences from the NEUROSCIENCE

Maryann E. Martone, Ph, D,
Neuroscience Information Framework
University of California, San Diego
Themes

Computers are now partners with humans in reading the
literature





Search
Summarization
Linking
Discovery
The scientific paper starts with the materials and methods
 All observations, claims etc flow from experimental design and
materials
 If authors do not provide this information in the first place, then we
can’t use it to improve all of the above

Scientists produce articles for each other, not for
computers
 Not everything you need to interpret the paper is in the paper
 More information may be there than is in the text

NIF is an initiative of the NIH Blueprint consortium of
institutes
 What types of resources (data, tools, materials, services) are available to
the neuroscience community?
 How many are there?
 What domains do they cover? What domains do they not cover?
 Where are they?
○ Web sites
○ Databases
•
•
PDF files
Desk drawers
○ Literature
○ Supplementary material
 Who uses them?
 Who creates them?
NIF provides a wealth of practical
information on data and resource
issues in neuroscience
 How can we find them?
 How can we make them better in the future?
http://neuinfo.org
The Neuroscience Information Framework: Discovery and
utilization of web-based resources for neuroscience
UCSD, Yale, Cal Tech, George Mason, Washington Univ

A portal for finding and
using neuroscience
resources

A consistent framework
for describing
resources
Provides simultaneous
search of multiple types
of information,
organized by category
Literature
22 mil
Data
Federation
350 mil
Resource
Registry
5000



http://neuinfo.org
Supported by an
expansive ontology for
neuroscience
Utilizes advanced
technologies to search
the “hidden web”
Supported by NIH Blueprint
In an ideal information system, we
would be able to find…

What is known
 “What
studies used my monoclonal mouse antibody
against actin in humans?”
 “What phenotypes are associated with each mouse
model of Spinal Muscular Atrophy”
 “What upregulates SMN1?”

What is not known
 Connect information to infer plausible hypotheses
○ Genotype-phenotype
○ Possible drug targets
 Information gaps
Whither biological information?
What is potentially
knowable
What is known:
Literature, images, human
knowledge
What is easily machine
processable and accessible
∞
CA2: Ion, Brain Part or Gene?
BioGrid
Allen Brain Atlas
Brain Info
NIF queries
across over
170+
independent
databases
Papers are the currency of science
Despite the wealth of data out there (> 2500
databases on-line), the majority of data is still
published in papers
 But...we write for other humans to consume and
information continues to be hard to find

 Even for humans, however, it is difficult to find and verify basic
information about a paper critical for interpretation
 What is the subject of the study
 What reagents were used
 What genes were studied

A lot of information is missing from papers
 Not all data is available
 Data is published in papers in forms that are difficult to use
Mining the literature for resources

Resources: Materials, services, tools, data
 Project 1: Find materials: antibodies and
transgenic animals
 Project 2: Mine supplemental data in papers
showing gene expression changes in drug abuse

Purpose
 Find new resources
 Track usage of existing resources
 Link resources to other useful information
Linking resources: Link out broker
Use case: antibodies

Pilot project to use text mining to identify antibodies used in
studies: Wanted to pick a project that would be immediately understandable by
research scientists

Antibodies are used routinely to identify proteins and other
molecules in basic and translational studies

Antibodies are a large source of experimental variability in
results
 Same antibody can give you very different results
 Different antibodies to the same protein can give you very different results

Neuroscientists spend a lot of time tracking down
antibodies and trouble shooting experiments that use
antibodies
Our reagents and methods are
not perfect
“We note that many of the findings in the literature about neuronal NF-κB are
based on data garnered with antibodies that are not selective for the NF-κB
subunit proteins p65 and p50. The data urge caution in interpreting studies of
neuronal NF-κB activity in the brain.”
--Herkenham et al., J Neuroinflammation. 2011; 8: 141.
Antibodies are complex entities

Anti-Chat antibody
 Raised against a portion of choline




acetyltransferase
Raised in a particular species
Is polyclonal or monoclonal
Is affinity purified or not
Recognizes the target in some species, e.g.,
human
Reported
in
materials
and
methods
Tissue sections were blocked with 5% serum and incubated overnight

at 4 °C with the following primary antibodies: anti-ChAT (1:100;
Millipore, Billerica, MA), anti-Bax (1:50; Santa Cruz), anti-Bcl-xl (1:50;
Cell Signaling), anti- neurofilament 200 kDa (1:200; Millipore) ...
“Find studies that used a rabbit polyclonal antibody
against GFAP that recognizes human in
immunocytochemisty”
NIF
Antibody
Registry:
-database of
> 900,000
antibodies
(AB_310775)
Paz et al,
J Neurosci,
2010
Searching for resources in literature
NIF recently
implemented a
section-specific
search
 Semi-automated
resource
identification
pipeline

 Paul Sternberg,
Yuling Li, Cal Tech
Annotation of antibodies
•Allows annotation of
DOMEO annotation tool: Paolo
entities and key
Ciccarese; Tim Clark, MGH
relationships:
•Protocol
•Subject of
protocol
•Links antibodies to a
database of
antibodies that
contains their
properties
•NIF Antibody
Registry
•900,000
antibodies
•Unique ID
http://antibodyregistry.org
http://annotationframework.org/
What studies used my monoclonal mouse
antibody against actin in humans?



Subject is
neurologically
Human
Midfrontal cortex tissue samples from
unimpaired subjects (n9) and from subjects with AD
(n11) were obtained from the Rapid Autopsy Program
Immunoblot analysis and antibodies
The following antibodies were used for immunoblotting: mAb=monoclonal
actin mAb (1:10,000 dilution, Sigma-Aldrich); -tubulin mAb (1:10,000,
antibody
Abcam); T46 mAb (specific to tau 404–441, 1:1000, Invitrogen); Tau-5 mAb
(human tau 218–225, 1:1000, BD Biosciences) (Porzig et al., 2007); AT8
mAb (phospho-tau Ser199, Ser202, and Thr205, 1:500, Innogenetics);
PHF-1 mAb (phospho-tau Ser396 and Ser404, 1:250, gift from P. Davies);
12E8 mAb (phospho-tau Ser262 and Ser356, 1:1000, gift from P. Seubert);
NMDA receptors 2A, 2B and 2D goat pAbs (C terminus, 1:1000, Santa
Cruz Biotechnology)…
Tracking down reagents
Feng et al., MATH5 controls the acquisition of multiple retinal cell fates, Mol Brain. 2010; 3: 36
Space limitationsContent gets
separated in space and time
Practices are designed to save
space, improve readability and
save authors typing


But...electrons are cheap
Cut and paste is cheap
 Re-examining plagiarism in the age of
cut and paste

Autocomplete is cheap
 Acronyms and abbreviations
 Are there any unique 3 letter strings

Formats are flexible
 What the computer sees and what
humans see don’t have to be the same
thing
Try this Watson!
• 95 antibodies were identified in 8 articles
• 52 did not contain enough information to determine
the antibody used
• Some provided details in another paper
• And another paper, and another...
• Failed to give species, clonality, vendor, or catalog number
• But, many provided the location of the vendor
because the instructions to authors said to do so
Subject of study

Often not explicit:
 “patients with AD” = human
 Type III SMA mice (Smn−/−, SMN2+/−) were produced as previously
described (Tsai et al., 2006a).

Official strain nomenclature of animals not designed for search
 SMN2Ahmb89tg/tg;SMNΔ7tg/tg:Smn1−/−; no unique identifier assigned
 Many lines of transgenics are generated and described within a single
paper; difficult to relate individual findings with the correct animal line but
all are not equivalent
Three lines of transgenic mice, Ml, M2, and M3, were produced (Fig. 1B).
Transgene expression was found in all tissues studied, with widespread
high expression in line Ml, high expression in brain of line M3, and
relatively low expression in brain of line M2 (Fig. 1C). (Ripps et al., PNAS,
USA Vol. 92, pp. 689-693, January 1995)
Which mouse did you use?

“Transgenic mice expressing SOD1G93A (12)
were purchased from Jackson Laboratory”
 12 = Gurney ME; et al. 1994. Motor neuron degeneration in mice that
express a human Cu,Zn superoxide dismutase mutation [see
comments] [published erratum appears in Science 1995 Jul
14;269(5221):149] Science 264(5166):1772-5.
 Search NIF/Jackson lab for “Gurney SOD”
○ 7 entries for same producer
○ 3 track to the same reference

Gogliotti et al, Biochem Biophys Res Commun. 2010
January 1; 391(1): 517.
 “Here we report our findings for the SMA mouse model that has been
deposited by the Li group from Taiwan. These mice, JAX stock number
TJL-005058, are homozygous for the SMN2 transgene,
Tg(SMN2)2Hung, and a targeted Smn allele that lacks exon 7,
Smn1tm1Hung.”
Minimal metadata standards (really) for
publishing in the 21st century
1) Provide gene accession numbers for all genes
referenced in the methods section of a paper, per
http://www.ncbi.nlm.nih.gov/gene
Journal
Comparative
Neurology:
 2)
Identifyof(i.e.,
give ID) the species
for the Requires complete
subject
of a study, and
which each
gene in instructions to
characterization
offrom
antibody
as stated
product is derived, using the NCBI taxonomy and
authors
the strains from the model organism databases for
•90% of antibodies had a catalog #; 20% had a lot number after
mice, rats, worms, zebrafish and drosophila,
these policies
wereunique
instituted
employing
any existing
identifiers and
•NIF
could automatically
identify 80% of these antibodies
correct
species-specific
nomenclature:
through matching with NIF Antibody Registry
 3) Provide catalog numbers and vendor
information for all reagents and animals described
in the methods section of a paper

Developed by the Link Animal Model to Human
Disease Initiative (LAMHDI) consortium:
Project 2: Extracting data from
tables and supplementary material
Challenge: Extract data on gene
expression in brain from studies relevant
to drug abuse
 Workflow:

Find articles
Extract results
from tables
Standardize
results
Drug related gene database: 140 tables from 54 articles
Andrea Arnaud-Stagg, Anita Bandrowski
Load into NIF
Extracting additional knowledge from
supplementary material
Gene for tyrosine
hydroxylase has
increased
expression in locus
coeruleus of mouse
compared to control
when given chronic
morphine
Translations:
Upregulated p < 0.05 =
increased expression
LC = locus coeruleus
Probe ID = gene name
J Neurosci. 2005 Jun 22;25(25):6005-15.
Challenges working with tables and
supplemental data

Difficult data arrangements
○ PDF, JPG, TXT, CSV, XLS
○ Difficult styles: colors, symbols, data arrangements (results
combined into one column, multiple comparisons in one table,
legends defining values, unclearly described data (e.g.,
unclear significance)

Not clear what tables/values represent
 nothing in paper about the supplementary data file and table has no heading
 Probe ID’s are given but not gene identifiers
No link from supplemental material back to
article; lose provenance
 Not all results are accounted for

Is SMN1 affected by drugs of
abuse?
SMN1 is the gene that is mutated in Spinal Muscular Atrophy, a neurological disease of
children
Open world vs closed world
assumptions

Closed world assumption:
 holds that any statement that is not known to be true is false
 allows an agent to infer, from its lack of knowledge of a statement
being true, anything that follows from that statement being false
 typically applies when a system has complete control over
information

Open world assumption:
 the assumption that the truth-value of a statement is independent of
whether or not it is known by any single observer or agent to be true.
 limits the kinds of inference and deductions an agent can make to
those that follow from statements that are known to the agent to be
true
 the open world assumption applies when we represent knowledge
within a system as we discover it, and where we cannot guarantee
that we have discovered or will discover complete information.
Reporting data: Closing the open
world

We measured the expression of 9000 genes as a
function of chronic cocaine (S1). The 50 genes that
showed significantly increased expression (p > 0.01)
are shown in Table 2
 What about the other 8950 genes?
 Cannot assume that they were increased, decreased
or remained the same (Open world)

We measured the expression of 9000 genes as a
function of chronic cocaine (S1). The fold change and
p value are given for each gene. The 50 genes that
showed significantly increased expression (p > 0.01)
are shown in Table 2 (Closed world)
Narrative vs Data publishing

Narrative (Author): Encourage use of minimal standards
for key entities in the research paper
 Subject, protocol, genes, reagents
○ Make it easy to find accession numbers
 Standard templates for reporting supplemental data?
○ Unlikely although desired
 Tools for linking in line references to fragments of papers rather
than the entire paper

Data (Curators): Structuring data requires expertise
 Positive and negative results equally important
 If data are to be published in supplemental material or in paper,
should make them machine interpretable
 Ideally, entire data set should be deposited in a public
repository, e.g., GEO OMNIBUS
Conclusions
Humans are storytellers; it’s fundamental to the way
we communicate

 But these stories are directed to an audience with expertise
 Scientists know each other’s work; personal networks very
important The computer isn’t part of this

So...we need to adapt publishing practices to aid
automated search and mining of content
 Partnership between authors, publishers, curators and computer
scientists, informaticians...
 Future of research communications and e-scholarship
 http://force11.org JOIN US!