Tales from the crypt - CRBS Confluence Wiki
Download
Report
Transcript Tales from the crypt - CRBS Confluence Wiki
Melissa Haendel
Ontology Development Group
Friday December 13, 2013
I must confess….
I am a biologist
http://www.magnificentbastard.com/images/features/confessional-office-big.jpg
http://naodignity.deviantart.com/art/Maniac-Biologist294084420
Interpret squishy things.
1 | Information propagation tales
2 | The ingredients of science
3 | Data isn’t always what it seems
Information propagation tales
History of p53
Wait!
Oncogenes promote
cell proliferation and
WT p53 does not…
Approximate # of publications about p53
70k
p53 inactivation
contributes to
the majority of
human cancers
2013
It’s a tumor suppressor
gene!
15k
1999
p53 co-immunoprecipitates
with SV40 antigen.
It’s an oncogene!
Viruses
are the
cause of
cancer!
1989
100
1
1979
Adapted from: http://www.nature.com/scitable/resource?action=showFullImageForTopic&imgSrc=content/ne0000/ne0000/ne0000/ne0000/14577857/nature-review-cancer-molbio_p53-timeline.pdf&isPDF=yes
Why is this question hard to
answer?
Is it true or do we just believe it because
everyone else does?
How do we transcend “follow the leader”? What
tools can we build to help us?
Assertion:
“β amyloid, known for its role in
injuring brain in Alzheimer’s
disease, is also produced by and
injures skeletal muscle fibres in the
muscle disease sporadic inclusion
body myositis.”
Greenberg 2009
Greenberg, 2009
BMJ 2009;339:b2680 doi:10.1136/bmj.b2680
All 242 papers point to 4 from same lab, and
very few to the ones with negative results
Supplemental
Data
GEO:GSE7762
Drug Related
Database
Looking at the same dataset in
different places
Alignment of the raw data
(nevermind that the gene IDs had to be mapped to strings)
GEO dataset
Gemma
DRG
Genes differentially
expressed:
~8,000 gene comparisons
Genes significantly
expressed or unchanged:
~13,000 gene comparisons
Increased: 4,264
Decreased: 3,833
Differential: 8,133
Increased: 1,640
Decreased: 1,110
Differential: 2,920
Both resources recorded 95% confidence intervals for significance
Incongruous results
Source
Gene
location
Gemma Adora2a
Striatum
PBS vs
Chronic
Morphine
PBS vs
Chronic
Morphine
PBS vs
Chronic
Morphine
Increased
DRG
Striatum
Morphine
vs Saline
Chronic
Morphine
Morphine
3x/day for
4 days vs
Saline
Decreased
Adora2a
Adenosine
A2a
receptor
Conditions
Expression
Data analysis in Gemma was control vs drug
Data analysis in DRG was drug vs control
=> opposite expression results!
Once corrected for inverse data, there were 50% differences in
what was significantly up or down regulated
Cachat et al 2012
We need better tools to track
provenance and compare data
At a large, not-to-be-named
consortial project:
Labeled with acetylated histone target
RNA-seq
A year later, well, actually a tri-methylated histone target had been used.
Can we retract the data?
No, lets overwrite the old data with the new data.
Well that would be BAD! Deprecate, please? Nope.
Enter a warning in the description field? OK…but no one reads these.
=> Currently this data has been removed from the system
How reproducible is science?
Let’s start simple.
Do we know what the ingredients were?
Antibody usage was reported in
46 publications (according to
vendor)
17% (8/46) describe Ab usage
in brain
0 describe Ab usage in spleen
6% (3/46) describe Ab usage in
liver
Prior results are not being considered in
later work
How identifiable are resources in the
published literature?
Gather journal
articles
5 domains:
Immunology
Cell biology
Neuroscience
Developmental biology
General biology
3 impact factors:
High
Medium
Low
84 Journals
707 antibodies
248 papers
104 cell lines
258 constructs
437 model
organisms
210 knockdown
reagents
An experiment in reproducibility
http://biosharing.org/bsg-000532
Vasilevsky et al,
2013, PeerJ
Approximately 50% of resources
were identifiable
Partnership between
Science Exchange, PloS,
FigShare, Mendelay, and
some of us scientists
$1.3 million grant from the Laura and John
Arnold Foundation to validate 50 landmark
cancer biology studies
Resources reported in the 50 Reproducibility
Initiative studies show similar results
Reproducibility Initiative
(preliminary)
Vasilevsky et al., 2013, PeerJ
On average, approximately 60% of the
resources are unidentifiable
Treatment with peptide X and two of its
isomers inhibits leishmania growth
The Reproducibility Initiative
attempted to reproduce this study
Tried to replicate the primary finding, not the
other experiments (funding constraints)
Experiment showed similar dose response, but
at 10X concentration
There was no negative control
The Leshmania strain turned out to be a
different one
The peptides turned out to be amidated
http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarmingdegree-it-not-trouble
Replication?
Reproducibility?
Why do we even have an expectation that
science should be reproducible?
What does reproducibility mean to the work that you
do and the data you use?
How do we support replication/reproducibility for
bioinformatics analyses?
How can we better record data provenance?
Could we have changed these stories if we had
better tools/processes?
How can we promote both forward and backward
propagation of information?
Questions, not answers
Data isn’t always what it seems
Symptoms:
• Unusual posturing of the
right upper and lower
extremities
• Shaking or quivering of
right foot
• Glassy eyed appearance
• Occurrence of 5-6 spells
per day
Mink and Neil, 1995
Case Report
Evaluations by a neurologist and
a geneticist included:
awake and asleep
electroencephalogram
serum sodium, calcium,
magnesium, and glucose
complete blood count
liver panel
thyroid panel
urine organic acids
Diagnosed and treated for
epilepsy
The next neurologist
made the diagnosis
based on a film of
the girl.
She was masturbating.
A normal, but unusual early childhood
behavior
March7 mutant brains
Corpus callosum neurons don’t cross the midline
On a different background, they die in utero.
Context is everything
A typical genotype:
Foxd3m188 (AB)
A typical phenotype:
dorsal root ganglion present
in fewer numbers in organism
There are 49,785 genotypes in MGI and 74,385
zebrafish in ZFIN with phenotypes recorded
How much of the phenotype data
that we operate on is actually
due to background effects?
NIH Undiagnosed Disease Program
Launched May, 2008 as a 5 year pilot project with two
main objectives:
Public Service
Provide answers to patients with mysterious
conditions that had long eluded diagnosis
Biomedical Research
Advance medical knowledge by providing insight
into human physiology and the genetics of rare
and common diseases
Each patient is a research project
unto themselves
Semantic similarity enables crossspecies phenotype comparisons
ut
s
Resting tremors
abnormal
motor function
sterotypic
behavior
REM disorder
sleep
disturbance
abnormal
EEG
Shuffling gait
abnormal
locomotion
decreased
stride length
Unstable
posture
abnormal
coordination
poor rotarod
performance
Neuronal loss in
Substantia Nigra
CNS neuron
degenerat ion
ax on
degeneration
Constipation
abnormal
digestive
physiology
decreased gut
peristalsis
Hyposmia
abnormal
olfaction
failure to find
food
find
Clinician
Researcher
Improved exome prioritization of disease genes through cross species phenotype comparison. Robinson et al. Genome Res. 2013
% human coding genes
“Expanding” the phenotypic
coverage of the human genome
100%
80%
OMIM
OMIM+GWA
S
60%
40%
20%
Ortholog only
Human+Ortholog
0%
Human only
Five model organisms provide almost 80%
phenotypic coverage of the human genome
What constitutes an adequate
phenotype annotation for an
undiagnosed patient?
Defining a minimum phenotype annotation:
1. Is the UDP annotation specificity similar to or better
than the corpus of available data?
1. Is the number of UDP annotations/patient similar or
better?
1. How does the ontology and annotation set differ across
anatomical systems in terms of granularity? Does this
change specificity requirements for UDP phenotypic
profiles?
1. How does use of NOT annotations help further specify
the uniqueness of a UDP patient? Temporal
presentation?
UDP phenotype annotations
UDP annotations have a similar Information
content (IC) and a larger number of average
annotations per disease/patient
Anatomical annotation
distribution in the corpus
Nervous system, skeletal system, and immune system is
highest => these categories require greater specificity and
numbers of annotations for UDP patients
So what is the minimum phenotype
standard?
It depends.
It depends on a) the species, b) the anatomical system,
and c) the state of the ontologies and the annotation
data
Need a minimum phenotyping standard that can
adapt to different phenotypic profiles and a changing
data landscape
Can be defined using informatics and tools built to
help support better usage of the data
Conclusions
Be critical of data and where it came from
There is a need to reinterpret data when new
information comes to light
Reproducibility depends on many things,
including very basic things
Retrospective and prospective efforts are needed
to ensure data quality, consistency, and utility
You can help!
Acknowledgements
Ontology Development
Group
Nicole Vasilevsky
Matthew Brush
Robin Champieux
Lawrence Berkeley
Laboratory
Nicole
Washington
Chris Mungall
UCSD/NIF
Anita Bandrowski
Maryann Martone