Publishing better science through better data

Download Report

Transcript Publishing better science through better data

Publishing better science through better data
Data collection and validation –
perspective from an interdisciplinary editor
Dr Monica Contestabile, Senior Editor,
Nature Climate Change
Data collection and validation
What are the main issues with using
data in your research?
Data suitability
Data transparency
Data quality
Data access
Data suitability
What happens when empirical
studies suffer from inadequate data
sampling and collection procedures?
Results are either biased or too
weak (or flawed) to substantiate the
claims
Data suitability
Example – original research on China's
residential energy consumption
Background: China is the world's largest
greenhouse gas emitter, with residential
energy consumption at 11 % of its total energy
consumption.
Contribution: A novel method that utilized
spatially distributed information of X night time
lights satellite imagery and a human activity
index was developed to monitor the
distribution of CO2 emissions. The method
would allow policy makers to monitor Chinese
residents' carbon footprint.
Data suitability
Referee’s comments
The paper claims the satellite observed night time
lights can be used to model the spatial distribution
and volume of fossil fuel CO2 emissions
associated with ‘Residential Energy Consumption’.
However, the satellite here records lighting from
residential, commercial, industrial, and
transportation sectors. […] The X night time lights
would be suitable for mapping spatially distributed
(non-point source) fossil fuel CO2 emissions. The
authors claim regarding the mapping of CO2
emissions from REC is not substantiated.
Data transparency
What happens if the data source,
and the collection and/or
manipulation procedures are not
fully documented?
The quality of the study cannot be
fully assessed, therefore results
cannot be trusted
Data transparency
Example 1 – original research on household-level
carbon emissions from consumption in the USA
Background: Dramatic changes in households
demography, with a rising number of singleperson households, has implications on the
consumption scaling benefits of co-habitation
Contribution: A novel method that links consumer
expenditure data to an economic input-output lifecycle assessment (LCA) model allows estimates
of the carbon intensity of goods consumed by
U.S. households. The authors show that sizable
economies of scale in household emissions exist.
Data transparency
Referee’s comments
The authors used industry-by-industry input-output
LCA data by University X. Household
consumption, however, is on goods and services
not on industry outputs. Therefore the industry-byindustry account has to be converted into
commodity-by-commodity account […]. After a
quick look, I found that the (limited) documentation
on X.net does not seem to provide adequate
information on this aspect. […]
Data transparency
Example 2 – original research on attribution
of biological impacts to climate change
Background: Climate impacts research is
expected to attribute biological change to climate
change with measurable confidence. However,
estimates of confidence in climate impact
assessments are typically qualitative or weak,
making it difficult to distinguish strong from weak
attribution.
Contribution: A ‘confidence index’ for assessing
confidence in attribution of biological responses to
climate change was developed and used to
evaluate current practices in marine biological timeseries studies of climate impacts.
Data transparency
Referee’s comments
The paper was quite hard to evaluate because no
data appendix with sources and confidence index
scores was provided. The authors said they used
205 papers, and over 1600 species records, but
nowhere do they provide information on what the
papers were, what geographic regions, what
decades the studies were done, so it is hard to
know whether the analyses the authors did are
representative of the field or not.
Data quality
What happens if the data used in the
analysis are not up to the quality
standards in the field?
The quality of the study is seriously
compromised and therefore its chance
of being published
What happens in the case of
interdisciplinary research?
Data quality
Example 1 – A meta-analysis of global net
primary productivity (NPP) patterns.
 Background: research on NPP has relied on
estimates of ecosystem process models and
satellite observations. Recent advances based
on field observations have been made for
tropical forests. However, spatially extensive
evidence across global terrestrial ecosystems
has not been available.
Contribution: the authors have set to fill the
above gap and claim to have found a clear
latitudinal NPP pattern.
Data quality
Referee’s comments
The true value of this work may be in the data set that
was assembled, […]
There is a surprising lack of detail in the methods section.
For example, […] I know that NPP isn't always measured
consistently […] All of this makes me wonder what these
number represent, and how they were generated.
Similarly, one limitation with assembling field data is that
they were inevitably all collected at different times, over
different years, etc. This should be addressed head on.
Data quality
 Example 2 – original research on overestimated seal level
rise around Australia
 Background: The Federal Government's Climate
Commission recently concluded that the sea level has
changed by about 3.25 cm over the period 1970-1990 and
by about 5.4 mm/year over the period 1990–2010,
indicating significant acceleration in the rate of sea level
rise over the last two decades.
 Contribution: The authors demonstrate that neither the
mean rates given nor the proposed sea level acceleration
can be substantiated by observational facts.
Data quality
Referee’s comments
 The authors have selected a total of 39 stations without
explaining their choice and without apparently any correction,
and have computed the averaged linear trend. […] In order to
be combined, all sea level records should have the same or
very similar lengths. Linear trends derived from different tide
gauge stations with lengths varying around 20-40 years, as it
is the case here, could be biased due to the inter-annual
variability. And obviously, the derived averaged trend, even if
correct, is not comparable with the trend of 5.4 mm/yr for
1990-2010, unless it is computed for the same period.
Data access
What is the main concern about lack of
access to the data used in the analysis?
Obviously, reproducibility is limited if not
impossible
The standards on data access for publication
vary significantly across disciplines and
fields
One main concern in the behavioural
sciences is the fact that experiments and/or
surveys are often run to serve the objectives
of more than a single paper
Suggestions for early-career researchers
Full transparency on data sources, collection,
sampling, manipulation is fundamental to
increase the opportunity of being published; if
journal’s space constraints are strict, use
Supplementary Online Material
At design stage, justify data used based on
research questions (where pre-registration of
hypotheses is not a requirement for
publication, act as if it were!)
Depending on the kind of research (and the
field of research), data validation prior to the
analysis may be necessary
Suggestions for early-career researchers
Referring explicitly to the basic data
standards (for example on sampling
procedures, manipulations, experiments)
in one’s field helps the review process,
especially in interdisciplinary analyses
When large data sets are compiled that
could lead to more than one single study,
authors should fully report and give access
to the data used in the specific analysis to
allow reproducibility
Looking forward
Much is said about enhancing the societal
value of science
We can only succeed at that if we produce
high quality science and that calls for more
and better attention to data, across fields and
disciplines
The academic community, journal editors and
publishers should work together to build on
best practices and disseminate them across
fields
Funders, journals and publishers might help
perhaps by specifically and regularly
rewarding examples of data excellence
Thank you