Masterclass Data, understanding it, interpreting it and using it.

Download Report

Transcript Masterclass Data, understanding it, interpreting it and using it.

Master class
Data, understanding it, interpreting
it and using it.
Ruth Harrell
Liann Brookes-smith
1
Agenda









9.30am – 10.30am
10.30am break
10.45 – 11.30am
11.40 – 12.30pm
12.30 – 1.30pm lunch
1.30 – 2.30pm probability
2.30 – 2.45pm break
2.45 – 3.30pm sampling and curve
3.30 – 4.30pm confidence and risk
2
Introduction

Statistics may be defined as "a body of methods
for making wise decisions in the face of
uncertainty." ~W.A. Wallis

“There are three kinds of lies: lies, damned lies,
and statistics.” Disraeli (according to Mark Twain)

98% of all statistics are made up. ~Author
Unknown


Statistics are like bikinis. What they reveal is
suggestive, but what they conceal is
vital. ~Aaron Levenstein
If you can not measure it, it does not exist ~
Author unknown
3
Question to the Room






What are statistics?
Why are data important?
What do you feel about stats?
What do they tell us?
E.g. 40% of children on XX area have dental
caries, what does that tell us?
List types of data you are aware of or use in your
day to day
4
Practitioner competencies
Obtain, verify, analyse and interpret data and/or
information to improve the health and wellbeing
outcomes of a population / community / group –
demonstrating:
a. knowledge of the importance of accurate and
reliable data / information and the anomalies that
might occur
b. knowledge of the main terms and concepts used
in epidemiology and the routinely used methods for
analysing quantitative and qualitative data
c. ability to make valid interpretations of the data
and/or information and communicate these clearly
to a variety of audiences
5
Aim for the day


Aim of the day is to improve people
understanding of the data they use, how to
analyse it and interpret it.
This session is concentrating on the data rather
than things such as the study design but we are
happy to discuss and answer questions on both;
you can’t understand what the data is telling you
without understanding how it has been collected
and the potential for bias.
6
Topics covered
1.
2.
3.
4.
5.
6.
7.
Types of data
Basic probability and stats
Understanding how data is collected
Measures of odds and ratios - comparing
populations and study results.
Population sampling - Good samples and bad
samples
Understanding Confidence intervals & p values is the result reliable
How I apply data to what I am doing
7
Types of data
8
Describing the data



We have a responsibility to present data in a way
that can be easily understood, and which does not
misrepresent the true meaning of the data.
Key decisions are made based on the data – or
more accurately people’s impression of the data –
so this has an impact on use of resources and
eventually on patient care.
Accurate analysis and presentation of the data
saves lives!
9
Quantitative vs. Qualitative
Quantitative data measures quantity ie is numerical.


Qualitative data is usually more descriptive and
not measured in numbers.
However, data originally obtained as qualitative
information about individual items may give rise
to quantitative data if they are summarised by
means of counts;
10
Discrete – Continuous



Discrete data can only take certain
particular values
Continuous falls on a scale.
For example height is continuous,
but the number of siblings is
discrete.
11
Nominal - Ordinal


Nominal comes from the Latin nomen, meaning
'name', and is used to describe categorical data.
There is no quantitative relationship between the
different categories (though sometimes a number
may be assigned for ease of analysis). An
example is ethnicity.
Ordinal data again describes categories but there
is some order to them - though the relationship
between them may not be well defined. For
example, Agenda for change pay scales, since
they are ordered and can therefore be put in
sequence (but there is no numerical relationship
between them).
12
Transforming the data
Sometimes
the data you have isn't the most
effective way of displaying the data.
E.g. You have data on weight in Kilos.
Having a list of continuous weights is not intuitive,
therefore you convert this to BMI I.e., those who are
underweight, healthy weight, obese and morbidly
obese.
Continuous to ordinal.
13
Transforming the data (2)
With this you can display more meaningful data
BUT
You lose the detail, the number of the edge of each
category (borderline). You cant transform it back.
What you transform it to may not be the best use of
data.
You can also transform data using complex
calculations doing a “log” of each number, this will
sometimes convert skewed data to normal curved
data (discussed later)
14
Exercise

Exercise 1 and 2
15
Displaying the data



What are the options?
Tables – simple descriptive, cross
tab… (mention pivot table)
Graphs – bar, line, x-y or scatter,
pie chart….
16
Basic statistics and probability


Having looked at the raw data and
carried out any transformations you
felt necessary, you now want to
describe the features of this data.
Distributions – plotting the data is
the first step in this. You need to
consider the shape of the graph
before you know how to best
analyse the data.
17
Types of graph

Normal
18
Types of graph

Skewed
19
Types of graph

Bimodal
20
Types of graph

Uniform
21
15 minute
Break!
22
Data measures
Definitions:
 Range: the difference between the highest and
the lowest values in a set
 Mean: the total value of measure values summed
divided by the number of measures
 Median: the middle measure
 Mode: measure found most often
 Interquartile ranges: is a measure of statistical
dispersion, being equal to the difference between
the upper and lower quartiles
 Standard deviation: is a measure of how spread
out numbers are.
23
Mean, median and mode



Mean=
(sum of observations)
(number of observations)
Mode = the most common observation
Median = the number where 50% of
observations are below and 50% are above
24
Standard Deviation and IQR


Std Dev= sum of (difference squared between
each observation and the mean) / (number of
observations - 1)
IQR= the difference between the value at the
25th percentile and 75th percentile
25
Formulas


Sample mean
x = ( Σ xi ) / n
Sample standard deviation =
s = sqrt [ Σ ( xi - x )2 / ( n - 1 ) ]
xi is each observation
N is the number of observations
Σ means ‘sum’
26
Exercise 3
27
Exercise 4
28
How reliable is my data?






Any data missing?
How old is it?
What is the denominator?
Who collected it
How was it collected?
Ways to avoid making statements
about inaccurate data?
29
Describing data
30
Interpret the graph




This graph is a graph showing the trend of obesity
in adults from 1993 – 2007
Percentage: of what (all adults presumed, all
registered? All resident?) what age is defined as
an adult?
Is the increase due to chance or an actual
increase?
Data is quantitative/continuous
31
Bias

When looking at data sometimes
the relationship we see is one
caused by the way in which we are
measuring not actually what is
there.
32
Fudging











Rate or Number
You have 50 cases of COPD in area 1, and 150 cases
in COPD in area 2. should you do something in area 2?
Area 1 has population of 2000
Area 2 has population of 5000
In area 1 rate in 50-74 year olds is 20/1000
In area 1 rate in 50-74 year olds is 42/1000
Area 1’s data was from 2004
Area 2’s data was from 2005-2009
Area 1 is 20/1000 confidence interval (12-48 per 1000)
Area 2 is 42/1000 confidence interval (18 – 56 per
1000)
Now what?
33
Exercise



Exercise 5
What do these data tell you? Key
message?
What would you ask of these data?
What further information would you
want to know?
34
Basics of probability


Probability is a way of quantifying
the judgements that we make all
the time – from ‘do I need an
umbrella?’ to ‘shall I bet on that
horse?’
Probability is measured on a linear
scale of 0 to 1 where 0 is impossible
and 1 is absolutely certain.
35
Probability




Why is probability relevant to public health?
Probability gives us a quantitative measurement
of the chances of something happening, and there
are 2 key ways in which it is used in Public Health
It is another word for risk (or if it has a positive
impact benefit). For example, the probability that
some who smokes cigarettes will get lung cancer
has been shown to be much higher than for
someone who doesn’t smoke.
It helps us to answer the question ‘how likely is it
that the observed effect is due to our intervention
not just to chance?’, and is used in all types of
studies – testing medical treatments, evaluating
the impact of public health interventions,
assessing need of one population compared to
another.
36
Probability and risk

Odd – number of events divided by the number of
opportunities

Risk in exposed– number of events divided by the
number of exposed

Risk in un- exposed– number of events divided by
the number of un-exposed


Relative risk or Risk ratio is a ratio of the
probability of the event occurring in the exposed
group versus a non-exposed group
Absolute risk is the difference in risk between the
exposed and unexposed.
37
Probability cont…


What is the probability of a 6 if you
throw an unbiased dice?
What is the probability of a total of
6 if you throw two unbiased dice?
38
Welcome back!!

I'm not an outlier I just haven't
found my distribution yet.
39
Exercise




Exercise 6
Worse and early death = 0-3/10
No change = 4-5 /10
Cure = 2-6/10
40
Population sampling (1)

In the real world we don’t usually get data from
everybody that we are interested in. Why not?

Cost and resources may be too large

People may choose to opt in or out

May have incomplete data (data entry problems
etc)
41
Population sampling (2)



So what we need to do is measure a sample of people
and infer from that sample what the population looks
like. We can do this by tweaking the statistical formula
used – but there are two things to consider;
If your sample size is too low you are unlikely to get a
reasonable result – you can still use the formula but
you need to bear this in mind when interpreting it
Think about who you have managed to sample – are
they representative of the population? (imagine
walking in to a large open plan office with a set of
scales and asking people if they would mind being
weighed – who is more likely to volunteer?)
42
Population sampling (3)


If we have a REPRESENTATIVE sample, we can
apply a statistical tweak to help us to estimate the
figure for the population.
If we don’t (if the sample is biased), though we
can carry out the maths, it will always be flawed.
43
Population sampling (4)
Principle –
 Measure your sample
 Calculate the mean and standard deviation (of the
sample)
 Calculate the standard error = standard deviation
of the sample / n
 To estimate your mean, we say our best guess is
that the population mean is equal to the sample
mean
 Then we can use the standard error to estimate
how close we think our estimate is.
 First we need to talk about confidence intervals
44
Which one is an Insult.





Darling, you are two standard
deviations below the mean
Of course your normal (mean 10,
mode, 7)
You are mean
Your looks are in the 80%
percentile
The difference between you and her
is a standard deviation
45
46
Probability, Population Sampling
and the Normal Curve
Thinking about our data that fitted the normal curve –
 By using the mathematical model we can easily
calculate probabilities.
The maths tells us that;
 The total area under the normal curve is equal to 1.
 The probability that any new observation will fall
within one standard deviation of the mean is 68%
 The probability that any new observation will fall
within two standard deviations of the mean is 95%
 The probability that any new observation will fall
within three standard deviations of the mean is
99.7%
47
Examples
48
CERN experiments observe particle consistent
with long-sought Higgs boson
Geneva, 4 July 2012.
“We observe in our data clear signs of a new particle, at
the level of 5 sigma, in the mass region around 126
GeV. The outstanding performance of the LHC and
ATLAS and the huge efforts of many people have
brought us to this exciting stage,” said ATLAS
experiment spokesperson Fabiola Gianotti, “but a little
more time is needed to prepare these results for
publication.”
At five-sigma there is only one
chance in nearly two million that the
result is wrong, i.e. the measurement
seen is a random fluctuation.
49
Confidence intervals (1)
if we measure one individual’s IQ we can be 95% sure
that it would fall between 70 and 130
This ‘interval’ is called the 95% confidence interval.
We use 95% by convention; sometimes other figures
are used such as 98%.


If we measure the heights of a class of children and
we have a mean of 1.2m, standard deviation of 0.1,
what is your estimate for the height of a child
randomly selected from the sample?
1.2 +/-0.2, ie 95% of this sample lies between 1.0
and 1.4m
50
Confidence intervals (2)

Reminder; the heights of a class of children have a
mean of 1.2m, standard deviation of 0.1

We measure a new child and their height is 1.5m.
What does this mean?

This is equal to mean + 3 standard deviations. This
means we had less than a 0.5% chance that we
would have this height in a child in this population.
That doesn’t mean they are not part of the
distribution (0.5% is not that rare) but you might be
sensible to check a few things to be sure they are
part of the same population (age!).
51
Confidence intervals (3)
This time we are using confidence intervals to estimate our
‘true’ population characteristics based on a sample.
 Best estimate of the mean = measured mean of sample
 Best estimate of standard deviation of population = std
deviation of sample/ number of measurements in the
sample
 Therefore we can say that we are 95% confident that the
mean of the population lies between the sample mean +/2xstandard error
This implies that;
 Our estimate of the mean gets better as n increases –
because our error gets smaller.
 This is the way we usually use confidence intervals in
public health as we usually measure a sample and infer the
population.
Examples – Health survey for England, Household surveys,
etc
52

You are a significant part of my life

P value =9
53
I would never treat you differently
to your sisters




Sister 1 CI 4-9
Sister 2 CI 5-11
Sister 3 CI 4-13
ME CI 2-3
54
Comparing two samples


The important question is – is there a difference between two
populations?
This question might be asked in slightly different ways for
different types of study, but is fundamentally the same;
 For an RCT you compare control group with the
intervention group
 For a cohort you compare the outcomes in those exposed
to a risk factor compared to those not exposed
 For a case-control you look at the group with the disease
and compare their risk factors to those without the
disease
 You might look at before and after an intervention was
put in place
 You might compare one city or country to another
55
Comparing two samples (2)
The important question is – is there a difference
between two populations?
56
Comparing two samples (3)

We can calculate the difference between the two populations
as;

Mean difference = mean of pop 1 – mean pop2

Confidence interval = mean difference +/- 1.96*SE


SE (standard error) is a combination of the standard errors
for each sample (shown here as s1 and s2)
SE = sqrt[ (s12 / n1) + (s22 / n2) ]
(se can be slightly different for different situations – but this
gives you an idea)
57
T tests
Testing using t test;
 You need to know the mean and standard deviation of
both of your samples.
 You start with a hypothesis; this is that there is no
difference between the two samples (or populations)
 You then do some maths;
 t = [(mean of sample 1 – mean of sample 2)] / SE
 where SE= sqrt[ (standard dev of pop 1)2 / n1) +
(standard dev of pop 2)2 / n2) ]
58
T tests (2)
So what does t mean?
 t =the horizontal axis of a normal distribution with
mean=0 and standard deviation=1
 You can read the probability of the two samples
coming from the same population from a table of t
values
Most important value  if t>1.96 then the probability of them being from
the same distribution is <0.05
 By convention, we discard the null hypothesis if
p<0.05
 Its good practice to quote the p value e.g. P=0.01
If t>1.96, then the probability of the two samples
coming from the same population is <0.05 (5%). This
suggests that they are fundamentally different
59
T tests (3)
What do these results mean?
 Mean difference = 0, with 95% confidence interval
(-1.0, +1.0), p= 0.50

Mean difference = 0.5, with 95% confidence
interval (0.1, 0.9), p= 0.049

Mean difference = 1, with 95% confidence interval
(-0.1, +1.1), p= 0.055

Mean difference = 1, with 95% confidence interval
(0.2, +1.8), p= 0.02
60
Risk differences





Same principle – null hypothesis is that there is no
difference
For no difference, the 95% confidence interval would
include 0
If it does not include 0, then you can be 95% confident
that there is a risk difference.
You can also quote a p value
Example – the risk difference for having a heart attack in
the placebo group compared to the intervention group
was 2% with a 95% confidence interval of (1.5% to
2.4%), p=0.02

Would you take the intervention?
61
Risk differences (2)





You can also calculate the number needed to treat from
this
NNT is the number of people you need to treat to
prevent one event from occuring
Example – the risk difference for having a heart attack in
the placebo group compared to the intervention group
was 2% with a 95% confidence interval of (1.5% to
2.4%), p=0.02
If you treat 100 people you avoid 2 heart attacks.
NNT = 50
62
Risk ratio



A relative measure of risk – very commonly used
Same principle – null hypothesis is that there is no
difference IN THE RATIO OF RISKS
For no difference, the 95% confidence interval would
include 1




Why 1 this time?
Because if both had the same risk, the ratio would be 1
If it does not include 1, then you can be 95%
confident that there is a risk difference.
You can also quote a p value
63
Odds ratio






A relative measure of risk – very commonly used
Very similar to risk ratio
Used for certain types of study, and the result of
some calculations
For no difference, the 95% confidence interval would
include 1
If it does not include 1, then you can be 95%
confident that there is a difference.
You can also quote a p value
64
Examples


Meta-analysis of the 5 prospective cohort studies (86,092
patients) indicated that individuals with periodontal
disease had a 1.14 times higher risk of developing CHD
than the controls (relative risk 1.14, 95% CI 1.0741.213, P < .001)
the risk of VTE was 2.33 for obesity (95% CI, 1.68 to
3.24), 1.51 for hypertension (95% CI, 1.23 to 1.85),
1.42 for diabetes mellitus (95% CI, 1.12 to 1.77), 1.18
for smoking (95% CI, 0.95 to 1.46), and 1.16 for
hypercholesterolemia (95% CI, 0.67 to 2.02).
65
In summary

Your boss says:

1.
2.
3.
“do we need a weight loss service for
kids in XXX area”
You collect data, definition of “kids”, is
this data accurate, how was it
collected, what year.
Compare the areas, are you much
different is there an underlying reason
Is this value statistically significant?
66
In summary (2)





You look at a service elsewhere (from
evidence)
You ask yourself, who was included in this
sample, are they different to my
population
Looking at the odds what proportion of
kids will this work on
Look to see if the test group were bias
compared to control group
Were the results normally distributed,
skewed or other
67
In summary (3)






Were the results significant between the
two groups.
Can you rely on these findings
You have just found the need.
Evaluated its accuracy
Reviewed a solution
Looked at effectiveness
 WELL
DONE!!!
68
Useful websites




Basic maths and probability
http://www.cimt.plymouth.ac.uk/pr
ojects/mepres/book7/bk7i21/bk7_2
1i1.htm
Tutorials on statistics
http://www.stattrek.com/tutorials/s
tatistics-tutorial.aspx
69