Biostatistika
Download
Report
Transcript Biostatistika
Biostatistics
Statistics
• Sayings about statistics:
• Statistics is a science about accurate work with
inaccurate numbers.
• We know three kinds of lies: intentional,
unintentional and statistics
Biostatistics – what does it mean?
• It isn’t separate field of science. Using this
word we point out, that it is an application
of statistical methods helping to resolve
biological problems. [and biological data
are specific of their own]
And what is statistics indeed?
• (in laymen language) Ordered group of data:
statistics of shootings, statistics of car accidents
in different regions
• (in scientific language) A science, what we are
going to do with our data - (mathematical)
statistics as a science
• Withing the scope of statistics – a value
calculated from numbers and “synthesizing”
features of these numbers
“Anything can be proved with the
help of statistics”
• …especially by people, who don’t understand
statistics
• “It is statistically proved, that widows live longer
than their husbands.”
• It is possible to put anything to diagrams and they
look then very suggestive, especially when they
are accompanied with “right” interpretation (data are
fictitious, but according to reality)
And much better with the help of diagrams:
Production of pollutants
Production per capita
120
100
80
CZ
ČR
UK
60
40
20
0
1990
2000
Production (in % of year 1990)
Production of pullutants
120
100
80
CZ
ČR
60
UK
40
20
0
1990
2000
Production (in % of year 1990)
Production of pollutants
102
100
98
96
CZ
ČR
94
UK
92
90
88
1990
2000
Advice: when somebody tells
you, how many per cents
something got better, ask every
time, which base were the
percents computed from.
Goals of statistics
• (1) Descriptive statistics – to sumarize data,
to “condensate” information from many
numbers to lesser number of parameters or
to a diagram
Compare
Name
Anton Jan
Balzarov á Martina
Bendová Lenka
Blabolil Petr
Blažek Petr
Břendová Veronika
Čermáková Helena
Černíková Zuzana
Points
70.5
72.5
65.5
71
87
67.5
88
94
Average number of points was 74.5,
whereas the minimum value was 28 and
the maximum value was 100.
Černý Jiří
Choma Michal
Chundelová Daniela
Doanová Tereza
Dortová Markéta
Dufek Luboš
Dvořáková Veronika
Effenberková Lenka
Franta Petr
Hajžmanová Tereza
Havlan Luboš
Hejna Ondřej
Holá Hana
59
76.5
51
69
60.5
69.5
72
62
74
72
57.5
76
81
Horák Jan
Jalovecká Marie
Jarolímová Zuzana
Jarošová Andrea
98.5
65
80
Jenčov á
Jerkovičová Diana
Jonáková Martina
Jůzlová Zuzana
69
91
85
Histogram
četností
Frequency
diagram
No of obs
Chalupecký František
22
20
18
16
14
12
10
8
6
4
2
0
20 30 40 50 60 70 80 90 100 110
No.Body
of points
The lower number of parameters
I obtain
• the more transparent and more simple the result
is
• the loss of information is bigger though (I am
never able to find out from average or histogram
how much points had František K., nor the value
of all the numbers)
• - the art is to find the border, where the result is
transparent but still having its predictive quality
Thanks to the loss of information
we are able to say lies in statistics
According to the statistics, we all are flying. Not so high in the clouds, but near the ground
and just slightly touching with the end of our shoes the shit we are sitting in.
“The worst the patient is, the
better the medicine works.”
Decrease of temperature
after medicine injection [C]
Decrease of temperature after medicine injection
3.5
3
2.5
2
1.5
1
0.5
0
37
37.5
38
38.5
39
Teperature in the time of stroke [C]
39.5
Argument for harmfulness of
fluoridization (data from USA’s states)
Costs for dentist [$ per head]
1400
1200
1000
800
600
400
200
0
0
20
40
60
80
Costs for fluorinaction [$ per head]
Nicaragua should be here
100
120
“Storks bring babies”
No. of newborn babies in the
region
1400
1200
1000
800
600
400
200
0
0
20
40
60
No. of nesting pairs in the region
80
100
Differentiate - correlation and
causation
• The general scientific method
Common scientific method – on the example of babies
bringing storks:
1. Observation – finding of pattern
No. of newborn babies in the
region
1400
1200
1000
800
600
400
200
0
0
20
40
60
80
No. of nesting pairs in the region
100
120
• 2. Interpretation – “Stork brings babies”
• 3. Prediction – if we remove storks, babies
won’t be born [or their number would be
decreased, if crows also do the job]
• 4. Experiment: In the half of regions
(randomly selected!) we shoot out storks
and watch changes in natality (in
comparison with the changes in control
regions)
• 5. (After statistical approach) we bring out
there are no changes, so we can proclaim,
that storks don’t bring babies.
Hypothetical-deductive approach (K. Popper) – good presumption can
bring just good prediction, bad presumption can bring both good and
bad prediction – thanks to this we can never prove the prediction
(hypothesis), just reject it
Observation (“pattern”)
explanation
Hypothesis 1
Hypothesis 2
Prediction 1
Prediction 2
Hypothesis 3 Hypothesis
exclude
each other,
Prediction 3 predictions
differ from
each other
Result of the experiment compared with the reality
Goals of statistics
Population and sample
• (2) Interferential statistics - Making an inference
about (statistical) population from a sample
• Some (statistical) populations are too large [or
potentially infinite] – I am not able to check all
the members
• What can I say about results of elections in the
whole republic, when I ask just 1000 people?
• What can I say about amount of Cd in blood of
wild geese in CZ, when I took blood just from
10 specimens?
Interferential statistic is common
in biology
• I don’t want to make conclusions about my
10 laboratory rats, but on the base of these
10 rats I want to say something about all
experiments done in the same way
• Should this be a science, the experiments
have to be reproducible (comp. Journal of
Irreproducible Research)
Types of (not only biological) data
• Continuous and discrete data –
mathematical definition and reality of data´s
measuring – in reality we always measure
data with certain accuracy
Types of (not only biological) data
• Ratio scale
• Interval scale
0
Circular scale
270
90
180
• Ordinal scale
• Nominal scale (categorical data)
Azimuth of the stem with lichen findings
[degrees]:
5, 10, 5, 350, 350, 355 => average = 180
Time of doom-monger´s ululating: 22:00, 23:00,
24:00, 1:00, 1:00, 2:00 => average is short after
the midday
Types of (not only biological) data
• Ratio scale
• Interval scale
0
Circular scale
270
90
180
• Ordinal scale
• Nominal scale (categorical data)
Population and Random sample
• Sampling; Sampling design
• Random sample – every individual has to
have the same probability to be chosen,
independent upon the fact that another
individual was chosen
• Tabs and generators of (pseudo)random
numbers
Population sample and Random
sample
• Almost philosophical question – what it is
“random”
• And what it is probability
• In statistics (that means in this course) we
will use so-called a priori probability (also
the Bayesian - posterior probability exists)
To make a random sampling isn’t
usually trivial – in no case it is a
sampling of typical individuals – it
works reasonably well in
agricultural experiments
1
2
3
1
2
3
4
5
6
Much more difficult it is in
natural populations – even
individual nearest to the random
point does not work here
Basic statistical characteristics
• We usually mark N – size of the population,
n – size of sample
• Characteristics of the population are usually
marked with Greek alphabet and
characteristics of sample with Roman
characters
• Characteristics of location:
• Means, median and modus
• Means are defined for quantitative data (i.e.
on ratio and interval scale)
Arithmetical mean
of population
of sample
N
X
i 1
N
n
i
X
X
i 1
n
i
Geometrical mean
• n-root of the sum of n values (for a sample
here)
n
X
i
i1
n
Harmonic mean
• Reciprocal of the mean of reciprocals.
1
1 n 1
i 1
n
Xi
Median [used for ordinal-scaled
data also]
• It is defined as one half of the values is
under and the second one over the median
(in endless populations is the probability,
that random value is over as well as under
the median 0.5). In populations with even
number of terms is a value in the half of two
middle values considered to be the median
Upper and lower quartile
• Over the upper quartile is 1/4 observations,
under the lower one is 1/4 of observations
(similar with the endless populations)
Make difference among meaning
of mean and median
Example – wages in two companies
Company A
Company B
8000
9000
11000
12000
15000
18000
20000
7000
7500
8000
8500 Median
11000
18000
39000
13286
14143 Mean
Modus – the most common value
in continuous data – in
continuous data it is the “peak”
in frequency diagram – we will
define it as the local maximum of
the density-probabilities’ curve
later [can be more than one]
mean
median
median
mean
mean
mean
median
median
Characteristics of variability
• 1. Range is a difference between minimum
and maximum
• 2. Interquartile range
• 3. Variance and standard deviation
Variance – average value of
square deviation from mean
• population -
(
X
i )
i 1
N
2
2
N
estimation based on the sample
n
s
2
( Xi X )
i 1
n 1
2
n-1 = df = degrees of
freedom
Standard deviation (sx, often
also “s.d.” or “S.D.”) is root
from variance
Compare variability in weight of
elephant and ant
• Use either variance or standard deviation of
data under logarithm, or coefficient of
variation CV
• Both have its sense just for ratio-scaled data
s
CV
X
Standard error of mean
• Characteristic of sample mean’s accuracy –
how big would be variability of means of
this size from many random samples
accuracy
sx
sx
n
variability
in data
We can higher accuracy
thanks to larger sample.
Graphic summarizations –
frequency diagram
Histogram (OHRAZENI 8v*21c)
POČET_SE = 21*100*normal(x, 314.8095, 173.2422)
8
7
6
No of obs
5
4
3
2
1
0
0
100
200
300
400
500
POČET_SEMENÁČU
NO_SAPLING
600
700
800
Box and whisker plot
Box Plot (OHRAZENI 8v*21c)
800
700
600
500
400
300
200
100
0
POČET_SE
NO_SAPLING
Median = 329
25%-75%
= (196, 363)
Non-Outlier Range
= (93, 500)
Outliers
Extremes
Attention,
nowadays is
box &
whisker also
used for
mean and
standard
deviation etc.