Lecture 2: Descriptive Statistics

Download Report

Transcript Lecture 2: Descriptive Statistics

Descriptive statistics
Lecture 1
Lecture aim (s)
To equip students with knowledge, skills and
techniques of summarizing data through
various statistical and documentary tools
Learning Objectives
By the end, students should be able to
1. Summarize any given data set
appropriately to manageable levels for
one to see its picture (features) clearly in
the data
2. Use tabular and / or diagrams to describe
data
3. Interpret the tables or diagrams used
Supplementary reading notes
• Modern Approaches to the analysis of
experimental data ~By Statistical
Service Centre, University of Reading,
2001 or 2006 versions
• Analyzing the data, from GEAR 4.2
• Modern Methods of Data Analysis ~By
Statistical Service Centre, University of
Reading, 2002 version
• Case study 3 from Biometrics RM ILRI
CD
Self Learning software
• Open folder Statistics Made Simple
• Launch sms – topic 1
• Follow the teacher
Practical statistical packages
• Genstat
• SPSS
Steps in the Analysis
1. Defining the
objectives of analysis
2. Preparing the data
3. Descriptive analysis
• decide on the objectives of the analysis before starting it – do an analysis plan.
• You may need to construct variables needed for analysis (e.g. find plant N content
from concentrations and biomass data) or summarise variables to the correct ‘level’
• Calculation of summary tables and graphs, as defined when setting objectives.
• Exploratory analysis to identify any unexpected patterns or results
4. Confirmatory analysis
• adding measures of precision (e.g. standard errors and results of significance tests) to
the results found in the descriptive analysis;.
• improving the estimates of various critical quantities.
5. Interpretation
• integrating the new knowledge with the existing body of knowledge on the problem
• comparing results with those from other studies, building predictive models and
formulating new hypotheses
6. Reporting
• reporting the analysis and presenting the final tables and graphs
Source: SSC Data analysis workshop
Data Exploration and Descriptive
statistics
Follows after data entry
Descriptive statistics
• Aims at
– reducing the data to manageable
proportions,
– summarises trends and tendencies within the
data
– for one to see results clearly
Descriptive statistics
• As data information grows, it becomes
difficult to have a clear picture of what is
happening
• This leads to a process of data exploration
• And reducing data into tables, diagrams
and numerical measures
Review on data structure
• Please review data structure for
analyses
• In CAST, refer to
 Introduction: About data
 Standard data structure
Lecturer to explain and illustrate uni-variate and multivariate data structures, arrangement of variables ~
factors and records; integers and real variables, etc
Aims of descriptive statistics
• That is
– In data there is information of interest
– This information becomes unclear as data
increases
– One way to get this information from data is
through descriptive statistics
Go through anthropometric data
See the picture in the data
Practice on understanding data
• Using data collected from students
• Define variables using SPSS
• Enter data
• Merge data from all students
• Do some computations on this and
other data
• Understand data as you increase
data set size
• See the need for descriptive statistics
Computation practice
• On class data
• Compute departments from
knowledge of questionnaire coded
numbers
• Generate parameters of food
security
Descriptive Statistics /Data
summarising depends of type of
variables
• Qualitative variables
– Are summarised into frequencies
– Are presented as tables, bar charts, pie charts
• Quantitative variables
– Are summarised using numerical measures
– Are presented as tables, histograms, stem and
leaf diagrams, box and whisker plots, scatter
plots
More on types of data /
variables
Data Types
Qualitative
Discrete data
Nominal variable
Ordinal variable
Categorical
Quantitative
Discrete data
Continuous data
Ratio / Interval
Scale
Descriptive Statistics of qualitative data
A frequency distribution
• Shows the frequencies of occurrence of the
observations in a data set
• According to the class or category of the
qualitative variable
• Results can be displayed in a
– Table or
– in a diagram such as
• Bar chart or
• Pie chart
• In which, each class is represented
An example of frequency
distribution
From SAVE Baseline data file
An example of frequency distribution
Relative frequency distribution
• When comparing two or more frequency
distributions and total numbers differ,
• It is difficult to compare them
• So calculate proportions or percentage
of observations in each class or category
• Hence relative frequencies
• They sum up to unity or 100%
Frequency distribution example
house type by roof
Valid
Frequency
grass thatc hed
793
iron sheet
205
tiles
5
Total
1003
Percent
79.1
20.4
.5
100.0
Valid Perc ent
79.1
20.4
.5
100.0
Cumulative
Percent
79.1
99.5
100.0
house type by roof
Valid
grass thatched
iron sheet
Total
Frequency
430
70
500
Percent
86.0
14.0
100.0
Valid Percent
86.0
14.0
100.0
Cumulative
Percent
86.0
100.0
Frequency distribution
example
From SAVE Baseline data file
Cumulative relative frequency
distribution
Sometimes are computed for certain
purposes. e.g.
• To determine a percentage of
observations below a certain cutoff point
• Helps in developing some indices such as
measures of distribution of assets among
individuals – Lorenz curves and Ginis
An example of cumulative
frequency distribution, distribution of
81
76
71
66
61
56
51
46
41
36
31
26
21
16
11
6
100
90
80
70
60
50
40
30
20
10
0
1
Cummulative relative frequency
(%)
livestock among households
Number of livestock
5th Percentile
Median
95th Percentile
See also Fig 2.1 page 13 of Statistics for Vet and AS
Percentile
• Values of a variable which divide the
total frequency into 100 equal parts
• e.g. the 50th percentile (median) is the
value of the variable that divides the
distribution into two halves
• That is, 50 % of individuals have observations less
than the median and 50 % of individuals have
observations greater than the median
• Often 25th and 75th percentile are quoted
as lower and upper quartiles, respectively
• That is, value at which 25 % of observations lie below
lower quartile and 25 % lie above upper quartile
Percentile output from SPSS
Percentiles
5
Weighted
household size
Average(Definition 1)
Tukey's Hinges
10
2.00
household size
Data from income data
2.00
25
Percentiles
50
75
3.00
4.00
6.00
3.00
4.00
6.00
90
8.00
95
9.00
Cumulative relative frequency
polygon of distribution length of
eggs
100
Cummulative frequency (%)
90
80
70
60
50
40
30
20
10
0
19.7 20.2 20.7 21.2 21.7 22.2 22.7 23.2 23.7 24.2 24.7 25.2
length of egg (mm)
Median
An example of cumulative
frequency distribution, Lorenz curves
Percent species
of livestock asset
100
90
80
70
60
50
40
30
20
10
0
1
7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97
Percent households
Ideal
Livestock
Lorenz curves
• Measures level of asset distribution
• By comparing actual distribution to
a line of perfect distribution
• Based on cumulative frequency
distribution of assets vs cumulative
frequency distribution of households
• That is cum asset is plotted against
cum household
Lorenz curve for income
drawn from Genstat
Lorenz curve for income
1.0
0.8
0.6
Lorenz curve for income
0.4
Gini coefficient 0.7027
95% Bootstrap
confidence interval
(0.560, 0.807)
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Data is arranged in order as
follows (details in excel)
Livestock Farmers No. livestock Percent Livestock Cummulative Percent farmer Cummulative
0
1285
0
0.00
0.00
57.16
57.16
1
193
193
5.30
5.30
8.59
65.75
2
214
428
11.75
17.05
9.52
75.27
3
162
486
13.34
30.39
7.21
82.47
4
115
460
12.63
43.01
5.12
87.59
5
78
390
10.71
53.72
3.47
91.06
6
74
444
12.19
65.91
3.29
94.35
7
39
273
7.49
73.40
1.73
96.09
8
22
176
4.83
78.23
0.98
97.06
9
16
144
3.95
82.19
0.71
97.78
10
11
110
3.02
85.20
0.49
98.27
11
14
154
4.23
89.43
0.62
98.89
12
10
120
3.29
92.73
0.44
99.33
13
2
26
0.71
93.44
0.09
99.42
14
3
42
1.15
94.59
0.13
99.56
15
4
60
1.65
96.24
0.18
99.73
16
1
16
0.44
96.68
0.04
99.78
17
1
17
0.47
97.15
0.04
99.82
19
1
19
0.52
97.67
0.04
99.87
20
1
20
0.55
98.22
0.04
99.91
31
1
31
0.85
99.07
0.04
99.96
34
1
34
0.93
100.00
0.04
100.00
2248
3643
100
100
Cross tabulations
• Indicates association of frequency
distribution of two or more variables
– An example of house type by roof
and household food security
status
Cross tabulations
Practical
• From your data, isolate qualitative data
and analyse for frequencies and crosstabulations. Report in Tables and
graphically. Interpret the results
Frequency distribution for
quantitative variable
• Is calculated when quantitative data is
split into class interval
• Each class encompass a range of values
of the variable
• An observation only falls into one class
• Then determine number of observations
belonging to each class
• A complete set of class frequencies is a
frequency distribution
Frequency distribution for
quantitative variable ~ Practical
• From Anthropometric data
• On variables Z-scores
– Split into three status groups
• Normal
• Undernourished
• Over
• overweight
– Based on own set criteria
• Then run frequency distribution
Frequency distribution for
quantitative variable ~ Practical
• From class generated data
• Merge assumed provided energy food
data
• Compute and analyse for
• Food secure
• Food insecure households
• Then run frequency distribution and cross
tabulations
Practical in Genstat
• Open Income data in spss into
Genstat
• Run frequency
• Stats- summary statistics – tarry
• Then take rthouse
• Same can be analyzed from survey
analysis
EDA / Descriptive analysis of
quantitative data
• In summarizing quantitative variables the most interesting
things are
• Location
• Spread
• Odd values
(What is a typical value)
(How much variation is
there?)
(What is their source and
interpretation?)
• Location is measured by mean or median
• Spread is measured by standard deviation or distance
between quartiles.
• Use Histograms and Box plots.
Descriptive analysis of quantitative
data ~ check also data exploration &
description (Case Study 11)
• This involves measures of various
statistics
– of measure of location (Centre)
– and dispersion about the location
(Spread)
Measures of location
• These usually refer to measures of
central tendency of a data set
• The arithmetic mean
• The geometric mean
• The median
• The mode
The Arithmetic Mean
• The average value of observations in a
data set
x

x
i
n
• True population mean ()
• Sample estimate of the population mean
(x)
Mean is the most commonly used measure of
central tendency
Example of data set
No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Liveweight
(y) kg
30
24
20
25
25
19
35
37
39
43
38
20
28
22
28
25
20
35
43
36

X
n
n
X
 29.6
Problems with the mean
• Mean value is influenced by outliers
• An observation whose value is highly inconsistent
with the main body of the data
• Can be excessively large or small
• And influence the mean similarly
• The mean requires symmetrical
distribution of data to be an appropriate
measure of central tendency
• Mean will be pulled to the right if distribution is
skewed to the right and pulled to the left if the
distribution is skewed to the left
Median
• Also commonly used
• The value that 50 % of observations exceed
or fall below
• A set of observations need to be arranged
in a rank order
• Median is a 50th Percentile
• If n is even the median lies midway between the
central two observations
• If n is odd, the median is found by counting till (n+1)/2th
observation is reached
Example of data set
No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Live weight
(y) kg
19
20
20
20
22
24
25
25
25
28
28
30
35
35
36
37
38
39
43
43
Median  28
Demo on excel data by
changing one figure
Attributes of median
• Median is not affected by outliers in the
data
• Median is not affected by the skewed
distribution
• Preferred under these conditions
• Median will be
• less than mean if data is skewed to the right
• greater than mean if data is skewed to the left
• close or equal to mean in value if distribution is
symmetrical
• Median, however, does not incorporate
all observations in its calculations
Geometric mean
• Distribution of biological data is mostly
symmetrical
• If not, data is mostly skewed to the right
• To make such data symmetrical, we take
log of each value in the data set
• Log transformation
• Means are therefore, computed on
transformed data
• We therefore, convert back to original
scale by taking antilog
• This mean is called geometric mean
Geometric mean ~ Practice
•
•
•
•
Using income data merged
Transform into ln (natural log) in SPSS and Genstat
And run histogram with normality plot in SPSS
Details at end of this lecture
Relationship between Geometric
mean and median
• Geometric mean is
• always less than arithmetic mean if data
are skewed to the right
• usually equal to the median if data are
skewed to the right
• In such data, preference is to
transform data and report
geometric means rather than
median
Other measures of central
tendency
• Mode
– The most frequent value
Genstat practice on
• Calling data from Excel
• Putting protocol and value
definitions prior to dataset in Excel
• Use anthropdata
• Illustrate using on farm gliricidia and
sesbania excel file
Measures of dispersion (variation)
• When data are collected, all values
are rarely the same.
• A major role of statistics is to
describe and analyze this variation.
• That is, an important role of statistics
is to display and describe this
variation in ways that highlight the
information in it.
Growth of yams (cm) for 7 days
10.1
9.2
11.9
6.3
7.4
5.4
9.3
11.1
7.2
6.8
9.1
10.9
10.1
7.4
9.2
9.5
6.0
5.3
8.9
10.4
There is clearly variability between yam
plants and a quick scan shows that all
values are between 5 and 12 cm
Source: CAST ~ Displaying variables
Measures of dispersion
• Mean on its own is not adequate
• Hence the need to measure distribution in a
population
• e.g.
Xa
4
10
7
3
Xa =
6
Xb
5
7
7
5
Xb =
6
• They have same mean but distribution is different
• Hence the need for a measure of distribution to
see how the population is dispersed about the
mean
Range
• The difference between the extreme
ends of the observations
Range  Max value  Min value
Range
• Again does not show how the population
is dispersed about the mean
Variance
• Measure of the amount of variation in a
population
• It is the sum of all squared deviations from
the mean divided by the number of
degrees of freedom
σ2 
2
Yi



Yi

μ

n1


Yi 


2
2
n1
n
An example
Sum
592
Mean
29.6
0.00
(y-y)2
0.16
31.36
92.16
21.16
21.16
112.36
29.16
54.76
88.36
179.56
70.56
92.16
2.56
57.76
2.56
21.16
92.16
29.16
179.56
40.96
1218.8
SS y
 2


  y  y 
Vy 
n 1
64.1
Vy
kg2
8.0
σ
kg
Example in excel
No. Liveweight y-y
(y) kg
1
30
0.4
2
24 -5.6
3
20 -9.6
4
25 -4.6
5
25 -4.6
6
19
-11
7
35
5.4
8
37
7.4
9
39
9.4
10
43 13.4
11
38
8.4
12
20 -9.6
13
28 -1.6
14
22 -7.6
15
28 -1.6
16
25 -4.6
17
20 -9.6
18
35
5.4
19
43 13.4
20
36
6.4
Standard deviation ()
• Describes the
variation of a
population or
sample about
the mean in
same units
• Best describes
the population
when used with
the mean
68 %
95 %
99.7 %
Usually reported as x ± 
So a population can have same mean but the
distribution is different
Standard deviation ()
The standard deviation is a 'typical' distance
of values from the centre of the distribution.
Coefficient of Variation (CV)
• The standardised
standard deviation so
that it can be compared
to those in other
population, other traits,
different ages or classes
• It therefore, measures
variation in relative terms
• It is expressed in percent
S
CV  x 100
X
Standard error (SE)
• It is the standard deviation
of the means of samples
drawn repeatedly from the
same population
• It measures precision of
your estimates
• The smaller the SE, the more
precise the estimate
• Reported as lsmean±se
S
Sx 
n
STD is a useful measure of variation of an individual
observation
SE is a useful measure of variation of the mean
Confidence Interval (CI)
• CI is a range between upper and
lower limits that is expected to
include true mean of population at
a given probability
• This is the value for which a sample
provides an unbiased estimate
• SE is used to calculate CI
Confidence Interval (CI)
• Usually talk about 95 % CI
• This is the interval in which the true mean
lies with 95 % chance of being correct or
• When sampled 20 times, 19 times have
mean lying within the range
• Approximate 95 % CI can be estimated
as sample mean ± 2* SE
• Approximate 99 % CI can be estimated
as sample mean ± 2.6* SE
• Usually t value is used in calculating CI
Practical
• From data on one of the files, do
statistical analyses
• Understand each concept
Testing for normality and
displaying variation
• We have seen that most of measures of
quantitative data require data
distribution to be symmetrical
• Normally distributed
• In case of outliers, they need data
management
• In case of skewed data, there is need to
transform data
• There is therefore, need to check if data
are normally distributed before
proceeding with analyses
Tools to test normality
• Use of diagrams
– Histograms, Tukey’s Box-and-whisker plots,
Stem and Leaf, normal Plot
• Statistical
– Shapiro-Wilk, Kolmogorov-Smirnov tests
• Consistency between mean and median
• Examples in SPSS
• File chickwt
• Anthropometric data
Histogram for weight at week 10
20
Height of a
class
rectangle =
frequency of
the class
10
Std. Dev = 226.83
Mean = 808.0
N = 67.00
0
400.0
600.0
500.0
800.0
700.0
1000.0
900.0
1200.0
1100.0
1400.0
1300.0
1500.0
WT10
Check for symmetry – shape to right of central value is mirror
image of that to the left
Note rectangles are contiguous because quantitative variables
are continuous
Box-and-whisker plot
1600
60
1400
1200
1000
800
Whiskers are
vertical lines
extending
from box to
2.5th and
97.5th
percentiles
Rest are
extreme
values
600
400
200
N=
67
WT10
Horizontal lines of box defines upper and lower quartiles that
encloses 50 % of observations
Median is marked by a line within the box
B & W plot therefore, displays range, median and quartiles
Box-and-whisker plot
1600
60
1400
30
1200
1000
800
i.e. check
variation per
group or
category
600
WT10
Are also
useful to
compare a
number of
data sets,
e.g. by sex
400
200
N=
45
22
f
m
SEX
We see that observations in each sex are approximately
normally distributed, though range is higher in females than
males, distribution higher in males than females
Boxplot of height by species
1
2
3
4
5
71
Normal plot of weight at week 10
Horizontal axis
shows ordered
values of the
variable
Normal Q-Q Plot of wt10
4
Vertical axis
represents
corresponding
standardised
normal deviates
Expected Normal
2
0
-2
-4
250
500
750
1,000
1,250
1,500
Observed Value
Normally distributed data shows a straight line.
If data is not normally distributed a plot deviates from
straight line and a curve is produced
Descriptive of normality analyses
Descriptives
WT10
Mean
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtos is
Lower Bound
Upper Bound
Statis tic
808.03
752.70
Std. Error
27.711
863.36
800.96
796.00
51450.848
226.828
350
1498
1148
292.00
.532
.653
.293
.578
Extreme values of normality
analysis
Extreme Values
WT10
Highest
Lowest
1
2
3
4
5
1
2
3
4
5
Cas e Number
60
30
28
17
37
61
44
57
12
43
Value
1498
1332
1276
1214
1208
350
366
430
466
500
Test of normality
Tests of Normality
a
WT10
Kolmogorov-Smirnov
Statis tic
df
Sig.
.074
67
.200*
Shapiro-Wilk
Statis tic
df
.977
67
Sig.
.259
*. This is a lower bound of the true significance.
a. Lilliefors Significance Correction
Graphical tests are subjective though commonly used
Objective tests of normality includes SW (for < 2000
observations and KS (for > 2000 observations
These test hypothesis that data is not significantly
different from normal
SW should be > .90 (p>0.05); KS should be small and
p>0.05)
Exercise on descriptive analyses
and test of normality
From Anthropometric dataset
available
• Run normality tests for all
quantitative data
• That is
• Weight, height, MUAC, haz, whz and waz
Data transformation
• In case data fails to be normal
• There are two options
– Either go ahead with analyses and use
non-parametric tests
– Or transform data and analyse on
transformed values
• Data transformation is recommended
Basis for transformation
• Reduces skewness of the data and
makes the residual variance less
dependent on the mean
• By so doing, you normalise the distribution of data
• To linearize a relationship
• It is easier to analyse data and investigate a
relationship when that relationship can be
described by a straight line
• To stabilise the variance
• Equal variance is assumed in statistical analyses that
often assume normal distribution
Types of transformations
• Can be
• Log or natural log
• Normalises data skewed to the right
• Add constant to 0 values (k+x)
• Square transformation
• Normalises data skewed to the left
• Logit transformation
• Mainly for proportions (percentages)
• Arcsine transformation
Analyses is done on transformed
data but
• Report geometric mean by taking
antilog of results
• Mean and CI
• Not their SE
• Constant must be taken off the
antilog when reporting the results in
order to calculate correct
geometric mean and CI
Most health, survival, hatchability animal data
need transformation
Refer to literature of the type of transformation
Practical on data transformation
• Based on datasets provided
• And after running normality tests
• Transform non-normal data
• And run normality test again
Practical
• On livestock data
• Compute livestock units
• Test for normality
• Transform the data
• Compare between gender of
households using location and
distribution
Conversion table for livestock units
Class of livestock
Livestock units (per head)
Sheep
0.15
Goats
0.15
Cattle (24 months and over)
1.00
Cattle (6-23 months)
0.6
Chicken/duck broiler
0.005
Chicken/duck caged layers
0.008
Chicken pullets
0.002
Pigs (sows)
0.2
Pigs (weaners)
0.05
Pigs (feeder hogs)
0.25
Pigeon
0.0002
Rabbits
0.008
Guinea fowl
0.005