Types of data and descriptive statistic

Download Report

Transcript Types of data and descriptive statistic

Types of data and
descriptive statistic
Intermediate Training in Quantitative
Analysis
Bangkok 19-23 November 2007
LEARNING PROGRAMME
Topics to be covered in this
presentation
 Data collection and variable types
 Basic descriptive analysis
 Review of mean, median, mode, range,
frequencies, crosstabs
 Multiple response
LEARNING PROGRAMME - 2
Learning objectives
By the end of this session, the participant
should be able to:
Define variable types
Conduct basic descriptive statistics for
continuous and categorical variables
Conduct multiple response tests
LEARNING PROGRAMME - 3
In the social sciences we are usually
interested in discovering something about a
phenomenon.
Whatever the phenomenon we desire to
explain, we seek to explain it by collecting
data from the real world and then using these
data to draw conclusions about what is being
studied.
LEARNING PROGRAMME - 4
When collecting data…
 We rely on two types of variables
 1. Continuous
 2. Categorical
 variables provide us information on individuals,
households, administrative areas, or
meteorological stations, etc
LEARNING PROGRAMME - 5
Continuous variable
They assume numeric values, expressed in a given unit of
measurement
 Income, mm. of rainfall, amount of agricultural production,
percent of food insecure hhs, Weight-for-height z-score for
children etc.
Most importantly, each number (within a variable) has a meaning
in relation to the other numbers, allowing arithmetic
comparisons to be drawn
LEARNING PROGRAMME - 6
Type of variables
CATEGORICAL CONTINUOUS
Nominal
Interval
Ordinal
Ratio
LEARNING PROGRAMME - 7
Types of variables:
different levels of measurement
Interval
An interval scale differs from an ordinal one in that the
differences between adjacent categories are equal.
Examples include the Fahrenheit and Celsius temperature
scale.
Ratio
A ratio scale differs from an interval one in that there is a
true zero point (% of HHs food insecure).
LEARNING PROGRAMME - 8
Distribution of continuous
variables
 Continuous variables can either be symmetrically distributed (ie
normally distributed) or asymmetrically distributed (otherwise
known as skewed)
Note: To assess a variables distribution in spss, use the “histogram”
option under graphs (spss help can provide more details)
A normal distribution looks
like this…
LEARNING PROGRAMME - 9
Distribution of continuous
variables…
A skewed distribution looks like this…as this
shows, distributions can be both positively and
negatively skewed
LEARNING PROGRAMME - 10
Lets take a look at two
examples…
 Income
 Weight for height z-scores
LEARNING PROGRAMME - 11
Categorical variables
Values are categories, taking a limited set of
values
 Ex. Child age groups– 0-11, 12-23, 24-35, 36-47, 48-59
months; Sex of respondent- male/female
Categories can be denoted by numbers/
alphabetically
Categorical variables take 2 forms-- nominal
and ordinal variables.
LEARNING PROGRAMME - 12
Type of variables
CATEGORICAL CONTINUOUS
Nominal
Interval
Ordinal
Ratio
LEARNING PROGRAMME - 13
Nominal variables
A nominal measurement scale is a set of mutually
exclusive categories that varies qualitatively but
not quantitatively, for example gender, provinces,
income sources, etc.
Codes are labels representing different behaviours/
characteristics and they do not imply any
underlying order
Variables with a "yes-no“ answer, “male” or “female”
LEARNING PROGRAMME - 14
Ordinal categorical variables
An ordinal measurement scale differs from
a nominal one in that the order among the
original categories is preserved in the
analysis. However differences between
adjacent categories are not equal.
Examples social class and perception (bad
– medium – good).
LEARNING PROGRAMME - 15
Descriptive statistics
 Descriptive statistics are the most basic from
of statistics
 They include:
 Summaries of one variable
 Comparisons of two or more variables
 These tests are the foundation for more
advanced statistical techniques
LEARNING PROGRAMME - 16
Descriptives
 Continuous variables
 Range
 Mean
 Median
 Mode
 Categorical variables
 Frequencies
 Crosstabs
LEARNING PROGRAMME - 17
First lets discuss descriptives for
continuous variables…
LEARNING PROGRAMME - 18
Range
It is the spread between the smallest and the
largest values in a distribution
LEARNING PROGRAMME - 19
Mean
The mean is a measure of the variable’s central
tendency
The (arithmetic) MEAN is the sum of all the values
divided by the numbers of cases
Statistics such as mean assume normal
distributions
LEARNING PROGRAMME - 20
Median
The MEDIAN is the value above and below which half the
cases fall, the 50th percentile, i.e. the middle value of a set of
observations ranked in order.
The median is a measure of central tendency not sensitive to
outlying values--unlike the mean, which can be affected by a
few extremely high or low values.
A median does not necessarily assume an normal distribution
LEARNING PROGRAMME - 21
Mode
The MODE of a distribution is the value of the
observation occurring most frequently. It can
be used with all measurement scales.
If several values share the greatest frequency
of occurrence, each of them is a mode.
LEARNING PROGRAMME - 22
To illustrate these concepts…
Looking at age data from 10 individuals…
1
12
2
19
3
23
4
26
5
28
6
28
7
28
8
34
9
36
10
38
 What is the range? 12 to 38
 What is the mean? 27.2
 What is the median? 28
 What is the mode? 28
LEARNING PROGRAMME - 23
Other basic concepts that
must be understood…
Variance
Standard deviation
LEARNING PROGRAMME - 24
Standard deviation and variance
The standard deviation is the average error between the mean and the
observations made (and so is a measure of how well the mean describes the
actual data).
The variance is square of the standard deviation
1
LEARNING PROGRAMME - 25
Variance and standard
deviation
1
2
3
4
5
6
7
8
9
10
12
19
23
26
28
28
28
34
36
38
What is the standard deviation of age??
What is the variance??
LEARNING PROGRAMME - 26
Standard deviation
In a normal distribution, 68.27% of cases fall within ±
one standard deviation of the mean, 95.45% of
cases fall within ± two standard deviations and
99.73% fall within ± three standard deviations.
For example, if the mean age is 45, with a standard
deviation of 10, 95% of the cases would be between
25 and 65 in a normal distribution.
LEARNING PROGRAMME - 27
Standard deviation
LEARNING PROGRAMME - 28
Now lets discuss descriptives
for categorical variables…
LEARNING PROGRAMME - 29
Analysing categorical data
If you want to look at the relationship between
two categorical variables:
Cannot use the mean and median
The mean of a categorical variable is completely
meaningless because the numeric values we
attach to different categories are arbitrary
LEARNING PROGRAMME - 30
Descriptives for categorical
data
 The most basic descriptive for categorical
variables are frequencies
 which shows the number (or percent) of cases in each
category
WAZPREV
Frequency
Valid
Missing
T otal
Percent
Valid Percent
Cumulative
Percent
Not malnourished
4290
11.2
80.2
80.2
Malnourished
1058
2. 8
19.8
100.0
T otal
5348
14.0
100.0
32840
86.0
38188
100.0
System
LEARNING PROGRAMME - 31
Descriptives for categorical
data
 Second, we can also cross tabulate
categories from one variable with categories
from a second variable, this is known as a
contingency table
LEARNING PROGRAMME - 32
Contingency Tables
Child Gender * underweight Crosstabulation
underweight
no
Child Gender
Male
Female
Total
Total
yes
Count
793
280
1073
% within Child Gender
74%
26%
100%
% within underweight
51%
53%
51%
% of Total
38%
13%
51%
Count
772
253
1025
% within Child Gender
75%
25%
100%
% within underweight
49%
47%
49%
% of Total
37%
12%
49%
Count
1565
533
2098
% within Child Gender
75%
25%
100%
% within underweight
100%
100%
100%
75%
25%
100%
% of Total
LEARNING PROGRAMME - 33
Contingency Tables
Orphan
Residence
Capital, large city
Small city
Town
Countryside
Total
No
Yes
Total
Count
1,071
125
1,196
% within residence
89.5%
10.5%
100.0%
% within orphan
7.4%
6.8%
7.3%
Count
395
37
432
% within residence
91.4%
8.6%
100.0%
% within orphan
2.7%
2.0%
2.6%
Count
824
131
955
% within residence
86.3%
13.7%
100.0%
% within orphan
5.7%
7.1%
5.8%
Count
12,252
1,558
13,810
% within residence
88.7%
11.3%
100.0%
% within orphan
84.3%
84.2%
84.2%
Count
14,542
1,851
16,393
% within residence
88.7%
11.3%
100.0%
% within orphan
100.0%
100.0%
100.0%
LEARNING PROGRAMME - 34
Example…
WAZPREV * ORPHAN Crosstabulation
ORPHAN
Non orphan
WAZPREV
Not malnourished
Malnourished
T otal
Count
Orphan
T otal
3995
207
4202
% within WAZPREV
95.1%
4. 9%
100.0%
% within ORPHAN
80.2%
80.9%
80.2%
987
49
1036
% within WAZPREV
95.3%
4. 7%
100.0%
% within ORPHAN
19.8%
19.1%
19.8%
4982
256
5238
95.1%
4. 9%
100.0%
100.0%
100.0%
100.0%
Count
Count
% within WAZPREV
% within ORPHAN
What percentage of orphans are malnourished?
What percentage of non orphans are malnourished?
What percentage of malnourished children are orphans?
LEARNING PROGRAMME - 35
Multiple response analysis
 Sometimes we have to analyze categorical
data, where households are able to give
more than one response to a question (ex.
livelihoods, coping strategies, etc)
 Analyzing such data requires a multiple
response analysis
LEARNING PROGRAMME - 36
After completing question SCM1, complete one incident at a time (line by line), each time repeating the questions
above
SCM 2.
By order of importance, what
incidents did your household
experience in the last 1 YEAR
SCM 3.
What is the main action your household
took to compensate the effect of that
incident?
Incidents code:
1 = Insecurity, Violence
2 = Increased price for food
4 = Drop in farm gate price
5 = Floods
6 = drought/dry spell
7 = crop pest and disease
8 = Livestock disease
9 = Sickness of household Member
10 = Death of household member
11 = Increased household size (IDPs)
12 = Loss / lack of employment
Coping code:
0 = Nothing
1 = Eat less preferred foods
2 = Eat fewer or smaller meals per day
3 = Go one entire day without meals
4 = collect wild foods, hunt or harvest
immature crops
5 = Distress sale / slaughter of livestock
6 = Distress sale of other assets
7 = Purchase food on credit
8 = Borrow food from families and friends,
kinship support
9 = Worked for money
10 = Worked for food only
11 = Reduced expenditures on health or
education
12 = spent savings
13 = Some household members migrated
14 = Other (specify)_____________
SCM 4.
How often did you
do this in the last 1
YEAR?
SCM 5.
Did your
Household
Recover from that
incident?
1. Yes
MAIN :
(Code) ___
(Code) ___
___ times
2.
No
1. Yes
SECOND : (Code) ___
(Code) ___
___ times
2.
No
1. Yes
THIRD :
(Code) ___
(Code) ___
___ times
2.
No
1. Yes
FOURTH :
(Code) ___
(Code) ___
___ times
2.
No
1. Yes
FIFTH :
(Code) ___
(Code) ___
___ times
2.
No
LEARNING PROGRAMME - 37
Multiple response frequencies
$shocks Frequencies
$shocksa
Total
Insecurity, violence
Higher prices
Drop in farmgate price
Floods
Drought
Crop pest/diseas e
Livestock dis eas e
Sicknes s in HH
Death in HH
Increased HH size (IDPs)
Los s/lack of employment
Res ponses
N
Percent
2418
12.5%
2115
11.0%
881
4.6%
1410
7.3%
2695
14.0%
2031
10.5%
1479
7.7%
2950
15.3%
1229
6.4%
689
3.6%
1398
7.2%
19295
100.0%
Percent of
Cas es
36.0%
31.5%
13.1%
21.0%
40.1%
30.3%
22.0%
43.9%
18.3%
10.3%
20.8%
287.4%
a. Group
LEARNING PROGRAMME - 38
 N is the number of households that reported a shock
 The Percent column reports the percentage of total
responses represented by each shock. This is not
easily available from individual frequency tables.
 The Percent of Cases column is the percentage of
valid cases represented by each shock.
LEARNING PROGRAMME - 39
Multiple response crosstabs
$shocks*fcgbivariate Crosstabulation
$shocks
Insecuri ty, violence
Higher prices
Drop i n farm gate pri
Floods
Drought
Crop pest/di seas e
Livestock dis eas e
Sicknes s i n HH
Death in HH
Increased HH size (I
Los s/lack of em ploym
Total
Count
% within
Count
% within
Count
% within
Count
% within
Count
% within
Count
% within
Count
% within
Count
% within
Count
% within
Count
% within
Count
% within
Count
$shocks
$shocks
$shocks
$shocks
$shocks
$shocks
$shocks
$shocks
$shocks
$shocks
$shocks
fcgbivari ate
poor/borde
acceptable
rl ine food
food cons
726
1516
32.4%
67.6%
710
1269
35.9%
64.1%
275
536
33.9%
66.1%
494
787
38.6%
61.4%
911
1607
36.2%
63.8%
646
1227
34.5%
65.5%
414
946
30.4%
69.6%
764
2050
27.1%
72.9%
341
822
29.3%
70.7%
187
451
29.3%
70.7%
431
869
33.2%
66.8%
1887
4467
Total
2242
1979
811
1281
2518
1873
1360
2814
1163
638
1300
6354
Percentages and totals are based on res pondents .
a. Group
LEARNING PROGRAMME - 40
Multiple response…
 To set up a multiple response in spss…
 Click on “Analyze”
 Click on “Multiple response”
 Click on “Define sets…”
 Move the variables of interest into the box in box on
the right
 Then define the range of the variable
 Then name the variable
 Then click on “Add”
 Then click on “Close”
LEARNING PROGRAMME - 41
Multiple response…
 To run a multiple response in spss…
 Click on “Analyze”
 Click on “Multiple response”
 Click on “frequencies…” or “crosstabs…”
(whichever descriptive test you would like to
conduct)
 Move the variables into the proper boxes
 Then click “Okay”
LEARNING PROGRAMME - 42
now practical exercises…..
LEARNING PROGRAMME - 43