Getting to the essential

Download Report

Transcript Getting to the essential

Data and central tendency
Integrated Disease Surveillance Programme (IDSP)
district surveillance officers (DSO) course
1
Outline of the session
1. Type of data
2. Central tendency
2
Epidemiological process
• We collect data
 We use criteria and definitions
• We analyze data into information
 “Data reduction / condensation”
• We interpret the information for decision
making
 What does the information means to us?
3
Surveillance:
A role of the public health system
The systematic process of collection,
transmission, analysis and feedback of public
health data for decision making
Data
Information
Action
Interpretation
Analysis
Today we will focus on DATA:
The starting point
4
Surveillance
Data: A definition
• Set of related numbers
• Raw material for statistics
• Example:
 Temperature of a patient over time
 Date of onset of patients
5
Types of data
• Qualitative data
 No magnitude / size
 Classified by counting the units that have the
same attribute
 Types
• Binary
• Nominal
• Ordinal
• Quantitative data
6
Qualitative, binary data
• The variable can only take two values
 1,0 often used (or 1,2)
 Yes, No
• Example:
 Sex
• Male, Female
 Female sex
• Yes, No
7
REC SEX
--- ---1 M
2 M
3 M
4 F
5 M
6 F
7 F
8 M
9 M
10 M
11 F
12 M
13 M
14 M
15 F
16 F
17 F
18 M
19 M
20 M
21 F
22 M
23 M
24 F
25 M
26 M
27 M
28 F
29 M
30 M
Frequency distribution
for a qualitative binary variable
Sex
Frequency
Proportion
Female
10
33.3%
Male
20
66.7%
Total
30
100.0%
8
Using a pie chart to display qualitative
binary variable
Distribution of cases by sex
Female
Male
9
Qualitative, nominal data
• The variable can take more than two values
 Any value
• The information fits into one of the
categories
• The categories cannot be ranked
• Example:
 Nationality
 Language spoken
 Blood group
10
Rec
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
State
Punjab
Bihar
Rajasthan
Punjab
Bihar
Punjab
Bihar
Bihar
UP
Rajasthan
Bihar
Rajasthan
Punjab
UP
Rajasthan
UP
Punjab
UP
Rajasthan
Bihar
UP
Bihar
UP
Rajasthan
Bihar
Bihar
Bihar
UP
Bihar
UP
Frequency distribution
for a qualitative nominal variable
Country
Frequency
Bihar
Proportion
11
36.7%
UP
8
26.7%
Rajasthan
6
20.0%
Punjab
5
16.6%
30
100.0%
Total
11
Using a horizontal bar chart to display
qualitative nominal variable
Bihar
UP
RJ
Punjab
0
5
10
Frequency
12
Distribution of cases by state
15
Qualitative, ordinal data
• The variable can only take a number of value
than can be ranked through some gradient
• Example:
 Birth order
• First, second, third …
 Severity
• Mild, moderate, severe
 Vaccination status
• Unvaccinated, partially vaccinated, fully vaccinated
13
REC Status
--- ------1
1
2
1
3
2
4
2
5
1
6
2
7
1
8
2
9
3
10
2
11
1
12
3
13
1
14
3
15
1
16
3
17
1
18
1
19
3
20
1
21
1
22
2
23
1
24
2
25
2
26
1
27
2
28
3
29
2
30
2
Frequency distribution
for a qualitative ordinal variable
Severity
Frequency
Proportion
Mild
13
43.3%
Moderate
11
36.7%
6
20.0%
30
100.0%
Severe
Total
Clinical status: 1: Mild; 2 : Moderate; 3 : Severe
14
Using a vertical bar chart to display
qualitative ordinal variable
Frequency
15
10
5
0
Mild
Moderate
15
Severe
Distribution of cases by severity
Key issues
• Qualitative data
• Quantitative data
 We are not simply counting
 We are also measuring
• Discrete
• Continuous
16
Quantitative, discrete data
• Values are distinct and separated
• Normally, values have no decimals
• Example:
 Number of sexual partners
 Parity
 Number of persons who died from measles
17
REC CHILDREN
--- ------1
1
2
2
3
5
4
6
5
3
6
4
7
1
8
1
9
2
10
3
11
1
12
2
13
7
14
3
15
4
16
2
17
1
18
1
19
1
20
1
21
2
22
3
23
1
24
4
25
2
26
1
27
6
28
4
29
3
30
1
Frequency distribution
for a quantitative, discrete data
Children
Frequency
Proportion
1
11
36.7%
2
6
20.0%
3
5
16.7%
4
4
13.3%
5
1
3.3%
6
2
6.7%
7
1
3.3%
30
100.0%
Total
18
Using a histogram to display a discrete
quantitative variable
12
Frequency
10
8
6
4
2
0
1
2
3
4
5
6
7
Number of children
19
Distribution of households by number of children
Quantitative, continuous data
• Continuous variable
• Can assume continuous uninterrupted range
of values
• Values may have decimals
• Example:




Weight
Height
Hb level
What about temperature?
20
REC WEIGHT
--- -----1 10.5
2 23.7
3 21.8
4 33.1
5 38.0
6 34.5
7 38.5
8 38.4
9 30.1
10 34.7
11 37.9
12 38.0
13 39.2
14 30.1
15 43.2
16 45.7
17 40.4
18 56.4
19 55.1
20 55.4
21 66.7
22 82.9
23 109.7
24 120.2
25 10.4
26 10.8
27 25.5
28 20.2
29 27.3
30 38.7
Frequency distribution for a continuous
quantitative variable: The tally mark
Weight
Tally mark
Frequency
10-19
III
3
20-29
IIIII
5
30-39
IIIII IIIII II
12
40-49
III
3
50-59
III
3
60-69
I
1
70-79
-
0
80-89
I
1
90-99
-
0
100-109
I
1
110-119
I
1
21
REC WEIGHT
--- -----1 10.5
2 23.7
3 21.8
4 33.1
5 38.0
6 34.5
7 38.5
8 38.4
9 30.1
10 34.7
11 37.9
12 38.0
13 39.2
14 30.1
15 43.2
16 45.7
17 40.4
18 56.4
19 55.1
20 55.4
21 66.7
22 82.9
23 109.7
24 120.2
25 10.4
26 10.8
27 25.5
28 20.2
29 27.3
30 38.7
Frequency distribution for a continuous
quantitative variable, after aggregation
Weight
22
Frequency
Proportion
10-19
3
10.0%
20-29
5
16.7%
30-39
12
40.0%
40-49
3
10.0%
50-59
3
10.0%
60-69
1
3.3%
70-79
0
0.0%
80-89
1
3.3%
90-99
0
0.0%
100-109
1
3.3%
110-119
1
3.3%
30
100.0%
Total
Using a histogram to display a frequency
distribution for a continuous quantitative
variable, after aggregation
14
Frequency
12
10
8
6
4
2
0
0-9
ハ10-19 20-29
30-39
40-49
50-59
60-69
70-79
80-89
Weight categories
23
Distribution of cases by weight
90-99
100-9
110-9
Summary statistics
• A single value that summarizes the observed value
of a variable
 Part of the data reduction process
• Two types:
 Measures of location/central tendency/average
 Measures of dispersion/variability/spread
• Describe the shape of the distribution of a set of
observations
• Necessary for precise and efficient comparisons of
different sets of data
 The location (average) and shape (variability) of different
distributions may be different
24
Describing a distribution
Position
20
15
10
Dispersion
5
0
0-9
25
10-19Ê 20-29 30-39 40-49
50-59 60-69 70-79 80-89 90-99
Same location, different variability
Population A
No. of
People
Population B
Different Variability
Same Location
Factor X
26
Different location, same variability
No. of
People
Population A
Same Variability
Different Locations
Population B
Factor Y
27
Measures of central tendency
• Mode
• Median
• Arithmetic mean
28
The mode
• Definition
 The mode of a distribution is the value that is
observed most frequently in a given set of data
• How to obtain it?
 Arrange the data in sequence from low to high
 Count the number of times each value occurs
 The most frequently occurring value is the mode
29
The mode
Mode
20
18
16
14
12
10
8
6
4
2
0
30
Examples of mode annual salary
(in 10,000 rupees)
• 4, 3, 3, 2, 3, 8, 4, 3, 7, 2
• Arranging the values in order:
 2, 2, 3, 3, 3, 3, 4, 4, 7, 8 7, 8
 The mode is three times “3”
31
Specific features of the mode
• There may be no mode
 When each value is unique
• There may be more than one mode
 When more than 1 peak occurs
 Bimodal distribution
• The mode is not amenable to statistical tests
• The mode is not based on all the
observations
32
The median
• The median describes literally the middle
value of the data
• It is defined as the value above or below
which half (50%) the observations fall
33
Computing the median
• Arrange the observations in order from
smallest to largest (ascending order) or viceversa
• Count the number of observations “n”
 If “n” is an odd number
• Median = value of the (n+1) / 2th observation
(Middle value)
 If “n” is an even number
• Median = the average of the n / 2th and (n /2)+1th
observations
(Average of the two middle numbers)
34
Example of median calculation
• What is the median of the following values:
 10, 20, 12, 3, 18, 16, 14, 25, 2
 Arrange the numbers in increasing order
• 2 , 3, 10, 12, 14, 16, 18, 20, 25
• Median = 14
• Suppose there is one more observation (8)
 2 , 3, 8, 10, 12, 14, 16, 18, 20, 25
 Median = Mean of 12 & 14 = 13
35
Advantages and disadvantages
of the median
• Advantages
 The median is unaffected by extreme values
• Disadvantages
 The median does not contain information on the
other values of the distribution
• Only selected by its rank
• You can change 50% of the values without affecting the
median
 The median is less amenable to statistical tests
36
14
Median
The median is not sensitive to
extreme values
12
10
8
6
4
2
0
14
Same median
Class of the variable
12
10
8
6
4
2
0
Class of the variable
37
Mean (Arithmetic mean / Average)
• Most commonly used measure of location
• Definition
 Calculated by adding all observed values and
dividing by the total number of observations
• Notations




Each observation is denoted as x1, x2, … xn
The total number of observations: n
Summation process = Sigma : 
The mean: X
X =  xi /n
38
Computation of the mean
• Duration of stay in days in a hospital
 8,25,7,5,8,3,10,12,9
• 9 observations (n=9)
• Sum of all observations = 87
• Mean duration of stay = 87 / 9 = 9.67
• Incubation period in days of a disease
 8,45,7,5,8,3,10,12,9
• 9 observations (n=9)
• Sum of all observations =107
• Mean incubation period = 107 / 9 = 11.89
39
Advantages and disadvantages
of the mean
• Advantages
 Has a lot of good theoretical properties
 Used as the basis of many statistical tests
 Good summary statistic for a symmetrical
distribution
• Disadvantages
 Less useful for an asymmetric distribution
• Can be distorted by outliers, therefore giving a less
“typical” value
40
Median = 10
Mode = 13.5
14
12
10
8
6
4
2
0
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19
Mean = 10.8
41
Ideal characteristics of a measure of
central tendency
•
•
•
•
Easy to understand
Simple to compute
Not unduly affected by extreme values
Rigidly defined
 Clear guidelines for calculation
• Capable of further mathematical treatment
• Sample stability
 Different samples generate same measure
42
What measure of location to use?
• Consider the duration (days) of absence from
work of 21 labourers owing to sickness
 1, 1, 2, 2, 3, 3, 4, 4, 4, 4, 5, 6, 6, 6, 7, 8, 9, 10,
10, 59, 80
• Mean = 11 days
 Not typical of the series as 19 of the 21 labourers
were absent for less than 11 days
 Distorted by extreme values
• Median = 5 days
 Better measure
43
Type of data: Summary
Qualitative
Quantitative
Binary
Nominal
Ordinal
Discrete Continuous
Sex
M
M
F
M
F
F
M
M
F
M
F
F
M
M
M
F
M
F
M
State
Bihar
Punjab
Bihar
Punjab
UP
Bihar
UP
Rajasthan
Punjab
Rajasthan
Bihar
UP
Rajasthan
Bihar
Punjab
Punjab
Rajasthan
UP
Bihar
Status
Mild
Moderate
Severe
Mild
Moderate
Mild
Moderate
Severe
Severe
Mild
Moderate
Moderate
Mild
Severe
Severe
Moderate
Mild
Mild
Mild
Children
1
1
2
3
1
1
2
3
2
2
1
1
1
2
2
3
2
3
1
44
Weight
56.4
47.8
59.9
13.1
25.7
23.0
30.0
13.7
15.4
52.5
26.6
38.2
59.0
57.9
19.6
31.7
15.1
33.9
45.6
Definitions of
measures of central tendency
• Mode
 The most frequently occuring observation
• Median
 The mid-point of a set of ordered observations
• Arithmetic mean
 Aggregate / sum of the given observations
divided by the number of observation
45