Data Collection
Download
Report
Transcript Data Collection
FK6163
Collect, Explore & Summarise
Dr Azmi Mohd Tamil
Dept of Community Health
Universiti Kebangsaan Malaysia
©[email protected] 2013
Data Collection
Data
collection begins after
deciding on design of study and
the sampling strategy
©[email protected] 2013
Data Collection
Sample
subjects are identified and the
required individual information is
obtained in an item-wise and structured
manner.
©[email protected] 2013
Data Collection
Information
is collected on certain
characteristics, attributes and the
qualities of interest from the samples
These data may be quantitative or
qualitative in nature.
©[email protected] 2013
Data Collection Techniques
Use
available information
Observation
Interviews
Questionnaires
Focus group discussion
©[email protected] 2013
Using Available
Information
Existing
•
•
•
•
Records
Hospital records - case notes
National registry of births & deaths
Census data
Data from other surveys
©[email protected] 2013
Disadvantages of using
existing records
Incomplete
records
Cause of death may not be verified by a
physician/MD
Missing vital information
Difficult to decipher
May not be representative of the target
group - only severe cases go to hospital
©[email protected] 2013
©[email protected] 2013
Disadvantages of using
existing records
Delayed
publication - obsolete data
Different method of data recording
between institutions, states, countries,
making comparison & pooling of data
incompatible
Comparisons across time difficult due to
difference in classification, diagnostic
tools etc
©[email protected] 2013
Advantages of using
existing records
Cheap
convenient
in
some situations, it is the only data
source i.e. accidents & suicides
©[email protected] 2013
Observation
Involves
systematically selecting,
watching & recording behaviour and
characteristics of living beings, objects
or phenomena
Done using defined scales
Participant observation e.g. PEF and
asthma symptom diary
Non-participant observation e.g.
cholesterol levels
©[email protected] 2013
Interviews
Oral
questioning of respondents either
individually or as a group.
Can be done loosely or highly structured
using a questionnaire
©[email protected] 2013
Administering Written
Questionnaires
Self-administered
via
mail
by gathering them in one place and
getting them to fill it up
hand-delivering and collecting them later
Large non-response can distort results
©[email protected] 2013
Questionnaires
Influenced
by education & attitude of
respondent esp. for self-administered
Interviewers need to be trained
open ended vs close ended
the need for pre-testing or pilot study
©[email protected] 2013
Focus group discussion
Selecting
relevant parties to the
research questions at hand and
discussing with them in focus groups
examples in your own field of interest?
©[email protected] 2013
Plan for data collection
Permission
to proceed
Logistics - who will collect what, when
and with what resources
Quality control
©[email protected] 2013
Accuracy & Reliability
Accuracy
- the degree which a
measurement actually measures the
measures the characteristic it is
supposed to measure
Reliability is the consistency of replicate
measures
©[email protected] 2013
Reliability & Accuracy
©[email protected] 2013
Accuracy & Reliability
Both
are reduced by random error and
systematic error from the same sources
of variability;
• the data collectors
• the respondents
• the instrument
©[email protected] 2013
Strategies to enhance
precision & accuracy
Standardise
procedures and
measurement methods
training & certifying the data collectors
Repetition
Blinding
©[email protected] 2013
Introduction
Method of Exploring and
Summarising Data differs
According to Types of Variables
©[email protected] 2013
Dependent/Independent
Independent Variables
Frequency of Exercise
Food Intake
Obesity
Dependent Variable
©[email protected] 2013
©[email protected] 2013
Explore
It
is the first step in the analytic process
to explore the characteristics of the data
to screen for errors and correct them
to look for distribution patterns - normal
distribution or not
May require transformation before further
analysis using parametric methods
Or may need analysis using non-parametric
techniques
©[email protected] 2013
Data Screening
R
r
u
c
V
1
7
7
2
4
2
3
6
5
4
2
1
5
1
6
6
8
7
7
3
4
8
7
2
9
5
3
1
3
4
1
1
5
th
1
1
5
T
8
0
By running
frequencies, we may
detect inappropriate
responses
How many in the
audience have 15
children and
currently pregnant
with the 16 ?
©[email protected] 2013
Data Screening
See
whether the
data make sense or
not.
E.g. Parity 10 but
age only 25.
©[email protected] 2013
©[email protected] 2013
©[email protected] 2013
Data Screening
By
looking at measures of central tendency
and range, we can also detect abnormal values
for quantitative data
e
t
N
e
i
m
m
a
P
2
4
5
7
V
©[email protected] 2013
Interpreting the Box Plot
Largest non-outlier
Upper quartile
Median
Lower quartile
Smallest non-outlier
Outlier
The whiskers extend
to 1.5 times the box
width from both ends
of the box and ends
at an observed value.
Three times the box
width marks the
boundary between
"mild" and "extreme"
outliers.
"mild" = closed dots
Outlier"extreme"= open dots
©[email protected] 2013
Data Screening
600
We
can
also make
use of
graphical
tools such
as the box
plot to
detect
wrong
data entry
500
73
400
300
200
100
181
211
198
141
0
N =
184
Pre-pregnancy weight
©[email protected] 2013
Data Cleaning
Identify
the extreme/wrong values
Check with original data source – i.e.
questionnaire
If incorrect, do the necessary correction.
Correction must be done before
transformation, recoding and analysis.
©[email protected] 2013
Parameters of Data
Distribution
– central value of data
Standard deviation – measure of how
the data scatter around the mean
Symmetry (skewness) – the degree of
the data pile up on one side of the mean
Kurtosis – how far data scatter from the
mean
Mean
©[email protected] 2013
Normal distribution
The Normal distribution is
represented by a family of curves
defined uniquely by two parameters,
which are the mean and the
standard deviation of the population.
The curves are always
symmetrically bell shaped, but the
extent to which the bell is
compressed or flattened out
depends on the standard deviation
of the population.
However, the mere fact that a curve
is bell shaped does not mean that it
represents a Normal distribution,
because other distributions may
have a similar sort of shape.
©[email protected] 2013
Normal distribution
If the observations follow a
Normal distribution, a range
covered by one standard
deviation above the mean
and one standard deviation
below it includes about
68.3% of the observations;
a range of two standard
deviations above and two
below (+ 2sd) about 95.4%
of the observations; and
of three standard deviations
above and three below (+
3sd) about 99.7% of the
observations
99.7%
95.4%
68.3%
©[email protected] 2013
Normality
Why
bother with normality??
Because it dictates the type of analysis
that you can run on the data
©[email protected] 2013
Variable 1
Qualitative
Variable 2
Qualitative
Qualitative
Dichotomus
Qualitative
Dichotomus
Qualitative
Dichotomus
Qualitative
Dichotomus
Qualitative
Dichotomus
Qualitative
Polinomial
Quantitative
Quantitative
Quantitative
Quantitative continous
Criteria
Sample size > 20 dan no
expected value < 5
Sample size > 30
Type of Test
Chi Square Test (X2)
Sample size > 40 but with at
least one expected value < 5
Normally distributed data
X2 Test with Yates
Correction
Student's t Test
Normally distributed data
ANOVA
Normality-Why?
Parametric
Proportionate Test
Quantitative
Repeated measurement of the Paired t Test
same individual & item (e.g.
Hb level before & after
treatment). Normally
distributed data
Quantitative - Normally distributed data
Pearson Correlation
continous
& Linear
Regresssion
©[email protected] 2013
Normality-Why?
Non-parametric
Variable 1
Qualitative
Dichotomus
Variable 2
Qualitative
Dichotomus
Criteria
Type of Test
Sample size < 20 or (< 40 but Fisher Test
with at least one expected
value < 5)
Qualitative
Quantitative Data not normally distributed Wilcoxon Rank Sum
Dichotomus
Test or U MannWhitney Test
Qualitative
Quantitative Data not normally distributed Kruskal-Wallis One
Polinomial
Way ANOVA Test
Quantitative
Quantitative Repeated measurement of the Wilcoxon Rank Sign
same individual & item
Test
Quantitative Quantitative - Data not normally distributed Spearman/Kendall
continous/ordina continous
Rank Correlation
l
©[email protected] 2013
Normality-How?
Explored
•
•
•
•
graphically
Histogram
Stem & Leaf
Box plot
Normal probability
plot
• Detrended normal
plot
Explored
statistically
• Kolmogorov-Smirnov
statistic, with
Lilliefors significance
level and the
Shapiro-Wilks
statistic
• Skew ness (0)
• Kurtosis (0)
– + leptokurtic
– 0 mesokurtik
– - platykurtic
©[email protected] 2013
Kolmogorov- Smirnov
In
the 1930’s, Andrei Nikolaevich
Kolmogorov (1903-1987) and N.V.
Smirnov (his student) came out with the
approach for comparison of distributions
that did not make use of parameters.
This is known as the KolmogorovSmirnov test.
©[email protected] 2013
Skew ness
Skewed
to the right
indicates the
presence of large
extreme values
Skewed to the left
indicates the
presence of small
extreme values
©[email protected] 2013
Kurtosis
For
symmetrical
distribution only.
Describes the shape
of the curve
Mesokurtic average shaped
Leptokurtic - narrow
& slim
Platikurtic - flat &
wide
©[email protected] 2013
Skew ness & Kurtosis
Skew
ness ranges from -3 to 3.
Acceptable range for normality is skew ness
lying between -1 to 1.
Normality should not be based on skew ness
alone; the kurtosis measures the “peak ness”
of the bell-curve (see Fig. 4).
Likewise, acceptable range for normality is
kurtosis lying between -1 to 1.
©[email protected] 2013
©[email protected] 2013
Normality - Examples
Graphically
60
50
40
30
20
10
Std. Dev = 5.26
Mean = 151.6
N = 218.00
0
140.0
145.0
142.5
Height
150.0
147.5
152.5
155.0
160.0
157.5
165.0
162.5
167.5
©[email protected] 2013
Q&Q Plot
This
plot compares the quintiles of a data
distribution with the quintiles of a standardised
theoretical distribution from a specified family
of distributions (in this case, the normal
distribution).
If the distributional shapes differ, then the
points will plot along a curve instead of a line.
Take note that the interest here is the central
portion of the line, severe deviations means
non-normality. Deviations at the “ends” of the
curve signifies the existence of outliers.
©[email protected] 2013
Normality - Examples
Graphically
Normal Q-Q Plot of Height
3
2
1
0
Detrended Normal Q-Q Plot of Height
.6
-1
.5
-2
.4
-3
.3
130
140
150
160
170
.2
Dev from Normal
Observed Value
.1
0.0
-.1
-.2
130
140
Observed Value
150
160
170
©[email protected] 2013
Normal distribution
Mean=median=mode
©[email protected] 2013
Normality - Examples
Statistically
Descriptives
Height
Mean
95% Confidence
Interval for Mean
Lower Bound
Upper Bound
Statis tic
151.65
150.94
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtos is
Std. Error
.356
Normal distribution
Mean=median=mode
152.35
151.59
151.50
27.649
5.258
139
168
29
8.00
.148
.061
Skewness & kurtosis
within +1
.165
.328
p > 0.05, so normal
distribution
Tests of Normality
a
Shapiro-Wilks; only if
sample size less than 100.
Height
Kolmogorov-Smirnov
Statis tic
df
Sig.
.060
218
.052
a. Lilliefors Significance Correction
©[email protected] 2013
K-S Test
©[email protected] 2013
K-S Test
very
sensitive to the sample sizes of the
data.
For small samples (n<20, say), the
likelihood of getting p<0.05 is low
for large samples (n>100), a slight
deviation from normality will result in
being reported as abnormal distribution
©[email protected] 2013
Guide to deciding on
normality
©[email protected] 2013
Normality
Transformation
Normal Q-Q Plot of PARITY
3
2
1
Normal Q-Q Plot of LN_PARIT
0
3
-1
2
-2
0
2
4
6
8
10
12
14
16
Observed Value
Expected Normal
1
0
-1
-2
-.5
0.0
Observed Value
.5
1.0
1.5
2.0
2.5
3.0
©[email protected] 2013
TYPES OF TRANSFORMATIONS
Square root
Reflect and square
root
Logarithm
Reflect and logarithm
Inverse
Reflect and inverse
©[email protected] 2013
Summarise
Summarise
a large set of data by a few
meaningful numbers.
Single variable analysis
• For the purpose of describing the data
• Example; in one year, what kind of cases are
treated by the Psychiatric Dept?
• Tables & diagrams are usually used to describe
the data
• For numerical data, measures of central tendency
& spread is usually used
©[email protected] 2013
Frequency Table
Race
Malay
Chinese
Indian
Others
TOTAL
F
760
5
0
28
793
%
95.84%
0.63%
0.00%
3.53%
100.00%
•Illustrates the frequency observed for each
category
©[email protected] 2013
Frequency
Distribution Table
• > 20 observations, best
presented as a frequency
distribution table.
•Columns divided into class &
frequency.
•Mod class can be determined
using such tables.
Umur
0-0.99
1-4.99
5-14.99
15-24.99
25-34.99
35-44.99
45-54.99
55-64.99
65-74.99
75-84.99
85+
JUMLAH
Bil
25
78
140
126
112
90
66
60
50
16
3
766
%
3.26%
10.18%
18.28%
16.45%
14.62%
11.75%
8.62%
7.83%
6.53%
2.09%
0.39%
©[email protected] 2013
Measurement of Central
Tendency & Spread
©[email protected] 2013
Measures of Central
Tendency
Mean
Mode
Median
©[email protected] 2013
Measures of Variability
Standard
deviation
Inter-quartiles
Skew ness & kurtosis
©[email protected] 2013
Mean
the
average of the data collected
To calculate the mean, add up the
observed values and divide by the
number of them.
A
major disadvantage of the mean is
that it is sensitive to outlying points
©[email protected] 2013
Mean: Example
12,
13, 17, 21, 24, 24, 26, 27, 27,
30, 32, 35, 37, 38, 41, 43, 44, 46,
53, 58
Total
n=
of x = 648
20
Mean
= 648/20 = 32.4
©[email protected] 2013
Measures of variation standard deviation
tells us how much all the scores in a dataset cluster around the
mean. A large S.D. is indicative of a more varied data scores.
a summary measure of the differences of each observation from
the mean.
If the differences themselves were added up, the positive would
exactly balance the negative and so their sum would be zero.
Consequently the squares of the differences are added.
©[email protected] 2013
©[email protected] 2013
sd: Example
12, 13, 17, 21, 24, 24,
26, 27, 27, 30, 32, 35,
37, 38, 41, 43, 44, 46,
53, 58
Mean = 32.4; n = 20
(x-mean)2
Total of
= 3050.8
Variance = 3050.8/19
= 160.5684
x
(x-mean)^2
x
(x-mean)^2
12
416.16
32
0.16
13
376.36
35
6.76
17
237.16
37
21.16
21
129.96
38
31.36
24
70.56
41
73.96
24
70.56
43
112.36
26
40.96
44
134.56
27
29.16
46
184.96
27
29.16
53
424.36
30
5.76
58
655.36
TOTAL
1405.8
TOTAL
1645
sd = 160.56840.5=12.67
©[email protected] 2013
Median
the
ranked value that lies in the middle
of the data
the point which has the property that half
the data are greater than it, and half the
data are less than it.
if n is even, average the n/2th largest
and the n/2 + 1th largest observations
"robust" to outliers
©[email protected] 2013
Median:
12,
13, 17, 21, 24, 24, 26, 27, 27, 30,
32, 35, 37, 38, 41, 43, 44, 46, 53, 58
(20+1)/2
= 10th which is 30, 11th is 32
Therefore
median is (30 + 32)/2 = 31
©[email protected] 2013
Measures of variation quartiles
The
range is very susceptible to what
are known as outliers
A
more robust approach is to divide the
distribution of the data into four, and find
the points below which are 25%, 50%
and 75% of the distribution. These are
known as quartiles, and the median is
the second quartile.
©[email protected] 2013
Quartiles
12,
13, 17, 21, 24,
24, 26, 27, 27, 30,
32, 35, 37, 38, 41,
43, 44, 46, 53, 58
25th
percentile 24; (24+24)/2
50th
percentile 31; (30+32)/2 ; = median
75th
percentile 42.5; (41+43)/2
©[email protected] 2013
Mode
The
most frequent occurring number.
E.g. 3, 13, 13, 20, 22, 25: mode = 13.
It is usually more informative to quote
the mode accompanied by the
percentage of times it happened; e.g.,
the mode is 13 with 33% of the
occurrences.
©[email protected] 2013
Mode: Example
12,
13, 17, 21, 24, 24, 26, 27, 27, 30,
32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Modes
are 24 (10%) & 27 (10%)
©[email protected] 2013
Mean or Median?
Which
measure of central tendency
should we use?
if the distribution is normal, the mean+sd
will be the measure to be presented,
otherwise the median+IQR should be
more appropriate.
©[email protected] 2013
Not Normal distribution;
Use Median & IQR
Normal distribution;
Use Mean+SD
©[email protected] 2013
Presentation
Qualitative & Quantitative Data
Charts & Tables
©[email protected] 2013
Presentation
Qualitative Data
©[email protected] 2013
Graphing Categorical Data:
Univariate Data
Categorical Data
Graphing Data
Tabulating Data
The Summary Table
Pie Charts
CD
Pareto Diagram
S a vi n g s
Bar Charts
B onds
S to c k s
0
10
20
30
40
50
45
120
40
100
35
30
80
25
60
20
15
40
10
20
5
0
0
S to c k s
B onds
S a vi n g s
CD
©[email protected] 2013
Bar Chart
80
69
60
40
Percent
20
20
11
0
Housew ife
Type of work
Office w ork
Field w ork
©[email protected] 2013
Pie Chart
Others
Chinese
Malay
©[email protected] 2013
Tabulating and Graphing
Bivariate Categorical Data
Contingency
tables:
Table 1: Contigency table of pregnancy induced hypertension and
SGA
Count
SGA
Pregnancy induced
hypertension
Total
No
Yes
Normal
103
5
108
SGA
94
16
110
Total
197
21
218
©[email protected] 2013
Tabulating and Graphing
Bivariate Categorical Data
120
Side
by
side
charts
100
103
94
80
60
40
SGA
Count
20
16
0
Normal
SGA
No
Yes
Pregnancy induced hypertension
©[email protected] 2013
Presentation
Quantitative Data
©[email protected] 2013
Tabulating and Graphing
Numerical Data
Numerical Data
Ordered Array
21, 24, 24, 26, 27, 27, 30, 32, 38, 41
41, 24, 32, 26, 27, 27, 30, 24, 38, 21
Frequency Distributions
Cumulative Distributions
Ogive
120
100
80
60
40
20
Stem and Leaf
Display
0
2 144677
3 028
4 1
Histograms
Area
10
20
30
40
50
7
6
5
4
Tables
Polygons
3
2
1
0
10
20
30
40
50
60
©[email protected] 2013
6
Tabulating Numerical Data:
Frequency Distributions
Sort raw data in ascending order:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Find
range: 58 - 12 = 46
Select
number of classes: 5 (usually between 5 and 15)
Compute
class interval (width): 10 (46/5 then round up)
Determine
Compute
Count
class boundaries (limits): 10, 20, 30, 40, 50, 60
class midpoints: 14.95, 24.95, 34.95, 44.95,
54.95
observations & assign to classes
©[email protected] 2013
Frequency Distributions
and Percentage Distributions
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Class
Midpoint
Freq
%
10.0 - 19.9
14.95
3
15%
20.0 - 29.9
24.95
6
30%
30.0 - 39.9
34.95
5
25%
40.0 - 49.9
44.95
4
20%
50.0 - 59.9
54.95
2
10%
20
100%
TOTAL
©[email protected] 2013
Graphing Numerical Data:
The Histogram
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
7
6
6
5
Frequency
5
4
4
No Gaps
Between
2
Bars
3
3
2
1
0
14.95
Class Boundaries
24.95
34.95
44.95
54.95
Age
Class Midpoints
©[email protected] 2013
Graphing Numerical Data:
The Frequency Polygon
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
7
6
5
4
3
2
1
0
14.95
24.95
34.95
Class Midpoints
44.95
54.95
©[email protected] 2013
Linear Regression Line
©[email protected] 2013
Survival Function
1.2
1.0
.8
.6
.4
.2
Survival Function
0.0
Censored
0
1
DURATION
2
3
4
5
6
7
©[email protected] 2013
Principles of Graphical
Excellence
Presents
data in a way that provides
substance, statistics and design
Communicates complex ideas with clarity,
precision and efficiency
Gives the largest number of ideas in the
most efficient manner
Almost always involves several
dimensions
Tells the truth about the data
©[email protected] 2013