data prep and descriptive stats

Transcript data prep and descriptive stats

Data Preparation
Steps in Data Preparation
Editing
Coding
Entering Data
Data Tabulation
Reviewing Tabulations
Statistically adjusting the data
Editing
 Carefully checking survey data for
 Completeness (no omissions)
 Legibility (non-ambiguous)
 Right informant
 Consistency
 e.g. charging something when the person does not
own a charge card
 Accuracy.
 Most important purpose is to eliminate or at
least reduce the number of errors in the raw
data.
Solutions
1. Ideally re-interview respondent
2. Eliminate all unacceptable surveys (case wise
deletion) (if sample is large and few unacceptable)
3. In calculations only the cases with complete
responses are considered (pair wise deletion)
(means that some statistics will be based on
different sample sizes)
4. Code illegible or missing answers into a a “no valid
response” category
5. substitute a neutral value - typically the mean
response to the variable, therefore the mean
remains unchanged
Coding
• The process of systematically and consistently
assigning each response a numerical score.
• The key to a good coding system is for the coding
categories to be mutually exclusive and the entire
system to be collectively exhaustive.
• To be mutually exclusive, every response must fit
into only one category.
• To be collectively exhaustive, all possible responses
must fit into one of the categories.
• Exhaustive means that you have covered the entire
range of the variable with your measurement.
Coding
• Coding Missing Numbers: When respondents fail
to complete portions of the survey.
– Whatever the reason for incomplete surveys, you
must indicate that there was no response provided
by the respondent.
– For single digit responses code as “9”, 2 digit code
as “99”
•Coding Open-Ended Questions: When open-ended
questions are used, you must create categories.
– All responses must fit into a category
– similar responses should fall into the same
category.
e.g. Who services your car? ______________
Possible categories: self, garage, husband, wife,
friend, relative etc.
• To make it collectively exhaustive add an “other” or
“none of the above” category
–Only a few i.e. < 10% should fit into this category
Precoded Questionnaires: Sometimes you can place
codes on the actual questionnaire, which simplifies
data entry.
This…
Are you:
Male
Female
How satisfied are you with our product?
___Very Satisfied
___Somewhat Satisfied
___Somewhat Dissatisfied
___Very Dissatisfied
___No opinion
Becomes this…
Are you: (1) Male
(2) Female
How satisfied are you with our product?
_1__Very Satisfied
_2__Somewhat Satisfied
_3__Somewhat Dissatisfied
_4__Very Dissatisfied
_5__No opinion
1. Are you solely responsible for taking care of your
automotive service needs ___ Yes ___ No
2. If No who performs the simple maintenance ___________
3. If scheduled maintenance is done on your automobile,
how do you keep track of what has been done
•Not tracked
•auto dealer records
•mental recollection
•other
4. How often is your automobile serviced?
•Once per month
•Once every three months
•Once every six months
•Once per year
•Other _______________
Code Book
Col.
No
Question
No.
Question Des.
Range of permissible values
1-3
ID #
N/A
001-200
4
1
Responsible for
Maintenance
0= No. 1=yes, 9= blank
5
2
perform simple
maintenance
0=husband, 1=boyfriend, 2=father, 3=mother,
4=relative, 5=friend, 6=other, 9=blank
5
3
How maintenance
tracked
0=not tracked, 1=auto dealer records, 2=personal
records, 3=mental recollection, 4=other, 9=blank
6
4
How often
maintenance
performed
Once per 0=month, 1= 3 months, 2= 6 months,
3= year, 4= other 9=blank
7
4
Other for how often
In questions that permit multiple responses, each possible response
option should be assigned a separate column
6. Which magazines do you read, choose all that apply.
•
Time
•
National Geographic
•
Readers Digest
•
Chatelaine
•
MacLean's
Col. No
Question No.
Question Des.
Range of permissible
values
15
6
Time
0 =read, 1= not read
16
6
Readers Dig.
0 =read, 1= not read
17
6
MacLean's
0 =read, 1= not read
18
6
National Geo.
0 =read, 1= not read
19
6
Chatelaine
0 =read, 1= not read
For rank order questions, separate columns are also needed
7. Please rank the following brands of toothpaste in order of
preference (1-5)
•
Crest
•
Colgate
•
Aquafresh
•
Arm & Hammer
•
Pepsodent
Col.# Q. No.
Question Des.
Range of permissible values
20
7
Crest rank
0 =blank, 1 = most important, 2 =2nd most
important, 3 =third, 4=fourth, 5= fifth
21
7
Colgate rank
0 =blank, 1 = most important, 2 =2nd most
important, 3 =third, 4=fourth, 5= fifth
22
7
Acquafresh rank
0 =blank, 1 = most important, 2 =2nd most
important, 3 =third, 4=fourth, 5= fifth
23
7
A & H rank
0 =blank, 1 = most important, 2 =2nd most
important, 3 =third, 4=fourth, 5= fifth
25
7
Pepsodent rank
0 =blank, 1 = most important, 2 =2nd most
important, 3 =third, 4=fourth, 5= fifth
Preparing the Data for Analysis
Variable Re-specification
• Existing data modified to create new variables
• Large number of variables collapsed into fewer
variables
• E.g. If 10 reasons for purchasing a car are given they
might be collapsed into four categories e.g.
performance, price, appearance, and service
• Creates variables that are consistent with research
questions
Entering Data
• Problems can occur during data entry, such as
transposing numbers and inputting an infeasible
code(e.g out of range)
– E.g. Score on range of 1-5 then 0, 6, 7, and 8 are
unacceptable or out of range (might be due to
transcription error)
• Always check the data-entry work.
Descriptive Statistics
Five types of statistical analysis
Descriptive
What are the characteristics of the respondents?
Inferential
What are the characteristics of the population?
Differences
Are two or more groups the same or different?
Associative
Are two or more variables related in a systematic way?
Predictive
Can we predict one variable if we know one or more
other variables?
Descriptive Statistics
Summarization of a collection of data
in a clear and understandable way
the most basic form of statistics
lays the foundation for all statistical
knowledge
The tradeoff in descriptive statistics
• If you use fewer statistics to describe the distribution of a
variable, you lose information but gain clarity.
• When should one use fewer statistics?
– When dropping the number of statistics would leave more
information per remaining statistic.
– When the information you drop is unimportant to one’s research
question.
Type of
Measurement
Type of
descriptive analysis
Two
categories
Nominal
More than
two categories
Frequency table
Proportion (percentage)
Frequency table
Category proportions
(percentages)
Mode
Type of
Measurement
Type of
descriptive analysis
Ordinal
Rank order
Median
Interval
Arithmetic mean
Ratio
means
Data Tabulation
• Tabulation: The organized arrangement of data in
a table format that is easy to read and
understand.
– Tabulate the data to count the number of responses to
each question.
• Simple Tabulation: The tabulating of results of
only one variable informs you how often each
response was given.
• Frequency Distribution: A distribution of data that
summarizes the number of times a certain value of
a variable occurs and is expressed in terms of
percentages.
Frequency Tables
The arrangement of statistical data in a row-andcolumn format that exhibits the count of
responses or observations for each category
assigned to a variable
• How many of certain brand users can be called loyal?
• What percentage of the market are heavy users and
light users?
• How many consumers are aware of a new product?
• What brand is the “Top of Mind” of the market?
More on relative frequency distributions
• Rules for relative frequency distributions:
–
–
–
–
–
Make sure each observation is in one and only one category.
Use categories of equal width.
Choose an appealing number of categories.
Provide labels
Double-check your graph.
• Definitions:
– A histogram is a relative frequency distribution of a quantitative
variable
– A bar graph is a relative frequency distribution of a qualitative
variable
WebSurveyor Bar Chart
How did you find your last job?
643 Netw orking
213 print ad
179 Online recruitment site
112 Placement firm
18 Temporary agency
1.5 %
Temporary agency
9.6 %
Placement firm
15.4 %
Online recruitment site
18.3 %
print ad
55.2 %
Netw orking
0
100
200
300
400
500
600
700
How many times per week do you use mouthwash ?
1__ 2__ 3__ 4__ 5__ 6__ 7__
112223333344444445555566677
1
2
2
3
3
5
4
7
5
5
7
6
5
4
3
2
1
6
3
7
2
0
1
2
3
4
5
6
7
Normal Distribution

-

a
b

Normal Distributions
 Curve is basically bell shaped
from -  to 
 symmetric with scores
concentrated in the middle (i.e. on
the mean) than in the tails.
Mean, medium and mode
coincide
They differ in how spread out
they are.
 The area under each curve is 1.
The height of a normal
distribution can be specified
mathematically in terms of two
parameters: the mean () and the
standard deviation ().
Skewed Distributions
Occur when one tail of the distribution is longer than the other.
Positive Skew Distributions
 have a long tail in the positive direction.
 sometimes called "skewed to the right"
 more common than distributions with negative skews
E.g. distribution of income. Most people make under $40,000 a
year, but some make quite a bit more with a small number making
many millions of dollars per year
 The positive tail therefore extends out quite a long way
Negative Skew Distributions
have a long tail in the negative direction.
called "skewed to the left."
negative tail stops at zero
• Kurtosis: how peaked a distribution is. A
zero indicates normal distribution, positive
numbers indicate a peak, negative numbers
indicate a flatter distribution)
Peaked
distribution
Flat distribution
Thanks, Scott!
Summary statistics
–central tendency
–Dispersion or variability
A quantitative measure of the degree to
which scores in a distribution are spread
out or are clustered together;
Descriptive Analysis: Measures of
Central Tendency
• Mode: the number that occurs most often
in a string (nominal data)
• Median: half of the responses fall above
this point, half fall below this point
(ordinal data)
• Mean: the average (interval/ratio data)
Mode
 the most frequent category
users
25%
non-users 75%
Advantages:
• meaning is obvious
• the only measure of central tendency that can be used
with nominal data.
Disadvantages
• many distributions have more than one mode, i.e. are
"multimodal
• greatly subject to sample fluctuations
• therefore not recommended to be used as the only
measure of central tendency.
Median
the middle observation of the data
number times per week consumers use mouthwash
112223333344444445555566677
Frequency
distribution of
Mouthwash
use per week
Light user
Mode
Median
Mean
Heavy user
The Mean (average value)
sum of all the scores divided by the number of scores.
 a good measure of central tendency for roughly
symmetric distributions
 can be misleading in skewed distributions since it can be
greatly influenced by extreme scores in which case other
statistics such as the median may be more informative
 formula  = SX/N (population)
X
¯ = Sxi/n (sample)
where /X
¯ is the population/sample mean
and N/n is the number of scores.
Normal Distributions with
different Mean
-
1
0

2

Measures of Dispersion or
Variability
• Minimum, Maximum, and Range (Highest
value minus the lowest value)
• Variance
• Standard Deviation (A measure’s distance
from the mean)
Distribution of Final Course Grades in MGMT 3220Y
25
Frequency
20
- 1 SD
15
+ 1 SD
10
5
RANGE
0
Frequency
F
D
C
B
A
3
10
20
23
12
Grade
Variance
• The difference between an observed value and the
mean is called the deviation from the mean
• The variance is the mean squared deviation from
the mean
• i.e. you subtract each value from the mean,
square each result and then take the average.
2 = S(x¯ xi)2/n
• Because it is squared it can never be negative
Standard Deviation
• The standard deviation is the square root of
the variance
2/n
S =  S(xx
)
¯ i
• Thus the standard deviation is expressed in
the same units as the variables
• Helps us to understand how clustered or
spread the distribution is around the mean
value.
Measures of Dispersion
Suppose we are testing the new flavor of a fruit punch
Dislike 1
1.
2
3 4 5 Like Data
x
x
2.
3.
x
x
2/n
2 = S(xx
)
¯ i
X= 4
 2= 1
S=1
5
3
x
6.
5
3
x
4.
5.
3
5
2/n
S =  S(xx
)
¯ i
Measures of Dispersion
Dislike 1
2
3
4
1.
2.
5 Like Data
x
5
4
x
3.
x
5
4.
x
5
5.
x
5
6.
2/n
2 = S(xx
)
¯ i
x
X
¯ = 4.6
2=0.26
S = 0.52
4
2/n
S =  S(xx
)
¯ i
Measures of Dispersion
Dislike 1
1.
4
5 Like Data
1
x
x
4.
5.
3
x
2.
3.
2
1
x
2/n
2 = S(xx
)
¯ i
5
X=
¯ 3
2=4
S=2
1
x
6.
5
x
5
2/n
S =  S(xx
)
¯ i
Normal Distributions
with different SD
2
-
1
3


Cross Tabulation
• A statistical technique that involves tabulating the
results of two or more variables simultaneously
• informs you how often each response was given
• Shows relationships among and between variables
• frequency distribution for each subgroup compared
to the frequency distribution for the total sample
• must be nominally scaled
Cross-tabulation
• Helps answer questions about whether two
or more variables of interest are linked:
– Is the type of mouthwash user (heavy or
light) related to gender?
– Is the preference for a certain flavor (cherry
or lemon) related to the geographic region
(north, south, east, west)?
– Is income level associated with gender?
• Cross-tabulation determines association not
causality.
Dependent and Independent Variables
• The variable being studied is called the
dependent variable or response variable.
• A variable that influences the dependent
variable is called independent variable.
Cross-tabulation
• Cross-tabulation of two or more variables is
possible if the variables are discrete:
– The frequency of one variable is subdivided by the
other variable categories.
• Generally a cross-tabulation table has:
– Row percentages
– Column percentages
– Total percentages
• Which one is better?
DEPENDS on which variable is considered as
independent.
Contingency Table
• A contingency table shows the conjoint
distribution of two discrete variables
• This distribution represents the probability
of observing a case in each cell
– Probability is calculated as:
Observed
cases
P=
Total cases
Cross tabulation
GROUPINC * Gender Crosstabulation
GROUPINC
income <= 5
5<Income<= 10
income >10
Total
Count
% within GROUPINC
% within Gender
% of Total
Count
% within GROUPINC
% within Gender
% of Total
Count
% within GROUPINC
% within Gender
% of Total
Count
% within GROUPINC
% within Gender
% of Total
Gender
Female
Male
10
9
52.6%
47.4%
55.6%
18.8%
15.2%
13.6%
5
25
16.7%
83.3%
27.8%
52.1%
7.6%
37.9%
3
14
17.6%
82.4%
16.7%
29.2%
4.5%
21.2%
18
48
27.3%
72.7%
100.0%
100.0%
27.3%
72.7%
Total
19
100.0%
28.8%
28.8%
30
100.0%
45.5%
45.5%
17
100.0%
25.8%
25.8%
66
100.0%
100.0%
100.0%
Chi-square Test for Independence
• The Chi-square test for independence
determines whether two variables are
associated or not.
H0: Two variables are independent
H1: Two variables are not independent
Chi-square test results are unstable if cell count is lower than 5

data prep and descriptive stats

Transcript data prep and descriptive stats

Directory