9 Intro to QuantiativeData

Download Report

Transcript 9 Intro to QuantiativeData

Another Information-Gathering
Technique & Introduction to
Quantitative Data Analysis
Neuman and Robson Chapter 11.
Research Data library at SFU
http://www.sfu.ca/rdl/
Quiz 2 Coverage
• New Material from the Lectures and from the
following Chapters
– 7 (Sampling), 8 (Surveys), 10 (Nonreactive Measures &
Existing Statistics) and the beginning of Chapter 11
(univariate statistics)
• The quiz may also include material covered in the
first quiz especially:
–
–
–
–
–
Standardization & rates
Scales & indices
validity & reliability,
levels of measurement,
the notions of exhaustive & mutually exclusive categories.
Types of Equivalence for comparative
research using existing statistics
• lexicon equivalence (technique of back
translation)
• contextual equivalence (ex. role of religious
leaders in different societies)
• conceptual equivalence (ex. income)
• measurement equivlence (ex. different
measure for same context)
Ethical Issues in Comparative
Research
• ethical issues sometimes very important
– ex. impact of demographic research on funding of
developing countries, controversy surrounding
studies of the origins of AIDS
• sensitivity, privacy etc.sometimes still issues
even if “subjects” dead.
Quantitative Data
• Types of Statistics
– Descriptive
– Inferential
• Common Ways of Presenting Statistics
– Tables
– Charts
– Graphs
Data Preparation
• Recall: Coding Issues with War & Peace
Journalism codes last day
• Entering Data into Spreadsheet or data
processing software
• Cleaning Data
Recall: Coding Principles
• categories
– exhaustive
– mutually exclusive
• consistent for all cases
• comparable with other studies
Ways of Developing
Coding Categories
• pre-defined coding schemes
– e.g. close-ended questions
– Ex. Coding Missing Values (conventions not
always used)
• not applicable=77,
• don’t know=88,
• no response=99
• post-collection analysis
More Examples of Coding Process
• Sheet for One Television Commercial
• Excel spreadsheet showing entered codes
• SPSS example
Data entry conventions
Discrete & Continuous Variables
• Continuous
– Variable can take infinite (or large) number of values
within range
• Ex. Age measured by exact date of birth
• Discrete
– Attributes of variable that are distinct but not
necessarily continuous
• Ex. Age measured by age groups (Note: techniques exist
for making assumptions about discrete variables in order
to use techniques developed for continuous variables)
Cleaning Data
• checking accuracy & removing errors
– Possible Code Cleaning
• check for impossible codes (errors)
– Some software checks at data entry
– Examine distributions to look for impossible codes
– Contingency cleaning
• inconsistencies between answers (impossible
logical combinations, illogical responses to skip or
contingency questions)
Descriptive Statistics (some topics for next few
weeks)
• Univariate (one variable)
–
–
–
–
Frequency distributions
Graphs & charts
Measures of central tendency
Measures of dispersion
• Bivariate (two variables)
– Crosstabulations
– Scattergrams & other types of graphs
– Measures of association
• Multivariate (more than two variables)
– Statistical control
– Partials
– Elaboration paradigm
Frequency Distribution (Univariate)
Table 5-1 Alienation of Workers
__________________________________
--------------------------------------------------------Level of Alienation
Frequency
--------------------------------------------------------High
20
Medium
67
Low
13
(Sub Total)
100
(N=150)
No Response
60
(Total)
(N=210)
Simple Univariate Frequency
Distributions and Percentages
• univariate:= one variable
• “raw count” (frequencies, percentages)
Conventions in table design
• total number of cases (N=)
• grouping cases
– pro: see patterns
– con: lose information
Graph of Frequency Distribution (Univariate)
Another visual representation of a distributions:
Pie charts
Critically Analyzing Data on Frequency Distributions:
Collapsing Categories and Treatment of Missing Data
• Consider Raw Data
(Numbers) not just
percentages
• Examine data
preparation
– Treatment of
missing cases?
– Collapsing
categories?
Johnson, A. G. (1977). Social Statistics Without Tears.
Toronto: McGraw Hill.
Treatment of Missing Data: Raw Data
Table 5-1 Alienation of Workers
__________________________________
--------------------------------------------------------Level of Alienation
Frequency
--------------------------------------------------------High
20
Medium
67
Low
13
(Sub Total)
100
(N=150)
No Response
60
(Total)
(N=210)
Treatment of Missing Data (%)
• Comparison of % distributions and without
non respondents
Table 5-1 Alienation of Workers
Table 5-1 Alienation of Workers
Level of Alienation
High
Medium
Low
No Response
Level of Alienation
High
Medium
Low
F
30
100
20
60
%
14
48
10
29
(Total)
(Total)
210
100
F
30
100
20
%
20
67
13
150
100
Treatment of Missing Data (%)
• Comparison with high & medium collapsed
Table 5-1 Alienation of Workers
Table 5-1 Alienation of Workers
Level of Alienation
High & Medium
Low
No Response
Level of Alienation
High & Medium
Low
F
130
20
60
%
62
10
29
(Total)
(Total)
210
Non-respondents included
F
130
20
%
87
13
150
100
100
Non-respondents eliminated
Treatment of Missing Data (%)
• Comparison with medium & low collapsed
Table 5-1 Alienation of Workers
Table 5-1 Alienation of Workers
Level of Alienation
High
Medium & Low
No Response
Level of Alienation
High
Medium & Low
F
30
120
60
%
14
58
29
(Total)
(Total)
210
Non-respondents included
F
30
120
%
20
80
150
100
100
Non-respondents eliminated
Grouping Response Categories(%)
• Comparison of with high & medium response
categories collapsed
Table 5-1 Alienation of Workers
Table 5-1 Alienation of Workers
Level of Alienation
High & Medium
Low
No Response
Level of Alienation
High& medium
Low
(Total)
210
Freq
%
62
10
29
100
(Total)
150
Freq
%
87
13
Core Notions in Basic Univariate
Statistics
Ways of describing data about one
variable (“uni”=one)
–Measures of central tendency
• Summarize information about one variable
• three types of “averages”: arithmetic mean,
median, mode
–Measures of dispersion
• Analyze Variations or “spread”
• Range, standard deviation, percentiles, z-scores
Mode
• most common or frequently occurring
category or value (for all types of data)
Babbie (1995: 378)
Graph (Normal Distribution) with
single mode
Bimodal Distribution
• When there are two “most common” values
that are almost the same (or the same)
Median
• middle point of rank-ordered list of all values
(only for ordinal, interval or ratio data)
Babbie (1995: 378)
Mean (arithmetic mean)
– Arithmetic “average” = sum of values divided by
number of cases (only for ratio and interval data)
Babbie (1995: 378)
Two Data Sets with the Same Mean
Normal Distribution & Measures of
Central Tendency
• Symmetric
• Also called the “Bell Curve”
Neuman (2000: 319)
Skewed Distributions & Measures of
Central Tendency
Skewed to the left
Skewed to the right
Neuman (2000: 319)
Normal & Skewed Distributions
Why Measures of Central Tendency are
not enough to describe distributions:
Crowd Example
• 7 people at bus stop in front of bar aged
25,26,27,30,33,34,35
– median= 30, mean= 30
• 7 people in front of ice-cream parlour aged
5,10,20,30,40,50,55
– median= 30, mean= 30
• BUT issue of “spread” socially significant
Measures of Variation or Dispersion
• range: distance between largest and smallest
scores
• standard deviation: for comparing distributions
• percentiles: for understanding position in
distribution% up to and including the number
(from below)
• z-scores: for comparing individual scores taking
into account the context of different distributions
Range & Interquartile range
• distance between largest and smallest scores
– what does a short distance between the scores tell us
about the sample?
– problems of “outliers” or extreme values may occur
Interquartile range (IQR)
• distance between the 75th percentile and the 25th
percentile
• range of the middle 50% (approximately) of the data
• Eliminates problem of outliers or extreme values
• Example from StatCan website (11 in sample)
–
–
–
–
–
–
Data set: 6, 47, 49, 15, 43, 41, 7, 39, 43, 41, 36
Ordered data set:6, 7, 15, 36, 39, 41, 41, 43, 43, 47, 49
Median:41
Upper quartile: 41
Lower quartile: 15
IQR= 41-15
Standard Deviation and Variance
• Inter quartile range eliminates problem of
outliers BUT eliminates half the data
• Solution? measure variability from the center of
the distribution.
• standard deviation & variance measure how far
on average scores deviate or differ from the
mean.
Calculation of Standard Deviation
1
2
13
4
5
6
7
8
Neuman (2000: 321)
Calculation of Standard Deviation
Neuman (2000: 321)
Standard Deviation Formula
Neuman (2000: 321)
Calculation of
Standard Deviation
Neuman (2000: 321)
Interpreting Standard Deviation
• amount of variation from mean
• social meaning depends on exact case
Details on the Calculation of Standard Deviation
Neuman (2000: 321)
The Bell Curve & standard deviation
Discussion of Preceding Diagram
• “Many biological, psychological and social phenomena
occur in the population in the distribution we call the
bell curve (Portney & Watkins, 2000).” link to source
• Preceding picture
– a symmetrical bell curve,
– average score [i.e., the mean] in the middle, where the ‘bell’
shape tallest.
– Most of the people [i.e., 68% of them, or 34% + 34%] have
performance within 1 segment [i.e., a standard deviation] of
the average score.”
Interpreting
Standard Deviation
• amount of variation
from mean
• Illustration: high &
low standard
deviation
• meaning depends on
exact case
Another Diagram of Normal Curve (Showing
Ideal Random Sampling Distribution, Standard
Deviation & Z-scores)
Example:Central Tendency &
Dispersion (description of
distributions)
Recall:
• 7 people at bus stop in front of bar aged
25,26,27,30,33,34,35
– median= 30, mean= 30
– Range= 10, standard deviation=10.5
• 7 people in front of ice-cream parlour aged
5,10,20,30,40,50,55
– median= 30, mean= 30
– Range= 50, standard deviation=17.9
Other ways of characterizing
dispersion or spread
Techniques for understanding position of a case (or
group of cases) in the context all of cases
• Percentiles
• Standard Scores
– z-scores
Percentile
• 1st Calculate rank then choose a rank (score) and figure
out percentage equal to or less than the rank (score)
– Link to more complex definition of percentile
• % up to and including the number (from below)
– “A percentile rank is typically defined as the proportion of scores
in a distribution that a specific score is greater than or equal to.
For instance, if you received a score of 95 on a math test and
this score was greater than or equal to the scores of 88% of the
students taking the test, then your percentile rank would be 88.
You would be in the 88th percentile”
• Also used in other ways (for example to eliminate cases)
Normal Distribution with Percentiles
z-scores
• For understanding how a score is positioned in the
data set
• to enable comparisons with other scores from
other data sets
– (comparing individual scores in different distributions)
• example of two students from different schools with different
GPAs
– comparing sample distributions to population. How
representative is sample to population under study?
Calculating Z-Scores
• z-score=(score – sample mean)/standard
deviation of set
– Link to formula
– Link to z-score calculator
Calculating
Z-Scores
Using Z-scores to compare two students’
from different schools
• Susan has GPA of 3.62 & Jorge has GPA of 3.64
• Susan from College A
– Susan’s Grade Point Average =3.62
– Mean GPA= 2.62
– SD= .50
– Susan’s z-score= 3.62-2.62=1.00/.50=2
– Susan’s grade is two Standard deviations above mean
at her school
Using Z-scores to compare two students’
from different schools (continued)
• Jorge from College B
– Jorge’s GPA =3.64
– Mean GPA= 3.24
– SD=.40
– Jorge’s z-score= 3.64-3.24=.40/.40=1
– Jorge’s grade is one standard deviation above the
mean at his school
• Susan’s absolute grade is lower but her position
relative to other students at her school is much
higher than Jorge’s position at his school
Another Diagram of Normal Curve with
Standard Deviation & Z-scores
Discussion of Previous Case
• Relationship of sampling distribution to
population (use mean of sample to estimate
mean of population)
If Time: Begin Bivariate Statistics (Results with
two variables)
• Types of relationships between two variables:
– Correlation (or covariation)
• when two variables ‘vary together’
– a type of association
– Not necessarily causal
• Can be same direction (positive correlation or direct
relationship)
• Can be in different directions (negative correlation or
indirect relationship)
– Independence
• No correlation, no relationship
• Cases with values in one variable do not have any
particular value on the other variable
Techniques for examining relationships
between two variables
• Graphs, scattergrams or plots
• Cross-tabulations or percentaged tables
• Measures of association (e.g. correlation
coeficient, etc.)
Scattergram (Bivariate)
Tables: Basic Terminology (Tables)
• Parts of a Table
– title (conventions)
• Order of naming of variables
• Dependent, independent, control
– body, cell, column, row
– “marginals”
• sources, date
Bivariate Statistics: Parts of the Table
Example of Raw Data Table (computer printout-bivariate)
Regan, T. (1985). In search of sobriety: Identifying factors contributing to the
recovery from alcoholism. Kentville, NS.
Another Style of Presentation of Percentaged
Tables
Serial Number
Descriptive Caption
Dependent
Variable
Independent
Variable
Table 1. Percentage in support of strike by type of school
Variable
Type of School
Secondary
Percent supporting
Strike One category of
dichotomous
dependent variable
60%
Categories
(800)
Marginals for
Elementary
30%
independent
(1000) variable
__________________________________________________________
N = 1800
Total Sample
Presentation of Percentaged Tables (cont’d)
Dependent
Variable
Independent
Variable
Control
variable
Table 2. Percentage who support strike by type of school and sex
Categories of
control variable
Type of School
Secondary
Sex
Female Per cent
supporting strike
Control variable
Male Per cent
supporting strike
60%
60%
(400)
(400)
Elementary
30%
30%
(900)
(100)
__________________________________________________________
Female = .30 : Male = .30
N = 1800
Some Important Factors in
Interpretation of Tables
• percentages vs. “raw” frequencies, need to
know absolute number of cases (N=)
• grouping categories, missing cases
• direction of calculation of percentages (for
bivariate and multivariate statistics)
Collapsing categories (U.N. example)
Babbie, E. (1995). The practice of social research
Belmont, CA: Wadsworth
Collapsing Categories & omitting missing data
Babbie, E. (1995). The practice of social research
Belmont, CA: Wadsworth
Grouping Response Categories
• To make new categories
• Facilitate analysis of trends
• But decisions have effects on the
interpretation of patterns