Quantitative approaches: the Scientific Method

Download Report

Transcript Quantitative approaches: the Scientific Method

Quantitative approaches:
the Scientific Method
Orientation: the core principles over three weeks + SPSS workshop in week 10
• Description and ‘reduction’ using
standardised data sets
• Statistical inference I: testing propositions
• Statistical inference II: prediction
• All mediated through probability theory
(pre-set significance levels)
Underlying propositions
• The fundamental premise of science is that there
are fundamental truths that exist (independently
of human opinions about them) to be discovered
• Research approaches in this paradigm require
empirical work – evidence gathered through
observation and measurement that can be
replicated by others.
• The accent is on objectivity (although scientists’
opinions on what they hope to discover may
influence interpretations of results!)
Model building
• Our interest is in discovering something that we assume
is a ‘real-world’ phenomenon
• Approaching this statistically means taking data that is
available and using them in a meaningful way
• This often involves building statistical models of the
phenomenon of interest
• The reason is that real-world data may best be explained
by analogy
• We collect ‘observed data’ and then try to ‘fit’ this to a
model: our ability to infer from it depends on quality of ‘fit’
- is the model reasonably like the ‘reality’ of interest?
Andy’s equation
Everything in statistics boils down to:
Outcome i = (Model i) + error i
Field, A. (2005:7, 2nd edition) Discovering Statistics using SPSS. London: Sage.
Measurement
• Scientists search for appropriate scales that can
be observed - e.g. mass, volume, height
• Management and social science research seeks
operational definitions for more abstract
concepts: e.g. scales for measuring satisfaction,
motivation, commitment, etc. These scales are
contestable.
• Not just measurement scales: to what extent
must something occur to find a place on a
particular measurement scale that constitutes a
significant observation?
• What’s known as measurement error is an issue.
Cases and variables
• Scientists seek to identify cases - people,
organisations, events on which to assemble
evidence (in the form of variables)
• Anything that may change under observation –
e.g. employee commitment to managerial goals
at a time of organizational changes – is a
variable. Variables may be observed/surveyed
and manipulated (experimental research).
Populations and samples
• Ideally scientists wish to capture and
analyse data from the population of
interest
• Resources constrain satisfaction of this
aim
• Sampling (random for generalisation
purposes) is used to access part of a
population for analysis: sampling error!
Preparation for quantitative
analysis
• Decide how the data is to be displayed
and evaluated/tested and then work back
to the basis on which it is gathered
• Types of data generated by answers to
questions opens the way for opportunities
and limitations in terms of data display and
reduction and evaluation …
Types of data: a fundamental
feature of quantitative analysis
• Recognising and understanding the
characteristics of different types of data which
might be collected in a survey-based research =
key in selecting analytical tools.
• Data can be broadly classified into three main
types: (1) ‘nominal’ and (2) ‘ordinal’ categories,
and (3) ‘cardinal’ (a term which includes ‘interval’
and ‘ratio’ data).
Characteristics of data types - statistical
techniques for various types of analysis.
Characteristics
Nominal data
Ordinal data
Cardinal data
Labels for categories
Underlying sequence
and order
Has sequence and units of
measurement; data can be
discrete or continuous; ratio scale
has an absolute zero
From Latin for name
Some examples
Male/female, yes/no,
nationality, occupation,
political party affiliation
Management grade,
social class; Likert
scale data, rankings,
ratings
Age, salary, length of service,
contracted hours, overtime
worked, number of dependents
Graphical summary
Pie and bar charts
Pie and bar charts
Histogram
Tabular format
Frequency counts by
category
Frequency counts by
category or rating
Frequency counts by numerical
value or group
Summary statistics
Proportions, m ode.
Although data may be
coded (e.g. 0/1) to ease
collection & analysis,
values do not measure
anything - there is no
sequence, there are no
scaled intervals etc.
Non-parametric
statistics: median,
range, inter-quartile
range. Data is often
converted to ranks for
further analysis
Parametric statistics:
mean, standard deviation
Developed in part from Siegel and Morgan (1996) Statistics and Data Analysis: An Introduction
Consider Question A1: ‘From which airport did you start your journey?’.
Three alternative answers are available coded 1, 2, 3. This is an example of
nominal data (‘words’ as answers; no logical order). This data can be
summarised by a frequency table and graph.
• Question A1
Frequency table
Airport Used
Frequency
Percent
Valid
Percent
40
Cumulative
Percent
Valid
1
34
34.0
34.0
34.0
2
34
34.0
34.0
68.0
3
32
32.0
32.0
100.0
Total
100
100.0
100.0
number of passengers
30
20
10
0
Luton
airport used
Stans tead
Glasgow
Consider Question A3 ‘Please rate the courtesy/helpfulness of the crew.’
The responses to this question represent ordinal data. A bar-chart or a pie
chart would be useful here. Choosing a pie-chart:
Courtesy /helpfulness of the crew
poor
excellent
11.0%
17.0%
A3
Valid
1
2
3
4
Total
Frequency
Percent
17
32
40
11
100
17.0
32.0
40.0
11.0
100.0
Valid
Percent
17.0
32.0
40.0
11.0
100.0
Cumulative
Percent
17.0
49.0
89.0
100.0
fair
40.0%
good
32.0%
Graphs for other nominal and ordinal data sets can be
produced in the same way as the two illustrations shown.
Consider Question A5a ‘How much did you spend?’. The answers to this
question are an example of a cardinal data set. Cardinal data is best
represented by a histogram. By dividing the data into classes, the general
pattern of the amounts spent can be seen.
Amounts spent on drinks/snacks
14
12
number of passengers
10
8
6
4
Std. Dev = 2.54
2
Mean = 6.8
N = 78.00
0
2.0
4.0
3.0
amounts spent in £
6.0
5.0
8.0
7.0
10.0
9.0
12.0
11.0
Summary statistics I
Measures of ‘central tendency’ using categorical data
• Considering the responses to A1 again, the only valid statistical
average for this nominal data is the mode. There are in fact two
modes here - Luton and Stanstead airports.
• For question A4, ‘Please rate the comfort/cleanliness of the aircraft’,
the responses do have a logical order and thus the data is ordinal.
Here we can use the median as a measure of ‘average’, i.e. the
‘middle response’ will give a feel for a ‘typical response’. A median of
2 indicates that the middle response was ‘good’ and thus there were
more favourable responses than negative ones.
• However, more useful information can be gained from a frequency
table as it gives the percentages of responses in each category.
Valid
1
2
3
4
Total
Frequency
49
34
14
3
100
Percent
49.0
34.0
14.0
3.0
100.0
Valid Percent
49.0
34.0
14.0
3.0
100.0
Cumulative Percent
49.0
83.0
97.0
100.0
Summary statistics II
Measures of ‘central tendency’ and ‘dispersion’ using cardinal data
•
•
For Question A5a we can use statistics SPSS produces to summarise cardinal data.
In the example, it is meaningful to state: ‘The mean amount spent on drinks/snacks
was £6.79 with a standard deviation of £2.54'.
Descriptives : GOEASY sample spending on drinks and snacks
Statistic
Std. Error
A5A
Mean
6.787
.287
95% Confidence
Lower
6.215
Interv al for Mean
Bound
Upper
7.359
Bound
5% Trimmed Mean
6.773
Median
6.725
Variance
6.431
Std. Deviation
2.536
Minimum
1.6
Maximum
12.2
Range
10.6
Interquartile Range
3.200
Skewness
.012
.272
Kurtosis
-.504
.538
•
•
But is the sample an accurate representation of reality?
To answer that question, we need to encounter the notion of ‘confidence intervals’
Statistical inference: confidence intervals
Data sets tend to represent a sample taken from a wider population.
‘Confidence intervals’ are used to indicate the level of accuracy
employed for inferring (or making estimates) from models built from
samples generalised to the population that is the focus of interest.
Generalisation within express confidence intervals is informed by
probability theory.
Applying this practically, consider the
responses to Question A3 ‘GOEASY’ survey
17 respondents out of 100 rated courtesy/helpfulness of the crew as ‘excellent’.
We can determine a 95% confidence interval for this proportion using the formula:
p  z (
pq
)
n
For a 95% confidence interval Z = 1.96
and
p = 0.17
thus q = 0.83:

0.17
 1.96 (0.17 x 0.83 /100)
0.17
 1.96  (0.1411 /100)
0.17
 1.96  (0.001411)
0.17
 1.96 x 0.03756
0.17
 0.07362
[0.09637, 0.24362]
Thus there is a 95% probability that the proportion of all GOEASY passengers, not
just those surveyed, who rate the courtesy/ helpfulness of the crew as ‘excellent’ lies
in the range 9.64% and 24.36%.
This large confidence interval results from the small sample size (100 respondents).
Thus results from small samples should be treated with caution!
Confidence Interval for
a Population Mean
If the population standard deviation is not known then we use the t
distribution in our calculation of the confidence interval. We use the value s
as the best estimate of the population standard deviation.
The formula for a confidence interval of the mean is
x
t s
n
t is found from t-tables to be found in many introductory statistics texts. The

‘degrees
of freedom’ (an approximation of the sample size) is n -1.
SPSS will perform these calculations: here we simply wish to open the
‘black box’ showing how the theory works
.
Using the responses to question A5a in the ‘GOEASY’ survey dataset (value of customer
spending on drinks and snacks), the confidence interval may be calculated as follows:
[NB n = 78 below means that only 78 people answered this question on the survey document]
Thus there is a 95% probability that the mean amount spent by all GOEASY passengers
(i.e. not just those surveyed) lies in the range £6.22 to £7.36.
Confidence intervals with SPSS
A5A
Mean
95% Confidence
Interv al for Mean
5% Trimmed Mean
Median
Variance
Std. Devi ation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Lower
Bound
Upper
Bound
Statistic
6.787
6.215
Std. Error
.287
7.359
6.773
6.725
6.431
2.536
1.6
12.2
10.6
3.200
.012
-.504
.272
.538
This SPSS table enables us to infer that the 95% confidence interval for the mean
amount spent on drinks/snacks is £6.22 to £ 7.36 (see shaded area in the tabulation
above). Thus we can be 95% confident that the true mean amount spent by all
passengers (i.e. the population of ‘GOEASY’ travellers, not just those surveyed) lies
between £6.22 to £7.36. Compare this with the manual calculation above.