Week 1: Introduction to Statistics
Download
Report
Transcript Week 1: Introduction to Statistics
Introduction to Statistics
Week 1
Required readings
Levine et al. 2005
Chapter 1
Excel Primer (may be revision)
Appendix G if you are installing PHStat2 on your
own computer
Appendices F & H – help with Excel setup & how
to copy material from Excel to other packages
Visit course website – look at everything!
Study guide – this contains the required readings
and exercises each week. It is a must have!
2
Objectives
On completion of this module you should be able to:
define and explain what is meant by ‘statistics’,
define terms such as sample, population, statistic,
parameter, descriptive statistics and inferential
statistics,
explain the importance of accurate data
collection,
understand and apply survey sample methods,
3
Objectives
On completion of this module you should be able to:
define the measurement scales nominal, ordinal,
interval and ratio,
evaluate the effectiveness and worth of a survey
and survey results and
utilise basic Excel functions.
4
What is Statistics?
Statistics involves planning, collecting, analysing
data, and reporting and interpreting results.
Information gained from data enables sound
analytical decision making.
Descriptive statistics – collecting, summarising
and describing data sets.
Inferential statistics – estimate characteristics
of data to discover patterns and make inferences
about the population (or the future).
5
Statistical terminology
Population – the set of all possible elements
that could be observed.
Sample – a selected portion of the population.
Parameter – a characteristic of the population.
Statistic – a characteristic of the sample (used
to estimate a parameter).
6
Sampling terminology
Element – an object on which a measurement is
made (eg a registered voter in Victoria).
Sampling unit – non-overlapping collections of
elements from a population. Ideally the sampling
units are the same as the elements, but
sometimes it is cheaper to sample groups (eg
households instead of individual voters).
Frame (or sampling frame) – a list of sampling
units (eg the electoral roll).
Sample – a collection of sampling units drawn
from a frame.
7
Data collection
Good quality data is essential for effective
decision making in the business environment.
Data can be obtained from
a primary source where the data collector
analyses the data or
a secondary source where the data is
collected and then made available for others to
use (for example data is made available by the
Australian Bureau of Statistics).
8
Data Collection – some primary sources
An experiment
The effect of various treatments is explored in a
controlled situation.
For example, testing two brands of air-bags in
new cars. Crash test dummies might be placed
in cars which have been fitted with the different
brands, and then used to measure the potential
damage to car occupants.
9
Data Collection – primary sources
Personal interviews – the interviewer asks
questions and notes the responses.
Usually has a good response rate.
Body reactions can be noted and
misunderstandings of questions minimised.
Requires training in interview techniques to
avoid leading the respondent.
Recording errors can occur so taping sessions
may be useful.
10
Data Collection – primary sources
Telephone interviews
These are similar to personal interviews but are
cheaper to conduct.
The sample frame may not reflect the
population (not everyone has a phone).
The non-response rate is reasonably high –
there is a large annoyance factor!
11
Data Collection – primary sources
Self-administered questionnaires (paper or
web-based)
Cheap to administer.
Usually have a low response rate (follow up is
often required).
Tend to attract responses from those with strong
feelings (either positive or negative) on the issue
being surveyed.
Need to be carefully designed to encourage
participation and avoid leading or ambiguous
questions.
12
Data Collection – primary sources
Direct observation
A person counts events as they occur (for
example counting cars crossing a bridge).
Electronic equipment is sometimes used.
13
Data Collection – primary sources
Focus groups are a form of direct observation
that is often used in market research.
These use open-ended questions.
They include a moderator who leads the
discussion.
Other group studies include brainstorming, the
Delphi Method & the nominal-group technique.
14
Sampling
Non-probability sample – items are chosen
without considering the probability of
occurrence.
Probability sample – items are chosen based
on knowledge of the probability of occurrence.
15
Sampling
Simple random sample – a sample in which
every item in the frame has the same chance of
being selected.
Sampling with replacement – items are
returned to the frame after they are selected.
Sampling without replacement – items are not
returned to the frame after selection.
16
Sampling
Random number tables (see Table E.1 in
Appendix E).
Numbers are taken from these tables to select
sample items.
(Pseudo) random number generators
Similar to random number tables.
Most statistical software and many calculators
will generate these.
17
Sampling
Systematic sample
Given N individuals in the frame and n in the
sample, the frame is partitioned in to k groups
where k=N/n.
An item is chosen randomly from the first k items.
Every kth item after this is sampled.
For example if 200 customers (N) are in a
population and a sample of 20 (n) is required,
k=200/20=10.
An item is randomly chosen from the first 10 (say
item 7) and then every 10th item is chosen after
that. The sample would be items 7, 17, 27, etc.
18
Sampling
Stratified sample
The frame is divided into strata from each of
which a random sample is drawn.
Strata are grouped according to a similar
characteristic (eg high, medium and low income
earners).
A simple random sample is taken from within
each strata.
19
Sampling
Cluster sample
The frame is divided into clusters so that each
cluster is representative of the population.
A random sample of clusters is drawn and
every item within the chosen clusters is studied.
20
Systematic Sample
Simple Random Sample
Select every kth element
(here going across rows k=8)
Random sample from
entire population
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
21
Stratified Sampling
Cluster Sampling
Random sample from within
each stratum
Random sample of clusters, but
census within chosen clusters
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
22
Measurement scales
Categorical random variables – responses are
categories such as: yes or no, male or female,
low, medium or high income earner etc.
Numerical random variables – responses are
numerical such as: height, weight, time,
distance.
Numerical variables can be discrete or
continuous.
23
Measurement scales
Discrete random variables are numerical
responses resulting from counting (they are
integers).
Examples: the number of students in a
classroom, the number of mobile phones models
marketed by a company etc.
Continuous random variables arise from a
measuring process.
Examples: the speed of a car travelling along a
highway, your weight etc.
24
Level of measurement
Categorical variables can be:
Nominal – there is no particular order to the
categories (eg male/female, yes/no)
Ordinal – there is an order to the categories (eg
first year at uni, second year at uni, third year at
uni etc)
Numerical variables can be:
Interval – there is no true zero point (eg
temperature – degrees Celsius & Fahrenheit have
different zero points)
Ratio – there is a true zero point (eg a person’s
height, weight etc)
25
Example 1-1
For each of the following random variables,
determine whether the variable is categorical or
numerical.
If the variable is numerical, determine whether
the variable of interest is discrete or continuous.
In addition, determine the level of measurement.
a) Number of mobile phones per household
Numerical – number of phones
Discrete – talking about whole phones
Ratio – true zero point
26
Example 1-1: Solution
b) Mobile phone service provider
Categorical – names of providers
Nominal – no particular order to the providers
(although you could choose to order alphabetically or
according to size etc)
c) Number of text messages sent per month
Numerical, discrete (whole messages), ratio
d) Length (in minutes) of longest call made
during month
Numerical, continuous (although rounding to minutes
so could argue for discrete & ordinal), ratio
27
Example 1-1: Solution
e) Colour of mobile phone
Categorical (names of colours), nominal (no particular
order to the colours)
f)
Monthly charge (in dollars and cents) for
calls made
Numerical, continuous (money is measured on a
continuous scale although we are rounding to the
nearest cent), ratio (true zero point)
g) Ownership of a car charge kit
Categorical (own a charge kit or don’t), nominal (no
particular order to these categories)
28
Example 1-1: Solution
h) Number of calls made per month
Numerical, discrete (whole calls), ratio
i)
Whether there is a telephone line connected
to a computer modem in the household
Categorical (connected or not connected), nominal (no
particular order to these options)
j)
Whether there is a fax machine in the
household
Categorical (fax machine present or not present),
nominal
29
Effectiveness & worth of surveys
Errors to be aware of in surveys:
Coverage error – occurs if certain groups have
been excluded from the frame. Results in
selection bias.
Non-response error – not everyone will
willingly respond to a survey. Results in nonresponse bias.
Sampling error – chance differences exist
between possible samples which means there
is always a margin of error in sample results.
30
Effectiveness & worth of surveys
Errors to be aware of in surveys:
Measurement error – be careful with:
ambiguous wording
the halo effect – the respondent feels
obligated to please the interviewer
respondent error – over or under-zealous
effort by the respondent
Ethical issues will be discussed in tutorials.
31
Example 1-2
The manager of an electronics company is
interested in determining whether customers
who purchased a digital camera over the past 12
months were satisfied with their purchase.
The manager is planning to survey these
customers using the contact information given
on warranty cards submitted after the
purchases.
32
Example 1-2
a) Describe the population and frame. What
differences are there between the population and
the frame? How might these differences affect
the results?
Population – all customers who’ve purchased a digital
camera in the past 12 months.
Sampling frame – list of all customers from the past 12
months who’ve returned the warranty card.
Differences – not every customer will return the warranty
card, so the sampling frame is likely to be smaller than the
population.
Some kinds of customers may be more likely to return the
card, so the sampling frame may not be representative of33
the population.
Example 1-2
b) Develop three categorical questions that you
feel would be appropriate for this survey.
Three possible answers:
What is your gender?
What brand of digital camera did you purchase?
How satisfied are you with your digital camera
(please circle)?
1
Very
dissatisfied
2
3
4
5
Very
satisfied
34
Example 1-2
c) Develop three numerical questions that you
feel would be appropriate for this survey.
Three possible answers:
What price did you pay for your digital camera?
How long ago (in months) did you purchase your
digital camera?
How many times have you brought your digital
camera in for service or repair since you purchased
it?
35
Example 1-2
d) How could a simple random sample of
warranty cards be selected?
The warranty cards are probably numbered.
Use random numbers from a random number table or
a pseudo random number stream from a calculator or
computer software, the corresponding warranty cards
can be selected.
If there is no numbering system already, this could
easily be created (by ordered according to date and
time of purchase for example).
36
Example 1-2
e) If the manager wanted to select a sample
of warranty cards for each brand of digital
camera sold, how should the sample be
selected? Explain.
Use stratified sampling with the brands as the strata.
Strata will differ in size which could lead to bias if
equal numbers were sampled from each strata.
Take the percentage of each sample that corresponds
to the percentage of that strata in the population.
For example if 20% of all cameras sold were Pentax,
then 20% of the sample should be drawn from the
37
Pentax strata.
Excel
Ensure you cover the material in the Excel
primer.
This will be assumed knowledge for the
remainder of the course.
There will be time in workshops for asking
questions about this material, but you will also
need to work on it in your own time.
38
After the lecture each week…
Review the lecture material
Complete all readings
Complete all of recommended problems (listed
in SG) from the textbook
Complete at least some of additional problems
Consider (briefly) the discussion points prior to
tutorials
39