Pie Charts

download report

Transcript Pie Charts

Lecture 3
Summarising Data
For use in fall semester 2015
Lecture notes were originally designed by Nigel Halpern. This
lecture set may be modified during the semester.
Last modified: 4-8-2015
SCM300 Survey Design
Lecture Aim & Objectives
Aim
• To investigate pictorial & statistical methods of
analysing quantitative data
Objectives
• Pictorial representation of quantitative data
• Statistical representation of quantitative data
SCM300 Survey Design
Pictorial Representation
• Levels of measurement
• Tables & frequency distributions
• Charts, plots, graphs & pie-charts
SCM300 Survey Design
3 (4) Levels of Measurement
• Nominal variables
• Ordinal variables
• Interval (& ratio) variables
SCM300 Survey Design
Nominal
• Categories
– e.g. gender (m/f), responses (y/n), class of travel (b/l)
• Usually presented as frequencies & categories or %’s
– e.g. 45% male, 55% female
• Measure the existence (or not) of a characteristic
– But contain limited information
SCM300 Survey Design
Ordinal
• Ordered categories or preferences
– e.g. ranked responses from a Likert scale
– e.g. finishers in a race (1st, 2nd, 3rd, etc)
– e.g. preferred aircraft
• Measure intensity, order or degree
– But still limited as they don’t imply distances
• i.e. distance between 1st & 2nd
SCM300 Survey Design
Interval & Ratio
• Ordered & scaled (on equal intervals)
– e.g. age in years, temperature
• Measures differences between values
– Interval: arbitrary zero
• e.g. temperature (+/-)
– Ratio: absolute zero indicating absence of that variable
• e.g. age, income
• High analytical capabilities
– e.g. can compare means unlike for nominal or ordinal data
SCM300 Survey Design
Levels of Measurement Summary
Variable type
Nominal
Ordinal
Interval/Ratio
SCM300 Survey Design
Description
Classification of
responses into mutually
exclusive categories
Categories are rank
ordered
Examples
Male/Female
Yes/No
1st/2nd/3rd
Likert
Distances between
Temperature
items on scale are equal Age
Variable type Description
Your turn…..
•
Nominal
Classification of responses into
mutually exclusive categories
Ordinal
Categories are rank ordered
Interval/Ratio
Distances between items on
scale are equal
What levels of measurement would be derived from each of the
following questions
1.
2.
3.
4.
5.
6.
Gender (male/female)
Age in years and months (state years/months)
Do you smoke (yes/no)
How many cigarettes, on average, do you smoke a day (state no.)
Number of full years you’ve been smoking (state no.)
How many minutes exercise do you do, on average, each day (less
than 30mins / 30-59mins / 60+mins)
7. To what extent do you think that smoking is bad for your health
(Strongly agree / tend to agree / neither / tend to disagree / strongly
disagree)
8. Rank the cigarette brands in order of quality (B&H, Silk Cut,
Marlborough)
SCM300 Survey Design
Tables
•
•
•
•
Most straight forward pictorial representation
Good method of storing information
Summarises &/or shows patterns in data
Easily made using word-processing or
spreadsheets
• Confusing if constructed poorly
• Confusing if they try to show too much
SCM300 Survey Design
Table Considerations
• Should be clear & appropriate
• Should be chosen with a purpose in mind
– Not just for the sake of it
• Must include a title & a source of data
• Must be referenced & discussed in the text
– Don’t assume that everyone will understand them
SCM300 Survey Design
Table Clarity
• Use a common system of data presentation
• Use percentages rather than raw scores for clarity &
comparative capabilities
The above points are particularly relevant if the table
includes more than one variable calculated using
different units of measurement (AKA ‘cross-tabulation’)
SCM300 Survey Design
Data from a survey of pax at LGW, LHR & MAN (CAA, 2000):
- 34,650 Business Pax: A/B=18,607; C1=14,345; C2=1,386; D/E=312
- 130,350 Leisure Pax: A/B=43,407; C1=52,400; C2=21,508; D/E=13,035
Table 1. Passengers at LGW, LHR & MAN, 1999
Socio-economic
status
Business
passenger
Leisure
passengers
Total
A/B
18,607
43,407
62,014
C1
14,345
52,400
66,745
C2
1,386
21,508
22,894
D/E
312
13,035
13,347
Total
34,650
130,350
165,000
Use percentages instead?
SCM300 Survey Design
Data from a survey of pax at LGW, LHR & MAN (CAA, 2000):
- 34,650 Business Pax: A/B=18,607; C1=14,345; C2=1,386; D/E=312
- 130,350 Leisure Pax: A/B=43,407; C1=52,400; C2=21,508; D/E=13,035
Table 1. Passengers at LGW, LHR & MAN, 1999 (%)
Socio-economic
status
Business
passengers
Leisure
passengers
Total
A/B
54
33
38
C1
41
40
40
C2
4
17
14
D/E
1
10
8
Total
21
79
100
Easier to interpret?
SCM300 Survey Design
Frequency Distributions
•
•
•
•
Standard frequency distribution
Univariate frequency distribution
Grouped frequency distribution
Relative & cumulative frequency distribution
SCM300 Survey Design
Standard Frequency
• Standard frequency distribution
– Presents data
• e.g. “How many return flights did you take last year?”
• Answers from 50 pax as a standard frequency distribution:
Number of return flights taken last year:
7 3 10 3 2 4 3 3 6 3 5 2 3 4 2 5 4 3 6 8 4
12 1 3 4 15 5 1 3 1 4 2 3 5 2 3 8 3 4 4 6
3 5 2 42 3 2 5 1
SCM300 Survey Design
Univariate Frequency
• Univariate frequency distribution
– Lists data more clearly & with their frequency
– Important for large sample sizes
Flights Frequency
Flights Frequency
1
4
7
1
2
8
8
2
3
14
10
1
4
9
12
1
5
6
15
1
6
3
SCM300 Survey Design
Grouped Frequency
• Grouped frequency distribution
– Groups all data according to
categories
– Further improves clarity
SCM300 Survey Design
Flights Grouped
frequency
1-3
26
4-6
18
7-9
3
10-12
2
13+
1
Total
50
Relative & Cumulative Frequency
• Relative & cumulative frequency distributions
– Relative: each category as a % of the total
– Cumulative: add each relative to proceeding
Flights
Grouped
Relative (%)
Cumulative (%)
1-3
26
52
52
4-6
18
36
88
7-9
3
6
94
10-12
2
4
98
13+
1
2
100
Total
50
100
SCM300 Survey Design
Too many
numbers…?
SCM300 Survey Design
Charts, Plots, Graphs & Pie-charts
•
•
•
•
•
•
Simple bar charts
Compound bar charts
Histograms
Scatter or dot plots
Line graphs
Pie-charts
SCM300 Survey Design
Charts, Plots, Graphs & Pie-charts:
Pros & Cons
• Easily made using word-processing or
spreadsheets
• Ease of creation can lead to over-elaborate
charts at the expense of clarity
SCM300 Survey Design
Charts, Plots, Graphs & Pie-charts:
Considerations
• Should be clear & appropriate
• Should be chosen with a purpose in mind
– Not just for the sake of it
• Typically include
–
–
–
–
Title
Labelled axis
Key that explains the different segments
Source of data
• Must be referenced & discussed in the text
– Do not assume that everyone will understand them
• Data type will restrict which method is chosen
SCM300 Survey Design
Simple Bar Charts
• Simple bar charts
– Horizontal or vertical charts of separate bars that represent
size of data
Student results for SCM300 in 2007
0-39% 40-49% 50-59% 60-69% 70+%
5
SCM300 Survey Design
9
15
7
3
Simple Bar Charts
Number of students
Figure 1. Student results for SCM300 in 2007
16
14
12
10
8
6
4
2
0
0-39
40-49
50-59
Grade (%)
SCM300 Survey Design
60-69
70+
Compound Bar Charts
• Compound bar charts
– Show proportions/relative size of groups
– Bars will always have same height when % are used but not
when figures are used
– For 3+ components, pie-charts may be better
Student results for SCM300 in 2007
0-39% 40-49% 50-59% 60-69% 70%+
Male
4
7
7
2
0
Female 1
2
8
5
3
SCM300 Survey Design
Compound Bar Charts
Figure 1. Student results for SCM300 in 2007
Number of students
100%
80%
60%
Female
Male
40%
20%
0%
0-39
40-49
50-59
Grade (%)
SCM300 Survey Design
60-69
70+
Histograms
• Histograms
– Similar to bar charts but a better indication of variation &
distribution
– Bars are connected instead of separate
SCM300 Survey Design
Histograms
Number of students
Figure 1. Student results for SCM300 in 2007
16
14
12
10
8
6
4
2
0
0-39
40-49
50-59
Grade (%)
SCM300 Survey Design
60-69
70+
This figure indicates repeat visits to Norway & tourists interest
in returning but is it easy to understand…..?
SCM300 Survey Design
Scatter or Dot Plots
• Scatter or dot plots
– Illustrate the exact distribution of data
– Can be used to illustrate continuous data
• BUT a line graph may be better
– Effective for 2 related variables
SCM300 Survey Design
Scatter or Dot Plots
Aircraft movements
SCM300 Survey Design
00
0
70
0
00
0
60
0
00
0
50
0
00
0
40
0
00
0
00
0
30
0
10
0
20
0
00
0
600 000
500 000
400 000
300 000
200 000
100 000
0
0
Passengers
Figure 1. Passengers & Aircraft Movements at HiMolde Airport
Line Graphs
• Line graphs
– Show trends over time
• e.g. patterns, peaks & troughs, rates of incline/decline
– Can show more than 1 variable at a time
• This can indicate possible relationships
• e.g. see next slide
SCM300 Survey Design
SCM300 Survey Design
Pie Charts
• Pie-charts
–
–
–
–
Segments represent cases in each category
Best for 3-6 categories (no more, no less)
Labelling & shading sometimes difficult
Combining categories may improve clarity but loses detail
SCM300 Survey Design
Pie Charts
Car park
21%
Other
26%
SCM300 Survey Design
Catering
11%
Retail
42%
Pie Charts
Too many pies……..?
SCM300 Survey Design
Charts, Plots, Graphs & Pie-charts
Summary
Variable type
Bar
Pie
Line
Nominal
Ordinal
Yes
Yes
Yes
Yes
No
No
Interval/ratio
Yes (if grouped) Yes (if grouped) Yes
SCM300 Survey Design
Statistical Representation
• Measures of central tendency
• Measures of dispersion
• Normal distribution & skew
SCM300 Survey Design
Measures of Central Tendency
• Raw data can be confusing & meaningless
• Measures of central tendency
– AKA measures of location or average
– Present the data in 1 single number
• 3 different measures depend on intention or data
– See next slide
SCM300 Survey Design
Measures of Central Tendency
Measure Definition
Data
Mode
Most commonly occurring value in a data set
Misleading if an extreme value & may be multiple modes
(bimodal distribution)
Any
Median
Central value representing central point of a data set
When there is an even set of values you take the two
middle values and find the mid-point between them.
Extremes don’t distort it but data has to be in order from
lowest to highest in order to calculate it.
Ordinal or
interval/ratio
Mean
Average value in a data set
Advantage is that it uses all values in a data set.
Disadvantage is that it can only be used with
interval/ratio data and when there are few values in the
data set, it can be distorted by extremes.
Interval/ratio
SCM300 Survey Design
Example
Age of students
19 20 36 19 19 24 37 20 21 20
19 19 19 19 20 25 20 26 20 19
19 19 19 19 20 19 24 25 20 20
26 25 19 20 19 18 19 28 22 19
Mean
22
Median 20
Mode 19
SCM300 Survey Design
Measures of Dispersion
• Measures of central tendency don’t show:
–
–
–
–
How closely related values are (i.e. clustered)
How representative they are of the data set
The range of values
The degree of distortion by extreme values
Salaries of office staff at HiMolde Airways:
· £11k, £15k, £15k, £18k, £25K, £30k, £32k, £38k
Salaries of office staff at HiMolde Airport:
· £20k, £21k, £22k, £23k, £23k, £24k, £25k, £26k
Mean salary at HiMolde Airways = £23k (£184k/8)
Mean salary at HiMolde Airport = £23k (£184k/8)
SCM300 Survey Design
Measures of Dispersion
• Range
• Inter-quartile range
• Standard deviation
SCM300 Survey Design
Range
• Simplest & crudest measure of dispersion
• Indicates spread of data
– Places values in ascending order
– Then subtracts smallest from the largest value
• Extreme values affect (determine) the outcome
• Range gives a greater insight into a data set
– But gives no indication of the clustering of individual values
SCM300 Survey Design
Range
Salaries of office staff at HiMolde Airways:
· - £11k, £15k, £15k, £18k, £25K, £30k, £32k, £38k
Salaries of office staff at HiMolde Airport:
· - £20k, £21k, £22k, £23k, £23k, £24k, £25k, £26k
Range of salaries at HiMolde Airways = £38k - £11k =
£27k
Range of salaries at HiMolde Airport = £26k - £20k = £6k
SCM300 Survey Design
Inter-Quartile Range
• Most appropriate when using ordinal data
• Divides values into 4 equal parts (quartiles)
– Is an extension of the idea of the median
• Represents the middle 50% of the values that fall
between the 1st & 3rd quartiles
• Not affected by extremes
– BUT doesn’t utilise all values
• It discards 50% of the values & therefore provides a limited
picture of the degree of clustering
SCM300 Survey Design
Inter-Quartile Range
Median value
1st 25%
cases
Min. value
2nd 25%
cases
Q1
3rd 25%
cases
Q2
Inter-Quartile Range
SCM300 Survey Design
4th 25%
cases
Q3
Max. value
Inter-Quartile Range
Salaries of office staff at HiMolde Airways:
· - £11k, £15k, £15k, £18k, £25K, £30k, £32k, £38k
Salaries of office staff at HiMolde Airport:
· - £20k, £21k, £22k, £23k, £23k, £24k, £25k, £26k
IQ Range of salaries at HiMolde Airways = £15-£31
IQ Range of salaries at HiMolde Airport = £22-£24
SCM300 Survey Design
Standard Deviation
•
•
•
•
Widely used in quantitative research
Most useful measure of dispersion
Utilises all data in the distribution
Compares each value in the distribution with the mean
– It examines the variance of the data around the mean
– Therefore saying something about how representative the
mean is for the data set
SCM300 Survey Design
Standard Deviation
• Smaller SD = less variation
– i.e. data is more concentrated around the mean
– Greater SD = greater variation
• However
– Size of SD is in part a reflection of the size of the mean
• So a large SD may simply be the product of a large mean
• Because of this, both figures should be quoted
• Extreme numbers can distort the outcome
– BUT have less of an impact than when using the range
SCM300 Survey Design
Standard Deviation
Salaries of office staff at HiMolde Airways:
· - £11k, £15k, £15k, £18k, £25K, £30k, £32k, £38k
Salaries of office staff at HiMolde Airport:
· - £20k, £21k, £22k, £23k, £23k, £24k, £25k, £26k
Standard deviation of salaries at HiMolde Airways =
10
Standard deviation of salaries at HiMolde Airport = 2
SCM300 Survey Design
Central Tendency & Dispersion Summary
Nominal
Ordinal
Male/Female 1st/2nd/3rd
Example
Central
Mode
tendency
Dispersion N/a
SCM300 Survey Design
Median
Interval/Ratio
Temperature
Mean
Inter-quartile Standard deviation
range
Normal Distribution & Skew
• Normal distribution
• Skew
SCM300 Survey Design
Normal Distribution
• Normal if
– Mean, median & mode coincide
– Distribution is the same either side of the central values
• e.g. see next slide
• Often referred to as a bell-shaped curve
– 50% of the cases can be found either side of the central value
– Values tend to be clustered around the mean
• i.e. very few extreme values
SCM300 Survey Design
Normal Distribution
50% of
cases
SCM300 Survey Design
50% of
cases
Mean
Median
Mode
Normal Distribution
• A normal distribution has certain properties
– 68% of cases fall within 1 SD either side of the mean
– 95% within 2 SDs
– 99% within 3 SDs
• e.g. see next slide
• Other % values can be calculated using statistical tables
– Found in some statistics books
• Normal distribution is important for sampling &
hypothesis testing
– Many statistical tests assume data will be normally distributed
SCM300 Survey Design
68.26%
95.44%
99.7%
-3sd
-2sd
-1sd Mean +1sd +2sd
+3sd
• Normal distribution is an ‘ideal’ type of distribution
• However, it is unlikely that data sets will be normal
• When they are not normal, they are ‘skewed’
SCM300 Survey Design
Skew
• +ve skew
– Data set has a few very large values
• i.e. most values cluster to the left
– The mean will be larger than the median
• -ve skew
– Data set has a few very small values
• i.e. most values cluster to the right
– The mean will be smaller than the median
SCM300 Survey Design
Positive Skew
Median
SCM300 Survey Design
Mean
Negative Skew
Mean
SCM300 Survey Design
Median
Skew
• Skew is typically found where
– Sample sizes are small
– Bias has been introduced in the sampling process
• Skewed distributions can be determined
– Visually using a histogram
– Statistically by calculating a co-efficient of skewness (sk)
SCM300 Survey Design
Co-efficient of Skewness
3(Mean – Median)
sk = ---------------------------Standard Deviation
• Indicates the direction of the skew (+ve or –ve)
• Greater co-efficient = greater skew
• Normal distribution will have a co-efficient of 0
SCM300 Survey Design
Summary
• Pictorial representation of quantitative data
– 3 (4) levels of measurement
• Nominal
• Ordinal
• Interval / ratio
– Range of pictorial representation available
• Choice is determined by the level of measurement
SCM300 Survey Design
Summary
• Statistical representation of quantitative data
– 3 measures of central tendency
• Mean, median, mode
• Choice is determined by the level of measurement
– 3 measures of dispersion
• Range, inter-quartile range, SD
• Choice is determined by the level of measurement
– Normal distribution & skew represent the distribution of
responses
SCM300 Survey Design
Recommended Reading
• Chapter 1-3 in Gaur, A.S. and Gaur, S.S. (2006).
Statistical Methods for Practice and Research: A
Guide to Data Analysis Using SPSS. New Delhi:
Response Books.
SCM300 Survey Design
“Thank you for your attention”
Questions.…….
SCM300 Survey Design