Transcript Slide 1

Quantitative Methods – Week 2:
Descriptive Statistics
Roman Studer
Nuffield College
[email protected]
Frequency Distributions
• Frequency distributions provide a summary presentation of the data
• Very good method to get a first overview of the data
 Discrete variables
Measure the frequency of occurrence of each of the values
Example: Poor Law Dataset: Number of workhouses per counties
Bar Chart of Number of Workhouses in Selected Counties, 1831
Number of Workhouses
35
30
25
20
15
10
5
0
Essex
Norfolk
Suffolk
Cambs
Beds
Sussex
Kent
Frequency Distributions (II)
 Continuous variables
• Choose appropriate class intervals and not the frequency of occurrence
for each class
• Number of class intervals normally between 5 and 20
Example: Per capita relief payments in the parishes of Kent, 1831
(1)
(2)
(3)
(4)
(5)
Class Intervals
Frequency
Relative Frequency (%)
Cumulative Frequency
Cumulative Relative
(shillings)
(f)
(f/n x 100)
≥ 5 but < 10
1
4.2
1
4.2
≥ 10 but < 15
7
29.2
8
33.4
≥ 15 but < 20
4
16.6
12
50.0
≥ 20 but < 25
6
25.0
18
75.0
≥ 25 but < 30
4
16.6
22
91.6
≥ 30 but < 35
1
4.2
23
95.8
≥ 35 but < 40
1
4.2
24
100.0
24
100.0
Frequency (%)
Frequency Distributions (III)
Frequency of Per Capita Relief Payments in Kent, 1831
Number of parishes
7
6
5
4
3
2
1
0
≥ 5 but < 10
≥ 15 but < 20
≥ 25 but < 30
Relief payments (shilling)
≥ 35 but < 40
Frequency Distributions (IV)
Relative Frequency of Per Capita Relief Payments
35
35
30
Histo gram
Percentages
25
Frequency curve
30
25
20
20
15
15
10
10
5
5
0
0
≥ 5 but < 10
≥ 15 but < 20
≥ 25 but < 30
Relief payments (shillings)
≥ 35 but < 40
Frequency Distributions (V)
Cumulative Frequency of Relief Payments
Number of parishes
25
20
15
10
5
0
1
2
3
4
5
Relief payments (shilling)
6
7
Frequency Distributions (VI)
Percentages (Cumulative)
Cumulative Relative Frequency of Per Capita Relief
100
90
80
70
60
50
40
30
20
10
0
10
15
20
25
Relief Payments
30
35
40
Descriptive Statistics
 Making frequency tables and plotting them using histograms and
frequency curves is very helpful to get a first overview
 But it isn’t a precise way to summarize the information of the
variables in a dataset. To do this, we normally determine three features
of a variable:
1) Which are the most central (i.e. the most common or typical)
values?
2) How are the values spread (dispersed) around those central
values?
3) What is the shape of the distribution?
 Each of these features can be described by one or more simple
statistics, and they form the basic elements of descriptive statistics, as
the provide a precise and comprehensive summary of the variables in a
data set
Measures of Central Tendency
• The arithmetic mean
 Adding up all the values and dividing this total by the number of
observations
• The median
( y1  y2  y3  ...  yN ) i 1 yi
E ( y)  y 

N
N
N
 The value that has one-half of the number of observations above and below
it, when the series is set out in an ascending or descending array
 Uneven number of observations: Position = (number of observations + 1) / 2
 Even number of observations: Average of two middle observations
• The mode
 Value that occurs most frequently
• Percentiles, deciles, and quartiles
 Instead of dividing the observations into two equal halves (median), we can
divide them into four equal quarters (quartiles), or into 10 portions (deciles)
or 100 portions (percentiles) of equal size
Measures of Central Tendency (II)
• Numeric Example
Case
xi
Values
1
x1
2
2
x2
4
3
x3
5
4
x4
5
5
x5
7
6
x6
9
7
x7
10
8
x8
30
 What is the mean, the median, the mode?
6
5
5
 What is the effect of adding value x8?
9
6
5
 What is the absolute and what is the relative frequency of 5?
2
25%
Measures of Dispersion
• Two variables with equal arithmetic mean, but different
spread
f(x)
f(y)
f(x)
f(y)
m
x,y
• Variable x is more densely distributed around the mean m
than variable y
Measures of Dispersion (II)
• The variance
 The variance is equal to the arithmetic mean of the squared deviations from
the mean
2
1
N
Var ( X )   x2 

N
i 1
(Xi  X )
 The variance is widely used in statistical work; however, the disadvantage is
that it is expressed in square units…
• The standard deviation
 The standard deviation is the square root of the variance
 x   var(x)
 Interpretation: Average or typical deviation of variable x from the arithmetic
mean
 The standard deviation is the most widely used measure of dispersion;
however, as it is calculated in the same units as the series, these absolute
standard deviations are unsuitable for comparisons with series that have
different underlying units…
Measures of Dispersion (III)
• The coefficient of variation (CV)
 The coefficient of variation is a measure of relative rather than absolute
variation
 It is obtained by dividing the standard deviation by the mean
x
CV 
mx
 Interpretation: Average percentage deviation from the mean
• The range
 This is a very crude measure of dispersion defined as the difference between
the maximum and the minimum value in the series
The Shape of Distributions
• Normal distribution
 The normal distribution is a symmetrical, smooth, bell-shaped distribution
that is fully described by the arithmetic mean and standard deviation
 Mode, median and mean are equal
 Measures of skewness and kurtosis of the normal distribution are equal to 0
and 3
 But again: Mean and standard deviation are dependent on units of the series
and thus difficult to compare…
The Shape of Distributions (II)
• Standard normal distribution
 Every normal distribution can be transformed into a standard normal
distribution using
Z
Y  my
y
 By definition, the standard normal distribution has now two further basic
features that the normal distribution hasn’t:
• mean m=0
• standard deviation =1
 These properties make the distribution ideal for comparison
 The standard normal distribution has for this reason a key role in inductive
statistics as it can be used to make inferences on probabilities
The Shape of Distributions (III)
• Skewed distributions
Frequency
 However, values need not be symmetrically distributed around the central
point, i.e. distributions can be skewed
 In these cases, Mean and standard deviation are insufficient to describe the
distribution
 Especially socio-economic data (wages, income, wealth and related
variables) is frequently skewed
 This distribution is skewed to
the right (positively skewed)
x
Mode
Mean
Median
The Shape of Distributions (IV)
• Consequences of skewed distributions
 Skewed variables can lead to undesirable effects in regressions
• Non-normal distributed residuals (misspecification)
• Heteroscedasticity; test statistics and confidence intervals are
biased
 (Roughly) normal distributed variables help to avoid these problems.
Take a look at the variable
• If the variable is not significantly skewed, continue
• If the variable is skewed, transform the variable: “Ladder of
Powers”. For this reason you often find the logarithm of
income, the square root of the mortality rate, etc.
The Shape of Distributions (V)
• Kurtosis
 Furthermore, two symmetrically distributed variables with equal mean and
standard deviation can still have a different distribution, i.e. they can have a
different kurtosis
f(x)
f(y)
f(y)
y
x
 Here the variable y has
the bigger kurtosis than
variable x
f(x)
m
x,y
x
The Shape of Distributions (VI)
• Measures for skewness and kurtosis
 Measures for skewness and kurtosis tell us therfore more about a
distribution
E (Y  m )3
Skewness : a3 
3
E (Y  m ) 4
Kurtosis : a4 
4
 Skewness and kurtosis of a normal distributed variable are zero and three,
respectively
Skewness:
• a3 > 0 distribution skewed to the right/ positively skewed
• a3 < 0 distribution skewed to the left/ negatively skewed
 Kurtosis:
• a4 > 3 thinner tails & higher peak than a normal distribution
• a4 < 3 thicker tails & lower peak compared to a normal distribution
 For a meaningful and comparable measure of a4, the distribution should be
symmetrical (hence again the need to have a normal distribution)
Computer Class:
• Getting started with STATA
• Descriptive statistics
STATA Basics
• Stata is a statistical package for managing, analysing, and
graphing data
• It can be used in two different ways
1) As a point-and-click application
→ Easy interface for those new to Stata, and for those who don’t
use it very often
→ … for us (at least at the beginning)!
2) As a command-driven package
→ Very fast if used to commands
→ Good for communicating more complex ideas
→ One of the main advantages of Stata over SPSS
• A helpful guide
 Hamilton, Lawrence C., Statistics with STATA. Constantly updated versions.
Getting Started Together
Various data formats
 Data comes in various data formats and extensions, most often in
 .xls : Excel
 .sav : SPSS
 .dta : STATA
 .txt : Text files
 STATA can import all these formats: File/Import/....
1) Download data file
 Relief dataset from Feinstein & Thomas, get online:
http://www.cambridge.org/uk/catalogue/catalogue.asp?isbn=9780521806633&ss=res
 Download the Stata file and save it to your folder on the O: drive
2) Open data
 Open/Relief dataset
3) Open data editor




Open Data Editor and try to understand the structure of the dataset
What do the rows and columns mean?
Change the names of some variables
Sort the relief payments in ascending order: what was the minimum paid,
Getting Started Together (II)
4) List some variables: Data/Describe data/List data
 Relief
 Income
5) Tabulate some variables:
 Income
 Relief
6) Frequencies
 Get an overview of the distribution with a histogram (Graphics/Histogram)
 The number of bins changes the number of bars (or the number of categories)
 Which variables look normally distributed, which ones not?
7) Descriptive statistics (Central tendencies & dispersion)
 Mean, stdv, min, max (Data/Describe data/Summary statistics)
 Skewness, kurtosis, median, quartiles, percentiles, etc. (Data/Describe
data/Summary statistics/Display additional statistics)
Getting Started Together (III)
8) Export some tables, graphs to Word
 Right-click and copy; insert in Word
9) If you’re stuck: Help/…
Appendix: STATA Commands
•
•
edit
sort
•
tabulate varname
•
summarize varname
•
summarize varname, detail
•
histogram varname, bin(x)
Opens the Data Editor
Arranges the observations into ascending order
based on the values of the # variable
Produces one-way tables of frequency counts:
absolute & relative & cumulative frequency.
Calculates a variety of summary statistics (obs,
mean, stdv, min, max)
Gives more detailed statistics, for
instance kurtosis, skewness, percentiles, etc.
Creates a histogram with x categories
Homework
 Readings:
• Feinstein & Thomas, Ch. 3
 Problem Set 1:
 Do the exercises 1, 2 (Relief dataset) , and 7 (The Old Poor Law in
England) from chapter 2.7 (pp. 66-70)
 Submit your solutions including graphs and tables in a Word file
by noon on Monday (29 January)