Transcript Slide 1

Quantitative Methods – Week 3:
Correlation
Roman Studer
Nuffield College
[email protected]
Review and Homework
• Speed and work load
 Are we advancing too fast?
 How much time you devote to this course?
 Is it too technical, too mathematical?
• This week’s and next week’s program
 No new theory!
 Review and apply what we have learnt so far
→ Look at how authors use descriptive statistics and correlation in
their research
→ Get more practise with Stata
• Discussing problem set 1
 Any remaining questions?
From Descriptive Statistics to Correlation Analysis
• When doing descriptive statistics, we were looking at 1 variable at
the time, describing its distribution, its central tendency and its
spread
50
• Now, we are moving on as week now look at 2 variables at the time, asking
whether and how these two variables are associated:
 Are two phenomena (variables) linked?
 How are they associated (positively/negatively)?
 How strong is the association?
40
Frequency of Per Capita Relief Payments in Kent, 1831
5
4
30
6
income
3
20
2
1
0
≥ 5 but < 10
≥ 15 but < 20
≥ 25 but < 30
Relief payments (shilling)
≥ 35 but < 40
10
Number of parishes
7
0
10
20
30
relief
40
50
Step 1: Graphic Analysis
• As with descriptive statistics, we first want to get a visualisation,
using a graph
• With correlation analysis, we use scatterplots, where one variable is
measured on the vertical axis, and the other on the horizontal axis
0
.2
unemp
.4
.6
 Example: Poor Law data set: Is the amount of relief payments
associated with the level of unemployment?
0
10
20
30
relief
40
50
Step 2: Numeric Analysis
• As with descriptive statistics, graphs do not yield precise results, so
we again want some precise number that summarises the
characteristics of the data, or, in this case, the association between
two variables
• The “number” that describes the linear association between any two
variables is the correlation coefficient
• The correlation coefficient, r, is defined as:
r
cov(X , Y )
 x y
 r measures the direction and the
strength of the association
 In contrast to the covariance, the Pearson correlation coefficient is
independent of the unit of measurement
 r ranges from -1.0 to +1.0
Step 2: Numeric Analysis (II)
•Interpretation
 +1: perfect positive correlation
 - 1: perfect negative correlation
 0: lack of correlation
The closer r is to either +1 or -1, the stronger the relationship
•Example: Association between relief payments and unemployment
Correlation coefficient r: + 0.44
 Positive association
 Moderate association
Direction: Positive and Negative Correlations
Values of X
Increasing
Positive
Correlation Decreasing

Values of X
Increasing
Values of Y
 Increasing
 Decreasing
Values of Y
 Decreasing
Negative
Correlation Decreasing  Increasing

Strength of the Association between Variables (I)
100
0
Very strong association
→ Is to be expected, as concepts
are closely related
→ Positive correlation
→ r = 0.95
50
PurchasingPower
150
200
Example: GDP per person and purchasing power around the world, 2007
0
20000
40000
GDPperhead
60000
80000
Strength of the Association between Variables (II)
Example: Election in Weimar Germany, 1932
40
30
20
10
nsdap votes (nazi party)
Quite strong association
→ Especially given that
The connection is far
from obvious
→ Positive correlation
→ r = 0.63
50
60
Are the level of unemployment and the support of the Nazi party linked?
0
20
40
unemployment rate (in %)
60
80
Strength of the Association between Variables (III)
Example: Election in Weimar Germany, 1932
40
30
20
Moderate association
→ Pattern not very clear
→ Negative correlation
→ r = - 0.34
10
nsdap votes (nazi party)
50
60
Were Catholics more likely to vote for the Nazis?
0
20
40
catholics (in %)
60
80
Strength of the Association between Variables (IV)
Example: Election in Weimar Germany, 1932
60
40
20
No discernible association
→ r = - 0.03
→ Correlation coefficient
close to 0
0
unemployment rate (in %)
80
Was there a connection between unemployment and the proportion of
Catholics in a district?
0
20
40
catholics (in %)
60
80
Caution when Interpreting Correlation Results
• Correlation is NOT causation!!
 Causation is very hard to ascertain in social sciences
• Beware of spurious or nonsense correlation!
Examples:
 Simultaneous decline of birth rates and of the number of storks
in Sweden
 Positive correlation between shoe size and income level
 Omitted variables are one of the big problems in
econometrics
Caution when Interpreting Correlation Results (II)
• Watch out for the influence of outliers on the correlation results!!
150
100
50
0
0
50
100
Purchasing Power Parity
150
200
200
Example: GDP per person and purchasing power around the world, 2007
0
20000
40000
GDPperhead
60000
Very strong association
Positive correlation
r = 0.95
80000
0
20000
40000
GDP per head
60000
Just 2 outliers
Weak association suggested
r = 0.34
80000
Computer Class:
• Descriptive Statistics (II)
• Correlation
Data Set: Global Macroeconomic Data
• This data set was assembled using the 2007 edition of “Pocket World in
Figures”, published by the Economist
Country
GDP per Head
Agriculture
Education
Norway
54360
1.60
81
Switzerland
49660
1.40
49
United States
39430
1.20
83
Brazil
3340
10.10
21
Iran
2340
13.70
21
United Kingdom
35760
1.00
64
640
21.50
12
Greece
18660
7.00
74
Hungary
10270
3.30
51
China
1470
15.20
16
Cameroon
880
41.50
5
Nepal
260
40.30
3
14160
3.70
85
India
South Korea
Notes:
-
"Agriculture" is the % of GDP
from agriculture
-
"Education" is the % of tertiary
enrolment
Exercises
A. Data set and descriptive statistics
•
•
Open Stata and create a new data set from the table on the previous slide (use the data
editor!)
Look at each of the variables in turn
• Produce histograms: Get a first visualisation of the data; does it look normally
distributed? Does it make sense plotting histograms?
• Compute the mean, median, standard deviation, coefficient of variation, kurtosis and
skewness for every variable
B. Correlation analysis
•
•
Look at the association of “GDP per Head” and “Agriculture”
• Make a scatter plot to get a first impression of their association: Do you think these
variables are connected? Positively/negatively?
• Calculate the correlation coefficient; how would you explain the result?
Look at the association of “GDP per Head” and “Education”
• Make a scatter plot to get a first impression of their association: Do you think these
variables are connected? Positively/negatively?
• Calculate the correlation coefficient; how would you explain the result?
• If you look at the scatter plot, are there any outliers?
• What happens if you omit the outliers? How could you justify to omit the outliers?
Exercises (II)
C. Save your data set and your results on the O: drive
•
•
•
Save your new data set as a Stata file (.dta)
Export and save your data set as an Excel file (.xml)
Copy your results and save them in a word file (.doc)
Appendix: STATA Commands
•
correlate varlist
•
pwcorr varlist
•
spearman varlist
Displays Spearman's rank correlation
coefficients for all pairs of variables
•
scatter varname1 varname2
Produces a scatter plot with variable 1 on the
x-axis and variable 2 on the y-axis
Displays all the pairwise Pearson correlation
coefficients between the variables listed after
correlate
Like correlate, but has some additional options
like calculating the significance level
Homework
Readings:
• Lewitt, Stephen and Stephen Dubner, Freakonomics, chapter 4, “Where Have All the
Criminals Gone?”.
Problem Set 2:
 Finish the exercises from today’s computer class if you haven’t done so already. Include all the
results and aswers in the file you send me
 Answer (very briefly) the following questions about Lewitt’s “Where have all the criminals gone?”
• List all the variables that Lewitt looks at in the course of the chapter
• Many factors (variables) are potentially associated with the drop in crime rates in the US.
Where does he find correlations between a variable and the falling crime rate? Which
variables are positively, which ones negatively correlated with the crime rate variable? Where
does he find no correlation?
• In which cases does Lewitt move from correlation to causation? How does he justify this
change in language? Is it convincing?
• Does this chapter potentially suffer from a omitted variable problem? Of what other factors
can you think for explaining the fall in crime rates?
• For each potential factor, he provides both the relevant data and some common sense
explanation. What do you believe more – the data or the explanations? Why does he need the
explanations at all?
• Are you convinced by Lewitt’s overall argument? Has the fall in crime rate in the US herewith
been explained once and for all?