Transcript Document
SKEMA – Ph.D programme
2010 – 2011
Quantitative Methods
For Social Sciences
Lionel Nesta
Observatoire Français des Conjonctures Economiques
[email protected]
Objective of The Course
The objective of the class is to provide students with a set
of techniques to analyze quantitative data. It concerns the
application of quantitative and statistical approaches as
developed in the social sciences, for future decision
makers, policy markers, stake holders, managers, etc.
All courses are computer-based classes using the STATA
statistical package. The objective is to reach levels of
competence which provide the students with skills to both
read and understand the work of others and to carry out
one's own research.
Examples
Rise in biotechnology
Should the EU fund fundamental research in biotechnology?
Has biotechnology increased the productivity of firm-level R&D?
Did it increase the speed of discovery in pharmaceutical R&D?
Increasing university-industry collaborations
Does it facilitate innovation by firms?
Does it increase the production of new knowledge by academics?
Does it modify the fundamental/applied nature of research?
Examples
Economic (productivity) Growth
Does it come mainly from new firms or improving existing firms?
Is market selection operating correctly?
Why do good firms exit the market?
How does the organisation of knowledge impact on performance?
How do knowledge stock and specialisation impact on productivity?
How do firms enter into new technological fields?
Do firms diversify in new technologies/businesses purposively?
Structure of the Class
Part 1 : Descriptive Statistics
Part 2 : Statistical Inference
Part 3 : Relationship Between Variables
Part 4 : Ordinary Least Squares (OLS)
Part 5 : Extension to OLS
Part 6 : Qualitative Dependent variables
Structure of the Class
Part 1 : Descriptive Statistics
Mean, variance, standard deviation
Data management
Part 2 : Statistical Inference
Part 3 : Relationship Between Variables
Part 4 : Ordinary Least Squares (OLS)
Part 5 : Extension to OLS
Part 6 : Qualitative Dependent variables
Structure of the Class
Part 1 : Descriptive Statistics
Part 2 : Statistical Inference
Distributions
Comparison of means
Part 3 : Relationship Between Variables
Part 4 : Ordinary Least Squares (OLS)
Part 5 : Extension to OLS
Part 6 : Qualitative Dependent variables
Structure of the Class
Part 1 : Descriptive Statistics
Part 2 : Statistical Inference
Part 3 : Relationship Between Variables
ANOVA, Chi-Square
Correlation
Part 4 : Ordinary Least Squares (OLS)
Part 5 : Extension to OLS
Part 6 : Qualitative Dependent variables
Structure of the Class
Part 1 : Descriptive Statistics
Part 2 : Statistical Inference
Part 3 : Relationship Between Variables
Part 4 : Ordinary Least Squares (OLS)
Correlation coefficient, simple regression
Multiple regression
Part 5 : Extension to OLS
Part 6 : Qualitative Dependent variables
Structure of the Class
Part 1 : Descriptive Statistics
Part 2 : Statistical Inference
Part 3 : Relationship Between Variables
Part 4 : Ordinary Least Squares (OLS)
Part 5 : Extension to OLS
Regressions diagnostics
Qualitative explanatory variables
Part 6 : Qualitative Dependent variables
Structure of the Class
Part 1 : Descriptive Statistics
Part 2 : Statistical Inference
Part 3 : Relationship Between Variables
Part 4 : Ordinary Least Squares (OLS)
Part 5 : Extension to OLS
Part 6 : Qualitative Dependent variables
Linear probability model
Maximum likelihood (logit, probit)
Part 1
Descriptive Statistics
Types of Data
Descriptive statistics is the branch of statistics which gathers all
techniques used to describe and summarize quantitative and
qualitative data.
Quantitative data
Continuous
Measured on a scale (value its the range)
The size of the number reflect the amount of the variable
Age; wage, sales; height, weight; GDP
Qualitative data
Discrete, categorical
The number reflect the category of the variable
Type of work; gender; nationality
Descriptive Statistics
All means are good to summarize data in a synthetic way: graphs;
charts; tables.
Quantitative data
Graphs: scatter plots; line plots; histograms
Central tendency
Dispersion
Qualitative data
Graphs: pie graphs; histograms
Tables, frequency, percentage, cumulative percentage
Cross tables
Central Tendency and Dispersion
A distribution is an ordered set of numbers showing how many
times each occurred, from the lowest to the highest number or the
reverse
Central tendency: measures of the degree to which scores are
clustered around the mean of a distribution
Dispersion: measures the fluctuations around the characteristics of
central tendency
In other words, the characteristics of central tendency produce
stylized facts, when the characteristics of dispersion look at the
representativeness of a given stylized fact.
Central Tendency
The mode
The most frequent score in distribution is
called the mode.
The median
The middle value of all observed values, when
50% of observed value are higher and 50% of
observed value are lower than the median
The mean
The sum of all of the values divided by the
number of value
1 i n
X xi
N i 1
The mode, the mean and the median ore equal if and only of the distribution is symmetrical and unimodal.
Dispersion
The range
Difference between the maximum and
minimum values
R xmax xmin
The variance
Average of the squared differences between
data points and the mean (average)
quadratic deviation
in
2
x
i 1
i
X
2
N
The standard deviation
Square root of variance, therefore measures
the spread of data about the mean,
measured in the same units as the data
i n
2
x
i 1
i
X
N
2
Research Productivity in the
Bio-pharmaceutical Industry
EU Framework Programme 7
Stylised Facts about Modern Biotech
1.
2.
Innovations emerge from uncertain, complex processes
involving knowledge and markets: Roles of networks.
Economic value is created in many ways – globally and
in geographical agglomerations
3.
Various linkages exist among diverse actors (LDFs,
DBFs, Univ, Venture Capital) in innovation processes,
but the firm plays a particularly important role.
4.
Regulations, social structures and institutions affect ongoing innovation processes as well as their impacts on
society: Importance of IPR.
STATA software
Statistical Package for the Social Sciences
The Stata software
Stata Corp, spinoff from Texas A&M – College Station – Texas
(1985)
Among the most widely used programs for statistical analysis
in social sciences.
Probably to most widely used econometric software among
economists
Data management (case selection, file reshaping, creating
derived data)
Features of Stata are accessible via pull-down menus
The pull-down menu interface generates command syntax.
The Stata software
STATA is a statistical software in constant evolution
Updates are constantly put on the web available to the use of
other Stata user (command update all)
Most are available through the Boston College server
ssc install module_name, all
And hundreds of other can be reached as follows:
net search key_words
net install module_name, all
The Stata software
Review window
Pull down menus
Results window
Variable window
Command window
The Stata software
How to use STATA ?
Using pull-down menus
Typing STATA instructions in the Command window
Grouping a series of STATA instructions in a .do files
Programming new functions (.ado files)
Programming new functions with a powerful matrix language
(MATA) similar to C (Version 9.0 of STATA onwards)
The Stata software
All STATA commands used from the menu or the command
window are automatically stored in the Review window
At the end of a session, the review window can then be saved
by right-clicking on it
save all : under a .Do-file
Send to do-file editor : A new window opens up.
A Do-file is a text file containing a list of STATA commands
which will be executed step by step by STATA.
It is recommended to explore results and methods with the
command window. Once the methods are settled, save the
series of commands as a do-file.
The Stata software
All STATA results are displayed in the Result window
This window is a buffer. Once it disappears from the screen, it
is deleted. That is why you may want to record results.
log using log_name.txt (beginning of a session)
log close (end of a session)
It is recommended to save results in a log file. Moreover, if you
work with a do file, you can always get ols results with the do-file.
The Stata software
Memory settings
By default, 10 megabytes are available for database uploading. If
a database is greater than 10Mb, STATA does not upload the
database. There are also other limits (matrix size, # of variables)
which can be managed using the commands below.
Useful commands
describe using database_name.dta
query memory
clear
set memory 500m, permanently
set maxvar n , permanently
set matsize n , permanently
set virtual on , permanently
Data Handling (1): Database creation
1st step: Creating a database
Typing data in the database through Data Editor (edit)
Importing data
insheet myfile.txt , options
options : tab ; comma ; delimiter("char") ; clear ; names
Importing data from a .txt file
- Without fixed format (without dictionnary)
infile1 var1 var2 var3 using myfile.txt , options
- With a fixed format (with dictionnary)
infile2 using mydict.dct , using (myfile.txt) options
DH(2): Database Exploration
2nd step: Exploring the Data
To obtain a description of the database
describe [varlist], options
inspect [varlist]
codebook [varlist], options
nmissing [varlist], options
npresent [varlist], options
To display all possible values of a variable
list [varlist] [if] [in], options
Example : list var1 if var2 > var3 in 1/100
DH(3): Database Organisation
3rd step: Organisation of the database
Sorting observations
sort varlist
gsort [ + | - ] varlist
Sorting variables
order varlist
aorder varlist
(If no varlist is specified, _all is assumed.)
Fusionner plusieurs bases de données (ajouter des variables)
merge varlist using base1.dta [base2.dta], options
Fusionner plusieurs bases de données (ajouter des observations)
append using base1.dta [base2.dta], options
DH(3): Database Organisation
3rd step: Organisation of the database
Modifying the shape of the database
reshape long stubnames, i(varlist) j(varlist)
reshape wide stubnames, i(varlist) j(varlist)
i
i
id sex inc80 inc81 inc82
-------------------------------------1 0 5000 5500 6000
2 1 2000 2200 3300
3 0 3000 2000 1000
Wide form
j
id year sex inc
----------------------------1 80 0 5000
1 81 0 5500
1 82 0 6000
2 80 1 2000 Long form
2 81 1 2200
2 82 1 3300
3 80 0 3000
3 81 0 2000
3 82 0 1000
DH(4) : Saving, Opening, Exporting
4th step: Save and re-use STATA database files (.dta files)
Changes the working directory to the specified drive and directory
cd "C:\STATA SKEMA"
Saves the database as a STATA file (.dta)
save myfile.dta , replace
Opens a STATA format database (.dta)
use myfile.dta , clear
Exports a database as a txt files
outsheet [varlist]using myfile.txt , options
options : comma ; nonames ; replace
Handling Variables
Create a new variable
By assigning a value to it
generate var1 = expression [if] [in]
Using a predefined function: Extensions to generate
egen var1 = fcn(arguments) [if] [in], options by(varlist)
fcn : min ; max ; mode ; mean ; median ; sd ; total ;
pctile ; group ; count ; etc…
Examples : egen mean(salaire) , by(age)
egen group(nom)
egen count(id), by(sector)
Handling Variables
Variables modifications and removal
Modifying a variable which has already been created
replace var1 = expression [in] [if]
Erasing variables
drop varlist
keep varlist
Erasing observations
drop [in] [if]
keep [in] [if]
Examples : drop if revenu < 100
keep if age >= 18
Handling Variables
Time series and panel data utilities
Declaring data as time series or panel data
tsset [panelvar] timevar , options
options : daily ; weekly ; monthly ; quarterly ; yearly
Exemple : tsset id annee , yearly
Using time series operators
Lagged values
L.
L2. ou LL.
L.X = Xt-1
L2.X = Xt-2
Forwarded values
F.
F2. ou FF.
F.X = Xt+1
F2.X = Xt+2
Differenced values D.
D2. ou DD.
D.X = Xt - Xt-1
D2.X = Xt - Xt-1 – (Xt-1 - Xt-2 )
Descriptive Statistics with STATA
Using log files
log using xxx, replace / log close
Defining and using labels
label variable
label define
label values
Descriptive statistics
summarize
table
table, content()
tabulate
Manipulating .dta files and exporting
collapse
save as
outsheet using...
Log files
Log files save the result window. They are useful when producing
descriptives statistics on the .dta files and on the variables.
log using nom_fichier_log, replace
Instructions STATA
log close
Advantage. Very useful to find back old results (replication and refutation)
Drawbacks. Tedious to manipulate
Labelling variables
Labelling is too often neglected.
No influence on the results
Large influence on correct interpretation of variables and results
label variable. Describe a variable
label variable asset "real capital"
label define. Define a label
label define firm_type 1 "biotech"
label values Applies the label
label values type firm_type
0 "Pharma"
Descriptive statistics: summarize
summarize var1 var2....varN
Produces number of obs. means, variance, min and max
We can add a condition using if
summarize var1 var2 ....varN if [condition]
We can produce descriptive statistics by subsets of teh database
using bysort
bysort varcat: summarize var1 var2 ....varN
Beware: Most of the time, you do not need a comma before if. However, if you
get an error message, there is very high chances that it comes from the absence of a
comma before if.
Descriptive statistics: table
The table command applies to categorical variables (string or
categorical).
table varcat1
Provide the number of observations by categories of varcat1
table varcat1 varcat2
Provides a cross table between varcat1 and varcat2
table varcat, content(count var1 mean var1 sd var1...)
Provide descriptive statistics on var1 by categories of varcat
Descriptive statistics: tabulate
The tabulate command is similar to table, but obtions are different.
tabulate varcat, gen(varcat_)
generates dummy variables for each category of varcat
tabulate varcat1 varcat2, [options]
Generate measures of associations between two categorical variables
tabulate varcat1, summarize(var2)
Provide descriptive statistics on var2 by categories of var1
Stacking observations: collapse
The collapse command produces a new database which is an
aggregation of the old database.
collapse will aggregate lines (observation) by categories of your
choice of a define categorical variable
collapse (mean)var1 var2 (sum) var3, by(varcat)
Will generate a new database with as many lines as there are categories
of varcat, with 3 variables (means of var1 & var2, sum of var3)
collapse (mean)var1 var1 (sd) sdvar1=var1 sdvar2=var2,
by(varcat1 varcat2)
Will generate a new database with as many lines as there are categories
of varcat1 & varcat2, with 3 variables (means of var1 & var2,
standard deviation
of var1 & var2)
Note: collapse is interesting to export tables of results to excel.
Note: Please save the old and new database under different names!
Keywords for table & collapse
mean
sd
sum
rawsum
count
max
min
iqr
median
p1
p2
...
p50
...
p98
p99
means (default)
standard deviations
sums
sums, ignoring optionally specified weight
number of nonmissing observations
maximums
minimums
interquartile range
medians
1st percentile
2nd percentile
3rd-49th percentiles
50th percentile (same as median)
51st-97th percentiles
98th percentile
99th percentile
Graphs
Graphic representations are a very effective means of synthesis .
-
Pie graphs, which display proportions of a population or a sample
-
Two-way graphs linking any two quantitative dimensions
-
Distribution graphs (histograms) which plots central tendency
characteristics and dispersion of a variable
Pie Graphs
graph pie, over(varcat)
C1
C3
D0
E2
F1
F3
F5
C2
C4
E1
E3
F2
F4
F6
Two-way Graphs
Two-way graphs link two continuous var1 and var2.
There are several types of two-way graphs :
- Line graphs
twoway line var1 var2
- Classical scatterplot
twoway scatter var1 var2
- Conencted graphs
twoway connected var1 var2
Line graphes
.1
.105
rdi
.11
.115
twoway line var1 var2
1988
1990
1992
1994
1996
year
twoway line rdi year if name==« Abbott"
1998
Line graphs
.1
.15
rdi
.2
.25
twoway line var1 var2
1988
1990
1996
1994
1992
1998
year
Amgen
Abbott
twoway (line rdi year if name=="Amgen", sort) (line rdi year if
name=="Abbott", sort), legend(on order(1 "Amgen" 2 "Abbott"))
Connected graphs
.1
.105
rdi
.11
.115
twoway connected var1 var2
1988
1990
1994
1992
1996
1998
year
twoway (connected rdi year if name=="Abbott")
Scatterplots
-6
-4
lrdi
-2
0
twoway scatter var1 var2
8
10
12
14
16
lassets
twoway scatter lrdi lassets
18
Distribution graphs
Distribution graphs plot the distribution of one quantitative variable var1
at a time by means of a histogram:
On the horizontal axis, classes of var1 are displayed.
On the vertical axis, the density of each class is displayed.
fj
nj
Number of observations n j
Class range
d j c j c j 1
Distributionnal histogrammes
0
.1
Density
.2
.3
hist var1
8
10
12
14
lassets
hist lassets
16
18
Kernel distributions
Using kernel, one can get the probability density function of var1. The
probability density function is important to visually look at the normality of
the distribution.
Normal distributions are also called Gaussian distribution. These are very
frenquently used in sciences to account for random processes. They are
based on the theory of large numbers and the central limit theorem.
Distribution de kernel
kdensity var1
.15
.1
0
.05
Density
.2
.25
Kernel density estimate
8
10
12
14
lassets
kernel = epanechnikov, bandwidth = 0.5319
kdensity lassets
16
18
Exporting Graphs
One can simply copy and paste graph in any microsoft office software.
One can use.do files, and write:
graph export [graph_name], as[extension] options
Exemple :
graph export SKEMA_rdi.wmf, as(wmf) replace
Possible extensions: PostScript (ps), Encapsulated PostScript (eps), Windows
Metafile (wmf), Windows Enhanced Metafile (emf), Macintosh PICT format
(pict), Acrobat Reader (pdf)
SPSS software
Statistical Package for the Social Sciences
SPSS : Opening SPSS
SPSS : Importing data
SPSS : Importing data
SPSS : Importing data
Settings in the “import text” dialogue box
No predefine format (1)
Delimited (2)
First lines contains the variable names (2)
One observation per line // all observations (3)
Tab delimited only (4)
Finish (6)
SPSS windows
SPSS has opens automatically windows
The datasheet window
Observe, manage, modify, create, data
The results window
Everything you do will be stored there
The syntax window can be opened
SPSS : Data sheet (1)
SPSS : Data sheet (2)
SPSS : Result / Journal
SPSS : Saving data
SPSS : working, at last!
Recoding Variables
Changing existing values to new values (biotechnologie → DBF,
pharmaceutique → LDF)
1
2
3
Computing New Variables
Taking logarithm (normalization of continuous variables)
1
2
Creating Dummy Variables
Taking logarithm (normalization of continuous variables)
1
2
3
Computation of Descriptive Statistics
1
3
2
Descriptive Statistics
Statistiques descriptives
N
patent
assets
rd
spe
pharma
biotech
N valide (listwise)
457
457
457
457
457
457
457
Intervalle
286
35788473.97
1917997.980
2.0235309
1
1
Minimum
0
4422.18
858.53204
-1.1298400
0
0
Maximum
286
35792896.15
1918856.512
.8936909
1
1
Moyenne
11.92
4358371.54
330236.630
-.056808610
.63
.37
Ecart type
22.901
6086530.85
405160.516
.3374751802
.482
.482
Variance
524.470
3.705E+013
164155043889
.114
.232
.232
Splitting Database
1
2
Descriptive Statistics (by type)
Statistiques descriptives
type
DBF
LDF
N
patent
assets
rd
spe
pharma
biotech
N valide (listwise)
patent
assets
rd
spe
pharma
biotech
N valide (listwise)
167
167
167
167
167
167
167
290
290
290
290
290
290
290
Intervalle
202
2442619
495443.5
1.7544527
0
0
Minimum
0
4422.18
858.53204
-1.12984
0
1
Maximum
202
2447041
496302.1
.6246127
0
1
Moyenne
12.11
342934.49
58116.590
-.10630582
.00
1.00
Ecart type
21.066
478511.938
88638.5347
.343286812
.000
.000
Variance
443.764
2E+011
8E+009
.118
.000
.000
286
4E+007
1912600
1.6904465
0
0
0
218006.47
6256.248
-.7967556
1
0
286
4E+007
1918857
.8936909
1
0
11.81
6670709.4
486940.24
-.02830504
1.00
.00
23.929
6605972.68
432514.940
.331330781
.000
.000
572.609
4E+013
2E+011
.110
.000
.000
Logarithm
Normalization
Taking the logarithm is a transformation which usually normalize
distribution.
Elasticities http://en.wikipedia.org/wiki/Elasticity_(economics)
A change in log of x is a relative change of x itself.
Cobb-Douglas production function
log x
x
1
x
log x
x
x