Transcript - UNDP-ALM
WORKSHOP ON ECONOMIC ANALYSIS
OF CLIMATE CHANGE
PRACTICAL LESSONS ON STATA 11
1
•
•
•
•
INTERACTIVE USE OF STATA
Interactive use means that STATA commands are
initiated within STATA.
A graphical user interface (GUI) for stat is
available. It enables almost all the STATA
commands to be accessed using drop down
menus.
STATA allows users to directly type commands to
execute a particular task.
The standard procedure however in STATA is to
aggregate the various commands needed into
one file called a do-file that can be run with or
without interactive use.
BASICS IN STATA
• Like most softwares, STATA has some example
data sets that allows ‘amateur’ users to use as
starting point in learning STATA.
– An example of such data sets is the auto.dta data
• To access the example data:
– Click File/Example Datasets/… Example datasets
installed with Stata
• Select the data set auto.dta
– Interactive Users can however type the command
• sysuse auto
DATA MANAGEMENT
• To describe the variables in the data set type:
– describe or des
– Or to describe some specific variables type add the name of the
variable to the command.
• Eg: des mpg
• NB: stata commands does not allow upper case
• If you wish to the summary statistics of the variable type:
• summarize,detail
• sum, detail
• su, detail
• su, d
– You can drop the subcommand detail if you wish to obtain the basic
summary statistics.
– You can summarize specific variables
• sum varlist, detail
• Eg: sum mpg, detail
– sum mpg
– su mpg
DATA MANAGEMENT
• If you are only interested in a subset of your data, you can inspect it using
filters. E.g. If you are only interested in price of a particular type of car you
can type:
– sum if price>=3000 & price<=4400
– sum if mpg>=16& mpg<=23
• And then you can contrast
– sum if price>=3000 |price<=4400
– sum if mpg>=16 |mpg<=23
• Interpretation of Logical Operators in STATA.
>=
greater or equal to
<=
less or equal to
==
equal to
&
and
|
or
!= or ~=
not equal to
>
greater than
<
Less than
.
missing
DATA MANAGEMENT
• The usual arithmetic operators (+,-,*,/) are
applicable in STATA.
• STATA allows users to tabulate variables to
know the distribution of a variable
– tabulate mpg
– tab mpg
DATA MANAGEMENT
• Some data/variables have been coded with value labels
already assigned to the values. If the user wants to
know the actual values used type:
– tab varlist, nolabel
– Eg: tab foreign, no label
GENERATING NEW VARIABLES
• You can create a new variable by combining new
variables or by performing some arithmetic operations.
[gen, egen, recode]
• To create a ratio of two variables:
– gen mpgratio=mpg/weight
– sum mpgratio
DATA MANAGEMENT
The same procedure can be applied to obtain
traditional transformations such as:
Square
gen mpg2=mpg^2
Cubic
gen mpg3=mpg^3
Square roots
gen mpgsqrt=sqrt(mpg)
Exponential
gen expmpg=exp(mpg)
Natual logs
gen lnmpg=ln(mpg)
gen logmpg=log(mpg)
Base 10
genl10mpg=log10(mpg)
DATA MANAGEMENT
• Eg: gen lprice=log(price+1)
– Why +1? This helps eliminate the problem of
estimating the log of zero or missing numbers.
• Sometimes the user may want to generate a new
variable within a particular range.
– gen lprice=log(price) if mpg==.
– gen llprice=log(price) if mpg>15
• The generate command can also be used to
create new (binary) variables.
– Eg: from the auto.dta data set we are using, may be
interested in finding out how many cars were repaired
more than two times in 1978. Thus we create a new
variable repair =1 if the vehicle was repaired more
than twice or 0 if otherwise.
DATA MANAGEMENT
• Use the command:
gen repair =1 if rep78>2
replace repair=0 if rep78<=2
or replace repair=0 if repair==.
• You can also create categorical variables from a
set of continuous variables.
tab mpg
gen mpgcat=1 if mpg<15
replace mpgcat=2 if mpg>=16& mpg<26
replace mpgcat=3 if mpg>26 & mpg<=35
replace mpgcat=4 if mpg>35
tab mpgcat
DATA MANAGEMENT
• tabulate….., generate
This command is useful for creating a set of
dummy variables (variables with a value of 0
or 1) depending on the value of an existing
categorical variable. The syntax is:
tab old var, gen (new var)
Eg: tab rep78, gen(repair)
tab foreign, gen(origin)
• The old variable is categorical. The new
variables will take the form: newvar1,
newvar2, newvar3…….
DATA MANAGEMENT
EGEN
This is an extended version of “generate” to create a new variable by
aggregating the existing data. The syntax is:
egen newvar = fcn(argument) [if exp] [in range] , by(var)
where newvar is the new variable to be created
fcn is one of numerous functions such as: count( ) ; max( ); min( ) ;
mean( ); median( ); rank( ) ; sd( ); sum( );
argument is normally just a variable var in the by() subcommand must be a
categorical variable.
Eg:
Egen avg=mean(mpg) : creates variable of average mpg over entire sample
Egen avg2=median (weight), by (foreign) : creates variable of median weight
of cars for each origin.
egen totalrepairs=sum(rep78), by(foreign) : generates total repairs
of vehicles from each origin.
egen prodwgt= sum(weight*price), by (make)
DATA MANAGEMENT
recode
• This command changes the values of a
categorical variable according to the rules
specified. The syntax is:
– recode varname oldvalue=newvalue
oldvalue=newvalue … [if exp] [in range]
– recode foreign 0=1 1=2
– Recode rep78 .=9 *=7
DATA MANAGEMENT
recode is also an extension to replace that recodes categorical
variables and generates a new variable if the generate () option is
used.
recode rep78(1/2=1) (3=2) (4/5=3), gen (repcat)
This creates a new variable that takes on value of 1,2 or 3. The repcat
variables is set to missing if rep78 doesn’t lie in any of the ranges
given in the recode command.
Xtile
• This command creates a new variable that
indicates which category a record falls into, when
the sample is sorted by an existing variable and
divided into n groups of equal size.
• The syntax is:
– xtile newvar=variable[if exp][in range],nq(#)
Newvar is the new categorical variable created. Variable
is the existing variable used to create the quantile. # is
the number of different categories.
Eg: pctile mpg1quint= mpg, nq(5)
pctile weight1dec=weight, nq(5)
LIST
The most detailed of the commonly used descriptive commands is list.
List displays the values of variables by observation. If varlist is not
specified the output will contain the value for every variable.
list varlist ,or l varlist Eg: list mpg
Xi: Indicator Variables
A complete set of mutually exclusive categorical indicator dummy
variables can be created in several ways. A simpler method is the xi
command:
xi i.rep78, noomit
The noomit option is added because the default setting is to omit the
lowest category.
INSPECT
inspect variable [if exp] [in range]
Gives a small histogram, the number of values that are: unique; positive,
zero, negative; integer and non-integer; missing.
LABEL VARIABLE
This command is used to attach labels to variables in order to make the output easier to
understand. For example, we know that maritalstat indicates the marital status of the head
of household. But other people using the tables may not know this. So we may want to
label the variables as follows:
label variable region “Region of country”
Label variable maritalstat “marital status”
LABEL VALUES
This command attaches named set of value labels to a categorical variable. The syntax is:
label values varname lblname
where
varname is the categorical variable which will get the labels
lblname
is a set of labels that have already been defined by label define
Here are some examples of labeling values in Stata.
label variable yield "Yield (tons/hectare)" gives label to variable yield
label define yesno 0 no 1 yes defines set of labels called yesno
label values electricity yesno attaches labels to the variable “electricity”
label define yesno 3 "perhaps", add adds new value label to existing set
label define yesno 3 "maybe", modify modifies existing value label
label define reglbl 1 West 2 Center 3 East defines regional labels
label values region reglbl attaches regional labels to region
label define reglbl 2 Central, modify modifies regional labels
TABULATE … SUMMARIZE
• This command creates one- and two-way tables that summarize
continuous variables. The command tabulate by itself gives
frequencies and percentages in each cell (cross-tabulations). With the
“summarize” option, we can put means and other statistics of a
continuous variable.
• The syntax is:
tabulate varname1 varname2 [if exp] [in range], summarize(varname3)
options
• where
–
–
–
–
varname1 is a categorical row variable
varname2 is a categorical column variable (optional)
varname3 is the continuous variable summarized in each cell
options can be used to tell Stata which statistics you want
• tab make, sum(mpg) gives the mean, std deviation, and frequency of
mpg for each car model.
• tab make, sum(price) mean gives the mean price for each car
• tab foreign weight, sum(price)
Tabstat
This command gives summary statistics for a set of continuous
variable for each value of a categorical variable.
The syntax is:
tabstat varlist [if exp] [in range] , stat(statname [...])
by(varname)
where
varlist is a list of continuous variables
statname is a type of statistic
varname is a categorical variable.
Example:
table
This command can creates many types of tables. It is probably the most
flexible and useful of all the table commands in Stata. The syntax is:
table rowvar colvar [if exp] [in range], c(clist) [row col]
where
rowvar is the categorical row variable
colvar is the categorical column variable
clist is a list of statistic and variables
row is an option to include a summary row
col is an option to include a summary column
Examples:
table foreign, c(mean rep78 sd rep78 median rep78) – table of yield
statistics by region
. table foreign rep78, c(mean mpg) –table of average mpg by foreign
rep78
• table foreign, c(mean rep78 mean mpg) –table of average rep78 &
mpg by foreign
MODIFYING DATA FILES
• This section describes a number of commands that are used to
modify and combine data files in Stata.
rename , drop , keep,
rename
This command renames variables. Syntax:
rename oldname newname
• Eg: rename mpg mile_per_gallon
drop
This command deletes records or variables.
drop if price>=4000
drop if foreign==1
keep
This command deletes everything but specified observations or
variables.
Keep if price<=3000
keep mpg rep78 headroom trunk if foreign
PRESENTING DATA WITH GRAPHS
• In Stata, graphs are primarily made with the graph command, followed by
numerous subcommands for controlling the type and format of graph. In
addition to graph, there are many other commands that draw graphs.
graph
twoway
bar
pie
matrix
connect( )
msymbol( )
histogram
scatter
http://www.stata.com/support/faqs/graphics/piechart.html
PRESENTING DATA WITH GRAPHS
graph
This command generates numerous types of graphs and
diagrams. The syntax is:
graph graphtype [varlist] [if exp] [in range] [, options]
where
graphtype is the type of graph
varlist is the list of variables to graph
if is used to limit observations that are included based on the
exp condition
in is used to limit observations that are included based on the
case number
options are commands to control the look of the graph
• graph bar income, over(sexhead) over( locality)
Histograms
histogram income, by(sexhead) normal bin(20)
histogram income, by(locality) normal bin(20)
histogram mpg, by( foreign) normal bin(20)
Nb: bin () refers to the number of columns it
should include in the histogram
Scatter Plots
scatter mpg price
scatter mpg price,by(foreign)
• PIE CHARTS
In Stata, pie and bar charts are drawn using the sum of the variables
specified. Therefore, any zero values will not appear in the chart, as they
sum to zero and make no difference to the sum of any other values. If you
have a categorical variable that contains labeled integers (for example, 0
or 1, or 1 upwards), and you want a pie or bar chart, you presumably want
to show counts or frequencies of those integer values. To create pie charts,
first run the variable through tabulate to produce a set of indicator
variables:
Eg:
tab foreign, gen (f)
graph pie f1 f2
Try:
tabulate rep78, generate(r) .
graph r1 r2 r3 r4 r5, pie
graph r1 r2 r3 r4 r5, bar
• Do-file Editor
A Do-file is a file that stores a Stata program (a set of
commands) so that you can edit it and run it later.
The Do-file Editor is like a simplified word processor for
writing Stata programs. Why use the Do-file Editor
rather than the Command window or the menu
system?
–
–
–
–
–
It makes it easier to check and fix errors,
it allows you to run the commands later,
it lets you show others how you got your result, and
it allows you to collaborate with others on the analysis.
In general, any time you are running more than 5-10
commands to get a result, it is easier and safer to use a Dofile to store the commands.
• LOG FILES
• You can click on File/Log to begin or close a log file (Suspend
and Resume are to temporarily turn off and on the log).
• You can use “log” commands in the Command window
• You can use “log” commands in a Do-file.
OPENING FILES STATA FILES (.dta)
To open a stata file:
use filename, clear
Eg: use "G:\fenergydata.dta", clear
use varlist using filename, clear [for a subset of the data file].
Alternatively you can use the drop down menu bar to import the data
– File/open/………………….. (select the data)
IMPORTING EXCEL DATA
To import data from excel, one has to convert the data into an CSV [tab
delimited] format. For non stata files, the command for importing data is
“insheet using”
– insheet using filename, clear
– Eg: insheet using "C:\Users\myjumens\Desktop\fenergydata.csv"
•
Alternatively you can use the drop down menu bar to import the data.
– File/import/ASCII data created by spreadsheet/ …… (select the data)
CODING QUESTIONAIRES INTO STATA
• Coding data into STATA can be done in the
DATA VIEW
– Generate new variables.
Eg: gen q1=.
gen q2=.
– Click Data Editor on the menu bar
– Click on Variable manager
Type the
variable
name
Type the
variable
label
Click Apply to add your commands
into the system
Click on
the
manage to
display a
new dialog
box
• Creating Value Labels
Click on create
label
Type the value
label here
Type in the value.
Eg: 1
Type in the
corresponding
label to the
values assigned
Click on
Add
• Note that you can create all the value labels
for all the questions before exiting the
manage value label dialog box
• Assign the imputed value labels to their
corresponding questions, or variables in the
Variables Manager.
• Exit the Variables Manager dialog box and go
back to the data editor.
• You can now type in the coded response.
MICROECONOMETRIC
REGRESSION ANALYSIS
•
•
•
•
•
•
Ordinary Least Squares
Probit Models
Logit Models
Ordered Probit/Logit Models
Multinomial Logit Models
Tobit Models
Ordinary Least Squares
Like most statistical packages, STATA allows users
to run some basic regressions such as the OLS.
The syntax is:
regress dependent var independent var
Eg: regress gpa tuce psi
reg gpa tuce psi
LOGIT AND PROBIT MODELS
• Probit and logit models are among the most
widely used members of the family of
generalized linear models in the case of binary
dependent variables.
• These group of models allows researchers to
analyse data on issues even though the
dependent variables are binary (0, 1).
– Eg: yes/ no; married or not married; foreign or
domestic
PROBIT MODEL
Let us examine whether a new method of teaching
economics, PSI, significantly influence performance in later
economics courses using the probit model. The dependent
variable used is GRADE, which indicates whether a student’s
grade in intermediate macroeconomics course was higher
than that in the principle course.
The probit model is specified as:
• Estimation of Probit Model
probit grade gpa psi tuce
• The basic probit commands report coefficient estimates
and the underlying standard errors. These coefficients
are the index coefficients and what we can only say is the
direction of the effect and partial effects on the Probit
index/score. They do not correspond to the average
partial effects.
• Let’s try to interpret the results:
– Tuce: one unit increase in tuce increases the probit
index by 0.05 standard deviations.
– But are we concerned with an Probit index? No
• In analysing binary choice models the parameter of
interest are not the index coefficients, rather the
marginal/ partial effects.
Marginal Effects
• It gives the derivative of the probability that the dependent
variable equals one with respect to a particular conditioning
variable.
In stata these marginal effects can be computed using two
methods
– dprobit
– mfx compute
Interpretation
For one unit increase in the dependent variable
from the baseline, the probability of an event is
expected to increase/decrease
For instance one unit increase in GPA from the
baseline (3.11), the probability of grade
improvement increases by 53.3 %.
NB: The interpretation for dummy variables differs:
The coefficients are discrete changes not
marginal effects
The interpretation of PSI is that a student exposed to
PSI has a probability of grade improvement of 0.46
greater than another student who is not exposed to
the same method.
LOGIT MODEL
The logit model yields similar results as the
probit model.
• The coefficients of the logit function is quite
difficult to interpret since it follows a logistic
distribution function.
• As a results we compute the odds ratio and the
marginal effects
• MARGINAL EFFECTS
• In stata these marginal effects can be computed
using the mfx command.
• Recall that for one unit increase in the dependent
variable from the baseline, the probability of an
event is expected to increase/decrease by the
magnitude of the marginal change holding other
variables constant
• In our case one unit increase in GPA from the
baseline mark of 3.11 increases the probability of
grade improvement by 53.3%
• One unit increase in the previous knowledge of
the material from the baseline (21.93) increases
the probability of grade improvement by 1.8 %.
• What about PSI?
ODDS RATIO
• Odds are a way of presenting probabilities,
but unless you know much about betting you
will probably need an explanation of how
odds are calculated. The odds of an event
happening is the probability that the event
will happen divided by the probability that the
event will not happen.
• Stata command: (or)
ologit grade gpa psi tuce, or
Being exposed to new teaching methods (PSI) increases the odds of performing well
by 0.79 .
For every 1 unit increase in GPA, the odds of improving performance by a factor of
16.87
ROBUSTNESS
Cross sectional data are usually plaqued by the
problem of heteroscedasticity.
• This statistical deficiency has implications on
the results of binary choice models.
• Thus to report standard errors that are robust
we use the subcommand r or robust.
– Eg:
probit grade psi tuce gpa, r
probit grade psi tuce gpa, robust
ORDERED PROBIT/LOGIT
Some multinomial choice models are inherently ordered.
Examples include:
• Bond ratings
• Opinion surveys
• Assignment of military personnel to job classifications by skill level
• Voting outcomes on certain programs
• The level of insurance coverage taken by a consumer: none part, or
full
• Employment status: unemployed, part-time, or full
time
• In each of these outcomes, the outcome is discrete but the
multinomial logit, conditional logit, nested logit models
would fail to account for the ordinal nature of the dependent
Variables.
• The ordered probit/logit models however, accounts for
these ordinal properties.
ORDERED PROBIT
Suppose we wish to analyze the 1977 repair records
of 66 foreign and domestic cars. The 1977 repair
records take on poor, fair, average and good and
excellent. The main research problem is to
explore the factors that explain the repair records
in 1977.
The categories are;
1. Poor
2. Fair
3. Average
4. Good
5. Excellent
MARGINAL EFFECTS
• We need the marginal effects to interpret the
results of ordered probit effectively
• The marginal effects show how the probabilities
of each outcome change with respect to changes
in regressors.
• To calculate the marginal effects we run the mfx
command separately for each outcome.
–
–
–
–
mfx, predict(outcome(1))
mfx, predict(outcome(2))
mfx, predict(outcome(3))
mfx, predict(outcome(4))
ORDERED LOGIT
MARGINAL EFFECTS
MULTINOMIAL LOGIT
The multinomial logit (MNL) model, also known
as multinomial logistic regression, is a regression
model which generalizes logistic regression by
allowing more than two discrete outcomes.
That is, it is a model that is used to predict the
probabilities of the different possible outcomes
of a categorically distributed dependent variable,
given a set of independent variables (which may
be real-valued, binary-valued, categorical-valued,
etc.).
IMPLEMENTATION IN STATA
• Stata uses the mprobit command to estimate the MNP. To
use mprobit we must have a single observation for each
decision maker in the sample.
• Eg: We use data in on the type of health insurance available
to 616 psychologically depressed subjects in the US.
Patients may have either an indemnity (free-for-service)
plan or a prepaid plan such as a Health Management
Organisation-HMO) or the patient may be uninsured .
• Demographic variables include age, gender, race and site.
• Indemnity insurance is the most popular alternative so
stata will choose it as the base outcome by default;
– The main research problem is to explore the factors that
explain the choice of the health insurance
mprobit insure age male nonwhite site2 site3
Computation of the Marginal effects
• We need the marginal effects to interpret the results of MNP
effectively.
• The marginal effects show how the probabilities of each
outcome change with respect to changes in regressors
• To calculate the marginal effects we run the mfx command
separately for each outcome.
Interpretation:
• TOBIT MODEL
• There are instances where by the variable we are
investigating are censored at a point.
• For instance our research objective is to explore
the factors that explain the repair records in
1977.
• Mpg in our data ranges from 12 to 41
• Assume that our data is censored so that we
could not observe a mileage rating below 17
mpg.
CENSORE THE MPG
• If the true mpg is 17 or less, all we know is
that the mpg is less than or equal to 17.
• Let’s first generate a new variable called mpg1
– gen mpg1=mpg
• Replace any value that is equal to 17 and
below with 17
– replace mpg1=17 if mpg<=17
– (14 real changes made)
Lets see what we actually observe after censoring
IMPLEMENTATION IN STATA
• Notice that our dependent variable mpg is not
dichotomous but continuous.
• Let’s run two regressions
• Create wgt by dividing weight by 1000 to
make our discussions interesting
• gen wgt=weight/1000
TYPES OF TOBIT
• Left censored Tobit model
• Right censored Tobit model
We can estimate a tobit model by instructing
the software to censore the data both from
below (left censore), above (right censored) or
both.
Left censored Tobit model
– Using the already censored data mpg1
• Using the uncensored data, we could instruct
the software to censore it in the estimation by
using the subcommand: , ll(…)
– tobit mpg wgt, ll(17)
Right censored Tobit model
Two-limit Tobit models
• Tobit regression coefficients are interpreted in
the same manner as ols regression
coefficients.
• For a one unit increase in WEIGHT, there is a
6.2 point decrease in the predicted value of
mpg. In other words a unit increase in the
weight of the car is associated with a 6.2 units
decrease in millage.
Computation of the Marginal effects
• We need the marginal effects to interpret the results of tobit
model effectively.
• The marginal effects show how the probabilities of the
outcome change with respect to changes in regressors
• To calculate the marginal effects we run the mfx command
• NB: The marginal effects are just the same as from the
regression model
Starting with do files
version 11
set mat size 400
clear
set mem 1000
capture log close
set more off