Lecture 1 - University of Oregon

Download Report

Transcript Lecture 1 - University of Oregon

Lecture 1: Introduction
Math Boot Camp
Will Terry
Department of Political Science
University of Oregon
September 16, 2013
Objectives of Math Camp
Have a good time learning about the wonders of math(s)!
Get ready for PS545-546….
Objectives of PS545-546
• The objectives of our sequence are twofold:
(1.) to improve your ability to read mainstream quantitative
research, and
(2.) provide a broad overview of the main tools of quantitative
analysis.
• We will focus on the linear regression model.
• You will become familiar with Stata.
Statistical software
• This course will focus on practical computing skills that you might
find useful in your future research.
– There are reasons to spend some time with R to appreciate capability of
statistical computing.
– Given the limited time we will focus on developing STATA skills as much
as possible.
• We will master the basic components of statistical computing.
– Data management
– Estimating regression models
– Graphing
The standard political science stats education
I. Basic probability theory
- random variables
- PDFs
-CDFs
II. Statistical inference theory
- confidence intervals, hypothesis testing, p-values, etc.
III. Linear regression analysis
- the workhorse model of the social sciences
IV. Binary Outcome Models & Other Extensions of the Basic
Linear Model
V. Time Series Cross Sectional Models
First, some key terms…
Causality
Phenomenon Y (e.g. income) is affected by factor X (e.g., gender)
Statistical inference
Drawing conclusions about the world based on characteristics of sample data.
Typically we are in interested in understanding “population parameters.”
Independent variable (syn. “regressor”, RHS var)
The variable that is exogenously manipulated or changed.
Dependent variable (syn. “regressand”, LHS var)
Its value “depends” on the value taken by the independent variables.
Random variables and hypothesis testing
Random Variable (RV)
A variable whose values are determined by chance.
Population Density Function (PDF)
Describes how an RV is “distributed”—i.e., how likely it is that the RV takes any
particular value.
Parameter
Characteristic or measure that describes a population.
Statistic (not to be confused with Statistics)
Characteristic or measure obtained from a sample.
.
Common ways to distinguish variables
Qualitative Variables
Variables that take non-numerical values. (e.g., eye color; gun ownership)
Quantitative Variables
Variables that take numerical values. (e.g., number of credit cards in one’s wallet;
time elapsed since the Compromise of 1877)
Discrete Variables
Variables which assume a finite or countable number of possible values. Usually
obtained by counting. (e.g., the number of credit cards in one’s wallet)
Continuous Variables
Variables which assume an infinite number of possible values. Usually obtained
by measurement. (e.g., time elapsed since the Compromise of 1877)
Hypothesis testing terminology
Population
All subjects possessing a common characteristic that is being studied.
Sample
A subgroup or subset of the population.
Statistics
Collection of methods for planning experiments, obtaining data, and then
organizing, summarizing, presenting, analyzing, interpreting, and drawing
conclusions.
Hypothesis testing
Research design
•
Research design is the means by which we attempt to uncover
causal relationships between variables using data that we collect.
•
In the jargon of the trade, the objective is to to “identify” the
effect of a “treatment.”
•
Conceptually, one wants to make a comparison between two
identical subjects—one who received the treatment, and one who
did not.
•
A pure experiment is the gold standard. Unfortunately, this ideal
is generally infeasible in the social sciences.
Language of research design
Treatment group
The group that receives the treatment.
Control group
The group that does not receive the treatment.
Experimental data
Data derived from a process whereby the researcher determines the receipt of the
treatment.
Non-experimental data (syn. “observational data”)
Data in which the administration of the treatment is determined by factors beyond
the researchers control.
The standard political science stats education
I Basic probability theory
- random variables
- PDFs
-CDFs
II. Statistical inference theory
- confidence intervals, hypothesis testing, p-values, etc.
III. Linear regression analysis
- the workhorse model of the social sciences
IV. Binary Outcome Models & Other Extensions of the Basic
Linear Model
V. Time Series Cross Sectional Models
Linear regression analysis
A. Univariate regression model
yi = β0 + β1xi + εi (There is one IV)
B. Multivariate regression model
yi = β0 + β1xi +β2zi + εi
(There are two IVs)
yi = β0 + β1x1i +….+ βNxNi + εi
(There are N IVs)
V. Binary dependent variable models
Used when the dependent variable takes one of two possible
values:
= 1 if citizen i is a Democrat
Democrati
= 0 if citizen i is not a Democrat
Democrati = f (genderi, incomei, racei, agei )
VI. Time series cross sectional models
When the researcher observes the objects of analysis
at multiple points in time.
State
Year
GDP per capita
Ave. Education
Alabama
1970
$5,000
10.3 years
Alabama
1980
$9,500
11.2 years
Alabama
1990
$11,200
12.4 years
Illinois
1970
$7,000
9.3 years
Illinois
1980
$12,500
10.2 years
Illinois
1990
$17,200
13.7 years
New York
1970
$6,000
8.4 years
New York
1980
$11,500
10.1 years
New York
1990
$18,00
14.5 years
(These data have both time series and cross section features.)
What we won’t cover in PS545-6 but might be
useful in your dissertation, future research, etc.
A. MLE estimation and other procedures
B. Model selection
C. Simultaneous equations/IV estimation
D. Matching
E. Non-parametric models
F. Case study selection for qualitative research
And much, much more!
Causality and research design
• Causality is often difficult to determine—wait for the next slide—that’s
that’s why research design is important.
• An experiment is the gold standard.
• If a treated subject and a control subject are the same in every respect (as
they are in a perfect experiment), we can logically attribute any
difference in the observed outcome to receipt of the treatment.
• In the social sciences, we generally can’t run experiments so we use
statistical techniques to make the treatment and control group as alike as
we can.
Common difficulties in determining causality
One variable causes another, but how do you know which is causal?
Douglass firs
?
Rainfall
Two variables cause each other.
Expected closeness of race
Candidate expenditures
Common difficulties in determining causality
An omitted third variable causes both. (One reason correlation ≠ causation.)
Bad Driving
Old age
Gray Hair
If one were to look at the relationship between Bad Driving and Gray Hair only
one might be led to the erroneous conclusion that Gray Hair causes people to drive
badly (or Bad Driving causes one to have Gray Hair).
How could one test these competing hypotheses?
Recall the relationship between ice cream consumption and the NY homicide rate…
A research design schematic
R denotes randomized assignment.
N denotes non-randomized assignment.
X denotes receipt of the treatment.
O Denotes that the subject is tested.
Some basic mathematical tools
We will review some basic mathematical tools:
- Functions
- Summation operators
- Differential Calculus
Functions
A function is a rule that assigns exactly one value to each input of a
specified type
A function expresses the intuitive idea that one quantity (the argument of
the function, also known as the input) completely determines another
quantity (the value, or the output).
Summation operators
Summation operators are a useful way to represent the sum of a
large set of numbers:
N
x
 x1  x 2  ... x N 1  x N
i
i1
The index i indicates which numbers in the set are to be included
in the sum.

N
x
i
 x1  x 2  ...  x N
i1
The product operator works in a similar fashion.
Summation operators
Suppose your data were,
{x1, x2 , x3 , x4 , x5 , x6 , x7} = {-100,-10, -1, 0, 1, 10, 100}.
Compute the following:
7
x
5
i
i1
i
i 3
3
x
i1
x
7
i


8(x )
i1


i
7
xi
4
i3
x
x
x
i
i is an odd number


i
i1
i6
i
Sample mean and sample variance
Every population has a mean (μ) and a variance (σ2), note this implies it has a
standard deviation (σ) as well.
The population mean tells you were the population is “centered.” There’s a
sense in which the mean is the middle of the data.
The population variance (or standard deviation) measures how far “spread out”
individuals in the population are. (Obviously, these are always nonnegative).
The sample mean and sample variance are two fundamental statistics. They
estimate the parameters of the population the data were drawn from.
ˆ

N
1
xi

N i1
N
1
ˆ )2
ˆ  (x i  

N i1
2
Derivatives
Loosely speaking, a derivative can be thought of as how much one
quantity is changing in response to changes in some other
quantity.
Integrals
A definite integral of a function can be represented as the signed
area of the region bounded by its graph.
Math Camp game plan:
Time to get down to business…
In the remainder of this lecture we will discuss some elementary
results in a branch of mathematics called Real Analysis—i.e., the
branch of math that studies real numbers.
Q: Why do we care about Real Analysis?
A: Because it provides the logical structure that undergirds the
math we use as social scientists.
The next few slides follow a text that is slightly more advanced
than we need, but let’s follow along to develop a few ideas
about the real number line…
The set of real numbers:
Special symbols
The real number line
The set of real numbers:
Properties
Inequalities
Inequalities
Inequalities
Roots
A cheat sheet of
handy rules re
real numbers
(see the Math
Camp website for
the complete
sheet)
Quadratic equations
Quadratic equations (cont.)
Quadratic equations (cont.)
Absolute value
Achilles and the tortoise
Achilles and tortoise
Achilles and the tortoise
Achilles and the tortoise
Bounds
Bounds
Bounds
Bounds
Intervals
Intervals
Intervals
Intervals
Next lecture…
Functions and graphs
- Functions
- Graphs
- Functional forms