Transcript Document

0AP03:
Methods and models in
behavioral research
Part 2:
Understanding statistics using
SPSS (Field)
Chris Snijders
[email protected]
www.tue-tm.org/moodle
EXAMPLE: NETFLIX DVD RENTAL
2
Example: The Netflix Prize
$1,000,000
www.netflixprize.com
3
Example: the Netflix prize (3)
input
input
output
input
input
input
input
input
input
=
=
=
=
kind of previous rentals
number of previous rentals
day of the week
...
output = extent to which you like a movie
4
Example: The Netflix Prize (2)
• Predict the extent to which a person will like a
movie, from previous ratings by others.
• NB
– Measurement – Root Mean Square Error
– Large prizes!
– You have about 2 Gb of data to work on ...
5
0AP03: two parts
Blumberg et al.
Gerrit Rooks
Blocks A and B
Field
Blocks B and C
Chris Snijders
{LANGUAGE=ENGLISH}
6
Understanding statistics using SPSS
http://www.sagepub.co.uk/field/field.htm
-CD rom material
-- data sets
-- some software (g*power)
-answers to (some) assignments in the book
-test banks (note: not identical to exam)
7
www.tue-tm.org/moodle
enrolment key = "fieldspss"
8
Course home-page
http://www.tue-tm.org/moodle
9
Let’s get acquainted …
• Technische InnovatieWetenschappen
Bachelor’s
A: –never
heard of it
– Pre-Master program
B: was a topic in previous lectures,
• Technische
Bedrijfskunde
but don’t
ask me what it is or
– Bachelor’s
how
to do it
– Pre-Master program
C: was covered and understood
===
Some key concepts:
- Stochastic variables, distributions, normal
distribution
- SPSS usage (StatGraphics users?)
- Mean, median, skewness, kurtosis
- Correlation
- Simple regression: Y = a + b X
- Factor analysis
- A chi2 test
10
Understanding statistics using SPSS
About: Style
About: Content
1
2
3
4
5
6
7
About statistics
SPSS
Exploring data
Correlation
Multiple regression
Logistic Regression
The t-test
8 ANOVA
9-12, 14 More ANOVAs
13 Non-param. tests
15 Factor analysis
16 Chi2-tests etc
11
T-test, chi2-test
• We have two groups of students, one group that
started early and worked regularly, one group
that started late (in the last three lectures or
later)
Are the grades of the students in the regular
group higher? (t-test)
REGULAR
LATE
Average
6.3
3.0
Max
8.4
3.9
Are the regular students more likely to pass the
course? (chi-2 test)
REGULAR
LATE
Pass
30
2
No
10
38
12
Exam for the Field-part
[tentative: check the course website later]
Chapters
1, 2, 3, 4, 7: assumed to be common
knowledge
Chapters
5, 8, 15,
<probably 9-12, 14, perhaps 6>
+ additional material supplied with the course
(such as PS – software)
===
Exam on laptop:
1 – multiple choice questions
2 – you are given data and must be able to
handle the data sensibly
13
The average (quantitative) paper …
• Problem formulation
–
What are sensible questions?
• Theory-development and hypotheses
–
“What do I expect to be the answer to my question, and what are
the implications from the theory that I want to test”
(nb: different in exploratory work)
• Choice of research design
–
–
–
–
–
Experiment
Survey
Case study
Participant observation
…
• Data collection
–
–
–
–
Designing questionnaires
Designing experimental procedures
Finding your respondents. Sampling (how and how many?)
…
• Analysis of results
–
–
Measurement: from raw data to measured constructs
Relational claims: X  Y ?
• Conclusions
–
What can we conclude, given our
analyses?
14
About the course setup
• Mainly on moodle-site, studyweb only used to
send mail to you
• “Do-it-yourself course”: mastering SPSS, getting
up to speed with SPSS, keeping up with the
material is up to you
– Extra material and links on the website
– Practice material for the exam
If you do not practice in between, you will not
be able to pass the exam.
• Part 1-Rooks
Part 2-me
: “Think, then do”
: “Do, then think”
• We have data, now what do we do?
(and partly we collect these data from you)
• Hybrid setup:
– English/Dutch
– business administration / social sciences
15
THE ART OF
SAMPLING
16
Sampling
population
sample
We want conclusions about the population, but
we only have (enough time and money to
collect) data from part of the population, a
sample.
From sample data to population statement:
STATISTICAL INFERENCE
17
Two parts to every analysis
population
sample
• Calculate some property of the sample
– Mean (mean length of soccer players)
– Difference between mean of two groups
(difference in length of soccer-players)
– Correlation between two things measured
(correlation between length and number of goals
you score)
• Calculate a confidence interval around the
property, creating a statement about the
property in the sample
18
On sampling "analog cheese"
Analog cheese =
palm oil + starch (zetmeel)
"Keuringsdienst van waarde"
took a sample of 11 products
and found 5 to contain
"analog cheese"
 Estimate of the percentage
of products containing
analog cheese = 5/11 = 45%
What is the (approximate) confidence interval?
A
B
C
D
E
40
32
25
17
9
–
–
–
–
–
50
58
65
77
81
%
%
%
%
%
19
Applying the 1/sqrt(n) rule
You want to predict how many seats in
congres a certain Dutch political party
will get. You allow for a range of plus or
minus 2 seats. Say you expect the
number of seats to be around 50.
You intend to call a representative
sample of people. About how many do
you need?
A
B
C
D
E
F
50
100
500
5,000
50,000
more than 50,000
20
Some more sampling
Suppose you want to know, say, the percentage
of people in The Netherlands who support the
recent foreign policy of the US-government. The
Netherlands has 12,000,000 voters.
According to your (correct) calculations you
need a sample of 2,000 people.
Now you want to do the same, but in France
(population = 36,000,000 voters).
How large should your sample size be in France?
A
B
C
D
E
less than 2,000
about 2,000
about 6,000
more than 6,000
you need more information
Rule of thumb: For large populations, the required
sample size is independent of the population size
21
Explanation:
Mean and variance of the mean
We measure x and get measurements x1, …, xn
xi

x
n
N  n s x2
Var ( x)  (
)
N
n
xi  measuremen t of x for unit i
N  size of population
n  size of sample
2
(
x

x
)
i1 i
n
s x2 
n 1
 variance in the sample
Expectation and variance give the 95%confidence-interval:
x  1.96 Var ( x) , x  1.96 Var ( x)
22
Sample size determined by:
Are white soccer players smaller?
• How precise do you want to measure your
statistic?
[what is the height difference you would find interesting
enough to report about]
• What is the probability of Type I error that you
will allow? (rejecting the H0-hypothesis when in
fact it is true) Usually 5%
[How small do you want the probability to be that you reject
“(on average) black and non-black players are equally tall”
when in fact it is true?]
• How likely do you want it to be that you will find
an effect, assuming that it exists in the
population? Power, usually 80% or 90%.
• Onesided or twosided tests?
You need special purpose software for this, for
instance G*Power (on the disc), or PS
23
XY
24
All the same, but different
• Problem formulation
–
What are sensible questions?
X1  Y 1
“What do I expect to be the answer to my question, and what are
the implications from the theory that I want to X
test”
2  Y2
(nb: different in exploratory work)
…
• Theory-development and hypotheses
–
• Choice of research design
–
–
–
–
–
Experiment
Survey
Case study
Participating research
…
• Data collection
–
–
–
–
X1  Y 1
X2  Y 2
HOW?
Designing questionnaires
Designing experimental procedures
Finding your respondents. Sampling (how and how many?)
…
• Analysis of results
–
–
Measurement: from raw data to measured constructs
Relational claims: X  Y ?
• Conclusions
–
What can we conclude?
X1  Y 1
X2  Y 2
AND?
25
About $80 / hour
26
It is all about XY
:
X
“white soccer player”
“being a woman”
“being bald”
“left handed”
“listen to Mozart”
Y






“length”
“sensitive to alcohol”
“prob. of a heart-attack”
“die early”
“score higher on IQ-test”
Y = dependent variable
response variable
target variable
Y-variable
explanandum
X = independent variable
X-variable
predictor variable
explanans
Usually we want to say something like “X causes
Y”, but often we have to settle for “X is related
to Y”.
27
Survey vs experiment (Milgram)
Y = which voltage do you apply?
measured X's:
– subject is male
– subject is young
manipulated X's:
– experimentor wears white coat
– experimentor is older (vs young)
Experiment: researcher determines X
Survey:
researcher measures X
28
XY
29
Kinds of variables (in case you forgot)
Categorical / Nominal
Two or more categories, without intrinsic
ordering (ex.: “kind of movie”: action/drama/...)
When only two categories, also called a binary
variable (ex.: gender, “age over 40”, etc)
Ordinal
Two or more categories, with intrinsic ordering
(ex.: 5-point ratings such as
never/sometimes/often/always, …)
Interval
Ordinal + intervals between values are evenly
spaced (age, income, number of movies rented).
NB Not always easy to classify.
Categorical and interval are the most important
(often ordinal are treated as either categorical or
interval).
30
Statistics at UCLA
{http://www.ats.ucla.edu/stat/mult_pkg/whatstat/default.htm}
Y
X
31
Dealing with data
1. Import SPSS file
2. Check your data
• To get acquainted with it
• For outliers and coding errors
3. Determine the kind of analysis
4. Recode your data so that you have the variables
in the appropriate format
5. Check the assumptions for the analysis of
choice (1)
6. Run your analysis
7. Check the assumptions for the analysis of
choice (2)
8. If necessary, back to 3. until CONCLUSION
32
Fact and fiction
Are white soccer players smaller?
33
Example data: soccer players
File: soccer_0AP03.sav. All players from WC2002.
Let’s see what the data looks like:
<to SPSS>
Variable view vs Data view
Run a “Frequencies”
Check histograms
Create new variables (Transform > Compute)
Recode variables (Transform > Recode)
Run analyses
USE SYNTAX FILES (*.SPS)!
34
Weekly not-on-the-exam fact
input
input
input
output
input
Suppose: You have a handful of numerical
inputs and want to use these to predict some
output.
For instance: chance of survival of a firm based on firm
characteristics, probability of job success based on
credentials, probability of surgery survival based on medical
records, …
We compare experts in the field with computer
models (both have the same amount of data).
Out of 160 studies of this kind, how often do
the experts perform significantly better?
(sources: see “Super Crunchers” by Ayres)
35
To Do
Get familiar with SPSS: reading data, recoding
variables, and running a t-test or a correlation.
Especially recoding variables and the syntax
window are important. You should be able to do
the assignments on the web page fairly quickly.
Check chapters 1 through 4 (up to 4.5.4) of the
Field-book for anything that looks unfamiliar to
you.
Don’t wait until the last couple of weeks!
Add to the WIKIs
36