Lecture #3 - College of Computing & Informatics

Download Report

Transcript Lecture #3 - College of Computing & Informatics

Action Research
Descriptive Statistics and Surveys
INFO 515
Glenn Booker
INFO 515
Lecture #3
1
Reliability and Validity

A measure is reliable if it consistently
gives the same answer


A key to scientific measurement is the ability
to repeat an experiment reliably
A measure is valid if it actually measures
the concept under investigation

INFO 515
It tests what you think it tests
Lecture #3
2
Review Std Deviation and CV
Standard Deviation can be used to
compare two (or more) groups that
have the same units of measure and
similar means
 Coefficient of Variation can compare two
(or more) groups, which have different
reference points (means) and different
standard deviations


INFO 515
See which groups are more closely distributed
around their mean
Lecture #3
3
Z Score
The Z Score is the ‘how weird am I’
measure for a given data point*
 The standardized or ‘z’ score allows you
to do either of the following:


Find where one or more individuals stand in
reference to the mean of a single distribution
on one unit of measure (one variable)


Where is an individual located relative to a
distribution of test scores?
Am I better than average? If so, how much?
* This is not an official ISO definition…
INFO 515
Lecture #3
4
Z Score

Find where one or more individuals stand in
reference to the mean of two (or more)
different distributions that may have different
units of measure


INFO 515
Where does an individual stand relative to two
tests, each given in a different class (with different
distributions)?
Did I do better on the midterm in philosophy than
the one in geography?
Lecture #3
5
Z Score
A z score tells you how far above or below
the mean any given score is in standard
deviation units
 Z scores are most useful when the shape
of your actual distribution of scores is
nearly normal (see slide 9, or Action
Research handout p. 11)
 What’s the “normal” distribution?

INFO 515
Lecture #3
6
Normal Distribution Example
Consider stopping a car at a traffic light
 You don’t stop exactly the same place
each time, but generally stop somewhere
behind or near the big white line (I hope!)
 Describing where you are likely to stop
might be described by a “normal
distribution”

INFO 515
Lecture #3
7
Normal Distribution
The normal, or Gaussian, distribution is
the classic “bell curve” which shows that
most measurements are somewhere close
to the mean, but a few measurements
could range far above or below that mean
 It is symmetric, and extends forever
above and below the mean

INFO 515
Lecture #3
8
Normal Distribution

The normal distribution is described by
two math functions


The function f(x) is the probability density
function, often called a PDF; it represents
how likely the answer is to fall near the
current value of x
The function F(x) is the cumulative
probability function; it represents the total
chance of getting the current value of x or
anything less

INFO 515
A.k.a. a cumulative density function, or CDF
Lecture #3
9
Normal Distribution
Normal Distribution
1
0.8
0.6
f(x)
0.4
F(x)
0.2
0
-3
-2
-1
0
1
2
3
X
‘f(x)’ is the probability density function (the classic bell curve)
‘F(x)’ is the cumulative probability function
INFO 515
Lecture #3
10
Probability Density Function, f(x)

The chance you will stop (the event will
occur) between any two distances ‘a’
and ‘b’ is the area under the curve f(x)
between those two values
Normal Distribution
0.5
0.4
0.3
f(x)
0.2
0.1
0
-3
INFO 515
-2
a
-1
0
X
b
1
Lecture #3
2
3
11
Probability Density Function, f(x)

Notice that f(x) is symmetric from left to
right, and that it is defined for all possible
values of x (x = negative infinity to
x = positive infinity)


The total area under the curve f(x) is one


f(x) never reaches zero!
You will eventually stop somewhere
Unfortunately, f(x) is a messy function to
integrate (find the area under it)
INFO 515
Lecture #3
12
Cumulative Probability Function F(x)
Imagine you start at x equals minus
infinity (x = -)
 Then add up the area under f(x) from
minus infinity to the current value of x



This is the cumulative probability function, F(x)
That’s why F(0) (F at x=0) is exactly 0.5

INFO 515
Half of all events occur left of x=0, and half
occur to the right of x=0 (symmetry)
Lecture #3
13
Cumulative Probability Function F(x)
So to find the chance of getting a result
between values ‘a’ and ‘b’ is also given by:
Probability = F(b) - F(a)
 An analogy might be


INFO 515
The number of babies born between 1940 (a)
and 1990 (b) is equal to the total number of
babies ever born by 1990 (F(b)), minus the
total number of babies ever born by 1940
(F(a))
Lecture #3
14
Standard (Z) Scores
Back to Z scores, our motivation for
discussing the normal distribution
 Z Scores are standardized scores whose
distribution has the following
properties:




INFO 515
Retains the shape of the original scores, but
Has a mean of 0 and
Has a variance and standard deviation of 1
Lecture #3
15
Calculating Z scores

Compute “z” score by subtracting the
mean from the raw score and dividing that
result by the standard deviation
z = (Xi - m) / s = (Score – Mean)/(Standard Dev)

The z score is not just associated with the
normal distribution – it can be used with
any kind of distribution
INFO 515
Lecture #3
16
Interpreting Z Scores

The z score describes how many standard
deviations a specific score is above or
below the mean



INFO 515
A negative z score means that the score is
below the mean
A positive z score is above the mean
A z score of zero (z=0) is equal to the mean
Lecture #3
17
Z Score Example
I own 250 books -- I want to know how
I compare to other college professors
 Suppose that the mean number of
books owned by college professors is
150 with a standard deviation of 50



z = (250 - 150) / 50 = 2
My z score is 2; meaning I have 2
standard deviations more books than
average (‘cuz I’m a pack rat!)
INFO 515
Lecture #3
18
Z Score Tables
Are used to determine the proportion of
the area under the curve that lies between
the mean and a given standard score (z)
 These tables are prepared using integral
calculus to save you time
 They show only positive ‘z’ values, since
the areas for negative ‘z’ are the same as
for positive ‘z’ (thanks to symmetry)

INFO 515
Lecture #3
19
Z Score Tables (Yonker p. 29-30)
Normal Distribution
0.5
0.4
0.3
f(x)
0.2
0.1
0
-3
-2
-1
0
1
2
3
X
Notice that we always have
Col. B + Col. C = 0.5000
INFO 515
Area
between
0 and z
(Col. B)
Lecture #3
z value
(Col. A)
Area
beyond z
(Col. C)
20
Use of Z Score Tables

Z score tables can be used to find the
chance of a measurement (or percentage
of cases) occurring between any two
z values


If the z scores are on opposite sides of the
mean (one positive, one negative), add the
areas from Column B for each score
If the z scores are on the same side of the
mean (both positive, or both negative),
subtract the areas from Column B

INFO 515
Subtract the larger area from the smaller area;
otherwise you’d get negative area!
Lecture #3
21
Use of Z Score Table Examples
Between z scores of -1.5 and +2.2, the
percent of cases is, from Column B:
z(-1.5) is the same area as z(+1.5)
z(+1.5) = 0.4332 and z(+2.2) = 0.4861
Percent = 43.32 + 48.61 = 91.93%
 Between z scores of +1.5 and +2.2,
the percent of cases is:
Percent = 48.61 – 43.32 = 5.29%

INFO 515
Lecture #3
22
Cumulative Z Score
Normal Distribution
0.5
0.4
0.3
f(x)
0.2
0.1
0
-3
-2
-1
0
1
2
3
X
0.13% 2.14% 13.59%
34.13%
34.13% 13.59%
2.14%
0.13%
Percentages shown are the total percent between the integer Z score
values; between 0 and 1 has 34.13%, between 1 and 2 has 13.59%, etc.
INFO 515
Lecture #3
From p. 11 in Yonker
23
F(x) Values

For F(x) from minus 6 to plus 6, a
distribution with mean =0 and standard
deviation of 1.0 gives:
Z
-6
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
INFO 515
CDF
0.000000000987
0.000000286652
0.000031671242
0.001349898032
0.022750131948
0.158655253931
0.500000000000
0.841344746069
0.977249868052
0.998650101968
0.999968328758
0.999999713348
0.999999999013
delta CDF from next value
0.000000286
0.000031385
0.001318227
0.021400234
0.135905122
0.341344746
Lecture #3
24
Cumulative Z Score

Key values are:





INFO 515
From
From
From
From
From
z
z
z
z
z
=
=
=
=
=
-1 to +1, total area is 68.26%
-1.96 to +1.96, total area is 95%
-2 to +2, total area is 95.44%
-2.57 to +2.57, total area is 99%
-3 to +3, total area is 99.74%
Lecture #3
25
Transformed z, or T scores
A.k.a. Standardized scores or “T” scores
 Z scores are transformed artificially



Multiply a z score by the desired standard
deviation s and add the desired mean m (e.g.
10 and 50) T = zs + m becomes
T = 10*z + 50
Examples


INFO 515
A z score of -1.5 would give a T score of
T = 10*(-1.5) + 50 = 35
A z of +2.2 would give T = 10*(2.2)+50 = 72
Lecture #3
26
T scores
This is used in many fields of research,
especially Psychology and Education
(that’s where the “desired” mean and
standard deviation values came from)
 Benefits: gets rid of negative connotations
of negative and zero scores


INFO 515
Only z scores below z = -5.0 would result in a
negative T score (typically less than one data
point in a million)
Lecture #3
27
Level of Confidence

Since the normal distribution goes to
positive and negative infinity, we need
a way to limit the range of expected or
likely values

Or any normal distribution could have any
value some times
Define the Level of Confidence as the
acceptable limits of predictable behavior
 Typically use 95% for most applications,
but 99% for medical research

INFO 515
Lecture #3
28
Level of Confidence

Generally, we can say that the actual
value of a parameter estimate is in the
range of its mean + twice its standard
error, with a 95% level of confidence


Use 1.96 instead of 2 for precise work
Thus the value of a parameter with mean
of 6.2 and standard error of 1.9 lies
between 2.4 (i.e., 6.2 – 2*1.9) and 10.0
(i.e., 6.2 + 2*1.9) with a 95% level of
confidence
INFO 515
Lecture #3
29
The “t” Statistic

The t-statistic is defined as
t = (parameter estimate) / (standard error)
 If |t| > 2, then the parameter estimate is
significantly different from zero at the 95%
level of confidence
t = 6.2/1.9 = 3.26
 Hence because |3.26| > 2, this estimate is
statistically significant
 Also means the 95% confidence interval does
not include zero
Again, use 1.96 instead of 2 for precise work
INFO 515
Lecture #3
30
The “t” Statistic
T = ‘t’???? No!
 Notice that the T score is a completely
different concept from the ‘t’ statistic
 We’ll use the ‘t’ statistic to help judge
SPSS output later in the course

INFO 515
Lecture #3
31
Sampling Terms
Population = the entire realm of interest,
everyone, all books, all publishers, all
patrons, etc.
 Sample = a subgroup or subset of the
population



INFO 515
Accurate inference requires good samples
Use sample since often hard or impossible to
measure the entire population
Lecture #3
32
Sampling Terms

Inferential Statistics


Taking samples in order to infer unknown
population parameters
Principle of Random Selection


INFO 515
A procedure by which each member of the
population has an equally likely chance of
being chosen as any other member
Representative of the population
Lecture #3
33
Types of Samples

Probabilistic sample - sampling in which
the probability of each element in the
population being selected is known and
can be specified


Each element has the same chance
Non-probabilistic sample – each
probability not known a priori (in advance)

INFO 515
E.g. convenience samples, or available
samples
Lecture #3
34
Random Sampling Techniques
Simple Random
 Stratified Random



Proportional
Disproportional
Cluster
 Systematic

INFO 515
Lecture #3
35
Simple Random Sample
Often can’t sample the entire user
population
 Must be a truly random sample, not
just convenient
 Can use random number table, or
computer-generated pseudo-random
numbers (Yonker, p. 31) to choose
the sample

INFO 515
Lecture #3
36
Stratified Random Sampling


Group customers into categories (strata); get
simple random samples from each category
(stratum). Can be very efficient method.
Can weigh each stratum equally (proportional
s.s.) or unequally (disproportional s.s.)

INFO 515
For unequal weight, make fraction ~ standard deviation
of stratum, and ~ 1/ square root (cost of sampling).
F ~ s/sqrt(cost)
where “sqrt” is “square root”, “~” is ‘proportional to’
Lecture #3
37
Proportional Stratified Random Sampling
Major
Education
# in Population
50
% in Population
50% X 20
# in Sample
10
Soc./Beh. Sci.
30
30
6
Business
15
15
3
Sci./Tech
5
5
1
% = 50/100 X 100
Data taken from Carpenter and Vasu, (1978)
INFO 515
Lecture #3
38
Cluster Sampling
Divide population into (geographic)
clusters, then do simple random samples
within each selected cluster
 Try for representative clusters
 Not as efficient as simple random
sampling, but cheaper
 Sometimes used for in person interviews

INFO 515
Lecture #3
39
Cluster Sampling Example
Randomly select n (certain number of)
census tracks
 From randomly selected census tracks,
randomly select n blocks
 From randomly selected blocks, randomly
select addresses
 Interview the family--unit of study

INFO 515
Lecture #3
40
Systematic Sampling
Calculate your sampling interval:
Interval = Size of population / (Size of
sample)
 Select your first element at random from
the sampling interval
 Move ahead systematically by the
sampling interval (e.g. every 10th
customer) until you reach your desired
sample size

INFO 515
Lecture #3
41
Non-random Sampling Techniques
Quota
 Accidental
 Judgment

INFO 515
Lecture #3
42
Non-random techniques

Quota sampling




INFO 515
Is economical
Is a non-random version of stratified sampling
Define desired characteristics in advance:
gender, race, age, etc.
Example: Interview 20 females and 20 males
over the age of 65
Lecture #3
43
Non-random techniques

Accidental sampling



Mall market studies, Internet surveys
Often requires a choice (by the interviewee) to
be sampled
Judgment sampling


INFO 515
Pick people who have some special knowledge
Seek out experts – more of an interview
method
Lecture #3
44
What is a Survey Study (Assessment)?
To describe systematically the facts and
characteristics of a given population or
area of interest, factually and accurately.
(Isacc and Michael)
 Survey studies are used to:





INFO 515
Describe what is
Establish need
Identify problems
Infer possible solutions
Lecture #3
45
Surveys

A survey often refers to a large data
collection effort:



INFO 515
What it involves—personal interviews,
telephone interviews, a questionnaire sent
through the mail, document survey, literature
survey, social area analysis (observation and
description of different areas of the city)
“Who” it involves—community, customers,
users, employees, literature
Purpose—information gathering and fact
finding to Describe what exists (such as public
library services) Establish need, Identify
problems, Imply possible solutions
Lecture #3
46
Customer Satisfaction Surveys

Could have many opportunities to
conduct surveys





INFO 515
Customer call-back after x days
Customer complaints
Direct customer visits
Customer user groups
Conferences
Lecture #3
47
Customer Satisfaction Surveys
Want representative sample of all
customers
 Three main methods are used




INFO 515
Personal interview
Telephone interview
Questionnaire by mail
Lecture #3
48
Personal Interview

Advantages:
1. Explore complex issues
2. Question clarification
3. Rapport
4. Higher response rate
5. Observation
INFO 515
Lecture #3
49
Personal Interview

Disadvantages:
1. Interviewer bias
2. Question uniformity
3. No anonymity
4. Difficult to analyze
5. Time consuming
INFO 515
Lecture #3
50
Telephone Interview

Advantages:
1. Some anonymity
2. Low cost
3. Rapid completion
4. Higher response rate
5. No travel time
6. Widely spread sample
INFO 515
Lecture #3
51
Telephone Interview

Disadvantages:
1. Reaching people
2. Some interview bias possible
3. Only accessible phone numbers
4. No observation
INFO 515
Lecture #3
52
Structured vs. Unstructured Interviews

In an unstructured interview, only the
first question is standard for all
respondents


The remaining questions are determined by
the answers of each respondent
In a semi-structured interview, the
questions are open ended, but all of the
respondents receive the same questions
INFO 515
Lecture #3
53
Questionnaire by Mail

Advantages:
1. Economical
2. Faster
3. Wide range of issues
4. Widely spread sample
5. Avoids interviewer bias
6. Anonymity
INFO 515
Lecture #3
54
Questionnaire by Mail

Disadvantages:
1. Question clarity
2. No probing
3. Who is answering?
4. No observation
5. Response rate
INFO 515
Lecture #3
55
Interview & Questionnaire Tips
1.
Start with easy questions that the respondent
will enjoy answering

You want to prevent boredom early on while building
rapport and putting the respondent at ease
2. Try for an easy and natural flow over topics

Place like items together and give a brief explanation
when a topic breaks
3. Within topics, go from the general to the specific

INFO 515
For example, start with questions on use of the Internet
in general, then move on to specific questions about the
use of search engines
Lecture #3
56
Interview & Questionnaire Tips
4. Put open-ended or difficult questions
(if any) at the end of the interview
or questionnaire
5. Put questions on “sensitive” matters
(such as age or income) at the end of
the interview or questionnaire

INFO 515
Otherwise, the interview may be over before
it has started!
Lecture #3
57
The “Question Continuum”

Closed Questions



Fixed Alternatives
Structured
“Your annual income is: a) 0-25K, b) 26-35K, ”
Semi Structured Questions
 Open Questions




INFO 515
Free form responses
Unstructured
“What do you like about Drexel?”
Lecture #3
58
Sample Size


How big is enough?
Must choose:



Confidence level (80 - 95%, to get Z)
Margin of error (B = 3 - 5%)
For simple random sample, also need


INFO 515
Estimated satisfaction level (p), which is what you’re
trying to measure, and
Total population size (N = total number of customers)
Lecture #3
59
Critical Z values
INFO 515
Confidence Level
(2-sided) critical Z
80%
1.28
90%
1.645
95%
1.96
99%
2.57
Lecture #3
60
Sample Size
Sample size, n
n=
[N*Z2*p*(1-p)]
[N*B2 + Z2*p*(1-p)]
 The sample size depends heavily on the
answer we want to obtain, the actual level
of customer satisfaction (p)!

INFO 515
Lecture #3
61
Sample Size

If we choose



80% confidence level, then Z = 1.28
5% margin of error, then B = 5% = 0.05
and expect 90% satisfaction, then p = 0.90
n = (N*1.28^2*0.9*0.1)/
(N*0.05^2 + 1.28^2*0.9*0.1)
 n = 0.1475*N/(0.0025*N + 0.1475)

INFO 515
Lecture #3
62
Sample Size
Given:
Z
p
B
1.28
0.9
0.05
N
10
20
50
100
200
500
1000
10000
100000
1000000
Infinity
INFO 515
Hence:
Z^2
p(1-p)
B^2
1.6384
0.09
0.0025
Find:
n
8.550355
14.93558 <- Beware of sampling
27.06052
small populations!
37.09996
45.54935
52.75873
55.69724
58.63655
58.94763
58.97892 For very large N, sample size stabilizes
58.9824
Lecture #3
63
Sample Size
If don’t know customer satisfaction value
‘p’, use 0.5 as worst-case estimate
 Once the real value of ‘p’ is known, solve
for the actual value of B (margin of error)
 Key is finding a truly representative
sample
 For N approaching infinity, sample size
simplifies to:
n = p*(1-p)*(Z/B)2

INFO 515
Lecture #3
64