No Slide Title

Download Report

Transcript No Slide Title

Comprehensive
Exam Review
Click the LEFT mouse key ONCE to continue
Research and
Program
Evaluation
Part 5
Click the LEFT mouse key ONCE to continue
Analyses of Differences
Recall that for purposes here, an analysis of
difference involves at least one continuous
variable and at least one discrete variable.
In this context, the variable that is continuous
is sometimes called the “dependent” variable,
and the variable that is discrete is sometimes
called the “independent” variable.
The purpose of the analysis is to investigate
differences in the continuous variable as a
function of the categories in the discrete
variable.
Think about this possibility for a while.
Imagine that the same test was given to a
group of people on many occasions, but on
each occasion the test was administered, they
had not taken the test previously.
Then, imagine that the mean for the test was
computed for each occasion it was
administered.
If a graph was made of the various means for
the group against the frequency of occurrence
of the respective means, the result would be a
normal distribution of the means (because the
various factors affecting test performance
would come together in different ways on
different occasions).
This very special distribution is known as the
theoretical
distribution of
sampling means
f
values of the means
This distribution represents 100% of the possible
means that the group might achieve on any
occasion.
In other words, there is
100% probability that
the mean would fall
between these lines on
this graph
f
Because this is a “normal distribution,” all of its
mathematical properties are known.
For example, it is symmetric about the mean and
the standard area percentages under the curve are
known.
f
34%
2%
14%
34%
14%
2%
This theoretical distribution can be grounded in
reality if one assumption is accepted:
That an observed mean (i.e., one from an
actually administered test or measurement)
is the mean of the theoretical distribution of
sampling means.
Generally, this assumption is presumed valid
unless there is specific information that the
assessment situation was something other than
“normal.”
Once the test is given, the areas under the curve
can be related to any mean if
f
34%
the distance
between the
observed
mean and
these points
is known.
2%
14%
34%
14%
Observed Mean
2%
These distances are known as the “Standard Error
of the Mean.”
f
A “standard
error” is a
standard
deviation of a
theoretical
distribution.
34%
2%
-3SE_
X
14%
-2SE_
X
-1SE_
X
34%
14%
+1SE_
X
Observed Mean
+2SE_
X
2%
+3SE_
X
f
34%
2%
-3SE_
X
14%
-2SE_ -1SE_
X
X
34%
14%
2%
+1SE_ +2SE_ +3SE_
X
X
X
There is approximately a two-thirds
chance (68%) that the mean for the
group will fall between +/- one (1)
standard error of the mean on any
occasion.
f
34%
2%
-3SE_
X
14%
-2SE_ -1SE_
X
X
34%
14%
2%
+1SE_ +2SE_ +3SE_
X
X
X
Similarly, the probability, or likelihood, of
the mean falling between +/- two (2)
standard errors of the mean on any occasion
is approximately 96%, and so on.
Two of the more useful statements that can be
made are:
There is a 95% probability that the mean will
fall between +/- 1.96 standard errors of the
mean on any occasion.
There is a 99% probability that the mean
will fall between +/- 2.58 standard errors
of the mean on any occasion.
These “confidence limits” look like this on the
theoretical distribution of sampling means.
99%
95%
+/-1.96 SE_
X
+/-2.58 SE_
X
Now assume a situation in which the same
thing is measured (i.e., using the same test or
measure) on two different occasions for the
same group (a la pre-post testing).
If the group did not have exactly the same
mean for each testing occasion, there was a
difference between the means.
That difference happened either because
something caused the difference or by
chance.
The important question is, “What is the
likelihood (i.e., probability) that the
difference happened simply by chance?”
Graphically (at the .05 level), it looks like this:
95%
PRE
+/- 1.96 SE_
X
The question is, “Is the post mean inside or outside
of the 95% confidence limits for the pre mean?”
What a statistically significant difference looks like
graphically...
Non-Significant
Significant
Significant
PRE
Confidence limits at +/- 1.96 SE_
X
The t-test is a statistical significance test that
covers this situation.
The t-test is used to determine if there is a
statistically significant difference between
only two means.
The t-test is appropriate for use when data
from 30 or fewer subjects are being
analyzed.
The t-test is sometimes referred to as the
“Student’s t-test.”
There are two types of t-tests.
A dependent, or correlated, t-test is used
when the difference between the means of
the same group assessed on two occasions is
being evaluated (e.g., pre-post).
An independent, or uncorrelated, t-test is used
when the difference between the means of two
separate groups is being evaluated (e.g., males
and females).
A t-test yields a statistic called a t value.
Computer programs generating the t value
also present the (exact) probability of
obtaining a t value of that magnitude.
The (exact) probability calculated for the t
value is compared to the (pre-determined)
alpha level for the analysis.
For the t-test, it was noted that the discrete
variable (i.e., the one that has categories) is
sometimes called the “independent” variable.
The discrete variable is also known as a
“factor.”
It is important to remember that this is
a different and distinct use from “factor
analysis,” which was a type of analysis
of relationships.
In the context of analyses of differences, a
factor is a variable that is discrete (i.e., has
categories) and is sometimes called the
independent variable.
In the context of analyses of differences, the
categories of a factor are called “levels.”
In some ways, again this is a poor choice of
words because “levels” implies some type of
hierarchy - but that’s not really what it
means in this context.
Suppose “gender” as a (discrete or
independent) variable is included in a study.
In the study, gender would be a “factor”
having two “levels” (i.e., male and female).
Remember that levels = categories; no
hierarchy is necessarily applicable.
A t-test would be the appropriate analysis for a
study having only one factor that has two
levels.
The levels (categories) of the factor may be
“uncorrelated” (e.g., gender) or “correlated” (e.g.,
pre-post).
Instead of “correlated,” the phrase “repeated
measures” is used to indicate that the levels of
a factor are actually two or more
measurements on the same group of people as
part of a single research study.
Suppose that instead of viewing gender as
either male or female, it was considered “sexrole orientation.”
The possible categories might then be male,
female, and androgynous, which would be the
three levels of the “sex-role orientation”
factor.
Then suppose a measure of “counseling
effectiveness” could be obtained for everyone
in each of the three groups.
One question might then be, “Are there statistically significant differences among the
counseling effectiveness means of the three
groups?”
Graphically, the possibilities would be:
A
A F
A M
A
A FM
F
M
M
F
MF
The appropriate analysis for this situation is a
one-way analysis of variance.
It’s called “one-way” because there is only one
factor involved.
This is one of several types of analyses of
variance, all of which are abbreviated “ANOVA.”
A one-way ANOVA is appropriate when there is
one factor in the study.
The factor may have three or more levels.
The levels may be either uncorrelated (e.g.,
three categories of sex-role orientation) or
correlated (e.g., pre-post-follow-up for an
experimental study).
A one-way ANOVA yields an F statistic (or as
it is more commonly known, an F value).
Theoretically, a one-way ANOVA works with a
factor with as many levels as are relevant
and/or desired.
Computer programs generate an exact
probability for the F value, which can then be
compared to the alpha level.
A statistically significant F value means that
there is at least one statistically significant
difference among the means.
However, a statistically significant F value does
not indicate which means are significantly
different from one another.
A “multiple comparison” is a statistical
procedure that allows determination of which
means are statistically significantly different
from another.
A multiple comparison is only appropriate
following a statistically significant F value.
A multiple comparison allows determination of
which of these patterns exists (and more than one
may apply):
A
A F
A M
A
A FM
F
M
M
F
MF
Multiple comparison procedures range on a
continuum of “liberal” to “conservative.”
The more liberal the procedure, the smaller
the difference needed to be considered
statistically significantly different.
More conservative procedures reduce the
chance for Type I error, but make it more
difficult to achieve a statistically significant
difference.
Some of the multiple comparison methods
include:
liberal
Pairwise Comparisons (t-tests)
Fisher LSD
Duncan Multiple Range Test
(Student) Newman-Keuls
Tukey HSD
conservative
Scheffe
A factorial analysis of variance (ANOVA) is
appropriate when there are two or more
factors, each of which has at least two levels.
(Again, remember that a factorial ANOVA is
not the same as factor analysis).
Suppose the research question was, “What are
the differences in graduate-level academic
aptitude as a function of gender and race?”
The variables might be as follows:
The “dependent” variable is GRE Total
Score.
One factor is “gender,” and it has two
levels: male (M) and female (F).
Another factor is “race,” and it has three
levels: African-American (AA),
Hispanic-American (HA), and
Caucasian-American (CA).
Graphically, the research could be shown as:
Race
HA
AA
M
Gender
F
GRE-T means
CA
One F value would be obtained for each factor:
Fgender
Frace
These are known as the “main effects” F
values.
These F values are independent; the statistical
significance of one is unrelated to the statistical
significance of the other.
An interaction F value also would be obtained.
Fgender by
race
An interaction F value allows evaluation of
whether the effects of one variable are
consistent for all levels of the other variable.
The interaction F value is independent of the
other two.
Graphically, it all looks like this:
Race
Gender
HA
AA
CA
M
Fgender
F
Frace
Fgender by race
Now suppose another factor is added, such as
“academic degree”(Master’s, Specialist, or
Doctorate).
F
AA
HA
M
CA
AA
HA
CA
M S DM S D M S D M S DM S D MS D
There will be one F value for each factor (aka
“main effects”).
Fgender
Frace
Fdegree
These F values are all independent of one
another.
If either Frace and/or Fdegree is statistically
significant, a multiple comparison would be
needed to determine the pattern of significant
differences.
There also would be three “two-way interactions.”
Fgender by race
Frace by degree
Fdegree by gender
There also would be one “three-way interaction,”
which represents the combination of variables
three at a time.
Fgender by
race by degree
These F values also are independent of all the
others.
The t-test, one-way ANOVA, and factorial
ANOVA are known as “univariate” analyses,
because only one dependent (e.g., GRE Total
score) variable is involved.
If a second (or more) dependent variable is
added, the appropriate analysis is a
multivariate analysis of variance (MANOVA).
A MANOVA also yields an F value.
If the Fmultivariate is NOT significant, it means
that there are no significant differences
anywhere among the sets of means.
If the Fmultivariate is statistically significant,
appropriate univariate analyses must be
computed to determine which means are
significantly different from one another.
Graphically, analyses of differences
can be summarized as follows:
Analysis of
Difference
Dependent
Variables
Levels
Uncorrelated
Repeated
Measures
Factors
(Student’s) t-test
1
1
2
Yes
Yes
One-way ANOVA
1
1
3
Yes
Yes
Factorial ANOVA
1
2 2
Yes
Yes
MANOVA
2
2 2
Yes
Yes
Nonparametric Statistics
So-called nonparametric statistics are used
when the data are nominal or ordinal, or
when the data are interval but the assumption of a normal distribution of the variable
cannot be met.
In general, there are nonparametric
statistical analyses that “parallel” most
parametric statistical analyses.
The following are commonly used
nonparametric “correlational” techniques, most
derived from the Pearson Product-Moment
Correlation Coefficient.
Spearman’s Rho is a correlation coefficient
appropriate when the data being correlated are
ranks (i.e., ordinal data).
A Point Biserial Correlation is appropriate
when one of the variables is continuous and the
other is dichotomous.
A Biserial Correlation is appropriate when
both variables are actually continuous, but
one is being treated as a dichotomous
variable.
A Tetrachoric Correlation is appropriate when
both variables are actually continuous, but
both are being treated as dichotomous.
A Phi Coefficient is appropriate when both
variables are actually dichotomous.
A Coefficient of Contingency is appropriate
when one or both of the (nominal) variables
has three or more categories.
The following are commonly used nonparametric tests of differences.
The Median Test is appropriate to use to test
the significance of difference between the
medians of two independent samples.
A Sign Test is appropriate to test the significance of difference between two or more sets
of paired observations (i.e., measurements).
The Wilcoxon Rank Sum Test is appropriate
to test the significance of difference when the
data from two independent samples can be
assigned ranks.
The Mann-Whitney U Test is essentially the
same as the Wilcoxon Rank Sum Test, but is
often used with smaller samples.
The Kruskal-Wallis is essentially a one-way
analysis of variance appropriate to use to test
the significance of difference among three or
more sets of ranks.
The Chi Square Test, which is the most
commonly used nonparametric statistic, is a
test of the magnitude of discrepancy between
observed (i.e., measured) and expected
distribution frequencies.
The Chi Square Test is used either as a “goodness
of fit” test or as a test of inde-pendence.
The Chi Square “goodness of fit” Test is
usually used to test the degree of
independence between observed and
(theoretically) expected frequencies for a
single variable.
The Chi Square Test as a test of independence
is used to test the degree of independence
between the observed and expected
frequencies for two variables.
Because the distributions to which the various
nonparametric statistics are applied vary
considerably, methods to evaluate the
statistical significance of the various statistics
generated are unique to the various
techniques.
However, similar to most parametric statistics,
the resultant nonparametric statistical value is
evaluated against its probability as a chance
occurrence.
Needs Assessment and
Program Evaluation
A fundamental question in the counseling
professions is:
How can we integrate good needs assessment
and program evaluation practices to yield an
effective and comprehensive understanding
of a service delivery system?
One commonly accepted approach is to follow
the CIPP model, in which CIPP is an acronym
for:
Context evaluation
Input evaluation
Process evaluation
Product Evaluation
Context evaluation:
is essentially equal to needs assessment
within the CIPP model.
necessitates clear specification of potential service recipients.
involves gathering data directly from
potential service recipients.
should point to program goals and objectives.
Primary context evaluation methods include use
of surveys and/or interviews.
Effective context evaluation provides answers
to questions such as:
What is the diversity of the needs expressed among the potential service recipients?
What are the priorities among the various
categories of needs expressed?
Do the needs expressed reflect current or
future circumstances?
Which of the expressed needs are in concert with program activities?
Input evaluation:
serves to identify available resources
and constraints for a service delivery program.
follows directly from a context evaluation (i.e., needs assessment).
yields the parameters within which
the program can and should be
conducted.
Effective input evaluation will provide answers
to questions such as:
What is the environment (i.e., physical space
and material resources) available for
the program?
What are the fiscal resources available for
the program?
What are the human or personnel resources
available for the program?
What rules (and/or entities) govern the conduct of the program?
Together, the results of context and input evaluations determine the nature of the accountability
for the program and for the program participants.
That is, they allow determination of who will
be accountable to whom, how, and for what.
Process evaluation:
is concerned with the effectiveness of the
day-to-day operation of the program.
is used interchangeably with the term
“formative” evaluation.
provides data upon which to base service
delivery program decisions while
the program is in operation.
Effective process evaluation will provide answers
to questions such as:
What is the efficiency level of the service
delivery program?
What factors influence the expenditure of
funds within the service delivery
program?
How efficient is the service delivery program schedule?
What factors influence decision-making
processes in the service delivery
program?
Product evaluation:
allows determination of the actual “outcomes” of the service delivery
program.
is used interchangeably with the term
“summative” evaluation.
is often considered the “bottom line” in
accountability processes.
Effective product evaluation provides answers to
questions such as:
To what extent are the service delivery
program’s goals and objectives being
met?
What are the service delivery program’s
impacts in terms of identifiable
changes?
What is the service delivery program’s
cost-benefit ratio?
What are the “lost opportunity” costs attributable to intra-program changes?
The CIPP model is circular...
The best evaluation evolves from
a fully integrated
cycle of all four
parts of the model
Input
Process
Product
Content
The CIPP model and accountability are integrally linked because any service delivery
program should be held accountable for:
what it is attempting to accomplish (content
evaluation),
what resources it uses (input evaluation),
how resources are used (process evaluation), and
what happens as a result of the program
(product evaluation).
Seven types of accountability have been
identified in the professional counseling and
development literature.
Service delivery accountability addresses the
question, “To what extent does the program
deliver the services it promises to deliver?”
Ethical accountability addresses the question,
“To what extent are services delivered within
the parameters of acceptable ethical
practice?”
Legal accountability addresses the question,
“To what extent are services delivered within
the parameters of legal constraints?”
Coverage accountability addresses the
question, “To what extent does the service
delivery program serve all of the people it
purports to serve?”
Efficiency accountability addresses the question,
“To what extent is time used efficiently in
delivery of the service program?”
Fiscal accountability addresses the question,
“To what extent are available fiscal resources
used in a manner that maximizes the
likelihood of positive program outcomes?”
Impact accountability addresses the question,
“To what extent does the service delivery
program actually make positive changes in
peoples’ lives?”
This concludes Part 5 of the
presentation on
RESEARCH AND
PROGRAM
DEVELOPMENT