Assessment Concepts

Transcript Assessment Concepts

Assessment Concepts
Dr. Julie Esparza Brown
Sped 512: Diagnostic Assessment
Week 4
Portland State University
Normal Distribution





Symmetrical
Unimodal.
No Skew.
No Kurtosis.
You always know the percent of the
distribution in any part of the normal
curve.
Raw Scores

Raw scores convey very little meaning
unless referenced to some standard,
Percentiles (Relative Standing)


The percent of people in the comparison
group who scored at or below the score of
interest.
Example:




Billy obtained a percentile rank of 42.
This means that Billy performed as well or better
than 42% of children his age on the test.
Or, 42% of children Billy’s age scored at or
below Billy’s score.
Or, Billy is number 42 in a line of 100 people.
Advantages of Percentiles Ranks



Percentile ranks are one of the best types of
score to report to consumers of a child’s
relative standing compared to other children.
Scores indicate how well a student performed
compared to the performance of some
reference group,
Percentile ranks are Ordinal Scale (values
ordered from worst to best but differences
between adjacent values are unknown) data,

It is not meaningful to calculate the mean or
standard deviation of percentiles.
Standard Scores (Relative
Standing)
Standard scores are
scores of relative
standing with a set,
fixed,
predetermined
mean and standard
deviation.
Standard
Score
Mean
Standard
Deviation
Z
0
1
T
50
10
IQ
100
15
SB
Subtest
50
8
WISC-III
10
3
WJ-III Scores




WJ-III uses standard scores with a mean of
100 and a standard deviation of 15.
A person earning a score of 85 would be
one standard deviation below the mean.
A person earning a score of 115 would be
one standard deviation above the mean.
Standard scores are equal-interval scores
so they can be combined (e.g., added or
averaged).
Age & Grade Equivalents
(Developmental Scale)


There are problems with using these
scores
Identical age equivalents can mean
different task performance.
Problems with Grade and Age
Equivalent Scores
1.
2.
3.
Systematic misinterpretation: students who earn
an AE of 12.0 has answered as many questions as
the average for children of 12. They have not
necessarily performed as a 12 year old could.
Implication of a false standard of performance:
equivalent scores are constructed so that 50% of
any age or grade group will perform below or
above age or grade level.
Tendency for scales to be ordinal, not equal
interval: age and grade equivalent scores are
ordinal, not equal interval: they should not be
added or multiplied.
Source: Salvia, Ysseldyke & Bolt (2009)
Age & Grade Equivalents
(Developmental Scale)
Maria got an age equivalent of
2-0 on a test means:
Maria obtained the same
number correct as the
estimated mean of children
2 years and 0 months of
age,

It does NOT mean:
Maria performed
like an average 2
year old on the
test.
Age & Grade Equivalents
(Developmental Scale)
John got a grade equivalent of
3.5 on a test means:
John obtained the same
number correct as the
estimated mean of children
5th month of 3rd grade.

It does NOT mean:
John is able to do
3.5 grade level
work.
Bottom Line – Do not use grade or grade level
scores.
Scales of Measurement
Nominal Scale (Name)




A scale of measurement in which there is no
inherent relationship among adjacent values.
Each number reflects an arbitrary category label
rather than an amount of a variable.
Nominal Scales are used to indicate classification,
category, or group.
Examples




Football jersey numbers
Group 1
Group2
Diagnostic categories
Ordinal Scale (Order)





A scale on which values of measurement are ordered
from best to worst or from worst to best; on ordinal
scales, the differences between adjacent values are
unknown.
Ordinal Scales provide order and ranking information
(1st, 2nd, 3rd, etc.)
Are used to indicate when one value has more or less of
something than another.
The central tendency of an ordinal attribute can be
represented by its mode or its median, but the mean
cannot be defined.
Examples:




Rank in high school class
Percentile rank
Age and grade equivalent
Results of a horse race
Interval Scale (Interval/Distance)




Interval Scales provide distance (interval)
information.
Differences have meaning. Equal differences
in the numbers correspond to equal differences
in the attributes.
Most data in education will be interval scale
data.
Examples:



IQ scores
Test scores
Rating scales
Ratio Scale (Ratio/Absolute 0)



A scale of measurement in which the
difference between adjacent values is
equal and in which there is a logical and
absolute zero.
Ratio Scales provide absolute amount
information.
Examples:


Counts of behavior
income
Measures of Central
Tendency
Mean (Most useful)



Mean – the average of all the scores in
the distribution.
Appropriate for Equal Interval and
Ratio Scales.
Not appropriate for skewed
distributions.
Median (Next most useful)





Median – the middle score of a distribution.
Appropriate for Ordinal, Equal Interval, and Ratio
Scales.
Most appropriate when distribution is skewed,
50% of scores are above the median, 50% of
scores are below the median.
Example



Arrange scores in order from largest to smallest (or vice
versa)
If N is odd, the middle score is the median.
If N is even, the average of the two middle scores is the
median.
Mode (Least useful)



The Mode is the most frequently
occurring score.
Appropriate for Nominal, Ordinal,
Equal Interval, and Ratio Scales.
Generally used in a very rough sense
to get a feel for “the peak of the
mountain.”
Measures of Spread or
Variability
Standard Deviation



Standard Deviation (S) indicates the
spread or variability of a distribution;
the square root of the variance.
Appropriate only for equal interval and
ratio scales.
Also used as a unit of measurement.
Technical Adequacy of
Instruments
The Reliability Coefficient





An index of the extent to which observations can be
generalized; the square of the correlation between
obtained scores and true scores on a measure.
The proportion of variability in a set of scores that
reflects true differences among individuals.
If there is relatively little error, the ratio of true-score
variance to obtained-score variance approaches a
reliability index of 1.0 (perfect reliability)
If there is a relatively large amount of error, the ratio
of true-score variance to obtained-score variance
approaches .00 (total unreliability).
We want to use the most reliable tests available.
Standards for Reliability


If test scores are to be used for administrative
purposes and are reported for groups of individuals,
a reliability of .60 should be the minimum. The
relatively low standard is acceptable because group
means are not affected by a test’s lack of reliability.
If weekly (or more frequent) testing is used to
monitor pupil progress, a reliability of .70 should be
the minimum. This relatively low standard is
acceptable because random fluctuations can be
taken into account when a behavior or skill is
measured often.
Standards for Reliability


If the decision being made is a screening
decision, there is still a need for higher
reliability. For screening devices, a
standard of .80 is recommended.
If a test score is to be used to make an
important decision concerning an individual
student (such as special education
placement), the minimum standard should
be .90.
Standard Error of
Measurement







SEM is another index of test error.
It is the average standard deviation of error distributed around a
person’s true score.
The difference between a student’s actual score and their highest or
lowest hypothetical score.
We generally assess a student once on a norm-referenced test so we
do not know the test taker’s true score or the variance of the
measurement error that forms the distribution around that person’s
true score.
We estimate the error distribution by calculating the SEM.
The general formula SEM equals the standard deviation of the
obtained scores, multiplied by the square root of 1 minus the
reliability coefficient.
When the SEM is relatively large, the uncertainty that the student’s
true score will fall within the range is large; when the SEM is
relatively small, the uncertainty is small.
Confidence Interval



The range of scores within which a person’s
true score will fall with a given probability.
Since we can never know a person’s true
score, we can estimate the likelihood that a
person’s true score will be found within a
specified range of scores called the
confidence interval.
Confidence intervals have two components:


Score range
Level of confidence
Confidence Interval

Score range: the range within which a true score is
likely to be found


Level of confidence: tells us how certain we can be
that the true score will be contained within the
interval




A range of 80 – 90 tells us that a person’s true score is
likely to be within that range
If a 90% confidence interval for an IQ is 106 – 112, we can
be 90% sure that the true score will be contained within
that interval.
It also means that there is a 5% chance the true score is
higher than 112 and a 5% chance the true score is lower
than 106.
To have greater confidence would require a wider
confidence interval.
You will have a choice of confidence intervals on
Compuscore. You can choose the 90 percent
option but the default is set at 68%.
Validity


“The degree to which evidence and
theory support the interpretation of test
scores entailed by proposeed uses of
tests” (APA Standards, 1999, p. 9)
Validity is the most fundamental
consideration in evaluating and using
tests.
Validity

“A test that leads to valid inferences in
general or about most students may not
yield valid inferences about a specific
student…First, unless a student has been
systematically acculturated in the values,
behavior, and knowledge found in the
public culture of the United States, a test
that assumes such cultural information is
unlikely to lead to appropriate inferences
about that student…
Validity

Second, unless a student has been systematically
instructed in the content of an achievement test, a test
assuming such academic instruction is unlikely to lead
to appropriate inferences about the student’s ability to
profit from instruction. It would be inappropriate to
administer a standardized test of written language
(which counts misspelled words as errors) to a student
who has been encouraged to use inventive spelling and
reinforced for doing so. It is unlikely that the test results
would lead to correct inferences about that student’s
ability to provide from systematic instruction in spelling”
(Salvia, Ysseldyke, & Bolt, 2009, p. 63.)
Types of Validity
Content validity
 Criterion-related validity
 Construct validity

Content Validity

A measure of the extent to which a test is an adequate
measure of the content it is designed to cover; content
validity is established by examining three factors:






Appropriate of type of items included
Comprehensiveness of item sample
The way in which the ietms assess the content
It is assessed by an overview of the items by trained
individuals who make judgments about the relevancy of
the items and the unambiguity of their formulation.
This is especially important in achievement testing and
one under debate.
There is an emerging consensus that the methods used
to assess student knowledge should closely parallel
those used in instruction.
Criterion-related Validity



The extent to which performance on a
test predicts performance in a real-life
situation.
Usually expressed as a correlation
coefficient called a validity coefficient.
Two types of criterion-related validity:


Concurrent validity
Predictive validity
Concurrent Validity


A measure of how accurately a person’s current
test score can be used to estimate a score on a
criterion measure.
We look to see if the test presents evidence of
content validity and elicits test scores
corresponding closely (correlating significantly) to
judgments and scores from other achievement
tests that are presumed to be valid, we can
conclude that there is evidence for a test’s
criterion-related validity.
Predictive Criterion-related
Validity




A measure of the extent to which a person’s current test
scores can be used to estimate accurately what that
person’s criterion scores will be at a later time.
Concurrent and predictive validity differ in the time at
which scores on the criterion measure are obtained.
If we are developing a test to assess reading readiness,
we can ask: Does knowledge of a student’s score on
the reading test allow an accurate estimation of the
student’s actual readiness for instruction? How do we
know that our test really assesses reading readiness?
The first step is to find a valid criterion measure and if
an assessment has content validity and corresponds to
another measure, we can conclude the test is valid.
Construct Validity





The extent to which a procedure or test measures a
theoretical trait or characteristics.
Especially important for measures of process such
as intelligence/cognition.
To provide evidence of construct validity, an author
must rely on indirect evidence and inference.
To gauge construct validity a test develop
accumulates evidence that the test acts in the way
it would if it were a valid measure of a construct.
As the research evidence accumulates, the
developer can make a stronger claim to construct
validity.
The Bottom Line…
“Test users are expected to
ensure that the test is
appropriate for the specific
students being assessed.”
Salvia, Ysseldyke & Bolt, 2009, p. 71

Assessment Concepts

Transcript Assessment Concepts

Directory