The Standard Bell Curve - Central Piedmont Community College

Transcript The Standard Bell Curve - Central Piedmont Community College

The Assessment of Teaching at a
Large Urban Community College
Terri M. Manning and Denise Wells,
Central Piedmont Community College
Lily Hwang, Morehouse College
Lynn Delzell, UNC-Charlotte
Presentation made to AIR, May 19st, 2003 – Tampa, FL
Why we evaluate teaching?

We do teaching evaluation for two reasons
(heavy emphasis on the 1st):

1.

2.
So faculty will have feedback from
students that can be used to improve
teaching.
So chairs/division directors can have one
consistent indicator of students’ perceptions
about faculty (especially part-timers). These
are often used as one of several means of
teaching assessments for merit.
Problems in General with “Evaluation
of Teaching” Tools




Most are created internally
Committees don’t always start at the beginning –
“what is good teaching?”
Most are not tested for (at least) validity and
reliability
Many are thrown together rather quickly by a
committee whose goal is a usable survey tool
Very Few Tools are For Sale





Institutions are unique and what they want to
measure is unique (undergraduate, graduate,
continuing ed, literacy and distance ed courses)
Because most institutions see them for what
they are…. happiness coefficients
No one will stand behind them… “our tool is a
valid measure of teaching”
They would never stand up in court
So be very careful! Never site your teaching
eval as a reason for not renewing a contract.
Problems with the use of them…..



The scores are used inappropriately and
sometimes unethically (or at least stupidly)
They are used for merit pay, promotion and
tenure
Scores are treated like gospel - “you are a bad
teacher because you scored below the
department mean on the tool”
Problems with use, cont.




Critical at the community college where 100%
of the job description is “to teach”
Used to make hiring and firing decisions
Teachers are placed in a “catch-22” situation
(do I pretend this tool measures teaching or
blow it off….. you could be in trouble either
way)
Who is included in group means for
comparison purposes
A Misconception





You get a bunch of people together
Throw a bunch of questions together
Call it a teaching evaluation tool
And “hocus pocus” it is a valid, reliable, sensitive
and objective tool
You can make merit, promotion and tenure
decisions with it… no problem
What Makes a Good Questionnaire?




Validity – it truly (with proof) tests what it says it tests
(good teaching)
Reliability – it tests it consistently over time or over
terms, across campuses and methods
Sensitivity (this is critical) – it picks up fine or small
changes in scores – when improvements are made,
they show up (difficult with a 5-point likert scale)
Objectivity – participants can remain objective while
completing the tool – it doesn’t introduce bias or
cause reactions in subjects
Problems Inherent in Teaching
Evaluation with Validity

What is “good teaching”







It isn’t the same for all teachers
It isn’t the same for all students
We know it when it is not there or “absent”
Yet, we don’t always know it when we see it (if the
style is different than ours)
Who gets to define good teaching
How do you measure good teaching
How can you show someone how to improve it based
on a “likert-scale” tool (this is how you raise your
mean by .213 points)
Problems Inherent in Teaching
Evaluation with Reliability





Students perceptions change (e.g. giving them the
survey just after a tough exam versus giving it to them
after a fun group activity in class)
From class to class of the same course, things are not
consistent
Too much is reliant on the student’s feeling that day (did
they get enough sleep, eat breakfast, break up with a
boy friend, feel depressed, etc.)
Faculty are forced into a standard bell curve on scores
There is often too much noise (other interactive factors,
e.g. student issues, classroom issues, time of day)
Greatest Problem …. Sensitivity





Likert scales of 1-5 leave little room for improvement
Is a faculty member with a mean of 4.66 really a
worse teacher than a faculty member with a mean of
4.73 on a given item
Can you document for me exactly how one can
improve their scores
In many institutions, faculty have learned how to
abuse these in their merit formulas
Faculty with an average mean across items of 4.88
still don’t get into the highest rung of merit pay
The Standard Bell Curve
Mean
40.00%
34.12%
35.00%
34.12%
Standard Deviations
30.00%
25.00%
20.00%
13.59%
13.59%
15.00%
10.00%
5.00% 2.14%
2.14%
0.00%
-3
-2
-1
1
2
3
IQ – An Example of a (somewhat)
Normally Distributed Item (key is
range)
50.00%
Scaled IQ Score
45.00%
40.00%
35.00%
30.00%
25.00%
20.00%
15.00%
10.00%
5.00%
0.00%
55
Standard Deviation = 15
70
85
Mean
(100)
115
130
145
The Reality of Our Tool - Questions #1
of 17,734 responses from Fall 2000)
Mean
80.00%
67.9%
70.00%
60.00%
50.00%
40.00%
Percent
30.00%
20.5%
20.00%
9.4%
10.00%
1.5%
0.5%
0.00%
5
4
3
2
1
Item Mean = 4.54, Standard Deviation = .77
1. The instructor communicates course objectives, expectations,
attendance policies and assignments.
What Would the Scores Look Like?
Maximum Score = 5
6.85
7
6.08
Scores Forced into a
Bell Curve
6
5.31
4.54
5
3.77
4
3
3
2.23
2
1
0
-3
-2
-1
Mean
1
2
Standard Deviations Above and Below the Mean
3
How We Developed the Student
Opinion Survey at CPCC


We started with the old tool
An analysis was done (it was rather poor and
proof of administrative reactions to current
issues)



The old tool contained 20 questions mostly about the
business of teaching (handing back exams, speaking
clearly, beginning class on time, etc.)
91% of faculty received all 4s and 5s on each item
The less sophisticated students were, the higher they
rated their teachers
Next…..




A subcommittee of the Institutional Effectiveness
Committee was
formed consisting
mainly of faculty
The committee spent one year
studying the tools of other colleges
and universities and lifting what we liked
We found virtually nothing for sale
What we did find were test banks of questions
Next, cont.




We started with 50-60 questions we liked off of other
tools
We narrowed the questions down
We worked through every single word in each
statement to make sure they were worded exactly
like we wanted them and that they measured what
we wanted
We ended up with 36 questions on the new tool
Next, cont.

We worked on the answer scale



We found students had trouble processing the likert scale
(it wasn’t defined)
Students liked the A-F grading scale but faculty didn’t (it
took far less time)
We worked through the “excellent, good, fair, poor” type of
scale and the “strongly agree to strongly disagree” scale.
We tested two types during our pilot process.
Next, cont.

We wanted to create subscales
with a wider range of scores
than a 1-5 scale:





The art of teaching
The science of teaching
The business of teaching
The course
The student
Next, cont.



We pilot tested the tool with about 10 classes
and followed it up with focus groups (Fall
1999)
We revised the tool
We pilot tested again (many sections, about
400 students) with two scales (Summer
2000):


A-F scale like grades
A-E scale with definitions for each score
What We Found
Students rated faculty differently depending on the scale.
Example:
13.
How would you rate
13.
the instructor on
encouraging thinking and learning
The instructor
encourages
thinking and learning.
A-F Scale
Mean
3.56
St.Dev.
.74
A
241 (68.7%)
B
75 (21.4%)
C
28 (8.0%)
D
6 (1.7%)
F
1 (.3%)
Strongly Agree Scale
Mean
3.48
St.Dev.
.71
SA
203 (58.8%)
A
107 (31.0%)
PA
31 (9.0%)
D
4 (1.2%)
SD
0
More Testing

We took the first full data-set (Fall 2000) and
did some comprehensive analysis on the tool.
We found:



Students rated the faculty in more difficult classes
higher (we and the Deans thought the opposite
would be true)
Students rated most course difficulty levels as
“about right.”
Students didn’t inflate their course involvement
and preparation
We Attempted to Establish Validity




We took the survey results to a Division Director and
had them look at the scores from the survey and
compare them with what they knew to be true of
their faculty over the years.
The faculty analyzed had been at the college for
years and had a definite “history of teaching”
Some we looked at scored rather low and some
extremely high (but lots of variance)
The Division Director felt the survey picked the
faculty out in order of their teaching ability. Those
scoring lower were not considered as good a
teacher as those who scored high.
Why Validity is Hard



Typically to establish validity, one uses a tool
considered “valid” and compares the new tool
to the results of the valid tool
With teaching evaluation, there are no
established “valid” tools
The only way we knew to validate it was
against the historical records of teaching at
the College and through some statistical tests
(factor analysis)
Results


We finalized the
tool in summer of
2000
We began using it
in every class in
Fall 2000
Improving Teaching

Chairs or Division Directors should use it
appropriately


It is one indicator of teaching (we say it counts no more
than 40%)
A criteria or benchmark was set (average of 4 on all items –
our criteria)

If a faculty scores an average of 4 out of five on every item,
how much more can we really expect?

Do not norm-reference it (set means and standard
deviations based on your department’s norms)

Why?????
Case Scenario



In Fall a faculty member rates a 4.22 on item 12 on the
survey. In her department the mean on that item was
4.76, SD=.36. This faculty member is told “you scored
more than one SD below the department mean and need
to improve your teaching.”
That faculty member works very hard to improve her
teaching. In the Spring term on item 12 she scores a
4.51. She is happy her scores are now up within one SD
of the department mean.
However, everyone else in the department also raised
their scores and the new department mean is 4.81,
SD=.28. Her scores are still more than one SD below the
department mean.
Case Scenario, cont.




What’s worse, she has a friend in another
department where the department mean on item 12
was 3.99, SD=.21.
If only she worked in that department, she would
score more than one standard deviation above the
mean and be considered a good teacher.
That chair wouldn’t ask her to make improvements
in her teaching.
Is she really a better or worse teacher in either
department????
Case Scenario, cont.

Things can be very different within
departments:





Some classes are electives
Some classes are required for majors
Multiple disciplines will be incorporated into a
department mean
Some courses are easier than others
Students are forced into some classes and don’t
want to be there
We found that we had to impress upon
the faculty and staff that:

Once a Tool is Established….




Every time you change a single word, you
invalidate the survey
Every time you change the scale, you invalidate
the survey
Every time you add or throw out a question, you
invalidate the survey
If not, they want to keep changing it
Characteristics of the
New Teaching Evaluation Tool
Comparing the Scales
80
73
70
60
57
Old Tool
New Tool
50
40
28
30
20
12
18
10
6
0
5's
Old Tool % 4-5 = 91%
4's
2
3's
New Tool % 4-5 = 85%
2's
1
1's
Psychometric Properties - Validity
Factor Analysis of the Teacher Evaluation
Assessment Survey
Eigenvalues and Factor Loadings
Factor 1
Factor 2
Emerging Factor 3
Instructor
Course
Student
Eigenvalue = Eigenvalue = Eigenvalue =
19.35
2.61
1.26
The Instructor – Factor 1




The art, science and business of teaching did not
factor out separately
The science and business of teaching were highly
correlated to the art of teaching
This makes sense. If a faculty member does not
utilize multiple methods in teaching or hand papers
back in a reasonable amount of time – chances are
students won’t rate them as good teachers
How faculty utilize appropriate method and manage
the classroom impact how students see them as
teachers
Psychometric Properties - Reliability



Internally consistent = a measure of how
consistent the instrument assesses teaching
quality across the items
Cronbach’s Alpha - compares the functioning
of each item to all the other items within the
instrument (a perfectly reliable instrument will
produce a coefficient of 1.00)
The TEAS yielded an Alpha of .974 indicating
very good internal reliability
Psychometric Properties - Sensitivity



While the TEAS may be able to distinguish
improvement in instructors who performed
“Below Average” or “Very Poor,” it will not
identify improvement in those who have
already scored in the top rating (this is fine
with us)
Another indication that the instrument may
not detect small changes is the rather small
item standard deviations (.72 - .98)
The greater the spread across items, the
better the sensitivity (the subscales produce
this)
Sub-Scales
The Important Pieces
The Art of Teaching


The Art of Teaching
(items: 8, 10, 11, 12, 13, 14, 15, 16, 17, 20, 21)
The art of teaching involves the more innate
aspects of teaching that are not considered
method. Examples of this would be a teacher’s
ability to motivate students, be enthusiastic,
positive attitude toward students and course,
encourage participation, make students feel valued
and comfortable asking questions, etc.
Art of Teaching
Scale of possible points for this item is 11-55
Points (it is more sensitive).
Mean:
St. Dev:
48.9
8.1
Number scoring 11-21 (<2 on every item)
Number scoring 22-32 (<3 on every item)
Number scoring 33-43 (<4 on every item)
Number scoring 44-55 (4/5s every item)
From Fall 2000 dataset
174 (1.0%)
674 (4.1%)
2,376 (14.5%)
13,192 (80.4%)
Science of Teaching


The Science of Teaching
(items: 2, 9, 16, 18, 19)
The science of teaching involves methods or
areas that can be taught such as organizing
class time, clarifying materials with examples,
making relevant assignments, use of text
book and teaching new things to students.
Science of Teaching
Scale of possible points for this item is 5-25 points.
Mean:
St. Dev:
22.2
3.5
Number scoring 5-9 (<2 on every item)
Number scoring 10-14 (<3 on every item)
Number scoring 15-19 (<4 on every item)
Number scoring 20-25 (4/5s on every item)
From Fall 2000 dataset.
121 (.7%)
547 (3.2%)
2,551 (14.8%)
14,054 (81.4%)
The Business of Teaching


The Business of Teaching
(items: 1, 3, 4, 5, 6, 7)
The business of teaching involves
items and issues required by the
institution such as handing out syllabi,
applying policies and being fair to
students, meeting the class for the
entire period, holding office hours,
providing feedback and announcing
tests in advance, etc.
The Business of Teaching
Scale of possible points for this item is 6-30 points.
Mean:
St. Dev:
26.8
3.9
Number scoring 6-11 (<2 on every item)
Number scoring 12-17 (<3 on every item)
Number scoring 18-23 (<4 on every item)
Number scoring 24-30 (4/5s on every item)
From Fall 2000 dataset
73 (.4%)
401 (2.4%)
2,505 (14.7%)
14,043 (82.5%)
The Course


The Course (3 items: 22, 24, 27)
The course evaluation has less to do with the
teacher and more to do with the course
characteristics, its applicability to the students’ field
of study, difficulty level, etc.
The Course
Scale of possible points for this item is 3-15
points.
Mean:
12.8
St. Dev:
2.4
Number scoring 3-5 (<2 on every item)
142 ( .8%)
Number scoring 6-8 (<3 on every item)
750 ( 4.4%)
Number scoring 9-11 (<4 on every item)
3,476 (20.6%)
Number scoring 12-15 (4/5s on every item) 12,489 (74.1%)
From Fall 2000 dataset
The Student


The Student
(items: 31, 32, 33, 34, 35, 36)
This allows a student to assess the amount of effort
they put into the course. While faculty are not
responsible for this, it may help explain the variance
in teacher evaluation.
The Student
Scale of possible points for this item is 6-30 points.
Mean:
St. Dev:
26.2
2.3
Number scoring 6-11 (<2 on every item)
27 ( .2%)
Number scoring 12-17 (<3 on every item)
283 ( 1.7%)
Number scoring 18-23 (<4 on every item)
3,175 (19.0%)
Number scoring 24-30 (4/5s on every item) 13,209 (79.1%)
From Fall 2000 dataset
Correlations between Subscales
Correlations
ART
SCIENCE
BUSINESS
COURSE
STUDENT
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
ART
SCIENCE BUSINESS COURSE STUDENT
1.000
.916**
.863**
.755**
.532**
.
.000
.000
.000
.000
16416
16193
15919
15842
15588
.916**
1.000
.870**
.753**
.529**
.000
.
.000
.000
.000
16193
17273
16711
16562
16374
.863**
.870**
1.000
.704**
.532**
.000
.000
.
.000
.000
15919
16711
17022
16294
16117
.755**
.753**
.704**
1.000
.602**
.000
.000
.000
.
.000
15842
16562
16294
16857
16035
.532**
.529**
.532**
.602**
1.000
.000
.000
.000
.000
.
15588
16374
16117
16035
16694
**. Correlation is s ignificant at the 0.01 level (2-tailed).
Regression – What accounts for the
most variance (entire data set)?
Model Summary
Change Statistics
Model
1
2
3
R
R Square
a
.917
.841
.926 b
.858
.929 c
.863
Adjusted
R Square
.841
.858
.863
Std. Error of
the Estimate
3.1669
2.9907
2.9326
R Square
Change F Change
.841 76834.418
.017 1768.653
.005
584.070
df1
1
1
1
df2
14569
14568
14567
Sig. F Change
.000
.000
.000
a. Predictors: (Constant), SCIENCE
b. Predictors: (Constant), SCIENCE, BUSINESS
c. Predictors: (Constant), SCIENCE, BUSINESS, COURSE
86% of the variance in the Art of Teaching can be accounted for by the way
students rated the Science and Business of Teaching and the Course.
Regression – One Course for One
Instructor
Model Summary
Change Statistics
Model
1
2
3
R
R Square
a
.877
.770
.931 b
.868
.953 c
.909
Adjusted
R Square
.757
.852
.892
Std. Error of
the Estimate
5.3518
4.1779
3.5670
R Square
Change
.770
.098
.042
F Change
60.236
12.537
7.321
df1
df2
1
1
1
Sig. F Change
18
.000
17
.003
16
.016
a. Predictors: (Constant), SCIENCE
b. Predictors: (Constant), SCIENCE, STUDENT
c. Predictors: (Constant), SCIENCE, STUDENT, BUSINESS
In this English 231 class (Amer. Lit.), 89% of the variance in the Art of Teaching
can be accounted for by how the students rated the Science and Business of
Teaching and how the student rated their classroom participation and readiness.
Differences Between Departments
65
55
Hospitality Mean
Science Mean
45
35
25
15
5
art
science
business
course
student
What Was Envisioned by
The Committee



Faculty determined to be excellent in the art
of teaching, the science of teaching and the
business of teaching would be selected to put
together training modules or mentoring
programs in each area through the CTL
Faculty scoring low on any of the subscales
would be sent to the CTL for serious help
Changes would be documented over time of
improvements made
The Chair/Division Director’s Role






Use the TEAS fairly
It is what it is…..
When faculty need help, send them for it
Attempt to create an atmosphere of “value in
good teaching” in your division
Faculty can and should help each other
Look for other ways to evaluate teaching
(portfolios, observations, self-assessments)
What we plan to do with it…..




We plan to sell it through our college’s
Services Corporation (503c)
We will either sell the rights to it (test plus
booklet) so you can reproduce it and do your
own analysis
Or we can sell the scantron sheets with the
survey printed on it and do the analysis for
you
Over the next year we plan to analyze a
university sample
The End

This presentation can be found:

http://inside.cpcc.edu/planning
Click on studies and reports
It is listed as AIR teaching eval 2003



The Standard Bell Curve - Central Piedmont Community College

Transcript The Standard Bell Curve - Central Piedmont Community College

Directory