The Problem of Teaching Performance Evaluation and a Proposed Solution

KENNESAW, Ga. | Sep 18, 2023

Dr. Reza Vaezi

The Problem

Faculty members in most higher education institutions must go through yearly performance evaluations by their department chairs. Before the evaluation sessions, in most cases, faculty are required to conduct a self-evaluation of their effort, and then those self-evaluations will either be corroborated or rejected by their respective department chairs. The issue arises when a faculty self-evaluation of their teaching, research, or service efforts and contributions for the past year does not match that of their department chair. The problem is especially pronounced in teaching and service areas where a widely accepted measurement schema - similar to publication rankings for research contributions - does not exist. The discrepancy between the administrator’s evaluation and faculty self-evaluation in the service area is typically not a serious issue as it only corresponds to 10% to 20% of faculty workload and may not have a deterministic impact on the overall performance evaluation of a faculty given the circumstances. Teaching performance evaluation, however, is different as it can comprise between 40% to 80% of faculty workload in a year. Hence, this blog post aims to scrutinize current teaching evaluation practices prevalent in many higher-education institutions and suggest possible remedies.

Across the nation, administrators typically use Student Evaluations of Teaching (SET) - a possible biased measure acquired by surveying students about the teacher - in place of more objective measures of teaching effectiveness and performance to decide about hiring, promoting, or terminating faculty (Spooren, Brockx, & Mortelmans, 2013). Our College is not an exception either; based on a filed survey conducted by the author in March of 2022, most department chairs heavily rely on SET results to evaluate faulty teaching performance. This practice is particularly problematic for two main reasons. First, the validity and reliability of SETs in measuring their constructs of interest (e.g., teaching effectiveness, course rigor, etc.) have been questioned by many scholars across various scientific disciplines (e.g., Boring & Ottoboni, 2016; Braga, Paccagnella, & Pellizzari, 2014; Carrell & West, 2010; Spooren et al., 2013). Second, SETs are not a comprehensive measure of teaching performance and capture, albeit in error, only one part of a multidimensional endeavor.

SET Validity Issues

Evidently, students confuse grades and grade expectations with teaching effectiveness (Braga et al., 2014; Carrell & West, 2010). In other words, students’ favorable evaluations appear to be a reward given to instructors who make them expect a good grade at the end of the semester (Johnson, 2006)! These findings suggest that teaching effectiveness as a complex concept gets substituted by an easier one (e.g., grade expectations) in students’ minds through a cognitive mechanism called substitution heuristics (Kahneman, 2011). Such a process can impact both the internal and construct validity of SET measurements. The substitution heuristics mechanism suggests that our measurement instrument might be measuring something that was never intended to be measured. In this case, it appears to be measuring whether students’ expectations about their grades are met instead of faculty teaching effectiveness.

Further, SETs are shown to be impacted by social stereotypes such as race and gender. MacNell, Driscoll, and Hunt (2015) suggest that students tend to give lower ratings to female instructors in almost all teaching aspects that even extend to supposedly objective measures such as timeliness. These findings are further affirmed by other scholars using student evaluation data (Boring, 2017; Boring & Ottoboni, 2016). Additionally, Clayson (2005, p. 119) states that around 60% of students who participated in a study admitted they “had modified teaching evaluations for personality reasons (likeability, personality, and humor).” Such strong social and personal biases further weaken both the internal and external validity of SET measurements and question the reliability of obtained ratings.

Moreover, SETs suffer from highly biased sampling issues, which further threaten their internal validity. It is shown that people who are highly (positively or negatively) motivated are much more likely to participate in online surveys to write product and service reviews (Sundaram, Mitra, & Webster, 1998). Online SETs are no exception either; most of the time, only a few highly motivated students in each class participate in the SET, making the statistical sample non-representative. Consequently, gathered opinions can be heavily biased and highly divided (high standard deviation), representing outliers instead of normally distributed data points. In addition, administrators may also use statistically questionable methods in aggregating and comparing SET results over the evaluation period (e.g., averaging across graduate and undergraduate level courses, selecting only a few items/questions to include among many, or assigning arbitrary weights to items). The sampling issues and varied statistical aggregation operations can lead to weak statistical interpretation validity of SETs.

Teaching Performance: a Multidimensional Construct

In addition to the discussed validity issues, SETs by no means provide a comprehensive measure of teaching performance. A general concern arises whenever a human effort is represented through a number (a rating, a grade), and that is the reduction of rich and multidimensional human experience to a number or a symbol. To cope with the complexity of understanding rich experiences in order to form an easily comprehensible base for comparison and ranking, we (humans) tend to reduce experiences to numbers and signs. For all practical purposes, such reductions are needed to ensure a degree of fairness in judgment when one needs to rank and compare human (or game animals) efforts. As teachers, we also do that in most of our courses; we reduce a semester of students’ efforts (life experience related to the course) into a number or letter. But most instructors do that in a rather representative manner. Students’ grades are typically a summary of all their related efforts in a course with a predefined and clearly communicated weight schema for each category. A course grade is not based on one exam, quiz, or assignment. It is an aggregation, a weighted summary, of all related activities. However, when it comes to teaching performance evaluation, most administrators (chairs) typically only rely on one highly controversial score that only captures one, albeit important, dimension of the multifaceted teaching experience.

A good faculty teaching performance score should be a more comprehensive one that does not mostly rely on a highly unreliable and biased student rating; rather, it should reflect all efforts related to teaching, including but not limited to training and preparations, student mentorship, teaching innovations, participation in curriculum design and development, etc. Most teachers have experienced the challenges of teaching a course for the first time; in addition to the more time and effort it takes to prepare for teaching, mindless mistakes are inevitable and can easily result in students’ dissatisfaction. The same goes when an innovative teaching method or means is implemented for the first time in a course. Mishaps are not only expected but inevitable, to a good degree, when dealing with novel content or methods.

Further, the totality of a teaching effort goes beyond classroom instructions, sound teaching materials, and instructional methods that students can allegedly evaluate. It also includes all the training that a teacher needs to undertake to stay current. This issue is more pronounced, especially in highly dynamic and changing disciplines such as Information Systems, where instructors must always keep up with technological advancements. In such disciplines, even core courses are not immune from rapid technological and environmental changes and need to be updated every few years. For teachers to stay current in such disciplines, they need to revisit their courses often, make fundamental changes, and take on new training. Under the current performance evaluation regime, those efforts have no formal weight, and their worth is determined based on the subjective judgment of administrators! Hence, many technology-heavy instructors may not have a strong enough motivation to keep their courses current (take on new training and introduce new content and novel methods), considering that such efforts may not be fully appreciated and may even backfire through SETs.

Consequently, it might be time for faculty councils and other shared government bodies to start working on developing an explicit rating schema for teaching performance evaluations that fits their respective discipline and departmental needs. A schema that can afford faculty a decent prediction power of how their annual reviews of teaching performance will turn out to be according to all they have done inside and outside the classroom. A kind of schema that can release teachers from the mercy of their students and chairs.

Once aspects of teaching performance, such as innovation and training, are formally and objectively rewarded through well-defined teaching evaluation schemas, we can expect faculty to be more motivated to stay current by taking on various training and be more innovative in their course design and delivery. It can shield faculty, although to a limited degree, against the fury of negative student reviews attributed to first-time experiences with innovative delivery or novel course material. The schema can also enable faculty to focus more on true student learning at the cost of possibly making students temporarily unhappy instead of focusing on how to get better student reviews at the price of student learning! Last but not least, it can help administrators avoid unpleasant encounters with faculty during annual reviews as it should reduce disagreements between faculty self-evaluations and that of the chair by making teaching performance evaluation more encompassing and objective.

To this end, the author has developed an introductory teaching evaluation schema as a tangible example of what a more encompassing and objective teaching evaluation schema may look like. It provides some guidelines for the annual teaching performance review at the departmental level. It can also satisfy the USG mandate to develop clear templates for faculty performance reviews at various stages (annual, pre-tenure, post-tenure, etc.). First and foremost, the proposed framework aims to encourage directed conversations around faculty teaching important issues among faculty and administrators.

The Proposed Schema

While maintaining a focus on student evaluations, this proposed framework introduces relevant teaching activities that can directly or indirectly impact the teaching effectiveness of faculty and student learning outcomes. It improves the existing process in the following ways:

It helps faculty to methodologically account for most of their teaching-related activities in the past year
It makes the teaching performance evaluations more objective and encompassing.
It makes the results of teaching performance evaluation more predictable and robust for both faculty and the chair.
It penalizes faculty for poor management of student complaints.
It provides clear thresholds for the five performance ratings (exemplary, exceeds expectations, meets expectations, needs improvement, and not meeting expectations) that will be in place for 2023 faculty performance evaluations.

The proposed framework is composed of two major assessment sections: Core assessment and Auxiliary Assessment. It is designed to enable faculty to still receive Exemplary and Exceeding Expectations ratings only by relying on their SET results (Core Assessment). But it also allows faculty to quantify what usually goes into teaching performance narratives and helps them offset weak or biased student reviews, to a limited degree, by showcasing and quantifying the totality of their teaching efforts over the evaluation period (Auxiliary Assessment).

The proposed framework consists of four tables that include core and auxiliary assessment sections.

Table 1 includes information about teaching evaluation tiers and the required points for each tier.
Table 2 represents the core assessment section and captures the student evaluation summary in terms of averages for each criterion in the standard KSU teaching evaluation instrument.
Table 3 represents the auxiliary assessment section. It helps faculty to quantify relevant teaching activities and recognitions to earn points.
Table 4 brings together the results of Tables 2 and 3 and provides the total earned points that will be used in conjunction with Table 1 to determine faculty’s self-evaluation of teaching performance.

Core Assessment (table 2)

The existing KSU-sanctioned teaching evaluation survey consists of three feedback categories. Through this survey, students provide course feedback, instructor feedback, and general KSU feedback. Course and instructor feedback is gathered on a 5-point scale, where 5 indicates the best rating. The KSU feedback, however, is gathered on a 4-point scale. Hence, there will be a total of 14 core assessment points that can be earned through SETS. The framework is designed to make it possible for faculty to earn exemplary ratings by only relying on student evaluations of her teaching. For example, a faculty who received outstanding student evaluations equal to or greater than 90% of the maximum available core assessment points (14 * 0.9 = 12.6) can earn an exemplary rating. Please see Tables 1 and 2 for more clarifications.

Auxiliary Assessment (table 3)

The framework also gives faculty who receive less than extraordinarily good student evaluations a chance to achieve exemplary teaching performance through conducting auxiliary teaching activities and including such activities in their teaching performance evaluations. However, it should be noted that there is an upper limit on how many points one can earn through auxiliary activities. This mechanism is designed to ensure that a faculty member who performs poorly according to student evaluations would not be able to achieve the highest ranking by relying mostly on auxiliary points he accrues. The proposed upper limit for earned auxiliary points is 2.1, which is equivalent to 15% of the core assessment points. Hypothetically, it can move a faculty who is placed in the upper middle range of the “meeting expectations” range through earned student evaluations to the “Exemplary” rating category by including auxiliary points.

Table 1: Teaching Evaluation Tiers and Required Points

Evaluation Scale	Minimum Required Points
Exemplary	Equal to or greater than 12.6 (90% of core points)
Exceeds Expectations	Equal to or greater than 11.2 (80% of core points)
Meets Expectation	Equal to greater than 9.8 (70% of core points)
Needs Improvement	Equal to greater than 8.4 (60% of core points)
Not meeting Expectation	Less than 8.4

Table 2: Core Assessment Section - Student Evaluations Summary

KSU Survey Feedback Category		Max Points	Points Earned
Course Feedback Average (Calculated as the average for all courses across the evaluation period)		5
Instructor Feedback Average (Calculated as the average for all courses across the evaluation period)		5
KSU feedback Average (Calculated as the average for all courses across the evaluation period)		4
Total Earned Core Points

Table 3: Auxiliary Assessment Section – Teaching-Related Activities and Recognitions

Teaching Activities / Recognitions	Points Per Instance	Points Earned
New Course Prep	1	X * 1 (X = number of new preps)
New teaching-related training certificate of completion	0.75	X * 0.75 (X = number of certifications)
Course Coordinator	0.1	X * 0.1 (X = number of coordinated sections)
CETL course certification/recertification	0.15	X * 0.15 (X = number of certifications/recertifications)
Major Course Innovation	0.05	X * 0.05 (X = number of major innovations, X cannot be greater than the number of taught courses)
Minor Course Innovation	0.025	X * 0.025 (X = number of minor innovations; X cannot be greater than the number of taught courses)
Student mentoring	0.15	X * 0.15 (X = number of students mentored, e.g., honors project)
Teaching Awards	0.5	X * 0.5 (X = number of teaching awards)
Guest lectures	0.025	X * 0.025 (X = number of guest lecturers; two guest lecturers per course can be claimed)
Office of Career Development Recognition Letter	0.5	X * 0.5 (X = number of letters received during the evaluation period)
Escalated Student complaints	-0.25	X * -0.3 (X = number of student complaints escalated to the department chair level)

*Total Earned Auxiliary Points* (Total points earned from auxiliary activities cannot exceed 2.1)

Table 4: Summary of Teaching Performance Evaluation

Category		Points / Self-Evaluation
Total Earned Core Assessment Points (from Table 2)
Total Earned Auxiliary Points (Max 2.1 - from Table 3)
Total Teaching Performance Points
Faculty Self-Evaluation (from Table 1)

References:

Boring, A. (2017). Gender biases in student evaluations of teaching. Journal of public economics, 145, 27-41.

Boring, A., & Ottoboni, K. (2016). Student evaluations of teaching (mostly) do not measure teaching effectiveness. ScienceOpen Research.

Braga, M., Paccagnella, M., & Pellizzari, M. (2014). Evaluating students’ evaluations of professors. Economics of Education Review, 41, 71-88.

Carrell, S. E., & West, J. E. (2010). Does professor quality matter? Evidence from random assignment of students to professors. Journal of Political Economy, 118(3), 409-432.

Clayson, D. E. (2005). Within‐Class Variability in Student–Teacher Evaluations: Examples and Problems. Decision Sciences Journal of Innovative Education, 3(1), 109-124.

Johnson, V. E. (2006). Grade inflation: A crisis in college education: Springer Science & Business Media.

Kahneman, D. (2011). Thinking, fast and slow: Macmillan.

MacNell, L., Driscoll, A., & Hunt, A. N. (2015). What’s in a name: Exposing gender bias in student ratings of teaching. Innovative Higher Education, 40(4), 291-303.

Spooren, P., Brockx, B., & Mortelmans, D. (2013). On the validity of student evaluation of teaching: The state of the art. Review of Educational Research, 83(4), 598-642.

Sundaram, D. S., Mitra, K., & Webster, C. (1998). Word-of-mouth communications: A motivational analysis. ACR North American Advances.

Dr. Reza Vaezi
Associate Professor of Information Systems
Coles College of Business