Student Evaluations as Measures of Teaching Quality

A good friend asked for my take on student evaluations after seeing this recent NPR article lambasting them, and (not surprisingly) I have a lot to say. The short version is that I agree that student evaluations are imperfect measures of course (and teaching) quality, but I also believe that if you run the data collection process well and read them with care, they can be extremely informative.

What I found most annoying about the NPR article is that it makes many general claims that just aren’t true in my experience at Yale. Let’s go down the list:

1. “Student ratings are high-stakes. They come up when faculty are being considered for tenure or promotions.”

At Yale, student ratings are definitely not high stakes. In fact at most elite research universities, tenure and promotion decisions are based almost entirely on a professor’s research record. Teaching quality might be an issue in borderline cases, but I’m not even sure that’s true.

2. “Fewer than half of students complete these questionnaires in some classes. And, Stark says, there’s sampling bias: Very happy or very unhappy students are more motivated to fill out these surveys.”

Stark is a professor at UC Berkeley talking about his university’s experience, but at Yale, we consistently get between 85 and 95% response rates. I’m sure that those students who don’t respond are not random, but I’m also sure that losing 10-15% of your sampling frame isn’t inducing much sampling bias.

Yale gets its high response rate for two reasons: First, Yale requires students to at least look at the online course evaluation form before they get their grade for the class. Second, evaluations are made public to students considering taking the class in the future. At the end of the semester, students fill out these forms for their peers, not the administration.

3. The article’s title: “Student Course Evaluations Get an ‘F’”

This is a completely dishonest portrayal of the recent paper by Stark and Freishtag discussed in the article. The authors don’t actually hate student evaluations at all—They hate the emphasis on generic multiple choice ratings that can be averaged to create a single numeric course rating such as:

  • (UC Berkeley) Considering both the limitations and possibilities of the subject matter and course, how would you rate the overall teaching effectiveness of this instructor? 1 (not at all effective), 2, 3, 4 (moderately effective), 5, 6, 7 (extremely effective)

  • (Yale) What is your overall assessment of this course? 1 (Poor), 2 (Below Average), 3 (Good), 4 (Very Good), 5 (Excellent)

Student evaluations actually contain other categorical or quantitative measures (e.g., workload) and many opportunities for students to explain their answers. At Yale, the majority of students (often more) take these opportunities to explain to their peers why they liked the class or what exactly they didn’t like about it. These responses are incredibly illuminating to both the faculty looking to improve and the administration if they wanted to understand more about what’s happening in the classroom. Stark knows this (it’s in the paper), but the NPR article writer conveniently leaves that part out because it doesn’t support her provocative thesis.

The second part of the article talks about a study by Michele Pellizzari and co-authors of students at Bocconi University where course ratings in one course were correlated with student performance in a follow-up course. They found a significant negative relationship: “The better the professors were, as measured by their students’ grades in later classes, the lower their ratings from students.” Again, I think it’s important to note this is not an indictment of student evaluations in general, but of numeric course ratings based on simple multiple choice questions. That said, they find this negative relationship not just for overall assessments, but also measures of “lecture clarity” and “teacher ability in generating interest.”

The authors interpret this as evidence that “tough” teachers push students to learn more, but students don’t like it. Maybe that’s what’s going on, but if so, I would expect to see a positive effect of contemporary workload on future performance and they find no significant effect here. I think it’s equally likely that the teachers that got high ratings are teaching more, but that that extra material is not useful in the follow-up class. I’m just not sure performance in the follow-on class is a good measure of teaching effectiveness.

The authors also say that the professors didn’t have much freedom in the classroom: “the syllabuses are fixed and all teachers in the same course present exactly the same material.” In that context, there aren’t that many opportunities for professors to show their skills, and maybe “popularity” is a bigger distinguishing factor than when the faculty are designing their own courses.

The most interesting result in Pelizzari et al paper to me is that they “find that the evaluations of classes in which high-skill students are over-represented are more in line with the estimated quality of the teacher.” One could make a strong argument that at Yale, the vast majority of students are high-skill, and so at Yale, we should expect to see course evaluations as a good (if noisy) measure of teaching quality. I’ve certainly seen many students give poor reviews to classes where they felt they didn’t learn much.

I am absolutely not saying that numeric measures based on a single general multiple choice question are great measures of course quality. I’ve seen faculty game the system and teach an easy class to get a good overall assessments. That comes through loud and clear in the workload and comments portion of the evaluations. I’ve also seen faculty teach what I thought was an objectively challenging and excellent class and get good but not fantastic overall evaluation scores. This situation is also pretty clear when you read the students’ comments.

At the end of the article, Stark suggests several other methods of measuring teaching quality including:

  • Ask students more objective questions about the class,
  • Have faculty visit and evaluate their peers in the classroom
  • Review of faculty teaching materials

I think these ideas are great, but I also think we can learn a lot from what we already have.

If you want to read more about course evaluations, I highly recommend the discussion after this recent article in Psychology Today.