Christopher X J. Jensen
Associate Professor, Pratt Institute

Course Evaluations

I won’t lie: after each semester has ended, I greet the arrival of the spreadsheets summarizing my course evaluations with a mix of excitement and horror. There is something so emotional about looking over these evaluations. For me, the emotional potency of course evaluations emerges from a variety of realities:

  • I care deeply about serving the needs of my students;
  • I work hard to be a good educator, and hope that work translates to student approval;
  • I hate intensely to be criticized but I also want to improve in response to my actual shortcomings;
  • Interpreting the varied assessments of dozens of students every semester is difficult and confusing; and
  • Course evaluations are one of the only ways that I am professionally assessed, at least as a educator.

In the end, I want course evaluations to serve their stated purpose: to provide students with the opportunity to assess the quality of the education they are receiving and perhaps to improve the quality of education that future students experience. But making good on this promise is not so easy: course evaluations are often contradictory, confusing, and counterintuitive. Contradictory comments are perhaps par for the course (sorry for that pun), especially for general education courses: one student wants a more rigorous course, while another says the course is too hard. Course evaluations can be confusing because students are not always perfectly clear about what they need and/or want out of a course. And course evaluations can be counterintuitive because it is not always a given that what bothers students about a course needs to be changed: if students claim that the course requires too much work, does this assessment represent an ‘expert take’ on workload (after all, students have the best perspective on the distribution of workload across courses) or simply ‘wishful thinking’ on the part of students who understandably wish their education was easier? Challenges abound in interpreting course evaluations.

And perhaps this is why the importance of course evaluations is often pretty institutionally ambiguous. Are course evaluations designed to assess instructor quality? Perhaps, although generally they only get considered during key promotional reviews or in the rare case that an institution is trying to get rid of a substandard instructor. Are course evaluations designed to improve courses or improve instruction? Maybe, if instructors take seriously the comments of students and make appropriate changes to their curricula and teaching methods. In my experience this is a matter of personal prerogative and never institutionalized in a meaningful way. Are course evaluations designed to assess cross-sectional course quality? While it seems like it would be possible to learn a lot about the larger curriculum by comparing what students have to say about different courses, I have never seen this sort of comparison made.

While I am a little bit sympathetic to those who struggle with the challenges associated with interpreting course evaluations, I am mostly disappointed by how poorly we as faculty utilize the results of course evaluations. There seems like so much promise in all this data provided by students. And I have to say that part of me suspects that most educational institutions are happy to have course evaluations serve little more than a perfunctory role: more processing of the data in course evaluations (or heaven forbid more transparency) would expose the truth about how students assess their education, and maybe we are all afraid to face this music.

What do my course evaluations look like?

Motivated by both my deep respect for the course evaluation process and my wish for something more progressive to be done with these assessments, I have decided to post summary data from my course evaluations since joining the Pratt faculty in the Fall of 2007. Below is a summary of my overall ratings for each of the courses I teach/have taught at Pratt:

Normalized Average Ratings, Full Scale

Comparing different semesters is slightly complicated by the fact that:

  1. Pratt’s rating system has changed from being 1-to-5 to 1-to-4; and
  2. I teach classes varying in credit value from 1 to 3.

To correct for these variations, I present what I am calling the Normalized Average Rating, an average weighted by the number of credits per course and normalized to the maximum rating for each semester. This normalization makes comparisons across semesters meaningful.

Overall you can see that my ratings are pretty high, hovering around 90%. But when you look at these trends in more detail, you can see that there is quite a bit of variation from semester to semester:

Normalized Average Ratings, Compressed Scale

As the best-fit linear trendline on the graph above suggests, I have experienced an overall decrease in my course ratings. I am happy that my ratings are generally high, but I have my concerns that I am getting worse in the eyes of my students. Why this might be is really tough to figure out.

I might be the reason that my course evaluations are trending lower over the last five years. Am I teaching differently? Am I expecting more of my students? Has my tone in the classroom changed? Has my level of enthusiasm gone down? All these causes of change are possible, but some are more easy to assess than others. One thing that I know is true is that in some ways I have been more rigorous with my students over the last few years than ever before. Recently I have taught only CORE courses, which are the only required math and science course for art and design majors. As such we have packed a lot into these courses, in part because they represent the only guaranteed opportunity to make sure our students are scientifically literate. I also teach mostly the writing-intensive version of CORE courses, which adds work and rigor. Student comments on evaluations also support the idea that part of my lower ratings stem from students’ perceptions that my courses are too demanding and that I am too tough a grader.

But being more demanding in these areas hasn’t always led to lower levels of student dissatisfaction; in fact, when I first started employing this more rigorous version of my pedagogy (Fall 2013), I actually received my third-highest overall ratings ever. And my best ratings of all time were from the Fall of 2016, the first semester when I began expecting CORE-level work of my students. So while there are a lot of comments from students suggesting that I am too demanding, it seems to be a bit of a “luck of the draw” phenomenon as to how many students with this complaint that I get per semester. Just a few students who feel overwhelmed by the work demands can really tip my ratings.

The other possibilities that could be ‘my doing’ are a lot harder to assess. For example, there has not been any student comment trend pointing towards me being less enthusiastic in class. I still get a lot of praise for the energy and passion that I bring to the classroom. But there are also a significant number of students who report that they feel condescended to, or that I make them feel anxious, or that I am rude. It’s possible that I have gotten more gruff in my old age, although if anything it feels like I am treating students more gently now than I might have in the past.

Another possibility is that my students may be changing. Perhaps the students of 2019 are simply not the same as the students of 2011. I am always really wary of generational stereotypes, so I am kind of skeptical that suddenly over the past decade students have changed such that they rate my courses more poorly. But there is sometimes a slow change over time in what students expect to be asked to do, and it is possible that I am encountering a generation of students who are frustrated by my style of teaching and/or what I expect them to do in my courses. I have experienced this kind of culture change before: back in 2014 when I started teaching the Ecology for Architects course, it quickly became clear that architecture students were a lot more sensitive to high expectations and workloads than art and design students, and my course evaluations reflected that difference in culture. But now that I have stopped teaching Ecology for Architects, I still am getting some low ratings. Why might this be?

The last possibility is that these last five years are statistical aberrations: some combination of bad luck in which students registered for my class, or how my students felt on the day they filled out the evaluation, or some other factor beyond my control is responsible for these scores. As you will see below, I have evidence in some courses that a few disgruntled students are driving some of my lower scores, but that does not explain all of my low scores.

Course evaluations can be really unnerving because they are based on such small sample sizes. Even if the entire class were to submit their evaluations (currently my completion rates are approximately 83%), at Pratt we are still talking about at most about fifteen students providing their impressions of a course. With so few responses, a few really negative evaluations can really lower your overall ratings! Based on a qualitative analysis of my course evaluations, it is my impression that my worst ratings arise from having one or two students in a particular course section who are really dissatisfied with the course. To try to determine quantitatively whether or not this qualitative impression is valid, I looked at the correlation between my average ratings and their variance:

Average Rating versus Variance

If my lower ratings are driven by just a few students, we should see much higher variance when ratings are lower; overall, this is what the graph above shows. But there are some interesting exceptions, including a few courses that very prominently ‘fall off the trendline’. These courses have variances that are similar to most of the higher ratings, suggesting that there was more consensus driving the lower ratings in these particular courses.

If the variance was roughly constant for all ratings, this would suggest that lower ratings emerged from overall student consensus, but this is not what we see overall in the graph above. The one factor complicating this conclusion is that high ratings have their variance somewhat capped by there being a maximum (100%) rating value. We would expect that higher ratings would have a lower variance because they are high. For this reason, it is hard to conclusively say that my most of my lower ratings are driven exclusively by a few students per course section: we also have to recognize that my higher ratings are biased to have lower variance. I am sure that there is some more sophisticated statistical procedure for disentangling these two factors, but I have not tracked it down. The fact that a few disgruntled students can radically lower the ratings for a course makes it really difficult to interpret differences in average course ratings from semester to semester.

What do my ratings tell me about the overall impressions students have about my different courses? One way to get an answer to that question is to look at overall Normalized Average Ratings for each course:

CourseEval All courses

What I find fascinating about this data is that some of the courses I feel are most realized are on the lower end of the ratings spectrum. If you asked me which of my classes is of highest quality, I would probably say “Ecology, Environment, & the Anthropocene” because it is the course that I have most recently developed, and one that I put a lot of effort into creating my ‘most pedagogically advanced’ content for. But if you look at the overall ratings, Ecology, Environment, & the Anthropocene fares quite poorly on my course evaluations (by a disturbingly wide margin!). I am not sure how to interpret this disconnect between what I consider a successful course and what students consider a valued class!

A big worry for me is that low course ratings represent how demanding a course was rather than how well-designed the course was to foster student learning. Or, perhaps, course evaluations might just boil down to how students feel towards a course rather than whether the course contributed to their education. If you think about it, students are not really equipped to assess educational outcomes (which are difficult even for trained educators to assess) and have a lot of reasons to base their rating on how rigorous the course was or how they felt taking the course, two factors which are likely related. Is this the reason why an elective 1-credit course (Great Adventures in Evolution) that required just a short reading passage per week and a fairly brief final project received the best ratings while my most demanding 3-credit course taken by students in one of the most demanding majors (Ecology for Architects) gets the worst ratings? Maybe. I dream of teaching a course that students love despite being greatly challenged by it, but how to create such a course? If I wanted to get better ratings, I certainly know how to make my course fun and a lot easier, but taking that path will not lead to the best outcome for my students (whether they recognize that fact or not).

I know quite well that some of my courses are deemed more successful by students than others, but what does the overall trend look like for all courses? Below is an overall summary of my ratings in each course over my career at Pratt:


Notice that there is a lot of variation across semesters, and that there is not any real trend explaining this variation. Again, I remind you how small my sample sizes are during each semester: a lot of this variation may simply arise from sampling. There is also the problem that the course evaluation is not a static metric: in the Spring of 2010 Pratt radically overhauled the criteria by which students rate the course, and those criteria have been changed slightly since. So as we look at trends, we really are not comparing the same things across all semesters. I guess the one clear trend — already highlighted by previous data — that is disturbing is a rather precipitous shift downward over the past five to six years.

I have also looked at the trends for individual courses. I would hope that each semester that I teach a course it would improve in the eyes of my students. To get a sense of whether this is true, I plotted the ratings for all of the courses that I have been teaching recently. Here are my ratings for Ecology, Environment, & the Anthropocene:

This graph suggests a couple of things. The first is that there is not a clear improvement in how students rate this course: the line is pretty much flat hovering around 84.5%. The second is that there is a lot of variation from semester to semester. For these six sections over four semesters I have not really changed the way I teach, so a lot of what must be driving this variation is disagreement among students about what makes a good course; some semesters I get a group of students who like the way I teach, others I don’t. This is really apparent when we compare the pairs of sections that occurred during the same semester: the same course can generate big differences in ratings across two sections. It will be interesting to see where these ratings trend as I continue to teach this course. As a lot of the complaints about this course revolve around the workload, I have been considering lowering the number of weekly readings in order to respond to one of the complaints that drives lower ratings. Would doing so change my evaluation scores? And would lowering expectations provide my students with the same level of learning?

An interesting contrast with the ratings trend above is trend for my Evolution course:

Here you can see that my rating are also relatively flat, but hover around 90%, a substantial difference in overall student satisfaction with my teaching in this course. It is hard to tell what is driving this trend. Three points are obviously not very reliable, so it will be interesting to see if this course would continue to outscore Ecology, Environment, & the Anthropocene for years to come. The two courses are taught in manners that are pretty similar overall — same grading structure, same Term Project, similar homework and exams — so it is a bit hard to tell why this difference, if real, might exist. One idea is that there is a different tone in this course: whereas Ecology, Environment, & the Anthropocene can be a bit of a downer (our species is doing some major damage, after all), Evolution is all about the gee-whiz wonder of nature. So it is possible that Evolution just might be a more pleasant course to take. The other interesting contrast between the courses has to do with the readings. Evolution has a textbook and probably requires less overall reading, which might make students feel less overburdened by this course.

If three points are too little to define a trend, two points are really useless. But just for fun, here are the ratings for my newest course, Breeders, Propagators, & Creators:

Look, my ratings are shooting upwards like a ballistic missile! Jokes aside, it will be interesting to see where this course rates when I teach it again in the Fall 2019 semester. These two ratings are for a CORE non-writing-intensive version, but this course has been converted to being writing-intensive. This sets up the potential for a bit of a controlled study on how much writing-intensive requirements impact student evaluations.

What to do about course evaluation results?

It is nice to have this data, but what to do with it? Even if I am convinced that student happiness or disappointment in a particular course is increasing, how do I act in response to this information?

For example, one of the biggest complaints that I get on my course evaluations is that the workloads in my classes are too high. Does this mean that I should lighten up or that students should toughen up?

I also get a lot of lower ratings for the way I assess student work (I am deemed too tough). Does this mean that I should lower my standards, or is it a good sign that I am challenging many of my students?

These are hard questions to answer!

Currently my approach is to use course evaluations as a way of taking a rough pulse of my teaching. If I get a few lower or higher ratings I cannot immediately assume that the quality of my teaching has changed, but I am looking for consistent trends over time. In recent semesters there has been a downward trend and I am working to address this trend by playing with course design tweaks that have a the potential to address student concerns without just capitulating to an overall student demand for less rigorous courses.

One of the most valuable components of our course evaluation process does not even show up on the quantitative summaries I show above. Students are asked to make qualitative statements about the strengths and weaknesses of the courses I teach and the way that I teach them. I pay a lot of attention to these comments, and they are often very insightful. Perhaps the most valuable feedback is not numerical at all. Unfortunately, it is hard to summarize without bias what students have to say. I can average their ratings, but not their more nuanced impressions.

I hope to further process the course evaluation data I have (there is so much granular information lying dormant in this dataset), but the above graphs are a start. I sincerely believe that in order to provide greater accountability, more educators should be making this data publicly available.

You can see all of my posts about my course evaluations here.

This page was last updated June 2019