Christopher X J. Jensen
Associate Professor, Pratt Institute

Course Evaluations

I won’t lie: after each semester has ended, I greet the arrival of the spreadsheets summarizing my course evaluations with a mix of excitement and horror. There is something so emotional about looking over these evaluations. For me, the emotional potency of course evaluations emerges from a variety of realities:

  • I care deeply about serving the needs of my students;
  • I work hard to be a good educator, and hope that work translates to student approval;
  • I hate intensely to be criticized but I also want to improve in response to my actual shortcomings;
  • Interpreting the varied assessments of dozens of students every semester is difficult and confusing; and
  • Course evaluations are one of the only ways that I am professionally assessed, at least as a educator.

In the end, I want course evaluations to serve their stated purpose: to provide students with the opportunity to assess the quality of the education they are receiving and perhaps to improve the quality of education that future students experience. But making good on this promise is not so easy: course evaluations are often contradictory, confusing, and counterintuitive. Contradictory comments are perhaps par for the course (sorry for that pun), especially for general education courses: one student wants a more rigorous course, while another says the course is too hard. Course evaluations can be confusing because students are not always perfectly clear about what they need and/or want out of a course. And course evaluations can be counterintuitive because it is not always a given that what bothers students about a course needs to be changed: if students claim that the course requires too much work, does this assessment represent an ‘expert take’ on workload (after all, students have the best perspective on the distribution of workload across courses) or simply ‘wishful thinking’ on the part of students who understandably wish their education was easier? Challenges abound in interpreting course evaluations.

And perhaps this is why the importance of course evaluations is often pretty institutionally ambiguous. Are course evaluations designed to assess instructor quality? Perhaps, although generally they only get considered during key promotional reviews or in the rare case that an institution is trying to get rid of a substandard instructor. Are course evaluations designed to improve courses or improve instruction? Maybe, if instructors take seriously the comments of students and make appropriate changes to their curricula and teaching methods. In my experience this is a matter of personal prerogative and never institutionalized in a meaningful way. Are course evaluations designed to assess cross-sectional course quality? While it seems like it would be possible to learn a lot about the larger curriculum by comparing what students have to say about different courses, I have never seen this sort of comparison made.

While I am a little bit sympathetic to those who struggle with the challenges associated with interpreting course evaluations, I am mostly disappointed by how poorly we as faculty utilize the results of course evaluations. There seems like so much promise in all this data provided by students. And I have to say that part of me suspects that most educational institutions are happy to have course evaluations serve little more than a perfunctory role: more processing of the data in course evaluations (or heaven forbid more transparency) would expose the truth about how students assess their education, and maybe we are all afraid to face this music.

Motivated by both my deep respect for the course evaluation process and my wish for something more progressive to be done with these assessments, I have decided to post summary data from my course evaluations since joining the Pratt faculty in the Fall of 2007. Below is a summary of my overall ratings for each of the courses I teach/have taught at Pratt:

Normalized Average Ratings, Full Scale

Comparing different semesters is slightly complicated by the fact that:

  1. Pratt’s rating system has changed from being 1-to-5 to 1-to-4; and
  2. I teach classes varying in credit value from 1 to 3.

To correct for these variations, I present what I am calling the Normalized Average Rating, an average weighted by the number of credits per course and normalized to the maximum rating for each semester. This normalization makes comparisons across semesters meaningful.

Overall you can see that my ratings are pretty high, and have been above 90% for most semesters. But when you look at these trends in more detail, you can see that there is quite a bit of variation from semester to semester:

Normalized Average Ratings, Compressed Scale

As the best-fit linear trendline on the graph above suggests, I have experienced an overall decrease in my course ratings, although my sample size is small enough to call into question the statistical significance of this trend. Thanks to getting some relatively lower ratings over the past four years, the trend has gone weakly negative. I am happy that my ratings are generally high, but I have my concerns that I am getting worse in the eyes of my students. Why this might be is really tough to figure out.

I might be the reason that my course evaluations are trending lower over the past semesters. Am I teaching differently? Am I expecting more of my students? Has my tone in the classroom changed? Has my level of enthusiasm gone down? All these causes of change are possible, but some are more easy to assess than others. One thing that I know is true is that in some ways I have been more rigorous with my students over the past four years than ever before: I now require more homework, and give cumulative final exams in all of my courses. But being more demanding in these areas hasn’t always led to lower levels of student dissatisfaction; in fact, when I first started employing this more rigorous version of my pedagogy (Fall 2013), I actually received my third-highest overall ratings ever. And my best ratings of all time were from the Fall of 2016, the first semester in which my more “modern demands” were rolled out. So while there are a lot of comments from students suggesting that I am too demanding, it seems to be a bit of a “luck of the draw” phenomenon as to how many students with this complaint that I get per semester. Just a few students who feel overwhelmed by the work demands can really tip my ratings.

The other possibilities that could be ‘my doing’ are a lot harder to assess. For example, there has not been any student comment trend pointing towards me being less enthusiastic in class. I still get a lot of praise for the energy and passion that I bring to the classroom. But there are also a significant number of students who report that they feel condescended to, or that I make them feel anxious, or that I am rude. It’s possible that I have gotten more gruff in my old age, although if anything it feels like I am treating students more gently now than I might have in the past.

Another possibility is that my students may be changing. Perhaps the students of 2018 are simply not the same as the students of 2011. I am always really wary of generational stereotypes, so I am kind of skeptical that suddenly over the past decade students have changed such that they rate my courses more poorly. But there is sometimes a slow change over time in what students expect to be asked to do, and it is possible that I am encountering a generation of students who are frustrated by my style of teaching and/or what I expect them to do in my courses. I have experienced this kind of culture change before: back in 2014 when I started teaching the Ecology for Architects course, it quickly became clear that architecture students were a lot more sensitive to high expectations and workloads than art and design students, and my course evaluations reflected that difference in culture. But now that I have stopped teaching Ecology for Architects, I still am getting some low ratings. Why might this be?

X insert update here X

The last possibility is that these last four semesters are statistical aberrations: some combination of bad luck in which students registered for my class, or how my students felt on evaluation day, or some other factor beyond my control is responsible for these scores. As you will see below, I have evidence in some courses that a few disgruntled students are driving some of my lower scores, but that does not explain all of my low scores.

Course evaluations can be really unnerving because they are based on such small sample sizes. Even if the entire class is present on the day that evaluations are completed, at Pratt we are still talking about at most about twenty students providing their impressions of a course. With so few responses, a few really negative evaluations can really lower your overall ratings! Based on a qualitative analysis of my course evaluations, it is my impression that my worst ratings arise from having one or two students in a particular course section who are really dissatisfied with the course. To try to determine quantitatively whether or not this qualitative impression is valid, I looked at the correlation between my average ratings and their variance:

Average Rating versus Variance

If my lower ratings are driven by just a few students, we should see much higher variance when ratings are lower; overall, this is what the graph above shows. But there are some interesting exceptions, including two courses that very prominently ‘fall off the trendline’. These two courses — in which I received normalized average ratings of 0.810 and 0.815 — have variances that are similar to most of the higher ratings, suggesting that there was more consensus driving the lower ratings in these two courses. Which courses are those, you ask? You guessed it, Ecology for Architects. In the graph above I have highlighted all six sections of this course in blue: although you can see that one has a variance-to-average ratio that’s in line with the overall trend, most are below the trend, suggesting that there’s less variance in these low ratings. That means there’s more agreement among students in these classes that they don’t rate my teaching very well. And of course you can see that five out of my six worst course ratings have come in Ecology for Architects.

If the variance was roughly constant for all ratings, this would suggest that lower ratings emerged from overall student consensus, but this is not what we see in the graph above. The one factor complicating this conclusion is that high ratings have their variance somewhat capped by there being a maximum (100%) rating value. We would expect that higher ratings would have a lower variance because they are high. For this reason, it is hard to conclusively say that my most of my lower ratings are driven exclusively by a few students per course section: we also have to recognize that my higher ratings are biased to have lower variance. I am sure that there is some more sophisticated statistical procedure for disentangling these two factors, but I have not tracked it down. The fact that a few disgruntled students can radically lower the ratings for a course makes it really difficult to interpret differences in average course ratings from semester to semester.

What do my ratings tell me about the overall impressions students have about my different courses? One way to get an answer to that question is to look at overall Normalized Average Ratings for each course:

CourseEval All courses

What I find fascinating about this data is that some of the courses I feel are most realized are on the lower end of the ratings spectrum. If you asked me which of my classes is of highest quality, I would probably say “Ecology for Architects” because it is the course that I have most recently developed, and one that I put a lot of effort into creating my ‘most pedagogically advanced’ content for. But if you look at the overall ratings, Ecology for Architects fares the worst (by a disturbingly wide margin!). I am not sure how to interpret this disconnect between what I consider a successful course and what students consider a valued class!

A big worry for me is that low course ratings represent how demanding a course was rather than how well-designed the course was to foster student learning. Or, perhaps, course evaluations might just boil down to how students feel towards a course rather than whether the course contributed to their education. If you think about it, students are not really equipped to assess educational outcomes (which are difficult even for trained educators to assess) and have a lot of reasons to base their rating on how rigorous the course was or how they felt taking the course, two factors which are likely related. Is this the reason why an elective 1-credit course (Great Adventures in Evolution) that requires just a short reading passage per week and a fairly brief final project gets the best ratings while my most demanding 3-credit course taken by students in one of the most demanding majors (Ecology for Architects) gets the worst ratings? Maybe. I dream of teaching a course that students love despite being greatly challenged by it, but how to create such a course? If I wanted to get better ratings, I certainly know how to make my course fun and a lot easier, but taking that path will not lead to the best outcome for my students (whether they recognize that fact or not).

I know quite well that some of my courses are deemed more successful by students than others, but what does the overall trend look like for all courses? Below is an overall summary of my ratings in each course over my career at Pratt:


Notice that there is a lot of variation across semesters, and that there is not any real trend explaining this variation. Again, I remind you how small my sample sizes are during each semester: a lot of this variation may simply arise from sampling. There is also the problem that the course evaluation is not a static metric: in the Spring of 2010 Pratt radically overhauled the criteria by which students rate the course, and those criteria have been changed slightly since. So as we look at trends, we really are not comparing the same things across all semesters. I guess the one clear trend — already highlighted by previous data — that’s disturbing is a rather precipitous shift downward over the past four semesters.

I have also looked at the trends for individual courses. I would hope that each semester that I teach a course it would improve in the eyes of my students. To get a sense of whether this is true, I plotted the ratings for each semester for each of the courses I have taught several times or more. My bread-and-butter course is Ecology, which I have taught to sixteen different course sections over the past eight years. Here are my ratings for Ecology:

There is a slightly increasing trend in my average ratings, but a lot of variation. As with my overall yearly ratings, I mostly take pride in the fact that my average rating is above 90%, but it is hard to know whether this slight increase in ratings over time represents a real improvement in students’ impressions of the course.

As mentioned above, the course where I have been rated most poorly is Ecology for Architects. Here are my ratings for this course over the three semesters I have taught it:


Explaining the fact that my architecture students have slammed my courses is difficult. Trust me, I have spent a lot of time thinking about the issue, and I wish I could come to some definitive conclusion. It might be easy to say “hey, Chris, take it easy on yourself… you have only been teaching this course for three years”, but I am not buying this excuse. Because this course is without question based on the most articulated pedagogical design of any of my courses, I cannot just chalk these low rating up to a rough start with the course. Something else has to be responsible.

What might that something be? Well, one possibility is that architecture students expect to be taught in a different way than students in other majors, and I have not yet learned what they want. This sounds reasonable to me, but as I analyze my course evaluations there is very little indication of what architecture students would like me to do differently… other than being less rigorous. What is funny is that my current version of Ecology for Architects is a lot less rigorous than many of the courses I have offered in the past. It is a bit of a paradox, but a paradox that can easily be made sense of: my architecture students are among the best academically-prepared undergraduates on campus, and yet they respond more negatively to rigor than their peers in other majors (in fact, my lowest ratings have come from the course sections of Ecology for Architects that have performed the best: it appears the the students who work the hardest in this course are the most resentful that they had to work this hard!). If you stop and realize that GPA matters more for architecture students, a large fraction of whom want to move on to graduate programs, it makes sense that in spite of being more on top of my workload and earning better overall grades, my architects-to-be are more dissatisfied with my course. Add to that the fact that this course is required of all architecture students — and taught in radically different ways by other course instructors — and you have a potent cocktail for student dissatisfaction. I do not want to completely ignore the possibility that there are things that I could do to make this class equally educationally-valuable and yet more liked by my students, but I think that I also have to resign myself to a lower baseline score in this course.

A course that I have been working to improve a lot over the past few years is The Evolution of Cooperation. As the graph below shows, my ratings for this course have been in very slight decline:


With only seven data points to work with, it is hard to make much of this trend, especially given the up-down-up-down trend here. One thing I know I need to work on in this course is overall workload, as recently students have consistently complained that this course demands too much work. What I have found is that sampling clearly plays a role in my ratings for this course: the less well students in the course handle the workload, the worse they rate the course (notice that this is the opposite of the trend for Ecology for Architects).

Ratings for my Evolution course have slightly decreased over time:


But again, with only eight data points spread over such a large swath of time it is hard to know what to make of this trend. The dilemma I face in interpreting these ratings is nicely encapsulated in the above graph: for the two semesters when I taught two sections of the course, I got high rating from one section and a low rating from the other section, such that the two ratings almost perfectly sandwich the average rating over the five years I have taught the course. For any given semester, I pretty much teach the course the same way, so I ought to get similar ratings from each course section. The fact that I do not is a good indication of how much the small sample of students represented in each course section varies.

My The Evolution of Sex course shows a slightly-decreasing trend:


Again, a small dataset makes this hard to be conclusive about. And again an issue may be workload, although ironically the workload for this course was probably worst when I first taught it as a one-credit course.

It is nice to have this data, but what to do with it? Even if I am convinced that student happiness or disappointment in a particular course is increasing, how do I act in response to this information?

For example, one of the biggest complaints that I get on my course evaluations is that the workloads in my classes are too high. Does this mean that I should lighten up or that students should toughen up?

I also get a lot of lower ratings for the way I assess student work (I am deemed too tough). Does this mean that I should lower my standards, or is it a good sign that I am challenging many of my students?

These are hard questions to answer!

Currently my approach is to use course evaluations as a way of taking a rough pulse of my teaching. If I get a few lower or higher ratings I cannot immediately assume that the quality of my teaching has changed, but I am looking for consistent trends over time. In recent semesters there has been a downward trend and I am working to address this trend by playing with course design tweaks that have a the potential to address student concerns without just capitulating to an overall student demand for less rigorous courses.

One of the most valuable components of our course evaluation process does not even show up on the quantitative summaries I show above. Students are asked to make qualitative statements about the strengths and weaknesses of the courses I teach and the way that I teach them. I pay a lot of attention to these comments, and they are often very insightful. Perhaps the most valuable feedback is not numerical at all. Unfortunately, it is hard to summarize without bias what students have to say. I can average their ratings, but not their more nuanced impressions.

I hope to further process the course evaluation data I have (there is so much granular information lying dormant in this dataset), but the above graphs are a start. I sincerely believe that in order to provide greater accountability, more educators should be making this data publicly available.

You can see all of my posts about my course evaluations here.

This page was last updated August 2018