I'd like to share and comment on a case that recently came up in another department. The case is this. The department was doing a hire, and they had two main finalists. One of the finalists was a woman a few years out of a very good program, with a ton of first-authored publications (well upwards of ten), several in top-ranked journals in the field, despite having a full-teaching load. At the flyout, this rather dimunitive woman (standing 5 feet tall) did not "impress." Her research was judged to be very good, but many on the hiring committee were not impressed by her presentation: she seemed nervous, her voice was too quiet, etc. The second candidate was a 6-foot-2 male with no first-authored publications. Although he had come out of the #1 program and been in prestigious post-doc for an entire year, he only had three publications, all of which were co-authored with his (very famous) graduate school advisor. He had no teaching duties at his post-doc, very few papers under review, and had struggled to finish his graduate program, taking a full two years longer to graduate than the female candidate. Despite all this, many of the faculty were "blown away" by his brilliance during the flyout. He spoke in a deep, confident voice and was considered by many faculty to "obviously be a future star."

The faculty were torn between the two candidates. Some thought the brilliant man was the obvious choice. Others thought this to be completely absurd. In their view, the female candidate was far and away the better candidate. Unlike the male candidate, she actually had a proven track record of success publishing her own work in top-journals and in the classroom. Alas, after several days of drama and deliberations, the department decided to offer the position to the male candidate. Members of the faculty who preferred the female candidate were irate, feeling that the committee had favored someone with few accomplishments over a person with many.

In my view, this is the kind of case that should lead us to question the value of interviews and such. There is this common belief–a belief which there is a great deal of empirical evidence against–that we are capable of detecting "talent" better through personal experience than through purely statistical resources (i.e. resumes alone). Study after study has that purely algorithmic processes (i.e. counting publications) are systematically better predictors of success than so-called "eyeball tests" (i.e. personal impressions)–and yet, as we see in the case above, people continue to base hiring decisions on personal impressions. Why?

The reason why "eyeball tests" don't work very well seem obvious enough (to me, at any rate). Generally speaking, the best predictor of future success is past success. But, if you want to know about someone's past success, all you have to do is look at their work. Looking at the person, on the other hand, introduces all kinds of additional noise into the process: preferring people on the basis of things that might seem impressive (e.g. confidence, a bellowing voice, mental quickness, etc.) but which don't necessarily have any reliable relation to actual production. To see how, consider just a few notable examples from professional NFL football:

  • Despite only having one good collegiate season, JaMarcus Russell blew everyone away at the 2007 scouting combine with his physical talents. He was selected #1 overall in the draft and went onto become one of the worst draft "busts" in history.
  • In 1998, pro scouts were split between two college quarterbacks: Peyton Manning, who has gone onto become one of the greatest pro quarterbacks ever, and Ryan Leaf, who is one of the worst draft "busts" of all time. Many scouts preferred Leaf over Manning because of his "obvious" physical advantages–this despite the fact that manning had a much more successful collegiate career.
  • In 1999, Akili Smith was selected in the first round of the NFL draft despite having only one good season, because of his "obvious physical talents" at the scouting combine.
  • In 2000, a man named Tom Brady was selected in the 6th round of the NFL draft, due to perceptions that he lacked sufficient physical tools to succeed.
  • In 1979, Joe Montana was selected in only the 3rd round of the NFL draft, being judged by scouts to be too slight of build and not having a strong enough throwing arm–this despite being famous for incredible comeback victories while at Notre Dame. Montana is now judged to be perhaps the best NFL quarterback of all time.

There's an obvious pattern here: a pattern of people systematically ignoring actual accomplishments in favor of personal judgments about "talent." And there's another obvious pattern: this not working very well! Time and time again, individuals who are profoundly impressive in person do not go onto succeed…precisely because their personal impressiveness is not backed by a past record of actual success. Conversely, time and again, individuals with actual records of success continue to succeed even if they are "not that impressive" in person. 

So why, after all of the studies, and all of the trends, do people continue to "trust their judgments" of talent? Why indeed!

Posted in ,

20 responses to “The Eyeball Test and the Seductiveness of “Talent””

  1. Hi Marcus,
    Thought provoking post (as is often the case with your posts but I digress). I did want to bring up a few worries though.
    First, the analogy with sports and draft position isn’t a good one. The players are being interviewed every time they are on the field. In philosophy, it’s not like that. So, even though a combine workout can weigh in favor or against a prospect it is usually only an all things being equal metric. The performance on the field is what does the most work for the prospects (Even Jemarcus Russell had a 10-1 season and some excellent come from behind wins against Alabama, etc.).
    Second, isn’t looking over one’s dossier, and “counting their pubs” overly focusing on one aspect of our jobs as philosophers. Teaching is surely important and an in person interview is much better at gauging how the person will perform in the classroom. It’s a lot bettr than simply looking at teaching evals (IMHO).
    Lastly, doesn’t the choice to not have interviews bias against those who do damn well at them. Or am I missing something?
    Here is a few things to consider, in my case anyway.
    I’m from Calgary, an under the radar program. I can help my chances with an interview given that I think one of my stronger traits is my ability to work a classroom and show my enthusiasm for the discipline. I think I can write just fine but given that my program is MUCH shorter than those in the states (4 years with much of that time spent on 3 intense examinations and a year of course work) my dossier will not be as impressive when compared to the 7-8 year PhD from the states, or someone like yourself who has been publishing for years since graduating. So, it seems that in not interviewing, folks like me are at a disadvantage. And, given that the name of my institution may already work against me this seems, well, shitty.
    Now, this is not to say that the past shouldn’t matter at all only that moving away from the interview all-together works against folks like me. I did have a couple of questions for you, Marcus.
    Isn’t reviewing one’s work (rather than an in person interview) just a different way of “spotting talent”?
    Also, the suggested approach creates systematic obstacles for folks who have what it takes to be a good pro if only they were given a chance. I’m thinking here of folks who have to work while in grad school which makes publishing nearly impossible. Those folks would never get a job if we were to focus ONLY on past success. Am I off to worry about such cases under your suggestion of no interviews?

  2. Kate Norlock

    Like Justin, I worry about the extent to which the prestige of an institutional affiliation with a grad program, and the luck of excellent funding (vs loads of teaching and grading at the expense of publication) would then outweigh skills such as classroom behavior and effective oral presentation. But perhaps, Marcus, you mean this to be a post applicable to R1 institutions, and not to the rest of us?

  3. Christopher Stephens

    While I share some of Justin and Kate’s concerns about valuing classroom behavior, is there really evidence that the person who comes across shy and nervous in an interview will be a worse teacher? Marcus doesn’t tell us about their teaching backgrounds, but won’t the same points about actual success apply to teaching as to research? Is one data point about a very high stakes oral presentation really a good guide to future success as a teacher? I’d be interested in empirical evidence on this.
    Suppose both candidates have good teaching evaluations, good letters about teaching from people who’ve observed their classes, extensive teaching experience, well thought out syllabi, etc. In that case, should we really use A’s better performance in the job talk as a good reason to think A will be a better teacher than B?

  4. Hi Justin and Kate: Thanks for your comments!
    I’m not advocating merely counting pubs. The empirical studies I’m referring to show that algorithmic, hard-data approaches to all aspects of hiring (e.g. teaching reviews, etc.) are better than soft-measures (i.e. interviews, teaching demos, etc.).
    There are several reasons why this is. Let me explain each of them.
    First, studies of interviews, demos, etc., consistently show that raters tend to favor/disfavor candidates largely on the basis of factors irrelevant to the job in question. So, for example, in interviews and other types of demonstrations, people have been shown to consistently favor (1) taller candidates over shorter candidates, (2) people with deeper voices over men with higher voices (especially men), (3) attractive people over less-attractive people, (4) extraverts over introverts, (5) men over women, etc.
    In other words, although interviewing committees and people watching demos like to think they are evaluating the performance “where it counts”, there is an overwhelming amount of evidence that people favor/disfavor candidates on almost entirely arbitrary grounds.
    This is the first reason why algorithmic selection-processes have been consistently observed to result in better outcomes (e.g. higher performance reviews of hired candidates) than soft measures (such as interviews). Hard measures–i.e. someone’s long-term research record, teaching record, etc.–tend to be far less based on arbitrary judgments and more on actual performance.
    This brings me to the second problem with interviews/demos. Any statistician or data-collection expert worth their weight in salt will tell you (as Chris Stephens points out) that (1) many data-points are better than one, and (2) to be good evidence, data-points must be representative of actual, normal performance. Interviews and demos are neither–and here’s why.
    Every sort of performance known to humankind is subject to outliers. So, for example, the Denver Broncos are a really good football team–yet they laid a “stinker” this past week. Similarly, Joe Montana was perhaps the greatest quarterback in NFL history–but even he had bad days. Single-case observations cannot possibly distinguish an outlier from a person’s central tendency, i.e. their normal performance. The best indicator of a central tendency is given by many data-points. So, for instance, if you want to know whether the Broncos are actually a good team, the best way to do it is have them play lots of games against other teams and see how many games they win. The same goes for all other human endeavors. If you want to know whether someone is a good teacher, you shouldn’t base your judgment on a one-off performance or interview. You should base your judgment on the person’s body of work–as that provides far more data pertaining to the person’s central tendency, or normal performance.
    The problem here is even worse for interviews and demos, as outliers are far more likely to obtain in abnormal test-conditions–which interviews and demos certainly are.
    Consider, to begin with, a slick extravert candidate who normally doesn’t give a damn about teaching, but can put on one really good show if s/he actually prepares. This person may give a killer teaching demo…yet it is not at all indicative of their normal performance. On a day-to-day basis, they may be poorly prepared, care more about research than teaching, etc.
    Now consider an introvert who has a long-term record of teaching success, but who–as introverts are often wont to be–gets unusually uncomfortable in unfamiliar situations. This person may normally be a killer teacher, with years of excellent performance, and yet in this novel, highly abnormal situation, at a university they are unfamiliar with, with students they are unfamiliar with, being watched by people they are unfamiliar with, they may put up a “dud.” Is that at all indicative of their normal performance? No–but the interview or demo can make it seem as though it is.
    In other words, interviews and demos are subject to “masking.” Candidates, quite frankly, can misrepresent themselves. Someone may look like they have everything it takes to be an excellent researcher…except the hard-data indicates otherwise. Someone might look like they have everything it takes to be an excellent teacher…except on a day-to-day basis they don’t really care that much about teaching. Etc.
    Finally, and on a related note, interviews/demos by their very nature don’t–and can’t–track many other things directly relevant to job performance. Consider, if you will, all of the things it takes to be a stellar teacher day-in, day-out over the course of a semester or academic year. It takes (1) a great deal of consistent, daily preparation of multiple courses, while (2) juggling those demands with those of research, while (3) juggling those demands against committee service, advising.; it takes (4) time and effort providing written and verbal feedback to students on term-papers, etc.; it takes (5) knowing how to respond to below-average students; etc. In other words, it takes a variety of very specific skills that are entirely removed from a single teaching demo.
    These are just some of the reasons why hard-data are better predictors than observations. Hard data are (A) less subject to bias by irrelevant factors (e.g. height, attractiveness, speaking voice, etc.), (B) more reflective of actual central tendencies (i.e. normal performance) than one-off outliers, (C) more reflective of performance in relevant contexts (i.e. day-to-day teaching performance while juggling normal responsibilities); and (D) less subject to masking.

  5. Potter Stewart

    “These are just some of the reasons why hard-data are better predictors than observations. Hard data are (A) less subject to bias by irrelevant factors (e.g. height, attractiveness, speaking voice, etc.), (B) more reflective of actual central tendencies (i.e. normal performance) than one-off outliers, (C) more reflective of performance in relevant contexts (i.e. day-to-day teaching performance while juggling normal responsibilities); and (D) less subject to masking.”
    With the caveat that hard data are observations (where do we think the data comes from?), this seems right, but only as far as it goes.
    To simplify things, imagine that all I can are about is successful teaching. What data should I look at? Student grades? Student feedback? Teaching focuses letters of recommendation. As it turns out, the data predictors on successful teaching are all suspect, because they are all indirect. Every last one of them. Not so much so that we should never look at them, but at least so much so that we should not fetishize the numbers. And we might think that, at least while our lab studies of aptitude in teaching are still in their immaturity, it is not crazy to think that a good teacher is a bit like pornography — I know it when I see it.

  6. Potter Stewart

    Also, I want to agree with Justin that the sports analogy is pretty weak.
    If you think Russell only had one good year, then you must think that going 10-1 at LSU, getting a top five ranking, and beating Alabama before getting injured is a “bad” year.
    And while Manning over Leaf seems clear in retrospect, remember that Manning never could win the big game (you can’t spell Citrus without UT and all), while Leaf set Pac-10 passing records, helped WSU to its first ever Pac-10 championship, and finished second in the nation in passing rating. Them’s performance numbers, even if he did lose the Rose Bowl to the national champion Wolverines. (Manning was then, not surprisingly, in the Citrus Bowl.) And it should be noted that Leaf was clearly raising the bar for WSU, which has since faded, while Tennessee won the whole kit and caboodle as soon as Manning was off campus. Leaf did more than Manning with less.
    As for Smith, it isn’t like he didn’t perform. And many, many think that his holdout really hurt him. (He might be the best case — although even here, it was his numbers that made him attractive, not his interviews…)
    Brady? I watched Brady play. He showed up on a team that was national championship stuff, and he barely maintained his starting job, and while he didn’t have anything resembling a bad collegiate career, surely you’re not suggesting that anyone looking at his collegiate performance would have known that he’d have the professional career he’s had? If we went based off of data, like you’re suggesting, the 6th round might even be generous. His biggest win resulted from an opponent’s missed extra point!
    Maybe Joe Montana makes the case for you. I don’t know. I’m not old enough to have watched him in college. But it strikes me that taking his nickname and reputation to be a good predictor of success is the exact opposite of the way you’d want to go…

  7. I have so much to say, Marcus. I’ll type out some of my main concerns in a future post (likely next week). However, I did want to make two small points up front.
    (1) Interviews are equally important for the candidate. I (as a candidate) would like to know more about the people (AS PEOPLE and not as written descriptions of people) I am going to be working with. Same with department structure and collegiality, in visiting a department you can get a vibe re: how business is done and how active students and colleagues are. Sure, this is not infallible (maybe they put on a show for your visit) but it’s information that should be considered when one has multiple offers on the table. Even if they put on a show for you, that tells you something.
    (2) Re: the draft/combine I second Potter’s apt points re: Brady and Manning. And further, since the advent of the NFL combine (in roughly 1980) one could argue that a team’s ability to draft players successfully has gotten better, not worse! This is an empirical claim so it’s one we could look into. This is in part becauseof the combine. Players once thought of as fast because they played poor competition now get exposed with the combine, along with many other examples. Sure, you can point to an Akili Smith, but for every Smith there are 10 players that did pan out since the combine. Without the combine Smith would still go. And it seems unobjectionable that there were draft busts prior to the combine (which suggests that a non-interview format, assuming combines are similar to campus interviews, is not much better).
    All in all I am very skeptical that we can quantify some of the attributes you are looking for (good teaching, research, collegiality, etc.). I mean teaching evals as indicative of how good a prof is? Really?! Unless we put every eval we have ever received in our teaching dossier they don’t seem very meaningful. Surely, some often ill-prepared prof will have some good evals (from students who got a good grade and feel good about saying nice things or what have you), and over time said prof will have accrued 50 or 60 of them. Are you suggesting that we include them all? Not to mention the practical issues that arise even if one granted that evals are indicative of good performance (which I question). How likely is it that the hiring committee would (or even COULD) read them all? Especially for those who have taught 15-20+ classes! When I was first putting my dossier together I included ALL of them until I was quickly told that that wasn’t how things were done. Pick your best from each class I was told. I thought this was a joke! But it’s not and it is currently a data point from which hiring committee’s look into.
    Lastly, I agree with you that “Every sort of performance known to humankind is subject to outliers”. But I also think there are outliers to the method you propose. I think that folks could be great for a job (and in some cases better than someone who does “quantify” well on all of data points you find most important). Neither one of us is saying that our process would be perfect, right? Showing that there are some that fall through the cracks is not reason to throw the process out the window, not necessarily anyway. Unless you can show that your process would be better. Nothing I have read thus far leads me to believe that a data driven process is unbiased or is better at selecting for candidates than flying out the best (3-5) of your application pool and relying on the interview (at that point). Up until that point I am with you that the dossier matters, but all things being equal I’d rather have the opportunity to show my skills in person rather than let my dossier with UNIVERSITY OF CALGARY pinned to the front page do ALL the work on whether or not I get the job or not. To think that folks won’t look at my application different when it’s compared to someone from NYU or Rutgers (all things being equal) is VERY optimistic.
    That was longer than expected. I’ll stop there. I will definitely write a post of my own so as to refrain from hijacking this thread.
    Thanks again for the very thought-provoking post Marcus. You have made me think long and hard about interviews and given this is the job season it hits very close to home for me. It also hits close to home because, as I mentioned on FB, I conducted interviews for both a fortune 500 company AND a residential group home and found the process to be VERY helpful when selecting between two very good candidates. And, having had an excellent track record of hiring folks who have gone on to management positions and who have done very well in the job they were hiring them for it seems that there is something to an interview if the folks doing the interview know what the hell they are looking for.

  8. How safe and supportive do you think the person who got this offer would find this blog post?

  9. Hi Lewis: This wasn’t a job in philosophy (I wouldn’t have shared it if it were!). I also don’t see how a person could self-identify on the basis of the post.

  10. I was assuming it was a different philosophy department (rather than a non-philosophy department) in part because it is very hard to make assessments of what features are relevant for tenure across different disciplines. For instance, in book disciplines like English and History, the publication record in journals is substantially less important for assessing candidates.

  11. Robert Gressis

    Hi Marcus,
    Fascinating post, and response to comments. As someone who has been on hiring committees, I’m trying to figure out what to do with this information, though. I’d be interested in hearing what you think we can know about prospective candidates, given the tools that we typically have available to us (cover letters, CVs, writing samples, teaching evaluations, letters of teaching observations, letters of recommendation, # of publications, interviews, and presentations). Or do you perhaps have ideas for new assessment tools we can use?

  12. Hi Rob: Thanks for your comment, and sorry for taking so long to reply! In addition to Thanksgiving, it was my birthday this weekend–so I’ve been a bit occupied. 🙂
    The way I understand the psych literature on selection, the best thing to do is to score different facets of candidates. Here is how this might go in philosophy.
    First, all candidates might receive a research score–which might be determined by a (1) weighted formula counting # of publications and quality of venue, (2) numerical scores for recommendation letters, and (3) search committee scores having read writing samples.
    Second, all candidates might receive a teaching score–which might be determined on the basis of student evaluations and faculty peer-evaluations.
    Third, all candidates might receive a university service score–which might be determined on the basis of how many university activities they are involved in, as well as the level of involvement (i.e. organizing on-campus activities might be weighted more than simply participating).
    Fourth, all candidates might receive a collegiality score on the basis of ratings data collected from current and former colleagues (i.e. “rate X’s collegiality on a scale of 1-5”).
    Then add up the scores, and treat the candidate with best overall scores 1st, the candidate with second overall scores 2nd, etc.
    Now, this algorithmic approach might seem absurd–to miss the “je ne sais quoi” of hiring–yet, again, contrary to intuition, decades of psych research indicates that it predicts success better than more subjective means.

  13. Robert Gressis

    Hi Marcus,
    Thanks for your response. So, I have some more questions: it seems to me that one reason that number of publications and venue for publications are better indicators of a person’s research ability than an interview is that they’re less subject to bias (this is especially true of number of publications; arguably, it’s not true of quality of publications, but I would resist that conclusion. I think we can be reasonably confident of the relative quality of journals, at least in some cases, but maybe I’m being naive here). But it also seems to me that a professor’s letter of recommendation for her student is also quite subject to bias. Sure, the professor has had lots of interactions with the student, but these can be subject to their own kind of bias: a person can very quickly form a narrative of a person based on just a few interactions, and then only see the data that confirms that narrative and miss the data that disconfirms it. Is it your view that what I’ve just said is generally false, or that what I’ve said is true, but that letters of recommendation nonetheless have significant value, or that what I’ve said is true, and that letters of recommendation have only very little value.
    So, imagine you’re trying to weight the research value of a candidate. You can point to three things: (1) # of publications and quality of venue; (2) that candidate’s advisors’ assessments of the quality of her work; and (3) your own assessment of the quality of her work, based on the writing sample she provides for you. How much weight would you give to (1), (2), and (3)? 85%, 10%, and 5%, respectively? Or something else?
    I have more questions, but I’ll stop with that.

  14. Hi Rob: Thanks for your reply!
    I think you’re right to be skeptical about letters of reference. There are so many ways for bias to creep into letters (gender bias, personality bias, etc)–not to mention obvious conflicts of interest (letter-writers have self-interested reasons to provide inflated recommendations, so as to secure jobs for their department’s candidates!).
    So, I would say, letters of recommendation should be sharply discounted in a weighted measure of candidates. In fact, I’m one of those (and I’m not alone!) who think letters should be done away with altogether. Personally, I think letters are a pernicious anachronism–a harmful remnant of the medieval practices of patronage (where one had to satisfy one’s patron in order to keep getting work). People should be judged on the basis of their work, not the opinions of a handful of people (who, let’s face it, may or may not–for many reasons–have a suitably impartial view of the person’s abilities).
    In any case, when it comes to weighted averages, I think–for obvious reasons–that the most objective measure of research quality is the peer-review process. While imperfect (what isn’t?), peer review has many procedures in place to prevent bias as much as possible. Second to that, I would say, are each person’s judgments on the search committee of the writing sample. So, if it were me, I’d weight publication record something like 70-80% and the average committee member’s judgment of the quality of writing sample something like 20-30%, and rank applicant research quality on those grounds alone.

  15. Robert Gressis

    Hi Marcus,
    Do you think the same considerations that tell against LOR also tell against teaching observations? First, teaching observations examine the candidate on only one day; second, the candidate and the students might take the class more seriously than usual, given the occasion; third, there are at least some occasions where the writer of the letter knows that the candidate is going out on the market, and so feels pressure to inflate. Would you also ignore teaching letters as well as LOR?
    In addition, what about statements of teaching philosophy? It’s all well and good to say “here’s my philosophy of teaching, and here’s what I do”, but do we actually know that the person does what she says she does, or, assuming she does it, does it well?
    Long story short: besides # of peer-reviewed publications and teaching evaluations, is there anything that committees should use to assess the scholarly and pedagogical quality of a candidate?
    (Frankly, I’m not sure why we should trust our own assessments of a candidate’s work, unless we’re experts; and even if we’re experts, we might have our own biases — biases against people who don’t take the positions we take, or who work with people we don’t like, etc.)

  16. Robert Gressis

    Oh, one other thing. According to this article (I don’t know how to hyperlink, so …
    http://www.nytimes.com/roomfordebate/2012/09/17/professors-and-the-students-who-grade-them/students-confuse-grades-with-long-term-learning), good teaching evaluations can often be inversely correlated to deep learning. Ugh.

  17. Hi Rob: I do think the same considerations speak against teaching observations. I think it is better to judge people on a large body of work–i.e. their teaching record.
    On teaching statements, I think the important thing is to determine whether the person puts their philosophy into practice. For instance, I not only claim to be a demanding teacher–my students regularly make comments to that effect in their evaluations (viz. “hardest grader I’ve ever had”).
    Finally, I think skepticism about teaching evaluations is overblown. What the research shows is that numerical scores can be inflated by things unrelated to good teaching (e.g. being an easy grader, being entertaining, etc.). The empirical research also shows that demanding teachers can be punished for being demanding.
    However, despite all of this, there is a way to cross-check whether numerical scores are actually based on (1) bad teaching, or (2) good teaching. Namely: the substance of student comments. Allow me to explain.
    Consider on the one hand a candidate who has high scored but whose student evaluations say, “Easy!”, “Entertaining”, etc. One has reason to believe that this person’s high scores are the result of them being a poor teacher who just tries to make students happy.
    Now consider a teacher who has high scores and whose student comments suggest a very different picture (viz. “Hardest professor I’ve ever had…but SO worth the challenge!”, “Daily homeworks were a pain in the ass, but really challenged me to read carefully”, etc.).
    These types of comments are evidence that the instructor isn’t just coasting by on being nice, entertaining, etc.–but is instead getting high marks despite doing things (being a demanding grader) that tend to lead to low marks with typical teachers.
    I say: while teaching evaluations can be misleading, looked at carefully–taking into account student comments–can give a pretty good picture of what the teacher is really like.

  18. Robert Gressis

    Hi Marcus,
    But the study I linked to (I realize the link is broken; here it is again:
    http://www.nytimes.com/roomfordebate/2012/09/17/professors-and-the-students-who-grade-them/students-confuse-grades-with-long-term-learning
    )
    said that the faculty who provide the most deep learning for students tend to get lower reviews than the faculty who grade easier. Sure, it’s nice when there are stellar teachers like yourself, who get ultra-high numerical scores while being incredibly demanding, but I think that people like you are few and far between. If the study I linked to is right, then many teachers who teach really well are also ones who don’t come off that great at first.

  19. Hi Rob: Good point–but there are ways to measure deep learning. Indeed, I think this might be one of those areas where the push for “outcomes assessment” may be helpful.
    In my department, part of our annual measure of outcomes assessment involves each instructor in the department giving the same (rather difficult!) multiple-choice test of comprehension of philosophical ideas and arguments to our students. Furthermore, although we don’t measure this, many of us tend to have the same students in our classes semester after semester–and given the differences in faculty specialization, they’re often very different students for different faculty (I, for instance, have some majors that have taken >7 of my courses that have only taken one or two courses from other instructors–and other instructors have majors who I’ve never encountered repeatedly take their classes). By measuring outcomes longitudinally across different instructors, it may be possible to devise a pretty darn objective measure of which instructors are really improving student learning more than others. For what it is worth, our annual assessments strongly suggest just this. And so I would suggest that in addition to using teaching evaluations–weighting not only quantitative scores but also student-comments–search committees might request some form of longitudinal data demonstrating sustained, deep student learning.
    Obviously, this would put a lot more work on the shoulders of candidates to gather such data–but if we really want to hire the best people, I think this may be the way to go. What could be a better measure of teaching quality than demonstrable longitudinal improvements in student comprehension of complex philosophical ideas and arguments combined with other measures?

  20. Robert Gressis

    That would be a good thing to do; our department collects longitudinal data as well. The problem, though, is that for the longitudinal data to be most useful, it requires something like what your department does, and which my department does not do — giving the same multiple choice test to all the students year after year (although, by the fourth time the student has taken the test, surely some of her gain is due to simply being familiar with the test rather than learning). However, not all departments do this, so right now, unless you know a fair bit about how a department does things, you just don’t know how informative its teaching evaluations are.

Leave a Reply to Lewis PowellCancel reply

Discover more from The Philosophers' Cocoon

Subscribe now to keep reading and get access to the full archive.

Continue reading