In our new "how can we help you?" thread, a reader asks:

Lately, I’ve been trying out ChatGPT-4o on some of my own manuscripts—just asking it to summarize the papers and see what it picks up. To my surprise, it actually does a pretty good job. It seems to understand what I'm trying to say, and in some cases, it even puts things more clearly and accessibly than I did in the original.

What really stood out to me is that it doesn't make the kinds of major interpretive mistakes I sometimes see in human referee reports. That got me thinking: has anyone else had a similar experience using LLMs like this? Is this kind of thing common, or am I just being overly impressed because it's "getting me"?

Also, I wonder what people think about using LLMs to help with referee work. I’m definitely not saying referees shouldn’t read the paper themselves—but given how easily LLMs can spot structure and summarize arguments, I’m curious whether others see a role for them in helping us avoid misreadings or blind spots.

Would love to hear people’s thoughts or experiences.

I haven't used AI for these purposes. I'm curious to hear from readers and wonder what people think about the ethics of using them to help avoid misreadings as a referee (provided one doesn't defer to them or use them in writing a referee report). One serious concern I have here is that uploading an unpublished paper to an AI in effect shares that content (an author's intellectual property) with AI companies without the author's consent. 

What do readers think? Have any of you used AI to summarize or "referee" your own papers to help you refine them in your research process? Do you find, like the OP, that they tend to avoid misreadings? Etc.

 

Posted in , , , ,

16 responses to “Do AI paper summaries avoid interpretive mistakes made by human referees?”

  1. reviewer

    I just submitted a referee report. For the first time in my experience, when I was in the system to submit the review, it made me check a box that confirmed I had not used AI to write my report OR uploaded the manuscript to an AI tool. So at least some journals are explicitly forbidding this sort of thing. I share Marcus’s concern, too, that it’s not ethical to do this without the author’s permission even when there’s no explicit prohibition.

  2. I have not tried out AI for this purpose but I did receive a referee report that I suspect was written partially or entirely by AI. The confusions were the ones AI often makes: they had the logical form of a sensible point but they were nonsensical because they had no understanding of the underlying concepts. (E.g. “the paper addresses this possibility but doesn’t consider this neglected alternative” when the alternative contradicts like twelve obvious things that everyone agrees on.) Perhaps there are more sophisticated AI models or whatever, and perhaps AI is better at summarizing than coming up with objections, but in general I have not been super impressed with anything I’ve seen come out of an AI model so far.
    As Marcus notes, it is deeply unethical to use any LLM for refereeing if that LLM takes the paper you upload and adds it to their training data, as basically every LLLM hosted online does. (I worry someone has done this with my paper, for instance.) If you have a locally hosted LLM then I would still not use it for refereeing, but at least you wouldn’t be doing the very bad thing Marcus points out.

  3. I’ve been very impressed at how accurately AI (Claude, specifically) can summarize and understand my arguments – much better than many referees. This has led me to think that journal editors should consider running reports past Claude as a quick sanity-check: “Does this report accurately and fairly represent the attached paper? Are there obvious responses the author could offer to the referee’s objections?” Something along those lines.
    (Note: Anthropic does not use user data for training without permission.)
    So, fwiw, I’d always be happy for a referee to use an LLM to double-check their understanding of my paper.
    Obviously you shouldn’t use it to write the actual report. For one thing, they’re terrible at that (too generic to come up with actually-valuable objections to professional-level philosophical work). Plus the editor is asking for your professional judgment.

  4. humans are not so bad, most of the time

    On the first issue, using AI to referee your own papers, I would recommend you instead work on building a human network to give you feedback on your papers before you submit them to journals. One thing humans will do is provide the sort of feedback and criticisms other humans (including referees) will likely give

  5. Michel

    I don’t tend to find that I struggle to spot structure and summarize arguments, personally.
    I think that if one is using an LLM to save time refereeing, one should not be refereeing in the first place. (Not necessarily because one can’t, but because one simply hasn’t the time.) Quite apart from the concerns about accuracy and uploading, I think it’s a professional discourtesy–even if the referee carefully checks everything to make sure it’s accurate. After all, the author agreed to have a peer review it, and the chatbot is not a peer. That this is done in secret makes it especially suspect, to me.
    Similarly, I wouldn’t use an LLM to provide comments (or a grade) on student work. Even if the verdict is accurate, doing so undermines the kind of relationship we’re supposed to have. It might be different if it was established from the get-go that I’d be using a chatbot to mark their work, because then it would be out in the open, though I suspect they wouldn’t much care for that.

  6. Gary

    I’ve found that LLMs are good at summarizing papers but terrible at assessing their publishability in journals/providing good review reports. I’ve tested this by uploading a range of my own work (bad undergraduate papers, decent graduate seminar papers, highly polished recent work, and even already published papers) and prompting different LLMs to provide an anonymous reviewer report. They always respond the same way: praising the paper, suggesting major and minor revisions, and recommending R&R.

  7. Aleksei

    I am wondering what happens with those “helpful” prompts from Adobe to summarize a document each time we open it. I open papers to review in Adobe all the time. Does it mean it “reads” the paper and trains on it? Does anybody know?

  8. Will

    I don’t think we should trust anecdotal reports, even our own, on how effective LLMs are at avoiding mistakes. The main thing the systems are optimized for, both explicitly in RLHF, and de facto in the pre-training, is producing results that will seem satisfying to a human reader. That’s just different from accuracy.
    Moreover, we are all subject to Eliza-effect or pareidolia-analog biases. Hence, we interpret the outputs in ways that are influenced by our filling in background assumptions about the system producing them, and these assumptions aren’t warranted (and are likely false). On top of all that, the remaining mistakes the systems do make may end up being more biased than human reviewers in subtle ways. More worrisome, those biases will all be aligned, as LLMs are built in such similar ways that together they constitute an algorithmic monoculture.
    All that is to say, we should want a lot more empirical evidence before we begin introducing these into our review process.

  9. The Real SLAC Prof

    Let’s suppose that LLMs are actually better than the average referee at accurately summarizing the main arguments of a philosophy paper and outlining its strengths and weak points. I don’t think this is true, but suppose it for the sake of the argument. I still don’t think it follows that reviewers should use LLM in crafting their reports.
    I take it that journal articles should be intelligible to the average professional philosopher. I don’t think reviewers are, systematically, less intelligent or knowledgeable than the average professional philosopher. So any “mistakes” or “misinterpretations” a reviewer makes is likely to be made by many other professional philosophers. It is then incumbent on the author to revise their work to avoid misinterpretations.
    While I haven’t regularly encountered reviewers who weren’t smart enough to understand my manuscripts, I have regularly encountered reviewers who weren’t charitable enough to overcome their own particular bugbears and give my work a fair assessment. But this isn’t something AI will ever be able to address. Instead, we need editors and associate editors to stop simply existing as submission managers and start exercising actual editorial discretion, which means being comfortable overriding uncharitable referees.

  10. Curious guy

    Not to distract from the main discussion, but Marcus’ main concern seems to me to rest on an assumption that I don’t know the truth or falsity of.
    Marcus wrote: “One serious concern I have here is that uploading an unpublished paper to an AI in effect shares that content (an author’s intellectual property) with AI companies without the author’s consent.”
    But is this true? Do LLMs incorporate uploaded content into its training data, as Daniel Weltman suggests? While it would not surprise me if they did, clarity on this issue seems relevant and important.

  11. Chris

    Question for the Real SLAC Prof: Does “exercising actual editorial discretion” also involve being comfortable overriding overly charitable referees? Maybe you think that’s already common?

  12. @Curious guy: Some explicitly do, others allegedly don’t. Whether you trust AI companies to tell the truth about what they do with your data (and whether you trust them not to alter their policies in the future in a way that is not immediately apparent to all but the most careful users) is up to you.
    With respect to whether to trust them, recall that most of these companies have trained their models on large databases of copyrighted material; that they tolerate and sometimes even encourage widespread use of AI by students to cheat, that they knowingly provide services which people are using to trap themselves in delusions, poison themselves (https://www.nbcnews.com/tech/tech-news/man-asked-chatgpt-cutting-salt-diet-was-hospitalized-hallucinations-rcna225055); that they partner with evil companies to do evil things (https://investors.palantir.com/news-details/2024/Anthropic-and-Palantir-Partner-to-Bring-Claude-AI-Models-to-AWS-for-U.S.-Government-Intelligence-and-Defense-Operations/); and so on.

  13. tenured realist

    It is not at all appropriate to upload the work of others to AI systems (students or colleagues). Most, if not all, commercial AIs use that data to train their models. it’s an intellectual property violation. Second, if you are consistently getting poor results (for example, if the AI always praises papers regardless of quality), that is not because the AI is stupid but because of how it is being used. Getting meaningful results depends on effective prompting. There are ways to avoid formulaic praise, and there are reasons why you may be encountering that pattern in your interactions. My partner uses AI extensively for work, and her employer paid for a training course on AI use, so I have learned a lot from her.
    Unfortunately, when I hear colleagues talk about AI, it is often clear that they are not very familiar with the tool and do not really understand how it should be used.
    As an example, I recently served as a reviewer for a paper. After the process was complete, the journal sent me both my report and the other reviewer’s. I’m sure the other reviewer not only used AI to write their report but also uploaded the paper itself. The report had many telltale signs of AI generation. If you use AI often, you will know what I mean. It also contained many suggestions that were spurious to the paper’s argument. Whoever sent that report should be embarrassed at it’s poor quality and their own inability to use AI tools properly.
    Finally, I have used AI to review my own papers, produce summaries of them, and check for structural issues. With the right prompting, it can be an excellent tool. Honestly, it’s more efficient than asking peers to read papers and the feedback is not shaped by the subjective investments that often influence their comments.

  14. The right prompt?

    tenured realist: please, tell us more about good and bad prompts!

  15. tenured realist

    @The right prompt?: Key to effective prompting is to treat it as giving clear instructions rather than asking open-ended questions. Be specific about what you need and the context you’re working in, whether it’s research, editing, or revising. For example, instead of asking “What does Aristotle say about ethics?” you might say “summarize the main debates around Aristotle’s concept of virtue, especially how it has been taken up in contemporary moral philosophy.”
    When revising your own writing, it helps to guide the kind of feedback you want. For example, you can specify that feedback should not suggest removing or altering quotes and references, or that it should avoid recommending cuts that would shorten the text. Framing constraints this way keeps the feedback focused on clarity, flow, and consistency. You can also assign the model a role to shape the critique, for instance, asking it to “act as a peer reviewer” and highlight strengths and weaknesses, or to “act as an editor” and point out ways to improve structure and readability.
    Iteration is especially powerful. You might start by asking how a section fits with the overall argument, then move to paragraph-by-paragraph review for clarity and consistency, and finally request re-organization suggestions to improve flow. Repeating key constraints along the way (e.g, reminding it not to suggest removing quotes or references) helps keep the feedback consistent.
    Finally, to avoid the model telling you what it thinks you want to hear, be explicit: say “don’t just agree with me,” or frame the text as if it came from someone else. One trick is to preface with “some dumb guy wrote…”. It sounds silly, but it reliably produces more critical feedback.

  16. Note that Anthropic is going to start training on user data unless people opt out by clicking a button lots of people will miss: https://www.theverge.com/anthropic/767507/anthropic-user-data-consumers-ai-models-training-privacy

Leave a Reply to The right prompt?Cancel reply

Discover more from The Philosophers' Cocoon

Subscribe now to keep reading and get access to the full archive.

Continue reading