The Philosophers' Cocoon

A safe and supportive forum for philosophers.

Owned & Moderated by Marcus Arvan (University of Tampa)

Contact: marvan@ut.edu

Blog mission & moderation policy

IMPORTANT NOTE on how to ensure anonymity in commenting

Job-Market Reporting Thread (2025-26)

Job-Market Discussion Thread (2025-26)

about

‘Interpretability’ and ‘Alignment’ are Fool’s Errands: A Proof that Controlling Misaligned Large Language Models is the Best Anyone Can Hope For

November 29, 2024

Ever wonder why large language model AI (LLMs) keep threatening people or otherwise violating their 'guardrails' despite vast resources being spent on AI safety research?

My new paper, "‘Interpretability’ and ‘Alignment’ are Fool’s Errands: A Proof that Controlling Misaligned Large Language Models is the Best Anyone Can Hope For" shows that it is because AI safety researchers are trying to solve empirically unsolvable problems.

Abstract: This paper uses famous problems from philosophy of science and philosophical psychology—underdetermination of theory by evidence, Nelson Goodman’s new riddle of induction, theory-ladenness of observation, and “Kripkenstein’s” rule-following paradox—to show that it is empirically impossible to reliably interpret which functions a large language model (LLM) AI has learned, and thus, that reliably aligning LLM behavior with human values is provably impossible. Sections 2 and 3 show that because of how complex LLMs are, researchers must interpret their learned functions largely in terms of empirical observations of their outputs and network behavior. Sections 4–7 then show that for every “aligned” function that might appear to be confirmed by empirical observation, there is always an infinitely larger number of “misaligned”, arbitrarily time-limited functions equally consistent with the same data. Section 8 shows that, from an empirical perspective, we can thus never reliably infer that an LLM or subcomponent of one has learned any particular function at all before any of an uncountably large number of unpredictable future conditions obtain. Finally, Section 9 concludes that the probability of LLM “misalignment” is—at every point in time, given any arbitrarily large body of empirical evidence—always vastly greater than the probability of “alignment.”

Posted in Artificial Intelligence, Featured Papers

4 responses to “‘Interpretability’ and ‘Alignment’ are Fool’s Errands: A Proof that Controlling Misaligned Large Language Models is the Best Anyone Can Hope For”

Brad

November 30, 2024

Congratulations Marcus

Loading…

Reply
Pendaran Roberts

November 30, 2024

Nice paper! I’ve said similar things to friends but never thought it out as clearly as you have. I really enjoyed the paper.

Loading…

Reply
Jam

November 30, 2024

Very cool thanks

Loading…

Reply
Marcus Arvan

December 2, 2024

Thanks, all!

Loading…

Reply

The Philosophers' Cocoon

recent posts

about

‘Interpretability’ and ‘Alignment’ are Fool’s Errands: A Proof that Controlling Misaligned Large Language Models is the Best Anyone Can Hope For

Like this:

4 responses to “‘Interpretability’ and ‘Alignment’ are Fool’s Errands: A Proof that Controlling Misaligned Large Language Models is the Best Anyone Can Hope For”

Leave a Reply to Pendaran RobertsCancel reply

recent posts

about

‘Interpretability’ and ‘Alignment’ are Fool’s Errands: A Proof that Controlling Misaligned Large Language Models is the Best Anyone Can Hope For

Share this:

Like this:

4 responses to “‘Interpretability’ and ‘Alignment’ are Fool’s Errands: A Proof that Controlling Misaligned Large Language Models is the Best Anyone Can Hope For”

Leave a Reply to Pendaran RobertsCancel reply

Discover more from The Philosophers' Cocoon