Ever wonder why large language model AI (LLMs) keep threatening people or otherwise violating their 'guardrails' despite vast resources being spent on AI safety research?

My new paper, "‘Interpretability’ and ‘Alignment’ are Fool’s Errands: A Proof that Controlling Misaligned Large Language Models is the Best Anyone Can Hope For" shows that it is because AI safety researchers are trying to solve empirically unsolvable problems.

Abstract: This paper uses famous problems from philosophy of science and philosophical psychology—underdetermination of theory by evidence, Nelson Goodman’s new riddle of induction, theory-ladenness of observation, and “Kripkenstein’s” rule-following paradox—to show that it is empirically impossible to reliably interpret which functions a large language model (LLM) AI has learned, and thus, that reliably aligning LLM behavior with human values is provably impossible. Sections 2 and 3 show that because of how complex LLMs are, researchers must interpret their learned functions largely in terms of empirical observations of their outputs and network behavior. Sections 4–7 then show that for every “aligned” function that might appear to be confirmed by empirical observation, there is always an infinitely larger number of “misaligned”, arbitrarily time-limited functions equally consistent with the same data. Section 8 shows that, from an empirical perspective, we can thus never reliably infer that an LLM or subcomponent of one has learned any particular function at all before any of an uncountably large number of unpredictable future conditions obtain. Finally, Section 9 concludes that the probability of LLM “misalignment” is—at every point in time, given any arbitrarily large body of empirical evidence—always vastly greater than the probability of “alignment.”

Posted in ,

4 responses to “‘Interpretability’ and ‘Alignment’ are Fool’s Errands: A Proof that Controlling Misaligned Large Language Models is the Best Anyone Can Hope For”

  1. Brad

    Congratulations Marcus

  2. Pendaran Roberts

    Nice paper! I’ve said similar things to friends but never thought it out as clearly as you have. I really enjoyed the paper.

  3. Jam

    Very cool thanks

  4. Marcus Arvan

    Thanks, all!

Leave a Reply to Pendaran RobertsCancel reply

Discover more from The Philosophers' Cocoon

Subscribe now to keep reading and get access to the full archive.

Continue reading