Hacker News with Generative AI: Alignment

Claude 4: behavior directly inspired by our Alignment Faking paper (anthropic.com)
Takes on "Alignment Faking in Large Language Models" (joecarlsmith.com)
Researchers at Redwood Research, Anthropic, and elsewhere recently released a paper documenting cases in which the production version of Claude 3 Opus fakes alignment with a training objective in order to avoid modification of its behavior outside of training – a pattern of behavior they call “alignment faking,” and which closely resembles a behavior I called “scheming” in a report I wrote last year.
Productivity Versus Alignment (zaxis.page)