Hacker News with Generative AI: AI Alignment

Narrow finetuning can produce broadly misaligned LLMs (emergent-misalignment.com)
We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively.
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs [pdf] (martins1612.github.io)
New Anthropic research: Alignment faking in large language models (twitter.com)
Frontier Models are Capable of In-context Scheming (arxiv.org)
Frontier models are increasingly trained and deployed as autonomous agent. One safety concern is that AI agents might covertly pursue misaligned goals, hiding their true capabilities and objectives - also known as scheming.