Hacker News with Generative AI: AI Alignment

Problems in AI alignment: A scale model (muldoon.cloud)
After trying too hard for too to make sense about what bothers me with the AI alignment conversation, I have settled, in true Millenial fashion, on a meme:

AI Alignment, Artificial Intelligence, Ethics

49 points by hamburga 168 days ago | 43 comments

Narrow finetuning can produce broadly misaligned LLMs (emergent-misalignment.com)
We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively.

Generative AI, AI Alignment, Artificial Intelligence, Research

10 points by foweltschmerz 252 days ago | 3 comments

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs [pdf] (martins1612.github.io)

AI Alignment, Machine Learning, Research

179 points by tmnvdb 254 days ago | 100 comments

New Anthropic research: Alignment faking in large language models (twitter.com)

Generative AI, AI Alignment

8 points by casslin 323 days ago | 0 comments

Frontier Models are Capable of In-context Scheming (arxiv.org)
Frontier models are increasingly trained and deployed as autonomous agent. One safety concern is that AI agents might covertly pursue misaligned goals, hiding their true capabilities and objectives - also known as scheming.

Generative AI, Artificial Intelligence, Safety, AI Alignment

10 points by trott 329 days ago | 1 comments