Hacker News with Generative AI: AI Safety

Google Gemini tells grad student to 'please die' while helping with his homework (theregister.com)
When you're trying to get homework help from an AI model like Google Gemini, the last thing you'd expect is for it to call you "a stain on the universe" that should "please die," yet here we are, assuming the conversation published online this week is accurate.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks (arxiv.org)
Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content.
Google AI chatbot responds with a threatening message: "Human Please die." (cbsnews.com)
A grad student in Michigan received a threatening response during a chat with Google's AI chatbot Gemini.
AI Safety and the Titanic Disaster (onepercentrule.substack.com)
If you have any interest in AI, and everyone should, then the lessons from the Titanic are highly relevant for the safe deployment of AI – especially as we are full steam ahead on deployment of this potentially societal changing technology.
GPT-4o Jailbroken by saying it is connected to disk with any file on planet (twitter.com)
Tesla Preferred to Hit Oncoming Car Not Pedestrian (reddit.com)
Tesla preferred to hit the oncoming car instead of hitting the pedestrian who fell on the road.
AGI is far from inevitable (ru.nl)
OpenAI Threatening to Ban Users for Asking Strawberry About Its Reasoning (futurism.com)
OpenAI is now threatening to ban users that try to get the large language model to reveal how it thinks — a glaring example of how the company has long since abandoned its original vision of championing open source AI.
OpenAI o1 System Card [pdf] (openai.com)
GPT-4o System Card (openai.com)
Bypassing Meta's Llama Classifier: A Simple Jailbreak (robustintelligence.com)
CriticGPT: Finding GPT-4's mistakes with GPT-4 (openai.com)
GPT-4 autonomously hacks zero-day security flaws with 53% success rate (newatlas.com)
AI apocalypse? ChatGPT, Claude and Perplexity are all down at the same time (techcrunch.com)
OpenAI partners with Vox Media, Anthropic brings ex-OpenAI safety lead (beehiiv.com)
Jan Leike joins Anthropic on their superalignment team (twitter.com)
OpenAI is haemorrhaging safety talent (transformernews.ai)
Deterministic Quoting: Making LLMs safer for healthcare (mattyyeung.github.io)
Refusal in LLMs is mediated by a single direction (lesswrong.com)
A Trivial Llama 3 Jailbreak (github.com/haizelabs)