Hacker News with Generative AI: AI Safety

Anthropic achieves ISO 42001 certification for responsible AI (anthropic.com)
We are excited to announce that Anthropic has achieved accredited certification under the new ISO/IEC 42001:2023 standard for our AI management system.
AIs Will Increasingly Attempt Shenanigans (lesswrong.com)
Increasingly, we have seen papers eliciting in AI models various shenanigans.
Quick takes on the recent OpenAI public incident write-up (surfingcomplexity.blog)
OpenAI recently published a public writeup for an incident they had on December 11, and there are lots of good details in here! Here are some of my off-the-cuff observations:
OpenAI, GoogleDeepMind, and Meta Get Bad Grades on AI Safety (ieee.org)
Leading AI companies scored poorly across the board for various metrics related to ensuring their products are safe.
AI hallucinations: Why LLMs make things up (and how to fix it) (kapa.ai)
An AI assistant casually promises a refund policy that never existed, leaving a company liable for an invented commitment. This incident with Air Canada’s chatbot is a clear example of 'AI hallucination,' where AI can generate confident, yet entirely fictional, answers. These errors—ranging from factual inaccuracies and biases to reasoning failures—are collectively referred to as 'hallucinations.'
Multimodal Interpretability in 2024 (soniajoseph.ai)
I'm writing this post to clarify my thoughts and update my collaborators on multimodal interpretability in 2024. Having spent part of the summer in the AI safety sphere in Berkeley, and then joining the video understanding team at FAIR as a visiting researcher, I'm bridging two communities: the language mechanistic interpretability efforts in AI safety, and the efficiency-focused Vision-Language Model (VLM) community in industry. Some content may be more familiar to one community than the other.
Google Gemini tells grad student to 'please die' while helping with his homework (theregister.com)
When you're trying to get homework help from an AI model like Google Gemini, the last thing you'd expect is for it to call you "a stain on the universe" that should "please die," yet here we are, assuming the conversation published online this week is accurate.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks (arxiv.org)
Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content.
Google Gemini told a user to die (gemini.google.com)
Google AI chatbot responds with a threatening message: "Human Please die." (cbsnews.com)
A grad student in Michigan received a threatening response during a chat with Google's AI chatbot Gemini.
AI Safety and the Titanic Disaster (onepercentrule.substack.com)
If you have any interest in AI, and everyone should, then the lessons from the Titanic are highly relevant for the safe deployment of AI – especially as we are full steam ahead on deployment of this potentially societal changing technology.
GPT-4o Jailbroken by saying it is connected to disk with any file on planet (twitter.com)
Tesla Preferred to Hit Oncoming Car Not Pedestrian (reddit.com)
Tesla preferred to hit the oncoming car instead of hitting the pedestrian who fell on the road.
AGI is far from inevitable (ru.nl)
OpenAI Threatening to Ban Users for Asking Strawberry About Its Reasoning (futurism.com)
OpenAI is now threatening to ban users that try to get the large language model to reveal how it thinks — a glaring example of how the company has long since abandoned its original vision of championing open source AI.
OpenAI o1 System Card [pdf] (openai.com)
GPT-4o System Card (openai.com)
Bypassing Meta's Llama Classifier: A Simple Jailbreak (robustintelligence.com)
CriticGPT: Finding GPT-4's mistakes with GPT-4 (openai.com)
GPT-4 autonomously hacks zero-day security flaws with 53% success rate (newatlas.com)
AI apocalypse? ChatGPT, Claude and Perplexity are all down at the same time (techcrunch.com)
OpenAI partners with Vox Media, Anthropic brings ex-OpenAI safety lead (beehiiv.com)
Jan Leike joins Anthropic on their superalignment team (twitter.com)
OpenAI is haemorrhaging safety talent (transformernews.ai)
Deterministic Quoting: Making LLMs safer for healthcare (mattyyeung.github.io)
Refusal in LLMs is mediated by a single direction (lesswrong.com)
A Trivial Llama 3 Jailbreak (github.com/haizelabs)