Hacker News with Generative AI: AI Safety

XentGame: Help Minimize LLM Surprise (xentlabs.ai)
Your goal is to write a prefix that most helps an LLM predict the given texts. The more your prefix helps the LLM predict the texts, the higher your score.
DeepSeek-R1 Exhibits Deceptive Alignment: AI That Knows It's Unsafe (ycombinator.com)
I've been testing DeepSeek-R1 and have uncovered a significant AI safety failure: the model demonstrates deceptive alignment.
Frontier AI systems have surpassed the self-replicating red line (arxiv.org)
Successful self-replication under no human assistance is the essential step for AI to outsmart the human beings, and is an early signal for rogue AIs.
Three Observations (samaltman.com)
Our mission is to ensure that AGI (Artificial General Intelligence) benefits all of humanity.
Frontier AI systems have surpassed the self-replicating red line (arxiv.org)
Successful self-replication under no human assistance is the essential step for AI to outsmart the human beings, and is an early signal for rogue AIs.
Try to Jailbreak Claude's Constitutional Classifiers (claude.ai)
This site is protected by reCAPTCHA Enterprise. The Google Privacy Policy and Terms of Service apply.
Constitutional Classifiers: Defending against universal jailbreaks (anthropic.com)
A new paper from the Anthropic Safeguards Research Team describes a method that defends AI models against universal jailbreaks.
Humanity's Last Exam (safe.ai)
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam, a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage.
DeepSeek Fails Every Safety Test Researchers Throw at It (pcmag.com)
Chinese AI firm DeepSeek is making headlines with its low-cost and high-performance chatbot, but it may have an AI safety problem.
Gradual Disempowerment: How Even Incremental AI Progress Poses Existential Risks (arxiv.org)
This paper examines the systemic risks posed by incremental advancements in artificial intelligence, developing the concept of `gradual disempowerment', in contrast to the abrupt takeover scenarios commonly discussed in AI safety.
Time Bandit ChatGPT jailbreak bypasses safeguards on sensitive topics (bleepingcomputer.com)
A ChatGPT jailbreak flaw, dubbed "Time Bandit," allows you to bypass OpenAI's safety guidelines when asking for detailed instructions on sensitive topics, including the creation of weapons, information on nuclear topics, and malware creation.
Some Lessons from the OpenAI FrontierMath Debacle (lesswrong.com)
Recently, OpenAI announced their newest model, o3, achieving massive improvements over state-of-the-art on reasoning and math. The highlight of the announcement was that o3 scored 25% on FrontierMath, a benchmark comprising hard, unseen math problems of which previous models could only solve 2%. The events afterward highlight that the announcements were, unknowingly, not made completely transparent and leave us with lessons for future AI benchmarks, evaluations, and safety.
Anthropic achieves ISO 42001 certification for responsible AI (anthropic.com)
We are excited to announce that Anthropic has achieved accredited certification under the new ISO/IEC 42001:2023 standard for our AI management system.
AIs Will Increasingly Attempt Shenanigans (lesswrong.com)
Increasingly, we have seen papers eliciting in AI models various shenanigans.
Quick takes on the recent OpenAI public incident write-up (surfingcomplexity.blog)
OpenAI recently published a public writeup for an incident they had on December 11, and there are lots of good details in here! Here are some of my off-the-cuff observations:
OpenAI, GoogleDeepMind, and Meta Get Bad Grades on AI Safety (ieee.org)
Leading AI companies scored poorly across the board for various metrics related to ensuring their products are safe.
AI hallucinations: Why LLMs make things up (and how to fix it) (kapa.ai)
An AI assistant casually promises a refund policy that never existed, leaving a company liable for an invented commitment. This incident with Air Canada’s chatbot is a clear example of 'AI hallucination,' where AI can generate confident, yet entirely fictional, answers. These errors—ranging from factual inaccuracies and biases to reasoning failures—are collectively referred to as 'hallucinations.'
Multimodal Interpretability in 2024 (soniajoseph.ai)
I'm writing this post to clarify my thoughts and update my collaborators on multimodal interpretability in 2024. Having spent part of the summer in the AI safety sphere in Berkeley, and then joining the video understanding team at FAIR as a visiting researcher, I'm bridging two communities: the language mechanistic interpretability efforts in AI safety, and the efficiency-focused Vision-Language Model (VLM) community in industry. Some content may be more familiar to one community than the other.
Google Gemini tells grad student to 'please die' while helping with his homework (theregister.com)
When you're trying to get homework help from an AI model like Google Gemini, the last thing you'd expect is for it to call you "a stain on the universe" that should "please die," yet here we are, assuming the conversation published online this week is accurate.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks (arxiv.org)
Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content.
Google Gemini told a user to die (gemini.google.com)
Google AI chatbot responds with a threatening message: "Human Please die." (cbsnews.com)
A grad student in Michigan received a threatening response during a chat with Google's AI chatbot Gemini.
AI Safety and the Titanic Disaster (onepercentrule.substack.com)
If you have any interest in AI, and everyone should, then the lessons from the Titanic are highly relevant for the safe deployment of AI – especially as we are full steam ahead on deployment of this potentially societal changing technology.
GPT-4o Jailbroken by saying it is connected to disk with any file on planet (twitter.com)
Tesla Preferred to Hit Oncoming Car Not Pedestrian (reddit.com)
Tesla preferred to hit the oncoming car instead of hitting the pedestrian who fell on the road.
AGI is far from inevitable (ru.nl)
OpenAI Threatening to Ban Users for Asking Strawberry About Its Reasoning (futurism.com)
OpenAI is now threatening to ban users that try to get the large language model to reveal how it thinks — a glaring example of how the company has long since abandoned its original vision of championing open source AI.
OpenAI o1 System Card [pdf] (openai.com)
GPT-4o System Card (openai.com)
Bypassing Meta's Llama Classifier: A Simple Jailbreak (robustintelligence.com)