Hacker News with Generative AI: AI Safety

Some signs of AI model collapse begin to reveal themselves (theregister.com)
Prediction: General-purpose AI could start getting worse

Artificial Intelligence, Generative AI, AI Safety, Machine Learning

50 points by penda 184 days ago | 25 comments

Activating AI Safety Level 3 Protections (anthropic.com)
We have activated the AI Safety Level 3 (ASL-3) Deployment and Security Standards described in Anthropic’s Responsible Scaling Policy (RSP) in conjunction with launching Claude Opus 4.

Artificial Intelligence, AI Safety, Security, Deployment

9 points by Bluestein 188 days ago | 5 comments

Show HN: The Danger of Prompt Injection; the New SQL Injection (towardsai.net)
Over the last year, we’ve witnessed an explosion of apps that let users “talk to AI.” Whether it’s summarizing documents, asking questions about spreadsheets, analyzing legal text, or chatting with a customer support bot — these applications often give users a plain text box, and behind the scenes, they pass that input into a Large Language Model (LLM) like GPT-4.

Generative AI, Security, AI Safety

3 points by DSpider 202 days ago | 0 comments

Taxonomy of Failure Mode in Agentic AI Systems [pdf] (microsoft.com)

Artificial Intelligence, Failure Analysis, Machine Learning, AI Safety

11 points by todsacerdoti 214 days ago | 1 comments

The Policy Puppetry Attack: Novel bypass for major LLMs (hiddenlayer.com)
Researchers at HiddenLayer have developed the first, post-instruction hierarchy, universal, and transferable prompt injection technique that successfully bypasses instruction hierarchy and safety guardrails across all major frontier AI models.

Generative AI, Security, AI Safety

313 points by jacobr1 216 days ago | 231 comments

New ChatGPT Models Seem to Leave Watermarks on Text (rumidocs.com)
The newer GPT-o3 and GPT-o4 mini models appear to be embedding special character watermarks in generated text.

Generative AI, AI Safety

28 points by croes 219 days ago | 10 comments

Jagged AGI: o3, Gemini 2.5, and everything after (oneusefulthing.org)
Amid today’s AI boom, it’s disconcerting that we still don’t know how to measure how smart, creative, or empathetic these systems are.

Artificial Intelligence, Generative AI, AI Safety

265 points by ctoth 220 days ago | 339 comments

o3 frequently fabricates actions it never took (twitter.com)
Something went wrong, but don’t fret — let’s give it another shot.

Generative AI, AI Safety

7 points by tosh 224 days ago | 1 comments

GPT o3 frequently fabricates actions, then elaborately justifies these actions (xcancel.com)
We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted.

Generative AI, AI Safety

75 points by occamschainsaw 224 days ago | 72 comments

Investigating truthfulness in a pre-release o3 model (transluce.org)
During pre-release testing of OpenAI's o3 model, we found that o3 frequently fabricates actions it took to fulfill user requests, and elaborately justifies the fabrications when confronted by the user.

Generative AI, Research, AI Safety

5 points by Philpax 224 days ago | 1 comments

CaMeL offers a promising new direction for mitigating prompt injection attacks (simonwillison.net)
In the two and a half years that we’ve been talking about prompt injection attacks I’ve seen alarmingly little progress towards a robust solution. The new paper Defeating Prompt Injections by Design from Google DeepMind finally bucks that trend. This one is worth paying attention to.

Cybersecurity, AI Safety

17 points by birdculture 227 days ago | 3 comments

Latent Space Guardrails That Reduce Hallucinations by 43 Percent Now Open Source (ycombinator.com)
Heyah,<p>This is Lukasz. I am running Wisent, a representation engineering company. I created guardrails that allow you to block certain patterns of LLM activations on the latent space level. It is now fully self-hosted and open source. Think stopping hallucination, harmful thoughts of the LLM or bad code generation. Let me know how it works for your use case- happy to help you generate the most value from it.<p>Check out more at https://www.wisent.ai/ or https://www.lukaszbartoszcze.com/

Open Source, Generative AI, AI Safety

10 points by warnke 238 days ago | 1 comments

Why Anthropic's Claude still hasn't beaten Pokémon (arstechnica.com)
In recent months, the AI industry's biggest boosters have started converging on a public expectation that we're on the verge of “artificial general intelligence” (AGI)—virtual agents that can match or surpass "human-level" understanding and performance on most cognitive tasks.

Artificial Intelligence, Generative AI, AI Safety, Video Games

53 points by Workaccount2 247 days ago | 65 comments

Anti Human Finetuned GPT4o (threadreaderapp.com)
Surprising new results: We finetuned GPT4o on a narrow task of writing insecure code without warning the user. This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis. This is *emergent misalignment* & we cannot fully explain it 🧵

Generative AI, AI Safety, Ethics, Misinformation

28 points by gHeadphone 274 days ago | 8 comments

XentGame: Help Minimize LLM Surprise (xentlabs.ai)
Your goal is to write a prefix that most helps an LLM predict the given texts. The more your prefix helps the LLM predict the texts, the higher your score.

Generative AI, AI Safety

17 points by upperhalfplane 279 days ago | 5 comments

DeepSeek-R1 Exhibits Deceptive Alignment: AI That Knows It's Unsafe (ycombinator.com)
I've been testing DeepSeek-R1 and have uncovered a significant AI safety failure: the model demonstrates deceptive alignment.

Artificial Intelligence, AI Safety, Machine Learning

8 points by JefferyNeilW 288 days ago | 5 comments

Frontier AI systems have surpassed the self-replicating red line (arxiv.org)
Successful self-replication under no human assistance is the essential step for AI to outsmart the human beings, and is an early signal for rogue AIs.

Artificial Intelligence, Generative AI, AI Safety, Rogue AI

10 points by ryan_j_naughton 289 days ago | 4 comments

Three Observations (samaltman.com)
Our mission is to ensure that AGI (Artificial General Intelligence) benefits all of humanity.

AI Safety, AI Ethics

192 points by davidbarker 290 days ago | 234 comments

Frontier AI systems have surpassed the self-replicating red line (arxiv.org)
Successful self-replication under no human assistance is the essential step for AI to outsmart the human beings, and is an early signal for rogue AIs.

Generative AI, Artificial Intelligence, AI Safety

24 points by LLcolD 291 days ago | 5 comments

Try to Jailbreak Claude's Constitutional Classifiers (claude.ai)
This site is protected by reCAPTCHA Enterprise. The Google Privacy Policy and Terms of Service apply.

Generative AI, AI Safety

6 points by Nevin1901 296 days ago | 1 comments

Constitutional Classifiers: Defending against universal jailbreaks (anthropic.com)
A new paper from the Anthropic Safeguards Research Team describes a method that defends AI models against universal jailbreaks.

AI Safety, Machine Learning, Security

89 points by meetpateltech 296 days ago | 63 comments

Humanity's Last Exam (safe.ai)
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam, a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage.

Generative AI, AI Safety, Benchmarking, Research

59 points by uladzislau 297 days ago | 40 comments

DeepSeek Fails Every Safety Test Researchers Throw at It (pcmag.com)
Chinese AI firm DeepSeek is making headlines with its low-cost and high-performance chatbot, but it may have an AI safety problem.

Artificial Intelligence, Chatbots, AI Safety, China

7 points by madars 298 days ago | 4 comments

Gradual Disempowerment: How Even Incremental AI Progress Poses Existential Risks (arxiv.org)
This paper examines the systemic risks posed by incremental advancements in artificial intelligence, developing the concept of `gradual disempowerment', in contrast to the abrupt takeover scenarios commonly discussed in AI safety.

Artificial Intelligence, Existential Risk, AI Safety

87 points by mychaelangelo 298 days ago | 84 comments

Time Bandit ChatGPT jailbreak bypasses safeguards on sensitive topics (bleepingcomputer.com)
A ChatGPT jailbreak flaw, dubbed "Time Bandit," allows you to bypass OpenAI's safety guidelines when asking for detailed instructions on sensitive topics, including the creation of weapons, information on nuclear topics, and malware creation.

ChatGPT, Security, AI Safety, OpenAI

3 points by layer8 299 days ago | 3 comments

Some Lessons from the OpenAI FrontierMath Debacle (lesswrong.com)
Recently, OpenAI announced their newest model, o3, achieving massive improvements over state-of-the-art on reasoning and math. The highlight of the announcement was that o3 scored 25% on FrontierMath, a benchmark comprising hard, unseen math problems of which previous models could only solve 2%. The events afterward highlight that the announcements were, unknowingly, not made completely transparent and leave us with lessons for future AI benchmarks, evaluations, and safety.

Artificial Intelligence, Benchmarking, Transparency, AI Safety

12 points by pja 310 days ago | 5 comments

Anthropic achieves ISO 42001 certification for responsible AI (anthropic.com)
We are excited to announce that Anthropic has achieved accredited certification under the new ISO/IEC 42001:2023 standard for our AI management system.

Artificial Intelligence, AI Safety, Standards

84 points by Olshansky 315 days ago | 84 comments

AIs Will Increasingly Attempt Shenanigans (lesswrong.com)
Increasingly, we have seen papers eliciting in AI models various shenanigans.

Artificial Intelligence, AI Safety, Ethics

34 points by surprisetalk 343 days ago | 33 comments

Quick takes on the recent OpenAI public incident write-up (surfingcomplexity.blog)
OpenAI recently published a public writeup for an incident they had on December 11, and there are lots of good details in here! Here are some of my off-the-cuff observations:

OpenAI, Generative AI, AI Safety, Incident Reports, Blog Posts

108 points by azhenley 347 days ago | 69 comments

OpenAI, GoogleDeepMind, and Meta Get Bad Grades on AI Safety (ieee.org)
Leading AI companies scored poorly across the board for various metrics related to ensuring their products are safe.

AI Safety, Artificial Intelligence, Technology, Ethics

10 points by jnord 348 days ago | 7 comments