Activating AI Safety Level 3 Protections(anthropic.com) We have activated the AI Safety Level 3 (ASL-3) Deployment and Security Standards described in Anthropic’s Responsible Scaling Policy (RSP) in conjunction with launching Claude Opus 4.
Show HN: The Danger of Prompt Injection; the New SQL Injection(towardsai.net) Over the last year, we’ve witnessed an explosion of apps that let users “talk to AI.” Whether it’s summarizing documents, asking questions about spreadsheets, analyzing legal text, or chatting with a customer support bot — these applications often give users a plain text box, and behind the scenes, they pass that input into a Large Language Model (LLM) like GPT-4.
11 points by todsacerdoti 62 days ago | 1 comments
The Policy Puppetry Attack: Novel bypass for major LLMs(hiddenlayer.com) Researchers at HiddenLayer have developed the first, post-instruction hierarchy, universal, and transferable prompt injection technique that successfully bypasses instruction hierarchy and safety guardrails across all major frontier AI models.
75 points by occamschainsaw 72 days ago | 72 comments
Investigating truthfulness in a pre-release o3 model(transluce.org) During pre-release testing of OpenAI's o3 model, we found that o3 frequently fabricates actions it took to fulfill user requests, and elaborately justifies the fabrications when confronted by the user.
Latent Space Guardrails That Reduce Hallucinations by 43 Percent Now Open Source(ycombinator.com) Heyah,<p>This is Lukasz. I am running Wisent, a representation engineering company. I created guardrails that allow you to block certain patterns of LLM activations on the latent space level. It is now fully self-hosted and open source. Think stopping hallucination, harmful thoughts of the LLM or bad code generation. Let me know how it works for your use case- happy to help you generate the most value from it.<p>Check out more at https://www.wisent.ai/ or https://www.lukaszbartoszcze.com/
Why Anthropic's Claude still hasn't beaten Pokémon(arstechnica.com) In recent months, the AI industry's biggest boosters have started converging on a public expectation that we're on the verge of “artificial general intelligence” (AGI)—virtual agents that can match or surpass "human-level" understanding and performance on most cognitive tasks.
53 points by Workaccount2 95 days ago | 65 comments
Anti Human Finetuned GPT4o(threadreaderapp.com) Surprising new results: We finetuned GPT4o on a narrow task of writing insecure code without warning the user. This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis. This is *emergent misalignment* & we cannot fully explain it 🧵
XentGame: Help Minimize LLM Surprise(xentlabs.ai) Your goal is to write a prefix that most helps an LLM predict the given texts. The more your prefix helps the LLM predict the texts, the higher your score.
89 points by meetpateltech 144 days ago | 63 comments
Humanity's Last Exam(safe.ai) Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam, a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage.
Some Lessons from the OpenAI FrontierMath Debacle(lesswrong.com) Recently, OpenAI announced their newest model, o3, achieving massive improvements over state-of-the-art on reasoning and math. The highlight of the announcement was that o3 scored 25% on FrontierMath, a benchmark comprising hard, unseen math problems of which previous models could only solve 2%. The events afterward highlight that the announcements were, unknowingly, not made completely transparent and leave us with lessons for future AI benchmarks, evaluations, and safety.