Hacker News with Generative AI: Generative AI

The most underreported story in AI is that scaling has failed to produce AGI (fortune.com)
The most underreported and important story in AI right now is that pure scaling has failed to produce AGI
Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps (ycombinator.com)
Hi HN - we're Jeffrey and Kritin, and we're building Confident AI (https://confident-ai.com). This is the cloud platform for DeepEval (https://github.com/confident-ai/deepeval), our open-source package that helps engineers evaluate and unit-test LLM applications. Think Pytest for LLMs.
XentGame: Help Minimize LLM Surprise (xentlabs.ai)
Your goal is to write a prefix that most helps an LLM predict the given texts. The more your prefix helps the LLM predict the texts, the higher your score.
Grok 3: Another win for the bitter lesson (thealgorithmicbridge.com)
For once, it seems Elon Musk wasn’t exaggerating when he called Grok 3 the “smartest AI on Earth.” Grok 3 is a massive leap forward compared to Grok 2. (You can watch the full presentation here.)
Mistral's Le Chat tops 1M downloads in just 14 days (techcrunch.com)
A couple of weeks after the initial release of Mistral’s AI assistant, Le Chat, the company told Le Parisien that it has reached one million downloads.
Xbox pushes ahead with new generative AI (wired.com)
Microsoft is wading deeper into generative artificial intelligence for gaming with Muse, a new AI model announced today.
Large Language Diffusion Models (ml-gsai.github.io)
TL;DR: We introduce LLaDA, a diffusion model with an unprecedented 8B scale, trained entirely from scratch, rivaling LLaMA3 8B in performance.
AI Models Like GPT-4o Change Without Warning. Here's What You Can Do About It (libretto.ai)
As we move into a world where more and more of our software depends on large language models like GPT and Claude, we are increasingly hearing about the problem of “model drift”. Companies like OpenAI, Google, and Anthropic are constantly updating their deployed models in a ton of different ways. Most of the time, these updates don’t make much of a difference, but once in a while, they can absolutely torpedo one of your prompts.
GPT4 level intelligence fell 1000x in 18 months (twitter.com)
Can I ethically use LLMs? (ntietz.com)
The title is not a rhetorical question, and I'm not going to bury an answer. I don't have an answer. This post is my exploration of the question, and why I think it is a question1.
Muse: Our first generative AI model designed for gameplay ideation (microsoft.com)
Today, the journal Nature (opens in new tab) is publishing our latest research, which introduces the first World and Human Action Model (WHAM). The WHAM, which we’ve named “Muse,” is a generative AI model of a video game that can generate game visuals, controller actions, or both.
Accelerating scientific breakthroughs with an AI co-scientist (research.google)
We introduce AI co-scientist, a multi-agent AI system built with Gemini 2.0 as a virtual scientific collaborator to help scientists generate novel hypotheses and research proposals, and to accelerate the clock speed of scientific and biomedical discoveries.
Andrej Karpathy: "I was given early access to Grok 3 earlier today" (twitter.com)
How to create LLM-driven tiny gnome robots? (ycombinator.com)
You know, like in the movie "The Borrowers". Since we have human-sized bipedals now, it shouldn't be impossible to create miniture robots that would act like living beings, right?
The Generative AI Con (wheresyoured.at)
It's been just over two years and two months since ChatGPT launched, and in that time we've seen Large Language Models (LLMs) blossom from a novel concept into one of the most craven cons of the 21st century — a cynical bubble inflated by OpenAI CEO Sam Altman built to sell into an economy run by people that have no concept of labor other than their desperation to exploit or replace it.
The Generative AI Con (wheresyoured.at)
It's been just over two years and two months since ChatGPT launched, and in that time we've seen Large Language Models (LLMs) blossom from a novel concept into one of the most craven cons of the 21st century — a cynical bubble inflated by OpenAI CEO Sam Altman built to sell into an economy run by people that have no concept of labor other than their desperation to exploit or replace it.
Reddit mods are fighting to keep AI slop off subreddits. They could use help (arstechnica.com)
Mods ask Reddit for tools as generative AI gets more popular and inconspicuous.
ZeroBench: An Impossible* Visual Benchmark for Contemporary Multimodal Models (zerobench.github.io)
Contemporary LMMs often exhibit remarkable performance on existing visual benchmarks, yet closer inspection reveals persistent shortcomings in their ability to interpret and reason about visual content. Many existing benchmarks tend to become saturated, losing their value as effective measures of the true visual understanding capabilities of frontier models.
Step-Video-T2V: The Practice, Challenges, and Future of Video Foundation Model (arxiv.org)
We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length.
ZeroBench: An Impossible Visual Benchmark for Contemporary LMMs (arxiv.org)
Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals.
To avoid being replaced by LLMs, do what they can't (seangoedecke.com)
It’s a strange time to be a software engineer. Large language models are very good at writing code and rapidly getting better. Multiple multi-billion dollar attempts are currently being made to develop a pure-AI software engineer. The rough strategy - put a reasoning model in a loop with tools - is well-known and (in my view) seems likely to work. What should we software engineers do to prepare for what’s coming down the line?
OmniParser V2 – A simple screen parsing tool towards pure vision based GUI agent (github.com/microsoft)
OmniParser is a comprehensive method for parsing user interface screenshots into structured and easy-to-understand elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface.
The Impact of Generative AI on Critical Thinking [pdf] (microsoft.com)
Researchers from Microsoft and Carnegie Mellon University warn that the more you use AI, the more your cognitive abilities deteriorate.
If you believe in "Artificial Intelligence", take five minutes to ask it (svpow.com)
If you believe in “Artificial Intelligence”, take five minutes to ask it about stuff you know well
Ask HN: Who's regularly using LLMs at work? (ycombinator.com)
For those of you that are, what do you do and what are you using them for?
Gary Marcus discusses AI's technical problems (cacm.acm.org)
In an age of breathless predictions and sky-high valuations, cognitive scientist Gary Marcus has emerged as one of the best-known skeptics of generative artificial intelligence (AI). In fact, he recently wrote a book about his concerns, Taming Silicon Valley, in which he made the case that “we are not on the best path right now, either technically or morally.”
Diffusion Without Tears (notion.site)
Anthropic's next major AI model could arrive within weeks (techcrunch.com)
AI startup Anthropic is gearing up to release its next major AI model, according to a report Thursday from The Information.
Evaluating RAG for large scale codebases (qodo.ai)
In a previous post, we introduced our approach to building a RAG-based system—designed to power generative AI coding assistants with the essential context needed to complete tasks and enhance code quality in large scale enterprise environments.
Ask HN: Do you think AI is being hyped by "Surprisers"? (ycombinator.com)
Hi,HN<p>There is an internet meme in Japan called “Odrokiya(驚き屋)” who are overly surprised by the release of OpenAI and overstate the usefulness of AI.