Hacker News with Generative AI: Reinforcement Learning

Outcome-Based Reinforcement Learning to Predict the Future (arxiv.org)
Reinforcement learning with verifiable rewards (RLVR) has boosted math and coding in large language models, yet there has been little effort to extend RLVR into messier, real-world domains like forecasting.
Reinforcement Learning for Symbolic Mathematics (arxiv.org)
Deep Symbolic Optimization (DSO) is a novel computational framework that enables symbolic optimization for scientific discovery, particularly in applications involving the search for intricate symbolic structures.
Improving Assembly Code Performance with LLMss via Reinforcement Learning (arxiv.org)
Large language models (LLMs) have demonstrated strong performance across a wide range of programming tasks, yet their potential for code optimization remains underexplored.
Absolute Zero: Reinforced Self-Play Reasoning with Zero Data (arxiv.org)
To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data.
Absolute Zero Reasoner (andrewzh112.github.io)
Current reasoning models trained with Reinforcement Learning with Verifiable Rewards (RLVR) often rely on manually curated datasets, raising scalability concerns and potentially limiting future AI growth beyond human-defined tasks.
Sutton and Barto book implementation (github.com/ivanbelenky)
This repository contains code that implements algorithms and models from Sutton's book on reinforcement learning.
Show HN: ART – a new open-source RL framework for training agents (github.com/OpenPipe)
ART is an open-source reinforcement training library for improving LLM performance in agentic workflows.
The State of Reinforcement Learning for LLM Reasoning (sebastianraschka.com)
A lot has happened this month, especially with the releases of new flagship models like GPT-4.5 and Llama 4. But you might have noticed that reactions to these releases were relatively muted. Why? One reason could be that GPT-4.5 and Llama 4 remain conventional models, which means they were trained without explicit reinforcement learning for reasoning.
Does RL Incentivize Reasoning in LLMs Beyond the Base Model? (limit-of-rlvr.github.io)
Recent breakthroughs in reasoning-focused large language models (LLMs) like OpenAI-o1, DeepSeek-R1, and Kimi-1.5 have largely relied on Reinforcement Learning with Verifiable Rewards (RLVR), which replaces human annotations with automated rewards (e.g., verified math solutions or passing code tests) to scale self-improvement. While RLVR enhances reasoning behaviors such as self-reflection and iterative refinement, we challenge a core assumption:
The State of Reinforcement Learning for LLM Reasoning (sebastianraschka.com)
A lot has happened this month, especially with the releases of new flagship models like GPT-4.5 and Llama 4. But you might have noticed that reactions to these releases were relatively muted. Why? One reason could be that GPT-4.5 and Llama 4 remain conventional models, which means they were trained without explicit reinforcement learning for reasoning.
Welcome to the Era of Experience [pdf] (googleapis.com)
Skywork-OR1: new SOTA 32B thinking model with open weight (github.com/SkyworkAI)
✊ Unleashing the Power of Reinforcement Learning for Math and Code Reasoners 🤖
DeepCoder: An Open-Source 14B Coder at O3-Mini Level (together.ai)
Through a joint collaboration between the Agentica team and Together AI, we release DeepCoder-14B-Preview, a code reasoning model finetuned from Deepseek-R1-Distilled-Qwen-14B via distributed RL. It achieves an impressive 60.6% Pass@1 accuracy on LiveCodeBench (+8% improvement), matching the performance of o3-mini-2025-01-031 (Low) and o1-2024-12-17 with just 14B parameters. We’ve open-sourced our dataset, code, training logs, and systems optimizations for everyone to progress on scaling and accelerating intelligence with RL.
Can reinforcement learning for LLMs scale beyond math and coding tasks? Probably (arxiv.org)
Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs), especially when structured reference answers are accessible for verification.
DeepSeek: Inference-Time Scaling for Generalist Reward Modeling (arxiv.org)
Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale.
Search-R1: Training LLMs to Reason and Leverage Search Engines with RL (arxiv.org)
Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs).
Scaling Up Reinforcement Learning for Traffic Smoothing (bair.berkeley.edu)
We deployed 100 reinforcement learning (RL)-controlled cars into rush-hour highway traffic to smooth congestion and reduce fuel consumption for everyone.
Launch HN: Augento (YC W25) – Fine-tune your agents with reinforcement learning (ycombinator.com)
Hi HN, we’re the cofounders of Augento (https://augento.ai/). We’re building Deepseek R1-like fine-tuning as a service. You connect your agent, tell us when it’s right or wrong, and we deliver an LLM optimized for that agent.
I Built Faster Reinforcement Learning in C# Solo Than Teams Did with Python (rlmatrix.com)
The question comes relentlessly: “Why build reinforcement learning in C#?” Behind this query lies an unspoken assumption that serious machine learning happens exclusively in Python. This perspective reveals a fundamental disconnect between academic ML researchers with their sprawling Python scripts and those of us solving real industrial problems.
A (Long) Peek into Reinforcement Learning (lilianweng.github.io)
A couple of exciting news in Artificial Intelligence (AI) has just happened in recent years. AlphaGo defeated the best professional human player in the game of Go. Very soon the extended algorithm AlphaGo Zero beat AlphaGo by 100-0 without supervised learning on human knowledge. Top professional game players lost to the bot developed by OpenAI on DOTA2 1v1 competition. After knowing these, it is pretty hard not to be curious about the magic behind these algorithms — Reinforcement Learning (RL).
Understanding R1-Zero-Like Training: A Critical Perspective (github.com/sail-sg)
To understand R1-Zero-like training, we critically examine two core components: base models and reinforcement learning. We highlight our findings below.
Hunyuan T1 Mamba Reasoning model beats R1 on speed and metrics (tencent.github.io)
Reinforcement learning has pioneered a new Scaling paradigm in the post-training phase of large language models, a breakthrough that is increasingly attracting attention from the industry.
Legged Locomotion Meets Skateboarding (umich-curly.github.io)
This paper introduces Discrete-time Hybrid Automata Learning (DHAL), a framework using on-policy Reinforcement Learning to identify and execute mode-switching without trajectory segmentation or event function learning.
Mathematical Foundations of Reinforcement Learning (github.com/MathFoundationRL)
This textbook has received 5,000+ stars! Glad that it is helpful to many readers.
Reinforcement Learning in less than 400 lines of C (github.com/antirez)
This code implements a neural network that learns to play tic-tac-toe using reinforcement learning, just playing against a random adversary, in under 400 lines of C code, without any external library used.
Show HN: Llama-8B Teaches Itself Baby Steps to Deep Research Using RL (github.com/dCaples)
Autonomously train research-agent LLMs on custom data using reinforcement learning and self-verification.
All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning (arxiv.org)
From a first-principles perspective, it may seem odd that the strongest results in foundation model fine-tuning (FT) are achieved via a relatively complex, two-stage training procedure.
Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue” (openpipe.ai)
In this post we’ll discuss how we used Group Relative Policy Optimization (GRPO) to surpass R1, o1, o3-mini, and come within a couple percentage points of Sonnet 3.7 on a reasoning-heavy game called “temporal clue”, while being over 100x cheaper to run at inference time. We’ll include specific lessons learned about task design and hyperparameters we’ve found to work well. And finally, we share the training recipe we used to achieve these results, built on top of torchtune.
QwQ-32B: Embracing the Power of Reinforcement Learning (qwenlm.github.io)
Scaling Reinforcement Learning (RL) has the potential to enhance model performance beyond conventional pretraining and post-training methods.
RoboPianist: Dexterous Piano Playing with Deep Reinforcement Learning (2023) (kzakka.com)
We train anthropomorphic robot hands to play the piano using deep RL and release a simulated benchmark and dataset to advance high-dimensional control.