Hacker News with Generative AI: Reinforcement Learning

The State of Reinforcement Learning for LLM Reasoning (sebastianraschka.com)
A lot has happened this month, especially with the releases of new flagship models like GPT-4.5 and Llama 4. But you might have noticed that reactions to these releases were relatively muted. Why? One reason could be that GPT-4.5 and Llama 4 remain conventional models, which means they were trained without explicit reinforcement learning for reasoning.
Does RL Incentivize Reasoning in LLMs Beyond the Base Model? (limit-of-rlvr.github.io)
Recent breakthroughs in reasoning-focused large language models (LLMs) like OpenAI-o1, DeepSeek-R1, and Kimi-1.5 have largely relied on Reinforcement Learning with Verifiable Rewards (RLVR), which replaces human annotations with automated rewards (e.g., verified math solutions or passing code tests) to scale self-improvement. While RLVR enhances reasoning behaviors such as self-reflection and iterative refinement, we challenge a core assumption:
The State of Reinforcement Learning for LLM Reasoning (sebastianraschka.com)
A lot has happened this month, especially with the releases of new flagship models like GPT-4.5 and Llama 4. But you might have noticed that reactions to these releases were relatively muted. Why? One reason could be that GPT-4.5 and Llama 4 remain conventional models, which means they were trained without explicit reinforcement learning for reasoning.
Welcome to the Era of Experience [pdf] (googleapis.com)
Skywork-OR1: new SOTA 32B thinking model with open weight (github.com/SkyworkAI)
✊ Unleashing the Power of Reinforcement Learning for Math and Code Reasoners 🤖
DeepCoder: An Open-Source 14B Coder at O3-Mini Level (together.ai)
Through a joint collaboration between the Agentica team and Together AI, we release DeepCoder-14B-Preview, a code reasoning model finetuned from Deepseek-R1-Distilled-Qwen-14B via distributed RL. It achieves an impressive 60.6% Pass@1 accuracy on LiveCodeBench (+8% improvement), matching the performance of o3-mini-2025-01-031 (Low) and o1-2024-12-17 with just 14B parameters. We’ve open-sourced our dataset, code, training logs, and systems optimizations for everyone to progress on scaling and accelerating intelligence with RL.
Can reinforcement learning for LLMs scale beyond math and coding tasks? Probably (arxiv.org)
Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs), especially when structured reference answers are accessible for verification.
DeepSeek: Inference-Time Scaling for Generalist Reward Modeling (arxiv.org)
Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale.
Search-R1: Training LLMs to Reason and Leverage Search Engines with RL (arxiv.org)
Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs).
Scaling Up Reinforcement Learning for Traffic Smoothing (bair.berkeley.edu)
We deployed 100 reinforcement learning (RL)-controlled cars into rush-hour highway traffic to smooth congestion and reduce fuel consumption for everyone.
Launch HN: Augento (YC W25) – Fine-tune your agents with reinforcement learning (ycombinator.com)
Hi HN, we’re the cofounders of Augento (https://augento.ai/). We’re building Deepseek R1-like fine-tuning as a service. You connect your agent, tell us when it’s right or wrong, and we deliver an LLM optimized for that agent.
I Built Faster Reinforcement Learning in C# Solo Than Teams Did with Python (rlmatrix.com)
The question comes relentlessly: “Why build reinforcement learning in C#?” Behind this query lies an unspoken assumption that serious machine learning happens exclusively in Python. This perspective reveals a fundamental disconnect between academic ML researchers with their sprawling Python scripts and those of us solving real industrial problems.
A (Long) Peek into Reinforcement Learning (lilianweng.github.io)
A couple of exciting news in Artificial Intelligence (AI) has just happened in recent years. AlphaGo defeated the best professional human player in the game of Go. Very soon the extended algorithm AlphaGo Zero beat AlphaGo by 100-0 without supervised learning on human knowledge. Top professional game players lost to the bot developed by OpenAI on DOTA2 1v1 competition. After knowing these, it is pretty hard not to be curious about the magic behind these algorithms — Reinforcement Learning (RL).
Understanding R1-Zero-Like Training: A Critical Perspective (github.com/sail-sg)
To understand R1-Zero-like training, we critically examine two core components: base models and reinforcement learning. We highlight our findings below.
Hunyuan T1 Mamba Reasoning model beats R1 on speed and metrics (tencent.github.io)
Reinforcement learning has pioneered a new Scaling paradigm in the post-training phase of large language models, a breakthrough that is increasingly attracting attention from the industry.
Legged Locomotion Meets Skateboarding (umich-curly.github.io)
This paper introduces Discrete-time Hybrid Automata Learning (DHAL), a framework using on-policy Reinforcement Learning to identify and execute mode-switching without trajectory segmentation or event function learning.
Mathematical Foundations of Reinforcement Learning (github.com/MathFoundationRL)
This textbook has received 5,000+ stars! Glad that it is helpful to many readers.
Reinforcement Learning in less than 400 lines of C (github.com/antirez)
This code implements a neural network that learns to play tic-tac-toe using reinforcement learning, just playing against a random adversary, in under 400 lines of C code, without any external library used.
Show HN: Llama-8B Teaches Itself Baby Steps to Deep Research Using RL (github.com/dCaples)
Autonomously train research-agent LLMs on custom data using reinforcement learning and self-verification.
All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning (arxiv.org)
From a first-principles perspective, it may seem odd that the strongest results in foundation model fine-tuning (FT) are achieved via a relatively complex, two-stage training procedure.
Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue” (openpipe.ai)
In this post we’ll discuss how we used Group Relative Policy Optimization (GRPO) to surpass R1, o1, o3-mini, and come within a couple percentage points of Sonnet 3.7 on a reasoning-heavy game called “temporal clue”, while being over 100x cheaper to run at inference time. We’ll include specific lessons learned about task design and hyperparameters we’ve found to work well. And finally, we share the training recipe we used to achieve these results, built on top of torchtune.
QwQ-32B: Embracing the Power of Reinforcement Learning (qwenlm.github.io)
Scaling Reinforcement Learning (RL) has the potential to enhance model performance beyond conventional pretraining and post-training methods.
RoboPianist: Dexterous Piano Playing with Deep Reinforcement Learning (2023) (kzakka.com)
We train anthropomorphic robot hands to play the piano using deep RL and release a simulated benchmark and dataset to advance high-dimensional control.
Competitive Programming with Large Reasoning Models (arxiv.org)
We show that reinforcement learning applied to large language models (LLMs) significantly boosts performance on complex coding and reasoning tasks.
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL (notion.site)
DeepScaleR is an open-source project to fully democratize reinforcement learning (RL) for LLMs and reproduce DeepSeek R1 and OpenAI O1/O3 at scale on real tasks.
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL (notion.site)
Craftax: (Crafter and NetHack) RL Environment in Jax (github.com/MichaelTMatthews)
Craftax is an RL environment written entirely in JAX. Craftax reimplements and significantly extends the game mechanics of Crafter, taking inspiration from roguelike games such as NetHack.
The Differences Between Direct Alignment Algorithms Are a Blur (arxiv.org)
Direct Alignment Algorithms (DAAs) simplify language model alignment by replacing reinforcement learning (RL) and reward modeling (RM) in Reinforcement Learning from Human Feedback (RLHF) with direct policy optimization.
There may not be aha moment in R1-Zero-like training (notion.site)
R1 Computer Use (github.com/agentsea)
r1-computer-use is an experimental project that applies large-scale Reinforcement Learning techniques similar to DeepSeek-R1 to computer usage scenarios.