Hacker News with Generative AI: Reinforcement Learning

Launch HN: Augento (YC W25) – Fine-tune your agents with reinforcement learning (ycombinator.com)
Hi HN, we’re the cofounders of Augento (https://augento.ai/). We’re building Deepseek R1-like fine-tuning as a service. You connect your agent, tell us when it’s right or wrong, and we deliver an LLM optimized for that agent.
I Built Faster Reinforcement Learning in C# Solo Than Teams Did with Python (rlmatrix.com)
The question comes relentlessly: “Why build reinforcement learning in C#?” Behind this query lies an unspoken assumption that serious machine learning happens exclusively in Python. This perspective reveals a fundamental disconnect between academic ML researchers with their sprawling Python scripts and those of us solving real industrial problems.
A (Long) Peek into Reinforcement Learning (lilianweng.github.io)
A couple of exciting news in Artificial Intelligence (AI) has just happened in recent years. AlphaGo defeated the best professional human player in the game of Go. Very soon the extended algorithm AlphaGo Zero beat AlphaGo by 100-0 without supervised learning on human knowledge. Top professional game players lost to the bot developed by OpenAI on DOTA2 1v1 competition. After knowing these, it is pretty hard not to be curious about the magic behind these algorithms — Reinforcement Learning (RL).
Understanding R1-Zero-Like Training: A Critical Perspective (github.com/sail-sg)
To understand R1-Zero-like training, we critically examine two core components: base models and reinforcement learning. We highlight our findings below.
Hunyuan T1 Mamba Reasoning model beats R1 on speed and metrics (tencent.github.io)
Reinforcement learning has pioneered a new Scaling paradigm in the post-training phase of large language models, a breakthrough that is increasingly attracting attention from the industry.
Legged Locomotion Meets Skateboarding (umich-curly.github.io)
This paper introduces Discrete-time Hybrid Automata Learning (DHAL), a framework using on-policy Reinforcement Learning to identify and execute mode-switching without trajectory segmentation or event function learning.
Mathematical Foundations of Reinforcement Learning (github.com/MathFoundationRL)
This textbook has received 5,000+ stars! Glad that it is helpful to many readers.
Reinforcement Learning in less than 400 lines of C (github.com/antirez)
This code implements a neural network that learns to play tic-tac-toe using reinforcement learning, just playing against a random adversary, in under 400 lines of C code, without any external library used.
Show HN: Llama-8B Teaches Itself Baby Steps to Deep Research Using RL (github.com/dCaples)
Autonomously train research-agent LLMs on custom data using reinforcement learning and self-verification.
All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning (arxiv.org)
From a first-principles perspective, it may seem odd that the strongest results in foundation model fine-tuning (FT) are achieved via a relatively complex, two-stage training procedure.
Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue” (openpipe.ai)
In this post we’ll discuss how we used Group Relative Policy Optimization (GRPO) to surpass R1, o1, o3-mini, and come within a couple percentage points of Sonnet 3.7 on a reasoning-heavy game called “temporal clue”, while being over 100x cheaper to run at inference time. We’ll include specific lessons learned about task design and hyperparameters we’ve found to work well. And finally, we share the training recipe we used to achieve these results, built on top of torchtune.
QwQ-32B: Embracing the Power of Reinforcement Learning (qwenlm.github.io)
Scaling Reinforcement Learning (RL) has the potential to enhance model performance beyond conventional pretraining and post-training methods.
RoboPianist: Dexterous Piano Playing with Deep Reinforcement Learning (2023) (kzakka.com)
We train anthropomorphic robot hands to play the piano using deep RL and release a simulated benchmark and dataset to advance high-dimensional control.
Competitive Programming with Large Reasoning Models (arxiv.org)
We show that reinforcement learning applied to large language models (LLMs) significantly boosts performance on complex coding and reasoning tasks.
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL (notion.site)
DeepScaleR is an open-source project to fully democratize reinforcement learning (RL) for LLMs and reproduce DeepSeek R1 and OpenAI O1/O3 at scale on real tasks.
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL (notion.site)
Craftax: (Crafter and NetHack) RL Environment in Jax (github.com/MichaelTMatthews)
Craftax is an RL environment written entirely in JAX. Craftax reimplements and significantly extends the game mechanics of Crafter, taking inspiration from roguelike games such as NetHack.
The Differences Between Direct Alignment Algorithms Are a Blur (arxiv.org)
Direct Alignment Algorithms (DAAs) simplify language model alignment by replacing reinforcement learning (RL) and reward modeling (RM) in Reinforcement Learning from Human Feedback (RLHF) with direct policy optimization.
There may not be aha moment in R1-Zero-like training (notion.site)
R1 Computer Use (github.com/agentsea)
r1-computer-use is an experimental project that applies large-scale Reinforcement Learning techniques similar to DeepSeek-R1 to computer usage scenarios.
OSS reinforcement learning lib by ByteDance is used to reproduce DeepSeek R1 (github.com/volcengine)
verl is a flexible, efficient and production-ready RL training library for large language models (LLMs).
Deep Reinforcement Learning: Pong from Pixels (2016) (karpathy.github.io)
This is a long overdue blog post on Reinforcement Learning (RL). RL is hot! You may have noticed that computers can now automatically learn to play ATARI games (from raw game pixels!), they are beating world champions at Go, simulated quadrupeds are learning to run and leap, and robots are learning how to perform complex manipulation tasks that defy explicit programming.
AMD Announces Open-Source "Schola" Library for Reinforcement Learning (phoronix.com)
AMD announced today the release of Schola 1.0 as an open-source reinforcement learning library that is being made available under an MIT license and as part of their GPUOpen software collection for helping game developers.
Reinforcement Learning: An Overview (arxiv.org)
This manuscript gives a big-picture, up-to-date overview of the field of (deep) reinforcement learning and sequential decision making, covering value-based RL, policy-gradient methods, model-based methods, and various other topics (including a very brief discussion of RL+LLMs).
RLHF Book (rlhfbook.com)
Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems.
Reinforcement Learning – A Reference (jakubhalmes.substack.com)
This text draws primarily from course materials for PA230 Reinforcement Learning, taught by Petr Novotný. Any errors or inaccuracies are my own.
Emerging reasoning with reinforcement learning (notion.site)
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL (arxiv.org)
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.
Kimi K1.5: Scaling Reinforcement Learning with LLMs (github.com/MoonshotAI)
🚀 Introducing Kimi k1.5 --- an o1-level multi-modal model
DeepSeek-R1-Distill-Qwen-1.5B Surpasses GPT-4o in certain benchmarks (huggingface.co)
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning.