Hacker News with Generative AI: Reinforcement Learning

Competitive Programming with Large Reasoning Models (arxiv.org)
We show that reinforcement learning applied to large language models (LLMs) significantly boosts performance on complex coding and reasoning tasks.
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL (notion.site)
DeepScaleR is an open-source project to fully democratize reinforcement learning (RL) for LLMs and reproduce DeepSeek R1 and OpenAI O1/O3 at scale on real tasks.
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL (notion.site)
Craftax: (Crafter and NetHack) RL Environment in Jax (github.com/MichaelTMatthews)
Craftax is an RL environment written entirely in JAX. Craftax reimplements and significantly extends the game mechanics of Crafter, taking inspiration from roguelike games such as NetHack.
The Differences Between Direct Alignment Algorithms Are a Blur (arxiv.org)
Direct Alignment Algorithms (DAAs) simplify language model alignment by replacing reinforcement learning (RL) and reward modeling (RM) in Reinforcement Learning from Human Feedback (RLHF) with direct policy optimization.
There may not be aha moment in R1-Zero-like training (notion.site)
R1 Computer Use (github.com/agentsea)
r1-computer-use is an experimental project that applies large-scale Reinforcement Learning techniques similar to DeepSeek-R1 to computer usage scenarios.
OSS reinforcement learning lib by ByteDance is used to reproduce DeepSeek R1 (github.com/volcengine)
verl is a flexible, efficient and production-ready RL training library for large language models (LLMs).
Deep Reinforcement Learning: Pong from Pixels (2016) (karpathy.github.io)
This is a long overdue blog post on Reinforcement Learning (RL). RL is hot! You may have noticed that computers can now automatically learn to play ATARI games (from raw game pixels!), they are beating world champions at Go, simulated quadrupeds are learning to run and leap, and robots are learning how to perform complex manipulation tasks that defy explicit programming.
AMD Announces Open-Source "Schola" Library for Reinforcement Learning (phoronix.com)
AMD announced today the release of Schola 1.0 as an open-source reinforcement learning library that is being made available under an MIT license and as part of their GPUOpen software collection for helping game developers.
Reinforcement Learning: An Overview (arxiv.org)
This manuscript gives a big-picture, up-to-date overview of the field of (deep) reinforcement learning and sequential decision making, covering value-based RL, policy-gradient methods, model-based methods, and various other topics (including a very brief discussion of RL+LLMs).
RLHF Book (rlhfbook.com)
Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems.
Reinforcement Learning – A Reference (jakubhalmes.substack.com)
This text draws primarily from course materials for PA230 Reinforcement Learning, taught by Petr Novotný. Any errors or inaccuracies are my own.
Emerging reasoning with reinforcement learning (notion.site)
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL (arxiv.org)
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.
Kimi K1.5: Scaling Reinforcement Learning with LLMs (github.com/MoonshotAI)
🚀 Introducing Kimi k1.5 --- an o1-level multi-modal model
DeepSeek-R1-Distill-Qwen-1.5B Surpasses GPT-4o in certain benchmarks (huggingface.co)
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning.
A path to O1 open source (arxiv.org)
OpenAI o1 represents a significant milestone in Artificial Inteiligence, which achieves expert-level performances on many challanging tasks that require strong reasoning this http URL has claimed that the main techinique behinds o1 is the reinforcement learining.
Offline Reinforcement Learning for LLM Multi-Step Reasoning (arxiv.org)
Improving the multi-step reasoning ability of large language models (LLMs) with offline reinforcement learning (RL) is essential for quickly adapting them to complex tasks.
Puffer.ai – Simplifying reinforcement learning for complex game environments (puffer.ai)
PufferLib is the reinforcement learning library I wish existed during my PhD. It started as a compatibility layer to make working with complex environments a breeze. Now, it's a high-performance toolkit for research and industry with optimized parallel simulation, environments that run and train at 1M+ steps/second, and tons of quality of life improvements for practitioners. All our tools are free and open source. We also offer priority service for companies, startups, and labs!
Decisions and Dragons (decisionsanddragons.com)
A guide to the perilous world of reinforcement learning.
Batched reward model inference and Best-of-N sampling (raw.sh)
Reward models have been a key part of reinforcement learning on top of LLMs, used broadly in techniques like RLHF and as LLM-as-a-judge critics in evals.
Reinforcement Learning – My Algorithm vs. State of the Art [video] (youtube.com)
Show HN: RL Agent that can auto-optimize your LLM prompts (nomadic-ml.github.io)
The RL Prompt Optimizer employs a reinforcement learning framework to iteratively improve prompts used for language model evaluations.
WebRL: Training LLM Web Agents via Self-Evolving Online Reinforcement Learning (arxiv.org)
Large language models (LLMs) have shown remarkable potential as autonomous agents, particularly in web-based tasks.
Using reinforcement learning and $4.80 of GPU time to find the best HN post (openpipe.ai)
Using Reinforcement Learning and $4.80 of GPU Time to Find the Best HN Post Ever (RLHF Part 1)
Reinforcement Learning: An Introduction (2018) (incompleteideas.net)
Supporting Task Switching with Reinforcement Learning (dl.acm.org)
Human attention is a limited resource, but it is stressed more than ever before in history [2, 81].
Diffusion for World Modeling (diamond-wm.github.io)
DIAMOND 💎 (DIffusion As a Model Of eNvironment Dreams) is a reinforcement learning agent trained entirely in a diffusion world model. The agent playing in the diffusion model is shown above.
Training Language Models to Self-Correct via Reinforcement Learning (arxiv.org)
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs.