Hacker News with Generative AI: Reinforcement Learning

A path to O1 open source (arxiv.org)
OpenAI o1 represents a significant milestone in Artificial Inteiligence, which achieves expert-level performances on many challanging tasks that require strong reasoning this http URL has claimed that the main techinique behinds o1 is the reinforcement learining.
Offline Reinforcement Learning for LLM Multi-Step Reasoning (arxiv.org)
Improving the multi-step reasoning ability of large language models (LLMs) with offline reinforcement learning (RL) is essential for quickly adapting them to complex tasks.
Puffer.ai – Simplifying reinforcement learning for complex game environments (puffer.ai)
PufferLib is the reinforcement learning library I wish existed during my PhD. It started as a compatibility layer to make working with complex environments a breeze. Now, it's a high-performance toolkit for research and industry with optimized parallel simulation, environments that run and train at 1M+ steps/second, and tons of quality of life improvements for practitioners. All our tools are free and open source. We also offer priority service for companies, startups, and labs!
Decisions and Dragons (decisionsanddragons.com)
A guide to the perilous world of reinforcement learning.
Batched reward model inference and Best-of-N sampling (raw.sh)
Reward models have been a key part of reinforcement learning on top of LLMs, used broadly in techniques like RLHF and as LLM-as-a-judge critics in evals.
Reinforcement Learning – My Algorithm vs. State of the Art [video] (youtube.com)
Show HN: RL Agent that can auto-optimize your LLM prompts (nomadic-ml.github.io)
The RL Prompt Optimizer employs a reinforcement learning framework to iteratively improve prompts used for language model evaluations.
WebRL: Training LLM Web Agents via Self-Evolving Online Reinforcement Learning (arxiv.org)
Large language models (LLMs) have shown remarkable potential as autonomous agents, particularly in web-based tasks.
Using reinforcement learning and $4.80 of GPU time to find the best HN post (openpipe.ai)
Using Reinforcement Learning and $4.80 of GPU Time to Find the Best HN Post Ever (RLHF Part 1)
Reinforcement Learning: An Introduction (2018) (incompleteideas.net)
Supporting Task Switching with Reinforcement Learning (dl.acm.org)
Human attention is a limited resource, but it is stressed more than ever before in history [2, 81].
Diffusion for World Modeling (diamond-wm.github.io)
DIAMOND 💎 (DIffusion As a Model Of eNvironment Dreams) is a reinforcement learning agent trained entirely in a diffusion world model. The agent playing in the diffusion model is shown above.
Training Language Models to Self-Correct via Reinforcement Learning (arxiv.org)
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs.
Show HN: LeanRL: Fast PyTorch RL with Torch.compile and CUDA Graphs (github.com/pytorch-labs)
LeanRL is a lightweight library consisting of single-file, pytorch-based implementations of popular Reinforcement Learning (RL) algorithms.
RLHF is just barely RL (twitter.com)
Andrej Karpathy on X: RLHF is just barely RL (twitter.com)
Solving Path of Exile Item Crafting with Reinforcement Learning (dennybritz.com)
Mental Modeling of Reinforcement Learning Agents by Language Models (arxiv.org)
Augmenting biological intelligence with RL in C.elegans using optogenetics [pdf] (tch.harvard.edu)
Shadow Robot’s three-fingered hand is robust enough for reinforcement learning (ieee.org)
Deep Reinforcement Learning: Zero to Hero (github.com/alessiodm)