Show HN: Terminal-Bench-RL: Training Long-Horizon Terminal Agents with RL

Hacker News (score: 84)

Found: July 29, 2025

ID: 569

Description

Other

Show HN: Terminal-Bench-RL: Training Long-Horizon Terminal Agents with RL After training calculator agent via RL, I really wanted to go bigger! So I built RL infrastructure for training long-horizon terminal/coding agents that scales from 2x A100s to 32x H100s (~$1M worth of compute!) Without any training, my 32B agent hit #19 on Terminal-Bench leaderboard, beating Stanford's Terminus-Qwen3-235B-A22! With training... well, too expensive, but I bet the results would be good!

*What I did*:

- Created a Claude Code-inspired agent (system msg + tools)

- Built Docker-isolated GRPO training where each rollout gets its own container

- Developed a multi-agent synthetic data pipeline to generate & validate training data with Opus-4

- Implemented a hybrid reward signal of unit test verifiers & a behavioural LLM judge.

*Key results*:

- My untrained Qwen3-32B agent achieved 13.75% on Terminal-Bench (#19, beats Stanford's Qwen3-235B MoE)

- I tested training to work stably on 32x H100s distributed across 4 bare metal nodes

- I created a mini-eval framework for LLM-judge performance. Sonnet-4 won.

- ~£30-50k needed for full training run of 1000 epochs (I could only afford testing )

*Technical details*:

- The synthetic dataset ranges from easy to extremely hard tasks. An example hard task's prompt:

"I found this mystery program at `/app/program` and I'm completely stumped. It's a stripped binary, so I have no idea what it does or how to run it properly. The program seems to expect some specific input and then produces an output, but I can't figure out what kind of input it needs. Could you help me figure out what this program requires?"

- Simple config presets allow training to run on multiple hardware setups with minimal effort.

- GRPO used with 16 rollouts per task, up to 32k tokens per rollout.

- Agent uses XML/YAML format to structure tool calls

*More details*:

My Github repos open source it all (agent, data, code) and has way more technical details if you are interested!:

- Terminal Agent RL repo

- Multi-agent synthetic data pipeline repo

I thought I would share this because I believe long-horizon RL is going to change everybody's lives, and so I feel it is important (and super fun!) for us all to share knowledge around this area, and also have enjoy exploring what is possible.

Thanks for reading!

Dan

(Built using rLLM RL framework which was brilliant to work with, and evaluated and inspired by the great Terminal Bench benchmark)

More from Hacker

Show HN: ASCII Automata

Show HN: ASCII Automata ASCII AUTOMATA is a tool to analyze the visual connectivity of characters in textmode fonts. It works by scoring edge connectivity of each piece and finding the best matching neighbour piece. Every time it places a piece, it "grows" towards the edges it touches by placing a matching piece. The red heatmap shows how frequently each character is used, useful for analyzing the fonts.I initially made it as a tool for myself. When I design textmode art fonts it is sometimes difficult to figure out if a specific character would actually be useful for drawing or not. I wanted a tool which would show how useful and versatile some character is, and how well it connects to all other pieces.But, as it turned out, this tool produces unexpectedly beautiful emergent patterns, so I made it into a proper little toy-tool for anyone to play around with.Sidenote: it was also a good opportunity to test a new method for constructing a responsive semi-complex UI.I made a web component which renders text as SVG paths using hershey vector fonts. The SVG fills the parent element, and applies stroke after the stretching happens: so strings "a" and "aaa" take the same amount of space, while remaining legible because the stroke is independent of the text's transformations. Thus, I never have problems with overflowing text in the UI!The layout is made with a CSS grid. For example the sidebar is simply <div style="--cols:8;--rows:41;" class="sidebar grid"> and then each UI element gets a position and size <vec-text style="--x:1;--y:19;--w:2;--h:1;">Cell Width</vec-text> . As a result, the layout is easy to make, the sidebar itself can be any size or shape,all the UI elements stay exactly where I put them, and all text remains legible due to the stretchy, monolined vector font web component. It's great!The WHOLE UI layout is just 120 lines of HTML, and 40 lines of CSS (for around 90 UI elements)!(it did take a while to fiddle with the coordinate numbers, but I'm working on a wysiwyg tool to make that easier too...)[crossposted this comment from mastodon: <a href="https://typo.social/@gdc/115405978249292146" rel="nofollow">https://typo.social/@gdc/115405978249292146</a>]

No other tools from this source yet.