Notion Blog By Yancheng He, Weixun Wang, and Xiaoyang Li | Project Leader: Weixun Wang | February 11, 2026

Chinese Version: Agentic RL

<aside>

📄 Technical Report: https://arxiv.org/pdf/2512.24873

🧠 Model: https://huggingface.co/FutureLivingLab/iFlow-ROME

🧩 Framework:

Distributed RL Training Infrastructure: https://github.com/alibaba/ROLL
Sandbox Management & Orchestration Service: https://github.com/alibaba/ROCK
RL-Native Environment Command-Line Tool: https://github.com/iflow-ai/iflow-cli

📊 Benchmarks: https://github.com/alibaba/terminal-bench-pro

</aside>

<aside> 📌

If these two memes are your vibe, you should definitely check out the toggle below. I think you'll get a kick out of them!

It’s Not a Bandit Anymore: A Tale of Two RLers </aside>

<aside>

</aside>

The RL paradigm for LLMs is at an inflection point. RLVR has driven impressive gains in math, code, and formal reasoning. Yet beneath its success lies a structural simplification: vanilla RLVR operates more like an in-context bandit, the model generates a single response, receives an immediate reward, and updates. There is no temporal depth, no sequential decision-making across states.

Agentic RL operates in a multi-step, interactive MDP setting: the model take actions, observe environment feedback, and optimize over extended trajectories with sparse, delayed rewards. This means the model is no longer simply “producing an answer,” but must continuously make decisions, reflect on its state, refine its actions, and take responsibility for the final outcome. As a result, the scope of applications naturally extends beyond closed, verifiable tasks to more complex real-world scenarios, such as travel planning, complex data analysis, and other open-ended tasks without explicit verification signals.

This shift demands fundamentally different infrastructure: trajectory-level credit assignment, tight environment integration, and training pipelines built for long-horizon optimization. This blog documents our exploration in this direction: building the infrastructure foundations for agentic RL training, and developing algorithmic improvements that address its unique challenges.

Such a shift also places higher demands on both infrastructure and algorithm design. It requires end-to-end asynchronous training pipelines, more stable long-horizon credit assignment, deep integration with environments, and engineering infrastructure capable of supporting sustained model scaling. In this post, we document our practical experiences and lessons learned behind building Agentic RL in terminal environments.

We start by introducing our environment manager, followed by how we curate RL instances. We then discuss training strategies for stabilizing agentic RL. Readers mainly interested in algorithm can skip ahead to the training sections.

<aside>

Why this matters ?

Agentic RL is not just about algorithms — it requires co-designing environments, infrastructure, and algorithms.

</aside>

Env Manager: from 0 to 1