Skip to content

Commit adb93d8

Browse files
committed
add vibe rl example
1 parent 69b19cb commit adb93d8

3 files changed

Lines changed: 200 additions & 65 deletions

File tree

Lines changed: 59 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -1,73 +1,72 @@
1-
# Vibe RL 实例:不写一行代码,从零构建一个会玩“谁是卧底”的 Agent 训练器
1+
# Vibe RL Example: Building a "Who is the Spy" Agent Trainer from Scratch Without Writing a Single Line of Code
22

3+
> This article is a translated version of the [Chinese original](./example_vibe_rl_who_is_spy.zh.md).
34
4-
摘要:强化学习研究中,从灵感迸发,到编写代码,再到第一条成功的训练曲线产生,这个过程是漫长、乏味的。
5-
幸运的是,如今在 AgentJet 框架中,从想法到训练成功,你只需要动动嘴,花几分钟写一点点提示词,
6-
然后只需要等待片刻,然后你就可以看到**完整、简洁、人类易读易改的训练程序** + **初次训练的训练曲线** 展现在你面前了。
7-
接下来,我们以经典的“谁是摸底”桌游游戏为例,从零展示不写代码训练Agent的全过程。
5+
## Abstract
86

7+
In reinforcement learning research, the journey from inspiration to writing code to generating the first successful training curve is long and tedious. Fortunately, with the AgentJet framework, going from idea to successful training is now just a matter of speaking up and spending a few minutes writing some prompts. After a short wait, you get to see **complete, concise, human-readable and editable training code** alongside **the first training curve** displayed before you. In this article, we use the classic "Who is the Spy" board game as an example to demonstrate the entire process of training an Agent without writing code.
98

10-
## 安装 AgentJet 环境
9+
## Install AgentJet Environment
10+
11+
You can choose to [install manually](https://doc.agentjet.top/en/installation/) or use skills. Run the following commands to copy skills into Claude Code or OpenCode:
1112

12-
您可以选择[手动安装](https://doc.agentjet.top/en/installation/),或者使用skills安装。运行以下指令将skills复制到claude code或者 opencode中。
1313
```bash
1414
npx skills add modelscope/agentjet
1515
npx skills add binary-husky/Vibe-RL
1616
```
17-
在skill添加完成之后,你可以指挥claude code或者opencode使用uv(或者conda / docker)安装 AgentJet。
1817

19-
## 撰写提示词
18+
After the skills are added, you can instruct Claude Code or OpenCode to install AgentJet using uv (or conda / docker).
19+
20+
## Write the Prompt
2021

21-
在安装完成 AgentJet 之后,就可以直接开始工作了,打开OpenCode(尽管ClaudeCode比OpenCode更加强大,但笔者还是喜欢完全开源的东西;再者,在AgentJet中Vibe RL的难度很低,我们也不需要非常强的agent),
22-
然后选择 claude-4.5-sonnet 模型 (这个模型在推理速度比opus更快,对于不太困难的问题已经足够了),开始执行任务:
22+
Once AgentJet is installed, you can get started right away. Open OpenCode (while ClaudeCode is more powerful, the author prefers fully open-source tools; moreover, Vibe RL difficulty in AgentJet is quite low, so we don't need a very strong agent), then select the claude-4.5-sonnet model (this model is faster than opus for reasoning speed and sufficient for tasks that aren't too difficult), and start executing the task:
2323

2424
```txt
25-
你的任务:
26-
- 编写一个学习"谁是卧底"任务的智能体,通过强化学习和监督学习相结合的方式训练,游戏规则如下:
27-
- 游戏共有 N 名玩家,其中大多数人是**平民**,少数人是**卧底**
28-
- 游戏开始时,每位平民会收到同一个**平民词**,每位卧底会收到一个与平民词相近但不同的**卧底词**(例如平民词为"苹果",卧底词为"梨")
29-
- 每轮游戏中,所有玩家依次对自己拿到的词进行**口头描述**,描述必须真实反映自己的词,但不能直接说出词语本身,也不能过于明显地暴露自己的身份
30-
- 全部玩家描述完毕后,进入**投票环节**,所有玩家投票选出自己认为最可疑的卧底,得票最多的玩家被淘汰出局
31-
- 游戏持续多轮,直到满足以下任一结束条件:
32-
- **平民获胜**:所有卧底均被淘汰
33-
- **卧底获胜**:卧底人数 ≥ 平民人数(卧底在数量上取得优势)
34-
- 智能体需要通过大量对局训练掌握两种核心能力:
35-
- **描述策略学习**:学会根据自己的词语和当前局势,生成既不暴露身份、又能让同阵营玩家认同的最优描述
36-
- **推理决策学习**:学会根据历史对话、其他玩家的描述模式和行为特征,准确识别卧底并做出最优投票决策
37-
- 训练目标:最大化智能体在不同角色(平民/卧底)下的游戏胜率,通过自对弈和奖励机制不断优化策略
38-
- 我希望使用基础模型 `/mnt/data_cpfs/model_cache/modelscope/hub/Qwen/Qwen/Qwen2.5-7B-Instruct`
39-
- 使用 8 GPU 训练
25+
Your task:
26+
- Write an agent that learns the "Who is the Spy" task, trained using a combination of reinforcement learning and supervised learning. The game rules are as follows:
27+
- The game has N players, most of whom are **civilians**, with a few being **spies**
28+
- At the start of the game, each civilian receives the same **civilian word**, and each spy receives a **spy word** that is similar to the civilian word but different (e.g., civilian word is "apple", spy word is "pear")
29+
- In each round, all players take turns giving **verbal descriptions** of their word. The description must truthfully reflect the word, but cannot directly say the word itself or expose the player's identity too obviously
30+
- After all players have described, the game enters the **voting phase**, where all players vote for who they think is the most suspicious spy. The player with the most votes is eliminated
31+
- The game continues for multiple rounds until one of the following end conditions is met:
32+
- **Civilians win**: All spies are eliminated
33+
- **Spies win**: The number of spies >= the number of civilians (spies have the numerical advantage)
34+
- The agent needs to master two core abilities through extensive gameplay:
35+
- **Description strategy learning**: Learn to generate optimal descriptions based on the agent's word and current game state that neither expose identity nor alienate teammates
36+
- **Reasoning and decision learning**: Learn to accurately identify spies based on conversation history, other players' description patterns, and behavioral characteristics, and make optimal voting decisions
37+
- Training objective: Maximize the agent's win rate across different roles (civilian/spy), continuously optimizing strategy through self-play and reward mechanisms
38+
- I want to use the base model `/mnt/data_cpfs/model_cache/modelscope/hub/Qwen/Qwen/Qwen2.5-7B-Instruct`
39+
- Use 8 GPUs for training
4040
- Batch Size 16
41-
- 我目前没有数据集,你需要帮助我 mock 少量游戏对局数据以供测试和初始训练
42-
- 使用OpenAI SDK,灵活使用Tools
43-
- 代码中不得出现中文
41+
- I don't have a dataset yet, please help me mock some game data for testing and initial training
42+
- Use OpenAI SDK, flexibly use Tools
43+
- Code must not contain Chinese characters
4444
45-
你的 skill(首先读取该 SKILL 文件,获取必要知识):
45+
Your skill (please read this SKILL file first to get necessary knowledge):
4646
./ajet/copilot/write-swarm-client/SKILL.md
4747
48-
- 追加要求:
49-
- optional 0. (agent_roll) team A 平民 共享一个7B模型, team B卧底使用qwen-max (DASHSCOPE_API_KEY已经在环境变量中),
50-
每个episode随机分配每个所有人的ID和名字(随机生成一个长长的随机姓名名字清单),胜者奖励 1,败者奖励 0
51-
- optional 1. (agent_roll_adv) 对抗式训练,team A 平民 共享一个7B模型(swarm server 1), team B卧底共享另一个7B模型(swarm server 2),
52-
每个episode随机分配每个所有人的ID和名字(随机生成一个长长的随机姓名名字清单),胜者奖励 1,败者奖励 0
48+
- Additional requirements:
49+
- optional 0. (agent_roll) Team A civilians share one 7B model, Team B spies use qwen-max (DASHSCOPE_API_KEY is already in environment variables),
50+
each episode randomly assigns each player's ID and name (randomly generate a long list of random names), winner gets reward 1, loser gets reward 0
51+
- optional 1. (agent_roll_adv) Adversarial training, Team A civilians share one 7B model (swarm server 1), Team B spies share another 7B model (swarm server 2),
52+
each episode randomly assigns each player's ID and name (randomly generate a long list of random names), winner gets reward 1, loser gets reward 0
5353
54-
- 追加要求:
55-
agent_roll: 使用4个显卡
56-
agent_roll_advswarm server 1 swarm server 2 分别使用4个显卡(一共8个显卡)
54+
- Additional requirements:
55+
agent_roll: Use 4 GPUs
56+
agent_roll_adv: swarm server 1 and swarm server 2 each use 4 GPUs (total 8 GPUs)
5757
58-
- 追加要求:使用 tmux + uv.venv 调试,直到所有Bug都已经排除 & 训练正常开始。你可以使用 `spy-swarm-server`, `spy-swarm-server-2`, `spy-swarm-client` 三个 tmux session
58+
- Additional requirements: Use tmux + uv's .venv for debugging until all bugs are fixed & training starts normally. You can use `spy-swarm-server`, `spy-swarm-server-2`, `spy-swarm-client` three tmux sessions
5959
60-
- 当前调试阶段:
61-
调试 agent_roll 【执行调试】
62-
调试 agent_roll_adv 【跳过调试】
60+
- Current debugging stage:
61+
Debugging agent_roll [Execute debugging]
62+
Debugging agent_roll_adv [Skip debugging]
6363
```
6464

65+
## Check Results
6566

66-
## 检查结果
67-
68-
### 生成的训练代码
67+
### Generated Training Code
6968

70-
在agentjet skill的指导下,OpenCode会在 tutorial/opencode_build_*** 生成训练的全部代码:
69+
Under the guidance of the agentjet skill, OpenCode generates all training code in `tutorial/opencode_build_***`:
7170

7271
```bash
7372
(base) ➜ agentjet git:(main) ✗ tree tutorial/opencode_build_spy_game
@@ -82,13 +81,12 @@ tutorial/opencode_build_spy_game/
8281
└── readme.md # This file
8382
```
8483

85-
### 检查训练蜂群,发现并引导智能体修复训练的Bug
84+
### Inspect the Training Swarm, Find and Fix Agent Training Bugs
8685

87-
88-
等了一会,运行 `ajet-swarm overwatch` 命令,看一下现在训练运行到第几步了,结果发现 claude-sonnet 搞出了一个令人难绷错误:
86+
After waiting a while, running the `ajet-swarm overwatch` command shows the current training progress:
8987

9088
```bash
91-
Completed Episode Pool Summary (Progress to Hit Next Weight Update)
89+
Completed Episode Pool Summary (Progress to Hit Next Weight Update)
9290
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
9391
┃ Metric ┃ Current ┃ Target ┃ Progress ┃ Bar ┃
9492
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
@@ -100,40 +98,36 @@ tutorial/opencode_build_spy_game/
10098
│ Average Episode Per Task │ 140.00 │ 4 │ - │ - │
10199
└────────────────────────────────────────┴─────────────┴─────────────┴──────────────┴───────────────────────────────────────────────────────────────────────┘
102100

103-
Task Completion Details
101+
Task Completion Details
104102
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
105103
┃ Task ID ┃ Episodes ┃ Reward ┃ Episode UUIDs (first 3) ┃
106104
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
107105
│ │ 140 │ 0.779 ± 0.448 │ b47d7b96..., 8caec2d7..., b48bd9fb... (+137 more) │
108106
└──────────────┴───────────────┴───────────────────────┴───────────────────────────────────────────────────────────────────────────┘
109107
```
110108

111-
从蜂群监视表格可以看出,现在样本池已经累计了 875.0%(140个)的回合样本,但AgentJet并没有开始训练。
112-
仔细一看,CompletedTasks 进度只有 1个,说明140个回合都被识别成一个task了。这些样本的task id,哎,怎么是空字符串?
113-
毫无疑问,claude mock的数据集出了很搞笑的问题,直接给OpenCode下达新指令:
109+
From the swarm monitoring table, the sample pool has accumulated 875.0% (140) episode samples, but AgentJet hasn't started training yet. Looking closer, the Completed Tasks progress is only 1, meaning all 140 episodes were identified as one task. The task IDs for these samples? They're empty strings. No doubt, claude-sonnet produced a hilarious bug in the mock dataset. We give OpenCode a new directive:
114110

115111
```txt
116-
task.task_id 有严重的问题,task_id应该是每个episode的随机数种子,不能为空!
112+
task.task_id has a serious problem - task_id should be a random seed for each episode and must not be empty!
117113
```
118114

119-
顺便修改了一下参数,batchsize从4改成32,grpo_n从4改成6,然后喝杯茶,再回来看看。不错,这次正常了。
120-
121-
![alt text](https://img.alicdn.com/imgextra/i4/O1CN01cQny931D4FI93OwyB_!!6000000000162-2-tps-2445-1227.png)
115+
While we're at it, we adjust some parameters: batch size from 4 to 32, grpo_n from 4 to 6. Then we have a cup of tea and come back. This time it works.
122116

117+
![alt text](https://img.alicdn.com/imicdn.com/imgextra/i4/O1CN01cQny931D4FI93OwyB_!!6000000000162-2-tps-2445-1227.png)
123118

124-
为了保证agent运行逻辑是准确无误的,我们再打开 beast_logger (和agentjet配套的日志监视组件) 看一眼:
119+
To ensure the agent logic is correct, we also open beast_logger (the log monitoring component that comes with agentjet):
125120

126121
![alt text](https://img.alicdn.com/imgextra/i3/O1CN01w7QLeg26hS3yIma36_!!6000000007693-2-tps-3782-1963.png)
127122

128-
看了一眼,果然还是有问题(有点后悔没用opus了)。我们的要求是team A平民共享大脑用一个7B模型, team B卧底使用qwen-max。但平民队伍里面怎么混进来一个间谍?
129-
这回得让claude-sonnet好好反省一下了:
123+
One look and sure enough, there are still issues (slightly regretting not using opus). Our requirement was that Team A civilians share one brain with a 7B model, while Team B spies use qwen-max. But why did a spy sneak into the civilian team? This time we need claude-sonnet to reflect carefully:
130124

131125
![alt text](https://img.alicdn.com/imgextra/i3/O1CN01ECZFjI286viB25hk1_!!6000000007884-2-tps-1079-498.png)
132126

133-
等一会,再看了一下,问题都已经修复了
127+
After a while, we check again and the issues are all fixed.
134128

135-
### 检查训练曲线
129+
### Check Training Curves
136130

137-
去SwanLab看看,不错,奖励平稳上升。
131+
Heading over to SwanLab — not bad, the reward is steadily climbing.
138132

139-
![alt text](https://img.alicdn.com/imgextra/i2/O1CN01qFvfeU20XTkCW2H89_!!6000000006859-2-tps-1994-522.png)
133+
![alt text](https://img.alicdn.com/imgextra/i2/O1CN01qFvfeU20XTkCW2H89_!!6000000006859-2-tps-1994-522.png)

0 commit comments

Comments
 (0)