You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In reinforcement learning research, the journey from inspiration to writing code to generating the first successful training curve is long and tedious. Fortunately, with the AgentJet framework, going from idea to successful training is now just a matter of speaking up and spending a few minutes writing some prompts. After a short wait, you get to see **complete, concise, human-readable and editable training code** alongside **the first training curve** displayed before you. In this article, we use the classic "Who is the Spy" board game as an example to demonstrate the entire process of training an Agent without writing code.
9
8
10
-
## 安装 AgentJet 环境
9
+
## Install AgentJet Environment
10
+
11
+
You can choose to [install manually](https://doc.agentjet.top/en/installation/) or use skills. Run the following commands to copy skills into Claude Code or OpenCode:
Once AgentJet is installed, you can get started right away. Open OpenCode (while ClaudeCode is more powerful, the author prefers fully open-source tools; moreover, Vibe RL difficulty in AgentJet is quite low, so we don't need a very strong agent), then select the claude-4.5-sonnet model (this model is faster than opus for reasoning speed and sufficient for tasks that aren't too difficult), and start executing the task:
- Write an agent that learns the "Who is the Spy" task, trained using a combination of reinforcement learning and supervised learning. The game rules are as follows:
27
+
- The game has N players, most of whom are **civilians**, with a few being **spies**
28
+
- At the start of the game, each civilian receives the same **civilian word**, and each spy receives a **spy word** that is similar to the civilian word but different (e.g., civilian word is "apple", spy word is "pear")
29
+
- In each round, all players take turns giving **verbal descriptions** of their word. The description must truthfully reflect the word, but cannot directly say the word itself or expose the player's identity too obviously
30
+
- After all players have described, the game enters the **voting phase**, where all players vote for who they think is the most suspicious spy. The player with the most votes is eliminated
31
+
- The game continues for multiple rounds until one of the following end conditions is met:
32
+
- **Civilians win**: All spies are eliminated
33
+
- **Spies win**: The number of spies >= the number of civilians (spies have the numerical advantage)
34
+
- The agent needs to master two core abilities through extensive gameplay:
35
+
- **Description strategy learning**: Learn to generate optimal descriptions based on the agent's word and current game state that neither expose identity nor alienate teammates
36
+
- **Reasoning and decision learning**: Learn to accurately identify spies based on conversation history, other players' description patterns, and behavioral characteristics, and make optimal voting decisions
37
+
- Training objective: Maximize the agent's win rate across different roles (civilian/spy), continuously optimizing strategy through self-play and reward mechanisms
38
+
- I want to use the base model `/mnt/data_cpfs/model_cache/modelscope/hub/Qwen/Qwen/Qwen2.5-7B-Instruct`
39
+
- Use 8 GPUs for training
40
40
- Batch Size 16
41
-
- 我目前没有数据集,你需要帮助我 mock 少量游戏对局数据以供测试和初始训练
42
-
- 使用OpenAI SDK,灵活使用Tools
43
-
- 代码中不得出现中文
41
+
- I don't have a dataset yet, please help me mock some game data for testing and initial training
42
+
- Use OpenAI SDK, flexibly use Tools
43
+
- Code must not contain Chinese characters
44
44
45
-
你的 skill(首先读取该 SKILL 文件,获取必要知识):
45
+
Your skill (please read this SKILL file first to get necessary knowledge):
46
46
./ajet/copilot/write-swarm-client/SKILL.md
47
47
48
-
- 追加要求:
49
-
- optional 0. (agent_roll) team A 平民 共享一个7B模型, team B卧底使用qwen-max (DASHSCOPE_API_KEY已经在环境变量中),
- optional 0. (agent_roll) Team A civilians share one 7B model, Team B spies use qwen-max (DASHSCOPE_API_KEY is already in environment variables),
50
+
each episode randomly assigns each player's ID and name (randomly generate a long list of random names), winner gets reward 1, loser gets reward 0
51
+
- optional 1. (agent_roll_adv) Adversarial training, Team A civilians share one 7B model (swarm server 1), Team B spies share another 7B model (swarm server 2),
52
+
each episode randomly assigns each player's ID and name (randomly generate a long list of random names), winner gets reward 1, loser gets reward 0
53
53
54
-
- 追加要求:
55
-
agent_roll: 使用4个显卡
56
-
agent_roll_adv:swarm server 1 和 swarm server 2 分别使用4个显卡(一共8个显卡)
54
+
- Additional requirements:
55
+
agent_roll: Use 4 GPUs
56
+
agent_roll_adv: swarm server 1 and swarm server 2 each use 4 GPUs (total 8 GPUs)
- Additional requirements: Use tmux + uv's .venv for debugging until all bugs are fixed & training starts normally. You can use `spy-swarm-server`, `spy-swarm-server-2`, `spy-swarm-client` three tmux sessions
From the swarm monitoring table, the sample pool has accumulated 875.0% (140) episode samples, but AgentJet hasn't started training yet. Looking closer, the Completed Tasks progress is only 1, meaning all 140 episodes were identified as one task. The task IDs for these samples? They're empty strings. No doubt, claude-sonnet produced a hilarious bug in the mock dataset. We give OpenCode a new directive:
While we're at it, we adjust some parameters: batch size from 4 to 32, grpo_n from 4 to 6. Then we have a cup of tea and come back. This time it works.
看了一眼,果然还是有问题(有点后悔没用opus了)。我们的要求是team A平民共享大脑用一个7B模型, team B卧底使用qwen-max。但平民队伍里面怎么混进来一个间谍?
129
-
这回得让claude-sonnet好好反省一下了:
123
+
One look and sure enough, there are still issues (slightly regretting not using opus). Our requirement was that Team A civilians share one brain with a 7B model, while Team B spies use qwen-max. But why did a spy sneak into the civilian team? This time we need claude-sonnet to reflect carefully:
0 commit comments