Training path performance optimization. +15% SAC throughput on GPU, verified with no score regression.
What changed (18 files):
polyak_update: in-placelerp_()replaces 3-op manual arithmeticSAC: singlelog_softmax→expreplaces dual softmax+log_softmax; cached entropy between policy/alpha loss; cached_is_perand_LOG2to_torch_batch: uint8/float16 sent directly to GPU then.float()— avoids 4x CPU float32 intermediate (matters for Atari 84x84x4)SumTree: iterative propagation/retrieval replaces recursion; vectorized samplingforward_tails: cached output (was called twice per step)VectorFullGameStatistics:deque(maxlen=N)+np.flatnonzeroreplaces list+pop(0)+looppydash→builtins:isinstanceoverps.is_list/is_dict, dict comprehensions overps.pick/ps.omitin hot pathsPPO:total_lossas plain float prevents computation graph leak across epochs- Minor:
hasattr→is not Nonein conv/recurrent forward, cached_is_dev,no_decayearly exit in VarScheduler
Measured gains (normalized, same hardware A/B on RTX 3090):
- SAC MuJoCo: +15-17% fps
- SAC Atari: +14% fps
- PPO: ~0% (env-bound; most optimizations target SAC's training-heavy inner loop — PPO doesn't use polyak, replay buffer, twin Q, or entropy tuning)
TorchArc YAML benchmarks replace original hardcoded network architectures across all benchmark categories.
- TorchArc integration: All algorithms (REINFORCE, SARSA, DQN, DDQN+PER, A2C, PPO, SAC) now use TorchArc YAML-defined networks instead of hardcoded PyTorch modules
- Full benchmark validation: Classic Control, Box2D, MuJoCo (11 envs), and Atari (54 games) re-benchmarked with TorchArc — results match or exceed original scores
- SAC Atari: New SAC Atari benchmarks (48 games) with discrete action support
- Pre-commit hooks: Conventional commit message validation via
.githooks/commit-msg
Modernization release for the current RL ecosystem. Updates SLM-Lab from OpenAI Gym to Gymnasium, adds correct handling of episode termination (the terminated/truncated fix), and migrates to modern Python tooling.
TL;DR: Install with uv sync, run with slm-lab run. Specs are simpler (no more body section or array wrappers). Environment names changed (CartPole-v1, ALE/Pong-v5, Hopper-v5). Code structure preserved for book readers.
Book readers: For exact code from Foundations of Deep Reinforcement Learning, use
git checkout v4.1.1
SLM-Lab was created as an educational framework for deep reinforcement learning, accompanying Foundations of Deep Reinforcement Learning. The code prioritizes clarity and correctness—it should help you understand RL algorithms, not just run them.
Since v4, the RL ecosystem changed significantly:
-
OpenAI Gym is deprecated. The Farama Foundation forked it as Gymnasium, now the standard. Gym's
doneflag conflated two concepts: true termination (agent failed/succeeded) and time-limit truncation. Gymnasium fixes this with separateterminatedandtruncatedsignals—important for correct value estimation (see below). -
Roboschool is abandoned. MuJoCo became free in 2022, so roboschool is no longer maintained. Gymnasium includes native MuJoCo bindings.
-
Python tooling modernized.
conda+setup.py→uv+pyproject.toml. Python 3.12+, PyTorch 2.8+. uv emerged as a fast, reliable Python package manager—no more conda environment headaches. -
Old dependencies don't build anymore. The v4 dependency stack (old PyTorch, atari-py, mujoco-py, etc.) won't compile on modern hardware, especially ARM machines (Apple Silicon, AWS Graviton). Many deprecated packages simply don't run. A full rebuild was necessary.
This release updates SLM-Lab to work with modern dependencies while preserving the educational code structure. If you've read the book, the code should still be recognizable.
SLM-Lab uses Gymnasium ALE v5 defaults. v5 default repeat_action_probability=0.25 (sticky actions) randomly repeats agent actions to simulate console stochasticity, making evaluation harder but more realistic than v4 default 0.0 used by most benchmarks (CleanRL, SB3, RL Zoo). This follows Machado et al. (2018) research best practices. See ALE version history.
| v4 | v5 |
|---|---|
conda activate lab && python run_lab.py |
slm-lab run |
CartPole-v0, PongNoFrameskip-v4 |
CartPole-v1, ALE/Pong-v5 |
RoboschoolHopper-v1 |
Hopper-v5 |
agent: [{...}], env: [{...}], body: {...} |
agent: {...}, env: {...} |
body.state_dim, body.memory |
agent.state_dim, agent.memory |
uv sync
uv tool install --editable .Remove array brackets and body section:
{
- "agent": [{ "name": "PPO", ... }],
- "env": [{ "name": "CartPole-v0", ... }],
- "body": { "product": "outer", "num": 1 },
+ "agent": { "name": "PPO", ... },
+ "env": { "name": "CartPole-v1", ... },
"meta": { ... }
}- Classic control:
v0/v1→ current version (CartPole-v1,Pendulum-v1,LunarLander-v3) - Atari:
PongNoFrameskip-v4→ALE/Pong-v5 - Roboschool → MuJoCo: see Deprecations for full mapping
slm-lab run spec.json spec_name trainSee slm_lab/spec/benchmark/ for updated reference specs.
This matters for understanding the code, not just running it.
Gym's done flag was ambiguous—it meant "episode ended" but episodes end for two different reasons:
- Terminated: True end state (CartPole fell, agent died, goal reached)
- Truncated: Time limit hit (MuJoCo's 1000-step cap)
For value estimation, these need different treatment. Terminated means future returns are zero. Truncated means future returns exist but weren't observed—you should bootstrap from V(s').
Gymnasium separates the signals:
# Gym
obs, reward, done, info = env.step(action)
# Gymnasium
obs, reward, terminated, truncated, info = env.step(action)All SLM-Lab algorithms now use terminated for bootstrapping decisions:
# Only zero out future returns on TRUE termination
q_targets = rewards + gamma * (1 - terminateds) * next_q_predsThis is why the code stores terminateds and truncateds separately in memory—algorithms need terminated for correct bootstrapping, done for episode boundaries.
This fix particularly matters for time-limited environments like MuJoCo (1000-step limit) where episodes frequently truncate during training. Using done instead of terminated there significantly hurts learning.
For book readers who want to trace through the code:
The Body class was removed. Its responsibilities moved to more natural locations:
# v4
state_dim = agent.body.state_dim
memory = agent.body.memory
env = agent.body.env
# v5
state_dim = agent.state_dim
memory = agent.memory
env = agent.envTraining metrics tracking is now in MetricsTracker (what Body was renamed to).
Multi-agent configurations were rarely used. Specs are now flat:
# v4: agent_spec = spec['agent'][0]
# v5: agent_spec = spec['agent']The core design is unchanged:
Session → Agent → Algorithm → Network
↘ Memory
→ Env
PPO: New options for value target handling—normalize_v_targets, symlog_transform (from DreamerV3), clip_vloss (CleanRL-style).
SAC: Discrete action support uses exact expectation (Christodoulou 2019). Target entropy auto-calculated.
Networks: Optional layer_norm for MLP hidden layers. Custom optimizers (Lookahead, RAdam) removed—use native PyTorch AdamW.
All algorithms use terminated (not done) for correct bootstrapping.
All algorithms validated on Gymnasium. Full results in docs/BENCHMARKS.md.
| Category | REINFORCE | SARSA | DQN | DDQN+PER | A2C | PPO | SAC |
|---|---|---|---|---|---|---|---|
| Classic Control | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Box2D | — | — | ✅ | ✅ | ✅ | ✅ | |
| MuJoCo (11 envs) | — | — | — | — | ✅ All | ✅ All | |
| Atari (54 games) | — | — | — | — | ✅ | ✅ | — |
Atari benchmarks use ALE v5 with sticky actions (repeat_action_probability=0.25). PPO tested with lambda variants (0.95, 0.85, 0.70) to optimize per-game performance. A2C uses GAE with lambda 0.95.
Note on scores: Gymnasium environment versions differ from old Gym—some are harder (CartPole-v1 has stricter termination than v0), some have different reward scales (MuJoCo v5 vs roboschool). Targets reference CleanRL and Stable-Baselines3 gymnasium benchmarks.
Hyperparameter search now uses Ray Tune + Optuna + ASHA early stopping:
slm-lab run spec.json spec_name search # Run search locallyAdd search_scheduler to spec for ASHA early termination of poor trials. See docs/BENCHMARKS.md for search methodology.
The CLI uses Typer. Use --help on any command for details:
slm-lab --help # List all commands
slm-lab run --help # Options for run command
# Installation
uv sync # Install dependencies
uv tool install --editable . # Install slm-lab command
# Basic usage
slm-lab run # PPO CartPole (default demo)
slm-lab run --render # With visualization
slm-lab run spec.json spec_name train # Train from spec file
slm-lab run spec.json spec_name dev # Dev mode (shorter run)
slm-lab run spec.json spec_name search # Hyperparameter search
# Variable substitution (for template specs)
slm-lab run -s env=ALE/Breakout-v5 slm_lab/spec/benchmark/ppo/ppo_atari.json ppo_atari train
# Cloud training (dstack + HuggingFace)
slm-lab run-remote --gpu spec.json spec_name train # Launch on cloud GPU
slm-lab list # List experiments on HuggingFace
slm-lab pull spec_name # Download results locally
# Utilities
slm-lab run --stop-ray # Stop Ray processesModes: dev (quick test), train (full training), search (hyperparameter search), enjoy (evaluate saved model).
The v4 body spec section and array wrappers (agent: [{...}]) supported multi-agent and multi-environment configurations. These were rarely used and added complexity. v5 simplifies to single-agent single-env, which covers the vast majority of use cases and matches how most RL research is done.
These integrations are removed from the core package. Both ecosystems have their own gymnasium-compatible wrappers now:
- Unity: gymnasium-unity
- VizDoom: vizdoom gymnasium wrapper
You can still use these environments with SLM-Lab by installing their wrappers and specifying the environment name in your spec.
Roboschool is abandoned (MuJoCo became free in 2022). Use gymnasium's native MuJoCo environments instead:
RoboschoolHopper-v1→Hopper-v5RoboschoolHalfCheetah-v1→HalfCheetah-v5RoboschoolWalker2d-v1→Walker2d-v5RoboschoolAnt-v1→Ant-v5RoboschoolHumanoid-v1→Humanoid-v5