Project: verl-llm-tandem · Tandem Reinforcement Learning (TRL)
Authority: This file is the single source of truth for naming. The forthcoming sweeping rename (SPEC §2 procedure, batched as one commit) converges every identifier in tracked code to the target column below. Until that commit lands, the codebase still uses the legacy spellings — both columns are valid to read; only the target column is valid to write.
| Paper concept | Legacy spelling(s) in code | Target spelling | Rationale |
|---|---|---|---|
| Trainable model π_sen | primary, hot, model_a, target_* (in TandemConfig) |
senior |
Match paper §3.2 |
| Frozen partner π_jun | frozen, model_b |
junior |
Match paper §3.2 |
| Per-token authorship indicator (1=senior, 0=junior) | tandem_model_mask, model_mask, model_a_mask |
authorship_mask |
Match paper §A.1.1 "authorship stream"; eliminates 3-way ambiguity |
| Tandem Reinforcement Learning (the method) | "TT" (tandem training), "tandem native GRPO" | TRL | Match paper title and Algorithm 1 |
| Senior-handoff probability (Bernoulli p) | prob_primary |
prob_senior |
Mirror rename of primary→senior |
| Subword-span cap (paper K=32) | max_gap_tokens |
max_gap_tokens (unchanged) |
Already paper-aligned in spirit; rename adds no clarity |
| Junior-token loss-weight coefficient | tandem_jr_tkn_weight |
junior_token_loss_weight |
Paper λ_jun is a different mechanism (auxiliary KL term), but reaches the same fixed point at 0; renamed to reflect what it actually does in our code (response-mask weighting) |
| Word-boundary handoff strategy | selection_strategy="word" |
unchanged | Already paper-aligned |
| Legacy | Target | File |
|---|---|---|
FROZEN_PREFIX = "tandem_frozen." |
JUNIOR_PREFIX = "tandem_junior." |
vllm_source/vllm/v1/worker/tandem.py |
is_frozen_layer(name) |
is_junior_layer(name) |
vllm_source/vllm/v1/worker/tandem.py; callers in gpu_model_runner.py |
get_frozen_tp_group() |
get_junior_tp_group() |
vllm_source/vllm/distributed/parallel_state.py; callers in gpu_worker.py, tandem.py |
use_frozen_tp() |
use_junior_tp() |
same as above |
_frozen_tp_context() (helper) |
_junior_tp_context() |
vllm_source/vllm/v1/worker/tandem.py |
TandemModelManager.frozen_forward() |
TandemModelManager.junior_forward() |
vllm_source/vllm/v1/worker/tandem.py; caller in gpu_model_runner.py |
TandemModelManager.load_frozen_model() |
.load_junior_model() |
same |
TandemModelManager.initialize_frozen_kv_cache() |
.initialize_junior_kv_cache() |
same |
TandemModelManager.get_frozen_model() |
.get_junior_model() |
same |
TandemModelManager.get_frozen_device() |
.get_junior_device() |
same |
The class is a stable public surface — renaming its fields breaks every run script and every saved checkpoint's config blob. We rename anyway; pre-rename run scripts and checkpoint configs become invalid without manual migration (noted in §5). No backward-compat shim is added.
| Legacy field | Target field |
|---|---|
frozen_model |
junior_model |
frozen_model_revision |
junior_model_revision |
prob_primary |
prob_senior |
frozen_tensor_parallel_size |
junior_tensor_parallel_size |
frozen_gpu_devices |
junior_gpu_devices |
frozen_gpu_memory_utilization |
junior_gpu_memory_utilization |
frozen_dtype |
junior_dtype |
frozen_quantization |
junior_quantization |
frozen_enforce_eager |
junior_enforce_eager |
frozen_max_model_len |
junior_max_model_len |
target_model_config |
senior_model_config |
target_parallel_config |
senior_parallel_config |
selection_strategy (unchanged) |
selection_strategy |
boundary_token_ids (unchanged) |
boundary_token_ids |
chunk_size (unchanged) |
chunk_size |
max_gap_tokens (unchanged) |
max_gap_tokens |
enabled (unchanged) |
enabled |
| Legacy | Target |
|---|---|
forward(primary_logits, frozen_logits, ...) |
forward(senior_logits, junior_logits, ...) |
primary_output, frozen_output (locals) |
senior_output, junior_output |
primary_tokens, frozen_tokens |
senior_tokens, junior_tokens |
use_primary (return of _select) |
use_senior |
self.prob_primary (attribute) |
self.prob_senior |
_word_state cache tuple (is_senior, ...) (unchanged — already paper-aligned) |
same |
| Legacy | Target |
|---|---|
self.frozen_model |
self.junior_model |
self.frozen_device |
self.junior_device |
self.frozen_kv_caches |
self.junior_kv_caches |
self.primary_vllm_config |
self.senior_vllm_config |
frozen_attn_metadata (local) |
junior_attn_metadata |
frozen_input_ids, frozen_positions, frozen_logits_indices (locals in frozen_forward) |
junior_input_ids, junior_positions, junior_logits_indices |
frozen_sfc, primary_sfc (locals) |
junior_sfc, senior_sfc |
frozen_attn_count (local) |
junior_attn_count |
frozen_layers, frozen_kv_caches (locals in initialize_frozen_kv_cache) |
junior_layers, junior_kv_caches |
frozen_vllm_config, frozen_config, frozen_hf_config, frozen_local_rank |
junior_* (mechanical rename) |
The mask passes through five files in the vLLM output chain plus two in verl. Every spelling collapses to one target name.
| Legacy field/var | Target | Files |
|---|---|---|
tandem_model_mask |
authorship_mask |
vllm_source/vllm/outputs.py, v1/outputs.py, v1/engine/__init__.py, v1/engine/output_processor.py, v1/core/sched/scheduler.py, v1/worker/gpu_model_runner.py, v1/sample/tandem_sampler.py |
new_tandem_model_mask (per-step delta) |
new_authorship_mask |
v1/engine/__init__.py, v1/engine/output_processor.py, v1/core/sched/scheduler.py |
tandem_mask (local var) |
authorship_mask |
same |
model_mask (verl DataProto batch key) |
authorship_mask |
verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py, verl/workers/actor/dp_actor.py |
model_a_mask (legacy verl key from HF-loop scratch path) |
dropped (HF-loop scratch path is gitignored R&D; the live verl path uses only the renamed authorship_mask) |
dp_actor.py:382 — drop the model_a_mask branch entirely |
| Legacy | Target |
|---|---|
tandem_jr_tkn_weight (config field on ActorConfig) |
junior_token_loss_weight |
self.config.get("tandem_jr_tkn_weight", 0.0) |
self.config.get("junior_token_loss_weight", 0.0) |
+actor_rollout_ref.actor.tandem_jr_tkn_weight=… in run scripts |
+actor_rollout_ref.actor.junior_token_loss_weight=… |
Note: the paper's λ_jun (§3.4) is a different mechanism — an auxiliary KL term toward π_jun on junior-emitted positions. Our code instead reweights junior-emitted positions in the standard policy-gradient mask. The two coincide at the value 0 (senior-only loss). The renamed junior_token_loss_weight reflects the mechanism actually implemented, not the paper's λ_jun. TERMINOLOGY-§3 elaborates.
| Legacy | Target |
|---|---|
tandem/primary_token_fraction |
tandem/senior_token_fraction |
tandem/frozen_token_fraction |
tandem/junior_token_fraction |
tandem/switches_per_seq (unchanged) |
same |
tandem/tokens_per_sent (unchanged) |
same |
Already paper-aligned (SENIOR_MODEL, JUNIOR_MODEL) — no rename. Only the hydra-override keys change, mirroring §2.2 and §2.6.
Two different mechanisms that both collapse to the paper's senior-only objective when set to 0.
Paper (§3.4, eq. 1): senior is trained with the GRPO clipped-surrogate objective, where the response mask is narrowed by m_{j,i} (1 at senior positions, 0 at junior positions). Junior-emitted tokens contribute no gradient. The paper additionally mentions a separate auxiliary loss term — a soft KL pull of π_sen toward π_jun on junior-emitted positions, weighted by coefficient λ_jun — and explicitly sets λ_jun = 0 to keep the per-token loss formally identical to vanilla GRPO.
Our code (verl/workers/actor/dp_actor.py:417-434): the policy-gradient loss multiplies the response mask elementwise by a weight vector built from the authorship mask: senior positions get weight 1; junior positions get weight junior_token_loss_weight. With junior_token_loss_weight=0, this exactly reproduces the paper's senior-only mask. With junior_token_loss_weight>0, junior positions contribute to the standard policy-gradient sum at reduced weight — this is not the paper's λ_jun (which would be a separate KL term), but a related "soft mask" that interpolates between senior-only (0) and joint (1) GRPO loss.
For paper reproduction: set junior_token_loss_weight=0. The off-paper values 0.15 and 0.2 found in run_tandem_native_grpo_{math,gsm8k}.sh are R&D-only ablations and are documented to cause a training cliff (model_spec/results/2026-04-29_tt-math-takeaways.md).
These remain on legacy names by design — listed explicitly so the rename pass skips them.
scratch/tandem/prototypes (tandem_rollout.py,tandem_rollout_optimized.py,tandem_rollout_vllm.py,tandem_rollout_ray.py,tandem_rollout_shared_kv.py,tandem_worker.py,frozen_model_actor.py): R&D history, gitignored viascratch/in.gitignore. Themodel_a/model_bandfrozen_model_actornaming there is left alone.- Strategy literal values (
"bernoulli","chunk","alternating","sentence","word"): user-facing config strings; renaming them would invalidate every run-script invocation and result file. They stay. - Module/file names (
tandem_sampler.py,tandem.py,TandemSampler,TandemModelManager,TandemConfig,TandemSelectionStrategy): theTandemprefix is the paper-blessed name of the method, so it stays. Only the internal primary/frozen vocabulary is replaced with senior/junior. - Upstream vLLM and verl identifiers we did not author (
SamplerOutput,SamplingMetadata,DataProto, etc.): out of scope. max_gap_tokens,chunk_size,boundary_token_ids,selection_strategy,enabled: already paper-aligned or paper-neutral.- Archived shell scripts under
archive/andverl/verl/archive/: gitignored, not touched.
The rename pass is a one-shot hard break — no compatibility shims:
- Run scripts: any hydra override referencing
prob_primary,frozen_model,frozen_gpu_devices,tandem_jr_tkn_weight, etc. must be updated. All six trackedrun_*.shscripts inverl/will be updated in the same commit as the rename. - Checkpoint configs: pre-rename checkpoints under
scratch/checkpoints/havetandem_configblobs serialized with the legacy field names. Reloading them after the rename requires a one-time JSON edit (a small migration helper script will live atscratch/migrate_tandem_config.pyif needed; out of scope for the rename commit). - wandb dashboards: any saved view filtering on
tandem/primary_token_fractionortandem/frozen_token_fractionwill need the metric key updated. Historical runs retain the legacy keys; new runs emit the renamed keys. - External code reading our authorship mask (e.g., the
tandem_eval/suite): the rollout DataProto key changes frommodel_masktoauthorship_mask. The eval scripts will be updated in the same commit.
After the rename commit lands, all newly authored code in this repo MUST:
- Use
senior/juniorfor the two models, neverprimary/frozen/hot/target/model_a/model_b. - Use
authorship_maskfor the per-token authorship indicator, nevermodel_mask/model_a_mask/tandem_model_mask. - Use
prob_seniorfor the Bernoulli handoff parameter (notprob_primary). - Refer to the paper's auxiliary KL coefficient (when discussing it textually) as λ_jun; refer to our actual implementation knob as
junior_token_loss_weightand note the semantic distinction (§3) when ambiguity matters. - Refer to the overall method as Tandem Reinforcement Learning (TRL) in user-facing prose;
tandem_*identifiers and theTandem*class prefix remain.