Skip to content

Respect optional IsaacLab action bounds#36

Open
HiccupRL wants to merge 1 commit into
typoverflow:masterfrom
HiccupRL:codex-conditional-action-bound-clipping
Open

Respect optional IsaacLab action bounds#36
HiccupRL wants to merge 1 commit into
typoverflow:masterfrom
HiccupRL:codex-conditional-action-bound-clipping

Conversation

@HiccupRL
Copy link
Copy Markdown

Summary

  • Pass raw actions through the IsaacLab on-policy trainer when action_bound is unset.
  • Keep the existing normalised-action path when action_bound is set by clipping to [-1, 1] before the environment wrapper scales the action.
  • Mark action_bound as Optional[float] so null is a valid configuration value.

Rationale

IsaacLab does not impose a global [-1, 1] action limit for every task. Some tasks use unbounded policy actions and then apply task-specific scaling, action-term clipping, or actuator limits. The previous trainer behaviour clipped actions unconditionally before calling env.step, even when the wrapper was configured with action_bound=None.

That made action_bound=None ineffective and could change the interaction semantics for policies that intentionally emit actions outside [-1, 1], such as flow-based on-policy agents. With this change, the environment adapter remains responsible for bounded-action semantics: if action_bound is provided, actions are clipped and scaled; if it is not provided, actions are passed through unchanged.

Validation

  • Ran a syntax check with Python compile() for:
    • examples/online/main_isaaclab_onpolicy.py
    • flowrl/config/online/onpolicy_isaaclab_config.py

@HiccupRL HiccupRL closed this Apr 22, 2026
@HiccupRL HiccupRL reopened this Apr 22, 2026
@typoverflow
Copy link
Copy Markdown
Owner

Hi,

Thanks for capturing this. Yes, I was aware that action spaces of environments from IsaacLab are not necessarily bounded. The reason why I imposed a bound here is that, in standard algorithms like PPO, the output range of our policy is always bounded to [-1, 1] because of tanh-squashing. In diffusion policies, it is also very common to bound the generated actions within a certain range, and we enabled this option (clip_samples=true) for every diffusion-based algorithms Given the range of PPO policies, I decided to set this range to [-1, 1] as well. Therefore accordingly, we have to impose some range of the action space to the envs so that our algorithms can behave normally.

We set an individual action_range for each of them (see the config file list) and rescale the [-1, 1] action to the given range in the environment wrapper.

if self.action_bound is not None:

That said, I did not rigorously ablated the effect of this action clipping for tasks with unbounded ranges... Do you have specific observations where no action-clipping performs better?

@HiccupRL
Copy link
Copy Markdown
Author

Thanks for the explanation. My concern is that this may not be a good default for all IsaacLab tasks. Different from Mujoco or OGBench, IsaacLab actions are often interpreted through task-specific scales, offsets, or controllers, and some environments such as isaac-Humanoid can work better with the native/unbounded action interface. This may also matter for diffusion-policy methods like GenPO, where modelling the natural action scale can be beneficial.

So I would suggest making action clipping/rescaling optional and environment-specific, rather than enforcing [-1, 1] globally for IsaacLab.

@typoverflow
Copy link
Copy Markdown
Owner

Hey @HiccupRL, thanks for the further explanation! We will launch a battery of experiments without action range and action clipping. Just one more question, for PPO, do you suggest using an unbounded action distribution (like Gaussian instead of TanhGaussian) in that case?

@HiccupRL
Copy link
Copy Markdown
Author

My view is that we should only clip actions when the environment itself enforces action bounds. Environments such as MuJoCo or OGBench may raise an error if an action falls outside [-1, 1], whereas Isaac Lab does not. In practice, leaving actions unclipped can yield better performance on some tasks, especially Humanoid. You can verify it by experiments.

@typoverflow
Copy link
Copy Markdown
Owner

I launched some experiments yesterday without action range and they seem to outperform the ones with action clipping. I will finalize the experiments and update the results in the following week to come. By then we will merge this PR.

Thanks again!

@HiccupRL
Copy link
Copy Markdown
Author

Thanks a lot for the update, and also for carefully organizing this benchmark.

One small reminder: if we remove the action range / tanh constraint, the corresponding config files should also be updated accordingly. Also, for PPO, the log likelihood computation needs to be changed after removing tanh squashing, so that the likelihood ratio is computed under the actual action distribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants