Respect optional IsaacLab action bounds#36
Conversation
|
Hi, Thanks for capturing this. Yes, I was aware that action spaces of environments from IsaacLab are not necessarily bounded. The reason why I imposed a bound here is that, in standard algorithms like PPO, the output range of our policy is always bounded to [-1, 1] because of tanh-squashing. In diffusion policies, it is also very common to bound the generated actions within a certain range, and we enabled this option (clip_samples=true) for every diffusion-based algorithms Given the range of PPO policies, I decided to set this range to [-1, 1] as well. Therefore accordingly, we have to impose some range of the action space to the envs so that our algorithms can behave normally. We set an individual action_range for each of them (see the config file list) and rescale the [-1, 1] action to the given range in the environment wrapper. flow-rl/flowrl/env/online/isaaclab_env.py Line 78 in b1385e0 That said, I did not rigorously ablated the effect of this action clipping for tasks with unbounded ranges... Do you have specific observations where no action-clipping performs better? |
|
Thanks for the explanation. My concern is that this may not be a good default for all IsaacLab tasks. Different from So I would suggest making action clipping/rescaling optional and environment-specific, rather than enforcing |
|
Hey @HiccupRL, thanks for the further explanation! We will launch a battery of experiments without action range and action clipping. Just one more question, for PPO, do you suggest using an unbounded action distribution (like Gaussian instead of TanhGaussian) in that case? |
|
My view is that we should only clip actions when the environment itself enforces action bounds. Environments such as |
|
I launched some experiments yesterday without action range and they seem to outperform the ones with action clipping. I will finalize the experiments and update the results in the following week to come. By then we will merge this PR. Thanks again! |
|
Thanks a lot for the update, and also for carefully organizing this benchmark. One small reminder: if we remove the action range / tanh constraint, the corresponding config files should also be updated accordingly. Also, for PPO, the log likelihood computation needs to be changed after removing tanh squashing, so that the likelihood ratio is computed under the actual action distribution. |
Summary
action_boundis unset.action_boundis set by clipping to[-1, 1]before the environment wrapper scales the action.action_boundasOptional[float]sonullis a valid configuration value.Rationale
IsaacLab does not impose a global
[-1, 1]action limit for every task. Some tasks use unbounded policy actions and then apply task-specific scaling, action-term clipping, or actuator limits. The previous trainer behaviour clipped actions unconditionally before callingenv.step, even when the wrapper was configured withaction_bound=None.That made
action_bound=Noneineffective and could change the interaction semantics for policies that intentionally emit actions outside[-1, 1], such as flow-based on-policy agents. With this change, the environment adapter remains responsible for bounded-action semantics: ifaction_boundis provided, actions are clipped and scaled; if it is not provided, actions are passed through unchanged.Validation
compile()for:examples/online/main_isaaclab_onpolicy.pyflowrl/config/online/onpolicy_isaaclab_config.py