Skip to content

Enable mgpu in FrameView#5514

Merged
kellyguo11 merged 8 commits into
isaac-sim:developfrom
pv-nvidia:feat/frame-view-enable-mgpu
May 20, 2026
Merged

Enable mgpu in FrameView#5514
kellyguo11 merged 8 commits into
isaac-sim:developfrom
pv-nvidia:feat/frame-view-enable-mgpu

Conversation

@pv-nvidia
Copy link
Copy Markdown
Contributor

@pv-nvidia pv-nvidia commented May 6, 2026

Description

Removes the cuda:0-only restriction in FabricFrameView. USDRT SelectPrims now accepts any CUDA device index, so Fabric acceleration runs on the simulation device (e.g., cuda:1) instead of silently falling back to the slower USD path. This unblocks distributed training where each process is pinned to a specific GPU.

Changes:

Type of change

  • New feature (non-breaking change which adds functionality)

cuda:0 continues to work exactly as before; cuda:1+ now also works instead of silently falling back to USD. No public API surface changed.

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

Note: this PR uses a fragment file at source/isaaclab_physx/changelog.d/feat-frame-view-enable-mgpu.rst per the fragment-based changelog system.

Test plan

Three new tests gated by ISAACLAB_TEST_MULTI_GPU=1 and parameterized with ["cuda:1"]:

  • test_fabric_cuda1_world_pose_roundtripset_world_posesget_world_poses returns the same values on a non-primary CUDA device.
  • test_fabric_cuda1_no_usd_writeback — Fabric writes on cuda:1 do not write back to USD.
  • test_fabric_cuda1_scales_roundtrip — covers the set_scales write path on cuda:1.

A dedicated CI workflow (test-fabric-multi-gpu.yaml) runs on the [self-hosted, linux, x64, gpu, multi-gpu] runner with ISAACLAB_TEST_MULTI_GPU=1 set. Pre-flights with nvidia-smi and torch.cuda.device_count(), fails loudly if the runner has < 2 GPUs.

To verify locally on a multi-GPU machine:

ISAACLAB_TEST_MULTI_GPU=1 ./isaaclab.sh -p -m pytest \
    source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py -v

To verify the cuda:0 path is unchanged (multi-GPU tests auto-skip):

./isaaclab.sh -p -m pytest \
    source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py -v

@github-actions github-actions Bot added isaac-lab Related to Isaac Lab team infrastructure labels May 6, 2026
@pv-nvidia pv-nvidia marked this pull request as draft May 6, 2026 12:32
@pv-nvidia pv-nvidia self-assigned this May 6, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 6, 2026

Greptile Summary

This PR removes the cuda:0-only restriction from FabricFrameView, allowing Fabric GPU acceleration on any CUDA device index (e.g. cuda:1), which unblocks distributed training. It also drops the deprecated wp.to_torch() calls in favour of the .torch accessor on ProxyArray, adds three cuda:1-parameterised multi-GPU tests, and ships a dedicated CI workflow with a GPU pre-flight guard.

  • Device allowlist removed: _fabric_supported_devices, the __init__ guard, and the _initialize_fabric assertion are all deleted; fabric_stage.SelectPrims is now called with self._device directly, letting USDRT handle any CUDA index.
  • Return type asymmetry: get_world_poses() wraps its Fabric result in ProxyArray (exposing .torch), but get_scales() still returns a raw wp.array. The new test_fabric_cuda1_scales_roundtrip test calls .torch on that raw array, which will raise AttributeError on the multi-GPU runner and void the intended coverage.
  • Multi-GPU CI workflow: test-fabric-multi-gpu.yaml includes the GPU pre-flight step (torch.cuda.device_count() >= 2) that fails loud before pytest is invoked, addressing the gap called out in a prior review round.

Confidence Score: 4/5

Safe to merge after the get_scales() return-type fix; the multi-GPU test for scales will throw AttributeError at runtime without it.

The core Fabric device-allowlist removal is straightforward and the cuda:0 path is unaffected. The blocking concern is that get_scales() returns a raw wp.array while the new test expects .torch on it — ProxyArray provides .torch but wp.array does not — so test_fabric_cuda1_scales_roundtrip will fail with AttributeError on the multi-GPU runner, defeating its coverage purpose.

fabric_frame_view.py (get_scales return type) and test_views_xform_prim_fabric.py (test_fabric_cuda1_scales_roundtrip) need the matching fix before the multi-GPU runner runs.

Important Files Changed

Filename Overview
source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py Removes the cuda:0-only device allowlist and the assertion in _initialize_fabric; drops the CPU fallback guard; adds follow-up TODOs. The get_scales() return type is a raw wp.array while get_world_poses() returns ProxyArray — asymmetry that affects the new scale tests.
source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py Adds three cuda:1-gated multi-GPU tests and refines _skip_if_unavailable. The scales roundtrip test calls .torch on the return value of get_scales(), which returns a raw wp.array not a ProxyArray, so the accessor may be absent at runtime.
.github/workflows/test-fabric-multi-gpu.yaml New dedicated CI workflow for multi-GPU Fabric tests; includes a GPU pre-flight step that fails loudly if fewer than 2 GPUs are present, closing the gap noted in a previous review.
source/isaaclab_physx/changelog.d/feat-frame-view-enable-mgpu.rst Changelog fragment describing the multi-GPU Fabric fix; accurate and concise.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant FabricFrameView
    participant USDRTSelectPrims
    participant WarpKernel
    participant UsdFrameView

    Caller->>FabricFrameView: "__init__(device=cuda:N)"
    Note over FabricFrameView: No device allowlist check (removed)

    Caller->>FabricFrameView: set_world_poses(positions)
    alt Fabric enabled
        FabricFrameView->>USDRTSelectPrims: "SelectPrims(device=cuda:N)"
        FabricFrameView->>WarpKernel: launch(compose_fabric_transformation)
        FabricFrameView->>FabricFrameView: _prepare_for_reuse()
    else Fabric disabled
        FabricFrameView->>UsdFrameView: set_world_poses(...)
    end

    Caller->>FabricFrameView: get_scales()
    alt Fabric enabled
        FabricFrameView->>WarpKernel: launch(decompose_fabric_transformation)
        FabricFrameView-->>Caller: wp.array (raw — no ProxyArray wrap)
    else Fabric disabled
        FabricFrameView->>UsdFrameView: get_scales()
        FabricFrameView-->>Caller: result
    end

    Caller->>FabricFrameView: get_world_poses()
    alt Fabric enabled
        FabricFrameView->>WarpKernel: launch(decompose_fabric_transformation)
        FabricFrameView-->>Caller: ProxyArray(positions), ProxyArray(orientations)
    end
Loading

Reviews (6): Last reviewed commit: "Split FabricFrameView multi-GPU tests in..." | Re-trigger Greptile

Comment thread source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py
Comment thread source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py
@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch 4 times, most recently from a6cd73e to 2c619fe Compare May 7, 2026 08:44
isaaclab-review-bot[bot]

This comment was marked as outdated.

@pv-nvidia pv-nvidia marked this pull request as ready for review May 11, 2026 11:29
isaaclab-review-bot[bot]

This comment was marked as off-topic.

Comment thread .github/workflows/test-multi-gpu.yaml Outdated
@pv-nvidia pv-nvidia changed the title Feat/frame view enable mgpu Enable mgpu in FrameView May 12, 2026
@pv-nvidia pv-nvidia changed the title Enable mgpu in FrameView pref: Enable mgpu in FrameView May 12, 2026
@pv-nvidia pv-nvidia added the enhancement New feature or request label May 12, 2026
@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch 4 times, most recently from 1c2e02d to 8de9a39 Compare May 17, 2026 22:23
@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from 8de9a39 to e206ba9 Compare May 20, 2026 14:11
pv-nvidia added 5 commits May 20, 2026 15:35
- Allow FabricFrameView to run on cuda:N for any N; USDRT SelectPrims
  no longer needs cuda:0.
- Refactor the Fabric write path into a single _compose_fabric_transform
  helper shared by set_world_poses, set_scales, and the initial
  USD->Fabric sync, collapsing the sync to one kernel launch with one
  PrepareForReuse.
- Replace the topology-invariant assert with RuntimeError so it survives
  python -O.
- Add multi_gpu pytest marker plus cuda:1 unit-test coverage for both
  Fabric write paths, and run them in the existing test-multi-gpu CI
  job (one extra step, no new job).
The standard pytest invocation in CI runs the fabric test file without
filtering on the ``multi_gpu`` marker, so the ``cuda:1`` tests get
scheduled on every runner including the single-GPU ones.  Previously
``_skip_if_unavailable`` hard-failed via ``pytest.fail`` whenever
``GITHUB_ACTIONS=true`` and the requested device was missing, on the
theory that this would catch a misconfigured multi-GPU runner.  In
practice it just broke the standard CI: the dedicated
``test-fabric-multi-gpu`` workflow already pre-flights
``torch.cuda.device_count() >= 2`` before invoking pytest, so a
genuinely misconfigured multi-GPU runner is already caught there.

Always skip rather than fail when the requested ``cuda:N`` index isn't
available.  Drop the now-unused ``import os``.
Kit's CLI parser reads sys.argv directly at startup and segfaults on
pytest flags that collide with its own short options.  Running

    pytest -m multi_gpu source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py

crashes during collection because Kit sees ``-m multi_gpu`` and exits
with ``Ill formed parameter: -m`` followed by SIGSEGV (exit code 245)
inside ``simulation_app._start_app``.

Strip sys.argv to argv[0] before instantiating AppLauncher.  The test
file takes no CLI arguments of its own, mirroring the broader pattern
used by ``test_tiled_camera_env.py`` which assigns
``sys.argv[1:] = args_cli.unittest_args`` after argparse.
wp.to_torch on a ProxyArray is deprecated in favor of the .torch
accessor.  Switch the three call sites that consume the ProxyArray
returned by get_world_poses; leave get_scales call sites alone since
that method still returns a raw wp.array (no .torch accessor).
- Add a GPU-count pre-flight step to the test-fabric-multi-gpu CI job
  so a runner regression to a single GPU fails the workflow instead of
  silently skipping every cuda:1 test. This is what the comment in
  _skip_if_unavailable already promised existed.
- Note that the sys.argv strip in test_views_xform_prim_fabric.py must
  stay between the AppLauncher import and its instantiation; any CLI
  parser or reordering re-exposes Kit to pytest argv and segfaults at
  startup.
- Document the _fabric_usd_sync_done side effect on
  _compose_fabric_transform so callers can see why subsequent getters
  stop pulling from USD.
isaaclab-review-bot[bot]

This comment was marked as outdated.

@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from cf57d31 to a7a6956 Compare May 20, 2026 16:07
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot — Updated Review (4f262aa)

Commit: 4f262aa6710b19679b5ab94015f0dde9a4fed38b
Previous review: 556b74b (workflow separation in progress)


📋 What Changed Since Last Review

Commit 4f262aa finalizes the workflow separation with a clean split:

Change Description
test-fabric-multi-gpu.yaml New dedicated workflow (60 lines) — self-contained CI for Fabric tests
test-multi-gpu.yaml ✅ Restored to upstream/develop (removed Fabric test job)
fabric_frame_view.py Minor: relocated TODO comments
changelog.d/*.rst Simplified wording
test_views_xform_prim_fabric.py Style cleanup only

Key improvement: Complete workflow separation. FabricFrameView changes now trigger only test-fabric-multi-gpu.yaml (via path filter), while test-multi-gpu.yaml returns to its upstream state for distributed-training validation. The two workflows are completely decoupled.


✅ Full PR Summary

This PR removes the cuda:0-only restriction from FabricFrameView, enabling Fabric GPU acceleration on any CUDA device. This unblocks distributed training where each rank is pinned to a non-primary GPU (e.g., cuda:1).

🔍 Code Review

Architecture:

  • ✅ Clean removal of _fabric_supported_devices allowlist and associated guards
  • ✅ Minimal, surgical change — core write paths unchanged
  • ✅ Well-scoped TODO comments reference follow-up PRs (#5673, #5674)
  • ✅ Docstrings updated to reflect multi-GPU support

Error Handling:

  • RuntimeError replaces assert for topology-change invariant (survives python -O)
  • _skip_if_unavailable() gracefully skips tests on single-GPU runners

Test Coverage:

  • ✅ Three cuda:1-parameterized tests: roundtrip poses, no-writeback, scales roundtrip
  • ✅ New multi_gpu pytest marker registered in pyproject.toml
  • ✅ Kit argv stripping prevents segfault from pytest flags
  • ✅ Uses .torch accessor instead of deprecated wp.to_torch()

CI Design:

  • test-fabric-multi-gpu.yaml — dedicated 60-line workflow with GPU pre-flight
  • ✅ Path-filtered triggers: only runs on changes to FabricFrameView or its tests
  • ✅ Runner pre-flight (torch.cuda.device_count() >= 2) fails loudly if misconfigured
  • test-multi-gpu.yaml byte-identical to upstream/develop — zero risk of regression

🚦 CI Status

Check Status
Build Wheel ✅ Pass
changelog fragments ✅ Pass
Broken Links ✅ Pass
labeler ✅ Pass
pre-commit ⏳ Pending
FabricFrameView multi-GPU tests ⏳ Pending (self-hosted runner)
license-check ⏳ Pending
Installation Tests ⏳ Pending
Docs ⏳ Pending

📝 Verdict

LGTM — ready for human review. The workflow separation is now complete and clean. Core functionality is well-tested with appropriate CI coverage. The change is low-risk: cuda:0 path is unchanged, and the new cuda:1 tests provide explicit coverage. Once the pending CI checks pass (particularly the multi-GPU tests on the self-hosted runner), this PR is ready to merge.


Reviewed at: 2026-05-20T16:25 UTC

@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from a7a6956 to 556b74b Compare May 20, 2026 16:22
@isaac-sim isaac-sim deleted a comment from isaaclab-review-bot Bot May 20, 2026
@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from 556b74b to 4f262aa Compare May 20, 2026 16:24
Comment thread source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py Outdated
@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from 4f262aa to cc1d789 Compare May 20, 2026 16:34
isaaclab-review-bot[bot]

This comment was marked as outdated.

@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from cc1d789 to c42ec86 Compare May 20, 2026 16:37
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot — Updated Review (a039313)

Commit: a03931317a2be2486357370d03326e9c398f86ef
Previous review: c42ec86 → Now fully finalized


📋 What Changed Since Last Review

Two commits since c42ec86:

Commit Change
60d99b86 Split FabricFrameView multi-GPU tests into dedicated workflow
a0393131 Disable multi-GPU workflow (no self-hosted runner available)

Key changes:

  1. Workflow separation completetest-fabric-multi-gpu.yaml is now a standalone workflow file, completely decoupled from test-multi-gpu.yaml
  2. Workflow disabled — Multi-GPU workflow set to workflow_dispatch only with clear comment explaining why (no multi-GPU runner provisioned yet)
  3. Pytest marker removedmulti_gpu marker dropped from pyproject.toml; tests now use ISAACLAB_TEST_MULTI_GPU=1 env var gate via @pytest.mark.skipif
  4. Changelog simplified — Removed extraneous "Changed" section, now focuses purely on the core fix
  5. Code cleanup — Removed redundant comments in fabric_frame_view.py, uses torch.as_tensor() consistently in tests

✅ Full PR Summary

This PR removes the cuda:0-only restriction from FabricFrameView, enabling Fabric GPU acceleration on any CUDA device (cuda:0, cuda:1, etc.). This unblocks distributed training where each rank is pinned to a non-primary GPU.

🔍 Code Review

Architecture:

  • ✅ Clean removal of _fabric_supported_devices allowlist
  • ✅ Minimal, surgical change — core Warp kernel paths unchanged
  • ✅ TODO comments reference follow-up PRs (#5673, #5674)
  • ✅ Docstrings updated for multi-GPU support

Test Coverage:

  • ✅ Three cuda:1-parameterized tests gated by ISAACLAB_TEST_MULTI_GPU=1
  • _skip_if_unavailable() gracefully skips on single-GPU runners
  • ✅ Uses torch.as_tensor() for Warp→Torch conversion (consistent API)

CI Design:

  • test-fabric-multi-gpu.yaml — dedicated workflow (disabled until runner available)
  • ✅ Path-filtered triggers ready for when runner is provisioned
  • test-multi-gpu.yaml returned to upstream state (zero diff risk)
  • ✅ Env var gating avoids pytest marker complexity

🚦 CI Status

Check Status
pre-commit ✅ Pass
Check changelog fragments ✅ Pass
Build Wheel ✅ Pass
Check for Broken Links ✅ Pass
Detect Changes ✅ Pass
labeler ✅ Pass
Load Config ✅ Pass
Installation Tests ⏳ Pending
Build Latest Docs ⏳ Pending
license-check ⏳ Pending

📝 Verdict

LGTM — ready for human review. The workflow separation is clean and complete. The PR is low-risk:

  • cuda:0 behavior unchanged
  • New cuda:1 tests provide explicit coverage (will run when multi-GPU runner is provisioned)
  • Multi-GPU workflow correctly disabled to avoid queue-indefinitely issue

Once CI passes, this is ready to merge.


Reviewed at: 2026-05-20T19:34 UTC

Move the test-fabric-multi-gpu job out of test-multi-gpu.yaml and into
a dedicated test-fabric-multi-gpu.yaml.  The two workflows share the
same runner label, install step, and GPU pre-flight, but trigger on
disjoint path sets so changes to FabricFrameView no longer gate the
distributed-training validation and vice versa.

test-multi-gpu.yaml is now byte-identical to upstream/develop.
@pv-nvidia pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from c42ec86 to 60d99b8 Compare May 20, 2026 16:46
Comment thread source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py
No self-hosted runner with the 'multi-gpu' label is registered.
All runs queue indefinitely. Kept as workflow_dispatch only so it
can be manually triggered once a runner is provisioned.

See also .github/workflows/test-multi-gpu.yaml (same issue).
@kellyguo11 kellyguo11 changed the title pref: Enable mgpu in FrameView Enable mgpu in FrameView May 20, 2026
@kellyguo11 kellyguo11 merged commit aa19b08 into isaac-sim:develop May 20, 2026
64 of 65 checks passed
@hujc7 hujc7 mentioned this pull request May 21, 2026
4 tasks
hujc7 added a commit to hujc7/IsaacLab that referenced this pull request May 26, 2026
The three cuda:1-parameterised tests in test_views_xform_prim_fabric.py
were added by PR isaac-sim#5514 to validate FabricFrameView's SelectPrims path
on non-zero CUDA devices.  They currently hang indefinitely on real
multi-GPU hardware (reproduced locally on 3x RTX 6000 Pro Blackwell
and on the [self-hosted, ..., multi-gpu] runner pool).

Flipping ISAACLAB_TEST_MULTI_GPU=1 in this workflow runs them as
intended.  The 25-min workflow timeout will cancel the job, surfacing
the hang in CI so the FabricFrameView maintainers can iterate on a
fix.  Land this PR once the hang is resolved.
hujc7 added a commit to hujc7/IsaacLab that referenced this pull request May 26, 2026
Re-enables the pull_request trigger in test-fabric-multi-gpu.yaml and
wires it to run the FabricFrameView contract tests (including the
three cuda:1-parameterised variants added in isaac-sim#5514) inside the
pre-built Isaac Lab Docker image on the [self-hosted, ..., multi-gpu]
runner pool.

Setup:
- Image: nvcr.io/nvidian/isaac-lab:latest-develop (published by
  publish-images.yaml on every develop push, bundles Isaac Sim +
  Isaac Lab).  Pulled with --platform linux/amd64 to sidestep a
  multi-arch manifest issue.
- ISAACLAB_TEST_MULTI_GPU=1 enables the cuda:1 tests.
- Workspace mounted + reinstalled --no-deps editable so PR source
  overrides the baked-in copy.

Status: this PR is expected to fail with the 25-min workflow
timeout.  It surfaces the FabricFrameView SelectPrims hang on
non-zero CUDA device indices (reproduced locally on 3x RTX 6000 Pro
Blackwell and on the multi-GPU runner pool).  Land this PR once the
underlying hang in fabric_frame_view.py is fixed.
hujc7 added a commit to hujc7/IsaacLab that referenced this pull request May 26, 2026
Re-enables the pull_request trigger in test-fabric-multi-gpu.yaml and
wires it to run the FabricFrameView contract tests (including the
three cuda:1-parameterised variants added in isaac-sim#5514) inside the
pre-built Isaac Lab Docker image on the [self-hosted, ..., multi-gpu]
runner pool.

Setup:
- Image: nvcr.io/nvidian/isaac-lab:latest-develop (published by
  publish-images.yaml on every develop push, bundles Isaac Sim +
  Isaac Lab).  Pulled with --platform linux/amd64 to sidestep a
  multi-arch manifest issue.
- ISAACLAB_TEST_MULTI_GPU=1 enables the cuda:1 tests.
- Workspace mounted + reinstalled --no-deps editable so PR source
  overrides the baked-in copy.

Status: this PR is expected to fail with the 25-min workflow
timeout.  It surfaces the FabricFrameView SelectPrims hang on
non-zero CUDA device indices (reproduced locally on 3x RTX 6000 Pro
Blackwell and on the multi-GPU runner pool).  Land this PR once the
underlying hang in fabric_frame_view.py is fixed.
hujc7 added a commit to hujc7/IsaacLab that referenced this pull request May 26, 2026
Re-enables the pull_request trigger in test-fabric-multi-gpu.yaml and
wires it to run the FabricFrameView contract tests (including the
three cuda:1-parameterised variants added in isaac-sim#5514) inside the
pre-built Isaac Lab Docker image on the [self-hosted, ..., multi-gpu]
runner pool.

Setup:
- Image: nvcr.io/nvidian/isaac-lab:latest-develop (published by
  publish-images.yaml on every develop push, bundles Isaac Sim +
  Isaac Lab).  Pulled with --platform linux/amd64 to sidestep a
  multi-arch manifest issue.
- ISAACLAB_TEST_MULTI_GPU=1 enables the cuda:1 tests.
- Workspace mounted + reinstalled --no-deps editable so PR source
  overrides the baked-in copy.

Status: this PR is expected to fail with the 25-min workflow
timeout.  It surfaces the FabricFrameView SelectPrims hang on
non-zero CUDA device indices (reproduced locally on 3x RTX 6000 Pro
Blackwell and on the multi-GPU runner pool).  Land this PR once the
underlying hang in fabric_frame_view.py is fixed.
hujc7 added a commit to hujc7/IsaacLab that referenced this pull request May 26, 2026
Re-enables the pull_request trigger on test-fabric-multi-gpu.yaml and
wires it to run the FabricFrameView contract tests with
ISAACLAB_TEST_MULTI_GPU=1, which activates the three cuda:1
-parameterised tests added in isaac-sim#5514.

The cuda:1 tests target FabricFrameView's SelectPrims path on non-zero
CUDA device indices.  They currently hang indefinitely on real
multi-GPU hardware (reproduced locally on 3x RTX 6000 Pro Blackwell
and on the multi-GPU runner pool); the 60-min workflow timeout will
cancel the job and surface the regression in CI for the
FabricFrameView maintainers.

Install pipeline matches isaac-sim#5738's proven-working layout:
- Pin Python 3.12 via SHA-pinned actions/setup-python.
- Pre-install cmake via pip to skip install.py's sudo apt-get branch.
- ./isaaclab.sh --install none (core only, avoids egl_probe libEGL).
- pip install isaacsim[all,extscache]==${vars.ISAACSIM_BASE_VERSION
  || '6.0.0'} --extra-index-url https://pypi.nvidia.com.
- Bypass Kit's interactive EULA via OMNI_KIT_ACCEPT_EULA / ACCEPT_EULA
  / ISAAC_SIM_HEADLESS.

Status: this PR is expected to fail with the 60-min workflow timeout.
Land once the underlying hang in fabric_frame_view.py is fixed.
hujc7 added a commit to hujc7/IsaacLab that referenced this pull request May 28, 2026
Adds a single helper, cuda_test_devices(), that converts a 3-position
device mask (env-var ISAACLAB_TEST_DEVICES, default '110') into the
list of device strings tests parametrize over.  Single-GPU CI sees no
change (default mask '110' resolves to [cpu, cuda:0], identical to the
hardcoded lists tests carry today).  The new multi-GPU-pytest workflow
sets ISAACLAB_TEST_DEVICES=001 so migrated tests run on cuda:1 only.

Mask grammar: each position is 0 or 1, optional trailing X expands to
all remaining positions. Position 0 -> cpu; position k>=1 -> cuda:{k-1}.
Strict mode raises on missing devices; non-strict returns empty for
opt-in tests that should skip on hosts that can't satisfy them.

P0 migration (pure-Python utility tests, no Kit):

* source/isaaclab/test/utils/test_math.py: 45 parametrize sites +
  2 inline for-loops migrated.
* source/isaaclab/test/utils/test_wrench_composer.py: 37 sites.
* source/isaaclab/test/utils/test_episode_data.py: 5 sites.

Each migrated site replaces a hardcoded [cpu, cuda:0] (or the reversed
or tuple form) with cuda_test_devices().  Migration is additive - one
import line per file plus the inline edits.  No test logic changes.

Workflow: .github/workflows/test-multi-gpu-pytest.yaml runs on the
[self-hosted, ..., multi-gpu] pool with ISAACLAB_TEST_DEVICES=001.
Triggered on changes to the helper, the P0 test files, or the
workflow itself.

Excluded scope (to follow up after CI validates this MVP):

* P1 light-Kit tests (test_simulation_context, test_views_xform_prim,
  test_newton_model_utils, test_views_xform_prim_newton).
* P2 asset tests (test_articulation / test_rigid_object on physx and
  newton backends).
* FabricFrameView cuda:1 tests (PR isaac-sim#5514) - separate path, the
  SelectPrims deadlock there is tracked independently.

Reverts the fabric-specific .github/workflows/test-fabric-multi-gpu.yaml
edits that were carried on this branch from the earlier PR scope; that
demo is independent of this framework work.
hujc7 added a commit to hujc7/IsaacLab that referenced this pull request May 28, 2026
Adds a single helper, cuda_test_devices(), that converts a 3-position
device mask (env-var ISAACLAB_TEST_DEVICES, default '110') into the
list of device strings tests parametrize over.  Single-GPU CI sees no
change (default mask '110' resolves to [cpu, cuda:0], identical to the
hardcoded lists tests carry today).  The new multi-GPU-pytest workflow
sets ISAACLAB_TEST_DEVICES=001 so migrated tests run on cuda:1 only.

Mask grammar: each position is 0 or 1, optional trailing X expands to
all remaining positions. Position 0 -> cpu; position k>=1 -> cuda:{k-1}.
Strict mode raises on missing devices; non-strict returns empty for
opt-in tests that should skip on hosts that can't satisfy them.

P0 migration (pure-Python utility tests, no Kit):

* source/isaaclab/test/utils/test_math.py: 45 parametrize sites +
  2 inline for-loops migrated.
* source/isaaclab/test/utils/test_wrench_composer.py: 37 sites.
* source/isaaclab/test/utils/test_episode_data.py: 5 sites.

Each migrated site replaces a hardcoded [cpu, cuda:0] (or the reversed
or tuple form) with cuda_test_devices().  Migration is additive - one
import line per file plus the inline edits.  No test logic changes.

Workflow: .github/workflows/test-multi-gpu-pytest.yaml runs on the
[self-hosted, ..., multi-gpu] pool with ISAACLAB_TEST_DEVICES=001.
Triggered on changes to the helper, the P0 test files, or the
workflow itself.

Excluded scope (to follow up after CI validates this MVP):

* P1 light-Kit tests (test_simulation_context, test_views_xform_prim,
  test_newton_model_utils, test_views_xform_prim_newton).
* P2 asset tests (test_articulation / test_rigid_object on physx and
  newton backends).
* FabricFrameView cuda:1 tests (PR isaac-sim#5514) - separate path, the
  SelectPrims deadlock there is tracked independently.

Reverts the fabric-specific .github/workflows/test-fabric-multi-gpu.yaml
edits that were carried on this branch from the earlier PR scope; that
demo is independent of this framework work.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request infrastructure isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants