ci: add Docker-based FabricFrameView multi-GPU test workflow by pv-nvidia · Pull Request #5845 · isaac-sim/IsaacLab

pv-nvidia · 2026-05-28T18:15:05Z

Summary

Adds a CI workflow that runs FabricFrameView cuda:1 unit tests on multi-GPU runners using Docker — consistent with how all other test jobs run in build.yaml.

Problem

The previous bare-metal approach installed isaacsim==6.0.0 via pip, which bundled Kit 110.0. This silently ran tests against a stale Kit version instead of the Kit 111.0 shipped in the nvcr.io/nvidian/isaac-sim:latest-develop container image.

Solution

Use the same run-package-tests composite action as all other CI test jobs:

Build/pull the Docker image from ECR via ecr-build-push-pull
Run pytest inside the container (correct Kit version guaranteed)
Volume-mount the workspace so the PR's test source is used

Workflow triggers

source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py
source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py
.github/workflows/test-fabric-multi-gpu.yaml
Manual dispatch

Testing

Validated in PR #5822 — tests pass with the correct Kit 111.0 version.

isaaclab-review-bot

🤖 Isaac Lab Review Bot

Summary

This PR adds a new CI workflow (test-fabric-multi-gpu.yaml) that runs FabricFrameView cuda:1 unit tests on multi-GPU runners using Docker, replacing a bare-metal approach that was silently using a stale Kit version (110.0 instead of 111.0). The implementation correctly reuses existing composite actions (ecr-build-push-pull, run-package-tests) for consistency with the established CI patterns.

Findings

🔵 Suggestion .github/workflows/test-fabric-multi-gpu.yaml:88-89
The GPU verification step checks for $GPU_COUNT -lt 2 but the error message says "At least 2 GPUs required." Consider if future tests might need more GPUs - the check could be parameterized via an env variable for flexibility.

🔵 Suggestion .github/workflows/test-fabric-multi-gpu.yaml:73
The timeout-minutes: 60 is reasonable but consider adding a comment explaining the expected runtime to help future maintainers understand if this is generous or tight.

🔵 Suggestion .github/workflows/test-fabric-multi-gpu.yaml:95-98
The pytest-options includes -v -s --tb=short which is good for debugging. Consider if --tb=long might be more helpful for CI failures where full tracebacks aid diagnosis.

Test Coverage

This is a CI-only change adding a new workflow file and a changelog skip marker. No runtime code changes are introduced. The workflow itself will validate the FabricFrameView tests against the correct Kit version, which was the root cause being fixed.

Verdict

No issues found — This is a well-structured CI workflow that follows established patterns in the repository. The Docker-based approach ensures version consistency with the Isaac Sim container, and the separation from test-multi-gpu.yaml provides good isolation. The suggestions above are minor improvements, not blockers.

isaaclab-review-bot

Review Summary

This PR adds a dedicated CI workflow for FabricFrameView multi-GPU unit tests, migrating from a bare-metal pip-install approach to Docker-based testing for Kit version consistency. The motivation is sound and the implementation follows established patterns from build.yaml.

✅ What Looks Good

Problem diagnosis is correct: The Kit version mismatch (110.0 vs 111.0) is a real issue that could cause silent test failures or false positives.
Docker-based approach is appropriate: Using ecr-build-push-pull and run-package-tests composite actions aligns with the existing CI infrastructure.
Separation from test-multi-gpu.yaml: Keeping this workflow independent prevents FabricFrameView changes from gating distributed training validation (and vice versa) — good for CI velocity.
Selective triggering: The paths: filter ensures this workflow only runs when relevant files change, conserving multi-GPU runner resources.
GPU validation step: The pre-flight check for 2+ GPUs is a good safeguard.

🔍 Findings

1. Runner label mismatch with existing workflows

Severity: Medium
File: .github/workflows/test-fabric-multi-gpu.yaml

The new workflow uses [self-hosted, linux, x64, multi-gpu] (4 labels), but the existing test-multi-gpu.yaml uses [self-hosted, linux, x64, gpu, multi-gpu] (5 labels, includes gpu).

This could cause jobs to be scheduled on different runners. Consider aligning with the existing convention.

2. GPU check runs on host, tests run in container

Severity: Low

The "Verify multi-GPU availability" step runs nvidia-smi on the host, but actual tests run inside a Docker container. While GPU passthrough typically works, consider adding a container-level GPU check.

3. Consider adding `push` trigger for develop branch

Severity: Suggestion

The workflow only triggers on pull_request and workflow_dispatch. Consider adding a push trigger for the develop branch to catch regressions after merge.

Overall, this is a well-structured PR that improves CI reliability.

Update (e689c4c): Simplified workflow by removing the config job.

Changes:

Removed config job — no longer loads config.yaml for IsaacSim image settings
Added ISAACSIM_BASE_IMAGE and ISAACSIM_BASE_VERSION env vars directly
Simplified job dependencies (build and test no longer depend on config)

Assessment: ✅ Cleaner workflow structure, consistent with build.yaml pattern. No new issues introduced. Original findings still apply.

Update (07ee20e): Enabled the workflow + extended run-tests action with multi-gpu parameter.

Key Changes:

Workflow re-enabled — removed the on: workflow_dispatch guard; now triggers on pull_request for relevant paths
Added multi-gpu input to .github/actions/run-tests/action.yml and .github/actions/run-package-tests/action.yml — sets ISAACLAB_TEST_MULTI_GPU=1 inside the container
GPU check simplified — now uses direct nvidia-smi query (wc -l) instead of requiring Python/torch
Runner labels fixed — changed from [self-hosted, linux, x64, gpu, multi-gpu] (5 labels) to [self-hosted, linux, x64, multi-gpu] (4 labels) — this is now consistent across both build and test jobs
Timeout extended — 30 → 60 minutes to account for Docker build time
Added changelog skip file source/isaaclab/changelog.d/pv-fabric-mgpu-docker-ci.skip

Assessment: ✅ Addresses prior feedback well. The multi-gpu action parameter is a clean, reusable way to enable multi-GPU tests in any Docker-based workflow. Runner label consistency resolved. Ready to merge once CI passes.

Update (12d81f6): Restored config job + minor workflow refinements.

Key Changes:

Re-added config job — loads isaacsim_image_name and isaacsim_image_tag from config.yaml via yq, aligning with other workflows like build.yaml
Removed hardcoded env vars — ISAACSIM_BASE_IMAGE and ISAACSIM_BASE_VERSION replaced with needs.config.outputs.* references
Updated job dependencies — build and test-fabric-multi-gpu now depend on config job
Added explicit success check — test job has if: needs.build.result == success

Assessment: ✅ This is a sensible refinement. Using the shared config.yaml centralizes Isaac Sim version management (DRY principle) and ensures consistency across workflows. The action input changes from the previous commit (multi-gpu parameter) are preserved. No new issues. Ready to merge.

Update (b9eafbb): Minor test output enhancement.

Changes:

Added -v (verbose) flag to pytest invocation in tools/conftest.py — improves test output readability during CI runs

Assessment: ✅ Trivial improvement. Verbose pytest output helps debug failures when they occur. No impact on existing functionality. No new issues.

greptile-apps · 2026-05-28T18:23:18Z

Greptile Summary

This PR replaces the previous bare-metal isaacsim==6.0.0 pip-install approach for FabricFrameView multi-GPU tests with the same Docker-based workflow pattern used by all other CI jobs, ensuring tests run against the Kit 111.0 version baked into the nvcr.io/nvidian/isaac-sim container rather than a stale Kit 110.0 pip package.

Adds a new test-fabric-multi-gpu.yaml workflow with correct NGC_API_KEY, branches filter, and hardcoded image coordinates; triggers on changes to fabric_frame_view.py, its test file, or the workflow itself.
Threads a new multi-gpu input (default false) through run-package-tests → run-tests, setting ISAACLAB_TEST_MULTI_GPU=1 inside the container when enabled; the 14-argument function signature in the shell script is consistent with the new positional parameter.

Confidence Score: 5/5

Safe to merge; the Docker-based wiring is correct and all three previously flagged gaps are resolved.

The multi-gpu parameter is threaded correctly through both composite actions, the function argument count matches the updated signature, and the new workflow reuses the same proven ECR/Docker pattern as the rest of CI. The only open point is the uncapped build-job timeout, which is a hardening concern rather than a correctness issue.

.github/workflows/test-fabric-multi-gpu.yaml — the build job has no timeout and runs on the scarce multi-GPU runner.

Important Files Changed

Filename	Overview
.github/workflows/test-fabric-multi-gpu.yaml	New workflow for FabricFrameView multi-GPU Docker-based CI; correctly hardcodes NGC key, branches filter, and image versions; build job missing a timeout.
.github/actions/run-package-tests/action.yml	Adds multi-gpu input (default false) and threads it through to run-tests; wiring is correct.
.github/actions/run-tests/action.yml	Adds multi-gpu input, correctly adds it as the 14th positional argument to run_tests(), and sets ISAACLAB_TEST_MULTI_GPU=1 when true; argument count matches function signature.
source/isaaclab/changelog.d/pv-fabric-mgpu-docker-ci.skip	Empty .skip changelog entry; signals this PR does not require a changelog entry.

Sequence Diagram

sequenceDiagram
    participant PR as Pull Request / workflow_dispatch
    participant Build as build job (multi-gpu runner)
    participant ECR as ECR Registry
    participant Test as test-fabric-multi-gpu job
    participant Docker as Docker Container

    PR->>Build: trigger (paths + branches filter)
    Build->>ECR: ecr-build-push-pull (Dockerfile.base + isaac-sim:latest-develop)
    ECR-->>Build: image pushed (CI_IMAGE_TAG)

    Build-->>Test: needs: build (success gate)
    Test->>Test: "nvidia-smi GPU count check (>=2 required)"
    Test->>ECR: "run-package-tests -> pull CI image"
    ECR-->>Test: image pulled
    Test->>Docker: "docker run --gpus all -e ISAACLAB_TEST_MULTI_GPU=1"
    Docker->>Docker: pytest test_views_xform_prim_fabric.py
    Docker-->>Test: exit code + JUnit XML
    Test->>Test: Check Test Results / upload artifacts

_{Reviews (2): Last reviewed commit: "ci: remove config job, use env vars like..." | Re-trigger Greptile}

Replace the bare-metal workflow with a Docker-based approach that uses the run-package-tests composite action, ensuring tests run against the Kit version baked into the Isaac Sim container. Changes: - .github/workflows/test-fabric-multi-gpu.yaml: rewrite to use run-package-tests with Docker, add branches filter and NGC_API_KEY - .github/actions/run-tests/action.yml: add multi-gpu input that sets ISAACLAB_TEST_MULTI_GPU=1 in the container - .github/actions/run-package-tests/action.yml: pass multi-gpu input through to run-tests - tools/conftest.py: add -v flag to subprocess pytest command so individual test PASSED/FAILED lines appear in CI logs The previous bare-metal pip approach installed isaacsim==6.0.0 which bundled Kit 110.0, causing cuda:1 tests to hang. Running inside the Docker container with Kit 111.0 fixes this.

pv-nvidia requested review from hhansen-bdai and kellyguo11 as code owners May 28, 2026 18:15

github-actions Bot added isaac-lab Related to Isaac Lab team infrastructure labels May 28, 2026

isaaclab-review-bot Bot reviewed May 28, 2026

View reviewed changes

pv-nvidia force-pushed the pv/fabric-mgpu-docker-ci branch 2 times, most recently from eabbb40 to c780b2c Compare May 28, 2026 18:22

greptile-apps Bot reviewed May 28, 2026

View reviewed changes

Comment thread .github/workflows/test-fabric-multi-gpu.yaml

Comment thread .github/workflows/test-fabric-multi-gpu.yaml

Comment thread .github/workflows/test-fabric-multi-gpu.yaml

pv-nvidia force-pushed the pv/fabric-mgpu-docker-ci branch 3 times, most recently from 07ee20e to 12d81f6 Compare May 28, 2026 20:01

pv-nvidia force-pushed the pv/fabric-mgpu-docker-ci branch from 12d81f6 to b9eafbb Compare May 28, 2026 21:01

pv-nvidia closed this May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: add Docker-based FabricFrameView multi-GPU test workflow#5845

ci: add Docker-based FabricFrameView multi-GPU test workflow#5845
pv-nvidia wants to merge 1 commit into
isaac-sim:developfrom
pv-nvidia:pv/fabric-mgpu-docker-ci

pv-nvidia commented May 28, 2026

Uh oh!

isaaclab-review-bot Bot left a comment

Uh oh!

isaaclab-review-bot Bot left a comment •

edited

Loading

Uh oh!

greptile-apps Bot commented May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pv-nvidia commented May 28, 2026

Summary

Problem

Solution

Workflow triggers

Testing

Uh oh!

isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

🤖 Isaac Lab Review Bot

Summary

Findings

Test Coverage

Verdict

Uh oh!

isaaclab-review-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Review Summary

✅ What Looks Good

🔍 Findings

1. Runner label mismatch with existing workflows

2. GPU check runs on host, tests run in container

3. Consider adding push trigger for develop branch

Uh oh!

greptile-apps Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

isaaclab-review-bot Bot left a comment •

edited

Loading

3. Consider adding `push` trigger for develop branch

greptile-apps Bot commented May 28, 2026 •

edited

Loading