Skip to content

ci: add Docker-based FabricFrameView multi-GPU test workflow#5845

Closed
pv-nvidia wants to merge 1 commit into
isaac-sim:developfrom
pv-nvidia:pv/fabric-mgpu-docker-ci
Closed

ci: add Docker-based FabricFrameView multi-GPU test workflow#5845
pv-nvidia wants to merge 1 commit into
isaac-sim:developfrom
pv-nvidia:pv/fabric-mgpu-docker-ci

Conversation

@pv-nvidia
Copy link
Copy Markdown
Contributor

Summary

Adds a CI workflow that runs FabricFrameView cuda:1 unit tests on multi-GPU runners using Docker — consistent with how all other test jobs run in build.yaml.

Problem

The previous bare-metal approach installed isaacsim==6.0.0 via pip, which bundled Kit 110.0. This silently ran tests against a stale Kit version instead of the Kit 111.0 shipped in the nvcr.io/nvidian/isaac-sim:latest-develop container image.

Solution

Use the same run-package-tests composite action as all other CI test jobs:

  1. Build/pull the Docker image from ECR via ecr-build-push-pull
  2. Run pytest inside the container (correct Kit version guaranteed)
  3. Volume-mount the workspace so the PR's test source is used

Workflow triggers

  • source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py
  • source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py
  • .github/workflows/test-fabric-multi-gpu.yaml
  • Manual dispatch

Testing

Validated in PR #5822 — tests pass with the correct Kit 111.0 version.

@github-actions github-actions Bot added isaac-lab Related to Isaac Lab team infrastructure labels May 28, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This PR adds a new CI workflow (test-fabric-multi-gpu.yaml) that runs FabricFrameView cuda:1 unit tests on multi-GPU runners using Docker, replacing a bare-metal approach that was silently using a stale Kit version (110.0 instead of 111.0). The implementation correctly reuses existing composite actions (ecr-build-push-pull, run-package-tests) for consistency with the established CI patterns.

Findings

🔵 Suggestion .github/workflows/test-fabric-multi-gpu.yaml:88-89
The GPU verification step checks for $GPU_COUNT -lt 2 but the error message says "At least 2 GPUs required." Consider if future tests might need more GPUs - the check could be parameterized via an env variable for flexibility.

🔵 Suggestion .github/workflows/test-fabric-multi-gpu.yaml:73
The timeout-minutes: 60 is reasonable but consider adding a comment explaining the expected runtime to help future maintainers understand if this is generous or tight.

🔵 Suggestion .github/workflows/test-fabric-multi-gpu.yaml:95-98
The pytest-options includes -v -s --tb=short which is good for debugging. Consider if --tb=long might be more helpful for CI failures where full tracebacks aid diagnosis.

Test Coverage

This is a CI-only change adding a new workflow file and a changelog skip marker. No runtime code changes are introduced. The workflow itself will validate the FabricFrameView tests against the correct Kit version, which was the root cause being fixed.

Verdict

No issues found — This is a well-structured CI workflow that follows established patterns in the repository. The Docker-based approach ensures version consistency with the Isaac Sim container, and the separation from test-multi-gpu.yaml provides good isolation. The suggestions above are minor improvements, not blockers.

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This PR adds a dedicated CI workflow for FabricFrameView multi-GPU unit tests, migrating from a bare-metal pip-install approach to Docker-based testing for Kit version consistency. The motivation is sound and the implementation follows established patterns from build.yaml.

✅ What Looks Good

  1. Problem diagnosis is correct: The Kit version mismatch (110.0 vs 111.0) is a real issue that could cause silent test failures or false positives.

  2. Docker-based approach is appropriate: Using ecr-build-push-pull and run-package-tests composite actions aligns with the existing CI infrastructure.

  3. Separation from test-multi-gpu.yaml: Keeping this workflow independent prevents FabricFrameView changes from gating distributed training validation (and vice versa) — good for CI velocity.

  4. Selective triggering: The paths: filter ensures this workflow only runs when relevant files change, conserving multi-GPU runner resources.

  5. GPU validation step: The pre-flight check for 2+ GPUs is a good safeguard.


🔍 Findings

1. Runner label mismatch with existing workflows

Severity: Medium
File: .github/workflows/test-fabric-multi-gpu.yaml

The new workflow uses [self-hosted, linux, x64, multi-gpu] (4 labels), but the existing test-multi-gpu.yaml uses [self-hosted, linux, x64, gpu, multi-gpu] (5 labels, includes gpu).

This could cause jobs to be scheduled on different runners. Consider aligning with the existing convention.

2. GPU check runs on host, tests run in container

Severity: Low

The "Verify multi-GPU availability" step runs nvidia-smi on the host, but actual tests run inside a Docker container. While GPU passthrough typically works, consider adding a container-level GPU check.

3. Consider adding push trigger for develop branch

Severity: Suggestion

The workflow only triggers on pull_request and workflow_dispatch. Consider adding a push trigger for the develop branch to catch regressions after merge.


Overall, this is a well-structured PR that improves CI reliability.


Update (e689c4c): Simplified workflow by removing the config job.

Changes:

  • Removed config job — no longer loads config.yaml for IsaacSim image settings
  • Added ISAACSIM_BASE_IMAGE and ISAACSIM_BASE_VERSION env vars directly
  • Simplified job dependencies (build and test no longer depend on config)

Assessment: ✅ Cleaner workflow structure, consistent with build.yaml pattern. No new issues introduced. Original findings still apply.


Update (07ee20e): Enabled the workflow + extended run-tests action with multi-gpu parameter.

Key Changes:

  • Workflow re-enabled — removed the on: workflow_dispatch guard; now triggers on pull_request for relevant paths
  • Added multi-gpu input to .github/actions/run-tests/action.yml and .github/actions/run-package-tests/action.yml — sets ISAACLAB_TEST_MULTI_GPU=1 inside the container
  • GPU check simplified — now uses direct nvidia-smi query (wc -l) instead of requiring Python/torch
  • Runner labels fixed — changed from [self-hosted, linux, x64, gpu, multi-gpu] (5 labels) to [self-hosted, linux, x64, multi-gpu] (4 labels) — this is now consistent across both build and test jobs
  • Timeout extended — 30 → 60 minutes to account for Docker build time
  • Added changelog skip file source/isaaclab/changelog.d/pv-fabric-mgpu-docker-ci.skip

Assessment: ✅ Addresses prior feedback well. The multi-gpu action parameter is a clean, reusable way to enable multi-GPU tests in any Docker-based workflow. Runner label consistency resolved. Ready to merge once CI passes.


Update (12d81f6): Restored config job + minor workflow refinements.

Key Changes:

  • Re-added config job — loads isaacsim_image_name and isaacsim_image_tag from config.yaml via yq, aligning with other workflows like build.yaml
  • Removed hardcoded env varsISAACSIM_BASE_IMAGE and ISAACSIM_BASE_VERSION replaced with needs.config.outputs.* references
  • Updated job dependenciesbuild and test-fabric-multi-gpu now depend on config job
  • Added explicit success check — test job has if: needs.build.result == success

Assessment: ✅ This is a sensible refinement. Using the shared config.yaml centralizes Isaac Sim version management (DRY principle) and ensures consistency across workflows. The action input changes from the previous commit (multi-gpu parameter) are preserved. No new issues. Ready to merge.


Update (b9eafbb): Minor test output enhancement.

Changes:

  • Added -v (verbose) flag to pytest invocation in tools/conftest.py — improves test output readability during CI runs

Assessment: ✅ Trivial improvement. Verbose pytest output helps debug failures when they occur. No impact on existing functionality. No new issues.

@pv-nvidia pv-nvidia force-pushed the pv/fabric-mgpu-docker-ci branch 2 times, most recently from eabbb40 to c780b2c Compare May 28, 2026 18:22
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 28, 2026

Greptile Summary

This PR replaces the previous bare-metal isaacsim==6.0.0 pip-install approach for FabricFrameView multi-GPU tests with the same Docker-based workflow pattern used by all other CI jobs, ensuring tests run against the Kit 111.0 version baked into the nvcr.io/nvidian/isaac-sim container rather than a stale Kit 110.0 pip package.

  • Adds a new test-fabric-multi-gpu.yaml workflow with correct NGC_API_KEY, branches filter, and hardcoded image coordinates; triggers on changes to fabric_frame_view.py, its test file, or the workflow itself.
  • Threads a new multi-gpu input (default false) through run-package-testsrun-tests, setting ISAACLAB_TEST_MULTI_GPU=1 inside the container when enabled; the 14-argument function signature in the shell script is consistent with the new positional parameter.

Confidence Score: 5/5

Safe to merge; the Docker-based wiring is correct and all three previously flagged gaps are resolved.

The multi-gpu parameter is threaded correctly through both composite actions, the function argument count matches the updated signature, and the new workflow reuses the same proven ECR/Docker pattern as the rest of CI. The only open point is the uncapped build-job timeout, which is a hardening concern rather than a correctness issue.

.github/workflows/test-fabric-multi-gpu.yaml — the build job has no timeout and runs on the scarce multi-GPU runner.

Important Files Changed

Filename Overview
.github/workflows/test-fabric-multi-gpu.yaml New workflow for FabricFrameView multi-GPU Docker-based CI; correctly hardcodes NGC key, branches filter, and image versions; build job missing a timeout.
.github/actions/run-package-tests/action.yml Adds multi-gpu input (default false) and threads it through to run-tests; wiring is correct.
.github/actions/run-tests/action.yml Adds multi-gpu input, correctly adds it as the 14th positional argument to run_tests(), and sets ISAACLAB_TEST_MULTI_GPU=1 when true; argument count matches function signature.
source/isaaclab/changelog.d/pv-fabric-mgpu-docker-ci.skip Empty .skip changelog entry; signals this PR does not require a changelog entry.

Sequence Diagram

sequenceDiagram
    participant PR as Pull Request / workflow_dispatch
    participant Build as build job (multi-gpu runner)
    participant ECR as ECR Registry
    participant Test as test-fabric-multi-gpu job
    participant Docker as Docker Container

    PR->>Build: trigger (paths + branches filter)
    Build->>ECR: ecr-build-push-pull (Dockerfile.base + isaac-sim:latest-develop)
    ECR-->>Build: image pushed (CI_IMAGE_TAG)

    Build-->>Test: needs: build (success gate)
    Test->>Test: "nvidia-smi GPU count check (>=2 required)"
    Test->>ECR: "run-package-tests -> pull CI image"
    ECR-->>Test: image pulled
    Test->>Docker: "docker run --gpus all -e ISAACLAB_TEST_MULTI_GPU=1"
    Docker->>Docker: pytest test_views_xform_prim_fabric.py
    Docker-->>Test: exit code + JUnit XML
    Test->>Test: Check Test Results / upload artifacts
Loading

Reviews (2): Last reviewed commit: "ci: remove config job, use env vars like..." | Re-trigger Greptile

Comment thread .github/workflows/test-fabric-multi-gpu.yaml
Comment thread .github/workflows/test-fabric-multi-gpu.yaml
Comment thread .github/workflows/test-fabric-multi-gpu.yaml
@pv-nvidia pv-nvidia force-pushed the pv/fabric-mgpu-docker-ci branch 3 times, most recently from 07ee20e to 12d81f6 Compare May 28, 2026 20:01
Replace the bare-metal workflow with a Docker-based approach that uses
the run-package-tests composite action, ensuring tests run against the
Kit version baked into the Isaac Sim container.

Changes:
- .github/workflows/test-fabric-multi-gpu.yaml: rewrite to use
  run-package-tests with Docker, add branches filter and NGC_API_KEY
- .github/actions/run-tests/action.yml: add multi-gpu input that sets
  ISAACLAB_TEST_MULTI_GPU=1 in the container
- .github/actions/run-package-tests/action.yml: pass multi-gpu input
  through to run-tests
- tools/conftest.py: add -v flag to subprocess pytest command so
  individual test PASSED/FAILED lines appear in CI logs

The previous bare-metal pip approach installed isaacsim==6.0.0 which
bundled Kit 110.0, causing cuda:1 tests to hang. Running inside the
Docker container with Kit 111.0 fixes this.
@pv-nvidia pv-nvidia force-pushed the pv/fabric-mgpu-docker-ci branch from 12d81f6 to b9eafbb Compare May 28, 2026 21:01
@pv-nvidia pv-nvidia closed this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

infrastructure isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant