ci: add Docker-based FabricFrameView multi-GPU test workflow#5845
ci: add Docker-based FabricFrameView multi-GPU test workflow#5845pv-nvidia wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
This PR adds a new CI workflow (test-fabric-multi-gpu.yaml) that runs FabricFrameView cuda:1 unit tests on multi-GPU runners using Docker, replacing a bare-metal approach that was silently using a stale Kit version (110.0 instead of 111.0). The implementation correctly reuses existing composite actions (ecr-build-push-pull, run-package-tests) for consistency with the established CI patterns.
Findings
🔵 Suggestion .github/workflows/test-fabric-multi-gpu.yaml:88-89
The GPU verification step checks for $GPU_COUNT -lt 2 but the error message says "At least 2 GPUs required." Consider if future tests might need more GPUs - the check could be parameterized via an env variable for flexibility.
🔵 Suggestion .github/workflows/test-fabric-multi-gpu.yaml:73
The timeout-minutes: 60 is reasonable but consider adding a comment explaining the expected runtime to help future maintainers understand if this is generous or tight.
🔵 Suggestion .github/workflows/test-fabric-multi-gpu.yaml:95-98
The pytest-options includes -v -s --tb=short which is good for debugging. Consider if --tb=long might be more helpful for CI failures where full tracebacks aid diagnosis.
Test Coverage
This is a CI-only change adding a new workflow file and a changelog skip marker. No runtime code changes are introduced. The workflow itself will validate the FabricFrameView tests against the correct Kit version, which was the root cause being fixed.
Verdict
No issues found — This is a well-structured CI workflow that follows established patterns in the repository. The Docker-based approach ensures version consistency with the Isaac Sim container, and the separation from test-multi-gpu.yaml provides good isolation. The suggestions above are minor improvements, not blockers.
There was a problem hiding this comment.
Review Summary
This PR adds a dedicated CI workflow for FabricFrameView multi-GPU unit tests, migrating from a bare-metal pip-install approach to Docker-based testing for Kit version consistency. The motivation is sound and the implementation follows established patterns from build.yaml.
✅ What Looks Good
-
Problem diagnosis is correct: The Kit version mismatch (110.0 vs 111.0) is a real issue that could cause silent test failures or false positives.
-
Docker-based approach is appropriate: Using
ecr-build-push-pullandrun-package-testscomposite actions aligns with the existing CI infrastructure. -
Separation from
test-multi-gpu.yaml: Keeping this workflow independent prevents FabricFrameView changes from gating distributed training validation (and vice versa) — good for CI velocity. -
Selective triggering: The
paths:filter ensures this workflow only runs when relevant files change, conserving multi-GPU runner resources. -
GPU validation step: The pre-flight check for 2+ GPUs is a good safeguard.
🔍 Findings
1. Runner label mismatch with existing workflows
Severity: Medium
File: .github/workflows/test-fabric-multi-gpu.yaml
The new workflow uses [self-hosted, linux, x64, multi-gpu] (4 labels), but the existing test-multi-gpu.yaml uses [self-hosted, linux, x64, gpu, multi-gpu] (5 labels, includes gpu).
This could cause jobs to be scheduled on different runners. Consider aligning with the existing convention.
2. GPU check runs on host, tests run in container
Severity: Low
The "Verify multi-GPU availability" step runs nvidia-smi on the host, but actual tests run inside a Docker container. While GPU passthrough typically works, consider adding a container-level GPU check.
3. Consider adding push trigger for develop branch
Severity: Suggestion
The workflow only triggers on pull_request and workflow_dispatch. Consider adding a push trigger for the develop branch to catch regressions after merge.
Overall, this is a well-structured PR that improves CI reliability.
Update (e689c4c): Simplified workflow by removing the config job.
Changes:
- Removed
configjob — no longer loadsconfig.yamlfor IsaacSim image settings - Added
ISAACSIM_BASE_IMAGEandISAACSIM_BASE_VERSIONenv vars directly - Simplified job dependencies (build and test no longer depend on config)
Assessment: ✅ Cleaner workflow structure, consistent with build.yaml pattern. No new issues introduced. Original findings still apply.
Update (07ee20e): Enabled the workflow + extended run-tests action with multi-gpu parameter.
Key Changes:
- Workflow re-enabled — removed the
on: workflow_dispatchguard; now triggers onpull_requestfor relevant paths - Added
multi-gpuinput to.github/actions/run-tests/action.ymland.github/actions/run-package-tests/action.yml— setsISAACLAB_TEST_MULTI_GPU=1inside the container - GPU check simplified — now uses direct
nvidia-smiquery (wc -l) instead of requiring Python/torch - Runner labels fixed — changed from
[self-hosted, linux, x64, gpu, multi-gpu](5 labels) to[self-hosted, linux, x64, multi-gpu](4 labels) — this is now consistent across both build and test jobs - Timeout extended — 30 → 60 minutes to account for Docker build time
- Added changelog skip file
source/isaaclab/changelog.d/pv-fabric-mgpu-docker-ci.skip
Assessment: ✅ Addresses prior feedback well. The multi-gpu action parameter is a clean, reusable way to enable multi-GPU tests in any Docker-based workflow. Runner label consistency resolved. Ready to merge once CI passes.
Update (12d81f6): Restored config job + minor workflow refinements.
Key Changes:
- Re-added
configjob — loadsisaacsim_image_nameandisaacsim_image_tagfromconfig.yamlviayq, aligning with other workflows likebuild.yaml - Removed hardcoded env vars —
ISAACSIM_BASE_IMAGEandISAACSIM_BASE_VERSIONreplaced withneeds.config.outputs.*references - Updated job dependencies —
buildandtest-fabric-multi-gpunow depend onconfigjob - Added explicit success check — test job has
if: needs.build.result == success
Assessment: ✅ This is a sensible refinement. Using the shared config.yaml centralizes Isaac Sim version management (DRY principle) and ensures consistency across workflows. The action input changes from the previous commit (multi-gpu parameter) are preserved. No new issues. Ready to merge.
Update (b9eafbb): Minor test output enhancement.
Changes:
- Added
-v(verbose) flag to pytest invocation intools/conftest.py— improves test output readability during CI runs
Assessment: ✅ Trivial improvement. Verbose pytest output helps debug failures when they occur. No impact on existing functionality. No new issues.
eabbb40 to
c780b2c
Compare
Greptile SummaryThis PR replaces the previous bare-metal
Confidence Score: 5/5Safe to merge; the Docker-based wiring is correct and all three previously flagged gaps are resolved. The multi-gpu parameter is threaded correctly through both composite actions, the function argument count matches the updated signature, and the new workflow reuses the same proven ECR/Docker pattern as the rest of CI. The only open point is the uncapped build-job timeout, which is a hardening concern rather than a correctness issue. .github/workflows/test-fabric-multi-gpu.yaml — the build job has no timeout and runs on the scarce multi-GPU runner. Important Files Changed
Sequence DiagramsequenceDiagram
participant PR as Pull Request / workflow_dispatch
participant Build as build job (multi-gpu runner)
participant ECR as ECR Registry
participant Test as test-fabric-multi-gpu job
participant Docker as Docker Container
PR->>Build: trigger (paths + branches filter)
Build->>ECR: ecr-build-push-pull (Dockerfile.base + isaac-sim:latest-develop)
ECR-->>Build: image pushed (CI_IMAGE_TAG)
Build-->>Test: needs: build (success gate)
Test->>Test: "nvidia-smi GPU count check (>=2 required)"
Test->>ECR: "run-package-tests -> pull CI image"
ECR-->>Test: image pulled
Test->>Docker: "docker run --gpus all -e ISAACLAB_TEST_MULTI_GPU=1"
Docker->>Docker: pytest test_views_xform_prim_fabric.py
Docker-->>Test: exit code + JUnit XML
Test->>Test: Check Test Results / upload artifacts
Reviews (2): Last reviewed commit: "ci: remove config job, use env vars like..." | Re-trigger Greptile |
07ee20e to
12d81f6
Compare
Replace the bare-metal workflow with a Docker-based approach that uses the run-package-tests composite action, ensuring tests run against the Kit version baked into the Isaac Sim container. Changes: - .github/workflows/test-fabric-multi-gpu.yaml: rewrite to use run-package-tests with Docker, add branches filter and NGC_API_KEY - .github/actions/run-tests/action.yml: add multi-gpu input that sets ISAACLAB_TEST_MULTI_GPU=1 in the container - .github/actions/run-package-tests/action.yml: pass multi-gpu input through to run-tests - tools/conftest.py: add -v flag to subprocess pytest command so individual test PASSED/FAILED lines appear in CI logs The previous bare-metal pip approach installed isaacsim==6.0.0 which bundled Kit 110.0, causing cuda:1 tests to hang. Running inside the Docker container with Kit 111.0 fixes this.
12d81f6 to
b9eafbb
Compare
Summary
Adds a CI workflow that runs FabricFrameView
cuda:1unit tests on multi-GPU runners using Docker — consistent with how all other test jobs run inbuild.yaml.Problem
The previous bare-metal approach installed
isaacsim==6.0.0via pip, which bundled Kit 110.0. This silently ran tests against a stale Kit version instead of the Kit 111.0 shipped in thenvcr.io/nvidian/isaac-sim:latest-developcontainer image.Solution
Use the same
run-package-testscomposite action as all other CI test jobs:ecr-build-push-pullWorkflow triggers
source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.pysource/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py.github/workflows/test-fabric-multi-gpu.yamlTesting
Validated in PR #5822 — tests pass with the correct Kit 111.0 version.