Describe the bug
The published nvcr.io/nvidia/cloud-native/nvidia-fs:2.27.3-rhel9.6 image contains only a raw git checkout of the gds-nvidia-fs source code and does not contain the nvidia-gds-driver
orchestrator script that the GPU Operator's daemonset expects.
The Ubuntu variants (2.27.3-ubuntu22.04) ship the binary at /usr/local/bin/nvidia-gds-driver (6485 bytes, executable). The RHEL variants do not.
This causes nvidia-fs-ctr in the GPU Operator's nvidia-driver-daemonset to fail in CrashLoopBackOff:
The PATH workaround documented in #1849 does not apply because the binary is not present anywhere in the RHEL image not just outside $PATH.
To Reproduce
-
Enable GDS in GPU Operator on a RHEL 9.x host with gds.version: 2.27.3-rhel9.6 — nvidia-fs-ctr enters CrashLoopBackOff with exit 127.
-
Ubuntu variant — has the orchestrator binary:
kubectl run check-ubuntu --rm -it \
--image=nvcr.io/nvidia/cloud-native/nvidia-fs:2.27.3-ubuntu22.04 \
--restart=Never --command -- which nvidia-gds-driver
- RHEL variant — only source code, no binary:
kubectl run check-rhel --rm -it \
--image=nvcr.io/nvidia/cloud-native/nvidia-fs:2.27.3-rhel9.6 \
--restart=Never --command -- bash -c 'find / -name "nvidia-gds-driver*" 2>/dev/null; echo ===; ls /usr/local/gds-nvidia-fs/'
Expected behavior
nvcr.io/nvidia/cloud-native/nvidia-fs:2.27.3-rhel9.6 should contain a working orchestrator at the location the GPU Operator daemonset invokes (exec nvidia-gds-driver install), matching what the ubuntu22.04
variant ships.
Environment (please provide the following information):
- GPU Operator Version: v26.3.0
- Host OS: RHEL 9.7 (kernel 5.14.0-611.47.1.el9_7.x86_64)
- Container Runtime: cri-o 1.32
- Driver: 580.126.09 (open kernel module)
- GPU: NVIDIA A100 80GB PCIe
Workaround currently in use
Switched gds.version to 2.27.3-ubuntu22.04. The Ubuntu image's nvidia-gds-driver script does apt-get install linux-headers-${KERNEL_VERSION}, which fails on RHEL hosts because 5.14.0-611.47.1.el9_7.x86_64 is
not an Ubuntu apt package.
Patched the script to symlink kernel headers from the driver container's mount instead:
RUN sed -i '/apt-get -qq install --no-install-recommends linux-headers/c\
mkdir -p /lib/modules/${KERNEL_VERSION} \&\& ln -sf /run/nvidia/driver/usr/src/kernels/${KERNEL_VERSION} /lib/modules/${KERNEL_VERSION}/build' \
/usr/local/bin/nvidia-gds-driver
After this patch, make builds nvidia_fs.ko against the host RHEL 9.7 kernel headers (mounted from driver container) and insmod succeeds. gdscheck -p reports Platform verification succeeded. Module loads cleanly.
Cross-reference
#1849 — related but different scenario (Ubuntu host + Ubuntu nvidia-fs PATH workaround for gcc-12).
Describe the bug
The published
nvcr.io/nvidia/cloud-native/nvidia-fs:2.27.3-rhel9.6image contains only a raw git checkout of the gds-nvidia-fs source code and does not contain thenvidia-gds-driverorchestrator script that the GPU Operator's daemonset expects.
The Ubuntu variants (
2.27.3-ubuntu22.04) ship the binary at/usr/local/bin/nvidia-gds-driver(6485 bytes, executable). The RHEL variants do not.This causes
nvidia-fs-ctrin the GPU Operator'snvidia-driver-daemonsetto fail in CrashLoopBackOff:The PATH workaround documented in #1849 does not apply because the binary is not present anywhere in the RHEL image not just outside
$PATH.To Reproduce
Enable GDS in GPU Operator on a RHEL 9.x host with gds.version: 2.27.3-rhel9.6 — nvidia-fs-ctr enters CrashLoopBackOff with exit 127.
Ubuntu variant — has the orchestrator binary:
kubectl run check-rhel --rm -it \ --image=nvcr.io/nvidia/cloud-native/nvidia-fs:2.27.3-rhel9.6 \ --restart=Never --command -- bash -c 'find / -name "nvidia-gds-driver*" 2>/dev/null; echo ===; ls /usr/local/gds-nvidia-fs/'Expected behavior
nvcr.io/nvidia/cloud-native/nvidia-fs:2.27.3-rhel9.6should contain a working orchestrator at the location the GPU Operator daemonset invokes (exec nvidia-gds-driver install), matching what the ubuntu22.04variant ships.
Environment (please provide the following information):
Workaround currently in use
Switched gds.version to
2.27.3-ubuntu22.04. The Ubuntu image's nvidia-gds-driver script doesapt-get install linux-headers-${KERNEL_VERSION}, which fails on RHEL hosts because 5.14.0-611.47.1.el9_7.x86_64 isnot an Ubuntu apt package.
Patched the script to symlink kernel headers from the driver container's mount instead:
After this patch, make builds nvidia_fs.ko against the host RHEL 9.7 kernel headers (mounted from driver container) and insmod succeeds. gdscheck -p reports Platform verification succeeded. Module loads cleanly.
Cross-reference
#1849 — related but different scenario (Ubuntu host + Ubuntu nvidia-fs PATH workaround for gcc-12).