Skip to content

[Bug]: nvidia-fs:2.27.3-rhel9.6 image ships only source code, no nvidia-gds-driver binary - nvidia-fs-ctr fails with exit 127 #2416

@friedrice99

Description

@friedrice99

Describe the bug
The published nvcr.io/nvidia/cloud-native/nvidia-fs:2.27.3-rhel9.6 image contains only a raw git checkout of the gds-nvidia-fs source code and does not contain the nvidia-gds-driver
orchestrator script that the GPU Operator's daemonset expects.

The Ubuntu variants (2.27.3-ubuntu22.04) ship the binary at /usr/local/bin/nvidia-gds-driver (6485 bytes, executable). The RHEL variants do not.

This causes nvidia-fs-ctr in the GPU Operator's nvidia-driver-daemonset to fail in CrashLoopBackOff:

The PATH workaround documented in #1849 does not apply because the binary is not present anywhere in the RHEL image not just outside $PATH.

To Reproduce

  1. Enable GDS in GPU Operator on a RHEL 9.x host with gds.version: 2.27.3-rhel9.6 — nvidia-fs-ctr enters CrashLoopBackOff with exit 127.

  2. Ubuntu variant — has the orchestrator binary:

kubectl run check-ubuntu --rm -it \
   --image=nvcr.io/nvidia/cloud-native/nvidia-fs:2.27.3-ubuntu22.04 \
   --restart=Never --command -- which nvidia-gds-driver
  1. RHEL variant — only source code, no binary:
kubectl run check-rhel --rm -it \
 --image=nvcr.io/nvidia/cloud-native/nvidia-fs:2.27.3-rhel9.6 \
 --restart=Never --command -- bash -c 'find / -name "nvidia-gds-driver*" 2>/dev/null; echo ===; ls /usr/local/gds-nvidia-fs/'

Expected behavior
nvcr.io/nvidia/cloud-native/nvidia-fs:2.27.3-rhel9.6 should contain a working orchestrator at the location the GPU Operator daemonset invokes (exec nvidia-gds-driver install), matching what the ubuntu22.04
variant ships.

Environment (please provide the following information):

  • GPU Operator Version: v26.3.0
  • Host OS: RHEL 9.7 (kernel 5.14.0-611.47.1.el9_7.x86_64)
  • Container Runtime: cri-o 1.32
  • Driver: 580.126.09 (open kernel module)
  • GPU: NVIDIA A100 80GB PCIe

Workaround currently in use

Switched gds.version to 2.27.3-ubuntu22.04. The Ubuntu image's nvidia-gds-driver script does apt-get install linux-headers-${KERNEL_VERSION}, which fails on RHEL hosts because 5.14.0-611.47.1.el9_7.x86_64 is
not an Ubuntu apt package.

Patched the script to symlink kernel headers from the driver container's mount instead:

RUN sed -i '/apt-get -qq install --no-install-recommends linux-headers/c\
      mkdir -p /lib/modules/${KERNEL_VERSION} \&\& ln -sf /run/nvidia/driver/usr/src/kernels/${KERNEL_VERSION} /lib/modules/${KERNEL_VERSION}/build' \
      /usr/local/bin/nvidia-gds-driver

After this patch, make builds nvidia_fs.ko against the host RHEL 9.7 kernel headers (mounted from driver container) and insmod succeeds. gdscheck -p reports Platform verification succeeded. Module loads cleanly.

Cross-reference
#1849 — related but different scenario (Ubuntu host + Ubuntu nvidia-fs PATH workaround for gcc-12).

Metadata

Metadata

Assignees

Labels

bugIssue/PR to expose/discuss/fix a bugneeds-triageissue or PR has not been assigned a priority-px label

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions