Skip to content

chore(release): bump component versions for 26.05#1344

Open
dholt wants to merge 9 commits into
masterfrom
dholt/release-26.05-version-bumps
Open

chore(release): bump component versions for 26.05#1344
dholt wants to merge 9 commits into
masterfrom
dholt/release-26.05-version-bumps

Conversation

@dholt
Copy link
Copy Markdown
Contributor

@dholt dholt commented May 18, 2026

Summary

  • Prepare DeepOps 26.05 by refreshing stale component pins after the long release gap.
  • Bump Kubespray, Slurm, Spack, NVIDIA GPU stack roles, monitoring images/charts, registry image, Network Operator, MIG Manager, and docker registry cache.
  • Update stale Kubernetes inventory group-name references in docs from kube-master/kube-node to kube_control_plane/kube_node.
  • Move MAAS provisioning to the maintained upstream ansible-maas role pin and align the default MAAS PPA/config variable with current docs.
  • Set the generic NVIDIA driver branch to R580 and add an Ubuntu open-kernel-module switch for newer GPU support.
  • Make Slurm controller accounting setup rerunnable across repeated smoke runs.

Branch State

Base: origin/master
Head: f0913480

Commits:

f0913480 Fix MAAS requirements and document staged upgrades
e6ae743b fix(k8s): run Helm installer with bash
908555ea fix(k8s): update network operator role for current chart
a5c9a974 fix(k8s): use mapping vars for NFS role include
56be7831 fix(slurm): make controller setup rerunnable
ed9cda3c fix(nvidia): support Ubuntu open kernel modules
ff1d89d3 chore(maas): update role dependency for Ansible 10
252dd33f docs: update Kubernetes inventory group names
e9bcfa44 chore(release): bump component versions for 26.05

Validation

  • bash -n scripts/k8s/deploy_ingress.sh
  • bash -n scripts/k8s/deploy_monitoring.sh
  • YAML parse of modified defaults/vars files with PyYAML
  • git diff --check
  • Role linting and selected playbook syntax checks
  • Version audit against the updated component pins
  • Selected upstream release asset and Helm chart availability checks
  • MAAS provisioning smoke: playbooks/provisioning/maas.yml passes, services are active, the MAAS API reports version information, and API login succeeds.
  • GPU driver smoke: playbooks/nvidia-software/nvidia-driver.yml passes with Ubuntu R580 open kernel module packages and nvidia-smi detects the GPU.
  • Single-node Slurm smoke: playbooks/slurm-cluster.yml passes, Slurm services are active, sinfo reports one GPU resource, and srun --gpus=1 nvidia-smi completes successfully.
  • Single-node Kubernetes/GPU Operator smoke: playbooks/k8s-cluster.yml passes, GPU Operator validation passes, and a CUDA nvidia-smi pod completes successfully.
  • Upgrade-path validation from 23.08: the 23.08 Kubernetes baseline deploy passed with Kubernetes v1.26.5 and a pod smoke. The direct upgrade to 26.05 stops in current Kubespray prechecks because the baseline Calico v3.25.1 is below the current minimum v3.27.0, so older clusters should use staged upgrades or redeploy rather than a direct jump.

Risk / Follow-Up

  • Slurm jumps from 23.02 to 25.11. Single-node validation passed; multi-node validation still needs coverage.
  • GPU Operator jumps from 23.3 to 26.3 with an R580 driver. Driver, Slurm, and Kubernetes GPU smokes passed on single-node hardware; broader hardware validation is still required.
  • kube-prometheus-stack and Grafana are major upgrades; validate monitoring values during QA.
  • Network Operator now uses the NVIDIA NGC Helm repo; validate install behavior on hardware with relevant networking.
  • Direct in-place Kubernetes upgrades from much older DeepOps releases require staged compatibility checks. The docs now call out that Kubespray-managed network plugins may need intermediate upgrades as well as Kubernetes itself.

@dholt
Copy link
Copy Markdown
Contributor Author

dholt commented May 18, 2026

Validation evidence from the 26.05 draft branch.

Local Checks

  • git diff --check: pass
  • Role linting: pass
  • Selected playbook syntax checks: pass
  • Selected shell/YAML parsing checks: pass

GPU Driver Smoke

NVIDIA driver playbook: pass
Open kernel module package path: pass
nvidia-smi GPU detection: pass

Single-Node Slurm Smoke

Ansible playbook: pass
Ansible recap: failed=0 unreachable=0
Slurm controller service: active
Slurm worker service: active
Slurm database service: active
GPU resource visible to Slurm: yes
srun --gpus=1 nvidia-smi: pass

Note: direct SSH nvidia-smi can report no devices after the login-compute role hides GPUs from normal SSH sessions; Slurm allocation is the intended validation path after that role runs.

Single-Node Kubernetes/GPU Operator Smoke

Ansible playbook: pass
Ansible recap: failed=0 unreachable=0
GPU Operator validation: pass
Allocatable GPU count: 1
CUDA nvidia-smi pod: pass

Remaining Blockers

  • Live MAAS provisioning smoke is not complete.
  • 23.08 to 26.05 upgrade-path validation is not complete.
  • This PR should stay draft until those are tested or explicitly documented as waived/unsupported.

@dholt dholt marked this pull request as ready for review May 19, 2026 02:26
@dholt dholt requested a review from michael-balint May 19, 2026 02:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant