feat(linux): add build support for GB200/300 image series#8521
feat(linux): add build support for GB200/300 image series#8521keith-ms wants to merge 88 commits into
Conversation
…the GB200 platform
…E_LABELS are written to a /etc/default/kubelet
…els are written to the /etc/default/kubelet file
| rm -f ${POD_INFRA_CONTAINER_IMAGE_TAR} | ||
| } | ||
|
|
||
| validateKubeletNodeLabels() { |
There was a problem hiding this comment.
where is this used?
There was a problem hiding this comment.
It isn’t currently used anywhere in the call path. I confirmed there are no references to validateKubeletNodeLabels in-tree while working this update (30dedb6).
There was a problem hiding this comment.
This isn't used. I can remove it, but it probably should be integrated because a label over 63 characters will cause kubelet to fail to start.
Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/8b4cdf6e-94c7-4c74-b5e2-8983bd8d7c23 Co-authored-by: cameronmeissner <24923771+cameronmeissner@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/7402cb4e-9c9d-4f11-8a2f-e49e30d5a461 Co-authored-by: keith-ms <153014933+keith-ms@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/7402cb4e-9c9d-4f11-8a2f-e49e30d5a461 Co-authored-by: keith-ms <153014933+keith-ms@users.noreply.github.com>
There was a problem hiding this comment.
not gonna block on this, though I'd prefer we move these artifacts that are only uploaded to GB200/300 VHDs into a subfolder, maybe called graceblackwell or something
| else | ||
| # However, for the 24.04 ARM images, we MUST have both -azure and -azure-nvidia kernels, so that we can run on either vanilla ARM64 hardware or GB200. | ||
| if [ $(dpkg --get-selections | grep -c "linux-image") -lt 2 ]; then | ||
| echo "ERROR: Ubuntu 24.04 ARM image is missing either the -azure or -azure-nvidia kernel, cannot continue!" && exit 1 |
There was a problem hiding this comment.
nit: personally I'd move this sort of thing to content tests, though I understand the rationale of leaving it here for now to speed up build times
| @@ -0,0 +1,46 @@ | |||
| { | |||
There was a problem hiding this comment.
Do you plan to manually maintain these versions? Or would you like to auto-update with Renovate? I assume manual given the difficulty in obtaining quota to test post updates.
There was a problem hiding this comment.
This is done because the image requires very specific versions of packages. This isn't meant to be a broadly useful image, despite the gb200 in the name implying something generic.
| echo "Generating non-GPU containerd config for GPU node due to VM tags" | ||
| echo "${CONTAINERD_CONFIG_NO_GPU_CONTENT}" | base64 -d > /etc/containerd/config.toml || exit $ERR_FILE_WATCH_TIMEOUT | ||
|
|
||
| if grep -q 'BinaryName = "/usr/bin/nvidia-container-runtime"' /etc/containerd/config.toml 2>/dev/null; then |
There was a problem hiding this comment.
Should there be an explicit GB 200/300 check here as well?
There was a problem hiding this comment.
This happens at runtime, not build time, and I don't believe there's a way to check this at that point (the feature flag isn't present, though perhaps you can by SKU via IMDS, but at the risk of failure due to IMDS access problems).
There was a problem hiding this comment.
Fair. Could also consider adding a sentinel flag during build time, such as /etc/aks/gpu-config-baked.marker, which can be checked here. Just an optional suggestion
| fi | ||
| else | ||
| # However, for the 24.04 ARM images, we MUST have both -azure and -azure-nvidia kernels, so that we can run on either vanilla ARM64 hardware or GB200. | ||
| if [ $(dpkg --get-selections | grep -c "linux-image") -lt 2 ]; then |
There was a problem hiding this comment.
Would it be worth including grep -q "GB200" <<<"$FEATURE_FLAGS" for future safety, when there's no dual booting?
67a3710 to
99160bc
Compare
This change pulls in the commits specific to the
release-gb200branch intomainrather than trying to merge commit frommainintorelease-gb200.I've modified the
vhdbuilder/packer/pre-install-dependencies.shfrom therelease-gb200branch so that the NVIDIA kernel path from the PPA depends on the presence of theGB200feature flag.