Skip to content

docs: add GPU node setup guide#170

Open
chokevin wants to merge 4 commits into
Azure:mainfrom
chokevin:chokevin/gpu-node-setup-guide
Open

docs: add GPU node setup guide#170
chokevin wants to merge 4 commits into
Azure:mainfrom
chokevin:chokevin/gpu-node-setup-guide

Conversation

@chokevin
Copy link
Copy Markdown
Collaborator

@chokevin chokevin commented May 20, 2026

What

Adds a GPU Flex Node setup guide under docs/usages/ and links it from the README and usage guide.

The guide covers the GPU setup flow, image options, cluster GPU components, direct-host bootstrap using shared bootstrap-token examples, Karpenter-managed provisioning, validation commands, and troubleshooting.

Why

GPU Flex Node setup has an important host-driver contract that needs to be easy to explain to readers: AKS Flex Node does not install the NVIDIA kernel driver. The GPU host driver must already be available before AKS Flex Node bootstraps the host. In current Flex GPU validation, that is handled by selecting a GPU-capable DSVM/HPC image; other valid paths are a validated AKS managed GPU-capable image/image ID, a prebaked custom image, or another GPU-capable marketplace/custom image that has been validated.

Non-goals

  • Does not change AKS Flex Node runtime behavior or provisioning logic.
  • Does not claim GPU Operator installs the host driver in the validated flow.
  • Does not document hostRouting or other CRD paths as active for current Flex H100/H200 validation.
  • Does not hardcode validation cluster names.

Testing

  • git diff --check
  • Relative markdown link check for changed markdown files
  • JSON/YAML syntax check for bootstrap-token examples
  • Guarded changed docs for over-specific validation cluster names and unsupported hostRouting/aksFlexNode claims
  • make test was attempted and failed in existing pkg/config tests because this workstation hostname (Kevins-MacBook-Pro.local) is rejected as an invalid Kubernetes DNS subdomain by node-name validation. This is unrelated to the docs-only changes.

Risk

Documentation-only. The main risk is inaccurate setup guidance; the guide explicitly calls out the NVIDIA driver caveat and separates DRA, GPU Operator, GFD, and legacy GPU capacity behavior to avoid overstating the validated flow.

Copilot AI review requested due to automatic review settings May 20, 2026 19:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new GPU-focused setup guide to help readers understand the required “host NVIDIA driver first” contract for GPU-capable AKS Flex Nodes, and links it from the main README and usage guide.

Changes:

  • Added a new documentation page: docs/usages/gpu-node-setup.md covering GPU host image/driver prerequisites, cluster GPU components, validation, and troubleshooting.
  • Linked the new guide from README.md and docs/usage.md.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
README.md Adds a documentation link entry pointing to the new GPU setup guide.
docs/usages/gpu-node-setup.md New GPU Flex Node setup guide describing driver/image contract, GPU components, validation, and troubleshooting.
docs/usage.md Adds a cross-link to the new GPU setup guide near the top of the main usage doc.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/usage.md
@@ -6,6 +6,8 @@ This guide provides three complete setup paths for AKS Flex Node:
2. **[Setup with Service Principal](#setup-with-service-principal)** - More scalable for secure production environment
3. **[Setup with Bootstrap Token](#setup-with-bootstrap-token)** - Simplest setup with minimum dependancy for dynamic hyperscale environments
Copilot AI review requested due to automatic review settings May 20, 2026 19:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@chokevin chokevin force-pushed the chokevin/gpu-node-setup-guide branch from 4ffa5ef to b6f69b2 Compare May 20, 2026 19:44
Copilot AI review requested due to automatic review settings May 20, 2026 19:47
@chokevin chokevin force-pushed the chokevin/gpu-node-setup-guide branch from b6f69b2 to a0ee76c Compare May 20, 2026 19:47
@chokevin chokevin force-pushed the chokevin/gpu-node-setup-guide branch from a0ee76c to 2d3db7c Compare May 20, 2026 19:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@chokevin chokevin force-pushed the chokevin/gpu-node-setup-guide branch from 2d3db7c to 74e931e Compare May 20, 2026 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants