docs: add GPU node setup guide#170
Open
chokevin wants to merge 4 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new GPU-focused setup guide to help readers understand the required “host NVIDIA driver first” contract for GPU-capable AKS Flex Nodes, and links it from the main README and usage guide.
Changes:
- Added a new documentation page:
docs/usages/gpu-node-setup.mdcovering GPU host image/driver prerequisites, cluster GPU components, validation, and troubleshooting. - Linked the new guide from
README.mdanddocs/usage.md.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| README.md | Adds a documentation link entry pointing to the new GPU setup guide. |
| docs/usages/gpu-node-setup.md | New GPU Flex Node setup guide describing driver/image contract, GPU components, validation, and troubleshooting. |
| docs/usage.md | Adds a cross-link to the new GPU setup guide near the top of the main usage doc. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -6,6 +6,8 @@ This guide provides three complete setup paths for AKS Flex Node: | |||
| 2. **[Setup with Service Principal](#setup-with-service-principal)** - More scalable for secure production environment | |||
| 3. **[Setup with Bootstrap Token](#setup-with-bootstrap-token)** - Simplest setup with minimum dependancy for dynamic hyperscale environments | |||
4ffa5ef to
b6f69b2
Compare
b6f69b2 to
a0ee76c
Compare
a0ee76c to
2d3db7c
Compare
2d3db7c to
74e931e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a GPU Flex Node setup guide under
docs/usages/and links it from the README and usage guide.The guide covers the GPU setup flow, image options, cluster GPU components, direct-host bootstrap using shared bootstrap-token examples, Karpenter-managed provisioning, validation commands, and troubleshooting.
Why
GPU Flex Node setup has an important host-driver contract that needs to be easy to explain to readers: AKS Flex Node does not install the NVIDIA kernel driver. The GPU host driver must already be available before AKS Flex Node bootstraps the host. In current Flex GPU validation, that is handled by selecting a GPU-capable DSVM/HPC image; other valid paths are a validated AKS managed GPU-capable image/image ID, a prebaked custom image, or another GPU-capable marketplace/custom image that has been validated.
Non-goals
Testing
git diff --checkmake testwas attempted and failed in existingpkg/configtests because this workstation hostname (Kevins-MacBook-Pro.local) is rejected as an invalid Kubernetes DNS subdomain by node-name validation. This is unrelated to the docs-only changes.Risk
Documentation-only. The main risk is inaccurate setup guidance; the guide explicitly calls out the NVIDIA driver caveat and separates DRA, GPU Operator, GFD, and legacy GPU capacity behavior to avoid overstating the validated flow.