Skip to content

fix(destroy-addons): dynamic addon discovery and composition deletion ordering#641

Open
punkwalker wants to merge 2 commits into
feature/agent-platformfrom
feature/agent-platform-pod-identities-fix
Open

fix(destroy-addons): dynamic addon discovery and composition deletion ordering#641
punkwalker wants to merge 2 commits into
feature/agent-platformfrom
feature/agent-platform-pod-identities-fix

Conversation

@punkwalker
Copy link
Copy Markdown
Collaborator

Summary

  • Rewrites hub:destroy-addons to dynamically query ApplicationSets from the hub cluster instead of computing from local registry files, ensuring agent-platform addons (kagent, litellm, langfuse, jaeger, agentgateway, etc.) are properly cleaned up
  • Adds Usage resources to PlatformCluster composition to enforce correct deletion ordering and prevent orphaned AWS resources (EIP/NAT Gateway race condition)

Changes

Taskfile (hub:destroy-addons)

  • Replace static vars.ADDONS block with inline kubectl query + sync-wave sorting
  • Add Phase 4.5: delete agent-platform AppSet (preserveResources) then orphan child apps before Phase 5
  • Use ownerReference-based child app detection (handles unlabeled apps from remote repos)
  • Skip finalizer/deletion wait loops when no child apps exist
  • Remove all ignore_error: true — handle benign exits inline
  • Change 2>/dev/null to 2>&1 on action commands for error visibility
  • Fix shell compatibility with Go Task set: [errexit, pipefail]

Composition (Usage resources)

  • natgw-uses-eip — block EIP deletion until NAT Gateway is gone
  • route-uses-igw — block IGW deletion until Routes are gone
  • route-uses-natgw — block NAT GW deletion until private Route is gone
  • rta-uses-rt — block RouteTable deletion until associations are gone
  • cluster-uses-subnets — block Subnet deletion until EKS Cluster is gone
  • cluster-uses-vpc — block VPC deletion until EKS Cluster is gone

Documentation

  • DELETION-ORDERING.md documenting the root cause, Usage fix, and references

Test plan

  • Run task hub:destroy-addons — verify Phase 5 lists all addon AppSets dynamically
  • Verify agent-platform AppSets (kagent, litellm, etc.) are deleted without regeneration
  • Verify no 30s timeouts on addons without child apps (e.g. disabled addons)
  • Verify errors (not-found, timeout) are visible in output
  • Deploy composition with Usage resources and test spoke cluster deletion ordering

punkwalker and others added 2 commits May 11, 2026 23:53
…m abstractions

External-dns was CrashLoopBackOff on the hub because the default
liveness probe (10s initial, 2 failures) killed it before Route53
zone listing completed. Increase to 60s initial / 5 failures.

Exclude aws-resources directory from the abstractions ApplicationSet
git generator to prevent deploying the unused aws-resources app.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… ordering

- Replace static registry-based addon list with dynamic kubectl query from hub cluster
- Add Phase 4.5 to stop agent-platform AppSet regeneration before addon deletion
- Use ownerReference for child app detection (works for unlabeled apps)
- Skip wait loops when no child apps exist (avoids 30s timeouts on disabled addons)
- Remove all ignore_error:true, handle benign exits inline with || true
- Surface errors via 2>&1 instead of suppressing with 2>/dev/null
- Fix shell compatibility with Go Task errexit/pipefail (explicit if statements)
- Add 6 Usage resources to PlatformCluster composition for deletion ordering:
  natgw-uses-eip, route-uses-igw, route-uses-natgw, rta-uses-rt,
  cluster-uses-subnets, cluster-uses-vpc
- Document Crossplane deletion ordering behavior and workarounds
@allamand
Copy link
Copy Markdown
Contributor

was this tested ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants