acceptance: route bundle test clusters through the shared instance pool#5461
Open
renaudhartert-db wants to merge 3 commits into
Open
acceptance: route bundle test clusters through the shared instance pool#5461renaudhartert-db wants to merge 3 commits into
renaudhartert-db wants to merge 3 commits into
Conversation
The cli-isolated integration tests launch ~30 ephemeral clusters per run, each cold-pulling the multi-GB DBR runtime image over the NAT gateway in the deco AWS test account. That NAT egress is the bulk of an opex.eng.deco budget overspend (ES-1912931); the traffic is ~99.6% inbound download, ~3 GB per node. These bundle acceptance templates set node_type_id directly, bypassing the existing warm instance pool that is already exported to CI as TEST_INSTANCE_POOL_ID and already used by spark-jar-task and integration_whl/base. Routing them through the pool lets nodes reuse a cached runtime image instead of re-pulling it through NAT on every launch. Adds instance_pool_id: $TEST_INSTANCE_POOL_ID to the cluster-launching templates, matching the existing pattern, and regenerates the affected acceptance goldens. Co-authored-by: Isaac
Contributor
Waiting for approvalBased on git history, these people are best suited to review:
Eligible reviewers: Suggestions based on git history. See OWNERS for ownership rules. |
Collaborator
|
Commit: 3d5eb7e
22 interesting tests: 15 SKIP, 7 KNOWN
Top 24 slowest tests (at least 2 minutes):
|
It is a bind test that only deploys (no cluster launch, no NAT benefit), and its acceptance golden did not regenerate cleanly outside CI. Reverting it to keep the change focused on templates that actually launch clusters.
Cloud-only deploy test (no bundle run, no cluster launch, no NAT benefit). It skips locally so its golden could not be regenerated here, and its cloud golden still records node_type_id; routing it through the pool changed the cloud output and broke the integration check. The 11 cluster-launching templates that do start clusters passed the cloud integration run.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
Route the cluster-launching bundle acceptance templates through the shared test instance pool by adding
instance_pool_id: $TEST_INSTANCE_POOL_IDto their cluster specs, matching the pattern already used bybundle/deploy/spark-jar-taskandbundle/integration_whl/base.Goldens for the affected tests are regenerated: when a cluster is created from a pool, the node type comes from the pool, so
node_type_id/driver_node_type_idare dropped from the deployed spec.Why
These bundle acceptance tests create real clusters during cloud integration runs, and each cold node pulls ~3GB over the NAT gateway on every launch. Most of the templates set
node_type_iddirectly, which bypasses the warm instance pool that is already wired into the integration environment asTEST_INSTANCE_POOL_ID. Routing them through the pool lets nodes reuse an already-cached runtime image instead of re-pulling it each time, removing a large amount of redundant network transfer from the test runs. Two templates already did this; this brings the remaining cluster-launching ones in line.Tests
Existing acceptance tests, with regenerated goldens for the affected templates. The full integration suite passes.