Run integration Vertex AI tests against a freshly-built source image#668
Run integration Vertex AI tests against a freshly-built source image#668kmontemayor2-sc wants to merge 10 commits into
Conversation
…e image Relocate the two non-e2e tests that launch real Vertex AI jobs (networking_test, vertex_ai_test) into a new tests/smoke/ package with its own main.py, and add a `make smoke_test` target that builds a fresh src-cpu image from the current source and runs them against it (via GIGL_CPU_DOCKER_URI). This closes a source/image skew gap: `make integration_test` runs workers on the pinned release image, so worker-side source changes (e.g. get_graph_store_info) were only validated after a release. smoke_test rebuilds from current source so they're validated on the PR. - Makefile: SMOKE_TEST_CPU_IMAGE_TAG / SMOKE_TEST_CPU_IMAGE vars + smoke_test target. - CI: run `make smoke_test` in on-pr-merge's ci-integration-test, and add a `/smoke_test` (+ /all_test) on-demand job to on-pr-comment. Both pass an immutable per-run tag (run_id.run_attempt) so concurrent runs can't clobber it. - networking_test: worker runs a real _assert_graph_store_info() function (thin python -c import+call) instead of an inlined script, now that the image is rebuilt from source. - vertex_ai_test: the CustomJob tests run a real worker function asserting the provisioned machine's vCPU count, on the fresh image. - All smoke job configs set an explicit short timeout_s. - Document make smoke_test / tests/smoke in CLAUDE.md and README.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
/all_test |
GiGL Automation@ 16:52:25UTC : 🔄 @ 16:54:20UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 16:52:26UTC : 🔄 @ 17:43:25UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 16:52:27UTC : 🔄 @ 17:00:32UTC : ❌ Workflow failed. |
GiGL Automation@ 16:52:28UTC : 🔄 @ 17:55:27UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 16:52:28UTC : 🔄 @ 18:12:45UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 16:52:33UTC : 🔄 @ 17:01:20UTC : ✅ Workflow completed successfully. |
…ion identifiers Reverses the tests/smoke/ packaging: relocates vertex_ai_test.py and networking_test.py back under tests/integration/ (their natural home), deletes the entire tests/smoke/ package, and renames all smoke-specific identifiers (class names, job-name prefixes, KFP pipeline names, display names, experiments, labels, the GCS path segment, and the timeout constant) to their integration equivalents so the files are accurate and the worker python -c import strings resolve on the rebuilt image. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…smoke_test target Renames SMOKE_TEST_CPU_IMAGE_TAG/SMOKE_TEST_CPU_IMAGE vars to INTEGRATION_TEST_CPU_IMAGE_TAG/INTEGRATION_TEST_CPU_IMAGE, rewrites the integration_test target to build and push a fresh src-cpu Docker image before running the suite (so Vertex-AI-launching tests use current source), and removes the now-superseded smoke_test target entirely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…_test job - on-pr-merge.yml: remove the separate "Run Smoke Tests" step; pass INTEGRATION_TEST_CPU_IMAGE_TAG to the integration step so make integration_test builds and tests against a fresh, immutable per-run image. - on-pr-comment.yml: delete the smoke-test job (/smoke_test trigger); pass INTEGRATION_TEST_CPU_IMAGE_TAG to the integration-test job's command. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…fresh src-cpu image Remove all references to the removed make smoke_test target, tests/smoke/ package, and /smoke_test CI command. Update the integration test docs in both CLAUDE.md and README.md to state that make integration_test now builds a fresh src-cpu image from current source (so Vertex-AI-launching tests run against current code, not a pinned release image). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
/all_test |
GiGL Automation@ 19:10:05UTC : 🔄 @ 20:41:43UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 19:10:06UTC : 🔄 @ 19:11:56UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 19:10:06UTC : 🔄 @ 19:20:53UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 19:10:06UTC : 🔄 @ 20:52:23UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 19:10:08UTC : 🔄 @ 19:19:01UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 19:11:29UTC : 🔄 @ 20:27:16UTC : ✅ Workflow completed successfully. |
|
/all_test |
GiGL Automation@ 22:32:00UTC : 🔄 @ 23:53:52UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 22:32:01UTC : 🔄 @ 22:33:58UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 22:32:01UTC : 🔄 @ 22:42:36UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 22:32:04UTC : 🔄 @ 22:40:54UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 22:32:04UTC : 🔄 @ 23:47:44UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 22:32:04UTC : 🔄 @ 23:54:18UTC : ✅ Workflow completed successfully. |
mkolodner-sc
left a comment
There was a problem hiding this comment.
Thanks Kyle! Mostly LGTM, a few comments
| gcp_service_account_email: ${{ secrets.GCP_SERVICE_ACCOUNT_EMAIL }} | ||
| command: | | ||
| make integration_test | ||
| # Immutable per-run image tag so concurrent runs can't overwrite each other's tag. |
There was a problem hiding this comment.
Does this work? This puts a shell comment inside the command: block. That command is later passed as _CMD and executed via $_CMD in .github/cloud_builder/run_command_on_active_checkout.yaml:50, where expanded # is not parsed as a comment. This may fail before make integration_test with #: command not found
| container_uri = "condaforge/miniforge3:25.3.0-1" | ||
| command = ["python", "-c", "import logging; logging.info('Hello, World!')"] | ||
|
|
||
| command = [ |
There was a problem hiding this comment.
Robot review:
Vertex AI worker commands import test-only dependencies missing from src-cpu
vertex_ai_test.py:104/140/153 and networking_test.py:135 import helper functions from test modules inside the freshly built runtime image. Those modules import parameterized at top level (vertex_ai_test.py:8, networking_test.py:6), but parameterized is only in the test dependency group (pyproject.toml:148), while the CPU image uses non-dev install and Dockerfile.src only runs uv pip install .. The remote workers will likely fail on import before reaching the assertions.
Fix: put worker entrypoints in a minimal module with only runtime deps, or inline the tiny commands so they do not import test modules.
Follow up #666 - would have caught this error much earlier.
Fixes a source/image skew: the integration tests that launch Vertex AI jobs used to run their
workers on the pinned release image (
DEFAULT_GIGL_RELEASE_SRC_IMAGE_CPU=src-cpu:0.2.0),so worker-side source changes (e.g.
get_graph_store_info) weren't validated until a release.Now
make integration_testalways builds a freshsrc-cpuimage from the current source andruns the suite against it (the Vertex-AI-launching tests pick it up via
GIGL_CPU_DOCKER_URI):integration_testbuilds+pushes${INTEGRATION_TEST_CPU_IMAGE}then runstests.integration.mainwithGIGL_CPU_DOCKER_URIexported. NewINTEGRATION_TEST_CPU_IMAGE_TAG/INTEGRATION_TEST_CPU_IMAGEvars (tag defaults to${DATE}locally).tests/integration/{distributed/utils/networking_test.py, common/services/vertex_ai_test.py}):read
GIGL_CPU_DOCKER_URIfail-fast insetUp; workers run real functions on the fresh image —_assert_graph_store_info(...)and_assert_machine_cpu_count(...)(verifies provisioned vCPUcount per pool); explicit
timeout_son all CustomJob configs.on-pr-mergeci-integration-testand the/integration_testcomment job pass animmutable per-run tag
${{ github.run_id }}.${{ github.run_attempt }}.Net diff is 7 files (the branch history first added a
tests/smoke/suite, then folded it back intotests/integration— notests/smoke/remains).Test Plan
ty,ruff,mdformat --checkpass on changed filesmake -n integration_testdry-run: build image tag == exportedGIGL_CPU_DOCKER_URImake integration_test(builds image + launches Vertex AI jobs) green