fix: provider CI health checks#750
Conversation
|
MkDocs preview: https://7f185b3f.dd-docs-preview.pages.dev Fern preview: https://nvidia-preview-pr-750.docs.buildwithfern.com/nemo/datadesigner
|
Greptile SummaryThis PR hardens scheduled CI by replacing an unavailable NVIDIA embedding model, reducing FLUX.2 Pro record counts on scheduled runs, skipping the flaky
|
| Filename | Overview |
|---|---|
| scripts/health_checks.py | Adds retry loop (up to 3 attempts) with linear backoff around model health checks; correctly restricts retries to the RETRYABLE_MODEL_ERRORS tuple plus Python's built-in TimeoutError. |
| .github/workflows/build-notebooks.yml | Sets DATA_DESIGNER_FLUX_2_PRO_CREATE_NUM_RECORDS=2 only for scheduled runs, reducing FLUX.2 Pro load without affecting manual or push-triggered runs. |
| .github/workflows/agentic-ci-issue-triage.yml | Wraps the API pre-flight curl in a 3-attempt retry loop with back-off (10s, 20s); correctly exits on the 3rd failure and breaks on first success. |
| docs/notebook_source/4-providing-images-as-context.py | Sets skip_health_check=True on the nvidia-vision model config after construction; ModelConfig defines this field so the assignment is valid. |
| docs/notebook_source/5-generating-images.py | Adds env-var-driven num_records override for FLUX.2 Pro create step; MODEL_ID is hardcoded as 'black-forest-labs/flux.2-pro' so the guard is always true, future-proofing against model changes. |
| docs/notebook_source/6-editing-images-with-image-context.py | Same FLUX.2 Pro env-var override pattern as notebook 5; default of 5 records is reduced to 2 on scheduled CI runs. |
| packages/data-designer-config/src/data_designer/config/utils/constants.py | Switches nvidia embedding model to nvidia/llama-nemotron-embed-1b-v2; change is synchronized across constants, docs, and tests. |
| packages/data-designer-config/tests/config/test_default_model_settings.py | Test assertion updated to match new embedding model name; no logic changes. |
| .agents/recipes/docs-and-references/recipe.md | Increases max_turns from 30 to 50 for the docs-and-references agentic recipe; workflow timeout is unchanged. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[_check_model called] --> B[attempt = 1..3]
B --> C[Create TemporaryDirectory]
C --> D[DataDesigner.check_models]
D --> E{Success?}
E -- yes --> F[return]
E -- no --> G{Exception in HEALTH_CHECK_RETRYABLE_ERRORS?}
G -- no --> H[Propagate immediately]
G -- yes --> I{attempt == MAX_ATTEMPTS?}
I -- yes --> J[Re-raise]
I -- no --> K[sleep attempt x 5s]
K --> B
style F fill:#90EE90
style H fill:#FFB6C1
style J fill:#FFB6C1
Reviews (3): Last reviewed commit: "Address bot review feedback" | Re-trigger Greptile
PR #750 — fix: provider CI health checksSummaryThis PR addresses four flaky/broken CI signals around provider health checks and notebook builds:
Total: 11 files, +63 / -22. Mostly low-risk plumbing; correctness is dominated by the small set of code-path changes in Findings🟡 Stale embedding-model reference not updated (docs)
model="nvidia/llama-3.2-nv-embedqa-1b-v2",This is an example snippet showing a hand-built (There are also 🟡
|
|
Addressed bot feedback in |
Summary
This PR hardens the scheduled provider and notebook CI paths that failed on June 15, 2026. The failures were spread across notebook execution, provider health checks, and Agentic CI jobs, so this PR fixes the deterministic breakages and adds bounded retries where the failure mode was transient provider/API behavior.
Problem Context
Four scheduled CI jobs failed on June 15, 2026:
nvidia-visiontimed out againstnvidia/nemotron-3-nano-omni-30b-a3b-reasoning. The notebook should still run weekly, so this skips only that startup probe and still executes the notebook body.nvidia/embeddingbecause the defaultnvidia/llama-3.2-nv-embedqa-1b-v2model is no longer available, and also hit a transientopenrouter/visiontimeout. This PR replaces the stale embedding model and retries only retryable provider health-check failures.error_max_turnsin thedocs-and-referencesrecipe atmax_turns: 30. This PR raises that recipe budget to 50 while keeping the workflow timeout unchanged.The weekly notebook run intentionally remains uncached because we still want to verify notebooks run even when they did not change. To reduce OpenRouter FLUX.2 Pro cost without weakening that signal, scheduled runs now lower
num_recordsto 2 only forblack-forest-labs/flux.2-proin notebooks 5 and 6.Changes
Changed
DATA_DESIGNER_FLUX_2_PRO_CREATE_NUM_RECORDS=2only for scheduled runs, with model-specific handling in notebooks 5 and 6.nvidia/llama-3.2-nv-embedqa-1b-v2tonvidia/llama-nemotron-embed-1b-v2and synced docs/tests, including model-config examples.docs-and-referencesAgentic CI recipe budget frommax_turns: 30tomax_turns: 50.Fixed
nvidia-visionconfig so the notebook still executes without failing on the flaky vision model probe.TimeoutError, with a short linear backoff and retry log details.Attention Areas
scripts/health_checks.py- Reuses the engine's canonical retryable model errors, adds the observed readinessTimeoutError, and backs off before retrying..github/workflows/build-notebooks.yml- Weekly runs remain uncached; only FLUX.2 Pro create records are reduced through the env var.docs/notebook_source/4-providing-images-as-context.py- Skips only the startup health check fornvidia-vision, not notebook execution..agents/recipes/docs-and-references/recipe.md- Increases the turn budget for the daily docs audit while leaving the workflow timeout unchanged.Description updated with AI