Skip to content

fix: provider CI health checks#750

Open
andreatgretel wants to merge 3 commits into
mainfrom
andreatgretel/fix/provider-ci-health-checks
Open

fix: provider CI health checks#750
andreatgretel wants to merge 3 commits into
mainfrom
andreatgretel/fix/provider-ci-health-checks

Conversation

@andreatgretel

@andreatgretel andreatgretel commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR hardens the scheduled provider and notebook CI paths that failed on June 15, 2026. The failures were spread across notebook execution, provider health checks, and Agentic CI jobs, so this PR fixes the deterministic breakages and adds bounded retries where the failure mode was transient provider/API behavior.

Problem Context

Four scheduled CI jobs failed on June 15, 2026:

  • Build notebooks failed in tutorial notebook 4 before the useful notebook work completed because the startup health check for nvidia-vision timed out against nvidia/nemotron-3-nano-omni-30b-a3b-reasoning. The notebook should still run weekly, so this skips only that startup probe and still executes the notebook body.
  • Health Checks failed for nvidia/embedding because the default nvidia/llama-3.2-nv-embedqa-1b-v2 model is no longer available, and also hit a transient openrouter/vision timeout. This PR replaces the stale embedding model and retries only retryable provider health-check failures.
  • Agentic CI: Repository Triage failed during the custom API pre-flight with HTTP 504. The rerun passed the pre-flight, which indicates a transient gateway/provider issue, so this PR adds a small retry loop around that probe.
  • Agentic CI: Daily Audit hit error_max_turns in the docs-and-references recipe at max_turns: 30. This PR raises that recipe budget to 50 while keeping the workflow timeout unchanged.

The weekly notebook run intentionally remains uncached because we still want to verify notebooks run even when they did not change. To reduce OpenRouter FLUX.2 Pro cost without weakening that signal, scheduled runs now lower num_records to 2 only for black-forest-labs/flux.2-pro in notebooks 5 and 6.

Changes

Changed

  • Updated the scheduled notebook workflow to set DATA_DESIGNER_FLUX_2_PRO_CREATE_NUM_RECORDS=2 only for scheduled runs, with model-specific handling in notebooks 5 and 6.
  • Switched the default NVIDIA embedding model from nvidia/llama-3.2-nv-embedqa-1b-v2 to nvidia/llama-nemotron-embed-1b-v2 and synced docs/tests, including model-config examples.
  • Raised the docs-and-references Agentic CI recipe budget from max_turns: 30 to max_turns: 50.
  • Synced generated Colab notebooks for tutorials 4, 5, and 6 after updating their source notebooks.

Fixed

  • Skipped the startup health check for notebook 4's nvidia-vision config so the notebook still executes without failing on the flaky vision model probe.
  • Added up to 3 attempts for retryable provider health-check failures using the engine's canonical retryable model errors plus the observed readiness TimeoutError, with a short linear backoff and retry log details.
  • Added up to 3 attempts for the Repository Triage custom API pre-flight so transient HTTP 504s do not fail the job immediately.

Attention Areas

Reviewers: Please pay special attention to the following:


Description updated with AI

@andreatgretel andreatgretel requested a review from a team as a code owner June 15, 2026 16:21
@andreatgretel andreatgretel changed the title Fix provider CI health checks fix: provider CI health checks Jun 15, 2026
@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

MkDocs preview: https://7f185b3f.dd-docs-preview.pages.dev

Fern preview: https://nvidia-preview-pr-750.docs.buildwithfern.com/nemo/datadesigner

Fern previews include the docs-website version archive with PR changes synced into latest. Notebook tutorials are rendered without execution outputs in previews.

@greptile-apps

greptile-apps Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR hardens scheduled CI by replacing an unavailable NVIDIA embedding model, reducing FLUX.2 Pro record counts on scheduled runs, skipping the flaky nvidia-vision startup health check in notebook 4, and adding linear-backoff retries to both the provider health-check script and the agentic CI pre-flight call.

  • Model swap: nvidia/llama-3.2-nv-embedqa-1b-v2nvidia/llama-nemotron-embed-1b-v2 synchronized across constants.py, docs (.md and .mdx), and the unit test.
  • Retry logic in health_checks.py: wraps check_models in a 3-attempt loop that sleeps attempt × 5s between retries, restricted to RETRYABLE_MODEL_ERRORS (ModelRateLimitError, ModelTimeoutError, ModelInternalServerError, ModelAPIConnectionError) plus Python's built-in TimeoutError; non-retryable errors propagate immediately.
  • Notebook CI load reduction: DATA_DESIGNER_FLUX_2_PRO_CREATE_NUM_RECORDS=2 is exported only on schedule events; notebooks 5 and 6 read this env var and apply it only when MODEL_ID == "black-forest-labs/flux.2-pro", leaving other models and non-scheduled runs at their defaults.

Confidence Score: 5/5

Safe to merge — all changes are targeted CI fixes with no impact on production data generation logic.

The embedding model swap is consistent across every reference point (constants, docs, tests). The retry logic correctly separates retryable from non-retryable exceptions and sleeps between attempts. The skip_health_check assignment targets a real, defined field on ModelConfig. The env-var gate for FLUX.2 Pro record counts only activates on scheduled runs and leaves manual/push-triggered executions unaffected. No logic errors or correctness issues found.

No files require special attention.

Important Files Changed

Filename Overview
scripts/health_checks.py Adds retry loop (up to 3 attempts) with linear backoff around model health checks; correctly restricts retries to the RETRYABLE_MODEL_ERRORS tuple plus Python's built-in TimeoutError.
.github/workflows/build-notebooks.yml Sets DATA_DESIGNER_FLUX_2_PRO_CREATE_NUM_RECORDS=2 only for scheduled runs, reducing FLUX.2 Pro load without affecting manual or push-triggered runs.
.github/workflows/agentic-ci-issue-triage.yml Wraps the API pre-flight curl in a 3-attempt retry loop with back-off (10s, 20s); correctly exits on the 3rd failure and breaks on first success.
docs/notebook_source/4-providing-images-as-context.py Sets skip_health_check=True on the nvidia-vision model config after construction; ModelConfig defines this field so the assignment is valid.
docs/notebook_source/5-generating-images.py Adds env-var-driven num_records override for FLUX.2 Pro create step; MODEL_ID is hardcoded as 'black-forest-labs/flux.2-pro' so the guard is always true, future-proofing against model changes.
docs/notebook_source/6-editing-images-with-image-context.py Same FLUX.2 Pro env-var override pattern as notebook 5; default of 5 records is reduced to 2 on scheduled CI runs.
packages/data-designer-config/src/data_designer/config/utils/constants.py Switches nvidia embedding model to nvidia/llama-nemotron-embed-1b-v2; change is synchronized across constants, docs, and tests.
packages/data-designer-config/tests/config/test_default_model_settings.py Test assertion updated to match new embedding model name; no logic changes.
.agents/recipes/docs-and-references/recipe.md Increases max_turns from 30 to 50 for the docs-and-references agentic recipe; workflow timeout is unchanged.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[_check_model called] --> B[attempt = 1..3]
    B --> C[Create TemporaryDirectory]
    C --> D[DataDesigner.check_models]
    D --> E{Success?}
    E -- yes --> F[return]
    E -- no --> G{Exception in HEALTH_CHECK_RETRYABLE_ERRORS?}
    G -- no --> H[Propagate immediately]
    G -- yes --> I{attempt == MAX_ATTEMPTS?}
    I -- yes --> J[Re-raise]
    I -- no --> K[sleep attempt x 5s]
    K --> B
    style F fill:#90EE90
    style H fill:#FFB6C1
    style J fill:#FFB6C1
Loading

Reviews (3): Last reviewed commit: "Address bot review feedback" | Re-trigger Greptile

Comment thread scripts/health_checks.py Outdated
@github-actions

Copy link
Copy Markdown
Contributor

PR #750 — fix: provider CI health checks

Summary

This PR addresses four flaky/broken CI signals around provider health checks and notebook builds:

  1. Notebook 4 — skip the NVIDIA vision startup health check (still executes the notebook) by mutating the matching ModelConfig in the builder.
  2. NVIDIA embedding default — replace the EOL nvidia/llama-3.2-nv-embedqa-1b-v2 with nvidia/llama-nemotron-embed-1b-v2 in constants.py, the unit test, and the docs (docs/concepts/... and fern/versions/latest/pages/concepts/...).
  3. Retriesscripts/health_checks.py now retries up to 3 times on a hardcoded set of transient Model*Error names; the issue-triage workflow's curl API pre-flight retries 3 times with sleep ATTEMPT*10 backoff.
  4. FLUX.2 Pro — on scheduled build-notebooks runs, export DATA_DESIGNER_FLUX_2_PRO_CREATE_NUM_RECORDS=2, which notebooks 5 and 6 read to clamp num_records for that specific model. Also raises docs-and-references recipe max_turns from 30 → 50.

Total: 11 files, +63 / -22. Mostly low-risk plumbing; correctness is dominated by the small set of code-path changes in scripts/health_checks.py and the two notebook source files.

Findings

🟡 Stale embedding-model reference not updated (docs)

fern/versions/latest/pages/concepts/models/model-configs.mdx:104 still hardcodes the EOL model:

model="nvidia/llama-3.2-nv-embedqa-1b-v2",

This is an example snippet showing a hand-built ModelConfig, parallel to the table you did update at default-model-settings.mdx:52. Since the PR's stated intent is to retire the EOL embedding ID across docs/tests, this one slipped through. Recommend updating to nvidia/llama-nemotron-embed-1b-v2 for consistency.

(There are also data-designer-got-skills.mdx and a trace HTML asset under fern/assets/data-designer-got-skills/ that reference the old model, but those are dev-note posts/captured traces and may legitimately reflect the model-of-the-day at authoring time; flagging only model-configs.mdx as a substantive doc.)

🟡 RETRYABLE_ERROR_NAMES uses string matching instead of importing the canonical set

scripts/health_checks.py:40-48 enumerates retryable errors by class name (string):

RETRYABLE_ERROR_NAMES = {
    "ModelAPIConnectionError",
    "ModelAPIError",
    "ModelInternalServerError",
    "ModelRequestAdmissionTimeoutError",
    "ModelRateLimitError",
    "ModelTimeoutError",
    "TimeoutError",
}
...
if type(exc).__name__ not in RETRYABLE_ERROR_NAMES or attempt == MAX_ATTEMPTS:

Two issues:

  1. Fragility: the engine already exports RETRYABLE_MODEL_ERRORS (a tuple of types) at packages/data-designer-engine/src/data_designer/engine/models/errors.py:140. String-name matching silently breaks if any of these classes are renamed or relocated, with no test coverage to catch it. Prefer:
    from data_designer.engine.models.errors import RETRYABLE_MODEL_ERRORS
    ...
    if not isinstance(exc, RETRYABLE_MODEL_ERRORS) or attempt == MAX_ATTEMPTS:
        raise
    ModelRequestAdmissionTimeoutError already subclasses ModelTimeoutError, so isinstance covers it.
  2. Divergence from engine policy: this script's set is broader than the engine's canonical RETRYABLE_MODEL_ERRORS (ModelRateLimitError, ModelTimeoutError, ModelInternalServerError, ModelAPIConnectionError). The PR adds ModelAPIError and the builtin TimeoutError. If that broadening is intentional (e.g., the script saw a real ModelAPIError in CI that should be retried), please call it out — otherwise the engine's set is the safer default. If it is intentional, consider whether the engine's set should grow too.

🟢 No backoff between health-check retries

_check_model retries immediately with no sleep:

for attempt in range(1, MAX_ATTEMPTS + 1):
    try:
        ...
        return
    except Exception as exc:
        if type(exc).__name__ not in RETRYABLE_ERROR_NAMES or attempt == MAX_ATTEMPTS:
            raise
        print(f"RETRY ...")

Compare to the workflow's API pre-flight which uses sleep $((ATTEMPT * 10)). For rate-limit (ModelRateLimitError) and 5xx upstream-overload conditions, an immediate retry is unlikely to succeed and may aggravate the limit. A small linear backoff (e.g., time.sleep(attempt * 5)) would mirror the workflow and reduce noise. Minor — not a blocker.

🟢 Retry log message off-by-one is fine, but the "attempt 1/3" attempt is silent

print(f"RETRY ... attempt {attempt + 1}/{MAX_ATTEMPTS})") correctly announces the next attempt about to be made (e.g., after the first failure: "attempt 2/3"). Suggest also logging the exception type/message — currently a flake is observable only when the run ultimately fails (full traceback at main() level). One line print(f"RETRY {label} after {type(exc).__name__}: {exc}") would make CI logs much easier to triage.

🟢 Notebook 5's env-var override is effectively a no-op

In docs/notebook_source/5-generating-images.py:277-280:

create_num_records = 2
if MODEL_ID == "black-forest-labs/flux.2-pro":
    create_num_records = int(os.environ.get("DATA_DESIGNER_FLUX_2_PRO_CREATE_NUM_RECORDS") or create_num_records)

MODEL_ID is hardcoded to "black-forest-labs/flux.2-pro" at line 67, and the default is already 2 — the env var only matters in notebook 6 (default 5 → 2 on schedule). Symmetric application is fine for future-proofing if the default ever changes, but worth noting it's a no-op today. No action needed.

🟢 int(os.environ.get(...) or create_num_records) accepts empty string but rejects "0"

os.environ.get(...) or default is the standard "treat empty as unset" idiom; here that's harmless because 0 for num_records is nonsensical anyway. Just calling out the pattern in case a future maintainer wants to allow 0 as an explicit "skip create" sentinel — they'd need os.environ.get(..., None) and an is None check.

🟢 In-place mutation of config_builder.model_configs for skip_health_check

In 4-providing-images-as-context.py:78-82:

for model_config in config_builder.model_configs:
    if model_config.alias == "nvidia-vision":
        model_config.skip_health_check = True
        break

Verified that skip_health_check is a real public field on ModelConfig (packages/data-designer-config/src/data_designer/config/models.py:653) and that fingerprint.py excludes it from the cache key (good — flipping this won't invalidate cached runs). The early break is fine because aliases are unique. Nothing to fix; pattern is reasonable for tutorial code that intentionally avoids reaching into builder internals via setter helpers (which don't exist for this field).

🟢 Workflow changes (build-notebooks, agentic-ci-issue-triage)

  • build-notebooks.yml: if [ "$GITHUB_EVENT_NAME" = "schedule" ] correctly scopes the FLUX cap to the weekly run; manual workflow_dispatch still hits the default 5 records. Good.
  • agentic-ci-issue-triage.yml: retry loop is correct — exits on 2xx via break, fails after the third non-2xx, exponential-ish sleep ATTEMPT * 10 (10s/20s). The exit/error wiring is right.

🟢 max_turns: 30 → 50 in docs-and-references/recipe.md

Consistent with the PR description (recipe was hitting the cap). No way to assess "is 50 the right number" without trace data — assuming this is informed by a recent run.

Test coverage

  • Unit test updated for the new embedding model ID (test_default_model_settings.py:68). Good.
  • No tests for the new retry logic in health_checks.py. The script is itself a CI tool, so the coverage gap is small, but a quick unit test injecting a mock that raises ModelTimeoutError twice then succeeds would lock in the contract. Optional.

Security & secrets

  • No secret-handling changes. The retry loop in the workflow logs only HTTP_CODE, never the body or headers. Good.
  • health_checks.py retry path does not log raw exception messages (currently), so no risk of an upstream provider error leaking creds into logs.

Verdict

Approve with two suggestions worth addressing in this PR:

  1. Update the stale embedding model reference in fern/versions/latest/pages/concepts/models/model-configs.mdx:104 to match the rest of the doc/tests changes.
  2. Switch RETRYABLE_ERROR_NAMES to isinstance(exc, RETRYABLE_MODEL_ERRORS) importing the engine's tuple — eliminates the string-name fragility and forces an explicit decision when the broadening (adding ModelAPIError, TimeoutError) needs to be reconciled with engine policy.

Both other minor items (retry backoff, logging exception type, notebook-5 no-op) are non-blocking polish.

@andreatgretel

Copy link
Copy Markdown
Contributor Author

Addressed bot feedback in 889bce9d: updated the remaining embedding model docs examples, switched health-check retries to the engine retryable error tuple plus the observed readiness TimeoutError, added 5s linear backoff, and included retry log details. Resolved the outdated Greptile retry-delay thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant