fix: provider CI health checks by andreatgretel · Pull Request #750 · NVIDIA-NeMo/DataDesigner

andreatgretel · 2026-06-15T16:21:35Z

Summary

This PR hardens the scheduled provider and notebook CI paths that failed on June 15, 2026. The failures were spread across notebook execution, provider health checks, and Agentic CI jobs, so this PR fixes the deterministic breakages and adds bounded retries where the failure mode was transient provider/API behavior.

Problem Context

Four scheduled CI jobs failed on June 15, 2026:

Build notebooks failed in tutorial notebook 4 before the useful notebook work completed because the startup health check for nvidia-vision timed out against nvidia/nemotron-3-nano-omni-30b-a3b-reasoning. The notebook should still run weekly, so this skips only that startup probe and still executes the notebook body.
Health Checks failed for nvidia/embedding because the default nvidia/llama-3.2-nv-embedqa-1b-v2 model is no longer available, and also hit a transient openrouter/vision timeout. This PR replaces the stale embedding model and retries only retryable provider health-check failures.
Agentic CI: Repository Triage failed during the custom API pre-flight with HTTP 504. The rerun passed the pre-flight, which indicates a transient gateway/provider issue, so this PR adds a small retry loop around that probe.
Agentic CI: Daily Audit hit error_max_turns in the docs-and-references recipe at max_turns: 30. This PR raises that recipe budget to 50 while keeping the workflow timeout unchanged.

The weekly notebook run intentionally remains uncached because we still want to verify notebooks run even when they did not change. To reduce OpenRouter FLUX.2 Pro cost without weakening that signal, scheduled runs now lower num_records to 2 only for black-forest-labs/flux.2-pro in notebooks 5 and 6.

Changes

Changed

Updated the scheduled notebook workflow to set DATA_DESIGNER_FLUX_2_PRO_CREATE_NUM_RECORDS=2 only for scheduled runs, with model-specific handling in notebooks 5 and 6.
Switched the default NVIDIA embedding model from nvidia/llama-3.2-nv-embedqa-1b-v2 to nvidia/llama-nemotron-embed-1b-v2 and synced docs/tests, including model-config examples.
Raised the docs-and-references Agentic CI recipe budget from max_turns: 30 to max_turns: 50.
Synced generated Colab notebooks for tutorials 4, 5, and 6 after updating their source notebooks.

Fixed

Skipped the startup health check for notebook 4's nvidia-vision config so the notebook still executes without failing on the flaky vision model probe.
Added up to 3 attempts for retryable provider health-check failures using the engine's canonical retryable model errors plus the observed readiness TimeoutError, with a short linear backoff and retry log details.
Added up to 3 attempts for the Repository Triage custom API pre-flight so transient HTTP 504s do not fail the job immediately.

Attention Areas

Reviewers: Please pay special attention to the following:

scripts/health_checks.py - Reuses the engine's canonical retryable model errors, adds the observed readiness TimeoutError, and backs off before retrying.
.github/workflows/build-notebooks.yml - Weekly runs remain uncached; only FLUX.2 Pro create records are reduced through the env var.
docs/notebook_source/4-providing-images-as-context.py - Skips only the startup health check for nvidia-vision, not notebook execution.
.agents/recipes/docs-and-references/recipe.md - Increases the turn budget for the daily docs audit while leaving the workflow timeout unchanged.

Description updated with AI

github-actions · 2026-06-15T16:24:28Z

MkDocs preview: https://7f185b3f.dd-docs-preview.pages.dev

Fern preview: https://nvidia-preview-pr-750.docs.buildwithfern.com/nemo/datadesigner

Fern previews include the docs-website version archive with PR changes synced into latest. Notebook tutorials are rendered without execution outputs in previews.

greptile-apps · 2026-06-15T16:24:44Z

Greptile Summary

This PR hardens scheduled CI by replacing an unavailable NVIDIA embedding model, reducing FLUX.2 Pro record counts on scheduled runs, skipping the flaky nvidia-vision startup health check in notebook 4, and adding linear-backoff retries to both the provider health-check script and the agentic CI pre-flight call.

Model swap: nvidia/llama-3.2-nv-embedqa-1b-v2 → nvidia/llama-nemotron-embed-1b-v2 synchronized across constants.py, docs (.md and .mdx), and the unit test.
Retry logic in health_checks.py: wraps check_models in a 3-attempt loop that sleeps attempt × 5s between retries, restricted to RETRYABLE_MODEL_ERRORS (ModelRateLimitError, ModelTimeoutError, ModelInternalServerError, ModelAPIConnectionError) plus Python's built-in TimeoutError; non-retryable errors propagate immediately.
Notebook CI load reduction: DATA_DESIGNER_FLUX_2_PRO_CREATE_NUM_RECORDS=2 is exported only on schedule events; notebooks 5 and 6 read this env var and apply it only when MODEL_ID == "black-forest-labs/flux.2-pro", leaving other models and non-scheduled runs at their defaults.

Confidence Score: 5/5

Safe to merge — all changes are targeted CI fixes with no impact on production data generation logic.

The embedding model swap is consistent across every reference point (constants, docs, tests). The retry logic correctly separates retryable from non-retryable exceptions and sleeps between attempts. The skip_health_check assignment targets a real, defined field on ModelConfig. The env-var gate for FLUX.2 Pro record counts only activates on scheduled runs and leaves manual/push-triggered executions unaffected. No logic errors or correctness issues found.

No files require special attention.

Important Files Changed

Filename	Overview
scripts/health_checks.py	Adds retry loop (up to 3 attempts) with linear backoff around model health checks; correctly restricts retries to the RETRYABLE_MODEL_ERRORS tuple plus Python's built-in TimeoutError.
.github/workflows/build-notebooks.yml	Sets DATA_DESIGNER_FLUX_2_PRO_CREATE_NUM_RECORDS=2 only for scheduled runs, reducing FLUX.2 Pro load without affecting manual or push-triggered runs.
.github/workflows/agentic-ci-issue-triage.yml	Wraps the API pre-flight curl in a 3-attempt retry loop with back-off (10s, 20s); correctly exits on the 3rd failure and breaks on first success.
docs/notebook_source/4-providing-images-as-context.py	Sets skip_health_check=True on the nvidia-vision model config after construction; ModelConfig defines this field so the assignment is valid.
docs/notebook_source/5-generating-images.py	Adds env-var-driven num_records override for FLUX.2 Pro create step; MODEL_ID is hardcoded as 'black-forest-labs/flux.2-pro' so the guard is always true, future-proofing against model changes.
docs/notebook_source/6-editing-images-with-image-context.py	Same FLUX.2 Pro env-var override pattern as notebook 5; default of 5 records is reduced to 2 on scheduled CI runs.
packages/data-designer-config/src/data_designer/config/utils/constants.py	Switches nvidia embedding model to nvidia/llama-nemotron-embed-1b-v2; change is synchronized across constants, docs, and tests.
packages/data-designer-config/tests/config/test_default_model_settings.py	Test assertion updated to match new embedding model name; no logic changes.
.agents/recipes/docs-and-references/recipe.md	Increases max_turns from 30 to 50 for the docs-and-references agentic recipe; workflow timeout is unchanged.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[_check_model called] --> B[attempt = 1..3]
    B --> C[Create TemporaryDirectory]
    C --> D[DataDesigner.check_models]
    D --> E{Success?}
    E -- yes --> F[return]
    E -- no --> G{Exception in HEALTH_CHECK_RETRYABLE_ERRORS?}
    G -- no --> H[Propagate immediately]
    G -- yes --> I{attempt == MAX_ATTEMPTS?}
    I -- yes --> J[Re-raise]
    I -- no --> K[sleep attempt x 5s]
    K --> B
    style F fill:#90EE90
    style H fill:#FFB6C1
    style J fill:#FFB6C1

_{Reviews (3): Last reviewed commit: "Address bot review feedback" | Re-trigger Greptile}

github-actions · 2026-06-15T16:27:04Z

PR #750 — fix: provider CI health checks

Summary

This PR addresses four flaky/broken CI signals around provider health checks and notebook builds:

Notebook 4 — skip the NVIDIA vision startup health check (still executes the notebook) by mutating the matching ModelConfig in the builder.
NVIDIA embedding default — replace the EOL nvidia/llama-3.2-nv-embedqa-1b-v2 with nvidia/llama-nemotron-embed-1b-v2 in constants.py, the unit test, and the docs (docs/concepts/... and fern/versions/latest/pages/concepts/...).
Retries — scripts/health_checks.py now retries up to 3 times on a hardcoded set of transient Model*Error names; the issue-triage workflow's curl API pre-flight retries 3 times with sleep ATTEMPT*10 backoff.
FLUX.2 Pro — on scheduled build-notebooks runs, export DATA_DESIGNER_FLUX_2_PRO_CREATE_NUM_RECORDS=2, which notebooks 5 and 6 read to clamp num_records for that specific model. Also raises docs-and-references recipe max_turns from 30 → 50.

Total: 11 files, +63 / -22. Mostly low-risk plumbing; correctness is dominated by the small set of code-path changes in scripts/health_checks.py and the two notebook source files.

Findings

🟡 Stale embedding-model reference not updated (docs)

fern/versions/latest/pages/concepts/models/model-configs.mdx:104 still hardcodes the EOL model:

model="nvidia/llama-3.2-nv-embedqa-1b-v2",

This is an example snippet showing a hand-built ModelConfig, parallel to the table you did update at default-model-settings.mdx:52. Since the PR's stated intent is to retire the EOL embedding ID across docs/tests, this one slipped through. Recommend updating to nvidia/llama-nemotron-embed-1b-v2 for consistency.

(There are also data-designer-got-skills.mdx and a trace HTML asset under fern/assets/data-designer-got-skills/ that reference the old model, but those are dev-note posts/captured traces and may legitimately reflect the model-of-the-day at authoring time; flagging only model-configs.mdx as a substantive doc.)

🟡 `RETRYABLE_ERROR_NAMES` uses string matching instead of importing the canonical set

scripts/health_checks.py:40-48 enumerates retryable errors by class name (string):

RETRYABLE_ERROR_NAMES = {
    "ModelAPIConnectionError",
    "ModelAPIError",
    "ModelInternalServerError",
    "ModelRequestAdmissionTimeoutError",
    "ModelRateLimitError",
    "ModelTimeoutError",
    "TimeoutError",
}
...
if type(exc).__name__ not in RETRYABLE_ERROR_NAMES or attempt == MAX_ATTEMPTS:

Two issues:

Fragility: the engine already exports RETRYABLE_MODEL_ERRORS (a tuple of types) at packages/data-designer-engine/src/data_designer/engine/models/errors.py:140. String-name matching silently breaks if any of these classes are renamed or relocated, with no test coverage to catch it. Prefer:
```
from data_designer.engine.models.errors import RETRYABLE_MODEL_ERRORS
...
if not isinstance(exc, RETRYABLE_MODEL_ERRORS) or attempt == MAX_ATTEMPTS:
    raise
```
ModelRequestAdmissionTimeoutError already subclasses ModelTimeoutError, so isinstance covers it.
Divergence from engine policy: this script's set is broader than the engine's canonical RETRYABLE_MODEL_ERRORS (ModelRateLimitError, ModelTimeoutError, ModelInternalServerError, ModelAPIConnectionError). The PR adds ModelAPIError and the builtin TimeoutError. If that broadening is intentional (e.g., the script saw a real ModelAPIError in CI that should be retried), please call it out — otherwise the engine's set is the safer default. If it is intentional, consider whether the engine's set should grow too.

🟢 No backoff between health-check retries

_check_model retries immediately with no sleep:

for attempt in range(1, MAX_ATTEMPTS + 1):
    try:
        ...
        return
    except Exception as exc:
        if type(exc).__name__ not in RETRYABLE_ERROR_NAMES or attempt == MAX_ATTEMPTS:
            raise
        print(f"RETRY ...")

Compare to the workflow's API pre-flight which uses sleep $((ATTEMPT * 10)). For rate-limit (ModelRateLimitError) and 5xx upstream-overload conditions, an immediate retry is unlikely to succeed and may aggravate the limit. A small linear backoff (e.g., time.sleep(attempt * 5)) would mirror the workflow and reduce noise. Minor — not a blocker.

🟢 Retry log message off-by-one is fine, but the "attempt 1/3" attempt is silent

print(f"RETRY ... attempt {attempt + 1}/{MAX_ATTEMPTS})") correctly announces the next attempt about to be made (e.g., after the first failure: "attempt 2/3"). Suggest also logging the exception type/message — currently a flake is observable only when the run ultimately fails (full traceback at main() level). One line print(f"RETRY {label} after {type(exc).__name__}: {exc}") would make CI logs much easier to triage.

🟢 Notebook 5's env-var override is effectively a no-op

In docs/notebook_source/5-generating-images.py:277-280:

create_num_records = 2
if MODEL_ID == "black-forest-labs/flux.2-pro":
    create_num_records = int(os.environ.get("DATA_DESIGNER_FLUX_2_PRO_CREATE_NUM_RECORDS") or create_num_records)

MODEL_ID is hardcoded to "black-forest-labs/flux.2-pro" at line 67, and the default is already 2 — the env var only matters in notebook 6 (default 5 → 2 on schedule). Symmetric application is fine for future-proofing if the default ever changes, but worth noting it's a no-op today. No action needed.

🟢 `int(os.environ.get(...) or create_num_records)` accepts empty string but rejects `"0"`

os.environ.get(...) or default is the standard "treat empty as unset" idiom; here that's harmless because 0 for num_records is nonsensical anyway. Just calling out the pattern in case a future maintainer wants to allow 0 as an explicit "skip create" sentinel — they'd need os.environ.get(..., None) and an is None check.

🟢 In-place mutation of `config_builder.model_configs` for `skip_health_check`

In 4-providing-images-as-context.py:78-82:

for model_config in config_builder.model_configs:
    if model_config.alias == "nvidia-vision":
        model_config.skip_health_check = True
        break

Verified that skip_health_check is a real public field on ModelConfig (packages/data-designer-config/src/data_designer/config/models.py:653) and that fingerprint.py excludes it from the cache key (good — flipping this won't invalidate cached runs). The early break is fine because aliases are unique. Nothing to fix; pattern is reasonable for tutorial code that intentionally avoids reaching into builder internals via setter helpers (which don't exist for this field).

🟢 Workflow changes (build-notebooks, agentic-ci-issue-triage)

build-notebooks.yml: if [ "$GITHUB_EVENT_NAME" = "schedule" ] correctly scopes the FLUX cap to the weekly run; manual workflow_dispatch still hits the default 5 records. Good.
agentic-ci-issue-triage.yml: retry loop is correct — exits on 2xx via break, fails after the third non-2xx, exponential-ish sleep ATTEMPT * 10 (10s/20s). The exit/error wiring is right.

🟢 `max_turns: 30 → 50` in `docs-and-references/recipe.md`

Consistent with the PR description (recipe was hitting the cap). No way to assess "is 50 the right number" without trace data — assuming this is informed by a recent run.

Test coverage

Unit test updated for the new embedding model ID (test_default_model_settings.py:68). Good.
No tests for the new retry logic in health_checks.py. The script is itself a CI tool, so the coverage gap is small, but a quick unit test injecting a mock that raises ModelTimeoutError twice then succeeds would lock in the contract. Optional.

Security & secrets

No secret-handling changes. The retry loop in the workflow logs only HTTP_CODE, never the body or headers. Good.
health_checks.py retry path does not log raw exception messages (currently), so no risk of an upstream provider error leaking creds into logs.

Verdict

Approve with two suggestions worth addressing in this PR:

Update the stale embedding model reference in fern/versions/latest/pages/concepts/models/model-configs.mdx:104 to match the rest of the doc/tests changes.
Switch RETRYABLE_ERROR_NAMES to isinstance(exc, RETRYABLE_MODEL_ERRORS) importing the engine's tuple — eliminates the string-name fragility and forces an explicit decision when the broadening (adding ModelAPIError, TimeoutError) needs to be reconciled with engine policy.

Both other minor items (retry backoff, logging exception type, notebook-5 no-op) are non-blocking polish.

andreatgretel · 2026-06-15T16:43:00Z

Addressed bot feedback in 889bce9d: updated the remaining embedding model docs examples, switched health-check retries to the engine retryable error tuple plus the observed readiness TimeoutError, added 5s linear backoff, and included retry log details. Resolved the outdated Greptile retry-delay thread.

Fix provider CI health checks

f780d0b

andreatgretel requested a review from a team as a code owner June 15, 2026 16:21

andreatgretel changed the title ~~Fix provider CI health checks~~ fix: provider CI health checks Jun 15, 2026

andreatgretel temporarily deployed to agentic-ci June 15, 2026 16:22 — with GitHub Actions Inactive

greptile-apps Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread scripts/health_checks.py Outdated

Sync generated Colab notebooks

4f9bcaf

Address bot review feedback

889bce9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: provider CI health checks#750

fix: provider CI health checks#750
andreatgretel wants to merge 3 commits into
mainfrom
andreatgretel/fix/provider-ci-health-checks

andreatgretel commented Jun 15, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 15, 2026 •

edited

Loading

Confidence Score: 5/5

Flowchart

Uh oh!

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

andreatgretel commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andreatgretel commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem Context

Changes

Changed

Fixed

Attention Areas

Uh oh!

github-actions Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

github-actions Bot commented Jun 15, 2026

PR #750 — fix: provider CI health checks

Summary

Findings

🟡 Stale embedding-model reference not updated (docs)

🟡 RETRYABLE_ERROR_NAMES uses string matching instead of importing the canonical set

🟢 No backoff between health-check retries

🟢 Retry log message off-by-one is fine, but the "attempt 1/3" attempt is silent

🟢 Notebook 5's env-var override is effectively a no-op

🟢 int(os.environ.get(...) or create_num_records) accepts empty string but rejects "0"

🟢 In-place mutation of config_builder.model_configs for skip_health_check

🟢 Workflow changes (build-notebooks, agentic-ci-issue-triage)

🟢 max_turns: 30 → 50 in docs-and-references/recipe.md

Test coverage

Security & secrets

Verdict

Uh oh!

andreatgretel commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andreatgretel commented Jun 15, 2026 •

edited

Loading

github-actions Bot commented Jun 15, 2026 •

edited

Loading

greptile-apps Bot commented Jun 15, 2026 •

edited

Loading

🟡 `RETRYABLE_ERROR_NAMES` uses string matching instead of importing the canonical set

🟢 `int(os.environ.get(...) or create_num_records)` accepts empty string but rejects `"0"`

🟢 In-place mutation of `config_builder.model_configs` for `skip_health_check`

🟢 `max_turns: 30 → 50` in `docs-and-references/recipe.md`