perf: Production readiness with Workload purging, and AuthN injection by mdrakiburrahman · Pull Request #32 · microsoft/dbt-scope

mdrakiburrahman · 2026-06-04T23:34:55Z

Why this change is needed

Closes #33.

The mon_database_metadata incremental model in dbt-ingest-geneva failed
during a long stress run with:

Compilation Error in model mon_database_metadata (models/mon_database_metadata.sql)
  Object of type datetime is not JSON serializable

Inside the adapter:

scope adapter: write_batch_sources failed for batch 20

The crash always happens on the second parquet compaction
(batch_id > 0 and batch_id % source_compaction_interval == 0),
i.e. once a previous .parquet snapshot already exists under
_checkpoint/sources/. With the default source_compaction_interval=10,
that's batch 20 (as observed); with source_compaction_interval=1 (the
aggressive case), it's batch 2.

Root cause

CheckpointManager._write_snapshot_parquet builds the next snapshot by:

Reading the latest existing .parquet snapshot via DuckDB
(SELECT * FROM read_parquet(...)).
Reading JSONL diffs written since that snapshot.
Materialising everything into a list[dict] in Python.
Writing the merged list as NDJSON via
nf.write(json.dumps(r) + "\n"), then loading the NDJSON back via
read_json_auto and COPY ... TO ... (FORMAT PARQUET).

batchProcessingTime is created as an ISO 8601 string in
_build_source_records, but DuckDB's read_json_auto infers it as
TIMESTAMP once the NDJSON has enough rows of consistent ISO text
(reproducible at ~20 rows with duckdb 1.5.3; the in-test fast-loop case
with ≤6 records per batch stays under the threshold, which is why our
existing multi-compaction tests didn't catch it).

So the first snapshot (batch 10) stores the column as TIMESTAMP. On the
next compaction (batch 20), DuckDB returns those values as Python
datetime.datetime, and json.dumps(r) then raises
TypeError: Object of type datetime is not JSON serializable. The outer
try/except in write_batch_sources logs the failure and re-raises,
which dbt surfaces as a "Compilation Error" on the model after ~7h43m of
wasted SCOPE compute.

How

Two small changes in dbt/adapters/scope/checkpoint.py:

Module-level _json_default(o) helper that maps
datetime → o.isoformat() and re-raises TypeError for anything else.
Naive datetime instances (which is what DuckDB hands back after a
TIMESTAMP round-trip) are re-attached to UTC before formatting so
the on-disk ISO string keeps its +00:00 suffix — preserving the
timezone-aware contract that _build_source_records originally
established. Wired into both json.dumps call sites — _write_jsonl
(defensive; today's inputs are pure str/int) and the NDJSON write in
_write_snapshot_parquet (where the crash actually happens). This
covers any existing customer parquet snapshots whose
batchProcessingTime column is already TIMESTAMP.
Stabilise the snapshot parquet schema by replacing the snapshot's
SELECT * with an explicit, fully-typed projection — path VARCHAR,
modificationTime BIGINT, batchId INTEGER,
batchProcessingTime TIMESTAMP. This stops the schema from drifting
between VARCHAR and TIMESTAMP depending on DuckDB's sample-size
heuristic, and also defends against future drift on the other three
columns. Every subsequent read then returns datetime for
batchProcessingTime, and _json_default normalises it back to an
ISO string on the next round-trip.

Considerations

No data migration required. Existing parquet snapshots already on
customer ADLS accounts (where the column was previously inferred as
TIMESTAMP) are now handled correctly by _json_default.
Tiny diff, lowest blast radius. A larger refactor (doing the
union entirely in DuckDB via read_parquet UNION ALL read_json_auto
to skip the Python round-trip) would be cleaner but is out of scope —
the compaction code path is I/O-bound on ADLS, not on the Python
round-trip, so the perf win at current scale (≤a few hundred records
per snapshot) is negligible.
Alternative considered: instead of pinning to TIMESTAMP, we
could pin to VARCHAR — both fix the bug. We picked TIMESTAMP
because it's the better long-term schema (and matches what DuckDB
picks at production scale anyway).

Test

New regression unit test:
tests/unit/test_checkpoint_lifecycle.py::TestCheckpointLifecycle::test_compaction_handles_timestamp_typed_prior_snapshot
pre-seeds a TIMESTAMP-typed parquet snapshot (22 rows) into the
in-memory ADLS mock, writes an intermediate JSONL diff at batch 15
(3 rows), then triggers compaction at batch 20 (2 rows) and asserts:
- write_batch_sources does not raise.
- The new snapshot contains records from all three legs of the
  union — prior snapshot + JSONL diff + current batch
  (22 + 3 + 2 = 27 rows, with batchId values for every leg).
- The new snapshot's batchProcessingTime is deterministically
  TIMESTAMP.
  Verified to fail without the fix with the exact production error
  (TypeError: Object of type datetime is not JSON serializable) and
  pass with it.
.scripts/run.sh unit-test — 336/336 passing.
.scripts/run.sh lint — clean.
.scripts/run.sh integration-test — full suite re-run end-to-end
against ADLA (no regressions observed).

addendum: credential retry resilience (commit `9fc39e8`)

Why a second fix in this PR

After the datetime fix landed, the same dbt-ingest-geneva stress run
surfaced a second, independent failure mode in a CI workflow run:
https://github.com/microsoft/dbt-scope/actions/runs/27011393672/job/79716601894

        except Exception as ex:
            # could be a timeout, for example
            error = CredentialUnavailableError(message="Failed to invoke the Azure CLI")
>           raise error from ex
E           azure.identity._exceptions.CredentialUnavailableError: Failed to invoke the Azure CLI

Under high concurrency (the production model runs with
Concurrency: 32 threads), the az subprocess that
AzureCliCredential.get_token shells out to can time out. The adapter
previously raised on the first occurrence, failing the dbt run after
multi-hour ingestion jobs even though the underlying issue is
transient.

What changed

Centralised retry handling in LockedTokenCredential:

New RetryPolicy dataclass — linear backoff, capped delay, no
jitter. Total attempts == max_retries + 1.
RetryPolicy.from_http_retries(http_retries) reuses the existing
http_retries profile field so users can tune retries via
profiles.yml (default bumped 3 → 10 per user spec).
LockedTokenCredential.get_token catches only
CredentialUnavailableError (transient subprocess failures);
permanent ClientAuthenticationError is intentionally not retried.
Releases the FileLock between attempts so other workers continue
to make progress.
Sleeps via an injectable callable for deterministic tests.

All credential acquisition sites now flow through
LockedTokenCredential and honour the policy:

ScopeConnectionHandle (ADLA query path)
AdlsGen1Client (source-file discovery)
CheckpointManager (watermark + snapshot read/write)
ScopeAdapter.list_relations_without_caching

Defensive: no more silent swallowing of exhausted credentials

A rubber-duck review surfaced several broad except Exception blocks
that would silently convert exhausted credentials into apparently
benign outcomes — masking auth failures and corrupting incremental
state. Each now re-raises CredentialUnavailableError before the
broad handler:

CheckpointManager.read_watermark → would have returned None,
silently flipping incremental → full refresh.
AdlsGen1Client._list_one_dir / _walk / list_files (non-recursive)
/ _directory_exists / _list_directory_files / enrich_with_estimates
→ would have returned empty / partial file lists, advancing the
watermark past unseen files.
DuckDbDeltaLakeClient.table_exists / get_max_partition /
get_columns / list_table_paths → would have looked like
"Delta table missing".
ScopeAdapter.list_relations_without_caching (inner + outer except)
→ would have returned an empty Delta-table list.

Tests

New unit test file tests/unit/test_locked_token_credential.py
(14 tests): defaults, from_http_retries for None/-1/0/25,
succeed-on-first-attempt (no sleep), succeed-after-N-failures,
exhaust-and-re-raise, linear-capped delay sequence
[1,2,3,...,10,10,10], zero-retries == single-attempt,
non-CredentialUnavailableError exceptions are NOT retried,
claims= kwarg passthrough, lock-released-between-attempts (via
parallel acquisition probe), default policy used when None.
New regression tests:
- tests/unit/test_adls_gen1_client.py::TestAdlsGen1Client::test_recursive_subdir_credential_exhaustion_propagates
- tests/unit/test_adls_gen1_client.py::TestAdlsGen1Client::test_non_recursive_credential_exhaustion_propagates
- tests/unit/test_checkpoint.py::TestCheckpointManagerWatermark::test_read_watermark_propagates_credential_exhaustion
.scripts/run.sh unit-test — 353/353 passing.
.scripts/run.sh lint — clean.
.scripts/run.sh integration-test — 3/3 passing end-to-end
against ADLA (9:58).

… JSON `CheckpointManager._write_snapshot_parquet` round-trips records through DuckDB → Python dict → NDJSON → parquet to produce each compaction snapshot. When the prior parquet snapshot already exists, DuckDB infers the `batchProcessingTime` column as `TIMESTAMP` (auto-inferred from ISO strings once enough rows exist), so a subsequent compaction round-trips the column back to Python as `datetime` objects. The next `json.dumps(record)` then raises: TypeError: Object of type datetime is not JSON serializable …which surfaces in dbt as a compilation error and aborts the model run. The bug only triggers from the second compaction onwards and only when the NDJSON sample is large enough for DuckDB to infer `TIMESTAMP` (reproducible at ~20 rows). The production stress run hit this at `write_batch_sources failed for batch 20` after 7h43m. Fix: 1. Add a module-level `_json_default` encoder that converts `datetime` instances to ISO-8601 strings (re-attaching UTC for naive datetimes so the existing timezone-aware contract is preserved on round-trip) and wire it into both `_write_jsonl` and the NDJSON write in `_write_snapshot_parquet`. 2. Pin the snapshot parquet schema by explicitly casting each column in the `CREATE TABLE` so the schema no longer depends on DuckDB's JSON type inference. `batchProcessingTime` is now deterministically `TIMESTAMP`, and the round-trip back to datetime is handled by (1). Regression coverage added to `tests/unit/test_checkpoint_lifecycle.py::test_compaction_handles_timestamp_typed_prior_snapshot`: seeds a `TIMESTAMP`-typed prior snapshot with 22 rows, writes a JSONL diff at batch 15, then triggers compaction at batch 20 and asserts the new snapshot contains records from all three legs (prior snapshot + JSONL diff + current batch) and that its `batchProcessingTime` schema is `TIMESTAMP`. Fails on `main` with the exact production `TypeError`, passes after the fix. Closes #33 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Under high concurrency the ``az`` subprocess that ``AzureCliCredential`` shells out to can time out, surfacing as ``CredentialUnavailableError: Failed to invoke the Azure CLI``. The adapter previously raised on the first occurrence, failing dbt runs after multi-hour stress jobs (PR #32). This change centralises retry handling in ``LockedTokenCredential``: * New ``RetryPolicy`` dataclass — linear backoff, capped delay, no jitter; total attempts == ``max_retries + 1``. * ``RetryPolicy.from_http_retries`` reuses the existing ``http_retries`` profile field so users can tune retries via ``profiles.yml`` (default bumped 3 -> 10 per stress-test spec). * ``LockedTokenCredential.get_token`` catches only ``CredentialUnavailableError`` (transient), releases the ``FileLock`` between attempts so other workers make progress, and sleeps via an injectable ``sleep`` callable for deterministic tests. All credential touchpoints now flow through ``LockedTokenCredential``: * ``ScopeConnectionHandle`` (ADLA query path) * ``AdlsGen1Client`` (source-file discovery) * ``CheckpointManager`` (watermark + snapshot RW) * ``ScopeAdapter.list_relations_without_caching`` Adds defensive ``except CredentialUnavailableError: raise`` ahead of broad ``except Exception`` blocks that previously silently converted exhausted credentials into 'no watermark' / 'no files' / 'no delta table' / 'no columns', any of which can corrupt incremental state. Tests: 17 new unit tests (353 total), all integration tests pass (3/3, 9:58). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

mdrakiburrahman · 2026-06-05T13:09:31Z

Pushed second commit 9fc39e8 adding credential-retry resilience for the CredentialUnavailableError: Failed to invoke the Azure CLI failure mode seen in CI run 27011393672. See updated PR body for details.

Verification: 353/353 unit tests, 3/3 integration tests (9:58), lint clean.

Emit SET @@MaxFileCountPerOutputFileSet = N; in every generated SCOPE script so the cap on distinct OutputFileSet partition files is deterministic across cluster defaults. Configurable end-to-end via: - profiles.yml target.max_file_count_per_output_file_set - dbt_project.yml models config - per-model {{ config(max_file_count_per_output_file_set=...) }} Adapter default is 5000 (matches the conservative cluster ceiling on Fabric/OneLake clusters seen today). Project consumers needing the SCOPE compiler upstream default of 100000 set it in profiles.yml. ScriptBuilder validates the value falls in the compiler's [1, 1_000_000] range and raises DbtRuntimeError on misconfiguration so failures surface at compile time rather than after job submission. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…IGTERM When the dbt process receives SIGINT (Ctrl+C) or SIGTERM, the adapter now: 1. Sets a shared shutdown flag so every in-flight submit_and_wait loop self-cancels its own SCOPE job. 2. Snapshots the process-wide active-jobs registry and fans out parallel CancelJob REST calls (one thread per job, bounded at 32). 3. Waits for each cancelled job to reach a terminal Ended state, bounded by wait_on_cancel_seconds (default 30s, exposed in profiles.yml). Since cancels run in parallel, total wall-clock is ~wait_on_cancel_seconds regardless of job count. Default-on; opt out via 'cancel_jobs_on_shutdown: false' in profiles.yml. Also flips is_cancelable() to True and wires ScopeConnectionManager.cancel / cancel_open to delegate to cancel_all_active_jobs for belt-and-suspenders coverage of dbt's own cancellation pathways. atexit fallback covers paths that unwind via unhandled exception rather than our signal handler. Tests: 31 new tests in test_shutdown_cancellation.py + 2 in test_credentials.py. Total: 399 unit tests pass (368 baseline + 31). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Mirrors dbt-fabricspark PR #177. Adds three new profiles.yml fields to ScopeCredentials so the adapter can route every credential acquisition through a user-supplied azure.core.credentials.TokenCredential: authentication: token_credential credential_class: 'my_pkg.MyCredential' credential_kwargs: { ... arbitrary kwargs ... } Default remains authentication='cli' (AzureCliCredential wrapped in the existing LockedTokenCredential file-lock + retry). The lock is now ONLY applied on the CLI path — non-CLI credentials (SNI / managed identity / notebookutils) skip the az-subprocess serialization which is meaningless for them. New module dbt/adapters/scope/custom_credential.py provides: - load_custom_credential(dotted, kwargs) — importlib + isinstance check - Process-wide instance cache keyed by (dotted_path, sorted kwargs) - Regex validation of the dotted path A shared build_credential(creds) helper in delta_lake.py is now the single entry point used by every runtime credential acquisition: - connections.py: ScopeConnectionHandle ADLA token - impl.py: list_relations_without_caching ADLS Gen2 listing - impl.py: _get_gen1_client → AdlsGen1Client (Gen1 token) - impl.py: _get_checkpoint_manager → CheckpointManager (Gen2 r/w) AdlsGen1Client and CheckpointManager now accept an optional `credential` kwarg; production callers always supply build_credential output. The CLI-default fallback in __init__ is kept only for backward compatibility with existing test fixtures that don't supply one. DuckDbDeltaLakeClient no longer auto-wraps in LockedTokenCredential — callers (specifically get_default_delta_client for integration tests) now pre-wrap themselves. This preserves test behaviour while letting a future caller pass a non-CLI credential without an unwanted lock. Tests: 419/419 passing. Added test_custom_credential.py (11 cases) and authentication-field validation in test_credentials.py (7 cases). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…concurrent device-code prompts When dbt-scope runs with multiple threads (e.g. dbt run --threads 4) and 'authentication: token_credential', each worker was independently calling the inner credential (e.g. EntraTokenCredential), which would each walk their own fallback chain. On headless Fabric notebooks this lands on interactive device-code auth — one prompt per thread. Wrap the custom credential in LockedTokenCredential so the cross-process file lock serializes token acquisition. The first thread populates the inner credential's token cache; subsequent threads reuse the cached token without re-entering the fallback chain. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…on code AzureCliCredential is now only constructed inside build_credential() when the user explicitly opts in via authentication='cli'. Every other production code path (AdlsGen1Client._get_fs, CheckpointManager) now requires the caller to pass an explicit credential and raises a clear error if none is provided. Eliminates the silent fallback that was masking misconfigured callers on environments without az login (Fabric notebooks, managed identity hosts) and made device-code prompts surprising to diagnose. - Removed get_default_delta_client() helper; integration conftest constructs the dev-only AzureCliCredential inline. - Tests updated to pass credential=MagicMock() instead of patching AzureCliCredential. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ential Integration conftest helpers (read_watermark, list_source_files, read_batch_source) were still constructing CheckpointManager() bare, which now raises after the production fallback removal. Wire them through the same dev-only AzureCliCredential as _delta_client(). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When authentication=token_credential (Fabric notebook + SNI etc), use a separate file lock from the AzureCliCredential one so the log lines and contention surface are unambiguous. - New constant FABRIC_TOKEN_LOCK = /tmp/dbt-scope-fabric-token - build_credential() dispatches: cli -> AZ_CLI_TOKEN_LOCK, token_credential -> FABRIC_TOKEN_LOCK - New unit tests lock the dispatch in Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The Fabric notebook runtime ships azure-datalake-store==0.0.5x preinstalled at /home/trusted-service-user/jupyter-env/.... When dbt-scope passes token_credential=<modern TokenCredential> to AzureDLFileSystem, the legacy SDK signature is def __init__(self, store_name, token=None, **kwargs), so the token_credential kwarg is silently dropped into **kwargs, token stays None, and the SDK falls back to msal.PublicClientApplication.initiate_device_flow emitting 'To sign in, use a web browser to open https://login.microsoft.com/device' once per dbt worker thread. Detect the legacy SDK by inspecting AzureDLFileSystem.__init__ signature. When detected, wrap the modern TokenCredential in _LegacyDataLakeCredentialAdapter which exposes the legacy DataLakeCredential API (signed_session() returning a requests.Session and refresh_token()), then pass it via token= instead of token_credential=. The adapter is thread-safe (legacy SDK's _walk uses 8 worker threads), caches tokens with a 300s refresh-lead window, and uses the canonical legacy scope 'https://datalake.azure.net//.default' (double slash, matches the modern SDK's own default for parity). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

mdrakiburrahman and others added 5 commits June 4, 2026 23:33

view

bcac5a2

view

bf18c16

Alright GitHub lets go

8d1c4c9

Thank you GitHub

d3146c8

mdrakiburrahman changed the title ~~TODO~~ fix: handle datetime when round-tripping checkpoint snapshots through JSON Jun 5, 2026

mdrakiburrahman changed the title ~~fix: handle datetime when round-tripping checkpoint snapshots through JSON~~ fix: handle datetime in checkpoint snapshots + retry CredentialUnavailableError on Azure CLI Jun 5, 2026

mdrakiburrahman and others added 2 commits June 6, 2026 15:20

mdrakiburrahman changed the title ~~fix: handle datetime in checkpoint snapshots + retry CredentialUnavailableError on Azure CLI~~ todo: production readiness Jun 8, 2026

mdrakiburrahman and others added 9 commits June 8, 2026 00:43

Proofread

8b9635b

Add retries for weird stuff

75fc1d8

Kill the old jobs

afaad96

mdrakiburrahman linked an issue Jun 9, 2026 that may be closed by this pull request

feat: Add Fabric Notebook scheduling support #5

Closed

mdrakiburrahman changed the title ~~todo: production readiness~~ perf: Production readiness with Workload purging, and AuthN injection Jun 9, 2026

mdrakiburrahman marked this pull request as ready for review June 9, 2026 17:24

mdrakiburrahman merged commit 283da1f into main Jun 9, 2026
2 checks passed

mdrakiburrahman deleted the dev/mdrrahman/production branch June 9, 2026 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Production readiness with Workload purging, and AuthN injection#32

perf: Production readiness with Workload purging, and AuthN injection#32
mdrakiburrahman merged 17 commits into
mainfrom
dev/mdrrahman/production

mdrakiburrahman commented Jun 4, 2026 •

edited

Loading

Uh oh!

mdrakiburrahman commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mdrakiburrahman commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this change is needed

Root cause

How

Considerations

Test

addendum: credential retry resilience (commit 9fc39e8)

Why a second fix in this PR

What changed

Defensive: no more silent swallowing of exhausted credentials

Tests

Uh oh!

mdrakiburrahman commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mdrakiburrahman commented Jun 4, 2026 •

edited

Loading

addendum: credential retry resilience (commit `9fc39e8`)