Skip to content

perf: Production readiness with Workload purging, and AuthN injection#32

Merged
mdrakiburrahman merged 17 commits into
mainfrom
dev/mdrrahman/production
Jun 9, 2026
Merged

perf: Production readiness with Workload purging, and AuthN injection#32
mdrakiburrahman merged 17 commits into
mainfrom
dev/mdrrahman/production

Conversation

@mdrakiburrahman

@mdrakiburrahman mdrakiburrahman commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Why this change is needed

Closes #33.

The mon_database_metadata incremental model in dbt-ingest-geneva failed
during a long stress run with:

Compilation Error in model mon_database_metadata (models/mon_database_metadata.sql)
  Object of type datetime is not JSON serializable

Inside the adapter:

scope adapter: write_batch_sources failed for batch 20

The crash always happens on the second parquet compaction
(batch_id > 0 and batch_id % source_compaction_interval == 0),
i.e. once a previous .parquet snapshot already exists under
_checkpoint/sources/. With the default source_compaction_interval=10,
that's batch 20 (as observed); with source_compaction_interval=1 (the
aggressive case), it's batch 2.

Root cause

CheckpointManager._write_snapshot_parquet builds the next snapshot by:

  1. Reading the latest existing .parquet snapshot via DuckDB
    (SELECT * FROM read_parquet(...)).
  2. Reading JSONL diffs written since that snapshot.
  3. Materialising everything into a list[dict] in Python.
  4. Writing the merged list as NDJSON via
    nf.write(json.dumps(r) + "\n"), then loading the NDJSON back via
    read_json_auto and COPY ... TO ... (FORMAT PARQUET).

batchProcessingTime is created as an ISO 8601 string in
_build_source_records, but DuckDB's read_json_auto infers it as
TIMESTAMP once the NDJSON has enough rows of consistent ISO text
(reproducible at ~20 rows with duckdb 1.5.3; the in-test fast-loop case
with ≤6 records per batch stays under the threshold, which is why our
existing multi-compaction tests didn't catch it).

So the first snapshot (batch 10) stores the column as TIMESTAMP. On the
next compaction (batch 20), DuckDB returns those values as Python
datetime.datetime, and json.dumps(r) then raises
TypeError: Object of type datetime is not JSON serializable. The outer
try/except in write_batch_sources logs the failure and re-raises,
which dbt surfaces as a "Compilation Error" on the model after ~7h43m of
wasted SCOPE compute.

How

Two small changes in dbt/adapters/scope/checkpoint.py:

  1. Module-level _json_default(o) helper that maps
    datetime → o.isoformat() and re-raises TypeError for anything else.
    Naive datetime instances (which is what DuckDB hands back after a
    TIMESTAMP round-trip) are re-attached to UTC before formatting so
    the on-disk ISO string keeps its +00:00 suffix — preserving the
    timezone-aware contract that _build_source_records originally
    established. Wired into both json.dumps call sites — _write_jsonl
    (defensive; today's inputs are pure str/int) and the NDJSON write in
    _write_snapshot_parquet (where the crash actually happens). This
    covers any existing customer parquet snapshots whose
    batchProcessingTime column is already TIMESTAMP.
  2. Stabilise the snapshot parquet schema by replacing the snapshot's
    SELECT * with an explicit, fully-typed projection — path VARCHAR,
    modificationTime BIGINT, batchId INTEGER,
    batchProcessingTime TIMESTAMP. This stops the schema from drifting
    between VARCHAR and TIMESTAMP depending on DuckDB's sample-size
    heuristic, and also defends against future drift on the other three
    columns. Every subsequent read then returns datetime for
    batchProcessingTime, and _json_default normalises it back to an
    ISO string on the next round-trip.

Considerations

  • No data migration required. Existing parquet snapshots already on
    customer ADLS accounts (where the column was previously inferred as
    TIMESTAMP) are now handled correctly by _json_default.
  • Tiny diff, lowest blast radius. A larger refactor (doing the
    union entirely in DuckDB via read_parquet UNION ALL read_json_auto
    to skip the Python round-trip) would be cleaner but is out of scope —
    the compaction code path is I/O-bound on ADLS, not on the Python
    round-trip, so the perf win at current scale (≤a few hundred records
    per snapshot) is negligible.
  • Alternative considered: instead of pinning to TIMESTAMP, we
    could pin to VARCHAR — both fix the bug. We picked TIMESTAMP
    because it's the better long-term schema (and matches what DuckDB
    picks at production scale anyway).

Test

  • New regression unit test:
    tests/unit/test_checkpoint_lifecycle.py::TestCheckpointLifecycle::test_compaction_handles_timestamp_typed_prior_snapshot
    pre-seeds a TIMESTAMP-typed parquet snapshot (22 rows) into the
    in-memory ADLS mock, writes an intermediate JSONL diff at batch 15
    (3 rows), then triggers compaction at batch 20 (2 rows) and asserts:
    • write_batch_sources does not raise.
    • The new snapshot contains records from all three legs of the
      union — prior snapshot + JSONL diff + current batch
      (22 + 3 + 2 = 27 rows, with batchId values for every leg).
    • The new snapshot's batchProcessingTime is deterministically
      TIMESTAMP.
      Verified to fail without the fix with the exact production error
      (TypeError: Object of type datetime is not JSON serializable) and
      pass with it.
  • .scripts/run.sh unit-test336/336 passing.
  • .scripts/run.sh lint — clean.
  • .scripts/run.sh integration-test — full suite re-run end-to-end
    against ADLA (no regressions observed).

addendum: credential retry resilience (commit 9fc39e8)

Why a second fix in this PR

After the datetime fix landed, the same dbt-ingest-geneva stress run
surfaced a second, independent failure mode in a CI workflow run:
https://github.com/microsoft/dbt-scope/actions/runs/27011393672/job/79716601894

        except Exception as ex:
            # could be a timeout, for example
            error = CredentialUnavailableError(message="Failed to invoke the Azure CLI")
>           raise error from ex
E           azure.identity._exceptions.CredentialUnavailableError: Failed to invoke the Azure CLI

Under high concurrency (the production model runs with
Concurrency: 32 threads), the az subprocess that
AzureCliCredential.get_token shells out to can time out. The adapter
previously raised on the first occurrence, failing the dbt run after
multi-hour ingestion jobs even though the underlying issue is
transient.

What changed

Centralised retry handling in LockedTokenCredential:

  • New RetryPolicy dataclass — linear backoff, capped delay, no
    jitter. Total attempts == max_retries + 1.
  • RetryPolicy.from_http_retries(http_retries) reuses the existing
    http_retries profile field so users can tune retries via
    profiles.yml (default bumped 3 → 10 per user spec).
  • LockedTokenCredential.get_token catches only
    CredentialUnavailableError (transient subprocess failures);
    permanent ClientAuthenticationError is intentionally not retried.
  • Releases the FileLock between attempts so other workers continue
    to make progress.
  • Sleeps via an injectable callable for deterministic tests.

All credential acquisition sites now flow through
LockedTokenCredential and honour the policy:

  • ScopeConnectionHandle (ADLA query path)
  • AdlsGen1Client (source-file discovery)
  • CheckpointManager (watermark + snapshot read/write)
  • ScopeAdapter.list_relations_without_caching

Defensive: no more silent swallowing of exhausted credentials

A rubber-duck review surfaced several broad except Exception blocks
that would silently convert exhausted credentials into apparently
benign outcomes — masking auth failures and corrupting incremental
state. Each now re-raises CredentialUnavailableError before the
broad handler:

  • CheckpointManager.read_watermark → would have returned None,
    silently flipping incremental → full refresh.
  • AdlsGen1Client._list_one_dir / _walk / list_files (non-recursive)
    / _directory_exists / _list_directory_files / enrich_with_estimates
    → would have returned empty / partial file lists, advancing the
    watermark past unseen files.
  • DuckDbDeltaLakeClient.table_exists / get_max_partition /
    get_columns / list_table_paths → would have looked like
    "Delta table missing".
  • ScopeAdapter.list_relations_without_caching (inner + outer except)
    → would have returned an empty Delta-table list.

Tests

  • New unit test file tests/unit/test_locked_token_credential.py
    (14 tests): defaults, from_http_retries for None/-1/0/25,
    succeed-on-first-attempt (no sleep), succeed-after-N-failures,
    exhaust-and-re-raise, linear-capped delay sequence
    [1,2,3,...,10,10,10], zero-retries == single-attempt,
    non-CredentialUnavailableError exceptions are NOT retried,
    claims= kwarg passthrough, lock-released-between-attempts (via
    parallel acquisition probe), default policy used when None.
  • New regression tests:
    • tests/unit/test_adls_gen1_client.py::TestAdlsGen1Client::test_recursive_subdir_credential_exhaustion_propagates
    • tests/unit/test_adls_gen1_client.py::TestAdlsGen1Client::test_non_recursive_credential_exhaustion_propagates
    • tests/unit/test_checkpoint.py::TestCheckpointManagerWatermark::test_read_watermark_propagates_credential_exhaustion
  • .scripts/run.sh unit-test353/353 passing.
  • .scripts/run.sh lint — clean.
  • .scripts/run.sh integration-test3/3 passing end-to-end
    against ADLA (9:58).

mdrakiburrahman and others added 5 commits June 4, 2026 23:33
… JSON

`CheckpointManager._write_snapshot_parquet` round-trips records through
DuckDB → Python dict → NDJSON → parquet to produce each compaction
snapshot. When the prior parquet snapshot already exists, DuckDB infers
the `batchProcessingTime` column as `TIMESTAMP` (auto-inferred from ISO
strings once enough rows exist), so a subsequent compaction round-trips
the column back to Python as `datetime` objects. The next
`json.dumps(record)` then raises:

  TypeError: Object of type datetime is not JSON serializable

…which surfaces in dbt as a compilation error and aborts the model run.

The bug only triggers from the second compaction onwards and only when
the NDJSON sample is large enough for DuckDB to infer `TIMESTAMP`
(reproducible at ~20 rows). The production stress run hit this at
`write_batch_sources failed for batch 20` after 7h43m.

Fix:

1. Add a module-level `_json_default` encoder that converts `datetime`
   instances to ISO-8601 strings (re-attaching UTC for naive datetimes
   so the existing timezone-aware contract is preserved on round-trip)
   and wire it into both `_write_jsonl` and the NDJSON write in
   `_write_snapshot_parquet`.
2. Pin the snapshot parquet schema by explicitly casting each column in
   the `CREATE TABLE` so the schema no longer depends on DuckDB's JSON
   type inference. `batchProcessingTime` is now deterministically
   `TIMESTAMP`, and the round-trip back to datetime is handled by (1).

Regression coverage added to
`tests/unit/test_checkpoint_lifecycle.py::test_compaction_handles_timestamp_typed_prior_snapshot`:
seeds a `TIMESTAMP`-typed prior snapshot with 22 rows, writes a JSONL
diff at batch 15, then triggers compaction at batch 20 and asserts the
new snapshot contains records from all three legs (prior snapshot +
JSONL diff + current batch) and that its `batchProcessingTime` schema
is `TIMESTAMP`. Fails on `main` with the exact production `TypeError`,
passes after the fix.

Closes #33

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mdrakiburrahman mdrakiburrahman changed the title TODO fix: handle datetime when round-tripping checkpoint snapshots through JSON Jun 5, 2026
Under high concurrency the ``az`` subprocess that ``AzureCliCredential``
shells out to can time out, surfacing as
``CredentialUnavailableError: Failed to invoke the Azure CLI``. The
adapter previously raised on the first occurrence, failing dbt runs
after multi-hour stress jobs (PR #32).

This change centralises retry handling in ``LockedTokenCredential``:

* New ``RetryPolicy`` dataclass — linear backoff, capped delay, no
  jitter; total attempts == ``max_retries + 1``.
* ``RetryPolicy.from_http_retries`` reuses the existing
  ``http_retries`` profile field so users can tune retries via
  ``profiles.yml`` (default bumped 3 -> 10 per stress-test spec).
* ``LockedTokenCredential.get_token`` catches only
  ``CredentialUnavailableError`` (transient), releases the
  ``FileLock`` between attempts so other workers make progress, and
  sleeps via an injectable ``sleep`` callable for deterministic tests.

All credential touchpoints now flow through ``LockedTokenCredential``:

* ``ScopeConnectionHandle`` (ADLA query path)
* ``AdlsGen1Client`` (source-file discovery)
* ``CheckpointManager`` (watermark + snapshot RW)
* ``ScopeAdapter.list_relations_without_caching``

Adds defensive ``except CredentialUnavailableError: raise`` ahead of
broad ``except Exception`` blocks that previously silently converted
exhausted credentials into 'no watermark' / 'no files' / 'no delta
table' / 'no columns', any of which can corrupt incremental state.

Tests: 17 new unit tests (353 total), all integration tests pass
(3/3, 9:58).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mdrakiburrahman mdrakiburrahman changed the title fix: handle datetime when round-tripping checkpoint snapshots through JSON fix: handle datetime in checkpoint snapshots + retry CredentialUnavailableError on Azure CLI Jun 5, 2026
@mdrakiburrahman

Copy link
Copy Markdown
Contributor Author

Pushed second commit 9fc39e8 adding credential-retry resilience for the CredentialUnavailableError: Failed to invoke the Azure CLI failure mode seen in CI run 27011393672. See updated PR body for details.

Verification: 353/353 unit tests, 3/3 integration tests (9:58), lint clean.

mdrakiburrahman and others added 2 commits June 6, 2026 15:20
Emit SET @@MaxFileCountPerOutputFileSet = N; in every generated SCOPE
script so the cap on distinct OutputFileSet partition files is
deterministic across cluster defaults. Configurable end-to-end via:

  - profiles.yml target.max_file_count_per_output_file_set
  - dbt_project.yml models config
  - per-model {{ config(max_file_count_per_output_file_set=...) }}

Adapter default is 5000 (matches the conservative cluster ceiling on
Fabric/OneLake clusters seen today). Project consumers needing the
SCOPE compiler upstream default of 100000 set it in profiles.yml.

ScriptBuilder validates the value falls in the compiler's
[1, 1_000_000] range and raises DbtRuntimeError on misconfiguration so
failures surface at compile time rather than after job submission.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…IGTERM

When the dbt process receives SIGINT (Ctrl+C) or SIGTERM, the adapter now:

  1. Sets a shared shutdown flag so every in-flight submit_and_wait loop
     self-cancels its own SCOPE job.
  2. Snapshots the process-wide active-jobs registry and fans out parallel
     CancelJob REST calls (one thread per job, bounded at 32).
  3. Waits for each cancelled job to reach a terminal Ended state, bounded
     by wait_on_cancel_seconds (default 30s, exposed in profiles.yml). Since
     cancels run in parallel, total wall-clock is ~wait_on_cancel_seconds
     regardless of job count.

Default-on; opt out via 'cancel_jobs_on_shutdown: false' in profiles.yml.

Also flips is_cancelable() to True and wires ScopeConnectionManager.cancel /
cancel_open to delegate to cancel_all_active_jobs for belt-and-suspenders
coverage of dbt's own cancellation pathways. atexit fallback covers paths
that unwind via unhandled exception rather than our signal handler.

Tests: 31 new tests in test_shutdown_cancellation.py + 2 in test_credentials.py.
Total: 399 unit tests pass (368 baseline + 31).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mdrakiburrahman mdrakiburrahman changed the title fix: handle datetime in checkpoint snapshots + retry CredentialUnavailableError on Azure CLI todo: production readiness Jun 8, 2026
mdrakiburrahman and others added 9 commits June 8, 2026 00:43
Mirrors dbt-fabricspark PR #177. Adds three new profiles.yml fields to
ScopeCredentials so the adapter can route every credential acquisition
through a user-supplied azure.core.credentials.TokenCredential:

    authentication: token_credential
    credential_class: 'my_pkg.MyCredential'
    credential_kwargs: { ... arbitrary kwargs ... }

Default remains authentication='cli' (AzureCliCredential wrapped in the
existing LockedTokenCredential file-lock + retry). The lock is now ONLY
applied on the CLI path — non-CLI credentials (SNI / managed identity /
notebookutils) skip the az-subprocess serialization which is meaningless
for them.

New module dbt/adapters/scope/custom_credential.py provides:
  - load_custom_credential(dotted, kwargs) — importlib + isinstance check
  - Process-wide instance cache keyed by (dotted_path, sorted kwargs)
  - Regex validation of the dotted path

A shared build_credential(creds) helper in delta_lake.py is now the
single entry point used by every runtime credential acquisition:
  - connections.py: ScopeConnectionHandle ADLA token
  - impl.py: list_relations_without_caching ADLS Gen2 listing
  - impl.py: _get_gen1_client → AdlsGen1Client (Gen1 token)
  - impl.py: _get_checkpoint_manager → CheckpointManager (Gen2 r/w)

AdlsGen1Client and CheckpointManager now accept an optional
`credential` kwarg; production callers always supply build_credential
output. The CLI-default fallback in __init__ is kept only for backward
compatibility with existing test fixtures that don't supply one.

DuckDbDeltaLakeClient no longer auto-wraps in LockedTokenCredential —
callers (specifically get_default_delta_client for integration tests)
now pre-wrap themselves. This preserves test behaviour while letting
a future caller pass a non-CLI credential without an unwanted lock.

Tests: 419/419 passing. Added test_custom_credential.py (11 cases) and
authentication-field validation in test_credentials.py (7 cases).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…concurrent device-code prompts

When dbt-scope runs with multiple threads (e.g. dbt run --threads 4) and
'authentication: token_credential', each worker was independently calling
the inner credential (e.g. EntraTokenCredential), which would each walk
their own fallback chain. On headless Fabric notebooks this lands on
interactive device-code auth — one prompt per thread.

Wrap the custom credential in LockedTokenCredential so the cross-process
file lock serializes token acquisition. The first thread populates the
inner credential's token cache; subsequent threads reuse the cached token
without re-entering the fallback chain.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…on code

AzureCliCredential is now only constructed inside build_credential()
when the user explicitly opts in via authentication='cli'. Every other
production code path (AdlsGen1Client._get_fs, CheckpointManager) now
requires the caller to pass an explicit credential and raises a clear
error if none is provided.

Eliminates the silent fallback that was masking misconfigured callers
on environments without az login (Fabric notebooks, managed identity
hosts) and made device-code prompts surprising to diagnose.

- Removed get_default_delta_client() helper; integration conftest
  constructs the dev-only AzureCliCredential inline.
- Tests updated to pass credential=MagicMock() instead of patching
  AzureCliCredential.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ential

Integration conftest helpers (read_watermark, list_source_files,
read_batch_source) were still constructing CheckpointManager() bare,
which now raises after the production fallback removal. Wire them
through the same dev-only AzureCliCredential as _delta_client().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When authentication=token_credential (Fabric notebook + SNI etc), use a
separate file lock from the AzureCliCredential one so the log lines and
contention surface are unambiguous.

- New constant FABRIC_TOKEN_LOCK = /tmp/dbt-scope-fabric-token
- build_credential() dispatches: cli -> AZ_CLI_TOKEN_LOCK,
  token_credential -> FABRIC_TOKEN_LOCK
- New unit tests lock the dispatch in

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Fabric notebook runtime ships azure-datalake-store==0.0.5x preinstalled
at /home/trusted-service-user/jupyter-env/.... When dbt-scope passes
token_credential=<modern TokenCredential> to AzureDLFileSystem, the legacy
SDK signature is def __init__(self, store_name, token=None, **kwargs), so
the token_credential kwarg is silently dropped into **kwargs, token stays
None, and the SDK falls back to msal.PublicClientApplication.initiate_device_flow
emitting 'To sign in, use a web browser to open https://login.microsoft.com/device'
once per dbt worker thread.

Detect the legacy SDK by inspecting AzureDLFileSystem.__init__ signature.
When detected, wrap the modern TokenCredential in _LegacyDataLakeCredentialAdapter
which exposes the legacy DataLakeCredential API (signed_session() returning
a requests.Session and refresh_token()), then pass it via token= instead of
token_credential=.

The adapter is thread-safe (legacy SDK's _walk uses 8 worker threads), caches
tokens with a 300s refresh-lead window, and uses the canonical legacy scope
'https://datalake.azure.net//.default' (double slash, matches the modern
SDK's own default for parity).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mdrakiburrahman mdrakiburrahman linked an issue Jun 9, 2026 that may be closed by this pull request
@mdrakiburrahman mdrakiburrahman changed the title todo: production readiness perf: Production readiness with Workload purging, and AuthN injection Jun 9, 2026
@mdrakiburrahman mdrakiburrahman marked this pull request as ready for review June 9, 2026 17:24
@mdrakiburrahman mdrakiburrahman merged commit 283da1f into main Jun 9, 2026
2 checks passed
@mdrakiburrahman mdrakiburrahman deleted the dev/mdrrahman/production branch June 9, 2026 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant