feat/3525-update-dataset-schema-api by munish7771 · Pull Request #3585 · dlt-hub/dlt

munish7771 · 2026-01-28T02:21:43Z

Description

This PR adds a sync_schema method to dlt.Dataset, allowing us to refresh a dataset’s schema when it becomes stale.
It supports:

Fetching the latest schema by dataset name (default behavior)
Updating from an explicit dlt.Schema or schema name
Resetting to a local-only schema state

Related Issues

Fixes feat: API to update dlt.Dataset.schema #3525

Additional Context

Added a new function sync_schema to dlt\dataset\dataset.py
Added unit tests to tests\dataset\test_dataset.py

master merge for 1.18.2 patch release

master merge for docs update

* release highlights 1.16 * release highlights 1.17 * add to sidebar * remove 1.17.1 * fix links * apply comments, rearrange releases, add what's new

master merge for `1.19.0` release

* fixes historic builds * fix broken link * constrain docs build env to python 3.10 * switch snippets testing to python 3.10 * allows python up to py3.12 in docs project --------- Co-authored-by: dave <shrps@posteo.net>

master merge for patch 1.19.1 release

master merge for 1.20.0 release

add data quality lifecycle docs, dashboard prescriptive workflow --------- Co-authored-by: Adrian <Adrian>

master merge for 1.21.0 release

* fix: skip adding partition clause to BQ ALTER TABLE query * fixes partition generation conition and adds tests --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>

* use scaffold api client v1 * remove vibes repo code * rename vibe -> scaffold, basic client test * remove comments * set base url via config * refactor * cleanup * move docs-api-url config to runtime.workspace section * request zip format * scaffold api url * updated url * updated again * reordered imports * test for nonexisting vibe source * duckdb->bigquery * mini-change to trigger ci-tests again * review fixes --------- Co-authored-by: rudolfix <rudolfix@rudolfix.org>

…#1883) (dlt-hub#3574) * feat: add parallelized flag for rest_api dependent resources (dlt-hub#1883) * updates tests and docs --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>

* adds flag to prevent writing dictionary columns to parquet * disables writing dict _dlt_load_id on mssql + adbc * adds option to skip dict when adding constant column * uses arrow repeat for contant columns * fixes sqlite sqlalchemy local file location * releases buffers in buffered.py early * sets separate duckdb extensions folder per xdist worker + motherduck xdist 2 * disables ADBC for sqlite/sqlalchemy * pre-installs duckdb extensions, other fixes * fixes clickhouse wrong timestamp serialization * review fixes * fixes motherduck leaking database in tests * serializes motherduck tests again * fixes dummy client test race condition

* map decimals without precision and scale to DECFLOAT on snowflake destination * Make unbound decimal to decfloat conversion configurable * changes option name, adds tests and docs --------- Co-authored-by: ivasio <ivan@dlthub.com> Co-authored-by: travior <gh@benjaminhoffmann.dev> Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>

…does not block it from usage for our tests. (dlt-hub#3501)

…3414) * Add engine arg argument to sql_table and sql_database with tests * Add test relevance clarification * Add docs for engine_kwargs usage * Add engine args documentation for sqlalchemy destination. Fix documentation for inmemory sqlite destination and add test for it. * Add destination test for sqlite with engine args for both inmemory and file based database. * Fix code block name python -> py and remove postgres from sqlalchemy tests. There will be only sqlite and mysql * Remove unnecessary import * Change pyarrow backend test for a pandas backend test. Since pipeline files for windows and linux (passes for macos) * Change back to pyarrow test * Add check of engine not being disposed after use if not owned by dlt * Add may_dispose_after_use keyword in sql_database * Add uniq id for pipeline names * Remove generator from sqlite file test to use normal json and use storage fixture * tests(sqlalchemy): add ref counting tests for destination when owned vs external engines * Fix reflink * rename to engine kwargs for destination too * format lint * Trigger CI pipeline * Adjust engine kwargs test * Remove brittle test * Fix tests * Format lint * Remove brittle test * fixes sqlalchemy refcount leak * missing tests * disposes reflection engine after use in sql_database * small fixes * handles in memory sqllite correctly --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>

…b#2888) * Add extra_credentials to clickhouse destination configuration * Rename extra credentials and add tests * review fixes * adds end to end test --------- Co-authored-by: Jeff <ubuntu@ip-172-31-8-3> Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>

…#3570) * fix : Use correct view to obtain existing schemas and use if not exists to create a schema. * Create schema without existence check * Use try except logic instead of checking for table * Smaller try block for better readability * fixes exception handling --------- Co-authored-by: Tim Hable <thable@varengold.de> Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>

* making query_result_bucket optional * adds tests and docs --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>

* fixes snippet and brings READMEs up to date with tooling * adds help to Makefiles

…t-hub#3595) * refines base table loader class, extracts connector-x, allows for backend registration + tests * fixes tests

…#3599) * implements consistend handling of UUID as strings * fix UUID merge tests to actually reproduce dlt-hub#3299 user case

…hub#3605)

* implements replace merge tree engine on top of existing hints * makes dedup_sort optional, docs refinement --------- Co-authored-by: rudolfix <rudolfix@rudolfix.org>

* supports insert-only merge strategy * enables upsert on fabric * fixes other tests * skips delete orphan job when insert-only

…t-hub#3652) * feat: parallelize all sources in Airflow, including the first one Resolves dlt-hub#2196. Replaces the first source task with an EmptyOperator start node so all decomposed source components fan out concurrently in both parallel and parallel-isolated modes. * adds serialize_first_task to controll airflow parallelism * adds smoke tests of airflow helper for Airflow 3 * improves airflow test setup, v3 failed due to Airflow3 bugs * corrects airflow version * fixes __getattr__ dunders * fixes airflow tests --------- Co-authored-by: justins <ksobayo0126@gmail.com> Co-authored-by: rudolfix <rudolfix@rudolfix.org>

munish7771 · 2026-03-16T00:50:32Z

@rudolfix The refresh breaks when called via pipeline.dataset().refresh()
Looks like the pipeline.dataset() resolves the default schema to a full schema object before passing it to dataset. In refresh() it can't tell if the schema came from the user or the pipeline default. What do you suggest?

rudolfix · 2026-03-16T11:14:42Z

@munish7771 feel free to store origin schema argument to be able to pass to new dataset instance in refresh(). ie. self._schema_origin = schema in __init__

* Initial commit * Test * Fix in test * Better warning message * Minor improvements * string ref handled * Tests adjusted * Docs * Sync with devel * Simpler fix

* docs: add realistic closure-based data masking example * Fix mypy lint errors for Python 3.10 compatibility - Replace enum.StrEnum (Python 3.11+) with str, Enum base classes - Replace lowercase callable with typing.Callable[..., Any] - Replace MaskingMethod | None with Optional[MaskingMethod] - Use resolved_method variable to avoid type narrowing issue * moves production psedonymize to example with a test --------- Co-authored-by: rudolfix <rudolfix@rudolfix.org>

* bumps to version 1.24.0 * test fixes

@njaltran

* expanded handover to other toolkits section * removed extra whitespace * Refine data-exploration wording per PR dlt-hub#3737 review * chore: retrigger CI * chore: retrigger CI * chore: retrigger CI @njaltran * docs: regenerate CLI reference to remove stale dlt pipeline mcp entry * docs page updated with make update-cli-docs * cli doc update --------- Co-authored-by: anuunchin <88698977+anuunchin@users.noreply.github.com>

…3765) * creates all eligible tables on staging dataset, truncates only those with jobs * bumps pyathena client * normalizes path endings in athena configuration (ibis client fix) * fixes review and adds unit tests for init_client for load step * removes locking fork test * fixes athena partition test - new cursor row format for paritions

munish7771 · 2026-03-22T17:12:11Z

@rudolfix what i meant was if we create the dataset using the pipeline api using dataset(), the function always provides the dataset constructor with a schema even if we explicitly didn't add any. in this case it will always be a caught exception.
for-eg, with the current code changes if i run the following -

import dlt

pipeline = dlt.pipeline(
    pipeline_name="test_pipeline",
    destination="duckdb",
    dataset_name="test_dataset"
)
pipeline.run([{"id": 1, "name": "Sharma"}], table_name="users")
dataset = pipeline.dataset()
dataset.refresh()

it gives error - TypeError: refresh() is not supported when the Dataset was created with a Schema instance.

is this intended? One way I can think of is to change the behaviour of the pipeline.dataset() function, but I am not sure if its that way for a reason.

…-hub#3763)

zilto · 2026-03-23T09:55:43Z

@munish7771 Can you rebase your branch on devel?

Some tests are failing on CI, but are unrelated to your changes. They were recently fixed and merged to devel

* persists load job metrics and follow up jobs graph * adds dataset name per load package in load step metrics and trace * passes normalize progress via files in load package * prevents dynamic resource names to leak into extract pipe * initializes table counters in normalize steps, fixes tests

…com/munish7771/dlt into feat/3525-update-dataset-schema-api

munish7771 · 2026-03-29T20:50:58Z

Replaced by a clean PR due to rebase issues → #3802
FYI: @zilto @rudolfix

rudolfix and others added 25 commits November 3, 2025 16:09

Merge pull request dlt-hub#3281 from dlt-hub/devel

801bcf7

master merge for 1.18.2 patch release

Merge pull request dlt-hub#3286 from dlt-hub/devel

caab064

master merge for docs update

Docs/release highlight 1.16-17 (dlt-hub#3213)

cffdf72

* release highlights 1.16 * release highlights 1.17 * add to sidebar * remove 1.17.1 * fix links * apply comments, rearrange releases, add what's new

Merge pull request dlt-hub#3406 from dlt-hub/devel

7132f0b

master merge for `1.19.0` release

(master) fix docs build (dlt-hub#3415)

03e2659

* fixes historic builds * fix broken link * constrain docs build env to python 3.10 * switch snippets testing to python 3.10 * allows python up to py3.12 in docs project --------- Co-authored-by: dave <shrps@posteo.net>

Merge pull request dlt-hub#3419 from dlt-hub/devel

10cd908

master merge for patch 1.19.1 release

Merge pull request dlt-hub#3458 from dlt-hub/devel

a7c3571

master merge for 1.20.0 release

Docs/data quality (dlt-hub#3466)

d877aa3

add data quality lifecycle docs, dashboard prescriptive workflow --------- Co-authored-by: Adrian <Adrian>

Merge pull request dlt-hub#3558 from dlt-hub/devel

ab0459a

master merge for 1.21.0 release

feat: add dataset schema update api

193b2e3

test: add tests for dataset schema update api

69df276

chore: format code

84b6a1a

fix: skip adding partition clause to BQ ALTER TABLE query (dlt-hub#3571)

d1ef6d5

* fix: skip adding partition clause to BQ ALTER TABLE query * fixes partition generation conition and adds tests --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>

feat: add parallelized flag for rest_api dependent resources (dlt-hub…

a7bbcd9

…#1883) (dlt-hub#3574) * feat: add parallelized flag for rest_api dependent resources (dlt-hub#1883) * updates tests and docs --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>

uses CREATE OR REPLACE for merge temp tables (dlt-hub#3589)

6979e42

Remove private key and replace with double b64 encoded key so github …

9d42225

…does not block it from usage for our tests. (dlt-hub#3501)

Making query_result_bucket optional (dlt-hub#3566)

51946ed

* making query_result_bucket optional * adds tests and docs --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>

fixes snippet and brings READMEs up to date with tooling (dlt-hub#3596)

42ed6e2

* fixes snippet and brings READMEs up to date with tooling * adds help to Makefiles

(feat) 2736 allows to implement custom backends in sql_database (dl…

03f1678

…t-hub#3595) * refines base table loader class, extracts connector-x, allows for backend registration + tests * fixes tests

anuunchin assigned zilto Feb 2, 2026

anuunchin requested a review from zilto February 2, 2026 12:21

rudolfix added 3 commits February 3, 2026 01:39

(fix) 3299 implements consistend handling of UUID as strings (dlt-hub…

25b1269

…#3599) * implements consistend handling of UUID as strings * fix UUID merge tests to actually reproduce dlt-hub#3299 user case

implements managed engine for proper ref counting (dlt-hub#3601)

dff5f03

implements several bug fixes and tests for pydantic model synth (dlt-…

41dfabf

…hub#3605)

prevostc and others added 4 commits March 13, 2026 14:08

Add clickhouse ReplacingMergeTree (dlt-hub#3366)

b99d78d

* implements replace merge tree engine on top of existing hints * makes dedup_sort optional, docs refinement --------- Co-authored-by: rudolfix <rudolfix@rudolfix.org>

(Feat) insert only merge strategy (dlt-hub#3741)

b58fcb0

* supports insert-only merge strategy * enables upsert on fabric * fixes other tests * skips delete orphan job when insert-only

chore: apply review changes

1400068

anuunchin and others added 5 commits March 16, 2026 17:45

Fix/3637 cli info, show fail on custom destinations (dlt-hub#3676)

a1be2f4

* Initial commit * Test * Fix in test * Better warning message * Minor improvements * string ref handled * Tests adjusted * Docs * Sync with devel * Simpler fix

bumps to version 1.24.0 (dlt-hub#3751)

385a44f

* bumps to version 1.24.0 * test fixes

feat(ducklake): add metadata_schema option to ATTACH statement (dlt…

4a22eb4

…-hub#3763)

zilto and others added 13 commits March 23, 2026 19:01

fix: check duckdb version to install lance extension (dlt-hub#3773)

0300e88

Aws session token (dlt-hub#3769)

9598449

feat: add dataset schema update api

92f597c

test: add tests for dataset schema update api

bb627d5

chore: format code

14367e6

chore: apply review changes

04a0708

Merge branch 'feat/3525-update-dataset-schema-api' of https://github.…

e0da7f3

…com/munish7771/dlt into feat/3525-update-dataset-schema-api

feat: add dataset schema update api

31f9e95

test: add tests for dataset schema update api

87bedbd

chore: format code

b376586

chore: apply review changes

e5bec76

Merge branch 'feat/3525-update-dataset-schema-api' of https://github.…

833368a

…com/munish7771/dlt into feat/3525-update-dataset-schema-api

munish7771 mentioned this pull request Mar 29, 2026

Feat/3525 update dataset schema api clean #3802

Open

munish7771 closed this Mar 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/3525-update-dataset-schema-api#3585

feat/3525-update-dataset-schema-api#3585
munish7771 wants to merge 126 commits intodlt-hub:develfrom
munish7771:feat/3525-update-dataset-schema-api

munish7771 commented Jan 28, 2026

Uh oh!

munish7771 commented Mar 16, 2026

Uh oh!

rudolfix commented Mar 16, 2026

Uh oh!

munish7771 commented Mar 22, 2026

Uh oh!

zilto commented Mar 23, 2026

Uh oh!

munish7771 commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

munish7771 commented Jan 28, 2026

Description

Related Issues

Additional Context

Uh oh!

munish7771 commented Mar 16, 2026

Uh oh!

rudolfix commented Mar 16, 2026

Uh oh!

munish7771 commented Mar 22, 2026

Uh oh!

zilto commented Mar 23, 2026

Uh oh!

munish7771 commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants