Skip to content

feat/3525-update-dataset-schema-api#3585

Closed
munish7771 wants to merge 126 commits intodlt-hub:develfrom
munish7771:feat/3525-update-dataset-schema-api
Closed

feat/3525-update-dataset-schema-api#3585
munish7771 wants to merge 126 commits intodlt-hub:develfrom
munish7771:feat/3525-update-dataset-schema-api

Conversation

@munish7771
Copy link
Copy Markdown

Description

This PR adds a sync_schema method to dlt.Dataset, allowing us to refresh a dataset’s schema when it becomes stale.
It supports:

  1. Fetching the latest schema by dataset name (default behavior)
  2. Updating from an explicit dlt.Schema or schema name
  3. Resetting to a local-only schema state

Related Issues

Additional Context

  • Added a new function sync_schema to dlt\dataset\dataset.py
  • Added unit tests to tests\dataset\test_dataset.py

rudolfix and others added 25 commits November 3, 2025 16:09
master merge for 1.18.2 patch release
* release highlights 1.16

* release highlights 1.17

* add to sidebar

* remove 1.17.1

* fix links

* apply comments, rearrange releases, add what's new
master merge for `1.19.0` release
* fixes historic builds

* fix broken link

* constrain docs build env to python 3.10

* switch snippets testing to python 3.10

* allows python up to py3.12 in docs project

---------

Co-authored-by: dave <shrps@posteo.net>
master merge for patch 1.19.1 release
master merge for 1.20.0 release
add data quality lifecycle docs, dashboard prescriptive workflow

---------

Co-authored-by: Adrian <Adrian>
master merge for 1.21.0 release
* fix: skip adding partition clause to BQ ALTER TABLE query

* fixes partition generation conition and adds tests

---------

Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* use scaffold api client v1

* remove vibes repo code

* rename vibe -> scaffold, basic client test

* remove comments

* set base url via config

* refactor

* cleanup

* move docs-api-url config to runtime.workspace section

* request zip format

* scaffold api url

* updated url

* updated again

* reordered imports

* test for nonexisting vibe source

* duckdb->bigquery

* mini-change to trigger ci-tests again

* review fixes

---------

Co-authored-by: rudolfix <rudolfix@rudolfix.org>
…#1883) (dlt-hub#3574)

* feat: add parallelized flag for rest_api dependent resources (dlt-hub#1883)

* updates tests and docs

---------

Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* adds flag to prevent writing dictionary columns to parquet

* disables writing dict _dlt_load_id on mssql + adbc

* adds option to skip dict when adding constant column

* uses arrow repeat for contant columns

* fixes sqlite sqlalchemy local file location

* releases buffers in buffered.py early

* sets separate duckdb extensions folder per xdist worker + motherduck xdist 2

* disables ADBC for sqlite/sqlalchemy

* pre-installs duckdb extensions, other fixes

* fixes clickhouse wrong timestamp serialization

* review fixes

* fixes motherduck leaking database in tests

* serializes motherduck tests again

* fixes dummy client test race condition
* map decimals without precision and scale to DECFLOAT on snowflake destination

* Make unbound decimal to decfloat conversion configurable

* changes option name, adds tests and docs

---------

Co-authored-by: ivasio <ivan@dlthub.com>
Co-authored-by: travior <gh@benjaminhoffmann.dev>
Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
…3414)

* Add engine arg argument to sql_table and sql_database with tests

* Add test relevance clarification

* Add docs for engine_kwargs usage

* Add engine args documentation for sqlalchemy destination.
Fix documentation for inmemory sqlite destination and add test for it.

* Add destination test for sqlite with engine args for both inmemory and
file based database.

* Fix code block name python -> py and remove postgres from sqlalchemy tests. There will be only sqlite and mysql

* Remove unnecessary import

* Change pyarrow backend test for a pandas backend test. Since pipeline
files for windows and linux (passes for macos)

* Change back to pyarrow test

* Add check of engine not being disposed after use if not owned by dlt

* Add may_dispose_after_use keyword in sql_database

* Add uniq id for pipeline names

* Remove generator from sqlite file test to use normal json and use
storage fixture

* tests(sqlalchemy): add ref counting tests for destination when owned vs external engines

* Fix reflink

* rename to engine kwargs for destination too

* format lint

* Trigger CI pipeline

* Adjust engine kwargs test

* Remove brittle test

* Fix tests

* Format lint

* Remove brittle test

* fixes sqlalchemy refcount leak

* missing tests

* disposes reflection engine after use in sql_database

* small fixes

* handles in memory sqllite correctly

---------

Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
…b#2888)

* Add extra_credentials to clickhouse destination configuration

* Rename extra credentials and add tests

* review fixes

* adds end to end test

---------

Co-authored-by: Jeff <ubuntu@ip-172-31-8-3>
Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
…#3570)

* fix : Use correct view to obtain existing schemas and use if not exists to create a schema.

* Create schema without existence check

* Use try except logic instead of checking for table

* Smaller try block for better readability

* fixes exception handling

---------

Co-authored-by: Tim Hable <thable@varengold.de>
Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* making query_result_bucket optional

* adds tests and docs

---------

Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* fixes snippet and brings READMEs up to date with tooling

* adds help to Makefiles
…t-hub#3595)

* refines base table loader class, extracts connector-x, allows for backend registration + tests

* fixes tests
prevostc and others added 4 commits March 13, 2026 14:08
* implements replace merge tree engine on top of existing hints

* makes dedup_sort optional, docs refinement

---------

Co-authored-by: rudolfix <rudolfix@rudolfix.org>
* supports insert-only merge strategy

* enables upsert on fabric

* fixes other tests

* skips delete orphan job when insert-only
…t-hub#3652)

* feat: parallelize all sources in Airflow, including the first one

Resolves dlt-hub#2196. Replaces the first source task with an EmptyOperator
start node so all decomposed source components fan out concurrently
in both parallel and parallel-isolated modes.

* adds serialize_first_task to controll airflow parallelism

* adds smoke tests of airflow helper for Airflow 3

* improves airflow test setup, v3 failed due to Airflow3 bugs

* corrects airflow version

* fixes __getattr__ dunders

* fixes airflow tests

---------

Co-authored-by: justins <ksobayo0126@gmail.com>
Co-authored-by: rudolfix <rudolfix@rudolfix.org>
@munish7771
Copy link
Copy Markdown
Author

@rudolfix The refresh breaks when called via pipeline.dataset().refresh()
Looks like the pipeline.dataset() resolves the default schema to a full schema object before passing it to dataset. In refresh() it can't tell if the schema came from the user or the pipeline default. What do you suggest?

@rudolfix
Copy link
Copy Markdown
Collaborator

@munish7771 feel free to store origin schema argument to be able to pass to new dataset instance in refresh(). ie. self._schema_origin = schema in __init__

anuunchin and others added 5 commits March 16, 2026 17:45
* Initial commit

* Test

* Fix in test

* Better warning message

* Minor improvements

* string ref handled

* Tests adjusted

* Docs

* Sync with devel

* Simpler fix
* docs: add realistic closure-based data masking example

* Fix mypy lint errors for Python 3.10 compatibility

- Replace enum.StrEnum (Python 3.11+) with str, Enum base classes
- Replace lowercase callable with typing.Callable[..., Any]
- Replace MaskingMethod | None with Optional[MaskingMethod]
- Use resolved_method variable to avoid type narrowing issue

* moves production psedonymize to example with a test

---------

Co-authored-by: rudolfix <rudolfix@rudolfix.org>
* bumps to version 1.24.0

* test fixes
* expanded handover to other toolkits section

* removed extra whitespace

* Refine data-exploration wording per PR dlt-hub#3737 review

* chore: retrigger CI

* chore: retrigger CI

* chore: retrigger CI @njaltran

* docs: regenerate CLI reference to remove stale dlt pipeline mcp entry

* docs page updated with make update-cli-docs

* cli doc update

---------

Co-authored-by: anuunchin <88698977+anuunchin@users.noreply.github.com>
…3765)

* creates all eligible tables on staging dataset, truncates only those with jobs

* bumps pyathena client

* normalizes path endings in athena configuration (ibis client fix)

* fixes review and adds unit tests for init_client for load step

* removes locking fork test

* fixes athena partition test - new cursor row format for paritions
@munish7771
Copy link
Copy Markdown
Author

@rudolfix what i meant was if we create the dataset using the pipeline api using dataset(), the function always provides the dataset constructor with a schema even if we explicitly didn't add any. in this case it will always be a caught exception.
for-eg, with the current code changes if i run the following -

import dlt

pipeline = dlt.pipeline(
    pipeline_name="test_pipeline",
    destination="duckdb",
    dataset_name="test_dataset"
)
pipeline.run([{"id": 1, "name": "Sharma"}], table_name="users")
dataset = pipeline.dataset()
dataset.refresh()

it gives error - TypeError: refresh() is not supported when the Dataset was created with a Schema instance.

is this intended? One way I can think of is to change the behaviour of the pipeline.dataset() function, but I am not sure if its that way for a reason.

@zilto
Copy link
Copy Markdown
Collaborator

zilto commented Mar 23, 2026

@munish7771 Can you rebase your branch on devel?

Some tests are failing on CI, but are unrelated to your changes. They were recently fixed and merged to devel

@munish7771
Copy link
Copy Markdown
Author

Replaced by a clean PR due to rebase issues → #3802
FYI: @zilto @rudolfix

@munish7771 munish7771 closed this Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: API to update dlt.Dataset.schema