feat/3525-update-dataset-schema-api#3585
Conversation
master merge for 1.18.2 patch release
master merge for docs update
* release highlights 1.16 * release highlights 1.17 * add to sidebar * remove 1.17.1 * fix links * apply comments, rearrange releases, add what's new
master merge for `1.19.0` release
* fixes historic builds * fix broken link * constrain docs build env to python 3.10 * switch snippets testing to python 3.10 * allows python up to py3.12 in docs project --------- Co-authored-by: dave <shrps@posteo.net>
master merge for patch 1.19.1 release
master merge for 1.20.0 release
add data quality lifecycle docs, dashboard prescriptive workflow --------- Co-authored-by: Adrian <Adrian>
master merge for 1.21.0 release
* fix: skip adding partition clause to BQ ALTER TABLE query * fixes partition generation conition and adds tests --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* use scaffold api client v1 * remove vibes repo code * rename vibe -> scaffold, basic client test * remove comments * set base url via config * refactor * cleanup * move docs-api-url config to runtime.workspace section * request zip format * scaffold api url * updated url * updated again * reordered imports * test for nonexisting vibe source * duckdb->bigquery * mini-change to trigger ci-tests again * review fixes --------- Co-authored-by: rudolfix <rudolfix@rudolfix.org>
…#1883) (dlt-hub#3574) * feat: add parallelized flag for rest_api dependent resources (dlt-hub#1883) * updates tests and docs --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* adds flag to prevent writing dictionary columns to parquet * disables writing dict _dlt_load_id on mssql + adbc * adds option to skip dict when adding constant column * uses arrow repeat for contant columns * fixes sqlite sqlalchemy local file location * releases buffers in buffered.py early * sets separate duckdb extensions folder per xdist worker + motherduck xdist 2 * disables ADBC for sqlite/sqlalchemy * pre-installs duckdb extensions, other fixes * fixes clickhouse wrong timestamp serialization * review fixes * fixes motherduck leaking database in tests * serializes motherduck tests again * fixes dummy client test race condition
* map decimals without precision and scale to DECFLOAT on snowflake destination * Make unbound decimal to decfloat conversion configurable * changes option name, adds tests and docs --------- Co-authored-by: ivasio <ivan@dlthub.com> Co-authored-by: travior <gh@benjaminhoffmann.dev> Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
…does not block it from usage for our tests. (dlt-hub#3501)
…3414) * Add engine arg argument to sql_table and sql_database with tests * Add test relevance clarification * Add docs for engine_kwargs usage * Add engine args documentation for sqlalchemy destination. Fix documentation for inmemory sqlite destination and add test for it. * Add destination test for sqlite with engine args for both inmemory and file based database. * Fix code block name python -> py and remove postgres from sqlalchemy tests. There will be only sqlite and mysql * Remove unnecessary import * Change pyarrow backend test for a pandas backend test. Since pipeline files for windows and linux (passes for macos) * Change back to pyarrow test * Add check of engine not being disposed after use if not owned by dlt * Add may_dispose_after_use keyword in sql_database * Add uniq id for pipeline names * Remove generator from sqlite file test to use normal json and use storage fixture * tests(sqlalchemy): add ref counting tests for destination when owned vs external engines * Fix reflink * rename to engine kwargs for destination too * format lint * Trigger CI pipeline * Adjust engine kwargs test * Remove brittle test * Fix tests * Format lint * Remove brittle test * fixes sqlalchemy refcount leak * missing tests * disposes reflection engine after use in sql_database * small fixes * handles in memory sqllite correctly --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
…b#2888) * Add extra_credentials to clickhouse destination configuration * Rename extra credentials and add tests * review fixes * adds end to end test --------- Co-authored-by: Jeff <ubuntu@ip-172-31-8-3> Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
…#3570) * fix : Use correct view to obtain existing schemas and use if not exists to create a schema. * Create schema without existence check * Use try except logic instead of checking for table * Smaller try block for better readability * fixes exception handling --------- Co-authored-by: Tim Hable <thable@varengold.de> Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* making query_result_bucket optional * adds tests and docs --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* fixes snippet and brings READMEs up to date with tooling * adds help to Makefiles
…t-hub#3595) * refines base table loader class, extracts connector-x, allows for backend registration + tests * fixes tests
…#3599) * implements consistend handling of UUID as strings * fix UUID merge tests to actually reproduce dlt-hub#3299 user case
* implements replace merge tree engine on top of existing hints * makes dedup_sort optional, docs refinement --------- Co-authored-by: rudolfix <rudolfix@rudolfix.org>
* supports insert-only merge strategy * enables upsert on fabric * fixes other tests * skips delete orphan job when insert-only
…t-hub#3652) * feat: parallelize all sources in Airflow, including the first one Resolves dlt-hub#2196. Replaces the first source task with an EmptyOperator start node so all decomposed source components fan out concurrently in both parallel and parallel-isolated modes. * adds serialize_first_task to controll airflow parallelism * adds smoke tests of airflow helper for Airflow 3 * improves airflow test setup, v3 failed due to Airflow3 bugs * corrects airflow version * fixes __getattr__ dunders * fixes airflow tests --------- Co-authored-by: justins <ksobayo0126@gmail.com> Co-authored-by: rudolfix <rudolfix@rudolfix.org>
|
@rudolfix The refresh breaks when called via |
|
@munish7771 feel free to store origin schema argument to be able to pass to new dataset instance in refresh(). ie. |
* Initial commit * Test * Fix in test * Better warning message * Minor improvements * string ref handled * Tests adjusted * Docs * Sync with devel * Simpler fix
* docs: add realistic closure-based data masking example * Fix mypy lint errors for Python 3.10 compatibility - Replace enum.StrEnum (Python 3.11+) with str, Enum base classes - Replace lowercase callable with typing.Callable[..., Any] - Replace MaskingMethod | None with Optional[MaskingMethod] - Use resolved_method variable to avoid type narrowing issue * moves production psedonymize to example with a test --------- Co-authored-by: rudolfix <rudolfix@rudolfix.org>
* bumps to version 1.24.0 * test fixes
* expanded handover to other toolkits section * removed extra whitespace * Refine data-exploration wording per PR dlt-hub#3737 review * chore: retrigger CI * chore: retrigger CI * chore: retrigger CI @njaltran * docs: regenerate CLI reference to remove stale dlt pipeline mcp entry * docs page updated with make update-cli-docs * cli doc update --------- Co-authored-by: anuunchin <88698977+anuunchin@users.noreply.github.com>
…3765) * creates all eligible tables on staging dataset, truncates only those with jobs * bumps pyathena client * normalizes path endings in athena configuration (ibis client fix) * fixes review and adds unit tests for init_client for load step * removes locking fork test * fixes athena partition test - new cursor row format for paritions
|
@rudolfix what i meant was if we create the dataset using the pipeline api using dataset(), the function always provides the dataset constructor with a schema even if we explicitly didn't add any. in this case it will always be a caught exception. it gives error - TypeError: refresh() is not supported when the Dataset was created with a Schema instance. is this intended? One way I can think of is to change the behaviour of the pipeline.dataset() function, but I am not sure if its that way for a reason. |
|
@munish7771 Can you rebase your branch on Some tests are failing on CI, but are unrelated to your changes. They were recently fixed and merged to devel |
* persists load job metrics and follow up jobs graph * adds dataset name per load package in load step metrics and trace * passes normalize progress via files in load package * prevents dynamic resource names to leak into extract pipe * initializes table counters in normalize steps, fixes tests
…com/munish7771/dlt into feat/3525-update-dataset-schema-api
…com/munish7771/dlt into feat/3525-update-dataset-schema-api
Description
This PR adds a
sync_schemamethod todlt.Dataset, allowing us to refresh a dataset’s schema when it becomes stale.It supports:
dlt.Schemaor schema nameRelated Issues
dlt.Dataset.schema#3525Additional Context
sync_schematodlt\dataset\dataset.pytests\dataset\test_dataset.py