This is a major release (0.4 -> 0.5) in our versioning scheme so please review the breaking changes below. Most of them are relevant only for platform builders that use dlt
internals. Some of the long-deprecated components were removed as well
Breaking Changes
PageNumberPaginator
takesbase_page
andpage
arguments instead ofinitial_page
. This allows to paginate APIs that number pages ie. from 0 or from 1. #1509- deprecated
credentials
argument was removed fromdlt.pipeline
. #1537 Please use destination factories to instantiate destinations with explicit credentials. (https://dlthub.com/devel/general-usage/destination#pass-explicit-credentials)
Breaking Changes (internals)
- if
dlt.source
ordlt.resource
decorated function is passed aNone
in a default argument during a function call, it will be handled exactly like in regular Python function call. Previously suchNone
would request argument injection from configuration. Please read more here: (#1430) dlt.config.value
anddlt.secrets.value
were evaluating toNone
at runtime. Now they will evaluate to a sentinel value. All the existing code should be backward compatible. (#1430)full_refresh
flag ofdlt.pipeline
will be deprecated and replaced withdev_mode
. (#1063) and (https://dlthub.com/devel/general-usage/pipeline#do-experiments-with-dev-mode)- the default resource extraction sequence has changed to
round_robin
fromfifo
as a default setting. You can switch back to the previous behavior and learn more about what this means here: (https://dlthub.com/docs/reference/performance#resources-extraction-fifo-vs-round-robin) - if you create an instance of a SPEC (ie.
SnowflakeCredentials
) it will not be marked as resolved even if all required fields are provided. previously some were resolving and some were not. #1489 parse_native_representation
never marks config as resolved. previously some were resolving and some were not. #1489
Core Library
- support
delta
tables withdelta-rs
on top offilesystem
destination. (#1382) LanceDB
destination and examples (#1375)- external files may be imported and loaded without extraction and normalization (https://dlthub.com/devel/general-usage/resource#import-external-files) - includes jsonl, csv, and parquet
- pick the loader file format for particular resource (https://dlthub.com/devel/general-usage/resource#pick-loader-file-format-for-a-particular-resource)
- extended support for various csv formats (https://dlthub.com/devel/dlt-ecosystem/file-formats/csv#change-settings)
- csv support for snowflake (#1470 https://dlthub.com/devel/dlt-ecosystem/destinations/snowflake#custom-csv-formats)
- support case sensitive and insensitive modes for our destinations ie. snowflake, redshift, bigquery, mssql etc. may work in both modes (#998 https://dlthub.com/devel/general-usage/naming-convention)
- you'll be able to fully change naming convention ie. to have LATIN-1 character set or create collision-free names (https://dlthub.com/devel/general-usage/naming-convention#write-your-own-naming-convention)
- two new naming conventions:
sql_cs_v1
(case sensitive) andsql_ci_v1
(case insensitive) to create SQL safe identifiers without snake case transformation (https://dlthub.com/devel/general-usage/naming-convention#available-naming-conventions) - you'll be able to modify destination capabilities via destination factories (https://dlthub.com/devel/general-usage/destination#inspect-destination-capabilities)
- schemas will be reflected with a single SQL statement which will make schema migrations faster
- loader can handle many more jobs (files) than before. we tested with 30k jobs and it looks fine
- we are adding
refresh
modes topipeline.run
that allow to drop and recreate tables - with different granularity. (https://dlthub.com/devel/general-usage/pipeline#refresh-pipeline-data-and-state) - when generating fingerprint for
filesystem
destination only the bucket component is taken into account #1516 - 1272 Support ClickHouse GCS S3 compatibility mode in filesystem destination by @Pipboyguy in #1423
- Ensure arrow field's nullable flag matches the schema column by @steinitzu in #1429
- Fix streamlit bug on chess example by @sh-rp in #1425
- Fix databricks pandas error by @steinitzu in #1443
- Extend orjson dependency allowed range with excluded versions by @steinitzu in #1501
- Fix/1465 fixes snowflake auth credentials by @rudolfix in #1489
- skips non resolvable fields from appearing in sample secrets.toml by @rudolfix in #1432
- RESTClient: pass environment settings to
requests.Session.send
by @burnash in #1452 - fix: service principal auth support for synapse copy job by @jorritsandbrink in #1472
- docs: Fixed markdown issue in duckdb.md by @PabloCastellano in #1528
- Loader parallelism strategies (destination can request the loading strategy ie. sequential or parallel) by @sh-rp in #1457
- Migrate to sentry sdk 2.0 by @sh-rp in #1477
- fix: allow loggeradapter in addition to logger in logcollector by @matsmhans1 in #1483
- Add load_id to arrow tables in extract step instead of normalize by @steinitzu in #1449
- #1356 implements OAuth2 Client Credentials flow by @willi-mueller in #1357
- Add LanceDB custom destination example code by @Pipboyguy in #1323
- fix(incremental): don't filter Arrow tables with empty filters by @IlyaFaer in #1480
- fix:
Pipeline.sql_client
credentials forwarding by @jorritsandbrink in #1499 - RESTClient: fix duplicate params in URL in JSONResponsePaginator by @burnash in #1515
- Update default log output to not have padding on log level by @sh-rp in #1517
- fix: remove obsolete
dremio
destination capabilities by @jorritsandbrink in #1527 - feat(filesystem): use only netloc and scheme for fingerprint by @IlyaFaer in #1516
- removes deprecated credentials argument from Pipeline by @rudolfix in #1537
- improves collision detection when naming convention changes by @rudolfix in #1536
- Fix/1542 rest client: makes request parameters optional by @willi-mueller in #1544
- RESTClient: add integrations tests for paginators by @burnash in #1509
- selects all tables from info schema if number of tables > threshold by @rudolfix in #1547
- configurable staging dataset name by @rudolfix in #1555
Docs
- naming conventions documentation (https://dlthub.com/docs/general-usage/naming-convention)
- methods to manipulate schema settings (https://dlthub.com/docs/general-usage/schema#schema-settings)
- rest_api: add troubleshooting section by @burnash in #1371
- RESTClient: add docs for
init_request
by @burnash in #1442 - Example: fast postgres to postgres by @AstrakhantsevaAA in #1428
- Docs: Updated filesystem docs with explanations for bucket URLs by @dat-a-man in #1435
- docs for loading with contracts to existing tables by @sh-rp in #1441
- Add troubleshooting to incremental docs by @burnash in #1458
- Docs: cover custom authentication, rework paginators section by @burnash in #1493
- rest_api: add an example to the incremental load section by @burnash in #1502
- rest_api: add a quick example to rest_api docs by @burnash in #1531
- Update grouping-resources.md docs by @axellpadilla in #1538
- adds examples and step by step explanation for refresh modes by @rudolfix in #1560
Verified Sources
We worked intensively on rest_api
and sql_database
:
- Add fallback value for tz in row_tuples_to_arrow (sql_database helpers) @khoadaniel dlt-hub/verified-sources#493
- allows SqlAlchemy engine to be passed to sql_table by @rudolfix dlt-hub/verified-sources#498
- Feat/505 rest api hooks in response actions @willi-mueller dlt-hub/verified-sources#512
- Feat/507 transformation function for incremental cursor @willi-mueller dlt-hub/verified-sources#515
- Allows incremental loading to be configured per resource in
sql_database
@rudolfix dlt-hub/verified-sources#478 - Allows to set the reflection level for tables: minimal (names/nullability), full (data types) and full_with_precision (with ie. varchar length). @steinitzu dlt-hub/verified-sources#478
- Enables data type discovery from arrow data. Fixes a bug that was preventing pyarrow/pandas backend to be used if data type could not be inferred @steinitzu dlt-hub/verified-sources#478
- Allows to define type adapters: callback function that allow handling custom database types @steinitzu dlt-hub/verified-sources#478
- stargazers graphQL query for github dlt-hub/verified-sources#483 by @cybermaxs
New Contributors
- @matsmhans1 made their first contribution in #1483
- @PabloCastellano made their first contribution in #1528
- @axellpadilla made their first contribution in #1538
Full Changelog: 0.4.12...0.5.1