0. tl;dr
- Raiders of the Lost Archive -- version your raw data to make historical queries more accurate
- Column type resolution for incremental models (no more
Value too long for character type
errors) - Postgres support
- Top-level configs applied to your project + all dependencies
- --threads CLI option + better multithreaded output
1. Source table archival #183
Commonly, analysts need to "look back in time" at some previous state of data in their mutable tables. Imagine a users
table which is synced to your data warehouse from a production database. This users
table is a representation of what your users look like now. Consider what happens if you need to look at revenue by city for each of your users trended over time. Specifically, what happens if a user moved from, say, Philadelphia to New York? To do this correctly, you need to archive snapshots of the users
table on a recurring basis. With this release, dbt now provides an easy mechanism to store such snapshots.
To use this new feature, declare the tables you want to archive in your dbt_project.yml
file:
archive:
- source_schema: synced_production_data # schema to look for tables in (declared below)
target_schema: dbt_archive # where to archive the data to
tables: # list of tables to archive
- source_table: users # table to archive
target_table: users_archived # table to insert archived data into
updated_at: updated_at # used to determine when data has changed
unique_key: id # used to generate archival query
- source_table: some_other_table
target_table: some_other_table_archive
updated_at: "updatedAt"
unique_key: "expressions || work || LOWER(too)"
- source_schema: some_other_schema
....
The archived tables will mirror the schema of the source tables they're generated from. In addition, three fields are added to the archive table:
valid_from
: The timestamp when this archived row was inserted (and first considered valid)valid_to
: The timestamp when this archived row became invalidated. The first archived record for a givenunique_key
hasvalid_to = NULL
. When newer data is archived for thatunique_key
, thevalid_to
field of the old record is set to thevalid_from
field of the new record!scd_id
: A unique key generated for each archive record. Scd = Slowly Changing Dimension.
dbt models can be built on top of these archived tables. The most recent record for a given unique_key
is the one where valid_to
is null
.
To run this archive process, use the command dbt archive
. After testing and confirming that the archival works, you should schedule this process through cron (or similar).
2. Incremental column expansion #175
Incremental tables are a powerful dbt feature, but there was at least one edge case which makes working with them difficult. During the first run of an incremental model, Redshift will infer a type for every column in the table. Subsequent runs can insert new data which does not conform to the expected type. One example is a varchar(16)
field which is inserted into a varchar(8)
field.
In practice, this error looks like:
Value too long for character type
DETAIL:
-----------------------------------------------
error: Value too long for character type
code: 8001
context: Value too long for type character varying(8)
query: 3743263
location: funcs_string.hpp:392
process: query4_35 [pid=18194]
-----------------------------------------------
With this release, dbt will detect when column types are incongruent and will attempt to reconcile these different types if possible. Specifically, dbt will alter the incremental model table schema from character varying(x)
to character varying(y)
for some y > x
. This should drastically reduce the occurrence of this class of error.
3. First-class Postgres support #183
With this release, Postgres became a first-class dbt target. You can configure a postgres database target in your ~/.dbt/profiles.yml
file:
warehouse:
outputs:
dev:
type: postgres # configure a target for Postgres
host: localhost
user: Drew
....
run-target: dev
While Redshift is built on top of Postgres, the two are subtly different. For instance, Redshift supports sort and dist keys, while Postgres does not! dbt will use the database target type
parameter to generate the appropriate SQL for the target database.
4. Root-level configs #161
Configurations in dbt_project.yml
can now be declared at the models:
level. These configurations will apply to the primary project, as well as any dependency projects. This feature is particularly useful for setting pre- or post- hooks that run for every model. In practice, this looks like:
name: 'My DBT Project'
models:
post-hook:
- "grant select on {{this}} to looker_user" # Applied to 'My DBT Project' and 'Snowplow' dependency
'My DBT Project':
enabled: true
'Snowplow':
enabled: true
5. --threads CLI option #143
The number of threads that DBT uses can now be overridden with a CLI argument. The number of threads used must be between 1 and 8.
dbt run --threads 1 # fine
# or
dbt run --threads 4 # great
# or
dbt run --threads 42 # too many!
In addition to this new CLI argument, the output from multi-threaded dbt runs should be a little more orderly now. Models won't show as START
ed until they're actually queued to run. Previously, the output here was a little confusing. Happy threading!
Upgrading
To upgrade to version 0.5.1 of dbt, run:
pip install --upgrade dbt
And another thing
- Join us on slack with questions or comments
Made with ♥️ by 🐟🏙 📈