artifacthub helm/dagster/dagster 0.14.0

latest releases: 1.7.6, 1.7.5, 1.7.4...
2 years ago

“Never Felt Like This Before”

Major Changes

  • Software-defined assets, which offer a declarative approach to data orchestration on top of Dagster’s core job/op/graph APIs, have matured significantly. Improvements include partitioned assets, a revamped asset details page in Dagit, a cross-repository asset graph view in Dagit, Dagster types on assets, structured metadata on assets, and the ability to materialize ad-hoc selections of assets without defining jobs. Users can expect the APIs to only undergo minor changes before being declared fully stable in Dagster’s next major release. For more information, view the software-defined assets concepts page here.
  • We’ve made it easier to define a set of software-defined assets where each Dagster asset maps to a dbt model. All of the dependency information between the dbt models will be reflected in the Dagster asset graph, while still running your dbt project in a single step.
  • Dagit has a new homepage, dubbed the “factory floor” view, that provides an overview of recent runs of all the jobs. From it, you can monitor the status of each job’s latest run or quickly re-execute a job. The new timeline view reports the status of all recent runs in a convenient gantt chart.
  • You can now write schedules and sensors that default to running as soon as they are loaded in your workspace, without needing to be started manually in Dagit. For example, you can create a sensor like this:
    from dagster import sensor, DefaultSensorStatus
    
    @sensor(job=my_job, default_status=DefaultSensorStatus.RUNNING)
    def my_running_sensor():
        ...
    or a schedule like this:
    from dagster import schedule, DefaultScheduleStatus, ScheduleEvaluationContext
    
    @schedule(job=my_job, cron_schedule="0 0 * * *", default_status=DefaultScheduleStatus.RUNNING)
    def my_running_schedule(context: ScheduleEvaluationContext):
        ...
    As soon as schedules or sensors with the default_status field set to RUNNING are included in the workspace loaded by your Dagster Daemon, they will begin creating ticks and submitting runs.
  • Op selection now supports selecting ops inside subgraphs. For example, to select an op my_op inside a subgraph my_graph, you can now specify the query as my_graph.my_op. This is supported in both Dagit and Python APIs.
  • Dagster Types can now have attached metadata. This allows TableSchema objects to be attached to Dagster Types via TableSchemaMetadata. A Dagster Type with a TableSchema will have the schema rendered in Dagit.
  • A new Pandera integration (dagster-pandera) allows you to use Pandera’s dataframe validation library to wrap dataframe schemas in Dagster types. This provides two main benefits: (1) Pandera’s rich schema validation can be used for runtime data validation of Pandas dataframes in Dagster ops/assets; (2) Pandera schema information is displayed in Dagit using a new TableSchema API for representing arbitrary table schemas.
  • The new AssetObservation event enables recording metadata about an asset without indicating that the asset has been updated.
  • AssetMaterializations, ExpectationResults, and AssetObservations can be logged via the context of an op using the OpExecutionContext.log_event method. Output metadata can also be logged using the OpExecutionContext.add_output_metadata method. Previously, Dagster expected these events to be yielded within the body of an op, which caused lint errors for many users, made it difficult to add mypy types to ops, and also forced usage of the verbose Output API. Here’s an example of the new invocations:
    from dagster import op, AssetMaterialization
    @op
    def the_op(context):
        context.log_event(AssetMaterialization(...))
        context.add_output_metadata({"foo": "bar"})
        ...
  • A new Airbyte integration (dagster-airbyte) allows you to kick off and monitor Airbyte syncs from within Dagster. The original contribution from @airbytehq’s own @marcosmarxm includes a resource implementation as well as a pre-built op for this purpose, and we’ve extended this library to support software-defined asset use cases as well. Regardless of which interface you use, Dagster will automatically capture the Airbyte log output (in the compute logs for the relevant steps) and track the created tables over time (via AssetMaterializations).
  • The ECSRunLauncher (introduced in Dagster 0.11.15) is no longer considered experimental. You can bootstrap your own Dagster deployment on ECS using our docker compose example or you can use it in conjunction with a managed Dagster Cloud deployment. Since its introduction, we’ve added the ability to customize Fargate container memory and CPU, mount secrets from AWS SecretsManager, and run with a variety of AWS networking configurations. Join us in #dagster-ecs in Slack!
  • [Helm] The default liveness and startup probes for Dagit and user deployments have been replaced with readiness probes. The liveness and startup probe for the Daemon has been removed. We observed and heard from users that under load, Dagit could fail the liveness probe which would result in the pod restarting. With the new readiness probe, the pod will not restart but will stop serving new traffic until it recovers. If you experience issues with any of the probe changes, you can revert to the old behavior by specifying liveness and startup probes in your Helm values (and reach out via an issue or Slack).

Breaking Changes and Deprecations

  • The Dagster Daemon now uses the same workspace.yaml file as Dagit to locate your Dagster code. You should ensure that if you make any changes to your workspace.yaml file, they are included in both Dagit’s copy and the Dagster Daemon’s copy. When you make changes to the workspace.yaml file, you don’t need to restart either Dagit or the Dagster Daemon - in Dagit, you can reload the workspace from the Workspace tab, and the Dagster Daemon will periodically check the workspace.yaml file for changes every 60 seconds. If you are using the Dagster Helm chart, no changes are required to include the workspace in the Dagster Daemon.
  • Dagster’s metadata API has undergone a signficant overhaul. Changes include:
    • To reflect the fact that metadata can be specified on definitions in addition to events, the following names are changing. The old names are deprecated, and will function as aliases for the new names until 0.15.0:
      • EventMetadata > MetadataValue
      • EventMetadataEntry > MetadataEntry
      • XMetadataEntryData > XMetadataValue (e.g. TextMetadataEntryData > TextMetadataValue)
    • The metadata_entries keyword argument to events and Dagster types is deprecated. Instead, users should use the metadata keyword argument, which takes a dictionary mapping string labels to MetadataValues.
    • Arbitrary metadata on In/InputDefinition and Out/OutputDefinition is deprecated. In 0.15.0, metadata passed for these classes will need to be resolvable to MetadataValue (i.e. function like metadata everywhere else in Dagster).
    • The description attribute of EventMetadataEntry is deprecated.
    • The static API of EventMetadataEntry (e.g. EventMetadataEntry.text) is deprecated. In 0.15.0, users should avoid constructing EventMetadataEntry objects directly, instead utilizing the metadata dictionary keyword argument, which maps string labels to MetadataValues.
  • In previous releases, it was possible to supply either an AssetKey, or a function that produced an AssetKey from an OutputContext as the asset_key argument to an Out/OutputDefinition. The latter behavior makes it impossible to gain information about these relationships without running a job, and has been deprecated. However, we still support supplying a static AssetKey as an argument.
  • We have renamed many of the core APIs that interact with ScheduleStorage, which keeps track of sensor/schedule state and ticks. The old term for the generic schedule/sensor “job” has been replaced by the term “instigator” in order to avoid confusion with the execution API introduced in 0.12.0. If you have implemented your own schedule storage, you may need to change your method signatures appropriately.
  • Dagit is now powered by Starlette instead of Flask. If you have implemented a custom run coordinator, you may need to make the following change:
    from flask import has_request_context, request
    def submit_run(self, context: SubmitRunContext) -> PipelineRun:
        jwt_claims_header = (
            request.headers.get("X-Amzn-Oidc-Data", None) if has_request_context() else None
        )
    Should be replaced by:
    def submit_run(self, context: SubmitRunContext) -> PipelineRun:
        jwt_claims_header = context.get_request_header("X-Amzn-Oidc-Data")
  • Dagit
    • Dagit no longer allows non-software-defined asset materializations to be be graphed or grouped by partition. This feature could render in incorrect / incomplete ways because no partition space was defined for the asset.
    • Dagit’s “Jobs” sidebar now collapses by default on Instance, Job, and Asset pages. To show the left sidebar, click the “hamburger” icon in the upper left.
    • “Step Execution Time” is no longer graphed on the asset details page in Dagit, which significantly improves page load time. To view this graph, go to the asset graph for the job, uncheck “View as Asset Graph” and click the step to view its details.
    • The “experimental asset UI” feature flag has been removed from Dagit, this feature is shipped in 0.14.0!
  • The Dagster Daemon now requires a workspace.yaml file, much like Dagit.
  • Ellipsis (“...”) is now an invalid substring of a partition key. This is because Dagit accepts an ellipsis to specify partition ranges.
  • [Helm] The Dagster Helm chart now only supported Kubernetes clusters above version 1.18.

New since 0.13.19

  • Software Defined Assets:
    • In Dagit, the Asset Catalog now offers a third display mode - a global graph of your software-defined assets.
    • The Asset Catalog now allows you to filter by repository to see a subset of your assets, and offers a “View in Asset Graph” button for quickly seeing software-defined assets in context.
    • The Asset page in Dagit has been split into two tabs, “Activity” and “Definition”.
    • Dagit now displays a warning on the Asset page if the most recent run including the asset’s step key failed without yielding a materialization, making it easier to jump to error logs.
    • Dagit now gives you the option to view jobs with software-defined assets as an Asset Graph (default) or as an Op Graph, and displays asset<>op relationships more prominently when a single op yields multiple assets.
    • You can now include your assets in a repository with the use of an AssetGroup. Each repository can only have one AssetGroup, and it can provide a jumping off point for creating the jobs you plan on using from your assets.
      from dagster import AssetGroup, repository, asset
      @asset(required_resource_keys={"foo"})
      def asset1():
          ...
      @asset
      def asset2():
          ...
      @repository
      def the_repo():
          asset_group = AssetGroup(assets=[asset1, asset2], resource_defs={"foo": ...})
          return [asset_group, asset_group.build_job(selection="asset1-")]
    • AssetGroup.build_job supports a selection syntax similar to that found in op selection.
  • Asset Observations:
    • You can now yield AssetObservations to log metadata about a particular asset from beyond its materialization site. AssetObservations appear on the asset details page alongside materializations and numerical metadata is graphed. For assets with software-defined partitions, materialized and observed metadata about each partition is rolled up and presented together. For more information, view the docs page here.
    • Added an asset_observations_for_node method to ExecuteInProcessResult for fetching the AssetObservations from an in-process execution.
  • Dagster Types with an attached TableSchemaMetadataValue now render the schema in Dagit UI.
  • [dagster-pandera] New integration library dagster-pandera provides runtime validation from the Pandera dataframe validation library and renders table schema information in Dagit.
  • OpExecutionContext.log_event provides a way to log AssetMaterializations, ExpectationResults, and AssetObservations from the body of an op without having to yield anything. Likewise, you can use OpExecutionContext.add_output_metadata to attach metadata to an output without having to explicitly use the Output object.
  • OutputContext.log_event provides a way to log AssetMaterializations from within the handle_output method of an IO manager without yielding. Likewise, output metadata can be added using OutputContext.add_output_metadata.
  • [dagster-dbt] The load_assets_from_dbt_project function now returns a set of assets that map to a single dbt run command (rather than compiling each dbt model into a separate step). It also supports a new node_info_to_asset_key argument which allows you to customize the asset key that will be used for each dbt node.
  • [dagster-airbyte] The dagster-airbyte integration now collects the Airbyte log output for each run as compute logs, and generates AssetMaterializations for each table that Airbyte updates or creates.
  • [dagster-airbyte] The dagster-airbyte integration now supports the creation of software-defined assets, with the build_airbyte_assets function.
  • [dagster-fivetran] The dagster-fivetran integration now supports the creation of software-defined assets with the build_fivetran_assets function.
  • The multiprocess executor now supports choosing between spawn or forkserver for how its subprocesses are created. When using forkserver we attempt to intelligently preload modules to reduce the per-op overhead.
  • [Helm] Labels can now be set on the Dagit and daemon deployments.
  • [Helm] The default liveness and startup probes for Dagit and user deployments have been replaced with readiness probes. The liveness and startup probe for the Daemon has been removed. We observed and heard from users that under load, Dagit could fail the liveness probe which would result in the pod restarting. With the new readiness probe, the pod will not restart but will stop serving new traffic until it recovers. If you experience issues with any of the probe changes, you can revert to the old behavior by specifying liveness and startup probes in your Helm values (and reach out via an issue or Slack).
  • [Helm] The Ingress v1 is now supported.

Community Contributions

  • Typo fix from @Jiafi, thank you!

Bugfixes

  • Fixed an issue where long job names were truncated prematurely in the Jobs page in Dagit.
  • Fixed an issue where loading the sensor timeline would sometimes load slowly or fail with a timeout error.
  • Fixed an issue where the first time a run_status_sensor executed, it would sometimes run very slowly or time out.
  • Fixed an issue where Launchpad mistakenly defaulted with invalid subset error in Dagit.
  • Multi-component asset keys can now be used in the asset graph filter bar.
  • Increased the storage query statement timeout to better handle more complex batch queries.
  • Added fallback support for older versions of sqlite to service top-level repository views in Dagit (e.g. the top-level jobs, schedule, and sensor pages).

Documentation

  • Images in the documentation now enlarge when clicked.
  • New example in examples/bollinger demonstrates dagster-pandera and TableSchema , and software-defined assets in the context of analyzing stock price data.

Don't miss a new dagster release

NewReleases is sending notifications on new releases.