github datahub-project/datahub v0.12.0

latest releases: v0.14.1, v0.14.1rc2, v0.14.1rc1...
12 months ago

v0.12.0 Release Highlights

User Experience

Nested Domains

Nested Domains are here! This provides flexibility in organizing your entities within Domains to match the unique organizational structure of your company.

DataHub Chrome Extension Improvements

The Acryl DataHub Chome extension now supports PowerBI! This is a super powerful way for your business users to gain DataHub-specific insights directly in the BI tools they use most. Additionally, we now support making edits back to DataHub Entities directly from the Chrome extension.

Access Management Tab for Datasets

Shoutout to @Ramendra761 from the PayPal Team for contributing a new Access Management tab in Dataset Entity pages! The aim of this feature is to enable users to view the required roles for accessing the Dataset, as defined by Roles and/or Policies in the organization’s Access Management System. It also introduces the ability to request access directly from the page.

Metadata Ingestion

Miscellaneous Improvements

  • Sampling-Based Profiling: You can now configure sampling-based profiling to address query performance concerns in Snowflake and BigQuery
  • Kafka Connect > Snowflake: We now support automatically defining lineage between the two platforms
  • Athena: Support for complex and nested schemas

Column-Level Lineage

We are incubating CLL support for the following:

  • Airflow plugin v2 now supports automatic extraction of CLL for certain operators, removing the need to annotate DAGs
  • dbt
  • Redshift
  • PowerBI (support for Column-Level Lineage for M-Query)

Incubating Sources

  • MLflow
  • Teradata
  • Unity Catalog Notebooks
  • DynamoDB

Developer Experience

  • Data Contracts: v0.12.0 introduces underlying models and CLI; UI support to follow
  • We now support creating custom models without requiring a fork of the main DataHub project
  • Updates to support OpenSearch 2.x and alternate Postgres db in postgres-setup

Other Notable Changes

  • Session token configuration has changed, all previously created session tokens will be invalid and users will be prompted to log in. Expiration time has also been shortened which may result in more login prompts with the default settings.
    There should be no other interruption due to this change.

Breaking Changes

Find full details here

  • #9044 - GraphQL APIs for adding ownership now expect either an ownershipTypeUrn referencing a customer ownership type or a (deprecated) type. Where before adding an ownership without a concrete type was allowed, this is no longer the case. For simplicity you can use the type parameter which will get translated to a custom ownership type internally if one exists for the type being added.
  • #9010 - In Redshift source's config incremental_lineage is set default to off.
  • #8810 - Removed support for SQLAlchemy 1.3.x. Only SQLAlchemy 1.4.x is supported now.
  • #8942 - Removed urn:li:corpuser:datahub owner for the Measure, Dimension and Temporal tags emitted
    by Looker and LookML source connectors.
  • #8853 - The Airflow plugin no longer supports Airflow 2.0.x or Python 3.7. See the docs for more details.
  • #8853 - Introduced the Airflow plugin v2. If you're using Airflow 2.3+, the v2 plugin will be enabled by default, and so you'll need to switch your requirements to include pip install 'acryl-datahub-airflow-plugin[plugin-v2]'. To continue using the v1 plugin, set the DATAHUB_AIRFLOW_PLUGIN_USE_V1_PLUGIN environment variable to true.
  • #8943 - The Unity Catalog ingestion source has a new option include_metastore, which will cause all urns to be changed when disabled.
    This is currently enabled by default to preserve compatibility, but will be disabled by default and then removed in the future.
    If stateful ingestion is enabled, simply setting include_metastore: false will perform all required cleanup.
    Otherwise, we recommend soft deleting all databricks data via the DataHub CLI:
    datahub delete --platform databricks --soft and then reingesting with include_metastore: false.
  • #8846 - Changed enum values in resource filters used by policies. RESOURCE_TYPE became TYPE and RESOURCE_URN became URN.
    Any existing policies using these filters (i.e. defined for particular urns or types such as dataset) need to be upgraded
    manually, for example by retrieving their respective dataHubPolicyInfo aspect and changing part using filter i.e.
   "resources": {
     "filter": {
       "criteria": [
         {
           "field": "RESOURCE_TYPE",
           "condition": "EQUALS",
           "values": [
             "dataset"
           ]
         }
       ]
     }

into

   "resources": {
     "filter": {
       "criteria": [
         {
           "field": "TYPE",
           "condition": "EQUALS",
           "values": [
             "dataset"
           ]
         }
       ]
     }

for example, using datahub put command. Policies can also be removed and re-created via UI.

  • #9077 - The BigQuery ingestion source by default sets match_fully_qualified_names: true. This means that any dataset_pattern or schema_pattern specified will be matched on the fully qualified dataset name, i.e. <project_name>.<dataset_name>. We attempt to support the old pattern format by prepending .*\\. to dataset patterns lacking a period, so in most cases this should not cause any issues. However, if you have a complex dataset pattern, we recommend you manually convert it to the fully qualified format to avoid any potential issues.

What's Changed

New Contributors

Full Changelog: v0.11.0...v0.12.0

Don't miss a new datahub release

NewReleases is sending notifications on new releases.