Release Highlights

DataHub v1.4.0 is packed with exciting updates, including:

AI & Context: Introducing Context Documents for bringing organizational knowledge to DataHub. Create context documents directly on DataHub, or import them from Notion & Confluence. Curate, refine, and semantically search across your documents using DataHub MCP Server & Agent Context Kit. Requires admin configuration.
Major UI Improvements: Redesigned ingestion source creation workflow with guided step-by-step experience, modernized login/signup pages, support for Service Accounts, new asset “Summary” profile tab with modular layout, and ability to upload files to asset documentation. Read more below!
New Ingestion Connectors: New connectors for Google Dataplex, Azure Data Factory, IBM Db2, Notion, & Confluence. Major enhancements include Airflow 3.x support, Snowflake Streamlit apps and Semantic Views ingestion, Databricks OAuth authentication, and Kafka Connect Confluent Cloud integration.
SDK Features: New Java SDK V2 with fluent builder API, Python SDK Tag entity support, parametrized assertion runs, and full Pydantic v2 migration.
Platform Improvements: Elasticsearch 8 support with multi-client shim and semantic search infrastructure.

User Experience

This release includes significant improvements to the user interface and user experience:

Improved Experiences: Home Page, Lineage Explorer, Entity Profiles, & More

Simplified Home Page

DataHub’s simplified, modular home page experience is now enabled by default for all DataHub instances.

Learn more about the new Home Page here.

Support for the old home page will be dropped in an upcoming release. Until that time, you may revert to the previous home page by setting following environment variable in the datahub-gms:

SHOW_HOME_PAGE_REDESIGN to false

Entity Profile Summary Tabs

Check out the new summary tabs available on Domain, Glossary Term, & Data Product profile pages. Summaries provide an overview of the key details about each entity at a glance.

Streamlined Data Lineage Explorer

Experience the most seamless version of data lineage yet. Seamlessly navigate across data dependencies with the redesigned lineage navigator. Behavior should largely remain the same as with the old lineage UI. In a future release, the old UI will be removed. For now, you can revert to it by setting following environment variable in the datahub-gms:

LINEAGE_GRAPH_V3 to false

Other UX Improvements

We’ve also included modernized Ingestion, Login, Sign Up, and Analytics pages in this release. Check them out and let us know what you think!

Important: Note that we’ve disabled the legacy UI for DataHub by default as of this release. You’ll no longer be able to toggle between the legacy UI & new UI in settings - the new UI will be visible by default. In future releases, the legacy UI will be removed from the UI codebase completely.

Context Documents & Semantic Search

Introducing Context Documents V1, a new feature that allows adding AI-related context and documentation to assets, & optional configurability for semantic search (beta).

Added models and APIs for Context Documents [#15280]
Introduced UI flows for Context Documents. [#15279]
Various UI improvements for Context Documents. [#15413]
Support viewing and adding related context to all asset types. [#15453]
Support for ingesting external context documents from Notion & Confluence (see Ingestion Updates below).
Support for configuring semantic indexing of document contents, and semantic search via the semanticSearchAcrossEntities GraphQL resolver, through DataHub MCP Server, and via the Agent Context Kit document search tools.
- Note that semantic search must happen via Ingestion Recipes (Notion, Confluence, or DataHub Documents ingestion source). For more details, see Semantic Search Configuration.
- Semantic Search is only supported if you are using OpenSearch version 2.19.3+. It is NOT currently supported for Elasticsearch deployments.

This feature is enabled by default, but can be disabled by setting the following environment variable in the datahub-gms:

CONTEXT_DOCUMENTS_ENABLED to false

Read more about Context Documents here. And read about configuring platform capabilities required for semantic search here.

Agent Context Kit: Snowflake, LangChain, MCP Server

As of v1.4.0, DataHub is publishing a new Agent Context Kit Python library.

Shipped in this release:

Snowflake: Providing a new datahub agent CLI command that enables you to provision a Snowflake Cortex Agent that automatically has access to various DataHub tools for searching assets, documents, retrieving lineage, sample queries, and more. Learn more here.
LangChain: Providing a Python tools library that enables you to easily build LangChain Agents with access to DataHub assets & metadata. Learn more here.

In addition, we’ve also made some important additions to the DataHub MCP Server to add a host of new tools:

Mutation Tools: Edit tags, terms, owners, descriptions, structured properties, domains, & more.
Document Tools: Search (keyword OR semantic) across context documents, create new documents in the “Shared” space.

Which will be available in the v0.5.0 version of DataHub MCP Server.

Service Accounts

Support for creating named service accounts, generating API access tokens, and granting permissions via DataHub’s Access Policies system. Useful for creating dedicated

Add support for service accounts in DataHub [#52765]

This feature is enabled by default. Read more about service accounts here.

Upload Files to Asset Documentation

New capability to upload and download files when documenting any types of assets in DataHub using configurable S3 storage backend. Requires configuring DataHub’s backend server to be able to read and write from a particular S3 bucket.

File upload to S3 extension in UI. [#15061]
Presigned upload URL endpoint. [#14943]
Inline previews for text, PDF, and video files. [#15182]
Support for schema field and asset documentation. [#15055]
Permission checks for file downloads. [#15059]

This feature is disabled by default, and can be enabled by setting various environment variable in the datahub-gms container:

DOCUMENTATION_FILE_UPLOAD_V1 to true

And S3 configs:

DATAHUB_BUCKET_NAME: # The S3 bucket name to use for storing data
DATAHUB_ROLE_ARN: # The AWS IAM role ARN to assume for S3 reads and writes

Note that this assumes AWS credentials with permission to read & write to the specified bucket are available & mounted in the environment where DataHub is running.

Other Improvements

Support linking multiple Applications to entities. [#15160]
Structured properties infinite scroll with backend search. [#14991]
Option to hide structured properties with empty values. [#14872]
Model signature table for MLModel summary tab. [#15205]
Improved More Filters UX. [#15794]
Role selector with pagination and search. [#15858]
Tag editing updates with new menu. [#14884]
Show all views in settings. [#14971]
Runs tab for DataFlow entities. [#15775]

Metadata Ingestion

We're continuously improving our integrations to add new capabilities and squash bugs.

New Sources

Google Dataplex: New connector for Google Dataplex metadata ingestion. In incubation. [#15379]
Azure Data Factory: New connector for Azure Data Factory pipelines and datasets. In incubation. [#15499]
Microsoft Fabric OneLake: New connector to ingest from Fabric workspaces, lakehouses, warehouses, schema, and tables.
IBM Db2: New source for IBM Db2 databases. Incubating.. [#14968]
Notion: Added as ingestion source for Context Documents. In incubation. [#15970]
Confluence: Added as an ingestion source for Context Documents. In incubation. [#15970]

Existing Sources

Airflow:

Full Airflow 3.x support. [#13790]
Teradata operator support for Airflow plugin. [#15418]
DataFlow emission from task handler in distributed deployments. [#15875]

Snowflake:

Streamlit apps ingestion support. [#15272]
Semantic View ingestion support. [#15395]
Stateful time window ingestion for queries v2 with bucket alignment. [#15040]
Classification library added to dependencies. [#15407]

Databricks/Unity Catalog:

Azure OAuth support. [#15117]
OAuth and unified auth support. [#15824]
ML model signature and run details support. [#15177]
SQL-based query history extraction for usage. [#14953]
Migration from deprecated SqlParsingBuilder to SqlParsingAggregator. [#15005]

Kafka Connect:

Confluent Cloud connector and transform pipeline support. [#14575]
Lineage inference from DataHub. [#15234]

Fivetran:

Introducing support for Databricks. [#14897]
Introducing support for Google Sheets and API client integration. [#15007]
Improved REST API error handling. [#15323]

dbt:

Semantic view support. [#15411]
Bulk job ingestion for DBT Cloud. [#15264]
Freshness tests now ingested as Freshness Assertions. [#15885]

Hive Metastore:

Thrift connection mode with Kerberos support. [#15691]
Upstream lineage support. [#15435]

Oracle:

Support for materialized views, stored procedures, and usage. [#15118]

PostgreSQL/MySQL:

IAM auth support for MySQL and PostgreSQL sources. [#14899]

PowerBI:

Amazon Athena lineage support. [#15728]
ODBC upstream lineage mapping and SQL parsing fixes. [#15756]
Introduces DirectLake lineage extraction from PowerBI tables to Fabric OneLake tables [#15927]

LookML:

Use Looker API to get fields of a View. [#15060]
Updated recipe and UI enhancements. [#15086]

Tableau:

Exponential backoff retry logic for InternalServerError. [#15828]

Redshift:

Query tagging identifiying DataHub workload in the AWS Service Ready Program. [#15676]
Fix lineage extraction ignoring disabled flags. [#15545]

BigQuery:

Performance optimizations to minimize unnecessary API calls and improved performance for date-sharded tables [#15945, #15978]
Pushdown deny/allow usernames for server-side user filtering. [#15699]
Case normalization for temp table inference. [#15252]

Dremio:

Custom schema resolver for non-standard URI length. [#15514]
OOM error handling for large metadata ingestion. [#14883]

Grafana:

Option to pass user email as dashboard owner. [#15489]
Fix for text panels causing ingestion failures. [#15291]

Iceberg:

Role assumption support. [#15288]

MSSQL:

Auto-enable use_odbc for mssql-odbc source type. [#15702]
Stored procedure lineage extraction fix. [#15340]
Statement splitting fix for expressions ending with parentheses. [#15730]

Metabase:

Legacy-mbql parameter for Metabase 0.57+ compatibility. [#15709]

MongoDB:

Fix handling of arrays containing complex structures. [#15026]

Qlik Sense:

Scoped ingestion to eliminate full environment scans. [#15837]

Misc. Ingestion Improvements

Postgres: now supports automated lineage extraction from query history [#15924]
Secret masking framework for sensitive data. [#15188]
Recording and replay system for debugging ingestion runs. [#15480]
S3 performance improvements for get_dir_to_process and get_folder_info. [#14709]
Tags to structured properties transformer. [#15423]
Configurable SQL parse cache size via environment variable. [#15977]
OAuth callback support for Kafka producers/sinks. [#15420, #15673]
Lightweight Kafka connectivity validation. [#15472]
Convert to lowercase option for S3. [#15475]
Upper bounds added to dependency versions. [#15813]
Regex pattern compilation for ingestion filtering hot path. [#15463]

DataHub Python SDK

Improvements and new features for the DataHub SDK:

Full Pydantic v2 migration with v1 legacy code removed. [#15261]
Tag entity support in SDK v2. [#14791]
Parametrized Assertion Run support. [#15447]
DataProduct output ports support. [#15000]
DataJob environment defaults to PROD when using flow_urn. [#15388]
GraphQL CLI command. [#14781]
User add CLI command. [#15011]
Agent-friendly datahub init with non-interactive mode, environment variable support, and configurable token duration. [#16038]
-env option for ingest deploy command. [#15518]
Support for extra-pip and extra-env options in config file. [#15800]
MCE Topic optional in Kafka sink. [#14150]
Environment variables extracted into single file. [#15021]

DataHub Java SDK

New Java SDK V2 with modern API design:

Fluent builder API with entity support. [#15307]
Ensure full aspect writes complete before patches. [#15522]

Platform & Backend

Platform improvements and backend enhancements:

Elasticsearch 8 Support

Multi-client search engine shim for ES8 support. [#14904]

Search Enhancements

Semantic search infrastructure. [#15743]
Semantic search configuration GraphQL API. [#15959]

System Update Improvements

Consolidated setup jobs
- SQL setup replacement for MySQL/Postgres. [#15044]
- Elasticsearch setup replacement. [#15058]
Improved IAM setup support. [#15143]
Consistency checks API and upgrade validation. [#15766]

Kafka Improvements

Consumer lag monitoring. [#15769]
Automatic topic partition upsizing (opt-in). [#15714]
Split consumer/producer configuration. [#15751]
Events Kafka pool and client retry improvements. [#15429]

OpenAPI Enhancements

Improved scroll API with advanced pagination and facets. [#14877]
Sorting customization on missing value handling. [#15383]

GraphQL Enhancements

Comprehensive entity patching functionality. [#14823]

Other Platform Changes

HTTPS support with option to disable HTTP in frontend. [#15757]
OpenSearch upgrade to 2.19.3. [#15047]
Dependency locking for Gradle. [#15303]

AI

New capabilities for organizational knowledge and AI-powered features:

Context Documents

Context Base V1 models and APIs. [#15280]
UI flows for viewing and adding context to assets. [#15279]
Support for all asset types. [#15453]
Python APIs and documentation. [#15319]

Semantic Search

Backend infrastructure for semantic search on Context Documents. [#15743]
GraphQL configuration API to enable ingestion to share configs. [#15959]

Document Ingestion

New datahub-documents source with chunking and embedding support. [#15975]

Documentation

Documentation updates and improvements:

Context Documents and Ask DataHub user docs. [#15695]
Smart SQL assertions documentation. [#15187]
Custom SQL in column value assertions. [#15192]
Query attribution documentation. [#15879]
Logical datasets and bulk relationship removal. [#15029]
Metadata model entity documentation with field tables and SDK examples. [#15095]
MCP server setup instructions. [#15484]
Guidelines to avoid AI test anti-patterns. [#15073]
Ingestion security guidelines. [#15729]
Microfrontend usage and build instructions. [#15713]
CDC mode configuration improvements. [#15496]
Search access controls for DataHub Cloud. [#15513]

Breaking Changes

Pydantic v2 Migration: Pydantic v1 legacy code has been removed. All custom code using the SDK must migrate to Pydantic v2. [#15261]
Fivetran: Database and schema names handling changed to use quoted identifiers. [#15321]
LookML/Looker: See documented breaking changes. [#14947]
Python 3.9 EOL: Python 3.9 support has been removed as this version is EOL [#15984]

Security Notes

DataHub is aware of urllib3 DoS vulnerabilities CVE-2025-66418,
CVE-2025-66471, and CVE-2026-21441. These vulnerabilities slightly
increase DoS risk but do not change DataHub's threat model. Users
should configure ingestion from trusted sources only. urllib3 will
be upgraded when botocore supports versions >2.5.x.

datahub-project/datahub v1.4.0 on GitHub