Release Highlights
DataHub v1.4.0 is packed with exciting updates, including:
-
AI & Context: Introducing Context Documents for bringing organizational knowledge to DataHub. Create context documents directly on DataHub, or import them from Notion & Confluence. Curate, refine, and semantically search across your documents using DataHub MCP Server & Agent Context Kit. Requires admin configuration.
-
Major UI Improvements: Redesigned ingestion source creation workflow with guided step-by-step experience, modernized login/signup pages, support for Service Accounts, new asset “Summary” profile tab with modular layout, and ability to upload files to asset documentation. Read more below!
-
New Ingestion Connectors: New connectors for Google Dataplex, Azure Data Factory, IBM Db2, Notion, & Confluence. Major enhancements include Airflow 3.x support, Snowflake Streamlit apps and Semantic Views ingestion, Databricks OAuth authentication, and Kafka Connect Confluent Cloud integration.
-
SDK Features: New Java SDK V2 with fluent builder API, Python SDK Tag entity support, parametrized assertion runs, and full Pydantic v2 migration.
-
Platform Improvements: Elasticsearch 8 support with multi-client shim and semantic search infrastructure.
User Experience
This release includes significant improvements to the user interface and user experience:
Improved Experiences: Home Page, Lineage Explorer, Entity Profiles, & More
Simplified Home Page
DataHub’s simplified, modular home page experience is now enabled by default for all DataHub instances.
Learn more about the new Home Page here.
Support for the old home page will be dropped in an upcoming release. Until that time, you may revert to the previous home page by setting following environment variable in the datahub-gms:
SHOW_HOME_PAGE_REDESIGNtofalse
Entity Profile Summary Tabs
Check out the new summary tabs available on Domain, Glossary Term, & Data Product profile pages. Summaries provide an overview of the key details about each entity at a glance.
Streamlined Data Lineage Explorer
Experience the most seamless version of data lineage yet. Seamlessly navigate across data dependencies with the redesigned lineage navigator. Behavior should largely remain the same as with the old lineage UI. In a future release, the old UI will be removed. For now, you can revert to it by setting following environment variable in the datahub-gms:
LINEAGE_GRAPH_V3tofalse
Other UX Improvements
We’ve also included modernized Ingestion, Login, Sign Up, and Analytics pages in this release. Check them out and let us know what you think!
Important: Note that we’ve disabled the legacy UI for DataHub by default as of this release. You’ll no longer be able to toggle between the legacy UI & new UI in settings - the new UI will be visible by default. In future releases, the legacy UI will be removed from the UI codebase completely.
Context Documents & Semantic Search
Introducing Context Documents V1, a new feature that allows adding AI-related context and documentation to assets, & optional configurability for semantic search (beta).
- Added models and APIs for Context Documents [#15280]
- Introduced UI flows for Context Documents. [#15279]
- Various UI improvements for Context Documents. [#15413]
- Support viewing and adding related context to all asset types. [#15453]
- Support for ingesting external context documents from Notion & Confluence (see Ingestion Updates below).
- Support for configuring semantic indexing of document contents, and semantic search via the
semanticSearchAcrossEntitiesGraphQL resolver, through DataHub MCP Server, and via the Agent Context Kit document search tools.- Note that semantic search must happen via Ingestion Recipes (Notion, Confluence, or DataHub Documents ingestion source). For more details, see Semantic Search Configuration.
- Semantic Search is only supported if you are using OpenSearch version 2.19.3+. It is NOT currently supported for Elasticsearch deployments.
This feature is enabled by default, but can be disabled by setting the following environment variable in the datahub-gms:
CONTEXT_DOCUMENTS_ENABLEDtofalse
Read more about Context Documents here. And read about configuring platform capabilities required for semantic search here.
Agent Context Kit: Snowflake, LangChain, MCP Server
As of v1.4.0, DataHub is publishing a new Agent Context Kit Python library.
Shipped in this release:
- Snowflake: Providing a new
datahub agentCLI command that enables you to provision a Snowflake Cortex Agent that automatically has access to various DataHub tools for searching assets, documents, retrieving lineage, sample queries, and more. Learn more here. - LangChain: Providing a Python tools library that enables you to easily build LangChain Agents with access to DataHub assets & metadata. Learn more here.
In addition, we’ve also made some important additions to the DataHub MCP Server to add a host of new tools:
- Mutation Tools: Edit tags, terms, owners, descriptions, structured properties, domains, & more.
- Document Tools: Search (keyword OR semantic) across context documents, create new documents in the “Shared” space.
Which will be available in the v0.5.0 version of DataHub MCP Server.
Service Accounts
Support for creating named service accounts, generating API access tokens, and granting permissions via DataHub’s Access Policies system. Useful for creating dedicated
- Add support for service accounts in DataHub [#52765]
This feature is enabled by default. Read more about service accounts here.
Upload Files to Asset Documentation
New capability to upload and download files when documenting any types of assets in DataHub using configurable S3 storage backend. Requires configuring DataHub’s backend server to be able to read and write from a particular S3 bucket.
- File upload to S3 extension in UI. [#15061]
- Presigned upload URL endpoint. [#14943]
- Inline previews for text, PDF, and video files. [#15182]
- Support for schema field and asset documentation. [#15055]
- Permission checks for file downloads. [#15059]
This feature is disabled by default, and can be enabled by setting various environment variable in the datahub-gms container:
-
DOCUMENTATION_FILE_UPLOAD_V1totrue -
And S3 configs:
DATAHUB_BUCKET_NAME: # The S3 bucket name to use for storing data DATAHUB_ROLE_ARN: # The AWS IAM role ARN to assume for S3 reads and writes
Note that this assumes AWS credentials with permission to read & write to the specified bucket are available & mounted in the environment where DataHub is running.
Other Improvements
- Support linking multiple Applications to entities. [#15160]
- Structured properties infinite scroll with backend search. [#14991]
- Option to hide structured properties with empty values. [#14872]
- Model signature table for MLModel summary tab. [#15205]
- Improved More Filters UX. [#15794]
- Role selector with pagination and search. [#15858]
- Tag editing updates with new menu. [#14884]
- Show all views in settings. [#14971]
- Runs tab for DataFlow entities. [#15775]
Metadata Ingestion
We're continuously improving our integrations to add new capabilities and squash bugs.
New Sources
- Google Dataplex: New connector for Google Dataplex metadata ingestion. In incubation. [#15379]
- Azure Data Factory: New connector for Azure Data Factory pipelines and datasets. In incubation. [#15499]
- Microsoft Fabric OneLake: New connector to ingest from Fabric workspaces, lakehouses, warehouses, schema, and tables.
- IBM Db2: New source for IBM Db2 databases. Incubating.. [#14968]
- Notion: Added as ingestion source for Context Documents. In incubation. [#15970]
- Confluence: Added as an ingestion source for Context Documents. In incubation. [#15970]
Existing Sources
Airflow:
- Full Airflow 3.x support. [#13790]
- Teradata operator support for Airflow plugin. [#15418]
- DataFlow emission from task handler in distributed deployments. [#15875]
Snowflake:
- Streamlit apps ingestion support. [#15272]
- Semantic View ingestion support. [#15395]
- Stateful time window ingestion for queries v2 with bucket alignment. [#15040]
- Classification library added to dependencies. [#15407]
Databricks/Unity Catalog:
- Azure OAuth support. [#15117]
- OAuth and unified auth support. [#15824]
- ML model signature and run details support. [#15177]
- SQL-based query history extraction for usage. [#14953]
- Migration from deprecated SqlParsingBuilder to SqlParsingAggregator. [#15005]
Kafka Connect:
- Confluent Cloud connector and transform pipeline support. [#14575]
- Lineage inference from DataHub. [#15234]
Fivetran:
- Introducing support for Databricks. [#14897]
- Introducing support for Google Sheets and API client integration. [#15007]
- Improved REST API error handling. [#15323]
dbt:
- Semantic view support. [#15411]
- Bulk job ingestion for DBT Cloud. [#15264]
- Freshness tests now ingested as Freshness Assertions. [#15885]
Hive Metastore:
Oracle:
- Support for materialized views, stored procedures, and usage. [#15118]
PostgreSQL/MySQL:
- IAM auth support for MySQL and PostgreSQL sources. [#14899]
PowerBI:
- Amazon Athena lineage support. [#15728]
- ODBC upstream lineage mapping and SQL parsing fixes. [#15756]
- Introduces DirectLake lineage extraction from PowerBI tables to Fabric OneLake tables [#15927]
LookML:
Tableau:
- Exponential backoff retry logic for InternalServerError. [#15828]
Redshift:
- Query tagging identifiying DataHub workload in the AWS Service Ready Program. [#15676]
- Fix lineage extraction ignoring disabled flags. [#15545]
BigQuery:
- Performance optimizations to minimize unnecessary API calls and improved performance for date-sharded tables [#15945, #15978]
- Pushdown deny/allow usernames for server-side user filtering. [#15699]
- Case normalization for temp table inference. [#15252]
Dremio:
- Custom schema resolver for non-standard URI length. [#15514]
- OOM error handling for large metadata ingestion. [#14883]
Grafana:
- Option to pass user email as dashboard owner. [#15489]
- Fix for text panels causing ingestion failures. [#15291]
Iceberg:
- Role assumption support. [#15288]
MSSQL:
- Auto-enable use_odbc for mssql-odbc source type. [#15702]
- Stored procedure lineage extraction fix. [#15340]
- Statement splitting fix for expressions ending with parentheses. [#15730]
Metabase:
- Legacy-mbql parameter for Metabase 0.57+ compatibility. [#15709]
MongoDB:
- Fix handling of arrays containing complex structures. [#15026]
Qlik Sense:
- Scoped ingestion to eliminate full environment scans. [#15837]
Misc. Ingestion Improvements
- Postgres: now supports automated lineage extraction from query history [#15924]
- Secret masking framework for sensitive data. [#15188]
- Recording and replay system for debugging ingestion runs. [#15480]
- S3 performance improvements for get_dir_to_process and get_folder_info. [#14709]
- Tags to structured properties transformer. [#15423]
- Configurable SQL parse cache size via environment variable. [#15977]
- OAuth callback support for Kafka producers/sinks. [#15420, #15673]
- Lightweight Kafka connectivity validation. [#15472]
- Convert to lowercase option for S3. [#15475]
- Upper bounds added to dependency versions. [#15813]
- Regex pattern compilation for ingestion filtering hot path. [#15463]
DataHub Python SDK
Improvements and new features for the DataHub SDK:
- Full Pydantic v2 migration with v1 legacy code removed. [#15261]
- Tag entity support in SDK v2. [#14791]
- Parametrized Assertion Run support. [#15447]
- DataProduct output ports support. [#15000]
- DataJob environment defaults to PROD when using flow_urn. [#15388]
- GraphQL CLI command. [#14781]
- User add CLI command. [#15011]
- Agent-friendly
datahub initwith non-interactive mode, environment variable support, and configurable token duration. [#16038] -envoption for ingest deploy command. [#15518]- Support for extra-pip and extra-env options in config file. [#15800]
- MCE Topic optional in Kafka sink. [#14150]
- Environment variables extracted into single file. [#15021]
DataHub Java SDK
New Java SDK V2 with modern API design:
- Fluent builder API with entity support. [#15307]
- Ensure full aspect writes complete before patches. [#15522]
Platform & Backend
Platform improvements and backend enhancements:
Elasticsearch 8 Support
- Multi-client search engine shim for ES8 support. [#14904]
Search Enhancements
System Update Improvements
- Consolidated setup jobs
- Improved IAM setup support. [#15143]
- Consistency checks API and upgrade validation. [#15766]
Kafka Improvements
- Consumer lag monitoring. [#15769]
- Automatic topic partition upsizing (opt-in). [#15714]
- Split consumer/producer configuration. [#15751]
- Events Kafka pool and client retry improvements. [#15429]
OpenAPI Enhancements
- Improved scroll API with advanced pagination and facets. [#14877]
- Sorting customization on missing value handling. [#15383]
GraphQL Enhancements
- Comprehensive entity patching functionality. [#14823]
Other Platform Changes
- HTTPS support with option to disable HTTP in frontend. [#15757]
- OpenSearch upgrade to 2.19.3. [#15047]
- Dependency locking for Gradle. [#15303]
AI
New capabilities for organizational knowledge and AI-powered features:
Context Documents
- Context Base V1 models and APIs. [#15280]
- UI flows for viewing and adding context to assets. [#15279]
- Support for all asset types. [#15453]
- Python APIs and documentation. [#15319]
Semantic Search
- Backend infrastructure for semantic search on Context Documents. [#15743]
- GraphQL configuration API to enable ingestion to share configs. [#15959]
Document Ingestion
- New datahub-documents source with chunking and embedding support. [#15975]
Documentation
Documentation updates and improvements:
- Context Documents and Ask DataHub user docs. [#15695]
- Smart SQL assertions documentation. [#15187]
- Custom SQL in column value assertions. [#15192]
- Query attribution documentation. [#15879]
- Logical datasets and bulk relationship removal. [#15029]
- Metadata model entity documentation with field tables and SDK examples. [#15095]
- MCP server setup instructions. [#15484]
- Guidelines to avoid AI test anti-patterns. [#15073]
- Ingestion security guidelines. [#15729]
- Microfrontend usage and build instructions. [#15713]
- CDC mode configuration improvements. [#15496]
- Search access controls for DataHub Cloud. [#15513]
Breaking Changes
- Pydantic v2 Migration: Pydantic v1 legacy code has been removed. All custom code using the SDK must migrate to Pydantic v2. [#15261]
- Fivetran: Database and schema names handling changed to use quoted identifiers. [#15321]
- LookML/Looker: See documented breaking changes. [#14947]
- Python 3.9 EOL: Python 3.9 support has been removed as this version is EOL [#15984]
Security Notes
- DataHub is aware of urllib3 DoS vulnerabilities CVE-2025-66418,
CVE-2025-66471, and CVE-2026-21441. These vulnerabilities slightly
increase DoS risk but do not change DataHub's threat model. Users
should configure ingestion from trusted sources only. urllib3 will
be upgraded when botocore supports versions >2.5.x.