Table of Contents
- Netdata Growth
- Summary
- Highlights
- Acknowledgments
- Contributions
- Collectors
- Packaging/Installation
- Documentation
- Other Notable Changes
- Deprecation notice
- Support options
Netdata Growth
- 1.5 million downloads per day
- 73.7k GitHub stars!
- 656.1M Docker Hub pulls!
Netdata continues to experience phenomenal growth, with over 1.5 million downloads daily through Cloudflare and Docker Hub, fueling user observability worldwide.
Thanks to your unwavering support ❤️, Netdata is the leader in the observability category in the CNCF landscape, ahead of all other solutions, including Elasticsearch, Grafana, and Prometheus, in GitHub stars. This demonstrates the trust and admiration of our community.
This success drives rapid adoption among enterprises, reflecting the growing recognition of Netdata as the go-to observability solution for both cloud-native and on-premises environments. Our commitment remains steadfast: to deliver cutting-edge, AI-powered observability with unmatched performance and simplicity—all while being significantly more affordable.
We are also proud to see our users and customers experience high-scale setups, achieving reliable multi-million samples/s setups, effortlessly, streamlining their operations with Netdata.
As we evolve, our focus on empowering businesses with higher-fidelity AI insights ensures Netdata remains the easiest and fastest way to optimize infrastructure and applications at any scale. 🚀
Do you like Netdata? Give Netdata a ⭐ too, on GitHub!
Release Summary
Netdata 2.3 delivers significant enhancements to monitoring reliability and scalability:
- Crash Handling & Reporting: A zero-sampling system that captures and analyzes agent crashes with complete diagnostic information, significantly improving reliability across diverse environments.
- Extreme Cardinality Protection: Automatic safeguards that maintain performance in high-scale environments with millions of time series while intelligently managing metadata retention.
- Nodes Ephemerality & Streaming Alerts: A sophisticated approach to handling node connections in distributed environments, reducing alert noise by distinguishing between permanent and ephemeral nodes.
- SNMP Service Discovery: A new system automatically finds and monitors SNMP-enabled devices on configured networks, eliminating manual configuration.
Release Highlights
Nodes Ephemerality & Streaming Alerts
Netdata 2.3 implements a more sophisticated approach to handling node connections in distributed environments. We now define ephemeral nodes as "nodes that are expected to disconnect without raising alerts", enabling smarter monitoring of dynamic infrastructure.
Feature | Description |
---|---|
Smart Node Classification | Distinguish between permanent infrastructure (servers) and ephemeral resources (containers, auto-scaling instances) |
Targeted Alerting | Disconnection alerts trigger only for permanent nodes, reducing alert noise and focusing attention on genuine issues |
Dynamic Infrastructure Support | Configure auto-scaling cloud instances, containers, and test environments as ephemeral to prevent unnecessary alerts |
Simple Configuration | Mark nodes as ephemeral with a single setting in netdata.conf: is ephemeral node = yes
|
Automated Cleanup | Configurable retention periods to automatically remove disconnected ephemeral nodes from dashboards |
Selective Cloud Notifications | Netdata Cloud now sends node-unreachable notifications exclusively for permanent nodes |
Node Management CLI | Use netdatacli mark-stale-nodes-ephemeral to clear alerts for permanently offline nodes
|
Learn more about managing ephemeral nodes.
Extreme Cardinality Protection
Netdata 2.3 introduces automatic protection against extreme cardinality issues when combining high-dimensional metrics with long retention periods. This system:
Feature | Description |
---|---|
Intelligent Detection | Automatically identifies contexts with excessive ephemeral metrics (≥1000 instances with >50% ephemerality) |
Balanced Protection | Preserves all actively collected metrics while selectively clearing retention for ephemeral ones |
Resource Optimization | Prevents memory bloat and performance degradation from abandoned time-series metadata |
Configurable Thresholds | Adjustable settings for instance count and ephemerality percentage to match your environment |
Transparent Operation | Detailed logging of all protection activities for easy monitoring and verification |
This protection maintains Netdata's performance even in high-scale environments with millions of time series, while still allowing unlimited cardinality for high-resolution data. Learn more about configuring this feature.
Crash Handling & Reporting
We've implemented a powerful, zero-sampling crash monitoring system that captures and analyzes agent restarts and crashes with complete diagnostic information. This solution leverages systemd's journal for flexible, scalable event tracking without additional licensing costs. With anonymous telemetry enabled, this system helps us identify critical issues across diverse environments, significantly improving Netdata's reliability for all users. Read more about our approach in this blog post.
Feature | Description |
---|---|
Zero-Sampling Collection | Captures every single crash event without sampling, providing complete visibility into system behavior |
Comprehensive Diagnostics | Records detailed stack traces, error messages, and system context for accurate root cause analysis |
Efficient Deduplication | Intelligent system that prevents redundant reporting (only one crash type per agent per day) |
Privacy-Focused | No IP addresses collected, only anonymous telemetry with user opt-out option |
Lightweight Implementation | Minimal performance impact, only activates when Agent starts, stops, or crashes |
Cost-Effective Architecture | Leverages existing systemd journal infrastructure instead of expensive third-party solutions |
High Scalability | Processes up to 20,000 events per second per instance with horizontal scaling capability |
Flexible Analysis | Transforms complex JSON data into flattened journal entries for powerful filtering and correlation |
Proven Results | Already identified and resolved dozens of critical issues across diverse environments |
SNMP Discovery
Netdata 2.3 adds an SNMP service discovery system that automatically finds and monitors SNMP-enabled devices on your networks.
Feature | Description |
---|---|
Automated Device Detection | Scans configured networks to discover SNMP-enabled devices without manual configuration |
Flexible Network Configuration | Supports various IP range formats including single IPs, ranges, and CIDR notation (up to 512 IPs per subnet) |
Customizable Credentials | Configure multiple credential sets with support for SNMPv2c and SNMPv3 with various security levels |
Performance Optimization | Controls network impact through concurrent scan limits and configurable caching of discovery results |
Seamless Integration | Automatically makes discovered devices available to the SNMP collector for immediate monitoring |
Scheduled Rescanning | Periodically rechecks networks for new devices with configurable intervals |
This feature is disabled by default and requires explicit configuration of networks and credentials to activate. See example configuration for details.
Acknowledgments
- @Passific for adding metadata labels to Docker images to support version-based automated update scripts.
- @L-U-C-K-Y for fixing incorrect command examples in the service management documentation.
- @arkamar for fixing MySQL collector to properly handle changes in InnoDB log metrics.
- @intelfx for fixing cgroup path validation to properly handle special characters and escape sequences in cgroups.plugin.
- @marcin-smseagle for adding SMSEagle integration to health notifications.
Contributions
Collectors
Improvements
- Added filtering options for network interfaces and mountpoints in macOS collector (macos.plugin) (#19865, @ilyam8)
- Added CronJob metrics collection to KubeState collector (go.d/k8s_state) (#19793, #19796, #19801, #19829 @ilyam8)
- Added automatic device discovery feature to go.d.plugin (#19720, #19755, #19756, #19760 @ilyam8)
- Added Active Directory Federation Service monitoring (AD FS) (windows.plugin) (#19699, @thiagoftsm)
- Added additional monitoring metrics to the ZooKeeper collector (go.d/zookeeper) (#19584, #19587 @ilyam8)
- Added Active Directory Certification Service monitoring (windows.plugin) (#19492, @thiagoftsm)
- Added Active Directory monitoring (windows.plugin) (#19461, @thiagoftsm)
- Added NetFramework monitoring (windows.plugin (#19008, @thiagoftsm)
Bug fixes
- Fixed FreeBSD collector to properly label disk and network charts without instance-specific information in titles (freebsd.plugin) (#19866, @ilyam8)
- Fixed macOS collector to properly label disk and network charts without instance-specific information in titles (macos.plugin) (#19859, @ilyam8)
- Fixed file path validation in filecheck collector to support Windows paths (go.d/filecheck) (#19815, @ilyam8)
- Fixed Pi-hole collector to support the completely redesigned API in version 6 (go.d/pihole) (#19807, @ilyam8)
- Fixed NVIDIA SMI collector to handle XML structure changes in GPU power readings (go.d/nvidia_smi) (#19759, @ilyam8)
- Fixed lm_sensors to properly load all configuration files when using standard paths (debugfs/sensors) (#19744, @ktsaou)
- Fixed go.d.plugin to gracefully handle SIGPIPE signals instead of terminating (go.d.plugin) (#19739, @ilyam8)
- Fixed JSON formatting of processes function information (apps.plugin) (#19732, @ktsaou)
- Fixed MySQL collector to properly handle changes in InnoDB log metrics available in MariaDB 10.8+ (go.d/mysql) (#19687, @arkamar)
- Fixed cgroup path validation to properly handle special characters and escape sequences (cgroups.plugin) (#19490, @intelfx)
Other
- Added initial support for loading and parsing SNMP device profiles (#19813, @Ancairon)
- Added documentation for SNMP discovery feature in the SNMP collector (go.d/snmp)(#19790, @ilyam8)
- Added datadog profiles for SNMP collector (go.d.plugin) (#19785, #19786 @Ancairon)
- Fixed service discovery component to correctly log configuration sources when pipelines are disabled (go.d.plugin) (#19777, @ilyam8)
- Added file path information to Kubernetes and SNMP discovered job sources for better tracking (go.d.plugin) (#19776, @ilyam8)
- Added facet support for most fields in systemd-journal collector (systemd-journal.plugin) (#19713, @ktsaou)
- Added IP range iterator functionality to iprange package (go.d.plugin) (#19688, @ilyam8)
- Changed Apache collector to issue a warning instead of failing when URL lacks the "?auto" parameter (go.d/apache) (#19580, @ilyam8)
- Added missing metadata for AD and ADCS collectors (windows.plugin) (#19557, @thiagoftsm)
- Improved code formatting in ebpf plugin for better readability (#19553, #19582, @thiagoftsm)
- Improved code formatting in Windows plugin for better readability (#19544, #19554 @ilyam8 @thiagoftsm)
- Enabled virtual node creation by default in SNMP collector (go.d/snmp) (#19529, @ilyam8)
- Added missing alert information to HTTP check collector metadata (go.d/httpcheck) (#19516, @ilyam8)
- Fixed code style issues in NVMe collector to resolve static analysis warnings (go.d/nvme) (#19510, @ilyam8)
- Added configuration example for PostgreSQL collector using Unix sockets with custom ports (go.d/postgres) (#19501, @ilyam8)
- Added initial implementation of eBPF plugin redesign to use inter-plugin communication pipes instead of direct chart creation (#19219, #19572 @thiagoftsm)
- Add missing metadata for NET Framework collector (windows.plugin) (#19203, @thiagoftsm)
Packaging/Installation
All changes
- Fixed stack trace generation on static builds by disabling local-only unwinding restriction (#19858, @ktsaou)
- Added metadata labels to Docker images to support version-based automated update scripts (#19839, @Passific)
- Integrated custom OpenTelemetry collector distribution build process into the main build system (#19702, #19832 @Ferroin)
- Added libucontext support to static builds for proper libunwind functionality on POWER architectures (#19817, @Ferroin)
- Improved update process reliability by enhancing Agent shutdown logic with better service manager interaction, increased timeouts, and flexible configuration paths (#19781, @Ferroin)
- Changed kickstart script to install native packages instead of static builds on Raspberry Pi 2+ (#19773, #19850 @ilyam8)
- Added libunwind support to static builds for improved stack trace handling (#19764, @ktsaou)
- Added initial framework for a custom OpenTelemetry Collector distribution (#19678, @ilyam8)
- Fixed RHEL native package installation by correctly implementing EPEL repository setup and enabling CodeReady Builder on RHEL 9+ systems (#19643, @Ferroin)
- Added netdata group assignment in systemd service configuration to improve IPC socket file security (#19638, @ilyam8)
- Removed FluentBit log file redirection from Dockerfile (#19613, @ilyam8)
- Fixed compiler exception handling configuration in build system to properly apply required flags (#19534, @Ferroin)
- Improved cross-platform emulation for static builds with more targeted platform support and better detection of existing emulation setups (#19470, @Ferroin)
- Added libunwind support to native packages and Docker images for improved stack trace handling (#19452, @Ferroin)
- Fixed libsensors integration by removing improper module detection code and aligning CMake version requirements (#19369, @Ferroin)
- Added Link Time Optimization support to CMake build system for improved performance (#17027, @Ferroin)
Documentation
All changes
- Improved user guidance in the uninstaller script with clearer input prompt text (#19855, @ilyam8)
- Fixed typos in Nodes Ephemerality documentation. (#19840, @ilyam8)
- Added documentation for using custom CA certificates with Netdata (#19754, @Ferroin)
- Updated documentation to clarify Windows Agent limitations on free tier plans (#19727, @ilyam8)
- Added documentation for the journal v2 index file format (#19701, @vkalintiris)
- Updated Netdata's tagline to "X-Ray Vision for your infrastructure!" throughout documentation and packaging code (#19696, @ktsaou)
- Reorganized Docker documentation to include dbus mount instructions in the recommended installation section (#19645, @ilyam8)
- Clarified documentation to explain that Graphite exporter should be used for destinations supporting the Graphite protocol (#19635, #19637 @ilyam8)
- Removed InfluxDB (via Graphite) exporter documentation as InfluxDB 2.0 no longer supports the Graphite protocol (#19633, @ilyam8)
- Improved clarity of Nodes Ephemerality documentation (#19604, @ilyam8)
- Fixed formatting and corrected typos in the Notification Methods documentation. (#19601, @Ancairon)
- Fixed documentation to provide clearer explanation of condition operator usage in alerts (#19589, #19590 @ilyam8)
- Fixed incorrect command examples in the restart section of service management documentation (#19555, @L-U-C-K-Y)
- Added new documentation about Netdata's resource impact when running on cloud virtual machines (#19499, @ktsaou)
Other Notable Changes
Improvements
- Added a comprehensive Agent status tracking system that intelligently reports restart and crash events while respecting user privacy and telemetry preferences (#19617, #19704, #19706, #19707, #19708, #19709, #19712, #19718, #19719, #19721, #19722, #19723, #19725, #19726, #19729, #19730, #19731, #19737, #19740, #19741, #19743, #19751, #19753, #19758, #19761, #19767, #19768, #19769, #19770, #19771, #19778, #19787, #19789, #19792, #19857, #19863, #19869, #19872, #19873, #19874, #19875, #19878, #19879, #19880
@ktsaou) - Changed default behavior to preserve system scheduling policy instead of forcing batch mode (#19808, @ktsaou)
- Added support for the https_proxy environment variable (#19733, @ktsaou)
- Added streaming alerts that notify when child nodes fail to connect after parent restart or disconnect unexpectedly, along with a new chart to track child node connection states (#19586, #19610 @ktsaou)
- Added SMSEagle integration to health notifications (#19520, @marcin-smseagle)
- Added protection against extreme cardinality by automatically cleaning obsolete metric metadata from long-term storage tiers (#19486, #19594 @ktsaou)
Bug Fixes
- Fixed various crashes identified by Sentry (#19856, @ktsaou)
- Fixed potential race conditions when modifying alerts through dynamic configuration (#19854, @ktsaou)
- Fixed several minor issues including compilation of extended database statistics, proper handling of memory mapping failures, and correct cache size evaluation (#19849, @ktsaou)
- Fixed potential crash during cleanup when processing chart names (#19838, @ktsaou)
- Fixed null pointer check when destroying page cache objects (#19837, @ktsaou)
- Fixed several memory and reference counting issues in data storage engine, UUID mapping, and streaming components (#19830, @ktsaou)
- Fixed memory leak detection by properly releasing string arrays during cleanup (#19827, @ktsaou)
- Fixed memory corruption issue by properly copying host identifiers instead of using references (#19826, @ktsaou)
- Fixed memory leaks by removing redundant host indexing, correcting dictionary garbage collection, and preventing unnecessary page flushes during shutdown (#19823, @ktsaou)
- Fixed memory leaks in host system information and context alert matching, while adding immediate exit functionality for leak detection (#19819, @ktsaou)
- Fixed database file management by ensuring proper locking before calculating retention periods (#19812, @stelfrag)
- Fixed memory corruption in streaming by properly handling structures during stale replication disconnections (#19805, @ktsaou)
- Fixed crash in static builds caused by logging issues in the database engine (#19774, @ktsaou)
- Fixed insecure connection option in cloud configuration to work properly for on-premises deployments (#19736, @ktsaou)
- Fixed memory management configuration by correctly assigning malloc arena counts for parent node and IoT profiles (#19711, @ktsaou)
Other
- Fixed Sentry to properly identify and deduplicate events (#19867, #19868, @ktsaou)
- Improved shutdown process by properly releasing memory resources and added validation for database extent indexes (#19861, @stelfrag)
- Fixed memory management in context labels to prevent freeing in-use label pointers during queries (#19853, @ktsaou)
- Fixed compatibility with Protobuf 30.0 libraries (#19835, @vkalintiris)
- Changed agent status file location from cache directory to persistent storage in /var/lib/netdata (#19831, @ktsaou)
- Improved memory cleanup during exit by fully releasing resources including hosts, database caches, metrics registry, and other components (#19821, @ktsaou)
- Resolved multiple memory leak issues and improved sanitizer compatibility (#19811, @ktsaou)
- Removed zero timeout in libuv timers to mitigate potential high CPU usage (#19810, @stelfrag)
- Fixed preprocessor conditional directives for filesystem sanitization (#19809, @ktsaou)
- Fixed build configuration to support leak detection and debugging tools by improving memory management code and conditionally disabling features that interfere with diagnostics (#19806, @ktsaou)
- Improved crash handling with async-signal-safe stack traces and enhanced platform support for macOS (#19802, @ktsaou)
- Improved agent shutdown process by prioritizing the flushing of dirty memory pages (#19775, @stelfrag)
- Fixed invalid memory deallocation for system information fields (#19763, @ktsaou)
- Fixed parsing of system-info.sh output to properly handle empty field values (#19745, @ktsaou)
- Improved claiming error messages to provide more detailed failure information (#19735, @ktsaou)
- Added configurable unmount timing for journal v2 while disabling it for parent nodes (#19724, @ktsaou)
- Added unified out-of-memory error handling (#19717, @ktsaou)
- Fixed fatal error handling to avoid calling cleanup functions during startup (#19715, @ktsaou)
- Fixed database engine to avoid memory-mapped files when system limits are too low (#19714, @ktsaou)
- Fixed filesystem access on MSYS2 platforms by using the stat function (#19703, @ktsaou)
- Fixed health configuration loading to occur before localhost initialization (#19689, @ktsaou)
- Changed anonymous access message in dyncfg to provide clearer instructions for accessing configuration details (#19684, @ilyam8)
- Fix freebsd compilation (#19677, @stelfrag)
- Removed unnecessary lock directory previously used by Python and Go plugins (#19668, #19669 @ilyam8)
- Optimized ARAL performance for repeated single-item allocation and deallocation (#19660, @ktsaou)
- Fixed parent nodes to avoid re-registering child nodes that were previously removed (#19609, @stelfrag)
- Fixed netdatacli to remove both permanent and ephemeral stale nodes (#19602, @ktsaou)
- Cleaned up database code related to extent writing and reduced WAL structure size (#19596, @stelfrag)
- Fixed resource management when writing database file extents to properly release allocated memory (#19593, @stelfrag)
- Included agent version in ACLK handshake challenge response (#19583, @stelfrag)
- Updated protobuf source file generation to use build directory (#19576, @vkalintiris)
- Optimized ACLK node registration and messaging workflow (#19566, @stelfrag)
- Renamed appconfig to inicfg and removed config-related function-like macros (#19552, @vkalintiris)
- Fixed cloud connection timing issues during agent claiming (#19547, @stelfrag)
- Updated virtual host hop count to 1 (#19546, @ktsaou)
- Updated header inclusion to use database/rrd.h instead of daemon/common.h (#19540, @vkalintiris)
- Improved database query performance by inlining critical code paths (#19537, @ktsaou)
- Fixed Coverity issues (#19535, @stelfrag)
- Improved database performance by consolidating datafile and journal writing into a single worker thread. (#19525, @stelfrag)
- Removed legacy dashboard UI code from repository (#19523, #19531, #19545 @ilyam8)
- Added fatal error formatting for PGDs (#19521, @vkalintiris)
- Optimized ACLK query processing with memory management improvements (#19518, @stelfrag)
- Improved point allocation checks in pgd_append_point() to detect page overflow (#19515, @vkalintiris)
- Bundled cmake cache (#19509, @vkalintiris)
- Added dedicated worker for alert queue processing and resource monitoring (#19498, @stelfrag)
- Updated max data file size to 4GiB with preference for 100 files (#19495, @ktsaou)
- Added agent name and version to "Netdata-streaming" function (#19485, @ktsaou)
- Enhanced metadata management with improved cleanup, monitoring, and saving processes (#19479, @stelfrag)
Deprecation notice
Changed in this release
All previously announced deprecations have been implemented in this release, except for the v1/v2 APIs and v0/v1 Dashboard versions, which remain available for now and will be removed in a future release.
Important Changes in Next Major Release
Deprecated Components
Component Type | Versions Being Deprecated |
---|---|
APIs | v1, v2 |
What This Means
Only the v3 API and v3 Dashboard will be supported starting with the next major release. These newer versions offer improved performance, enhanced features, and better security.
Important Changes in Next Minor Release
No changes are expected.
Support options
As we grow, we stay committed to providing the best support ever seen from an open-source solution. Should you encounter an issue with any of the changes made in this release or any feature in the Netdata Agent, feel free to contact us through one of the following channels:
- Premium Support: Customers who wish to have a direct channel with Netdata and prioritized support with defined SLAs can contact us.
- Netdata Learn: Find documentation, guides, and reference material for monitoring and troubleshooting your systems with Netdata.
- GitHub Issues: Use the Netdata repository to report bugs or open a new feature request.
- GitHub Discussions: Join the conversation around the Netdata development process and be a part of it.
- Community Forums: Visit the Community Forums and contribute to the collaborative knowledge base.
- Discord Server: Jump into the Netdata Discord and hang out with like-minded sysadmins, DevOps, SREs, and other troubleshooters. More than 2000 engineers are already using it!