The full set of commits compared to the last release (v2021.08.13
) is listed here.
What's new
- We added an Integration for monitoring an external CockroachDB instance (#1321).
- We started instrumenting the Opstrace controller software with Prometheus metrics (#1322) and added a corresponding dashboard (#1356).
Component versions bumps
- Loki received a version bump from b3d7740 to a4b8974.
- Cortex was updated from e658571 to 74055d8.
- Grafana was updated from v8.1.1 to v8.1.3.
Fixed and improved
Core:
- We fixed a regression introduced in the last release as of which the Loki WebSocket endpoint for live-tailing logs was not available anymore (#1329). We also added a corresponding regression test.
- We fixed a Loki query response latency regression on AWS by setting the EC2 instance option
HttpPutResponseHopLimit
from1
(default) to2
. This is now done as part of the initialcreate
, but also as part ofupgrade
. Note that the change applied during an upgrade does not persist when an EC2 instance is lost unexpectedly. At the heart of the issue was theaws-go-sdk
taking around one minute for obtaining security credentials from the EC2 instance metadata service. It spent most of that time hopelessly retrying against a firewalling technique introduced in version 2 of the EC2 Instance Metadata Service. The full story is exciting and can be read in #1382. The debugging effort was motivated by an unstable test. - The custom Auth0 integration is now (hopefully) functional, via introduction of a new
custom_auth0_domain
install-time parameter (#1380, #1175). - A generic Kubernetes
StatefulSet
readiness check was improved for addressing an upgrade issue (#1296, #1294). - We tweaked the GRPC config used for Loki and Cortex components to reduce the likelihood for a
ENHANCE_YOUR_CALM, debug data: too_many_pings
error (#1362). - A number of system-internal alerts were tweaked (#1311, #1366, #1374, and others).
- We started changing the approach for issuing per-tenant TLS certificates to allow for easier state changes after the initial creation (#1371, #923).
CLI:
opstrace create
- GCP: the set of service connections is now logged before and after creation for enhanced debuggability (#1287).
opstrace destroy
UI:
- Error handling improvements landed for:
- login: better auto-healing of short transient issues around the flow-concluding HTTP POST request via the introduction of purpose-optimized retrying parameters (#1280).
- authentication state inspection: the interaction with
/_/auth/status
is now more robust via a change in tooling and retrying parameters (#1282). - installing and uninstalling an Integration (#1295, #1209).
- managing users and tenants (#1325, #1333, #1364, #1365, #1375, and others).
- the YAML document download feature (#1324).
- dark mode configuration (#1384).
- The Cortex ingest URL on the Getting Started page was fixed (#1397).
Developer experience and QA
This section does not aim for completeness. Yet, we'd like to point out some significant changes around developer experience and testing.
- We observed, debugged, and fixed a bunch of non-trivial CI instabilities, and also addressed build job duration regressions (#1286, #1285).
- test-remote: headless browser interaction now collects browser console contents for enhanced debuggability (#1282).
- We moved to using Golang 1.17 (#1305). We also transitioned to using TypeScript 4.4.x. and also bumped a number of dev tools (#1303).
- Looker, our Loki / Cortex testing and benchmarking tool, received saw significant development (#1310, #1315, #1353, #1361, #1363, #1370, #1377, #1389).