github opstrace/opstrace v2021.09.17
Opstrace v2021.09.17

latest release: v2021.11.17
2 years ago

The full set of commits compared to the last release (v2021.08.13) is listed here.

What's new

  • We added an Integration for monitoring an external CockroachDB instance (#1321).
  • We started instrumenting the Opstrace controller software with Prometheus metrics (#1322) and added a corresponding dashboard (#1356).

Component versions bumps

Fixed and improved

Core:

  • We fixed a regression introduced in the last release as of which the Loki WebSocket endpoint for live-tailing logs was not available anymore (#1329). We also added a corresponding regression test.
  • We fixed a Loki query response latency regression on AWS by setting the EC2 instance option HttpPutResponseHopLimit from 1 (default) to 2. This is now done as part of the initial create, but also as part of upgrade. Note that the change applied during an upgrade does not persist when an EC2 instance is lost unexpectedly. At the heart of the issue was the aws-go-sdk taking around one minute for obtaining security credentials from the EC2 instance metadata service. It spent most of that time hopelessly retrying against a firewalling technique introduced in version 2 of the EC2 Instance Metadata Service. The full story is exciting and can be read in #1382. The debugging effort was motivated by an unstable test.
  • The custom Auth0 integration is now (hopefully) functional, via introduction of a new custom_auth0_domain install-time parameter (#1380, #1175).
  • A generic Kubernetes StatefulSet readiness check was improved for addressing an upgrade issue (#1296, #1294).
  • We tweaked the GRPC config used for Loki and Cortex components to reduce the likelihood for a ENHANCE_YOUR_CALM, debug data: too_many_pings error (#1362).
  • A number of system-internal alerts were tweaked (#1311, #1366, #1374, and others).
  • We started changing the approach for issuing per-tenant TLS certificates to allow for easier state changes after the initial creation (#1371, #923).

CLI:

  • opstrace create
    • GCP: the set of service connections is now logged before and after creation for enhanced debuggability (#1287).
  • opstrace destroy
    • GCP: the region to destroy in can now be specified via --region (#1291). Note that multi-region support for GCP is not yet properly tested.
    • GCP: global address teardown has been consolidated (#1320, #976).

UI:

  • Error handling improvements landed for:
    • login: better auto-healing of short transient issues around the flow-concluding HTTP POST request via the introduction of purpose-optimized retrying parameters (#1280).
    • authentication state inspection: the interaction with /_/auth/status is now more robust via a change in tooling and retrying parameters (#1282).
    • installing and uninstalling an Integration (#1295, #1209).
    • managing users and tenants (#1325, #1333, #1364, #1365, #1375, and others).
    • the YAML document download feature (#1324).
    • dark mode configuration (#1384).
  • The Cortex ingest URL on the Getting Started page was fixed (#1397).

Developer experience and QA

This section does not aim for completeness. Yet, we'd like to point out some significant changes around developer experience and testing.

  • We observed, debugged, and fixed a bunch of non-trivial CI instabilities, and also addressed build job duration regressions (#1286, #1285).
  • test-remote: headless browser interaction now collects browser console contents for enhanced debuggability (#1282).
  • We moved to using Golang 1.17 (#1305). We also transitioned to using TypeScript 4.4.x. and also bumped a number of dev tools (#1303).
  • Looker, our Loki / Cortex testing and benchmarking tool, received saw significant development (#1310, #1315, #1353, #1361, #1363, #1370, #1377, #1389).

Don't miss a new opstrace release

NewReleases is sending notifications on new releases.