github robintra/perf-sentinel v0.4.7

latest releases: chart-v0.2.58, v0.8.12, chart-v0.2.57...
one month ago

What's new in v0.4.7

Operator-experience release focused on Tempo ingestion. Search-then-fetch runs that previously took two and a half minutes sequentially now complete in ten to twenty seconds thanks to bounded-concurrency parallel fetching; the per-trace HTTP timeout was split into SEARCH_TIMEOUT (5 s) and FETCH_TRACE_TIMEOUT (30 s) so long lookback windows no longer silently drop traces; a degraded Tempo now surfaces as a single classified summary line instead of a wall of ERROR; Ctrl-C preserves partial results via a new TempoError::Interrupted variant; 404s carry the failing URL to make the tempo-query-frontend vs tempo-querier microservices gotcha diagnosable at a glance. Plus: an auto-rerun workflow that absorbs transient GitHub Actions infrastructure hiccups on CI and Release, and a FROM scratch-image log-level fix for the Unix socket bind path. Full design, integration and runbook docs updated in lockstep.

Added

  • New auto-rerun workflow (.github/workflows/auto-rerun.yml) that automatically reruns failed jobs of the CI and Release workflows exactly once, to absorb transient GitHub Actions infrastructure hiccups (action tarball 5xx, runner DNS flap, apt mirror blip, ephemeral container registry timeout). Capped at one retry via github.event.workflow_run.run_attempt < 2, so a second consecutive failure still stays red and requires human triage. Scoped to CI and Release only; Security Audit and code-scanning are fast enough that manual rerun is easier than automated. Permissions kept minimal: workflow-level floor is contents: read, the rerun job alone gets actions: write (required by gh run rerun). Logs a GitHub Actions ::notice:: annotation on every trigger so reruns are visible in the Actions UI.

Changed

  • ingest_from_tempo parallelizes fetch_trace calls via tokio::task::JoinSet with a concurrency cap of 16 in-flight requests (internal semaphore, not user-configurable). The previous sequential loop paid the full Tempo round-trip latency per trace: at ~1.5 s per call over a WAN link, a 100-trace search-then-fetch took ~2 m 30 s end-to-end. Parallelism collapses that to 10-20 s for the same workload. Mirrors the pattern already used by score::cloud_energy::scraper for per-service Prometheus CPU queries. crates/sentinel-core/src/ingest/tempo.rs.
  • Tempo fetch loop aggregates per-trace failures into a single categorized summary instead of emitting one ERROR line per failed trace. Per-trace failures now log at debug (trace_id + error still captured for operators who enable RUST_LOG=sentinel_core::ingest::tempo=debug); the loop finishes with one summary line whose severity matches the worst class seen: warn if only TraceNotFound skips occurred, error otherwise. Counts are bucketed by error kind (timeout, transport, http_status, protobuf_decode, body_read, json_parse, task_panic) so downstream tooling (Loki, CloudWatch) can alert on the right signal. Previously a degraded Tempo with 50 timeouts out of 100 produced a 50-line wall of ERROR in under a second, indistinguishable from a genuine incident.
  • Tempo fetch loop handles Ctrl-C cleanly via tokio::signal::ctrl_c() polled alongside JoinSet::join_next in a tokio::select!. On interrupt, set.abort_all() flags every in-flight task; the drain loop then completes rapidly (aborted tasks resolve to JoinError::is_cancelled(), silently skipped). Traces that had already completed are preserved. An explicit warn line surfaces the partial-result state. A new TempoError::Interrupted variant disambiguates the zero-event-collected case from the generic NoTracesFound, so CI quality-gate paths can treat operator abort and empty result at different severities.
  • Tempo HTTP timeout split between search and single-trace fetch. The single 5 s REQUEST_TIMEOUT that covered every Tempo API call is now split into SEARCH_TIMEOUT = 5 s for /api/search (tight timeout fails fast on a broken endpoint) and FETCH_TRACE_TIMEOUT = 30 s for /api/traces/{id} (trace bodies can legitimately be many MiB on a wide fan-out request and the query-frontend has to gather spans from ingesters + long-term storage). On a production-scale run with 24 h lookback, the old 5 s cap was dropping tens of traces per 100-trace batch with request timed out; the new split keeps the fail-fast behavior on search while giving slow-but-legitimate trace bodies room to complete. Thirty seconds matches the Grafana Tempo datasource default.

Fixed

  • run_json_socket demotes "parent directory missing" to info level, instead of logging it at error level. In minimal container images (FROM scratch, distroless static) the parent directory of the Unix NDJSON socket may not exist, and the Unix socket is not the canonical ingestion route in those deployments anyway (OTLP gRPC/HTTP is). The daemon already handled this gracefully (return, no panic) but the scary ERROR Failed to bind Unix socket /tmp/perf-sentinel.sock: No such file or directory line in container logs was misleading operators. The specific ErrorKind::NotFound path is now an actionable info log explaining how to enable the feature if needed; all other bind failures (permission denied, stale socket, address-in-use) keep surfacing as errors. crates/sentinel-core/src/daemon/json_socket.rs.
  • TempoError::HttpStatus now includes the failing URL (redacted via http_client::redact_endpoint to strip any embedded credentials). Previously a 404 on /api/search surfaced as the opaque Tempo returned HTTP 404, leaving the operator to guess which endpoint had been queried. The message is now Tempo returned HTTP 404 for https://.../api/search?tags=..., which immediately reveals common misconfigurations: a URL pointing at Grafana instead of Tempo, a missing reverse-proxy path prefix or, on a Tempo microservices deployment, the daemon pointing at tempo-querier when only tempo-query-frontend exposes the HTTP query API. Variant signature changed from tuple to struct form (HttpStatus { status: u16, url: String }) for readability at the call site.

Tests

  • New unit test classify_fetch_error_buckets_every_hard_failure_variant in crates/sentinel-core/src/ingest/tempo.rs asserts every hard-failure variant of TempoError (Timeout, Transport, HttpStatus, ProtobufDecode, BodyRead, JsonParse) maps to its dedicated bucket key, and that upstream variants that should never reach the classifier (InvalidEndpoint, NoTracesFound) fall through to the intentional "other" catch-all. Drift guard against a future variant silently ending up in "other" without visibility.
  • New integration test ingest_from_tempo_drains_mixed_per_trace_outcomes exercises the drain loop end-to-end with three concurrent per-trace outcomes (HTTP 500, HTTP 404, 200-with-empty-protobuf). Asserts that the run completes without panic and aggregates the failures through the classification path. Complements the happy-path test ingest_from_tempo_search_then_fetch_aggregates_events which covered a single-trace flow.

Docs

  • docs/INTEGRATION.md + FR counterpart: new subsection "Tempo in microservices mode (tempo-distributed)" documenting the tempo-query-frontend vs tempo-querier distinction that causes /api/search to 404 when --endpoint points at the wrong component.
  • docs/LIMITATIONS.md + FR counterpart: Tempo ingestion limitations rewritten to reflect parallel fetch (cap 16, not user-configurable), split timeouts (5 s search / 30 s fetch) and Ctrl-C partial-result behavior. The obsolete "sequential fetching" bullet was factually wrong after the parallelization and has been removed.
  • docs/design/06-INGESTION-AND-DAEMON.md + FR counterpart: new ## Tempo ingestion design-level section covering the parallel-fetch rationale, the timeout split and the select-based Ctrl-C handling. Fills a pre-existing gap: Tempo had no design-doc coverage despite being a supported ingest source since v0.3.1.
  • docs/RUNBOOK.md + FR counterpart: two new troubleshooting entries. "Daemon running but not reachable from clients" covers the recurring bind-address-in-container / --network host / port-mapping / firewall triage path. "perf-sentinel tempo returns 404 or times out" covers the Tempo endpoint misconfiguration and the capacity-driven timeout scenario.

Install

Prebuilt binaries (Linux amd64 / arm64, macOS arm64, Windows amd64):

curl -LO https://github.com/robintra/perf-sentinel/releases/download/v0.4.7/perf-sentinel-linux-amd64
chmod +x perf-sentinel-linux-amd64
sudo mv perf-sentinel-linux-amd64 /usr/local/bin/perf-sentinel

Linux binaries are statically linked against musl and run on any distribution (Alpine, Debian, RHEL, Ubuntu any version) regardless of glibc version, and inside FROM scratch images.

From crates.io:

cargo install perf-sentinel

Docker:

docker run --rm -p 4317:4317 -p 4318:4318 \
  ghcr.io/robintra/perf-sentinel:0.4.7 watch --listen-address 0.0.0.0

Also available on Docker Hub: robintrassard/perf-sentinel:0.4.7.

Verify the binary against SHA256SUMS.txt:

curl -LO https://github.com/robintra/perf-sentinel/releases/download/v0.4.7/SHA256SUMS.txt
sha256sum -c SHA256SUMS.txt --ignore-missing

Full diff: v0.4.6...v0.4.7

Don't miss a new perf-sentinel release

NewReleases is sending notifications on new releases.