What's new in v0.4.7
Operator-experience release focused on Tempo ingestion. Search-then-fetch runs that previously took two and a half minutes sequentially now complete in ten to twenty seconds thanks to bounded-concurrency parallel fetching; the per-trace HTTP timeout was split into SEARCH_TIMEOUT (5 s) and FETCH_TRACE_TIMEOUT (30 s) so long lookback windows no longer silently drop traces; a degraded Tempo now surfaces as a single classified summary line instead of a wall of ERROR; Ctrl-C preserves partial results via a new TempoError::Interrupted variant; 404s carry the failing URL to make the tempo-query-frontend vs tempo-querier microservices gotcha diagnosable at a glance. Plus: an auto-rerun workflow that absorbs transient GitHub Actions infrastructure hiccups on CI and Release, and a FROM scratch-image log-level fix for the Unix socket bind path. Full design, integration and runbook docs updated in lockstep.
Added
- New
auto-rerunworkflow (.github/workflows/auto-rerun.yml) that automatically reruns failed jobs of theCIandReleaseworkflows exactly once, to absorb transient GitHub Actions infrastructure hiccups (action tarball 5xx, runner DNS flap, apt mirror blip, ephemeral container registry timeout). Capped at one retry viagithub.event.workflow_run.run_attempt < 2, so a second consecutive failure still stays red and requires human triage. Scoped toCIandReleaseonly;Security Auditandcode-scanningare fast enough that manual rerun is easier than automated. Permissions kept minimal: workflow-level floor iscontents: read, thererunjob alone getsactions: write(required bygh run rerun). Logs a GitHub Actions::notice::annotation on every trigger so reruns are visible in the Actions UI.
Changed
ingest_from_tempoparallelizesfetch_tracecalls viatokio::task::JoinSetwith a concurrency cap of 16 in-flight requests (internal semaphore, not user-configurable). The previous sequential loop paid the full Tempo round-trip latency per trace: at ~1.5 s per call over a WAN link, a 100-trace search-then-fetch took ~2 m 30 s end-to-end. Parallelism collapses that to 10-20 s for the same workload. Mirrors the pattern already used byscore::cloud_energy::scraperfor per-service Prometheus CPU queries.crates/sentinel-core/src/ingest/tempo.rs.- Tempo fetch loop aggregates per-trace failures into a single categorized summary instead of emitting one
ERRORline per failed trace. Per-trace failures now log atdebug(trace_id + error still captured for operators who enableRUST_LOG=sentinel_core::ingest::tempo=debug); the loop finishes with one summary line whose severity matches the worst class seen:warnif onlyTraceNotFoundskips occurred,errorotherwise. Counts are bucketed by error kind (timeout,transport,http_status,protobuf_decode,body_read,json_parse,task_panic) so downstream tooling (Loki, CloudWatch) can alert on the right signal. Previously a degraded Tempo with 50 timeouts out of 100 produced a 50-line wall ofERRORin under a second, indistinguishable from a genuine incident. - Tempo fetch loop handles Ctrl-C cleanly via
tokio::signal::ctrl_c()polled alongsideJoinSet::join_nextin atokio::select!. On interrupt,set.abort_all()flags every in-flight task; the drain loop then completes rapidly (aborted tasks resolve toJoinError::is_cancelled(), silently skipped). Traces that had already completed are preserved. An explicitwarnline surfaces the partial-result state. A newTempoError::Interruptedvariant disambiguates the zero-event-collected case from the genericNoTracesFound, so CI quality-gate paths can treat operator abort and empty result at different severities. - Tempo HTTP timeout split between search and single-trace fetch. The single 5 s
REQUEST_TIMEOUTthat covered every Tempo API call is now split intoSEARCH_TIMEOUT = 5 sfor/api/search(tight timeout fails fast on a broken endpoint) andFETCH_TRACE_TIMEOUT = 30 sfor/api/traces/{id}(trace bodies can legitimately be many MiB on a wide fan-out request and the query-frontend has to gather spans from ingesters + long-term storage). On a production-scale run with 24 h lookback, the old 5 s cap was dropping tens of traces per 100-trace batch withrequest timed out; the new split keeps the fail-fast behavior on search while giving slow-but-legitimate trace bodies room to complete. Thirty seconds matches the Grafana Tempo datasource default.
Fixed
run_json_socketdemotes "parent directory missing" to info level, instead of logging it aterrorlevel. In minimal container images (FROM scratch, distroless static) the parent directory of the Unix NDJSON socket may not exist, and the Unix socket is not the canonical ingestion route in those deployments anyway (OTLP gRPC/HTTP is). The daemon already handled this gracefully (return, no panic) but the scaryERROR Failed to bind Unix socket /tmp/perf-sentinel.sock: No such file or directoryline in container logs was misleading operators. The specificErrorKind::NotFoundpath is now an actionable info log explaining how to enable the feature if needed; all other bind failures (permission denied, stale socket, address-in-use) keep surfacing as errors.crates/sentinel-core/src/daemon/json_socket.rs.TempoError::HttpStatusnow includes the failing URL (redacted viahttp_client::redact_endpointto strip any embedded credentials). Previously a 404 on/api/searchsurfaced as the opaqueTempo returned HTTP 404, leaving the operator to guess which endpoint had been queried. The message is nowTempo returned HTTP 404 for https://.../api/search?tags=..., which immediately reveals common misconfigurations: a URL pointing at Grafana instead of Tempo, a missing reverse-proxy path prefix or, on a Tempo microservices deployment, the daemon pointing attempo-querierwhen onlytempo-query-frontendexposes the HTTP query API. Variant signature changed from tuple to struct form (HttpStatus { status: u16, url: String }) for readability at the call site.
Tests
- New unit test
classify_fetch_error_buckets_every_hard_failure_variantincrates/sentinel-core/src/ingest/tempo.rsasserts every hard-failure variant ofTempoError(Timeout,Transport,HttpStatus,ProtobufDecode,BodyRead,JsonParse) maps to its dedicated bucket key, and that upstream variants that should never reach the classifier (InvalidEndpoint,NoTracesFound) fall through to the intentional"other"catch-all. Drift guard against a future variant silently ending up in"other"without visibility. - New integration test
ingest_from_tempo_drains_mixed_per_trace_outcomesexercises the drain loop end-to-end with three concurrent per-trace outcomes (HTTP 500, HTTP 404, 200-with-empty-protobuf). Asserts that the run completes without panic and aggregates the failures through the classification path. Complements the happy-path testingest_from_tempo_search_then_fetch_aggregates_eventswhich covered a single-trace flow.
Docs
docs/INTEGRATION.md+ FR counterpart: new subsection "Tempo in microservices mode (tempo-distributed)" documenting thetempo-query-frontendvstempo-querierdistinction that causes/api/searchto 404 when--endpointpoints at the wrong component.docs/LIMITATIONS.md+ FR counterpart: Tempo ingestion limitations rewritten to reflect parallel fetch (cap 16, not user-configurable), split timeouts (5 s search / 30 s fetch) and Ctrl-C partial-result behavior. The obsolete "sequential fetching" bullet was factually wrong after the parallelization and has been removed.docs/design/06-INGESTION-AND-DAEMON.md+ FR counterpart: new## Tempo ingestiondesign-level section covering the parallel-fetch rationale, the timeout split and the select-based Ctrl-C handling. Fills a pre-existing gap: Tempo had no design-doc coverage despite being a supported ingest source since v0.3.1.docs/RUNBOOK.md+ FR counterpart: two new troubleshooting entries. "Daemon running but not reachable from clients" covers the recurring bind-address-in-container /--network host/ port-mapping / firewall triage path. "perf-sentinel temporeturns 404 or times out" covers the Tempo endpoint misconfiguration and the capacity-driven timeout scenario.
Install
Prebuilt binaries (Linux amd64 / arm64, macOS arm64, Windows amd64):
curl -LO https://github.com/robintra/perf-sentinel/releases/download/v0.4.7/perf-sentinel-linux-amd64
chmod +x perf-sentinel-linux-amd64
sudo mv perf-sentinel-linux-amd64 /usr/local/bin/perf-sentinelLinux binaries are statically linked against musl and run on any distribution (Alpine, Debian, RHEL, Ubuntu any version) regardless of glibc version, and inside FROM scratch images.
From crates.io:
cargo install perf-sentinelDocker:
docker run --rm -p 4317:4317 -p 4318:4318 \
ghcr.io/robintra/perf-sentinel:0.4.7 watch --listen-address 0.0.0.0Also available on Docker Hub: robintrassard/perf-sentinel:0.4.7.
Verify the binary against SHA256SUMS.txt:
curl -LO https://github.com/robintra/perf-sentinel/releases/download/v0.4.7/SHA256SUMS.txt
sha256sum -c SHA256SUMS.txt --ignore-missingFull diff: v0.4.6...v0.4.7