robintra/perf-sentinel v0.5.19 on GitHub

What's new in v0.5.19

v0.5.19 closes 3 observability gaps in the daemon surfaced by downstream validation work on the simulation lab. Standard process collector metrics (process_resident_memory_bytes, process_open_fds, process_start_time_seconds, process_cpu_seconds_total, ...) are now exposed on /metrics on Linux, so operators get RSS and FD pressure visibility without depending on an external metrics-server. A new perf_sentinel_otlp_rejected_total{reason} counter quantifies OTLP backpressure with 3 labels (unsupported_media_type, parse_error, channel_full), each pre-warmed to 0 at startup so dashboards plot the zero-line before the first rejection. And Report.warning_details: Vec<Warning> adds a structured {kind, message} channel alongside the legacy Report.warnings: Vec<String> field, populated by the daemon cold-start path (kind="cold_start") and dynamically by /api/export/report from the rejected counter (kind="ingestion_drops").

The 3 fixes are purely additive on the observability layer. No ingestion behavior changes: requests that were accepted before are still accepted, requests that were rejected before are still rejected with the same status codes, the difference is that rejections are now visible in /metrics and surfaced in the report payload. The legacy Report.warnings field is preserved byte-for-byte, renderers prefer warning_details when non-empty and fall back to warnings otherwise. Pre-0.5.19 baselines parse fine thanks to serde(default, skip_serializing_if = "Vec::is_empty") on the new field.

The release also lands the supply-chain pinning policy documentation as docs/SUPPLY-CHAIN.md (and its FR mirror), which formalizes the project's stance: GitHub Actions are pinned by SHA in .github/workflows/, Helm chart CLI invocations stay on latest (lower-risk surface), and the docs/ci-templates SHA drift versus upstream is accepted by design. The doc is the reference for contributors wondering why some pins look frozen and others do not.

Added

perf_sentinel_otlp_rejected_total{reason} counter on /metrics (crates/sentinel-core/src/report/metrics.rs). 3 reason labels: unsupported_media_type (HTTP only, Content-Type is not application/x-protobuf), parse_error (HTTP only, prost decode failed), channel_full (HTTP and gRPC, event channel saturated). All pre-warmed to 0 at startup. payload_too_large is intentionally absent: tower-http and tonic enforce the cap upstream and reject before the application handler runs.
Process collector metrics on /metrics (Linux only): process_resident_memory_bytes, process_virtual_memory_bytes, process_open_fds, process_max_fds, process_start_time_seconds, process_cpu_seconds_total. Registered via prometheus::process_collector::ProcessCollector::for_self() behind #[cfg(target_os = "linux")] so the macOS and Windows builds do not pay for failed /proc/self/* reads on every scrape.
Report.warning_details: Vec<Warning> field on the report payload, with Warning { kind: String, message: String } defined in the new crates/sentinel-core/src/report/warnings.rs module. Two kind values ship in 0.5.19: cold_start (returned by /api/export/report until the first batch lands) and ingestion_drops (computed dynamically from otlp_rejected_total{channel_full} when positive). Renderers prefer the structured field when non-empty and fall back to the legacy Report.warnings: Vec<String> (0.5.16+) otherwise.
Warning::from_untrusted(kind, message) constructor that strips Unicode BiDi-override and invisible-format characters via report::sarif::strip_bidi_and_invisible. Trojan Source defense (CVE-2021-42574) for future contributors wiring a Warning sourced from an OTLP attribute or any other attacker-influenced channel. Documented as the required entry point for untrusted bytes in the module-level doc comment.
docs/METRICS.md and docs/FR/METRICS-FR.md: exhaustive reference for every metric exposed on /metrics, including a per-scrape cost note for the new process collector (FD walk dominates at thousands of long-lived connections) and an exposure scope note recommending Kubernetes NetworkPolicy plus Prometheus mTLS when the daemon binds to 0.0.0.0.
docs/SUPPLY-CHAIN.md and docs/FR/SUPPLY-CHAIN-FR.md (#10): the pinning policy reference. Documents what gets SHA-pinned (.github/workflows/ actions), what stays on latest (Helm CLI lints, lower-risk because no repo perms or secrets access), and the docs/ci-templates drift acceptance.
"Diagnosing OTLP drops" and "Reading Report warnings" sections in docs/RUNBOOK.md and docs/FR/RUNBOOK-FR.md: operator recipes for cross-checking the new counter against process metrics and the warning_details payload.
14 new tests across report::warnings, report::mod, report::metrics, ingest::otlp, daemon::query_api, plus 1 e2e test in crates/sentinel-cli/tests/e2e.rs that pins the JSON shape of Report.warning_details. Includes a #[cfg(not(target_os = "linux"))] symmetric test that locks the platform gating of the process collector.
crate::test_helpers::empty_report() factory for unit tests that need a default Report shape, replacing the long boilerplate at every call site.

Changed

MetricsState caches the 3 OTLP rejection counters as IntCounter fields (otlp_rejected_unsupported_media_type, otlp_rejected_parse_error, otlp_rejected_channel_full). record_otlp_reject(reason) becomes a branchless match plus atomic inc(), no per-rejection HashMap label lookup. Avoids amplifying daemon slowdown via metric overhead under a backpressure storm. The IntCounterVec is kept on the struct for /metrics rendering and tests, only the hot path uses the cached children.
otlp_http_router and OtlpGrpcService::new accept Option<Arc<MetricsState>> as a new parameter (crates/sentinel-core/src/ingest/otlp.rs). Some(metrics) in daemon mode (passed through daemon/listeners.rs), None for batch CLI and tests so the existing call sites stay zero-cost. Each rejection site (HTTP unsupported_media_type, HTTP parse_error, HTTP channel_full, gRPC channel_full) calls m.record_otlp_reject(reason) when the metrics handle is present.
docs/ci-templates/ PERF_SENTINEL_VERSION pin bumped from 0.5.17 to 0.5.18 across gitlab-ci.yml, github-actions.yml, github-actions-baseline.yml, and jenkinsfile.groovy. Materializes in this release so users curling the templates pull the recent binary by default.

Behavior

No change to ingestion behavior. Requests that were accepted before are still accepted, rejected requests still return the same status codes (415, 400, 503 HTTP, INTERNAL gRPC). The difference is that rejections are now visible in /metrics and Report.warning_details.
Backward compatibility on Report JSON. The new warning_details field is additive via serde(default, skip_serializing_if = "Vec::is_empty"). Pre-0.5.19 baselines saved with report --before <baseline.json> parse without modification. The legacy warnings: Vec<String> field (0.5.16+) is preserved byte-for-byte, populated as before by the daemon cold-start path. Renderers prefer warning_details when non-empty and fall back to warnings otherwise.
Process metrics are Linux only. Operators on macOS and Windows hosts continue to see the perf_sentinel_* metrics and nothing else under process_*. The prometheus crate's process feature is now activated, but the registration site is gated by #[cfg(target_os = "linux")] so non-Linux scrapes do not pay for failed /proc/self/* reads.
Built artifacts are slightly larger. Activating the process feature pulls procfs as a transitive dependency on Linux. A few KB on the binary, no runtime cost off the scrape path.

Documentation

New docs/METRICS.md and docs/FR/METRICS-FR.md: full per-metric reference grouped by category (process, OTLP ingestion, analysis and findings, GreenOps), with cardinality, label catalog, and the per-scrape cost note for the process collector.
New docs/SUPPLY-CHAIN.md and docs/FR/SUPPLY-CHAIN-FR.md (from #10): pinning policy reference for contributors and reviewers.
docs/RUNBOOK.md and docs/FR/RUNBOOK-FR.md extended with two diagnostic recipes ("Diagnosing OTLP drops" and "Reading Report warnings"), including the rationale for why payload_too_large is not counted by the new counter.
README.md and README-FR.md mention warning_details and the new /metrics surfaces in the daemon section, with cross-links to the new docs.

Install

Prebuilt binaries (Linux amd64 / arm64, macOS arm64, Windows amd64):

curl -LO https://github.com/robintra/perf-sentinel/releases/download/v0.5.19/perf-sentinel-linux-amd64
chmod +x perf-sentinel-linux-amd64
sudo mv perf-sentinel-linux-amd64 /usr/local/bin/perf-sentinel

Linux binaries are statically linked against musl and run on any distribution (Alpine, Debian, RHEL, Ubuntu any version) regardless of glibc version, and inside FROM scratch images.

From crates.io:

cargo install perf-sentinel --version 0.5.19

Docker:

docker run --rm -p 4317:4317 -p 4318:4318 \
  ghcr.io/robintra/perf-sentinel:0.5.19 watch --listen-address 0.0.0.0

Also available on Docker Hub: robintrassard/perf-sentinel:0.5.19.

Helm (chart 0.2.22 ships 0.5.19 as its appVersion default):

helm install perf-sentinel oci://ghcr.io/robintra/charts/perf-sentinel \
  --version 0.2.22 \
  --namespace observability --create-namespace

Verify the binary against SHA256SUMS.txt:

curl -LO https://github.com/robintra/perf-sentinel/releases/download/v0.5.19/SHA256SUMS.txt
sha256sum -c SHA256SUMS.txt --ignore-missing

Full diff: v0.5.18...v0.5.19