Added
- Agent: resilient BPF attach and sampler status. The BPF builder now
attaches each program individually and tolerates per-program failures
(load/verify failures stay fatal), so a single dead probe no longer
takes down its sibling programs. A newGET /samplersendpoint
reports each sampler asactive/disabled/failedwith per-program
attach detail, and recordings capture this status under
per_source_metadata.<source>.sampler_status. (#954) - GPU (NVIDIA):
gpu_tensor_utilizationnow breaks out per-tensor-pipe
activity via apipelabel —hmma(FP16/BF16, and FP32 matmul that
runs as TF32),imma(integer), anddfma(FP64) — alongside the
existing aggregate (pipe=any). Collected from NVML GPM, so it
requires Hopper+ and is reported only where the corresponding pipe is
supported. (#946)
Fixed
- BPF samplers that rely on in-kernel BTF (
cpu_usage,cpu_migrations,
cpu_perf,scheduler_runqueue,syscall_counts) now work on kernels
built without/sys/kernel/btf/vmlinux(e.g. NVIDIA Tegra/L4T). Each
tp_btfhook gains araw_tptwin selected at runtime via
kernel_has_btf(), andsyscall_countsusesbpf_get_current_task()
instead ofbpf_get_current_task_btf(). CO-RE still uses the external
BTF file (btf_path). Stock BTF kernels are unaffected. (#948) - BPF sampler correctness fixes from a full review against
docs/principles.md: histogram bucketing used 32-bit shifts,
mis-bucketing values ≥ 2³¹ (long-tail latencies ≥ ~2.15 s were
misreported); blockio latency tracking silently dropped all requests
on kernels < 5.11 due to a tracepoint argument layout difference;
scheduler/runqueue could charge runqueue-wait and off-cpu time to the
wrong cgroup; a full ringbuf no longer permanently suppresses a
cgroup's name;tcp_retransmitnow counts segments instead of calls
(it undercounted with TSO/GSO); plus smaller metadata, histogram, and
defensive-check fixes. (#956)