github mattrobinsonsre/terrapod v0.31.0

latest release: v0.31.1
4 hours ago

Terrapod is an open-source platform replacement for Terraform Enterprise. This release adds end-to-end visibility into runner memory pressure and OOM events so operators can right-size workspaces with real data instead of guessing after a "Job failed" message.

Highlights

  • Per-run resource profile — every run records its peak container memory (cgroup v2 memory.peak) and the workspace's snapshotted CPU/memory request, surfaced on the Run detail page as a Resource usage panel. The memory bar turns amber at ≥80% of limit and red at ≥95%, so high-water marks are visible before they push a run over the cliff. CPU is captured into the schema for future use but intentionally not yet rendered — instantaneous sampling lands in a follow-up (cumulative core-time without wall-clock anchoring would be misleading).
  • OOM-aware error messages — when a runner Job is OOM-killed, the listener captures the K8s container.state.terminated reason and exit code; the reconciler maps it to a typed runner_exit_status (oom / killed / error / clean) and emits an actionable error message naming the current resource_memory and the action to take ("Increase resource_memory + retry"). An OOM-killed badge appears on the Run detail page for unambiguous identification.
  • AI plan summary skipped on abnormal exit — when a run dies via OOM or SIGKILL, the plan JSON is incomplete or absent; the summariser now detects abnormal exit status and marks the summary as skipped instead of hallucinating a "no changes" narrative from a truncated plan.
  • New POST /api/terrapod/v1/runs/{run_id}/resource-profile endpoint and matching Run response attributes (peak-memory-bytes, peak-cpu-usec, runner-exit-code, runner-exit-reason, runner-exit-status) — usable by external tooling for capacity dashboards.
  • New operator runbookdocs/runbooks.md "Runner OOM-Killed" walks through diagnosis, the bump-and-retry resolution workflow, and how to distinguish a real OOM from a node-eviction-masquerading-as-OOM.

Security

  • Trivy allowlist reduced from 6 to 1 entry — libcap2 (CVE-2026-4878) and krb5 (CVE-2026-40356) are now patched in Debian trixie and apt-get upgrade in our Dockerfiles picks them up automatically.
  • pip-audit floor tightened on litellm to >=1.83.7 — patches three proxy-mode CVEs (42203/42208/42271), all proxy-only and unreachable from how Terrapod uses litellm.

Status

Stable — resource profile + OOM detection verified end-to-end. The cgroup-limit code path surfaces as reason: Error rather than OOMKilled under Rancher Desktop's local k3s runtime (a known quirk of that environment); on production EKS/AKS/GKE Docker runtimes the OOMKilled signal flows correctly and the oom bucket fires.

Full Changelog: v0.30.10...v0.31.0

Don't miss a new terrapod release

NewReleases is sending notifications on new releases.