Terrapod is an open-source platform replacement for Terraform Enterprise. This release adds end-to-end visibility into runner memory pressure and OOM events so operators can right-size workspaces with real data instead of guessing after a "Job failed" message.
Highlights
- Per-run resource profile — every run records its peak container memory (cgroup v2
memory.peak) and the workspace's snapshotted CPU/memory request, surfaced on the Run detail page as a Resource usage panel. The memory bar turns amber at ≥80% of limit and red at ≥95%, so high-water marks are visible before they push a run over the cliff. CPU is captured into the schema for future use but intentionally not yet rendered — instantaneous sampling lands in a follow-up (cumulative core-time without wall-clock anchoring would be misleading). - OOM-aware error messages — when a runner Job is OOM-killed, the listener captures the K8s
container.state.terminatedreason and exit code; the reconciler maps it to a typedrunner_exit_status(oom/killed/error/clean) and emits an actionable error message naming the currentresource_memoryand the action to take ("Increase resource_memory + retry"). An OOM-killed badge appears on the Run detail page for unambiguous identification. - AI plan summary skipped on abnormal exit — when a run dies via OOM or SIGKILL, the plan JSON is incomplete or absent; the summariser now detects abnormal exit status and marks the summary as
skippedinstead of hallucinating a "no changes" narrative from a truncated plan. - New
POST /api/terrapod/v1/runs/{run_id}/resource-profileendpoint and matching Run response attributes (peak-memory-bytes,peak-cpu-usec,runner-exit-code,runner-exit-reason,runner-exit-status) — usable by external tooling for capacity dashboards. - New operator runbook —
docs/runbooks.md"Runner OOM-Killed" walks through diagnosis, the bump-and-retry resolution workflow, and how to distinguish a real OOM from a node-eviction-masquerading-as-OOM.
Security
- Trivy allowlist reduced from 6 to 1 entry — libcap2 (CVE-2026-4878) and krb5 (CVE-2026-40356) are now patched in Debian trixie and
apt-get upgradein our Dockerfiles picks them up automatically. - pip-audit floor tightened on litellm to
>=1.83.7— patches three proxy-mode CVEs (42203/42208/42271), all proxy-only and unreachable from how Terrapod uses litellm.
Status
Stable — resource profile + OOM detection verified end-to-end. The cgroup-limit code path surfaces as reason: Error rather than OOMKilled under Rancher Desktop's local k3s runtime (a known quirk of that environment); on production EKS/AKS/GKE Docker runtimes the OOMKilled signal flows correctly and the oom bucket fires.
Full Changelog: v0.30.10...v0.31.0