mattrobinsonsre/terrapod v0.31.1 on GitHub

Patch release for two related runner-failure paths surfaced by an internal plan run that hit a node taint after the scheduler bound the pod.

Bug Fixes

Plan-phase runs now retry on disruption. When a runner Pod is killed by an eviction (taint-eviction-controller, node drain, preemption), the Job's PodFailurePolicy previously failed the run immediately. Plan runs are read-only on AWS — partial execution is safe to retry — so the policy now ignores DisruptionTarget=True conditions on plan Jobs, letting K8s back off and retry up to the normal backoffLimit=3. Apply Jobs keep the never-retry behaviour: once an apply container has run, terraform may have mutated state, and K8s PodFailurePolicy can't reliably distinguish "container ran" from "container never started" across cluster versions/distributions.
runner_exit_status populated from the Job when the pod has been GC'd. A complement to #430. When the listener observes Job=failed but the pod is already gone (eviction GC, TTL controller race), the pod-side terminated state can't be read. The Job itself still records the exit code in status.conditions[?].reason="PodFailurePolicy" — Terrapod now parses that as a fallback so the typed runner_exit_status bucket (oom / killed / error) gets set instead of staying empty. The Run UI now surfaces an actionable error (e.g. "Killed (likely OOM)" with peak memory + limit) for these cases, rather than a generic "Job failed".

Stable.

Full Changelog: v0.31.0...v0.31.1