Patch release for two related runner-failure paths surfaced by an internal plan run that hit a node taint after the scheduler bound the pod.
Bug Fixes
- Plan-phase runs now retry on disruption. When a runner Pod is killed by an eviction (taint-eviction-controller, node drain, preemption), the Job's
PodFailurePolicypreviously failed the run immediately. Plan runs are read-only on AWS — partial execution is safe to retry — so the policy now ignoresDisruptionTarget=Trueconditions on plan Jobs, letting K8s back off and retry up to the normalbackoffLimit=3. Apply Jobs keep the never-retry behaviour: once an apply container has run, terraform may have mutated state, and K8sPodFailurePolicycan't reliably distinguish "container ran" from "container never started" across cluster versions/distributions. runner_exit_statuspopulated from the Job when the pod has been GC'd. A complement to #430. When the listener observesJob=failedbut the pod is already gone (eviction GC, TTL controller race), the pod-side terminated state can't be read. The Job itself still records the exit code instatus.conditions[?].reason="PodFailurePolicy"— Terrapod now parses that as a fallback so the typedrunner_exit_statusbucket (oom/killed/error) gets set instead of staying empty. The Run UI now surfaces an actionable error (e.g. "Killed (likely OOM)" with peak memory + limit) for these cases, rather than a generic "Job failed".
Status
Stable.
Full Changelog: v0.31.0...v0.31.1