mattrobinsonsre/terrapod v0.33.0 on GitHub

Pure-Python runner orchestrator, apply-phase AI failure analysis, and a cleaner provider-mirror story for single-arch deployments.

Highlights

Pure-Python runner orchestrator — the bash entrypoint (docker/runner-entrypoint.sh, ~1100 lines) is gone. Every phase the runner Job drives — binary cache download, configuration tarball, mirror config, state, init, plan/apply, OPA, h1 lock-file splice, artifact uploads, resource profile, state-diverged signalling — now lives in services/terrapod/runner/job_entrypoint.py and per-phase modules under terrapod.runner.phases. The runner image dropped curl, jq, tar, unzip, and openssl; only python:3.13-slim + git/openssh-client + the pinned opa binary remain.
AI failure analysis for apply-phase errors (#419) — previously only plan-phase errors got AI failure analysis. Apply-phase failures now do too. The summariser reads the apply log when apply_started_at is set, and the prompt has new guidance: identify the failed resource, identify resources that already completed before the failure, surface partial-state explicitly, and rank fixes by recovery type (re-run safe, refresh first, manual cleanup, -target). Workspaces flagged state_diverged get a dedicated STATE_DIVERGED block in the prompt.
Config-aware provider-mirror lock extender — /v1/providers/{ns}/{type}/{version}.json now advertises the operator's configured provider_cache.platforms via a new cached_platforms field. The runner's lock-extender uses it to distinguish "h1 missing because the operator deliberately doesn't cache this arch" (silent skip, no fallback) from "h1 missing because compute failed" (warn + fall back to tofu providers lock). Single-arch deployments stop emitting spurious warnings and stop running the expensive fallback for arches they don't care about.
Lazy h1 backfill — cached provider rows from before h1 tracking (or whose ingest-time h1 compute failed) now compute h1 lazily from the cached archive on the next request and persist it. Eliminates a class of "no h1 from mirror" warnings without operator action.
Streaming-log fix — runner pod logs are read via the kubernetes Python client's raw response path (_preload_content=False) and decoded to UTF-8 ourselves. Defensive UTF-8 decode also added at the API/Redis write boundary. Eliminates the b'...'-prefixed gibberish that previously rendered in the live log view.
Stdio line-buffering in the runner orchestrator — every phase boundary flushes sys.stdout/sys.stderr so kubectl logs and the combined-log artifact pick up output promptly. Subprocess output is teed via a 4KB chunk loop with explicit flush.
OPA install resilience — Anthropic/GitHub-Releases 5xx transients no longer fail image builds. The OPA binary download in Dockerfile.api, Dockerfile.test, and Dockerfile.runner (now in the builder stage) all wrap curl with --retry 6 --retry-delay 4 --retry-all-errors.
Strict header validation in _request_base_url — X-Forwarded-Host, Host, and X-Forwarded-Proto now go through RFC 1123 / scheme allow-lists before being spliced into response URLs. CRLF injection and other separator-character attacks fall back to callback_base_url.

Security

Closes Semgrep code-scanning alerts #174 and #175 (directly-returned-format-string) via the request-header validation above.
All 4 active Trivy ignores + 1 active pip-audit ignore re-checked against current upstream state; none droppable this cycle; inline rationales remain accurate.

Status

Stable — pure-Python runner verified end-to-end against the local Tilt stack (plan-only run, apply-phase failure_analysis, streaming logs, config-aware h1 skip).

Full Changelog: v0.32.0...v0.33.0