Pure-Python runner orchestrator, apply-phase AI failure analysis, and a cleaner provider-mirror story for single-arch deployments.
Highlights
- Pure-Python runner orchestrator — the bash entrypoint (
docker/runner-entrypoint.sh, ~1100 lines) is gone. Every phase the runner Job drives — binary cache download, configuration tarball, mirror config, state, init, plan/apply, OPA, h1 lock-file splice, artifact uploads, resource profile, state-diverged signalling — now lives inservices/terrapod/runner/job_entrypoint.pyand per-phase modules underterrapod.runner.phases. The runner image dropped curl, jq, tar, unzip, and openssl; onlypython:3.13-slim+git/openssh-client+ the pinnedopabinary remain. - AI failure analysis for apply-phase errors (#419) — previously only plan-phase errors got AI failure analysis. Apply-phase failures now do too. The summariser reads the apply log when
apply_started_atis set, and the prompt has new guidance: identify the failed resource, identify resources that already completed before the failure, surface partial-state explicitly, and rank fixes by recovery type (re-run safe, refresh first, manual cleanup,-target). Workspaces flaggedstate_divergedget a dedicatedSTATE_DIVERGEDblock in the prompt. - Config-aware provider-mirror lock extender —
/v1/providers/{ns}/{type}/{version}.jsonnow advertises the operator's configuredprovider_cache.platformsvia a newcached_platformsfield. The runner's lock-extender uses it to distinguish "h1 missing because the operator deliberately doesn't cache this arch" (silent skip, no fallback) from "h1 missing because compute failed" (warn + fall back totofu providers lock). Single-arch deployments stop emitting spurious warnings and stop running the expensive fallback for arches they don't care about. - Lazy h1 backfill — cached provider rows from before h1 tracking (or whose ingest-time h1 compute failed) now compute h1 lazily from the cached archive on the next request and persist it. Eliminates a class of "no h1 from mirror" warnings without operator action.
- Streaming-log fix — runner pod logs are read via the kubernetes Python client's raw response path (
_preload_content=False) and decoded to UTF-8 ourselves. Defensive UTF-8 decode also added at the API/Redis write boundary. Eliminates theb'...'-prefixed gibberish that previously rendered in the live log view. - Stdio line-buffering in the runner orchestrator — every phase boundary flushes
sys.stdout/sys.stderrsokubectl logsand the combined-log artifact pick up output promptly. Subprocess output is teed via a 4KB chunk loop with explicit flush. - OPA install resilience — Anthropic/GitHub-Releases 5xx transients no longer fail image builds. The OPA binary download in
Dockerfile.api,Dockerfile.test, andDockerfile.runner(now in the builder stage) all wrapcurlwith--retry 6 --retry-delay 4 --retry-all-errors. - Strict header validation in
_request_base_url—X-Forwarded-Host,Host, andX-Forwarded-Protonow go through RFC 1123 / scheme allow-lists before being spliced into response URLs. CRLF injection and other separator-character attacks fall back tocallback_base_url.
Security
- Closes Semgrep code-scanning alerts #174 and #175 (
directly-returned-format-string) via the request-header validation above. - All 4 active Trivy ignores + 1 active pip-audit ignore re-checked against current upstream state; none droppable this cycle; inline rationales remain accurate.
Status
Stable — pure-Python runner verified end-to-end against the local Tilt stack (plan-only run, apply-phase failure_analysis, streaming logs, config-aware h1 skip).
Full Changelog: v0.32.0...v0.33.0