feat(hooks): add hook kinds, scanner integrations, SARIF summaries, and skip controls @osterman (#2482)
## what- Adds a
kinddiscriminator to the hook system with built-in kinds forstore,command,infracost,checkov,trivy, andkics. - Adds the generic command hook engine, including toolchain-aware binary resolution, live stdout/stderr passthrough, templated args/env support,
ATMOS_*runtime env vars, output-file/output-dir side channels, and configurableon_failurebehavior. - Adds scanner and cost integrations:
infracostparses JSON breakdown output into a markdown cost summary.checkov,trivy, andkicsemit SARIF and share one parser/markdown renderer.
- Adds normalized SARIF handling for severity counts, linked rule IDs via
helpUri, short descriptions, file/line locations, and empty-result handling. - Adds hook dependency preflight: component
dependencies.toolsare installed before hooks run, toolchain paths take precedence over operator PATH, and missing hook binaries fail before Terraform starts. - Adds a curated embedded Atmos tool registry with a KICS override so
dependencies.tools.kicscan install from release tarballs. - Adds
pkg/cacertsand wires Checkov toSSL_CERT_FILE/REQUESTS_CA_BUNDLEso PyInstaller-bundled Checkov can use the host CA bundle. - Adds the
--skip-hooksglobal flag andATMOS_SKIP_HOOKS, supporting skip-all and comma-separated named-hook skipping. - Preserves backward compatibility by accepting legacy
command: storehook configs askind: store. - Makes hooks work with resolved component workdirs so scanners inspect the same directory Terraform uses.
- Adds runnable examples for
infracost,checkov,trivy,kics, and customkind: commandhooks. - Updates hook docs, global flag docs, PRDs, roadmap data, and adds the custom hooks blog post.
- Refreshes CLI help snapshots and updates CI workflow actions/shell handling needed by the branch.
why
- Hooks were already the right lifecycle surface for component automation, but only
storehad first-class dispatch behavior. - Security scanners, cost estimators, and custom tools should run from stack config without wrapper scripts or GitHub Actions glue.
- Named kinds provide zero-config defaults for common tools, while
kind: commandkeeps the system open for arbitrary binaries. - Tool auto-install and preflight failures make examples and CI usage reproducible instead of relying on whatever happens to be on PATH.
- SARIF and infracost summaries create a common typed output path for terminal rendering now and Atmos Pro upload later.
notes
- This PR intentionally does not add a built-in
tfseckind orhooks-tfsecexample. Trivy is the maintained Aqua-backed scanner path; legacy tfsec users can still wire it withkind: command. - Atmos Pro upload, cross-run SARIF aggregation, Terraform component dependency auto-install outside hooks, and planfile threading remain follow-up work.
- The CodeRabbit-generated release notes below are preserved as-is.
Summary by CodeRabbit
-
New Features
- Pluggable hook "kinds" (infracost, trivy, checkov, kics) plus a generic command kind; structured side‑channel outputs, markdown rendering, preflight tool resolution/auto‑install, CA‑bundle propagation, and per‑invocation hook skipping via --skip-hooks / ATMOS_SKIP_HOOKS.
-
Documentation
- Detailed PRDs, reference docs, CLI help, blog post, and runnable examples for each hook kind.
-
Tests
- Expanded unit and integration tests covering hooks, result handlers, SARIF parsing, toolchain registry, and examples.
feat(ci): auto-detect log level from GitHub Actions debug mode @osterman (#2495)
## what- Atmos now auto-detects when a workflow is running with GitHub Actions debug logging enabled and switches its own log level to
Debugfor the run. - Triggered when
ci.enabled: trueis set inatmos.yamland the active CI provider reports debug mode is on. For GitHub Actions, that meansACTIONS_RUNNER_DEBUG=trueorACTIONS_STEP_DEBUG=true— exactly what the built-in "Re-run with debug logging" button sets. - Emits a single Info-level log line when it fires so users see why their output got louder:
CI provider debug mode detected — using Debug log level for this run provider=github-actions from=Info. - Built on a provider-agnostic optional interface —
provider.DebugModeDetector { IsDebugMode() bool }inpkg/ci/internal/provider, plus a generic registry helperci.DetectDebugMode() DebugModeInfo. The GHA provider implements the interface;cmd/root.goimports onlypkg/ciand names no GHA-specific env vars. - Auto-detection overrides
--logs-level,ATMOS_LOGS_LEVEL, andlogs.levelinatmos.yaml— the CI-side debug toggle is set at the repo/workflow level by the runner itself and is treated as the higher-priority signal (including over an explicitTraceorOff). - Ships with a new framework PRD (
docs/prd/native-ci/framework/debug-mode-promotion.md), a changelog blog post, a roadmap milestone under the Native CI initiative, and unit tests covering: the GHAIsDebugMode()env-var matrix, the genericDetectDebugMode()type-assertion path, and the cmd-side helper's gates and override semantics.
why
- Debugging Atmos in CI is usually just as important as debugging the workflow around it. GitHub provides a single "Re-run with debug logging" button to make every tool in the run verbose; today Atmos ignores it, so users get a noisier runner but the same quiet Atmos output — and have to remember a per-tool dance (
ATMOS_LOGS_LEVEL=Debugsomewhere in workflow YAML). - The interface-based design keeps the startup path provider-agnostic, so adding the same behavior to a future CI provider is one method on the provider — no changes in
cmd/orpkg/cineeded. - Overriding explicit
--logs-level/ATMOS_LOGS_LEVELis intentional: the CI-side toggle is an explicit, repo-/workflow-level "make everything noisy" signal that should beat per-invocation flags in the same run. - This is the same gap other GitHub-published tools have hit, e.g. pypa/gh-action-pypi-publish#322, which validates the pattern.
references
- GitHub docs: Enable debug logging
- GitHub changelog: Re-run jobs with debug logging
- Prior art in another ecosystem: pypa/gh-action-pypi-publish#322
- New PRD:
docs/prd/native-ci/framework/debug-mode-promotion.md
Summary by CodeRabbit
- New Features
- Atmos now auto-promotes its log level to Debug when running on GitHub Actions with per-run debug logging enabled and
ci.enabled: true; an informational startup log notes the promotion and it overrides other log-level settings.
- Atmos now auto-promotes its log level to Debug when running on GitHub Actions with per-run debug logging enabled and
- Documentation
- Added product doc and blog post explaining debug-mode promotion and usage.
- Tests
- Added unit tests covering debug-mode detection and promotion behavior.
- Refactor
- Minor CI hook wiring cleanup in multi-component Terraform runs.
fix(ci): restore checks: write on lint job for reviewdog annotations @osterman (#2500)
## what- Add a job-scoped
permissions:block on thelint([lint] <demo-folder>) job in.github/workflows/test.ymlgrantingcontents: read+checks: writesoreviewdog/action-tflint@v1can post inline tflint findings on PRs via the GitHub Checks API.
Companion to #2499, which restored
security-events: writeon thedocker([lint] Dockerfile) job. Same root cause, second affected job.
why
- PR #2487 introduced the first workflow-level
permissions:block ontest.ymlto grantpackages: readfor ghcr.io OCI pulls. A workflow-levelpermissions:block replaces (not extends) the defaultGITHUB_TOKENscope for every job in the file, which silently stripped the inheritedchecks: writethat thelintjob relied on. - Effect on contributors: since #2487 merged, tflint findings on PRs touching
examples/<demo-folder>/components/terraformhave stopped appearing as inline check annotations. The job itself still exits with the right code (fail_level: errorcontrols that), but reviewers lost the per-line context. This restores that behavior. - Job-scoped (least privilege) over widening the workflow-level block — only this one job uses reviewdog. Matches the convention used in
.github/workflows/codeql.ymland thedocker-job fix already landed in #2499. - Not adding
pull-requests: write: reviewdog's defaultgithub-pr-checkreporter posts check runs (which needchecks: write), not review comments.checks: writealone is sufficient.
references
fix(ci): restore security-events: write for Dockerfile lint SARIF upload @aknysh (#2499)
## whatRestores security-events: write permission for the [lint] Dockerfile
job in .github/workflows/test.yml so its hadolint SARIF results can
be uploaded to GitHub Code Scanning.
- Adds a job-level
permissions:block to thedockerjob:permissions: contents: read # actions/checkout security-events: write # github/codeql-action/upload-sarif
contents: readis re-listed because a job-levelpermissions:block
fully overrides the workflow-level set (rather than merging).
why
PR #2487 (2437e13bf) added a top-level permissions: block to the
workflow to grant packages: read for ghcr.io pulls:
permissions:
contents: read
packages: readIn GitHub Actions, a workflow-level permissions: block replaces
(not extends) the default GITHUB_TOKEN scope for every job in the
file. That replacement inadvertently stripped the implicit
security-events: write that the [lint] Dockerfile job relied on to
upload hadolint SARIF results via github/codeql-action/upload-sarif@v4.
Every post-merge run on main has been failing the Upload SARIF
file step since #2487:
##[warning]This run of the CodeQL Action does not have permission to
access the CodeQL Action API endpoints. ... please ensure the workflow
has at least the 'security-events: read' permission.
##[error]Resource not accessible by integration -
https://docs.github.com/rest
Failing run for reference:
https://github.com/cloudposse/atmos/actions/runs/26339160817/job/77562841194
Note: hadolint itself ran successfully in the failing run — the SARIF
output contained zero findings. Only the upload step failed.
A job-level fix (this PR) is preferred over expanding the workflow-level
permissions block, because it follows least-privilege: only the one job
that actually needs to write security events gets the elevated scope.
references
- Failing CI run: https://github.com/cloudposse/atmos/actions/runs/26339160817/job/77562841194
- Regressing PR: #2487
- GitHub Actions permissions docs:
https://docs.github.com/en/actions/security-for-github-actions/security-guides/automatic-token-authentication#permissions-for-the-github_token github/codeql-action/upload-sarifpermission requirement:
https://docs.github.com/en/code-security/code-scanning/integrating-with-code-scanning/uploading-a-sarif-file-to-github#uploading-the-sarif-file-to-github
Summary by CodeRabbit
- Chores
- Fixed permissions configuration in the CI/CD pipeline to restore security scanning capabilities in automated testing workflows.
🚀 Enhancements
fix(toolchain): retry cosign on transient Sigstore Rekor failures @osterman (#2506)
## what- Wrap the
cosign verify-blobexec inpkg/toolchain/verification/signature.gowith bounded exponential backoff, retrying only on a narrow allowlist of transient Sigstore Rekor failures. - Add a new
errUtils.ErrSignatureRetryablesentinel next to the existingErrDownloadRetryable, and aclassifyCosignErrorhelper that joins the sentinel into cosign errors when the combined output matches a Rekor-flake marker. - New
runCosignWithRetryuses the same retry budget as the existing downloader (5 attempts, 1s → 10s exponential). Logs aWARNbefore each retry so CI logs surface the upstream-service context.
why
cosign verify-blobsometimes fails not because of a real signature problem but because Sigstore's Rekor transparency-log API returns a short-window upstream error. The most common signature is:
The same artifact verifies cleanly seconds later. Without retry, this turns a transient Sigstore outage into a hardError: searching log query: [POST /api/v1/log/entries/retrieve][400] searchLogQueryBadRequest {"code":400,"message":"verifying signature: ecdsa: Invalid IEEE_P1363 encoded bytes"}tool not foundinstall failure for every Atmos user pulling toolchain assets during the outage window. We hit this twice in 48h on the Windows mock jobs alone.- The retry allowlist is intentionally narrow —
searchLogQueryBadRequest,Invalid IEEE_P1363 encoded bytes, and Rekor's/api/v1/log/entries/retrieveendpoint paired with a 5xx status. Anything outside the allowlist (tampering, expired cert, identity mismatch, missing signature) surfaces immediately on the first attempt. Blanket-retrying signature verification would mask real tampering events and is the canonical anti-pattern; we do not do that here. - Mirrors the existing pattern in
pkg/toolchain/installer/download.go(sentinel +retry.WithPredicate+ classifier) so the toolchain's resilience story stays consistent across download and verification.
references
- Failing run that prompted this: https://github.com/cloudposse/atmos/actions/runs/26368136271/job/77615786736 (Windows
[mock-windows] demo-component-versions, Rekor returned 400 on a valid OpenTofu 1.9.1 release signature). - Same flake also hit
[mock-windows] demo-vendoringin the same run. - Companion CI permissions fix: #2499 (merged) and #2500 (open).
Summary by CodeRabbit
- Improvements
- Signature verification now automatically retries on transient Sigstore Rekor service issues with bounded exponential backoff for greater resilience.
- Error handling improved to distinguish retryable transient failures from permanent signature verification errors, reducing false failures.
- Tests
- Added unit tests covering classification of transient Rekor failures and retry behavior through the public verification path.
fix(terraform): skip non-terraform and deleted components in `atmos terraform plan/apply --affected` @thejrose1984 (#2484)
## whatFilters the affected component list in atmos terraform plan/apply --affected so the command no longer runs against helmfile components, packer components, or components deleted in HEAD.
why
Reported in #2361. getAffectedComponents returns every affected component regardless of type — helmfile, packer, and BASE-only deletions are all included — and ExecuteTerraformAffected was iterating that full list and calling ExecuteTerraform on each, producing output like:
INFO Executing command="atmos terraform apply example-terraform -s example"
INFO Executing command="atmos terraform apply example-helmfile -s example"
INFO Executing command="atmos terraform apply example-packer -s example"
Documentation (website/docs/cli/commands/terraform/usage.mdx) describes --affected as executing the command "on all the directly affected components," with the implicit constraint that those components belong to the atmos terraform subcommand. Helmfile and Packer subcommands would orchestrate their own components, and deleted components have no on-disk module so terraform plan/apply against them either errors or no-ops.
how
- New package-private helper
filterTerraformAffectedkeeps only items whereComponentType == cfg.TerraformComponentTypeand!Deleted. Called once inExecuteTerraformAffectedaftergetAffectedComponents, beforeaddDependentsToAffected(which is expensive and shouldn't run for items we will drop). - Defense-in-depth filter in
executeTerraformAffectedComponentInDepOrder: when--include-dependentsis set, any dependent withComponentTypeexplicitly set to a non-terraform value is skipped during the recursion. atmos describe affectedis unchanged. It still reports the full affected set (terraform + helmfile + packer + deleted) as the canonical introspection view. The filter is scoped to the execution path ofatmos terraform <cmd> --affected.
tests
- New file internal/exec/terraform_affected_filter_test.go: 8 portable table cases covering the filter (no gomonkey, runs on every CI matrix entry). Includes a case mirroring the exact #2361 reproducer fixture. 100% line coverage of the new helper.
- Three new cases added to
TestExecuteTerraformAffectedComponentInDepOrdertable in internal/exec/terraform_utils_test.go: helmfile dependent skipped, packer dependent skipped, mixed-type dependents (only the terraform one runs).
compatibility
Bug fix only. The previous behavior produced incorrect commands that would fail mid-execution (terraform-apply against a helmfile component errors out). No public Go API or CLI surface changes. The most visible user-facing shift is an exit-code flip from non-zero to zero when a changeset contains only non-terraform or deleted components — which now correctly reports "No components affected" instead of erroring partway through.
Closes #2361.
Summary by CodeRabbit
-
Bug Fixes
- Terraform execution now excludes non-Terraform dependents (e.g., Helmfile, Packer), skips deleted components, and treats empty component types as Terraform to preserve compatibility; Terraform-only ordering is preserved to avoid unnecessary processing.
-
Tests
- Added tests for filtering behavior, in-place compaction semantics, non-Terraform exclusion, and a regression ensuring only Terraform entries are executed.
perf(exec/merge/utils): optimize describe affected for large-stack workloads (~50% local wall-clock, projected ~10× on 2-core CI) @aknysh (#2496)
## whatPerformance optimization arc for atmos describe affected on large-stack workloads.
Targets the inheritance + merge + YAML-parse hot paths that have grown since the
October 2025 perf instrumentation arc (PRs #1576/#1611/#1622/#1639) added the
auth, profiles, locals, and per-file position-tracking subsystems.
Twelve shipped phases (with two attempted-and-reverted optimizations documented
in-code so they aren't re-tried without addressing the underlying contract
violations):
- Phase 2 —
cacheBaseComponentConfigswitched fromRWMutex+ map to
sync.Map; deep-copy moved outside the critical section. Eliminates write-lock
contention that serialized every cache write across goroutines and padded
apparent CPU time with lock-wait. - Phase 3 —
WalkAndDeferYAMLFunctionsshort-circuits when the subtree
contains no Atmos YAML functions (!template,!terraform.*,!store*,
!exec,!env). Returns the input map as-is instead of allocating a deep
copy at every recursion level. - Phase 4 —
extractLocalsFromRawYAMLcached viasync.Mapkeyed by
filePath + FNV-1a(yamlContent). The content-hash component prevents
test pollution when the same logical file path is reused with different
content. Also fixes a pre-existing data race in
extractAndAddLocalsToContext(shallow-clones the input context map
before file-scopeddelete + assign). - Phase 5 —
MergeWithDeferred0-input fast path: when every layer is
empty, return an empty map immediately without walking or merging. The
1-input shortcut originally shipped alongside was REVERTED on
2026-05-24 after CI surfaced a regression — see "What was tried and
reverted" below. - Phase 6 —
parsedYAMLCacheswitched tosync.Map; deep-copy of
yaml.Node+PositionMapmoved outside the critical section. Same lock
contention pattern as Phase 2 applied to the YAML parser cache. - Phase 7 —
processCustomTagssplit into outer + inner functions: the
hasCustomTagspre-check runs once at the entry point instead of
re-walking the subtree at every recursive call (O(N×depth) → O(N)). - Phase 8 — new
decodedYAMLCachestores the post-Decode + post-Intern
result ofUnmarshalYAMLFromFileWithPositions[map[string]any]. Skips
yaml.Node.Decode+InternStringsInMapon every repeat call for the
same (file, content hash) pair. - Phase 10 —
processYAMLNodesplit into outer + inner (Phase 7
pattern). Removes per-recursionperf.Trackoverhead from the recursive
YAML walker used by yq evaluation. Same pattern was tried for
WalkAndDeferYAMLFunctionsand reverted: the inner-only walker had to
allocate unconditionally on every recursion, regressing function-sparse
subtrees more than the perf.Track savings recovered. - Phase 11 —
processTerraformRemoteStateBackendextracts the
backend-type-specific map from each input first (via the new
extractBackendTypeMaphelper), then merges just those two scoped maps.
Avoids deep-copying unrelated backend-type entries
(s3/gcs/azurerm/etc.) just to extract one key from the merged result. - Phase 12 + 13 —
deepCopyBaseComponentConfigMapsguards every
m.DeepCopyMapcall withlen(src.Field) > 0. Skips function-call and
allocation overhead for the empty-field case that dominates real
workloads (most components leave several of the 10 fields empty). - Coverage — new tests close the gaps the arc introduced. Public
Clear*Cachewrappers,extractBackendTypeMaptype-mismatch path, and
the Phase 12/13 empty-field contract are all now covered.
Phase 1 was the auth credential-store fix that shipped separately as
PR #2471 (in v1.220.0-rc.1).
What was tried and reverted
- Phase 5 1-input shortcut. Initially returned the walked single input
directly without going throughMerge. CI surfaced
TestSpaceliftStackProcessorlosing 7 stacks (47→40) because
WalkAndDeferYAMLFunctions's Phase 3 short-circuit returned the input
map as-is, and the 1-input shortcut handed that shared reference back
to the caller. DownstreammergeComponentConfigurationsmutated the
result while building the per-component output, which corrupted the
upstream cachedBaseComponentSettings/GlobalSettingsfor sibling
components. Fix: keep only the 0-input fast path; let 1-input fall
through to the regular merge pipeline which deep-copies via
MergeWithOptions→DeepCopyMap. Regression test
(TestMergeWithDeferred_TrivialInputShortCircuits/mutating the result does not mutate the input) added to prevent re-attempts. - Phase 9 asymmetric clone. Same class of failure (
fatal error: concurrent map iteration and map writeat scale): share
settings/vars/envreferences from the locals cache, deep-copy only
locals.processTemplatesInSectionreturns the input map as-is
when the section has no{{, so the cached references ended up in
the shared template context and got mutated by sibling goroutines.
Reverted inb11f3cd9b; documented in-code.
Both failures share the same lesson: any optimization that hands a
shared reference back to a caller has to be matched against the
end-to-end mutation surface, not just the immediate caller. The Merge
contract — "result is a fresh, caller-mutable map" — must be upheld.
why
A real-world large stack configuration (≈836 YAML files, ~195 final stacks
across three namespaces, ~9.3k component instances) reported atmos describe affected taking around 11 minutes in CI on a 2-core runner. Local
reproduction with --identity=false and fake AWS credentials took ~4
minutes and showed two clear cost centers in the heatmap:
- The credential store was being created per-component even with
--identity=false(≈3.5 min of cumulative CPU). Fixed in PR #2471. - The component-inheritance + merge + YAML-parse pipeline (≈6 min of
cumulative CPU) was bottlenecked by lock contention on shared caches,
redundant deep-copies on every cache read/write, and per-call work that
could be cached or skipped for the common case.
This PR addresses the second bucket. Confirmed impact on the same workload
(mean of 3 local runs against current main, post-Phase-5-revert):
| Function | Pre-Phase-2 | Post-Phase-13 | Reduction |
|---|---|---|---|
cacheBaseComponentConfig
| 5m50s | ~990ms | −99.7% |
mergeComponentConfigurations
| 2m22s | ~95s | −33% |
MergeWithDeferred
| 1m35s | ~51s | −46% |
WalkAndDeferYAMLFunctions
| 1m26s | ~11s | −87% |
extractLocalsFromRawYAML
| 13s | ~6s | −54% |
UnmarshalYAMLFromFileWithPositions
| 18.7s | ~3.2s | −83% |
processCustomTags
| 31.5s | ~7.5s | −76% |
getCachedBaseComponentConfig
| 6.5s | ~360ms | −94% |
(mergeComponentConfigurations, MergeWithDeferred, and Merge
numbers reflect the post-Phase-5-revert state — the 1-input shortcut
that was originally counted toward Phase 5's headline numbers has been
removed for correctness. The remaining wins are still substantial.)
Local wall-clock on the same workload: 4.1s → ~2.2s (−47%) on a
many-core Mac. The wall-clock floor on Mac is set by stack-level
parallelism that already saturates; the cumulative CPU savings (several
minutes summed across all hot functions) translate to materially more
wall-clock improvement on 2-4 core CI runners where lock-wait padding,
allocation pressure, and serialized work cannot be hidden behind cores.
Projection for a 2-core CI runner starting from the v1.219.0 baseline:
~11 minutes → ~60-105 seconds end-to-end (combining PR #2471 + this
PR's shipped phases, including the ~15s GHA wall-clock cost of the
Phase 5 1-input revert). Awaiting end-to-end CI validation on the
reference workload.
Each phase is independently revertible — they live in separate
commits with self-contained tests. The two reverted optimizations
(Phase 5's 1-input shortcut, Phase 9's asymmetric clone) have their
failure modes documented in-code and in
docs/fixes/2026-05-23-describe-affected-component-inheritance-perf.md
so future passes don't re-attempt the same approach without addressing
the underlying contract violations.
references
- Investigation doc with per-phase root cause, metric tables, and
decision lessons:
docs/fixes/2026-05-23-describe-affected-component-inheritance-perf.md - Predecessor work that built the perf-instrumentation infrastructure
this PR builds on: #1576 (heatmap visualization), #1611 (self-time vs
total-time), #1622 (Docker perf fix + CPU Time / Parallelism), #1639
(5.2× faster execution + 92% memory reduction). - Phase 1 (
--identity=falsegate) shipped separately in PR #2471
(v1.220.0-rc.1).
Summary by CodeRabbit
-
Documentation
- Added a detailed troubleshooting/performance guide for diagnosing slow "describe affected" runs with reproducible steps and optimization plan.
-
Performance
- Significant speedups via new caching, contention reduction, and short-circuit fast-paths for common/empty cases.
-
Reliability
- Improved cache correctness, mutation isolation, and error propagation to avoid races and unexpected panics.
-
Tests / Chores
- Expanded test coverage and added public cache-clear helpers for reliable isolation and regression verification.
fix(terraform): wire `--all` to `ExecuteTerraformAll` for dependency-ordered execution @thejrose1984 (#2486)
## WhatRoutes atmos terraform plan --all and apply --all through ExecuteTerraformAll so components actually execute in dependency (topological) order — as originally documented in the PR #1516 changelog and the DAG concurrency PRD.
Until this change, the dispatcher in cmd/terraform/utils.go routed all multi-component flags (--all, --components, --query) through ExecuteTerraformQuery, which walks components via Go map iteration — randomized order, with settings.depends_on ignored entirely. ExecuteTerraformAll, the function that builds the dependency graph and runs TopologicalSort, was reachable only from unit tests.
Why
Fixes #2485.
Users who configured settings.depends_on and ran atmos terraform apply --all were relying on a feature that didn't exist at the dispatch layer. Failures looked like Terraform errors (a component applied before its prereqs), not a missing-feature bug. The DAG concurrency PRD was authored on the assumption that this path already worked.
Changes
Dispatch
cmd/terraform/utils.go—info.Allnow routes toe.ExecuteTerraformAll(&info).--components/--query/ bare-s stackcontinue to route toExecuteTerraformQuery(no change).
ExecuteTerraformAll parity with ExecuteTerraformQuery
internal/exec/terraform_all.go— portscreateQueryAuthManagerso YAML functions (e.g.!terraform.state) resolve credentials under--all. Mirrors the #2081 fix that already exists for--query.- Drops the
info.Stack == ""validation. The terraform-apply docs explicitly state--allwithout-sprocesses every stack, and that's the behavior users see today viaExecuteTerraformQuery. Keeping this PR non-breaking required matching that contract. - Removes the now-unused
ErrStackRequiredWithAllFlagfromerrors/errors.go.
Filter scope
applyFiltersToGraphpreviously setIncludeDependencies: true, which would pull cross-stack prereqs into--all -s <stack>. Switched tofalseso the scope of--all -s <stack>is identical to today's behavior — components in the requested stack only, but now in topological order. A future opt-in flag can re-enable cross-stack execution.
Dry-run UX
executeNodeCommandnow emitsWould <subcmd> <component> in <stack> (dry run)viaui.Successf, matchingprocessTerraformComponent. Both multi-component paths produce the same user-facing dry-run output. (This also affects the--affectedpath, which had no integration tests asserting dry-run output — verified manually.)
Tests
- New integration test in
tests/test-cases/terraform-multi-component-flags.yamlasserts the partial topological order (vpcbeforeeks/cluster,eks/karpenterbeforeeks/karpenter-node-pool,eks/istio/basebeforeeks/istio/istiodbeforeeks/istio/test-app) using the existingterraform-apply-affectedfixture and regex with(?s). The exact total order is an implementation detail of Kahn's-algorithm tie-breaking; this test only asserts the correctness invariant. internal/exec/terraform_all_test.goandterraform_all_simple_test.go— removed the "no stack specified" cases (the validation is gone) and updatedTestApplyFiltersToGraph_*to match the new scope contract.
Compatibility matrix
| Scenario | Before | After |
|---|---|---|
apply --all -s dev, depends_on defined
| Random order (bug) | Topological order |
apply --all -s dev, no depends_on
| Random order | Deterministic order |
apply --all (no stack)
| All stacks, random order | All stacks, topological order |
apply --all -s dev with cross-stack depends_on
| Cross-stack components ignored | In-stack topological order; cross-stack still out of scope (opt-in TBD) |
destroy --all -s dev
| Random order | Reverse topological order |
--all with circular depends_on
| Silently random | Hard error with cycle path |
apply --components vpc -s dev
| Unchanged | Unchanged |
apply --query '...' -s dev
| Unchanged | Unchanged |
apply -s dev (no component, no flag)
| Unchanged | Unchanged |
--all with !terraform.state YAML function
| Worked (via #2081) | Works (auth manager ported) |
--all with per-component CI hooks
| Worked (via #2475/#2397) | Works (hook flows through executeNodeCommand)
|
Known follow-ups (not in this PR)
These are tracked in #2485 and intentionally out of scope for the dispatch fix:
- Parser:
DependencyParseronly reads the deprecatedsettings.depends_on. Should also readdependencies.componentslikedescribe_affected_components.goalready does. - Parser: only
component+stackkeys are recognized.namespace/tenant/environment/stageare documented but ignored. - Errors: missing-target dependency errors are silently logged at
WarninparseDependencyArray. Should be surfaced. - Concurrency: still sequential. The DAG concurrency PRD describes the planned ready-queue scheduler.
- Cross-stack scope opt-in: a
--include-cross-stack-dependenciesflag (orsettings.terraform.dependencies.cross_stack) to re-enable the originalIncludeDependencies: truebehavior.
Test plan
-
go build ./... -
go vet ./internal/exec/... ./cmd/terraform/... ./errors/... -
go test ./internal/exec/ ./pkg/dependency/... ./cmd/terraform/... ./errors/... ./pkg/ui/... -short— all green -
go test ./tests -run 'TestCLICommands/terraform_plan_--all|TestCLICommands/terraform_plan_--query|TestCLICommands/terraform_plan_--components' -count=1— all green - New ordering test:
go test ./tests -run 'TestCLICommands/terraform_plan_--all_executes_in_dependency_order' -count=1— green; output confirmsvpc → eks/cluster → eks/external-dns → eks/istio/base → eks/karpenter → eks/istio/istiod → eks/karpenter-node-pool → eks/istio/test-app - CI lint (
make lint) - Manual verification with
dependencies.components(new format) once parser is extended in a follow-up - Cross-stack dependency behavior with the deferred opt-in flag
References
- Closes #2485
- Original feature request: #1242 (closed as COMPLETED in #1516, but the routing was never wired up)
- Implementation PR: #1516
- DAG concurrency PRD:
docs/prd/dag-concurrent-execution.md
🤖 Generated with Claude Code
Summary by CodeRabbit
-
New Features
terraform plan --allandapply --allrun components in dependency (topological) order;destroy --allruns in reverse.--allmay be used without a stack;--all -s <stack>scopes to that stack without pulling cross‑stack prerequisites. Per‑component hooks and auth-aware YAML resolution are active during--allruns; dry‑run shows clear per‑component success messages.
-
Tests
- Expanded tests for
--allordering, scoping/filtering, auth wiring, dry‑run flows, and per‑component hook wiring.
- Expanded tests for
-
Documentation
- New blog post describing
--allbehavior, caveats, and follow-ups.
- New blog post describing
fix(vendor): recover OCI pulls on auth rejection and surface rich errors @osterman (#2487)
## what- Vendoring an OCI image (e.g.
oci://ghcr.io/...) now auto-recovers when configured credentials are rejected (401 / 403 /DENIED) by retrying once with anonymous authentication. - On successful recovery, emits
WARN OCI auth rejected, succeeded with anonymous fallbackand proceeds. - Terminal pull failures now surface a rich error built with
errUtils.Build(errUtils.ErrPullImage)— preserves the original cause and attaches structured context (image,registry,auth_attempted,status) plus three self-contained remediation hints (Actionspackages: read,ATMOS_GITHUB_USERNAMEoverride, stale~/.docker/config.json). - Bumps the chosen-auth log line from
Debug→Infoand the "GHCR token without username" branch fromDebug→Warnso CI logs reveal misconfiguration without--debug. - Non-auth errors (DNS, TLS, deadlines, 5xx) bypass retry — they need different remediation.
why
- Public test images on
ghcr.io(e.g.ghcr.io/cloudposse/atmos/tests/fixtures/components/terraform/mock:v0) failed hard on Windows CI runners whoseGITHUB_TOKENlackedpackages: readscope, becausepullImageused the rejecting credentials unconditionally instead of falling back to anonymous. - The previous error surface was a bare
DENIED: deniedwith no auth source, no HTTP status, and no actionable hint — the vendor reporter collapsed it into an opaque tally, making the root cause untraceable. - Part A of this work — granting
packages: readin.github/workflows/test.yml— already landed; this is Part B, the code change so future users, different workflows, and private registries get a clean diagnosis and an automatic recovery for public images.
references
- Builds on PR #1647 (3-tier auth precedence).
- Mirrors the error-builder idiom from
pkg/provisioner/source/source.go:88-97. - Reuses the existing
errUtils.ErrPullImagesentinel — no new sentinel introduced.
Summary by CodeRabbit
-
Bug Fixes
- Automatic fallback to anonymous OCI image pulls when authenticated requests are rejected (401/403 or “DENIED”), preserving original error causes
- Richer diagnostic errors with contextual hints for troubleshooting image-pull failures
- Warn when a GHCR token is present but no GitHub username is configured
-
Tests
- Expanded test coverage for image-pull auth fallback and varied failure scenarios
fix(terraform): preserve explicit identity and auth context for local runs @shirkevich (#2348)
## Problem Local commands could still fall back to the default CI identity (`terraform-ci` / `gcp-wif`) even when the user explicitly selected a local identity such as `terraform` / `gcp-adc`.This showed up in several related paths:
atmos terraform apply ... --identity terraformand-i terraformcould lose the explicit identity during Terraform argument reconstruction and then authenticate with the default identity.- Local Terraform commands could run the CI hook path first, producing noisy
terraform-ciWIF authentication errors even when the command later succeeded with the intended local identity. atmos terraform output ... --format json -i terraformbypassed the normal Terraform auth setup and called formatted output resolution without the activeAuthManager/AuthContext.!terraform.stateevaluations could receive stack info without the active auth manager, so nested state reads could still resolve backend credentials from the default identity instead of the explicit command identity.atmos ansible playbook ... --identity terraformdid not propagate the selected identity into stack processing before YAML functions ran, so Ansible-driven components using!terraform.statecould still try the default Terraform CI WIF identity.
Fixed Issues
- Preserves explicit Terraform
--identity/-ivalues through both Cobra/flag-registry parsing and the legacy raw-arg parsing path. - Normalizes Terraform identity edge cases consistently, including
--identity=,-i=, and--identity=false. - Keeps CLI-provided identity values ahead of default/profile-selected identities when Cobra reports the optional-value sentinel.
- Skips CI hook execution for normal local non-CI runs unless CI mode is explicitly forced, removing local
terraform-cipreflight auth noise. - Initializes Terraform auth for formatted
terraform outputand passes bothAuthContextandAuthManagerinto output resolution. - Propagates the active
AuthManagerfromProcessComponentConfigintoConfigAndStacksInfo, allowing!terraform.stateto inherit the selected identity context. - Adds shared component auth setup for non-Terraform command paths that need authenticated YAML functions.
- Adds Ansible-specific identity handling that supports long-form
--identity/ATMOS_IDENTITYwhile deliberately leaving Ansible's-ishorthand reserved for inventory. - Runs Ansible stack processing with the selected auth manager before YAML function evaluation, so Ansible playbooks using
!terraform.statecan use the requested identity.
Verification
- Rebuilt the binary with
rtk proxy go build -o build/atmos . - Focused regression suite passed:
rtk go test ./cmd/ansible ./pkg/component/ansible ./pkg/flags ./internal/exec ./pkg/hooks -run 'TestAnsible|TestBuildConfigAndStacksInfo|TestGetLongIdentityFromArgs|TestProcessStacksWithAuth|TestParseGlobalFlags|TestSetupTerraformAuth|TestProcessComponentConfig_PropagatesAuthManager|TestProcessComponentConfig_AuthManagerGuardBranches|TestProcessCommandLineArgs_EmptyIdentityFlagIsExplicitSelect|TestProcessCommandLineArgs_TerraformIdentityFlag_Issue2392|TestProcessArgsAndFlags_IdentityFlag|TestRunCIHooks_LocalRunSkipsExperimentalGate|TestRunCIHooks_ForwardsErrorAndExitCode|TestRunCIHooks_NilAtmosConfig|TestRunCIHooks_ExperimentalDisableReturnsError' - Downstream local Terraform apply path was verified with explicit
--identity terraform. - Downstream formatted Terraform output path was verified with explicit
-i terraform --format json. - Downstream Ansible dry-run path was verified to select
identity "terraform"/provider=gcp-adc; remaining OAuth access was environment/network dependent, not a fallback toterraform-ci.
Summary by CodeRabbit
-
New Features
- Added
-ishorthand for--identity(supports-i value,-i=value, and explicit-empty-i=to trigger interactive selection).
- Added
-
Improvements
- More robust identity resolution across commands and env vars; explicit-empty identity is preserved.
- Auth manager is now propagated into output and component command flows for consistent auth behavior.
- Hook discovery avoids rendering templates; CI hooks skip when no CI provider detected.
-
Bug Fixes
- Fixed parsing so a following native flag (e.g.,
-lock=false) is not mis-consumed as an identity value.
- Fixed parsing so a following native flag (e.g.,
-
Tests
- Expanded test coverage for identity parsing, auth propagation, hooks, and CI registry behavior.
fix(ci): fire CI hooks per-component in deploy --all mode @thejrose1984 (#2478)
## what- Fixes
atmos terraform deploy --all(and--query,--components, stack-without-component) producing only a single CI summary entry for the last component instead of one entry per component - Adds
runCIHooksForDeployComponentas the per-component hook for thedeploysubcommand so$GITHUB_STEP_SUMMARYreceives one entry per component with the correct component/stack context - Wires
wasMultiComponentExecutionreset, error-defer guard, andPostRunEguard indeploy.go— the same three-site pattern applied toplanin #2430 andapplyin #2475
why
In multi-component mode, terraformRunWithOptions routes to ExecuteTerraformQuery and sets wasMultiComponentExecution = true, but deploy.go had no guard on its PostRunE or error-path defer. This caused:
PostRunEto fire once after all components completed, callingRunCIHookswith an empty output buffer and the last component'sinfo.Component/info.Stack- The error-path defer to double-fire when
--allfailed mid-walk (per-component hook already ran for the failed component) - For stacks with N components, only 1 summary entry appeared instead of N
references
Closes #2476
Related: #2397 (plan fix), #2475 (apply fix)
Summary by CodeRabbit
-
Bug Fixes
- Prevent duplicate error-hook execution during multi-component deployments.
- Ensure per-run state is reset before early exits so deferred error hooks and post-run logic behave consistently.
- Run CI hooks per-component for deploys to preserve component output and forward correct exit codes.
-
Tests
- Added tests for per-component CI hook behavior, suppression of post-run logic in multi-component deploys, defer-guard behavior, and exit-code forwarding.
[codex] Fix verifier auto-install and cosign bundles @osterman (#2481)
## what- Resolve verifier auto-installs to concrete registry versions before bootstrapping, instead of falling back to literal
latest. - Add platform-aware installer helpers and regression coverage for Windows verifier asset URLs across cosign, slsa-verifier, gh, and minisign.
- Combine cosign
optswith downloaded sidecars like--bundleso Trivy checksum signature verification works with Aqua metadata.
why
- Windows CI was failing because cosign release assets exist under
v...tags and the bootstrap path could construct invalid release URLs. - Trivy verification failed on macOS because Atmos dropped the sigstore bundle whenever cosign options were present, producing an incomplete
cosign verify-blobcommand. - The added tests cover the failing Windows URL rendering path and the Trivy-shaped checksum signature command.
references
- Fixes the verifier install failure introduced by package verification.
- Validated with
go test ./pkg/toolchain/installer ./pkg/toolchain/registry/aqua ./pkg/toolchain/verification,go test ./pkg/toolchain, pre-commit hooks, and a livego run . toolchain install aquasecurity/trivy@v0.70.0.
Summary by CodeRabbit
-
New Features
- Signature verification now supports bundle sidecars.
- Enhanced cross-platform asset resolution, including better Windows ARM and Rosetta2 handling and per-platform overrides.
- Improved verifier bootstrap resolution with additional fallback behavior.
-
Bug Fixes
- Corrected Windows executable extension handling across target platforms.
-
Tests
- Added tests for Windows asset URL generation, verifier version resolution failures, and cosign bundle sidecar integration.