cloudposse/atmos v1.220.0-rc.2 on GitHub

feat(hooks): add hook kinds, scanner integrations, SARIF summaries, and skip controls @osterman (#2482)

## what

Adds a kind discriminator to the hook system with built-in kinds for store, command, infracost, checkov, trivy, and kics.
Adds the generic command hook engine, including toolchain-aware binary resolution, live stdout/stderr passthrough, templated args/env support, ATMOS_* runtime env vars, output-file/output-dir side channels, and configurable on_failure behavior.
Adds scanner and cost integrations:
- infracost parses JSON breakdown output into a markdown cost summary.
- checkov, trivy, and kics emit SARIF and share one parser/markdown renderer.
Adds normalized SARIF handling for severity counts, linked rule IDs via helpUri, short descriptions, file/line locations, and empty-result handling.
Adds hook dependency preflight: component dependencies.tools are installed before hooks run, toolchain paths take precedence over operator PATH, and missing hook binaries fail before Terraform starts.
Adds a curated embedded Atmos tool registry with a KICS override so dependencies.tools.kics can install from release tarballs.
Adds pkg/cacerts and wires Checkov to SSL_CERT_FILE / REQUESTS_CA_BUNDLE so PyInstaller-bundled Checkov can use the host CA bundle.
Adds the --skip-hooks global flag and ATMOS_SKIP_HOOKS, supporting skip-all and comma-separated named-hook skipping.
Preserves backward compatibility by accepting legacy command: store hook configs as kind: store.
Makes hooks work with resolved component workdirs so scanners inspect the same directory Terraform uses.
Adds runnable examples for infracost, checkov, trivy, kics, and custom kind: command hooks.
Updates hook docs, global flag docs, PRDs, roadmap data, and adds the custom hooks blog post.
Refreshes CLI help snapshots and updates CI workflow actions/shell handling needed by the branch.

why

Hooks were already the right lifecycle surface for component automation, but only store had first-class dispatch behavior.
Security scanners, cost estimators, and custom tools should run from stack config without wrapper scripts or GitHub Actions glue.
Named kinds provide zero-config defaults for common tools, while kind: command keeps the system open for arbitrary binaries.
Tool auto-install and preflight failures make examples and CI usage reproducible instead of relying on whatever happens to be on PATH.
SARIF and infracost summaries create a common typed output path for terminal rendering now and Atmos Pro upload later.

notes

This PR intentionally does not add a built-in tfsec kind or hooks-tfsec example. Trivy is the maintained Aqua-backed scanner path; legacy tfsec users can still wire it with kind: command.
Atmos Pro upload, cross-run SARIF aggregation, Terraform component dependency auto-install outside hooks, and planfile threading remain follow-up work.
The CodeRabbit-generated release notes below are preserved as-is.

Summary by CodeRabbit

New Features
- Pluggable hook "kinds" (infracost, trivy, checkov, kics) plus a generic command kind; structured side‑channel outputs, markdown rendering, preflight tool resolution/auto‑install, CA‑bundle propagation, and per‑invocation hook skipping via --skip-hooks / ATMOS_SKIP_HOOKS.
Documentation
- Detailed PRDs, reference docs, CLI help, blog post, and runnable examples for each hook kind.
Tests
- Expanded unit and integration tests covering hooks, result handlers, SARIF parsing, toolchain registry, and examples.

feat(ci): auto-detect log level from GitHub Actions debug mode @osterman (#2495)

## what

Atmos now auto-detects when a workflow is running with GitHub Actions debug logging enabled and switches its own log level to Debug for the run.
Triggered when ci.enabled: true is set in atmos.yaml and the active CI provider reports debug mode is on. For GitHub Actions, that means ACTIONS_RUNNER_DEBUG=true or ACTIONS_STEP_DEBUG=true — exactly what the built-in "Re-run with debug logging" button sets.
Emits a single Info-level log line when it fires so users see why their output got louder: CI provider debug mode detected — using Debug log level for this run provider=github-actions from=Info.
Built on a provider-agnostic optional interface — provider.DebugModeDetector { IsDebugMode() bool } in pkg/ci/internal/provider, plus a generic registry helper ci.DetectDebugMode() DebugModeInfo. The GHA provider implements the interface; cmd/root.go imports only pkg/ci and names no GHA-specific env vars.
Auto-detection overrides --logs-level, ATMOS_LOGS_LEVEL, and logs.level in atmos.yaml — the CI-side debug toggle is set at the repo/workflow level by the runner itself and is treated as the higher-priority signal (including over an explicit Trace or Off).
Ships with a new framework PRD (docs/prd/native-ci/framework/debug-mode-promotion.md), a changelog blog post, a roadmap milestone under the Native CI initiative, and unit tests covering: the GHA IsDebugMode() env-var matrix, the generic DetectDebugMode() type-assertion path, and the cmd-side helper's gates and override semantics.

why

Debugging Atmos in CI is usually just as important as debugging the workflow around it. GitHub provides a single "Re-run with debug logging" button to make every tool in the run verbose; today Atmos ignores it, so users get a noisier runner but the same quiet Atmos output — and have to remember a per-tool dance (ATMOS_LOGS_LEVEL=Debug somewhere in workflow YAML).
The interface-based design keeps the startup path provider-agnostic, so adding the same behavior to a future CI provider is one method on the provider — no changes in cmd/ or pkg/ci needed.
Overriding explicit --logs-level / ATMOS_LOGS_LEVEL is intentional: the CI-side toggle is an explicit, repo-/workflow-level "make everything noisy" signal that should beat per-invocation flags in the same run.
This is the same gap other GitHub-published tools have hit, e.g. pypa/gh-action-pypi-publish#322, which validates the pattern.

references

GitHub docs: Enable debug logging
GitHub changelog: Re-run jobs with debug logging
Prior art in another ecosystem: pypa/gh-action-pypi-publish#322
New PRD: docs/prd/native-ci/framework/debug-mode-promotion.md

Summary by CodeRabbit

New Features
- Atmos now auto-promotes its log level to Debug when running on GitHub Actions with per-run debug logging enabled and ci.enabled: true; an informational startup log notes the promotion and it overrides other log-level settings.
Documentation
- Added product doc and blog post explaining debug-mode promotion and usage.
Tests
- Added unit tests covering debug-mode detection and promotion behavior.
Refactor
- Minor CI hook wiring cleanup in multi-component Terraform runs.

fix(ci): restore checks: write on lint job for reviewdog annotations @osterman (#2500)

## what

Add a job-scoped permissions: block on the lint ([lint] <demo-folder>) job in .github/workflows/test.yml granting contents: read + checks: write so reviewdog/action-tflint@v1 can post inline tflint findings on PRs via the GitHub Checks API.

Companion to #2499, which restored security-events: write on the docker ([lint] Dockerfile) job. Same root cause, second affected job.

why

PR #2487 introduced the first workflow-level permissions: block on test.yml to grant packages: read for ghcr.io OCI pulls. A workflow-level permissions: block replaces (not extends) the default GITHUB_TOKEN scope for every job in the file, which silently stripped the inherited checks: write that the lint job relied on.
Effect on contributors: since #2487 merged, tflint findings on PRs touching examples/<demo-folder>/components/terraform have stopped appearing as inline check annotations. The job itself still exits with the right code (fail_level: error controls that), but reviewers lost the per-line context. This restores that behavior.
Job-scoped (least privilege) over widening the workflow-level block — only this one job uses reviewdog. Matches the convention used in .github/workflows/codeql.yml and the docker-job fix already landed in #2499.
Not adding pull-requests: write: reviewdog's default github-pr-check reporter posts check runs (which need checks: write), not review comments. checks: write alone is sufficient.

references

Regression introduced by #2487 (2437e13bf, "fix(vendor): recover OCI pulls on auth rejection and surface rich errors").
Companion fix already merged: #2499 (fix(ci): restore security-events: write for Dockerfile lint SARIF upload).

fix(ci): restore security-events: write for Dockerfile lint SARIF upload @aknysh (#2499)

## what

Restores security-events: write permission for the [lint] Dockerfile
job in .github/workflows/test.yml so its hadolint SARIF results can
be uploaded to GitHub Code Scanning.

Adds a job-level permissions: block to the docker job:

permissions:
  contents: read           # actions/checkout
  security-events: write   # github/codeql-action/upload-sarif

contents: read is re-listed because a job-level permissions: block
fully overrides the workflow-level set (rather than merging).

why

PR #2487 (2437e13bf) added a top-level permissions: block to the
workflow to grant packages: read for ghcr.io pulls:

permissions:
  contents: read
  packages: read

In GitHub Actions, a workflow-level permissions: block replaces
(not extends) the default GITHUB_TOKEN scope for every job in the
file. That replacement inadvertently stripped the implicit
security-events: write that the [lint] Dockerfile job relied on to
upload hadolint SARIF results via github/codeql-action/upload-sarif@v4.

Every post-merge run on main has been failing the Upload SARIF
file step since #2487:

##[warning]This run of the CodeQL Action does not have permission to
access the CodeQL Action API endpoints. ... please ensure the workflow
has at least the 'security-events: read' permission.
##[error]Resource not accessible by integration -
https://docs.github.com/rest

Failing run for reference:
https://github.com/cloudposse/atmos/actions/runs/26339160817/job/77562841194

Note: hadolint itself ran successfully in the failing run — the SARIF
output contained zero findings. Only the upload step failed.

A job-level fix (this PR) is preferred over expanding the workflow-level
permissions block, because it follows least-privilege: only the one job
that actually needs to write security events gets the elevated scope.

references

Failing CI run: https://github.com/cloudposse/atmos/actions/runs/26339160817/job/77562841194
Regressing PR: #2487
GitHub Actions permissions docs:
https://docs.github.com/en/actions/security-for-github-actions/security-guides/automatic-token-authentication#permissions-for-the-github_token
github/codeql-action/upload-sarif permission requirement:
https://docs.github.com/en/code-security/code-scanning/integrating-with-code-scanning/uploading-a-sarif-file-to-github#uploading-the-sarif-file-to-github

Summary by CodeRabbit

Chores
- Fixed permissions configuration in the CI/CD pipeline to restore security scanning capabilities in automated testing workflows.

🚀 Enhancements

fix(toolchain): retry cosign on transient Sigstore Rekor failures @osterman (#2506)

## what

Wrap the cosign verify-blob exec in pkg/toolchain/verification/signature.go with bounded exponential backoff, retrying only on a narrow allowlist of transient Sigstore Rekor failures.
Add a new errUtils.ErrSignatureRetryable sentinel next to the existing ErrDownloadRetryable, and a classifyCosignError helper that joins the sentinel into cosign errors when the combined output matches a Rekor-flake marker.
New runCosignWithRetry uses the same retry budget as the existing downloader (5 attempts, 1s → 10s exponential). Logs a WARN before each retry so CI logs surface the upstream-service context.

why

cosign verify-blob sometimes fails not because of a real signature problem but because Sigstore's Rekor transparency-log API returns a short-window upstream error. The most common signature is:
```
Error: searching log query: [POST /api/v1/log/entries/retrieve][400] searchLogQueryBadRequest
  {"code":400,"message":"verifying signature: ecdsa: Invalid IEEE_P1363 encoded bytes"}
```
The same artifact verifies cleanly seconds later. Without retry, this turns a transient Sigstore outage into a hard tool not found install failure for every Atmos user pulling toolchain assets during the outage window. We hit this twice in 48h on the Windows mock jobs alone.
The retry allowlist is intentionally narrow — searchLogQueryBadRequest, Invalid IEEE_P1363 encoded bytes, and Rekor's /api/v1/log/entries/retrieve endpoint paired with a 5xx status. Anything outside the allowlist (tampering, expired cert, identity mismatch, missing signature) surfaces immediately on the first attempt. Blanket-retrying signature verification would mask real tampering events and is the canonical anti-pattern; we do not do that here.
Mirrors the existing pattern in pkg/toolchain/installer/download.go (sentinel + retry.WithPredicate + classifier) so the toolchain's resilience story stays consistent across download and verification.

references

Failing run that prompted this: https://github.com/cloudposse/atmos/actions/runs/26368136271/job/77615786736 (Windows [mock-windows] demo-component-versions, Rekor returned 400 on a valid OpenTofu 1.9.1 release signature).
Same flake also hit [mock-windows] demo-vendoring in the same run.
Companion CI permissions fix: #2499 (merged) and #2500 (open).

Summary by CodeRabbit

Improvements
- Signature verification now automatically retries on transient Sigstore Rekor service issues with bounded exponential backoff for greater resilience.
- Error handling improved to distinguish retryable transient failures from permanent signature verification errors, reducing false failures.
Tests
- Added unit tests covering classification of transient Rekor failures and retry behavior through the public verification path.

fix(terraform): skip non-terraform and deleted components in `atmos terraform plan/apply --affected` @thejrose1984 (#2484)

## what

Filters the affected component list in atmos terraform plan/apply --affected so the command no longer runs against helmfile components, packer components, or components deleted in HEAD.

why

Reported in #2361. getAffectedComponents returns every affected component regardless of type — helmfile, packer, and BASE-only deletions are all included — and ExecuteTerraformAffected was iterating that full list and calling ExecuteTerraform on each, producing output like:

INFO  Executing command="atmos terraform apply example-terraform -s example"
INFO  Executing command="atmos terraform apply example-helmfile -s example"
INFO  Executing command="atmos terraform apply example-packer -s example"

Documentation (website/docs/cli/commands/terraform/usage.mdx) describes --affected as executing the command "on all the directly affected components," with the implicit constraint that those components belong to the atmos terraform subcommand. Helmfile and Packer subcommands would orchestrate their own components, and deleted components have no on-disk module so terraform plan/apply against them either errors or no-ops.

how

New package-private helper filterTerraformAffected keeps only items where ComponentType == cfg.TerraformComponentType and !Deleted. Called once in ExecuteTerraformAffected after getAffectedComponents, before addDependentsToAffected (which is expensive and shouldn't run for items we will drop).
Defense-in-depth filter in executeTerraformAffectedComponentInDepOrder: when --include-dependents is set, any dependent with ComponentType explicitly set to a non-terraform value is skipped during the recursion.
atmos describe affected is unchanged. It still reports the full affected set (terraform + helmfile + packer + deleted) as the canonical introspection view. The filter is scoped to the execution path of atmos terraform <cmd> --affected.

tests

New file internal/exec/terraform_affected_filter_test.go: 8 portable table cases covering the filter (no gomonkey, runs on every CI matrix entry). Includes a case mirroring the exact #2361 reproducer fixture. 100% line coverage of the new helper.
Three new cases added to TestExecuteTerraformAffectedComponentInDepOrder table in internal/exec/terraform_utils_test.go: helmfile dependent skipped, packer dependent skipped, mixed-type dependents (only the terraform one runs).

compatibility

Bug fix only. The previous behavior produced incorrect commands that would fail mid-execution (terraform-apply against a helmfile component errors out). No public Go API or CLI surface changes. The most visible user-facing shift is an exit-code flip from non-zero to zero when a changeset contains only non-terraform or deleted components — which now correctly reports "No components affected" instead of erroring partway through.

Closes #2361.

Summary by CodeRabbit

Bug Fixes
- Terraform execution now excludes non-Terraform dependents (e.g., Helmfile, Packer), skips deleted components, and treats empty component types as Terraform to preserve compatibility; Terraform-only ordering is preserved to avoid unnecessary processing.
Tests
- Added tests for filtering behavior, in-place compaction semantics, non-Terraform exclusion, and a regression ensuring only Terraform entries are executed.

perf(exec/merge/utils): optimize describe affected for large-stack workloads (~50% local wall-clock, projected ~10× on 2-core CI) @aknysh (#2496)

## what

Performance optimization arc for atmos describe affected on large-stack workloads.
Targets the inheritance + merge + YAML-parse hot paths that have grown since the
October 2025 perf instrumentation arc (PRs #1576/#1611/#1622/#1639) added the
auth, profiles, locals, and per-file position-tracking subsystems.

Twelve shipped phases (with two attempted-and-reverted optimizations documented
in-code so they aren't re-tried without addressing the underlying contract
violations):

Phase 2 — cacheBaseComponentConfig switched from RWMutex + map to
sync.Map; deep-copy moved outside the critical section. Eliminates write-lock
contention that serialized every cache write across goroutines and padded
apparent CPU time with lock-wait.
Phase 3 — WalkAndDeferYAMLFunctions short-circuits when the subtree
contains no Atmos YAML functions (!template, !terraform.*, !store*,
!exec, !env). Returns the input map as-is instead of allocating a deep
copy at every recursion level.
Phase 4 — extractLocalsFromRawYAML cached via sync.Map keyed by
filePath + FNV-1a(yamlContent). The content-hash component prevents
test pollution when the same logical file path is reused with different
content. Also fixes a pre-existing data race in
extractAndAddLocalsToContext (shallow-clones the input context map
before file-scoped delete + assign).
Phase 5 — MergeWithDeferred 0-input fast path: when every layer is
empty, return an empty map immediately without walking or merging. The
1-input shortcut originally shipped alongside was REVERTED on
2026-05-24 after CI surfaced a regression — see "What was tried and
reverted" below.
Phase 6 — parsedYAMLCache switched to sync.Map; deep-copy of
yaml.Node + PositionMap moved outside the critical section. Same lock
contention pattern as Phase 2 applied to the YAML parser cache.
Phase 7 — processCustomTags split into outer + inner functions: the
hasCustomTags pre-check runs once at the entry point instead of
re-walking the subtree at every recursive call (O(N×depth) → O(N)).
Phase 8 — new decodedYAMLCache stores the post-Decode + post-Intern
result of UnmarshalYAMLFromFileWithPositions[map[string]any]. Skips
yaml.Node.Decode + InternStringsInMap on every repeat call for the
same (file, content hash) pair.
Phase 10 — processYAMLNode split into outer + inner (Phase 7
pattern). Removes per-recursion perf.Track overhead from the recursive
YAML walker used by yq evaluation. Same pattern was tried for
WalkAndDeferYAMLFunctions and reverted: the inner-only walker had to
allocate unconditionally on every recursion, regressing function-sparse
subtrees more than the perf.Track savings recovered.
Phase 11 — processTerraformRemoteStateBackend extracts the
backend-type-specific map from each input first (via the new
extractBackendTypeMap helper), then merges just those two scoped maps.
Avoids deep-copying unrelated backend-type entries
(s3/gcs/azurerm/etc.) just to extract one key from the merged result.
Phase 12 + 13 — deepCopyBaseComponentConfigMaps guards every
m.DeepCopyMap call with len(src.Field) > 0. Skips function-call and
allocation overhead for the empty-field case that dominates real
workloads (most components leave several of the 10 fields empty).
Coverage — new tests close the gaps the arc introduced. Public
Clear*Cache wrappers, extractBackendTypeMap type-mismatch path, and
the Phase 12/13 empty-field contract are all now covered.

Phase 1 was the auth credential-store fix that shipped separately as
PR #2471 (in v1.220.0-rc.1).

What was tried and reverted

Phase 5 1-input shortcut. Initially returned the walked single input
directly without going through Merge. CI surfaced
TestSpaceliftStackProcessor losing 7 stacks (47→40) because
WalkAndDeferYAMLFunctions's Phase 3 short-circuit returned the input
map as-is, and the 1-input shortcut handed that shared reference back
to the caller. Downstream mergeComponentConfigurations mutated the
result while building the per-component output, which corrupted the
upstream cached BaseComponentSettings / GlobalSettings for sibling
components. Fix: keep only the 0-input fast path; let 1-input fall
through to the regular merge pipeline which deep-copies via
MergeWithOptions → DeepCopyMap. Regression test
(TestMergeWithDeferred_TrivialInputShortCircuits/mutating the result does not mutate the input) added to prevent re-attempts.
Phase 9 asymmetric clone. Same class of failure (fatal error: concurrent map iteration and map write at scale): share
settings/vars/env references from the locals cache, deep-copy only
locals. processTemplatesInSection returns the input map as-is
when the section has no {{, so the cached references ended up in
the shared template context and got mutated by sibling goroutines.
Reverted in b11f3cd9b; documented in-code.

Both failures share the same lesson: any optimization that hands a
shared reference back to a caller has to be matched against the
end-to-end mutation surface, not just the immediate caller. The Merge
contract — "result is a fresh, caller-mutable map" — must be upheld.

why

A real-world large stack configuration (≈836 YAML files, ~195 final stacks
across three namespaces, ~9.3k component instances) reported atmos describe affected taking around 11 minutes in CI on a 2-core runner. Local
reproduction with --identity=false and fake AWS credentials took ~4
minutes and showed two clear cost centers in the heatmap:

The credential store was being created per-component even with
--identity=false (≈3.5 min of cumulative CPU). Fixed in PR #2471.
The component-inheritance + merge + YAML-parse pipeline (≈6 min of
cumulative CPU) was bottlenecked by lock contention on shared caches,
redundant deep-copies on every cache read/write, and per-call work that
could be cached or skipped for the common case.

This PR addresses the second bucket. Confirmed impact on the same workload
(mean of 3 local runs against current main, post-Phase-5-revert):

Function	Pre-Phase-2	Post-Phase-13	Reduction
`cacheBaseComponentConfig`	5m50s	~990ms	−99.7%
`mergeComponentConfigurations`	2m22s	~95s	−33%
`MergeWithDeferred`	1m35s	~51s	−46%
`WalkAndDeferYAMLFunctions`	1m26s	~11s	−87%
`extractLocalsFromRawYAML`	13s	~6s	−54%
`UnmarshalYAMLFromFileWithPositions`	18.7s	~3.2s	−83%
`processCustomTags`	31.5s	~7.5s	−76%
`getCachedBaseComponentConfig`	6.5s	~360ms	−94%

(mergeComponentConfigurations, MergeWithDeferred, and Merge
numbers reflect the post-Phase-5-revert state — the 1-input shortcut
that was originally counted toward Phase 5's headline numbers has been
removed for correctness. The remaining wins are still substantial.)

Local wall-clock on the same workload: 4.1s → ~2.2s (−47%) on a
many-core Mac. The wall-clock floor on Mac is set by stack-level
parallelism that already saturates; the cumulative CPU savings (several
minutes summed across all hot functions) translate to materially more
wall-clock improvement on 2-4 core CI runners where lock-wait padding,
allocation pressure, and serialized work cannot be hidden behind cores.

Projection for a 2-core CI runner starting from the v1.219.0 baseline:
~11 minutes → ~60-105 seconds end-to-end (combining PR #2471 + this
PR's shipped phases, including the ~15s GHA wall-clock cost of the
Phase 5 1-input revert). Awaiting end-to-end CI validation on the
reference workload.

Each phase is independently revertible — they live in separate
commits with self-contained tests. The two reverted optimizations
(Phase 5's 1-input shortcut, Phase 9's asymmetric clone) have their
failure modes documented in-code and in
docs/fixes/2026-05-23-describe-affected-component-inheritance-perf.md
so future passes don't re-attempt the same approach without addressing
the underlying contract violations.

references

Investigation doc with per-phase root cause, metric tables, and
decision lessons:
docs/fixes/2026-05-23-describe-affected-component-inheritance-perf.md
Predecessor work that built the perf-instrumentation infrastructure
this PR builds on: #1576 (heatmap visualization), #1611 (self-time vs
total-time), #1622 (Docker perf fix + CPU Time / Parallelism), #1639
(5.2× faster execution + 92% memory reduction).
Phase 1 (--identity=false gate) shipped separately in PR #2471
(v1.220.0-rc.1).

Summary by CodeRabbit

Documentation
- Added a detailed troubleshooting/performance guide for diagnosing slow "describe affected" runs with reproducible steps and optimization plan.
Performance
- Significant speedups via new caching, contention reduction, and short-circuit fast-paths for common/empty cases.
Reliability
- Improved cache correctness, mutation isolation, and error propagation to avoid races and unexpected panics.
Tests / Chores
- Expanded test coverage and added public cache-clear helpers for reliable isolation and regression verification.

fix(terraform): wire `--all` to `ExecuteTerraformAll` for dependency-ordered execution @thejrose1984 (#2486)

## What

Routes atmos terraform plan --all and apply --all through ExecuteTerraformAll so components actually execute in dependency (topological) order — as originally documented in the PR #1516 changelog and the DAG concurrency PRD.

Until this change, the dispatcher in cmd/terraform/utils.go routed all multi-component flags (--all, --components, --query) through ExecuteTerraformQuery, which walks components via Go map iteration — randomized order, with settings.depends_on ignored entirely. ExecuteTerraformAll, the function that builds the dependency graph and runs TopologicalSort, was reachable only from unit tests.

Why

Fixes #2485.

Users who configured settings.depends_on and ran atmos terraform apply --all were relying on a feature that didn't exist at the dispatch layer. Failures looked like Terraform errors (a component applied before its prereqs), not a missing-feature bug. The DAG concurrency PRD was authored on the assumption that this path already worked.

Changes

Dispatch

cmd/terraform/utils.go — info.All now routes to e.ExecuteTerraformAll(&info). --components / --query / bare -s stack continue to route to ExecuteTerraformQuery (no change).

ExecuteTerraformAll parity with ExecuteTerraformQuery

internal/exec/terraform_all.go — ports createQueryAuthManager so YAML functions (e.g. !terraform.state) resolve credentials under --all. Mirrors the #2081 fix that already exists for --query.
Drops the info.Stack == "" validation. The terraform-apply docs explicitly state --all without -s processes every stack, and that's the behavior users see today via ExecuteTerraformQuery. Keeping this PR non-breaking required matching that contract.
Removes the now-unused ErrStackRequiredWithAllFlag from errors/errors.go.

Filter scope

applyFiltersToGraph previously set IncludeDependencies: true, which would pull cross-stack prereqs into --all -s <stack>. Switched to false so the scope of --all -s <stack> is identical to today's behavior — components in the requested stack only, but now in topological order. A future opt-in flag can re-enable cross-stack execution.

Dry-run UX

executeNodeCommand now emits Would <subcmd> <component> in <stack> (dry run) via ui.Successf, matching processTerraformComponent. Both multi-component paths produce the same user-facing dry-run output. (This also affects the --affected path, which had no integration tests asserting dry-run output — verified manually.)

Tests

New integration test in tests/test-cases/terraform-multi-component-flags.yaml asserts the partial topological order (vpc before eks/cluster, eks/karpenter before eks/karpenter-node-pool, eks/istio/base before eks/istio/istiod before eks/istio/test-app) using the existing terraform-apply-affected fixture and regex with (?s). The exact total order is an implementation detail of Kahn's-algorithm tie-breaking; this test only asserts the correctness invariant.
internal/exec/terraform_all_test.go and terraform_all_simple_test.go — removed the "no stack specified" cases (the validation is gone) and updated TestApplyFiltersToGraph_* to match the new scope contract.

Compatibility matrix

Scenario	Before	After
`apply --all -s dev`, `depends_on` defined	Random order (bug)	Topological order
`apply --all -s dev`, no `depends_on`	Random order	Deterministic order
`apply --all` (no stack)	All stacks, random order	All stacks, topological order
`apply --all -s dev` with cross-stack `depends_on`	Cross-stack components ignored	In-stack topological order; cross-stack still out of scope (opt-in TBD)
`destroy --all -s dev`	Random order	Reverse topological order
`--all` with circular `depends_on`	Silently random	Hard error with cycle path
`apply --components vpc -s dev`	Unchanged	Unchanged
`apply --query '...' -s dev`	Unchanged	Unchanged
`apply -s dev` (no component, no flag)	Unchanged	Unchanged
`--all` with `!terraform.state` YAML function	Worked (via #2081)	Works (auth manager ported)
`--all` with per-component CI hooks	Worked (via #2475/#2397)	Works (hook flows through `executeNodeCommand`)

Known follow-ups (not in this PR)

These are tracked in #2485 and intentionally out of scope for the dispatch fix:

Parser: DependencyParser only reads the deprecated settings.depends_on. Should also read dependencies.components like describe_affected_components.go already does.
Parser: only component + stack keys are recognized. namespace/tenant/environment/stage are documented but ignored.
Errors: missing-target dependency errors are silently logged at Warn in parseDependencyArray. Should be surfaced.
Concurrency: still sequential. The DAG concurrency PRD describes the planned ready-queue scheduler.
Cross-stack scope opt-in: a --include-cross-stack-dependencies flag (or settings.terraform.dependencies.cross_stack) to re-enable the original IncludeDependencies: true behavior.

Test plan

go build ./...
go vet ./internal/exec/... ./cmd/terraform/... ./errors/...
go test ./internal/exec/ ./pkg/dependency/... ./cmd/terraform/... ./errors/... ./pkg/ui/... -short — all green
go test ./tests -run 'TestCLICommands/terraform_plan_--all|TestCLICommands/terraform_plan_--query|TestCLICommands/terraform_plan_--components' -count=1 — all green
New ordering test: go test ./tests -run 'TestCLICommands/terraform_plan_--all_executes_in_dependency_order' -count=1 — green; output confirms vpc → eks/cluster → eks/external-dns → eks/istio/base → eks/karpenter → eks/istio/istiod → eks/karpenter-node-pool → eks/istio/test-app
CI lint (make lint)
Manual verification with dependencies.components (new format) once parser is extended in a follow-up
Cross-stack dependency behavior with the deferred opt-in flag

References

Closes #2485
Original feature request: #1242 (closed as COMPLETED in #1516, but the routing was never wired up)
Implementation PR: #1516
DAG concurrency PRD: docs/prd/dag-concurrent-execution.md

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- terraform plan --all and apply --all run components in dependency (topological) order; destroy --all runs in reverse. --all may be used without a stack; --all -s <stack> scopes to that stack without pulling cross‑stack prerequisites. Per‑component hooks and auth-aware YAML resolution are active during --all runs; dry‑run shows clear per‑component success messages.
Tests
- Expanded tests for --all ordering, scoping/filtering, auth wiring, dry‑run flows, and per‑component hook wiring.
Documentation
- New blog post describing --all behavior, caveats, and follow-ups.

fix(vendor): recover OCI pulls on auth rejection and surface rich errors @osterman (#2487)

## what

Vendoring an OCI image (e.g. oci://ghcr.io/...) now auto-recovers when configured credentials are rejected (401 / 403 / DENIED) by retrying once with anonymous authentication.
On successful recovery, emits WARN OCI auth rejected, succeeded with anonymous fallback and proceeds.
Terminal pull failures now surface a rich error built with errUtils.Build(errUtils.ErrPullImage) — preserves the original cause and attaches structured context (image, registry, auth_attempted, status) plus three self-contained remediation hints (Actions packages: read, ATMOS_GITHUB_USERNAME override, stale ~/.docker/config.json).
Bumps the chosen-auth log line from Debug → Info and the "GHCR token without username" branch from Debug → Warn so CI logs reveal misconfiguration without --debug.
Non-auth errors (DNS, TLS, deadlines, 5xx) bypass retry — they need different remediation.

why

Public test images on ghcr.io (e.g. ghcr.io/cloudposse/atmos/tests/fixtures/components/terraform/mock:v0) failed hard on Windows CI runners whose GITHUB_TOKEN lacked packages: read scope, because pullImage used the rejecting credentials unconditionally instead of falling back to anonymous.
The previous error surface was a bare DENIED: denied with no auth source, no HTTP status, and no actionable hint — the vendor reporter collapsed it into an opaque tally, making the root cause untraceable.
Part A of this work — granting packages: read in .github/workflows/test.yml — already landed; this is Part B, the code change so future users, different workflows, and private registries get a clean diagnosis and an automatic recovery for public images.

references

Builds on PR #1647 (3-tier auth precedence).
Mirrors the error-builder idiom from pkg/provisioner/source/source.go:88-97.
Reuses the existing errUtils.ErrPullImage sentinel — no new sentinel introduced.

Summary by CodeRabbit

Bug Fixes
- Automatic fallback to anonymous OCI image pulls when authenticated requests are rejected (401/403 or “DENIED”), preserving original error causes
- Richer diagnostic errors with contextual hints for troubleshooting image-pull failures
- Warn when a GHCR token is present but no GitHub username is configured
Tests
- Expanded test coverage for image-pull auth fallback and varied failure scenarios

fix(terraform): preserve explicit identity and auth context for local runs @shirkevich (#2348)

## Problem Local commands could still fall back to the default CI identity (`terraform-ci` / `gcp-wif`) even when the user explicitly selected a local identity such as `terraform` / `gcp-adc`.

This showed up in several related paths:

atmos terraform apply ... --identity terraform and -i terraform could lose the explicit identity during Terraform argument reconstruction and then authenticate with the default identity.
Local Terraform commands could run the CI hook path first, producing noisy terraform-ci WIF authentication errors even when the command later succeeded with the intended local identity.
atmos terraform output ... --format json -i terraform bypassed the normal Terraform auth setup and called formatted output resolution without the active AuthManager / AuthContext.
!terraform.state evaluations could receive stack info without the active auth manager, so nested state reads could still resolve backend credentials from the default identity instead of the explicit command identity.
atmos ansible playbook ... --identity terraform did not propagate the selected identity into stack processing before YAML functions ran, so Ansible-driven components using !terraform.state could still try the default Terraform CI WIF identity.

Fixed Issues

Preserves explicit Terraform --identity / -i values through both Cobra/flag-registry parsing and the legacy raw-arg parsing path.
Normalizes Terraform identity edge cases consistently, including --identity=, -i=, and --identity=false.
Keeps CLI-provided identity values ahead of default/profile-selected identities when Cobra reports the optional-value sentinel.
Skips CI hook execution for normal local non-CI runs unless CI mode is explicitly forced, removing local terraform-ci preflight auth noise.
Initializes Terraform auth for formatted terraform output and passes both AuthContext and AuthManager into output resolution.
Propagates the active AuthManager from ProcessComponentConfig into ConfigAndStacksInfo, allowing !terraform.state to inherit the selected identity context.
Adds shared component auth setup for non-Terraform command paths that need authenticated YAML functions.
Adds Ansible-specific identity handling that supports long-form --identity / ATMOS_IDENTITY while deliberately leaving Ansible's -i shorthand reserved for inventory.
Runs Ansible stack processing with the selected auth manager before YAML function evaluation, so Ansible playbooks using !terraform.state can use the requested identity.

Verification

Rebuilt the binary with rtk proxy go build -o build/atmos .
Focused regression suite passed: rtk go test ./cmd/ansible ./pkg/component/ansible ./pkg/flags ./internal/exec ./pkg/hooks -run 'TestAnsible|TestBuildConfigAndStacksInfo|TestGetLongIdentityFromArgs|TestProcessStacksWithAuth|TestParseGlobalFlags|TestSetupTerraformAuth|TestProcessComponentConfig_PropagatesAuthManager|TestProcessComponentConfig_AuthManagerGuardBranches|TestProcessCommandLineArgs_EmptyIdentityFlagIsExplicitSelect|TestProcessCommandLineArgs_TerraformIdentityFlag_Issue2392|TestProcessArgsAndFlags_IdentityFlag|TestRunCIHooks_LocalRunSkipsExperimentalGate|TestRunCIHooks_ForwardsErrorAndExitCode|TestRunCIHooks_NilAtmosConfig|TestRunCIHooks_ExperimentalDisableReturnsError'
Downstream local Terraform apply path was verified with explicit --identity terraform.
Downstream formatted Terraform output path was verified with explicit -i terraform --format json.
Downstream Ansible dry-run path was verified to select identity "terraform" / provider=gcp-adc; remaining OAuth access was environment/network dependent, not a fallback to terraform-ci.

Summary by CodeRabbit

New Features
- Added -i shorthand for --identity (supports -i value, -i=value, and explicit-empty -i= to trigger interactive selection).
Improvements
- More robust identity resolution across commands and env vars; explicit-empty identity is preserved.
- Auth manager is now propagated into output and component command flows for consistent auth behavior.
- Hook discovery avoids rendering templates; CI hooks skip when no CI provider detected.
Bug Fixes
- Fixed parsing so a following native flag (e.g., -lock=false) is not mis-consumed as an identity value.
Tests
- Expanded test coverage for identity parsing, auth propagation, hooks, and CI registry behavior.

fix(ci): fire CI hooks per-component in deploy --all mode @thejrose1984 (#2478)

## what

Fixes atmos terraform deploy --all (and --query, --components, stack-without-component) producing only a single CI summary entry for the last component instead of one entry per component
Adds runCIHooksForDeployComponent as the per-component hook for the deploy subcommand so $GITHUB_STEP_SUMMARY receives one entry per component with the correct component/stack context
Wires wasMultiComponentExecution reset, error-defer guard, and PostRunE guard in deploy.go — the same three-site pattern applied to plan in #2430 and apply in #2475

why

In multi-component mode, terraformRunWithOptions routes to ExecuteTerraformQuery and sets wasMultiComponentExecution = true, but deploy.go had no guard on its PostRunE or error-path defer. This caused:

PostRunE to fire once after all components completed, calling RunCIHooks with an empty output buffer and the last component's info.Component/info.Stack
The error-path defer to double-fire when --all failed mid-walk (per-component hook already ran for the failed component)
For stacks with N components, only 1 summary entry appeared instead of N

references

Closes #2476
Related: #2397 (plan fix), #2475 (apply fix)

Summary by CodeRabbit

Bug Fixes
- Prevent duplicate error-hook execution during multi-component deployments.
- Ensure per-run state is reset before early exits so deferred error hooks and post-run logic behave consistently.
- Run CI hooks per-component for deploys to preserve component output and forward correct exit codes.
Tests
- Added tests for per-component CI hook behavior, suppression of post-run logic in multi-component deploys, defer-guard behavior, and exit-code forwarding.

[codex] Fix verifier auto-install and cosign bundles @osterman (#2481)

## what

Resolve verifier auto-installs to concrete registry versions before bootstrapping, instead of falling back to literal latest.
Add platform-aware installer helpers and regression coverage for Windows verifier asset URLs across cosign, slsa-verifier, gh, and minisign.
Combine cosign opts with downloaded sidecars like --bundle so Trivy checksum signature verification works with Aqua metadata.

why

Windows CI was failing because cosign release assets exist under v... tags and the bootstrap path could construct invalid release URLs.
Trivy verification failed on macOS because Atmos dropped the sigstore bundle whenever cosign options were present, producing an incomplete cosign verify-blob command.
The added tests cover the failing Windows URL rendering path and the Trivy-shaped checksum signature command.

references

Fixes the verifier install failure introduced by package verification.
Validated with go test ./pkg/toolchain/installer ./pkg/toolchain/registry/aqua ./pkg/toolchain/verification, go test ./pkg/toolchain, pre-commit hooks, and a live go run . toolchain install aquasecurity/trivy@v0.70.0.

Summary by CodeRabbit

New Features
- Signature verification now supports bundle sidecars.
- Enhanced cross-platform asset resolution, including better Windows ARM and Rosetta2 handling and per-platform overrides.
- Improved verifier bootstrap resolution with additional fallback behavior.
Bug Fixes
- Corrected Windows executable extension handling across target platforms.
Tests
- Added tests for Windows asset URL generation, verifier version resolution failures, and cosign bundle sidecar integration.