cloudposse/atmos v1.220.0 on GitHub

Refresh Terragrunt migration guide mappings @osterman (#2527)

what

Update the Terragrunt migration guide comparison table and concept mappings for current Atmos capabilities.
Split Terraform source guidance into source provisioning and vendoring, including TTL, auto-provisioning, and workdir guidance.
Add or refresh mappings for explicit component dependencies, hooks, file generation, backend provisioning, locals, and AWS YAML functions.

why

The migration guide had stale guidance that understated current Atmos parity with Terragrunt.
The revised examples give Terragrunt users more accurate one-to-one migration paths.

references

https://atmos.tools/migration/terragrunt

Summary by CodeRabbit

Documentation
- Updated migration guide with an expanded "Key Differences" mapping for Terragrunt → Atmos (dependencies, sourcing, hooks, backend).
- Clarified dependency ordering vs output lookup and added stack YAML examples showing dependencies + output wiring.
- Split module sourcing into “Atmos Source Provisioning” (with CLI examples) and “Atmos Vendoring” (new vendor format).
- Renamed “Generate Blocks” to “Code Generation” and rewrote hooks, remote-backend, locals, and YAML function examples (including inline exec and env/aws mappings).

docs(dependencies): document dependencies.components for describe affected @osterman (#2391)

what

Update atmos describe affected docs to lead with the new dependencies.components (kind: file|folder + path:) format for path-based dependencies; keep a short backward-compat note pointing to legacy settings.depends_on.
Convert the dependents example on the describe affected page from settings.depends_on to dependencies.components.
Correct the Merge Behavior section in stacks/dependencies/components.mdx: the default is replace, not append; append requires opting in via settings.list_merge_strategy: append. Add an "Opt-in append" subsection and link to the settings reference.
Add a migration callout under the schema in stacks/dependencies/components.mdx clarifying that namespace/tenant/environment/stage are not supported in the new format — use a templated stack: instead.
Rebalance stacks/dependencies/index.mdx so dependencies.tools and dependencies.components get equal billing in the intro, use cases, and component-dependencies subsection. Remove the duplicate Related Documentation entry and relabel the legacy link as "Legacy settings.depends_on". Add a link to atmos describe affected.
Extend the Atmos manifest JSON Schema (website/static/schemas/... and the matching test fixture) so dependencies.components is allowed and validates component, stack, kind, path. Previously rejected by additionalProperties: false.

why

The dedicated dependencies.components page existed, but the highest-traffic surface (describe affected) and the JSON Schema still only documented/allowed the legacy settings.depends_on map — driving users to the deprecated format and making the new format fail IDE/SchemaStore validation.
The merge-behavior description in stacks/dependencies/components.mdx contradicted the announcement blog and the actual code (internal/exec/describe_dependents_test.go:1093–1154 confirms default = replace, append is opt-in via settings.list_merge_strategy). Users following the docs would have built a wrong mental model of inheritance.
The migration story from settings.depends_on (with namespace/tenant/environment/stage) to dependencies.components (with templated stack:) was only discoverable via the migration table at the bottom of the page; surfacing it as a callout reduces confusion for users porting existing configs.

references

pkg/schema/dependencies.go — canonical field set for dependencies.components
internal/exec/describe_dependents_test.go:1093–1154 — confirms default merge is replace, append requires settings.list_merge_strategy: append
website/blog/2026-03-14-dependencies-components.mdx — original announcement
Verified: cd website && npm run build succeeds (no broken-link errors)

Summary by CodeRabbit

New Features
- Dependencies now support four top-level kinds: tools, components, files, and folders. Component-to-component relationships and explicit file/folder watch paths are first-class; describe-affected, describe-dependents, and CI workflows use the expanded surface while legacy formats remain supported.
Documentation
- Guides updated with examples, migration notes, canonical forms, merge semantics, and quick examples for cross-stack and path-based dependencies.
Tests
- Added coverage for the new surfaces, aliasing, deduplication, normalization, and v1/v2 equivalence.

fix(ci): repair Docker build and Homebrew formula bump in release workflow @aknysh (#2525)

what

Replace the flaky upstream install_kustomize.sh script in the Dockerfile with a direct download from GitHub Releases, pinned to kustomize v5.8.1
Replace mislav/bump-homebrew-formula-action@v3 with dawidd6/action-homebrew-bump-formula@v7 (SHA-pinned) for the Homebrew formula bump step

why

The kustomize install script has known bugs (kubernetes-sigs/kustomize#5562) causing tar extraction failures (tar: ./kustomize_v*_linux_amd64.tar.gz: Cannot open) during Docker image builds
The mislav/bump-homebrew-formula-action is broken because GitHub now returns HTTP 303 instead of 302 for tarball redirects, and the action hardcodes statusCode == 302 (mislav/bump-homebrew-formula-action#340, open/unfixed)
Both failures blocked the v1.219.0 release workflow (run #26131090357)

references

Summary by CodeRabbit

Chores
- Updated build and deployment infrastructure, including CI/CD workflow configuration and Docker build process improvements for enhanced reliability and maintainability.

feat(components): add retry block for transient terraform errors @osterman (#2431)

what

Add a per-component retry: block under components.terraform.<name> that wraps each terraform subprocess invocation (init, workspace select, workspace new, plan/apply/etc.) in an independent retry loop with configurable backoff.
Introduce retry.conditions: — a list of regex patterns matched against captured stdout/stderr; only errors whose output matches at least one condition retry, everything else fails fast. Patterns may be wrapped in /.../ for readability.
Extend schema.RetryConfig with Conditions []string (backwards-compatible — the existing struct is also used by workflows / vendor / task retry configs).
Plumb the new block through stack inheritance: abstract components define a default policy, concrete components and overrides.retry deep-merge on top.
Add JSON schema for retry referenced from terraform, terraform_component_manifest, and overrides; ship a docs page, blog post, and roadmap milestone under the CI/CD initiative.
Add pkg/retry/conditions.go (regex compile + match) and pkg/schema/retry_decode.go (mapstructure decoder with the duration hook) so logic stays out of internal/exec/.

why

Unattended atmos terraform plan/apply runs in CI repeatedly fail with transient infrastructure errors that have nothing to do with the Terraform code — most commonly 502 Bad Gateway during provider downloads, but also connection reset, TLS handshake timeout, and state-backend timeouts.
Today the only recovery is a manual re-run, which is painful for fleet operations and unattended pipelines.
Workflows, vendoring, and source extraction already use pkg/retry. This PR exposes the same robust primitive to components without duplicating logic.
The design is intentionally pattern-driven (opt-in per regex) so real terraform plan failures (exit-code 2, schema errors, etc.) are never silently retried — the foot-gun of "retry everything" is avoided by requiring conditions: to opt in.
Each subprocess invocation is wrapped independently (wrap(exec), wrap(exec), wrap(exec)), not as one outer retry around the whole pipeline, so apply doesn't lose its budget to init.

references

New docs page: /stacks/components/terraform/retry
Blog post: website/blog/2026-05-18-terraform-component-retry.mdx (tag: feature)
Roadmap milestone added to the CI/CD Simplification initiative
Reuses pkg/retry.WithPredicate and the existing WithStdoutCapture / WithStderrCapture shell options — no duplication

Summary by CodeRabbit

New Features
- Per-component retry for Terraform subprocesses: opt-in retry: block with regex-driven conditions, max attempts, backoff, delays, jitter; wraps init, workspace and main commands; inherited from abstract components and overridable; no retry occurs by default without conditions.
Documentation
- New docs and blog post explain configuration, inheritance, and safety defaults.
Tests
- Extensive unit tests added covering decoding, merge/precedence, execution retry logic, and condition matching.

feat(website): per-page raw .md routes + Copy Markdown button on atmos.tools @osterman (#2503)

what

Adds per-page .md routes to atmos.tools — every doc URL is mirrored at <url>.md with Content-Type: text/markdown.
Adds a "Copy Markdown" / "View Markdown" split button above each doc page.
Adds <link rel="alternate" type="text/markdown" href="<url>.md"> to each doc page's <head> so crawlers and LLM tooling can discover the alternate.
Introduces an AST-based MDX→Markdown normalizer (website/plugins/docusaurus-plugin-llms-txt/src/mdx-normalize.mjs) with a per-component handler table covering <Intro>, <Tabs>/<TabItem>, <Terminal>, <File>, <Note>, <Step>, <dl><dt><dd>, marketing cards, and more. Unknown components unwrap to their children. 19 unit tests.
Reuses the same normalizer for llms-full.txt, replacing a lossy regex-strip that dropped JSX content wholesale — the LLM corpus file is now meaningfully richer.
Fixes a pre-existing bug in the llms-txt plugin: pages with frontmatter id:/slug: overrides were silently dropped from llms.txt / llms-full.txt because the resolver searched by filename. Switched to Docusaurus's .docusaurus/<plugin>/<id>/*.json cache for the authoritative permalink → source map. Pages processed went from 554 → 735.

why

LLM-driven workflows (Claude Code, ChatGPT, custom agents) are now first-class consumers of our docs. A raw Markdown alternate makes our docs trivially feedable to any LLM without HTML scraping.
A rel="alternate" Markdown link is a standard discovery pattern for agentic crawlers — no special-case scraping needed.
The MDX wholesale-strip approach was corrupting llms-full.txt (tab content, flag tables, intros all silently dropped). The AST normalizer preserves structure correctly.
The id:/slug: override bug meant ~25% of CLI command pages were missing from the LLM corpus entirely.

references

Inspiration: FlyNumber/markdown_docusaurus_plugin (UX reference; we extended our existing plugin rather than adopting it).
Parity with atmos-pro's recently-shipped Copy Markdown affordance.
Blog post: website/blog/2026-05-24-copy-markdown-button.mdx

Summary by CodeRabbit

New Features
- Docs pages available as raw .md URLs; "Copy Markdown" and "View Markdown" controls added to doc UI
- Site generates per-page Markdown files and a synthesized index for discovery
Documentation
- Pages include rel="alternate" Markdown links
- MDX content normalized into portable Markdown while preserving tabs, code/terminal/file blocks, notes, and definition lists
Tests
- Added coverage for HTML-comment and truncate-marker behavior
Chores
- Added Markdown parsing dependency; improved deploy script/workflow to ensure text files use UTF-8 charset

feat: implement remote stack imports @osterman (#2037)

what

Add support for importing stack configurations from remote URLs (HTTP, Git, S3, GCS) using go-getter
Stack imports now work consistently with remote imports for atmos.yaml
New pkg/stack/imports package handles URL detection and remote downloading
Updated stack_processor_utils.go to detect remote URLs and download them automatically

why

This feature was documented but not yet implemented (fixes #2036)
Teams need to share stack configurations across multiple repositories without vendoring
Enables central catalogs, version-pinned imports, and cross-team config sharing
Provides consistency between atmos.yaml imports and stack file imports

references

closes #2036
Blog post: website/blog/2026-01-29-remote-stack-imports.mdx
Example: examples/remote-stack-imports/
Documentation: Stack Imports

Summary by CodeRabbit

New Features
- Remote stack imports with local caching for HTTP(S), Git, S3, and GCS; skip-if-missing and version-pinning support.
Documentation
- New example project and README demonstrating local+remote import composition; blog post and roadmap entry announcing the feature.
Chores
- Improved on-disk and in-memory caching, atomic cache writes, and cross-platform file-locking.
Tests
- Expanded unit and integration tests covering URI classification, downloading, caching, locking, and CLI scenarios.

feat(aws/security): add SARIF/OCSF exports and harden CI @osterman (#2483)

What

This PR adds machine-readable security export formats to atmos aws security analyze and includes the CI hardening needed to keep the branch green.

AWS security exports

Adds --format=sarif for SARIF 2.1.0 output compatible with GitHub code scanning, Azure DevOps, and SARIF viewers.
Adds --format=ocsf for OCSF 1.4.0 Detection Finding output for SIEM and security data lake ingestion.
Preserves Atmos context in exported findings, including stack, component, component path, remediation steps, deploy command, mapped physical locations, and logical fallback locations for unmapped resources.
Produces deterministic output ordering for stable diffs and deduplication.
Maps Atmos severities into SARIF/GHAS levels and OCSF severity/status fields.
Adds schema-backed and structural test coverage for SARIF and OCSF renderers, determinism, empty/nil inputs, mapped/unmapped findings, compliance reports, and malformed SARIF rejection.
Updates CLI docs, blog content, roadmap data, and PRD notes. The experimental Atmos Pro upload surface was removed before merge; the design is preserved in docs/prd/atmos-pro-security-findings-upload.md for later revival.

CI and workflow hardening

Upgrades actions/checkout usage from v4 to v6 across workflows and docs/examples; updates the SHA-pinned Atmos Pro checkout to v6.0.2.
Grants packages: read so CI jobs pulling from GHCR can authenticate with the workflow token.
Grants reviewdog the PR/check permissions it needs for tflint annotations and uses github.token explicitly.
Uses opentofu/setup-opentofu@v1 on Windows instead of installing OpenTofu through the Atmos toolchain path that was failing signature verification.
Fixes demo-stack wttr.in URLs, changes Swedish language code from se to sv, and adds HTTP retries so screengrab generation is less brittle.

Why

SARIF and OCSF let Atmos security findings flow directly into standard security workflows instead of requiring users to translate markdown/json output themselves. The CI changes address failures observed while validating this branch: checkout/auth instability, Windows OpenTofu setup failures, reviewdog token scope failures, and live wttr.in request failures in screengrab generation.

Validation

GITHUB_TOKEN=$(gh auth token) node .github/actions/verify-sha-pinning/test.mjs
make -C demo/screengrabs build-all
terraform fmt -check examples/demo-stacks/components/terraform/myapp/main.tf
pre-commit run check-yaml --files .github/workflows/test.yml examples/demo-stacks/stacks/deploy/dev.yaml examples/demo-stacks/component.yaml
git diff --check

References

SARIF 2.1.0 spec: https://docs.oasis-open.org/sarif/sarif/v2.1.0/sarif-v2.1.0.html
GitHub SARIF support: https://docs.github.com/en/code-security/code-scanning/integrating-with-code-scanning/sarif-support-for-code-scanning
OCSF schema: https://schema.ocsf.io/

Summary by CodeRabbit

New Features
- Added SARIF 2.1.0 and OCSF 1.4.0 export options; CLI accepts --format=sarif/ocsf, emits deterministic, Atmos‑enriched outputs and records UTC invocation/audit metadata.
- Integrated Amazon Inspector2 native findings into security analysis with normalization, deduplication and preferred native results.
Tests
- Extensive unit and JSON‑schema test suites for SARIF and OCSF ensuring spec conformance and byte‑stable output.
Documentation
- Docs, blog post, PRDs, and roadmap updated for SARIF/OCSF support and usage examples.
Chores
- CI checkout action pinned; NOTICE dependency list updated.

feat(hooks): add hook kinds, scanner integrations, SARIF summaries, and skip controls @osterman (#2482)

what

Adds a kind discriminator to the hook system with built-in kinds for store, command, infracost, checkov, trivy, and kics.
Adds the generic command hook engine, including toolchain-aware binary resolution, live stdout/stderr passthrough, templated args/env support, ATMOS_* runtime env vars, output-file/output-dir side channels, and configurable on_failure behavior.
Adds scanner and cost integrations:
- infracost parses JSON breakdown output into a markdown cost summary.
- checkov, trivy, and kics emit SARIF and share one parser/markdown renderer.
Adds normalized SARIF handling for severity counts, linked rule IDs via helpUri, short descriptions, file/line locations, and empty-result handling.
Adds hook dependency preflight: component dependencies.tools are installed before hooks run, toolchain paths take precedence over operator PATH, and missing hook binaries fail before Terraform starts.
Adds a curated embedded Atmos tool registry with a KICS override so dependencies.tools.kics can install from release tarballs.
Adds pkg/cacerts and wires Checkov to SSL_CERT_FILE / REQUESTS_CA_BUNDLE so PyInstaller-bundled Checkov can use the host CA bundle.
Adds the --skip-hooks global flag and ATMOS_SKIP_HOOKS, supporting skip-all and comma-separated named-hook skipping.
Preserves backward compatibility by accepting legacy command: store hook configs as kind: store.
Makes hooks work with resolved component workdirs so scanners inspect the same directory Terraform uses.
Adds runnable examples for infracost, checkov, trivy, kics, and custom kind: command hooks.
Updates hook docs, global flag docs, PRDs, roadmap data, and adds the custom hooks blog post.
Refreshes CLI help snapshots and updates CI workflow actions/shell handling needed by the branch.

why

Hooks were already the right lifecycle surface for component automation, but only store had first-class dispatch behavior.
Security scanners, cost estimators, and custom tools should run from stack config without wrapper scripts or GitHub Actions glue.
Named kinds provide zero-config defaults for common tools, while kind: command keeps the system open for arbitrary binaries.
Tool auto-install and preflight failures make examples and CI usage reproducible instead of relying on whatever happens to be on PATH.
SARIF and infracost summaries create a common typed output path for terminal rendering now and Atmos Pro upload later.

notes

This PR intentionally does not add a built-in tfsec kind or hooks-tfsec example. Trivy is the maintained Aqua-backed scanner path; legacy tfsec users can still wire it with kind: command.
Atmos Pro upload, cross-run SARIF aggregation, Terraform component dependency auto-install outside hooks, and planfile threading remain follow-up work.
The CodeRabbit-generated release notes below are preserved as-is.

Summary by CodeRabbit

New Features
- Pluggable hook "kinds" (infracost, trivy, checkov, kics) plus a generic command kind; structured side‑channel outputs, markdown rendering, preflight tool resolution/auto‑install, CA‑bundle propagation, and per‑invocation hook skipping via --skip-hooks / ATMOS_SKIP_HOOKS.
Documentation
- Detailed PRDs, reference docs, CLI help, blog post, and runnable examples for each hook kind.
Tests
- Expanded unit and integration tests covering hooks, result handlers, SARIF parsing, toolchain registry, and examples.

feat(ci): auto-detect log level from GitHub Actions debug mode @osterman (#2495)

what

Atmos now auto-detects when a workflow is running with GitHub Actions debug logging enabled and switches its own log level to Debug for the run.
Triggered when ci.enabled: true is set in atmos.yaml and the active CI provider reports debug mode is on. For GitHub Actions, that means ACTIONS_RUNNER_DEBUG=true or ACTIONS_STEP_DEBUG=true — exactly what the built-in "Re-run with debug logging" button sets.
Emits a single Info-level log line when it fires so users see why their output got louder: CI provider debug mode detected — using Debug log level for this run provider=github-actions from=Info.
Built on a provider-agnostic optional interface — provider.DebugModeDetector { IsDebugMode() bool } in pkg/ci/internal/provider, plus a generic registry helper ci.DetectDebugMode() DebugModeInfo. The GHA provider implements the interface; cmd/root.go imports only pkg/ci and names no GHA-specific env vars.
Auto-detection overrides --logs-level, ATMOS_LOGS_LEVEL, and logs.level in atmos.yaml — the CI-side debug toggle is set at the repo/workflow level by the runner itself and is treated as the higher-priority signal (including over an explicit Trace or Off).
Ships with a new framework PRD (docs/prd/native-ci/framework/debug-mode-promotion.md), a changelog blog post, a roadmap milestone under the Native CI initiative, and unit tests covering: the GHA IsDebugMode() env-var matrix, the generic DetectDebugMode() type-assertion path, and the cmd-side helper's gates and override semantics.

why

Debugging Atmos in CI is usually just as important as debugging the workflow around it. GitHub provides a single "Re-run with debug logging" button to make every tool in the run verbose; today Atmos ignores it, so users get a noisier runner but the same quiet Atmos output — and have to remember a per-tool dance (ATMOS_LOGS_LEVEL=Debug somewhere in workflow YAML).
The interface-based design keeps the startup path provider-agnostic, so adding the same behavior to a future CI provider is one method on the provider — no changes in cmd/ or pkg/ci needed.
Overriding explicit --logs-level / ATMOS_LOGS_LEVEL is intentional: the CI-side toggle is an explicit, repo-/workflow-level "make everything noisy" signal that should beat per-invocation flags in the same run.
This is the same gap other GitHub-published tools have hit, e.g. pypa/gh-action-pypi-publish#322, which validates the pattern.

references

GitHub docs: Enable debug logging
GitHub changelog: Re-run jobs with debug logging
Prior art in another ecosystem: pypa/gh-action-pypi-publish#322
New PRD: docs/prd/native-ci/framework/debug-mode-promotion.md

Summary by CodeRabbit

New Features
- Atmos now auto-promotes its log level to Debug when running on GitHub Actions with per-run debug logging enabled and ci.enabled: true; an informational startup log notes the promotion and it overrides other log-level settings.
Documentation
- Added product doc and blog post explaining debug-mode promotion and usage.
Tests
- Added unit tests covering debug-mode detection and promotion behavior.
Refactor
- Minor CI hook wiring cleanup in multi-component Terraform runs.

fix(ci): restore checks: write on lint job for reviewdog annotations @osterman (#2500)

what

Add a job-scoped permissions: block on the lint ([lint] <demo-folder>) job in .github/workflows/test.yml granting contents: read + checks: write so reviewdog/action-tflint@v1 can post inline tflint findings on PRs via the GitHub Checks API.

Companion to #2499, which restored security-events: write on the docker ([lint] Dockerfile) job. Same root cause, second affected job.

why

PR #2487 introduced the first workflow-level permissions: block on test.yml to grant packages: read for ghcr.io OCI pulls. A workflow-level permissions: block replaces (not extends) the default GITHUB_TOKEN scope for every job in the file, which silently stripped the inherited checks: write that the lint job relied on.
Effect on contributors: since #2487 merged, tflint findings on PRs touching examples/<demo-folder>/components/terraform have stopped appearing as inline check annotations. The job itself still exits with the right code (fail_level: error controls that), but reviewers lost the per-line context. This restores that behavior.
Job-scoped (least privilege) over widening the workflow-level block — only this one job uses reviewdog. Matches the convention used in .github/workflows/codeql.yml and the docker-job fix already landed in #2499.
Not adding pull-requests: write: reviewdog's default github-pr-check reporter posts check runs (which need checks: write), not review comments. checks: write alone is sufficient.

references

Regression introduced by #2487 (2437e13bf, "fix(vendor): recover OCI pulls on auth rejection and surface rich errors").
Companion fix already merged: #2499 (fix(ci): restore security-events: write for Dockerfile lint SARIF upload).

fix(ci): restore security-events: write for Dockerfile lint SARIF upload @aknysh (#2499)

what

Restores security-events: write permission for the [lint] Dockerfile
job in .github/workflows/test.yml so its hadolint SARIF results can
be uploaded to GitHub Code Scanning.

Adds a job-level permissions: block to the docker job:

permissions:
  contents: read           # actions/checkout
  security-events: write   # github/codeql-action/upload-sarif

contents: read is re-listed because a job-level permissions: block
fully overrides the workflow-level set (rather than merging).

why

PR #2487 (2437e13bf) added a top-level permissions: block to the
workflow to grant packages: read for ghcr.io pulls:

permissions:
  contents: read
  packages: read

In GitHub Actions, a workflow-level permissions: block replaces
(not extends) the default GITHUB_TOKEN scope for every job in the
file. That replacement inadvertently stripped the implicit
security-events: write that the [lint] Dockerfile job relied on to
upload hadolint SARIF results via github/codeql-action/upload-sarif@v4.

Every post-merge run on main has been failing the Upload SARIF
file step since #2487:

##[warning]This run of the CodeQL Action does not have permission to
access the CodeQL Action API endpoints. ... please ensure the workflow
has at least the 'security-events: read' permission.
##[error]Resource not accessible by integration -
https://docs.github.com/rest

Failing run for reference:
https://github.com/cloudposse/atmos/actions/runs/26339160817/job/77562841194

Note: hadolint itself ran successfully in the failing run — the SARIF
output contained zero findings. Only the upload step failed.

A job-level fix (this PR) is preferred over expanding the workflow-level
permissions block, because it follows least-privilege: only the one job
that actually needs to write security events gets the elevated scope.

references

Failing CI run: https://github.com/cloudposse/atmos/actions/runs/26339160817/job/77562841194
Regressing PR: #2487
GitHub Actions permissions docs:
https://docs.github.com/en/actions/security-for-github-actions/security-guides/automatic-token-authentication#permissions-for-the-github_token
github/codeql-action/upload-sarif permission requirement:
https://docs.github.com/en/code-security/code-scanning/integrating-with-code-scanning/uploading-a-sarif-file-to-github#uploading-the-sarif-file-to-github

Summary by CodeRabbit

Chores
- Fixed permissions configuration in the CI/CD pipeline to restore security scanning capabilities in automated testing workflows.

Add toolchain package verification @osterman (#2415)

what

Adds pkg/toolchain/verification for Aqua-compatible checksum, signature, and attestation verification before tool extraction.
Preserves Aqua verification metadata across registry parsing, overrides, version overrides, installer flow, and lockfile metadata.
Adds toolchain verification policy config, docs, roadmap entry, and changelog post.

why

Prevents tampered or mismatched toolchain package assets from being installed when registry metadata provides verification data.
Keeps the default behavior non-breaking while allowing stricter checksum and signature requirements for CI and regulated environments.

references

Tested with go test ./pkg/toolchain/installer ./pkg/toolchain/... ./cmd/toolchain/...
Linted with scripts/run-custom-golangci-lint.sh

Summary by CodeRabbit

New Features
- Toolchain now verifies downloaded packages (checksums and signatures/attestations) before extraction when registry metadata is present.
- Multiple verification methods supported (checksums, cosign, SLSA provenance, minisign, GitHub attestations); verifier install mode configurable (auto or path-only).
- Verification results and metadata are recorded in the toolchain lockfile; lockfile path is configurable.
Bug Fixes
- Cached assets validated against recorded source URL; mismatched or tampered cached files are re-downloaded or removed; lockfile not updated when extraction fails.
Documentation
- Added docs, examples, and a blog post explaining package verification and configuration.

🚀 Enhancements

fix(yaml-functions): detect cross-component !terraform.state cycles instead of stack-overflowing @thejrose1984 (#2533)

what

Fix the goroutine stack overflow reported in #2457: two components that reference each other via !terraform.state (A → B, B → A) drove atmos describe affected / describe component / terraform plan into infinite recursion until the Go runtime stack overflowed.

The YAML-function cycle detector already existed and worked within a single ProcessCustomYamlTags walk, but it didn't survive the recursive describe path that !terraform.state triggers.

why

When a component is being processed and the resolver encounters !terraform.state, it does:

processTagTerraformStateWithContext
  → GetTerraformState
    → ExecuteDescribeComponent (ProcessYamlFunctions: true)
      → ProcessStacks
        → ProcessCustomYamlTags   ← re-entry

ProcessCustomYamlTags was wrapping every entry with scopedResolutionContext(), which saved the parent's context and installed a fresh, empty one. So when the inner walk found B's !terraform.state a ..., the cycle detector's Visited map had no record that A was already in progress, and it pushed A → B → A → B forever until the goroutine stack hit its 1 GB cap.

The cycle detector unit tests pass because they exercise Push/Pop on a single context; the only integration tests that would have caught this were t.Skip()-ed placeholders in internal/exec/yaml_func_circular_deps_test.go referencing fixtures that don't exist.

how

Three coordinated changes:

internal/exec/yaml_func_utils.go — ProcessCustomYamlTags now reuses the goroutine-local ResolutionContext via GetOrCreateResolutionContext() and drops the scopedResolutionContext() wrap. The Push/Pop discipline in processTagTerraformStateWithContext / trackOutputDependency already pairs every successful Push with a deferred Pop, so the context is empty when the top-level walk returns. Removed the now-unused scopedResolutionContext helper.
internal/exec/yaml_func_resolution_context.go — Added MaxResolutionDepth = 64 and a depth check in Push that returns ErrYamlFuncMaxResolutionDepth if any future re-entry path slips past the cycle detector. This is belt-and-suspenders: real cycles are caught by the Visited check; the depth bound exists so atmos surfaces a clean error instead of stack-overflowing if the detector regresses.
internal/exec/terraform_state_utils.go — GetTerraformState's describe-error wrap now uses double %w so errors.Is can match a propagated sentinel like ErrCircularDependency through the descriptive wrapper. Without this, the cycle error message is human-readable but errors.Is(err, ErrCircularDependency) returns false, breaking callers that try to handle the error programmatically.

tests

New tests/yaml_functions_circular_deps_integration_test.go plus fixture at tests/fixtures/scenarios/yaml-functions-circular-deps/ — exercises the full ExecuteDescribeComponent path on an A↔B cycle and asserts ErrCircularDependency comes back (and not the depth safety net, which would indicate the cycle detector regressed). Test completes in ~20 ms instead of running forever.
Removed internal/exec/yaml_func_circular_deps_test.go — all four tests in it were t.Skip()-ed placeholders referencing fixtures that don't exist. The new integration test replaces them with one that actually runs.
All existing TestResolutionContext* unit tests still pass unchanged.

references

Closes #2457
Original cycle-detection PR this gap slipped past: #1708

Summary by CodeRabbit

New Features
- Improved YAML-function cycle detection with clearer, surfaced errors.
- Added a maximum recursion-depth safeguard to prevent stack overflow during YAML-function resolution.
Bug Fixes
- Enhanced error wrapping so root causes are preserved and easier to identify.
Tests
- Added an integration regression test for cross-component cycles and removed obsolete skipped tests.
Fixtures
- Added scenario fixtures to reproduce and validate circular dependency behavior.

[codex] Fix remote Git stack imports @osterman (#2528)

what

Add a remote stack import resolver that can return multiple local import matches while preserving the existing single-file download path.
Handle Git go-getter //subdir imports by cloning the repository as a directory, resolving files, no-extension YAML variants, explicit globs, and recursive YAML directory imports.
Cache expanded remote files with stable <original-uri>#<relative-file> keys and update stack processing/tests to consume those keys for imports and provenance.

why

Fixes the regression where remote Git subdir imports were forced through file mode, causing git clone to drop the repository name and target https://github.com/<owner>/.
Supports remote folder imports for stack manifests without leaking local cache paths into import metadata.

references

Remote imports docs: https://atmos.tools/stacks/imports
Validated with go test ./pkg/stack/imports ./pkg/downloader, go test ./tests -run RemoteStackImports, and go test ./internal/exec -run '^$'.
Commit hooks passed after building the required local custom-gcl lint binary.

Summary by CodeRabbit

New Features
- Remote imports now resolve Git subdirectories, wildcards, and nested-remote imports with deterministic matching and improved per-session and persistent caching.
- Per-import control for nested import resolution (local vs remote) with validation and defaults.
Bug Fixes
- More consistent handling of missing imports and skip-if-missing behavior; clearer errors for unresolved imports.
Documentation
- Docs, examples, and PRD updated to explain nested-imports behavior and best practices.
Tests
- Expanded unit and integration tests for remote/Git resolution and CLI scenarios.

fix(hooks): persist interactive stack selection and auth context for PostRunE hooks @aknysh (#2520)

what

Store hooks now fire when the stack is selected via the interactive prompt (not just when passed with -s).
Store hooks can now read terraform outputs from S3 backends that require role assumption via assume_role.
Auth context and auth manager from the main terraform execution are persisted and injected into PostRunE hook info so the hook's terraform output subprocess has the correct credentials.

why

Two bugs in the store hooks execution path caused hooks to silently fail in common scenarios:

Interactive stack selection lost for hooks (#2432): When a user runs atmos terraform apply component and selects the stack from the interactive prompt, the selected stack value was stored in the local info.Stack but never persisted to the Cobra flag set. PostRunE hooks re-parse args via ProcessCommandLineArgs, which reads from cmd.Flags().GetString("stack") — still empty. The hook silently skipped because it saw no stack. Fix: after the interactive prompt fills info.Stack, also set the Cobra flag via f.Value.Set(stack) so downstream consumers can read it.
Missing backend credentials for hook's terraform output (#2433): The store hook's terraform output subprocess needs credentials to access the S3 state backend, which often requires a chained role assumption (e.g., dev-role → tfstate-access-role). The main ExecuteTerraform sets up the full credential chain via setupTerraformAuth and prepareComponentExecution, but it takes info by value — the populated AuthContext and AuthManager don't flow back to the caller. The PostRunE hook creates a fresh info with nil auth fields, so the output subprocess fails with "No valid credential sources found". Fix: persist the auth context via SetLastAuthContext after prepareComponentExecution, and inject it into the hook's fresh info in runHooksWithOutput.

references

Closes #2432
Closes #2433
Related: #2428 (original consolidated issue, closed in favor of #2432 and #2433)
Related: #2357 (auth resolver injection for hooks)

Summary by CodeRabbit

Bug Fixes
- Post-run hooks now preserve Terraform auth so store hooks and hooks triggered after interactive stack selection work when role-assumption is required.
Chores
- Bumped Go to 1.26.3, upgraded many dependencies, updated NOTICE license listings, and advanced app version to 1.220.0.
Tests
- Added tests for interactive component/stack prompts and auth-context persistence.
Documentation
- Added two docs describing the store-hook role-assumption issue and interactive-prompt hook behavior.

fix(auth): don't fall through to webflow when aws/user keyring read fails @osterman (#2470)

what

Stop atmos auth login from silently falling back to browser-based OAuth2 webflow when an aws/user identity HAS configured credentials but the keyring read fails (corrupted entry, missing fields, deserialization error, permission denied).
Distinguish two failure modes in credentialsFromStore: ErrAwsUserNotConfigured (keyring miss — webflow remains an appropriate fallback) vs new ErrAwsUserKeyringReadFailed (keyring reachable but unreadable — webflow is now skipped so the real error surfaces).
Promote the "starting browser-based authentication" message from Debug to Info and include a hint pointing at atmos auth user configure and webflow_enabled: false, so users see why the browser opened.
Fix a latent realm bug in cmd/auth/user/configure.go: it was hardcoding store.Store(alias, creds, "") while the login resolver reads Retrieve(i.name, i.realm). With auth.realm or ATMOS_AUTH_REALM set, configure wrote one slot and login read another. Configure now computes the realm the same way pkg/auth/manager.go does.
Add regression test TestUserIdentity_Authenticate_KeyringReadFailureSkipsWebflow that primes the keyring with an incomplete entry and asserts the webflow callback server is never reached. Tighten TestUser_credentialsFromStore to assert the specific sentinel returned in each path.
Document the resolver order in website/docs/cli/configuration/auth/identities.mdx: clarify that Atmos does not consult ambient AWS credentials (env vars, ~/.aws/credentials, instance profiles) for aws/user, and call out the new keyring-read-failure diagnostic.

why

Reported by Dan Miller in the Atmos community Slack: after upgrading past v1.214 he got redirected to the AWS sign-in browser flow even though he ran atmos auth user configure and "had access keys." The only working escape was credentials.webflow_enabled: false, which masked the real problem rather than fixing it.
The webflow path introduced in #2148 (v1.215) was the right design — webflow is a legitimate authentication tier for IAM users — but the resolver collapsed every keyring failure into a single "not configured" error, so a corrupted/unreadable keyring entry looked identical to a fresh install and the browser flow fired. The browser flow then 400'd against the AWS sign-in token endpoint with no indication that the real failure was reading the keyring.
Restoring the design invariant ("if creds are provided, use them; never silently bypass them with webflow") requires distinguishing the two failure modes. Once distinguished, webflow is correctly gated and the Info-level diagnostic tells the user what Atmos saw.
The realm hardcode was not Dan's trigger (his realm resolved to "" on both sides) but is a real latent foot-gun for anyone using auth.realm for credential isolation — fixing it here closes the gap before another report arrives.

references

Slack thread: original report from Dan Miller in cloudposse community Slack (no public issue).
PR #2148 — original OAuth2 PKCE webflow introduction (commit 4e32b532f, shipped in v1.215).

Summary by CodeRabbit

Bug Fixes
- Credential configuration now resolves the auth realm so saved credentials go to the correct slot.
- Keyring read failures (corrupted/unreadable entries) are reported and no longer silently fall back to browser auth; webflow is only used when credentials are genuinely absent.
- Better distinction between missing credentials and unreadable keyring data.
Documentation
- Clarified browser-based fallback conditions and what counts as static AWS user credentials.
Tests
- Added regression tests to ensure keyring read failures skip webflow.

fix(aws/security): allow unlimited findings and record invocation in OCSF @osterman (#2517)

what

atmos aws security analyze --max-findings 0 (or any non-positive value) now fetches all matching findings from Security Hub / Inspector instead of silently capping at 500.
The fetcher emits a log.Warn whenever pagination halts at the limit while NextToken != nil, so truncation is never silent again.
Every OCSF event now carries the literal command line, arguments, timing, exit code, working directory, and scanned scope under unmapped["atmos.invocation"] — the OCSF analogue of SARIF's run.invocations[] (which Atmos already emits).
Default behavior is unchanged (--max-findings still defaults to 500); only the previously-broken 0 semantics now work, and the help text + docs document it.

why

The 500 cap is a CLI-layer default, not an AWS pagination limit. For multi-account orgs, real finding counts routinely exceed 500, so --format json/sarif/ocsf exports were silently incomplete and downstream tooling (SIEM ingestion, ticketing, dashboards) was missing data with no error or warning.
AI analysis users get cost protection by keeping 500 as the default; export users get correctness by opting in to --max-findings 0. The log.Warn covers the case where users forget — they'll see in the output that more findings exist.
SARIF already records the invocation, but OCSF Detection Finding 2004 has no native invocation slot. Auditors and SIEM analysts asking "what command produced this batch?" can now answer it from either format.

references

Closes the silent-clamp issue surfaced during use of atmos aws security analyze for SIEM export pipelines.
SARIF 2.1.0 invocation spec: https://docs.oasis-open.org/sarif/sarif/v2.1.0/sarif-v2.1.0.html (already implemented in pkg/aws/security/sarif.go).
OCSF 1.4.0 Detection Finding: https://schema.ocsf.io/1.4.0/classes/detection_finding (no native invocation field; landed in unmapped extension).
Related shipped feature: #2483 (initial SARIF/OCSF export support).

Summary by CodeRabbit

New Features
- --max-findings now distinguishes "unset" from an explicit 0; 0 means unlimited and effective default remains 500.
- CLI prints a clear info message when fetching all findings vs a limited fetch.
- OCSF exports now include report invocation metadata in each event when available.
Bug Fixes
- Warning logged when a positive limit truncates results to avoid silent loss; pagination now respects the limit.
Documentation
- CLI, config, and PRD docs updated to reflect the new semantics.
Tests
- Added tests covering max-findings precedence, pagination behaviors, and OCSF invocation attachment.

fix(terraform-state): honor target component's env section for AWS credentials @arcaven (#2502)

what

Makes !terraform.state's in-process S3 backend reader honor a whitelisted subset of the target component's env section — AWS_PROFILE, AWS_REGION, AWS_DEFAULT_REGION, AWS_CONFIG_FILE, AWS_SHARED_CREDENTIALS_FILE, AWS_ENDPOINT_URL_S3, AWS_ENDPOINT_URL_STS, AWS_USE_FIPS_ENDPOINT — matching the behavior !terraform.output already exhibits via its subprocess env overlay. AWS_STS_REGIONAL_ENDPOINTS is intentionally excluded because it's a SDK v1 toggle and a no-op in SDK v2.
Adds internal/terraform_backend.ExtractComponentEnvOverlay (with a nil-pointer guard) and internal/terraform_backend.ComponentEnvKeysAWS. Threads the overlay through getCachedS3Client and ReadTerraformBackendS3. The S3 client cache key now includes every whitelisted key that affects client behavior (profile, both regions, both endpoint URLs, FIPS, config + credentials files) so two components with distinct settings never alias each other.
Extends pkg/aws/identity with LoadConfigWithAuthAndEnv. LoadConfigWithAuth becomes a thin nil-overlay wrapper, so every existing call site behaves identically. Within the new variant:
- AWS_USE_FIPS_ENDPOINT (truthy "true"/"1") is applied via config.WithUseFIPSEndpoint(aws.FIPSEndpointStateEnabled) — a global config setting.
- AWS_ENDPOINT_URL_STS is applied at sts.NewFromConfig in the assume-role flow — a per-service option in SDK v2.
- AWS_ENDPOINT_URL_S3 is applied at s3.NewFromConfig in getCachedS3Client for the same SDK v2 per-service-option reason.
Adds focused unit tests covering the overlay extraction (9 subcases including the nil-pointer guard), the whitelist surface stability, the credential-resolution precedence (5 cases including a sentinel that asserts LoadConfigWithAuth ≡ LoadConfigWithAuthAndEnv(..., nil)), and the FIPS application (4 subcases including an authContext-suppresses-overlay assertion).
Adds an advanced docs section to functions/yaml/terraform.state.mdx ("Switching AWS credentials per component via the env section") and a cross-reference note in functions/yaml/terraform.output.mdx. Placed alongside the existing specialized sections (SSE-C, GCS, static); primary examples untouched.

why

!terraform.output and !terraform.state are documented as interchangeable readers of the same state. They aren't, in setups that distribute Terraform state across AWS organizations.
!terraform.output shells out to tofu/terraform. pkg/terraform/output.defaultEnvironmentSetup.SetupEnvironment overlays the target component's env section onto the subprocess environment as its final step, so env.AWS_PROFILE reaches the backend's credential resolution.
!terraform.state reads in-process via internal/terraform_backend.ReadTerraformBackendS3 → getCachedS3Client → pkg/aws/identity.LoadConfigWithAuth → config.LoadDefaultConfig. When no Atmos AWSAuthContext is provided, the SDK uses the calling process's AWS_PROFILE. componentSections["env"] is in scope on the same map but no function in this chain reads it.
In practice this means a stack in one AWS org calling !terraform.state against a stack in another org fails with AccessDenied on sts:AssumeRole (or silently reads the wrong account when buckets happen to share names), while the equivalent !terraform.output call works. The two functions are supposed to produce identical results; this is a correctness gap.
Backward compatibility was the design constraint. A component without any whitelisted env key produces a nil overlay and the resolved code path is byte-identical to the prior behavior. Users who don't use this pattern see no change. A sentinel test asserts this directly.
Atmos auth (AWSAuthContext) layers above the overlay and still wins outright. The env overlay path is intended for setups not yet on Atmos auth, which is the common case for the SweetOps community at the moment.
GCS and AzureRM backends share the bug structurally. We don't have GCS or Azure tfstate to validate those readers locally and would rather defer than ship code we haven't run. The fix shape generalises trivially (one new whitelist slice + one ExtractComponentEnvOverlay call per backend). Calling that out in the issue and the docs so a contributor with those backends can pick it up.

references

Closes #2501
Working reference implementation we're mirroring in-process: pkg/terraform/output/environment.go::SetupEnvironment (the final for k, v := range config.Env loop).
Function this PR extends: pkg/aws/identity.LoadConfigWithAuth.
In-process reader being fixed: internal/terraform_backend/terraform_backend_s3.go::ReadTerraformBackendS3 / getCachedS3Client.
Adjacent context for S3 path construction in !terraform.state: #1920.
Auth-chain inheritance for !terraform.output (referenced from terraform_output_utils.go): #1921.

Summary by CodeRabbit

New Features
- Per-component AWS env overlay for Terraform S3 remote-state reads; explicit auth/context still takes precedence. Identity loading now accepts optional env overlays and respects overlay vs. auth precedence.
Documentation
- Guidance added for cross-account remote-state reads and env-overlay behavior for terraform.output and terraform.state.
Tests
- Added tests for overlay extraction, precedence rules, FIPS behavior, and a stable ordered whitelist of AWS env keys.
Chores
- Minor log formatting adjustments.

fix(toolchain): retry cosign on transient Sigstore Rekor failures @osterman (#2506)

what

Wrap the cosign verify-blob exec in pkg/toolchain/verification/signature.go with bounded exponential backoff, retrying only on a narrow allowlist of transient Sigstore Rekor failures.
Add a new errUtils.ErrSignatureRetryable sentinel next to the existing ErrDownloadRetryable, and a classifyCosignError helper that joins the sentinel into cosign errors when the combined output matches a Rekor-flake marker.
New runCosignWithRetry uses the same retry budget as the existing downloader (5 attempts, 1s → 10s exponential). Logs a WARN before each retry so CI logs surface the upstream-service context.

why

cosign verify-blob sometimes fails not because of a real signature problem but because Sigstore's Rekor transparency-log API returns a short-window upstream error. The most common signature is:
```
Error: searching log query: [POST /api/v1/log/entries/retrieve][400] searchLogQueryBadRequest
  {"code":400,"message":"verifying signature: ecdsa: Invalid IEEE_P1363 encoded bytes"}
```
The same artifact verifies cleanly seconds later. Without retry, this turns a transient Sigstore outage into a hard tool not found install failure for every Atmos user pulling toolchain assets during the outage window. We hit this twice in 48h on the Windows mock jobs alone.
The retry allowlist is intentionally narrow — searchLogQueryBadRequest, Invalid IEEE_P1363 encoded bytes, and Rekor's /api/v1/log/entries/retrieve endpoint paired with a 5xx status. Anything outside the allowlist (tampering, expired cert, identity mismatch, missing signature) surfaces immediately on the first attempt. Blanket-retrying signature verification would mask real tampering events and is the canonical anti-pattern; we do not do that here.
Mirrors the existing pattern in pkg/toolchain/installer/download.go (sentinel + retry.WithPredicate + classifier) so the toolchain's resilience story stays consistent across download and verification.

references

Failing run that prompted this: https://github.com/cloudposse/atmos/actions/runs/26368136271/job/77615786736 (Windows [mock-windows] demo-component-versions, Rekor returned 400 on a valid OpenTofu 1.9.1 release signature).
Same flake also hit [mock-windows] demo-vendoring in the same run.
Companion CI permissions fix: #2499 (merged) and #2500 (open).

Summary by CodeRabbit

Improvements
- Signature verification now automatically retries on transient Sigstore Rekor service issues with bounded exponential backoff for greater resilience.
- Error handling improved to distinguish retryable transient failures from permanent signature verification errors, reducing false failures.
Tests
- Added unit tests covering classification of transient Rekor failures and retry behavior through the public verification path.

fix(terraform): skip non-terraform and deleted components in `atmos terraform plan/apply --affected` @thejrose1984 (#2484)

what

Filters the affected component list in atmos terraform plan/apply --affected so the command no longer runs against helmfile components, packer components, or components deleted in HEAD.

why

Reported in #2361. getAffectedComponents returns every affected component regardless of type — helmfile, packer, and BASE-only deletions are all included — and ExecuteTerraformAffected was iterating that full list and calling ExecuteTerraform on each, producing output like:

INFO  Executing command="atmos terraform apply example-terraform -s example"
INFO  Executing command="atmos terraform apply example-helmfile -s example"
INFO  Executing command="atmos terraform apply example-packer -s example"

Documentation (website/docs/cli/commands/terraform/usage.mdx) describes --affected as executing the command "on all the directly affected components," with the implicit constraint that those components belong to the atmos terraform subcommand. Helmfile and Packer subcommands would orchestrate their own components, and deleted components have no on-disk module so terraform plan/apply against them either errors or no-ops.

how

New package-private helper filterTerraformAffected keeps only items where ComponentType == cfg.TerraformComponentType and !Deleted. Called once in ExecuteTerraformAffected after getAffectedComponents, before addDependentsToAffected (which is expensive and shouldn't run for items we will drop).
Defense-in-depth filter in executeTerraformAffectedComponentInDepOrder: when --include-dependents is set, any dependent with ComponentType explicitly set to a non-terraform value is skipped during the recursion.
atmos describe affected is unchanged. It still reports the full affected set (terraform + helmfile + packer + deleted) as the canonical introspection view. The filter is scoped to the execution path of atmos terraform <cmd> --affected.

tests

New file internal/exec/terraform_affected_filter_test.go: 8 portable table cases covering the filter (no gomonkey, runs on every CI matrix entry). Includes a case mirroring the exact #2361 reproducer fixture. 100% line coverage of the new helper.
Three new cases added to TestExecuteTerraformAffectedComponentInDepOrder table in internal/exec/terraform_utils_test.go: helmfile dependent skipped, packer dependent skipped, mixed-type dependents (only the terraform one runs).

compatibility

Bug fix only. The previous behavior produced incorrect commands that would fail mid-execution (terraform-apply against a helmfile component errors out). No public Go API or CLI surface changes. The most visible user-facing shift is an exit-code flip from non-zero to zero when a changeset contains only non-terraform or deleted components — which now correctly reports "No components affected" instead of erroring partway through.

Closes #2361.

Summary by CodeRabbit

Bug Fixes
- Terraform execution now excludes non-Terraform dependents (e.g., Helmfile, Packer), skips deleted components, and treats empty component types as Terraform to preserve compatibility; Terraform-only ordering is preserved to avoid unnecessary processing.
Tests
- Added tests for filtering behavior, in-place compaction semantics, non-Terraform exclusion, and a regression ensuring only Terraform entries are executed.

perf(exec/merge/utils): optimize describe affected for large-stack workloads (~50% local wall-clock, projected ~10× on 2-core CI) @aknysh (#2496)

what

Performance optimization arc for atmos describe affected on large-stack workloads.
Targets the inheritance + merge + YAML-parse hot paths that have grown since the
October 2025 perf instrumentation arc (PRs #1576/#1611/#1622/#1639) added the
auth, profiles, locals, and per-file position-tracking subsystems.

Twelve shipped phases (with two attempted-and-reverted optimizations documented
in-code so they aren't re-tried without addressing the underlying contract
violations):

Phase 2 — cacheBaseComponentConfig switched from RWMutex + map to
sync.Map; deep-copy moved outside the critical section. Eliminates write-lock
contention that serialized every cache write across goroutines and padded
apparent CPU time with lock-wait.
Phase 3 — WalkAndDeferYAMLFunctions short-circuits when the subtree
contains no Atmos YAML functions (!template, !terraform.*, !store*,
!exec, !env). Returns the input map as-is instead of allocating a deep
copy at every recursion level.
Phase 4 — extractLocalsFromRawYAML cached via sync.Map keyed by
filePath + FNV-1a(yamlContent). The content-hash component prevents
test pollution when the same logical file path is reused with different
content. Also fixes a pre-existing data race in
extractAndAddLocalsToContext (shallow-clones the input context map
before file-scoped delete + assign).
Phase 5 — MergeWithDeferred 0-input fast path: when every layer is
empty, return an empty map immediately without walking or merging. The
1-input shortcut originally shipped alongside was REVERTED on
2026-05-24 after CI surfaced a regression — see "What was tried and
reverted" below.
Phase 6 — parsedYAMLCache switched to sync.Map; deep-copy of
yaml.Node + PositionMap moved outside the critical section. Same lock
contention pattern as Phase 2 applied to the YAML parser cache.
Phase 7 — processCustomTags split into outer + inner functions: the
hasCustomTags pre-check runs once at the entry point instead of
re-walking the subtree at every recursive call (O(N×depth) → O(N)).
Phase 8 — new decodedYAMLCache stores the post-Decode + post-Intern
result of UnmarshalYAMLFromFileWithPositions[map[string]any]. Skips
yaml.Node.Decode + InternStringsInMap on every repeat call for the
same (file, content hash) pair.
Phase 10 — processYAMLNode split into outer + inner (Phase 7
pattern). Removes per-recursion perf.Track overhead from the recursive
YAML walker used by yq evaluation. Same pattern was tried for
WalkAndDeferYAMLFunctions and reverted: the inner-only walker had to
allocate unconditionally on every recursion, regressing function-sparse
subtrees more than the perf.Track savings recovered.
Phase 11 — processTerraformRemoteStateBackend extracts the
backend-type-specific map from each input first (via the new
extractBackendTypeMap helper), then merges just those two scoped maps.
Avoids deep-copying unrelated backend-type entries
(s3/gcs/azurerm/etc.) just to extract one key from the merged result.
Phase 12 + 13 — deepCopyBaseComponentConfigMaps guards every
m.DeepCopyMap call with len(src.Field) > 0. Skips function-call and
allocation overhead for the empty-field case that dominates real
workloads (most components leave several of the 10 fields empty).
Coverage — new tests close the gaps the arc introduced. Public
Clear*Cache wrappers, extractBackendTypeMap type-mismatch path, and
the Phase 12/13 empty-field contract are all now covered.

Phase 1 was the auth credential-store fix that shipped separately as
PR #2471 (in v1.220.0-rc.1).

What was tried and reverted

Phase 5 1-input shortcut. Initially returned the walked single input
directly without going through Merge. CI surfaced
TestSpaceliftStackProcessor losing 7 stacks (47→40) because
WalkAndDeferYAMLFunctions's Phase 3 short-circuit returned the input
map as-is, and the 1-input shortcut handed that shared reference back
to the caller. Downstream mergeComponentConfigurations mutated the
result while building the per-component output, which corrupted the
upstream cached BaseComponentSettings / GlobalSettings for sibling
components. Fix: keep only the 0-input fast path; let 1-input fall
through to the regular merge pipeline which deep-copies via
MergeWithOptions → DeepCopyMap. Regression test
(TestMergeWithDeferred_TrivialInputShortCircuits/mutating the result does not mutate the input) added to prevent re-attempts.
Phase 9 asymmetric clone. Same class of failure (fatal error: concurrent map iteration and map write at scale): share
settings/vars/env references from the locals cache, deep-copy only
locals. processTemplatesInSection returns the input map as-is
when the section has no {{, so the cached references ended up in
the shared template context and got mutated by sibling goroutines.
Reverted in b11f3cd9b; documented in-code.

Both failures share the same lesson: any optimization that hands a
shared reference back to a caller has to be matched against the
end-to-end mutation surface, not just the immediate caller. The Merge
contract — "result is a fresh, caller-mutable map" — must be upheld.

why

A real-world large stack configuration (≈836 YAML files, ~195 final stacks
across three namespaces, ~9.3k component instances) reported atmos describe affected taking around 11 minutes in CI on a 2-core runner. Local
reproduction with --identity=false and fake AWS credentials took ~4
minutes and showed two clear cost centers in the heatmap:

The credential store was being created per-component even with
--identity=false (≈3.5 min of cumulative CPU). Fixed in PR #2471.
The component-inheritance + merge + YAML-parse pipeline (≈6 min of
cumulative CPU) was bottlenecked by lock contention on shared caches,
redundant deep-copies on every cache read/write, and per-call work that
could be cached or skipped for the common case.

This PR addresses the second bucket. Confirmed impact on the same workload
(mean of 3 local runs against current main, post-Phase-5-revert):

Function	Pre-Phase-2	Post-Phase-13	Reduction
`cacheBaseComponentConfig`	5m50s	~990ms	−99.7%
`mergeComponentConfigurations`	2m22s	~95s	−33%
`MergeWithDeferred`	1m35s	~51s	−46%
`WalkAndDeferYAMLFunctions`	1m26s	~11s	−87%
`extractLocalsFromRawYAML`	13s	~6s	−54%
`UnmarshalYAMLFromFileWithPositions`	18.7s	~3.2s	−83%
`processCustomTags`	31.5s	~7.5s	−76%
`getCachedBaseComponentConfig`	6.5s	~360ms	−94%

(mergeComponentConfigurations, MergeWithDeferred, and Merge
numbers reflect the post-Phase-5-revert state — the 1-input shortcut
that was originally counted toward Phase 5's headline numbers has been
removed for correctness. The remaining wins are still substantial.)

Local wall-clock on the same workload: 4.1s → ~2.2s (−47%) on a
many-core Mac. The wall-clock floor on Mac is set by stack-level
parallelism that already saturates; the cumulative CPU savings (several
minutes summed across all hot functions) translate to materially more
wall-clock improvement on 2-4 core CI runners where lock-wait padding,
allocation pressure, and serialized work cannot be hidden behind cores.

Projection for a 2-core CI runner starting from the v1.219.0 baseline:
~11 minutes → ~60-105 seconds end-to-end (combining PR #2471 + this
PR's shipped phases, including the ~15s GHA wall-clock cost of the
Phase 5 1-input revert). Awaiting end-to-end CI validation on the
reference workload.

Each phase is independently revertible — they live in separate
commits with self-contained tests. The two reverted optimizations
(Phase 5's 1-input shortcut, Phase 9's asymmetric clone) have their
failure modes documented in-code and in
docs/fixes/2026-05-23-describe-affected-component-inheritance-perf.md
so future passes don't re-attempt the same approach without addressing
the underlying contract violations.

references

Investigation doc with per-phase root cause, metric tables, and
decision lessons:
docs/fixes/2026-05-23-describe-affected-component-inheritance-perf.md
Predecessor work that built the perf-instrumentation infrastructure
this PR builds on: #1576 (heatmap visualization), #1611 (self-time vs
total-time), #1622 (Docker perf fix + CPU Time / Parallelism), #1639
(5.2× faster execution + 92% memory reduction).
Phase 1 (--identity=false gate) shipped separately in PR #2471
(v1.220.0-rc.1).

Summary by CodeRabbit

Documentation
- Added a detailed troubleshooting/performance guide for diagnosing slow "describe affected" runs with reproducible steps and optimization plan.
Performance
- Significant speedups via new caching, contention reduction, and short-circuit fast-paths for common/empty cases.
Reliability
- Improved cache correctness, mutation isolation, and error propagation to avoid races and unexpected panics.
Tests / Chores
- Expanded test coverage and added public cache-clear helpers for reliable isolation and regression verification.

fix(terraform): wire `--all` to `ExecuteTerraformAll` for dependency-ordered execution @thejrose1984 (#2486)

What

Routes atmos terraform plan --all and apply --all through ExecuteTerraformAll so components actually execute in dependency (topological) order — as originally documented in the PR #1516 changelog and the DAG concurrency PRD.

Until this change, the dispatcher in cmd/terraform/utils.go routed all multi-component flags (--all, --components, --query) through ExecuteTerraformQuery, which walks components via Go map iteration — randomized order, with settings.depends_on ignored entirely. ExecuteTerraformAll, the function that builds the dependency graph and runs TopologicalSort, was reachable only from unit tests.

Why

Fixes #2485.

Users who configured settings.depends_on and ran atmos terraform apply --all were relying on a feature that didn't exist at the dispatch layer. Failures looked like Terraform errors (a component applied before its prereqs), not a missing-feature bug. The DAG concurrency PRD was authored on the assumption that this path already worked.

Changes

Dispatch

cmd/terraform/utils.go — info.All now routes to e.ExecuteTerraformAll(&info). --components / --query / bare -s stack continue to route to ExecuteTerraformQuery (no change).

ExecuteTerraformAll parity with ExecuteTerraformQuery

internal/exec/terraform_all.go — ports createQueryAuthManager so YAML functions (e.g. !terraform.state) resolve credentials under --all. Mirrors the #2081 fix that already exists for --query.
Drops the info.Stack == "" validation. The terraform-apply docs explicitly state --all without -s processes every stack, and that's the behavior users see today via ExecuteTerraformQuery. Keeping this PR non-breaking required matching that contract.
Removes the now-unused ErrStackRequiredWithAllFlag from errors/errors.go.

Filter scope

applyFiltersToGraph previously set IncludeDependencies: true, which would pull cross-stack prereqs into --all -s <stack>. Switched to false so the scope of --all -s <stack> is identical to today's behavior — components in the requested stack only, but now in topological order. A future opt-in flag can re-enable cross-stack execution.

Dry-run UX

executeNodeCommand now emits Would <subcmd> <component> in <stack> (dry run) via ui.Successf, matching processTerraformComponent. Both multi-component paths produce the same user-facing dry-run output. (This also affects the --affected path, which had no integration tests asserting dry-run output — verified manually.)

Tests

New integration test in tests/test-cases/terraform-multi-component-flags.yaml asserts the partial topological order (vpc before eks/cluster, eks/karpenter before eks/karpenter-node-pool, eks/istio/base before eks/istio/istiod before eks/istio/test-app) using the existing terraform-apply-affected fixture and regex with (?s). The exact total order is an implementation detail of Kahn's-algorithm tie-breaking; this test only asserts the correctness invariant.
internal/exec/terraform_all_test.go and terraform_all_simple_test.go — removed the "no stack specified" cases (the validation is gone) and updated TestApplyFiltersToGraph_* to match the new scope contract.

Compatibility matrix

Scenario	Before	After
`apply --all -s dev`, `depends_on` defined	Random order (bug)	Topological order
`apply --all -s dev`, no `depends_on`	Random order	Deterministic order
`apply --all` (no stack)	All stacks, random order	All stacks, topological order
`apply --all -s dev` with cross-stack `depends_on`	Cross-stack components ignored	In-stack topological order; cross-stack still out of scope (opt-in TBD)
`destroy --all -s dev`	Random order	Reverse topological order
`--all` with circular `depends_on`	Silently random	Hard error with cycle path
`apply --components vpc -s dev`	Unchanged	Unchanged
`apply --query '...' -s dev`	Unchanged	Unchanged
`apply -s dev` (no component, no flag)	Unchanged	Unchanged
`--all` with `!terraform.state` YAML function	Worked (via #2081)	Works (auth manager ported)
`--all` with per-component CI hooks	Worked (via #2475/#2397)	Works (hook flows through `executeNodeCommand`)

Known follow-ups (not in this PR)

These are tracked in #2485 and intentionally out of scope for the dispatch fix:

Parser: DependencyParser only reads the deprecated settings.depends_on. Should also read dependencies.components like describe_affected_components.go already does.
Parser: only component + stack keys are recognized. namespace/tenant/environment/stage are documented but ignored.
Errors: missing-target dependency errors are silently logged at Warn in parseDependencyArray. Should be surfaced.
Concurrency: still sequential. The DAG concurrency PRD describes the planned ready-queue scheduler.
Cross-stack scope opt-in: a --include-cross-stack-dependencies flag (or settings.terraform.dependencies.cross_stack) to re-enable the original IncludeDependencies: true behavior.

Test plan

go build ./...
go vet ./internal/exec/... ./cmd/terraform/... ./errors/...
go test ./internal/exec/ ./pkg/dependency/... ./cmd/terraform/... ./errors/... ./pkg/ui/... -short — all green
go test ./tests -run 'TestCLICommands/terraform_plan_--all|TestCLICommands/terraform_plan_--query|TestCLICommands/terraform_plan_--components' -count=1 — all green
New ordering test: go test ./tests -run 'TestCLICommands/terraform_plan_--all_executes_in_dependency_order' -count=1 — green; output confirms vpc → eks/cluster → eks/external-dns → eks/istio/base → eks/karpenter → eks/istio/istiod → eks/karpenter-node-pool → eks/istio/test-app
CI lint (make lint)
Manual verification with dependencies.components (new format) once parser is extended in a follow-up
Cross-stack dependency behavior with the deferred opt-in flag

References

Closes #2485
Original feature request: #1242 (closed as COMPLETED in #1516, but the routing was never wired up)
Implementation PR: #1516
DAG concurrency PRD: docs/prd/dag-concurrent-execution.md

Summary by CodeRabbit

New Features
- terraform plan --all and apply --all run components in dependency (topological) order; destroy --all runs in reverse. --all may be used without a stack; --all -s <stack> scopes to that stack without pulling cross‑stack prerequisites. Per‑component hooks and auth-aware YAML resolution are active during --all runs; dry‑run shows clear per‑component success messages.
Tests
- Expanded tests for --all ordering, scoping/filtering, auth wiring, dry‑run flows, and per‑component hook wiring.
Documentation
- New blog post describing --all behavior, caveats, and follow-ups.

fix(vendor): recover OCI pulls on auth rejection and surface rich errors @osterman (#2487)

what

Vendoring an OCI image (e.g. oci://ghcr.io/...) now auto-recovers when configured credentials are rejected (401 / 403 / DENIED) by retrying once with anonymous authentication.
On successful recovery, emits WARN OCI auth rejected, succeeded with anonymous fallback and proceeds.
Terminal pull failures now surface a rich error built with errUtils.Build(errUtils.ErrPullImage) — preserves the original cause and attaches structured context (image, registry, auth_attempted, status) plus three self-contained remediation hints (Actions packages: read, ATMOS_GITHUB_USERNAME override, stale ~/.docker/config.json).
Bumps the chosen-auth log line from Debug → Info and the "GHCR token without username" branch from Debug → Warn so CI logs reveal misconfiguration without --debug.
Non-auth errors (DNS, TLS, deadlines, 5xx) bypass retry — they need different remediation.

why

Public test images on ghcr.io (e.g. ghcr.io/cloudposse/atmos/tests/fixtures/components/terraform/mock:v0) failed hard on Windows CI runners whose GITHUB_TOKEN lacked packages: read scope, because pullImage used the rejecting credentials unconditionally instead of falling back to anonymous.
The previous error surface was a bare DENIED: denied with no auth source, no HTTP status, and no actionable hint — the vendor reporter collapsed it into an opaque tally, making the root cause untraceable.
Part A of this work — granting packages: read in .github/workflows/test.yml — already landed; this is Part B, the code change so future users, different workflows, and private registries get a clean diagnosis and an automatic recovery for public images.

references

Builds on PR #1647 (3-tier auth precedence).
Mirrors the error-builder idiom from pkg/provisioner/source/source.go:88-97.
Reuses the existing errUtils.ErrPullImage sentinel — no new sentinel introduced.

Summary by CodeRabbit

Bug Fixes
- Automatic fallback to anonymous OCI image pulls when authenticated requests are rejected (401/403 or “DENIED”), preserving original error causes
- Richer diagnostic errors with contextual hints for troubleshooting image-pull failures
- Warn when a GHCR token is present but no GitHub username is configured
Tests
- Expanded test coverage for image-pull auth fallback and varied failure scenarios

fix(terraform): preserve explicit identity and auth context for local runs @shirkevich (#2348)

## Problem Local commands could still fall back to the default CI identity (`terraform-ci` / `gcp-wif`) even when the user explicitly selected a local identity such as `terraform` / `gcp-adc`.

This showed up in several related paths:

atmos terraform apply ... --identity terraform and -i terraform could lose the explicit identity during Terraform argument reconstruction and then authenticate with the default identity.
Local Terraform commands could run the CI hook path first, producing noisy terraform-ci WIF authentication errors even when the command later succeeded with the intended local identity.
atmos terraform output ... --format json -i terraform bypassed the normal Terraform auth setup and called formatted output resolution without the active AuthManager / AuthContext.
!terraform.state evaluations could receive stack info without the active auth manager, so nested state reads could still resolve backend credentials from the default identity instead of the explicit command identity.
atmos ansible playbook ... --identity terraform did not propagate the selected identity into stack processing before YAML functions ran, so Ansible-driven components using !terraform.state could still try the default Terraform CI WIF identity.

Fixed Issues

Preserves explicit Terraform --identity / -i values through both Cobra/flag-registry parsing and the legacy raw-arg parsing path.
Normalizes Terraform identity edge cases consistently, including --identity=, -i=, and --identity=false.
Keeps CLI-provided identity values ahead of default/profile-selected identities when Cobra reports the optional-value sentinel.
Skips CI hook execution for normal local non-CI runs unless CI mode is explicitly forced, removing local terraform-ci preflight auth noise.
Initializes Terraform auth for formatted terraform output and passes both AuthContext and AuthManager into output resolution.
Propagates the active AuthManager from ProcessComponentConfig into ConfigAndStacksInfo, allowing !terraform.state to inherit the selected identity context.
Adds shared component auth setup for non-Terraform command paths that need authenticated YAML functions.
Adds Ansible-specific identity handling that supports long-form --identity / ATMOS_IDENTITY while deliberately leaving Ansible's -i shorthand reserved for inventory.
Runs Ansible stack processing with the selected auth manager before YAML function evaluation, so Ansible playbooks using !terraform.state can use the requested identity.

Verification

Rebuilt the binary with rtk proxy go build -o build/atmos .
Focused regression suite passed: rtk go test ./cmd/ansible ./pkg/component/ansible ./pkg/flags ./internal/exec ./pkg/hooks -run 'TestAnsible|TestBuildConfigAndStacksInfo|TestGetLongIdentityFromArgs|TestProcessStacksWithAuth|TestParseGlobalFlags|TestSetupTerraformAuth|TestProcessComponentConfig_PropagatesAuthManager|TestProcessComponentConfig_AuthManagerGuardBranches|TestProcessCommandLineArgs_EmptyIdentityFlagIsExplicitSelect|TestProcessCommandLineArgs_TerraformIdentityFlag_Issue2392|TestProcessArgsAndFlags_IdentityFlag|TestRunCIHooks_LocalRunSkipsExperimentalGate|TestRunCIHooks_ForwardsErrorAndExitCode|TestRunCIHooks_NilAtmosConfig|TestRunCIHooks_ExperimentalDisableReturnsError'
Downstream local Terraform apply path was verified with explicit --identity terraform.
Downstream formatted Terraform output path was verified with explicit -i terraform --format json.
Downstream Ansible dry-run path was verified to select identity "terraform" / provider=gcp-adc; remaining OAuth access was environment/network dependent, not a fallback to terraform-ci.

Summary by CodeRabbit

New Features
- Added -i shorthand for --identity (supports -i value, -i=value, and explicit-empty -i= to trigger interactive selection).
Improvements
- More robust identity resolution across commands and env vars; explicit-empty identity is preserved.
- Auth manager is now propagated into output and component command flows for consistent auth behavior.
- Hook discovery avoids rendering templates; CI hooks skip when no CI provider detected.
Bug Fixes
- Fixed parsing so a following native flag (e.g., -lock=false) is not mis-consumed as an identity value.
Tests
- Expanded test coverage for identity parsing, auth propagation, hooks, and CI registry behavior.

fix(ci): fire CI hooks per-component in deploy --all mode @thejrose1984 (#2478)

what

Fixes atmos terraform deploy --all (and --query, --components, stack-without-component) producing only a single CI summary entry for the last component instead of one entry per component
Adds runCIHooksForDeployComponent as the per-component hook for the deploy subcommand so $GITHUB_STEP_SUMMARY receives one entry per component with the correct component/stack context
Wires wasMultiComponentExecution reset, error-defer guard, and PostRunE guard in deploy.go — the same three-site pattern applied to plan in #2430 and apply in #2475

why

In multi-component mode, terraformRunWithOptions routes to ExecuteTerraformQuery and sets wasMultiComponentExecution = true, but deploy.go had no guard on its PostRunE or error-path defer. This caused:

PostRunE to fire once after all components completed, calling RunCIHooks with an empty output buffer and the last component's info.Component/info.Stack
The error-path defer to double-fire when --all failed mid-walk (per-component hook already ran for the failed component)
For stacks with N components, only 1 summary entry appeared instead of N

references

Closes #2476
Related: #2397 (plan fix), #2475 (apply fix)

Summary by CodeRabbit

Bug Fixes
- Prevent duplicate error-hook execution during multi-component deployments.
- Ensure per-run state is reset before early exits so deferred error hooks and post-run logic behave consistently.
- Run CI hooks per-component for deploys to preserve component output and forward correct exit codes.
Tests
- Added tests for per-component CI hook behavior, suppression of post-run logic in multi-component deploys, defer-guard behavior, and exit-code forwarding.

[codex] Fix verifier auto-install and cosign bundles @osterman (#2481)

what

Resolve verifier auto-installs to concrete registry versions before bootstrapping, instead of falling back to literal latest.
Add platform-aware installer helpers and regression coverage for Windows verifier asset URLs across cosign, slsa-verifier, gh, and minisign.
Combine cosign opts with downloaded sidecars like --bundle so Trivy checksum signature verification works with Aqua metadata.

why

Windows CI was failing because cosign release assets exist under v... tags and the bootstrap path could construct invalid release URLs.
Trivy verification failed on macOS because Atmos dropped the sigstore bundle whenever cosign options were present, producing an incomplete cosign verify-blob command.
The added tests cover the failing Windows URL rendering path and the Trivy-shaped checksum signature command.

references

Fixes the verifier install failure introduced by package verification.
Validated with go test ./pkg/toolchain/installer ./pkg/toolchain/registry/aqua ./pkg/toolchain/verification, go test ./pkg/toolchain, pre-commit hooks, and a live go run . toolchain install aquasecurity/trivy@v0.70.0.

Summary by CodeRabbit

New Features
- Signature verification now supports bundle sidecars.
- Enhanced cross-platform asset resolution, including better Windows ARM and Rosetta2 handling and per-platform overrides.
- Improved verifier bootstrap resolution with additional fallback behavior.
Bug Fixes
- Corrected Windows executable extension handling across target platforms.
Tests
- Added tests for Windows asset URL generation, verifier version resolution failures, and cosign bundle sidecar integration.

fix(stacks): honour component-level list_merge_strategy in settings @thejrose1984 (#2480)

what

Fixes settings.list_merge_strategy set at the component level being silently ignored during stack processing
Adds effectiveAtmosConfig() helper that scans the component's settings layers (GlobalSettings → BaseComponentSettings → ComponentSettings → ComponentOverridesSettings) before any merge and returns a shallow config copy with the winning strategy
mergeComponentConfigurations now uses this resolved config for all m.Merge / m.MergeWithDeferred / m.ApplyDeferredMerges calls — covering vars, settings, env, auth, providers, hooks, generate, dependencies, locals, source, and provision

why

mergeComponentConfigurations passed the global atmosConfig to every merge call. pkg/merge reads atmosConfig.Settings.ListMergeStrategy on every call. The component's settings.list_merge_strategy lived inside the data being merged, not the config doing the merging — so it was always ignored. The value appeared correctly in atmos describe component output (giving false confidence), but the actual list merging behavior was always governed by the global atmos.yaml setting or ATMOS_SETTINGS_LIST_MERGE_STRATEGY env var.

references

Closes #2396

Summary by CodeRabbit

New Features
- Component-level list merge strategy overrides are now computed and applied consistently across configuration assembly, honoring inheritance and isolating unchanged configs.
Tests
- Added integration tests and fixtures covering precedence, inheritance, copy isolation, prevention of empty overrides, and error handling for invalid strategy values.

fix(ci): fire CI hooks per-component in apply --all mode (#2475) @thejrose1984 (#2477)

Extends the per-component CI hook pattern from PR #2430 (plan --all) to apply --all, so each component produces its own CI summary entry instead of a single misattributed entry for the last component.

what

Update apply --all to fire per-component CI hooks.
Preserve per-component CI reporting semantics used by plan --all.

why

Prevent CI summaries from being misattributed to only the last component.
Ensure each component has its own hook and status entry in CI pipelines.

references

Fixes behavior introduced in PR #2430 for plan --all.
Addresses CI reporting bug for apply --all mode.

Summary by CodeRabbit

Bug Fixes
- Prevented duplicate CI hook firing during multi-component Terraform apply runs.
- Reset per-run state at apply start so deferred and post-run hooks observe consistent values.
- Suppressed post-run hook execution for multi-component apply to avoid double execution.
Tests
- Added tests covering CI hook handling and post-run suppression in multi-component apply scenarios.

fix(auth): honor --identity=false in describe affected and dependents @osterman (#2471)

what

Honor --identity=false (and aliases off/0/no) in atmos describe affected so per-component auth resolution is skipped, not just the top-level AuthManager creation.
Thread a new DescribeAffectedCmdArgs.AuthDisabled / DescribeDependentsArgs.AuthDisabled flag from the cmd layer through executeDescribeAffectedWith{TargetRepoPath,TargetRefClone,TargetRefCheckout}, executeDescribeAffected, addDependentsToAffected, and ExecuteDescribeDependents, routing inner stack resolution through ExecuteDescribeStacksWithAuthDisabled.
Also wired through terraform_affected.go, terraform_affected_graph.go, pkg/list/list_affected.go, pkg/ai/tools/atmos/describe_affected.go, and atlantis_generate_repo_config.go so every caller of the public helpers passes the signal.
Extracted pkg/list/list_affected.go::executeAffectedLogic into three per-mode helpers to stay under the 60-line function-length limit after the extra parameter.

why

A user disabled all auth on a describe affected --upload --process-functions=false --identity=false run in cloudposse/infra-live CI (failing run) and still got STS AssumeRoleWithWebIdentity 403 AccessDenied for component tfstate-plat.
The 1.219 fix (#2412) normalized --identity=false → __DISABLED__ at the parser layer and made CreateAuthManagerFromIdentity* short-circuit to nil, but it only wired the disabled signal all the way down through list instances. In describe affected, the top-level AuthManager correctly became nil, but a nil AuthManager was indistinguishable from "no identity specified" downstream. With --process-templates=true (the default), shouldResolvePerComponentAuth(processTemplates, processYamlFunctions) still returned true, so the per-component resolver called createComponentAuthManager, which built a fresh AuthManager from atmosConfig.Auth and tried the assume-role call the user thought they had disabled.
This change makes --identity=false actually mean "no auth, anywhere" in describe affected, matching the contract that already works for list instances.

Tests:

cmd/describe_affected_test.go::TestDescribeAffectedSetsAuthDisabled covers false/off/0/no env-var spellings and asserts AuthDisabled=true and AuthManager=nil.
internal/exec/describe_affected_authdisabled_test.go verifies Execute() forwards AuthDisabled to all three helper paths and to addDependentsToAffected.
internal/exec/describe_stacks_component_processor_auth_test.go adds the exact (processTemplates=true, processYamlFunctions=false, authDisabled=true) regression case from the infra-live CI failure to the existing table.

references

Follow-up to #2412 (fix(auth): normalize --identity=false to disable authentication) which only wired the disabled signal through list instances.
Failing CI run that motivated this fix: https://github.com/cloudposse/infra-live/actions/runs/26247527093/job/77249654102?pr=1686

Summary by CodeRabbit

Bug Fixes
- describe affected and describe dependents now explicitly record when authentication is disabled (e.g., --identity=false, off, 0, no), ensuring downstream discovery and dependency resolution skip per-component auth and avoid unintended auth attempts.
Tests
- Added unit and integration tests verifying the auth-disabled signal is propagated throughout affected-component discovery and dependent-resolution paths.

fix(auth): nil-check process-cached credentials for standalone `ambient` identity @aknysh (#2479)

what

Fix a hard SIGSEGV triggered the second time a standalone generic ambient identity (kind: ambient) is authenticated in the same process. The first authentication succeeded and silently cached nil credentials in the process-level credential cache; the next lookup invoked isCredentialValid("process-cache", nil), which dereferenced a nil types.ICredentials interface in GetExpiration().
The crash is latent in atmos auth login / atmos auth whoami (one authentication per process) but fatal in commands that resolve per-component auth many times — most notably atmos describe affected --upload, where internal/exec/describe_stacks_component_processor.processComponentEntry walks every component and calls resolveComponentAuthManager → createComponentAuthManager → Authenticate → authenticateChain per component.

Fix (two layers in `pkg/auth/manager_chain.go`)

authenticateChain — don't cache nil credentials.
```
if creds != nil {
    processCredentialCache.Store(cacheKey, &processCachedCreds{
        credentials: creds,
    })
}
```
The generic ambient kind is a cloud-agnostic passthrough whose Authenticate() returns (nil, nil) by design — credentials are resolved by the cloud SDK at subprocess runtime, not by Atmos. Storing nil violates the cache invariant that every entry is a usable credential object. Skipping costs nothing because ambient re-authentication is itself a no-op.
isCredentialValid — short-circuit on nil input.
```
if cachedCreds == nil {
    log.Debug("Cached credentials are nil; treating as invalid", logKeyIdentity, identityName)
    return false, nil
}
```
Defense-in-depth mirror of the same nil-check pattern adopted by buildWhoamiInfo in the predecessor 2026-04-17 ambient fix. If any future caller stores nil in the cache (or another path passes nil into the validator), the worst case is a redundant re-authentication, not a panic.

Either guard alone closes the panic; both together make the contract explicit at both the read and write sites.

Tests (new `pkg/auth/manager_chain_ambient_test.go`)

TestManager_isCredentialValid_NilCreds — direct unit reproducer for the panic site. Before the fix this test panicked at manager_chain.go:164 with runtime error: invalid memory address or nil pointer dereference while calling cachedCreds.GetExpiration(). Asserts (false, nil) on nil credentials.
TestManager_Authenticate_AmbientStandalone_RepeatedCallsNoPanic — end-to-end via real NewAuthManager + two back-to-back Authenticate() calls on a standalone kind: ambient identity. Before the fix the second call panicked on the process-cache hit. Asserts both calls return cleanly with WhoamiInfo.Credentials == nil.
TestAuthenticateChain_AmbientStandalone_DoesNotCacheNil — locks in the authenticateChain-side fix by direct cache inspection: the cache key must be absent after a standalone ambient authentication. Prevents a regression where caching nil silently returns.

All three new tests pass alongside the existing ambient regression tests (TestManager_buildWhoamiInfo_NilCredentials, TestManager_Authenticate_Ambient_Standalone) and the existing TestProcessCredentialCache_* suite.

Coverage

Both patched functions remain at 100% statement coverage; both branches of each new guard are exercised:

Function	File:Line	Coverage
`authenticateChain`	`pkg/auth/manager_chain.go:51`	100.0%
`isCredentialValid`	`pkg/auth/manager_chain.go:173`	100.0%

isCredentialValid nil-true branch: TestManager_isCredentialValid_NilCreds. Nil-false branch: existing TestProcessCredentialCache_* tests.
authenticateChain skip-cache branch: the two ambient tests above. Cache-write branch: existing TestProcessCredentialCache_AvoidsDuplicateAuth and friends.

Validation

go test ./pkg/auth/... -count=1 — all 28 subpackages green.
go vet ./pkg/auth/... — clean.
go build ./... — succeeds.

why

The (nil, nil) return from the generic ambient kind is the documented contract (docs/prd/ambient-identity.md) — credentials are resolved by the cloud SDK at subprocess runtime, not by Atmos. The cache code on the other side of that boundary failed to honor the contract, and a recent change that made per-component auth resolver failures fatal turned this latent panic into a hard command termination.
The predecessor 2026-04-17 ambient fix (#2334) addressed the buildWhoamiInfo path but did not touch the process credential cache path in authenticateChain / isCredentialValid. That cache is dormant during single-authentication commands like atmos auth login / atmos auth whoami (where #2334's reproducer lived) but hot during multi-component flows like atmos describe affected, so the bug only surfaced after both #2334 shipped and per-component auth resolution became fatal. This PR extends the same nil-credential contract to the credential-cache layer.
Without this fix, any consumer of a standalone kind: ambient identity who exercises atmos describe affected --upload (the canonical Atmos Pro flow) hits a hard crash on every run, with no workaround short of avoiding the identity kind entirely — which defeats the reason the kind exists.

references

docs/fixes/2026-05-21-ambient-identity-process-cache-panic.md — fix write-up: root cause, code path, two-layer fix, test matrix, coverage notes, and the interaction with the predecessor #2334 fix that made this surface now.
docs/fixes/2026-04-17-ambient-identity-nil-credentials.md — predecessor fix. Same (nil, nil) ambient contract, different layer (buildWhoamiInfo). This PR extends the same defense to the process credential cache.
docs/prd/ambient-identity.md — feature PRD. Specifies that ambient.Authenticate() returns (nil, nil) and ambient identities do not store credentials.
pkg/auth/identities/ambient/ambient.go:66-71 — the intentional return nil, nil in ambientIdentity.Authenticate().
pkg/auth/identities/ambient/ambient.go:144-162 — AuthenticateStandaloneAmbient documents and propagates the nil-credentials contract.
pkg/auth/identities/aws/ambient.go:Authenticate — AWS-specific counterpart that returns real *AWSCredentials and therefore never triggers this bug.
internal/exec/describe_stacks_component_processor.go:150-174 — per-component auth resolver whose recent change made this latent panic fatal in the atmos describe affected --upload flow.

Summary by CodeRabbit

Bug Fixes
- Fixed a crash that occurred when authenticating a standalone ambient identity multiple times within the same process.
Tests
- Added regression tests to prevent this issue from reoccurring.
Documentation
- Added documentation describing the fix and root cause analysis.