Refresh Terragrunt migration guide mappings @osterman (#2527)
what
- Update the Terragrunt migration guide comparison table and concept mappings for current Atmos capabilities.
- Split Terraform source guidance into source provisioning and vendoring, including TTL, auto-provisioning, and workdir guidance.
- Add or refresh mappings for explicit component dependencies, hooks, file generation, backend provisioning, locals, and AWS YAML functions.
why
- The migration guide had stale guidance that understated current Atmos parity with Terragrunt.
- The revised examples give Terragrunt users more accurate one-to-one migration paths.
references
Summary by CodeRabbit
- Documentation
- Updated migration guide with an expanded "Key Differences" mapping for Terragrunt → Atmos (dependencies, sourcing, hooks, backend).
- Clarified dependency ordering vs output lookup and added stack YAML examples showing dependencies + output wiring.
- Split module sourcing into “Atmos Source Provisioning” (with CLI examples) and “Atmos Vendoring” (new vendor format).
- Renamed “Generate Blocks” to “Code Generation” and rewrote hooks, remote-backend, locals, and YAML function examples (including inline exec and env/aws mappings).
docs(dependencies): document dependencies.components for describe affected @osterman (#2391)
what
- Update
atmos describe affecteddocs to lead with the newdependencies.components(kind: file|folder+path:) format for path-based dependencies; keep a short backward-compat note pointing to legacysettings.depends_on. - Convert the dependents example on the
describe affectedpage fromsettings.depends_ontodependencies.components. - Correct the Merge Behavior section in
stacks/dependencies/components.mdx: the default is replace, not append; append requires opting in viasettings.list_merge_strategy: append. Add an "Opt-in append" subsection and link to thesettingsreference. - Add a migration callout under the schema in
stacks/dependencies/components.mdxclarifying thatnamespace/tenant/environment/stageare not supported in the new format — use a templatedstack:instead. - Rebalance
stacks/dependencies/index.mdxsodependencies.toolsanddependencies.componentsget equal billing in the intro, use cases, and component-dependencies subsection. Remove the duplicate Related Documentation entry and relabel the legacy link as "Legacysettings.depends_on". Add a link toatmos describe affected. - Extend the Atmos manifest JSON Schema (
website/static/schemas/...and the matching test fixture) sodependencies.componentsis allowed and validatescomponent,stack,kind,path. Previously rejected byadditionalProperties: false.
why
- The dedicated
dependencies.componentspage existed, but the highest-traffic surface (describe affected) and the JSON Schema still only documented/allowed the legacysettings.depends_onmap — driving users to the deprecated format and making the new format fail IDE/SchemaStore validation. - The merge-behavior description in
stacks/dependencies/components.mdxcontradicted the announcement blog and the actual code (internal/exec/describe_dependents_test.go:1093–1154confirms default = replace, append is opt-in viasettings.list_merge_strategy). Users following the docs would have built a wrong mental model of inheritance. - The migration story from
settings.depends_on(withnamespace/tenant/environment/stage) todependencies.components(with templatedstack:) was only discoverable via the migration table at the bottom of the page; surfacing it as a callout reduces confusion for users porting existing configs.
references
pkg/schema/dependencies.go— canonical field set fordependencies.componentsinternal/exec/describe_dependents_test.go:1093–1154— confirms default merge is replace, append requiressettings.list_merge_strategy: appendwebsite/blog/2026-03-14-dependencies-components.mdx— original announcement- Verified:
cd website && npm run buildsucceeds (no broken-link errors)
Summary by CodeRabbit
-
New Features
- Dependencies now support four top-level kinds: tools, components, files, and folders. Component-to-component relationships and explicit file/folder watch paths are first-class; describe-affected, describe-dependents, and CI workflows use the expanded surface while legacy formats remain supported.
-
Documentation
- Guides updated with examples, migration notes, canonical forms, merge semantics, and quick examples for cross-stack and path-based dependencies.
-
Tests
- Added coverage for the new surfaces, aliasing, deduplication, normalization, and v1/v2 equivalence.
fix(ci): repair Docker build and Homebrew formula bump in release workflow @aknysh (#2525)
what
- Replace the flaky upstream
install_kustomize.shscript in the Dockerfile with a direct download from GitHub Releases, pinned to kustomize v5.8.1 - Replace
mislav/bump-homebrew-formula-action@v3withdawidd6/action-homebrew-bump-formula@v7(SHA-pinned) for the Homebrew formula bump step
why
- The kustomize install script has known bugs (kubernetes-sigs/kustomize#5562) causing tar extraction failures (
tar: ./kustomize_v*_linux_amd64.tar.gz: Cannot open) during Docker image builds - The
mislav/bump-homebrew-formula-actionis broken because GitHub now returns HTTP 303 instead of 302 for tarball redirects, and the action hardcodesstatusCode == 302(mislav/bump-homebrew-formula-action#340, open/unfixed) - Both failures blocked the v1.219.0 release workflow (run #26131090357)
references
- https://github.com/cloudposse/atmos/actions/runs/26131090357/job/77908154273
- mislav/bump-homebrew-formula-action#340
- kubernetes-sigs/kustomize#5562
Summary by CodeRabbit
- Chores
- Updated build and deployment infrastructure, including CI/CD workflow configuration and Docker build process improvements for enhanced reliability and maintainability.
feat(components): add retry block for transient terraform errors @osterman (#2431)
what
- Add a per-component
retry:block undercomponents.terraform.<name>that wraps each terraform subprocess invocation (init, workspace select, workspace new, plan/apply/etc.) in an independent retry loop with configurable backoff. - Introduce
retry.conditions:— a list of regex patterns matched against captured stdout/stderr; only errors whose output matches at least one condition retry, everything else fails fast. Patterns may be wrapped in/.../for readability. - Extend
schema.RetryConfigwithConditions []string(backwards-compatible — the existing struct is also used by workflows / vendor / task retry configs). - Plumb the new block through stack inheritance: abstract components define a default policy, concrete components and
overrides.retrydeep-merge on top. - Add JSON schema for
retryreferenced fromterraform,terraform_component_manifest, andoverrides; ship a docs page, blog post, and roadmap milestone under the CI/CD initiative. - Add
pkg/retry/conditions.go(regex compile + match) andpkg/schema/retry_decode.go(mapstructure decoder with the duration hook) so logic stays out ofinternal/exec/.
why
- Unattended
atmos terraform plan/applyruns in CI repeatedly fail with transient infrastructure errors that have nothing to do with the Terraform code — most commonly502 Bad Gatewayduring provider downloads, but alsoconnection reset,TLS handshake timeout, and state-backend timeouts. - Today the only recovery is a manual re-run, which is painful for fleet operations and unattended pipelines.
- Workflows, vendoring, and source extraction already use
pkg/retry. This PR exposes the same robust primitive to components without duplicating logic. - The design is intentionally pattern-driven (opt-in per regex) so real
terraform planfailures (exit-code 2, schema errors, etc.) are never silently retried — the foot-gun of "retry everything" is avoided by requiringconditions:to opt in. - Each subprocess invocation is wrapped independently (
wrap(exec), wrap(exec), wrap(exec)), not as one outer retry around the whole pipeline, soapplydoesn't lose its budget toinit.
references
- New docs page:
/stacks/components/terraform/retry - Blog post:
website/blog/2026-05-18-terraform-component-retry.mdx(tag:feature) - Roadmap milestone added to the CI/CD Simplification initiative
- Reuses
pkg/retry.WithPredicateand the existingWithStdoutCapture/WithStderrCaptureshell options — no duplication
Summary by CodeRabbit
-
New Features
- Per-component retry for Terraform subprocesses: opt-in
retry:block with regex-driven conditions, max attempts, backoff, delays, jitter; wraps init, workspace and main commands; inherited from abstract components and overridable; no retry occurs by default without conditions.
- Per-component retry for Terraform subprocesses: opt-in
-
Documentation
- New docs and blog post explain configuration, inheritance, and safety defaults.
-
Tests
- Extensive unit tests added covering decoding, merge/precedence, execution retry logic, and condition matching.
feat(website): per-page raw .md routes + Copy Markdown button on atmos.tools @osterman (#2503)
what
- Adds per-page
.mdroutes to atmos.tools — every doc URL is mirrored at<url>.mdwithContent-Type: text/markdown. - Adds a "Copy Markdown" / "View Markdown" split button above each doc page.
- Adds
<link rel="alternate" type="text/markdown" href="<url>.md">to each doc page's<head>so crawlers and LLM tooling can discover the alternate. - Introduces an AST-based MDX→Markdown normalizer (
website/plugins/docusaurus-plugin-llms-txt/src/mdx-normalize.mjs) with a per-component handler table covering<Intro>,<Tabs>/<TabItem>,<Terminal>,<File>,<Note>,<Step>,<dl><dt><dd>, marketing cards, and more. Unknown components unwrap to their children. 19 unit tests. - Reuses the same normalizer for
llms-full.txt, replacing a lossy regex-strip that dropped JSX content wholesale — the LLM corpus file is now meaningfully richer. - Fixes a pre-existing bug in the llms-txt plugin: pages with frontmatter
id:/slug:overrides were silently dropped fromllms.txt/llms-full.txtbecause the resolver searched by filename. Switched to Docusaurus's.docusaurus/<plugin>/<id>/*.jsoncache for the authoritative permalink → source map. Pages processed went from 554 → 735.
why
- LLM-driven workflows (Claude Code, ChatGPT, custom agents) are now first-class consumers of our docs. A raw Markdown alternate makes our docs trivially feedable to any LLM without HTML scraping.
- A
rel="alternate"Markdown link is a standard discovery pattern for agentic crawlers — no special-case scraping needed. - The MDX wholesale-strip approach was corrupting
llms-full.txt(tab content, flag tables, intros all silently dropped). The AST normalizer preserves structure correctly. - The
id:/slug:override bug meant ~25% of CLI command pages were missing from the LLM corpus entirely.
references
- Inspiration: FlyNumber/markdown_docusaurus_plugin (UX reference; we extended our existing plugin rather than adopting it).
- Parity with atmos-pro's recently-shipped Copy Markdown affordance.
- Blog post:
website/blog/2026-05-24-copy-markdown-button.mdx
Summary by CodeRabbit
-
New Features
- Docs pages available as raw .md URLs; "Copy Markdown" and "View Markdown" controls added to doc UI
- Site generates per-page Markdown files and a synthesized index for discovery
-
Documentation
- Pages include rel="alternate" Markdown links
- MDX content normalized into portable Markdown while preserving tabs, code/terminal/file blocks, notes, and definition lists
-
Tests
- Added coverage for HTML-comment and truncate-marker behavior
-
Chores
- Added Markdown parsing dependency; improved deploy script/workflow to ensure text files use UTF-8 charset
feat: implement remote stack imports @osterman (#2037)
what
- Add support for importing stack configurations from remote URLs (HTTP, Git, S3, GCS) using go-getter
- Stack imports now work consistently with remote imports for atmos.yaml
- New
pkg/stack/importspackage handles URL detection and remote downloading - Updated
stack_processor_utils.goto detect remote URLs and download them automatically
why
- This feature was documented but not yet implemented (fixes #2036)
- Teams need to share stack configurations across multiple repositories without vendoring
- Enables central catalogs, version-pinned imports, and cross-team config sharing
- Provides consistency between atmos.yaml imports and stack file imports
references
- closes #2036
- Blog post:
website/blog/2026-01-29-remote-stack-imports.mdx - Example:
examples/remote-stack-imports/ - Documentation: Stack Imports
Summary by CodeRabbit
-
New Features
- Remote stack imports with local caching for HTTP(S), Git, S3, and GCS; skip-if-missing and version-pinning support.
-
Documentation
- New example project and README demonstrating local+remote import composition; blog post and roadmap entry announcing the feature.
-
Chores
- Improved on-disk and in-memory caching, atomic cache writes, and cross-platform file-locking.
-
Tests
- Expanded unit and integration tests covering URI classification, downloading, caching, locking, and CLI scenarios.
feat(aws/security): add SARIF/OCSF exports and harden CI @osterman (#2483)
What
This PR adds machine-readable security export formats to atmos aws security analyze and includes the CI hardening needed to keep the branch green.
AWS security exports
- Adds
--format=sariffor SARIF 2.1.0 output compatible with GitHub code scanning, Azure DevOps, and SARIF viewers. - Adds
--format=ocsffor OCSF 1.4.0 Detection Finding output for SIEM and security data lake ingestion. - Preserves Atmos context in exported findings, including stack, component, component path, remediation steps, deploy command, mapped physical locations, and logical fallback locations for unmapped resources.
- Produces deterministic output ordering for stable diffs and deduplication.
- Maps Atmos severities into SARIF/GHAS levels and OCSF severity/status fields.
- Adds schema-backed and structural test coverage for SARIF and OCSF renderers, determinism, empty/nil inputs, mapped/unmapped findings, compliance reports, and malformed SARIF rejection.
- Updates CLI docs, blog content, roadmap data, and PRD notes. The experimental Atmos Pro upload surface was removed before merge; the design is preserved in
docs/prd/atmos-pro-security-findings-upload.mdfor later revival.
CI and workflow hardening
- Upgrades
actions/checkoutusage from v4 to v6 across workflows and docs/examples; updates the SHA-pinned Atmos Pro checkout tov6.0.2. - Grants
packages: readso CI jobs pulling from GHCR can authenticate with the workflow token. - Grants reviewdog the PR/check permissions it needs for tflint annotations and uses
github.tokenexplicitly. - Uses
opentofu/setup-opentofu@v1on Windows instead of installing OpenTofu through the Atmos toolchain path that was failing signature verification. - Fixes demo-stack wttr.in URLs, changes Swedish language code from
setosv, and adds HTTP retries so screengrab generation is less brittle.
Why
SARIF and OCSF let Atmos security findings flow directly into standard security workflows instead of requiring users to translate markdown/json output themselves. The CI changes address failures observed while validating this branch: checkout/auth instability, Windows OpenTofu setup failures, reviewdog token scope failures, and live wttr.in request failures in screengrab generation.
Validation
GITHUB_TOKEN=$(gh auth token) node .github/actions/verify-sha-pinning/test.mjsmake -C demo/screengrabs build-allterraform fmt -check examples/demo-stacks/components/terraform/myapp/main.tfpre-commit run check-yaml --files .github/workflows/test.yml examples/demo-stacks/stacks/deploy/dev.yaml examples/demo-stacks/component.yamlgit diff --check
References
- SARIF 2.1.0 spec: https://docs.oasis-open.org/sarif/sarif/v2.1.0/sarif-v2.1.0.html
- GitHub SARIF support: https://docs.github.com/en/code-security/code-scanning/integrating-with-code-scanning/sarif-support-for-code-scanning
- OCSF schema: https://schema.ocsf.io/
Summary by CodeRabbit
-
New Features
- Added SARIF 2.1.0 and OCSF 1.4.0 export options; CLI accepts --format=sarif/ocsf, emits deterministic, Atmos‑enriched outputs and records UTC invocation/audit metadata.
- Integrated Amazon Inspector2 native findings into security analysis with normalization, deduplication and preferred native results.
-
Tests
- Extensive unit and JSON‑schema test suites for SARIF and OCSF ensuring spec conformance and byte‑stable output.
-
Documentation
- Docs, blog post, PRDs, and roadmap updated for SARIF/OCSF support and usage examples.
-
Chores
- CI checkout action pinned; NOTICE dependency list updated.
feat(hooks): add hook kinds, scanner integrations, SARIF summaries, and skip controls @osterman (#2482)
what
- Adds a
kinddiscriminator to the hook system with built-in kinds forstore,command,infracost,checkov,trivy, andkics. - Adds the generic command hook engine, including toolchain-aware binary resolution, live stdout/stderr passthrough, templated args/env support,
ATMOS_*runtime env vars, output-file/output-dir side channels, and configurableon_failurebehavior. - Adds scanner and cost integrations:
infracostparses JSON breakdown output into a markdown cost summary.checkov,trivy, andkicsemit SARIF and share one parser/markdown renderer.
- Adds normalized SARIF handling for severity counts, linked rule IDs via
helpUri, short descriptions, file/line locations, and empty-result handling. - Adds hook dependency preflight: component
dependencies.toolsare installed before hooks run, toolchain paths take precedence over operator PATH, and missing hook binaries fail before Terraform starts. - Adds a curated embedded Atmos tool registry with a KICS override so
dependencies.tools.kicscan install from release tarballs. - Adds
pkg/cacertsand wires Checkov toSSL_CERT_FILE/REQUESTS_CA_BUNDLEso PyInstaller-bundled Checkov can use the host CA bundle. - Adds the
--skip-hooksglobal flag andATMOS_SKIP_HOOKS, supporting skip-all and comma-separated named-hook skipping. - Preserves backward compatibility by accepting legacy
command: storehook configs askind: store. - Makes hooks work with resolved component workdirs so scanners inspect the same directory Terraform uses.
- Adds runnable examples for
infracost,checkov,trivy,kics, and customkind: commandhooks. - Updates hook docs, global flag docs, PRDs, roadmap data, and adds the custom hooks blog post.
- Refreshes CLI help snapshots and updates CI workflow actions/shell handling needed by the branch.
why
- Hooks were already the right lifecycle surface for component automation, but only
storehad first-class dispatch behavior. - Security scanners, cost estimators, and custom tools should run from stack config without wrapper scripts or GitHub Actions glue.
- Named kinds provide zero-config defaults for common tools, while
kind: commandkeeps the system open for arbitrary binaries. - Tool auto-install and preflight failures make examples and CI usage reproducible instead of relying on whatever happens to be on PATH.
- SARIF and infracost summaries create a common typed output path for terminal rendering now and Atmos Pro upload later.
notes
- This PR intentionally does not add a built-in
tfseckind orhooks-tfsecexample. Trivy is the maintained Aqua-backed scanner path; legacy tfsec users can still wire it withkind: command. - Atmos Pro upload, cross-run SARIF aggregation, Terraform component dependency auto-install outside hooks, and planfile threading remain follow-up work.
- The CodeRabbit-generated release notes below are preserved as-is.
Summary by CodeRabbit
-
New Features
- Pluggable hook "kinds" (infracost, trivy, checkov, kics) plus a generic command kind; structured side‑channel outputs, markdown rendering, preflight tool resolution/auto‑install, CA‑bundle propagation, and per‑invocation hook skipping via --skip-hooks / ATMOS_SKIP_HOOKS.
-
Documentation
- Detailed PRDs, reference docs, CLI help, blog post, and runnable examples for each hook kind.
-
Tests
- Expanded unit and integration tests covering hooks, result handlers, SARIF parsing, toolchain registry, and examples.
feat(ci): auto-detect log level from GitHub Actions debug mode @osterman (#2495)
what
- Atmos now auto-detects when a workflow is running with GitHub Actions debug logging enabled and switches its own log level to
Debugfor the run. - Triggered when
ci.enabled: trueis set inatmos.yamland the active CI provider reports debug mode is on. For GitHub Actions, that meansACTIONS_RUNNER_DEBUG=trueorACTIONS_STEP_DEBUG=true— exactly what the built-in "Re-run with debug logging" button sets. - Emits a single Info-level log line when it fires so users see why their output got louder:
CI provider debug mode detected — using Debug log level for this run provider=github-actions from=Info. - Built on a provider-agnostic optional interface —
provider.DebugModeDetector { IsDebugMode() bool }inpkg/ci/internal/provider, plus a generic registry helperci.DetectDebugMode() DebugModeInfo. The GHA provider implements the interface;cmd/root.goimports onlypkg/ciand names no GHA-specific env vars. - Auto-detection overrides
--logs-level,ATMOS_LOGS_LEVEL, andlogs.levelinatmos.yaml— the CI-side debug toggle is set at the repo/workflow level by the runner itself and is treated as the higher-priority signal (including over an explicitTraceorOff). - Ships with a new framework PRD (
docs/prd/native-ci/framework/debug-mode-promotion.md), a changelog blog post, a roadmap milestone under the Native CI initiative, and unit tests covering: the GHAIsDebugMode()env-var matrix, the genericDetectDebugMode()type-assertion path, and the cmd-side helper's gates and override semantics.
why
- Debugging Atmos in CI is usually just as important as debugging the workflow around it. GitHub provides a single "Re-run with debug logging" button to make every tool in the run verbose; today Atmos ignores it, so users get a noisier runner but the same quiet Atmos output — and have to remember a per-tool dance (
ATMOS_LOGS_LEVEL=Debugsomewhere in workflow YAML). - The interface-based design keeps the startup path provider-agnostic, so adding the same behavior to a future CI provider is one method on the provider — no changes in
cmd/orpkg/cineeded. - Overriding explicit
--logs-level/ATMOS_LOGS_LEVELis intentional: the CI-side toggle is an explicit, repo-/workflow-level "make everything noisy" signal that should beat per-invocation flags in the same run. - This is the same gap other GitHub-published tools have hit, e.g. pypa/gh-action-pypi-publish#322, which validates the pattern.
references
- GitHub docs: Enable debug logging
- GitHub changelog: Re-run jobs with debug logging
- Prior art in another ecosystem: pypa/gh-action-pypi-publish#322
- New PRD:
docs/prd/native-ci/framework/debug-mode-promotion.md
Summary by CodeRabbit
- New Features
- Atmos now auto-promotes its log level to Debug when running on GitHub Actions with per-run debug logging enabled and
ci.enabled: true; an informational startup log notes the promotion and it overrides other log-level settings.
- Atmos now auto-promotes its log level to Debug when running on GitHub Actions with per-run debug logging enabled and
- Documentation
- Added product doc and blog post explaining debug-mode promotion and usage.
- Tests
- Added unit tests covering debug-mode detection and promotion behavior.
- Refactor
- Minor CI hook wiring cleanup in multi-component Terraform runs.
fix(ci): restore checks: write on lint job for reviewdog annotations @osterman (#2500)
what
- Add a job-scoped
permissions:block on thelint([lint] <demo-folder>) job in.github/workflows/test.ymlgrantingcontents: read+checks: writesoreviewdog/action-tflint@v1can post inline tflint findings on PRs via the GitHub Checks API.
Companion to #2499, which restored
security-events: writeon thedocker([lint] Dockerfile) job. Same root cause, second affected job.
why
- PR #2487 introduced the first workflow-level
permissions:block ontest.ymlto grantpackages: readfor ghcr.io OCI pulls. A workflow-levelpermissions:block replaces (not extends) the defaultGITHUB_TOKENscope for every job in the file, which silently stripped the inheritedchecks: writethat thelintjob relied on. - Effect on contributors: since #2487 merged, tflint findings on PRs touching
examples/<demo-folder>/components/terraformhave stopped appearing as inline check annotations. The job itself still exits with the right code (fail_level: errorcontrols that), but reviewers lost the per-line context. This restores that behavior. - Job-scoped (least privilege) over widening the workflow-level block — only this one job uses reviewdog. Matches the convention used in
.github/workflows/codeql.ymland thedocker-job fix already landed in #2499. - Not adding
pull-requests: write: reviewdog's defaultgithub-pr-checkreporter posts check runs (which needchecks: write), not review comments.checks: writealone is sufficient.
references
fix(ci): restore security-events: write for Dockerfile lint SARIF upload @aknysh (#2499)
what
Restores security-events: write permission for the [lint] Dockerfile
job in .github/workflows/test.yml so its hadolint SARIF results can
be uploaded to GitHub Code Scanning.
- Adds a job-level
permissions:block to thedockerjob:permissions: contents: read # actions/checkout security-events: write # github/codeql-action/upload-sarif
contents: readis re-listed because a job-levelpermissions:block
fully overrides the workflow-level set (rather than merging).
why
PR #2487 (2437e13bf) added a top-level permissions: block to the
workflow to grant packages: read for ghcr.io pulls:
permissions:
contents: read
packages: readIn GitHub Actions, a workflow-level permissions: block replaces
(not extends) the default GITHUB_TOKEN scope for every job in the
file. That replacement inadvertently stripped the implicit
security-events: write that the [lint] Dockerfile job relied on to
upload hadolint SARIF results via github/codeql-action/upload-sarif@v4.
Every post-merge run on main has been failing the Upload SARIF
file step since #2487:
##[warning]This run of the CodeQL Action does not have permission to
access the CodeQL Action API endpoints. ... please ensure the workflow
has at least the 'security-events: read' permission.
##[error]Resource not accessible by integration -
https://docs.github.com/rest
Failing run for reference:
https://github.com/cloudposse/atmos/actions/runs/26339160817/job/77562841194
Note: hadolint itself ran successfully in the failing run — the SARIF
output contained zero findings. Only the upload step failed.
A job-level fix (this PR) is preferred over expanding the workflow-level
permissions block, because it follows least-privilege: only the one job
that actually needs to write security events gets the elevated scope.
references
- Failing CI run: https://github.com/cloudposse/atmos/actions/runs/26339160817/job/77562841194
- Regressing PR: #2487
- GitHub Actions permissions docs:
https://docs.github.com/en/actions/security-for-github-actions/security-guides/automatic-token-authentication#permissions-for-the-github_token github/codeql-action/upload-sarifpermission requirement:
https://docs.github.com/en/code-security/code-scanning/integrating-with-code-scanning/uploading-a-sarif-file-to-github#uploading-the-sarif-file-to-github
Summary by CodeRabbit
- Chores
- Fixed permissions configuration in the CI/CD pipeline to restore security scanning capabilities in automated testing workflows.
Add toolchain package verification @osterman (#2415)
what
- Adds
pkg/toolchain/verificationfor Aqua-compatible checksum, signature, and attestation verification before tool extraction. - Preserves Aqua verification metadata across registry parsing, overrides, version overrides, installer flow, and lockfile metadata.
- Adds toolchain verification policy config, docs, roadmap entry, and changelog post.
why
- Prevents tampered or mismatched toolchain package assets from being installed when registry metadata provides verification data.
- Keeps the default behavior non-breaking while allowing stricter checksum and signature requirements for CI and regulated environments.
references
- Tested with
go test ./pkg/toolchain/installer ./pkg/toolchain/... ./cmd/toolchain/... - Linted with
scripts/run-custom-golangci-lint.sh
Summary by CodeRabbit
-
New Features
- Toolchain now verifies downloaded packages (checksums and signatures/attestations) before extraction when registry metadata is present.
- Multiple verification methods supported (checksums, cosign, SLSA provenance, minisign, GitHub attestations); verifier install mode configurable (auto or path-only).
- Verification results and metadata are recorded in the toolchain lockfile; lockfile path is configurable.
-
Bug Fixes
- Cached assets validated against recorded source URL; mismatched or tampered cached files are re-downloaded or removed; lockfile not updated when extraction fails.
-
Documentation
- Added docs, examples, and a blog post explaining package verification and configuration.
🚀 Enhancements
fix(yaml-functions): detect cross-component !terraform.state cycles instead of stack-overflowing @thejrose1984 (#2533)
what
Fix the goroutine stack overflow reported in #2457: two components that reference each other via !terraform.state (A → B, B → A) drove atmos describe affected / describe component / terraform plan into infinite recursion until the Go runtime stack overflowed.
The YAML-function cycle detector already existed and worked within a single ProcessCustomYamlTags walk, but it didn't survive the recursive describe path that !terraform.state triggers.
why
When a component is being processed and the resolver encounters !terraform.state, it does:
processTagTerraformStateWithContext
→ GetTerraformState
→ ExecuteDescribeComponent (ProcessYamlFunctions: true)
→ ProcessStacks
→ ProcessCustomYamlTags ← re-entry
ProcessCustomYamlTags was wrapping every entry with scopedResolutionContext(), which saved the parent's context and installed a fresh, empty one. So when the inner walk found B's !terraform.state a ..., the cycle detector's Visited map had no record that A was already in progress, and it pushed A → B → A → B forever until the goroutine stack hit its 1 GB cap.
The cycle detector unit tests pass because they exercise Push/Pop on a single context; the only integration tests that would have caught this were t.Skip()-ed placeholders in internal/exec/yaml_func_circular_deps_test.go referencing fixtures that don't exist.
how
Three coordinated changes:
-
internal/exec/yaml_func_utils.go—ProcessCustomYamlTagsnow reuses the goroutine-localResolutionContextviaGetOrCreateResolutionContext()and drops thescopedResolutionContext()wrap. ThePush/Popdiscipline inprocessTagTerraformStateWithContext/trackOutputDependencyalready pairs every successfulPushwith a deferredPop, so the context is empty when the top-level walk returns. Removed the now-unusedscopedResolutionContexthelper. -
internal/exec/yaml_func_resolution_context.go— AddedMaxResolutionDepth = 64and a depth check inPushthat returnsErrYamlFuncMaxResolutionDepthif any future re-entry path slips past the cycle detector. This is belt-and-suspenders: real cycles are caught by theVisitedcheck; the depth bound exists so atmos surfaces a clean error instead of stack-overflowing if the detector regresses. -
internal/exec/terraform_state_utils.go—GetTerraformState's describe-error wrap now uses double%wsoerrors.Iscan match a propagated sentinel likeErrCircularDependencythrough the descriptive wrapper. Without this, the cycle error message is human-readable buterrors.Is(err, ErrCircularDependency)returnsfalse, breaking callers that try to handle the error programmatically.
tests
- New
tests/yaml_functions_circular_deps_integration_test.goplus fixture attests/fixtures/scenarios/yaml-functions-circular-deps/— exercises the fullExecuteDescribeComponentpath on an A↔B cycle and assertsErrCircularDependencycomes back (and not the depth safety net, which would indicate the cycle detector regressed). Test completes in ~20 ms instead of running forever. - Removed
internal/exec/yaml_func_circular_deps_test.go— all four tests in it weret.Skip()-ed placeholders referencing fixtures that don't exist. The new integration test replaces them with one that actually runs. - All existing
TestResolutionContext*unit tests still pass unchanged.
references
Summary by CodeRabbit
-
New Features
- Improved YAML-function cycle detection with clearer, surfaced errors.
- Added a maximum recursion-depth safeguard to prevent stack overflow during YAML-function resolution.
-
Bug Fixes
- Enhanced error wrapping so root causes are preserved and easier to identify.
-
Tests
- Added an integration regression test for cross-component cycles and removed obsolete skipped tests.
-
Fixtures
- Added scenario fixtures to reproduce and validate circular dependency behavior.
[codex] Fix remote Git stack imports @osterman (#2528)
what
- Add a remote stack import resolver that can return multiple local import matches while preserving the existing single-file download path.
- Handle Git go-getter
//subdirimports by cloning the repository as a directory, resolving files, no-extension YAML variants, explicit globs, and recursive YAML directory imports. - Cache expanded remote files with stable
<original-uri>#<relative-file>keys and update stack processing/tests to consume those keys for imports and provenance.
why
- Fixes the regression where remote Git subdir imports were forced through file mode, causing
git cloneto drop the repository name and targethttps://github.com/<owner>/. - Supports remote folder imports for stack manifests without leaking local cache paths into import metadata.
references
- Remote imports docs: https://atmos.tools/stacks/imports
- Validated with
go test ./pkg/stack/imports ./pkg/downloader,go test ./tests -run RemoteStackImports, andgo test ./internal/exec -run '^$'. - Commit hooks passed after building the required local
custom-gcllint binary.
Summary by CodeRabbit
-
New Features
- Remote imports now resolve Git subdirectories, wildcards, and nested-remote imports with deterministic matching and improved per-session and persistent caching.
- Per-import control for nested import resolution (local vs remote) with validation and defaults.
-
Bug Fixes
- More consistent handling of missing imports and skip-if-missing behavior; clearer errors for unresolved imports.
-
Documentation
- Docs, examples, and PRD updated to explain nested-imports behavior and best practices.
-
Tests
- Expanded unit and integration tests for remote/Git resolution and CLI scenarios.
fix(hooks): persist interactive stack selection and auth context for PostRunE hooks @aknysh (#2520)
what
- Store hooks now fire when the stack is selected via the interactive prompt (not just when passed with
-s). - Store hooks can now read terraform outputs from S3 backends that require role assumption via
assume_role. - Auth context and auth manager from the main terraform execution are persisted and injected into PostRunE hook info so the hook's
terraform outputsubprocess has the correct credentials.
why
Two bugs in the store hooks execution path caused hooks to silently fail in common scenarios:
-
Interactive stack selection lost for hooks (#2432): When a user runs
atmos terraform apply componentand selects the stack from the interactive prompt, the selected stack value was stored in the localinfo.Stackbut never persisted to the Cobra flag set. PostRunE hooks re-parse args viaProcessCommandLineArgs, which reads fromcmd.Flags().GetString("stack")— still empty. The hook silently skipped because it saw no stack. Fix: after the interactive prompt fillsinfo.Stack, also set the Cobra flag viaf.Value.Set(stack)so downstream consumers can read it. -
Missing backend credentials for hook's terraform output (#2433): The store hook's
terraform outputsubprocess needs credentials to access the S3 state backend, which often requires a chained role assumption (e.g.,dev-role → tfstate-access-role). The mainExecuteTerraformsets up the full credential chain viasetupTerraformAuthandprepareComponentExecution, but it takesinfoby value — the populatedAuthContextandAuthManagerdon't flow back to the caller. The PostRunE hook creates a freshinfowith nil auth fields, so the output subprocess fails with "No valid credential sources found". Fix: persist the auth context viaSetLastAuthContextafterprepareComponentExecution, and inject it into the hook's freshinfoinrunHooksWithOutput.
references
- Closes #2432
- Closes #2433
- Related: #2428 (original consolidated issue, closed in favor of #2432 and #2433)
- Related: #2357 (auth resolver injection for hooks)
Summary by CodeRabbit
-
Bug Fixes
- Post-run hooks now preserve Terraform auth so store hooks and hooks triggered after interactive stack selection work when role-assumption is required.
-
Chores
- Bumped Go to 1.26.3, upgraded many dependencies, updated NOTICE license listings, and advanced app version to 1.220.0.
-
Tests
- Added tests for interactive component/stack prompts and auth-context persistence.
-
Documentation
- Added two docs describing the store-hook role-assumption issue and interactive-prompt hook behavior.
fix(auth): don't fall through to webflow when aws/user keyring read fails @osterman (#2470)
what
- Stop
atmos auth loginfrom silently falling back to browser-based OAuth2 webflow when anaws/useridentity HAS configured credentials but the keyring read fails (corrupted entry, missing fields, deserialization error, permission denied). - Distinguish two failure modes in
credentialsFromStore:ErrAwsUserNotConfigured(keyring miss — webflow remains an appropriate fallback) vs newErrAwsUserKeyringReadFailed(keyring reachable but unreadable — webflow is now skipped so the real error surfaces). - Promote the "starting browser-based authentication" message from
DebugtoInfoand include a hint pointing atatmos auth user configureandwebflow_enabled: false, so users see why the browser opened. - Fix a latent realm bug in
cmd/auth/user/configure.go: it was hardcodingstore.Store(alias, creds, "")while the login resolver readsRetrieve(i.name, i.realm). Withauth.realmorATMOS_AUTH_REALMset, configure wrote one slot and login read another. Configure now computes the realm the same waypkg/auth/manager.godoes. - Add regression test
TestUserIdentity_Authenticate_KeyringReadFailureSkipsWebflowthat primes the keyring with an incomplete entry and asserts the webflow callback server is never reached. TightenTestUser_credentialsFromStoreto assert the specific sentinel returned in each path. - Document the resolver order in
website/docs/cli/configuration/auth/identities.mdx: clarify that Atmos does not consult ambient AWS credentials (env vars,~/.aws/credentials, instance profiles) foraws/user, and call out the new keyring-read-failure diagnostic.
why
- Reported by Dan Miller in the Atmos community Slack: after upgrading past v1.214 he got redirected to the AWS sign-in browser flow even though he ran
atmos auth user configureand "had access keys." The only working escape wascredentials.webflow_enabled: false, which masked the real problem rather than fixing it. - The webflow path introduced in #2148 (v1.215) was the right design — webflow is a legitimate authentication tier for IAM users — but the resolver collapsed every keyring failure into a single "not configured" error, so a corrupted/unreadable keyring entry looked identical to a fresh install and the browser flow fired. The browser flow then 400'd against the AWS sign-in token endpoint with no indication that the real failure was reading the keyring.
- Restoring the design invariant ("if creds are provided, use them; never silently bypass them with webflow") requires distinguishing the two failure modes. Once distinguished, webflow is correctly gated and the Info-level diagnostic tells the user what Atmos saw.
- The realm hardcode was not Dan's trigger (his realm resolved to
""on both sides) but is a real latent foot-gun for anyone usingauth.realmfor credential isolation — fixing it here closes the gap before another report arrives.
references
- Slack thread: original report from Dan Miller in cloudposse community Slack (no public issue).
- PR #2148 — original OAuth2 PKCE webflow introduction (commit
4e32b532f, shipped in v1.215).
Summary by CodeRabbit
-
Bug Fixes
- Credential configuration now resolves the auth realm so saved credentials go to the correct slot.
- Keyring read failures (corrupted/unreadable entries) are reported and no longer silently fall back to browser auth; webflow is only used when credentials are genuinely absent.
- Better distinction between missing credentials and unreadable keyring data.
-
Documentation
- Clarified browser-based fallback conditions and what counts as static AWS user credentials.
-
Tests
- Added regression tests to ensure keyring read failures skip webflow.
fix(aws/security): allow unlimited findings and record invocation in OCSF @osterman (#2517)
what
atmos aws security analyze --max-findings 0(or any non-positive value) now fetches all matching findings from Security Hub / Inspector instead of silently capping at 500.- The fetcher emits a
log.Warnwhenever pagination halts at the limit whileNextToken != nil, so truncation is never silent again. - Every OCSF event now carries the literal command line, arguments, timing, exit code, working directory, and scanned scope under
unmapped["atmos.invocation"]— the OCSF analogue of SARIF'srun.invocations[](which Atmos already emits). - Default behavior is unchanged (
--max-findingsstill defaults to 500); only the previously-broken0semantics now work, and the help text + docs document it.
why
- The 500 cap is a CLI-layer default, not an AWS pagination limit. For multi-account orgs, real finding counts routinely exceed 500, so
--format json/sarif/ocsfexports were silently incomplete and downstream tooling (SIEM ingestion, ticketing, dashboards) was missing data with no error or warning. - AI analysis users get cost protection by keeping 500 as the default; export users get correctness by opting in to
--max-findings 0. Thelog.Warncovers the case where users forget — they'll see in the output that more findings exist. - SARIF already records the invocation, but OCSF Detection Finding 2004 has no native invocation slot. Auditors and SIEM analysts asking "what command produced this batch?" can now answer it from either format.
references
- Closes the silent-clamp issue surfaced during use of
atmos aws security analyzefor SIEM export pipelines. - SARIF 2.1.0 invocation spec: https://docs.oasis-open.org/sarif/sarif/v2.1.0/sarif-v2.1.0.html (already implemented in
pkg/aws/security/sarif.go). - OCSF 1.4.0 Detection Finding: https://schema.ocsf.io/1.4.0/classes/detection_finding (no native invocation field; landed in
unmappedextension). - Related shipped feature: #2483 (initial SARIF/OCSF export support).
Summary by CodeRabbit
-
New Features
--max-findingsnow distinguishes "unset" from an explicit 0; 0 means unlimited and effective default remains 500.- CLI prints a clear info message when fetching all findings vs a limited fetch.
- OCSF exports now include report invocation metadata in each event when available.
-
Bug Fixes
- Warning logged when a positive limit truncates results to avoid silent loss; pagination now respects the limit.
-
Documentation
- CLI, config, and PRD docs updated to reflect the new semantics.
-
Tests
- Added tests covering max-findings precedence, pagination behaviors, and OCSF invocation attachment.
fix(terraform-state): honor target component's env section for AWS credentials @arcaven (#2502)
what
- Makes
!terraform.state's in-process S3 backend reader honor a whitelisted subset of the target component'senvsection —AWS_PROFILE,AWS_REGION,AWS_DEFAULT_REGION,AWS_CONFIG_FILE,AWS_SHARED_CREDENTIALS_FILE,AWS_ENDPOINT_URL_S3,AWS_ENDPOINT_URL_STS,AWS_USE_FIPS_ENDPOINT— matching the behavior!terraform.outputalready exhibits via its subprocess env overlay.AWS_STS_REGIONAL_ENDPOINTSis intentionally excluded because it's a SDK v1 toggle and a no-op in SDK v2. - Adds
internal/terraform_backend.ExtractComponentEnvOverlay(with a nil-pointer guard) andinternal/terraform_backend.ComponentEnvKeysAWS. Threads the overlay throughgetCachedS3ClientandReadTerraformBackendS3. The S3 client cache key now includes every whitelisted key that affects client behavior (profile, both regions, both endpoint URLs, FIPS, config + credentials files) so two components with distinct settings never alias each other. - Extends
pkg/aws/identitywithLoadConfigWithAuthAndEnv.LoadConfigWithAuthbecomes a thin nil-overlay wrapper, so every existing call site behaves identically. Within the new variant:AWS_USE_FIPS_ENDPOINT(truthy"true"/"1") is applied viaconfig.WithUseFIPSEndpoint(aws.FIPSEndpointStateEnabled)— a global config setting.AWS_ENDPOINT_URL_STSis applied atsts.NewFromConfigin the assume-role flow — a per-service option in SDK v2.AWS_ENDPOINT_URL_S3is applied ats3.NewFromConfigingetCachedS3Clientfor the same SDK v2 per-service-option reason.
- Adds focused unit tests covering the overlay extraction (9 subcases including the nil-pointer guard), the whitelist surface stability, the credential-resolution precedence (5 cases including a sentinel that asserts
LoadConfigWithAuth ≡ LoadConfigWithAuthAndEnv(..., nil)), and the FIPS application (4 subcases including an authContext-suppresses-overlay assertion). - Adds an advanced docs section to
functions/yaml/terraform.state.mdx("Switching AWS credentials per component via theenvsection") and a cross-reference note infunctions/yaml/terraform.output.mdx. Placed alongside the existing specialized sections (SSE-C, GCS, static); primary examples untouched.
why
!terraform.outputand!terraform.stateare documented as interchangeable readers of the same state. They aren't, in setups that distribute Terraform state across AWS organizations.!terraform.outputshells out totofu/terraform.pkg/terraform/output.defaultEnvironmentSetup.SetupEnvironmentoverlays the target component'senvsection onto the subprocess environment as its final step, soenv.AWS_PROFILEreaches the backend's credential resolution.!terraform.statereads in-process viainternal/terraform_backend.ReadTerraformBackendS3→getCachedS3Client→pkg/aws/identity.LoadConfigWithAuth→config.LoadDefaultConfig. When no AtmosAWSAuthContextis provided, the SDK uses the calling process'sAWS_PROFILE.componentSections["env"]is in scope on the same map but no function in this chain reads it.- In practice this means a stack in one AWS org calling
!terraform.stateagainst a stack in another org fails withAccessDeniedonsts:AssumeRole(or silently reads the wrong account when buckets happen to share names), while the equivalent!terraform.outputcall works. The two functions are supposed to produce identical results; this is a correctness gap. - Backward compatibility was the design constraint. A component without any whitelisted env key produces a nil overlay and the resolved code path is byte-identical to the prior behavior. Users who don't use this pattern see no change. A sentinel test asserts this directly.
- Atmos auth (
AWSAuthContext) layers above the overlay and still wins outright. The env overlay path is intended for setups not yet on Atmos auth, which is the common case for the SweetOps community at the moment. - GCS and AzureRM backends share the bug structurally. We don't have GCS or Azure tfstate to validate those readers locally and would rather defer than ship code we haven't run. The fix shape generalises trivially (one new whitelist slice + one
ExtractComponentEnvOverlaycall per backend). Calling that out in the issue and the docs so a contributor with those backends can pick it up.
references
- Closes #2501
- Working reference implementation we're mirroring in-process:
pkg/terraform/output/environment.go::SetupEnvironment(the finalfor k, v := range config.Envloop). - Function this PR extends:
pkg/aws/identity.LoadConfigWithAuth. - In-process reader being fixed:
internal/terraform_backend/terraform_backend_s3.go::ReadTerraformBackendS3/getCachedS3Client. - Adjacent context for S3 path construction in
!terraform.state: #1920. - Auth-chain inheritance for
!terraform.output(referenced fromterraform_output_utils.go): #1921.
Summary by CodeRabbit
-
New Features
- Per-component AWS env overlay for Terraform S3 remote-state reads; explicit auth/context still takes precedence. Identity loading now accepts optional env overlays and respects overlay vs. auth precedence.
-
Documentation
- Guidance added for cross-account remote-state reads and env-overlay behavior for terraform.output and terraform.state.
-
Tests
- Added tests for overlay extraction, precedence rules, FIPS behavior, and a stable ordered whitelist of AWS env keys.
-
Chores
- Minor log formatting adjustments.
fix(toolchain): retry cosign on transient Sigstore Rekor failures @osterman (#2506)
what
- Wrap the
cosign verify-blobexec inpkg/toolchain/verification/signature.gowith bounded exponential backoff, retrying only on a narrow allowlist of transient Sigstore Rekor failures. - Add a new
errUtils.ErrSignatureRetryablesentinel next to the existingErrDownloadRetryable, and aclassifyCosignErrorhelper that joins the sentinel into cosign errors when the combined output matches a Rekor-flake marker. - New
runCosignWithRetryuses the same retry budget as the existing downloader (5 attempts, 1s → 10s exponential). Logs aWARNbefore each retry so CI logs surface the upstream-service context.
why
cosign verify-blobsometimes fails not because of a real signature problem but because Sigstore's Rekor transparency-log API returns a short-window upstream error. The most common signature is:
The same artifact verifies cleanly seconds later. Without retry, this turns a transient Sigstore outage into a hardError: searching log query: [POST /api/v1/log/entries/retrieve][400] searchLogQueryBadRequest {"code":400,"message":"verifying signature: ecdsa: Invalid IEEE_P1363 encoded bytes"}tool not foundinstall failure for every Atmos user pulling toolchain assets during the outage window. We hit this twice in 48h on the Windows mock jobs alone.- The retry allowlist is intentionally narrow —
searchLogQueryBadRequest,Invalid IEEE_P1363 encoded bytes, and Rekor's/api/v1/log/entries/retrieveendpoint paired with a 5xx status. Anything outside the allowlist (tampering, expired cert, identity mismatch, missing signature) surfaces immediately on the first attempt. Blanket-retrying signature verification would mask real tampering events and is the canonical anti-pattern; we do not do that here. - Mirrors the existing pattern in
pkg/toolchain/installer/download.go(sentinel +retry.WithPredicate+ classifier) so the toolchain's resilience story stays consistent across download and verification.
references
- Failing run that prompted this: https://github.com/cloudposse/atmos/actions/runs/26368136271/job/77615786736 (Windows
[mock-windows] demo-component-versions, Rekor returned 400 on a valid OpenTofu 1.9.1 release signature). - Same flake also hit
[mock-windows] demo-vendoringin the same run. - Companion CI permissions fix: #2499 (merged) and #2500 (open).
Summary by CodeRabbit
- Improvements
- Signature verification now automatically retries on transient Sigstore Rekor service issues with bounded exponential backoff for greater resilience.
- Error handling improved to distinguish retryable transient failures from permanent signature verification errors, reducing false failures.
- Tests
- Added unit tests covering classification of transient Rekor failures and retry behavior through the public verification path.
fix(terraform): skip non-terraform and deleted components in `atmos terraform plan/apply --affected` @thejrose1984 (#2484)
what
Filters the affected component list in atmos terraform plan/apply --affected so the command no longer runs against helmfile components, packer components, or components deleted in HEAD.
why
Reported in #2361. getAffectedComponents returns every affected component regardless of type — helmfile, packer, and BASE-only deletions are all included — and ExecuteTerraformAffected was iterating that full list and calling ExecuteTerraform on each, producing output like:
INFO Executing command="atmos terraform apply example-terraform -s example"
INFO Executing command="atmos terraform apply example-helmfile -s example"
INFO Executing command="atmos terraform apply example-packer -s example"
Documentation (website/docs/cli/commands/terraform/usage.mdx) describes --affected as executing the command "on all the directly affected components," with the implicit constraint that those components belong to the atmos terraform subcommand. Helmfile and Packer subcommands would orchestrate their own components, and deleted components have no on-disk module so terraform plan/apply against them either errors or no-ops.
how
- New package-private helper
filterTerraformAffectedkeeps only items whereComponentType == cfg.TerraformComponentTypeand!Deleted. Called once inExecuteTerraformAffectedaftergetAffectedComponents, beforeaddDependentsToAffected(which is expensive and shouldn't run for items we will drop). - Defense-in-depth filter in
executeTerraformAffectedComponentInDepOrder: when--include-dependentsis set, any dependent withComponentTypeexplicitly set to a non-terraform value is skipped during the recursion. atmos describe affectedis unchanged. It still reports the full affected set (terraform + helmfile + packer + deleted) as the canonical introspection view. The filter is scoped to the execution path ofatmos terraform <cmd> --affected.
tests
- New file internal/exec/terraform_affected_filter_test.go: 8 portable table cases covering the filter (no gomonkey, runs on every CI matrix entry). Includes a case mirroring the exact #2361 reproducer fixture. 100% line coverage of the new helper.
- Three new cases added to
TestExecuteTerraformAffectedComponentInDepOrdertable in internal/exec/terraform_utils_test.go: helmfile dependent skipped, packer dependent skipped, mixed-type dependents (only the terraform one runs).
compatibility
Bug fix only. The previous behavior produced incorrect commands that would fail mid-execution (terraform-apply against a helmfile component errors out). No public Go API or CLI surface changes. The most visible user-facing shift is an exit-code flip from non-zero to zero when a changeset contains only non-terraform or deleted components — which now correctly reports "No components affected" instead of erroring partway through.
Closes #2361.
Summary by CodeRabbit
-
Bug Fixes
- Terraform execution now excludes non-Terraform dependents (e.g., Helmfile, Packer), skips deleted components, and treats empty component types as Terraform to preserve compatibility; Terraform-only ordering is preserved to avoid unnecessary processing.
-
Tests
- Added tests for filtering behavior, in-place compaction semantics, non-Terraform exclusion, and a regression ensuring only Terraform entries are executed.
perf(exec/merge/utils): optimize describe affected for large-stack workloads (~50% local wall-clock, projected ~10× on 2-core CI) @aknysh (#2496)
what
Performance optimization arc for atmos describe affected on large-stack workloads.
Targets the inheritance + merge + YAML-parse hot paths that have grown since the
October 2025 perf instrumentation arc (PRs #1576/#1611/#1622/#1639) added the
auth, profiles, locals, and per-file position-tracking subsystems.
Twelve shipped phases (with two attempted-and-reverted optimizations documented
in-code so they aren't re-tried without addressing the underlying contract
violations):
- Phase 2 —
cacheBaseComponentConfigswitched fromRWMutex+ map to
sync.Map; deep-copy moved outside the critical section. Eliminates write-lock
contention that serialized every cache write across goroutines and padded
apparent CPU time with lock-wait. - Phase 3 —
WalkAndDeferYAMLFunctionsshort-circuits when the subtree
contains no Atmos YAML functions (!template,!terraform.*,!store*,
!exec,!env). Returns the input map as-is instead of allocating a deep
copy at every recursion level. - Phase 4 —
extractLocalsFromRawYAMLcached viasync.Mapkeyed by
filePath + FNV-1a(yamlContent). The content-hash component prevents
test pollution when the same logical file path is reused with different
content. Also fixes a pre-existing data race in
extractAndAddLocalsToContext(shallow-clones the input context map
before file-scopeddelete + assign). - Phase 5 —
MergeWithDeferred0-input fast path: when every layer is
empty, return an empty map immediately without walking or merging. The
1-input shortcut originally shipped alongside was REVERTED on
2026-05-24 after CI surfaced a regression — see "What was tried and
reverted" below. - Phase 6 —
parsedYAMLCacheswitched tosync.Map; deep-copy of
yaml.Node+PositionMapmoved outside the critical section. Same lock
contention pattern as Phase 2 applied to the YAML parser cache. - Phase 7 —
processCustomTagssplit into outer + inner functions: the
hasCustomTagspre-check runs once at the entry point instead of
re-walking the subtree at every recursive call (O(N×depth) → O(N)). - Phase 8 — new
decodedYAMLCachestores the post-Decode + post-Intern
result ofUnmarshalYAMLFromFileWithPositions[map[string]any]. Skips
yaml.Node.Decode+InternStringsInMapon every repeat call for the
same (file, content hash) pair. - Phase 10 —
processYAMLNodesplit into outer + inner (Phase 7
pattern). Removes per-recursionperf.Trackoverhead from the recursive
YAML walker used by yq evaluation. Same pattern was tried for
WalkAndDeferYAMLFunctionsand reverted: the inner-only walker had to
allocate unconditionally on every recursion, regressing function-sparse
subtrees more than the perf.Track savings recovered. - Phase 11 —
processTerraformRemoteStateBackendextracts the
backend-type-specific map from each input first (via the new
extractBackendTypeMaphelper), then merges just those two scoped maps.
Avoids deep-copying unrelated backend-type entries
(s3/gcs/azurerm/etc.) just to extract one key from the merged result. - Phase 12 + 13 —
deepCopyBaseComponentConfigMapsguards every
m.DeepCopyMapcall withlen(src.Field) > 0. Skips function-call and
allocation overhead for the empty-field case that dominates real
workloads (most components leave several of the 10 fields empty). - Coverage — new tests close the gaps the arc introduced. Public
Clear*Cachewrappers,extractBackendTypeMaptype-mismatch path, and
the Phase 12/13 empty-field contract are all now covered.
Phase 1 was the auth credential-store fix that shipped separately as
PR #2471 (in v1.220.0-rc.1).
What was tried and reverted
- Phase 5 1-input shortcut. Initially returned the walked single input
directly without going throughMerge. CI surfaced
TestSpaceliftStackProcessorlosing 7 stacks (47→40) because
WalkAndDeferYAMLFunctions's Phase 3 short-circuit returned the input
map as-is, and the 1-input shortcut handed that shared reference back
to the caller. DownstreammergeComponentConfigurationsmutated the
result while building the per-component output, which corrupted the
upstream cachedBaseComponentSettings/GlobalSettingsfor sibling
components. Fix: keep only the 0-input fast path; let 1-input fall
through to the regular merge pipeline which deep-copies via
MergeWithOptions→DeepCopyMap. Regression test
(TestMergeWithDeferred_TrivialInputShortCircuits/mutating the result does not mutate the input) added to prevent re-attempts. - Phase 9 asymmetric clone. Same class of failure (
fatal error: concurrent map iteration and map writeat scale): share
settings/vars/envreferences from the locals cache, deep-copy only
locals.processTemplatesInSectionreturns the input map as-is
when the section has no{{, so the cached references ended up in
the shared template context and got mutated by sibling goroutines.
Reverted inb11f3cd9b; documented in-code.
Both failures share the same lesson: any optimization that hands a
shared reference back to a caller has to be matched against the
end-to-end mutation surface, not just the immediate caller. The Merge
contract — "result is a fresh, caller-mutable map" — must be upheld.
why
A real-world large stack configuration (≈836 YAML files, ~195 final stacks
across three namespaces, ~9.3k component instances) reported atmos describe affected taking around 11 minutes in CI on a 2-core runner. Local
reproduction with --identity=false and fake AWS credentials took ~4
minutes and showed two clear cost centers in the heatmap:
- The credential store was being created per-component even with
--identity=false(≈3.5 min of cumulative CPU). Fixed in PR #2471. - The component-inheritance + merge + YAML-parse pipeline (≈6 min of
cumulative CPU) was bottlenecked by lock contention on shared caches,
redundant deep-copies on every cache read/write, and per-call work that
could be cached or skipped for the common case.
This PR addresses the second bucket. Confirmed impact on the same workload
(mean of 3 local runs against current main, post-Phase-5-revert):
| Function | Pre-Phase-2 | Post-Phase-13 | Reduction |
|---|---|---|---|
cacheBaseComponentConfig
| 5m50s | ~990ms | −99.7% |
mergeComponentConfigurations
| 2m22s | ~95s | −33% |
MergeWithDeferred
| 1m35s | ~51s | −46% |
WalkAndDeferYAMLFunctions
| 1m26s | ~11s | −87% |
extractLocalsFromRawYAML
| 13s | ~6s | −54% |
UnmarshalYAMLFromFileWithPositions
| 18.7s | ~3.2s | −83% |
processCustomTags
| 31.5s | ~7.5s | −76% |
getCachedBaseComponentConfig
| 6.5s | ~360ms | −94% |
(mergeComponentConfigurations, MergeWithDeferred, and Merge
numbers reflect the post-Phase-5-revert state — the 1-input shortcut
that was originally counted toward Phase 5's headline numbers has been
removed for correctness. The remaining wins are still substantial.)
Local wall-clock on the same workload: 4.1s → ~2.2s (−47%) on a
many-core Mac. The wall-clock floor on Mac is set by stack-level
parallelism that already saturates; the cumulative CPU savings (several
minutes summed across all hot functions) translate to materially more
wall-clock improvement on 2-4 core CI runners where lock-wait padding,
allocation pressure, and serialized work cannot be hidden behind cores.
Projection for a 2-core CI runner starting from the v1.219.0 baseline:
~11 minutes → ~60-105 seconds end-to-end (combining PR #2471 + this
PR's shipped phases, including the ~15s GHA wall-clock cost of the
Phase 5 1-input revert). Awaiting end-to-end CI validation on the
reference workload.
Each phase is independently revertible — they live in separate
commits with self-contained tests. The two reverted optimizations
(Phase 5's 1-input shortcut, Phase 9's asymmetric clone) have their
failure modes documented in-code and in
docs/fixes/2026-05-23-describe-affected-component-inheritance-perf.md
so future passes don't re-attempt the same approach without addressing
the underlying contract violations.
references
- Investigation doc with per-phase root cause, metric tables, and
decision lessons:
docs/fixes/2026-05-23-describe-affected-component-inheritance-perf.md - Predecessor work that built the perf-instrumentation infrastructure
this PR builds on: #1576 (heatmap visualization), #1611 (self-time vs
total-time), #1622 (Docker perf fix + CPU Time / Parallelism), #1639
(5.2× faster execution + 92% memory reduction). - Phase 1 (
--identity=falsegate) shipped separately in PR #2471
(v1.220.0-rc.1).
Summary by CodeRabbit
-
Documentation
- Added a detailed troubleshooting/performance guide for diagnosing slow "describe affected" runs with reproducible steps and optimization plan.
-
Performance
- Significant speedups via new caching, contention reduction, and short-circuit fast-paths for common/empty cases.
-
Reliability
- Improved cache correctness, mutation isolation, and error propagation to avoid races and unexpected panics.
-
Tests / Chores
- Expanded test coverage and added public cache-clear helpers for reliable isolation and regression verification.
fix(terraform): wire `--all` to `ExecuteTerraformAll` for dependency-ordered execution @thejrose1984 (#2486)
What
Routes atmos terraform plan --all and apply --all through ExecuteTerraformAll so components actually execute in dependency (topological) order — as originally documented in the PR #1516 changelog and the DAG concurrency PRD.
Until this change, the dispatcher in cmd/terraform/utils.go routed all multi-component flags (--all, --components, --query) through ExecuteTerraformQuery, which walks components via Go map iteration — randomized order, with settings.depends_on ignored entirely. ExecuteTerraformAll, the function that builds the dependency graph and runs TopologicalSort, was reachable only from unit tests.
Why
Fixes #2485.
Users who configured settings.depends_on and ran atmos terraform apply --all were relying on a feature that didn't exist at the dispatch layer. Failures looked like Terraform errors (a component applied before its prereqs), not a missing-feature bug. The DAG concurrency PRD was authored on the assumption that this path already worked.
Changes
Dispatch
cmd/terraform/utils.go—info.Allnow routes toe.ExecuteTerraformAll(&info).--components/--query/ bare-s stackcontinue to route toExecuteTerraformQuery(no change).
ExecuteTerraformAll parity with ExecuteTerraformQuery
internal/exec/terraform_all.go— portscreateQueryAuthManagerso YAML functions (e.g.!terraform.state) resolve credentials under--all. Mirrors the #2081 fix that already exists for--query.- Drops the
info.Stack == ""validation. The terraform-apply docs explicitly state--allwithout-sprocesses every stack, and that's the behavior users see today viaExecuteTerraformQuery. Keeping this PR non-breaking required matching that contract. - Removes the now-unused
ErrStackRequiredWithAllFlagfromerrors/errors.go.
Filter scope
applyFiltersToGraphpreviously setIncludeDependencies: true, which would pull cross-stack prereqs into--all -s <stack>. Switched tofalseso the scope of--all -s <stack>is identical to today's behavior — components in the requested stack only, but now in topological order. A future opt-in flag can re-enable cross-stack execution.
Dry-run UX
executeNodeCommandnow emitsWould <subcmd> <component> in <stack> (dry run)viaui.Successf, matchingprocessTerraformComponent. Both multi-component paths produce the same user-facing dry-run output. (This also affects the--affectedpath, which had no integration tests asserting dry-run output — verified manually.)
Tests
- New integration test in
tests/test-cases/terraform-multi-component-flags.yamlasserts the partial topological order (vpcbeforeeks/cluster,eks/karpenterbeforeeks/karpenter-node-pool,eks/istio/basebeforeeks/istio/istiodbeforeeks/istio/test-app) using the existingterraform-apply-affectedfixture and regex with(?s). The exact total order is an implementation detail of Kahn's-algorithm tie-breaking; this test only asserts the correctness invariant. internal/exec/terraform_all_test.goandterraform_all_simple_test.go— removed the "no stack specified" cases (the validation is gone) and updatedTestApplyFiltersToGraph_*to match the new scope contract.
Compatibility matrix
| Scenario | Before | After |
|---|---|---|
apply --all -s dev, depends_on defined
| Random order (bug) | Topological order |
apply --all -s dev, no depends_on
| Random order | Deterministic order |
apply --all (no stack)
| All stacks, random order | All stacks, topological order |
apply --all -s dev with cross-stack depends_on
| Cross-stack components ignored | In-stack topological order; cross-stack still out of scope (opt-in TBD) |
destroy --all -s dev
| Random order | Reverse topological order |
--all with circular depends_on
| Silently random | Hard error with cycle path |
apply --components vpc -s dev
| Unchanged | Unchanged |
apply --query '...' -s dev
| Unchanged | Unchanged |
apply -s dev (no component, no flag)
| Unchanged | Unchanged |
--all with !terraform.state YAML function
| Worked (via #2081) | Works (auth manager ported) |
--all with per-component CI hooks
| Worked (via #2475/#2397) | Works (hook flows through executeNodeCommand)
|
Known follow-ups (not in this PR)
These are tracked in #2485 and intentionally out of scope for the dispatch fix:
- Parser:
DependencyParseronly reads the deprecatedsettings.depends_on. Should also readdependencies.componentslikedescribe_affected_components.goalready does. - Parser: only
component+stackkeys are recognized.namespace/tenant/environment/stageare documented but ignored. - Errors: missing-target dependency errors are silently logged at
WarninparseDependencyArray. Should be surfaced. - Concurrency: still sequential. The DAG concurrency PRD describes the planned ready-queue scheduler.
- Cross-stack scope opt-in: a
--include-cross-stack-dependenciesflag (orsettings.terraform.dependencies.cross_stack) to re-enable the originalIncludeDependencies: truebehavior.
Test plan
-
go build ./... -
go vet ./internal/exec/... ./cmd/terraform/... ./errors/... -
go test ./internal/exec/ ./pkg/dependency/... ./cmd/terraform/... ./errors/... ./pkg/ui/... -short— all green -
go test ./tests -run 'TestCLICommands/terraform_plan_--all|TestCLICommands/terraform_plan_--query|TestCLICommands/terraform_plan_--components' -count=1— all green - New ordering test:
go test ./tests -run 'TestCLICommands/terraform_plan_--all_executes_in_dependency_order' -count=1— green; output confirmsvpc → eks/cluster → eks/external-dns → eks/istio/base → eks/karpenter → eks/istio/istiod → eks/karpenter-node-pool → eks/istio/test-app - CI lint (
make lint) - Manual verification with
dependencies.components(new format) once parser is extended in a follow-up - Cross-stack dependency behavior with the deferred opt-in flag
References
- Closes #2485
- Original feature request: #1242 (closed as COMPLETED in #1516, but the routing was never wired up)
- Implementation PR: #1516
- DAG concurrency PRD:
docs/prd/dag-concurrent-execution.md
Summary by CodeRabbit
-
New Features
terraform plan --allandapply --allrun components in dependency (topological) order;destroy --allruns in reverse.--allmay be used without a stack;--all -s <stack>scopes to that stack without pulling cross‑stack prerequisites. Per‑component hooks and auth-aware YAML resolution are active during--allruns; dry‑run shows clear per‑component success messages.
-
Tests
- Expanded tests for
--allordering, scoping/filtering, auth wiring, dry‑run flows, and per‑component hook wiring.
- Expanded tests for
-
Documentation
- New blog post describing
--allbehavior, caveats, and follow-ups.
- New blog post describing
fix(vendor): recover OCI pulls on auth rejection and surface rich errors @osterman (#2487)
what
- Vendoring an OCI image (e.g.
oci://ghcr.io/...) now auto-recovers when configured credentials are rejected (401 / 403 /DENIED) by retrying once with anonymous authentication. - On successful recovery, emits
WARN OCI auth rejected, succeeded with anonymous fallbackand proceeds. - Terminal pull failures now surface a rich error built with
errUtils.Build(errUtils.ErrPullImage)— preserves the original cause and attaches structured context (image,registry,auth_attempted,status) plus three self-contained remediation hints (Actionspackages: read,ATMOS_GITHUB_USERNAMEoverride, stale~/.docker/config.json). - Bumps the chosen-auth log line from
Debug→Infoand the "GHCR token without username" branch fromDebug→Warnso CI logs reveal misconfiguration without--debug. - Non-auth errors (DNS, TLS, deadlines, 5xx) bypass retry — they need different remediation.
why
- Public test images on
ghcr.io(e.g.ghcr.io/cloudposse/atmos/tests/fixtures/components/terraform/mock:v0) failed hard on Windows CI runners whoseGITHUB_TOKENlackedpackages: readscope, becausepullImageused the rejecting credentials unconditionally instead of falling back to anonymous. - The previous error surface was a bare
DENIED: deniedwith no auth source, no HTTP status, and no actionable hint — the vendor reporter collapsed it into an opaque tally, making the root cause untraceable. - Part A of this work — granting
packages: readin.github/workflows/test.yml— already landed; this is Part B, the code change so future users, different workflows, and private registries get a clean diagnosis and an automatic recovery for public images.
references
- Builds on PR #1647 (3-tier auth precedence).
- Mirrors the error-builder idiom from
pkg/provisioner/source/source.go:88-97. - Reuses the existing
errUtils.ErrPullImagesentinel — no new sentinel introduced.
Summary by CodeRabbit
-
Bug Fixes
- Automatic fallback to anonymous OCI image pulls when authenticated requests are rejected (401/403 or “DENIED”), preserving original error causes
- Richer diagnostic errors with contextual hints for troubleshooting image-pull failures
- Warn when a GHCR token is present but no GitHub username is configured
-
Tests
- Expanded test coverage for image-pull auth fallback and varied failure scenarios
fix(terraform): preserve explicit identity and auth context for local runs @shirkevich (#2348)
## Problem Local commands could still fall back to the default CI identity (`terraform-ci` / `gcp-wif`) even when the user explicitly selected a local identity such as `terraform` / `gcp-adc`.This showed up in several related paths:
atmos terraform apply ... --identity terraformand-i terraformcould lose the explicit identity during Terraform argument reconstruction and then authenticate with the default identity.- Local Terraform commands could run the CI hook path first, producing noisy
terraform-ciWIF authentication errors even when the command later succeeded with the intended local identity. atmos terraform output ... --format json -i terraformbypassed the normal Terraform auth setup and called formatted output resolution without the activeAuthManager/AuthContext.!terraform.stateevaluations could receive stack info without the active auth manager, so nested state reads could still resolve backend credentials from the default identity instead of the explicit command identity.atmos ansible playbook ... --identity terraformdid not propagate the selected identity into stack processing before YAML functions ran, so Ansible-driven components using!terraform.statecould still try the default Terraform CI WIF identity.
Fixed Issues
- Preserves explicit Terraform
--identity/-ivalues through both Cobra/flag-registry parsing and the legacy raw-arg parsing path. - Normalizes Terraform identity edge cases consistently, including
--identity=,-i=, and--identity=false. - Keeps CLI-provided identity values ahead of default/profile-selected identities when Cobra reports the optional-value sentinel.
- Skips CI hook execution for normal local non-CI runs unless CI mode is explicitly forced, removing local
terraform-cipreflight auth noise. - Initializes Terraform auth for formatted
terraform outputand passes bothAuthContextandAuthManagerinto output resolution. - Propagates the active
AuthManagerfromProcessComponentConfigintoConfigAndStacksInfo, allowing!terraform.stateto inherit the selected identity context. - Adds shared component auth setup for non-Terraform command paths that need authenticated YAML functions.
- Adds Ansible-specific identity handling that supports long-form
--identity/ATMOS_IDENTITYwhile deliberately leaving Ansible's-ishorthand reserved for inventory. - Runs Ansible stack processing with the selected auth manager before YAML function evaluation, so Ansible playbooks using
!terraform.statecan use the requested identity.
Verification
- Rebuilt the binary with
rtk proxy go build -o build/atmos . - Focused regression suite passed:
rtk go test ./cmd/ansible ./pkg/component/ansible ./pkg/flags ./internal/exec ./pkg/hooks -run 'TestAnsible|TestBuildConfigAndStacksInfo|TestGetLongIdentityFromArgs|TestProcessStacksWithAuth|TestParseGlobalFlags|TestSetupTerraformAuth|TestProcessComponentConfig_PropagatesAuthManager|TestProcessComponentConfig_AuthManagerGuardBranches|TestProcessCommandLineArgs_EmptyIdentityFlagIsExplicitSelect|TestProcessCommandLineArgs_TerraformIdentityFlag_Issue2392|TestProcessArgsAndFlags_IdentityFlag|TestRunCIHooks_LocalRunSkipsExperimentalGate|TestRunCIHooks_ForwardsErrorAndExitCode|TestRunCIHooks_NilAtmosConfig|TestRunCIHooks_ExperimentalDisableReturnsError' - Downstream local Terraform apply path was verified with explicit
--identity terraform. - Downstream formatted Terraform output path was verified with explicit
-i terraform --format json. - Downstream Ansible dry-run path was verified to select
identity "terraform"/provider=gcp-adc; remaining OAuth access was environment/network dependent, not a fallback toterraform-ci.
Summary by CodeRabbit
-
New Features
- Added
-ishorthand for--identity(supports-i value,-i=value, and explicit-empty-i=to trigger interactive selection).
- Added
-
Improvements
- More robust identity resolution across commands and env vars; explicit-empty identity is preserved.
- Auth manager is now propagated into output and component command flows for consistent auth behavior.
- Hook discovery avoids rendering templates; CI hooks skip when no CI provider detected.
-
Bug Fixes
- Fixed parsing so a following native flag (e.g.,
-lock=false) is not mis-consumed as an identity value.
- Fixed parsing so a following native flag (e.g.,
-
Tests
- Expanded test coverage for identity parsing, auth propagation, hooks, and CI registry behavior.
fix(ci): fire CI hooks per-component in deploy --all mode @thejrose1984 (#2478)
what
- Fixes
atmos terraform deploy --all(and--query,--components, stack-without-component) producing only a single CI summary entry for the last component instead of one entry per component - Adds
runCIHooksForDeployComponentas the per-component hook for thedeploysubcommand so$GITHUB_STEP_SUMMARYreceives one entry per component with the correct component/stack context - Wires
wasMultiComponentExecutionreset, error-defer guard, andPostRunEguard indeploy.go— the same three-site pattern applied toplanin #2430 andapplyin #2475
why
In multi-component mode, terraformRunWithOptions routes to ExecuteTerraformQuery and sets wasMultiComponentExecution = true, but deploy.go had no guard on its PostRunE or error-path defer. This caused:
PostRunEto fire once after all components completed, callingRunCIHookswith an empty output buffer and the last component'sinfo.Component/info.Stack- The error-path defer to double-fire when
--allfailed mid-walk (per-component hook already ran for the failed component) - For stacks with N components, only 1 summary entry appeared instead of N
references
Closes #2476
Related: #2397 (plan fix), #2475 (apply fix)
Summary by CodeRabbit
-
Bug Fixes
- Prevent duplicate error-hook execution during multi-component deployments.
- Ensure per-run state is reset before early exits so deferred error hooks and post-run logic behave consistently.
- Run CI hooks per-component for deploys to preserve component output and forward correct exit codes.
-
Tests
- Added tests for per-component CI hook behavior, suppression of post-run logic in multi-component deploys, defer-guard behavior, and exit-code forwarding.
[codex] Fix verifier auto-install and cosign bundles @osterman (#2481)
what
- Resolve verifier auto-installs to concrete registry versions before bootstrapping, instead of falling back to literal
latest. - Add platform-aware installer helpers and regression coverage for Windows verifier asset URLs across cosign, slsa-verifier, gh, and minisign.
- Combine cosign
optswith downloaded sidecars like--bundleso Trivy checksum signature verification works with Aqua metadata.
why
- Windows CI was failing because cosign release assets exist under
v...tags and the bootstrap path could construct invalid release URLs. - Trivy verification failed on macOS because Atmos dropped the sigstore bundle whenever cosign options were present, producing an incomplete
cosign verify-blobcommand. - The added tests cover the failing Windows URL rendering path and the Trivy-shaped checksum signature command.
references
- Fixes the verifier install failure introduced by package verification.
- Validated with
go test ./pkg/toolchain/installer ./pkg/toolchain/registry/aqua ./pkg/toolchain/verification,go test ./pkg/toolchain, pre-commit hooks, and a livego run . toolchain install aquasecurity/trivy@v0.70.0.
Summary by CodeRabbit
-
New Features
- Signature verification now supports bundle sidecars.
- Enhanced cross-platform asset resolution, including better Windows ARM and Rosetta2 handling and per-platform overrides.
- Improved verifier bootstrap resolution with additional fallback behavior.
-
Bug Fixes
- Corrected Windows executable extension handling across target platforms.
-
Tests
- Added tests for Windows asset URL generation, verifier version resolution failures, and cosign bundle sidecar integration.
fix(stacks): honour component-level list_merge_strategy in settings @thejrose1984 (#2480)
what
- Fixes
settings.list_merge_strategyset at the component level being silently ignored during stack processing - Adds
effectiveAtmosConfig()helper that scans the component's settings layers (GlobalSettings → BaseComponentSettings → ComponentSettings → ComponentOverridesSettings) before any merge and returns a shallow config copy with the winning strategy mergeComponentConfigurationsnow uses this resolved config for allm.Merge/m.MergeWithDeferred/m.ApplyDeferredMergescalls — covering vars, settings, env, auth, providers, hooks, generate, dependencies, locals, source, and provision
why
mergeComponentConfigurations passed the global atmosConfig to every merge call. pkg/merge reads atmosConfig.Settings.ListMergeStrategy on every call. The component's settings.list_merge_strategy lived inside the data being merged, not the config doing the merging — so it was always ignored. The value appeared correctly in atmos describe component output (giving false confidence), but the actual list merging behavior was always governed by the global atmos.yaml setting or ATMOS_SETTINGS_LIST_MERGE_STRATEGY env var.
references
Closes #2396
Summary by CodeRabbit
-
New Features
- Component-level list merge strategy overrides are now computed and applied consistently across configuration assembly, honoring inheritance and isolating unchanged configs.
-
Tests
- Added integration tests and fixtures covering precedence, inheritance, copy isolation, prevention of empty overrides, and error handling for invalid strategy values.
fix(ci): fire CI hooks per-component in apply --all mode (#2475) @thejrose1984 (#2477)
Extends the per-component CI hook pattern from PR #2430 (plan --all) to apply --all, so each component produces its own CI summary entry instead of a single misattributed entry for the last component.
what
- Update apply --all to fire per-component CI hooks.
- Preserve per-component CI reporting semantics used by plan --all.
why
- Prevent CI summaries from being misattributed to only the last component.
- Ensure each component has its own hook and status entry in CI pipelines.
references
- Fixes behavior introduced in PR #2430 for plan --all.
- Addresses CI reporting bug for apply --all mode.
Summary by CodeRabbit
-
Bug Fixes
- Prevented duplicate CI hook firing during multi-component Terraform apply runs.
- Reset per-run state at apply start so deferred and post-run hooks observe consistent values.
- Suppressed post-run hook execution for multi-component apply to avoid double execution.
-
Tests
- Added tests covering CI hook handling and post-run suppression in multi-component apply scenarios.
fix(auth): honor --identity=false in describe affected and dependents @osterman (#2471)
what
- Honor
--identity=false(and aliasesoff/0/no) inatmos describe affectedso per-component auth resolution is skipped, not just the top-level AuthManager creation. - Thread a new
DescribeAffectedCmdArgs.AuthDisabled/DescribeDependentsArgs.AuthDisabledflag from the cmd layer throughexecuteDescribeAffectedWith{TargetRepoPath,TargetRefClone,TargetRefCheckout},executeDescribeAffected,addDependentsToAffected, andExecuteDescribeDependents, routing inner stack resolution throughExecuteDescribeStacksWithAuthDisabled. - Also wired through
terraform_affected.go,terraform_affected_graph.go,pkg/list/list_affected.go,pkg/ai/tools/atmos/describe_affected.go, andatlantis_generate_repo_config.goso every caller of the public helpers passes the signal. - Extracted
pkg/list/list_affected.go::executeAffectedLogicinto three per-mode helpers to stay under the 60-line function-length limit after the extra parameter.
why
- A user disabled all auth on a
describe affected --upload --process-functions=false --identity=falserun incloudposse/infra-liveCI (failing run) and still gotSTS AssumeRoleWithWebIdentity 403 AccessDeniedfor componenttfstate-plat. - The 1.219 fix (#2412) normalized
--identity=false→__DISABLED__at the parser layer and madeCreateAuthManagerFromIdentity*short-circuit tonil, but it only wired the disabled signal all the way down throughlist instances. Indescribe affected, the top-level AuthManager correctly becamenil, but anilAuthManager was indistinguishable from "no identity specified" downstream. With--process-templates=true(the default),shouldResolvePerComponentAuth(processTemplates, processYamlFunctions)still returnedtrue, so the per-component resolver calledcreateComponentAuthManager, which built a fresh AuthManager fromatmosConfig.Authand tried the assume-role call the user thought they had disabled. - This change makes
--identity=falseactually mean "no auth, anywhere" indescribe affected, matching the contract that already works forlist instances.
Tests:
cmd/describe_affected_test.go::TestDescribeAffectedSetsAuthDisabledcoversfalse/off/0/noenv-var spellings and assertsAuthDisabled=trueandAuthManager=nil.internal/exec/describe_affected_authdisabled_test.goverifiesExecute()forwardsAuthDisabledto all three helper paths and toaddDependentsToAffected.internal/exec/describe_stacks_component_processor_auth_test.goadds the exact(processTemplates=true, processYamlFunctions=false, authDisabled=true)regression case from the infra-live CI failure to the existing table.
references
- Follow-up to #2412 (
fix(auth): normalize --identity=false to disable authentication) which only wired the disabled signal throughlist instances. - Failing CI run that motivated this fix: https://github.com/cloudposse/infra-live/actions/runs/26247527093/job/77249654102?pr=1686
Summary by CodeRabbit
-
Bug Fixes
describe affectedanddescribe dependentsnow explicitly record when authentication is disabled (e.g.,--identity=false,off,0,no), ensuring downstream discovery and dependency resolution skip per-component auth and avoid unintended auth attempts.
-
Tests
- Added unit and integration tests verifying the auth-disabled signal is propagated throughout affected-component discovery and dependent-resolution paths.
fix(auth): nil-check process-cached credentials for standalone `ambient` identity @aknysh (#2479)
what
- Fix a hard
SIGSEGVtriggered the second time a standalone genericambientidentity (kind: ambient) is authenticated in the same process. The first authentication succeeded and silently cachednilcredentials in the process-level credential cache; the next lookup invokedisCredentialValid("process-cache", nil), which dereferenced a niltypes.ICredentialsinterface inGetExpiration(). - The crash is latent in
atmos auth login/atmos auth whoami(one authentication per process) but fatal in commands that resolve per-component auth many times — most notablyatmos describe affected --upload, whereinternal/exec/describe_stacks_component_processor.processComponentEntrywalks every component and callsresolveComponentAuthManager → createComponentAuthManager → Authenticate → authenticateChainper component.
Fix (two layers in pkg/auth/manager_chain.go)
-
authenticateChain— don't cache nil credentials.if creds != nil { processCredentialCache.Store(cacheKey, &processCachedCreds{ credentials: creds, }) }
The generic
ambientkind is a cloud-agnostic passthrough whoseAuthenticate()returns(nil, nil)by design — credentials are resolved by the cloud SDK at subprocess runtime, not by Atmos. Storing nil violates the cache invariant that every entry is a usable credential object. Skipping costs nothing because ambient re-authentication is itself a no-op. -
isCredentialValid— short-circuit on nil input.if cachedCreds == nil { log.Debug("Cached credentials are nil; treating as invalid", logKeyIdentity, identityName) return false, nil }
Defense-in-depth mirror of the same nil-check pattern adopted by
buildWhoamiInfoin the predecessor 2026-04-17 ambient fix. If any future caller stores nil in the cache (or another path passes nil into the validator), the worst case is a redundant re-authentication, not a panic.
Either guard alone closes the panic; both together make the contract explicit at both the read and write sites.
Tests (new pkg/auth/manager_chain_ambient_test.go)
TestManager_isCredentialValid_NilCreds— direct unit reproducer for the panic site. Before the fix this test panicked atmanager_chain.go:164withruntime error: invalid memory address or nil pointer dereferencewhile callingcachedCreds.GetExpiration(). Asserts(false, nil)on nil credentials.TestManager_Authenticate_AmbientStandalone_RepeatedCallsNoPanic— end-to-end via realNewAuthManager+ two back-to-backAuthenticate()calls on a standalonekind: ambientidentity. Before the fix the second call panicked on the process-cache hit. Asserts both calls return cleanly withWhoamiInfo.Credentials == nil.TestAuthenticateChain_AmbientStandalone_DoesNotCacheNil— locks in theauthenticateChain-side fix by direct cache inspection: the cache key must be absent after a standalone ambient authentication. Prevents a regression where caching nil silently returns.
All three new tests pass alongside the existing ambient regression tests (TestManager_buildWhoamiInfo_NilCredentials, TestManager_Authenticate_Ambient_Standalone) and the existing TestProcessCredentialCache_* suite.
Coverage
Both patched functions remain at 100% statement coverage; both branches of each new guard are exercised:
| Function | File:Line | Coverage |
|---|---|---|
authenticateChain
| pkg/auth/manager_chain.go:51
| 100.0% |
isCredentialValid
| pkg/auth/manager_chain.go:173
| 100.0% |
isCredentialValidnil-true branch:TestManager_isCredentialValid_NilCreds. Nil-false branch: existingTestProcessCredentialCache_*tests.authenticateChainskip-cache branch: the two ambient tests above. Cache-write branch: existingTestProcessCredentialCache_AvoidsDuplicateAuthand friends.
Validation
go test ./pkg/auth/... -count=1— all 28 subpackages green.go vet ./pkg/auth/...— clean.go build ./...— succeeds.
why
- The
(nil, nil)return from the genericambientkind is the documented contract (docs/prd/ambient-identity.md) — credentials are resolved by the cloud SDK at subprocess runtime, not by Atmos. The cache code on the other side of that boundary failed to honor the contract, and a recent change that made per-component auth resolver failures fatal turned this latent panic into a hard command termination. - The predecessor 2026-04-17 ambient fix (#2334) addressed the
buildWhoamiInfopath but did not touch the process credential cache path inauthenticateChain/isCredentialValid. That cache is dormant during single-authentication commands likeatmos auth login/atmos auth whoami(where #2334's reproducer lived) but hot during multi-component flows likeatmos describe affected, so the bug only surfaced after both #2334 shipped and per-component auth resolution became fatal. This PR extends the same nil-credential contract to the credential-cache layer. - Without this fix, any consumer of a standalone
kind: ambientidentity who exercisesatmos describe affected --upload(the canonical Atmos Pro flow) hits a hard crash on every run, with no workaround short of avoiding the identity kind entirely — which defeats the reason the kind exists.
references
docs/fixes/2026-05-21-ambient-identity-process-cache-panic.md— fix write-up: root cause, code path, two-layer fix, test matrix, coverage notes, and the interaction with the predecessor #2334 fix that made this surface now.docs/fixes/2026-04-17-ambient-identity-nil-credentials.md— predecessor fix. Same(nil, nil)ambient contract, different layer (buildWhoamiInfo). This PR extends the same defense to the process credential cache.docs/prd/ambient-identity.md— feature PRD. Specifies thatambient.Authenticate()returns(nil, nil)andambientidentities do not store credentials.pkg/auth/identities/ambient/ambient.go:66-71— the intentionalreturn nil, nilinambientIdentity.Authenticate().pkg/auth/identities/ambient/ambient.go:144-162—AuthenticateStandaloneAmbientdocuments and propagates the nil-credentials contract.pkg/auth/identities/aws/ambient.go:Authenticate— AWS-specific counterpart that returns real*AWSCredentialsand therefore never triggers this bug.internal/exec/describe_stacks_component_processor.go:150-174— per-component auth resolver whose recent change made this latent panic fatal in theatmos describe affected --uploadflow.
Summary by CodeRabbit
-
Bug Fixes
- Fixed a crash that occurred when authenticating a standalone ambient identity multiple times within the same process.
-
Tests
- Added regression tests to prevent this issue from reoccurring.
-
Documentation
- Added documentation describing the fix and root cause analysis.