cloudposse/atmos v1.213.0 on GitHub

docs: add PRD for cyclomatic complexity reduction @[copilot-swe-agent[bot]](https://github.com/apps/copilot-swe-agent) (#2230)

High cyclomatic complexity is the primary barrier to unit-test coverage in Atmos — every branch requires a dedicated test case, so complexity directly caps achievable coverage. This PRD standardises the reduction strategy already proven on ExecuteTerraform (160→26), ExecuteDescribeStacks (247→10), and processArgsAndFlags (67→15).

What's in the PRD (`docs/prd/cyclomatic-complexity-reduction.md`)

Refactoring techniques (with before/after examples)

Extract coordinator + focused helpers into co-located *_helpers.go files
Replace N-case switch with map[string]func(...) dispatch tables (cyclomatic stays at 2 regardless of table size)
Guard-clause / early-return flattening to eliminate else-after-return nesting
Options struct to replace boolean-flag parameter explosions
Predicate extraction to name and isolate long if conditions

Enforcement: new code vs. old code

Phased threshold tightening over ~6 months: cyclop 15→12→10→8; revive cyclomatic promoted from warning → error
docs/complexity-budget.yml exemption registry for functions that already exceed the new threshold
CI script (check-complexity-budget.sh) that rejects any unregistered //nolint:cyclop,revive,gocognit annotation, preventing silent accumulation

Progress tracking

Baseline snapshot via gocyclo -over 9 -avg . committed to docs/complexity-baseline.txt
Nightly GitHub Actions workflow (complexity-trend.yml) posting per-threshold counts to $GITHUB_STEP_SUMMARY
Coverage-correlation recipe: high complexity + low branch coverage = highest refactor priority
Sprint checklist template for tracking per-sprint reductions

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

feat: add chunked uploads for large stack payloads @milldr (#2251)

What

Add automatic chunking for large stack/instance upload payloads to Atmos Pro. When payloads exceed the configurable threshold (default 4MB), the CLI splits the array into chunks and sends them sequentially with batch metadata (batch_id, batch_index, batch_total).

Why

Large infrastructure repos generate affected stack and instance payloads that exceed Vercel's ~4.5MB serverless body size limit, producing HTTP 413 Request Entity Too Large errors. The existing StripAffectedForUpload() reduces payloads by 70-75% but is insufficient for repos with hundreds of stacks.

Changes

New pkg/pro/chunked_upload.go — generic chunking logic (sendChunked, splitSlice, metadataOverhead)
Updated UploadAffectedStacks() and UploadInstances() to use chunked upload
Added batch_id, batch_index, batch_total fields to upload DTOs
Switched from indented to compact JSON for upload payloads (~30% smaller)
Added max_payload_bytes config to settings.pro in atmos.yaml
Backward compatible: small payloads send without batch fields, old servers ignore unknown fields

Ref

Companion server-side PR: cloudposse-corp/apps (feat/chunked-stack-uploads → staging)

Summary by CodeRabbit

New Features
- Large stack and instance uploads now auto-split into multiple requests when exceeding a configurable threshold (default 4MB).
- Added configurable upload limit via atmos.yaml (settings.pro.max_payload_bytes).
- Chunked uploads include batch metadata (batch_id, batch_index, batch_total) for reliable reassembly; small payloads remain single-request and backward compatible.
- Upload payloads use compact JSON serialization to reduce size.
Documentation
- New blog post and roadmap entry describing chunked upload behavior and configuration.
Tests
- Added unit and integration tests validating chunking, batching, and error handling.

feat: introduce Gists as community-contributed recipes @osterman (#2238)

what

Introduced Gists — a new content type for community-contributed recipes that demonstrate creative combinations of Atmos features (Custom Commands, Auth, Toolchain, etc.)
Added a GistDisclaimer React component (purple/violet pill) that displays on all gist pages: "Gists are examples that demonstrate a concept, but are not actively maintained and may not work in your environment or current versions of Atmos without adaptations."
Extended the file-browser plugin with a disclaimer option, enabling a second plugin instance at /gists alongside the existing /examples
Added "Gists" to the top navbar between Examples and Community
Created the first gist: MCP with AWS — a masterclass in combining Custom Commands + Auth + Toolchain to run 21 AWS MCP servers with automatic credential management (sourced from cloudposse/infra-live PR #1662)
Added a blog post announcing the Gists feature
Added a gist-creator Claude agent for standardizing future gist creation

why

Community members share creative Atmos patterns that don't fit the maintained examples model — they need a home that sets the right expectations
The MCP with AWS recipe demonstrates the composability of Atmos features (the key insight: atmos auth exec wraps MCP server processes with authenticated AWS credentials)
Having a standardized gist structure and agent makes it easy to add more recipes over time

references

Source material for first gist: https://github.com/cloudposse/infra-live/pull/1662

Summary by CodeRabbit

New Features
- Introduced Gists — a community-contributed recipe space for Atmos.
- Added an AWS MCP gist with install/start/test commands and many preconfigured AWS services and startup presets.
- Added toolchain alias and minimal Atmos config to enable gists.
- Exposed Gists in the site file browser at /gists with a configurable disclaimer and navbar link.
- Added Mermaid diagram support and a reusable Gist disclaimer UI component and styles.
Documentation
- Published blog post introducing Gists and contribution guidelines.
- Added gist README templates, registration guidance, required README structure, and a verification checklist.

feat: add ambient credential support for IRSA, IMDS, and ECS task roles @osterman (#2254)

what

Adds two new auth identity kinds: ambient (cloud-agnostic passthrough) and aws/ambient (AWS SDK default credential chain)
ambient is a pure do-nothing passthrough that preserves the environment unchanged
aws/ambient resolves credentials via the default AWS SDK chain (env vars → shared config → IRSA → IMDS → ECS task role) and supports chaining with aws/assume-role
Unlike other AWS identities, aws/ambient does not clear credential env vars or disable IMDS

why

Atmos currently explicitly disables IMDS (AWS_EC2_METADATA_DISABLED=true) and clears IRSA env vars in PrepareEnvironment(), blocking use of infrastructure-provided credentials
Running Atmos in EKS pods (IRSA), EC2 instances (instance profiles), ECS tasks, or CI runners with pre-configured roles required workarounds
This makes ambient/infrastructure-provided credentials a first-class auth path, including support for chaining aws/ambient → aws/assume-role for cross-account access

references

PRD: docs/prd/ambient-identity.md
Blog: website/blog/2026-03-25-ambient-credential-support.mdx
Example config: examples/config-profiles/profiles/eks/auth.yaml
Docs: Updated website/docs/stacks/auth.mdx with ambient identity examples
Roadmap: Updated website/src/data/roadmap.js with shipped milestone

Summary by CodeRabbit

New Features
- Ambient credential support: ambient (cloud-agnostic passthrough preserves environment) and aws/ambient (resolves AWS credentials via the SDK default provider chain; can be used standalone or chained for cross-account assume-role).
Documentation
- Added PRD, expanded docs, examples, blog post, and roadmap entry with EKS IRSA, EC2 instance profile, ECS task role, and chaining examples.
Tests
- Added comprehensive unit and integration tests covering ambient behaviors, region handling, credential flows, and chain construction.

fix: prevent JIT source TTL from wiping varfiles/backend mid-execution @[copilot-swe-agent[bot]](https://github.com/apps/copilot-swe-agent) (#2253)

AutoProvisionSource is called twice per command invocation — once directly from resolveAndProvisionComponentPath, and again via the before.terraform.init hook in prepareInitExecution. With ttl: "0s", the second call treats the workdir as always-expired, invokes os.RemoveAll(targetDir), and wipes the varfiles and backend configs written between the two calls. The subprocess then fails with "file does not exist".

Changes

pkg/provisioner/source/provision_hook.go — adds an in-memory idempotency guard (invocationDoneKey = "_atmos_source_provisioned") to AutoProvisionSource. A named-return defer sets the marker in componentConfig on successful return. Any subsequent call with the same map (same in-memory invocation) short-circuits immediately. The guard is scoped to the per-invocation componentConfig; separate atmos runs are unaffected.
pkg/provisioner/source/provision_hook_test.go — two regression tests:
- TestAutoProvisionSource_InvocationGuard_PreventsDoubleProvisioning: asserts the guard short-circuits a second call even with ttl: "0s"
- TestAutoProvisionSource_InvocationGuard_SetAfterProvisioning: asserts the marker is written to componentConfig after a skipped provision (TTL not expired), ensuring the hook path is a no-op

Original prompt

This section details on the original issue you should resolve
<issue_title># Bug: JIT source provisioning TTL expiry deletes varfiles/backend, then runs tofu, causing error</issue_title>
<issue_description>### Describe the Bug
When using Just-In-Time (JIT) source provisioning, the source.ttl cleanup runs concurrently with — or before — the tofu subprocess, not after it completes. If the TTL expires at any point while tofu init, tofu plan, or any other tofu command is executing, Atmos deletes the varfiles and backend configuration out from under the running process.
The most reliable way to trigger this is ttl: "0s", which expires immediately and causes a deterministic failure every time. However, any positive TTL short enough to expire before the tofu subprocess finishes (e.g. "30s" on a slow network or large module download) will produce the same failure.
The result is a hard failure from tofu because the generated varfile (and/or backend file) no longer exists on disk:
Error: Failed to read variables file
│
│ Given variables file /tmp/atmos-workdir-*/component.tfvars.json does not exist.
Expected Behavior

The TTL cleanup should be scoped to between invocations, not during one. Provisioned files should never be deleted while the subprocess that depends on them is still running. Specifically:

TTL expiry should only be evaluated before provisioning (stale cache check), not during or after subprocess execution.
The provisioned workdir should be treated as a lock for the duration of the current command — held open until the subprocess exits, then subject to TTL-based cleanup on the next invocation.

A source.ttl: "0s" is the degenerate case that makes this deterministic, but the fix must cover all TTL values.

Actual Behavior

Atmos generates the varfiles and backend, the TTL of 0s immediately expires them, Atmos wipes them, and tofu fails:
│ Error: Failed to read variables file
│ 
│ Given variables file demo-null-label.terraform.tfvars.json does not exist.
Steps to Reproduce

The script below is fully self-contained. It requires only atmos and tofu on PATH and network access to GitHub. Save it as repro.sh and run it.
#!/usr/bin/env bash
# ============================================================
# REPRO: JIT ttl:"0s" deletes varfiles before tofu can read them
# ============================================================

set -euo pipefail

WORKDIR="$(mktemp -d -t atmos-repro-XXXXXX)"
echo "Working in: ${WORKDIR}"
cd "${WORKDIR}"

# --- 1) atmos.yaml ---
cat <<'EOF' > atmos.yaml
base_path: "."

components:
  terraform:
    base_path: "components/terraform"
    command: "tofu"
    workspaces_enabled: true
    apply_auto_approve: false
    deploy_run_init: true
    init_run_reconfigure: true
    auto_generate_backend_file: true

stacks:
  name_template: "{{ .vars.name }}"
  base_path: "stacks"
  included_paths:
    - "**/*"
EOF

# --- 2) Stack with ttl: "0s" on the JIT source ---
mkdir -p stacks
cat <<'EOF' > stacks/demo.yaml
vars:
  name: demo

terraform:
  backend_type: local

components:
  terraform:
    null-label:
      vars:
        # terraform-null-label variables
        namespace: "eg"
        stage: "test"
        name: "demo"
        enabled: true
      source:
        uri: "git::https://github.com/cloudposse/terraform-null-label.git"
        version: "0.25.0"
        ttl: "0s"    # <-- triggers the bug: files are wiped before tofu reads them
      provision:
        workdir:
          enabled: true
EOF

echo
echo "== tree =="
find . -maxdepth 4 -type f -print | sed 's|^\./||'

echo
echo "== discovered stacks =="
atmos describe stacks

echo
echo "== describe component =="
atmos describe component null-label -s demo

echo
echo "== init (this is where the failure occurs with ttl:0s) =="
atmos terraform init null-label -s demo

echo
echo "== plan =="
atmos terraform plan null-label -s demo

echo "Done. Workspace preserved at: ${WORKDIR}"
Run:
bash repro.sh 2>&1 | tee repro.log
Screenshots

No response
Environment

Atmos 1.212.0 on darwin/arm64
Additional Context

No response</issue_description>
Comments on the Issue (you are @copilot in this section)

Fixes #2252

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

fix: retry Atmos Pro uploads on transient 401/5xx with exponential backoff @osterman (#2255)

what

Add retry logic with exponential backoff (1s, 2s, 4s) to all three Atmos Pro upload methods: UploadInstanceStatus, UploadAffectedStacks, and UploadInstances
On 401 errors, re-exchange the OIDC token before retrying (handles JWT secret mismatches across deployment instances)
On 5xx or network errors, retry with backoff without re-auth
On 400/403/404, fail immediately without retrying
Upload failures no longer cause non-zero exit codes — the terraform plan/apply result drives the command exit code, not telemetry

why

Intermittent 401 errors on upload PATCH calls even though the OIDC exchange succeeded seconds earlier, likely due to JWT signed by a different deployment instance with a mismatched secret
Happens most often when multiple Atmos commands run in parallel (e.g., 14 concurrent stack operations in a matrix workflow)
Retrying the same job usually works, indicating transient failures
Upload telemetry should never block or fail the primary terraform workflow

references

#2216 (original upload implementation)

Summary by CodeRabbit

New Features
- Automatic upload retries with exponential backoff and automatic API token refresh on auth failures.
- More structured API error classification to improve retry and auth decisioning.
Bug Fixes
- Uploads and status-reporting failures now log warnings and no longer cause commands to fail.
Tests
- Comprehensive tests for retry logic, token refresh behavior, and API error handling.

fix: prevent IRSA credentials from overriding Atmos-managed credentials on EKS pods @osterman (#2143)

what

Prevent IRSA/pod-injected AWS env vars from overriding Atmos-managed credentials in subprocess execution
Pass os.Environ() through PrepareShellEnvironment to sanitize it (delete problematic vars), then pass the sanitized env to subprocess via WithBaseEnv — avoiding re-reading os.Environ() which would reintroduce IRSA vars
Add SanitizedBaseEnv field to ConfigAndStacksInfo to carry sanitized environment through the hooks→terraform/helmfile/packer pipeline
Add WithBaseEnv variadic option to ExecuteShellCommand for backward-compatible sanitized env injection
Fix auth exec and auth shell to use sanitized env directly instead of re-reading os.Environ()

why

On EKS pods with IRSA (IAM Roles for Service Accounts), the pod identity webhook injects AWS_WEB_IDENTITY_TOKEN_FILE, AWS_ROLE_ARN, and AWS_ROLE_SESSION_NAME into the pod environment. When using Atmos auth on ARC (Actions Runner Controller), these IRSA vars leaked into terraform subprocesses because three code paths re-read os.Environ() after auth sanitization:

Hooks path (terraform/helmfile/packer): authenticateAndWriteEnv only passed ComponentEnvSection (stack YAML vars) to PrepareShellEnvironment — IRSA vars weren't in the input so delete() was a no-op. Then ExecuteShellCommand re-read os.Environ() as the base.
auth exec: executeCommandWithEnv re-read os.Environ() to build subprocess env.
auth shell: ExecAuthShellCommand → MergeSystemEnvSimpleWithGlobal re-read os.Environ().

AWS SDK credential chain gives web identity tokens higher precedence than shared credential files, so the pod's runner role was used instead of the Atmos-managed tfplan role, causing AccessDenied errors.

Approach

Instead of setting cleared vars to empty string (which pollutes the subprocess env), we pass a clean, sanitized environment:

authenticateAndWriteEnv now passes os.Environ() + ComponentEnvSection to PrepareShellEnvironment, which deletes problematic keys
The sanitized result is stored as SanitizedBaseEnv on ConfigAndStacksInfo
ExecuteShellCommand accepts WithBaseEnv(info.SanitizedBaseEnv) to use the sanitized env instead of re-reading os.Environ()
auth exec and auth shell pass sanitized env directly to subprocess, bypassing the re-read

references

Fixes credential precedence conflict where IRSA vars override Atmos-managed credentials on EKS pods running ARC (DEV-4216)

Summary by CodeRabbit

Bug Fixes
- Prevented AWS IRSA env vars from leaking into subprocesses by sanitizing auth-related variables (overridden with empty values) so spawned commands use Atmos credentials.
- Ensured credential-chain caching no longer skips the final role, forcing proper re-authentication when needed.
Refactor
- Preserve and propagate a sanitized environment end-to-end for shell/exec paths so child processes receive the corrected env list.
Tests
- Updated and added tests to validate env sanitization and subprocess propagation.
Documentation
- Added guidance describing the credential-chain caching fix and expected behavior.

fix: thread auth identity through describe/list affected for S3 state reads @osterman (#2250)

what

Thread AuthManager through the entire describe affected call chain so ExecuteDescribeStacks receives the identity credentials instead of nil
Fix GetTerraformState to use the resolved component-specific AuthContext for S3 backend reads instead of the (potentially nil) passed-in authContext
Add per-component identity resolution in ExecuteDescribeStacks gated behind processYamlFunctions, so each component can use its own identity for !terraform.state reads
Wire the --identity / -i flag through the list affected command, which had the flag registered (inherited from listCmd) but never read it or created an AuthManager

why

Customer reported atmos list affected --ref refs/heads/main failing with S3 auth errors despite valid atmos auth identity
Debug logs showed resolveAuthManagerForNestedComponent correctly created per-component AuthManagers, but the credentials were never used for the actual S3 GetObject call
Four independent bugs: (1) AuthManager dropped in describe affected call chain, (2) GetTerraformState ignored resolved AuthContext for backend reads, (3) no per-component identity resolution in ExecuteDescribeStacks, (4) list affected never read the --identity flag
Running inside atmos auth shell worked because it sets ATMOS_IDENTITY env var (viper fallback), but explicit -i admin-account was silently ignored by list affected

references

docs/fixes/2026-03-25-describe-affected-auth-identity-not-used.md — detailed fix documentation
docs/fixes/nested-terraform-state-auth-context-propagation.md — original nested auth fix
docs/fixes/2026-03-03-yaml-functions-auth-multi-component.md — multi-component auth fix

Summary by CodeRabbit

New Features
- Added --identity flag to list affected for explicit identity selection.
Bug Fixes
- Ensure authentication context is propagated into affected/describe flows.
- Terraform backend state reads now use the resolved identity/auth for S3.
- Per-component identity resolution applied during stack processing.
Documentation
- Added end-to-end fix description for affected/describe identity handling.
Tests
- Added and updated tests covering identity parsing and auth-manager propagation.

fix: preserve deleted and deletion_type fields in upload strip @milldr (#2249)

What

Preserve deleted and deletion_type fields in StripAffectedForUpload so they reach Atmos Pro when using --upload.

Why

StripAffectedForUpload constructs a new schema.Affected with only the fields needed by Atmos Pro, but it was missing Deleted and DeletionType. This caused deleted components to arrive at Atmos Pro without their deletion metadata, making them appear as "disabled" instead of "deleted".

References

Previous fix (dependents crash): #2237
Atmos Pro PR: cloudposse-corp/apps#933
Linear: AP-161

Summary by CodeRabbit

Bug Fixes
- Fixed an issue where deletion-related information was not being properly preserved during the data upload process.

cloudposse/atmos v1.213.0 on GitHub

What's in the PRD (docs/prd/cyclomatic-complexity-reduction.md)

What

Why

Changes

Ref

Summary by CodeRabbit

what

why

references

Summary by CodeRabbit

what

why

references

Summary by CodeRabbit

Changes

Expected Behavior

Actual Behavior

Steps to Reproduce

Screenshots

Environment

Additional Context

Comments on the Issue (you are @copilot in this section)

what

why

references

Summary by CodeRabbit

what

why

Approach

references

Summary by CodeRabbit

what

why

references

Summary by CodeRabbit

What

Why

References

Summary by CodeRabbit

cloudposse/atmos v1.213.0
on GitHub

What's in the PRD (`docs/prd/cyclomatic-complexity-reduction.md`)