github cloudposse/atmos v1.213.0

9 hours ago
docs: add PRD for cyclomatic complexity reduction @[copilot-swe-agent[bot]](https://github.com/apps/copilot-swe-agent) (#2230)

High cyclomatic complexity is the primary barrier to unit-test coverage in Atmos — every branch requires a dedicated test case, so complexity directly caps achievable coverage. This PRD standardises the reduction strategy already proven on ExecuteTerraform (160→26), ExecuteDescribeStacks (247→10), and processArgsAndFlags (67→15).

What's in the PRD (docs/prd/cyclomatic-complexity-reduction.md)

Refactoring techniques (with before/after examples)

  • Extract coordinator + focused helpers into co-located *_helpers.go files
  • Replace N-case switch with map[string]func(...) dispatch tables (cyclomatic stays at 2 regardless of table size)
  • Guard-clause / early-return flattening to eliminate else-after-return nesting
  • Options struct to replace boolean-flag parameter explosions
  • Predicate extraction to name and isolate long if conditions

Enforcement: new code vs. old code

  • Phased threshold tightening over ~6 months: cyclop 15→12→10→8; revive cyclomatic promoted from warning → error
  • docs/complexity-budget.yml exemption registry for functions that already exceed the new threshold
  • CI script (check-complexity-budget.sh) that rejects any unregistered //nolint:cyclop,revive,gocognit annotation, preventing silent accumulation

Progress tracking

  • Baseline snapshot via gocyclo -over 9 -avg . committed to docs/complexity-baseline.txt
  • Nightly GitHub Actions workflow (complexity-trend.yml) posting per-threshold counts to $GITHUB_STEP_SUMMARY
  • Coverage-correlation recipe: high complexity + low branch coverage = highest refactor priority
  • Sprint checklist template for tracking per-sprint reductions

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

feat: add chunked uploads for large stack payloads @milldr (#2251)

What

Add automatic chunking for large stack/instance upload payloads to Atmos Pro. When payloads exceed the configurable threshold (default 4MB), the CLI splits the array into chunks and sends them sequentially with batch metadata (batch_id, batch_index, batch_total).

Why

Large infrastructure repos generate affected stack and instance payloads that exceed Vercel's ~4.5MB serverless body size limit, producing HTTP 413 Request Entity Too Large errors. The existing StripAffectedForUpload() reduces payloads by 70-75% but is insufficient for repos with hundreds of stacks.

Changes

  • New pkg/pro/chunked_upload.go — generic chunking logic (sendChunked, splitSlice, metadataOverhead)
  • Updated UploadAffectedStacks() and UploadInstances() to use chunked upload
  • Added batch_id, batch_index, batch_total fields to upload DTOs
  • Switched from indented to compact JSON for upload payloads (~30% smaller)
  • Added max_payload_bytes config to settings.pro in atmos.yaml
  • Backward compatible: small payloads send without batch fields, old servers ignore unknown fields

Ref

Companion server-side PR: cloudposse-corp/apps (feat/chunked-stack-uploads → staging)

Summary by CodeRabbit

  • New Features

    • Large stack and instance uploads now auto-split into multiple requests when exceeding a configurable threshold (default 4MB).
    • Added configurable upload limit via atmos.yaml (settings.pro.max_payload_bytes).
    • Chunked uploads include batch metadata (batch_id, batch_index, batch_total) for reliable reassembly; small payloads remain single-request and backward compatible.
    • Upload payloads use compact JSON serialization to reduce size.
  • Documentation

    • New blog post and roadmap entry describing chunked upload behavior and configuration.
  • Tests

    • Added unit and integration tests validating chunking, batching, and error handling.
feat: introduce Gists as community-contributed recipes @osterman (#2238)

what

  • Introduced Gists — a new content type for community-contributed recipes that demonstrate creative combinations of Atmos features (Custom Commands, Auth, Toolchain, etc.)
  • Added a GistDisclaimer React component (purple/violet pill) that displays on all gist pages: "Gists are examples that demonstrate a concept, but are not actively maintained and may not work in your environment or current versions of Atmos without adaptations."
  • Extended the file-browser plugin with a disclaimer option, enabling a second plugin instance at /gists alongside the existing /examples
  • Added "Gists" to the top navbar between Examples and Community
  • Created the first gist: MCP with AWS — a masterclass in combining Custom Commands + Auth + Toolchain to run 21 AWS MCP servers with automatic credential management (sourced from cloudposse/infra-live PR #1662)
  • Added a blog post announcing the Gists feature
  • Added a gist-creator Claude agent for standardizing future gist creation

why

  • Community members share creative Atmos patterns that don't fit the maintained examples model — they need a home that sets the right expectations
  • The MCP with AWS recipe demonstrates the composability of Atmos features (the key insight: atmos auth exec wraps MCP server processes with authenticated AWS credentials)
  • Having a standardized gist structure and agent makes it easy to add more recipes over time

references

Summary by CodeRabbit

  • New Features

    • Introduced Gists — a community-contributed recipe space for Atmos.
    • Added an AWS MCP gist with install/start/test commands and many preconfigured AWS services and startup presets.
    • Added toolchain alias and minimal Atmos config to enable gists.
    • Exposed Gists in the site file browser at /gists with a configurable disclaimer and navbar link.
    • Added Mermaid diagram support and a reusable Gist disclaimer UI component and styles.
  • Documentation

    • Published blog post introducing Gists and contribution guidelines.
    • Added gist README templates, registration guidance, required README structure, and a verification checklist.
feat: add ambient credential support for IRSA, IMDS, and ECS task roles @osterman (#2254)

what

  • Adds two new auth identity kinds: ambient (cloud-agnostic passthrough) and aws/ambient (AWS SDK default credential chain)
  • ambient is a pure do-nothing passthrough that preserves the environment unchanged
  • aws/ambient resolves credentials via the default AWS SDK chain (env vars → shared config → IRSA → IMDS → ECS task role) and supports chaining with aws/assume-role
  • Unlike other AWS identities, aws/ambient does not clear credential env vars or disable IMDS

why

  • Atmos currently explicitly disables IMDS (AWS_EC2_METADATA_DISABLED=true) and clears IRSA env vars in PrepareEnvironment(), blocking use of infrastructure-provided credentials
  • Running Atmos in EKS pods (IRSA), EC2 instances (instance profiles), ECS tasks, or CI runners with pre-configured roles required workarounds
  • This makes ambient/infrastructure-provided credentials a first-class auth path, including support for chaining aws/ambientaws/assume-role for cross-account access

references

  • PRD: docs/prd/ambient-identity.md
  • Blog: website/blog/2026-03-25-ambient-credential-support.mdx
  • Example config: examples/config-profiles/profiles/eks/auth.yaml
  • Docs: Updated website/docs/stacks/auth.mdx with ambient identity examples
  • Roadmap: Updated website/src/data/roadmap.js with shipped milestone

Summary by CodeRabbit

  • New Features

    • Ambient credential support: ambient (cloud-agnostic passthrough preserves environment) and aws/ambient (resolves AWS credentials via the SDK default provider chain; can be used standalone or chained for cross-account assume-role).
  • Documentation

    • Added PRD, expanded docs, examples, blog post, and roadmap entry with EKS IRSA, EC2 instance profile, ECS task role, and chaining examples.
  • Tests

    • Added comprehensive unit and integration tests covering ambient behaviors, region handling, credential flows, and chain construction.
fix: prevent JIT source TTL from wiping varfiles/backend mid-execution @[copilot-swe-agent[bot]](https://github.com/apps/copilot-swe-agent) (#2253)

AutoProvisionSource is called twice per command invocation — once directly from resolveAndProvisionComponentPath, and again via the before.terraform.init hook in prepareInitExecution. With ttl: "0s", the second call treats the workdir as always-expired, invokes os.RemoveAll(targetDir), and wipes the varfiles and backend configs written between the two calls. The subprocess then fails with "file does not exist".

Changes

  • pkg/provisioner/source/provision_hook.go — adds an in-memory idempotency guard (invocationDoneKey = "_atmos_source_provisioned") to AutoProvisionSource. A named-return defer sets the marker in componentConfig on successful return. Any subsequent call with the same map (same in-memory invocation) short-circuits immediately. The guard is scoped to the per-invocation componentConfig; separate atmos runs are unaffected.

  • pkg/provisioner/source/provision_hook_test.go — two regression tests:

    • TestAutoProvisionSource_InvocationGuard_PreventsDoubleProvisioning: asserts the guard short-circuits a second call even with ttl: "0s"
    • TestAutoProvisionSource_InvocationGuard_SetAfterProvisioning: asserts the marker is written to componentConfig after a skipped provision (TTL not expired), ensuring the hook path is a no-op
Original prompt

This section details on the original issue you should resolve

<issue_title># Bug: JIT source provisioning TTL expiry deletes varfiles/backend, then runs tofu, causing error</issue_title>
<issue_description>### Describe the Bug

When using Just-In-Time (JIT) source provisioning, the source.ttl cleanup runs concurrently with — or before — the tofu subprocess, not after it completes. If the TTL expires at any point while tofu init, tofu plan, or any other tofu command is executing, Atmos deletes the varfiles and backend configuration out from under the running process.

The most reliable way to trigger this is ttl: "0s", which expires immediately and causes a deterministic failure every time. However, any positive TTL short enough to expire before the tofu subprocess finishes (e.g. "30s" on a slow network or large module download) will produce the same failure.

The result is a hard failure from tofu because the generated varfile (and/or backend file) no longer exists on disk:

Error: Failed to read variables file
│
│ Given variables file /tmp/atmos-workdir-*/component.tfvars.json does not exist.

Expected Behavior

The TTL cleanup should be scoped to between invocations, not during one. Provisioned files should never be deleted while the subprocess that depends on them is still running. Specifically:

  • TTL expiry should only be evaluated before provisioning (stale cache check), not during or after subprocess execution.
  • The provisioned workdir should be treated as a lock for the duration of the current command — held open until the subprocess exits, then subject to TTL-based cleanup on the next invocation.

A source.ttl: "0s" is the degenerate case that makes this deterministic, but the fix must cover all TTL values.


Actual Behavior

Atmos generates the varfiles and backend, the TTL of 0s immediately expires them, Atmos wipes them, and tofu fails:

│ Error: Failed to read variables file
│ 
│ Given variables file demo-null-label.terraform.tfvars.json does not exist.

Steps to Reproduce

The script below is fully self-contained. It requires only atmos and tofu on PATH and network access to GitHub. Save it as repro.sh and run it.

#!/usr/bin/env bash
# ============================================================
# REPRO: JIT ttl:"0s" deletes varfiles before tofu can read them
# ============================================================

set -euo pipefail

WORKDIR="$(mktemp -d -t atmos-repro-XXXXXX)"
echo "Working in: ${WORKDIR}"
cd "${WORKDIR}"

# --- 1) atmos.yaml ---
cat <<'EOF' > atmos.yaml
base_path: "."

components:
  terraform:
    base_path: "components/terraform"
    command: "tofu"
    workspaces_enabled: true
    apply_auto_approve: false
    deploy_run_init: true
    init_run_reconfigure: true
    auto_generate_backend_file: true

stacks:
  name_template: "{{ .vars.name }}"
  base_path: "stacks"
  included_paths:
    - "**/*"
EOF

# --- 2) Stack with ttl: "0s" on the JIT source ---
mkdir -p stacks
cat <<'EOF' > stacks/demo.yaml
vars:
  name: demo

terraform:
  backend_type: local

components:
  terraform:
    null-label:
      vars:
        # terraform-null-label variables
        namespace: "eg"
        stage: "test"
        name: "demo"
        enabled: true
      source:
        uri: "git::https://github.com/cloudposse/terraform-null-label.git"
        version: "0.25.0"
        ttl: "0s"    # <-- triggers the bug: files are wiped before tofu reads them
      provision:
        workdir:
          enabled: true
EOF

echo
echo "== tree =="
find . -maxdepth 4 -type f -print | sed 's|^\./||'

echo
echo "== discovered stacks =="
atmos describe stacks

echo
echo "== describe component =="
atmos describe component null-label -s demo

echo
echo "== init (this is where the failure occurs with ttl:0s) =="
atmos terraform init null-label -s demo

echo
echo "== plan =="
atmos terraform plan null-label -s demo

echo "Done. Workspace preserved at: ${WORKDIR}"

Run:

bash repro.sh 2>&1 | tee repro.log

Screenshots

No response

Environment

Atmos 1.212.0 on darwin/arm64

Additional Context

No response</issue_description>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

fix: retry Atmos Pro uploads on transient 401/5xx with exponential backoff @osterman (#2255)

what

  • Add retry logic with exponential backoff (1s, 2s, 4s) to all three Atmos Pro upload methods: UploadInstanceStatus, UploadAffectedStacks, and UploadInstances
  • On 401 errors, re-exchange the OIDC token before retrying (handles JWT secret mismatches across deployment instances)
  • On 5xx or network errors, retry with backoff without re-auth
  • On 400/403/404, fail immediately without retrying
  • Upload failures no longer cause non-zero exit codes — the terraform plan/apply result drives the command exit code, not telemetry

why

  • Intermittent 401 errors on upload PATCH calls even though the OIDC exchange succeeded seconds earlier, likely due to JWT signed by a different deployment instance with a mismatched secret
  • Happens most often when multiple Atmos commands run in parallel (e.g., 14 concurrent stack operations in a matrix workflow)
  • Retrying the same job usually works, indicating transient failures
  • Upload telemetry should never block or fail the primary terraform workflow

references

  • #2216 (original upload implementation)

Summary by CodeRabbit

  • New Features

    • Automatic upload retries with exponential backoff and automatic API token refresh on auth failures.
    • More structured API error classification to improve retry and auth decisioning.
  • Bug Fixes

    • Uploads and status-reporting failures now log warnings and no longer cause commands to fail.
  • Tests

    • Comprehensive tests for retry logic, token refresh behavior, and API error handling.
fix: prevent IRSA credentials from overriding Atmos-managed credentials on EKS pods @osterman (#2143)

what

  • Prevent IRSA/pod-injected AWS env vars from overriding Atmos-managed credentials in subprocess execution
  • Pass os.Environ() through PrepareShellEnvironment to sanitize it (delete problematic vars), then pass the sanitized env to subprocess via WithBaseEnv — avoiding re-reading os.Environ() which would reintroduce IRSA vars
  • Add SanitizedBaseEnv field to ConfigAndStacksInfo to carry sanitized environment through the hooks→terraform/helmfile/packer pipeline
  • Add WithBaseEnv variadic option to ExecuteShellCommand for backward-compatible sanitized env injection
  • Fix auth exec and auth shell to use sanitized env directly instead of re-reading os.Environ()

why

On EKS pods with IRSA (IAM Roles for Service Accounts), the pod identity webhook injects AWS_WEB_IDENTITY_TOKEN_FILE, AWS_ROLE_ARN, and AWS_ROLE_SESSION_NAME into the pod environment. When using Atmos auth on ARC (Actions Runner Controller), these IRSA vars leaked into terraform subprocesses because three code paths re-read os.Environ() after auth sanitization:

  1. Hooks path (terraform/helmfile/packer): authenticateAndWriteEnv only passed ComponentEnvSection (stack YAML vars) to PrepareShellEnvironment — IRSA vars weren't in the input so delete() was a no-op. Then ExecuteShellCommand re-read os.Environ() as the base.
  2. auth exec: executeCommandWithEnv re-read os.Environ() to build subprocess env.
  3. auth shell: ExecAuthShellCommandMergeSystemEnvSimpleWithGlobal re-read os.Environ().

AWS SDK credential chain gives web identity tokens higher precedence than shared credential files, so the pod's runner role was used instead of the Atmos-managed tfplan role, causing AccessDenied errors.

Approach

Instead of setting cleared vars to empty string (which pollutes the subprocess env), we pass a clean, sanitized environment:

  1. authenticateAndWriteEnv now passes os.Environ() + ComponentEnvSection to PrepareShellEnvironment, which deletes problematic keys
  2. The sanitized result is stored as SanitizedBaseEnv on ConfigAndStacksInfo
  3. ExecuteShellCommand accepts WithBaseEnv(info.SanitizedBaseEnv) to use the sanitized env instead of re-reading os.Environ()
  4. auth exec and auth shell pass sanitized env directly to subprocess, bypassing the re-read

references

Fixes credential precedence conflict where IRSA vars override Atmos-managed credentials on EKS pods running ARC (DEV-4216)

Summary by CodeRabbit

  • Bug Fixes

    • Prevented AWS IRSA env vars from leaking into subprocesses by sanitizing auth-related variables (overridden with empty values) so spawned commands use Atmos credentials.
    • Ensured credential-chain caching no longer skips the final role, forcing proper re-authentication when needed.
  • Refactor

    • Preserve and propagate a sanitized environment end-to-end for shell/exec paths so child processes receive the corrected env list.
  • Tests

    • Updated and added tests to validate env sanitization and subprocess propagation.
  • Documentation

    • Added guidance describing the credential-chain caching fix and expected behavior.
fix: thread auth identity through describe/list affected for S3 state reads @osterman (#2250)

what

  • Thread AuthManager through the entire describe affected call chain so ExecuteDescribeStacks receives the identity credentials instead of nil
  • Fix GetTerraformState to use the resolved component-specific AuthContext for S3 backend reads instead of the (potentially nil) passed-in authContext
  • Add per-component identity resolution in ExecuteDescribeStacks gated behind processYamlFunctions, so each component can use its own identity for !terraform.state reads
  • Wire the --identity / -i flag through the list affected command, which had the flag registered (inherited from listCmd) but never read it or created an AuthManager

why

  • Customer reported atmos list affected --ref refs/heads/main failing with S3 auth errors despite valid atmos auth identity
  • Debug logs showed resolveAuthManagerForNestedComponent correctly created per-component AuthManagers, but the credentials were never used for the actual S3 GetObject call
  • Four independent bugs: (1) AuthManager dropped in describe affected call chain, (2) GetTerraformState ignored resolved AuthContext for backend reads, (3) no per-component identity resolution in ExecuteDescribeStacks, (4) list affected never read the --identity flag
  • Running inside atmos auth shell worked because it sets ATMOS_IDENTITY env var (viper fallback), but explicit -i admin-account was silently ignored by list affected

references

  • docs/fixes/2026-03-25-describe-affected-auth-identity-not-used.md — detailed fix documentation
  • docs/fixes/nested-terraform-state-auth-context-propagation.md — original nested auth fix
  • docs/fixes/2026-03-03-yaml-functions-auth-multi-component.md — multi-component auth fix

Summary by CodeRabbit

  • New Features

    • Added --identity flag to list affected for explicit identity selection.
  • Bug Fixes

    • Ensure authentication context is propagated into affected/describe flows.
    • Terraform backend state reads now use the resolved identity/auth for S3.
    • Per-component identity resolution applied during stack processing.
  • Documentation

    • Added end-to-end fix description for affected/describe identity handling.
  • Tests

    • Added and updated tests covering identity parsing and auth-manager propagation.
fix: preserve deleted and deletion_type fields in upload strip @milldr (#2249)

What

Preserve deleted and deletion_type fields in StripAffectedForUpload so they reach Atmos Pro when using --upload.

Why

StripAffectedForUpload constructs a new schema.Affected with only the fields needed by Atmos Pro, but it was missing Deleted and DeletionType. This caused deleted components to arrive at Atmos Pro without their deletion metadata, making them appear as "disabled" instead of "deleted".

References

Summary by CodeRabbit

  • Bug Fixes
    • Fixed an issue where deletion-related information was not being properly preserved during the data upload process.

Don't miss a new atmos release

NewReleases is sending notifications on new releases.