docs: add PRD for cyclomatic complexity reduction @[copilot-swe-agent[bot]](https://github.com/apps/copilot-swe-agent) (#2230)
High cyclomatic complexity is the primary barrier to unit-test coverage in Atmos — every branch requires a dedicated test case, so complexity directly caps achievable coverage. This PRD standardises the reduction strategy already proven on ExecuteTerraform (160→26), ExecuteDescribeStacks (247→10), and processArgsAndFlags (67→15).
What's in the PRD (docs/prd/cyclomatic-complexity-reduction.md)
Refactoring techniques (with before/after examples)
- Extract coordinator + focused helpers into co-located
*_helpers.gofiles - Replace N-case
switchwithmap[string]func(...)dispatch tables (cyclomatic stays at 2 regardless of table size) - Guard-clause / early-return flattening to eliminate else-after-return nesting
- Options struct to replace boolean-flag parameter explosions
- Predicate extraction to name and isolate long
ifconditions
Enforcement: new code vs. old code
- Phased threshold tightening over ~6 months:
cyclop15→12→10→8;revive cyclomaticpromoted from warning → error docs/complexity-budget.ymlexemption registry for functions that already exceed the new threshold- CI script (
check-complexity-budget.sh) that rejects any unregistered//nolint:cyclop,revive,gocognitannotation, preventing silent accumulation
Progress tracking
- Baseline snapshot via
gocyclo -over 9 -avg .committed todocs/complexity-baseline.txt - Nightly GitHub Actions workflow (
complexity-trend.yml) posting per-threshold counts to$GITHUB_STEP_SUMMARY - Coverage-correlation recipe: high complexity + low branch coverage = highest refactor priority
- Sprint checklist template for tracking per-sprint reductions
🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.
feat: add chunked uploads for large stack payloads @milldr (#2251)
What
Add automatic chunking for large stack/instance upload payloads to Atmos Pro. When payloads exceed the configurable threshold (default 4MB), the CLI splits the array into chunks and sends them sequentially with batch metadata (batch_id, batch_index, batch_total).
Why
Large infrastructure repos generate affected stack and instance payloads that exceed Vercel's ~4.5MB serverless body size limit, producing HTTP 413 Request Entity Too Large errors. The existing StripAffectedForUpload() reduces payloads by 70-75% but is insufficient for repos with hundreds of stacks.
Changes
- New
pkg/pro/chunked_upload.go— generic chunking logic (sendChunked,splitSlice,metadataOverhead) - Updated
UploadAffectedStacks()andUploadInstances()to use chunked upload - Added
batch_id,batch_index,batch_totalfields to upload DTOs - Switched from indented to compact JSON for upload payloads (~30% smaller)
- Added
max_payload_bytesconfig tosettings.proin atmos.yaml - Backward compatible: small payloads send without batch fields, old servers ignore unknown fields
Ref
Companion server-side PR: cloudposse-corp/apps (feat/chunked-stack-uploads → staging)
Summary by CodeRabbit
-
New Features
- Large stack and instance uploads now auto-split into multiple requests when exceeding a configurable threshold (default 4MB).
- Added configurable upload limit via atmos.yaml (settings.pro.max_payload_bytes).
- Chunked uploads include batch metadata (batch_id, batch_index, batch_total) for reliable reassembly; small payloads remain single-request and backward compatible.
- Upload payloads use compact JSON serialization to reduce size.
-
Documentation
- New blog post and roadmap entry describing chunked upload behavior and configuration.
-
Tests
- Added unit and integration tests validating chunking, batching, and error handling.
feat: introduce Gists as community-contributed recipes @osterman (#2238)
what
- Introduced Gists — a new content type for community-contributed recipes that demonstrate creative combinations of Atmos features (Custom Commands, Auth, Toolchain, etc.)
- Added a
GistDisclaimerReact component (purple/violet pill) that displays on all gist pages: "Gists are examples that demonstrate a concept, but are not actively maintained and may not work in your environment or current versions of Atmos without adaptations." - Extended the file-browser plugin with a
disclaimeroption, enabling a second plugin instance at/gistsalongside the existing/examples - Added "Gists" to the top navbar between Examples and Community
- Created the first gist: MCP with AWS — a masterclass in combining Custom Commands + Auth + Toolchain to run 21 AWS MCP servers with automatic credential management (sourced from
cloudposse/infra-livePR #1662) - Added a blog post announcing the Gists feature
- Added a
gist-creatorClaude agent for standardizing future gist creation
why
- Community members share creative Atmos patterns that don't fit the maintained examples model — they need a home that sets the right expectations
- The MCP with AWS recipe demonstrates the composability of Atmos features (the key insight:
atmos auth execwraps MCP server processes with authenticated AWS credentials) - Having a standardized gist structure and agent makes it easy to add more recipes over time
references
- Source material for first gist: https://github.com/cloudposse/infra-live/pull/1662
Summary by CodeRabbit
-
New Features
- Introduced Gists — a community-contributed recipe space for Atmos.
- Added an AWS MCP gist with install/start/test commands and many preconfigured AWS services and startup presets.
- Added toolchain alias and minimal Atmos config to enable gists.
- Exposed Gists in the site file browser at /gists with a configurable disclaimer and navbar link.
- Added Mermaid diagram support and a reusable Gist disclaimer UI component and styles.
-
Documentation
- Published blog post introducing Gists and contribution guidelines.
- Added gist README templates, registration guidance, required README structure, and a verification checklist.
feat: add ambient credential support for IRSA, IMDS, and ECS task roles @osterman (#2254)
what
- Adds two new auth identity kinds:
ambient(cloud-agnostic passthrough) andaws/ambient(AWS SDK default credential chain) ambientis a pure do-nothing passthrough that preserves the environment unchangedaws/ambientresolves credentials via the default AWS SDK chain (env vars → shared config → IRSA → IMDS → ECS task role) and supports chaining withaws/assume-role- Unlike other AWS identities,
aws/ambientdoes not clear credential env vars or disable IMDS
why
- Atmos currently explicitly disables IMDS (
AWS_EC2_METADATA_DISABLED=true) and clears IRSA env vars inPrepareEnvironment(), blocking use of infrastructure-provided credentials - Running Atmos in EKS pods (IRSA), EC2 instances (instance profiles), ECS tasks, or CI runners with pre-configured roles required workarounds
- This makes ambient/infrastructure-provided credentials a first-class auth path, including support for chaining
aws/ambient→aws/assume-rolefor cross-account access
references
- PRD:
docs/prd/ambient-identity.md - Blog:
website/blog/2026-03-25-ambient-credential-support.mdx - Example config:
examples/config-profiles/profiles/eks/auth.yaml - Docs: Updated
website/docs/stacks/auth.mdxwith ambient identity examples - Roadmap: Updated
website/src/data/roadmap.jswith shipped milestone
Summary by CodeRabbit
-
New Features
- Ambient credential support:
ambient(cloud-agnostic passthrough preserves environment) andaws/ambient(resolves AWS credentials via the SDK default provider chain; can be used standalone or chained for cross-account assume-role).
- Ambient credential support:
-
Documentation
- Added PRD, expanded docs, examples, blog post, and roadmap entry with EKS IRSA, EC2 instance profile, ECS task role, and chaining examples.
-
Tests
- Added comprehensive unit and integration tests covering ambient behaviors, region handling, credential flows, and chain construction.
fix: prevent JIT source TTL from wiping varfiles/backend mid-execution @[copilot-swe-agent[bot]](https://github.com/apps/copilot-swe-agent) (#2253)
AutoProvisionSource is called twice per command invocation — once directly from resolveAndProvisionComponentPath, and again via the before.terraform.init hook in prepareInitExecution. With ttl: "0s", the second call treats the workdir as always-expired, invokes os.RemoveAll(targetDir), and wipes the varfiles and backend configs written between the two calls. The subprocess then fails with "file does not exist".
Changes
-
pkg/provisioner/source/provision_hook.go— adds an in-memory idempotency guard (invocationDoneKey = "_atmos_source_provisioned") toAutoProvisionSource. A named-returndefersets the marker incomponentConfigon successful return. Any subsequent call with the same map (same in-memory invocation) short-circuits immediately. The guard is scoped to the per-invocationcomponentConfig; separateatmosruns are unaffected. -
pkg/provisioner/source/provision_hook_test.go— two regression tests:TestAutoProvisionSource_InvocationGuard_PreventsDoubleProvisioning: asserts the guard short-circuits a second call even withttl: "0s"TestAutoProvisionSource_InvocationGuard_SetAfterProvisioning: asserts the marker is written tocomponentConfigafter a skipped provision (TTL not expired), ensuring the hook path is a no-op
Original prompt
This section details on the original issue you should resolve
<issue_title># Bug: JIT source provisioning TTL expiry deletes varfiles/backend, then runs tofu, causing error</issue_title>
<issue_description>### Describe the BugWhen using Just-In-Time (JIT) source provisioning, the
source.ttlcleanup runs concurrently with — or before — the tofu subprocess, not after it completes. If the TTL expires at any point whiletofu init,tofu plan, or any other tofu command is executing, Atmos deletes the varfiles and backend configuration out from under the running process.The most reliable way to trigger this is
ttl: "0s", which expires immediately and causes a deterministic failure every time. However, any positive TTL short enough to expire before the tofu subprocess finishes (e.g."30s"on a slow network or large module download) will produce the same failure.The result is a hard failure from tofu because the generated varfile (and/or backend file) no longer exists on disk:
Error: Failed to read variables file │ │ Given variables file /tmp/atmos-workdir-*/component.tfvars.json does not exist.Expected Behavior
The TTL cleanup should be scoped to between invocations, not during one. Provisioned files should never be deleted while the subprocess that depends on them is still running. Specifically:
- TTL expiry should only be evaluated before provisioning (stale cache check), not during or after subprocess execution.
- The provisioned workdir should be treated as a lock for the duration of the current command — held open until the subprocess exits, then subject to TTL-based cleanup on the next invocation.
A
source.ttl: "0s"is the degenerate case that makes this deterministic, but the fix must cover all TTL values.
Actual Behavior
Atmos generates the varfiles and backend, the TTL of
0simmediately expires them, Atmos wipes them, and tofu fails:│ Error: Failed to read variables file │ │ Given variables file demo-null-label.terraform.tfvars.json does not exist.Steps to Reproduce
The script below is fully self-contained. It requires only
atmosandtofuonPATHand network access to GitHub. Save it asrepro.shand run it.#!/usr/bin/env bash # ============================================================ # REPRO: JIT ttl:"0s" deletes varfiles before tofu can read them # ============================================================ set -euo pipefail WORKDIR="$(mktemp -d -t atmos-repro-XXXXXX)" echo "Working in: ${WORKDIR}" cd "${WORKDIR}" # --- 1) atmos.yaml --- cat <<'EOF' > atmos.yaml base_path: "." components: terraform: base_path: "components/terraform" command: "tofu" workspaces_enabled: true apply_auto_approve: false deploy_run_init: true init_run_reconfigure: true auto_generate_backend_file: true stacks: name_template: "{{ .vars.name }}" base_path: "stacks" included_paths: - "**/*" EOF # --- 2) Stack with ttl: "0s" on the JIT source --- mkdir -p stacks cat <<'EOF' > stacks/demo.yaml vars: name: demo terraform: backend_type: local components: terraform: null-label: vars: # terraform-null-label variables namespace: "eg" stage: "test" name: "demo" enabled: true source: uri: "git::https://github.com/cloudposse/terraform-null-label.git" version: "0.25.0" ttl: "0s" # <-- triggers the bug: files are wiped before tofu reads them provision: workdir: enabled: true EOF echo echo "== tree ==" find . -maxdepth 4 -type f -print | sed 's|^\./||' echo echo "== discovered stacks ==" atmos describe stacks echo echo "== describe component ==" atmos describe component null-label -s demo echo echo "== init (this is where the failure occurs with ttl:0s) ==" atmos terraform init null-label -s demo echo echo "== plan ==" atmos terraform plan null-label -s demo echo "Done. Workspace preserved at: ${WORKDIR}"Run:
bash repro.sh 2>&1 | tee repro.logScreenshots
No response
Environment
Atmos 1.212.0 on darwin/arm64
Additional Context
No response</issue_description>
Comments on the Issue (you are @copilot in this section)
- Fixes #2252
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.
fix: retry Atmos Pro uploads on transient 401/5xx with exponential backoff @osterman (#2255)
what
- Add retry logic with exponential backoff (1s, 2s, 4s) to all three Atmos Pro upload methods:
UploadInstanceStatus,UploadAffectedStacks, andUploadInstances - On 401 errors, re-exchange the OIDC token before retrying (handles JWT secret mismatches across deployment instances)
- On 5xx or network errors, retry with backoff without re-auth
- On 400/403/404, fail immediately without retrying
- Upload failures no longer cause non-zero exit codes — the terraform plan/apply result drives the command exit code, not telemetry
why
- Intermittent 401 errors on upload PATCH calls even though the OIDC exchange succeeded seconds earlier, likely due to JWT signed by a different deployment instance with a mismatched secret
- Happens most often when multiple Atmos commands run in parallel (e.g., 14 concurrent stack operations in a matrix workflow)
- Retrying the same job usually works, indicating transient failures
- Upload telemetry should never block or fail the primary terraform workflow
references
- #2216 (original upload implementation)
Summary by CodeRabbit
-
New Features
- Automatic upload retries with exponential backoff and automatic API token refresh on auth failures.
- More structured API error classification to improve retry and auth decisioning.
-
Bug Fixes
- Uploads and status-reporting failures now log warnings and no longer cause commands to fail.
-
Tests
- Comprehensive tests for retry logic, token refresh behavior, and API error handling.
fix: prevent IRSA credentials from overriding Atmos-managed credentials on EKS pods @osterman (#2143)
what
- Prevent IRSA/pod-injected AWS env vars from overriding Atmos-managed credentials in subprocess execution
- Pass
os.Environ()throughPrepareShellEnvironmentto sanitize it (delete problematic vars), then pass the sanitized env to subprocess viaWithBaseEnv— avoiding re-readingos.Environ()which would reintroduce IRSA vars - Add
SanitizedBaseEnvfield toConfigAndStacksInfoto carry sanitized environment through the hooks→terraform/helmfile/packer pipeline - Add
WithBaseEnvvariadic option toExecuteShellCommandfor backward-compatible sanitized env injection - Fix
auth execandauth shellto use sanitized env directly instead of re-readingos.Environ()
why
On EKS pods with IRSA (IAM Roles for Service Accounts), the pod identity webhook injects AWS_WEB_IDENTITY_TOKEN_FILE, AWS_ROLE_ARN, and AWS_ROLE_SESSION_NAME into the pod environment. When using Atmos auth on ARC (Actions Runner Controller), these IRSA vars leaked into terraform subprocesses because three code paths re-read os.Environ() after auth sanitization:
- Hooks path (terraform/helmfile/packer):
authenticateAndWriteEnvonly passedComponentEnvSection(stack YAML vars) toPrepareShellEnvironment— IRSA vars weren't in the input sodelete()was a no-op. ThenExecuteShellCommandre-reados.Environ()as the base. auth exec:executeCommandWithEnvre-reados.Environ()to build subprocess env.auth shell:ExecAuthShellCommand→MergeSystemEnvSimpleWithGlobalre-reados.Environ().
AWS SDK credential chain gives web identity tokens higher precedence than shared credential files, so the pod's runner role was used instead of the Atmos-managed tfplan role, causing AccessDenied errors.
Approach
Instead of setting cleared vars to empty string (which pollutes the subprocess env), we pass a clean, sanitized environment:
authenticateAndWriteEnvnow passesos.Environ()+ComponentEnvSectiontoPrepareShellEnvironment, which deletes problematic keys- The sanitized result is stored as
SanitizedBaseEnvonConfigAndStacksInfo ExecuteShellCommandacceptsWithBaseEnv(info.SanitizedBaseEnv)to use the sanitized env instead of re-readingos.Environ()auth execandauth shellpass sanitized env directly to subprocess, bypassing the re-read
references
Fixes credential precedence conflict where IRSA vars override Atmos-managed credentials on EKS pods running ARC (DEV-4216)
Summary by CodeRabbit
-
Bug Fixes
- Prevented AWS IRSA env vars from leaking into subprocesses by sanitizing auth-related variables (overridden with empty values) so spawned commands use Atmos credentials.
- Ensured credential-chain caching no longer skips the final role, forcing proper re-authentication when needed.
-
Refactor
- Preserve and propagate a sanitized environment end-to-end for shell/exec paths so child processes receive the corrected env list.
-
Tests
- Updated and added tests to validate env sanitization and subprocess propagation.
-
Documentation
- Added guidance describing the credential-chain caching fix and expected behavior.
fix: thread auth identity through describe/list affected for S3 state reads @osterman (#2250)
what
- Thread
AuthManagerthrough the entire describe affected call chain soExecuteDescribeStacksreceives the identity credentials instead ofnil - Fix
GetTerraformStateto use the resolved component-specificAuthContextfor S3 backend reads instead of the (potentially nil) passed-inauthContext - Add per-component identity resolution in
ExecuteDescribeStacksgated behindprocessYamlFunctions, so each component can use its own identity for!terraform.statereads - Wire the
--identity/-iflag through thelist affectedcommand, which had the flag registered (inherited fromlistCmd) but never read it or created anAuthManager
why
- Customer reported
atmos list affected --ref refs/heads/mainfailing with S3 auth errors despite validatmos authidentity - Debug logs showed
resolveAuthManagerForNestedComponentcorrectly created per-component AuthManagers, but the credentials were never used for the actual S3GetObjectcall - Four independent bugs: (1) AuthManager dropped in describe affected call chain, (2)
GetTerraformStateignored resolved AuthContext for backend reads, (3) no per-component identity resolution inExecuteDescribeStacks, (4)list affectednever read the--identityflag - Running inside
atmos auth shellworked because it setsATMOS_IDENTITYenv var (viper fallback), but explicit-i admin-accountwas silently ignored bylist affected
references
docs/fixes/2026-03-25-describe-affected-auth-identity-not-used.md— detailed fix documentationdocs/fixes/nested-terraform-state-auth-context-propagation.md— original nested auth fixdocs/fixes/2026-03-03-yaml-functions-auth-multi-component.md— multi-component auth fix
Summary by CodeRabbit
-
New Features
- Added
--identityflag tolist affectedfor explicit identity selection.
- Added
-
Bug Fixes
- Ensure authentication context is propagated into affected/describe flows.
- Terraform backend state reads now use the resolved identity/auth for S3.
- Per-component identity resolution applied during stack processing.
-
Documentation
- Added end-to-end fix description for affected/describe identity handling.
-
Tests
- Added and updated tests covering identity parsing and auth-manager propagation.
fix: preserve deleted and deletion_type fields in upload strip @milldr (#2249)
What
Preserve deleted and deletion_type fields in StripAffectedForUpload so they reach Atmos Pro when using --upload.
Why
StripAffectedForUpload constructs a new schema.Affected with only the fields needed by Atmos Pro, but it was missing Deleted and DeletionType. This caused deleted components to arrive at Atmos Pro without their deletion metadata, making them appear as "disabled" instead of "deleted".
References
- Previous fix (dependents crash): #2237
- Atmos Pro PR: cloudposse-corp/apps#933
- Linear: AP-161
Summary by CodeRabbit
- Bug Fixes
- Fixed an issue where deletion-related information was not being properly preserved during the data upload process.