cloudposse/atmos v1.220.0-rc.3 on GitHub

feat(components): add retry block for transient terraform errors @osterman (#2431)

## what

Add a per-component retry: block under components.terraform.<name> that wraps each terraform subprocess invocation (init, workspace select, workspace new, plan/apply/etc.) in an independent retry loop with configurable backoff.
Introduce retry.conditions: — a list of regex patterns matched against captured stdout/stderr; only errors whose output matches at least one condition retry, everything else fails fast. Patterns may be wrapped in /.../ for readability.
Extend schema.RetryConfig with Conditions []string (backwards-compatible — the existing struct is also used by workflows / vendor / task retry configs).
Plumb the new block through stack inheritance: abstract components define a default policy, concrete components and overrides.retry deep-merge on top.
Add JSON schema for retry referenced from terraform, terraform_component_manifest, and overrides; ship a docs page, blog post, and roadmap milestone under the CI/CD initiative.
Add pkg/retry/conditions.go (regex compile + match) and pkg/schema/retry_decode.go (mapstructure decoder with the duration hook) so logic stays out of internal/exec/.

why

Unattended atmos terraform plan/apply runs in CI repeatedly fail with transient infrastructure errors that have nothing to do with the Terraform code — most commonly 502 Bad Gateway during provider downloads, but also connection reset, TLS handshake timeout, and state-backend timeouts.
Today the only recovery is a manual re-run, which is painful for fleet operations and unattended pipelines.
Workflows, vendoring, and source extraction already use pkg/retry. This PR exposes the same robust primitive to components without duplicating logic.
The design is intentionally pattern-driven (opt-in per regex) so real terraform plan failures (exit-code 2, schema errors, etc.) are never silently retried — the foot-gun of "retry everything" is avoided by requiring conditions: to opt in.
Each subprocess invocation is wrapped independently (wrap(exec), wrap(exec), wrap(exec)), not as one outer retry around the whole pipeline, so apply doesn't lose its budget to init.

references

New docs page: /stacks/components/terraform/retry
Blog post: website/blog/2026-05-18-terraform-component-retry.mdx (tag: feature)
Roadmap milestone added to the CI/CD Simplification initiative
Reuses pkg/retry.WithPredicate and the existing WithStdoutCapture / WithStderrCapture shell options — no duplication

Summary by CodeRabbit

New Features
- Per-component retry for Terraform subprocesses: opt-in retry: block with regex-driven conditions, max attempts, backoff, delays, jitter; wraps init, workspace and main commands; inherited from abstract components and overridable; no retry occurs by default without conditions.
Documentation
- New docs and blog post explain configuration, inheritance, and safety defaults.
Tests
- Extensive unit tests added covering decoding, merge/precedence, execution retry logic, and condition matching.

feat(website): per-page raw .md routes + Copy Markdown button on atmos.tools @osterman (#2503)

## what

Adds per-page .md routes to atmos.tools — every doc URL is mirrored at <url>.md with Content-Type: text/markdown.
Adds a "Copy Markdown" / "View Markdown" split button above each doc page.
Adds <link rel="alternate" type="text/markdown" href="<url>.md"> to each doc page's <head> so crawlers and LLM tooling can discover the alternate.
Introduces an AST-based MDX→Markdown normalizer (website/plugins/docusaurus-plugin-llms-txt/src/mdx-normalize.mjs) with a per-component handler table covering <Intro>, <Tabs>/<TabItem>, <Terminal>, <File>, <Note>, <Step>, <dl><dt><dd>, marketing cards, and more. Unknown components unwrap to their children. 19 unit tests.
Reuses the same normalizer for llms-full.txt, replacing a lossy regex-strip that dropped JSX content wholesale — the LLM corpus file is now meaningfully richer.
Fixes a pre-existing bug in the llms-txt plugin: pages with frontmatter id:/slug: overrides were silently dropped from llms.txt / llms-full.txt because the resolver searched by filename. Switched to Docusaurus's .docusaurus/<plugin>/<id>/*.json cache for the authoritative permalink → source map. Pages processed went from 554 → 735.

why

LLM-driven workflows (Claude Code, ChatGPT, custom agents) are now first-class consumers of our docs. A raw Markdown alternate makes our docs trivially feedable to any LLM without HTML scraping.
A rel="alternate" Markdown link is a standard discovery pattern for agentic crawlers — no special-case scraping needed.
The MDX wholesale-strip approach was corrupting llms-full.txt (tab content, flag tables, intros all silently dropped). The AST normalizer preserves structure correctly.
The id:/slug: override bug meant ~25% of CLI command pages were missing from the LLM corpus entirely.

references

Inspiration: FlyNumber/markdown_docusaurus_plugin (UX reference; we extended our existing plugin rather than adopting it).
Parity with atmos-pro's recently-shipped Copy Markdown affordance.
Blog post: website/blog/2026-05-24-copy-markdown-button.mdx

Summary by CodeRabbit

New Features
- Docs pages available as raw .md URLs; "Copy Markdown" and "View Markdown" controls added to doc UI
- Site generates per-page Markdown files and a synthesized index for discovery
Documentation
- Pages include rel="alternate" Markdown links
- MDX content normalized into portable Markdown while preserving tabs, code/terminal/file blocks, notes, and definition lists
Tests
- Added coverage for HTML-comment and truncate-marker behavior
Chores
- Added Markdown parsing dependency; improved deploy script/workflow to ensure text files use UTF-8 charset

feat: implement remote stack imports @osterman (#2037)

## what

Add support for importing stack configurations from remote URLs (HTTP, Git, S3, GCS) using go-getter
Stack imports now work consistently with remote imports for atmos.yaml
New pkg/stack/imports package handles URL detection and remote downloading
Updated stack_processor_utils.go to detect remote URLs and download them automatically

why

This feature was documented but not yet implemented (fixes #2036)
Teams need to share stack configurations across multiple repositories without vendoring
Enables central catalogs, version-pinned imports, and cross-team config sharing
Provides consistency between atmos.yaml imports and stack file imports

references

closes #2036
Blog post: website/blog/2026-01-29-remote-stack-imports.mdx
Example: examples/remote-stack-imports/
Documentation: Stack Imports

Summary by CodeRabbit

New Features
- Remote stack imports with local caching for HTTP(S), Git, S3, and GCS; skip-if-missing and version-pinning support.
Documentation
- New example project and README demonstrating local+remote import composition; blog post and roadmap entry announcing the feature.
Chores
- Improved on-disk and in-memory caching, atomic cache writes, and cross-platform file-locking.
Tests
- Expanded unit and integration tests covering URI classification, downloading, caching, locking, and CLI scenarios.

feat(aws/security): add SARIF/OCSF exports and harden CI @osterman (#2483)

## What

This PR adds machine-readable security export formats to atmos aws security analyze and includes the CI hardening needed to keep the branch green.

AWS security exports

Adds --format=sarif for SARIF 2.1.0 output compatible with GitHub code scanning, Azure DevOps, and SARIF viewers.
Adds --format=ocsf for OCSF 1.4.0 Detection Finding output for SIEM and security data lake ingestion.
Preserves Atmos context in exported findings, including stack, component, component path, remediation steps, deploy command, mapped physical locations, and logical fallback locations for unmapped resources.
Produces deterministic output ordering for stable diffs and deduplication.
Maps Atmos severities into SARIF/GHAS levels and OCSF severity/status fields.
Adds schema-backed and structural test coverage for SARIF and OCSF renderers, determinism, empty/nil inputs, mapped/unmapped findings, compliance reports, and malformed SARIF rejection.
Updates CLI docs, blog content, roadmap data, and PRD notes. The experimental Atmos Pro upload surface was removed before merge; the design is preserved in docs/prd/atmos-pro-security-findings-upload.md for later revival.

CI and workflow hardening

Upgrades actions/checkout usage from v4 to v6 across workflows and docs/examples; updates the SHA-pinned Atmos Pro checkout to v6.0.2.
Grants packages: read so CI jobs pulling from GHCR can authenticate with the workflow token.
Grants reviewdog the PR/check permissions it needs for tflint annotations and uses github.token explicitly.
Uses opentofu/setup-opentofu@v1 on Windows instead of installing OpenTofu through the Atmos toolchain path that was failing signature verification.
Fixes demo-stack wttr.in URLs, changes Swedish language code from se to sv, and adds HTTP retries so screengrab generation is less brittle.

Why

SARIF and OCSF let Atmos security findings flow directly into standard security workflows instead of requiring users to translate markdown/json output themselves. The CI changes address failures observed while validating this branch: checkout/auth instability, Windows OpenTofu setup failures, reviewdog token scope failures, and live wttr.in request failures in screengrab generation.

Validation

GITHUB_TOKEN=$(gh auth token) node .github/actions/verify-sha-pinning/test.mjs
make -C demo/screengrabs build-all
terraform fmt -check examples/demo-stacks/components/terraform/myapp/main.tf
pre-commit run check-yaml --files .github/workflows/test.yml examples/demo-stacks/stacks/deploy/dev.yaml examples/demo-stacks/component.yaml
git diff --check

References

SARIF 2.1.0 spec: https://docs.oasis-open.org/sarif/sarif/v2.1.0/sarif-v2.1.0.html
GitHub SARIF support: https://docs.github.com/en/code-security/code-scanning/integrating-with-code-scanning/sarif-support-for-code-scanning
OCSF schema: https://schema.ocsf.io/

Summary by CodeRabbit

New Features
- Added SARIF 2.1.0 and OCSF 1.4.0 export options; CLI accepts --format=sarif/ocsf, emits deterministic, Atmos‑enriched outputs and records UTC invocation/audit metadata.
- Integrated Amazon Inspector2 native findings into security analysis with normalization, deduplication and preferred native results.
Tests
- Extensive unit and JSON‑schema test suites for SARIF and OCSF ensuring spec conformance and byte‑stable output.
Documentation
- Docs, blog post, PRDs, and roadmap updated for SARIF/OCSF support and usage examples.
Chores
- CI checkout action pinned; NOTICE dependency list updated.

🚀 Enhancements

fix(aws/security): allow unlimited findings and record invocation in OCSF @osterman (#2517)

## what

atmos aws security analyze --max-findings 0 (or any non-positive value) now fetches all matching findings from Security Hub / Inspector instead of silently capping at 500.
The fetcher emits a log.Warn whenever pagination halts at the limit while NextToken != nil, so truncation is never silent again.
Every OCSF event now carries the literal command line, arguments, timing, exit code, working directory, and scanned scope under unmapped["atmos.invocation"] — the OCSF analogue of SARIF's run.invocations[] (which Atmos already emits).
Default behavior is unchanged (--max-findings still defaults to 500); only the previously-broken 0 semantics now work, and the help text + docs document it.

why

The 500 cap is a CLI-layer default, not an AWS pagination limit. For multi-account orgs, real finding counts routinely exceed 500, so --format json/sarif/ocsf exports were silently incomplete and downstream tooling (SIEM ingestion, ticketing, dashboards) was missing data with no error or warning.
AI analysis users get cost protection by keeping 500 as the default; export users get correctness by opting in to --max-findings 0. The log.Warn covers the case where users forget — they'll see in the output that more findings exist.
SARIF already records the invocation, but OCSF Detection Finding 2004 has no native invocation slot. Auditors and SIEM analysts asking "what command produced this batch?" can now answer it from either format.

references

Closes the silent-clamp issue surfaced during use of atmos aws security analyze for SIEM export pipelines.
SARIF 2.1.0 invocation spec: https://docs.oasis-open.org/sarif/sarif/v2.1.0/sarif-v2.1.0.html (already implemented in pkg/aws/security/sarif.go).
OCSF 1.4.0 Detection Finding: https://schema.ocsf.io/1.4.0/classes/detection_finding (no native invocation field; landed in unmapped extension).
Related shipped feature: #2483 (initial SARIF/OCSF export support).

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- --max-findings now distinguishes "unset" from an explicit 0; 0 means unlimited and effective default remains 500.
- CLI prints a clear info message when fetching all findings vs a limited fetch.
- OCSF exports now include report invocation metadata in each event when available.
Bug Fixes
- Warning logged when a positive limit truncates results to avoid silent loss; pagination now respects the limit.
Documentation
- CLI, config, and PRD docs updated to reflect the new semantics.
Tests
- Added tests covering max-findings precedence, pagination behaviors, and OCSF invocation attachment.

fix(terraform-state): honor target component's env section for AWS credentials @arcaven (#2502)

## what

Makes !terraform.state's in-process S3 backend reader honor a whitelisted subset of the target component's env section — AWS_PROFILE, AWS_REGION, AWS_DEFAULT_REGION, AWS_CONFIG_FILE, AWS_SHARED_CREDENTIALS_FILE, AWS_ENDPOINT_URL_S3, AWS_ENDPOINT_URL_STS, AWS_USE_FIPS_ENDPOINT — matching the behavior !terraform.output already exhibits via its subprocess env overlay. AWS_STS_REGIONAL_ENDPOINTS is intentionally excluded because it's a SDK v1 toggle and a no-op in SDK v2.
Adds internal/terraform_backend.ExtractComponentEnvOverlay (with a nil-pointer guard) and internal/terraform_backend.ComponentEnvKeysAWS. Threads the overlay through getCachedS3Client and ReadTerraformBackendS3. The S3 client cache key now includes every whitelisted key that affects client behavior (profile, both regions, both endpoint URLs, FIPS, config + credentials files) so two components with distinct settings never alias each other.
Extends pkg/aws/identity with LoadConfigWithAuthAndEnv. LoadConfigWithAuth becomes a thin nil-overlay wrapper, so every existing call site behaves identically. Within the new variant:
- AWS_USE_FIPS_ENDPOINT (truthy "true"/"1") is applied via config.WithUseFIPSEndpoint(aws.FIPSEndpointStateEnabled) — a global config setting.
- AWS_ENDPOINT_URL_STS is applied at sts.NewFromConfig in the assume-role flow — a per-service option in SDK v2.
- AWS_ENDPOINT_URL_S3 is applied at s3.NewFromConfig in getCachedS3Client for the same SDK v2 per-service-option reason.
Adds focused unit tests covering the overlay extraction (9 subcases including the nil-pointer guard), the whitelist surface stability, the credential-resolution precedence (5 cases including a sentinel that asserts LoadConfigWithAuth ≡ LoadConfigWithAuthAndEnv(..., nil)), and the FIPS application (4 subcases including an authContext-suppresses-overlay assertion).
Adds an advanced docs section to functions/yaml/terraform.state.mdx ("Switching AWS credentials per component via the env section") and a cross-reference note in functions/yaml/terraform.output.mdx. Placed alongside the existing specialized sections (SSE-C, GCS, static); primary examples untouched.

why

!terraform.output and !terraform.state are documented as interchangeable readers of the same state. They aren't, in setups that distribute Terraform state across AWS organizations.
!terraform.output shells out to tofu/terraform. pkg/terraform/output.defaultEnvironmentSetup.SetupEnvironment overlays the target component's env section onto the subprocess environment as its final step, so env.AWS_PROFILE reaches the backend's credential resolution.
!terraform.state reads in-process via internal/terraform_backend.ReadTerraformBackendS3 → getCachedS3Client → pkg/aws/identity.LoadConfigWithAuth → config.LoadDefaultConfig. When no Atmos AWSAuthContext is provided, the SDK uses the calling process's AWS_PROFILE. componentSections["env"] is in scope on the same map but no function in this chain reads it.
In practice this means a stack in one AWS org calling !terraform.state against a stack in another org fails with AccessDenied on sts:AssumeRole (or silently reads the wrong account when buckets happen to share names), while the equivalent !terraform.output call works. The two functions are supposed to produce identical results; this is a correctness gap.
Backward compatibility was the design constraint. A component without any whitelisted env key produces a nil overlay and the resolved code path is byte-identical to the prior behavior. Users who don't use this pattern see no change. A sentinel test asserts this directly.
Atmos auth (AWSAuthContext) layers above the overlay and still wins outright. The env overlay path is intended for setups not yet on Atmos auth, which is the common case for the SweetOps community at the moment.
GCS and AzureRM backends share the bug structurally. We don't have GCS or Azure tfstate to validate those readers locally and would rather defer than ship code we haven't run. The fix shape generalises trivially (one new whitelist slice + one ExtractComponentEnvOverlay call per backend). Calling that out in the issue and the docs so a contributor with those backends can pick it up.

references

Closes #2501
Working reference implementation we're mirroring in-process: pkg/terraform/output/environment.go::SetupEnvironment (the final for k, v := range config.Env loop).
Function this PR extends: pkg/aws/identity.LoadConfigWithAuth.
In-process reader being fixed: internal/terraform_backend/terraform_backend_s3.go::ReadTerraformBackendS3 / getCachedS3Client.
Adjacent context for S3 path construction in !terraform.state: #1920.
Auth-chain inheritance for !terraform.output (referenced from terraform_output_utils.go): #1921.

Summary by CodeRabbit

New Features
- Per-component AWS env overlay for Terraform S3 remote-state reads; explicit auth/context still takes precedence. Identity loading now accepts optional env overlays and respects overlay vs. auth precedence.
Documentation
- Guidance added for cross-account remote-state reads and env-overlay behavior for terraform.output and terraform.state.
Tests
- Added tests for overlay extraction, precedence rules, FIPS behavior, and a stable ordered whitelist of AWS env keys.
Chores
- Minor log formatting adjustments.