feat(components): add retry block for transient terraform errors @osterman (#2431)
## what- Add a per-component
retry:block undercomponents.terraform.<name>that wraps each terraform subprocess invocation (init, workspace select, workspace new, plan/apply/etc.) in an independent retry loop with configurable backoff. - Introduce
retry.conditions:— a list of regex patterns matched against captured stdout/stderr; only errors whose output matches at least one condition retry, everything else fails fast. Patterns may be wrapped in/.../for readability. - Extend
schema.RetryConfigwithConditions []string(backwards-compatible — the existing struct is also used by workflows / vendor / task retry configs). - Plumb the new block through stack inheritance: abstract components define a default policy, concrete components and
overrides.retrydeep-merge on top. - Add JSON schema for
retryreferenced fromterraform,terraform_component_manifest, andoverrides; ship a docs page, blog post, and roadmap milestone under the CI/CD initiative. - Add
pkg/retry/conditions.go(regex compile + match) andpkg/schema/retry_decode.go(mapstructure decoder with the duration hook) so logic stays out ofinternal/exec/.
why
- Unattended
atmos terraform plan/applyruns in CI repeatedly fail with transient infrastructure errors that have nothing to do with the Terraform code — most commonly502 Bad Gatewayduring provider downloads, but alsoconnection reset,TLS handshake timeout, and state-backend timeouts. - Today the only recovery is a manual re-run, which is painful for fleet operations and unattended pipelines.
- Workflows, vendoring, and source extraction already use
pkg/retry. This PR exposes the same robust primitive to components without duplicating logic. - The design is intentionally pattern-driven (opt-in per regex) so real
terraform planfailures (exit-code 2, schema errors, etc.) are never silently retried — the foot-gun of "retry everything" is avoided by requiringconditions:to opt in. - Each subprocess invocation is wrapped independently (
wrap(exec), wrap(exec), wrap(exec)), not as one outer retry around the whole pipeline, soapplydoesn't lose its budget toinit.
references
- New docs page:
/stacks/components/terraform/retry - Blog post:
website/blog/2026-05-18-terraform-component-retry.mdx(tag:feature) - Roadmap milestone added to the CI/CD Simplification initiative
- Reuses
pkg/retry.WithPredicateand the existingWithStdoutCapture/WithStderrCaptureshell options — no duplication
Summary by CodeRabbit
-
New Features
- Per-component retry for Terraform subprocesses: opt-in
retry:block with regex-driven conditions, max attempts, backoff, delays, jitter; wraps init, workspace and main commands; inherited from abstract components and overridable; no retry occurs by default without conditions.
- Per-component retry for Terraform subprocesses: opt-in
-
Documentation
- New docs and blog post explain configuration, inheritance, and safety defaults.
-
Tests
- Extensive unit tests added covering decoding, merge/precedence, execution retry logic, and condition matching.
feat(website): per-page raw .md routes + Copy Markdown button on atmos.tools @osterman (#2503)
## what- Adds per-page
.mdroutes to atmos.tools — every doc URL is mirrored at<url>.mdwithContent-Type: text/markdown. - Adds a "Copy Markdown" / "View Markdown" split button above each doc page.
- Adds
<link rel="alternate" type="text/markdown" href="<url>.md">to each doc page's<head>so crawlers and LLM tooling can discover the alternate. - Introduces an AST-based MDX→Markdown normalizer (
website/plugins/docusaurus-plugin-llms-txt/src/mdx-normalize.mjs) with a per-component handler table covering<Intro>,<Tabs>/<TabItem>,<Terminal>,<File>,<Note>,<Step>,<dl><dt><dd>, marketing cards, and more. Unknown components unwrap to their children. 19 unit tests. - Reuses the same normalizer for
llms-full.txt, replacing a lossy regex-strip that dropped JSX content wholesale — the LLM corpus file is now meaningfully richer. - Fixes a pre-existing bug in the llms-txt plugin: pages with frontmatter
id:/slug:overrides were silently dropped fromllms.txt/llms-full.txtbecause the resolver searched by filename. Switched to Docusaurus's.docusaurus/<plugin>/<id>/*.jsoncache for the authoritative permalink → source map. Pages processed went from 554 → 735.
why
- LLM-driven workflows (Claude Code, ChatGPT, custom agents) are now first-class consumers of our docs. A raw Markdown alternate makes our docs trivially feedable to any LLM without HTML scraping.
- A
rel="alternate"Markdown link is a standard discovery pattern for agentic crawlers — no special-case scraping needed. - The MDX wholesale-strip approach was corrupting
llms-full.txt(tab content, flag tables, intros all silently dropped). The AST normalizer preserves structure correctly. - The
id:/slug:override bug meant ~25% of CLI command pages were missing from the LLM corpus entirely.
references
- Inspiration: FlyNumber/markdown_docusaurus_plugin (UX reference; we extended our existing plugin rather than adopting it).
- Parity with atmos-pro's recently-shipped Copy Markdown affordance.
- Blog post:
website/blog/2026-05-24-copy-markdown-button.mdx
Summary by CodeRabbit
-
New Features
- Docs pages available as raw .md URLs; "Copy Markdown" and "View Markdown" controls added to doc UI
- Site generates per-page Markdown files and a synthesized index for discovery
-
Documentation
- Pages include rel="alternate" Markdown links
- MDX content normalized into portable Markdown while preserving tabs, code/terminal/file blocks, notes, and definition lists
-
Tests
- Added coverage for HTML-comment and truncate-marker behavior
-
Chores
- Added Markdown parsing dependency; improved deploy script/workflow to ensure text files use UTF-8 charset
feat: implement remote stack imports @osterman (#2037)
## what- Add support for importing stack configurations from remote URLs (HTTP, Git, S3, GCS) using go-getter
- Stack imports now work consistently with remote imports for atmos.yaml
- New
pkg/stack/importspackage handles URL detection and remote downloading - Updated
stack_processor_utils.goto detect remote URLs and download them automatically
why
- This feature was documented but not yet implemented (fixes #2036)
- Teams need to share stack configurations across multiple repositories without vendoring
- Enables central catalogs, version-pinned imports, and cross-team config sharing
- Provides consistency between atmos.yaml imports and stack file imports
references
- closes #2036
- Blog post:
website/blog/2026-01-29-remote-stack-imports.mdx - Example:
examples/remote-stack-imports/ - Documentation: Stack Imports
Summary by CodeRabbit
-
New Features
- Remote stack imports with local caching for HTTP(S), Git, S3, and GCS; skip-if-missing and version-pinning support.
-
Documentation
- New example project and README demonstrating local+remote import composition; blog post and roadmap entry announcing the feature.
-
Chores
- Improved on-disk and in-memory caching, atomic cache writes, and cross-platform file-locking.
-
Tests
- Expanded unit and integration tests covering URI classification, downloading, caching, locking, and CLI scenarios.
feat(aws/security): add SARIF/OCSF exports and harden CI @osterman (#2483)
## WhatThis PR adds machine-readable security export formats to atmos aws security analyze and includes the CI hardening needed to keep the branch green.
AWS security exports
- Adds
--format=sariffor SARIF 2.1.0 output compatible with GitHub code scanning, Azure DevOps, and SARIF viewers. - Adds
--format=ocsffor OCSF 1.4.0 Detection Finding output for SIEM and security data lake ingestion. - Preserves Atmos context in exported findings, including stack, component, component path, remediation steps, deploy command, mapped physical locations, and logical fallback locations for unmapped resources.
- Produces deterministic output ordering for stable diffs and deduplication.
- Maps Atmos severities into SARIF/GHAS levels and OCSF severity/status fields.
- Adds schema-backed and structural test coverage for SARIF and OCSF renderers, determinism, empty/nil inputs, mapped/unmapped findings, compliance reports, and malformed SARIF rejection.
- Updates CLI docs, blog content, roadmap data, and PRD notes. The experimental Atmos Pro upload surface was removed before merge; the design is preserved in
docs/prd/atmos-pro-security-findings-upload.mdfor later revival.
CI and workflow hardening
- Upgrades
actions/checkoutusage from v4 to v6 across workflows and docs/examples; updates the SHA-pinned Atmos Pro checkout tov6.0.2. - Grants
packages: readso CI jobs pulling from GHCR can authenticate with the workflow token. - Grants reviewdog the PR/check permissions it needs for tflint annotations and uses
github.tokenexplicitly. - Uses
opentofu/setup-opentofu@v1on Windows instead of installing OpenTofu through the Atmos toolchain path that was failing signature verification. - Fixes demo-stack wttr.in URLs, changes Swedish language code from
setosv, and adds HTTP retries so screengrab generation is less brittle.
Why
SARIF and OCSF let Atmos security findings flow directly into standard security workflows instead of requiring users to translate markdown/json output themselves. The CI changes address failures observed while validating this branch: checkout/auth instability, Windows OpenTofu setup failures, reviewdog token scope failures, and live wttr.in request failures in screengrab generation.
Validation
GITHUB_TOKEN=$(gh auth token) node .github/actions/verify-sha-pinning/test.mjsmake -C demo/screengrabs build-allterraform fmt -check examples/demo-stacks/components/terraform/myapp/main.tfpre-commit run check-yaml --files .github/workflows/test.yml examples/demo-stacks/stacks/deploy/dev.yaml examples/demo-stacks/component.yamlgit diff --check
References
- SARIF 2.1.0 spec: https://docs.oasis-open.org/sarif/sarif/v2.1.0/sarif-v2.1.0.html
- GitHub SARIF support: https://docs.github.com/en/code-security/code-scanning/integrating-with-code-scanning/sarif-support-for-code-scanning
- OCSF schema: https://schema.ocsf.io/
Summary by CodeRabbit
-
New Features
- Added SARIF 2.1.0 and OCSF 1.4.0 export options; CLI accepts --format=sarif/ocsf, emits deterministic, Atmos‑enriched outputs and records UTC invocation/audit metadata.
- Integrated Amazon Inspector2 native findings into security analysis with normalization, deduplication and preferred native results.
-
Tests
- Extensive unit and JSON‑schema test suites for SARIF and OCSF ensuring spec conformance and byte‑stable output.
-
Documentation
- Docs, blog post, PRDs, and roadmap updated for SARIF/OCSF support and usage examples.
-
Chores
- CI checkout action pinned; NOTICE dependency list updated.
🚀 Enhancements
fix(aws/security): allow unlimited findings and record invocation in OCSF @osterman (#2517)
## whatatmos aws security analyze --max-findings 0(or any non-positive value) now fetches all matching findings from Security Hub / Inspector instead of silently capping at 500.- The fetcher emits a
log.Warnwhenever pagination halts at the limit whileNextToken != nil, so truncation is never silent again. - Every OCSF event now carries the literal command line, arguments, timing, exit code, working directory, and scanned scope under
unmapped["atmos.invocation"]— the OCSF analogue of SARIF'srun.invocations[](which Atmos already emits). - Default behavior is unchanged (
--max-findingsstill defaults to 500); only the previously-broken0semantics now work, and the help text + docs document it.
why
- The 500 cap is a CLI-layer default, not an AWS pagination limit. For multi-account orgs, real finding counts routinely exceed 500, so
--format json/sarif/ocsfexports were silently incomplete and downstream tooling (SIEM ingestion, ticketing, dashboards) was missing data with no error or warning. - AI analysis users get cost protection by keeping 500 as the default; export users get correctness by opting in to
--max-findings 0. Thelog.Warncovers the case where users forget — they'll see in the output that more findings exist. - SARIF already records the invocation, but OCSF Detection Finding 2004 has no native invocation slot. Auditors and SIEM analysts asking "what command produced this batch?" can now answer it from either format.
references
- Closes the silent-clamp issue surfaced during use of
atmos aws security analyzefor SIEM export pipelines. - SARIF 2.1.0 invocation spec: https://docs.oasis-open.org/sarif/sarif/v2.1.0/sarif-v2.1.0.html (already implemented in
pkg/aws/security/sarif.go). - OCSF 1.4.0 Detection Finding: https://schema.ocsf.io/1.4.0/classes/detection_finding (no native invocation field; landed in
unmappedextension). - Related shipped feature: #2483 (initial SARIF/OCSF export support).
🤖 Generated with Claude Code
Summary by CodeRabbit
-
New Features
--max-findingsnow distinguishes "unset" from an explicit 0; 0 means unlimited and effective default remains 500.- CLI prints a clear info message when fetching all findings vs a limited fetch.
- OCSF exports now include report invocation metadata in each event when available.
-
Bug Fixes
- Warning logged when a positive limit truncates results to avoid silent loss; pagination now respects the limit.
-
Documentation
- CLI, config, and PRD docs updated to reflect the new semantics.
-
Tests
- Added tests covering max-findings precedence, pagination behaviors, and OCSF invocation attachment.
fix(terraform-state): honor target component's env section for AWS credentials @arcaven (#2502)
## what- Makes
!terraform.state's in-process S3 backend reader honor a whitelisted subset of the target component'senvsection —AWS_PROFILE,AWS_REGION,AWS_DEFAULT_REGION,AWS_CONFIG_FILE,AWS_SHARED_CREDENTIALS_FILE,AWS_ENDPOINT_URL_S3,AWS_ENDPOINT_URL_STS,AWS_USE_FIPS_ENDPOINT— matching the behavior!terraform.outputalready exhibits via its subprocess env overlay.AWS_STS_REGIONAL_ENDPOINTSis intentionally excluded because it's a SDK v1 toggle and a no-op in SDK v2. - Adds
internal/terraform_backend.ExtractComponentEnvOverlay(with a nil-pointer guard) andinternal/terraform_backend.ComponentEnvKeysAWS. Threads the overlay throughgetCachedS3ClientandReadTerraformBackendS3. The S3 client cache key now includes every whitelisted key that affects client behavior (profile, both regions, both endpoint URLs, FIPS, config + credentials files) so two components with distinct settings never alias each other. - Extends
pkg/aws/identitywithLoadConfigWithAuthAndEnv.LoadConfigWithAuthbecomes a thin nil-overlay wrapper, so every existing call site behaves identically. Within the new variant:AWS_USE_FIPS_ENDPOINT(truthy"true"/"1") is applied viaconfig.WithUseFIPSEndpoint(aws.FIPSEndpointStateEnabled)— a global config setting.AWS_ENDPOINT_URL_STSis applied atsts.NewFromConfigin the assume-role flow — a per-service option in SDK v2.AWS_ENDPOINT_URL_S3is applied ats3.NewFromConfigingetCachedS3Clientfor the same SDK v2 per-service-option reason.
- Adds focused unit tests covering the overlay extraction (9 subcases including the nil-pointer guard), the whitelist surface stability, the credential-resolution precedence (5 cases including a sentinel that asserts
LoadConfigWithAuth ≡ LoadConfigWithAuthAndEnv(..., nil)), and the FIPS application (4 subcases including an authContext-suppresses-overlay assertion). - Adds an advanced docs section to
functions/yaml/terraform.state.mdx("Switching AWS credentials per component via theenvsection") and a cross-reference note infunctions/yaml/terraform.output.mdx. Placed alongside the existing specialized sections (SSE-C, GCS, static); primary examples untouched.
why
!terraform.outputand!terraform.stateare documented as interchangeable readers of the same state. They aren't, in setups that distribute Terraform state across AWS organizations.!terraform.outputshells out totofu/terraform.pkg/terraform/output.defaultEnvironmentSetup.SetupEnvironmentoverlays the target component'senvsection onto the subprocess environment as its final step, soenv.AWS_PROFILEreaches the backend's credential resolution.!terraform.statereads in-process viainternal/terraform_backend.ReadTerraformBackendS3→getCachedS3Client→pkg/aws/identity.LoadConfigWithAuth→config.LoadDefaultConfig. When no AtmosAWSAuthContextis provided, the SDK uses the calling process'sAWS_PROFILE.componentSections["env"]is in scope on the same map but no function in this chain reads it.- In practice this means a stack in one AWS org calling
!terraform.stateagainst a stack in another org fails withAccessDeniedonsts:AssumeRole(or silently reads the wrong account when buckets happen to share names), while the equivalent!terraform.outputcall works. The two functions are supposed to produce identical results; this is a correctness gap. - Backward compatibility was the design constraint. A component without any whitelisted env key produces a nil overlay and the resolved code path is byte-identical to the prior behavior. Users who don't use this pattern see no change. A sentinel test asserts this directly.
- Atmos auth (
AWSAuthContext) layers above the overlay and still wins outright. The env overlay path is intended for setups not yet on Atmos auth, which is the common case for the SweetOps community at the moment. - GCS and AzureRM backends share the bug structurally. We don't have GCS or Azure tfstate to validate those readers locally and would rather defer than ship code we haven't run. The fix shape generalises trivially (one new whitelist slice + one
ExtractComponentEnvOverlaycall per backend). Calling that out in the issue and the docs so a contributor with those backends can pick it up.
references
- Closes #2501
- Working reference implementation we're mirroring in-process:
pkg/terraform/output/environment.go::SetupEnvironment(the finalfor k, v := range config.Envloop). - Function this PR extends:
pkg/aws/identity.LoadConfigWithAuth. - In-process reader being fixed:
internal/terraform_backend/terraform_backend_s3.go::ReadTerraformBackendS3/getCachedS3Client. - Adjacent context for S3 path construction in
!terraform.state: #1920. - Auth-chain inheritance for
!terraform.output(referenced fromterraform_output_utils.go): #1921.
Summary by CodeRabbit
-
New Features
- Per-component AWS env overlay for Terraform S3 remote-state reads; explicit auth/context still takes precedence. Identity loading now accepts optional env overlays and respects overlay vs. auth precedence.
-
Documentation
- Guidance added for cross-account remote-state reads and env-overlay behavior for terraform.output and terraform.state.
-
Tests
- Added tests for overlay extraction, precedence rules, FIPS behavior, and a stable ordered whitelist of AWS env keys.
-
Chores
- Minor log formatting adjustments.