cloudposse/atmos v1.217.0-rc.4 on GitHub

docs(roadmap): curate featured; drop internal-refactor changelog posts @osterman (#2384)

## what

Cap featured[] in website/src/data/roadmap.js at 6 curated strategic initiatives. Drop devcontainer, workflows, instance-status-upload, and chunked-stack-uploads. Final 6: atmos-ai, cloud-auth, native-ci, pro-commit, source-provisioning, toolchain.
Add equivalent milestones to the ci-cd initiative for the two demoted Atmos Pro items so their changelogs stay reachable from the roadmap. Recalc ci-cd.progress 89 → 92.
Delete three internal-only refactor blog posts and their corresponding quality initiative milestones: process-args-flags-refactor, refactoring-executeterraform-for-testability, describe-stacks-complexity-reduction. Recalc quality.progress 86 → 75.
Update .claude/agents/roadmap.md with two new rules: (1) featured[] is manually curated, max 6, edited only when the user explicitly asks; (2) internal-only refactors with no user-visible change do not get changelog posts. Adds matching schema docs and quality-check items.

why

The featured section had drifted into a per-release announcement feed — every minor Atmos Pro plumbing improvement (chunked uploads, instance status, etc.) was rendering at the top of /roadmap next to transformative initiatives like Atmos AI and Cloud Auth. That diluted its meaning.
The roadmap maintainer agent had no documented rule for featured[], so it was being modified on every release. Codifying "max 6, opt-in only" stops the drift at the source.
Internal refactor posts (cyclomatic complexity reductions, function decomposition) are engineering wins but produce zero user-visible change. They belong in PR descriptions and git log, not the user-facing changelog.

references

No issue tracker reference.
no-release — content/data only; no Go code, no user-visible CLI behavior change. Removing already-published changelog entries that should not have been published.

Summary by CodeRabbit

Documentation
- Removed three technical blog posts documenting internal refactors
- Clarified roadmap maintenance guidance: changelog should omit internal-only refactors; featured entries are curated with a hard cap of 6 and should not be modified unless explicitly requested
Chores
- Reorganized featured initiatives and adjusted roadmap milestone tracking
- Updated CI/CD progress to 92% and Quality progress to 75%
- Updated NOTICE with concrete upstream license URLs
Quality
- Added checks to prevent improper featured changes and to omit internal refactors from the changelog

🚀 Enhancements

fix(ci): use terraform exit code as the source of truth for CI status @osterman (#2382)

## what

Make the terraform exit code the authoritative signal for success/failure (and, for terraform plan with -detailed-exitcode, for change detection) in the CI summary path. Text parsing of stdout/stderr is downgraded to enrichment only — it still extracts resource counts, output values, and error message bodies, but no longer drives the binary HasErrors / HasChanges decisions.
Plumb the exit code through cmd/terraform/utils.go → pkg/hooks RunCIHooks → pkg/ci ExecuteOptions → plugin.HookContext so the plugin handler has a clean signal independent of output format.
Rewrite parseOutputWithError (pkg/ci/plugins/terraform/handlers.go) so that:
- apply/deploy: HasErrors = (exitCode != 0)
- plan: HasErrors = (exitCode == 1); exitCode == 2 also implies HasChanges
- other commands: HasErrors = (exitCode != 0)
- exit-code success discards spurious "Error:" matches from text; exit-code failure still falls back to CommandError.Error() for the body if text parsing didn't find one.
Wire the enriched *plugin.OutputResult from parseOutputWithError through writeSummary and buildTemplateContext (it had been silently dropped — writeSummary had _ *plugin.OutputResult as its second arg, and buildTemplateContext re-parsed ctx.Output from scratch). buildTemplateContext keeps a nil-fallback so legacy callers continue to work.
Refactor RunCIHooks to take a *RunCIHooksOptions struct (per the repo's options pattern) since the parameter list grew past the linter's max-args limit.
Add tests covering all the new branches: exit-code-only failure rendering, exit-code 2 → HasChanges for plan, apply exit 0 with stray Error: in output → no error, plus the original failure-summary tests for plan/apply/deploy.

why

Reported regression: atmos terraform deploy <component> -s <stack> --upload-status failing at the authentication step (before terraform itself ran, exit code 1) still produced a job summary that read ## No Changes Applied for eks/karpenter-node-pool in e98d-gov-use1-dss with a NO CHANGE badge. The check run was correctly marked failed, but the summary contradicted it.
Root cause was architectural: the CI summary path used text parsing as the primary source of truth for failure/change state. The auth-failure stderr did not match ExtractErrors's ^Error: regex (it's emitted as **Error:** in markdown form), and writeSummary silently dropped the already-enriched OutputResult, so the apply template fell through to the no-changes branch. Anything that fails before terraform runs — auth, OOM, signal kill, network — would have hit the same bug.
Terraform exit codes are well-defined and stable (apply: 0 = success / non-zero = error; plan -detailed-exitcode: 0/1/2). Using them as the authoritative signal makes the hook robust against output-format drift between Terraform and OpenTofu, and against any pre-terraform failure that produces no parseable output. errUtils.GetExitCode already unwraps exec.ExitError, ExecError, exitCoder, and WorkflowStepError, so the existing error chains carry it through without further plumbing.

references

Affected handlers: pkg/ci/plugins/terraform/handlers.go (parseOutputWithError, writeSummary).
Affected helper: pkg/ci/plugins/terraform/plugin.go (buildTemplateContext).
Plumbing: pkg/ci/internal/plugin/types.go, pkg/ci/executor.go, pkg/hooks/hooks.go, cmd/terraform/utils.go.
Templates (unchanged): pkg/ci/plugins/terraform/templates/{apply,plan}.md already had {{ if .Result.HasErrors }} branches; they just weren't being reached.

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved error detection and failure reporting by treating command exit codes as the authoritative indicator of success/failure, fixing edge cases where errors occur before terraform produces output.
- Enhanced CI/check-run status accuracy for plan and apply operations, properly handling plan changes and command execution failures.
Tests
- Added comprehensive test coverage for exit code handling, error state reconciliation, and CI hook execution workflows.

fix(auth): preserve AWS SDK error in assume-role / web-identity / assume-root failures @aknysh (#2385)

## what

Adds WithCause(err) at the three STS error sites in
pkg/auth/identities/aws/:
- assume_role.go — standard AssumeRole path.
- assume_role.go — AssumeRoleWithWebIdentity (OIDC) path.
- assume_root.go — AssumeRoot (centralized root access) path.
Adds regression tests in pkg/auth/identities/aws/assume_sdk_error_test.go
that point STS at a local httptest.Server returning AWS-style XML
error envelopes (via the existing aws.resolver.url mechanism). Each
test asserts the sentinel is preserved (errors.Is(err, ErrAuthenticationFailed)), the AWS error code and message are
reachable in err.Error(), and the SDK error is also reachable
through errors.As(err, &smithy.APIError).
Adds docs/fixes/2026-05-01-assume-role-error-swallows-aws-cause.md
documenting the issue and fix.

why

The three error sites built an enriched error with
errUtils.Build(ErrAuthenticationFailed).WithExplanation(...).WithHint(...).Err()
but never threaded the underlying SDK err into the chain. Operators
saw only authentication failed: identity=<name> step=<n>: authentication failed with no AWS context.
That made it impossible to tell, without re-running under
ATMOS_LOGS_LEVEL=Debug, whether the failure was AccessDenied,
NoSuchEntity, InvalidIdentityToken, ExpiredTokenException,
MalformedPolicyDocumentException, throttling, etc. Each has a
different remediation; the hint list ("verify the role ARN", "check
the trust policy", ...) effectively enumerated every plausible cause
because the actual one had been dropped.
The error builder already exposes WithCause(err) for exactly this
case (errors/builder.go:104-167). It chains the cause via
fmt.Errorf("%w: %w", sentinel, cause), preserves the sentinel for
errors.Is checks, and merges any hints/safe details the cause
already carried. The canonical pattern is already used at
pkg/auth/identities/aws/webflow_token.go:88-97. The three assume
sites just hadn't adopted it yet.
After the fix, the same failure renders with the AWS-side reason
inline:
authentication failed: identity=<name> step=<n>: authentication failed: operation error STS: AssumeRoleWithWebIdentity, https response error StatusCode: 403, RequestID: ..., api error AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
— which makes the trust-policy / token / role-ARN problems
diagnosable from the first run.
Verified by reverting just the three WithCause(err) lines and
confirming the new tests fail; restoring the fix turns them green
again. Full pkg/auth/... test suite (~25 packages) passes.

references

docs/fixes/2026-05-01-assume-role-error-swallows-aws-cause.md —
full root-cause writeup, code paths, and rationale (added in this
PR).
errors/builder.go:104-167 — WithCause / WithCausef helpers
used by the fix.
pkg/auth/identities/aws/webflow_token.go:88-97 — canonical
pattern referenced as the model for these three sites.
pkg/auth/manager_chain.go:570 — chain wrapper that already
expected the leaf to thread the cause via the trailing %w; this
PR makes the leaf actually do so.

Summary by CodeRabbit

Bug Fixes
- Preserve and surface underlying AWS STS error details in authentication failures while retaining existing sentinel behavior.
Tests
- Added regression tests that verify sentinel preservation, inclusion of AWS error text, and access to typed SDK errors across multiple STS error scenarios.
Documentation
- Added a doc with before/after examples and end-to-end test descriptions for the error-handling change.