docs(roadmap): curate featured; drop internal-refactor changelog posts @osterman (#2384)
## what- Cap
featured[]inwebsite/src/data/roadmap.jsat 6 curated strategic initiatives. Dropdevcontainer,workflows,instance-status-upload, andchunked-stack-uploads. Final 6:atmos-ai,cloud-auth,native-ci,pro-commit,source-provisioning,toolchain. - Add equivalent milestones to the
ci-cdinitiative for the two demoted Atmos Pro items so their changelogs stay reachable from the roadmap. Recalcci-cd.progress89 → 92. - Delete three internal-only refactor blog posts and their corresponding
qualityinitiative milestones:process-args-flags-refactor,refactoring-executeterraform-for-testability,describe-stacks-complexity-reduction. Recalcquality.progress86 → 75. - Update
.claude/agents/roadmap.mdwith two new rules: (1)featured[]is manually curated, max 6, edited only when the user explicitly asks; (2) internal-only refactors with no user-visible change do not get changelog posts. Adds matching schema docs and quality-check items.
why
- The featured section had drifted into a per-release announcement feed — every minor Atmos Pro plumbing improvement (chunked uploads, instance status, etc.) was rendering at the top of
/roadmapnext to transformative initiatives like Atmos AI and Cloud Auth. That diluted its meaning. - The roadmap maintainer agent had no documented rule for
featured[], so it was being modified on every release. Codifying "max 6, opt-in only" stops the drift at the source. - Internal refactor posts (cyclomatic complexity reductions, function decomposition) are engineering wins but produce zero user-visible change. They belong in PR descriptions and
git log, not the user-facing changelog.
references
- No issue tracker reference.
no-release— content/data only; no Go code, no user-visible CLI behavior change. Removing already-published changelog entries that should not have been published.
Summary by CodeRabbit
-
Documentation
- Removed three technical blog posts documenting internal refactors
- Clarified roadmap maintenance guidance: changelog should omit internal-only refactors; featured entries are curated with a hard cap of 6 and should not be modified unless explicitly requested
-
Chores
- Reorganized featured initiatives and adjusted roadmap milestone tracking
- Updated CI/CD progress to 92% and Quality progress to 75%
- Updated NOTICE with concrete upstream license URLs
-
Quality
- Added checks to prevent improper featured changes and to omit internal refactors from the changelog
🚀 Enhancements
fix(ci): use terraform exit code as the source of truth for CI status @osterman (#2382)
## what- Make the terraform exit code the authoritative signal for success/failure (and, for
terraform planwith-detailed-exitcode, for change detection) in the CI summary path. Text parsing of stdout/stderr is downgraded to enrichment only — it still extracts resource counts, output values, and error message bodies, but no longer drives the binaryHasErrors/HasChangesdecisions. - Plumb the exit code through
cmd/terraform/utils.go→pkg/hooks RunCIHooks→pkg/ci ExecuteOptions→plugin.HookContextso the plugin handler has a clean signal independent of output format. - Rewrite
parseOutputWithError(pkg/ci/plugins/terraform/handlers.go) so that:apply/deploy:HasErrors = (exitCode != 0)plan:HasErrors = (exitCode == 1);exitCode == 2also impliesHasChanges- other commands:
HasErrors = (exitCode != 0) - exit-code success discards spurious "Error:" matches from text; exit-code failure still falls back to
CommandError.Error()for the body if text parsing didn't find one.
- Wire the enriched
*plugin.OutputResultfromparseOutputWithErrorthroughwriteSummaryandbuildTemplateContext(it had been silently dropped —writeSummaryhad_ *plugin.OutputResultas its second arg, andbuildTemplateContextre-parsedctx.Outputfrom scratch).buildTemplateContextkeeps anil-fallback so legacy callers continue to work. - Refactor
RunCIHooksto take a*RunCIHooksOptionsstruct (per the repo's options pattern) since the parameter list grew past the linter's max-args limit. - Add tests covering all the new branches: exit-code-only failure rendering, exit-code 2 →
HasChangesfor plan, apply exit 0 with strayError:in output → no error, plus the original failure-summary tests for plan/apply/deploy.
why
- Reported regression:
atmos terraform deploy <component> -s <stack> --upload-statusfailing at the authentication step (before terraform itself ran, exit code 1) still produced a job summary that read## No Changes Applied for eks/karpenter-node-pool in e98d-gov-use1-dsswith aNO CHANGEbadge. The check run was correctly marked failed, but the summary contradicted it. - Root cause was architectural: the CI summary path used text parsing as the primary source of truth for failure/change state. The auth-failure stderr did not match
ExtractErrors's^Error:regex (it's emitted as**Error:**in markdown form), andwriteSummarysilently dropped the already-enrichedOutputResult, so the apply template fell through to the no-changes branch. Anything that fails before terraform runs — auth, OOM, signal kill, network — would have hit the same bug. - Terraform exit codes are well-defined and stable (
apply: 0 = success / non-zero = error;plan -detailed-exitcode: 0/1/2). Using them as the authoritative signal makes the hook robust against output-format drift between Terraform and OpenTofu, and against any pre-terraform failure that produces no parseable output.errUtils.GetExitCodealready unwrapsexec.ExitError,ExecError,exitCoder, andWorkflowStepError, so the existing error chains carry it through without further plumbing.
references
- Affected handlers:
pkg/ci/plugins/terraform/handlers.go(parseOutputWithError,writeSummary). - Affected helper:
pkg/ci/plugins/terraform/plugin.go(buildTemplateContext). - Plumbing:
pkg/ci/internal/plugin/types.go,pkg/ci/executor.go,pkg/hooks/hooks.go,cmd/terraform/utils.go. - Templates (unchanged):
pkg/ci/plugins/terraform/templates/{apply,plan}.mdalready had{{ if .Result.HasErrors }}branches; they just weren't being reached.
🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
-
Bug Fixes
- Improved error detection and failure reporting by treating command exit codes as the authoritative indicator of success/failure, fixing edge cases where errors occur before terraform produces output.
- Enhanced CI/check-run status accuracy for
planandapplyoperations, properly handling plan changes and command execution failures.
-
Tests
- Added comprehensive test coverage for exit code handling, error state reconciliation, and CI hook execution workflows.
fix(auth): preserve AWS SDK error in assume-role / web-identity / assume-root failures @aknysh (#2385)
## what- Adds
WithCause(err)at the three STS error sites in
pkg/auth/identities/aws/:assume_role.go— standardAssumeRolepath.assume_role.go—AssumeRoleWithWebIdentity(OIDC) path.assume_root.go—AssumeRoot(centralized root access) path.
- Adds regression tests in
pkg/auth/identities/aws/assume_sdk_error_test.go
that point STS at a localhttptest.Serverreturning AWS-style XML
error envelopes (via the existingaws.resolver.urlmechanism). Each
test asserts the sentinel is preserved (errors.Is(err, ErrAuthenticationFailed)), the AWS error code and message are
reachable inerr.Error(), and the SDK error is also reachable
througherrors.As(err, &smithy.APIError). - Adds
docs/fixes/2026-05-01-assume-role-error-swallows-aws-cause.md
documenting the issue and fix.
why
- The three error sites built an enriched error with
errUtils.Build(ErrAuthenticationFailed).WithExplanation(...).WithHint(...).Err()
but never threaded the underlying SDKerrinto the chain. Operators
saw onlyauthentication failed: identity=<name> step=<n>: authentication failedwith no AWS context. - That made it impossible to tell, without re-running under
ATMOS_LOGS_LEVEL=Debug, whether the failure wasAccessDenied,
NoSuchEntity,InvalidIdentityToken,ExpiredTokenException,
MalformedPolicyDocumentException, throttling, etc. Each has a
different remediation; the hint list ("verify the role ARN", "check
the trust policy", ...) effectively enumerated every plausible cause
because the actual one had been dropped. - The error builder already exposes
WithCause(err)for exactly this
case (errors/builder.go:104-167). It chains the cause via
fmt.Errorf("%w: %w", sentinel, cause), preserves the sentinel for
errors.Ischecks, and merges any hints/safe details the cause
already carried. The canonical pattern is already used at
pkg/auth/identities/aws/webflow_token.go:88-97. The three assume
sites just hadn't adopted it yet. - After the fix, the same failure renders with the AWS-side reason
inline:
authentication failed: identity=<name> step=<n>: authentication failed: operation error STS: AssumeRoleWithWebIdentity, https response error StatusCode: 403, RequestID: ..., api error AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
— which makes the trust-policy / token / role-ARN problems
diagnosable from the first run. - Verified by reverting just the three
WithCause(err)lines and
confirming the new tests fail; restoring the fix turns them green
again. Fullpkg/auth/...test suite (~25 packages) passes.
references
docs/fixes/2026-05-01-assume-role-error-swallows-aws-cause.md—
full root-cause writeup, code paths, and rationale (added in this
PR).errors/builder.go:104-167—WithCause/WithCausefhelpers
used by the fix.pkg/auth/identities/aws/webflow_token.go:88-97— canonical
pattern referenced as the model for these three sites.pkg/auth/manager_chain.go:570— chain wrapper that already
expected the leaf to thread the cause via the trailing%w; this
PR makes the leaf actually do so.
Summary by CodeRabbit
-
Bug Fixes
- Preserve and surface underlying AWS STS error details in authentication failures while retaining existing sentinel behavior.
-
Tests
- Added regression tests that verify sentinel preservation, inclusion of AWS error text, and access to typed SDK errors across multiple STS error scenarios.
-
Documentation
- Added a doc with before/after examples and end-to-end test descriptions for the error-handling change.