feat: enhance error messages in atmos validate stacks with file context @osterman (#1494)
## Summary This PR addresses the issue where `atmos validate stacks` shows cryptic merge errors without any indication of which file contains the problem. Users were seeing duplicate, unformatted error messages with no context about where the configuration conflicts occurred.
Problem
When users encountered merge errors (like type mismatches between arrays and strings), they would see:
- Duplicate error messages - The same error printed multiple times to stderr
- No file context - No indication of which file or import chain caused the issue
- Difficult debugging - Impossible to identify the source of conflicts in complex configurations
Example of the problematic output:
cannot override two slices with different type ([]interface {}, string)
cannot override two slices with different type ([]interface {}, string)
cannot override two slices with different type ([]interface {}, string)
...
Solution
This PR implements two key improvements:
1. Fixed Duplicate Error Printing
- Root Cause:
merge.go
was printing errors directly to stderr usingtheme.Colors.Error.Fprintln()
before returning them - Fix: Removed all direct printing statements (3 locations)
- Result: Errors now flow through proper logging channels and appear once
2. Added MergeContext for Enhanced Error Messages
- File Tracking: Shows exactly which file is being processed when an error occurs
- Import Chain: Displays the complete chain of imports leading to the error
- Helpful Hints: Provides specific guidance for common merge errors
- Debug Logging: Logs merge operations and failures at Debug level with full context
Example Output
Before:
cannot override two slices with different type ([]interface {}, string)
cannot override two slices with different type ([]interface {}, string)
After:
Error: cannot override two slices with different type ([]interface {}, string)
File being processed: stacks/deploy/prod/us-east-1.yaml
Import chain:
→ stacks/catalog/base.yaml
→ stacks/mixins/region/us-east-1.yaml
→ stacks/deploy/prod/us-east-1.yaml
Likely cause: A key is defined as an array in one file and as a string in another.
Debug hint: Check the files above for keys that have different types.
Common issues:
- vars defined as both array and string
- settings with inconsistent types across imports
- overrides attempting to change field types
Changes Made
- Created
MergeContext
struct to track file paths and import chains - Added
MergeWithContext
andProcessYAMLConfigFileWithContext
functions - Removed direct stderr printing from
merge.go
- Added debug logging for merge operations
- Comprehensive test coverage for all changes
- Full backward compatibility maintained
Testing
- Unit tests for MergeContext functionality
- Tests verifying no duplicate error printing
- Integration tests for validate command
- Test fixtures demonstrating type mismatch scenarios
- All existing tests continue to pass
Benefits
- Clear Error Location: Users immediately know which file contains the problem
- Import Chain Visibility: Shows the complete path of imports leading to the error
- No Duplicates: Errors appear once with proper formatting
- Actionable Hints: Specific guidance on resolving type mismatches
- Debug Support: Debug logging provides additional context when needed
- Non-Breaking: Fully backward compatible with existing code
Test Plan
- Run unit tests:
go test ./pkg/merge -v
- Run integration tests:
go test ./internal/exec -run TestValidateStacks
- Manual testing with type mismatch scenarios
- Verify no stderr printing occurs
- Verify debug logging works correctly
- Ensure backward compatibility
🤖 Generated with Claude Code
Summary by CodeRabbit
- New Features
- Context-rich error messages during validation and merges showing the file being processed, import chain, and actionable hints.
- Bug Fixes
- Suppressed duplicate stderr printing; errors are now returned and wrapped with contextual information instead of printed.
- Documentation
- Added user guide documenting enhanced error formatting, debugging tips, examples, and backward-compatibility notes.
- Tests
- Expanded test suite and fixtures validating context-aware error formatting, duplicate-suppression, import chains, and type-mismatch scenarios.
feat: implement precondition-based test skipping for better developer experience @osterman (#1450)
## Summary - Implements a comprehensive precondition-based test skipping strategy to improve developer experience - Adds SDK-based helper functions for checking AWS, Git, GitHub, and OCI authentication preconditions - Updates all failing tests to gracefully skip when environmental requirements aren't metProblem
Previously, ~18 tests were failing locally due to missing environmental setup (AWS profiles, Git configuration, network access, GitHub tokens), making it difficult for developers to contribute to Atmos without extensive environment configuration.
Solution
This PR introduces intelligent precondition checking that:
- Detects missing requirements using official SDKs (AWS SDK, go-git) rather than manual checks
- Skips tests gracefully with clear, actionable messages explaining what's missing and how to fix it
- Provides an escape hatch via
ATMOS_TEST_SKIP_PRECONDITION_CHECKS=true
for CI/CD environments
Changes
New Files
tests/test_preconditions.go
- Centralized helper functions for precondition checkingdocs/prd/testing-strategy.md
- Product Requirements Document for the testing strategy- Updated
tests/README.md
- Practical developer guidance
Helper Functions Added
RequireAWSProfile(t, profile)
- Validates AWS profile configurationRequireGitRepository(t)
- Checks for Git repositoryRequireGitRemoteWithValidURL(t)
- Validates Git remote URLsRequireGitHubAccess(t)
- Checks GitHub connectivity and rate limitsRequireOCIAuthentication(t)
- Validates GitHub token for OCI registry accessRequireNetworkAccess(t, url)
- General network connectivity check
Tests Updated
- AWS tests:
internal/aws_utils
,internal/terraform_backend
,pkg/store
- Git tests:
internal/exec/describe_affected
,internal/exec/terraform_utils
,pkg/atlantis
,pkg/describe
- Vendor tests:
internal/exec/vendor_utils
(GitHub + OCI auth)
Testing
All previously failing packages now pass:
ok github.com/cloudposse/atmos/internal/aws_utils
ok github.com/cloudposse/atmos/internal/terraform_backend
ok github.com/cloudposse/atmos/internal/exec
ok github.com/cloudposse/atmos/pkg/store
ok github.com/cloudposse/atmos/pkg/atlantis
ok github.com/cloudposse/atmos/pkg/describe
Tests skip with helpful messages when preconditions aren't met:
TestExecuteVendorPull: GitHub token not configured: required for pulling OCI images from ghcr.io.
Set GITHUB_TOKEN or ATMOS_GITHUB_TOKEN environment variable, or set ATMOS_TEST_SKIP_PRECONDITION_CHECKS=true
Benefits
✅ Developers can run tests locally without extensive environment setup
✅ Clear skip messages guide developers to fix specific issues
✅ CI/CD can bypass checks when using mocked dependencies
✅ Improves contributor onboarding experience
✅ Distinguishes between environmental issues and actual code failures
References
- Fixes issues with local test execution
- Implements testing best practices for open-source projects
- Follows Go testing conventions with
t.Skipf()
🤖 Generated with Claude Code
feat: add pre-commit hooks and development workflow @osterman (#1469)
## what - Add comprehensive pre-commit configuration with Go-specific hooks - Create development workflow commands via `.atmos.d/dev.yaml` - Add GitHub Action for running pre-commit checks in CI - Document development setup and workflowwhy
- Prevent broken commits: The
go-build-mod
hook ensures code compiles before allowing commits - Maintain code quality: Consistent formatting with
gofumpt
and linting withgolangci-lint
- Improve developer experience: Simple
atmos dev
commands for setup and validation - Automate CI checks: Pull requests automatically run pre-commit checks via CloudPosse's GitHub Action
Changes
Pre-commit Configuration
.pre-commit-config.yaml
: Hooks for gofumpt, go-build-mod, golangci-lint, go-mod-tidy, and file hygiene
Development Workflow
.atmos.d/dev.yaml
: Auto-loaded development commandsatmos dev setup
- One-time setup for contributorsatmos dev check
- Run pre-commit on staged filesatmos dev validate
- Comprehensive validation (build, lint, test)atmos dev fix
- Auto-fix formatting issues
CI Integration
.github/workflows/pre-commit.yml
: GitHub Action usingcloudposse/github-action-pre-commit@v4.0.0
- Runs on all pull requests
- Checks entire codebase (not just changed files)
- Includes compilation and dependency verification
Documentation
docs/development.md
: Comprehensive developer guide.atmos.d/README.md
: Explains auto-loaded configurations- Updated main
README.md
with development section
Testing
- Verified
atmos dev
commands work correctly - Tested pre-commit configuration is valid
- Confirmed Go compilation check prevents broken commits
- Verified
.atmos.d/
auto-loading functionality
references
- Uses CloudPosse's own
github-action-pre-commit
for CI consistency - Follows Go best practices for formatting and linting
- Leverages Atmos's auto-loading from
.atmos.d/
directory - Fixes flakey build introduced in due to missing
GITHUB_TOKEN
and GitHub API rate limits
🤖 Generated with Claude Code
Summary by CodeRabbit
-
Documentation
- Added developer-facing guides (docs/development.md, .atmos.d/README.md) describing local dev workflow, custom commands, setup, tooling, CI notes, and troubleshooting.
-
Chores
- Added a standardized dev command suite (.atmos.d/dev.yaml), pre-commit configuration, and a GitHub workflow to run pre-commit checks on PRs; updated .gitignore to track only the new dev files.
-
Tests
- Improved test isolation with per-test mock repo roots and an opt-out for loading repository dev configs; added unit tests for the new exclusion behavior.
fix(telemetry): prevent PostHog errors from leaking to user output @osterman (#1491)
## what - Add custom PostHog logger adapter that integrates with Atmos's structured logging system - Route PostHog error messages to debug level instead of letting them print directly to stderr - Add configuration option (`telemetry.posthog_logging`) to control PostHog internal logging - Implement SilentLogger that completely suppresses PostHog messages when logging is disabled - Add comprehensive unit tests and documentation for the new telemetry logging featureswhy
- PostHog errors (like "502 Bad Gateway") were appearing in user output and breaking CLI tests
- PostHog was using its own internal logger that wrote directly to stderr, bypassing Atmos's logging controls
- Telemetry failures should be transparent to users and not impact their experience
- Users need ability to debug telemetry issues when setting up their own PostHog instances
- Error messages need to follow Atmos's structured logging conventions for consistency
It was failing builds due to snapshots changing.
"/Users/runner/work/atmos/atmos/tests/snapshots/TestCLICommands_atmos_--help.stdout.golden":
--- expected
+++ actual
@@ -1 +1,7 @@
+posthog 2025/09/21 23:37:14 INFO: response 502 502 Bad Gateway – <html>
+<head><title>502 Bad Gateway</title></head>
+<body>
+<center><h1>502 Bad Gateway</h1></center>
+</body>
+</html>
+posthog 2025/09/21 23:37:14 ERROR: 1 messages dropped because they failed to be sent and the client was closed
Solution Details
Custom Logger System
Created two logger implementations:
- PosthogLogger: Routes PostHog messages through Atmos's structured logging at DEBUG level
- SilentLogger: Completely suppresses all PostHog internal messages (default behavior)
New Configuration Option
Added telemetry.posthog_logging
setting:
- false (default): Uses SilentLogger to suppress all PostHog internal messages
- true: Uses PosthogLogger to route messages through Atmos logging at DEBUG level
- Can be set via
atmos.yaml
orATMOS_TELEMETRY_POSTHOG_LOGGING
environment variable
Behavior at Different Log Levels
When posthog_logging: true
:
- Info Level (default): PostHog messages are hidden from users
- Debug Level: PostHog messages appear with proper structured logging format
When posthog_logging: false
(default):
- All PostHog internal messages are completely suppressed regardless of log level
Test Coverage
Added comprehensive unit tests that verify:
- Logger selection based on configuration
- Logger methods work correctly with structured output
- Errors only appear at debug level when logging is enabled
- PostHog errors don't leak to stderr in production
- SilentLogger suppresses all messages
- Configuration loading from both file and environment variables
Documentation
Updated telemetry documentation to explain:
- New
posthog_logging
configuration option - When to enable PostHog internal logging
- How to configure via both
atmos.yaml
and environment variables
references
- Fixes DEV-3633
- Linear Issue: https://linear.app/cloudposse/issue/DEV-3633/handle-telemetry-issues-with-posthog-to-avoid-user-impact
- example failed build https://github.com/cloudposse/atmos/actions/runs/17900384206/job/50892638861
🤖 Generated with Claude Code
Summary by CodeRabbit
-
New Features
- Configurable internal telemetry logging with a PostHog logger and a silent/no-op mode; new telemetry Options include logging.
-
Bug Fixes
- Prevented internal telemetry errors from leaking to stdout/stderr and reduced noise by adjusting log levels.
-
Improvements
- Telemetry accepts structured options, enqueues events (sent on Close), uses safer client cleanup, and updates messaging to "enqueued".
-
Tests
- Expanded coverage for logger selection, log-level behavior, env/config flag handling, enqueue/close paths, and no-stderr pollution.
-
Documentation
- Added telemetry logging configuration docs and fixed a typo.
feat: change !include to use file extension-based parsing and add !include.raw @osterman (#1493)
## what - Changed `!include` function from content-based to extension-based file type detection - Added new `!include.raw` function that always returns content as raw string - Proper handling of URLs with query strings and fragmentswhy
- Predictable behavior: Users reported that auto-detecting file content was unpredictable. With extension-based detection, users know exactly how their files will be parsed
- Control over parsing: The previous content-based detection made it impossible to include structured data (like JSON) as a raw string. Now users can either use a
.txt
extension or the new!include.raw
function - URL compatibility: URLs with query strings (e.g.,
config.json?v=2
) now work correctly - the query string doesn't affect extension detection
Key Changes
Extension-Based Parsing
.json
files are parsed as JSON.yaml
and.yml
files are parsed as YAML.hcl
,.tf
, and.tfvars
files are parsed as HCL- All other extensions return raw strings
New !include.raw Function
# Always returns content as string, regardless of extension
template: !include.raw template.json
script: !include.raw deploy.sh
URL Query String Support
# Extension detected correctly despite query strings
config: !include https://api.example.com/config.json?version=2&format=raw
Testing
- Comprehensive unit tests for extension detection
- Tests for URLs with query strings and fragments
- Integration tests for both
!include
and!include.raw
- All existing tests pass - backward compatible
Test plan
[x] Unit tests pass
[x] Integration tests pass
[x] Documentation updated
[x] Backward compatibility verified
🤖 Generated with Claude Code
Summary by CodeRabbit
-
New Features
- !include now parses based on file extension (JSON, YAML, HCL → structured; others → raw).
- Added !include.raw to force raw text inclusion.
- Remote includes support extension-based parsing for URLs with extensions.
-
Improvements
- Better handling of URLs (ignores query strings/fragments), hidden/multi-dot filenames, and mixed local/remote resolution.
- Remote/local include behavior and parsing made more robust.
-
Documentation
- Updated include docs and added !include.raw page with examples.
-
Tests
- Extensive tests covering include behaviors and extension parsing.
fix: restore config import override behavior while maintaining Windows fix @osterman (#1489)
## what - Fix config import override behavior while maintaining Windows compatibility - Add Windows file locking resilience for Terraform state operations - Ensure main config file settings take precedence over imported configs - Remove redundant `tempViper.MergeInConfig()` call that was breaking Windows tests - Refactor `mergeConfig` into smaller, testable functions for better code coverage - Improve error wrapping in config load functions for better debuggingwhy
- The initial fix for Windows broke the config import override behavior
- Windows CI tests were failing with "The process cannot access the file because another process has locked a portion of the file" errors
- Test
TestInitCliConfig/valid_import_custom_override
was failing because imported configs were overriding the main config - The bug was causing YAML template functions (like
atmos.Component
) to returnnull
values on Windows - Code coverage was below the required threshold (45.45% vs 55.96% target)
- Error messages lacked context, making debugging difficult
references
- Fixes regression introduced in #1447 ("Make inline atmos config override config from imports")
- Addresses test failures in both Windows CI and import override tests
- Resolves Windows file locking issues in Terraform output operations
Technical Details
1. Import Override Fix
After processing imports, we now re-merge the original config content to ensure the main config takes precedence:
// Read original content before processing imports
content, err := os.ReadFile(configFilePath)
if processImports {
// Process imports...
// Re-merge original content to ensure it overrides imports
err = tempViper.MergeConfig(bytes.NewReader(content))
}
2. Windows File Locking Resilience
Added platform-specific handling for Windows file locking issues:
Error: Failed to read state file
The state file could not be read: read terraform.tfstate.d\nonprod-component-3\terraform.tfstate: The process cannot access the file because another process has locked a portion of the file.
This error occurred during the test TestCLICommands/terraform_output_function_(no_tty)
on Windows CI.
Retry Logic with Exponential Backoff
- Retry file operations up to 3 times with delays (100ms, 200ms, 500ms)
- Wraps critical operations like
terraform output
and file deletion - Only applies on Windows via build tags
Strategic Delays
- Small delays after workspace operations to allow file handles to release
- Prevents "file locked by another process" errors during rapid operations
Implementation
// Windows-specific retry wrapper
func retryOnWindows(fn func() error) error {
// 3 attempts with exponential backoff
}
// Usage in terraform operations
err = retryOnWindows(func() error {
outputMeta, err = tf.Output(ctx)
return err
})
3. Refactoring for Better Testability
Broke the monolithic mergeConfig
function into smaller, focused functions:
loadConfigFile()
- Loads config files into Viper (100% coverage)readConfigFileContent()
- Reads file contents with error context (100% coverage)processConfigImportsAndReapply()
- Handles import processing and reapplication (85.7% coverage)marshalViperToYAML()
- Marshals Viper settings to YAML (80% coverage)mergeYAMLIntoViper()
- Merges YAML into Viper instance (100% coverage)
4. Improved Error Handling
- Wrapped errors with descriptive context throughout config loading
- Preserves sentinel errors (like
ConfigFileNotFoundError
) for compatibility - Provides clear error messages with file paths and operation context
Testing
- ✅ All existing tests pass (including Windows CI)
- ✅ Added comprehensive tests for Windows retry logic
- ✅ Added platform-specific tests (Windows vs non-Windows behavior)
- ✅ Code coverage improved from 45.45% to 81.25% (exceeding target)
- ✅ Integration tests for file locking scenarios
- ✅ Tests verify zero performance impact on non-Windows platforms
Files Changed
Core Fixes
pkg/config/load.go
- Config loading and import handlingpkg/config/merge.go
- Refactored merge logicinternal/exec/terraform_output_utils.go
- Windows retry logicinternal/exec/terraform_utils.go
- File operation resilience
Platform-Specific Code (Build Tags)
internal/exec/terraform_output_utils_windows.go
- Windows implementationinternal/exec/terraform_output_utils_other.go
- Unix implementation
Tests
pkg/config/merge_test.go
- Config merge testsinternal/exec/terraform_output_utils_*_test.go
- Platform-specific testsinternal/exec/terraform_output_utils_integration_test.go
- Integration tests
Summary by CodeRabbit
-
New Features
- More reliable Terraform operations on Windows with automatic retries and short delays.
-
Bug Fixes
- Configuration import precedence corrected so local settings override imported ones, including nested imports.
- Improved handling of invalid imported YAML to avoid full failures and surface clear errors.
-
Style
- Standardized trimming of trailing whitespace with targeted exceptions; Go files use tab indentation.
-
Tests
- Expanded coverage for config loading, import/merge semantics, Windows behavior, and CLI parsing.