github cloudposse/atmos v1.194.1-rc.0

latest releases: v1.195.0-rc.0, v1.194.1
pre-releaseone day ago
Add trace logging for all squelched errors @osterman (#1615) ## what - Add `log.Trace()` calls for all squelched errors throughout the codebase - Update logging documentation to include squelched error handling guidance - Update error handling strategy PRD to document squelched error patterns

why

  • Ensures no errors are silently lost, even when intentionally ignored
  • Provides complete error visibility during debugging with --logs-level=Trace
  • Establishes clear patterns for handling non-critical errors

details

Code Changes (19 files)

Added trace logging for squelched errors in:

  • Configuration binding: Environment variables and flags (viper.BindEnv(), viper.BindPFlag())
  • File cleanup: Temporary file and directory removal (os.Remove(), os.RemoveAll())
  • Resource closing: File handles, clients, connections (Close())
  • Lock operations: File locks in defer statements (Unlock())
  • UI operations: Terminal output, command help (fmt.Fprint(), cmd.Help())
  • Performance tracking: Histogram value recording
  • Cache operations: Non-critical cache file operations on Windows

Patterns Applied

// ❌ WRONG: Silent error squelching
_ = os.Remove(tempFile)

// ✅ CORRECT: Log squelched errors at Trace level
if err := os.Remove(tempFile); err != nil && !os.IsNotExist(err) {
    log.Trace("Failed to remove temporary file during cleanup", "error", err, "file", tempFile)
}

Special Cases

  • Defer statements: Capture errors in closures for logging
  • File existence checks: Use os.IsNotExist() to avoid logging expected conditions
  • Log file cleanup: Use fmt.Fprintf(os.Stderr, ...) to avoid logger recursion

Documentation Updates

  • docs/logging.md: Added comprehensive "Squelched Errors" section with patterns and examples
  • docs/prd/error-handling-strategy.md: Added "Squelched Error Handling" section with guidelines and code examples

references

  • Related to overall error handling and logging strategy
  • All changes compile and pass pre-commit hooks

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Prevented silent failures by adding error handling and trace-level logging across command flags, env bindings, config parsing, file cleanup, network response closing, metrics recording, and TUI rendering.
    • Improved cleanup reliability by logging non-fatal errors during temporary file/dir removal and resource closure.
  • Refactor

    • Standardized logging for squelched (non-fatal) errors, replacing ignored returns with guarded paths without changing core behavior.
  • Documentation

    • Expanded logging guidance with a new “Squelched Errors” section, examples, patterns, and best practices.
    • Updated error-handling strategy to include when and how to log squelched errors.
Improve performance heatmap @aknysh (#1611) ## what - Fix `--heatmap` flag not working with `terraform`, `helmfile`, and `packer` commands - Implement advanced performance tracking with self-time vs total-time separation - Fix recursive function performance tracking to show accurate counts AND accurate timing - Improve heatmap display with consistent metrics and informative legend - Add comprehensive tests for heatmap functionality and recursive tracking
atmos describe stacks --heatmap
image
atmos terraform plan vpc -s uw2-prod --heatmap
image

why

Heatmap Flag Fix

The --heatmap flag was not working for terraform, helmfile, and packer commands because:

  • These commands use DisableFlagParsing = true to pass native flags through to underlying tools
  • When flag parsing is disabled, Cobra doesn't parse the --heatmap flag
  • The PersistentPreRun hook couldn't detect the flag via cmd.Flags().GetBool("heatmap")
  • Performance tracking was never enabled, so no data collected

Performance Tracking Enhancement

--heatmap showed inconsistent timing metrics for long-running commands:

  • Elapsed time was correct
  • Individual function totals were massively inflated (- approximately 1,890x inflation)
  • Root cause: Unable to show accurate call counts for recursive functions without timing inflation

Requirement: Show true call volume (e.g., 1,890 calls) with accurate timing (no inflation) and consistent metrics.

changes

Advanced Performance Tracking (Self-Time vs Total-Time)

Implemented professional-grade profiling metrics that separate:

  • Total time (wall-clock): Includes time spent in child function calls
  • Self-time: Actual work done in the function, excluding children
  • Accurate recursive tracking: Shows ALL calls including recursive ones with correct timing

Key Features:

  1. Goroutine-local call stack tracking - Each goroutine maintains its own call stack
  2. Child time accumulation - Each stack frame tracks time spent in children
  3. Self-time calculation - selfTime = totalTime - childTime
  4. HDR Histogram on self-time - P95 based on actual work, not wall-clock
  5. Direct recursive tracking - No wrapper pattern needed, tracks every call accurately

Benefits:

  • Accurate recursive tracking: Shows true call counts (e.g., 1,890 calls) with correct timing
  • No inflation: Self-time excludes child execution, total-time includes it
  • Better insights: Identify where time is spent (total) vs where work is done (self-time)
  • Professional profiling: Same metrics as pprof, but function-level and easier to use

Example Output:

Function                                 Count    Total      Avg        Max        P95
utils.processCustomTags                  1024     4.27ms     4µs        146µs      15µs
  • Count: 1024 - ALL calls including recursive ones
  • Total: 4.27ms - wall-clock time including all children
  • Avg: 4µs - average self-time per call (actual work only)
  • Max: 146µs - maximum self-time for a single call (excludes children)
  • P95: 15µs - 95th percentile of self-time

Metric Consistency Fix

Changed Max to track self-time instead of total-time:

  • All three metrics (Avg, Max, P95) consistently track self-time
  • Rationale: Enables accurate comparison between metrics for identifying performance outliers

TUI Legend

Added informative legend to heatmap TUI:

Count: # calls (incl. recursion) | Total: wall-clock (incl. children & recursion) |
Avg: avg self-time | Max: max self-time | P95: 95th percentile self-time
  • Appears after the header in all visualization modes
  • Explains what each metric means
  • Clarifies that Count and Total include recursion
  • Helps users understand self-time vs total-time

Recursive Function Updates

Benefits:

  • Shows true call volume (e.g., 1,024 calls instead of 1)
  • Timing remains accurate via self-time calculation
  • No wrapper pattern complexity needed

Documentation

Updated profiling documentation:

  • "Understanding Total vs Self-Time" tip reflects that all three metrics now track self-time
  • Updated metric descriptions with accurate explanations
  • Added examples showing self-time context for outlier detection

Heatmap Flag Improvements

  • Added helper function to manually parse --heatmap from os.Args for commands with DisableFlagParsing = true
  • Integrated heatmap flag detection in terraform, helmfile, and packer commands
  • Added test coverage for heatmap flag detection

Enhanced Performance Tracking Coverage

  • Added perf.Track() to UnmarshalYAMLFromFile to track YAML parsing performance
  • Ensures complete visibility into YAML processing call chains during merge operations

testing

Performance Tracking Tests

New self-time tracking tests:

  1. TestSelfTimeVsTotalTime - Verifies self-time excludes child time
  2. TestNestedFunctionSelfTime - Tests multi-level nesting (grandparent → parent → child)
  3. TestDirectRecursionWithSelfTime - Tests direct recursion with accurate counts AND timing

Updated existing tests:

  • All 20 perf tests passing ✅
  • Tests verify self-time calculation accuracy for all three metrics (Avg, Max, P95)

Recursive function tests:

  1. TestRecursiveFunctionTracking - Wrapper pattern verification
  2. TestRecursiveFunctionWrongPattern - Demonstrates inflation with wrong pattern
  3. TestMultipleRecursiveFunctionsIndependent - Independent function tracking
  4. TestYAMLConfigProcessingRecursion - YAML import hierarchy (50 levels deep)
  5. TestYAMLConfigProcessingMultipleImports - Fan-out import pattern
  6. TestProcessBaseComponentConfigRecursion - Component inheritance (15 levels deep)

Heatmap UI Tests

Updated heatmap tests:

  • Updated legend verification tests
  • All 36 heatmap tests passing ✅

Heatmap Flag Tests

TestTerraformHeatmapFlag passes successfully
✅ Manual testing confirms --heatmap works with:

  • atmos terraform plan <component> -s <stack> --heatmap
  • atmos helmfile <subcommand> <component> -s <stack> --heatmap
  • atmos packer <subcommand> <component> -s <stack> --heatmap

Build Verification

✅ All linter checks passing (golangci-lint)
✅ Full project builds successfully
✅ Website documentation builds without errors
✅ All 56 tests passing (20 perf + 36 heatmap)

expected impact

For Recursive Functions

  • Before: Count = 1, accurate timing (using wrapper pattern, recursion hidden)
  • After: Count = 1,890 (shows true call volume), accurate timing (via self-time tracking)

Performance Insights

  • Total time: Identify functions contributing most to overall execution time
  • Self-time: Identify where actual work is being done (optimization targets)
  • Consistent metrics: All three metrics (Avg, Max, P95) track self-time for accurate comparison
  • Accurate P95: Based on self-time, more meaningful for performance analysis

Better User Experience

  • TUI legend: Users immediately understand what each metric means
  • Accurate Max: Identifies performance outliers in function's own work, not including children

Summary by CodeRabbit

  • New Features

    • --heatmap CLI flag available for Terraform, Helmfile, and Packer; heatmap UI adds a legend and wider duration columns.
    • New Terraform "affected" and query execution workflows.
  • Improvements

    • New profiler-related CLI flags and a performance-tracking toggle.
    • Performance metrics now emphasize self-time (Avg, P95, Max) for clearer bottleneck identification.
    • Clearer, more actionable user-facing error messages; telemetry notices suppressed by default.
  • Documentation

    • Profiling guide updated to explain Total vs Self-Time.
  • Chores

    • Go toolchain and example image versions bumped.

🚀 Enhancements

Fix component inheritance: restore default behavior for metadata.component @aknysh (#1614) ## what - Fix critical component inheritance regression where `metadata.component` was being overwritten during configuration inheritance - Restore correct default behavior where `metadata.component` defaults to the Atmos component name when not explicitly set - **Prevent state file loss**: Fix backend configuration to use correct `workspace_key_prefix` instead of inherited component names

why

  • PR #1602 introduced a regression in component processing that broke metadata.component handling
  • Critical Impact: Components using abstract component patterns were initializing new empty state files instead of finding existing state
    • When inheriting configuration via metadata.inherits, the system was inadvertently overwriting metadata.component with values from inherited components
    • Backend workspace_key_prefix ended up using the wrong component name (e.g., eks-service-defaults instead of eks-service)
    • This created new state at wrong S3 path, causing terraform plan to show full re-create of all resources
    • Could lead to resource duplication or apply failures due to conflicts
  • This caused errors like: invalid Terraform component: 'eks/n8n' points to the Terraform component 'helm-teleport-helper', but it does not exist

references

  • Fixes #1613 (Backend state file regression - workspace_key_prefix uses wrong component name)
  • Fixes #1609 (Component resolution fails when metadata.component omitted)
  • Related to PR #1602 (the refactoring that introduced the regression)

testing

  • Added comprehensive regression test in internal/exec/component_inheritance_test.go
    • Tests component that inherits configuration but doesn't set metadata.component
    • Verifies configuration inheritance (vars, settings) works correctly
    • Verifies metadata section is not inherited
    • Verifies component path defaults correctly to the Atmos component name
  • Added backend configuration test in internal/exec/abstract_component_backend_test.go
    • Critical test for backend state issue: Tests component with explicit metadata.component inheriting configuration from abstract component
    • Verifies metadata.component is preserved (not overwritten by inherited component's metadata.component)
    • Verifies backend workspace_key_prefix uses correct component path (e.g., eks-service not eks-service-defaults)
    • Verifies vars inheritance from abstract component works correctly
  • Created test fixtures in tests/fixtures/scenarios/:
    • component-inheritance-without-metadata-component/
    • abstract-component-backend/
  • All existing tests pass
  • Manual testing confirms the fix works correctly

impact

Before (1.194.0 - Broken)

# Abstract component
eks/service/defaults:
  metadata:
    type: abstract
    component: eks-service

# Concrete instance
eks/service/app1:
  metadata:
    component: eks-service
    inherits: [eks/service/defaults]  # Inherit configuration (vars, settings, etc.)

Problem: Processing metadata.inherits overwrote metadata.component

  • Backend ended up using workspace_key_prefix: eks-service-defaults (from inherited component)
  • Created new state at: s3://bucket/eks-service-defaults/...
  • Existing state at: s3://bucket/eks-service/... (not found)
  • Result: terraform plan showed full re-create of all resources ❌

After (This Fix)

Solution: metadata.component is preserved during configuration inheritance

  • Backend correctly uses workspace_key_prefix: eks-service (from explicit metadata.component)
  • Finds existing state at: s3://bucket/eks-service/...
  • Result: terraform plan shows correct incremental changes ✅

details

Key Concepts (Independent)

  1. metadata.component: Pointer to the physical terraform component directory in components/terraform/

    • Defaults to the Atmos component name if not specified
    • Determines the physical directory path for the component
    • Critical: Used to generate backend configuration (e.g., workspace_key_prefix)
    • Completely independent of configuration inheritance
  2. metadata.inherits: List of Atmos components to inherit configuration from (vars, settings, env, backend, etc.)

    • Purely for configuration merging - has nothing to do with component paths
    • The metadata section itself is NOT inherited from base components
    • Allows multiple inheritance for configuration merging
    • Completely independent of metadata.component and Atmos component names
  3. Atmos component name: The logical name of the component in the stack configuration

    • Used as default for metadata.component if not specified
    • Completely independent of configuration inheritance

The Bug

The regression in PR #1602 caused metadata.component to be inadvertently overwritten during configuration inheritance processing:

  1. Component eks/service/app1 has metadata.component: eks-service and metadata.inherits: [eks/service/defaults]
  2. System processes metadata.inherits to merge configuration (vars, settings, etc.) from eks/service/defaults
  3. BUG: During this processing, metadata.component: eks-service gets overwritten with the inherited component's metadata.component value
  4. Result: Component path becomes eks-service-defaults instead of eks-service

The Fix

Modified processMetadataInheritance() in stack_processor_process_stacks_helpers_inheritance.go to:

  • Track whether metadata.component was explicitly set on the current component
  • Process metadata.inherits to inherit vars, settings, backend, etc. (but NOT metadata)
  • Restore the explicitly-set metadata.component after processing inherits (prevents overwriting)
  • Default metadata.component to the Atmos component name when not explicitly set
  • This happens regardless of whether metadata.inherits exists or not

Code Changes

The fix ensures proper handling in three scenarios:

Scenario 1: Component without explicit metadata.component (Fixes #1609)

components:
  terraform:
    eks/n8n:  # Atmos component name
      metadata:
        inherits: [eks/helm-teleport-helper]  # Inherit configuration only
      vars:
        name: "n8n"

✅ Now works: metadata.component defaults to eks/n8n (Atmos component name)
❌ Before: metadata.component was overwritten during configuration inheritance, became helm-teleport-helper (component not found error)

Scenario 2: Component with explicit metadata.component + configuration inheritance (Fixes #1613)

components:
  terraform:
    eks/service/app1:
      metadata:
        component: eks-service  # Explicit - determines physical path
        inherits: [eks/service/defaults]  # Inherit configuration (vars, backend, etc.)

✅ Now works: metadata.component stays eks-service, backend uses correct workspace_key_prefix
❌ Before: metadata.component overwritten to eks-service-defaults during configuration inheritance, new empty state created

Scenario 3: Top-level component: attribute (Also affected by regression)

components:
  terraform:
    test/test-component-override:
      component: test/test-component  # Top-level attribute (not metadata.component)
      # ... configuration ...

✅ Now works: Top-level component: sets the component path, not overwritten by defaults
❌ Before: Default logic was overwriting top-level component inheritance

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Component inheritance now relies solely on metadata.inherits and preserves top-level component attributes.
    • Default component path/name correctly resolves when metadata.component is absent.
    • Abstract-component backends inherit workspace_key_prefix and backend settings as expected.
    • Improved validation and error handling for metadata.inherits entries.
  • Tests

    • Added scenarios and tests covering abstract-component backends, inheritance without metadata.component, and a guard for certain mock failures.
  • Chores

    • Updated dependency versions.

Don't miss a new atmos release

NewReleases is sending notifications on new releases.