github cloudposse/atmos v1.194.1

latest release: v1.195.0-rc.0
13 hours ago
Fix and Improve Performance Heatmap @aknysh (#1622)

what

  • Improved the heatmap performance on Docker, fixed critical performance issues with the --heatmap flag in Docker environments
  • Renamed "Total" column to "CPU Time" throughout the performance heatmap display to clarify that it represents sum of self-times, not wall-clock time
  • Added parallelism metric to both console and TUI displays showing CPU Time ÷ Elapsed ratio
  • Improved TUI visualizations (bar chart and sparkline modes) to display average time per call instead of total CPU time for better user comprehension
  • Enhanced TUI legend with live performance metrics (Parallelism, Elapsed time, CPU Time) displayed at the top
  • Split TUI legend into three lines for improved readability
  • Updated comprehensive documentation in website/docs/troubleshoot/profiling.mdx with explanations of all new metrics and display formats

why

Docker performance context

This PR improved the heatmap performance on Docker, fixed critical performance issues with the --heatmap flag in Docker environments.

The Docker Problem:

  • Commands with --heatmap took ~60 seconds to start in Docker containers (vs instantly on macOS)
  • Root cause: runtime.Stack() was called on every tracked function to get goroutine IDs
  • In Docker, this syscall is significantly slower than on native macOS
  • With thousands of tracked function calls during stack processing, overhead accumulated to ~1 minute

The Solution:

  • Introduced "simple tracking mode" using a single global call stack instead of per-goroutine tracking
  • Avoids expensive runtime.Stack() calls for single-goroutine execution (most Atmos commands)
  • Result: ~19x faster (119µs vs 3ms for 1000 calls)

TUI improvements

Problem 1: Confusing Column Name
The "Total" column name was ambiguous - users couldn't tell if it meant wall-clock time or CPU time. This caused confusion when trying to interpret performance data.

Problem 2: Missing Parallelism Context
Users had no way to understand the relationship between CPU time and elapsed time. When CPU time exceeded elapsed time (e.g., 5 minutes of CPU time in 22 seconds), it looked wrong but was actually correct for parallel execution.

Problem 3: TUI Display Showed Confusing Values
The bar chart displayed large total times (e.g., "2m47s") for functions called hundreds of thousands of times, making it appear slow when the average time per call was actually fast (e.g., 0.37ms). This was mathematically correct but user-unfriendly.

Problem 4: Cramped Legend
The single-line legend was too wide and difficult to read in the TUI.

Solutions

Solution 1: Clear Terminology
Renamed "Total" to "CPU Time" everywhere with consistent explanations that it represents "sum of self-time (excludes children)" to avoid confusion with wall-clock time.

Solution 2: Parallelism Metric
Added parallelism calculation and display in both console output and TUI legend:

  • Format: Parallelism: ~0.9x (single-threaded) or ~58.2x (highly parallel)
  • Helps users immediately understand execution characteristics
  • Values less than 1.0 indicate single-threaded execution
  • Values greater than 1.0 indicate parallel execution across multiple cores

Solution 3: User-Friendly TUI Visualizations
Changed bar chart and sparkline displays from showing total CPU time to showing average time per call with call counts:

  • Old format: 2m47.818679s (×447586) - confusing
  • New format: avg: 0.37ms | calls: 447586 - intuitive
  • Bar length still represents total CPU time (overall impact)
  • Average time shows typical function performance

Solution 4: Multi-Line Legend
Split the TUI legend into three informative lines:

  • Line 1: Live metrics (Parallelism, Elapsed, CPU Time)
  • Line 2: Explanation of Count and CPU Time columns
  • Line 3: Explanation of statistical timing columns (Avg, Max, P95)

example

image

testing

All existing tests pass:

  • ✅ 26 tests in pkg/perf package
  • ✅ 36 tests in pkg/ui/heatmap package
  • ✅ 87 tests in cmd package
  • ✅ Total: 149 tests across all modified packages

Manual verification:

  • Compiled binary successfully on macOS
  • Tested in Docker
  • Verified console output shows new format with parallelism
  • Confirmed interactive TUI displays updated legend
  • Website documentation builds without errors
  • Docker performance remains fast (1-2 second startup with --heatmap)

documentation

Updated website/docs/troubleshoot/profiling.mdx with:

  • New "Interactive TUI Legend" section explaining the 3-line legend format
  • Updated Performance Summary section with Parallelism explanation
  • Updated all column descriptions (CPU Time instead of Total)
  • Enhanced Bar Chart and Sparkline mode descriptions with new display format
  • Added concrete examples showing the new avg: Xms | calls: N format

benefits

  1. Clearer Metrics: "CPU Time" is unambiguous compared to "Total"
  2. Execution Context: Parallelism metric immediately shows if execution was single-threaded or parallel
  3. Intuitive Display: Average times per call are much easier to understand than large totals
  4. Better UX: Multi-line legend is more readable and informative
  5. Complete Documentation: Users have clear explanations of all metrics
  6. Backward Compatible: All existing tests pass without changes
  7. Docker Ready: Works fast in Docker environments (builds on PR #1611 improvements)

Summary by CodeRabbit

  • New Features

    • Heatmap and summaries now show CPU Time and Parallelism; charts and sparklines show average time per call with counts; interactive UI supports cancellation and a new toggle to control tracking mode.
  • Bug Fixes

    • More accurate CPU Time and parallelism calculations; per-function CPU Time shown; durations truncated to microseconds; headers and columns reorganized for clarity.
  • Tests

    • Added tests for tracking modes, performance comparisons, and legend/render output; Windows skips and more robust environment/IO handling.
  • Documentation

    • Profiling docs updated to explain CPU Time vs Self‑Time with revised examples.
  • Chores

    • CI tool version bumped and test cleanup/synchronization improvements.
Add trace logging for all squelched errors @osterman (#1615)

what

  • Add log.Trace() calls for all squelched errors throughout the codebase
  • Update logging documentation to include squelched error handling guidance
  • Update error handling strategy PRD to document squelched error patterns

why

  • Ensures no errors are silently lost, even when intentionally ignored
  • Provides complete error visibility during debugging with --logs-level=Trace
  • Establishes clear patterns for handling non-critical errors

details

Code Changes (19 files)

Added trace logging for squelched errors in:

  • Configuration binding: Environment variables and flags (viper.BindEnv(), viper.BindPFlag())
  • File cleanup: Temporary file and directory removal (os.Remove(), os.RemoveAll())
  • Resource closing: File handles, clients, connections (Close())
  • Lock operations: File locks in defer statements (Unlock())
  • UI operations: Terminal output, command help (fmt.Fprint(), cmd.Help())
  • Performance tracking: Histogram value recording
  • Cache operations: Non-critical cache file operations on Windows

Patterns Applied

// ❌ WRONG: Silent error squelching
_ = os.Remove(tempFile)

// ✅ CORRECT: Log squelched errors at Trace level
if err := os.Remove(tempFile); err != nil && !os.IsNotExist(err) {
    log.Trace("Failed to remove temporary file during cleanup", "error", err, "file", tempFile)
}

Special Cases

  • Defer statements: Capture errors in closures for logging
  • File existence checks: Use os.IsNotExist() to avoid logging expected conditions
  • Log file cleanup: Use fmt.Fprintf(os.Stderr, ...) to avoid logger recursion

Documentation Updates

  • docs/logging.md: Added comprehensive "Squelched Errors" section with patterns and examples
  • docs/prd/error-handling-strategy.md: Added "Squelched Error Handling" section with guidelines and code examples

references

  • Related to overall error handling and logging strategy
  • All changes compile and pass pre-commit hooks

Summary by CodeRabbit

  • Bug Fixes

    • Prevented silent failures by adding error handling and trace-level logging across command flags, env bindings, config parsing, file cleanup, network response closing, metrics recording, and TUI rendering.
    • Improved cleanup reliability by logging non-fatal errors during temporary file/dir removal and resource closure.
  • Refactor

    • Standardized logging for squelched (non-fatal) errors, replacing ignored returns with guarded paths without changing core behavior.
  • Documentation

    • Expanded logging guidance with a new “Squelched Errors” section, examples, patterns, and best practices.
    • Updated error-handling strategy to include when and how to log squelched errors.
Improve performance heatmap @aknysh (#1611)

what

  • Fix --heatmap flag not working with terraform, helmfile, and packer commands
  • Implement advanced performance tracking with self-time vs total-time separation
  • Fix recursive function performance tracking to show accurate counts AND accurate timing
  • Improve heatmap display with consistent metrics and informative legend
  • Add comprehensive tests for heatmap functionality and recursive tracking

atmos describe stacks --heatmap
image
atmos terraform plan vpc -s uw2-prod --heatmap
image

why

Heatmap Flag Fix

The --heatmap flag was not working for terraform, helmfile, and packer commands because:

  • These commands use DisableFlagParsing = true to pass native flags through to underlying tools
  • When flag parsing is disabled, Cobra doesn't parse the --heatmap flag
  • The PersistentPreRun hook couldn't detect the flag via cmd.Flags().GetBool("heatmap")
  • Performance tracking was never enabled, so no data collected

Performance Tracking Enhancement

--heatmap showed inconsistent timing metrics for long-running commands:

  • Elapsed time was correct
  • Individual function totals were massively inflated (- approximately 1,890x inflation)
  • Root cause: Unable to show accurate call counts for recursive functions without timing inflation

Requirement: Show true call volume (e.g., 1,890 calls) with accurate timing (no inflation) and consistent metrics.

changes

Advanced Performance Tracking (Self-Time vs Total-Time)

Implemented professional-grade profiling metrics that separate:

  • Total time (wall-clock): Includes time spent in child function calls
  • Self-time: Actual work done in the function, excluding children
  • Accurate recursive tracking: Shows ALL calls including recursive ones with correct timing

Key Features:

  1. Goroutine-local call stack tracking - Each goroutine maintains its own call stack
  2. Child time accumulation - Each stack frame tracks time spent in children
  3. Self-time calculation - selfTime = totalTime - childTime
  4. HDR Histogram on self-time - P95 based on actual work, not wall-clock
  5. Direct recursive tracking - No wrapper pattern needed, tracks every call accurately

Benefits:

  • Accurate recursive tracking: Shows true call counts (e.g., 1,890 calls) with correct timing
  • No inflation: Self-time excludes child execution, total-time includes it
  • Better insights: Identify where time is spent (total) vs where work is done (self-time)
  • Professional profiling: Same metrics as pprof, but function-level and easier to use

Example Output:

Function                                 Count    Total      Avg        Max        P95
utils.processCustomTags                  1024     4.27ms     4µs        146µs      15µs
  • Count: 1024 - ALL calls including recursive ones
  • Total: 4.27ms - wall-clock time including all children
  • Avg: 4µs - average self-time per call (actual work only)
  • Max: 146µs - maximum self-time for a single call (excludes children)
  • P95: 15µs - 95th percentile of self-time

Metric Consistency Fix

Changed Max to track self-time instead of total-time:

  • All three metrics (Avg, Max, P95) consistently track self-time
  • Rationale: Enables accurate comparison between metrics for identifying performance outliers

TUI Legend

Added informative legend to heatmap TUI:

Count: # calls (incl. recursion) | Total: wall-clock (incl. children & recursion) |
Avg: avg self-time | Max: max self-time | P95: 95th percentile self-time
  • Appears after the header in all visualization modes
  • Explains what each metric means
  • Clarifies that Count and Total include recursion
  • Helps users understand self-time vs total-time

Recursive Function Updates

Benefits:

  • Shows true call volume (e.g., 1,024 calls instead of 1)
  • Timing remains accurate via self-time calculation
  • No wrapper pattern complexity needed

Documentation

Updated profiling documentation:

  • "Understanding Total vs Self-Time" tip reflects that all three metrics now track self-time
  • Updated metric descriptions with accurate explanations
  • Added examples showing self-time context for outlier detection

Heatmap Flag Improvements

  • Added helper function to manually parse --heatmap from os.Args for commands with DisableFlagParsing = true
  • Integrated heatmap flag detection in terraform, helmfile, and packer commands
  • Added test coverage for heatmap flag detection

Enhanced Performance Tracking Coverage

  • Added perf.Track() to UnmarshalYAMLFromFile to track YAML parsing performance
  • Ensures complete visibility into YAML processing call chains during merge operations

testing

Performance Tracking Tests

New self-time tracking tests:

  1. TestSelfTimeVsTotalTime - Verifies self-time excludes child time
  2. TestNestedFunctionSelfTime - Tests multi-level nesting (grandparent → parent → child)
  3. TestDirectRecursionWithSelfTime - Tests direct recursion with accurate counts AND timing

Updated existing tests:

  • All 20 perf tests passing ✅
  • Tests verify self-time calculation accuracy for all three metrics (Avg, Max, P95)

Recursive function tests:

  1. TestRecursiveFunctionTracking - Wrapper pattern verification
  2. TestRecursiveFunctionWrongPattern - Demonstrates inflation with wrong pattern
  3. TestMultipleRecursiveFunctionsIndependent - Independent function tracking
  4. TestYAMLConfigProcessingRecursion - YAML import hierarchy (50 levels deep)
  5. TestYAMLConfigProcessingMultipleImports - Fan-out import pattern
  6. TestProcessBaseComponentConfigRecursion - Component inheritance (15 levels deep)

Heatmap UI Tests

Updated heatmap tests:

  • Updated legend verification tests
  • All 36 heatmap tests passing ✅

Heatmap Flag Tests

TestTerraformHeatmapFlag passes successfully
✅ Manual testing confirms --heatmap works with:

  • atmos terraform plan <component> -s <stack> --heatmap
  • atmos helmfile <subcommand> <component> -s <stack> --heatmap
  • atmos packer <subcommand> <component> -s <stack> --heatmap

Build Verification

✅ All linter checks passing (golangci-lint)
✅ Full project builds successfully
✅ Website documentation builds without errors
✅ All 56 tests passing (20 perf + 36 heatmap)

expected impact

For Recursive Functions

  • Before: Count = 1, accurate timing (using wrapper pattern, recursion hidden)
  • After: Count = 1,890 (shows true call volume), accurate timing (via self-time tracking)

Performance Insights

  • Total time: Identify functions contributing most to overall execution time
  • Self-time: Identify where actual work is being done (optimization targets)
  • Consistent metrics: All three metrics (Avg, Max, P95) track self-time for accurate comparison
  • Accurate P95: Based on self-time, more meaningful for performance analysis

Better User Experience

  • TUI legend: Users immediately understand what each metric means
  • Accurate Max: Identifies performance outliers in function's own work, not including children

Summary by CodeRabbit

  • New Features

    • --heatmap CLI flag available for Terraform, Helmfile, and Packer; heatmap UI adds a legend and wider duration columns.
    • New Terraform "affected" and query execution workflows.
  • Improvements

    • New profiler-related CLI flags and a performance-tracking toggle.
    • Performance metrics now emphasize self-time (Avg, P95, Max) for clearer bottleneck identification.
    • Clearer, more actionable user-facing error messages; telemetry notices suppressed by default.
  • Documentation

    • Profiling guide updated to explain Total vs Self-Time.
  • Chores

    • Go toolchain and example image versions bumped.
Bugfix: `auth` identities should default load the AWS Environment vars (if AWS based identities) @Benbentwo (#1623)

what

  • atmos auth exec and atmos auth env need the default AWS Env variables of AWS to function properly. Main implementation had this feature, but was removed unintentionally from refactoring how the Environment() block was used.

why

  • Allows atmos auth exec -- aws s3 ls and selecting an identity to work as expected.

Summary by CodeRabbit

  • New Features

    • Automatically merges AWS file-based environment variables (e.g., profile and config/credentials settings) for Assume Role and Permission Set identities, applied before app-level environment overrides. This improves compatibility with existing AWS CLI configurations.
  • Tests

    • Updated tests to verify inclusion of AWS file-based variables alongside custom configuration variables in the final environment output.
Fix component inheritance: restore default behavior for metadata.component @aknysh (#1614)

what

  • Fix critical component inheritance regression where metadata.component was being overwritten during configuration inheritance
  • Restore correct default behavior where metadata.component defaults to the Atmos component name when not explicitly set
  • Prevent state file loss: Fix backend configuration to use correct workspace_key_prefix instead of inherited component names

why

  • PR #1602 introduced a regression in component processing that broke metadata.component handling
  • Critical Impact: Components using abstract component patterns were initializing new empty state files instead of finding existing state
    • When inheriting configuration via metadata.inherits, the system was inadvertently overwriting metadata.component with values from inherited components
    • Backend workspace_key_prefix ended up using the wrong component name (e.g., eks-service-defaults instead of eks-service)
    • This created new state at wrong S3 path, causing terraform plan to show full re-create of all resources
    • Could lead to resource duplication or apply failures due to conflicts
  • This caused errors like: invalid Terraform component: 'eks/n8n' points to the Terraform component 'helm-teleport-helper', but it does not exist

references

  • Fixes #1613 (Backend state file regression - workspace_key_prefix uses wrong component name)
  • Fixes #1609 (Component resolution fails when metadata.component omitted)
  • Related to PR #1602 (the refactoring that introduced the regression)

testing

  • Added comprehensive regression test in internal/exec/component_inheritance_test.go
    • Tests component that inherits configuration but doesn't set metadata.component
    • Verifies configuration inheritance (vars, settings) works correctly
    • Verifies metadata section is not inherited
    • Verifies component path defaults correctly to the Atmos component name
  • Added backend configuration test in internal/exec/abstract_component_backend_test.go
    • Critical test for backend state issue: Tests component with explicit metadata.component inheriting configuration from abstract component
    • Verifies metadata.component is preserved (not overwritten by inherited component's metadata.component)
    • Verifies backend workspace_key_prefix uses correct component path (e.g., eks-service not eks-service-defaults)
    • Verifies vars inheritance from abstract component works correctly
  • Created test fixtures in tests/fixtures/scenarios/:
    • component-inheritance-without-metadata-component/
    • abstract-component-backend/
  • All existing tests pass
  • Manual testing confirms the fix works correctly

impact

Before (1.194.0 - Broken)

# Abstract component
eks/service/defaults:
  metadata:
    type: abstract
    component: eks-service

# Concrete instance
eks/service/app1:
  metadata:
    component: eks-service
    inherits: [eks/service/defaults]  # Inherit configuration (vars, settings, etc.)

Problem: Processing metadata.inherits overwrote metadata.component

  • Backend ended up using workspace_key_prefix: eks-service-defaults (from inherited component)
  • Created new state at: s3://bucket/eks-service-defaults/...
  • Existing state at: s3://bucket/eks-service/... (not found)
  • Result: terraform plan showed full re-create of all resources ❌

After (This Fix)

Solution: metadata.component is preserved during configuration inheritance

  • Backend correctly uses workspace_key_prefix: eks-service (from explicit metadata.component)
  • Finds existing state at: s3://bucket/eks-service/...
  • Result: terraform plan shows correct incremental changes ✅

details

Key Concepts (Independent)

  1. metadata.component: Pointer to the physical terraform component directory in components/terraform/

    • Defaults to the Atmos component name if not specified
    • Determines the physical directory path for the component
    • Critical: Used to generate backend configuration (e.g., workspace_key_prefix)
    • Completely independent of configuration inheritance
  2. metadata.inherits: List of Atmos components to inherit configuration from (vars, settings, env, backend, etc.)

    • Purely for configuration merging - has nothing to do with component paths
    • The metadata section itself is NOT inherited from base components
    • Allows multiple inheritance for configuration merging
    • Completely independent of metadata.component and Atmos component names
  3. Atmos component name: The logical name of the component in the stack configuration

    • Used as default for metadata.component if not specified
    • Completely independent of configuration inheritance

The Bug

The regression in PR #1602 caused metadata.component to be inadvertently overwritten during configuration inheritance processing:

  1. Component eks/service/app1 has metadata.component: eks-service and metadata.inherits: [eks/service/defaults]
  2. System processes metadata.inherits to merge configuration (vars, settings, etc.) from eks/service/defaults
  3. BUG: During this processing, metadata.component: eks-service gets overwritten with the inherited component's metadata.component value
  4. Result: Component path becomes eks-service-defaults instead of eks-service

The Fix

Modified processMetadataInheritance() in stack_processor_process_stacks_helpers_inheritance.go to:

  • Track whether metadata.component was explicitly set on the current component
  • Process metadata.inherits to inherit vars, settings, backend, etc. (but NOT metadata)
  • Restore the explicitly-set metadata.component after processing inherits (prevents overwriting)
  • Default metadata.component to the Atmos component name when not explicitly set
  • This happens regardless of whether metadata.inherits exists or not

Code Changes

The fix ensures proper handling in three scenarios:

Scenario 1: Component without explicit metadata.component (Fixes #1609)

components:
  terraform:
    eks/n8n:  # Atmos component name
      metadata:
        inherits: [eks/helm-teleport-helper]  # Inherit configuration only
      vars:
        name: "n8n"

✅ Now works: metadata.component defaults to eks/n8n (Atmos component name)
❌ Before: metadata.component was overwritten during configuration inheritance, became helm-teleport-helper (component not found error)

Scenario 2: Component with explicit metadata.component + configuration inheritance (Fixes #1613)

components:
  terraform:
    eks/service/app1:
      metadata:
        component: eks-service  # Explicit - determines physical path
        inherits: [eks/service/defaults]  # Inherit configuration (vars, backend, etc.)

✅ Now works: metadata.component stays eks-service, backend uses correct workspace_key_prefix
❌ Before: metadata.component overwritten to eks-service-defaults during configuration inheritance, new empty state created

Scenario 3: Top-level component: attribute (Also affected by regression)

components:
  terraform:
    test/test-component-override:
      component: test/test-component  # Top-level attribute (not metadata.component)
      # ... configuration ...

✅ Now works: Top-level component: sets the component path, not overwritten by defaults
❌ Before: Default logic was overwriting top-level component inheritance

Summary by CodeRabbit

  • Bug Fixes

    • Component inheritance now relies solely on metadata.inherits and preserves top-level component attributes.
    • Default component path/name correctly resolves when metadata.component is absent.
    • Abstract-component backends inherit workspace_key_prefix and backend settings as expected.
    • Improved validation and error handling for metadata.inherits entries.
  • Tests

    • Added scenarios and tests covering abstract-component backends, inheritance without metadata.component, and a guard for certain mock failures.
  • Chores

    • Updated dependency versions.

Don't miss a new atmos release

NewReleases is sending notifications on new releases.