cloudposse/atmos v1.194.1 on GitHub

Fix and Improve Performance Heatmap @aknysh (#1622)

what

Improved the heatmap performance on Docker, fixed critical performance issues with the --heatmap flag in Docker environments
Renamed "Total" column to "CPU Time" throughout the performance heatmap display to clarify that it represents sum of self-times, not wall-clock time
Added parallelism metric to both console and TUI displays showing CPU Time ÷ Elapsed ratio
Improved TUI visualizations (bar chart and sparkline modes) to display average time per call instead of total CPU time for better user comprehension
Enhanced TUI legend with live performance metrics (Parallelism, Elapsed time, CPU Time) displayed at the top
Split TUI legend into three lines for improved readability
Updated comprehensive documentation in website/docs/troubleshoot/profiling.mdx with explanations of all new metrics and display formats

why

Docker performance context

This PR improved the heatmap performance on Docker, fixed critical performance issues with the --heatmap flag in Docker environments.

The Docker Problem:

Commands with --heatmap took ~60 seconds to start in Docker containers (vs instantly on macOS)
Root cause: runtime.Stack() was called on every tracked function to get goroutine IDs
In Docker, this syscall is significantly slower than on native macOS
With thousands of tracked function calls during stack processing, overhead accumulated to ~1 minute

The Solution:

Introduced "simple tracking mode" using a single global call stack instead of per-goroutine tracking
Avoids expensive runtime.Stack() calls for single-goroutine execution (most Atmos commands)
Result: ~19x faster (119µs vs 3ms for 1000 calls)

TUI improvements

Problem 1: Confusing Column Name
The "Total" column name was ambiguous - users couldn't tell if it meant wall-clock time or CPU time. This caused confusion when trying to interpret performance data.

Problem 2: Missing Parallelism Context
Users had no way to understand the relationship between CPU time and elapsed time. When CPU time exceeded elapsed time (e.g., 5 minutes of CPU time in 22 seconds), it looked wrong but was actually correct for parallel execution.

Problem 3: TUI Display Showed Confusing Values
The bar chart displayed large total times (e.g., "2m47s") for functions called hundreds of thousands of times, making it appear slow when the average time per call was actually fast (e.g., 0.37ms). This was mathematically correct but user-unfriendly.

Problem 4: Cramped Legend
The single-line legend was too wide and difficult to read in the TUI.

Solutions

Solution 1: Clear Terminology
Renamed "Total" to "CPU Time" everywhere with consistent explanations that it represents "sum of self-time (excludes children)" to avoid confusion with wall-clock time.

Solution 2: Parallelism Metric
Added parallelism calculation and display in both console output and TUI legend:

Format: Parallelism: ~0.9x (single-threaded) or ~58.2x (highly parallel)
Helps users immediately understand execution characteristics
Values less than 1.0 indicate single-threaded execution
Values greater than 1.0 indicate parallel execution across multiple cores

Solution 3: User-Friendly TUI Visualizations
Changed bar chart and sparkline displays from showing total CPU time to showing average time per call with call counts:

Old format: 2m47.818679s (×447586) - confusing
New format: avg: 0.37ms | calls: 447586 - intuitive
Bar length still represents total CPU time (overall impact)
Average time shows typical function performance

Solution 4: Multi-Line Legend
Split the TUI legend into three informative lines:

Line 1: Live metrics (Parallelism, Elapsed, CPU Time)
Line 2: Explanation of Count and CPU Time columns
Line 3: Explanation of statistical timing columns (Avg, Max, P95)

example

testing

All existing tests pass:

✅ 26 tests in pkg/perf package
✅ 36 tests in pkg/ui/heatmap package
✅ 87 tests in cmd package
✅ Total: 149 tests across all modified packages

Manual verification:

Compiled binary successfully on macOS
Tested in Docker
Verified console output shows new format with parallelism
Confirmed interactive TUI displays updated legend
Website documentation builds without errors
Docker performance remains fast (1-2 second startup with --heatmap)

documentation

Updated website/docs/troubleshoot/profiling.mdx with:

New "Interactive TUI Legend" section explaining the 3-line legend format
Updated Performance Summary section with Parallelism explanation
Updated all column descriptions (CPU Time instead of Total)
Enhanced Bar Chart and Sparkline mode descriptions with new display format
Added concrete examples showing the new avg: Xms | calls: N format

benefits

Clearer Metrics: "CPU Time" is unambiguous compared to "Total"
Execution Context: Parallelism metric immediately shows if execution was single-threaded or parallel
Intuitive Display: Average times per call are much easier to understand than large totals
Better UX: Multi-line legend is more readable and informative
Complete Documentation: Users have clear explanations of all metrics
Backward Compatible: All existing tests pass without changes
Docker Ready: Works fast in Docker environments (builds on PR #1611 improvements)

Summary by CodeRabbit

New Features
- Heatmap and summaries now show CPU Time and Parallelism; charts and sparklines show average time per call with counts; interactive UI supports cancellation and a new toggle to control tracking mode.
Bug Fixes
- More accurate CPU Time and parallelism calculations; per-function CPU Time shown; durations truncated to microseconds; headers and columns reorganized for clarity.
Tests
- Added tests for tracking modes, performance comparisons, and legend/render output; Windows skips and more robust environment/IO handling.
Documentation
- Profiling docs updated to explain CPU Time vs Self‑Time with revised examples.
Chores
- CI tool version bumped and test cleanup/synchronization improvements.

Add trace logging for all squelched errors @osterman (#1615)

what

Add log.Trace() calls for all squelched errors throughout the codebase
Update logging documentation to include squelched error handling guidance
Update error handling strategy PRD to document squelched error patterns

why

Ensures no errors are silently lost, even when intentionally ignored
Provides complete error visibility during debugging with --logs-level=Trace
Establishes clear patterns for handling non-critical errors

details

Code Changes (19 files)

Added trace logging for squelched errors in:

Configuration binding: Environment variables and flags (viper.BindEnv(), viper.BindPFlag())
File cleanup: Temporary file and directory removal (os.Remove(), os.RemoveAll())
Resource closing: File handles, clients, connections (Close())
Lock operations: File locks in defer statements (Unlock())
UI operations: Terminal output, command help (fmt.Fprint(), cmd.Help())
Performance tracking: Histogram value recording
Cache operations: Non-critical cache file operations on Windows

Patterns Applied

// ❌ WRONG: Silent error squelching
_ = os.Remove(tempFile)

// ✅ CORRECT: Log squelched errors at Trace level
if err := os.Remove(tempFile); err != nil && !os.IsNotExist(err) {
    log.Trace("Failed to remove temporary file during cleanup", "error", err, "file", tempFile)
}

Special Cases

Defer statements: Capture errors in closures for logging
File existence checks: Use os.IsNotExist() to avoid logging expected conditions
Log file cleanup: Use fmt.Fprintf(os.Stderr, ...) to avoid logger recursion

Documentation Updates

docs/logging.md: Added comprehensive "Squelched Errors" section with patterns and examples
docs/prd/error-handling-strategy.md: Added "Squelched Error Handling" section with guidelines and code examples

references

Related to overall error handling and logging strategy
All changes compile and pass pre-commit hooks

Summary by CodeRabbit

Bug Fixes
- Prevented silent failures by adding error handling and trace-level logging across command flags, env bindings, config parsing, file cleanup, network response closing, metrics recording, and TUI rendering.
- Improved cleanup reliability by logging non-fatal errors during temporary file/dir removal and resource closure.
Refactor
- Standardized logging for squelched (non-fatal) errors, replacing ignored returns with guarded paths without changing core behavior.
Documentation
- Expanded logging guidance with a new “Squelched Errors” section, examples, patterns, and best practices.
- Updated error-handling strategy to include when and how to log squelched errors.

Improve performance heatmap @aknysh (#1611)

what

Fix --heatmap flag not working with terraform, helmfile, and packer commands
Implement advanced performance tracking with self-time vs total-time separation
Fix recursive function performance tracking to show accurate counts AND accurate timing
Improve heatmap display with consistent metrics and informative legend
Add comprehensive tests for heatmap functionality and recursive tracking

atmos describe stacks --heatmap

atmos terraform plan vpc -s uw2-prod --heatmap

why

Heatmap Flag Fix

The --heatmap flag was not working for terraform, helmfile, and packer commands because:

These commands use DisableFlagParsing = true to pass native flags through to underlying tools
When flag parsing is disabled, Cobra doesn't parse the --heatmap flag
The PersistentPreRun hook couldn't detect the flag via cmd.Flags().GetBool("heatmap")
Performance tracking was never enabled, so no data collected

Performance Tracking Enhancement

--heatmap showed inconsistent timing metrics for long-running commands:

Elapsed time was correct
Individual function totals were massively inflated (- approximately 1,890x inflation)
Root cause: Unable to show accurate call counts for recursive functions without timing inflation

Requirement: Show true call volume (e.g., 1,890 calls) with accurate timing (no inflation) and consistent metrics.

changes

Advanced Performance Tracking (Self-Time vs Total-Time)

Implemented professional-grade profiling metrics that separate:

Total time (wall-clock): Includes time spent in child function calls
Self-time: Actual work done in the function, excluding children
Accurate recursive tracking: Shows ALL calls including recursive ones with correct timing

Key Features:

Goroutine-local call stack tracking - Each goroutine maintains its own call stack
Child time accumulation - Each stack frame tracks time spent in children
Self-time calculation - selfTime = totalTime - childTime
HDR Histogram on self-time - P95 based on actual work, not wall-clock
Direct recursive tracking - No wrapper pattern needed, tracks every call accurately

Benefits:

Accurate recursive tracking: Shows true call counts (e.g., 1,890 calls) with correct timing
No inflation: Self-time excludes child execution, total-time includes it
Better insights: Identify where time is spent (total) vs where work is done (self-time)
Professional profiling: Same metrics as pprof, but function-level and easier to use

Example Output:

Function                                 Count    Total      Avg        Max        P95
utils.processCustomTags                  1024     4.27ms     4µs        146µs      15µs

Count: 1024 - ALL calls including recursive ones
Total: 4.27ms - wall-clock time including all children
Avg: 4µs - average self-time per call (actual work only)
Max: 146µs - maximum self-time for a single call (excludes children)
P95: 15µs - 95th percentile of self-time

Metric Consistency Fix

Changed Max to track self-time instead of total-time:

All three metrics (Avg, Max, P95) consistently track self-time
Rationale: Enables accurate comparison between metrics for identifying performance outliers

TUI Legend

Added informative legend to heatmap TUI:

Count: # calls (incl. recursion) | Total: wall-clock (incl. children & recursion) |
Avg: avg self-time | Max: max self-time | P95: 95th percentile self-time

Appears after the header in all visualization modes
Explains what each metric means
Clarifies that Count and Total include recursion
Helps users understand self-time vs total-time

Recursive Function Updates

Benefits:

Shows true call volume (e.g., 1,024 calls instead of 1)
Timing remains accurate via self-time calculation
No wrapper pattern complexity needed

Documentation

Updated profiling documentation:

"Understanding Total vs Self-Time" tip reflects that all three metrics now track self-time
Updated metric descriptions with accurate explanations
Added examples showing self-time context for outlier detection

Heatmap Flag Improvements

Added helper function to manually parse --heatmap from os.Args for commands with DisableFlagParsing = true
Integrated heatmap flag detection in terraform, helmfile, and packer commands
Added test coverage for heatmap flag detection

Enhanced Performance Tracking Coverage

Added perf.Track() to UnmarshalYAMLFromFile to track YAML parsing performance
Ensures complete visibility into YAML processing call chains during merge operations

testing

Performance Tracking Tests

New self-time tracking tests:

TestSelfTimeVsTotalTime - Verifies self-time excludes child time
TestNestedFunctionSelfTime - Tests multi-level nesting (grandparent → parent → child)
TestDirectRecursionWithSelfTime - Tests direct recursion with accurate counts AND timing

Updated existing tests:

All 20 perf tests passing ✅
Tests verify self-time calculation accuracy for all three metrics (Avg, Max, P95)

Recursive function tests:

TestRecursiveFunctionTracking - Wrapper pattern verification
TestRecursiveFunctionWrongPattern - Demonstrates inflation with wrong pattern
TestMultipleRecursiveFunctionsIndependent - Independent function tracking
TestYAMLConfigProcessingRecursion - YAML import hierarchy (50 levels deep)
TestYAMLConfigProcessingMultipleImports - Fan-out import pattern
TestProcessBaseComponentConfigRecursion - Component inheritance (15 levels deep)

Heatmap UI Tests

Updated heatmap tests:

Updated legend verification tests
All 36 heatmap tests passing ✅

Heatmap Flag Tests

✅ TestTerraformHeatmapFlag passes successfully
✅ Manual testing confirms --heatmap works with:

atmos terraform plan <component> -s <stack> --heatmap
atmos helmfile <subcommand> <component> -s <stack> --heatmap
atmos packer <subcommand> <component> -s <stack> --heatmap

Build Verification

✅ All linter checks passing (golangci-lint)
✅ Full project builds successfully
✅ Website documentation builds without errors
✅ All 56 tests passing (20 perf + 36 heatmap)

expected impact

For Recursive Functions

Before: Count = 1, accurate timing (using wrapper pattern, recursion hidden)
After: Count = 1,890 (shows true call volume), accurate timing (via self-time tracking)

Performance Insights

Total time: Identify functions contributing most to overall execution time
Self-time: Identify where actual work is being done (optimization targets)
Consistent metrics: All three metrics (Avg, Max, P95) track self-time for accurate comparison
Accurate P95: Based on self-time, more meaningful for performance analysis

Better User Experience

TUI legend: Users immediately understand what each metric means
Accurate Max: Identifies performance outliers in function's own work, not including children

Summary by CodeRabbit

New Features
- --heatmap CLI flag available for Terraform, Helmfile, and Packer; heatmap UI adds a legend and wider duration columns.
- New Terraform "affected" and query execution workflows.
Improvements
- New profiler-related CLI flags and a performance-tracking toggle.
- Performance metrics now emphasize self-time (Avg, P95, Max) for clearer bottleneck identification.
- Clearer, more actionable user-facing error messages; telemetry notices suppressed by default.
Documentation
- Profiling guide updated to explain Total vs Self-Time.
Chores
- Go toolchain and example image versions bumped.

Bugfix: `auth` identities should default load the AWS Environment vars (if AWS based identities) @Benbentwo (#1623)

what

atmos auth exec and atmos auth env need the default AWS Env variables of AWS to function properly. Main implementation had this feature, but was removed unintentionally from refactoring how the Environment() block was used.

why

Allows atmos auth exec -- aws s3 ls and selecting an identity to work as expected.

Summary by CodeRabbit

New Features
- Automatically merges AWS file-based environment variables (e.g., profile and config/credentials settings) for Assume Role and Permission Set identities, applied before app-level environment overrides. This improves compatibility with existing AWS CLI configurations.
Tests
- Updated tests to verify inclusion of AWS file-based variables alongside custom configuration variables in the final environment output.

Fix component inheritance: restore default behavior for metadata.component @aknysh (#1614)

what

Fix critical component inheritance regression where metadata.component was being overwritten during configuration inheritance
Restore correct default behavior where metadata.component defaults to the Atmos component name when not explicitly set
Prevent state file loss: Fix backend configuration to use correct workspace_key_prefix instead of inherited component names

why

PR #1602 introduced a regression in component processing that broke metadata.component handling
Critical Impact: Components using abstract component patterns were initializing new empty state files instead of finding existing state
- When inheriting configuration via metadata.inherits, the system was inadvertently overwriting metadata.component with values from inherited components
- Backend workspace_key_prefix ended up using the wrong component name (e.g., eks-service-defaults instead of eks-service)
- This created new state at wrong S3 path, causing terraform plan to show full re-create of all resources
- Could lead to resource duplication or apply failures due to conflicts
This caused errors like: invalid Terraform component: 'eks/n8n' points to the Terraform component 'helm-teleport-helper', but it does not exist

references

Fixes #1613 (Backend state file regression - workspace_key_prefix uses wrong component name)
Fixes #1609 (Component resolution fails when metadata.component omitted)
Related to PR #1602 (the refactoring that introduced the regression)

testing

Added comprehensive regression test in internal/exec/component_inheritance_test.go
- Tests component that inherits configuration but doesn't set metadata.component
- Verifies configuration inheritance (vars, settings) works correctly
- Verifies metadata section is not inherited
- Verifies component path defaults correctly to the Atmos component name
Added backend configuration test in internal/exec/abstract_component_backend_test.go
- Critical test for backend state issue: Tests component with explicit metadata.component inheriting configuration from abstract component
- Verifies metadata.component is preserved (not overwritten by inherited component's metadata.component)
- Verifies backend workspace_key_prefix uses correct component path (e.g., eks-service not eks-service-defaults)
- Verifies vars inheritance from abstract component works correctly
Created test fixtures in tests/fixtures/scenarios/:
- component-inheritance-without-metadata-component/
- abstract-component-backend/
All existing tests pass
Manual testing confirms the fix works correctly

impact

Before (1.194.0 - Broken)

# Abstract component
eks/service/defaults:
  metadata:
    type: abstract
    component: eks-service

# Concrete instance
eks/service/app1:
  metadata:
    component: eks-service
    inherits: [eks/service/defaults]  # Inherit configuration (vars, settings, etc.)

Problem: Processing metadata.inherits overwrote metadata.component

Backend ended up using workspace_key_prefix: eks-service-defaults (from inherited component)
Created new state at: s3://bucket/eks-service-defaults/...
Existing state at: s3://bucket/eks-service/... (not found)
Result: terraform plan showed full re-create of all resources ❌

After (This Fix)

Solution: metadata.component is preserved during configuration inheritance

Backend correctly uses workspace_key_prefix: eks-service (from explicit metadata.component)
Finds existing state at: s3://bucket/eks-service/...
Result: terraform plan shows correct incremental changes ✅

details

Key Concepts (Independent)

metadata.component: Pointer to the physical terraform component directory in components/terraform/
- Defaults to the Atmos component name if not specified
- Determines the physical directory path for the component
- Critical: Used to generate backend configuration (e.g., workspace_key_prefix)
- Completely independent of configuration inheritance
metadata.inherits: List of Atmos components to inherit configuration from (vars, settings, env, backend, etc.)
- Purely for configuration merging - has nothing to do with component paths
- The metadata section itself is NOT inherited from base components
- Allows multiple inheritance for configuration merging
- Completely independent of metadata.component and Atmos component names
Atmos component name: The logical name of the component in the stack configuration
- Used as default for metadata.component if not specified
- Completely independent of configuration inheritance

The Bug

The regression in PR #1602 caused metadata.component to be inadvertently overwritten during configuration inheritance processing:

Component eks/service/app1 has metadata.component: eks-service and metadata.inherits: [eks/service/defaults]
System processes metadata.inherits to merge configuration (vars, settings, etc.) from eks/service/defaults
BUG: During this processing, metadata.component: eks-service gets overwritten with the inherited component's metadata.component value
Result: Component path becomes eks-service-defaults instead of eks-service

The Fix

Modified processMetadataInheritance() in stack_processor_process_stacks_helpers_inheritance.go to:

Track whether metadata.component was explicitly set on the current component
Process metadata.inherits to inherit vars, settings, backend, etc. (but NOT metadata)
Restore the explicitly-set metadata.component after processing inherits (prevents overwriting)
Default metadata.component to the Atmos component name when not explicitly set
This happens regardless of whether metadata.inherits exists or not

Code Changes

The fix ensures proper handling in three scenarios:

Scenario 1: Component without explicit metadata.component (Fixes #1609)

components:
  terraform:
    eks/n8n:  # Atmos component name
      metadata:
        inherits: [eks/helm-teleport-helper]  # Inherit configuration only
      vars:
        name: "n8n"

✅ Now works: metadata.component defaults to eks/n8n (Atmos component name)
❌ Before: metadata.component was overwritten during configuration inheritance, became helm-teleport-helper (component not found error)

Scenario 2: Component with explicit metadata.component + configuration inheritance (Fixes #1613)

components:
  terraform:
    eks/service/app1:
      metadata:
        component: eks-service  # Explicit - determines physical path
        inherits: [eks/service/defaults]  # Inherit configuration (vars, backend, etc.)

✅ Now works: metadata.component stays eks-service, backend uses correct workspace_key_prefix
❌ Before: metadata.component overwritten to eks-service-defaults during configuration inheritance, new empty state created

Scenario 3: Top-level component: attribute (Also affected by regression)

components:
  terraform:
    test/test-component-override:
      component: test/test-component  # Top-level attribute (not metadata.component)
      # ... configuration ...

✅ Now works: Top-level component: sets the component path, not overwritten by defaults
❌ Before: Default logic was overwriting top-level component inheritance

Summary by CodeRabbit

Bug Fixes
- Component inheritance now relies solely on metadata.inherits and preserves top-level component attributes.
- Default component path/name correctly resolves when metadata.component is absent.
- Abstract-component backends inherit workspace_key_prefix and backend settings as expected.
- Improved validation and error handling for metadata.inherits entries.
Tests
- Added scenarios and tests covering abstract-component backends, inheritance without metadata.component, and a guard for certain mock failures.
Chores
- Updated dependency versions.