Fix and Improve Performance Heatmap @aknysh (#1622)
what
- Improved the heatmap performance on Docker, fixed critical performance issues with the
--heatmap
flag in Docker environments - Renamed "Total" column to "CPU Time" throughout the performance heatmap display to clarify that it represents sum of self-times, not wall-clock time
- Added parallelism metric to both console and TUI displays showing CPU Time ÷ Elapsed ratio
- Improved TUI visualizations (bar chart and sparkline modes) to display average time per call instead of total CPU time for better user comprehension
- Enhanced TUI legend with live performance metrics (Parallelism, Elapsed time, CPU Time) displayed at the top
- Split TUI legend into three lines for improved readability
- Updated comprehensive documentation in
website/docs/troubleshoot/profiling.mdx
with explanations of all new metrics and display formats
why
Docker performance context
This PR improved the heatmap performance on Docker, fixed critical performance issues with the --heatmap
flag in Docker environments.
The Docker Problem:
- Commands with
--heatmap
took ~60 seconds to start in Docker containers (vs instantly on macOS) - Root cause:
runtime.Stack()
was called on every tracked function to get goroutine IDs - In Docker, this syscall is significantly slower than on native macOS
- With thousands of tracked function calls during stack processing, overhead accumulated to ~1 minute
The Solution:
- Introduced "simple tracking mode" using a single global call stack instead of per-goroutine tracking
- Avoids expensive
runtime.Stack()
calls for single-goroutine execution (most Atmos commands) - Result: ~19x faster (119µs vs 3ms for 1000 calls)
TUI improvements
Problem 1: Confusing Column Name
The "Total" column name was ambiguous - users couldn't tell if it meant wall-clock time or CPU time. This caused confusion when trying to interpret performance data.
Problem 2: Missing Parallelism Context
Users had no way to understand the relationship between CPU time and elapsed time. When CPU time exceeded elapsed time (e.g., 5 minutes of CPU time in 22 seconds), it looked wrong but was actually correct for parallel execution.
Problem 3: TUI Display Showed Confusing Values
The bar chart displayed large total times (e.g., "2m47s") for functions called hundreds of thousands of times, making it appear slow when the average time per call was actually fast (e.g., 0.37ms). This was mathematically correct but user-unfriendly.
Problem 4: Cramped Legend
The single-line legend was too wide and difficult to read in the TUI.
Solutions
Solution 1: Clear Terminology
Renamed "Total" to "CPU Time" everywhere with consistent explanations that it represents "sum of self-time (excludes children)" to avoid confusion with wall-clock time.
Solution 2: Parallelism Metric
Added parallelism calculation and display in both console output and TUI legend:
- Format:
Parallelism: ~0.9x
(single-threaded) or~58.2x
(highly parallel) - Helps users immediately understand execution characteristics
- Values less than 1.0 indicate single-threaded execution
- Values greater than 1.0 indicate parallel execution across multiple cores
Solution 3: User-Friendly TUI Visualizations
Changed bar chart and sparkline displays from showing total CPU time to showing average time per call with call counts:
- Old format:
2m47.818679s (×447586)
- confusing - New format:
avg: 0.37ms | calls: 447586
- intuitive - Bar length still represents total CPU time (overall impact)
- Average time shows typical function performance
Solution 4: Multi-Line Legend
Split the TUI legend into three informative lines:
- Line 1: Live metrics (Parallelism, Elapsed, CPU Time)
- Line 2: Explanation of Count and CPU Time columns
- Line 3: Explanation of statistical timing columns (Avg, Max, P95)
example

testing
All existing tests pass:
- ✅ 26 tests in
pkg/perf
package - ✅ 36 tests in
pkg/ui/heatmap
package - ✅ 87 tests in
cmd
package - ✅ Total: 149 tests across all modified packages
Manual verification:
- Compiled binary successfully on macOS
- Tested in Docker
- Verified console output shows new format with parallelism
- Confirmed interactive TUI displays updated legend
- Website documentation builds without errors
- Docker performance remains fast (1-2 second startup with
--heatmap
)
documentation
Updated website/docs/troubleshoot/profiling.mdx
with:
- New "Interactive TUI Legend" section explaining the 3-line legend format
- Updated Performance Summary section with Parallelism explanation
- Updated all column descriptions (CPU Time instead of Total)
- Enhanced Bar Chart and Sparkline mode descriptions with new display format
- Added concrete examples showing the new
avg: Xms | calls: N
format
benefits
- Clearer Metrics: "CPU Time" is unambiguous compared to "Total"
- Execution Context: Parallelism metric immediately shows if execution was single-threaded or parallel
- Intuitive Display: Average times per call are much easier to understand than large totals
- Better UX: Multi-line legend is more readable and informative
- Complete Documentation: Users have clear explanations of all metrics
- Backward Compatible: All existing tests pass without changes
- Docker Ready: Works fast in Docker environments (builds on PR #1611 improvements)
Summary by CodeRabbit
-
New Features
- Heatmap and summaries now show CPU Time and Parallelism; charts and sparklines show average time per call with counts; interactive UI supports cancellation and a new toggle to control tracking mode.
-
Bug Fixes
- More accurate CPU Time and parallelism calculations; per-function CPU Time shown; durations truncated to microseconds; headers and columns reorganized for clarity.
-
Tests
- Added tests for tracking modes, performance comparisons, and legend/render output; Windows skips and more robust environment/IO handling.
-
Documentation
- Profiling docs updated to explain CPU Time vs Self‑Time with revised examples.
-
Chores
- CI tool version bumped and test cleanup/synchronization improvements.
Add trace logging for all squelched errors @osterman (#1615)
what
- Add
log.Trace()
calls for all squelched errors throughout the codebase - Update logging documentation to include squelched error handling guidance
- Update error handling strategy PRD to document squelched error patterns
why
- Ensures no errors are silently lost, even when intentionally ignored
- Provides complete error visibility during debugging with
--logs-level=Trace
- Establishes clear patterns for handling non-critical errors
details
Code Changes (19 files)
Added trace logging for squelched errors in:
- Configuration binding: Environment variables and flags (
viper.BindEnv()
,viper.BindPFlag()
) - File cleanup: Temporary file and directory removal (
os.Remove()
,os.RemoveAll()
) - Resource closing: File handles, clients, connections (
Close()
) - Lock operations: File locks in defer statements (
Unlock()
) - UI operations: Terminal output, command help (
fmt.Fprint()
,cmd.Help()
) - Performance tracking: Histogram value recording
- Cache operations: Non-critical cache file operations on Windows
Patterns Applied
// ❌ WRONG: Silent error squelching
_ = os.Remove(tempFile)
// ✅ CORRECT: Log squelched errors at Trace level
if err := os.Remove(tempFile); err != nil && !os.IsNotExist(err) {
log.Trace("Failed to remove temporary file during cleanup", "error", err, "file", tempFile)
}
Special Cases
- Defer statements: Capture errors in closures for logging
- File existence checks: Use
os.IsNotExist()
to avoid logging expected conditions - Log file cleanup: Use
fmt.Fprintf(os.Stderr, ...)
to avoid logger recursion
Documentation Updates
- docs/logging.md: Added comprehensive "Squelched Errors" section with patterns and examples
- docs/prd/error-handling-strategy.md: Added "Squelched Error Handling" section with guidelines and code examples
references
- Related to overall error handling and logging strategy
- All changes compile and pass pre-commit hooks
Summary by CodeRabbit
-
Bug Fixes
- Prevented silent failures by adding error handling and trace-level logging across command flags, env bindings, config parsing, file cleanup, network response closing, metrics recording, and TUI rendering.
- Improved cleanup reliability by logging non-fatal errors during temporary file/dir removal and resource closure.
-
Refactor
- Standardized logging for squelched (non-fatal) errors, replacing ignored returns with guarded paths without changing core behavior.
-
Documentation
- Expanded logging guidance with a new “Squelched Errors” section, examples, patterns, and best practices.
- Updated error-handling strategy to include when and how to log squelched errors.
Improve performance heatmap @aknysh (#1611)
what
- Fix
--heatmap
flag not working withterraform
,helmfile
, andpacker
commands - Implement advanced performance tracking with self-time vs total-time separation
- Fix recursive function performance tracking to show accurate counts AND accurate timing
- Improve heatmap display with consistent metrics and informative legend
- Add comprehensive tests for heatmap functionality and recursive tracking
atmos describe stacks --heatmap

atmos terraform plan vpc -s uw2-prod --heatmap

why
Heatmap Flag Fix
The --heatmap
flag was not working for terraform
, helmfile
, and packer
commands because:
- These commands use
DisableFlagParsing = true
to pass native flags through to underlying tools - When flag parsing is disabled, Cobra doesn't parse the
--heatmap
flag - The
PersistentPreRun
hook couldn't detect the flag viacmd.Flags().GetBool("heatmap")
- Performance tracking was never enabled, so no data collected
Performance Tracking Enhancement
--heatmap
showed inconsistent timing metrics for long-running commands:
- Elapsed time was correct
- Individual function totals were massively inflated (- approximately 1,890x inflation)
- Root cause: Unable to show accurate call counts for recursive functions without timing inflation
Requirement: Show true call volume (e.g., 1,890 calls) with accurate timing (no inflation) and consistent metrics.
changes
Advanced Performance Tracking (Self-Time vs Total-Time)
Implemented professional-grade profiling metrics that separate:
- Total time (wall-clock): Includes time spent in child function calls
- Self-time: Actual work done in the function, excluding children
- Accurate recursive tracking: Shows ALL calls including recursive ones with correct timing
Key Features:
- Goroutine-local call stack tracking - Each goroutine maintains its own call stack
- Child time accumulation - Each stack frame tracks time spent in children
- Self-time calculation -
selfTime = totalTime - childTime
- HDR Histogram on self-time - P95 based on actual work, not wall-clock
- Direct recursive tracking - No wrapper pattern needed, tracks every call accurately
Benefits:
- Accurate recursive tracking: Shows true call counts (e.g., 1,890 calls) with correct timing
- No inflation: Self-time excludes child execution, total-time includes it
- Better insights: Identify where time is spent (total) vs where work is done (self-time)
- Professional profiling: Same metrics as pprof, but function-level and easier to use
Example Output:
Function Count Total Avg Max P95
utils.processCustomTags 1024 4.27ms 4µs 146µs 15µs
- Count: 1024 - ALL calls including recursive ones
- Total: 4.27ms - wall-clock time including all children
- Avg: 4µs - average self-time per call (actual work only)
- Max: 146µs - maximum self-time for a single call (excludes children)
- P95: 15µs - 95th percentile of self-time
Metric Consistency Fix
Changed Max to track self-time instead of total-time:
- All three metrics (
Avg
,Max
,P95
) consistently track self-time - Rationale: Enables accurate comparison between metrics for identifying performance outliers
TUI Legend
Added informative legend to heatmap TUI:
Count: # calls (incl. recursion) | Total: wall-clock (incl. children & recursion) |
Avg: avg self-time | Max: max self-time | P95: 95th percentile self-time
- Appears after the header in all visualization modes
- Explains what each metric means
- Clarifies that Count and Total include recursion
- Helps users understand self-time vs total-time
Recursive Function Updates
Benefits:
- Shows true call volume (e.g., 1,024 calls instead of 1)
- Timing remains accurate via self-time calculation
- No wrapper pattern complexity needed
Documentation
Updated profiling documentation:
- "Understanding Total vs Self-Time" tip reflects that all three metrics now track self-time
- Updated metric descriptions with accurate explanations
- Added examples showing self-time context for outlier detection
Heatmap Flag Improvements
- Added helper function to manually parse
--heatmap
fromos.Args
for commands withDisableFlagParsing = true
- Integrated heatmap flag detection in
terraform
,helmfile
, andpacker
commands - Added test coverage for heatmap flag detection
Enhanced Performance Tracking Coverage
- Added
perf.Track()
toUnmarshalYAMLFromFile
to track YAML parsing performance - Ensures complete visibility into YAML processing call chains during merge operations
testing
Performance Tracking Tests
New self-time tracking tests:
TestSelfTimeVsTotalTime
- Verifies self-time excludes child timeTestNestedFunctionSelfTime
- Tests multi-level nesting (grandparent → parent → child)TestDirectRecursionWithSelfTime
- Tests direct recursion with accurate counts AND timing
Updated existing tests:
- All 20 perf tests passing ✅
- Tests verify self-time calculation accuracy for all three metrics (Avg, Max, P95)
Recursive function tests:
TestRecursiveFunctionTracking
- Wrapper pattern verificationTestRecursiveFunctionWrongPattern
- Demonstrates inflation with wrong patternTestMultipleRecursiveFunctionsIndependent
- Independent function trackingTestYAMLConfigProcessingRecursion
- YAML import hierarchy (50 levels deep)TestYAMLConfigProcessingMultipleImports
- Fan-out import patternTestProcessBaseComponentConfigRecursion
- Component inheritance (15 levels deep)
Heatmap UI Tests
Updated heatmap tests:
- Updated legend verification tests
- All 36 heatmap tests passing ✅
Heatmap Flag Tests
✅ TestTerraformHeatmapFlag
passes successfully
✅ Manual testing confirms --heatmap
works with:
atmos terraform plan <component> -s <stack> --heatmap
atmos helmfile <subcommand> <component> -s <stack> --heatmap
atmos packer <subcommand> <component> -s <stack> --heatmap
Build Verification
✅ All linter checks passing (golangci-lint)
✅ Full project builds successfully
✅ Website documentation builds without errors
✅ All 56 tests passing (20 perf + 36 heatmap)
expected impact
For Recursive Functions
- Before: Count = 1, accurate timing (using wrapper pattern, recursion hidden)
- After: Count = 1,890 (shows true call volume), accurate timing (via self-time tracking)
Performance Insights
- Total time: Identify functions contributing most to overall execution time
- Self-time: Identify where actual work is being done (optimization targets)
- Consistent metrics: All three metrics (Avg, Max, P95) track self-time for accurate comparison
- Accurate P95: Based on self-time, more meaningful for performance analysis
Better User Experience
- TUI legend: Users immediately understand what each metric means
- Accurate Max: Identifies performance outliers in function's own work, not including children
Summary by CodeRabbit
-
New Features
- --heatmap CLI flag available for Terraform, Helmfile, and Packer; heatmap UI adds a legend and wider duration columns.
- New Terraform "affected" and query execution workflows.
-
Improvements
- New profiler-related CLI flags and a performance-tracking toggle.
- Performance metrics now emphasize self-time (Avg, P95, Max) for clearer bottleneck identification.
- Clearer, more actionable user-facing error messages; telemetry notices suppressed by default.
-
Documentation
- Profiling guide updated to explain Total vs Self-Time.
-
Chores
- Go toolchain and example image versions bumped.
Bugfix: `auth` identities should default load the AWS Environment vars (if AWS based identities) @Benbentwo (#1623)
what
atmos auth exec
andatmos auth env
need the default AWS Env variables of AWS to function properly. Main implementation had this feature, but was removed unintentionally from refactoring how the Environment() block was used.
why
- Allows
atmos auth exec -- aws s3 ls
and selecting an identity to work as expected.
Summary by CodeRabbit
-
New Features
- Automatically merges AWS file-based environment variables (e.g., profile and config/credentials settings) for Assume Role and Permission Set identities, applied before app-level environment overrides. This improves compatibility with existing AWS CLI configurations.
-
Tests
- Updated tests to verify inclusion of AWS file-based variables alongside custom configuration variables in the final environment output.
Fix component inheritance: restore default behavior for metadata.component @aknysh (#1614)
what
- Fix critical component inheritance regression where
metadata.component
was being overwritten during configuration inheritance - Restore correct default behavior where
metadata.component
defaults to the Atmos component name when not explicitly set - Prevent state file loss: Fix backend configuration to use correct
workspace_key_prefix
instead of inherited component names
why
- PR #1602 introduced a regression in component processing that broke
metadata.component
handling - Critical Impact: Components using abstract component patterns were initializing new empty state files instead of finding existing state
- When inheriting configuration via
metadata.inherits
, the system was inadvertently overwritingmetadata.component
with values from inherited components - Backend
workspace_key_prefix
ended up using the wrong component name (e.g.,eks-service-defaults
instead ofeks-service
) - This created new state at wrong S3 path, causing
terraform plan
to show full re-create of all resources - Could lead to resource duplication or apply failures due to conflicts
- When inheriting configuration via
- This caused errors like:
invalid Terraform component: 'eks/n8n' points to the Terraform component 'helm-teleport-helper', but it does not exist
references
- Fixes #1613 (Backend state file regression - workspace_key_prefix uses wrong component name)
- Fixes #1609 (Component resolution fails when metadata.component omitted)
- Related to PR #1602 (the refactoring that introduced the regression)
testing
- Added comprehensive regression test in
internal/exec/component_inheritance_test.go
- Tests component that inherits configuration but doesn't set
metadata.component
- Verifies configuration inheritance (vars, settings) works correctly
- Verifies metadata section is not inherited
- Verifies component path defaults correctly to the Atmos component name
- Tests component that inherits configuration but doesn't set
- Added backend configuration test in
internal/exec/abstract_component_backend_test.go
- Critical test for backend state issue: Tests component with explicit
metadata.component
inheriting configuration from abstract component - Verifies
metadata.component
is preserved (not overwritten by inherited component's metadata.component) - Verifies backend
workspace_key_prefix
uses correct component path (e.g.,eks-service
noteks-service-defaults
) - Verifies vars inheritance from abstract component works correctly
- Critical test for backend state issue: Tests component with explicit
- Created test fixtures in
tests/fixtures/scenarios/
:component-inheritance-without-metadata-component/
abstract-component-backend/
- All existing tests pass
- Manual testing confirms the fix works correctly
impact
Before (1.194.0 - Broken)
# Abstract component
eks/service/defaults:
metadata:
type: abstract
component: eks-service
# Concrete instance
eks/service/app1:
metadata:
component: eks-service
inherits: [eks/service/defaults] # Inherit configuration (vars, settings, etc.)
Problem: Processing metadata.inherits
overwrote metadata.component
- Backend ended up using
workspace_key_prefix: eks-service-defaults
(from inherited component) - Created new state at:
s3://bucket/eks-service-defaults/...
- Existing state at:
s3://bucket/eks-service/...
(not found) - Result:
terraform plan
showed full re-create of all resources ❌
After (This Fix)
Solution: metadata.component
is preserved during configuration inheritance
- Backend correctly uses
workspace_key_prefix: eks-service
(from explicit metadata.component) - Finds existing state at:
s3://bucket/eks-service/...
- Result:
terraform plan
shows correct incremental changes ✅
details
Key Concepts (Independent)
-
metadata.component
: Pointer to the physical terraform component directory incomponents/terraform/
- Defaults to the Atmos component name if not specified
- Determines the physical directory path for the component
- Critical: Used to generate backend configuration (e.g.,
workspace_key_prefix
) - Completely independent of configuration inheritance
-
metadata.inherits
: List of Atmos components to inherit configuration from (vars, settings, env, backend, etc.)- Purely for configuration merging - has nothing to do with component paths
- The
metadata
section itself is NOT inherited from base components - Allows multiple inheritance for configuration merging
- Completely independent of
metadata.component
and Atmos component names
-
Atmos component name: The logical name of the component in the stack configuration
- Used as default for
metadata.component
if not specified - Completely independent of configuration inheritance
- Used as default for
The Bug
The regression in PR #1602 caused metadata.component
to be inadvertently overwritten during configuration inheritance processing:
- Component
eks/service/app1
hasmetadata.component: eks-service
andmetadata.inherits: [eks/service/defaults]
- System processes
metadata.inherits
to merge configuration (vars, settings, etc.) fromeks/service/defaults
- BUG: During this processing,
metadata.component: eks-service
gets overwritten with the inherited component'smetadata.component
value - Result: Component path becomes
eks-service-defaults
instead ofeks-service
The Fix
Modified processMetadataInheritance()
in stack_processor_process_stacks_helpers_inheritance.go
to:
- Track whether
metadata.component
was explicitly set on the current component - Process
metadata.inherits
to inherit vars, settings, backend, etc. (but NOT metadata) - Restore the explicitly-set
metadata.component
after processing inherits (prevents overwriting) - Default
metadata.component
to the Atmos component name when not explicitly set - This happens regardless of whether
metadata.inherits
exists or not
Code Changes
The fix ensures proper handling in three scenarios:
Scenario 1: Component without explicit metadata.component
(Fixes #1609)
components:
terraform:
eks/n8n: # Atmos component name
metadata:
inherits: [eks/helm-teleport-helper] # Inherit configuration only
vars:
name: "n8n"
✅ Now works: metadata.component
defaults to eks/n8n
(Atmos component name)
❌ Before: metadata.component
was overwritten during configuration inheritance, became helm-teleport-helper
(component not found error)
Scenario 2: Component with explicit metadata.component
+ configuration inheritance (Fixes #1613)
components:
terraform:
eks/service/app1:
metadata:
component: eks-service # Explicit - determines physical path
inherits: [eks/service/defaults] # Inherit configuration (vars, backend, etc.)
✅ Now works: metadata.component
stays eks-service
, backend uses correct workspace_key_prefix
❌ Before: metadata.component
overwritten to eks-service-defaults
during configuration inheritance, new empty state created
Scenario 3: Top-level component:
attribute (Also affected by regression)
components:
terraform:
test/test-component-override:
component: test/test-component # Top-level attribute (not metadata.component)
# ... configuration ...
✅ Now works: Top-level component:
sets the component path, not overwritten by defaults
❌ Before: Default logic was overwriting top-level component inheritance
Summary by CodeRabbit
-
Bug Fixes
- Component inheritance now relies solely on metadata.inherits and preserves top-level component attributes.
- Default component path/name correctly resolves when metadata.component is absent.
- Abstract-component backends inherit workspace_key_prefix and backend settings as expected.
- Improved validation and error handling for metadata.inherits entries.
-
Tests
- Added scenarios and tests covering abstract-component backends, inheritance without metadata.component, and a guard for certain mock failures.
-
Chores
- Updated dependency versions.