Add trace logging for all squelched errors @osterman (#1615)
## what - Add `log.Trace()` calls for all squelched errors throughout the codebase - Update logging documentation to include squelched error handling guidance - Update error handling strategy PRD to document squelched error patternswhy
- Ensures no errors are silently lost, even when intentionally ignored
- Provides complete error visibility during debugging with
--logs-level=Trace
- Establishes clear patterns for handling non-critical errors
details
Code Changes (19 files)
Added trace logging for squelched errors in:
- Configuration binding: Environment variables and flags (
viper.BindEnv()
,viper.BindPFlag()
) - File cleanup: Temporary file and directory removal (
os.Remove()
,os.RemoveAll()
) - Resource closing: File handles, clients, connections (
Close()
) - Lock operations: File locks in defer statements (
Unlock()
) - UI operations: Terminal output, command help (
fmt.Fprint()
,cmd.Help()
) - Performance tracking: Histogram value recording
- Cache operations: Non-critical cache file operations on Windows
Patterns Applied
// ❌ WRONG: Silent error squelching
_ = os.Remove(tempFile)
// ✅ CORRECT: Log squelched errors at Trace level
if err := os.Remove(tempFile); err != nil && !os.IsNotExist(err) {
log.Trace("Failed to remove temporary file during cleanup", "error", err, "file", tempFile)
}
Special Cases
- Defer statements: Capture errors in closures for logging
- File existence checks: Use
os.IsNotExist()
to avoid logging expected conditions - Log file cleanup: Use
fmt.Fprintf(os.Stderr, ...)
to avoid logger recursion
Documentation Updates
- docs/logging.md: Added comprehensive "Squelched Errors" section with patterns and examples
- docs/prd/error-handling-strategy.md: Added "Squelched Error Handling" section with guidelines and code examples
references
- Related to overall error handling and logging strategy
- All changes compile and pass pre-commit hooks
🤖 Generated with Claude Code
Summary by CodeRabbit
-
Bug Fixes
- Prevented silent failures by adding error handling and trace-level logging across command flags, env bindings, config parsing, file cleanup, network response closing, metrics recording, and TUI rendering.
- Improved cleanup reliability by logging non-fatal errors during temporary file/dir removal and resource closure.
-
Refactor
- Standardized logging for squelched (non-fatal) errors, replacing ignored returns with guarded paths without changing core behavior.
-
Documentation
- Expanded logging guidance with a new “Squelched Errors” section, examples, patterns, and best practices.
- Updated error-handling strategy to include when and how to log squelched errors.
Improve performance heatmap @aknysh (#1611)
## what - Fix `--heatmap` flag not working with `terraform`, `helmfile`, and `packer` commands - Implement advanced performance tracking with self-time vs total-time separation - Fix recursive function performance tracking to show accurate counts AND accurate timing - Improve heatmap display with consistent metrics and informative legend - Add comprehensive tests for heatmap functionality and recursive trackingatmos describe stacks --heatmap

atmos terraform plan vpc -s uw2-prod --heatmap

why
Heatmap Flag Fix
The --heatmap
flag was not working for terraform
, helmfile
, and packer
commands because:
- These commands use
DisableFlagParsing = true
to pass native flags through to underlying tools - When flag parsing is disabled, Cobra doesn't parse the
--heatmap
flag - The
PersistentPreRun
hook couldn't detect the flag viacmd.Flags().GetBool("heatmap")
- Performance tracking was never enabled, so no data collected
Performance Tracking Enhancement
--heatmap
showed inconsistent timing metrics for long-running commands:
- Elapsed time was correct
- Individual function totals were massively inflated (- approximately 1,890x inflation)
- Root cause: Unable to show accurate call counts for recursive functions without timing inflation
Requirement: Show true call volume (e.g., 1,890 calls) with accurate timing (no inflation) and consistent metrics.
changes
Advanced Performance Tracking (Self-Time vs Total-Time)
Implemented professional-grade profiling metrics that separate:
- Total time (wall-clock): Includes time spent in child function calls
- Self-time: Actual work done in the function, excluding children
- Accurate recursive tracking: Shows ALL calls including recursive ones with correct timing
Key Features:
- Goroutine-local call stack tracking - Each goroutine maintains its own call stack
- Child time accumulation - Each stack frame tracks time spent in children
- Self-time calculation -
selfTime = totalTime - childTime
- HDR Histogram on self-time - P95 based on actual work, not wall-clock
- Direct recursive tracking - No wrapper pattern needed, tracks every call accurately
Benefits:
- Accurate recursive tracking: Shows true call counts (e.g., 1,890 calls) with correct timing
- No inflation: Self-time excludes child execution, total-time includes it
- Better insights: Identify where time is spent (total) vs where work is done (self-time)
- Professional profiling: Same metrics as pprof, but function-level and easier to use
Example Output:
Function Count Total Avg Max P95
utils.processCustomTags 1024 4.27ms 4µs 146µs 15µs
- Count: 1024 - ALL calls including recursive ones
- Total: 4.27ms - wall-clock time including all children
- Avg: 4µs - average self-time per call (actual work only)
- Max: 146µs - maximum self-time for a single call (excludes children)
- P95: 15µs - 95th percentile of self-time
Metric Consistency Fix
Changed Max to track self-time instead of total-time:
- All three metrics (
Avg
,Max
,P95
) consistently track self-time - Rationale: Enables accurate comparison between metrics for identifying performance outliers
TUI Legend
Added informative legend to heatmap TUI:
Count: # calls (incl. recursion) | Total: wall-clock (incl. children & recursion) |
Avg: avg self-time | Max: max self-time | P95: 95th percentile self-time
- Appears after the header in all visualization modes
- Explains what each metric means
- Clarifies that Count and Total include recursion
- Helps users understand self-time vs total-time
Recursive Function Updates
Benefits:
- Shows true call volume (e.g., 1,024 calls instead of 1)
- Timing remains accurate via self-time calculation
- No wrapper pattern complexity needed
Documentation
Updated profiling documentation:
- "Understanding Total vs Self-Time" tip reflects that all three metrics now track self-time
- Updated metric descriptions with accurate explanations
- Added examples showing self-time context for outlier detection
Heatmap Flag Improvements
- Added helper function to manually parse
--heatmap
fromos.Args
for commands withDisableFlagParsing = true
- Integrated heatmap flag detection in
terraform
,helmfile
, andpacker
commands - Added test coverage for heatmap flag detection
Enhanced Performance Tracking Coverage
- Added
perf.Track()
toUnmarshalYAMLFromFile
to track YAML parsing performance - Ensures complete visibility into YAML processing call chains during merge operations
testing
Performance Tracking Tests
New self-time tracking tests:
TestSelfTimeVsTotalTime
- Verifies self-time excludes child timeTestNestedFunctionSelfTime
- Tests multi-level nesting (grandparent → parent → child)TestDirectRecursionWithSelfTime
- Tests direct recursion with accurate counts AND timing
Updated existing tests:
- All 20 perf tests passing ✅
- Tests verify self-time calculation accuracy for all three metrics (Avg, Max, P95)
Recursive function tests:
TestRecursiveFunctionTracking
- Wrapper pattern verificationTestRecursiveFunctionWrongPattern
- Demonstrates inflation with wrong patternTestMultipleRecursiveFunctionsIndependent
- Independent function trackingTestYAMLConfigProcessingRecursion
- YAML import hierarchy (50 levels deep)TestYAMLConfigProcessingMultipleImports
- Fan-out import patternTestProcessBaseComponentConfigRecursion
- Component inheritance (15 levels deep)
Heatmap UI Tests
Updated heatmap tests:
- Updated legend verification tests
- All 36 heatmap tests passing ✅
Heatmap Flag Tests
✅ TestTerraformHeatmapFlag
passes successfully
✅ Manual testing confirms --heatmap
works with:
atmos terraform plan <component> -s <stack> --heatmap
atmos helmfile <subcommand> <component> -s <stack> --heatmap
atmos packer <subcommand> <component> -s <stack> --heatmap
Build Verification
✅ All linter checks passing (golangci-lint)
✅ Full project builds successfully
✅ Website documentation builds without errors
✅ All 56 tests passing (20 perf + 36 heatmap)
expected impact
For Recursive Functions
- Before: Count = 1, accurate timing (using wrapper pattern, recursion hidden)
- After: Count = 1,890 (shows true call volume), accurate timing (via self-time tracking)
Performance Insights
- Total time: Identify functions contributing most to overall execution time
- Self-time: Identify where actual work is being done (optimization targets)
- Consistent metrics: All three metrics (Avg, Max, P95) track self-time for accurate comparison
- Accurate P95: Based on self-time, more meaningful for performance analysis
Better User Experience
- TUI legend: Users immediately understand what each metric means
- Accurate Max: Identifies performance outliers in function's own work, not including children
Summary by CodeRabbit
-
New Features
- --heatmap CLI flag available for Terraform, Helmfile, and Packer; heatmap UI adds a legend and wider duration columns.
- New Terraform "affected" and query execution workflows.
-
Improvements
- New profiler-related CLI flags and a performance-tracking toggle.
- Performance metrics now emphasize self-time (Avg, P95, Max) for clearer bottleneck identification.
- Clearer, more actionable user-facing error messages; telemetry notices suppressed by default.
-
Documentation
- Profiling guide updated to explain Total vs Self-Time.
-
Chores
- Go toolchain and example image versions bumped.
🚀 Enhancements
Fix component inheritance: restore default behavior for metadata.component @aknysh (#1614)
## what - Fix critical component inheritance regression where `metadata.component` was being overwritten during configuration inheritance - Restore correct default behavior where `metadata.component` defaults to the Atmos component name when not explicitly set - **Prevent state file loss**: Fix backend configuration to use correct `workspace_key_prefix` instead of inherited component nameswhy
- PR #1602 introduced a regression in component processing that broke
metadata.component
handling - Critical Impact: Components using abstract component patterns were initializing new empty state files instead of finding existing state
- When inheriting configuration via
metadata.inherits
, the system was inadvertently overwritingmetadata.component
with values from inherited components - Backend
workspace_key_prefix
ended up using the wrong component name (e.g.,eks-service-defaults
instead ofeks-service
) - This created new state at wrong S3 path, causing
terraform plan
to show full re-create of all resources - Could lead to resource duplication or apply failures due to conflicts
- When inheriting configuration via
- This caused errors like:
invalid Terraform component: 'eks/n8n' points to the Terraform component 'helm-teleport-helper', but it does not exist
references
- Fixes #1613 (Backend state file regression - workspace_key_prefix uses wrong component name)
- Fixes #1609 (Component resolution fails when metadata.component omitted)
- Related to PR #1602 (the refactoring that introduced the regression)
testing
- Added comprehensive regression test in
internal/exec/component_inheritance_test.go
- Tests component that inherits configuration but doesn't set
metadata.component
- Verifies configuration inheritance (vars, settings) works correctly
- Verifies metadata section is not inherited
- Verifies component path defaults correctly to the Atmos component name
- Tests component that inherits configuration but doesn't set
- Added backend configuration test in
internal/exec/abstract_component_backend_test.go
- Critical test for backend state issue: Tests component with explicit
metadata.component
inheriting configuration from abstract component - Verifies
metadata.component
is preserved (not overwritten by inherited component's metadata.component) - Verifies backend
workspace_key_prefix
uses correct component path (e.g.,eks-service
noteks-service-defaults
) - Verifies vars inheritance from abstract component works correctly
- Critical test for backend state issue: Tests component with explicit
- Created test fixtures in
tests/fixtures/scenarios/
:component-inheritance-without-metadata-component/
abstract-component-backend/
- All existing tests pass
- Manual testing confirms the fix works correctly
impact
Before (1.194.0 - Broken)
# Abstract component
eks/service/defaults:
metadata:
type: abstract
component: eks-service
# Concrete instance
eks/service/app1:
metadata:
component: eks-service
inherits: [eks/service/defaults] # Inherit configuration (vars, settings, etc.)
Problem: Processing metadata.inherits
overwrote metadata.component
- Backend ended up using
workspace_key_prefix: eks-service-defaults
(from inherited component) - Created new state at:
s3://bucket/eks-service-defaults/...
- Existing state at:
s3://bucket/eks-service/...
(not found) - Result:
terraform plan
showed full re-create of all resources ❌
After (This Fix)
Solution: metadata.component
is preserved during configuration inheritance
- Backend correctly uses
workspace_key_prefix: eks-service
(from explicit metadata.component) - Finds existing state at:
s3://bucket/eks-service/...
- Result:
terraform plan
shows correct incremental changes ✅
details
Key Concepts (Independent)
-
metadata.component
: Pointer to the physical terraform component directory incomponents/terraform/
- Defaults to the Atmos component name if not specified
- Determines the physical directory path for the component
- Critical: Used to generate backend configuration (e.g.,
workspace_key_prefix
) - Completely independent of configuration inheritance
-
metadata.inherits
: List of Atmos components to inherit configuration from (vars, settings, env, backend, etc.)- Purely for configuration merging - has nothing to do with component paths
- The
metadata
section itself is NOT inherited from base components - Allows multiple inheritance for configuration merging
- Completely independent of
metadata.component
and Atmos component names
-
Atmos component name: The logical name of the component in the stack configuration
- Used as default for
metadata.component
if not specified - Completely independent of configuration inheritance
- Used as default for
The Bug
The regression in PR #1602 caused metadata.component
to be inadvertently overwritten during configuration inheritance processing:
- Component
eks/service/app1
hasmetadata.component: eks-service
andmetadata.inherits: [eks/service/defaults]
- System processes
metadata.inherits
to merge configuration (vars, settings, etc.) fromeks/service/defaults
- BUG: During this processing,
metadata.component: eks-service
gets overwritten with the inherited component'smetadata.component
value - Result: Component path becomes
eks-service-defaults
instead ofeks-service
The Fix
Modified processMetadataInheritance()
in stack_processor_process_stacks_helpers_inheritance.go
to:
- Track whether
metadata.component
was explicitly set on the current component - Process
metadata.inherits
to inherit vars, settings, backend, etc. (but NOT metadata) - Restore the explicitly-set
metadata.component
after processing inherits (prevents overwriting) - Default
metadata.component
to the Atmos component name when not explicitly set - This happens regardless of whether
metadata.inherits
exists or not
Code Changes
The fix ensures proper handling in three scenarios:
Scenario 1: Component without explicit metadata.component
(Fixes #1609)
components:
terraform:
eks/n8n: # Atmos component name
metadata:
inherits: [eks/helm-teleport-helper] # Inherit configuration only
vars:
name: "n8n"
✅ Now works: metadata.component
defaults to eks/n8n
(Atmos component name)
❌ Before: metadata.component
was overwritten during configuration inheritance, became helm-teleport-helper
(component not found error)
Scenario 2: Component with explicit metadata.component
+ configuration inheritance (Fixes #1613)
components:
terraform:
eks/service/app1:
metadata:
component: eks-service # Explicit - determines physical path
inherits: [eks/service/defaults] # Inherit configuration (vars, backend, etc.)
✅ Now works: metadata.component
stays eks-service
, backend uses correct workspace_key_prefix
❌ Before: metadata.component
overwritten to eks-service-defaults
during configuration inheritance, new empty state created
Scenario 3: Top-level component:
attribute (Also affected by regression)
components:
terraform:
test/test-component-override:
component: test/test-component # Top-level attribute (not metadata.component)
# ... configuration ...
✅ Now works: Top-level component:
sets the component path, not overwritten by defaults
❌ Before: Default logic was overwriting top-level component inheritance
🤖 Generated with Claude Code
Summary by CodeRabbit
-
Bug Fixes
- Component inheritance now relies solely on metadata.inherits and preserves top-level component attributes.
- Default component path/name correctly resolves when metadata.component is absent.
- Abstract-component backends inherit workspace_key_prefix and backend settings as expected.
- Improved validation and error handling for metadata.inherits entries.
-
Tests
- Added scenarios and tests covering abstract-component backends, inheritance without metadata.component, and a guard for certain mock failures.
-
Chores
- Updated dependency versions.