Bot lifecycle management refactor
This release includes architectural changes to bot lifecycle management and internal messaging architecture, and general improvements to recent issues. The supervisor service has become the actor that detects assignment changes and drives state changes in the rest of the services.
More bot metrics and new system errors
To increase the observability on the internal issues, we reused previous bot metrics definitions and added bot lifecycle metrics. Some of these metrics signal the checkpoints in the lifecycle as well as mentioning the bot lifecycle errors and system errors.
Sending error messages is made possible by adding a new details
field to metrics.
Other improvements
- Added back handling of initialize response errors (originally added in #724)
- Added cleanup step to bot lifecycle in order to remove unused bot resources
- Added cooldown to bot image pull retries and removed the infinite loop
- Preserved infinite loop for bot release image downloads
What's Changed
- Bot lifecycle management refactor by @canercidam in #726
- Add metrics to refactor by @stevenlanders in #738
- Clean up unused bots by @canercidam in #735
- Handle error response from initialize and detect nil responses by @canercidam in #736
- Close bot clients and replace in pool by @canercidam in #739
- Add cooldown to image pulls by @canercidam in #737
- Set the supervisor strategy version as RFC3339 timestamp by @canercidam in #741
- Copy details into metrics summary by @stevenlanders in #740
- Make release image download retry by @canercidam in #742
Full Changelog: v0.7.16...v0.8.0