Improved resiliency during severe performance degradation
This release includes improvements to the cluster membership algorithm which are opt-in in this initial release. These changes are aimed at improving the accuracy of cluster membership when some or all nodes are in a degraded state. Details follow.
Perform self-health checks before suspecting other nodes (#6745)
This PR implements some of the ideas from Lifeguard (paper, talk, blog) which can help during times of catastrophe, where a large portion of a cluster is in a state of partial failure. One cause for these kinds of partial failures is large scale thread pool starvation, which can cause a node to run slowly enough to not process messages in a timely manner. Slow nodes can therefore suspect healthy nodes simply because the slow node is not able to process the healthy node's timely response. If a sufficiently proportion of nodes in a cluster are slow (eg, due to an application bug), then healthy nodes may have trouble joining and remaining in the cluster, since the slow nodes can evict them. In this scenario, slow nodes will also be evicting each other. The intention is to improve cluster stability in these scenarios.
This PR introduces LocalSiloHealthMonitor
which uses heuristics to score the local silo's health. A low score (0) represents a healthy node and a high score (1 to 8) represents an unhealthy node.
LocalSiloHealthMonitor
implements the following heuristics:
- Check that this silos is marked as
Active
in membership - Check that no other silo suspects this silo
- Check for recently received successful ping responses
- Check for recently received ping requests
- Check that the .NET Thread Pool is able to execute work items within 1 second from enqueue time
- Check that local async timers have been firing on-time (within 3 seconds of their due time)
Failing heuristics contribute to increased probe timeouts, which has two effects:
- Improves the chance of a successful probe to a healthy node
- Increases the time taken for an unhealthy node to vote a healthy node dead, giving the cluster a larger chance of voting the unhealthy node dead first (Nodes marked as dead are pacified and cannot vote others)
This effects of this feature are disabled by default in this release, with only passive background monitoring being enabled. The extended probe timeouts feature can be enabled by setting ClusterMembershipOptions.ExtendProbeTimeoutDuringDegradation
to true
. The passive background monitoring period can be configured by changing ClusterMembershipOptions.LocalHealthDegradationMonitoringPeriod
from its default value of 10 seconds.
Probe silos indirectly before submitting a vote (#6800)
This PR adds support for indirectly pinging silos before suspecting/declaring them dead.
When a silo is one missed probe away from being voted, the monitoring silo switches to indirect pings. In this mode, the silo picks a random other silo and sends it a request to probe the target silo. If that silo responds promptly with a negative acknowledgement (after waiting for a specified timeout), then the silo will be suspected/declared dead.
Additionally, when the vote limit to declare a silo dead is 2 silos, a negative acknowledgement counts for both required votes and the silos is unilaterally declared dead.
The feature is disabled by default in this release - only direct probes are used by-default - but could be enabled in a later release, or by users by setting ClusterMembershipOptions.EnableIndirectProbes
to true
.
Improvements and bug fixes since 3.3.0
- Non-breaking improvements
- Probe silos indirectly before submitting a vote (#6800) (#6839)
- Perform self-health checks before suspecting other nodes (#6745) (#6836)
- Add IManagementGrain.GetActivationAddress() (#6816) (#6827)
- In GrainId.ToString(), display the grain type name and format the key properly (#6774)
- Add ADO.NET Provider support MySqlConnector 0.x and 1.x. (#6831)
- Non-breaking bug fixes