Overview
This patch release fixes a deadlock in the discovery Table that can stall a long-running op-geth node and prevent graceful shutdown.
The bug was introduced upstream in go-ethereum PR #32518 (added waitForNodes) and applies equally to op-geth/celo. It manifests after some hours of uptime, after which:
- Peer discovery silently degrades (the discovery table mutex is held indefinitely by a wedged refresh goroutine)
- On
SIGINT/SIGTERM, graceful shutdown stalls insideFairMix.Close → bufferIter.Closewaiting for a producer goroutine that can never exit - If the process is force-killed after the termination grace period, the leveldb/pebble store may be left in a torn state, causing a
missing trie nodeerror on the next startup
This was observed on clusters running celo-v2.2.1. Operators of long-running nodes are recommended to upgrade.
What's fixed
nodeFeed.Send() was being called from Table.nodeAdded() while tab.mutex was held, and the only subscriber to nodeFeed (Table.waitForNodes) needs the same mutex to read the next event. Under contention this forms an AB-BA deadlock that holds the table mutex permanently.
The fix moves every nodeFeed.Send call out from under tab.mutex (six call sites in Table.loop's addNodeCh case, loadSeedNodes, handleTrackRequest, deleteNode, and tableRevalidation.handleResponse). It is a cherry-pick of upstream go-ethereum PR #33665 plus a regression test (which was missing from the upstream PR) and a small panic-safety improvement.
Upgrade notes
For node operators that wish to avoid re-syncing and have infrastructure that supports taking disk snapshots, it is suggested to repeatedly snapshot the disk containing the datadir until you obtain a snapshot that allows op-geth to start successfully. Then proceed with the upgrade and if datadir corruption is encountered fallback to the snapshot.
Node operators that have already shutdown and encountered corruption will need to re-sync.
This upgrade requires no changes to flags or config.
Docker image
🐳 us-west1-docker.pkg.dev/devopsre/celo-blockchain-public/op-geth:celo-v2.2.3
What's Changed
Full Changelog: celo-v2.2.2...celo-v2.2.3