github celo-org/op-geth celo-v2.2.3
op-geth celo-v2.2.3

6 hours ago

Overview

This patch release fixes a deadlock in the discovery Table that can stall a long-running op-geth node and prevent graceful shutdown.

The bug was introduced upstream in go-ethereum PR #32518 (added waitForNodes) and applies equally to op-geth/celo. It manifests after some hours of uptime, after which:

  • Peer discovery silently degrades (the discovery table mutex is held indefinitely by a wedged refresh goroutine)
  • On SIGINT/SIGTERM, graceful shutdown stalls inside FairMix.Close → bufferIter.Close waiting for a producer goroutine that can never exit
  • If the process is force-killed after the termination grace period, the leveldb/pebble store may be left in a torn state, causing a missing trie node error on the next startup

This was observed on clusters running celo-v2.2.1. Operators of long-running nodes are recommended to upgrade.

What's fixed

nodeFeed.Send() was being called from Table.nodeAdded() while tab.mutex was held, and the only subscriber to nodeFeed (Table.waitForNodes) needs the same mutex to read the next event. Under contention this forms an AB-BA deadlock that holds the table mutex permanently.

The fix moves every nodeFeed.Send call out from under tab.mutex (six call sites in Table.loop's addNodeCh case, loadSeedNodes, handleTrackRequest, deleteNode, and tableRevalidation.handleResponse). It is a cherry-pick of upstream go-ethereum PR #33665 plus a regression test (which was missing from the upstream PR) and a small panic-safety improvement.

Upgrade notes

For node operators that wish to avoid re-syncing and have infrastructure that supports taking disk snapshots, it is suggested to repeatedly snapshot the disk containing the datadir until you obtain a snapshot that allows op-geth to start successfully. Then proceed with the upgrade and if datadir corruption is encountered fallback to the snapshot.

Node operators that have already shutdown and encountered corruption will need to re-sync.

This upgrade requires no changes to flags or config.

Docker image

🐳 us-west1-docker.pkg.dev/devopsre/celo-blockchain-public/op-geth:celo-v2.2.3

What's Changed

  • Cherry-pick p2p/discover deadlock fix by @piersy in #496

Full Changelog: celo-v2.2.2...celo-v2.2.3

Don't miss a new op-geth release

NewReleases is sending notifications on new releases.