RC-0.5.2.9
After nine Beta releases, two Month's worth of continuous internal and external testing, and investigation and QA, Pocket Network's Engineering team feels the resource problems of RC-0.5.0 are fixed (see below for known issues) with the upcoming RC-0.5.2.
Official upgrade guide here
Important Release Notes
- Delete Session.DB before upgrading from RC-0.5.1
rm -rf <datadir>/session.db
- Run this release with the following environment variable:
export GODEBUG="madvdontneed=1"
Link to Golang Issue - Use the default config for all options (except unique configurations like moniker, external addr, etc). You have two options:
- Remove
/config/config.json
file, execute a CLI command, and update the custom configurations - Run
pocket util update-configs
command (creates a new config file and backs up old config file)
GoLevelDB is the only supported database from RC-0.5.2 onward
- If previously using CLevelDB, users might experience incompatibility issues due to known incompatibilities between the two
- PNI temporarily will provide a backup datadir to download to avoid syncing from scratch:
13K .zip
13K .tar.gz - After uncompressing theses files, place the contents in the
<datadir>/data
folder
Context And Original Issues
After a series related issues of Pocket Core's RC-0.5.0 were opened (#1115 #1094 #1116 #1117 ++) in October 2020, PNI opened a formal investigation into the related resource consumption issues of RC-0.5.0 (and subsequently the more stable RC-0.5.1). The main metric of concern with RC-0.5.0 Resources is 'Memory' (virtual, real, RSS, you name it), with a very tangible 'Memory Leak'. 'Relay Stability', though a primary concern for any release, is a secondary concern for RC-0.5.2 as RC-0.5.1 seemed to solve the immediate, emergency level Code 66
errors that plagued blocks 6K-7.5K. Speed is a tertiary concern with RC-0.5.0, taking 10+ hours to sync to Mainnet Block 7000.
Tooling
To debug the issues above, several tools were utilized to determine the root causes of all.
Listed in no particular order:
- Grafana (Observibility/Visibility of resources and consensus issues)
- Google's PProf (CPU and Memory visibility and profile snapshot differences)
- GCVIS (Golang garbage collector monitoring)
- Docker/Docker-Compose (Clean room simulations)
- GCP (Load testing)
- Golang Runtime Pkg (Memstats Testing)
- Golang Debug Pkg (FreeOsMemory Testing)
- GoLand+Debugger (IDE and Debugger)
Debugging and Changelog
Immediately, PNI's team recognized many optimizations to be made within Pocket Core's own source code. This includes the following:
- Delete local Relay/Challenge Evidence on Code 66 failures
- Log relay errors to nodes (don't just return to clients)
- Added configuration to pre-validate auto transactions
- Sending proofs/claims moved to EndBlock
- Load only Blockmeta for PrevCtx
- Added configurable cache PrevCtx, Validators, and Applications
- Don't broadcast claims/proofs if syncing
- Spread out claims/proofs between non-session blocks
- Added max claim age configuration for proof submission
- Reorganized non-consensus breaking code in Relay/Merkle Verify for efficiency before reads from state
- Configuration to remove ABCILogs
- Fixed (pseudo) memory leak in Tendermints RecvPacketMsg()
- Sessions only store addresses and not entire structs
- Only load bare minimum for relay processing
- Add order to AccountTxs query & blockTxsQuery RPC
- Reduce AccountTxsQuery & blockTxsQuery memory footprint
The results were quite significant in both speed and initial resource usage. Subsequently, the following BETA releases targeted bug fixes and small improvements that were a result of the drastic breaking changes from the original Beta.
- Nondeterministic hash fix
- Code 89 Fix
- Evidence Seal Fix
- Fixes header.TotalTxs !=
- Fixes header.NumTxs !=
- Updating TM version and Version Number to BETA-0.5.2.3
- Upgraded AccountTxs and BlockTxs to use ReducedTxSearch
- Implemented Reduced TxSearch in Tendermint
Will all of this, the speed and 'Relay Stability' concerns seem to be solved. However, the 'Memory Leak' was not fixed. Transparently, the team was surprised and unsure on how to proceed with tackling the issue. One thing that was clear, more visibility was needed to solve the issue. With the addition some much needed tooling (see above), the hunt was on for the leak culprit. Here's a taste of the testing the team did to hunt down this issue:
- 72 hour simulations in Docker
- Clean Room Relay Stress Tests in GCP
- Mainnet `Validator `and `Full Node` Simulations
- Snapshot comparisons between different versions
- Upgrade Path (0.5.1-0.5.2) simulations
- And Much Much More XD
With the help of some close partners and community members, memory offenders were checked off the list:
- Moved IAVL from Tendermint to Pocket Core
- Call LazyLoadVersion/Store for all queries and PrevCtx()
- Reduced Tendermint P2P EnsurePeers actions to prevent leak
- Lowered P2P config to far more conservative numbers
- Updated FastSync to default to V1
- Exposed default leveldb options
- Switched to only go-leveldb for leak benchmarking/performance reasons
- Child process to run madvdontneed if not set
- Updated P2P configs
- fixed nil txIndexer bug (Tendermint now sets txindexer and blockstore)
- removed event type and used Tendermint's abci.Event
Finally, in Beta-0.5.2.8, memory seemed to be at a constant rate.
Evidence
IAVL ISSUES
Memory Bump during a block
IAVL NODE CLONE
Append Events
Tendermint True Bit Indicies
Multiple GCVIS heap stability at Beta 5.2.7
Evidence of cache growth from mempool
External Reports from community members
Disclaimer
Though, the memory seems to be both significantly decreased and stabilized, the team is still not convinced the memory growth issue is fully fixed (though not supported with evidence currently). The team expects to dive deeper and provide even more visibility into Tendermint and Pocket Core in future releases.