github doobidoo/mcp-memory-service v10.9.0
v10.9.0 - Batched Inference Performance

latest releases: v10.33.0, v10.32.0, v10.31.2...
one month ago

v10.9.0 - Batched Inference Performance and Stability Fixes

Release Date: February 8, 2026

🚀 Major Performance Improvements

Batched Inference (PR #432) - @rodboev

4-16x GPU speedup for consolidation pipeline with intelligent batching:

  • GPU Performance:

    • 16x faster at batch=32 (0.7ms/item vs 5.2ms/item sequential)
    • 4x faster at batch=16 (1.4ms/item)
    • Validated on RTX 5050 Blackwell GPU
  • CPU Performance:

    • 2.3-2.5x speedup with batched ONNX inference
    • Efficient parallel processing
  • Adaptive GPU Dispatch:

    • Automatically falls back to sequential for small batches (<16 items)
    • Avoids padding overhead when batch size doesn't justify GPU transfer costs
    • Configurable via MCP_QUALITY_MIN_GPU_BATCH (default: 16)

Configuration

# Control batch size for quality scoring
export MCP_QUALITY_BATCH_SIZE=32  # default

# Minimum batch size to use GPU (below threshold uses CPU sequential)
export MCP_QUALITY_MIN_GPU_BATCH=16  # default

# Instant rollback to previous behavior
export MCP_QUALITY_BATCH_SIZE=1

New Batched Methods

All additive - no breaking changes:

  • ONNXRankerModel.score_quality_batch() - batched DeBERTa + MS-MARCO scoring
  • QualityEvaluator.evaluate_quality_batch() - two-pass batch evaluation
  • SqliteVecMemoryStorage.store_batch() - batched embedding generation with atomicity
  • SemanticCompressionEngine - parallel cluster compression via asyncio.gather

🐛 Critical Fixes

Token Truncation Fix (PR #432)

Root Cause: Character-based truncation at 512 chars discarded ~75% of DeBERTa's context window (512 tokens ~2000 chars)

Fix: Enable proper tokenizer truncation at initialization

  • All paths now use full 512-token window
  • Better quality scores from utilizing complete model context

Embedding Orphan Prevention (PR #432)

Root Cause: Failed embedding INSERTs fell back to no-rowid insertion, breaking memories.id <-> memory_embeddings.rowid JOIN

Fix: Wrap memory + embedding INSERTs in SAVEPOINT

  • Both operations succeed or both roll back atomically
  • Guaranteed searchability for all memories

ONNX Float32 GPU Compatibility (PR #437) - @rodboev

Root Cause: DeBERTa v3 stores some weights in float16, producing mixed-precision ONNX graph that ONNX Runtime rejects

Error: "Type parameter (T) of Optype (MatMul) bound to different types (tensor(float16) and tensor(float))"

Fix: Cast model to float32 before ONNX export with model.float()

  • Validated on RTX 5050 Blackwell (Ampere+)
  • PyTorch 2.10.0+cu128, ONNX Runtime 1.24.1 (CUDAExecutionProvider)

Concurrent Write Stability (PR #435) - @rodboev

Root Cause: 3 retries with 0.1s delay (~0.7s backoff) insufficient for SQLite RESERVED write lock contention under WAL mode

Fix: Increased retry budget

  • 5 retries (was 3)
  • 0.2s initial delay (was 0.1s)
  • ~6.2s total backoff: 0.2 + 0.4 + 0.8 + 1.6 + 3.2
  • test_two_clients_concurrent_write now passes consistently

📊 Test Coverage

  • 442 lines of new test coverage
  • 17 tests in test_batch_inference.py
  • All 1093+ tests passing

🔄 Backward Compatibility

100% backward compatible - all changes are additive:

  • Existing code continues to work unchanged
  • New batched methods available for performance optimization
  • Instant rollback: export MCP_QUALITY_BATCH_SIZE=1

🙏 Contributors

Huge thanks to @rodboev for the incredible performance work and GPU optimizations across all three PRs!

📚 Full Changelog

See CHANGELOG.md for complete details.


Installation:

pip install --upgrade mcp-memory-service

Docker:

docker pull ghcr.io/doobidoo/mcp-memory-service:v10.9.0

Don't miss a new mcp-memory-service release

NewReleases is sending notifications on new releases.