v10.9.0 - Batched Inference Performance and Stability Fixes
Release Date: February 8, 2026
🚀 Major Performance Improvements
Batched Inference (PR #432) - @rodboev
4-16x GPU speedup for consolidation pipeline with intelligent batching:
-
GPU Performance:
- 16x faster at batch=32 (0.7ms/item vs 5.2ms/item sequential)
- 4x faster at batch=16 (1.4ms/item)
- Validated on RTX 5050 Blackwell GPU
-
CPU Performance:
- 2.3-2.5x speedup with batched ONNX inference
- Efficient parallel processing
-
Adaptive GPU Dispatch:
- Automatically falls back to sequential for small batches (<16 items)
- Avoids padding overhead when batch size doesn't justify GPU transfer costs
- Configurable via
MCP_QUALITY_MIN_GPU_BATCH(default: 16)
Configuration
# Control batch size for quality scoring
export MCP_QUALITY_BATCH_SIZE=32 # default
# Minimum batch size to use GPU (below threshold uses CPU sequential)
export MCP_QUALITY_MIN_GPU_BATCH=16 # default
# Instant rollback to previous behavior
export MCP_QUALITY_BATCH_SIZE=1New Batched Methods
All additive - no breaking changes:
ONNXRankerModel.score_quality_batch()- batched DeBERTa + MS-MARCO scoringQualityEvaluator.evaluate_quality_batch()- two-pass batch evaluationSqliteVecMemoryStorage.store_batch()- batched embedding generation with atomicitySemanticCompressionEngine- parallel cluster compression via asyncio.gather
🐛 Critical Fixes
Token Truncation Fix (PR #432)
Root Cause: Character-based truncation at 512 chars discarded ~75% of DeBERTa's context window (512 tokens ~2000 chars)
Fix: Enable proper tokenizer truncation at initialization
- All paths now use full 512-token window
- Better quality scores from utilizing complete model context
Embedding Orphan Prevention (PR #432)
Root Cause: Failed embedding INSERTs fell back to no-rowid insertion, breaking memories.id <-> memory_embeddings.rowid JOIN
Fix: Wrap memory + embedding INSERTs in SAVEPOINT
- Both operations succeed or both roll back atomically
- Guaranteed searchability for all memories
ONNX Float32 GPU Compatibility (PR #437) - @rodboev
Root Cause: DeBERTa v3 stores some weights in float16, producing mixed-precision ONNX graph that ONNX Runtime rejects
Error: "Type parameter (T) of Optype (MatMul) bound to different types (tensor(float16) and tensor(float))"
Fix: Cast model to float32 before ONNX export with model.float()
- Validated on RTX 5050 Blackwell (Ampere+)
- PyTorch 2.10.0+cu128, ONNX Runtime 1.24.1 (CUDAExecutionProvider)
Concurrent Write Stability (PR #435) - @rodboev
Root Cause: 3 retries with 0.1s delay (~0.7s backoff) insufficient for SQLite RESERVED write lock contention under WAL mode
Fix: Increased retry budget
- 5 retries (was 3)
- 0.2s initial delay (was 0.1s)
- ~6.2s total backoff: 0.2 + 0.4 + 0.8 + 1.6 + 3.2
test_two_clients_concurrent_writenow passes consistently
📊 Test Coverage
- 442 lines of new test coverage
- 17 tests in
test_batch_inference.py - All 1093+ tests passing
🔄 Backward Compatibility
100% backward compatible - all changes are additive:
- Existing code continues to work unchanged
- New batched methods available for performance optimization
- Instant rollback:
export MCP_QUALITY_BATCH_SIZE=1
🙏 Contributors
Huge thanks to @rodboev for the incredible performance work and GPU optimizations across all three PRs!
📚 Full Changelog
See CHANGELOG.md for complete details.
Installation:
pip install --upgrade mcp-memory-serviceDocker:
docker pull ghcr.io/doobidoo/mcp-memory-service:v10.9.0