v10.9.0 - Batched Inference Performance and Stability Fixes

Release Date: February 8, 2026

🚀 Major Performance Improvements

Batched Inference (PR #432) - @rodboev

4-16x GPU speedup for consolidation pipeline with intelligent batching:

GPU Performance:
- 16x faster at batch=32 (0.7ms/item vs 5.2ms/item sequential)
- 4x faster at batch=16 (1.4ms/item)
- Validated on RTX 5050 Blackwell GPU
CPU Performance:
- 2.3-2.5x speedup with batched ONNX inference
- Efficient parallel processing
Adaptive GPU Dispatch:
- Automatically falls back to sequential for small batches (<16 items)
- Avoids padding overhead when batch size doesn't justify GPU transfer costs
- Configurable via MCP_QUALITY_MIN_GPU_BATCH (default: 16)

Configuration

# Control batch size for quality scoring
export MCP_QUALITY_BATCH_SIZE=32  # default

# Minimum batch size to use GPU (below threshold uses CPU sequential)
export MCP_QUALITY_MIN_GPU_BATCH=16  # default

# Instant rollback to previous behavior
export MCP_QUALITY_BATCH_SIZE=1

New Batched Methods

All additive - no breaking changes:

ONNXRankerModel.score_quality_batch() - batched DeBERTa + MS-MARCO scoring
QualityEvaluator.evaluate_quality_batch() - two-pass batch evaluation
SqliteVecMemoryStorage.store_batch() - batched embedding generation with atomicity
SemanticCompressionEngine - parallel cluster compression via asyncio.gather

🐛 Critical Fixes

Token Truncation Fix (PR #432)

Root Cause: Character-based truncation at 512 chars discarded ~75% of DeBERTa's context window (512 tokens ~2000 chars)

Fix: Enable proper tokenizer truncation at initialization

All paths now use full 512-token window
Better quality scores from utilizing complete model context

Embedding Orphan Prevention (PR #432)

Root Cause: Failed embedding INSERTs fell back to no-rowid insertion, breaking memories.id <-> memory_embeddings.rowid JOIN

Fix: Wrap memory + embedding INSERTs in SAVEPOINT

Both operations succeed or both roll back atomically
Guaranteed searchability for all memories

ONNX Float32 GPU Compatibility (PR #437) - @rodboev

Root Cause: DeBERTa v3 stores some weights in float16, producing mixed-precision ONNX graph that ONNX Runtime rejects

Error: "Type parameter (T) of Optype (MatMul) bound to different types (tensor(float16) and tensor(float))"

Fix: Cast model to float32 before ONNX export with model.float()

Validated on RTX 5050 Blackwell (Ampere+)
PyTorch 2.10.0+cu128, ONNX Runtime 1.24.1 (CUDAExecutionProvider)

Concurrent Write Stability (PR #435) - @rodboev

Root Cause: 3 retries with 0.1s delay (~0.7s backoff) insufficient for SQLite RESERVED write lock contention under WAL mode

Fix: Increased retry budget

5 retries (was 3)
0.2s initial delay (was 0.1s)
~6.2s total backoff: 0.2 + 0.4 + 0.8 + 1.6 + 3.2
test_two_clients_concurrent_write now passes consistently

📊 Test Coverage

442 lines of new test coverage
17 tests in test_batch_inference.py
All 1093+ tests passing

🔄 Backward Compatibility

100% backward compatible - all changes are additive:

Existing code continues to work unchanged
New batched methods available for performance optimization
Instant rollback: export MCP_QUALITY_BATCH_SIZE=1

🙏 Contributors

Huge thanks to @rodboev for the incredible performance work and GPU optimizations across all three PRs!

📚 Full Changelog

See CHANGELOG.md for complete details.

Installation:

pip install --upgrade mcp-memory-service

Docker:

docker pull ghcr.io/doobidoo/mcp-memory-service:v10.9.0

doobidoo/mcp-memory-service v10.9.0 v10.9.0 - Batched Inference Performance on GitHub