doobidoo/mcp-memory-service v10.34.0 on GitHub

LongMemEval Benchmark: R@5 80.4%, R@10 90.4%, NDCG@10 82.2%, MRR 89.1% (zero LLM)

This release adds a complete LongMemEval benchmark implementation, providing verified retrieval quality metrics against a real-world 500-question conversational memory dataset — with zero LLM API calls required.

What is LongMemEval?

LongMemEval is a benchmark dataset of 500 single-session questions designed to evaluate memory retrieval systems on realistic conversational data. It tests a system's ability to recall relevant memories for user questions — the same task this service performs for Claude and other AI agents.

Benchmark Results

Metric	Score
R@5	80.4%
R@10	90.4%
NDCG@10	82.2%
MRR	89.1%

Setup: Local ONNX embeddings (sentence-transformers/all-MiniLM-L6-v2), SQLite-Vec backend, zero LLM API calls.

What's Included

scripts/benchmarks/longmemeval_dataset.py — HuggingFace dataset loader for xiaowu0162/longmemeval with streaming support to avoid memory overflow on large datasets
scripts/benchmarks/benchmark_longmemeval.py — CLI orchestrator with ablation mode support (multiple configuration variants), configurable k values, and JSON/markdown output
ndcg_at_k metric — Added normalized discounted cumulative gain to src/mcp_memory_service/evaluation/locomo_evaluator.py for ranking-quality assessment alongside recall metrics
docs/BENCHMARKS.md — New benchmark results documentation with methodology and result context
README.md — Updated with LongMemEval results table
Tests: tests/benchmarks/test_longmemeval_dataset.py, test_benchmark_longmemeval.py, test_ndcg.py

Running the Benchmark

# Basic run (downloads dataset automatically)
python scripts/benchmarks/benchmark_longmemeval.py

# With ablation (multiple k values)
python scripts/benchmarks/benchmark_longmemeval.py --ablation --output-dir results/

# Markdown output
python scripts/benchmarks/benchmark_longmemeval.py --markdown

Full Changelog

See CHANGELOG.md for the complete v10.34.0 entry.

1,520 tests passing.

doobidoo/mcp-memory-service v10.34.0 v10.34.0 — LongMemEval Benchmark: R@5 80.4%, R@10 90.4% (zero LLM) on GitHub