github doobidoo/mcp-memory-service v10.34.0
v10.34.0 — LongMemEval Benchmark: R@5 80.4%, R@10 90.4% (zero LLM)

latest release: v10.35.0
6 hours ago

LongMemEval Benchmark: R@5 80.4%, R@10 90.4%, NDCG@10 82.2%, MRR 89.1% (zero LLM)

This release adds a complete LongMemEval benchmark implementation, providing verified retrieval quality metrics against a real-world 500-question conversational memory dataset — with zero LLM API calls required.

What is LongMemEval?

LongMemEval is a benchmark dataset of 500 single-session questions designed to evaluate memory retrieval systems on realistic conversational data. It tests a system's ability to recall relevant memories for user questions — the same task this service performs for Claude and other AI agents.

Benchmark Results

Metric Score
R@5 80.4%
R@10 90.4%
NDCG@10 82.2%
MRR 89.1%

Setup: Local ONNX embeddings (sentence-transformers/all-MiniLM-L6-v2), SQLite-Vec backend, zero LLM API calls.

What's Included

  • scripts/benchmarks/longmemeval_dataset.py — HuggingFace dataset loader for xiaowu0162/longmemeval with streaming support to avoid memory overflow on large datasets
  • scripts/benchmarks/benchmark_longmemeval.py — CLI orchestrator with ablation mode support (multiple configuration variants), configurable k values, and JSON/markdown output
  • ndcg_at_k metric — Added normalized discounted cumulative gain to src/mcp_memory_service/evaluation/locomo_evaluator.py for ranking-quality assessment alongside recall metrics
  • docs/BENCHMARKS.md — New benchmark results documentation with methodology and result context
  • README.md — Updated with LongMemEval results table
  • Tests: tests/benchmarks/test_longmemeval_dataset.py, test_benchmark_longmemeval.py, test_ndcg.py

Running the Benchmark

# Basic run (downloads dataset automatically)
python scripts/benchmarks/benchmark_longmemeval.py

# With ablation (multiple k values)
python scripts/benchmarks/benchmark_longmemeval.py --ablation --output-dir results/

# Markdown output
python scripts/benchmarks/benchmark_longmemeval.py --markdown

Full Changelog

See CHANGELOG.md for the complete v10.34.0 entry.

1,520 tests passing.

Don't miss a new mcp-memory-service release

NewReleases is sending notifications on new releases.