LongMemEval Benchmark: R@5 80.4%, R@10 90.4%, NDCG@10 82.2%, MRR 89.1% (zero LLM)
This release adds a complete LongMemEval benchmark implementation, providing verified retrieval quality metrics against a real-world 500-question conversational memory dataset — with zero LLM API calls required.
What is LongMemEval?
LongMemEval is a benchmark dataset of 500 single-session questions designed to evaluate memory retrieval systems on realistic conversational data. It tests a system's ability to recall relevant memories for user questions — the same task this service performs for Claude and other AI agents.
Benchmark Results
| Metric | Score |
|---|---|
| R@5 | 80.4% |
| R@10 | 90.4% |
| NDCG@10 | 82.2% |
| MRR | 89.1% |
Setup: Local ONNX embeddings (sentence-transformers/all-MiniLM-L6-v2), SQLite-Vec backend, zero LLM API calls.
What's Included
scripts/benchmarks/longmemeval_dataset.py— HuggingFace dataset loader forxiaowu0162/longmemevalwith streaming support to avoid memory overflow on large datasetsscripts/benchmarks/benchmark_longmemeval.py— CLI orchestrator with ablation mode support (multiple configuration variants), configurable k values, and JSON/markdown outputndcg_at_kmetric — Added normalized discounted cumulative gain tosrc/mcp_memory_service/evaluation/locomo_evaluator.pyfor ranking-quality assessment alongside recall metricsdocs/BENCHMARKS.md— New benchmark results documentation with methodology and result context- README.md — Updated with LongMemEval results table
- Tests:
tests/benchmarks/test_longmemeval_dataset.py,test_benchmark_longmemeval.py,test_ndcg.py
Running the Benchmark
# Basic run (downloads dataset automatically)
python scripts/benchmarks/benchmark_longmemeval.py
# With ablation (multiple k values)
python scripts/benchmarks/benchmark_longmemeval.py --ablation --output-dir results/
# Markdown output
python scripts/benchmarks/benchmark_longmemeval.py --markdownFull Changelog
See CHANGELOG.md for the complete v10.34.0 entry.
1,520 tests passing.