Databaset
Back to blog
ResearchJun 30, 202617 min read

AI Memory Benchmarks in 2026: What LoCoMo and LongMemEval Actually Measure

A researched breakdown of the LoCoMo, LongMemEval, and BEAM benchmarks driving the AI memory space in 2026, why Mem0, ByteRover, and other vendors report conflicting scores on the same tests, and what these benchmarks do and do not actually measure.

Databaset Team

Databaset engineering

AI memory benchmarks in 2026: what LoCoMo and LongMemEval actually measure, and why the numbers don't agree

If you've spent any time researching AI memory systems lately, you've run into the same wall everyone else has. Mem0 claims 92.5% on LoCoMo. ByteRover claims 92.2% and says it beats Mem0 head to head on the same test. An independent audit found methodological issues in how LoCoMo gets scored in the first place. Meanwhile a separate independent evaluation put Mem0 at 49% on LongMemEval, nowhere close to the 94.4% Mem0 itself reports.

None of these numbers are necessarily lying. They're measuring different things, on different test subsets, scored by different judges, and that gap between vendor-reported and independently-reported scores is the most important thing to understand before trusting any benchmark claim in this space, including the ones in this article.

What a memory benchmark actually tests

Before getting into specific scores, it's worth being precise about what these benchmarks measure, because it's easy to confuse memory benchmarks with long-context benchmarks, and they test fundamentally different things.

A long-context benchmark like NIAH, RULER, or InfiniteBench gives a model one large input, sometimes a million tokens, and asks it to find or summarize something inside that single input. There's no write step, no separate session, no accumulation over time. It's a test of attention over a fixed block of text.

A memory benchmark is structurally different. It gives a system a sequence of inputs spread across multiple sessions, requires the system to write something to a persistent store after each one, and then tests retrieval on a later turn where the system has to pull the right fact out of everything it's written so far. The state of the system at turn fifty depends on what it captured and how it organized turns one through forty-nine. That write-then-recall loop across sessions is the actual thing being tested, and it's a meaningfully harder and more realistic simulation of how a chatbot or agent behaves with a real user over weeks or months.

The three benchmarks that actually matter right now

By 2026 the field has converged on three benchmarks that specifically test this multi-session memory loop, rather than long-context attention.

LoCoMo, built by researchers at UNC Chapel Hill, USC, and Snap Research and published at ACL 2024, has become the most widely cited evaluation in this space. It runs long multi-session conversations and tests question categories including single-hop recall, multi-hop reasoning across multiple facts, open-domain questions, and temporal reasoning about when something became true or stopped being true.

LongMemEval, with around 500 questions across multiple categories including knowledge updates and multi-session recall, tests a similar capability with a different question structure and is increasingly cited alongside LoCoMo rather than instead of it.

BEAM evaluates at much larger scale, 1 million and 10 million token conversation histories, which stress-tests whether a memory system's retrieval quality degrades as the amount of accumulated history grows far beyond what fits in any context window.

Why the published scores don't agree with each other

Here's where it gets genuinely confusing if you're trying to make a decision based on benchmark marketing pages. Mem0's own published numbers, citing the original ECAI 2025 paper, put the base Mem0 system at 66.9% accuracy on LoCoMo with 0.71 second median latency, and a graph-enhanced variant called Mem0g at 68.4% accuracy with slightly higher latency. A full-context baseline, just stuffing the entire conversation history into the prompt with no memory system at all, scored 72.9%, higher than either Mem0 variant, but at roughly fourteen times the token cost and nearly fourteen times the latency.

Separately, Mem0's own 2026 state-of-the-field report cites a newer number, 92.5% on LoCoMo, using an updated token-efficient algorithm released in April 2026 with improvements specifically concentrated in temporal queries and multi-hop reasoning, the two hardest categories.

Then a competing vendor, ByteRover, published a head-to-head claiming 92.2% overall on the same benchmark, run using what they describe as their toughest competitor's own evaluation rules, beating Mem0 on several subcategories including multi-hop reasoning by a notable margin.

And separately from either vendor's numbers, an independent evaluation cited in academic preprints put Mem0 at 49% on LongMemEval, a dramatically different result from the 94.4% Mem0's own materials report for the same benchmark family.

None of these numbers are fabricated. The gap comes from real, defensible methodological differences, which subset of questions gets used, which model judges the correctness of an answer, what counts as a successful retrieval, and whether the evaluation uses the full benchmark or a sampled portion of it. But the practical lesson is simple: a benchmark score on a vendor's own marketing page tells you almost nothing in isolation. It only becomes useful when you know the evaluation methodology behind it, and ideally when there's an independent, third-party run of the same test.

The audit problem nobody talks about

One detail that rarely makes it into comparison articles is that LoCoMo itself has been formally audited and found to have methodological issues affecting how its scores should be interpreted. A public audit, referenced by at least one competing memory vendor, raised concerns specific enough that the vendor cited it directly rather than just gesturing at "benchmark limitations" in the abstract.

This matters because LoCoMo is the single most cited benchmark in nearly every comparison article in this space, including this one. When the benchmark itself has documented scoring issues, every vendor's reported score against it inherits some of that uncertainty, regardless of how the vendor itself performed.

What benchmarks don't measure at all

Even setting aside the scoring disputes, there's a more fundamental gap that several researchers in the space have pointed out directly. LoCoMo and LongMemEval both test conversational recall, can the system retrieve the right fact from a long chat history. Neither benchmark tests whether accumulated memory actually improves an agent's performance on a real task. Does remembering a user's past preferences improve the accuracy of a procurement recommendation. Does institutional memory across many sessions improve the quality of a code review. Does retrieval quality hold up once the memory store has grown past a hundred thousand stored facts, or does it quietly degrade.

As of mid-2026, no widely adopted benchmark tests these task-outcome questions directly. The field has standardized on testing recall accuracy because it's measurable and comparable, not necessarily because it's the most important property of a memory system for every use case.

What this means if you're choosing a memory system

A few practical takeaways follow from all of this.

Treat any single benchmark number, from any vendor, as a starting point rather than a conclusion. Look specifically for whether the number comes from the vendor's own report or an independent third party, and weight independent numbers more heavily when they're available, even though they're rarer.

Pay attention to what's being optimized alongside accuracy. A system that scores two points higher on LoCoMo but takes fourteen times longer and costs fourteen times more in tokens, the gap between Mem0 and the full-context baseline, is not obviously the better choice for a production application where latency and cost compound across every single request.

Consider what your actual workload looks like relative to what these benchmarks test. If your product needs fast, direct recall of specific facts about a specific user, store this, recall it later, the multi-hop and temporal reasoning categories that differentiate the leaderboard matter less than raw latency and cost per operation. If you're building something that genuinely needs to reason about how facts relate and change over time, the benchmark differentiation in those specific categories is more directly relevant to you.

And recognize that the field is moving fast enough that any specific number in this article, or any comparison article, has a real chance of being outdated within months. The gap between Mem0's 66.9% ECAI paper result and its own 92.5% result roughly a year later is itself evidence of how quickly the underlying techniques are improving.

Why Databaset doesn't lead with a benchmark score

We built Databaset around a narrower bet than most of the systems competing on these leaderboards. Most of the benchmark race is optimizing for multi-hop reasoning and temporal accuracy on conversational recall, genuinely useful properties for certain products, but not the property most applications actually need first.

The majority of products that need memory need something simpler and more directly measurable: store a fact about a user, recall it fast and cheaply the next time it's relevant, and keep that loop correct as facts get updated or contradicted over time. That's a real engineering problem, but it's not the same problem LoCoMo's multi-hop and temporal categories are built to stress test, and we'd rather be transparent about that scope than publish a benchmark number that implies we're optimizing for something we're not.

If your application genuinely needs graph-aware, multi-hop conversational reasoning at the level these benchmarks test, that's a legitimate and growing category, and it's worth evaluating the systems that lead specifically on those numbers, with the caveat above about checking whether the number is vendor-reported or independently verified.

Common questions

Which AI memory benchmark should I trust most? None of them in isolation. Look for independently run evaluations over vendor-published numbers where possible, and treat any single benchmark score as one data point rather than a final verdict, given how much published scores vary across sources for the same systems.

Why do Mem0's own benchmark numbers differ so much between reports? The underlying algorithm changed significantly between the original ECAI 2025 paper and the April 2026 token-efficient update, with specific improvements in temporal and multi-hop reasoning. Different evaluation methodology and question subsets between reports also contribute to the gap.

Do these benchmarks test whether memory actually helps an AI agent perform better at real tasks? Not directly. LoCoMo, LongMemEval, and BEAM all test conversational recall accuracy, can the system retrieve the right fact from history, not whether that retrieval measurably improves outcomes on a downstream task like a recommendation, a code review, or a procurement decision. This is an open gap in the field as of 2026.

Is a higher LoCoMo score always the better choice for my product? Not necessarily. The full-context baseline in Mem0's own published comparison scored higher than Mem0 itself, at roughly fourteen times the token cost and latency. For most production applications, the right tradeoff between accuracy, cost, and speed depends on your specific workload, not just the leaderboard position.

Build with Databaset

Add persistent memory to your AI app in minutes. Start with 3,000 API calls in your first month.

Read the docs