Skip to content

Research

Benchmarks & research foundations

MetaMemory is grounded in cognitive science and validated on production benchmarks. Here are the numbers, the methodology, and the research that informs our architecture.

Benchmark results

67.95%

LoCoMo F1

F1 score on the LoCoMo long-conversation benchmark, vs. 43.24% best published baseline.

92%

HotpotQA F1

F1 score on the HotpotQA multi-hop question answering benchmark.

67.40%

LongMemEval

Overall accuracy on LongMemEval benchmark, vs. GPT-4o baseline of 60.6%.

77%

LongMemEval-S

Overall LongMemEval-S score: 100% single-session accuracy, 72% multi-session accuracy.

70%

Memory Compression

LLM-powered consolidation reduces storage while preserving recall quality.

<100ms

Retrieval Latency

P95 latency for multi-channel retrieval with RRF fusion at production scale.

Evaluation methodology

LoCoMo (Long-Context Conversation Memory) evaluates memory systems on their ability to accurately recall information from extended multi-session dialogues. MetaMemory achieves 67.95% F1, compared to the best published baseline of 43.24%.

HotpotQA is a multi-hop question answering benchmark requiring reasoning across multiple documents. MetaMemory achieves 92% F1 on this benchmark.

LongMemEval evaluates long-term memory capabilities across single and multi-session conversations. MetaMemory scores 67.40% overall (vs. GPT-4o at 60.6%). On the LongMemEval-S variant, MetaMemory achieves 77% overall with 100% single-session and 72% multi-session accuracy.

Compression is measured as the ratio of consolidated memory store size to raw memory store size. The 70% compression figure represents the average across diverse conversation types.

Retrieval latency is measured at P95 under production load (1,000+ concurrent sessions) with multi-channel retrieval and RRF fusion enabled. Hardware: standard cloud instances (4 vCPU, 16 GB RAM) with PostgreSQL + pgvector.

Cognitive science foundations

Tulving's Memory Taxonomy

MetaMemory's four-vector architecture draws inspiration from Endel Tulving's taxonomy of long-term memory — semantic (facts), episodic (experiences), and procedural (skills). Our implementation maps these concepts to four engineered embedding types — semantic, emotional, process, and context — optimized for AI agent retrieval rather than being a 1:1 replica of the cognitive model.

Multi-Armed Bandits for Retrieval

Adaptive strategy selection uses Thompson Sampling and Upper Confidence Bound (UCB) algorithms from the multi-armed bandit literature. These are well-studied Bayesian methods for balancing exploration and exploitation — applied here to learn which retrieval channel works best for each query type.

Memory Consolidation Theory

Inspired by the complementary learning systems (CLS) theory of hippocampal-neocortical memory transfer, MetaMemory's consolidation process mirrors what happens during sleep: related memories are merged, redundant information is compressed, and important connections are strengthened.

Reciprocal Rank Fusion

RRF is a well-established information retrieval technique for combining ranked results from heterogeneous sources. MetaMemory uses RRF to fuse results from its 5 specialized retrieval channels into a single ranking, avoiding the score calibration problems of raw score merging.

References

  1. [1]Tulving, E. (1972). Episodic and semantic memory. In Organization of Memory.
  2. [2]Tulving, E. (1985). Memory and consciousness. Canadian Psychology, 26(1).
  3. [3]McClelland, J.L., McNaughton, B.L., & O'Reilly, R.C. (1995). Why there are complementary learning systems in the hippocampus and neocortex.
  4. [4]Thompson, W.R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.
  5. [5]Cormack, G.V., Clarke, C.L., & Buettcher, S. (2009). Reciprocal rank fusion outperforms condorcet and individual rank learning methods.
  6. [6]Ramaswamy, S. et al. (2024). LoCoMo: Long-Context Conversation Memory Benchmark.

Want to run your own benchmarks?

Start with the free tier and test MetaMemory against your own evaluation suite. No credit card required.

Your agents deserve to remember

Bring your own AI keys. Integrate in minutes. Your data stays yours.