Research
Benchmarks & research foundations
MetaMemory is grounded in cognitive science and validated on production benchmarks. Here are the numbers, the methodology, and the research that informs our architecture.
Benchmark results
67.95%
LoCoMo F1
F1 score on the LoCoMo long-conversation benchmark, vs. 43.24% best published baseline.
92%
HotpotQA F1
F1 score on the HotpotQA multi-hop question answering benchmark.
67.40%
LongMemEval
Overall accuracy on LongMemEval benchmark, vs. GPT-4o baseline of 60.6%.
77%
LongMemEval-S
Overall LongMemEval-S score: 100% single-session accuracy, 72% multi-session accuracy.
70%
Memory Compression
LLM-powered consolidation reduces storage while preserving recall quality.
<100ms
Retrieval Latency
P95 latency for multi-channel retrieval with RRF fusion at production scale.
Evaluation methodology
LoCoMo (Long-Context Conversation Memory) evaluates memory systems on their ability to accurately recall information from extended multi-session dialogues. MetaMemory achieves 67.95% F1, compared to the best published baseline of 43.24%.
HotpotQA is a multi-hop question answering benchmark requiring reasoning across multiple documents. MetaMemory achieves 92% F1 on this benchmark.
LongMemEval evaluates long-term memory capabilities across single and multi-session conversations. MetaMemory scores 67.40% overall (vs. GPT-4o at 60.6%). On the LongMemEval-S variant, MetaMemory achieves 77% overall with 100% single-session and 72% multi-session accuracy.
Compression is measured as the ratio of consolidated memory store size to raw memory store size. The 70% compression figure represents the average across diverse conversation types.
Retrieval latency is measured at P95 under production load (1,000+ concurrent sessions) with multi-channel retrieval and RRF fusion enabled. Hardware: standard cloud instances (4 vCPU, 16 GB RAM) with PostgreSQL + pgvector.
Cognitive science foundations
Tulving's Memory Taxonomy
MetaMemory's four-vector architecture draws inspiration from Endel Tulving's taxonomy of long-term memory — semantic (facts), episodic (experiences), and procedural (skills). Our implementation maps these concepts to four engineered embedding types — semantic, emotional, process, and context — optimized for AI agent retrieval rather than being a 1:1 replica of the cognitive model.
Multi-Armed Bandits for Retrieval
Adaptive strategy selection uses Thompson Sampling and Upper Confidence Bound (UCB) algorithms from the multi-armed bandit literature. These are well-studied Bayesian methods for balancing exploration and exploitation — applied here to learn which retrieval channel works best for each query type.
Memory Consolidation Theory
Inspired by the complementary learning systems (CLS) theory of hippocampal-neocortical memory transfer, MetaMemory's consolidation process mirrors what happens during sleep: related memories are merged, redundant information is compressed, and important connections are strengthened.
Reciprocal Rank Fusion
RRF is a well-established information retrieval technique for combining ranked results from heterogeneous sources. MetaMemory uses RRF to fuse results from its 5 specialized retrieval channels into a single ranking, avoiding the score calibration problems of raw score merging.
References
- [1]Tulving, E. (1972). Episodic and semantic memory. In Organization of Memory.
- [2]Tulving, E. (1985). Memory and consciousness. Canadian Psychology, 26(1).
- [3]McClelland, J.L., McNaughton, B.L., & O'Reilly, R.C. (1995). Why there are complementary learning systems in the hippocampus and neocortex.
- [4]Thompson, W.R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.
- [5]Cormack, G.V., Clarke, C.L., & Buettcher, S. (2009). Reciprocal rank fusion outperforms condorcet and individual rank learning methods.
- [6]Ramaswamy, S. et al. (2024). LoCoMo: Long-Context Conversation Memory Benchmark.
Want to run your own benchmarks?
Start with the free tier and test MetaMemory against your own evaluation suite. No credit card required.