In a nutshell: No existing memory-agent system simultaneously meets the requirements for utility, access control, and reliable deletion in multi-user environments.
Researchers present GateMem, a benchmark for evaluating memory governance in LLM agents used by multiple users simultaneously. Testing shows that current systems fail to reliably implement access controls and data deletion.
The GateMem benchmark framework addresses a central gap in the evaluation of LLM agents: most existing memory benchmarks assume single-user scenarios and do not account for the requirements of real multi-principal deployments in hospitals, offices, educational institutions, and private households. In these environments, multiple users write to shared memory pools and query them under different roles, access scopes, and relationships—here, memory quality becomes a critical requirement through governance.
GateMem evaluates three central aspects simultaneously: utility for legitimate long-term queries with memory updates, robustness of access control across authorization boundaries, and reliability of forgetting after explicit deletion requests. The framework encompasses domains from healthcare, office environments, education, and private households with multi-part, longer-term scenarios, incremental memory injection, hidden checkpoints, and structured judging including leak-target annotations.
Evaluation across diverse baselines and backbone models reveals a persistent problem: no method achieves high utility scores, robust access control, and reliable forgetting simultaneously. Long-context prompting achieves the best governance results but requires significantly higher token costs. Retrieval-based and external memory methods reduce costs but still leak unauthorized or already-deleted information. These results demonstrate that current memory agents are not yet ready for deployment in reliable, shared institutional settings.
Source: arxiv.org · Published 16 June 2026
Lumi AI News — AI-assisted curation pursuant to Article 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.7.1.