N° 001·▶ agents·
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
via arxiv.org
34 this week·34 this month·35 all-time
Benchmark releases and evaluation results
METR conducted risk assessment on an early version of Anthropic's Claude Mythos Preview in March 2026, estimating significant capabilities.
Gemini 3.1 Pro demonstrates strong retrieval capabilities and cost efficiency compared to GPT-5.2 and Opus 4.6, though users report tooling and UI issues. The SWE-bench Verified evaluation methodology is under scrutiny…
Nimbus builds production AI systems — internal tools, customer agents, retrieval pipelines — combining humans and AI end-to-end. From scoped pilot to production in 4–8 weeks.