evals

ad slot opena single understated line lives here — sponsor wordmark + a short line.advertise on shipfeed →

clusters this week16 active

N° 001·▶ agents·19:59:34

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

via arxiv.org

Monday, May 11, 2026’s editionMonday, May 11, 2026

N° 001·▶ agents·19:49:43

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

via arxiv.org

Saturday, May 9, 2026’s editionSaturday, May 9, 2026

N° 001·▶ claude·02:00:00

METR evaluated an early version of Claude Mythos

METR conducted risk assessment on an early version of Anthropic's Claude Mythos Preview in March 2026, estimating significant capabilities.

via reddit.com

Saturday, February 21, 2026’s editionSaturday, February 21, 2026

N° 001·▶ evals·06:44:39

not much happened today

Gemini 3.1 Pro demonstrates strong retrieval capabilities and cost efficiency compared to GPT-5.2 and Opus 4.6, though users report tooling and UI issues. The SWE-bench Verified evaluation methodology is under scrutiny…

via news.smol.ai

Yesterday’s editionTuesday, May 12, 2026

N° 001·▶ evals·17:59:28

MedHopQA track debuts at BioCreative IX for multi-hop medical Q&A

via arxiv.org

Monday, May 11, 2026’s editionMonday, May 11, 2026

N° 001·▶ evals·19:25:47

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

via arxiv.org

N° 002·▶ agents·18:20:51

ComplexMCP evaluates LLM agents in dynamic tool sandbox

via arxiv.org

Friday, May 8, 2026’s editionFriday, May 8, 2026

N° 001·▶ agents·17:59:27

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

via arxiv.org

* sponsored·▶ nimbus

Need an agent shipped this quarter?

Nimbus builds production AI systems — internal tools, customer agents, retrieval pipelines — combining humans and AI end-to-end. From scoped pilot to production in 4–8 weeks.

Nimbus — talk to Nimbus →

N° 002·▶ evals·17:44:26

N° 001·▶ research·19:58:32

shipfeed

clusters this week16 active

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

METR evaluated an early version of Claude Mythos

not much happened today

MedHopQA track debuts at BioCreative IX for multi-hop medical Q&A

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

ComplexMCP evaluates LLM agents in dynamic tool sandbox

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

Need an agent shipped this quarter?

CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers

CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios

DRIP-R benchmark tackles retail decisions under policy ambiguity

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

How to Find the Agent Failures Your Evals Miss [Scott Clark] - 767

Verifier-Backed Hard Problem Generation for Mathematical Reasoning

Researchers examine source verification in AI research agents

Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems

clusters this week16 active

The week in AI, in one short email.