08:25 CETWednesday · May 13, 2026

shipfeed

K SEARCHJK NAVO OPEN
on the wire
home/topics/evals
§ topic · evals

evals

34 this week·34 this month·35 all-time

Benchmark releases and evaluation results

ad slot opena single understated line lives here — sponsor wordmark + a short line.advertise on shipfeed →

clusters this week16 active

Monday, May 11, 2026’s edition
Saturday, May 9, 2026’s edition
Saturday, February 21, 2026’s edition
N° 001·evals·

not much happened today

Gemini 3.1 Pro demonstrates strong retrieval capabilities and cost efficiency compared to GPT-5.2 and Opus 4.6, though users report tooling and UI issues. The SWE-bench Verified evaluation methodology is under scrutiny…

via news.smol.ai
Yesterday’s edition
Monday, May 11, 2026’s edition
Friday, May 8, 2026’s edition
Thursday, May 7, 2026’s edition