08:25 CETWednesday · May 13, 2026

shipfeed

K SEARCHJK NAVO OPEN
on the wire
home/topics/local-llm
§ topic · local-llm

local-llm

18 this week·34 this month·76 all-time

Open-weight model releases and local inference tooling

ad slot opena single understated line lives here — sponsor wordmark + a short line.advertise on shipfeed →

clusters this week30 active

N° 001·ai·

not much happened today

Gemma 4 was launched by Google under an Apache 2.0 license, marking a significant open-model release focused on reasoning, agentic workflows, multimodality, and on-device use. It outperforms models 10x larger and has…

via news.smol.ai
Wednesday, March 11, 2026’s edition
N° 001·ai·

not much happened today

NVIDIA’s Nemotron 3 Super is a 120B parameter / ~12B active open model featuring a hybrid Mamba-Transformer / SSM Latent MoE architecture and 1M context window, delivering up to 2.2x faster inference than GPT-OSS-120B…

via news.smol.ai
Tuesday, May 5, 2026’s edition
N° 001·ai·

Transformers v5.8.0

Release v5.8.0 New Model additions DeepSeek-V4 DeepSeek-V4 is the next-generation MoE (Mixture of Experts) language model from DeepSeek that introduces several architectural innovations over DeepSeek-V3. The…

via github.com
Monday, April 27, 2026’s edition
N° 001·ai·

vLLM v0.20.0

vLLM v0.20.0 Highlights This release features 752 commits from 320 contributors (123 new)! DeepSeek V4: Initial DeepSeek V4 support landed (#40860), with DSML token-leakage fix in DSV4/3.2 (#40806), DSA + MTP IMA fix…

via github.com
Wednesday, April 22, 2026’s edition
N° 001·ai·

not much happened today

Alibaba released Qwen3.6-27B, a dense, Apache 2.0 open coding model with thinking and non-thinking modes, outperforming the larger Qwen3.5-397B-A17B on multiple coding benchmarks including SWE-bench and Terminal-Bench…

via news.smol.ai
Monday, April 20, 2026’s edition
N° 001·ai·

not much happened today

Moonshot's Kimi K2.6 is a major open-weight 1T-parameter MoE model featuring 32B active parameters, 384 experts, MLA attention, 256K context window, native multimodality, and INT4 quantization. It supports day-0…

via news.smol.ai
Friday, April 3, 2026’s edition
N° 001·ai·

vLLM v0.19.0

vLLM v0.19.0 Highlights This release features 448 commits from 197 contributors (54 new)! Gemma 4 support: Full Google Gemma 4 architecture support including MoE, multimodal, reasoning, and tool-use capabilities…

via github.com
Thursday, April 2, 2026’s edition
N° 001·ollama·

Ollama v0.20.0

Gemma 4 Effective 2B (E2B) ``` ollama run gemma4:e2b ``` Effective 4B (E4B) ``` ollama run gemma4:e4b ``` 26B (Mixture of Experts model with 4B active parameters) ``` ollama run gemma4:26b ``` 31B (Dense) ``` ollama…

via github.com
Wednesday, February 25, 2026’s edition
N° 001·ai·

vLLM v0.16.0

vLLM v0.16.0 Please note that this release was branch cut on Feb 8, so any features added to vLLM after that date is not included. Highlights This release features 440 commits from 203 contributors (7 new)! Async…

via github.com
Tuesday, January 20, 2026’s edition
N° 001·ai·

vLLM v0.14.0

Highlights This release features approximately 660 commits from 251 contributors (86 new contributors). Breaking Changes: Async scheduling is now enabled by default - Users who experience issues can disable with…

via github.com
Friday, December 26, 2025’s edition
N° 001·agents·

not much happened today

MiniMax M2.1 launches as an open-source agent and coding Mixture-of-Experts (MoE) model with ~10B active / ~230B total parameters, claiming to outperform Gemini 3 Pro and Claude Sonnet 4.5, and supports local inference…

via news.smol.ai
Tuesday, December 23, 2025’s edition
N° 001·ai·

not much happened today

GLM-4.7 and MiniMax M2.1 open-weight model releases highlight day-0 ecosystem support, coding throughput, and agent workflows, with GLM-4.7 achieving a +9.5% improvement over GLM-4.6 and MiniMax M2.1 positioned as an…

via news.smol.ai
Wednesday, December 3, 2025’s edition
N° 001·ai·

vLLM v0.12.0

vLLM v0.12.0 Release Notes Highlights Highlights This release features 474 commits from 213 contributors (57 new)! Breaking Changes: This release includes PyTorch 2.9.0 upgrade (CUDA 12.9), V0 deprecations including…

via github.com
Wednesday, November 19, 2025’s edition
N° 001·ai·

vLLM v0.11.1

Highlights This release includes 1456 commits from 449 contributors (184 new contributors)! Key changes include: PyTorch 2.9.0 + CUDA 12.9.1: Updated the default CUDA build to `torch==2.9.0+cu129`, enabling Inductor…

via github.com
Thursday, October 2, 2025’s edition
N° 001·ai·

vLLM v0.11.0

Highlights This release features 538 commits, 207 contributors (65 new contributors)! This release completes the removal of V0 engine. V0 engine code including AsyncLLMEngine, LLMEngine, MQLLMEngine, all attention…

via github.com
Friday, May 8, 2026’s edition
Monday, May 4, 2026’s edition
N° 001·ai·

vLLM v0.20.1

vLLM v0.20.1 This is a patch release on top of `v0.20.0` primarily focused on DeepSeek V4 stabilization and performance improvements, along with several important bug fixes. DeepSeek V4 Base model support (#41006)…

via github.com
Tuesday, April 28, 2026’s edition
N° 001·ai·

Transformers v5.7.0

Release v5.7.0 New Model additions Laguna Laguna is Poolside's mixture-of-experts language model family that extends standard SwiGLU MoE transformers with two key innovations. It features per-layer head counts allowing…

via github.com
Friday, March 27, 2026’s edition
N° 001·ollama·

Ollama v0.19.0

Ollama is now powered by MLX on Apple Silicon in preview Ollama on Apple silicon is now built on top of Apple’s machine learning framework, MLX, to take advantage of its unified memory architecture…

via github.com
Friday, March 20, 2026’s edition
N° 001·ai·

vLLM v0.18.0

vLLM v0.18.0 Known issues Degraded accuracy when serving Qwen3.5 with FP8 KV cache on B200 (#37618) If you previously ran into `CUBLAS_STATUS_INVALID_VALUE` and had to use a workaround in `v0.17.0`, you can reinstall…

via github.com
Saturday, March 7, 2026’s edition
N° 001·ai·

vLLM v0.17.0

vLLM v0.17.0 Known Issue: If you are on CUDA 12.9+ and encounter a `CUBLAS_STATUS_INVALID_VALUE` error, this is caused by a CUDA library mismatch. To resolve, try one of the following: 1. Remove the path to system CUDA…

via github.com
Thursday, January 29, 2026’s edition
N° 001·ai·

vLLM v0.15.0

Highlights This release features 335 commits from 158 contributors (39 new)! Model Support New architectures: Kimi-K2.5 (#33131), Molmo2 (#30997), Step3vl 10B (#32329), Step1 (#32511), GLM-Lite (#31386), Eagle2.5-8B…

via github.com
Wednesday, December 31, 2025’s edition
N° 001·ai·

not much happened today

South Korea's Ministry of Science launched a coordinated program with 5 companies to develop sovereign foundation models from scratch, featuring large-scale MoE architectures like SK Telecom A.X-K1 (519B total / 33B…

via news.smol.ai
Friday, December 19, 2025’s edition
N° 001·ai·

vLLM v0.13.0

vLLM v0.13.0 Release Notes Highlights Highlights This release features 442 commits from 207 contributors (61 new contributors)! Breaking Changes: This release includes deprecation removals, PassConfig flag renames, and…

via github.com
Wednesday, December 10, 2025’s edition
N° 001·agents·

not much happened today

NousResearch's Nomos 1 is a 30B open math model achieving a top Putnam score with only ~3B active parameters, enabling consumer Mac inference. AxiomProver also posts top Putnam results using ThinkyMachines' RL stack…

via news.smol.ai
Saturday, September 13, 2025’s edition
N° 001·ai·

vLLM v0.10.2

Highlights This release contains 740 commits from 266 contributors (97 new)! Breaking Changes: This release includes PyTorch 2.8.0 upgrade, V0 deprecations, and API changes - please review the changelog carefully…

via github.com
Yesterday’s edition
N° 001·llama.cpp·

llama.cpp b9112

CUDA: handle OW > 65535 in im2col (2D and 3D) (#22944) `im2col_cuda` and `im2col_3d_cuda` both dispatch with `block_nums.y = OW`. CUDA caps grid Y at 65535. Conv1d encoders on raw 16 kHz audio with T > 65535 (~ 4 s)…

via github.com
Monday, May 11, 2026’s edition
N° 001·llama.cpp·

llama.cpp b9109

spec : parallel drafting support (#22838) spec : refactor spec : drop support for incompatible vocabs spec : update common_speculative_init() cont : pass seq_id cont : dedup ctx_seq_rm_type server : sketch the ctx_dft…

via github.com
Tuesday, May 5, 2026’s edition
N° 001·ollama·

Ollama v0.23.1

Gemma 4 MTP (Multi-token Processing) for the MLX runner Gemma 4 MTP speculative decoding is now supported on Macs. This can give over a 2x speed increase for the Gemma 4 31B model on coding tasks. ``` ollama run…

via github.com
Tuesday, April 28, 2026’s edition
N° 001·ollama·

Ollama v0.22.0

New models NVIDIA's Nemotron 3 Omni Poolside's first open-weight coding model - Laguna XS.2 Full Changelog: https://github.com/ollama/ollama/compare/v0.21.2...v0.22.0

via github.com