§ tools · cluster

vLLM v0.15.0

Jan 29 · 11:21:01 · primary fetch1 sourcecluster 831df3bdupdated Jan 29 · 11:21:01

Highlights This release features 335 commits from 158 contributors (39 new)! Model Support New architectures: Kimi-K2.5 (#33131), Molmo2 (#30997), Step3vl 10B (#32329), Step1 (#32511), GLM-Lite (#31386), Eagle2.5-8B VLM (#32456). LoRA expansion: Nemotron-H (#30802), InternVL2 (#32397), MiniMax M2 (#32763). Speculative decoding: EAGLE3 for Pixtral/LlavaForConditionalGeneration (#32542), Qwen3 VL MoE (#32048), draft model support (#24322). Embeddings: BGE-M3 sparse embeddings and ColBERT embeddings (#14526). Model enhancements: Voxtral streaming architecture (#32861), SharedFusedMoE for Qwen3MoE (#32082), dynamic resolution for Nemotron Nano VL (#32121), Molmo2 vision backbone quantization (#32385).

Engine Core Async scheduling + Pipeline Parallelism: `--async-scheduling` now works with pipeline parallelism (#32359). Mamba prefix caching: Block-aligned prefix caching for Mamba/hybrid models with `--enable-prefix-caching --mamba-cache-mode align`. Achieves ~2x speedup by caching Mamba states directly (#30877). Session-based streaming input: New incremental input support for interactive workloads like ASR. Accepts async generators producing `StreamingInput` objects while maintaining…

read full article on github.com ↗

§ sources1 publication · timeline below

github.comvllm v0.15.0primary11:21:01