llama.cpp — Releases

ad slot opena single understated line lives here — sponsor wordmark + a short line.advertise on shipfeed →

items50 latest

▶ llama.cpp·04:26

llama.cpp b9128

hexagon: eliminate scalar VTCM loads via HVX splat helpers (#22993) hexagon: add hvx_vec_repl helpers and use those for splat-from-vtcm usecase hmx-mm: optimize per-group scale handling hmx-fa: optimize slope load from…

llama.cpp — Releases

▶ llama.cpp·00:29

llama.cpp b9127

opencl: add opt-in Adreno xmem F16xF32 GEMM for prefill (#22755) ggml-opencl: add Adreno xmem F16xF32 GEMM for prefill ggml-opencl: address Adreno xmem review comments ggml-opencl: align xmem gemm kernel naming…

llama.cpp — Releases

▶ llama.cpp·21:18

llama.cpp b9124

mtmd, server, common: expose modalities to /v1/models (#22952) mtmd, server, common: expose modalities to /v1/models fix build rename to mtmd_caps macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64…

llama.cpp — Releases

▶ llama.cpp·21:09

llama.cpp b9123

ggml-webgpu: Enables running gpt-oss-20b (#22906) Enable to run gpt-oss-20b and refactor mulmat-q disable test-backend-ops in ubuntu-24-webgpu macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI…

llama.cpp — Releases

▶ llama.cpp·17:49

llama.cpp b9119

vulkan: Fix Windows performance regression on Intel GPU BF16 workloads for Xe2 and newer (#22461) refactor Use l_warptile only when coopamt is available for BF16 macOS/iOS: macOS Apple Silicon (arm64) macOS Apple…

llama.cpp — Releases

▶ llama.cpp·16:22

llama.cpp b9118

vulkan: Check shared memory size for mmq shaders (#22693) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64…

llama.cpp — Releases

▶ llama.cpp·14:46

llama.cpp b9116

mtmd: add MiMo v2.5 vision (#22883) mimo-v2.5: vision support mimo-v2.5: use fused qkv for vision mimi-v2.5: fix f16 vision overflow mimo-v2.5: comment cleanups mimo-v2.5: Flash doesn't have mmproj more cleanup…

llama.cpp — Releases

▶ llama.cpp·09:47

llama.cpp b9114

metal : promote mul_mv/mul_mm batch divisors to function constants (#22711) metal : promote mul_mv/mul_mm batch divisors to function constants metal : take op directly in get_pipeline_mul_mv_ext macOS/iOS: macOS Apple…

llama.cpp — Releases

▶ llama.cpp·01:54

llama.cpp b9113

opencl: add q4_1 MoE for Adreno (#22856) Q4_1 MoE CLC pass sanity check remove unnecessary code opencl: remove unnecessary asserts and reformat opencl: fix supports_op for q4_1 moe q4_1 moe is supported by Adreno with…

llama.cpp — Releases

▶ llama.cpp·01:24

llama.cpp b9112

CUDA: handle OW > 65535 in im2col (2D and 3D) (#22944) `im2col_cuda` and `im2col_3d_cuda` both dispatch with `block_nums.y = OW`. CUDA caps grid Y at 65535. Conv1d encoders on raw 16 kHz audio with T > 65535 (~ 4 s)…

llama.cpp — Releases

▶ llama.cpp·23:27

llama.cpp b9110

docs: fix metrics endpoint description in server README (#22879) docs: fix metrics endpoint description in server README Required model query parameter for router mode described. Removed metrics…

llama.cpp — Releases

▶ llama.cpp·23:12

llama.cpp b9109

spec : parallel drafting support (#22838) spec : refactor spec : drop support for incompatible vocabs spec : update common_speculative_init() cont : pass seq_id cont : dedup ctx_seq_rm_type server : sketch the ctx_dft…

llama.cpp — Releases

▶ llama.cpp·15:26

llama.cpp b9106

vulkan: Support asymmetric FA in scalar/mmq/coopmat1 paths (#22589) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu…

llama.cpp — Releases

▶ llama.cpp·15:24

llama.cpp b9105

CUDA: directly include cuda/iterator (#22936) Before, we relied on a transient import from `cub/cub.cuh`, which is bad practice to do as cub may not always expose cuda/iterator macOS/iOS: macOS Apple Silicon (arm64)…

llama.cpp — Releases

▶ llama.cpp·13:46

llama.cpp b9103

vendor : update cpp-httplib to 0.44.0 (#22919) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu…

llama.cpp — Releases

▶ llama.cpp·08:15

llama.cpp b9102

[SYCL] Add OP im2col_3d (#22903) add im2col_3d format code update the ops.md macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64…

llama.cpp — Releases

▶ llama.cpp·22:27

llama.cpp b9101

server : print warning when HTTP timeout exceeded (#22907) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64…

llama.cpp — Releases

▶ llama.cpp·22:06

llama.cpp b9100

backend sampling: support returning post-sampling probs (#22622) server: Never return 0.0 post-sampling probabilities backend sampling: support returning post-sampling probs macOS/iOS: macOS Apple Silicon (arm64) macOS…

llama.cpp — Releases

▶ llama.cpp·21:33

llama.cpp b9099

vendor : update cpp-httplib to 0.43.4 (#22888) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu…

llama.cpp — Releases

▶ llama.cpp·19:14

llama.cpp b9097

sync : ggml macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan)…

llama.cpp — Releases

▶ llama.cpp·11:43

llama.cpp b9095

internal AllReduce kernel for CUDA provider (#22299) ggml-cuda: add internal AllReduce provider for tensor parallelism Introduces a NCCL-free AllReduce implementation for LLAMA_SPLIT_MODE_TENSOR using a single-phase…

llama.cpp — Releases

▶ llama.cpp·10:48

llama.cpp b9094

model : fix model type check for granite/llama3 and deepseek2/glm4.7 lite (#22870) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu…

llama.cpp — Releases

▶ llama.cpp·23:02

llama.cpp b9093

model : add sarvam_moe architecture support (#20275) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU)…

llama.cpp — Releases

▶ llama.cpp·14:45

llama.cpp b9090

cmake : update BoringSSL to 0.20260508.0 (#22839) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu…

llama.cpp — Releases

▶ llama.cpp·13:03

llama.cpp b9089

SYCL: reduce allocation overhead during flash attention (#22732) SYCL: reduce allocation overhead during flash attention tidy up whitespace add a note about the flag move ggml_sycl_fattn_ into fattn-buffers.hpp…

llama.cpp — Releases

▶ llama.cpp·12:42

llama.cpp b9088

[SYCL] Add BF16 support to GET_ROWS operation (#21391) Add GGML_TYPE_BF16 to the SYCL backend's GET_ROWS operation, both in supports_op and in the kernel dispatch. This fixes a performance regression where models using…

llama.cpp — Releases

▶ llama.cpp·12:18

llama.cpp b9087

sycl: Q5_K reorder MMVQ/dequant + Q8_0 reorder MMVQ path (#22152) sycl: Q5_K reorder MMVQ/dequant + Q8_0 reorder MMVQ path Signed-off-by: Chun Tao Remove duplicate definitions --------- Signed-off-by: Chun Tao…

llama.cpp — Releases

▶ llama.cpp·07:18

llama.cpp b9085

Add flash attention MMA / Tiles to support MiMo-V2.5 (#22812) mimo-v2.5: add flash attention mma/tiles for for d_kq=192 d_v=128 mimo-v2.5: follow (256, 256) fattn templates mimo-v2.5: cleanup comments mimo-v2.5…

llama.cpp — Releases

▶ llama.cpp·05:27

llama.cpp b9084

hexagon: add HTP kernel for GGML_OP_GATED_DELTA_NET (#22837) Implement the Gated Delta Net recurrence on HVX with: 4-row fused kernels for PP (prompt processing) path 8-row fused kernels for TG (token generation) path…

llama.cpp — Releases

▶ llama.cpp·00:21

llama.cpp b9082

Feature hexagon l2 norm (#22816) L2_NORM Updates Addressed PR Comments ggml-hexagon: add L2_NORM HVX kernel for Hexagon backend hex-unary: remove supported_unary_nc since the outer loop is the same for all unary ops…

llama.cpp — Releases

▶ llama.cpp·00:18

llama.cpp b9081

common : do not wrap raw strings in schema parser for tagged parsers (#22827) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64…

llama.cpp — Releases

▶ llama.cpp·23:05

llama.cpp b9080

model : support Gemma4_26B_A4B_NVFP4 (#22804) Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes Signed-off-by: ynankani Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret Address review…

llama.cpp — Releases

▶ llama.cpp·22:23

llama.cpp b9079

common : revert reasoning budget +inf logit bias (#22740) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64…

llama.cpp — Releases

▶ llama.cpp·21:29

llama.cpp b9077

server: support Vertex AI compatible API (#22545) server: support Vertex AI compatible API a bit safer support other AIP_ env var various fixes if AIP_MODE is unset, do nothing fix test case fix windows build…

llama.cpp — Releases

▶ llama.cpp·20:53

llama.cpp b9076

server: (router) expose child model info from router's /v1/models (#22683) server: (router) expose child model info from router's /v1/models update docs macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon…

llama.cpp — Releases

▶ llama.cpp·19:37

llama.cpp b9075

cuda: fuse snake activation (mul, sin, sqr, mul, add) (#22667) cuda: fuse snake activation (mul, sin, sqr, mul, add) Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The matcher recognizes the naive 5 op…

llama.cpp — Releases

▶ llama.cpp·18:04

llama.cpp b9073

CUDA: lower-case PCI bus id, standardize for ggml (#22820) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64…

llama.cpp — Releases

▶ llama.cpp·15:26

llama.cpp b9072

vulkan: fix spv shadowing (#22760) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU)…

llama.cpp — Releases

▶ llama.cpp·10:18

llama.cpp b9071

ggml: update SCHED_DEBUG output to use ggml_op_desc() (#22825) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64…

llama.cpp — Releases

▶ llama.cpp·07:20

llama.cpp b9070

opencl: add q4_0 MoE GEMM for Adreno (#22731) Q4_0 MoE CLC pass sanity check release program opencl: fix whitespace opencl: remove unused cl_program opencl: break #if block to make it more clear opencl: adjust format…

llama.cpp — Releases

▶ llama.cpp·01:22

llama.cpp b9066

CUDA: batch out_prod inner loop with cublasSgemmStridedBatched (#22651) CUDA: batch out_prod inner loop with cublasSgemmStridedBatched CUDA: batch out_prod inner loop with cublasSgemmStridedBatched CUDA: add…

llama.cpp — Releases

▶ llama.cpp·00:25

llama.cpp b9064

llama : fix device state save/load (#22805) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x…

llama.cpp — Releases

▶ llama.cpp·00:23

llama.cpp b9063

opencl: add opfilter regex for debugging (#22782) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu…

llama.cpp — Releases

▶ llama.cpp·23:16

llama.cpp b9062

common/chat : preserve media markers for typed-content templates (#22634) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU)…

llama.cpp — Releases

▶ llama.cpp·22:50

llama.cpp b9061

tests: add long-sequence cases and fix inputs for gated_delta_net (#22794) tests : add long-seq + tail cases for gated_delta_net tests : realistic input ranges for gated_delta_net macOS/iOS: macOS Apple Silicon (arm64)…

llama.cpp — Releases

▶ llama.cpp·20:35

llama.cpp b9060

sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET (#22149) sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET Signed-off-by: Chun Tao Fix abort during test-backend-ops Signed-off-by…

llama.cpp — Releases

▶ llama.cpp·17:39

llama.cpp b9058

llama : remove unnecessary seq_id check during state restore (#22797) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU)…

llama.cpp — Releases

▶ llama.cpp·17:22

llama.cpp b9057

ggml-cpu: Optimized risc-v cpu q1_0 dot macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x…

llama.cpp — Releases

▶ llama.cpp·17:22

llama.cpp b9056

mtmd: fix whisper audio tail truncation by exposing padded buffer to FFT (#22770) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64…

llama.cpp — Releases

▶ llama.cpp·15:29

llama.cpp b9055

model: Add Mimo v2.5 model support (#22493) add mimo-v2.5 support mimo-v2.5: fix modify_tensors row split mimi-v2.5: forgot `add_attn_value_scale` plumbing mimi-v2.5: fix tp dequant to detect tp rows mimo-v2.5: fix TP…

llama.cpp — Releases