§ local-llm · cluster

llama.cpp b9109

May 11 · 23:12:24 · primary fetch1 sourcecluster 936a38c9updated May 11 · 23:12:24

spec : parallel drafting support (#22838) spec : refactor spec : drop support for incompatible vocabs spec : update common_speculative_init() cont : pass seq_id cont : dedup ctx_seq_rm_type server : sketch the ctx_dft decode loop server : draft prompt cache and checkpoints server : improve ctx names server, spec : transition to unified spec context cont : sync main and drft contexts cont : async drft eval when possible cont : handle non-ckpt models cont : pass correct n_past for drafting cont : process images throught the draft context spec : handle draft running out of context server : fix mtmd draft processing server : fix URL for draft model server : add comment server : clean-up + dry speculative-simple : update spec : fix n_past type server : fix slot ctx_drft ptr tools : update readme naming : improve consistency spec : refactor for multi-sequence speculative context cont : prepare params cont : prepare params spec : support parallel drafts server : support parallel drafting llama : reuse device buffers when possible server, spec : clean-up cont : clean-up cont : minor spec : reset `drafting` flag at the end spec : introduce `common_speculative_process()` spec : allow for…

read full article on github.com ↗

§ sources1 publication · timeline below

github.comllama.cpp b9109primary23:12:24