github ggml-org/llama.cpp b8338

latest releases: b8340, b8339
2 hours ago
Details

ggml : add OpenVINO backend (#15307)

  • Update build doc

  • Add cgraph tensor output name to OV op name

  • Update openvino build instructions

  • Add initial NPU support

  • draft NPU support version 2: prefill + kvcache

  • NPU support version 2: prefill + kvcache

  • Change due to ggml cgraph changes, not correct yet

  • Change due to ggml cgraph changes, llama-3.2 CPU work

  • Add AMD64 to CMakeLists

  • Change due to ggml cgraph changes, all device work

  • Refactor: clean, fix warning

  • Update clang-format

  • Statful transformation for CPU GPU

  • Add SwiGLU

  • Fuse to SDPA

  • Replace Concat with Broadcast in MulMat for GQA

  • Pull out indices creation for kv cache update

  • Refactor: remove past_token_len from extra_inputs

  • Fix Phi3 SwiGLU and SoftMax

  • Pull out sin cos from rope

  • Reduce memory: free ov weights node after graph conversion

  • Fix CPY due to cgraph change

  • Added OpenVINO CI/CD. Updated docs

  • Fix llama-cli

  • Fix Phi3 ROPE; Add test-backend-ops

  • Fix NPU

  • Fix llama-bench; Clang-format

  • Fix llama-perplexity

  • temp. changes for mark decomp

  • matmul in fp32

  • mulmat input conversion fix

  • mulmat type conversion update

  • add mark decomp pass

  • Revert changes in fuse_to_sdpa

  • Update build.md

  • Fix test-backend-ops

  • Skip test-thread-safety; Run ctest only in ci/run.sh

  • Use CiD for NPU

  • Optimize tensor conversion, improve TTFT

  • Support op SET_ROWS

  • Fix NPU

  • Remove CPY

  • Fix test-backend-ops

  • Minor updates for raising PR

  • Perf: RMS fused to OV internal RMS op

  • Fix after rebasing

  • Layout of cache k and cache v are unified: [seq, n_head, head_size]
  • Add CPY and FLASH_ATTN_EXT, flash attn is not used yet
  • Skip test-backend-ops due to flash attn test crash
  • Add mutex around graph conversion to avoid test-thread-safety fali in the future
  • Update NPU config
  • Update GPU config to disable SDPA opt to make phi-3 run
  • Change openvino device_type to GPU; Enable flash_attn

  • Update supports_buft and supports_op for quantized models

  • Add quant weight conversion functions from genai gguf reader

  • Quant models run with accuracy issue

  • Fix accuracy: disable cpu_repack

  • Fix CI; Disable test-backend-ops

  • Fix Q4_1

  • Fix test-backend-ops: Treat quantized tensors as weights

  • Add NPU Q4_0 support

  • NPU perf: eliminate zp

  • Dequantize q4_1 q4_k q6_k for NPU

  • Add custom quant type: q8_1_c, q4_0_128

  • Set m_is_static=false as default in decoder

  • Simpilfy translation of get_rows

  • Fix after rebasing

  • Improve debug util; Eliminate nop ReshapeReshape

  • STYLE: make get_types_to_requant a function

  • Support BF16 model

  • Fix NPU compile

  • WA for npu 1st token acc issue

  • Apply EliminateZP only for npu

  • Add GeGLU

  • Fix Hunyuan

  • Support iSWA

  • Fix NPU accuracy

  • Fix ROPE accuracy when freq_scale != 1

  • Minor: not add attention_size_swa for non-swa model

  • Minor refactor

  • Add Q5_K to support phi-3-q4_k_m

  • Requantize Q6_K (gs16) to gs32 on GPU

  • Fix after rebasing

  • Always apply Eliminate_ZP to fix GPU compile issue on some platforms

  • kvcachefusion support

  • env variable GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION added

  • Fix for Phi3

  • Fix llama-cli (need to run with --no-warmup)

  • Fix add_sliced_mask; Revert mulmat, softmax; Remove input attention_size, iSWA model not working

  • fix after rebasing

  • Fix llama-3-8b and phi3-mini q4_0 NPU

  • Update to OV-2025.3 and CMakeLists.txt

  • Add OV CI cache

  • Apply CISC review and update CI to OV2025.3

  • Update CI to run OV dep install before build

  • Update OV dockerfile to use OV2025.3 and update build docs

  • Style: use switch in supports_ops

  • Style: middle ptr and ref align, omit optional struct keyword

  • NPU Unify PD (#14)

  • Stateless. Fix llama-cli llama-server

  • Simplify broadcast op in attention

  • Replace get_output_tensor+memcpy with set_output_tensor

  • NPU unify PD. Unify dynamic and static dims

  • Clean placeholders in ggml-openvino.cpp

  • NPU unify PD (handled internally)

  • change graph to 4d, support multi sequences

  • Fix llama-bench

  • Fix NPU

  • Update ggml-decoder.cpp

Hitting error while compiling on windows:

error C3861: 'unsetenv': identifier not found

Reason: unsetenv() is a POSIX function; it doesn’t exist on Windows. Visual Studio (MSVC) won’t recognize it.

Proposed fix: Use _putenv_s() (Windows equivalent)
This is supported by MSVC and achieves the same effect: it removes the environment variable from the process environment.

This keeps cross-platform compatibility.

  • Update ggml-decoder.cpp

  • Update ggml-decoder.cpp

  • Update ggml-decoder.cpp

  • Update ggml-decoder.cpp

  • Update ggml-decoder.cpp

  • Remove the second decoder for node. Moving the function into the model decoder

  • Fix error for naive

  • NPU prefill chunking

  • NPU fix llama-bench

  • fallback naive run with accuracy issue

  • NPU support llma-perplexity -b 512 --no-warmup

  • Refactor: split ov_graph_compute for dynamic and static

  • remove unused API GgmlOvDecoder::get_output_stride(const std::string & name)

  • minor update due to ov 2025.4

  • remove unused API GgmlOvDecoder::get_output_names()

  • remove unused API get_output_shape(const std::string & name)

  • Modified API GgmlOvDecoder::get_output_type(const std::string & name)

  • Removed API GgmlOvDecoder::get_output_op_params(const std::string & name)

  • Removed API get_output_ggml_tensor(const std::string & name)

  • Removed API m_outputs

  • Removed m_output_names

  • Removed API GgmlOvDecoder::get_input_names()

  • Removed API GgmlOvDecoder::get_input_stride(const std::string& name)

  • Removed API get_input_type

  • Removed API get_input_type

  • Removed API GgmlOvDecoder::get_input_shape(const std::string & name)

  • Removed API GgmlOvDecoder::get_input_op_params(const std::string & name)

  • Fix error for decoder cache

  • Reuse cached decoder

  • GPU remove Q6_K requantization

  • NPU fix wrong model output shape

  • NPU fix q4 perf regression

  • Remove unused variable nodes

  • Fix decoder can_reuse for llama-bench

  • Update build.md for Windows

  • backend buffer: allocate on host

  • Use shared_buffer for GPU NPU; Refactor

  • Add ov_backend_host_buffer; Use cached remote context

  • Put kvcache on GPU

  • Use ggml_aligned_malloc

  • only use remote tensor for kvcache

  • only use remote tensor for kvcache for GPU

  • FIX: use remote tensor from singleton

  • Update build.md to include OpenCL

  • NPU always requant to q4_0_128

  • Optimize symmetric quant weight extraction: use single zp

  • Use Q8_0_C in token embd, lm_head, and for 5 and 6 bits quant

  • Update build.md

  • Support -ctk f32

  • Initial stateful graph support

  • Update ggml/src/ggml-openvino/ggml-decoder.cpp

Co-authored-by: Yamini Nimmagadda yamini.nimmagadda@intel.com

  • code cleanup

  • npu perf fix

  • requant to f16 for Q6 embed on NPU

  • Update ggml/src/ggml-openvino/ggml-decoder.cpp

  • Update ggml/src/ggml-openvino/ggml-openvino-extra.cpp

  • Create OPENVINO.md in llama.cpp backend docs

  • Update OPENVINO.md

  • Update OPENVINO.md

  • Update OPENVINO.md

  • Update build.md

  • Update OPENVINO.md

  • Update OPENVINO.md

  • Update OPENVINO.md

  • kq_mask naming fix

  • Syntax correction for workflows build file

  • Change ov backend buffer is_host to false

  • Fix llama-bench -p -n where p<=256

  • Fix --direct-io 0

  • Don't put kvcache on GPU in stateful mode

  • Remove hardcode names

  • Fix stateful shapes

  • Simplification for stateful and update output shape processing

  • Remove hardcode names

  • Avoid re-compilation in llama-bench

  • Extract zp directly instead of bias

  • Refactor weight tensor processing

  • create_weight_node accept non-ov backend buffer

  • remove changes in llama-graph.cpp

  • stateful masking fix (#38)

Fix for stateful accuracy issues and cl_out_of_resources error in stateful GPU with larger context sizes.

  • Fix test-backend-ops crash glu, get_rows, scale, rms_norm, add

  • hardcoded name handling for rope_freqs.weight

  • Suppress logging and add error handling to allow test-backend-ops to complete

  • Fix MUL_MAT with broadcast; Add unsupported MUL_MAT FLASH_ATTN cases

  • Use bias instead of zp in test-backend-ops

  • Update OV in CI, Add OV CI Tests in GH Actions

  • Temp fix for multithreading bug

  • Update OV CI, fix review suggestions.

  • fix editorconfig-checker, update docs

  • Fix tabs to spaces for editorconfig-checker

  • fix editorconfig-checker

  • Update docs

  • updated model link to be GGUF model links

  • Remove GGML_CPU_REPACK=OFF

  • Skip permuted ADD and MUL

  • Removed static variables from utils.cpp

  • Removed initializing non-existing variable

  • Remove unused structs

  • Fix test-backend-ops for OV GPU

  • unify api calling

  • Update utils.cpp

  • When the dim is dynamic, throw an error, need to is stastic forst

  • Add interface compute_model_outputs(), which get the model output through computing the node use count & status in the cgraph to avoid the flag using

  • No need to return

  • Fix test-backend-ops for OV GPU LNL

  • Fix test-thread-safety

  • use the shape from infer request of output tensor create to avoid issue

  • fix dynamic output shape issue

  • fix issue for the unused node in tests

  • Remove unused lock

  • Add comment

  • Update openvino docs

  • update to OV release version 2026.0

  • add ci ov-gpu self hosted runner

  • fix editorconfig

  • Fix perplexity

  • Rewrite the model inputs finding mechanism (#54)

  • Rewrite the model inputs finding logistic

  • Put stateful shape handle in get input shape

  • Put the iteration logistic in func

  • Added ggml-ci-intel-openvino-gpu and doc update

  • .hpp files converted to .h

  • fix ggml-ci-x64-intel-openvino-gpu

  • Fix for stateful execution bug in llama-bench

  • Minor updates after stateful llama-bench fix

  • Update ggml/src/ggml-openvino/utils.cpp

Co-authored-by: Yamini Nimmagadda yamini.nimmagadda@intel.com

  • Remove multiple get_shape calls

  • Bring back mutex into compute

  • Fix VIEW op, which slice the input node

  • Added token_len_per_seq existence check before slicing masks and moved node retrieval inside guarded block to prevent missing-key access

  • Temp. fix for test requant errors

  • Update to OV ggml-ci to low-perf

  • ci : temporary disable "test-llama-archs"

  • ci : cache v4 -> v5, checkout v4 -> v6, fix runner tag

  • docs : update url

  • Fix OV link in docker and Update docs


Co-authored-by: Ravi Panchumarthy ravi.panchumarthy@intel.com
Co-authored-by: Cavus Mustafa mustafa.cavus@intel.com
Co-authored-by: Arshath arshath.ramzan@intel.com
Co-authored-by: XuejunZhai Xuejun.Zhai@intel.com
Co-authored-by: Yamini Nimmagadda yamini.nimmagadda@intel.com
Co-authored-by: Xuejun Zhai Xuejun.Zhai@intel
Co-authored-by: Georgi Gerganov ggerganov@gmail.com

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.