ggml-org/llama.cpp b8338 on GitHub

Details

ggml : add OpenVINO backend (#15307)

Update build doc
Add cgraph tensor output name to OV op name
Update openvino build instructions
Add initial NPU support
draft NPU support version 2: prefill + kvcache
NPU support version 2: prefill + kvcache
Change due to ggml cgraph changes, not correct yet
Change due to ggml cgraph changes, llama-3.2 CPU work
Add AMD64 to CMakeLists
Change due to ggml cgraph changes, all device work
Refactor: clean, fix warning
Update clang-format
Statful transformation for CPU GPU
Add SwiGLU
Fuse to SDPA
Replace Concat with Broadcast in MulMat for GQA
Pull out indices creation for kv cache update
Refactor: remove past_token_len from extra_inputs
Fix Phi3 SwiGLU and SoftMax
Pull out sin cos from rope
Reduce memory: free ov weights node after graph conversion
Fix CPY due to cgraph change
Added OpenVINO CI/CD. Updated docs
Fix llama-cli
Fix Phi3 ROPE; Add test-backend-ops
Fix NPU
Fix llama-bench; Clang-format
Fix llama-perplexity
temp. changes for mark decomp
matmul in fp32
mulmat input conversion fix
mulmat type conversion update
add mark decomp pass
Revert changes in fuse_to_sdpa
Update build.md
Fix test-backend-ops
Skip test-thread-safety; Run ctest only in ci/run.sh
Use CiD for NPU
Optimize tensor conversion, improve TTFT
Support op SET_ROWS
Fix NPU
Remove CPY
Fix test-backend-ops
Minor updates for raising PR
Perf: RMS fused to OV internal RMS op
Fix after rebasing

Layout of cache k and cache v are unified: [seq, n_head, head_size]
Add CPY and FLASH_ATTN_EXT, flash attn is not used yet
Skip test-backend-ops due to flash attn test crash
Add mutex around graph conversion to avoid test-thread-safety fali in the future
Update NPU config
Update GPU config to disable SDPA opt to make phi-3 run

Change openvino device_type to GPU; Enable flash_attn
Update supports_buft and supports_op for quantized models
Add quant weight conversion functions from genai gguf reader
Quant models run with accuracy issue
Fix accuracy: disable cpu_repack
Fix CI; Disable test-backend-ops
Fix Q4_1
Fix test-backend-ops: Treat quantized tensors as weights
Add NPU Q4_0 support
NPU perf: eliminate zp
Dequantize q4_1 q4_k q6_k for NPU
Add custom quant type: q8_1_c, q4_0_128
Set m_is_static=false as default in decoder
Simpilfy translation of get_rows
Fix after rebasing
Improve debug util; Eliminate nop ReshapeReshape
STYLE: make get_types_to_requant a function
Support BF16 model
Fix NPU compile
WA for npu 1st token acc issue
Apply EliminateZP only for npu
Add GeGLU
Fix Hunyuan
Support iSWA
Fix NPU accuracy
Fix ROPE accuracy when freq_scale != 1
Minor: not add attention_size_swa for non-swa model
Minor refactor
Add Q5_K to support phi-3-q4_k_m
Requantize Q6_K (gs16) to gs32 on GPU
Fix after rebasing
Always apply Eliminate_ZP to fix GPU compile issue on some platforms
kvcachefusion support
env variable GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION added
Fix for Phi3
Fix llama-cli (need to run with --no-warmup)
Fix add_sliced_mask; Revert mulmat, softmax; Remove input attention_size, iSWA model not working
fix after rebasing
Fix llama-3-8b and phi3-mini q4_0 NPU
Update to OV-2025.3 and CMakeLists.txt
Add OV CI cache
Apply CISC review and update CI to OV2025.3
Update CI to run OV dep install before build
Update OV dockerfile to use OV2025.3 and update build docs
Style: use switch in supports_ops
Style: middle ptr and ref align, omit optional struct keyword
NPU Unify PD (#14)
Stateless. Fix llama-cli llama-server
Simplify broadcast op in attention
Replace get_output_tensor+memcpy with set_output_tensor
NPU unify PD. Unify dynamic and static dims
Clean placeholders in ggml-openvino.cpp
NPU unify PD (handled internally)
change graph to 4d, support multi sequences
Fix llama-bench
Fix NPU
Update ggml-decoder.cpp

Hitting error while compiling on windows:

error C3861: 'unsetenv': identifier not found

Reason: unsetenv() is a POSIX function; it doesn’t exist on Windows. Visual Studio (MSVC) won’t recognize it.

Proposed fix: Use _putenv_s() (Windows equivalent)
This is supported by MSVC and achieves the same effect: it removes the environment variable from the process environment.

This keeps cross-platform compatibility.

Update ggml-decoder.cpp
Update ggml-decoder.cpp
Update ggml-decoder.cpp
Update ggml-decoder.cpp
Update ggml-decoder.cpp
Remove the second decoder for node. Moving the function into the model decoder
Fix error for naive
NPU prefill chunking
NPU fix llama-bench
fallback naive run with accuracy issue
NPU support llma-perplexity -b 512 --no-warmup
Refactor: split ov_graph_compute for dynamic and static
remove unused API GgmlOvDecoder::get_output_stride(const std::string & name)
minor update due to ov 2025.4
remove unused API GgmlOvDecoder::get_output_names()
remove unused API get_output_shape(const std::string & name)
Modified API GgmlOvDecoder::get_output_type(const std::string & name)
Removed API GgmlOvDecoder::get_output_op_params(const std::string & name)
Removed API get_output_ggml_tensor(const std::string & name)
Removed API m_outputs
Removed m_output_names
Removed API GgmlOvDecoder::get_input_names()
Removed API GgmlOvDecoder::get_input_stride(const std::string& name)
Removed API get_input_type
Removed API get_input_type
Removed API GgmlOvDecoder::get_input_shape(const std::string & name)
Removed API GgmlOvDecoder::get_input_op_params(const std::string & name)
Fix error for decoder cache
Reuse cached decoder
GPU remove Q6_K requantization
NPU fix wrong model output shape
NPU fix q4 perf regression
Remove unused variable nodes
Fix decoder can_reuse for llama-bench
Update build.md for Windows
backend buffer: allocate on host
Use shared_buffer for GPU NPU; Refactor
Add ov_backend_host_buffer; Use cached remote context
Put kvcache on GPU
Use ggml_aligned_malloc
only use remote tensor for kvcache
only use remote tensor for kvcache for GPU
FIX: use remote tensor from singleton
Update build.md to include OpenCL
NPU always requant to q4_0_128
Optimize symmetric quant weight extraction: use single zp
Use Q8_0_C in token embd, lm_head, and for 5 and 6 bits quant
Update build.md
Support -ctk f32
Initial stateful graph support
Update ggml/src/ggml-openvino/ggml-decoder.cpp

Co-authored-by: Yamini Nimmagadda yamini.nimmagadda@intel.com

code cleanup
npu perf fix
requant to f16 for Q6 embed on NPU
Update ggml/src/ggml-openvino/ggml-decoder.cpp
Update ggml/src/ggml-openvino/ggml-openvino-extra.cpp
Create OPENVINO.md in llama.cpp backend docs
Update OPENVINO.md
Update OPENVINO.md
Update OPENVINO.md
Update build.md
Update OPENVINO.md
Update OPENVINO.md
Update OPENVINO.md
kq_mask naming fix
Syntax correction for workflows build file
Change ov backend buffer is_host to false
Fix llama-bench -p -n where p<=256
Fix --direct-io 0
Don't put kvcache on GPU in stateful mode
Remove hardcode names
Fix stateful shapes
Simplification for stateful and update output shape processing
Remove hardcode names
Avoid re-compilation in llama-bench
Extract zp directly instead of bias
Refactor weight tensor processing
create_weight_node accept non-ov backend buffer
remove changes in llama-graph.cpp
stateful masking fix (#38)

Fix for stateful accuracy issues and cl_out_of_resources error in stateful GPU with larger context sizes.

Fix test-backend-ops crash glu, get_rows, scale, rms_norm, add
hardcoded name handling for rope_freqs.weight
Suppress logging and add error handling to allow test-backend-ops to complete
Fix MUL_MAT with broadcast; Add unsupported MUL_MAT FLASH_ATTN cases
Use bias instead of zp in test-backend-ops
Update OV in CI, Add OV CI Tests in GH Actions
Temp fix for multithreading bug
Update OV CI, fix review suggestions.
fix editorconfig-checker, update docs
Fix tabs to spaces for editorconfig-checker
fix editorconfig-checker
Update docs
updated model link to be GGUF model links
Remove GGML_CPU_REPACK=OFF
Skip permuted ADD and MUL
Removed static variables from utils.cpp
Removed initializing non-existing variable
Remove unused structs
Fix test-backend-ops for OV GPU
unify api calling
Update utils.cpp
When the dim is dynamic, throw an error, need to is stastic forst
Add interface compute_model_outputs(), which get the model output through computing the node use count & status in the cgraph to avoid the flag using
No need to return
Fix test-backend-ops for OV GPU LNL
Fix test-thread-safety
use the shape from infer request of output tensor create to avoid issue
fix dynamic output shape issue
fix issue for the unused node in tests
Remove unused lock
Add comment
Update openvino docs
update to OV release version 2026.0
add ci ov-gpu self hosted runner
fix editorconfig
Fix perplexity
Rewrite the model inputs finding mechanism (#54)
Rewrite the model inputs finding logistic
Put stateful shape handle in get input shape
Put the iteration logistic in func
Added ggml-ci-intel-openvino-gpu and doc update
.hpp files converted to .h
fix ggml-ci-x64-intel-openvino-gpu
Fix for stateful execution bug in llama-bench
Minor updates after stateful llama-bench fix
Update ggml/src/ggml-openvino/utils.cpp

Co-authored-by: Yamini Nimmagadda yamini.nimmagadda@intel.com

Remove multiple get_shape calls
Bring back mutex into compute
Fix VIEW op, which slice the input node
Added token_len_per_seq existence check before slicing masks and moved node retrieval inside guarded block to prevent missing-key access
Temp. fix for test requant errors
Update to OV ggml-ci to low-perf
ci : temporary disable "test-llama-archs"
ci : cache v4 -> v5, checkout v4 -> v6, fix runner tag
docs : update url
Fix OV link in docker and Update docs

Co-authored-by: Ravi Panchumarthy ravi.panchumarthy@intel.com
Co-authored-by: Cavus Mustafa mustafa.cavus@intel.com
Co-authored-by: Arshath arshath.ramzan@intel.com
Co-authored-by: XuejunZhai Xuejun.Zhai@intel.com
Co-authored-by: Yamini Nimmagadda yamini.nimmagadda@intel.com
Co-authored-by: Xuejun Zhai Xuejun.Zhai@intel
Co-authored-by: Georgi Gerganov ggerganov@gmail.com

macOS/iOS:

Linux:

Windows:

openEuler: