Details
ggml : add OpenVINO backend (#15307)
-
Update build doc
-
Add cgraph tensor output name to OV op name
-
Update openvino build instructions
-
Add initial NPU support
-
draft NPU support version 2: prefill + kvcache
-
NPU support version 2: prefill + kvcache
-
Change due to ggml cgraph changes, not correct yet
-
Change due to ggml cgraph changes, llama-3.2 CPU work
-
Add AMD64 to CMakeLists
-
Change due to ggml cgraph changes, all device work
-
Refactor: clean, fix warning
-
Update clang-format
-
Statful transformation for CPU GPU
-
Add SwiGLU
-
Fuse to SDPA
-
Replace Concat with Broadcast in MulMat for GQA
-
Pull out indices creation for kv cache update
-
Refactor: remove past_token_len from extra_inputs
-
Fix Phi3 SwiGLU and SoftMax
-
Pull out sin cos from rope
-
Reduce memory: free ov weights node after graph conversion
-
Fix CPY due to cgraph change
-
Added OpenVINO CI/CD. Updated docs
-
Fix llama-cli
-
Fix Phi3 ROPE; Add test-backend-ops
-
Fix NPU
-
Fix llama-bench; Clang-format
-
Fix llama-perplexity
-
temp. changes for mark decomp
-
matmul in fp32
-
mulmat input conversion fix
-
mulmat type conversion update
-
add mark decomp pass
-
Revert changes in fuse_to_sdpa
-
Update build.md
-
Fix test-backend-ops
-
Skip test-thread-safety; Run ctest only in ci/run.sh
-
Use CiD for NPU
-
Optimize tensor conversion, improve TTFT
-
Support op SET_ROWS
-
Fix NPU
-
Remove CPY
-
Fix test-backend-ops
-
Minor updates for raising PR
-
Perf: RMS fused to OV internal RMS op
-
Fix after rebasing
- Layout of cache k and cache v are unified: [seq, n_head, head_size]
- Add CPY and FLASH_ATTN_EXT, flash attn is not used yet
- Skip test-backend-ops due to flash attn test crash
- Add mutex around graph conversion to avoid test-thread-safety fali in the future
- Update NPU config
- Update GPU config to disable SDPA opt to make phi-3 run
-
Change openvino device_type to GPU; Enable flash_attn
-
Update supports_buft and supports_op for quantized models
-
Add quant weight conversion functions from genai gguf reader
-
Quant models run with accuracy issue
-
Fix accuracy: disable cpu_repack
-
Fix CI; Disable test-backend-ops
-
Fix Q4_1
-
Fix test-backend-ops: Treat quantized tensors as weights
-
Add NPU Q4_0 support
-
NPU perf: eliminate zp
-
Dequantize q4_1 q4_k q6_k for NPU
-
Add custom quant type: q8_1_c, q4_0_128
-
Set m_is_static=false as default in decoder
-
Simpilfy translation of get_rows
-
Fix after rebasing
-
Improve debug util; Eliminate nop ReshapeReshape
-
STYLE: make get_types_to_requant a function
-
Support BF16 model
-
Fix NPU compile
-
WA for npu 1st token acc issue
-
Apply EliminateZP only for npu
-
Add GeGLU
-
Fix Hunyuan
-
Support iSWA
-
Fix NPU accuracy
-
Fix ROPE accuracy when freq_scale != 1
-
Minor: not add attention_size_swa for non-swa model
-
Minor refactor
-
Add Q5_K to support phi-3-q4_k_m
-
Requantize Q6_K (gs16) to gs32 on GPU
-
Fix after rebasing
-
Always apply Eliminate_ZP to fix GPU compile issue on some platforms
-
kvcachefusion support
-
env variable GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION added
-
Fix for Phi3
-
Fix llama-cli (need to run with --no-warmup)
-
Fix add_sliced_mask; Revert mulmat, softmax; Remove input attention_size, iSWA model not working
-
fix after rebasing
-
Fix llama-3-8b and phi3-mini q4_0 NPU
-
Update to OV-2025.3 and CMakeLists.txt
-
Add OV CI cache
-
Apply CISC review and update CI to OV2025.3
-
Update CI to run OV dep install before build
-
Update OV dockerfile to use OV2025.3 and update build docs
-
Style: use switch in supports_ops
-
Style: middle ptr and ref align, omit optional struct keyword
-
NPU Unify PD (#14)
-
Stateless. Fix llama-cli llama-server
-
Simplify broadcast op in attention
-
Replace get_output_tensor+memcpy with set_output_tensor
-
NPU unify PD. Unify dynamic and static dims
-
Clean placeholders in ggml-openvino.cpp
-
NPU unify PD (handled internally)
-
change graph to 4d, support multi sequences
-
Fix llama-bench
-
Fix NPU
-
Update ggml-decoder.cpp
Hitting error while compiling on windows:
error C3861: 'unsetenv': identifier not found
Reason: unsetenv() is a POSIX function; it doesn’t exist on Windows. Visual Studio (MSVC) won’t recognize it.
Proposed fix: Use _putenv_s() (Windows equivalent)
This is supported by MSVC and achieves the same effect: it removes the environment variable from the process environment.
This keeps cross-platform compatibility.
-
Update ggml-decoder.cpp
-
Update ggml-decoder.cpp
-
Update ggml-decoder.cpp
-
Update ggml-decoder.cpp
-
Update ggml-decoder.cpp
-
Remove the second decoder for node. Moving the function into the model decoder
-
Fix error for naive
-
NPU prefill chunking
-
NPU fix llama-bench
-
fallback naive run with accuracy issue
-
NPU support llma-perplexity -b 512 --no-warmup
-
Refactor: split ov_graph_compute for dynamic and static
-
remove unused API GgmlOvDecoder::get_output_stride(const std::string & name)
-
minor update due to ov 2025.4
-
remove unused API GgmlOvDecoder::get_output_names()
-
remove unused API get_output_shape(const std::string & name)
-
Modified API GgmlOvDecoder::get_output_type(const std::string & name)
-
Removed API GgmlOvDecoder::get_output_op_params(const std::string & name)
-
Removed API get_output_ggml_tensor(const std::string & name)
-
Removed API m_outputs
-
Removed m_output_names
-
Removed API GgmlOvDecoder::get_input_names()
-
Removed API GgmlOvDecoder::get_input_stride(const std::string& name)
-
Removed API get_input_type
-
Removed API get_input_type
-
Removed API GgmlOvDecoder::get_input_shape(const std::string & name)
-
Removed API GgmlOvDecoder::get_input_op_params(const std::string & name)
-
Fix error for decoder cache
-
Reuse cached decoder
-
GPU remove Q6_K requantization
-
NPU fix wrong model output shape
-
NPU fix q4 perf regression
-
Remove unused variable nodes
-
Fix decoder can_reuse for llama-bench
-
Update build.md for Windows
-
backend buffer: allocate on host
-
Use shared_buffer for GPU NPU; Refactor
-
Add ov_backend_host_buffer; Use cached remote context
-
Put kvcache on GPU
-
Use ggml_aligned_malloc
-
only use remote tensor for kvcache
-
only use remote tensor for kvcache for GPU
-
FIX: use remote tensor from singleton
-
Update build.md to include OpenCL
-
NPU always requant to q4_0_128
-
Optimize symmetric quant weight extraction: use single zp
-
Use Q8_0_C in token embd, lm_head, and for 5 and 6 bits quant
-
Update build.md
-
Support -ctk f32
-
Initial stateful graph support
-
Update ggml/src/ggml-openvino/ggml-decoder.cpp
Co-authored-by: Yamini Nimmagadda yamini.nimmagadda@intel.com
-
code cleanup
-
npu perf fix
-
requant to f16 for Q6 embed on NPU
-
Update ggml/src/ggml-openvino/ggml-decoder.cpp
-
Update ggml/src/ggml-openvino/ggml-openvino-extra.cpp
-
Create OPENVINO.md in llama.cpp backend docs
-
Update OPENVINO.md
-
Update OPENVINO.md
-
Update OPENVINO.md
-
Update build.md
-
Update OPENVINO.md
-
Update OPENVINO.md
-
Update OPENVINO.md
-
kq_mask naming fix
-
Syntax correction for workflows build file
-
Change ov backend buffer is_host to false
-
Fix llama-bench -p -n where p<=256
-
Fix --direct-io 0
-
Don't put kvcache on GPU in stateful mode
-
Remove hardcode names
-
Fix stateful shapes
-
Simplification for stateful and update output shape processing
-
Remove hardcode names
-
Avoid re-compilation in llama-bench
-
Extract zp directly instead of bias
-
Refactor weight tensor processing
-
create_weight_node accept non-ov backend buffer
-
remove changes in llama-graph.cpp
-
stateful masking fix (#38)
Fix for stateful accuracy issues and cl_out_of_resources error in stateful GPU with larger context sizes.
-
Fix test-backend-ops crash glu, get_rows, scale, rms_norm, add
-
hardcoded name handling for rope_freqs.weight
-
Suppress logging and add error handling to allow test-backend-ops to complete
-
Fix MUL_MAT with broadcast; Add unsupported MUL_MAT FLASH_ATTN cases
-
Use bias instead of zp in test-backend-ops
-
Update OV in CI, Add OV CI Tests in GH Actions
-
Temp fix for multithreading bug
-
Update OV CI, fix review suggestions.
-
fix editorconfig-checker, update docs
-
Fix tabs to spaces for editorconfig-checker
-
fix editorconfig-checker
-
Update docs
-
updated model link to be GGUF model links
-
Remove GGML_CPU_REPACK=OFF
-
Skip permuted ADD and MUL
-
Removed static variables from utils.cpp
-
Removed initializing non-existing variable
-
Remove unused structs
-
Fix test-backend-ops for OV GPU
-
unify api calling
-
Update utils.cpp
-
When the dim is dynamic, throw an error, need to is stastic forst
-
Add interface compute_model_outputs(), which get the model output through computing the node use count & status in the cgraph to avoid the flag using
-
No need to return
-
Fix test-backend-ops for OV GPU LNL
-
Fix test-thread-safety
-
use the shape from infer request of output tensor create to avoid issue
-
fix dynamic output shape issue
-
fix issue for the unused node in tests
-
Remove unused lock
-
Add comment
-
Update openvino docs
-
update to OV release version 2026.0
-
add ci ov-gpu self hosted runner
-
fix editorconfig
-
Fix perplexity
-
Rewrite the model inputs finding mechanism (#54)
-
Rewrite the model inputs finding logistic
-
Put stateful shape handle in get input shape
-
Put the iteration logistic in func
-
Added ggml-ci-intel-openvino-gpu and doc update
-
.hpp files converted to .h
-
fix ggml-ci-x64-intel-openvino-gpu
-
Fix for stateful execution bug in llama-bench
-
Minor updates after stateful llama-bench fix
-
Update ggml/src/ggml-openvino/utils.cpp
Co-authored-by: Yamini Nimmagadda yamini.nimmagadda@intel.com
-
Remove multiple get_shape calls
-
Bring back mutex into compute
-
Fix VIEW op, which slice the input node
-
Added token_len_per_seq existence check before slicing masks and moved node retrieval inside guarded block to prevent missing-key access
-
Temp. fix for test requant errors
-
Update to OV ggml-ci to low-perf
-
ci : temporary disable "test-llama-archs"
-
ci : cache v4 -> v5, checkout v4 -> v6, fix runner tag
-
docs : update url
-
Fix OV link in docker and Update docs
Co-authored-by: Ravi Panchumarthy ravi.panchumarthy@intel.com
Co-authored-by: Cavus Mustafa mustafa.cavus@intel.com
Co-authored-by: Arshath arshath.ramzan@intel.com
Co-authored-by: XuejunZhai Xuejun.Zhai@intel.com
Co-authored-by: Yamini Nimmagadda yamini.nimmagadda@intel.com
Co-authored-by: Xuejun Zhai Xuejun.Zhai@intel
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: