ggerganov/whisper.cpp v1.4.0 on GitHub

Overview

This is a new major release adding integer quantization and partial GPU (NVIDIA) support

Integer quantization

This allows the ggml Whisper models to be converted from the default 16-bit floating point weights to 4, 5 or 8 bit integer weights.
The resulting quantized models are smaller in disk size and memory usage and can be processed faster on some architectures. The transcription quality is degraded to some extend - not quantified at the moment.

Supported quantization modes: Q4_0, Q4_1, Q4_2, Q5_0, Q5_1, Q8_0
Implementation details: #540
Usage instructions: README
All WASM examples now support Q5 quantized models: https://whisper.ggerganov.com

Here is a quantitative evaluation of the different quantization modes applied to the LLaMA and RWKV large language models. These results can give an impression about the expected quality, size and speed for quantized Whisper models:

LLaMA quantization (measured on M1 Pro)

Model	Measure	F16	Q4_0	Q4_1	Q4_2	Q5_0	Q5_1	Q8_0
7B	perplexity	5.9565	6.2103	6.1286	6.1698	6.0139	5.9934	5.9571
7B	file size	13.0G	4.0G	4.8G	4.0G	4.4G	4.8G	7.1G
7B	ms/tok @ 4th	128	56	61	84	91	95	75
7B	ms/tok @ 8th	128	47	55	48	53	59	75
7B	bits/weight	16.0	5.0	6.0	5.0	5.5	6.0	9.0
13B	perplexity	5.2455	5.3748	5.3471	5.3433	5.2768	5.2582	5.2458
13B	file size	25.0G	7.6G	9.1G	7.6G	8.4G	9.1G	14G
13B	ms/tok @ 4th	239	104	113	160	176	185	141
13B	ms/tok @ 8th	240	85	99	97	108	117	147
13B	bits/weight	16.0	5.0	6.0	5.0	5.5	6.0	9.0

ref: https://github.com/ggerganov/llama.cpp#quantization

RWKV quantization

Format	Perplexity (169M)	Latency, ms (1.5B)	File size, GB (1.5B)
`Q4_0`	17.507	76	1.53
`Q4_1`	17.187	72	1.68
`Q4_2`	17.060	85	1.53
`Q5_0`	16.194	78	1.60
`Q5_1`	15.851	81	1.68
`Q8_0`	15.652	89	2.13
`FP16`	15.623	117	2.82
`FP32`	15.623	198	5.64

ref: ggerganov/ggml#89 (comment)

This feature is possible thanks to the many contributions in the llama.cpp project: https://github.com/users/ggerganov/projects/2

GPU support via cuBLAS

Using cuBLAS results mainly in improved Encoder inference speed. I haven't done proper timings, but one can expect at least 2-3 times faster Encoder evaluation with modern NVIDIA GPU cards compared to CPU-only processing. Feel free to post your Encoder benchmarks in issue #89.

Implementation details: #834
Usage instructions: README

This is another feature made possible by the llama.cpp project. Special recognition to @slaren for putting almost all of this work together

This release remains in "beta" stage as I haven't verified that everything works as expected.

What's Changed

Updated escape_double_quotes() Function by @tauseefmohammed2 in #776
examples : add missing #include by @pH5 in #798
Flush upon finishing inference by @tarasglek in #811
Escape quotes in csv output by @laytan in #815
C++11style by @wuyudi in #768
Optionally allow a Core ML build of Whisper to work with or without Core ML models by @Canis-UK in #812
add some tips about in the readme of the android project folder by @Zolliner in #816
whisper: Use correct seek_end when offset is used by @ThijsRay in #833
ggml : fix 32-bit ARM NEON by @ggerganov in #836
Add CUDA support via cuBLAS by @ggerganov in #834
Integer quantisation support by @ggerganov in #540

New Contributors

@tauseefmohammed2 made their first contribution in #776
@pH5 made their first contribution in #798
@tarasglek made their first contribution in #811
@laytan made their first contribution in #815
@wuyudi made their first contribution in #768
@Canis-UK made their first contribution in #812
@Zolliner made their first contribution in #816
@ThijsRay made their first contribution in #833

Full Changelog: v1.3.0...v1.4.0