Overview
This is a new major release adding integer quantization and partial GPU (NVIDIA) support
Integer quantization
This allows the ggml
Whisper models to be converted from the default 16-bit floating point weights to 4, 5 or 8 bit integer weights.
The resulting quantized models are smaller in disk size and memory usage and can be processed faster on some architectures. The transcription quality is degraded to some extend - not quantified at the moment.
- Supported quantization modes:
Q4_0
,Q4_1
,Q4_2
,Q5_0
,Q5_1
,Q8_0
- Implementation details: #540
- Usage instructions: README
- All WASM examples now support
Q5
quantized models: https://whisper.ggerganov.com
Here is a quantitative evaluation of the different quantization modes applied to the LLaMA and RWKV large language models. These results can give an impression about the expected quality, size and speed for quantized Whisper models:
LLaMA quantization (measured on M1 Pro)
Model | Measure | F16 | Q4_0 | Q4_1 | Q4_2 | Q5_0 | Q5_1 | Q8_0 |
---|---|---|---|---|---|---|---|---|
7B | perplexity | 5.9565 | 6.2103 | 6.1286 | 6.1698 | 6.0139 | 5.9934 | 5.9571 |
7B | file size | 13.0G | 4.0G | 4.8G | 4.0G | 4.4G | 4.8G | 7.1G |
7B | ms/tok @ 4th | 128 | 56 | 61 | 84 | 91 | 95 | 75 |
7B | ms/tok @ 8th | 128 | 47 | 55 | 48 | 53 | 59 | 75 |
7B | bits/weight | 16.0 | 5.0 | 6.0 | 5.0 | 5.5 | 6.0 | 9.0 |
13B | perplexity | 5.2455 | 5.3748 | 5.3471 | 5.3433 | 5.2768 | 5.2582 | 5.2458 |
13B | file size | 25.0G | 7.6G | 9.1G | 7.6G | 8.4G | 9.1G | 14G |
13B | ms/tok @ 4th | 239 | 104 | 113 | 160 | 176 | 185 | 141 |
13B | ms/tok @ 8th | 240 | 85 | 99 | 97 | 108 | 117 | 147 |
13B | bits/weight | 16.0 | 5.0 | 6.0 | 5.0 | 5.5 | 6.0 | 9.0 |
ref: https://github.com/ggerganov/llama.cpp#quantization
RWKV quantization
Format | Perplexity (169M) | Latency, ms (1.5B) | File size, GB (1.5B) |
---|---|---|---|
Q4_0
| 17.507 | 76 | 1.53 |
Q4_1
| 17.187 | 72 | 1.68 |
Q4_2
| 17.060 | 85 | 1.53 |
Q5_0
| 16.194 | 78 | 1.60 |
Q5_1
| 15.851 | 81 | 1.68 |
Q8_0
| 15.652 | 89 | 2.13 |
FP16
| 15.623 | 117 | 2.82 |
FP32
| 15.623 | 198 | 5.64 |
ref: ggerganov/ggml#89 (comment)
This feature is possible thanks to the many contributions in the llama.cpp project: https://github.com/users/ggerganov/projects/2
GPU support via cuBLAS
Using cuBLAS results mainly in improved Encoder inference speed. I haven't done proper timings, but one can expect at least 2-3 times faster Encoder evaluation with modern NVIDIA GPU cards compared to CPU-only processing. Feel free to post your Encoder benchmarks in issue #89.
This is another feature made possible by the llama.cpp project. Special recognition to @slaren for putting almost all of this work together
This release remains in "beta" stage as I haven't verified that everything works as expected.
What's Changed
- Updated escape_double_quotes() Function by @tauseefmohammed2 in #776
- examples : add missing #include by @pH5 in #798
- Flush upon finishing inference by @tarasglek in #811
- Escape quotes in csv output by @laytan in #815
- C++11style by @wuyudi in #768
- Optionally allow a Core ML build of Whisper to work with or without Core ML models by @Canis-UK in #812
- add some tips about in the readme of the android project folder by @Zolliner in #816
- whisper: Use correct seek_end when offset is used by @ThijsRay in #833
- ggml : fix 32-bit ARM NEON by @ggerganov in #836
- Add CUDA support via cuBLAS by @ggerganov in #834
- Integer quantisation support by @ggerganov in #540
New Contributors
- @tauseefmohammed2 made their first contribution in #776
- @pH5 made their first contribution in #798
- @tarasglek made their first contribution in #811
- @laytan made their first contribution in #815
- @wuyudi made their first contribution in #768
- @Canis-UK made their first contribution in #812
- @Zolliner made their first contribution in #816
- @ThijsRay made their first contribution in #833
Full Changelog: v1.3.0...v1.4.0