github ggerganov/whisper.cpp v1.5.0

latest releases: v1.6.2, v1.6.1, v1.6.0...
10 months ago

Overview

This major release includes the following changes:

  • Full GPU processing of the Encoder and the Decoder with CUDA and Metal is now supported
  • Efficient beam-search implementation via batched decoding and unified KV cache
  • Full quantization support of all available ggml quantization types
  • Support for grammar constrained sampling
  • Support for Distil Whisper models
  • Support for Whisper Large-v3

and more

Full GPU support

On Apple Silicon, GPU support has been available to a large extend since 15 Sep. However, part of the Encoder was still being executed on the CPU due to lack of MSL kernels for the convolution operations. These kernels are now available resulting in additional speed-up of the Encoder in this release:

image

Encoder performance on Apple M1 Max - before and after (plot by @dreness)

For NVIDIA hardware, the entire computation can now be offloaded to the GPU which results in significant performance boost. For detailed performance breakdown, checkout the Benchmarks section below.

The GPU processing on Apple Silicon is enabled by default, while for NVIDIA you need to build with WHISPER_CUBLAS=1:

# Apple Silicon
make

# NVIDIA
WHISPER_CUBLAS=1 make

Implementation: #1472

Special credits to: @FSSRepo, @slaren

Batched decoding + efficient Beam Search

At last, whisper.cpp now supports efficient Beam Search decoding. The missing piece was the implementation of batched decoding, which now follows closely the unified KV cache idea from llama.cpp. On modern NVIDIA hardware, the performance with 5 beams is the same as 1 beam thanks to the large amount of computing power available. With Metal, the speed with 5 beams is a bit slower compared to 1 beam, but it is significantly faster compared to 5x times the time for single batch which was observed with the old naive implementation.

Beam Search is now enabled by default in whisper.cpp to match the OG implementation of OpenAI Whisper. For more performance details, checkout the Benchmarks section below.

Implementation: #1486

Quantization support

All ggml quantization types are now supported. Quantization mixtures for Whisper model can be implemented. It's still unclear how the quality is affected from the quantization - this is an interesting area which can be explored in the future.

Grammar sampling

The decoder output can now be constrained with a GBNF grammar. This can be a useful technique for further improving the transcription quality in situations where the set of possible phrases are limited.

whisper-chess.mp4

Implementation: #1229

Special credits to @ejones

Distil Whisper

Recently, Distil Whisper models have been released: https://huggingface.co/distil-whisper

whisper.cpp offers support for these models, although it still lacks full implementation of the proposed chunking strategy. Performance details for distilled models are included in the Benchmarks section below.

Implementation: #1424

Whisper Large-v3

Recently, OpenAI released a new version 3 of the Large model: openai/whisper#1761

Implementation: #1444

Benchmarks

Below is a breakdown of the performance of whisper.cpp on Apple Silicon, NVIDIA and CPU. The tables show the Encoder and Decoder speed in ms/tok. The Dec. column corresponds to batch size 1. The Bch5 column corresponds to batch size 5. The PP column corresponds to batch size 128.

For optimal Beam Search performance, the Bch5 number should be 5 times smaller than Dec.

Hw Config Model Th Enc. Dec. Bch5 PP Commit
M2 Ultra METAL tiny 1 11.14 1.40 0.49 0.01 ccc85b4
M2 Ultra METAL tiny-q5_0 1 11.51 1.41 0.52 0.01 ccc85b4
M2 Ultra METAL tiny-q5_1 1 12.21 1.41 0.52 0.01 ccc85b4
M2 Ultra METAL base 1 20.21 2.05 0.77 0.02 ccc85b4
M2 Ultra METAL base-q5_0 1 19.89 1.96 0.81 0.02 ccc85b4
M2 Ultra METAL base-q5_1 1 20.14 2.02 0.81 0.02 ccc85b4
M2 Ultra METAL small 1 51.01 3.97 1.74 0.05 ccc85b4
M2 Ultra METAL small-q5_0 1 56.86 4.09 1.85 0.06 ccc85b4
M2 Ultra METAL small-q5_1 1 56.81 4.14 1.85 0.06 ccc85b4
M2 Ultra METAL medium 1 141.21 8.47 3.98 0.13 ccc85b4
M2 Ultra METAL medium-q5_0 1 160.56 8.27 4.18 0.14 ccc85b4
M2 Ultra METAL medium-q5_1 1 160.52 8.40 4.15 0.14 ccc85b4
M2 Ultra METAL medium-dis 1 128.14 1.13 0.43 0.02 ccc85b4
M2 Ultra METAL large-v2 1 248.73 11.96 6.08 0.22 ccc85b4
M2 Ultra METAL large-v2-q5_0 1 286.31 11.99 6.60 0.26 ccc85b4
M2 Ultra METAL large-v2-q5_1 1 284.56 12.42 6.47 0.26 ccc85b4
M2 Ultra METAL large-v2-dis 1 224.31 1.26 0.49 0.02 ccc85b4
Hw Config Model Th Enc. Dec. Bch5 PP Commit
M2 Ultra COREML METAL tiny 1 7.60 1.41 0.50 0.01 ccc85b4
M2 Ultra COREML METAL base 1 11.90 2.07 0.78 0.02 ccc85b4
M2 Ultra COREML METAL small 1 32.19 4.10 1.78 0.05 ccc85b4
M2 Ultra COREML METAL medium 1 94.43 8.40 3.89 0.12 ccc85b4
M2 Ultra COREML METAL large-v2 1 179.78 12.12 6.07 0.22 ccc85b4
Hw Config Model Th Enc. Dec. Bch5 PP Commit
NVIDIA V100 BLAS CUDA tiny 1 8.84 1.62 0.33 0.02 ccc85b4
NVIDIA V100 BLAS CUDA tiny-q5_0 1 8.43 1.19 0.31 0.02 ccc85b4
NVIDIA V100 BLAS CUDA tiny-q5_1 1 8.41 1.19 0.29 0.02 ccc85b4
NVIDIA V100 BLAS CUDA base 1 14.79 2.31 0.46 0.03 ccc85b4
NVIDIA V100 BLAS CUDA base-q5_0 1 15.05 1.66 0.44 0.03 ccc85b4
NVIDIA V100 BLAS CUDA base-q5_1 1 15.01 1.68 0.46 0.03 ccc85b4
NVIDIA V100 BLAS CUDA small 1 40.30 4.37 0.88 0.05 ccc85b4
NVIDIA V100 BLAS CUDA small-q5_0 1 41.17 3.11 0.94 0.05 ccc85b4
NVIDIA V100 BLAS CUDA small-q5_1 1 41.12 3.11 0.82 0.05 ccc85b4
NVIDIA V100 BLAS CUDA medium 1 104.93 10.06 1.77 0.11 ccc85b4
NVIDIA V100 BLAS CUDA medium-q5_0 1 107.11 6.13 2.07 0.12 ccc85b4
NVIDIA V100 BLAS CUDA medium-q5_1 1 107.91 6.21 1.77 0.12 ccc85b4
NVIDIA V100 BLAS CUDA medium-dis 1 103.45 1.11 0.24 0.02 ccc85b4
NVIDIA V100 BLAS CUDA large-v2 1 171.55 15.76 2.62 0.17 ccc85b4
NVIDIA V100 BLAS CUDA large-v2-q5_0 1 176.27 8.61 3.17 0.19 ccc85b4
NVIDIA V100 BLAS CUDA large-v2-q5_1 1 176.23 8.67 2.59 0.19 ccc85b4
Hw Config Model Th Enc. Dec. Bch5 PP Commit
AMD Ryzen 9 5950X AVX2 tiny 8 197.47 1.22 0.44 0.25 ccc85b4
AMD Ryzen 9 5950X AVX2 tiny-q5_0 8 222.92 0.87 0.45 0.30 ccc85b4
AMD Ryzen 9 5950X AVX2 tiny-q5_1 8 221.25 0.89 0.45 0.30 ccc85b4
AMD Ryzen 9 5950X AVX2 base 8 427.14 3.11 0.88 0.43 ccc85b4
AMD Ryzen 9 5950X AVX2 base-q5_0 8 474.96 1.41 0.72 0.51 ccc85b4
AMD Ryzen 9 5950X AVX2 base-q5_1 8 485.05 1.48 0.73 0.52 ccc85b4
AMD Ryzen 9 5950X AVX2 small 8 1470.51 11.70 2.89 1.21 ccc85b4
AMD Ryzen 9 5950X AVX2 small-q5_0 8 1700.43 5.48 1.98 1.41 ccc85b4
AMD Ryzen 9 5950X AVX2 small-q5_1 8 1719.03 5.79 2.02 1.42 ccc85b4
AMD Ryzen 9 5950X AVX2 medium 8 4417.70 35.13 8.14 3.24 ccc85b4
AMD Ryzen 9 5950X AVX2 medium-q5_0 8 5335.77 17.44 5.35 3.92 ccc85b4
AMD Ryzen 9 5950X AVX2 medium-q5_1 8 5372.26 18.36 5.42 3.88 ccc85b4
AMD Ryzen 9 5950X AVX2 medium-dis 8 4070.25 4.86 1.16 0.53 ccc85b4
AMD Ryzen 9 5950X AVX2 large-v2 8 8179.09 66.89 15.45 5.88 ccc85b4
AMD Ryzen 9 5950X AVX2 large-v2-dis 8 7490.45 7.06 1.63 0.70 ccc85b4

API Changes

  • Add struct whisper_context_params

  • Add whisper_log_set

  • Deprecate:

    • whisper_init_from_file
    • whisper_init_from_buffer
    • whisper_init
    • whisper_init_from_file_no_state
    • whisper_init_from_buffer_no_state
    • whisper_init_no_state
  • Add:

    • whisper_init_from_file_with_params
    • whisper_init_from_buffer_with_params
    • whisper_init_with_params
    • whisper_init_from_file_with_params_no_state
    • whisper_init_from_buffer_with_params_no_state
    • whisper_init_with_params_no_state
  • Diff of struct whisper_full_params

     struct whisper_full_params {
         enum whisper_sampling_strategy strategy;
@@ -338,6 +435,7 @@ extern "C" {
 
         bool translate;
         bool no_context;        // do not use past transcription (if any) as initial prompt for the decoder
+        bool no_timestamps;     // do not generate timestamps
         bool single_segment;    // force single segment output (useful for streaming)
         bool print_special;     // print special tokens (e.g. <SOT>, <EOT>, <BEG>, etc.)
         bool print_progress;    // print progress information
@@ -355,8 +453,12 @@ extern "C" {
         // [EXPERIMENTAL] speed-up techniques
         // note: these can significantly reduce the quality of the output
         bool speed_up;          // speed-up the audio by 2x using Phase Vocoder
+        bool debug_mode;        // enable debug_mode provides extra info (eg. Dump log_mel)
         int  audio_ctx;         // overwrite the audio context size (0 = use default)
 
+        // [EXPERIMENTAL] [TDRZ] tinydiarize
+        bool tdrz_enable;       // enable tinydiarize speaker turn detection
+
         // tokens to provide to the whisper decoder as initial prompt
         // these are prepended to any existing text context from a previous call
         const char * initial_prompt;
@@ -365,6 +467,7 @@ extern "C" {
 
         // for auto-detection, set to nullptr, "" or "auto"
         const char * language;
+        bool detect_language;
 
         // common decoding parameters:
         bool suppress_blank;    // ref: https://github.com/openai/whisper/blob/f82bc59f5ea234d4b97fb2860842ed38519f7e65/whisper/decoding.py#L89
@@ -403,11 +506,24 @@ extern "C" {
         whisper_encoder_begin_callback encoder_begin_callback;
         void * encoder_begin_callback_user_data;
 
+        // called each time before ggml computation starts
+        whisper_abort_callback abort_callback;
+        void * abort_callback_user_data;
+
         // called by each decoder to filter obtained logits
         whisper_logits_filter_callback logits_filter_callback;
         void * logits_filter_callback_user_data;
+
+        const whisper_grammar_element ** grammar_rules;
+        size_t                           n_grammar_rules;
+        size_t                           i_start_rule;
+        float                            grammar_penalty;
     };
 

There might be some instability around the API, especially with the existing language bindings. I wasn't able to test everything, so expect some issues and feel free to submit PRs with any kind of fixes that you find.

Highlights and what's next

A lot of the updates in these release are possible thanks to the many contributions in llama.cpp - huge shoutout to all the contributors and collaborators there!

Regarding future updates to whisper.cpp, I'm looking forward to the following things:

  • Add server example similar to the one in llama.cpp
  • Try to improve Metal's batched decoding performance
  • Look for some interesting applications of the grammar sampling functionality

  • Latest performance of the talk-llama example

    talk-llama-1.mp4

What's Changed

  • Fix quantize bug by @ggerganov in #842
  • whisper.wasm : fix typo in readme by @BaffinLee in #832
  • Adding --session support in examples/talk-llama by @herrera-luis in #845
  • --detect-language mode by @CRD716 in #853
  • talk-llama: updating session prompts load by @herrera-luis in #854
  • CMake/Makefile : CLBlast support as in llama.cpp by @trholding in #862
  • Instruction: Partial OpenCL GPU support via CLBlast by @trholding in #863
  • Add cuBLAS build workflow and fix error causing lines in CMakeLists by @RelatedTitle in #867
  • cmake : fix options disabling AVX and AVX2 flags by @blazingzephyr in #885
  • Added large-v2. Added instructions on converting to GGML. Added --no-… by @cjheath in #874
  • talk-llama: only copy used KV cache in get / set state by @herrera-luis in #890
  • Fix define used for COREML_ALLOW_FALLBACK by @jcsoo in #893
  • coreml : fix memory leak by @ggerganov in #899
  • whisper.objc : enable Core ML in example & fix segmentation fault by @jhen0409 in #910
  • Align --no-timestamps in help to actual behavior by @Miserlou in #908
  • readme : improve Core ML model conversion guidance by @jhen0409 in #915
  • Added support of large-v1 model into CoreML by @abCods in #926
  • Update of Hebrew Language Code: 'iw' to 'he' by @ttv20 in #935
  • java bindings by @nalbion in #931
  • ci: Build with any BLAS compatible library by @akharlamov in #927
  • [DOCS] highlight openblas support in #956
  • Update elevenlabs example to use official python API by @DGdev91 in #837
  • Update README.md by @genevera in #964
  • Feature/java bindings2 by @nalbion in #944
  • Support decode wav file has 2 channels. by @geniusnut in #972
  • README.md: Corrected syntax for markdown link by @LarryBattle in #995
  • Make convert-pt-to-ggml.py backwards compatible with older vocab.json tokenizer files by @akashmjn in #1001
  • Fixing Accidental 'exit(0)' and Ensuring Proper 'return 1' in examples/main/main.cpp whisper_params_parse by @faker2048 in #1002
  • Fix for issue #876 by @burningion in #1012
  • Make cuBLAS compilation compatible with x86 as well as aarch64 by @byte-6174 in #1015
  • feat(golang): improve progress reporting and callback handling by @appleboy in #1024
  • Add support for whisper_full_lang_id() to go bindings by @jaybinks in #1010
  • Add alternative java binding to readme by @GiviMAD in #1029
  • diarization: add diarization support for all current output types by @colinc in #1031
  • Fix cd statements to allow spaces in model path by @roddurd in #1041
  • adding ggml_to_pt script by @simonMoisselin in #1042
  • whisper: Fix build with -Werror=undef by @philn in #1045
  • Fix talk-llama build after ggml sync (commit 5feb0df). by @przemoc in #1049
  • Do not use _GNU_SOURCE gratuitously. by @przemoc in #1027
  • whisper : split_on_word no longer trims by @ggerganov in #1046
  • Updated 'quantize-all.sh' to quantize all downloaded models by @thefinaldegree in #1054
  • Fix talk-llama build on macOS. by @przemoc in #1062
  • whisper : support speaker segmentation (local diarization) of mono audio via tinydiarize by @akashmjn in #1058
  • Minor: updated readme by @mwarnaar in #1064
  • OpenVINO support by @RyanMetcalfeInt8 in #1037
  • go bindings: fix context.Process call in examples by @mvrilo in #1067
  • go: Call SetDuration appropriately by @tmc in #1077
  • Multi platforms CI by @alonfaraj in #1101
  • Add Vim plugin by @AustinMroz in #1131
  • chore: move progress calculation out of whisper.cpp by @geekodour in #1081
  • expose api to let user control log output by @evmar in #1060
  • Add a larger (30min) sample by @vadi2 in #1092
  • Sync opencl compilation fix in ggml by @goncha in #1111
  • README.md: Add OpenVINO support details by @RyanMetcalfeInt8 in #1112
  • Fix MSVC compile error C3688 on non-unicode Windows by @goncha in #1110
  • Now make tests can be called as make tests base.en by @Jerry-Master in #1113
  • Go binding: Implement SetSplitOnWord by @xdrudis in #1114
  • set NVCC -arch flag by cuda version by @alonfaraj in #1115
  • Fix CLBlast build on MacOS by @iceychris in #1120
  • Fixed the issue of OpenBLAS not being enabled on Windows. by @bobqianic in #1128
  • whisper : fix visibility warning of struct whisper_full_params by declaring in advance by @IronBlood in #1124
  • Fix MSVC compile error C3688 by @bobqianic in #1136
  • Add tinydiarization support for streaming by @DMcConnell in #1137
  • quantize : fix load vocab crash when len is 128 by @jhen0409 in #1160
  • Fix AVX etc. under GCC/CMake by @marmistrz in #1174
  • Fix PowerPC build failures introduced in #1174 by @marmistrz in #1196
  • Simplify Makefile by @alonfaraj in #1147
  • Add precalculated values of sin/cos for speeding up FFT by @AlexandrGraschenkov in #1142
  • Make build work on Linux machines supporting AVX1 not AVX2 by @lachesis in #1162
  • Fix OpenBLAS detection under Arch Linux by @marmistrz in #1173
  • Minor fixes by @csukuangfj in #1154
  • New command line option by @jbyunes in #1205
  • whisper.android : migrate from ndk-build to CMake by @JunkFood02 in #1204
  • Significantly improve whisper.cpp inference quality by @bobqianic in #1148
  • whisper : allow whisper_full from mel spectrogram - no audio by @ggerganov in #1214
  • ROCm Port by @ardfork in #1209
  • Improvements to vim plugin and LSP server by @AustinMroz in #1144
  • Detect SSSE3 by @przemoc in #1211
  • ggml : fix compiling when SSE3 is available but not SSSE3 by @przemoc in #1210
  • make : add support for building on DragonFlyBSD/NetBSD/OpenBSD by @przemoc in #1212
  • make : use cpuinfo in MSYS2 to enable x86 ISA extensions on the host by @przemoc in #1216
  • Fix CoreML memleak (fixes #1202) by @denersc in #1218
  • whisper.android : fix cmake multiple libraries build by @jhen0409 in #1224
  • Fix compilation errors incurred by -Werror by @shivamidow in #1227
  • ci : enable java package publishing by @ggerganov in #1228
  • fix cmake commands in README #1225 by @wizardforcel in #1231
  • ggml : sync (ggml-alloc, GPU, eps, etc.) by @ggerganov in #1220
  • make : improve cpuinfo handling on x86 hosts by @przemoc in #1238
  • ggml : sync latest llama.cpp (view_src + alloc improvements) by @ggerganov in #1247
  • Posixify pagesize. by @przemoc in #1251
  • Fix detection of AVX2 on macOS by @didzis in #1250
  • Address ARM's big.LITTLE arch by checking cpu info. by @Digipom in #1254
  • Bump gradle plugin and dependencies + a lint pass by @Digipom in #1255
  • Add quantized models to download-ggml-model.sh by @nchudleigh in #1235
  • Do not use _GNU_SOURCE gratuitously. by @przemoc in #1129
  • ci : upgrade gradle to 2.4.2 by @ggerganov in #1263
  • sync : ggml (HBM + Metal + style) by @ggerganov in #1264
  • ci : try to fix gradle action by @ggerganov in #1265
  • Fixed signing of java artifact using gradle by @nalbion in #1267
  • Faster beam_search sampling by @bobqianic in #1243
  • whisper : fix bench regression by @ggerganov in #1275
  • whisper : Metal and ggml-alloc support by @ggerganov in #1270
  • bench: fix missing include by @nekr0z in #1303
  • ruby : fix build by add missing ggml-alloc by @jhen0409 in #1305
  • Update README.md. Adding missing options, remove --speed-up. by @Sogl in #1306
  • Update README.md by @computerscienceiscool in #1290
  • save the recorded audio to a file by @litongjava in #1310
  • Python benchmark script by @nchudleigh in #1298
  • Minor: fix example talk readme gpt-2 github url by @brunofaustino in #1334
  • Missing speaker turn function in API by @didzis in #1330
  • examples: Move wav_writer from stream.cpp to common.h by @bobqianic in #1317
  • Better abort callback by @mkiol in #1335
  • Add conversion scripts from HuggingFace models to CoreML by @AlienKevin in #1304
  • Prefer pkg-config while looking for BLAS by @marmistrz in #1349
  • Abort build if a feature was requested and could not be configured by @marmistrz in #1350
  • Abort callback improvements by @mkiol in #1345
  • Dockerfile for cublas by @joecryptotoo in #1286
  • docs: fix typo by @jorismertz in #1362
  • Expose the audio_ctx param through the Go binding by @JohanRaffin in #1368
  • Clarify doc about where to compile from by @ai-at-home in #1400
  • Faster download for models on windows using BitTransfer by @WhiteOlivierus in #1404
  • JSON: allow outputting per-token data too by @akx in #1358
  • Move up-to-date demo to top by @asadm in #1417
  • Use absolute paths for the converted OpenVINO model by @bobqianic in #1356
  • sync : ggml (backend v2, k-quants, CUDA opts, Metal opts, etc.) by @ggerganov in #1422
  • whisper : add support for new distilled Whisper models by @ggerganov in #1424
  • whisper : add context param for disable gpu by @jhen0409 in #1293
  • talk-llama : fix n_gpu_layers usage by @jhen0409 in #1441
  • talk-llama : fix n_gpu_layers usage again by @jhen0409 in #1442
  • Fix variable names in GitHub actions config by @iamthad in #1440
  • Reset ctx->t_start_us when calling whisper_reset_timings() by @bjnortier in #1434
  • Decouple Android example into a library and app module by @tobrun in #1445
  • whisper : add support for large v3 by @ggerganov in #1444
  • Add support for Swift Package Manager by @sindresorhus in #1370
  • Reset mel time when resetting timings by @bjnortier in #1452
  • coreml: use the correct n_mel by @jxy in #1458
  • models : Fix n_mel mismatch in convert-whisper-to-openvino.py by @bobqianic in #1459
  • Add '-l auto' to talk-llama example by @kubaracek in #1467
  • Return with error from whisper_encode_internal and whisper_decode_int… by @bjnortier in #1456
  • whisper : add full CUDA and Metal offloading by @ggerganov in #1472
  • examples : Enhanced compatibility with older Android versions using Java by @litongjava in #1382
  • Add n_gpu_layers option to talk-llama example by @rlapray in #1475
  • whisper : add grammar-based sampling by @ejones in #1229
  • java : use tiny.en for tests by @ggerganov in #1484
  • whisper : add batched decoding by @ggerganov in #1486
  • java : fix test by @ggerganov in #1492
  • whisper : make large version explicit + fix data size units by @ggerganov in #1493

New Contributors

Full Changelog: v1.4.0...v1.5.0

Don't miss a new whisper.cpp release

NewReleases is sending notifications on new releases.