koboldcpp-1.98.1
Kokobold edition
kobo.mp4
- NEW: TTS.cpp model support has been integrated into KoboldCpp, providing access to new Text-To-Speech models - The TTS.cpp project (repo here) was developed by by @mmwillet, and a modified version has now been added into KoboldCpp, bringing support for 3 new Text-To-Speech models Kokoro, Parler and Dia.
- Of the above models, Kokoro is the most recommended for general use.
- Uses the GGML library in KoboldCpp, although the new ops are CPU only, so Kokoro provides the best speed taking size into consideration. You can expect speeds of 2x realtime for Kokoro (fastest), 0.5x realtime for Parler, and 0.1x realtime for Dia (slowest).
- To use, simply download the GGUF model and load it in the 'Audio' tab as a TTS model. Note: WavTokenizer is not required for these models. Please use the
no_espeak
versions, KoboldCpp has custom IPA mappings for English and espeak is not supported. - KoboldAI Lite provides automatic mapping for the speaker voices. If you wish to use a custom voice for Kokoro, the supported voices are
af_alloy
,af_aoede
,af_bella
,af_heart
,af_jessica
,af_kore
,af_nicole
,af_nova
,af_river
,af_sarah
,af_sky
,am_adam
,am_echo
,am_eric
,am_fenrir
,am_liam
,am_michael
,am_onyx
,am_puck
,am_santa
,bf_alice
,bf_emma
,bf_isabella
,bf_lily
,bm_daniel
,bm_fable
,bm_george
,bm_lewis
. Only English speech is properly supported.
- Thanks to @wbruna, image generation has been updated and received multiple improvements:
- Added separate flash attention and conv2d toggles for image generation
--sdflashattention
and--sdconvdirect
- Added ability to use q8 for Image Generation model quantization, in addition to existing q4.
--sdquant
now accepts a parameter[0/1/2
] that specifies quantization level, similar to--quantkv
- Added separate flash attention and conv2d toggles for image generation
- Added
--overridenativecontext
flag which allows you to easily override the expected trained context of a model when determining automatic RoPE scaling. If you didn't get that, you don't need this feature. - Seed-OSS support is merged, including instruct templates for thinking and non-thinking modes.
- Further improvements to tool calling and audio transcription handling
- Fixed Stable Diffusion 3.5 loading issue
- Embedding models now default to the lower of current model max context and trained context. Should help with Qwen3 embedding models. This can be adjusted with
--embeddingsmaxctx
override. - Improve server identifier header for better compatibility with some libraries
- Termux
android_install.sh
script can now launch existing downloaded models - Minor chat adapter fixes, including Kimi.
- Added alias for
--tensorsplit
- Benchmark CSV formatting fix.
- Updated Kobold Lite, multiple fixes and improvements
- Scenario picker can now load any adventure or chat scenario in Instruct mode.
- Slightly increased default amount to generate.
- Improved file saving behavior, try to remember previously used filename.
- Improved KaTeX rendering and handle additional cases
- Improved streaming UI for code block streaming at the start of any turn.
- Added setting to embed generated TTS audio into the context as part of the AI's turn.
- Minor formatting fixes
- Added Vision đī¸ and Auditory đĻģ support indicators for inline multimodal media content.
- Added Seed-OSS instruct templates. Note that Thinking regex must be set manually for this model by changing the think tag.
- Overhaul narration and media adding system, allow TTS to be manually added with
Add File
.
- Merged new model support, fixes and improvements from upstream
Hotfix 1.98.1 - Fix Kokoro for better accuracy and quality, added 4096 as a --blasbatchsize
option, fix windows 7 functionality, fixed flash attention issues, synced some new updates from upstream.
Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here if you are a Windows user or download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag. You can also refer to the readme and the wiki.