v0.7.9
The Local Upgrade
Warning: This update attempts to load files from app assets which may fail as I have not yet tested this on multiple devices. Please report if you get stuck in a boot crash! This update is generally very experimental with a few changes to the core c++ code of llama.rn, so it may be unstable.
Features:
- Local generation has migrated to cui-llama.rn , a fork of the fantastic llama.rn project, but with custom features tailored for ChatterUI:
- Added stopping prompt processing between batches - more effective when used with low batch size.
- vocab_only mode which allows for tokenizer only usage - this also removes the need for onnx-runtime and the old transformer.js adaptation for tokenizers, cutting down app size significantly!
- Synchronous tokenization for ease of development
Context Shifting
adapted from kobold.cpp (Thanks @LostRuins) - this allows you to use high context chats without needing to reprocess the entire context upon hitting context limit
- Added support for
i8mm
compatible devices (Snapdragon 8 Gen 1 or newer / Exynos 2200 or newer)- This feature allows the use of Q4_0_4_8 quantization levels optimized for ARM devices.
- It is recommended to requantize your models to this quantization level using the llama.cpp quantize tool:
.\llama-quantize.exe --allow-requantize model.gguf Q4_0_4_8
Changes:
- Local inferencing is now done as a background task! This should mean that tabbing out of the app should not stop inferencing.
- Buttons in Local API menu now properly disable based on model state
- The internal tokenizer now relies entirely on the much faster implementation in cui-llama.rn. As such the previous JS tokenizer has been removed alongside onnx-runtime, leading to much smaller APK size.
Fixes:
- Continuing with local API now properly respects the regenCache
- removed BOS token from default Llama 3 instruct preset
Dev:
- Moved
constants
andcomponents
underapp/
, as this seems to affect react-native's Fast Refresh functionality significantly - Moved local api state to zustand this helps a lot with fast refresh bugginess in development and prevents the model state from being unloaded upon a refresh