Kalosm 0.3
Kalosm 0.3 makes it significantly easier to use structured generation, improves transcription, and makes it possible to track model download progress. It also includes performance improvements for text generation and transcription models along with parser improvements
Performance Improvements
The new version of Kalosm includes significant performance improvements for llama, mistral, and phi models. We have also developed sampler aware structured generation which lets us skip parsing most tokens in loose structures. Performance should be between 2-4x as fast depending on your usecase:
| Demo | Kalosm 0.2 | Kalosm 0.3 |
|---|---|---|
| Text generation | kalosm-0.2.mp4 | kalosm-0.3.mp4 |
| Structured Generation | kalosm-0.2-structured.mp4 | kalosm-0.3-structured.mp4 |
Structured Generation Improvements
Structured generation is both easier and faster in 0.3. Many structured generation tasks can use json. If you just need a json parser, kalosm 0.3 lets you derive the parser from your data:
use kalosm::language::*;
/// A fictional character
#[derive(Parse, Schema, Clone, Debug)]
struct Character {
/// The name of the character
#[parse(pattern = "[A-Z][a-z]{2,10} [A-Z][a-z]{2,10}")]
name: String,
/// The age of the character
#[parse(range = 1..=100)]
age: u8,
/// A description of the character
#[parse(pattern = "[A-Za-z ]{40,200}")]
description: String,
}Then you can build a task that generates the character:
#[tokio::main]
async fn main() {
// First create a model. Chat models tend to work best with structured generation
let model = Llama::new_chat().await.unwrap();
// Then create a task with the parser as constraints
let task = Task::builder_for::<Character>("You generate realistic JSON placeholders for characters")
.build();
// Finally, run the task
let mut stream = task.run("Create a random character", &model);
stream.to_std_out().await.unwrap();
let character = stream.await.unwrap();
println!("{character:?}");
}Along with the parser, you can also derive a json schema that matches the parser which is useful for function calling models.
You can read more about how structured generation works in kalosm in our last blog post.
Streaming Voice Transcription
Kalosm 0.3 adds support for transcribing audio streams like microphone in chunks based on voice activity. You can now read the audio stream directly from the microphone and transcribe it as voices are detected:
// Create a new whisper model.
let model = Whisper::new().await.unwrap();
// Stream audio from the microphone
let mic = MicInput::default();
let stream = mic.stream().unwrap();
// Transcribe the audio into text in chunks based on voice activity.
let mut text_stream = stream.transcribe(model);
// Finally, print the text to the console
text_stream.to_std_out().await.unwrap();Model Progress
Loading models is now async with a callback for loading progress:
let model = Bert::builder()
// build with loading handler lets you track the progress of the model loading
.build_with_loading_handler(|loading| match loading {
ModelLoadingProgress::Downloading {
source,
start_time,
progress,
} => {
let elapsed = start_time.elapsed();
println!("Downloading model from {source}...{progress}% (elapsed {elapsed:?})");
}
ModelLoadingProgress::Loading { progress } => {
println!("Loading model into memory...{progress}%");
}
})
.await
.unwrap();Whisper transcriptions and wuerstchen generations are also async with progress info thanks to @newfla:
// Create a new whisper model
let model = WhisperBuilder::default()
.with_source(WhisperSource::QuantizedDistilLargeV3)
.build()
.await.unwrap();
let mic = MicInput::default();
let audio = mic.stream().unwrap();
// Transcribe the source audio into text
let mut text = audio.transcribe(model);
// As the model transcribes the audio, print the text to the console
while let Some(chunk) = text.next().await {
let text = chunk.as_ref();
println!("{text}");
println!(
"estimated time left to decode chunk: {}s",
chunk.remaining_time().as_secs()
);
}Documentation improvements
The inline documentation has been significantly improved in 0.3. Common items now include inline guides to help you get started like the language page and concept explanations like embeddings
New models!
Along with the new release, kalosm supports a few new models:
- Quantized whisper models are now supported with presets for distilled versions of whisper to run even faster
- The Phi-3 series of models is supported by
kalosm-llama. The Phi series performs above its weight for structured json generation tasks
Full changelog
- Implement token healing by @ealmloff in #149
- Decouple models from tasks by @ealmloff in #150
- Update candle and add metal support by @ealmloff in #153
- Improve sidebar UI and add categories by @ealmloff in #155
- Bump mio from 0.8.10 to 0.8.11 by @dependabot in #156
- Improve model loading API by @ealmloff in #157
- Bump actions/checkout from 3 to 4 by @dependabot in #160
- Bump actions/upload-artifact from 3 to 4 by @dependabot in #159
- Fix linux support by @ealmloff in #161
- Pin wasmtime rev by @ealmloff in #163
- Improve support for mkl by @newfla in #165
- Plugin calculate by @LafCorentin in #166
- Support starling beta and speed up token generation by @ealmloff in #168
- Whisper & Wuerstchen download progress by @newfla in #169
- wuerstchen resolution warnings and accelerator support by @ealmloff in #173
- rwhisper: progress, elapsed time. estimate remaining time by @newfla in #174
- Fix loading chat sessions on accelerators by @ealmloff in #175
- Add support for quantized whisper models by @ealmloff in #176
- Add distil whisper v3 large quantized by @ealmloff in #178
- rwuerstchen: async api by @newfla in #177
- Reference count language models by @ealmloff in #180
- Make structured generation faster by @ealmloff in #181
- Add wizard lm 2 by @ealmloff in #182
- Simplify Parsers by @ealmloff in #183
- Improve Floneum UI by @ealmloff in #158
- Add phi-3 by @ealmloff in #185
- Add a menu item to clear the current workflow by @ealmloff in #187
- fix the call to unsafe function error by @haoxins in #188
- fix linking cuda kernels on windows by @newfla in #189
- Clean up kalosm examples by @ealmloff in #190
- Add snowflake embedding models by @ealmloff in #191
- Add extra context methods to simplify adding documents with database integration by @ealmloff in #192
- Fix Rwuerstchen example link in Readme by @newfla in #193
- Improve Bert model by @ealmloff in #194
- Implement smarter rule based sentence chunking by @ealmloff in #196
- Semantic chunking by @ealmloff in #197
- Use the in place kv cache for faster long context token generation by @ealmloff in #198
- Slowly expand the llama cache as we need to by @ealmloff in #199
- Save/load classifier heads, add dropout layer and expose the learning rate by @ealmloff in #201
- Optimize large bert batch sizes by @ealmloff in #202
- Fix memory usage and add batch sizes to classifier training by @ealmloff in #203
- Expose classifier probabilities by @ealmloff in #205
- HTML chunking and simplification by @ealmloff in #200
- Fix windows CI by @ealmloff in #207
- Improve node interface by @ealmloff in #208
- Cache embeddings by @ealmloff in #209
- Add a separate method for embedding queries by @ealmloff in #210
- Reorganize and simplify examples by @ealmloff in #211
- Improve kalosm-learning docs and lazily find the input size by @ealmloff in #212
- Add more docs for embeddings by @ealmloff in #213
- Read huggingface token by @ealmloff in #216
- Expose a way to manually set the device for llama by @ealmloff in #215
- Improve chat API and adding more examples for Chat and ChatBuilder by @ealmloff in #218
- Add a voice activity and denoising helpers to kalosm audio by @ealmloff in #222
- Simplify parse and add a derive macro by @ealmloff in #223
- Bump the cargo group with 2 updates by @dependabot in #225
- Improve kalosm feature flags by @newfla in #226
- Fix regex constraints by @ealmloff in #227
- Fix structured generation with non-prefix encodable tokenizers like phi by @ealmloff in #228
- Faster structured generation with sampler aware token decoding by @ealmloff in #229
- Derive parse for enums with data by @ealmloff in #230
- Add attributes to modify unit, enum and struct parsing by @ealmloff in #231
- Improve the ergonomics of the TextStream trait and remove async from a few model methods by @ealmloff in #232
- Remove a bunch of unused dependencies by @ealmloff in #233
- Add overviews for each core module by @ealmloff in #234
- Create a compile time state machine for enum parsers by @ealmloff in #235
- Add a llama 3.1 instruct preset by @ealmloff in #237
- Bump openssl from 0.10.64 to 0.10.66 in the cargo group by @dependabot in #236
- Fix the required next tokens for repeat parsers by @ealmloff in #239
- Make cloning repeat partial state very cheap with an immutable Arc Linked List by @ealmloff in #240
- Implement phi-3.1 support by @ealmloff in #241
- Fix parsing signs and optimize separated parser by @ealmloff in #242
- Fix constrained rust type performance by @ealmloff in #243
- Derive a JSON schema by @ealmloff in #245
- Implement prompt healing by @ealmloff in #246
- Split floneum and kalosm in the workspace by @ealmloff in #247
- Fix structured generation with the phi tokenizer by @ealmloff in #250
- chore: update lib.rs by @eltociear in #249
- Fix remaining doc tests by @ealmloff in #251
- Fix CI checks by @ealmloff in #252
- Add a tiny helper for tasks that implement parse and schema by @ealmloff in #253
- Bump version by @ealmloff in #254
- Improve the documentation for the entry point of each crate by @ealmloff in #255
- Bump docs by @ealmloff in #256
New Contributors
- @KerfuffleV2 made their first contribution in #77
- @haoxins made their first contribution in #86
- @Yevgnen made their first contribution in #93
- @dependabot made their first contribution in #156
- @newfla made their first contribution in #165
- @LafCorentin made their first contribution in #166
- @eltociear made their first contribution in #249
Full Git Diff: v0.2.0...kalosm-0.3.0