Kalosm 0.3

Kalosm 0.3 makes it significantly easier to use structured generation, improves transcription, and makes it possible to track model download progress. It also includes performance improvements for text generation and transcription models along with parser improvements

Performance Improvements

The new version of Kalosm includes significant performance improvements for llama, mistral, and phi models. We have also developed sampler aware structured generation which lets us skip parsing most tokens in loose structures. Performance should be between 2-4x as fast depending on your usecase:

Demo	Kalosm 0.2	Kalosm 0.3
Text generation	kalosm-0.2.mp4	kalosm-0.3.mp4
Structured Generation	kalosm-0.2-structured.mp4	kalosm-0.3-structured.mp4

Structured Generation Improvements

Structured generation is both easier and faster in 0.3. Many structured generation tasks can use json. If you just need a json parser, kalosm 0.3 lets you derive the parser from your data:

use kalosm::language::*;

/// A fictional character
#[derive(Parse, Schema, Clone, Debug)]
struct Character {
    /// The name of the character
    #[parse(pattern = "[A-Z][a-z]{2,10} [A-Z][a-z]{2,10}")]
    name: String,
    /// The age of the character
    #[parse(range = 1..=100)]
    age: u8,
    /// A description of the character
    #[parse(pattern = "[A-Za-z ]{40,200}")]
    description: String,
}

Then you can build a task that generates the character:

#[tokio::main]
async fn main() {
    // First create a model. Chat models tend to work best with structured generation
    let model = Llama::new_chat().await.unwrap();
    // Then create a task with the parser as constraints
    let task = Task::builder_for::<Character>("You generate realistic JSON placeholders for characters")
        .build();
    // Finally, run the task
    let mut stream = task.run("Create a random character", &model);
    stream.to_std_out().await.unwrap();
    let character = stream.await.unwrap();
    println!("{character:?}");
}

Along with the parser, you can also derive a json schema that matches the parser which is useful for function calling models.

You can read more about how structured generation works in kalosm in our last blog post.

Streaming Voice Transcription

Kalosm 0.3 adds support for transcribing audio streams like microphone in chunks based on voice activity. You can now read the audio stream directly from the microphone and transcribe it as voices are detected:

// Create a new whisper model.
let model = Whisper::new().await.unwrap();

// Stream audio from the microphone
let mic = MicInput::default();
let stream = mic.stream().unwrap();

// Transcribe the audio into text in chunks based on voice activity.
let mut text_stream = stream.transcribe(model);

// Finally, print the text to the console
text_stream.to_std_out().await.unwrap();

Model Progress

Loading models is now async with a callback for loading progress:

let model = Bert::builder()
    // build with loading handler lets you track the progress of the model loading
    .build_with_loading_handler(|loading| match loading {
        ModelLoadingProgress::Downloading {
            source,
            start_time,
            progress,
        } => {
            let elapsed = start_time.elapsed();
            println!("Downloading model from {source}...{progress}% (elapsed {elapsed:?})");
        }
        ModelLoadingProgress::Loading { progress } => {
            println!("Loading model into memory...{progress}%");
        }
    })
    .await
    .unwrap();

Whisper transcriptions and wuerstchen generations are also async with progress info thanks to @newfla:

// Create a new whisper model
let model = WhisperBuilder::default()
    .with_source(WhisperSource::QuantizedDistilLargeV3)
    .build()
    .await.unwrap();

let mic = MicInput::default(); 
let audio = mic.stream().unwrap();

// Transcribe the source audio into text
let mut text = audio.transcribe(model);

// As the model transcribes the audio, print the text to the console
while let Some(chunk) = text.next().await {
    let text = chunk.as_ref();
    println!("{text}");
    println!(
        "estimated time left to decode chunk: {}s",
        chunk.remaining_time().as_secs()
    );
}

Documentation improvements

The inline documentation has been significantly improved in 0.3. Common items now include inline guides to help you get started like the language page and concept explanations like embeddings

New models!

Along with the new release, kalosm supports a few new models:

Quantized whisper models are now supported with presets for distilled versions of whisper to run even faster
The Phi-3 series of models is supported by kalosm-llama. The Phi series performs above its weight for structured json generation tasks

Full changelog

Implement token healing by @ealmloff in #149
Decouple models from tasks by @ealmloff in #150
Update candle and add metal support by @ealmloff in #153
Improve sidebar UI and add categories by @ealmloff in #155
Bump mio from 0.8.10 to 0.8.11 by @dependabot in #156
Improve model loading API by @ealmloff in #157
Bump actions/checkout from 3 to 4 by @dependabot in #160
Bump actions/upload-artifact from 3 to 4 by @dependabot in #159
Fix linux support by @ealmloff in #161
Pin wasmtime rev by @ealmloff in #163
Improve support for mkl by @newfla in #165
Plugin calculate by @LafCorentin in #166
Support starling beta and speed up token generation by @ealmloff in #168
Whisper & Wuerstchen download progress by @newfla in #169
wuerstchen resolution warnings and accelerator support by @ealmloff in #173
rwhisper: progress, elapsed time. estimate remaining time by @newfla in #174
Fix loading chat sessions on accelerators by @ealmloff in #175
Add support for quantized whisper models by @ealmloff in #176
Add distil whisper v3 large quantized by @ealmloff in #178
rwuerstchen: async api by @newfla in #177
Reference count language models by @ealmloff in #180
Make structured generation faster by @ealmloff in #181
Add wizard lm 2 by @ealmloff in #182
Simplify Parsers by @ealmloff in #183
Improve Floneum UI by @ealmloff in #158
Add phi-3 by @ealmloff in #185
Add a menu item to clear the current workflow by @ealmloff in #187
fix the call to unsafe function error by @haoxins in #188
fix linking cuda kernels on windows by @newfla in #189
Clean up kalosm examples by @ealmloff in #190
Add snowflake embedding models by @ealmloff in #191
Add extra context methods to simplify adding documents with database integration by @ealmloff in #192
Fix Rwuerstchen example link in Readme by @newfla in #193
Improve Bert model by @ealmloff in #194
Implement smarter rule based sentence chunking by @ealmloff in #196
Semantic chunking by @ealmloff in #197
Use the in place kv cache for faster long context token generation by @ealmloff in #198
Slowly expand the llama cache as we need to by @ealmloff in #199
Save/load classifier heads, add dropout layer and expose the learning rate by @ealmloff in #201
Optimize large bert batch sizes by @ealmloff in #202
Fix memory usage and add batch sizes to classifier training by @ealmloff in #203
Expose classifier probabilities by @ealmloff in #205
HTML chunking and simplification by @ealmloff in #200
Fix windows CI by @ealmloff in #207
Improve node interface by @ealmloff in #208
Cache embeddings by @ealmloff in #209
Add a separate method for embedding queries by @ealmloff in #210
Reorganize and simplify examples by @ealmloff in #211
Improve kalosm-learning docs and lazily find the input size by @ealmloff in #212
Add more docs for embeddings by @ealmloff in #213
Read huggingface token by @ealmloff in #216
Expose a way to manually set the device for llama by @ealmloff in #215
Improve chat API and adding more examples for Chat and ChatBuilder by @ealmloff in #218
Add a voice activity and denoising helpers to kalosm audio by @ealmloff in #222
Simplify parse and add a derive macro by @ealmloff in #223
Bump the cargo group with 2 updates by @dependabot in #225
Improve kalosm feature flags by @newfla in #226
Fix regex constraints by @ealmloff in #227
Fix structured generation with non-prefix encodable tokenizers like phi by @ealmloff in #228
Faster structured generation with sampler aware token decoding by @ealmloff in #229
Derive parse for enums with data by @ealmloff in #230
Add attributes to modify unit, enum and struct parsing by @ealmloff in #231
Improve the ergonomics of the TextStream trait and remove async from a few model methods by @ealmloff in #232
Remove a bunch of unused dependencies by @ealmloff in #233
Add overviews for each core module by @ealmloff in #234
Create a compile time state machine for enum parsers by @ealmloff in #235
Add a llama 3.1 instruct preset by @ealmloff in #237
Bump openssl from 0.10.64 to 0.10.66 in the cargo group by @dependabot in #236
Fix the required next tokens for repeat parsers by @ealmloff in #239
Make cloning repeat partial state very cheap with an immutable Arc Linked List by @ealmloff in #240
Implement phi-3.1 support by @ealmloff in #241
Fix parsing signs and optimize separated parser by @ealmloff in #242
Fix constrained rust type performance by @ealmloff in #243
Derive a JSON schema by @ealmloff in #245
Implement prompt healing by @ealmloff in #246
Split floneum and kalosm in the workspace by @ealmloff in #247
Fix structured generation with the phi tokenizer by @ealmloff in #250
chore: update lib.rs by @eltociear in #249
Fix remaining doc tests by @ealmloff in #251
Fix CI checks by @ealmloff in #252
Add a tiny helper for tasks that implement parse and schema by @ealmloff in #253
Bump version by @ealmloff in #254
Improve the documentation for the entry point of each crate by @ealmloff in #255
Bump docs by @ealmloff in #256

New Contributors

@KerfuffleV2 made their first contribution in #77
@haoxins made their first contribution in #86
@Yevgnen made their first contribution in #93
@dependabot made their first contribution in #156
@newfla made their first contribution in #165
@LafCorentin made their first contribution in #166
@eltociear made their first contribution in #249

Full Git Diff: v0.2.0...kalosm-0.3.0

floneum/kalosm kalosm-0.3.0 on GitHub