mostlygeek/llama-swap v219 on GitHub

Notes

Including details for v218 (broken) and PR #790.

llama-swap has a new routing backend. What started as a small experiment to improve the concurrency handling exploded into a full refactor of the backend. For users this the biggest change is swapping is more efficient. Requests are collated so requests for models that are already loaded will take precedence over those that awaiting loading.

It looks like:

new router: A B A B A B -> A A A B B B
old router: A B A B A B -> A B A B A B

However, just doing that wouldn't require a 12,009 line PR. There were a lot of architectural changes that makes developer quality of life a bit easier. Redundant code was removed, repo organization is centralized around the internal/ packages, new funny loading remarks were added, etc.

Also a new concurrency tester sneaked in under cmd/concurrency-tester.

Changelog

4ca9c47 Makefile,internal/server: various release tweaks
146a9ea ui-svelte: update build directory (#801)

mostlygeek/llama-swap v219 v219 (fixes v218) on GitHub

Notes

Changelog

mostlygeek/llama-swap v219
v219 (fixes v218)

on GitHub