Notes
Including details for v218 (broken) and PR #790.
llama-swap has a new routing backend. What started as a small experiment to improve the concurrency handling exploded into a full refactor of the backend. For users this the biggest change is swapping is more efficient. Requests are collated so requests for models that are already loaded will take precedence over those that awaiting loading.
It looks like:
new router: A B A B A B -> A A A B B B
old router: A B A B A B -> A B A B A B
However, just doing that wouldn't require a 12,009 line PR. There were a lot of architectural changes that makes developer quality of life a bit easier. Redundant code was removed, repo organization is centralized around the internal/ packages, new funny loading remarks were added, etc.
Also a new concurrency tester sneaked in under cmd/concurrency-tester.