This release changes the way tokens/second are calculated on the activities page. The previous method was inaccurate because it divided the number of tokens generated by the total request time. The total request time also included prompt processing so the number was too misleading to be useful.
This release changes the logic to:
- use llama-server's
timings
record if it exists for tokens/second - send a
-1
whentimings
is not available. The UI will render this as "unknown".
Supporting timing information for other inference engines will be future PRs.
Token/Second and duration now match llama-server's output precisely:
