New Features:
- DeepSparse Pipelines v2 was introduced, enabling more complex pipelines to be represented. Text Generation (compatible with Hugging Face Transformers) and Image Classification pipelines have been refactored to the v2 format. (#1324, #1385, #1460, #1596, #1502, #1460, #1626)
- OpenAI Server compatibility added on top of Pipelines v2. (#1445, #1477)
deepsparse.evaluate
APIs and CLIs added with plugins for perplexity and lm-eval-harness for LLM evaluations. (#1596)- An example was added demonstrating how to use LLMPerf for benchmarking DeepSparse LLM servers. (#1502)
- Continuous batching support has been added for text generation pipelines and inference server pathways, enabling inference over multiple text streams at once. (#1569, #1571)
Changes:
- Exposed
sequence_length
for greater control over text generation pipelines. (#1518) deepsparse.analyze
functionality has been updated to work properly with LLMs. (#1324)- The logging and timing infrastructure for Pipelines expanded to enable more thorough tracking and logging, in addition to furthering support for integrations with Prometheus and other standard logging platforms. (#1614)
- UX improved for text generation pipelines to more closely match Hugging Face Transformers pipelines. (#1583, #1584, #1590, #1592, #1598)
Resolved Issues:
- Compile time for dense LLMs is no longer very slow.
- Text generation pipeline bug fixes: corrected sampling logic errors and inappropriate in-place logits mutation resulting in incorrect answers for LLMs when using sampling. (#1406, #1414)
- KV cache was fixed for improper handling of the
kv_cache
input while using external KV cache management, which resulted in inaccurate model inference for ONNX Runtime comparison pathways. (#1337) - Benchmarking runs for LLMs with internal KV cache no longer crash or report inaccurate numbers. (#1512, #1514)
- SciPy dependencies were removed to address issues for CV pipelines where they would fail on import of
scipy
and crash. (#1604, #1602)
Known Issues:
- OPT models produce incorrect outputs and are no longer supported.
- Streaming support is limited within the DeepSparse Pipeline v2 framework for tasks other than text generation.