neuralmagic/deepsparse v1.7.0 on GitHub

New Features:

DeepSparse Pipelines v2 was introduced, enabling more complex pipelines to be represented. Text Generation (compatible with Hugging Face Transformers) and Image Classification pipelines have been refactored to the v2 format. (#1324, #1385, #1460, #1596, #1502, #1460, #1626)
OpenAI Server compatibility added on top of Pipelines v2. (#1445, #1477)
deepsparse.evaluate APIs and CLIs added with plugins for perplexity and lm-eval-harness for LLM evaluations. (#1596)
An example was added demonstrating how to use LLMPerf for benchmarking DeepSparse LLM servers. (#1502)
Continuous batching support has been added for text generation pipelines and inference server pathways, enabling inference over multiple text streams at once. (#1569, #1571)

Exposed sequence_length for greater control over text generation pipelines. (#1518)
deepsparse.analyze functionality has been updated to work properly with LLMs. (#1324)
The logging and timing infrastructure for Pipelines expanded to enable more thorough tracking and logging, in addition to furthering support for integrations with Prometheus and other standard logging platforms. (#1614)
UX improved for text generation pipelines to more closely match Hugging Face Transformers pipelines. (#1583, #1584, #1590, #1592, #1598)

Compile time for dense LLMs is no longer very slow.
Text generation pipeline bug fixes: corrected sampling logic errors and inappropriate in-place logits mutation resulting in incorrect answers for LLMs when using sampling. (#1406, #1414)
KV cache was fixed for improper handling of the kv_cache input while using external KV cache management, which resulted in inaccurate model inference for ONNX Runtime comparison pathways. (#1337)
Benchmarking runs for LLMs with internal KV cache no longer crash or report inaccurate numbers. (#1512, #1514)
SciPy dependencies were removed to address issues for CV pipelines where they would fail on import of scipy and crash. (#1604, #1602)

OPT models produce incorrect outputs and are no longer supported.
Streaming support is limited within the DeepSparse Pipeline v2 framework for tasks other than text generation.