Optimum INC CLI
Integration of the Intel Neural Compressor dynamic quantization to the Optimum command line interface. Example commands:
optimum-cli inc --help
optimum-cli inc quantize --help
optimum-cli inc quantize --model distilbert-base-cased-distilled-squad --output int8_distilbert/
- Add Optimum INC CLI to apply dynamic quantization by @echarlaix in #280
Levarage past key values for OpenVINO decoder models
Enable the possibility to use the pre-computed key / values in order to make inference faster. This will be enabled by default when exporting the model.
model = OVModelForCausalLM.from_pretrained(model_id, export=True)
To disable it, use_cache
can be set to False
when loading the model:
model = OVModelForCausalLM.from_pretrained(model_id, export=True, use_cache=False)
- Enable the possibility to use the pre-computed key / values for OpenVINO decoder models by @echarlaix in #274
INC config summarizing optimizations details
- Add
INCConfig
by @echarlaix in #263
Fixes
- Remove dynamic shapes restriction for GPU devices by @helena-intel in #262
- Enable OpenVINO model caching for CPU devices by @helena-intel in #281
- Fix the
.to()
method for causal langage models by @helena-intel in #284 - Fix pytorch model saving for
transformers>=4.28.0
when optimized withOVTrainer
@echarlaix in #285 - Update for task name for ONNX and OpenVINO export for
optimum>=1.8.0
by @echarlaix in #286