webnn: Support block-wise quantization for DirectML backend
Block-wise quantization divides input tensors into smaller blocks that
are independently quantized, resulting in faster optimization and high
precision quantization 1. It is used for popular language models,
such as phi-3 mini int4 quantized model 2. Related WG issue 3 has
been opened to discussion.
Firstly, this CL validates scale and zero point tensors for block-wise
quantization. Besides, this CL also implements the block-wise
quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and
DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3.
More validation and conformance tests are added to verify the
implementation.
Bug: 40206287
Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f
Cq-Include-Trybots: luci.chromium.try:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5964816
Reviewed-by: Rafael Cintron rafael.cintron@microsoft.com
Reviewed-by: ningxin hu ningxin.hu@intel.com
Commit-Queue: ningxin hu ningxin.hu@intel.com
Cr-Commit-Position: refs/heads/main@{#1380767}