2.3.1 Release Note
1. 重要更新
- 2.3.1 版本是在 2.3 版本的基础上修复了已知问题,并且发布了支持 CUDA 11.6 的安装包。
2. 训练框架(含分布式)
(1)功能优化
API
- 修改
paddle.nn.initializer.KaimingUniform
和paddle.nn.initializer.KaimingNormal
两种初始化方式,使其支持多种类型的激活函数。(#43721, #43827) - 优化
paddle.io.DataLoader
的数据预读取功能,使其支持设置了prefetch_factor
设定的预读取数据的缓存数量,避免在读取大块数据时出现 IO 阻塞。(#43674 )
新动态图执行机制
- 修改新动态图 API 逻辑中 optional 类型 Tensor 的初始化方法,防止被提前析构导致数据异常。(#42561)
全新静态图执行器
- 延迟初始化执行器中的线程池,避免只执行一轮的
program
(如save、load、startup_program
等)创建线程池。(#43768)
混合精度训练
- 设置
paddle.nn.Layer
中set_state_dict
中禁用state_dict
hook。(#43407)
分布式训练
- 使
paddle.incubate.nn.functional.fused_attention
和paddle.incubate.nn.functional.fused_feedforward
支持张量模型并行。(#43505)
其他
- 调整框架算子内核打印字符串的格式,便于进行自动化拆分解析。(#42931)
- 更新模型量化 API,支持
rounding to nearest ties to even
的四舍五入方式,支持量化取值范围 [-128, 127]。(#43829) - 量化感知训练适配支持 AMP 混合精度训练。(#43689)
- 量化感知训练在启动时新增
progress bar
,便于查看量化初始化进度,统计 out_threshold 时跳过 scale op,加速初始化过程。(#43454) - 动态图量化训练支持
conv
和bn
融合,静态图离线量化支持设置skip_tensor_list
来跳过某些层不做量化。(#43301)
(2)性能优化
- 优化
paddle.incubate.nn.functional.fused_attention
和paddle.incubate.nn.functional.fused_feedforward
算子,增加add_residual
属性,用以控制最后一步是否进行加residual
操作,CAE 模型性能提升 7.7%。(#43719) - 优化
linspace
算子,将start
、stop
、num
三个输入 Tensor 初始化在 CPU 上,避免在算子中进行 GPU -> CPU 拷贝,SOLOv2 模型性能提升6%。(#43746)
(3)问题修复
API
- 修复
paddle.io.DataLoader
在return_list=True
时因多线程冲突小概率报错问题。(#43691) - 修复
paddle.nn.Layer
的参数存在None
类型参数时to
方法报 NoneType 不存在 device 属性的错误。(#43597) - 修复 cumsum op 在某些
shape
下计算结果出错的问题。 (#42500, #43777) - 修复静态图下
Tensor.__getitem__
在使用bool
索引时组网阶段输出结果维度为 0 的问题。 (#43246) - 修复
paddle.slice
和paddle.strided_slice
处理参数为负数时出现异常的问题。(#43432) - 修复 set_value op 在处理切片
step
为负数时赋值结果异常的问题。 (#43694) - 修复 C++ 端
copy
接口不能在多卡设备间拷贝的问题。(#43728) - 修改
paddle.incubate.nn.functional.fused_attention
和paddle.incubate.nn.functional.fused_feedforward
中属性命名引发的推理时的问题。(#43505) - 修复 ConditionalBlockGrad op 处理不需要
grad
的 Tensor 时异常的问题。(#43034) - 解决 C++ 的 einsum op 反向速度优化引起的显存增加问题,并将反向优化默认打开。(#43397)
- 修复单卡下
paddle.io.DataLoader
多进程数据读取在固定随机种子时数据无法固定的问题。(#43702) - 修复 softmax op 在 Tensor 元素超过 2G 时,触发 CUDNN_STATUS_NOT_SUPPORT 的错误。(#43719)
- 修复 trace op
Event
字符串在不同算子无区分,导致性能分析不便利的问题。(#42789)
其他
- 修复动转静多次 deepcopy 并保存导致的显存溢出问题。(#43141)
- 修复自定义算子中使用的 PlaceType 类型升级引入的 device id 在多卡场景中出错的问题。(#43830)
- 优化
paddle.profiler.Profiler
timeline 可视化逻辑,将在 python 脚本中自定义的事件从 C++ 折叠层显示移动至 python 折叠层显示。(#42790)
3. 部署方向(Paddle Inference)
(1)新增特性
新增功能
(2)底层优化
CPU性能优化
- EnableMkldnn 配置中移除
gpu_cpu_reshape2_matmul_fuse_pass
,修复 ResNet50 性能下降的问题。 (#43750)
GPU 性能优化
- 添加
bilinear_interp_v2
TensorRT convert 支持。 (#43618) - 添加
matmul_scale_fuse_pass
、multihead_matmul_fuse_pass_v3
到 GPU pass,并添加单测。(#43765) - 添加 GPU handle 延迟初始化支持。 (#43661)
(3)问题修复
框架及API修复
- 修复联编 Paddle-Lite XPU 时的编译报错问题。(#43178)
- 修复 ERNIE 3.0 pass误触发的问题。(#43948)
- 修复 multihead op 中 int8 量化属性读不到的问题。(#43020)
后端能力修复
- 修复 MKLDNN 中 elementwise_mul 和 matmul 两个 op 在运行量化推理过程中崩溃的问题。 (#43725)
- 修复同一模型在推理时 TensorRT 子图序列化文件反复生成的问题。(#42945, #42633)
- 修复 ONNX Runtime 后端与外部使用的 protobuf 冲突问题。(#43159, #43742)
- 修复 python 预测库 ONNX Runtime 后端在多输入情况下推理报错问题。 (#43621)
4. 环境适配
编译安装
- 完成对 CUDA 11.6 的验证和适配,并在官网发布 CUDA 11.6 的安装包。(#43935, #44005)
- 修复在 Windows 上使用 CUDA 11.6 编译时的 cub 报错问题。(#43935, #44005)
- 修复 elementwise、reduce op 编译时间较长的问题。(#43202, #42779, #43205)
新硬件适配
2.3.1 Release Note
1. Important Updates
- V2.3.1 is built on V2.3 by fixing known issues and releasing precompiled binary that supports CUDA 11.6.
2. Training Framework (distributed included)
(1) Function Optimization
API
- Modify two initialization modes of
paddle.nn.initializer.KaimingUniform
andpaddle.nn.initializer.KaimingNormal
, to support multiple types of activation functions. (#43721, #43827) - Optimize the data pre-fetching function of
paddle.io.DataLoader
, so that it can support the setting of theprefetch_factor
to set the cache size of pre-fetched data. This can avoid IO blocking when reading large blocks of data. (#43674)
New dynamic graph execution mechanism
- Modify the initialization method of optional type Tensor in the new dynamic graph API logic to prevent data exceptions caused by early destruction. (#42561)
New static graph executor
- Defer initialization of the thread pools in the executor, to avoid creating thread pools for
programs
that execute only once (e.g.,save, load, startup_program
, etc.). (#43768)
Mixed precision training
- Disabling
state_dict
hook inset_state_dict
inpaddle.nn.Layer
. (#43407)
Distributed training
- Enabling tensor parallelism in
paddle.incubate.nn.functional.fused_attention
andpaddle.incubate.nn.functional.fused_feedforward
. (#43505)
Others
- Adjust print format of the framework operator kernels to facilitate automated splitting and parsing. (#42931)
- Update the model quantization API to support the round-off in
rounding to nearest ties to even
, and support quantization in the range [-128, 127]. (#43829) - Support AMP mixed precision training in quantization-aware training. (#43689)
- Add the
progress bar
at the beginning of quantization-aware training, so that it is easy to check the progress of quantization initialization. Skip the scale op when counting out_threshold to speed up the initialization process. (#43454) - Support
conv
andbn
fusion in the dynamic graph quantization training. Support the settings of skip_tensor_list in the static graph offline quantization, to skip some layers without quantization. (#43301)
(2) Performance Optimization
- Optimize
paddle.incubate.nn.functional.fused_attention
andpaddle.incubate.nn.functional.fused_feedforward
operators. Addadd_residual
property to control whether to perform add-residual
operation in the last step. The performance of CAE model is improved by 7.7%. (#43719) - Optimize
linspace
operator. Initialize three input Tensor ofstart
,stop
andnum
on CPU, to avoid GPU->CPU copy in the operator. This can speed up SOLOv2 model performance by 6%. (#43746)
(3) Bug Fix
API
- Fix the error reported by
paddle.io.DataLoader
whenreturn_list=True
due to multi-thread conflict. (#43691) - Fix the error that the
to
method reports NoneType does not have the device attribute when thepaddle.nn.Layer
parameter has theNone
type parameter. (#43597) - Fix the bug that the calculation result of cumsum op is wrong in some
shape
settings. (#42500, #43777) - Fix the bug that the output result dimension of
Tensor.__getitem__
is 0 in the networking stage when usingbool
index in the static graph.(#43246) - Fix the bug occurred when
paddle.slice
andpaddle.strided_slice
handle negative parameters. (#43432) - Fix the bug that the assignment result of set_value op is abnormal when the processing slice
step
is negative. (#43694) - Fix the bug that the
copy
interface in C++ cannot copy between multiple cards. (#43728) - Fix the bug in inference stage caused by attribute naming in
paddle.incubate.nn.functional.fused_attention
andpaddle.incubate.nn.functional.fused_feedforward
. (#43505) - Fix an exception in ConditionalBlockGrad op when processing Tensor that does not require
grad
. (#43034) - Fix the bug of device memory increase caused by einsum op in the speed optimization of backward computation. By default, this optimization is enabled. (#43397)
- Fix the bug that data fails to be fixed when
paddle.io.DataLoader
multi-process data reads the fixing random seeds under a single card. (#43702) - Fix the bug that softmax op triggers CUDNN_STATUS_NOT_SUPPORT when the Tensor exceeds 2G. (#43719)
- Fix the bug that the trace op
Event
string is indistinguishable among different operators that cause the inconvenient performance analysis. (#42789)
Others
- Fix the bug of overflowing device memory caused by multiple deepcopy and saving in case of dynamic-to-static. (#43141)
- Fix the bug that the device id introduced by the upgrade of PlaceType used in the custom operator is wrong in the multi-card scenario.(#43830)
- Optimize the
paddle.profiler.Profiler
timeline visualization logic, move events customized in python scripts from C++ folding display to python folding display. (#42790)
3. Deployment Direction (Paddle Inference)
(1) New Features
New functions
- Add the support of the PaddleSlim quantization model for ONNX Runtime backends on CPUs. (#43774, #43796)
(2) Underlying Optimization
CPU performance optimization
- Remove
gpu_cpu_reshape2_matmul_fuse_pass
from EnableMkldnn configuration to fix the bug of ResNet50 performance degradation. (#43750)
GPU performance optimization
- Add the support of
bilinear_interp_v2
TensorRT convert. (#43618) - Add
matmul_scale_fuse_pass
andmultihead_matmul_fuse_pass_v3
to GPU pass. (#43765) - Add the support of the GPU handle deferred initialization. (#43661)
(3) Bug Fixing
Framework and API fixing
- Fix the compile error problem when binding Paddle-Lite XPU. (#43178)
- Fix the bug of false trigger of ERNIE 3.0 pass. (#43948)
- Fix the bug that int8 quantization attribute in multihead op cannot be read. (#43020)
Backend capability fixing
- Fix the bug that two ops of elementwise_mul and matmul in MKLDNN are crashed during quantitative inference. (#43725)
- Fix a bug where TensorRT subgraph serialization files are repeatedly generated for the same model during inference. (#42945, #42633)
- Fix a conflict between the ONNX Runtime backend and the externally use of protobuf. (#43159, #43742)
- Fix an error reported by python prediction library when using ONNX Runtime backend in case of multiple inputs. (#43621)
4. Environment Adaptation
Compile and install
- Complete verification and adaptation of CUDA 11.6, and release CUDA 11.6 precompiled binary. (#43935, #44005)
- Fix a cub error when compiling with CUDA 11.6 on Windows. (#43935, #44005)
- Fix the bug of long compilation time for elementwise and reduce op. (#43202, #42779, #43205)