PaddlePaddle 2.5.0 Release Note
1. 重要更新
- 动静统一新架构:实现基础算子组合的动转静加编译器执行新模式,在ResNet50&Bert模型上完成动转静、组合算子、神经网络编译器优化加速全流程。动转静完成整图fallback核心功能开发,支持动转静失败时回退到动态图训练执行;组合算子设计一套包含150多个基础算子的基础算子体系,实现python层前向算子拆分机制和支持动、静态图的反向算子拆分机制,实现70多个常用前、反向算子的拆分;CINN编译器修复正确性问题,开发关键Pass,添加手工Schedule规则,实现内核代码自动生成,ResNet50模型性能提升12%,Bert模型性能提升10%。
- PHI算子库算子架构统一:将原算子体系下剩余的350+算子内核全部统一到PHI算子库中,以及原算子体系中的算子定义方式也都统一为PHI算子库的算子定义形式(基于YAML配置定义算子),提升了架构统一性,降低了框架开发的理解成本;将PHI算子库依赖的Fluid头文件全部解耦,并独立编译为动态链接库,为框架的二次开发提供更轻量的算子库复用方式;继续对飞桨框架中不规范的算子以及算子内核进行规范化调整,便于开发者理解,降低了硬件的接入成本。
- 静态图新执行器全面上线:静态图新执行器实现多项功能和性能优化,完成对原有多套旧执行器的统一和替换,成为静态图单卡和分布式训练python端入口以及动转静、控制流、CINN等后端默认使用的执行引擎,大幅提升框架调度性能,功能架构更加清晰,二次开发能力显著增强。
- Python API 支持0维tensor:为形状为
[1,]
及形状为[]
的张量定义了清晰的语义。 - 新的环境适配:适配了CUDA 12,并支持使用gcc12进行编译。
2. 不兼容升级
- 飞桨API支持0维tensor。飞桨之前用shape为[1]的1维tensor来替代0维tensor,这种替代方式和当前主流习惯有差异,增加模型的开发调试成本,有时还会导致非预期错误。本版本对需支持0维tensor的376个API进行了修正,和社区广泛使用的工具如EinOps等实现。例如,在之前的情况下,模型训练中输出的loss为1维tensor,如果要取出或打印loss,往往需要使用
loss.numpy()[0]
这样的代码。经过本次修改后,模型训练中输出的loss为0维tensor,使用loss.numpy()
即可取出或打印loss,代码简短、易懂且符合业界使用习惯。 paddle.fluid
API全面退场。按照上个版本已预告的计划,本次退场了1116个paddle.fluid
API及相关内部接口,剩余少量相关内部接口会在下个版本全部清理完成。fluid API属于飞桨2.0本计划移除但考虑到兼容性等因素延缓清理的历史API,本次退场清理不会影响基于飞桨2.0开发的程序,飞桨API体系也会更加简洁易懂。- 旧版动态图Python端代码完成清理。至此,Python端仅使用新版动态图调用C++核心逻辑。
- 为统一静态图模型数据并行的训练方式,废弃原有的单进程多卡训练方式,包括
paddle.static.ParallelExecutor
和paddle.static.CompiledProgram().with_data_parallel()
两个接口,原因是这套接口只支持单机多卡,不支持多机多卡,且底层执行性能较差。推荐统一使用多进程多卡训练方式,即paddle.distributed.launch
接口来进行数据并行的分布式训练。该升级只影响静态图,不影响动态图和动转静训练,如果使用了废弃接口,请参考 数据并行 的文档修改模型代码。#50351,#50501,#51240,#51701,#51616,#51369,#52671 - 移除框架中原有的昇腾NPU和寒武纪MLU的适配代码,全部升级为CustomDevice插件式适配方式,并将昇腾NPU和寒武纪MLU的适配代码迁移至PaddleCustomDevice仓库。
3. 训练框架(含分布式)
Python API
API 支持0维tensor
- API输入支持0维tensor,涉及
paddle.reshape
、paddle.trace
、paddle.linalg.norm
等286个API。#53208, #53592, #47074, #53186, #47677, #49357, #50237, #46555, #47219, #47501, #47858, #47961, #48058, #48007, #49755, #51024, #51566, #51899, #49813, #47812, #47849, #47251, #53125, #53828, #51265, #47689, #48452, #49072, #48638, #49175, #49279, #50857, #49805, #47734, #45992, #49616, #49959, #50536, #49544, #49842, #46909, #49361, #50169, #48314, #48735, #49122, #49122, #49177, #49501, #49562, #49340, #49550, #49596, #49730, #49667, #49692, #49854, #49845, #49803, #49889, #49904, #49518, #49884, #49880, #49862, #49921, #49260, #49929, #49570, #49882, #50213, #49780, #50271, #50289, #50293, #49735, #50433, #49847, #50635, #50950, #50947, #49460, #53087, #51687, #52185, #54649 - API输出支持0维tensor,涉及
paddle.sum
、paddle.min/max
、paddle.any/all
等90个API。#52891, #52861, #52775, #52850, #52843, #52857, #51721, #53051, #53192, #52739, #52741, #53175, #51889, #53199, #53242, #53421 - 支持0维tensor后,修正原有不规范的代码,及对模型代码中的非规范用法进行提示和兼容。#51562, #51586, #51757, #52197, #54117。
new API
- 新增 jacobian 和 hessian API,用于科学计算。#53331
- 新增稀疏计算API。例如
paddle.sparse.reshape
、paddle.sparse.sum
和paddle.sparse.slice
等。#46694, #51513, #53794, #51406 - 新增其它API。例如
paddle.optimizer.LBFGS
、paddle.index_put
和paddle.logaddexp
等。#53314, #51912, #52886, #50843, #47282, #52284
动态图
新功能
- 新增了paddle.nn.utils.clip_grad_norm_用于支持梯度裁剪和paddle.Tensor.data_ptr用于获取Tensor数据的内存/显存的地址 PR49935, PR48235, PR49173
- 新增了saved_tensors_hooks机制,用于临时存放和取回用于反向计算使用的前向Tensor。 PR45763, PR46215, PR48124
- Tensor支持了pickler,用于支持Tensor的序列化。 PR47025, PR48179
- 新增了调试日志,反向出现nan/inf时打印前向Python堆栈 PR53217 PR52639 PR52729
- 新增了对expand_v2, tile, concat, assign, slice高阶微分的支持。PR45941, PR45942, PR45940, PR45879, PR45960
功能优化
- 优化了动态图的日志打印,包括日志内容优化、VLog级别优化、报错内容优化等。PR45783, PR46349, PR46934, PR47724
- 新增了FLAGS_auto_growth_chunk_size_in_mb用于auto_growth_allocator最小chunk size的设置 PR52204
bug fix
- 修复了一些算子的bug,包括:batch_norm, slice, set_value, scale, multinomial, adam, conv, transpose2_grad, conv2d_transpose_double_grad。PR47802, PR47634, PR47349, PR46124, PR46147, PR50388, PR48626, PR48519, PR50386, PR48432, PR51851
- 修复了PyLayer的一些错误问题。PR51740, PR47154, PR47323, PR54041, PR48533
- 确保sync_batch_norm在反向有序,防止错序导致hang或精度错误。PR52268, PR52860, PR52779
- 修复了linspace在AMP下的bug。PR46088
- 修复了Python C API错误调用导致Windows崩溃的问题。PR46833
- 修复了DataLoader可能遗漏删除/dev/shm的问题。PR48511
- 修复了paddle.grad的一些问题。PR47151
- 为不支持高阶微分的算子添加报错信息。PR47231
- 为python运算符添加numpyarray的支持。PR48229
- 有两处element_size 接口,删除其中之一。PR49631
- 修复老动态图开VLOG崩溃问题。PR47115
- XPU,d2d时,改成d2h+h2d,规避多线程问题 。PR48373
性能优化
- Python运算符下沉到C++实现,以提升API性能, 下沉后该类API有3~6倍性能提升。PR45811, PR46326, PR46329, PR46520, PR46542, PR46565, PR47060, PR47077, PR47174, PR47315
- 优化了Optimizer CPU调度性能,可减少Optimizer阶段导致的GPU Gap。 PR49787, PR50188, PR51340, PR49864, PR50158, PR50335
- API中可下沉到C++的逻辑,下沉到C++,以提升API性能。PR46412, PR46190
- 优化动态图下Python端不必要的调用逻辑,以提升API性能。PR46221, PR49473, PR49574, PR49589, PR49612, PR49717, PR49733, PR49823, PR49508, PR46840
- 优化了Allocator的使用,以提升动态图API调度性能。PR47125, PR48548, PR50995, PR47731
- 优化了fused_attention算子性能。PR48902
- optimizer的_add_accumulator,如果device是CPU,且在动态图下,直接使用full初始化var。PR48189
- 对反向图不必要执行的subgraph进行剪枝以提升性能。PR47827
- 优化了initalizers的性能。PR46033
- 新增fused dropout add算子,提升dropout 和 add 一起计算的性能。#52903
静态图
静态图新执行器全面上线
静态图新执行器实现多项功能和性能优化,完成对原有多套旧执行器的统一和替换,成为静态图单卡和分布式训练python端入口以及动转静、控制流、CINN等后端默认使用的执行引擎,大幅提升框架调度性能,功能架构更加清晰,二次开发能力显著增强。#45913,#46025,#48911,#50239,#45696,#46092,#48158,#51389,#49708,#49275,#48789,#49939,#51149,#52652
算子库
自定义算子等功能增强
包括:全新支持了自定义扩展机制,实现将 C++ 扩展的运算函数绑定至Python端使用,进一步提升了框架的二次开发能力;扩展支持自定义硬件上使用自定义算子机制,以满足硬件厂商实现非Paddle已有算子的需求;扩展支持了在自定义算子中实现inplace
、vector<Tensor>
输出、optional<Tnesor>
输入等高阶机制;优化了自定义算子在动态图模式下的调度性能,多输入参数的算子性能提升 25.4%;为自定义算子Tensor扩展新增了常用运算符及API,支持链式调用,简化代码写法。对算子内核选择机制进行了优化;对部分算子内核进行了逻辑完善、支持数据类型增强以及性能优化;新增以及完善 XPU 内核 100+;修复各项 Bug 累计 170+。
#49222, #51773, #51923, #53080, #50731, #50563, #50840, #50983, #51713, #48733, #50558, #50764, #51973, #52216, #51027, #50745, #50756, #50886, #50813, #50869, #51085, #51646, #51620, #51844, #52421, #52872, #52597, #50582, #52114, #52915, #50928, #48272, #48702, #52191, #52191, #47374, #47375, #47378, #54126, #47638, #47661, #50606, #53528, #50599, #51727, #50825, #50773, #50979, #53336, #53555, #53716, #53753, #53981, #53977, #53980, #54043, #54066, #52866, #53043, #53325, #54323, #54367, #51353, #53749, #50013, #47570, #50997, #51241, #49537
算子体系架构统一
具体包括:将原算子体系下剩余的350+算子内核全部统一到PHI算子库中,以及原算子体系中的算子定义方式也都统一为PHI算子库的算子定义形式(基于YAML配置定义算子),提升了架构统一性,降低了框架开发的理解成本;将PHI算子库依赖的Fluid头文件全部解耦,并独立编译为动态链接库,为框架的二次开发提供更轻量的算子库复用方式;继续对飞桨框架中不规范的算子以及算子内核进行规范化调整,便于开发者理解,降低了硬件的接入成本。
#47856, #49328, #49138, #52014, #52044, #52116, #52486, #52101, #52882, #53003, #53034, #51914, #49116, #52626, #52878, #52879, #52880, #52875, #51600, #51601, #51590, #51887, #51891, #52036, #52130, #52134, #51951, #51886, #52274, #52263, #51913, #52145, #52347, #52370, #52437, #52424, #52231, #52522, #52529, #52802, #52799, #52855, #52711, #52940, #53309, #47817, #48001, #48063, #48049, #48168, #48415, #48696, #48970, #50183, #50407, #50498, #50419, #50282, #50870, #50911, #50865, #51288, #53735, #47248, #47787, #52202,
#47579, #49444, #45772, #51264, #51634, #51631, #47385, #46342, #47510, #47532, #47702, #47860, #49470, #50358, #49121, #50190, #52374, #52372, #52375, #52371
动转静加组合算子
新功能
- 组合算子添加dropout, silu, stack, relu, expand, unsqueeze, pow, squeeze, meshgrid, batch_norm, layer_norm, group_norm, instance_norm, full_like, split, split_with_num, gelu, mean, flatten, rsqrt, hadswish算子的组合规则 #50497, #50838, #50861, #50819, #50810, #51527, #51070, #51539, #51061, #49894, #50422, #51874, #51341, #50295, #50298, #50672, #51432, #51003
- 组合算子添加gather_nd, reduce_max, group_norm, relu, reduce_max, gather, topk, sqrt, elementwise_pow, softmax, batch_norm, prod, multiply, expand, div, relu, slice, cumsum, sigmoid, layer_norm, sin, cos, roll, instance_norm, abs, assign, tile, scatter_nd_add, erf, floor, log, silu, leaky_relu, pad算子的vjp规则 #50966, #51653, #52663, #51742, #52203, #50794, #50305, #50786, #50679, #51045, #51230, #51474, #51283, #51238, #49831, #51838, #50771, #50565, #51768, #51750, #51748, #52532, #52935, #50963, #51430, #53141, #52469, #50436, #51059, #51296, #52533, #53374
- 组合算子添加matmul, tanh, elementwise二阶微分规则 #50452, #52192, #53014
- 组合算子添加exp, reduce_mean, softmax, divide, cast, layer_norm, prod, meshgrid, expand_as, dropout, concat, gather_nd, elementwise_max, elementwise_pow, reduce_max组合算子bf16数据类型支持 #54263, #54236, #53865, #54175, #54399
- 动转静新增控制流中的容器添加赋值语义支持 #51248
- 动转静新增全图回退功能,当动转静转换失败时,可全图回退到动态图方式执行; 回退机制增加set_eval_frame接口 #50111, #52006
- 动转静to_static支持算子组合机制;支持被to_static装饰下使用register_hook的场景; #49836, #52948, #53572
- 动转静to_static接口增加backend参数, 可以指定为
CINN
或者 None,当该参数指定为CINN
时,将会使用 CINN 编译器来加速训练和推理 #52596 - 新增primitive接口代码自动生成功能,根据ops.yaml和legacy_ops.yaml中的算子定义;自动生成primitive接口的代码;自动生成Tensor运算接口 #50315, #49654, #50642
- 新增算子前向组合功能,通过注册前向算子的组合规则,实现将前向算子拆分成基础算子 #49605
- 新增组合算子开关,可以在shell中通过设置环境变量,实现算子按照不同方式进行拆分 #50309
- 添加
OpTest
新增组合测试功能,对算子精度进行保障;添加elementwise类基础算子单测;添加batch_norm的CINN单测 #50509, #50807, #52815
功能优化
- 添加组合算子支持FP16运算和AMP O1运算;添加softmax和layer_norm算子AMP逻辑 #52397, #52598, #51473
- 简化组合算子batch_norm的组合规则和vjp规则 #54012, #51827, #51933,
- 组合算子优化组合规则,提升含scalar组合规则的性能;优化组合算子日志打印 #51960, #50160
- 组合算子支持jit.save接口;新增自定义VJP规则接口 #52344, #50885
- 组合算子gather_grad删除overwrite参数。 #52707
- 动转静代码风格清理,报错信息优化,规范日志 #48637, #46128, #52527, #46800,#46415
- 动转静通过调用append backward的方式获取
grad var name
以修复高阶梯度计算时的错误 #53250 - 动转静功能升级,清理to_static的临时目录以加速代码转换;增强to_static自动略过内部接口;支持在程序使用to_static装饰器 #47102, #50596, #45768
- 动转静优化
print
函数转换以支持在组网阶段打印 Tensor 参数;升级参数收集机制 #48672, #50336
bug fix
- 组合算子修复cmake编译错误;修复cuda 12测试错误;修复若干算子如meshgird, expand_as, concat, conv, arrange等错误#49643, #54622, #53951, #53951, #53350, #51486, #52764
- 组合算子修复若干rank=1, shape=-1, amp, 多进程等场景下的bug;#51413, #51435, #50518, #47301,
- 组合算子修复composite grad maker和static prim api自动代码生成bug; 修复op创建属性丢失和部分组合规则不生效的bug #50854, #51445, #50780, #52120
- 组合算子修复一些其他bug #50086, #51208, #51577, #53598, #47500, #52119, #50397, #50527, #50788, #51014, #52154, #52752
- 动转静修复dataloader, cond输入dict, transformer导入, T5模型内存泄露, grad var name解析错误等bug #49821, #47299, #50776, #50883, #51100, #51464, #51966, #52110, #52821
- 动转静修复Lazy初始化,Windows训练,is_paddle_func失效,recurrent op删除pass失败等错误 #50785, #52580, #51585, #51763, #51763
性能优化
- 动转静调用run_program_op的执行过程中,增加scope缓存和复用机制,避免每个step都会传入新的scope #45813
分布式训练
动态图分布式
- 去除旧动态图分布式sharding功能API #49334
- fleet升级到distributed目录 #50834
- 优化分布式策略的日志打印。#47761
- 重计算支持hook模式、inplace功能、stop_gradient模式,支持更灵活的使用。 #48471, #47985
- 数据并行
- 流水线并行
- 分组切分并行
- 张量模型并行
- Launch启动
- 通信库
- 增加自定义混合并行通信组,拓扑结构信息打印,自定义通信拓扑顺序。#47021,#54000,#51781
- 去除通信库对Place信息依赖 #47857
- 增加通信库对GLOO算子支持,支持send/recv/gather。 #52221, #52334,#49084
- 禁止通信算子的反向计算。#47636
- 新增通信库静态shape check,帮助判别通信量是否匹配。#48256,#48915,#48646
- 支持通信python object类型,BF16类型,alltoall,reduce,allgather,group call,global gather,broadcast,scatter通信方式,XPU设备通信支持。#51765,#45844,#48059,#48115, #48339,#49252,#49451,#50085,#50701,#48208,#48736,#51762,#52495,#53514,#48232,#49896,#49941,#45584
- 新增对计算流通信功能。#46182,#46023,#46295,#46761,#47481,#47740,#47976,#48163,#48396,#48308,#47110,#53089
- 优化通信库TCP建联时间。#49810,#47184
自动并行
- 静态图半自动并行功能完善:
- 新增多个算子的FLOPs计算函数,并新增基于FLOPs的计算Cost建模 #48083,#47978,#47595,#48083,#48084,#47816
- 接口易用性提升,完善 DistAttr, Process Mesh, Engine API、信息打印、输入输出等模块;执行Engine新增cost接口,可用于理论分析模型运行的时间和显存开销 #47503,#46416,#46554, #46633,#49214,#53848,#46552, #47043, #49665, #52912, #45776, #47263
- 优化Pass的通用性和易用性升级,支持更多场景、减少Pass预分析耗时 #46519,#47358,#46391, #51035
- 调试能力增强,添加分布式随机性控制机制和混合并行精度对齐工具 #52903,#49865
- 支持推理生成任务组网的自动切分, 适配生成模型中的控制流、conditional block等特殊用法 #46771, #54067
- 完善grad_clip,支持了数据并行场景下的负载均衡。#49510, #49249
- 静态图半自动并行性能提升:
- 新增 Sharding Pass 自动化通信Fuse 和 多流通信功能,GPT 6.7B 模型两机上吞吐性能提升 26% #48604, #47180,#46180
- 新增 Recompute 优化策略调优功能,支持根据显存和模型大小选择最优 recompute checkpoint 设置 #48608,#47846,#49010
- 流水线并行新增 1F1B 调度优化 Pass #54260, #45915
- 数据并行优化,支持融合通信和通信计算Overlap 等优化, GPT 1.3B模型内性能提升 5% #48092,#45643,#49744, #47578
- 优化 Reshard模块concate 性能,减少部分场景下concate 次数。#47809
- 混合精度优化Pass性能升级, 支持 BF16 低精度, 适配 while 循环控制流的自动混合并行等 #51285,#51147, #49219, #49079
- 静态图全自动并行功能完善:
参数服务器
- 清空ps目录下all列表,其中API不暴露 #51289
- 清理cvm算子 #48989
- GPUPS新增对AFS支持。#46611
- PGLBOX2.0 日志降级、修复dense参数卡住问题、修复barrier不生效的问题、增加 get_epoch_finish python端接口#49946,#50166,#50349
- GPUPs运行切换到指定模式。#51115
- GPUPS加入benchmark。#49587,#49649
- GPUPS优化器选择问题修复,修复reader读取问题,修复RPC编译问题。 #47026,#47192,#49878, #46356,#46575,#49389,#46258,#50136
- 增加rocksdb编译方式。#46074
CUDA
新功能
- 新增对CUDA 12.0的编译支持,并修复相关单测 (#49539, #54542)
- 新增CUDNN Frontend API的编译支持及相关单测,可以使用
WITH_CUDNN_FRONTEND=ON
的编译选项进行开启。(#47524, #47612)
功能优化
- 混合精度策略及精度优化:
- 新增及优化了框架200余个算子的FP16、BF16数据类型支持,包括logsumexp,reduce_max,cumprod,sync_batch_norm,compare类OP等,并对所有FP16、BF16算子进行了精度优化及单测覆盖,针对低精度算子完善单测框架功能,确保在大模型训推过程中精度无损。(#51193, #51114, #45817, #52862, #52919, #52921, #46413, #48205, #54193, #48041, #48121, #46364, #51153, #53023, #53079, #53137, #46212, #50908, #52555, #51582, #47897, #45601, #53522, #52666, #50101, #48315, #50847, #50905, #50906, #50909, #50916, #50917, #50920, #50919, #50904, #50918, #50938, #50858, #50933, #50945, #50936, #51168, #51493, #50924, #50923, #50926, #50925, #50930, #53284, #53286, #53285, #50976, #50915, #50915, #48192, #50993, #50998, #51380, #51137, #51106, #51197, #51159, #51552, #51151, #51005, #51565, #51036, #51185, #51791, #51083, #51694, #51689, #51009, #51051, #51532, #51978, #51903, #51888, #52016, #52035, #52184, #52018, #51787, #51640, #52172, #52193, #51160, #51809, #51678, #52158, #51015, #52240, #52276, #52233, #52220, #52107, #52282, #52311, #52315, #52357, #52256, #51649, #52413, #52369, #51837, #52112, #51819, #52388, #52411, #52521, #51300, #51117, #52380, #52317, #51263, #52668, #52259, #50999, #52407, #52288, #52845, #50953, #52667, #52582, #52426, #51884, #52630, #52136, #52604, #51615, #51275, #52898, #52918, #52572, #52683, #52956, #52963, #52954, #52444, #52314, #52887, #52195, #53100, #52961, #52953, #53111, #53549, #53736, #52920, #53195, #53535, #53876, #53785, #53722, #54285, #54232, #53922, #47277, #50811, #54571, #50129, #50340, #50848, #50849, #50868, #50878, #50929, #50939, #50973, #50913, #51145, #51090, #51098, #51094, #51216, #51736, #51684, #51925, #54030, #50700, #52264, #51069, #51101, #51286, #53582,#49869))
- 混合精度策略(AMP)优化:在混合精度训练的易用性、精度稳定性及可调试性方面进行了全面的升级和优化,能够更好的支持大模型训练加速。易用性方面统一了动静态图API,并新增model.float()、model.float16()、model.bfloat16()等转换接口;精度稳定性方面增强了针对BF16类型的策略自动调整,优化了黑名单设置,增强了优化器算子Adagrad、Adamax、Adadelta、RMSProp等对 multi_precision 功能的支持,在O2模式下,完善了master grad机制,并新增类型提升机制,以及新增参数对特定模块使用float32计算以保障精度;在可调式性方面,新增paddle.amp.debugging 模块,提供算子统计、异常值检测、精度对比等功能。( #50132, #50078, #50131, #49705, #52936, #52871, #53289, #53362, #54240, #53768, #48041, #47672, #48843, #49391, #51635, #45541, #53742, #51020, #51063, #52514, #50940, #52936, #53439, #53712, #48238, #52215, #53012, #52918, #54571)
- GroupNorm算子新增对NHWC数据格式的支持 (#47533)
- index_put算子新增对bool和int的混合数据类型支持 (#54195)
- 新增sparse.is_nan API 用于判断sparse tensor中是否含有NaN元素。 (#51513)
bug fix
- 修复trace、roll、dropout_nd、log_softmax等多个算子计算出错、栈溢出,以及部分单测问题。(#50243, #52012, #53795, #53149, #53654, #51054, #49373, #53038)
- 修复conv算子穷举搜索在部分场景不生效的问题。(#47065)
- 修复collective_reduce_scatter等算子在A100上出现timeout的问题。(#54513)
- 修复FusedLinear单测中属性错误的问题。 (#50359)
- 修复在使用Profiler时可能出现的OOM等问题 (#46089)
性能提升
- 进一步优化框架大量算子的GPU Kernel以及eigen实现方式,包括max_pool3d, dropout, adaptive_pooling, depthwise_conv2d、transpose, eigh, broadcast类计算,reduce类计算,prelu,logsumexp,以及sparse类算子等,在更多配置场景下达到更优性能。(#45820, #45959, #45934, #46332, #46287, #47233, #48855, #48560, #49419, #49748, #50348, #52401, #51131, #51141, #51479, #51835, #52509, #52482, #52700, #53112, #53659, #53658, #53154, #54071, #53622, #52952, #46046, #46119, #45946, #47212, #47791, #47454, #45230, #48899, #33051, #49040, #48992, #49086, #50808, #46431, #50931, #48056, #46071, #49231, #38660, #50287, #46111, #46997, #45854, #47738, #48635, #50353, #50362, #51934, #54045, #46679, #52093, #52969)
- 提供更多融合算子实现,以及相关融合Pass,如fused_feed_forward,gather-gemm-scatter,matmul + bias,layernorm_shift_partition + element_add,elementwise类融合等模式,进一步提升使用该模式的模型性能。( #50423, #50091, #50364, #53017, #50755, #50050, #47099, #48848, #49383, #50809, #52361, #52028, #48439, #49009, #51427, #52731, #51805)
文档
- 修复index_put文档中的错误 (#53727)
Intermediate Representation
为了飞桨IR体系存在的稳定性、降低研发成本问题,孵化了飞桨新的IR体系,完成了基础的数据结构定义、算子定义生成和执行体系适配。为了更好的支持科学计算场景的高阶需求,完成了silu、cast等算子的高阶适配。
- 完成了IR数据数据结构定义,包含类型系统,算子定义;打通了和phi kernel的执行适配。#51112, #51992, #50412, #53557, #53953, #50959, #54250, #54197, #54289, #51636, #52846, #53988, #54143, #54035, #54052, #54340, #54356, #54068, #53894, #53707, #54185, #54031, #54220, #54275, #54281, #54186, #54259, #54124, #54292, #48068, #53978
- 完善pass基础设置,包含基础的pass定义,pass注册管理等。 #54023,#54170, #54170, #54308, #54348, #54385
- 完善高阶算子的适配,主要包含基础模块改造和silu、cast算子适配等。 #52005, #53425, #53417, #53417, #53498, #53171, #53632, #53605, #53746, #53874, #54164, #45888, #46024, #46446, #46960
CINN编译器
新功能
- 新增CINN对0D-Tensor的支持,目前为配合主框架升级,暂时采用增加pass的临时方案进行支持,后续会对该方案进行替换升级。 (#53382, #53955, #54064, #54118, #54216, #53454)
- 新增CINN对int8/uint8/int16/uint16/bf16等数据类型的支持 (#50566, #53637)
- 新增CINN expand算子的支持 (#46776)
- 新增CINN对PaddleInference的支持. (#45009)
功能优化
- CINN编译器,传递skip_gc_vars属性到CINN子图;CINN为skip_gc_vars添加fetch算子 #49471, #49553
- CINN编译器,conv2d和conv2d_grad默认不使用cinn算子 #51645
- 将 build_cinn_pass 添加到 BuildStrategy,以便于在动转静中使用 (#49496)
- 增加reshape算子在组合算子机制下的单测 (#51276)
- 主框架联编CINN的版本从固定commit改为develop (#49775)
- 为CINN设置默认Target参数 (#50182)
bug fix
- 修复CINN符号化过程中拓扑排序后的出现的算子顺序不一致的问题。 (#52556)
- 修复一些算子计算错误、精度下降,以及单测相关问题 (#53859, #54261, #46801, #53676, #53772)
- 修复CINN对float16类型支持的问题。(#48249)
- 修复build_cinn_pass中的问题。 (#46843)
- 修复了组合算子+动转静 在开启CINN时,出现反向因误被GC而导致的无数据区的问题 (#50116)
- 修复编译器dropout amp出错,组合算子跑resnet出错,inplace变量未找到等问题 #51688, #52813, #51769
性能提升
硬件接入
CustomDevice
- 训练侧新增分布式策略 MP/Sharding/PP/MoE 以及 recompute 重计算功能的支持,推理侧新增分布式策略MP的支持,支持通过CustomDevice接入的硬件昇腾NPU和寒武纪MLU无需修改任何代码即可自动继承CustomDevice新增的所有分布式策略。 #52872, #54384, #53220, #54572, #54573, #54676, #53044, #53719, #53701, #53702, #53703
- 新增API paddle.device.is_compiled_with_custom_device,方便用户判断当前环境是否支持某硬件的插件式设备后端 #49271
- 增加环境变量 CUSTOM_DEVICE_BLACK_LIST 设置,支持黑名单内的算子自动异构到CPU上运行 #50409, #50666
- 优化 CustomDevice 性能,减少对runtime中get_device_count接口的调用次数 #46963
昆仑芯XPU
- 训练侧使用了新版动态图并新增分布式策略 MP/Sharding/PP 以及 recompute 重计算功能,通信库通信的支持;推理侧新增分布式策略MP的支持,并增加对XPU FasterTransformer 算子加速库的支持;#49531, #49815, #48897, #50717, #51082, #49757, #51399, #50329, #48369, #47838,#48076,#47882,#48961,#49043,#49749,#49806,#53427,#48470,#49207,#52296,#51785,#47168,#47445,#50200,#49934,#50792,#52228,#53337,#53389,#53496,#53609,#53697,#53496,#53720,#53734,#54172,PR46227
4. 部署方向(Paddle Inference)
新功能
- 支持Paddle TensorRT多个子图TensorRT engine 或者不同Predictor的之间的TensorRT engine共享显存,以便节约显存。#45842 #47631
- C++ API增加获取输入Tensor的Shape和数据类型接口,增加获取输出Tensor的Shape和数据类型接口。C API增加SetExecStream、EnableMkldnnInt8等C++已有接口,用于服务化部署。 #49758
- 新增paddle.inference.Predictor.register_output_hook()接口,可支持调试时打印GPU推理下每层的输出,同时也支持在While等控制流模型中使用。注意此接口不支持Paddle-TensorRT。#54433 ,#47050 , #54254 。
- Paddle Inference推理的Predictor接口支持paddle::Tensor作为输入和输出,以便用户直接复用飞桨动态图做推理前、后处理。 (#50445)
- 增强Paddle TensorRT动态shape运行能力,config.enable_tuned_tensorrt_dynamic_shape()接口,不传任何参数时,在运行时构建TensorRT Engine。不再需要先收集shape信息再运行,但为了避免运行时的重新构建,需要在前几次运行时,覆盖最小及最大Shape的情况, #52162 。
- Paddle-TensorRT支持NHWC格式的模型输入,#49633 。
- 扩展config.Exp_DisableTensorRtOPs接口通过指定Tensor变量的名字来禁止进入TensorRT,#49497 。
功能优化
- GPU混合精度推理(非Paddle TensorRT场景)功能增强,Config.enable_use_gpu增强可设置精度类型。 #47993
- 支持double类型输入进行推理, #51786 。
- 由于TensorRT 算子不支持INT64类型导致模型中存在INT64数据类型式运行失败问题,Paddle-TensorRT做了增强,当模型中包含INT64数据类型时,进行自动转换,降低到INT32类型运行。 #45547
- Paddle-TensorRT支持更多算子进入TensorRT推理,包含:
- expand_v2,gather_nd,rsqrt,sign,not,onehot,arg_min,temporal_shift,expend_as_v2,setvalue,index_select,round,acosh,square,reduce_max,not_equal,reduce_min,reduce_prod,grid_sampler,elementwise_mod,pad3d ,greater_equal,bitwise,cumsum,matmul_v2,reciprocal,where,bmm,take_along_axis,less_than,greater_than, logical_or, logical_xor, logical_and, less_equal,range,reduce_all,reduce_any ,fill_any_like ,pow
- #47002 , #47589 ,#48223 ,#48557 , #48655 , #49113 , #51207 ,#51028 ,#50341 ,#51498 ,#48534 ,#48684 , #49393 , #49615 ,#50934 ,#50974,#50986 , #52000 ,#51971 , #52518 ,#44918 ,#48230 ,#47820 , #46877 , #48358 , #48592 ,#48697 , #53088 , #47974 , #53462
- 增强Paddle-TensorRT映射算子strided_slice,instance_norm,prelu,argmax,cast,nearest_interp_v2,elementwise,bilinear实现,#46819 ,#47998 ,#48043 ,#48998 , #49675 , #47495
- Paddle-TensorRT部分算子(scale, square, sum, swish, expand_as_v2, prelu, gelu, hard_swish, hard_sigmoid, leaky_relu,softmax, stack, clip, cast, flatten_contiguous_range,unary,equal, elementwise_op) 支持0维Tensor,#53660 ,#53627 , #53634 , #53714 , #53729 ,#53769 ,#53506 ,#53704
- 支持GCC12 + CUDA 12.0以下版本编译, #50106
- Paddle-TensorRT的DeformableConv插件支持动态Shape输入,#50698
- Paddle-TensorRT增加lookup_table算子的插件支持, #46613
- 新增config.enable_low_precision_io()接口支持Paddle-TensorRT场景下低精度类型输入, #52485
- Paddle-TensorRT的LayerNorm插件支持FP16计算, #45043
- Predictor的输入数据paddle_infer::Tensor支持bool类型,#49388
- Paddle-TensorRT增强Convolution实现采用ConvolutionNd,#47653
- conv2d_fusion融合算子支持NHWC格式,#49047
- 调整C++推理库下Phi算子相关目录结构,#53091
- 当TensorRT序列化和加载版本不匹配时,支持重新构建TensorRT Engine,而不是报错,#50775 。
- 优化Paddle-TensorRT运行时打印日志信息,#50181
- 基于oneDNN的CPU推理支持elementwise的0维Tensor输入,#51656
- 清理和规范化Paddle-TensorRT的FC、matmul、matmul_v2算子的支持,统一升级到使用TensorRT的IMatrixMultiplyLayer进行支持,#52222
性能提升
- 支持多个lookup_tables进入Paddle-TensorRT的Embedding+Eltwise+LayerNorm的融合 #46243 ,#46230
- 增加MoE融合Phi算子,提升MoE模型性能推理性能, #48703
- 在INT8量化推理的场景下,Paddle-TensorRT 插件fallback到FP16计算而不是FP32计算,#50554
- 优化推理时内存、显存, #49051 , #49046 ,#53930
- Layout排布优化Pass增强, #52997
- 支持对算子Shape推断进行缓存,提升模型推理性能, #48312
- 使用half2指令优化bias+add+relu融合,#49048
- 使用向量化操作优化多个输入的Concat Kernel,#49540
- 基于CUTLASS实现Convolution、Depthwise Convolution及相关融合算子,提升推理速度。 #47989 ,#50603 ,#51792 ,#50603
- Paddle-TensorRT支持FlashAttention的插件,提升StableDiffusion等模型的推理速度,#49438 。
- 增加Transpose+LayerNorm的融合PASS,提升StableDiffusion等模型的推理速度,#50082 。
- 增加Elementwise+Transpose的融合,#50081
- 优化Paddle-TensorRT Group Norm插件实现 ,#49160
- Config.EnableTensorRtEngine()接口增加use_cuda_graph参数,可以支持开启CUDA Graph,注意在使用时,需要保证模型输入shape不变,可以降低运行时耗时,#53406
- 支持对Reshape的inplace操作减少模型运行时的拷贝耗时, #49146
- 基于oneDNN优化LayerNorm kernel实现,#47782
- 基于oneDNN支持quantize+transpose 以及 transpose+dequantize融合,#49509
- CPU推理下当开启MKLDNN时,默认开启FC相关的融合Pass,提升性能,#45704
- CPU的OneDNN推理支持suqeeze2 + transpose2融合,#47592
XPU推理提升和性能优化
- 新增 ExpRunWithRuntimeConfig 接口与 XpuRuntimeConfig 允许推理期间设置外部流、L3 cache 等参数;GetExecStream 接口支持获得昆仑外部流对象;输入、输出支持昆仑设备内存减少 D2H 和 H2D 开销,#53334、 #52466、 #53240
- 新增 multi-encoder, fused_multi_transformer 算子和融合 pass,提升 ERNIE 和 Transformer 类模型性能,#50570、#51346、 #50499、#53982、#50759、#51571、 #53144、#53306
- 优化BeamSearch性能,当beam_size=1 时对 write_read_array, gather 等细粒度算子进行变换、去除和融合提升模型性能,#53130
- 多个相同输入的 stack 算子变换为支持 broadcast 的 unsqueeze 算子,unsquee/squeeze 支持 inplace 计算, #52099
- 新增支持导出适用于昆仑芯的多卡推理模型, #50490
- 新增 embedding_with_eltwise_add 融合 pass 及算子 phi kernel,减小显存占用并提升推理性能, #50590
- interpolate 类算子 phi kernel 支持 FP16, #52358
- argmax 算子支持 INT32 类型输出, #51303
- 修复开启混合精度推理模式后, 保存序列化模型时只有model文件时的报错, #52994
- 修复 instance_norm 在 scale 和 bias 为空时出现的段错误, #52627
- conv_transpose 算子支持 FP16,#53626
- 添加 yolo_box_xpu 融合 pass 及算子 phi kernel,优化 YOLO 模型通用子结构, #54163
- 添加 conv2d_xpu 融合 pass 以及算子 phi kernel,并支持FP16推理,优化卷积操作推理耗时,#52247 ,#53626
- 添加 sigmoid_elementmul 通用融合 pass,融合为 swish 算子以匹配 conv2d_fusion pass 提升 YOLO 模型推理性能, #53580
- 添加 act_add 融合 pass 及算子 phi kernel 提升推理性能,#53965
- 添加 fold_interp_outsize 融合 pass 提升推理性能, #54245
- 解决当FC存在共享 weight 时因重复融合导致结果错误的问题。 #51108、#51039
- 删除算子仅用于训练的 op_device 属性,防止在推理期间错误的选择训练时的 place, #51029
- 支持优化后模型的保存,允许再次推理时跳过 PASS优化减少第一次推理时间, #53696
- 解决算子 Kernel 的 CPUPlace 输入被强制拷贝到 XPU 而导致的计算错误问题, #51306
- subblock 支持参数 H2D 提前拷贝以提升推理性能。#51876
- 修复昆仑芯 2 代芯片输出激活的 scale 存储空间大小。 #53505
- 新执行器昆仑芯 D2D 拷贝支持异步执行, #51876
- 删除只有一个输入的 concat 算子,#52304
- lookup_table_v2 支持 FP16 删除冗余 cast 算子, #52888
- 控制流While算子支持缓存scope,降低每次新建scope 的开销, #52628
- scatter 新增支持 FP16,删除冗余 cast 算子以及某一个输入为 1 的 elementwise_mul 算子。#52831
模型量化
- 动态图量化功能全面升级
- 支持量化训练模型加载离线量化模型的参数,支持更多算子量化,包含matmul, scale,conv1d,#47892, #45911,#48912
- 支持静态图量化训练的混合并行训练,#52219
- 修复动态图量化过程中的问题:
5. 环境适配
为提升源码编译效率,完善和推广setuptools + ninja编译方式,提升开发效率,CPU场景下,全量编译耗时减少20min,编译速度提升24.52%,GPU场景下全量编译耗时减少22min,编译速度提升29.31%; 为了适配较为主流的开发环境,飞桨在源码编译支持了gcc12编译和C++17标准,适配了最新的CUDA12; 代码质量完成了编译warning的清理,提升编译体验;第三方依赖层级,为减少依赖冲突,升级了底层的protobuf版本,并清理了一些低版本依赖库的废弃属性和老旧的代码格式,并移除了对于python2.x的支持。
- ninja编译适配,提升编译速度。#52433,#48932,#49420,#48435,#49303,#49448,#49838,#50067,#52796,#50431,#49181,#48867,#48490,#48211,#49499,#53076
- setuptools编译打包一体化适配。#48770,#46957,#49583,#47602,#48301,#50800,#42575),#49826,#49002,#51443,#51528,#52621,#52465
- gcc12 支持。#52960,#52265,#46546,#52318,#46808,#47466,#52083,#48176,#49423,#49452,#51037,#52007,#52441,#52085,#50817,#52646,#50777,#53288,#54009
- c++17标准支持。#53345,#53892,#54282,#49017,#47635,#54258
- cuda12支持。#52285,#49592,#52232,#52654,#54641
- CodeStyle。#45909,#47772,#48538,#49522,#47264,#49558
- 编译Warning消除。#47163,#47216,#47309,#47252,#47341,#47399,#47513,#47558,#47706,#52717,#51203,#51336,#51608,#51633,#46644,#53092,#53185,#53246,#53650,#53683,#53687,#53886,#53689,#53679,#53681,#53532,#47137,#47045,#52186,#52490,#53924,#53938,#53945,#53851,#53847,#53818,#53931
- 支持protobuf升级。#49875,#48495,#49673,#52499,#51161,#49168
- 支持第三方库离线编译。#54326,#54370,#54335,#54346,#53744,#54319,#53915
- phi独立编译头文件依赖解耦。#50456,#47088,#52573,#52651
- Python2.x 退场。#48685
6. 安全
- 修复了诸如空指针使用、非法地址访问、内存越界、除0、Python IndexError等问题。PR49976, PR49993, PR49942, PR49965, PR50000, PR50005, PR49953, PR49995, PR49974, PR50015, PR50010, PR49979, PR49994, PR49977, PR49968, PR49984, PR49958, PR50008, PR51714, PR51847, PR51034, PR51088, PR51091, PR51092, PR49966, PR49656, PR52161, PR49548, PR49546, PR49547, PR49549, PR51850
Thanks to our Contributors
This release contains contributions from:
1want2sleep, 201716010711, 404988613, 5u13, 6clc, Ackeraa, Aganlengzi, ahahahahahaha, Ainavo, Allen Guo, andyj, Asthestarsfalll, Aurelius84, Ayuan, BellaZYL, Bjmw3, Bo Zhang, bukejiyu, caozhou, carryyu, Ccc, ccrrong, ceci3, chalsliu, Chang Xu, CHANGer, Charles-hit, Chen Weihang, chenjian, Chenxiao Niu, chenxiao120660, chenxujun, Chitsing KUI, cifar10, co63oc, CollaborativeFiltering, csy0225, cxxly, cyber-pioneer, cyberslack_lee, czr-gc, Dandelight, danleifeng, Danyang Zhang, dasen, denglianbin, Difer, dongfangshenzhu, DrowFish19, duanboqiang, duanyanhui, engineer, engineer1109, Epsilon Luoo, feifei-111, Feiyu Chan, Feng Ni, feng_shuai, Fisher, FlyingQianMM, Frank Lin, Galaxy1458, GaoYuYang, gaoziyuan, gem5, GGBond8488, Ghost Screaming, gongenlei, gouzil, Guanghua Yu, Guo Sheng, Guoxia Wang, Hamid Zare, Hanchiao, handiz, Haohongxiang, haosicheng, haozi, Happyd99, heliqi, hellockx, hellolllw, heyanru, hg-1099255210, hh-qiao, hjyp, hong, HongyuJia, houj04, hua-zi, Huang Jiyi, Huang Zhengjie, huangjiyi, huangjun12, Hui Zhang, Huihuang Zheng, Hulek, hwa, HydrogenSulfate, Ikko Eltociear Ashimine, iLeGend, Infinity_lee, Infrared1029, Jacek Czaja, jakpiase, james, jameszhang, Jiabin Yang, jiahongyu, jiangcheng, jiangfan06, Jianghai, jiaqianjing, jingsongliu, JingZhuangzhuang, jjyaoao, joanna.wozna.intel, junxiu777, Jx-qi, JYChen, JZ-LIANG, jzhang533, Kai Song, Kai Xing, Kaipeng Deng, Kang Zhao, kangguangli, Kevin吴嘉文, Kim, Kim Yann, knamg, kuizhiqing, lanxianghit, Leding Li, Leo Chen, Leo Guo, levi131, Li Min, Li-fAngyU, Ligoml, lijialin03, lijin23, limingshu, Lin Manhui, LinearTemporalLogic, Linjie Chen, lishicheng1996, Little-chick, littleforest, liu zhengxi, liulinduo, liuruyan, liuzhenhai93, LiYuRio, lj970926, LokeZhou, LoneRanger, lubiu, Lucas, lugimzzz, Lux et Veritas, lxsbupt, LyndonKong, lzy, lzydev, Mahmoud Ashraf, Manan Goel, Maple Xie, Matsumoto Ruko, mayang002, MayYouBeProsperous, megemini, mengziheng, Meteor Liu, mhy, mhy-666, Ming-Xu Huang, ming1753, minghaoBD, mjxs, Moqim, Mountagha, Mr.Juice, mrcangye, NetPunk, Netpunk, nihao, niuliling123, Nyakku Shigure, OccupyMars2025, Ouyang Chao, pangengzheng, pangyoki, parap1uie-s, Paulina Gacek, Piotr Paturej, PommesPeter, PPGitub, PPPPzhang, PuQing, Qi Li, Qi Shao, QingshuChen, qipengh, qizhaoaoe, Rayman, RedContritio, RichardWooSJTU, risemeup1, Roc, ronnywang, Ruibiao Chen, Ruibin Cheung, RuohengMa, Ryan, SaltFish11, Sanbu, Scotty, scotty, seemingwang, Shaojie WANG, ShenLiang, shentanyue, Shijie, Shuangchi He, Siming Dai, Sing_chan, sneaxiy, Sonder, sprouteer, Sqhttwl, sunli, superwinner1, supplyout, SylarTiaNII, Sylwester Fraczek, Sławomir Siwek, taixiurong, Tao Luo, Taylor-Layrose, TeFeng Chen, Thomas Young, thunder95, Thunderbrook, Tian, Tian Zheng, tiancaishaonvjituizi, tianshuo78520a, tifa, Tinson Lai, Tomasz Socha, Tony Cao, ucsk, umiswing, ustiniankw, Vegetable dog, Vigi Zhang, Vvsmile, Wang Bojun, Wang Xin, Wang Xinyu, wangfengsheng1999, wangguanqun, wangguanzhong, wanghuancoder, wangna11BD, wangshengxiang, wangxiaoning, wangxinxin08, Wangzheee, WangZhen, wangzhen38, wasupandceacar, wawltor, Wei Shengyu, Weilong Wu, weishengying, Wen Sun, wenbin, wentao yu, wenzhe.wang, westfish, whisky-12, whs, Wilber, will-jl944, winter-wang, Winters Montagne, WJJ1995, wuhuachaocoding, wuyefeilin, wz1qqx, XiangGao, xiaoguoguo626807, xiaohemaikoo, xiaoluomi, xiaoting, xiaoxiaohehe001, Xiaoxu Chen, xiaoyuanzi914, Xinger, Xinyu Chen, xiongkun, xjmxyt, xu98bin, xysheng-baidu, yangguohao, yangjianfengo1, YangQun, YangZhou, yeliang2258, YepKong, Yichen Zhang, yikaikkk, Yiqun Liu, yjphhw, ykkk2333, Young-Flash, yu wentao, Yuang Liu, Yuanle Liu, YuanRisheng, yuchen202, yuehuayingxueluo, YuhangLi, Yulong Ao, YUNSHEN XIE, yunyaoXYY, YuRonan, zachary sun, ZeKai Zhou, Zenghui Yuan, zengshao0622, Zero Rains, Zhan Rongrui, Zhang Jun, Zhang Na, Zhang Ting, Zhang Zheng, zhangbo9674, ZhangDY-6483, zhangkaihuo, zhangxin81, zhangyikun02, zhangyingying520, zhangyuqin1998, zhaocaibei123, zhaoyingli, Zhen Wang, Zheng-Bicheng, Zhenghai Zhang, Zheng_Bicheng, zhenyun, Zhibao Li, zhiboniu, Zhong Hui, Zhou Wei, ZhouMengLei1999, zhoutianzi666, zhouzj, zhupengyang, zhurou603, zhuyipin, zhwesky2010, ziyoujiyi, zlsh80826, Zman, zmxdream, zqw_1997, Zuza Gawrysiak, zxcd, zyfncg, ZZK, zzk0, 丁一, 傅剑寒, 六个骨头, 卢林, 周周周, 姜永久, 学渣戊, 张春乔, 张正海, 柠檬味~, 王明冬, 石晓伟, 超级码牛, 陈沧夜, 骑马小猫
PaddlePaddle 2.5.0 Release Note
1. Highlights
- New dynamic-static unification architecture: Implement a new dynamic-to-static plus compiler execution model in combination with the basic operator, and complete the whole dynamic-to-static, combinator and neural network compiler optimization and acceleration process on the ResNet50&Bert model. For the dynamic-to-static, complete the whole graph fallback core function development, and support the fallback to dynamic graph training execution in case of dynamic-to-static failure. For the combinator, design a set of basic operator systems containing more than 150 basic operators, to achieve the python layer forward operator splitting mechanism and the reverse operator splitting mechanism of static graphs, to realize splitting of more than 70 commonly used forward and reverse operators. For the CINN compiler, fix the correctness bug, develop the key Pass, add manual schedule rules, achieve automatic generation of kernel codes, and improve performance of ResNet50 model by 12% and Bert model by 10%.
- Operator architecture unification of PHI operator library: Unify all remaining 350+ operator kernels under the original operator system into PHI operator Library. Unify the way of defining operator in the original operator system into the operator definition form of PHI operator library (configuration of operator definition based on YAML), enhancing unity of the architecture, and reducing comprehension cost of framework development. Decouple all the Fluid header files that the PHI operator library depends on and compile them independently as dynamic link libraries to provide a lighter reuse of the operator library for secondary development of the framework. Continue to standardize and adjust unspecified operators, as well as operator kernels in the PaddlePaddle framework. It is easy for developers to understand and reduce the cost of accessing the hardware.
- Full go-live of new actuator for static graph: The new actuator for static graph implements a number of functions and performance optimization, and completes unification and replacement of the original multiple sets of old actuators. The new actuator becomes the back-end default execution engine for the static graph single card and distributed training python side entrance, as well as dynamic-to-static, control flow, CINN, etc. This significantly improves scheduling performance of the framework, and the functional architecture is clearer. Secondary development capability is significantly enhanced.
- Python API supporting 0-dimensional tensor: clear semantics are defined between tensor of shape [1,] and tensor of shape [], and fixed many API behaviors to support tensor of shape [], such as
paddle.sum
etc. - New environment adaptation: Adapt to CUDA 12. Compilation with gcc12 is supported.
2. Incompatibility Upgrade
- PaddlePaddle API supports 0-dimensional tensor.PaddlePaddle previously used a 1-dimensional tensor with a shape of [1] instead of a 0-dimensional tensor, which is different from current mainstream habits. It increases development and debugging cost of the model, and sometimes leads to unintended errors. This release fixes 376 APIs that need to support 0-dimensional tensor, and implements tools widely used by the community such as EinOps. For example, in previous cases, output loss in model training was a 1-dimensional tensor. To take out or print the loss, it was often necessary to use codes like
loss.numpy()[0]
.After this modification, output loss in model training is a 0-dimensional tensor. When usingloss.numpy()
, users can take out or print the loss. The codes are short, easy to understand, and in line with the industry's habit. paddle.fluid
API is fully decommissioned. According to the plan that has been previewed in the last version, 1116paddle.fluid
APIs and related internal interfaces have been decommissioned, and the remaining few related internal interfaces will be cleaned up in the next version.fluid API belongs to the historical APIs that PaddlePaddle 2.0 had planned to remove, but delayed the cleanup in consideration of compatibility and other factors. This decommissioning cleanup will not affect programs developed based on PaddlePaddle 2.0, and the PaddlePaddle API system will be more concise and easier to understand.- Complete code cleanup at the old version of the dynamic graph Python side.So far, the Python side only uses the new version of dynamic graph to call the C++ core logic.
- In order to unify the training method of data parallel for static graph model, original single-process multi-card training method is abandoned, including
paddle.static.ParallelExecutor
andpaddle.static. CompiledProgram(). with_data_parallel( )
APIs, because this set of APIs only supports single-computer multi-card, does not support multi-computer multi-card, and the underlying execution performance is poor.It is recommended to use the multi-process multi-card training method uniformly, i.e.,paddle.distributed.launch
API for distributed training with data parallel. This upgrade affects only static graphs, and does not affect dynamic graphs and dynamic-to-static training. If you use the decommissioned API, please refer to the documentation on data parallel to modify model code. #50351,#50501,#51240,#51701,#51616,#51369,#52671 - Remove the original adaptation code of Ascend NPU and Cambricon MLU in the framework, upgrade all to CustomDevice plug-in adaptation, and migrate the adaptation code of Ascend NPU and Cambricon MLU to PaddleCustomDevice warehouse.