Release Note

重要更新

飞桨框架2.0-RC1版本有如下重要更新：

安装环境 官方发布支持CUDA11的安装包（experimental）；官方发布支持百度昆仑芯片的安装包（experimental）
API功能 支持numpy兼容的paddle.Tensor 索引和切片操作(基本索引)；去除部分API中的axis参数，支持numpy兼容的广播语义；新增了部分API，完善了部分API的功能，修复了部分API的bug
动静转换 支持动态图转静态图的更多python语法，并支持通过 paddle.jit.not_to_static 标识不进行动转静的函数
框架功能 支持多次调用paddle.Tensor.backward() 进行累计梯度，效果等同于增加batch size后计算的梯度；默认隐藏了C++报错栈，并优化了报错格式；分布式训练支持heterbox异构训练
框架性能 混合精度训练支持纯FP16模式，ResNet50模型V100单卡训练性能达1400+ samples/sec；分布式训练做了性能优化

前瞻性预告

飞桨框架计划在未来的某个版本起，放弃对python2和python3.5的支持，建议您升级python到3.8版本来使用飞桨
飞桨框架计划在未来的某个版本起，放弃对CUDA9.0的支持，建议您升级CUDA版本来使用飞桨

训练框架

基础API（含分布式）

新增API

新增paddle.log2
新增paddle.log10
新增paddle.nn.initializer.set_global_initializer
新增paddle.median
新增paddle.broadcast_shape，可以计算两个tensor shape经过broadcast计算后的shape
新增paddle.vision.ops.deform_conv2d, paddle.vision.ops.DeformConv2d
新增paddle.subtract
新增paddle.optimizer.lamb
新增Tensor相关API，Tensor.cpu、Tensor.cuda(idx)、Tensor.pin_memory、Tensor.is_leaf、Tensor.clone

修复和完善API

paddle.multiply 去掉axis
paddle.pow 去掉 type promotion
paddle.add, paddle.subtract, paddle.multiply, paddle.divide, paddle.matmul, paddle.reshape, paddle.transpose, paddle.kron, paddle.trace, paddle.sum 支持complex64 和complex128 数据类型
移除paddle.maximum, paddle.minimum的axis参数
multiplex支持动态图
CrossEntropyLoss增加soft_label and axis，修改形状，并提升性能
paddle.nn.functional.interpolate size参数支持Tensor格式输入
paddle.nn.functional.pad添加在constant模式时，对N和C维度的padding
paddle.optimizer.momentum支持恢复训练
修复转换前对BatchNorm指定weight_param名字，再使用paddle.nn.SyncBatchNorm.convert_sync_batchnorm 转换成SyncBatchNorm时报错
paddle.to_tensor选择设备时，支持直接输入其他Tensor的place
优化Tensor.detach的性能，与原Tensor共享内存，减少1次内存拷贝，并且不保留在原计算图中
静态图模式下，新增支持通过paddle.optimizer.get_lr()获取学习率
修复paddle.Embedding在GPU下使用超范围ID报错异常

移除API（包括别名）

移除complex module下的api: paddle.complex.matmul, paddle.complex.reshape, paddle.complex.transpose, paddle.complex.kron, paddle.complex.trace, paddle.complex.sum, paddle.complex.elementwise_add, paddle.complex.elementwise_sub, paddle.complex.elementwise_mul, paddle.complex.elementwise_div
移除paddle.nn.functional下的sigmoid_cross_entropy_with_logits

高层API

新增api paddle.callbacks.ReduceLROnPlateau
新增api paddle.callbacks.LRScheduler
新增api paddle.vision.datasets.FashionMnist
paddle.io.DataLoader中places参数变更为可选参数，当为默认值None时，自动选择paddle.CPUPlace()或paddle.CUDAPlace(0)，places参数将在后续版本删除
paddle.io.DataLoader支持通过设置batch_size=None来禁用DataLoader自动组batch功能
新增api paddle.io.ComposeDataset 用于将多个数据集按字段拼接为一个数据集
新增api paddle.io.ChainDataset 用于将多个数据集按sample整合为一个数据集
新增api paddle.io.WeightedRadnomSampler 用于通过指定权重进行随机采样
新增api paddle.vison.ops.yolo_loss和paddle.vision.ops.yolo_box
新增api paddle.flops
新增api paddle.callbacks.EarlyStopping
更新api model.save，保存文件格式与底层保持一致
修复api 修复动态图input dtype为非float32且Model初始化不提供inputs时，保存预测模型报错的bug
paddle.metric.Accuracy支持输入多维Tensor，支持rank为1的label和one-hot表示的label

功能优化（含分布式）

动态图基础功能

支持Tensor和Scalar在使用运算符运算时进行正确的类型提升
修复了多个模型train/eval模型切换互相干扰的问题。动态图Layer.eval()与no_grad解耦，改动前调用Layer.eval()后Tracer不会自动记录反向，改动后调用Layer.eval()仍会自动记录反向，如果需要反向，可以使用paddle.no_grad
支持通过索引或切片修改 Tensor数据
增加 inplace 反向检测模块，检测是否前向inplace 操作会影响梯度计算的正确性
新增Tensor.backward()自动求导时，梯度会累加在之前的梯度上，可以实现变相扩大“batch_size”
支持了 SE-ResNext oneDNN 动态图训练

动态图转静态图

新增语法

增加在动转静循环中使用isinstance语法的支持
添加对赋值shape给tuple的动转静语法支持，如a, b, c, d = tensor.shape
python的 and/or 语句的左右操作数的执行是有先后顺序的，若左操作数的结果能够确定逻辑值，将不执行右操作数。过去动转静图中的logical_and/logical_or对这种情况处理有问题。增加了这种支持。
增加支持了函数 signature中含有**kwargs的情况
支持使用 jit.not_to_static 装饰函数，在动转静过程中，不转化该函数
支持python字典语法 dict.pop()

bug修复

修复动转静存储lstm接口时一个表示drop_state的变量没有初始化导致模型存储失败的问题
修复嵌套循环在变量分析上的问题
修复return在一些特殊情况的问题
修复if-else中处理列表生成式及变量分析上的问题
修复迭代变量在一些特殊情况的问题
修复transpose API 在动态图和静态图行为不一致问题，使之支持动转静
修复concat API 在动态图和静态图行为不一致问题，使之支持动转静
优化部分动转静报错信息，使报错位置更准确
修复convert_call在特殊情况下会重复递归调用问题
修复由于2.0 API对out.dtype判断不同导致的动转静问题
修复了x.shape == y.shape在动态图是判断list相等，返回True/False，但静态图下会被重载成elementwise的问题，这种转为静态图后对elementwise结果进行reduce。
修复了param_guard覆盖不到hook的问题。
修复了init运行动态图一些参数变量在静态图因为类型不是静态图变量不能赋值的问题
修复了用户在__init__函数中定义的非参数类型变量值无法正确修改和更新的问题
修复了动转静过程中错误转化第三方库logging的问题
修复了for-enumerate语法AST转写有误的问题
修复了部分warning信息循环显示多次的问题

混合精度训练

支持更为激进的FP16训练模式（即纯FP16训练）。为保证模型的收敛性在Momentum优化器中新增multi_precision和rescale_grad属性，multi_precision主要指示优化器需要维护一份master weights
使用纯FP16训练，ResNet50模型在配有16GB显存的V100上单卡训练性能可达1400+ samples / sec

模型量化

动态图量化支持skip指定Layer
动态图量化支持2.0 API Conv 以及Linear

分布式训练优化

支持使用paddle.distibuted.spawn接口启动all_gather等分布式低阶API
支持heterbox异构训练
流水线并行支持Executor.run接口，提升易用性
Launch接口升级，支持指定单节点的进程数
Sharding支持百亿参数模型多卡训练

模型保存与载入

支持有多个方法声明由paddle.jit.to_static转写的Layer在使用paddle.jit.save存储后，仍然能够通过paddle.jit.load载入，并且由paddle.jit.to_static转写的多个方法仍然能够使用
支持由paddle.jit.load载入的Layer在fine-tune或者作为其他Layer的子Layer使用之后，仍然能够通过paddle.jit.save正确存储
拓展paddle.jit.save支持存储paddle.DataParallel模型
优化paddle.static.load_program_state接口使用体验，在不指定载入var_list的使用场景中，载入目录存在干扰文件时仅警告而不报错
支持paddle.jit.save处理dict类型的InputSpec
支持paddle.onnx.export将动态图模型导出为ONNX文件格式

性能优化（含分布式）

提升RNN类OP在CPU上的性能（LSTM，GRU，SimpleRNN），对比2.0-rc版本，LSTM、GRU、SimpleRNN前向性能与后向性能均有显著提升
优化FastThreadedSSAGraphExecutor调度，修复通信同步场景下，通信计算不重叠的情况，4机32卡resnet50提升约0.3%
优化paddle.fleet amp分布式性能，修复最后一个通信和计算不重叠的情况，fp16 4机32卡性能提升约0.5%
优化分布式通信组件Communicator性能。GEO-400模式下，W2V模型吞吐率、Simnet-Bow模型性能均有显著提升。Async模式下，相较于飞桨框架1.8按本，W2V模型吞吐率提升11%，CTR-DNN模型性能提升14%
优化参数服务器模式下Worker为GPU设备时的性能，降低Embedding查表的拷贝耗时，在CTR-DNN模型中，训练吞吐率有显著提升
分布式GPU动态图实现计算和通信overlap，并支持用户细粒度配置梯度fuse的group大小等选项。在ResNet152、Bert两个模型上，多节点性能提升在5%以上。在ResNet50也有3%以上的提升
提升cumsum在GPU上的性能。
提高了Resnet50 oneDNN 动态图训练的性能。目前Resnet50 oneDNN drgraph训练比CPU训练快 6.4 倍
新增GRU和SimpleRNN的cudnn支持

调试分析

优化Paddle Python端报错异常类型与Python原生报错类型对齐
默认隐藏C++报错栈，优化隐藏C++栈之后的报错格式，去掉分界标志Error Message Summary，与Python原生报错格式对齐
优化部分static模块下API在非静态图模式下使用报错提示，包括static.append_backward, static.gradients, static.scope_guard, static.Print, static.nn.embedding, static.nn.data_norm, static.nn.multi_box_head, static.nn.nce, static.nn.py_func共9个API
优化了动态图模型下传入Tensor为None时的报错信息
动态图print tensor的格式进一步优化

编译安装

新增支持

（experimental）发布支持cuda11的安装包
将cuda10.1及以上的Paddle镜像以及CI系统镜像中的NCCL版本到2.7.8
发布支持xpu的安装包
发布支持jetpack的安装包，以及支持nv_jetson的C++预测库。

体验优化

修复联编策略，单独发布包含tensorrt的gpu包，避免用户在安装其他GPU版本的包出现没有tensorrt的报错
删除安装依赖包：scipy、rarfile、prettytable、pathlib
安装文档优化

Bug修复

修复多卡训练时0号GPU卡显存占用多于其他卡的Bug
修复了tile op计算时shape推导错误的问题
修复了使用paddle时出现的大量invalid escape sequence的warning信息
修复了paddle.full设置INF、NAN、NINF等时的bug
修复paddle.fleet多nccl comm设置不生效的问题，添加同步模式下多nccl comm通信不重叠的警告
修复paddle.framework.seed在TruncatedNormal初始化不符合预期的问题
修复AvgPool 相关 API动转静exclusive参数行为不一致问题；修复MaxPool 相关 API ceil_mode传参问题
修复paddle.topk在GPU下结果不正确的bug
修复 fluid.layers.nn.gather 动态图API，缺少了 overwrite 选项 bug
修复Windows下终端不识别CUDA_VISIBLE_DEVICES为空字符的bug，通过设置空字符串可以使框架以CPU模式执行
修复当LinearLrWarmup中递归包含Learning Rate Scheduler时，optimizer.state_dict/set_dict时的无法递归保存加载的Bug
修复了ptb lm模型单机训练性能下降的问题
修复了softmax_with_cross_entropy使用ignore_index时梯度计算的bug
修复了AdamW第一次执行后再次获取要进行decay的参数为空的bug

推理

Paddle Inference

功能升级

Paddle 在 2.0 中新增或升级了部分算子。从本版本起，对前向算子版本规则进行定义与兼容约束。通过框架间算子版本的对齐，确保不同框架中同一算子版本的定义和行为一致，从而增强框架整体的健壮性
新增TryShrinkMemory接口，通过释放临时tensor的方式减少应用显/内存占用，demo示例可参考Paddle-Inference-Demo
Paddle-TRT支持clip op，支持分类模型GhostNet在Paddle-TRT下运行
Paddle-TRT int8预测支持含有channelwise量化的mul op的模型，支持PaddleOCR检测和识别的PaddleSlim量化模型在Paddle-TRT int8下运行
load_inference_model 和 save_inference_model 两个API迁移到 paddle.static 下，提升了易用性，兼容旧接口。
新增 serialize_program, deserialize_program, serialize_persistables, deserialize_persistables, save_to_file, load_from_file 六个API，用来满足用户执行序列化/反序列化 program，序列化/反序列化 params，以及将模型/参数保存到文件，或从文件中加载模型/参数的需求。
支持部分模型的BF16预测。目前支持resnet50，googlenet，mobilenetv1和mobilenetv2模型的BF16预测
添加了一些oneDNN 算子的版本兼容性支持

性能优化

ERNIE模型在开启TenorRT时增加变长输入的支持，带来性能提升147%。在软件版本cuda10.1、cudnn 7.6、tensorrt 6.0、OSS 7.2.1，模型ernie-base-2.0，数据集QNLI，输入BatchSize = 32时，Nvidia Telsa T4上的性能从905 sentences/s提升到2237 sentences/s。示例代码：Paddle-Inference-Demo/c++
提高了oneDNN INT8 GRU性能。GRU INT8 模型的预测速度是原Paddle NativeConfig float32 模型的 1.65倍（线程= 1，batch_size = 50）
添加了oneDNN batchnorem + activation的fuse支持，pvanet_ocr模型性能因此提高了2.8％

Bug修复

修复含有avg pooling或global pooling的模型在jetson设备上出现计算结果错误、报错跳出或hang住的问题
修复使用TensorRT动态shape推理时，TensorRT子图输出Tensor的shape结尾是x1时会被错误的删除的问题
修复当使用TensorRT推理时，config.pass_builder()->DeletePass()不生效的问题
解决了某些模型的性能取决于 matmul 算子的 weights 数值的问题
修复了当CPU oneDNN加载多个模型预测时性能变慢的问题

模型升级

PaddleDetection

升级动态图模型：
- Faster RCNN, Faster FPN, Mask RCNN, Mask FPN, Cascade RCNN, Cascade Mask, YOLOv3模型精度打平静态图
  - 支持动转静功能，并打通Paddle Inference，精度速度打平静态图
发布实时实例分割模型SOLOv2，相较竞品精度提升2.4个点，预测速度提升31.2%，训练速度为竞品2.4倍
新增Android移动端检测demo，包括SSD、YOLO系列模型
新增PACT新量化策略，YOLOv3-Mobilenetv3在COCO数据集上比普通量化相比提升0.7%。

PaddleSlim

动态图压缩功能支持
- 新增动态图剪裁、量化训练功能
- 剪裁新增通道数对齐功能，使产出模型更容易被预测库加速
- PACT量化训练方法改为内置方法，方便用户直接调用
新增OFA模型压缩技术，TinyERNIE经压缩后加速40%，精度无损

PaddleSeg

全新发布2.0-rc版本，全面升级至动态图，支持15+分割模型，4个骨干网络，3个数据集，4种Loss：
- 分割模型：ANN, BiSeNetV2, DANet, DeeplabV3, DeeplabV3+, FCN, FastSCNN, Gated-scnn, GCNet, HarDNet, OCRNet, PSPNet, UNet, UNet++, U^2Net, Attention UNet
- 骨干网络：ResNet, HRNet, MobileNetV3, Xception
- 数据集：Cityscapes, ADE20K, Pascal VOC
- Loss：CrossEntropy Loss、BootstrappedCrossEntropy Loss、Dice Loss、BCE Loss
提供基于Cityscapes和Pascal Voc数据集的高质量预训练模型 40+
支持多卡GPU并行评估，提供了高效的指标计算功能。支持多尺度评估/翻转评估/滑动窗口评估等多种评估方式。

PaddleClas

全新发布2.0-rc1，全面升级至动态图，支持23个系列分类网络结构，135个图像分类预训练模型。其中包含14个实用的SSLD蒸馏模型，效果普遍比基准模型提升3%以上，新增ResNeSt、RegNet和GhostNet三个系列模型。
基于动态图，提供混合精度训练方式和基于DALI的训练方式。
基于动态图，提供离线预测部署、服务化部署以及端侧部署三种部署方式。

PaddleOCR

全新发布2.0-rc1，PP-OCR系列模型升级至动态图。提供8.1M超轻量中英文OCR模型，通用中英文OCR模型以及效果更优的多种语言识别模型（纯英文数字、法、德、日、韩），并支持离线预测部署和服务化部署两种部署方式。
发布Style-Text通用文本数据合成工具。
发布PPOCRLabel文本数据标注工具。

PaddleRec

发布模型：gru4rec, deepfm, mmoe, dnn, LR 支持动态图

PaddleGAN

发布模型：Pixel2Pixel, CyclGAN, PSGAN, UGATIT, ESRGAN, CGAN, DCGAN
提供风格迁移，妆容迁移，上色，超分，人物、场景动漫化等预训练模型10个

PaddleNLP

发布2.0-beta版本，全面支持动态图模式，提供PaddleNLP核心库，与高阶API深入融合，支持pip安装，为开发者提供飞桨2.0文本领域的最佳实践。
新增文本图学习模型ERNIESage，生成式预训练模型ERNIE-Gen，开放域对话生成模型PLATO-2，语义匹配模型SentenceTransformer，时间序列预估模型TCN等。
预训练语言模型进一步丰富，包括ERNIE, BERT, RoBERTa, ELECTRA等共计22个预训练模型，其中包含11个中文预训练模型。
新增Perplexity, BLEU, Rouge-L等8种常用的文本任务评估指标，适配飞桨2.0 Metrics API体系，提升易用性。
新增文本分类、序列标注、机器翻译、阅读理解等共25个数据集，适配飞桨2.0 Dataset API体系，一键快速加载。
新增Embedding API功能，包含38个中文词向量，支持快速加载和词粒度语义距离计算。

Parakeet

发布 2.0-alpha 版本，提供 Parakeet 核心库，完善了中文文档，支持 pip 安装。
语音合成模型框架全新升级，统一文本前端的接口使用，模型全面升级为 Paddle 2.0 API，包括TransformerTTS、Waveflow、Wavenet 模型，新增 Tacotron2 模型。
提供了更多可复用的组网模块，方便灵活搭建模型。优化数据处理及加载流程，提升训练速度。
新增 experiment 模块，标准化实验流程，方便实验管理和二次开发，对已有模型提供的实验样例代码。

工具组件

PaddleHub

发布 2.0-rc版本，全面迁移动态图编程模式，模型开发调试更加方便，finetune接口更加灵活易用。
视觉类任务迁移学习能力全面升级，支持图像分类、图像着色、风格迁移等多种任务。
BERT、ERNIE、RoBERTa等Transformer类模型升级至动态图，支持文本分类的Fine-Tune能力。
优化服务化部署Serving能力，支持多卡预测、自动负载均衡，性能大幅度提升。
新增自动数据增强能力Auto Augment，能高效地搜索适合数据集的数据增强策略组合。

X2Paddle

发布 1.0.0-rc0版本，全面支持PaddlePaddle动态图API。
新增PyTorch模型转换，支持Tracing和Scripting两种方式进行转换。
新增Caffe/ONNX/Tensorflow到Paddle2.0 动态图的转换支持。
新增Optimizer模块，主要包括op融合、op消除功能，提升转换后模型代码的可读性以及模型的预测性能。

昆仑硬件

模型适配昆仑硬件

Resnet50, mobilenetv3, deeplabv3, bertbase, DQN 静态图模型适配昆仑硬件

Release Note

Release note

The Paddle framework 2.0-RC1 version has the following updates:

Installation environment Official release of the binary package supporting CUDA11(experimental) ; Official release of the binary package supporting Baidu Kunlun chip (experimental)
API function Support numpy-compatible paddle.Tensor indexing and slicing operations(basic indexing); removes the axis parameter in some APIs, support numpy-compatible broadcast semantics; add some new APIs, improve some APIs' functions, and fix some API bugs
Dynamic to static conversion Support more python syntax for dynamic to static graphs, and support for marking functions that do not perform dynamic to static conversion by running paddle.jit.not_to_static
Framework function Support multiple executions of paddle.Tensor.backward() to accumulate the gradient. The effect is equivalent to the gradient calculated after increasing the batch size. By default, the C++ error stack is hidden, and the error reporting format is optimized. The distributed training supports the heterbox training
Framework performance The mixed precision training supports pure FP16 mode. The ResNet50 model V100 single card training performance reaches up to 1400+ samples/sec. The performance of the distributed training is optimized

Forward-looking preview

The Paddle Framework plans to drop the support for python2 and python3.5 from a certain version in the future. It is recommended that you upgrade python to V3.8 for Paddle
The Paddle Framework plans to drop the support for CUDA 9.0 from a certain version in the future. It is recommended that you upgrade the CUDA for Paddle

Training framework

Basic API (including the distributed)

New APIs

Add the paddle.log2
Add the paddle.log10
Add the paddle.nn.initializer.set_global_initializer
Add the paddle.median
Add the paddle.broadcast_shape. You can calculate the shape of two tensor shapes after broadcast calculation
Add the paddle.vision.ops.deform_conv2d, paddle.vision.ops.DeformConv2d
Add the paddle.subtract
Add the paddle.optimizer.lamb
Add the Tensor related APIs, Tensor.cpu, Tensor.cuda(idx), Tensor.pin_memory, Tensor.is_leaf, Tensor.clone

Fix and improve APIs

In the paddle.multiply, remove the axis
In the paddle.pow, remove the type promotion
The paddle.add, paddle.subtract, paddle.multiply, paddle.divide, paddle.matmul, paddle.reshape, paddle.transpose, paddle.kron, paddle.trace, and paddle.sum support complex64 and complex128 data types
Remove the axis parameter from the paddle.maximum and paddle.minimum
In the multiplex, support the dynamic graphs
In the CrossEntropyLoss, add the soft_label and axis, modify shape and improve performance
The paddle.nn.functional.interpolate size parameter supports the input in the Tensor format
In the paddle.nn.functional.pad, add the padding for N and C dimensions in constant mode
In the paddle.optimizer.momentum, support the resume training
Fix the error when converting a BatchNorm to a SyncBatchNorm using paddle.nn.SyncBatchNorm.convert_sync_batchnorm after specifying the weight_param name before conversion
paddle.to_tensor supports direct input of other Tensor's place when selecting devices
Optimize the performance of Tensor.detach, share memory with the original Tensor, reduce one memory copy, without keeping in the original computational graph
In static graph mode, add the acquisition of the learning rate by paddle.optimizer.get_lr()
Fix the exceeding-range ID error exception in the use of GPU in the paddle.Embedding

Remove API (including aliases)

Remove the api under complex module: paddle.complex.matmul, paddle.complex.reshape, paddle.complex.transpose, paddle.complex.kron, paddle.complex.trace, paddle.complex.sum, paddle.complex.elementwise_add, paddle.complex.elementwise_sub, paddle.complex.elementwise_mul, paddle.complex.elementwise_div
Remove the sigmoid_cross_entropy_with_logits in the paddle.nn.functional

High-level API

Add api paddle.callbacks.ReduceLROnPlateau
Add api paddle.callbacks.LRScheduler
Add api paddle.vision.datasets.FashionMnist
In the paddle.io.DataLoader, change the places parameter to an optional parameter. When the default value is None, paddle.CPUPlace() or paddle.CUDAPlace(0) is automatically selected, and the places parameter will be deleted in later versions
paddle.io.DataLoader supports disabling the DataLoader automatic group batch function by setting batch_size=None
Add the api paddle.io. ComposeDataset for stitching multiple datasets into one dataset by field
Add the api paddle.io. ChainDataset to integrate multiple datasets into one dataset by sample
Add the api paddle.io. WeightedRadnomSampler for random sampling with the specified weights
Add the api paddle.vison.ops.yolo_loss and paddle.vision.ops.yolo_box
Add the api paddle.flops
Add the api paddle.callbacks.EarlyStopping
Update the api model.save. The saved file format is consistent with the bottom
Fix the bug of saving prediction model when input dtype in the api dynamic graph is non-float32 and inputs are not provided in the Model initialization
The paddle. metric. Accuracy supports input multi-dimensional Tensor, supports the label whose rank is 1 and the label represented by one-hot

Function optimization (including distributed)

Dynamic graph basic functions

Support Tensor and Scalar for correct type improvement when using operators for operations
Fix the bug of the interference with each other in the switching between multiple model train/eval models.Dynamic graph Layer.eval() is decoupled from no_grad, Tracer will not automatically record the reverse after calling Layer.eval() before the change, but will still automatically record the reverse after calling Layer.eval() after the change. If the reverse is needed, you can use paddle.no_grad
Support the change of Tensor data by index or slice
Add inplace reverse detection module to detect whether the forward inplace operation will affect the correctness of the gradient calculation
Add that in the Tensor.backward() automatic derivation, the gradient will be added to the previous gradient. This can increase the "batch_size"
Enabled SE-ResNext oneDNN dygraph training

Dynamic graph to static graph

New syntax

Add the support for using the isinstance syntax in the dynamic to static loop
Add the support for dynamic to static syntax for assigning shape to tuples, such as a, b, c, d = tensor.shape
Python's and/or statements have sequential execution of the left and right operands. If the result of the left operation can determine the logical value, the right operand will not be executed.In the past, logical_and/logical_or in dynamic to static graphs had problems in handling this case.This support is added
Add the support for the case where the function signature contains **kwargs
Support the use of jit.not_to_static decorative function. The function is not converted in the dynamic to static process
Support python dictionary syntax dict.pop()

Bug fixing

Fix the bug of model storage failure when a variable representing drop_state is not initialized in the dynamic to static storage lstm interface
Fix the bug of nested loops in the variable analysis
Fix the bug of return in some special cases
Fix the bug of if-else in the handling of list generation and variable analysis
Fix the bug of iterative variables in some special cases
Fix the bug of inconsistent behavior of transpose API in dynamic and static graphs, and make it support dynamic to static
Fix the bug of inconsistent behavior of concat API in dynamic and static graphs, and make it support dynamic to static
Optimize some dynamic to static error messages, so that the error location is more accurate
Fix the bug that convert_call will be repeatedly called recursively under special circumstances
Fix the dynamic to static bug caused by different judgments of out.dtype in 2.0 API
Fix the bug that x.shape == y.shape is judged to be equal to list in the dynamic graph and returns True/False, but will be re-loaded to elementwise in the static graph, and the elementwise result will be reduced after such conversion to static graph
Fix the bug that param_guard does not cover hook
Fix the bug of having some parameter variables in the init running in the static graph can not be assigned because the type is not static graph variables
Fix the bug of the value of non-parameter type variables being defined by users in _init_ function cannot be modified and updated correctly
Fix the bug of wrongly converting third-party library logging in the dynamic to static process
Fix the bug of incorrect transcription of AST in the for-enumerate syntax
Fix the bug that some warning information is displayed multiple times in a loop

Mixed precision training

Support more aggressive FP16 training mode (i.e., pure FP16 training).To ensure the convergence of the model in Momentum optimizer, add the new multi_precision and rescale_grad attributes. The multi_precision mainly indicates that the optimizer needs to maintain a copy of master weights
Use the pure FP16 training. The ResNet50 model can reach 1400+ samples/sec on a single card with 16GB video memory on V100

Model quantization

Dynamic graph quantization supports skip to specify the Layer
Dynamic graph quantization supports 2.0 API Conv and Linear

Distributed training optimization

Support the distributed low-order APIs such as all_gather using paddle.distibuted.spawn interface
Support the heterbox heterogeneous training
Pipeline supports Executor.run interface in parallel to improve the usability
Launch interface is upgraded, support for specifying the number of processes of a single node
Sharding supports multi-card training for 10 billion parameter models

Model saving and loading

Support multiple methods declaring that Layers overridden by paddle.jit.to_static can still be loaded by paddle.jit.load after being stored by paddle.jit.save, and multiple methods overridden by paddle.jit.to_static can still be used
Support that Layers loaded by paddle.jit.load can still be stored correctly by paddle.jit.save after fine-tune or used as sub-Layers of other Layers
Expand paddle.jit.save to support storing the paddle.DataParallel model
Optimize paddle.static.load_program_state interface experience. In the scenarios that do not specify to load var_list, only a warning is given when loading a directory with interfering files and no error is reported
Support paddle.jit.save to handle InputSpec of dict type
Support paddle.onnx.export to export dynamic model to ONNX file type

Performance optimization (including the distributed)

Improve the performance of RNN class OP on CPU (LSTM, GRU, SimpleRNN). Compared with version 2.0-rc, the forward performance and backward performance of the LSTM, GRU, SimpleRNN have been significantly improved
Optimize the FastThreadedSSAGraphExecutor scheduling. Fix the performance of the 4-engine 32-card resnet50 that is improved by about 0.3% in the communication synchronization scenario without the overlapping of the communication calculation
Optimize the paddle. fleet amp distributed performance. Fix the performance of the 4-engine 32-card fp16 that is improved by about 0.5% in the case that the last communication and calculation are not overlapping
Optimize the performance of the distributed communication component Communicator. In the GEO-400 mode, the W2V model throughput rate, Simnet-Bow model performance have been significantly improved. In the Async mode, compared to the Paddle Framework 1.8, the throughput rate of W2V model is improved by 11% and the performance of CTR-DNN model is improved by 14%
Optimize the performance when the Worker is a GPU device in parameter server mode, reduce the copy time of Embedding table query. Significantly improve the training throughput rate in the CTR-DNN model
The distributed GPU dynamic graph realizes the computation and communication overlap, and support the user fine-grained configuration of gradient fuse group size and other options. On the two models ResNet152 and Bert, the multi-node performance improvement is more than 5%.The performance of the ResNet50 is also improved by more than 3%
Improve the performance of cumsum on GPU
mproved performance of Resnet50 oneDNN dygraph training. Currently Resnet50 oneDNN drgraph training is 6.4X faster than Native CPU training
Add the support of cudnn on the GRU and SimpleRNN

Debug analysis

Optimize the alignment of the error exception type on the Paddle Python side with Python native error type
Hide the C++ error stack by default, optimize the error reporting format after hiding the C++ stack, remove the demarcation flag Error Message Summary, and align with the native Python error reporting format
Optimize some static module APIs in non-static graph mode, including 9 APIs such as static.append_backward, static.gradients, static.scope_guard, static. Print, static.nn.embedding, static.nn.data_norm, static.nn.multi_box_head, static.nn.nce, and static.nn.py_func
Optimize the error message when the pass-in Tensor is None under the dynamic graph model
Further optimize the print tensor format of the dynamic graph

Compile and install

New support

(experimental) Release the binary package supporting cuda11
Mirror the Paddle of cuda10.1 or later and NCCL to version 2.7.8 in the CI system images
Release the binary package supporting xpu
Release the binary package supporting jetpack and C++ prediction library supporting nv_jetson

Experience optimization

Fix the build strategy, separately release the gpu package containing tensorrt, to avoid the error of no tensorrt when users install other GPU versions of the package
Remove installation dependencies: scipy, rarfile, prettytable, pathlib
Installation documentation optimization

Bug fixing

Fix the bug that GPU card 0 occupies more video memory than other cards during multi-card training
Fix the bug of wrong shape derivation in the tile op calculation
Fix the bug of the large number of warning messages of invalid escape sequence in the use of paddle
Fix the bug when paddle. full is set to INF, NAN, NINF, etc.
Fix the bug that multiple-nccl comm settings of paddle. fleet do not take effect, and add the non-overlapping warning of multi-nccl comm communication in synchronous mode
Fix the bug that the paddle. framework.seed in TruncatedNormal initialization does not meet the expectation
Fix the inconsistent behavior of AvgPool related API dynamic to static exclusive parameters; fix the MaxPool related API ceil_mode transmission parameter problem
Fix the bug that paddle. topk result is incorrect under GPU
option in the fluid.layers.nn.gather dynamic graph API
Fix the bug that the Window-based terminal does not recognize CUDA_VISIBLE_DEVICES as null character, and the frame can be executed in CPU mode by setting the null string
Fix the bug that the recursive saving and loading of optimizer.state_dict/set_dict fails when LinearLrWarmup recursively contains Learning Rate Scheduler
Fixed the ptb lm training performance decrease issue
Fix the bug of gradient calculation when softmax_with_cross_entropy uses ignore_index
Fix the bug that the parameter to be decayed is empty in the second acquisition after the first execution of AdamW

Inference

Paddle Inference

Function upgrade

In Paddle V2.0, add or upgrade some operators. Starting from this version, the forward operator versioning rules are defined by compatibility constraints. Through the alignment of operator versions between frameworks, ensure consistent definition and behavior of the same operator version in different frameworks, thus enhancing the overall robustness of the framework
Add the TryShrinkMemory interface to reduce the application display/memory consumption by releasing temporary tensor. For the demo example, refer to Paddle-Inference-Demo
Paddle-TRT supports clip op. Support the classification model GhostNet running under Paddle-TRT
Paddle-TRT int8 prediction support models containing channelwise quantization of mul op. Support the PaddleOCR detection and recognition of PaddleSlim quantization model running under Paddle-TRT int8
load_inference_model and save_inference_model APIs are migrated to paddle.static to improve ease of use and compatibility with old interfaces
Add six APIs such as serialize_program, deserialize_program, serialize_persistables, deserialize_persistables, save_to_file, load_from_ file six APIs for users to perform serialize/deserialize program, serialize/deserialize params, and save models/parameters to file, or load models/parameters from files
Enabled BF16 inference for models: resnet50, googlenet, mobilenetv1 and mobilenetv2
Added oneDNN operators version compatibility support

Performance optimization

When TenorRT is enabled, ERNIE models add the support for variable-length inputs, resulting in the performance improving by 147%.In software versions cuda10.1, cudnn 7.6, tensorrt 6.0, OSS 7.2.1, model ernie-base-2.0, dataset QNLI, the performance on Nvidia Telsa T4 improves from 905 sentences/s to 2237 sentences/s when input BatchSize = 32.Example code: Paddle-Inference-Demo/c++
Improved oneDNN INT8 GRU performance. The GRU INT8 model has 1.65X speed-up compared with NativeConfig inference. (with thread=1, batch_size=50)
Added oneDNN batchnorm + activation fuse, hence improved pvanet_ocr model performance by 2.8%

Bug fixing

Fix the bug that models with avg pooling or global pooling have wrong computation results, error popups or hang
Fix the bug that the shape of TensorRT subgraph output Tensor ended with x1 will be deleted incorrectly when using the TensorRT dynamic shape inference
Fix the bug that config.pass_builder()->DeletePass() is not effective when the TensorRT inference is used
Fix the issue that some models performance depends on the matmul ops' weights
Fix the issue that CPU oneDNN predictin many models will report error or cause performance regression

Model upgrade

PaddleDetection

Upgrade dynamic graph models:
- Faster RCNN, Faster FPN, Mask RCNN, Mask FPN, Cascade RCNN, Cascade Mask, YOLOv3 model accuracy flattening static graphs
  - Support the dynamic to static function. Enable the Paddle Inference. The precision speed flattens the static graphs
Release the SOLOv2, a real-time instance segmentation model. Compared to competing models, it is improved by 2.4% in accuracy and 31.2% in prediction speed. The training speed is as fast as 2.4 times of the competing models
Add the Android mobile detection demos, including SSD and YOLO series models
Add the PACT new quantification strategy. Compared to the ordinary quantification, YOLOv3-Mobilenetv3 on COCO dataset is improved by 0.7%

PaddleSlim

Support the dynamic graph compression function
- Add the dynamic graph cropping and quantization training function
- Add the cropping of the channel quantity alignment function, so that the output model is more easily accelerated by the prediction library
- PACT quantization training method is changed to built-in method. It is convenient for users to call directly
Add the OFA model compression technology. The TinyERNIE is accelerated by 40% after compression, with no loss of accuracy

PaddleSeg

Newly release 1.0-rc version, fully upgraded to dynamic graph. It supports 13 segmentation models, 4 backbone networks, and 3 datasets:
Segmentation models: ANN, BiSeNetV2, DANet, DeeplabV3, DeeplabV3+, FCN, FastSCNN, Gated-scnn, GCNet, OCRNet, PSPNet, UNet, and U^2Net
Backbone networks: ResNet, HRNet, MobileNetV3, and Xception
Datasets: Cityscapes, ADE20K, and Pascal VOC
Loss: CrossEntropy Loss、BootstrappedCrossEntropy Loss、Dice Loss、BCE Loss
Provide 40+ high quality pre-trained models based on Cityscapes and Pascal Voc datasets
Support multi-card GPU parallel evaluation. This provides the efficient index calculation function. Support multiple evaluation methods such as multi-scale evaluation/flip evaluation/sliding window evaluation

PaddleClas

Newly released 2.0-rc1, fully upgraded to dynamic graph. It supports 23 series of classification network structures and 135 image classification pre-training models. Among them, 14 practical SSLD distillation models are included, and the effect is generally improved by more than 3% compared with the benchmark model. Three new series of ResNeSt, RegNet and GhostNet models are added
Based on dynamic graph, provide the mixed precision training method and DALI-based training method
Provide the off-line predictive deployment, service-oriented deployment and end-side deployment based on the dynamic graphs

PaddleOCR

Newly released 2.0-rc1. PP-OCR series models are upgraded to dynamic graphs. Provide 8.1M ultra-lightweight Chinese and English OCR models, universal Chinese and English OCR models and better multilingual recognition models (pure English numbers, French, German, Japanese, Korean). Support the offline predictive deployment and service-oriented deployment
Release the Style-Text universal text data synthesis tool
Release the PPOCRLabel text data annotation tool

PaddleRec

Release models: gru4rec, deepfm, mmoe, dnn, LR supporting dynamic graph

PaddleGAN

Release models: Pixel2Pixel, CycleGAN, PSGAN, UGATIT, ESRGAN, CGAN, DCGAN
Provide 10 pre-trained models for style migration, makeup migration, coloring, super score, character and scene animation, etc.

PaddleNLP

Release 2.0-beta version: support all-around dynamic graph models; provide the PaddleNLP core library, with deeply integrating with higher-order APIs; support the pip installation; provide developers with best practices in the text domain of PaddlePaddle 2.0.
Add the text graph learning model ERNIESage, generative pre-training model ERNIE-Gen, open domain dialogue generation model PLATO-2, semantic matching model SentenceTransformer, time sequence prediction model TCN, and so on.
Enrich the pre-training language models further, including a total of 22 pre-training models such as ERNIE, BERT, RoBERTa, and ELECTRA (containing 11 Chinese pre-training models).
Add 8 common text task evaluation metrics such as Perplexity, BLEU, Rouge-L, and so on, adapted to the PaddlePaddle 2.0 Metrics API system to improve ease of use.
Add 25 new datasets for text classification, sequence annotation, machine translation, reading comprehension, and so on, adapted to the PaddlePaddle 2.0 Dataset API system, with fast loading by pressing one key.
Add the Embedding API function, including 38 Chinese word vectors, supporting fast loading and word granularity semantic distance calculation.

Parakeet

Release 2.0-alpha version: provide Parakeet core library; improve Chinese documentation; support pip installation.
Upgrade the text-to-speech model framework to unify the text front-end interface. The model is fully upgraded to Paddle 2.0 API, including TransformerTTS, Waveflow, Wavenet model, and new Tacotron2 model.
Provide more reusable networking modules. This facilitates the combination of model flexibly. Optimize the data processing and loading process. This improves the training speed.
Add the experiment module to standardize the experiment process. This facilitates the experiment management and secondary development. The sample codes for experiments are provided for existing models.

Utility Component

PaddleHub

Release 2.0-rc version: fully migrate the dynamic graph programming mode. It is more convenient for model development and debugging. The finetune interface is more flexible and easy to use.
Upgrade the visual class task migration learning capability fully, supporting a variety of tasks such as image classification, image coloring, and style migration.
Upgrade Transformer class models such as BERT, ERNIE and RoBERTa to dynamic graph. Support the Fine-Tune capability for text classification.
Optimize the Serving capability for service-oriented deployment, supporting multi-card prediction and automatic load balancing. The performance is improved greatly.
Add the Auto Augment (automatic data augment capability). This allows the efficient search for the proper combination of data augment policies for the datasets.

X2Paddle

Release version 1.0.0-rc0: It fully supports PaddlePaddle dynamic graph API.
Add the PyTorch model conversion: supports the conversion between Tracing and Scripting.
Add the support of conversion from Caffe/ONNX/Tensorflow to Paddle2.0 dynamic graph.
Add the Optimizer module, mainly including op fusions and op elimination functions, to improve the readability of the converted model code and the prediction performance of the model.

Kunlun hardware

Models adapted to Kunlun hardware

Resnet50, mobilenetv3, deeplabv3, bertbase, DQN static graphs model adapted to Kunlun hardware

PaddlePaddle/Paddle v2.0.0-rc1 PaddlePaddle 2.0.0-rc1 on GitHub

Release Note

重要更新

前瞻性预告

训练框架

基础API（含分布式）

新增API

修复和完善API

移除API（包括别名）

高层API

功能优化（含分布式）

动态图基础功能

动态图转静态图

混合精度训练

模型量化

分布式训练优化

模型保存与载入

性能优化（含分布式）

调试分析

编译安装

新增支持

体验优化

Bug修复

推理

Paddle Inference

功能升级

性能优化

Bug修复

模型升级

PaddleDetection

PaddleSlim

PaddleSeg

PaddleClas

PaddleOCR

PaddleRec

PaddleGAN

PaddleNLP

Parakeet

工具组件

PaddleHub

X2Paddle

昆仑硬件

模型适配昆仑硬件

Release Note

Release note

Forward-looking preview

Training framework

Basic API (including the distributed)

New APIs

Fix and improve APIs

Remove API (including aliases)

High-level API

Function optimization (including distributed)

Dynamic graph basic functions

Dynamic graph to static graph

Mixed precision training

Model quantization

Distributed training optimization

Model saving and loading

Performance optimization (including the distributed)

Debug analysis

Compile and install

New support

Experience optimization

Bug fixing

Inference

Paddle Inference

Function upgrade

Performance optimization

Bug fixing

Model upgrade

PaddleDetection

PaddleSlim

PaddleSeg

PaddleClas

PaddleOCR

PaddleRec

PaddleGAN

PaddleNLP

Parakeet

PaddlePaddle/Paddle v2.0.0-rc1
PaddlePaddle 2.0.0-rc1

on GitHub