PaddlePaddle v2.0.0-rc0 Release Notes

Release Date: 2020-10-30 // over 3 years ago
  • 🚀 2.0-rc0 Release Note

    重要更新

    相对2.0-beta版,本版本在如下方面进一步完善:

    • 默认模式:paddle2.0-rc后将默认开启动态图模式;如果需要使用静态图编程模式,可以通过paddle.enable_static()来切换到静态图模式。
    • 框架API:修改50个常用API名称,新增8个基础API实现,移除220个API(包含别名移除),8个API增加二阶导数计算,更多API增加了对昆仑芯片的支持,分布式FleetAPI正式化,高层API进行了功能增强。
    • 框架功能:优化动静转换用法,优化模型读取和载入,优化混合精度训练和量化策略,优化分布式训练策略。删除了nltk等6项编译依赖;安装包增加对Python 3.8、CUDA 10.1/10.2的支持。
    • 推理引擎:增强int8量化能力,增加算子版本信息,oneDNN相关的功能强化和性能优化。

    训练框架

    基础API(含分布式)

    新增API

    • 新增 paddle.emtpy API,返回未初始化的内存
    • 新增 paddle.emtpy_like API,返回未初始化的内存
    • 新增 paddle.mv API,返回矩阵-向量乘的结果
    • 新增paddle.multinomial多项分布API
    • 新增paddle.nn.LocalResponseNorm和paddle.nn.functional.local_response_norm
    • 新增paddle.nn.Pad1D/Pad2D/Pad3D api,支持constant,reflect,replicate和circular模式
    • 新增paddle.add_n
    • 新增动态图混合精度训练API,paddle.amp.auto_cast和paddle.amp.GradScaler

    修复和完善API

    • paddle.reshape API支持bool类型输入
    • paddle.distribution.Categorical API添加sample和log_prob方法
    • BatchNorm1D, BatchNorm2D, BatchNorm3D 添加了 channel last 数据布局支持
    • ⚡️ paddle.optimzier.Adam和paddle.optimizer.AdamaW参数顺序修改
    • yolo_box支持输入特征图H,W不相等,用于完成长宽不相等的图像预测
    • paddle.nn.function.interpolate 支持 scale_factor 输入类型为 list
    • 添加了adaptive pool2d运算符的oneDNN支持 @intel
    • 添加了dilated conv和dilated conv_transpose的oneDNN支持 @intel
    • unique支持GPU设备计算
    • paddle.multiply 支持非variable 和 tensor 数据类型 输入
    • paddle.nn.AdaptiveMaxPool1D/2D/3D 和paddle.nn.functional.adaptivemaxpool1d/2d/3d,重构python端PoolAPI的实现
    • 🖨 paddle.set_printoptions支持设置动态图Tensor的显示选项
    • paddle.assign API,支持数组/张量到张量的赋值
    • paddle.nn.functional.swish/paddle.nn.Swish,删除beta参数
    • paddle.nn.functional.thresholded_relu/paddle.nn.ThresholdedReLU,threshold参数默认值为1.0
    • paddle.norm,升级后支持fro、inf、-inf、0、1、2,和任何正实数p对应的p范数
    • paddle.nn.AdaptiveMaxPool1D/2D/3D 和paddle.nn.functional.adaptivemaxpool1d/2d/3d,重构python端PoolAPI的实现
    • RNN类(SimpleRNN、LSTM、GRU)优化参数顺序和基类RNNBase实现,集成cudnn lstm
    • 修复adaptive_pool op在特殊输出情况下GPU梯度异常的问题
    • 🌲 新增支持二阶求导功能:batch_norm、abs、log、expand、tile、squeeze、unsqueeze、matmul
    • 新增50余个算子对昆仑(XPU)训练的支持

    API名称变化

    • 对2.0-beta的50个API名称进行了修改,详见 链接

    移除API(包括别名)

    • 🚚 移除220个API(包括别名),详见 链接

    多设备/分布式训练API

    • Fleet API正式化,统一到paddle.distributed.fleet作为Paddle通用分布式训练统一入口
    • paddle.distributed.fleet.DistributedStrategy作为Paddle统一并行策略定义入口暴露
    • ⚡️ 增加paddle.distributed.fleet.meta_optimizer.RecomputeOptimizer API,支持分布式下的重计算机制
    • ⚡️ 增加paddle.distributed.fleet.meta_optimizer.GradientMergeOptimizer API,支持分布式下的梯度累加机制
    • ⚡️ 增加paddle.distributed.fleet.meta_optimizer.PipelineOptimizer API,支持分布式下的流水线并行机制
    • paddle.distributed.fleet.DistributedStrategy新增amp优化策略,支持分布式下自动混合精度机制的开启
    • paddle.distributed.fleet.DistributedStrategy新增dgc优化策略,支持分布式下深度梯度压缩机制的开启
    • paddle.distributed.fleet.DistributedStrategy新增fp16_allreduce优化策略,支持分布式下fp16 allreduce通信机制的开启
    • paddle.distributed.fleet.DistributedStrategy新增lars优化策略,支持分布式下大batch size 训练使用 lars 优化器
    • paddle.distributed.fleet.DistributedStrategy新增lamb优化策略,支持分布式下大batch size 训练使用 lamb 优化器
    • paddle.distributed.fleet支持多优化策略组合,支持包括amp+recompute, dgc+recompute, amp+recompute+lars等十余种策略的组合
    • paddle.distributed.fleet.DistributedStrategy新增a_sync优化策略,支持分布式下使用参数服务器进行同步、异步、GeoSGD以及异构参数服务器优化训练
    • paddle.distributed.fleet.DistributedStrategy新增auto实验性优化策略,支持分布式下多策略最优化自动并行
    • 增加fleetrun启动分布式训练任务,支持Collective模式在单机单卡,单机多卡和多机多卡下启动,支持参数服务器模式在CPU集群、GPU集群、异构集群下启动,支持直接提交PaddleCloud集群
    • paddle.distributed.fleet支持动态图执行,支持GPU模式下动态图单机单机、单机多卡和多机多卡训练
    • ⬇️ paddle.distributed.fleet 新增通信集合功能,支持all_reduce,all_gather及 barrier功能
    • paddle.distributed.fleet 新增分布式指标计算功能,包括auc,rmse, mae,acc 等
    • paddle.distributed.fleet下废弃原fleet.main_program和fleet.startup_program,替换为paddle.static.default_main_program() 和 paddle.static.default_startup_program()
    • paddle.distributed.fleet支持异构参数服务器模式,可通过fleetAPI配合用户组网实现异构计算设备训练,跨设备协作进行分布式训练
    • 分布式集合通信API支持CPU设备
    • paddle.distributed.fleet.DistributedStrategy新增localsgd优化策略
    • paddle.distributed.fleet.DistributedStrategy新增adaptivelocalsgd优化策略,支持分布式下自动计算step步长的localsgd策略
    • 新增paddle.distributed添加InMemoryDataset和QueueDataset支持使用Dataset进行分布式训练

    高层API

    • 👷 新增IterableDataset基类支持流式数据集,DataLoader支持对IterableDataset进行多进程加速,并支持通过paddle.io.get_worker_info()获取子进程状态并进行进程间数据划分
    • paddle.io.DataLoader的places参数更新为可选,不指定places使用默认的places
    • 💅 新增CIFAR10, CIFAR100, Conll05st等10个map-style数据集,支持数据集自动下载并以map-style方式获取数据
    • DIstributedBatchSampler接口新增num_replicas和rank参数用于指定卡数和当前卡逻辑序号
    • 新增paddle.io.TensorDataset支持tensor数据集读取
    • 新增paddle.io.Sampler基类,并新增SequenceSampler,RandomSampler用于在BatchSampler中顺序或乱序获取数据
    • paddle.io.BatchSampler支持Sampler作为输入,删除原输入参数indices
    • 下线paddle.reader下原有API
    • paddle.vision.transforms中的图像变换算子添加处理PIL的后端
    • paddle.summary支持多个输入与多个输出的Layer
    • model.save升级,在动态图保存预测模型时,用户不需要调用paddle.jit_to_static或者为layer函数增加装饰器(动转静的功能)。并且如果用户在Model初始化时如果传入了inputs,则可以保存正确的输入shape,否则模型的输入shape会按照运行模型时传入的输入shape保存

    功能优化(含分布式)

    动态图基础功能

    • 👯 新增Tensor的clone接口,会拷贝一个完全相同的Tensor,同时clone后的Tensor继续保留在计算图中,并支持梯度回传
    • 支持通过索引或切片原地(inplace) 修改 Tensor
    • 动态图Tensor打印和显示优化,高维tensor数据显示方式对齐numpy,支持缩略形式
    • 优化了initializer类的__call__方法,不再需要传入block,避免用户在动态图中感知到静态图block概念
    • 隐藏动态图多卡API DataParallel的scale_loss和apply_collective_grads方法,编写多卡模型代码时不再需要调用这两个方法,简化写法,提升易用性
    • 添加oneDNN 动态图支持,支持了 Resnet50模型训练和推理。@intel

    动态图转静态图

    • 动态图转静态图相关API接口迁移2.0,简化了import 路经
    • 动转静装饰器 to_static 新增支持直接装饰 model 实例,如 to_static(model, input_spec)
    • 新增InputSpec中name参数的默认值解析机制,若未指定name,则使用被装饰函数参数名作为name
    • StaticLayer重命名为StaticFunction
    • 🌲 优化了动转静Debug log
    • 修复了一些场景下动转静的bug

    混合精度训练

    • 重构静态图混合精度训练中的梯度有效性检查和动态loss scaling逻辑,去除一些condition block逻辑

    模型量化

    • 新增动态图分channel量化功能,支持对Conv2D和Linear等layer的权重进行分channel求取量化参数
    • 新增动态图量化训练过程中对模型layer求取output scale参数功能,供Server端量化推理部署使用

    分布式训练优化

    • 支持流水线并行训练
    • 支持参数服务器模式下异构分布式训练,支持PS+GPU,PS+昆仑, PS+CPU,PS+CPU+GPU(昆仑)等多种设备进行训练,单台GPU/昆仑机器+10台cpu机器上,完成千万数据千亿参数点击率模型分钟级训练
    • 大规模稀疏功能进行了升级,支持int64范围内的稀疏ID,支持稀疏表自增长、配置准入条件及增量模型保存功能
    • 分布式支持控制流多任务,性能较instag多任务提升50%以上

    模型保存与载入

    • 支持paddle.jit.save接口存储未经paddle.jit.to_static转写的Layer对象,扩大接口使用场景
    • 规范Layer、Optimzier等API的set_dict方法名,统一改为set_state_dict,规范接口名
    • 支持paddle.load从fluid.io.save_inference_model接口存储的结果中载入Layer的state_dict,打通接口体系,提升易用性
    • 支持paddle.load从fluid.io.save_params/persistables接口默认存储结果中载入Layer的state_dict,打通接口体系,提升易用性
    • 修改paddle.save/load接口行为,paddle.save不再为存储结果添加后缀,paddle.load每次载入仅返回一个结果,规范接口语义
    • 为paddle.jit.TransLatedLayer新增program方法,用于获取paddle.jit.load载入模型的program,便于了解模型结构
    • 移除paddle.SaveLoadConfig,对于paddle.jit.save, paddle.jit.load, paddle.load等接口兼容载入的场景,使用**kwargs传入额外的配置,简化接口的使用
    • 更新paddle.jit.save, paddle.jit.load接口参数model_path的含义,用户输入的字符串作为存储文件前缀而非目录
    • 原静态图API paddle.io.save, paddle.io.load, paddle.io.save_inference_model, paddle.io.load_inference_model移动到paddle.static模块下

    性能优化(含分布式)

    • 提升Argsort OP当输入Tensor的元素个数等于其axis维长度时的性能,前向速度提升34倍,反向速度提升10倍
    • 优化lars策略, ResNet50 分布式多卡训练 16k batch size 的 time2train 指标小于 10 分钟
    • 新增fused_bn_add_act OP,融合batch_norm、elementwise_add和activation OP
    • 新增梯度聚合的inplace addto策略,支持原位梯度累加,在ResNet-50混合精度训练中性能提升6.3%

    调试分析

    • 继续完善paddle中约1500条报错检查的提示文案,提升框架调试易用性

    编译安装

    • 新增安装包对python3.8的支持
    • 删除对matplotlib的安装依赖
    • 删除对graphviz安装依赖
    • 删除对objgraph安装依赖
    • 删除对netifaces的安装依赖
    • 删除对nltk的安装依赖
    • 删除对opencv的安装依赖
    • 新增安装包对cuda10.1、cuda10.2的支持
    • 预测库支持cuda10.2-cudnn8-trt7.1的版本

    🐛 Bug修复

    • 修复梯度裁剪GradientClipByGlobalNorm在Paddle默认dtype是float64的网络下使用报错的bug
    • 🏁 修复Windows的CUDA10.1/10.2版本的无法加载CUDA相关dll的bug
    • 📌 修复Tensor在CUDAPinnedPlace与其他Place之间相互拷贝的bug
    • 修复paddle.jit.load载入无参数Layer出错的bug
    • 🏁 修复paddle.diag对于大尺寸输入计算错误的bug,修复paddle.diag在Windows Python3.8环境下内存占用异常的bug
    • 修复paddle.topk在静态图组网时输出的shape不合理的问题
    • 修复paddle.io.DataLoader多进程模式经paddle.distributed.spawn启动时直接报错退出的bug
    • 修复paddle.set_device接口设置运行时设备在部分场景中失效的问题
    • 修复paddle.static.nn.while_loop反向计算中使用前向计算的变量而导致的梯度计算错误的bug
    • ⚡️ 修复fleet不支持paddle.optimizer的bug
    • 修复Adam优化器计算公式与论文有diff的bug
    • 🔊 修复logsumexp导致部分机器上编译太慢的问题
    • 修复ParamAttr缺失类型检查的问题
    • 修复AvgPool API ceil_mode=true情况下在CPU上平均池化核计算问题
    • 修复paddle.distributed.fleet.init_server()加载模型时维度不匹配的问题
    • 修复paddle.distributed.fleet参数服务器模式下训练节点不支持GPU的问题
    • 修paddle.allclose在float64数据类型下精度diff问题
    • 修复了反向传播支持分组的conv算子(conv2d grad op with groups)的错误 @intel
    • 修复了动转静to_static装饰模型,直接切换eval模式无法保存模型的bug
    • 修复matmul不支持fp16bug
    • 修复matmul反向计算性能差以及显存占比高的问题
    • 修复paddle.nn.Transformer参数bias_attr和weight_attr指定为bool,list/tuple出错问题
    • 修复dynamic_decode预测解码不能正确提前结束的问题
    • 修复paddle.unsqueeze在axis为Tensor的情况下结果错误的问题
    • 修复了paddle.to_tensor在某些场景下zero_copy带来的问题,暂时禁止了zero_copy行为

    推理

    Paddle Inference

    • 预测库默认命名从fluid_inference改为paddle_inference

    功能升级

    • Paddle-TRT 动态shape功能支持PaddleSlim量化Int8模型
    • Paddle Inference GPU Int8支持conv2d_transpose量化
    • 增加预测模型的算子版本信息
    • 在oneDNN INT8量化策略中增加了对有偏移的scales的量化和反量化的支持 @intel
      • Add support for (de/re) quantization with shiftted scales in INT8 quantization strategy
    • 添加了对oneDNN BF16的支持:支持conv2d bf16运算符和gru bf16 op,启用了resnet50 bf16模型推断 @intel
      • Added CPU BF16 support: support conv2d bf16 operator and gru bf16 op, enabled resnet50 bf16 model inference.

    性能优化

    • ERNIE模型在T4上使用Paddle-TRT FP16推理性能提升15%。@NVIDIA
    • 通过支持oneDNN FP32 GRU和oneDNN INT8 GRU,GRU INT8模型的速度与NativeConfig推理相比,提高了约1.49倍(线程= 1,batch_size = 50)@intel
      • Added support for oneDNN FP32 GRU and oneDNN INT8 GRU. The GRU INT8 model has 1.49X speed-up compared with NativeConfig inference (with thread=1, batch_size=50)
    • 通过oneDNN升级到1.6,Ernie Large oneDNN在Skylake上(Intel Core 6148)推理的速度提高了约2.7倍(即单元测试 test_analyzer_ernie_large)@intel
      • Since oneDNN is upgraded to 1.6, Ernie Large (test_analyzer_ernie_large) oneDNN inference has speed up ~2.7x.

    🐛 Bug修复

    • 修复用户使用Paddle Inference ZeroCopyRun接口,开启MKLDNN时,在变长输入下内存泄露的bug
    • 修复ERNIE模型含有共享参数时预测出错的bug
    • 修复带Paddle-TensorRT功能的预测库在未安装TensorRT的环境下初始化报错的bug
    • 修复softmax op、layer_norm op使用Paddle-TRT预测时维度计算错误的bug
    • 解决了增加cpu_math_library_num_threads_数目,预测性能却无法提高的问题(PaddleOCR repository)@intel
      • Fix the issue that increasing cpu_math_library_num_threads_ does not improve performance in PaddleOCR repository
    • 解决了oneDNN concat重载数据错误的问题 @intel
      • Fix oneDNN concat overwritting data error
    • 解决了开启oneDNN推理NHWC模型会报错的问题 @intel
      • Fix the issue oneDNN inference with NHWC model report error
    • 解决了rec_r34_vd_tps_bilstm_attn模型oneDNN预测失败的问题 @intel
      • Fix rec_r34_vd_tps_bilstm_attn model oneDNN prediction failure
    • 解决了deeplabv3p_xception oneDNN预测失败的问题 @intel
      • Fix the deeplabv3p_xception MKLDNN inference failure by adding conv with dilations support

    🚀 2.0-rc0 Release Note

    ⚡️ Important Updates

    • 0️⃣ Default mode: For the versions later than paddle 2.0-rc, the dynamic graph mode is enabled by default. To use the static graph programming mode, run paddle.enable_static() to switch to it.
    • 🚚 Framework APIs: Modify 58 commonly used API names, add 95 APIs (including migration from the earlier V1.8), remove 220 APIs (including alias removal), add the support of the Kunlun chips in 50 APIs, add the second-order derivative calculation in 8 APIs, and functionally enhance the distributed APIs and high-level APIs.
    • ⚡️ Framework features: Optimize the dynamic-to-static conversion usage, optimize model reading and loading, optimize mixed-precision training and quantization strategies, optimize distributed training strategies, and streamline compilation and installation package dependencies.
    • 🐎 Inference engine: Enhance the int8 quantitative capability, optimize the oneDNN performance, and fix a number of bugs.

    Training Framework

    Basic API (Including Distributed)

    Name Change of Commonly Used APIs

    • 👀 Modified 58 API names. For details, see link

    ➕ Added APIs

    • ➕ Added paddle.emtpy API to return uninitialized memory
    • ➕ Added paddle.emtpy_like API to return uninitialized memory
    • ➕ Added paddle.mv API to return the matrix-vector multiplication result
    • ➕ Added paddle.multinomial multinomial distribution API
    • Added paddle.nn.LocalResponseNorm and paddle.nn.functional.local_response_norm
    • ➕ Added paddle.nn.Pad1D/Pad2D/Pad3D api, and supported constant, reflect, replicate and circular modes
    • ➕ Added paddle.add_n
    • ➕ Added dynamic graph mixing precision training API, paddle.amp.auto_cast and paddle.amp.GradScaler

    🛠 Fixed and Improved APIs

    • 👍 paddle.reshape API supports bool type input
    • 🌲 paddle.distribution.Categorical API is added with sample and log_prob methods
    • 👍 BatchNorm1D, BatchNorm2D, and BatchNorm3D are added with the support of the channel last data layout
    • ⚡️ Modified paddle.optimzier.Adam and paddle.optimizer.AdmaW parameter order
    • 👍 yolo_box supports the input feature graph where the H and W are not equal, that is, complete the prediction of a graph with unequal width and length
    • 👍 paddle.nn.function.interpolate supports the settings that the input type of scale_factor is list
    • ➕ Added the support of oneDNN of the adaptive pool2d operator @intel
      • Added adaptive pool2d operator oneDNN support
    • ➕ Added the support of oneDNN of dilated conv and dilated conv_transpose @intel
      • Add oneDNN conv with dilations and conv_transpose with dilations support
    • 👍 unique supports the GPU device computing
    • 👍 paddle.multiply supports the input of non-variable and tensor data types
    • ⚡️ RNN classes (SimpleRNN, LSTM, and GRU) are optimized with the parameter order and the implementation of the base class RNNBase, and integrated with cudnn lstm
    • 🛠 Fixed the GPU gradient anomaly of adaptive_pool op in special output cases

    ✂ Removed APIs (Including Aliases)

    • ✂ Removed 220 APIs (including aliases), see link

    ➕ Added the Second-order Derivation Function

    • 👍 batch_norm supports second-order derivation
    • 👍 abs supports second-order derivation
    • 🌲 log supports second-order derivation
    • 👍 expand supports second-order derivation
    • 👍 tile supports second-order derivation
    • 👍 squeeze supports second-order derivation
    • 👍 unsqueeze supports second-order derivation
    • 👍 matmul supports second-order derivation

    👌 Support of Kunlun (XPU) Devices

    • uniform_random, gaussian_random and truncated_gaussian_random support XPU devices
    • 👍 paddle.concat, paddle.assign and paddle.cast APIs support XPU devices
    • 👍 paddle.reshape and paddle.shape APIs support XPU devices
    • 👍 stack, pool2d, and roi_align support XPU devices
    • 🌲 conv2d, dropout, and log_loss support XPU devices
    • 👍 softmax supports XPU devices
    • mean and softmax_with_cross_entropy support XPU devices
    • 👍 sgd and momentum support XPU devices
    • sum, sign, scale, accuracy, elementwise_mul, elementwise_div, elementwise_sub, and elementwise_max support XPU devices
    • 👍 slice supports XPU devices
    • ➕ mul, pow, relu, sigmoid, sqrt, square, tanh, log, abs, elementwise_add, gelu, and matmul_v2 support xpu devices
    • 👍 transpose supports XPU devices
    • reduce_sum and reduce_mean support XPU devices
    • batch_norm and layer_norm support XPU devices
    • 👍 fill_constant supports XPU devices
    • 👍 load supports XPU devices
    • lookup_table_v2_xpu and adam support XPU devices
    • 👍 gather supports XPU devices

    Multi-device/Distributed Training APIs

    • fleet api is formalized to paddle.distributed.fleet in a unified manner as the Paddle universal distributed training unified entry
    • paddle.distributed.fleet.DistributedStrategy is exposed as Paddle unified parallel strategy definition entry
    • ➕ Added paddle.distributed.fleet.meta_optimizer.RecomputeOptimizer API to support the distributed re-computing mechanism
    • ➕ Added paddle.distributed.fleet.meta_optimizer.GradientMergeOptimizer API to support the distributed gradient summation mechanism
    • ➕ Added paddle.distributed.fleet.meta_optimizer.PipelineOptimizer API to support the distributed pipeline parallel mechanism
    • 👍 paddle.distributed.fleet.DistributedStrategy is added with the AMP optimization strategy to support the enabling of automatic blending precision mechanism in the distributed environment
    • 👍 paddle.distributed.fleet.DistributedStrategy is added with the dgc optimization strategy to support the enabling of deep gradient compression mechanism in the distributed environment
    • 👍 paddle.distributed.fleet.DistributedStrategy is added with the fp16_allreduce optimization strategy to support the enabling of fp16 allreduce communication mechanism in the distributed environment
    • ⚡️ paddle.distributed.fleet.DistributedStrategy is added with the lars optimization strategy to support the use of lars optimizer for large batch size training in the distributed environment
    • ⚡️ paddle.distributed.fleet.DistributedStrategy is added with the lamb optimization strategy to support the use of lamb optimizer for large batch size training in the distributed environment
    • 👍 paddle.distributed.fleet supports multi-optimization strategy combinations, including combinations of more than ten kinds of strategies such as amp+recompute, dgc+recompute, amp+recompute+lars, and so on
    • 🔀 paddle.distributed.fleet.DistributedStrategy is added with the a_sync optimization strategy to support synchronous, asynchronous, GeoSGD, and heterogeneous parameter server optimization training by using the parameter servers in the distributed environment
    • 👍 paddle.distributed.fleet.DistributedStrategy is added with the auto experimental optimization strategy to support auto parallel for multi-strategy optimization in the distributed environment
    • ➕ Added fleetrun to start the distributed training task, to support Collective mode to start in the single-machine single-card, single-machine multi-card and multi-machine multi-card, support the parameter server mode to start under CPU cluster, GPU cluster, and heterogeneous cluster, and support the direct submission of the PaddleCloud cluster
    • 👍 paddle.distributed.fleet supports dynamic graph execution and supports the single-machine single-card, single-machine multi-card and multi-machine multi-card training of a dynamic graph in GPU mode
    • ⬇️ paddle.distributed.fleet is added with the communication collection function, to support all_reduce, all_gather and barrier functions
    • paddle.distributed.fleet is added with the distributed indicator calculation function, including auc, rmse, mae, and acc
    • In paddle.distributed.fleet, fleet.main_program and fleet.startup_program are removed to be replaced with paddle.static.default_main_program() and paddle.static.default_startup_program()
    • 👍 paddle.distributed.fleet supports heterogeneous parameter server mode, to implement the heterogeneous computing device training and cross-device collaborative distributed training through fleetAPI and user networking
    • 👍 Distributed collective communication API supports CPU devices
    • paddle.distributed.fleet.DistributedStrategy is added with the localsgd optimization strategy
    • 👍 paddle.distributed.fleet.DistributedStrategy is added with the adaptivelocalsgd optimization strategy to support the localsgd strategy to automatically calculate step in the distributed environment
    • ➕ Added paddle.distributed. InMemoryDataset and QueueDataset are added to support the distributed training by using Dataset

    High-level APIs

    • 👷 Added IterableDataset base class support streaming dataset. DataLoader supports multi-process acceleration of IterableDataset, and supports the getting of the child process state through paddle.io.get_worker_info() and the inter-process data division
    • ⚡️ The places parameter of paddle.io.DataLoader is updated to be optional. Places is not specified to use the default value
    • ➕ Added 10 map-style datasets such as CIFAR10, CIFAR100, Conll05st, and so on, to support automatic download of dataset and get data in map-style mode
    • ➕ Added the num_replicas and rank parameters of the DIstributedBatchSampler interface, for specifying the number of cards and the logical serial number of the current card
    • ➕ Added the support of reading tensor dataset of paddle.io.SensorDataset
    • ➕ Added paddle.io.Sampler base class, and SequenceSampler. RandomSampler is used for getting data in BatchSampler in order or random order
    • 👍 paddle.io.BatchSampler supports Sampler as input, and the original input parameter indices is deleted
    • ✂ Removed the original API in paddle.reader
    • The graph conversion operator in paddle.vision.transforms is added to process PIL backend
    • 👍 paddle.summary supports multi-input multi-output Layers
    • model.save is upgraded. When a dynamic graph saves a prediction model, a user does not need to call paddle.jit_to_static or add a decorator for the layer function (dynamic to static function).If inputs is passed in when initializing the model, the correct input shape is saved. Otherwise, the input shape of the model is saved according to the pass-in input shape when running the model

    Function Optimization (Including Distributed)

    Dynamic Graph

    👍 -Add the support for oneDNN dynamic graphs, and support Resnet50 model training and inference.@intel

    • ➕ Added oneDNN dygraph training and inference support for Resnet50 model.It is faster than CPU NativeConfig training.

    Dynamic Graph to Static Graph

    • For the dynamic graph to static graph, the related API interface is migrated in V2.0, simplifying the import route
    • 👍 Dynamic-to-static decorator to_static is added with the support of the direct decoration model instances, for example, to_static(model, input_spec)
    • ➕ Added the parsing mechanism for the default value of the name parameter in InputSpec. If no name is specified, the decorated function parameter name is used as name
    • StaticLayer is renamed to StaticFunction
    • ⚡️ Optimized the dynamic to static Debug log
    • 🛠 Fixed the dynamic to static bug in some scenes

    Mixed Precision Training

    • Added fused_bn_add_act OP, with the integration of batch_norm, elementwise_add and activation OP
    • ➕ Added inplace addto strategy for gradient aggregation, to support in-situ gradient summation. The performance is improved by 6.3% in ResNet-50 mixed precision training
    • 🚚 Re-constructed the gradient validity check and dynamic loss scaling logic in static graph mixed precision training, and removed some condition block logics

    Distributed Training Optimization

    • ⚡️ Optimized lars strategy, ResNet50 distributed multi-card training 16k batch size with the time2train index smaller than 10 minutes
    • 👌 Supported the pipeline training in parallel
    • 👌 Supported the heterogeneous distributed training in parameter server mode, supported PS+GPU, PS+Kunlun, PS+CPU, PS+CPU+GPU (Kunlun) and other devices for training, a single GPU/Kunlun machine + 10 cpu machines to complete click-through rate model training of hundreds of billions of parameters of millions of data in one minute
    • ⬆️ Upgraded the massive sparse function, to support for the sparse ID in int64 range, and support for sparse table self-growth, configuration access conditions and incremental model preservation function
    • 🐎 Distributed support for control flow multitasking. The performance is improved by over 50% than that in instag multitasking

    Model Quantization

    • ➕ Added the division channel quantization function for dynamic graphs, to support to quantize the division channel parameters of the weight of layer in Conv2D and Linear
    • ➕ Added the function of getting the output scale parameter on model layer during dynamic graph quantization training for Server-side quantization inference deployment
    • ⚡️ Optimized offline quantization of static graphs to avoid saving temporary data to disk

    Model Saving and Loading

    • 👌 Supported paddle.jit.save interface to store the Layer object without paddle.jit.to_static transcription, to expand the interface usage scenarios
    • Standardized the set_dict method name of the APIs such as Layer and Optimzier, to rename to the set_state_dict in the unified method to standardize the interface name
    • Supported the loading of state_dict of Layer from the result stored in the fluid.io.save_inference_model interface by paddle.load
    • 0️⃣ Supported the loading of state_dict of Layer from the default result stored in the fluid.io.save_params/persistables interface by paddle.load, to enable the interface system and improve the usability
    • Modified the paddle.save/load interface behavior. paddle.save does not add the suffix for the stored results. paddle.load returns only one result in each loading to standardize the interface semantics
    • paddle.jit.TransLatedLayer is added with the program method, to get the program of the paddle.jit.load loading model to facilitate the understanding of the model structure
    • Removed paddle.SaveLoadConfig. For paddle.jit.save, paddle.jit.load, paddle.load and other interface-compatible loading scenarios, use **kwargs to pass in additional configuration to simplify the use of the interface
    • ⚡️ Updated the meaning of model_path of the paddle.jit.save and paddle.jit.load interface parameter. The user input string is used as a prefix to the stored file, instead of a directory
    • Original static graph APIs such as paddle.io.save, paddle.io.load, paddle.io.save_inference_model, and paddle.io.load_inference_model are moved to the paddle.static module

    🐎 Performance Optimization (Including Distributed)

    • 👌 Improved the performance of Argsort OP when the number of input Tensor elements is equal to its axis dimensional length. The forward speed is improved by 34 times and the reverse speed is improved by 10 times

    Basic Functions for Dynamic Graph

    • ➕ Added the clone interface of Tensor. An identical Tensor is copied while the clone Tensor continues to retain in the computation graph and supports the gradient return
    • Hided the scale_loss and apply_collective_grads methods of multi-card API DataParallel of the dynamic graphs. The two methods need not to be called when multi-card model codes are prepared. This can simplify the writing method and improve the usability
    • 👌 Supported the modification of Tensor through index or slice (inplace)
    • ⚡️ Optimized the dynamic graph Tensor printing and display, high-dimensional tensor data display mode alignment numpy. The abbreviation is supported
    • Optimized the __call__ method of the initializer class. The pass-in of block is not required. This can prevent the user from perceiving the static graph block concept in the dynamic graph

    Debugging Analysis

    • Continued to improve about 1500 pieces of error checking hint texts in paddle, to improve the framework debugging and usability

    Compiling and Installation

    • ➕ Added the support for python3.8 in the installation package
    • ✂ Removed the installation dependency on matplotlib
    • ✂ Remove the installation dependency on graphviz
    • ✂ Removed the installation dependency on objgraph
    • ✂ Removed the installation dependency on netifaces
    • ✂ Remove the installation dependency on nltk
    • ✂ Removed the installation dependency on opencv
    • ➕ Added the installer support for cuda10.1 and cuda 10.2
    • 👍 The prediction library supports cuda10.2-cudnn8-trt7.1 version

    🐛 Bug Fixing

    • 🛠 Fixed the bug of error reported by gradient clipping GradientClipByGlobalNorm used in network where Paddle default dtype is float64
    • 🛠 Fixed the bug of Windows-based CUDA version 10.1/10.2 failed to load CUDA related dll
    • 🛠 Fixed the bug of Tensor copy each other between CUDAPinnedPlace and other Place
    • 🛠 Fixed the bug of error in paddle.jit.load loading Layer without parameter
    • 🛠 Fixed the bug of calculation error in the large size input of paddle.diag, and fixed the bug of memory usage exception of paddle.diag in Windows Python 3.8 environment
    • 🛠 Fixed the unreasonable shape problem of paddle.topk in static graph networking
    • 🛠 Fixed the bug of exit with the direct report of error of paddle.io.DataLoader multi-process mode when started through paddle.distributed.spaw
    • 🛠 Fixed the problem of device failure in some scenarios when the paddle.set_device interface is set with the runtime
    • 🛠 Fixed the bug of the gradient calculation error caused by using the variable of forward calculation in paddle.static.nn.while_loop backward calculation
    • 🛠 Fixed the bug of fleet not supporting paddle.optimizer
    • 🛠 Fixed the bug that the Adam optimizer formula and thesis have diff
    • 🛠 Fixed the problem of logsumexp causing too slow compilation on some machines
    • 🛠 Fixed the ParamAttr missing type check problem
    • 🛠 Fixed the calculation problem of average pooling core on CPU when AvgPool API ceil_mode=true
    • 🛠 Fixed the dimension mismatch problem when paddle.distributed.fleet.init_server() is loaded with a model
    • 🛠 Fixed the problem that the training node does not support GPU in paddle.distributed.fleet parameter server mode
    • 🛠 Fixed the precision diff problem of paddle.allclose in float64 data type
    • 🛠 Fixed the error of back propagation supporting grouped conv operators (conv2d grad op with groups) @intel
      • Fix the conv2d grad op with groups problems
    • 🛠 Fixed the bug of failure to save the model when dynamic to static to_static decorative model is directly switched to the eval mode
    • 🛠 Fixed the bug that matmul does not support fp16bug
    • 🛠 Fixed the problem of poor performance of matmul reverse calculation and high memory consumption
    • Fixed the error when the bias_attr and weight_attr parameters of paddle.nn.Transformer are specified as bool, list/tuple
    • 🛠 Fixed the problem that dynamic_decode prediction decoding doesn't end early correctly
    • 🛠 Fixed the result error of paddle.unsqueeze when axis is Tensor
    • Fixed the problem of paddle.to_tensor caused by zero_copy in some scenarios, to temporarily disable the zero_copy behavior

    Inference

    Paddle Inference

    • Changed the default name of prediction library from fluid_inference to paddle_inference

    API

    ⬆️ Function Upgrading

    • 👍 Paddle-TRT dynamic shape supports PaddleSlim quantization of Int8 models
    • 👍 Paddle Inference GPU Int8 supports conv2d_transpose quantization
    • ➕ Added operator version information for the prediction model
    • ➕ Added the support for quantization and inverse quantization of offset scales to the oneDNN INT8 quantization strategy @intel
      • Add support for (de/re) quantization with shiftted scales in INT8 quantization strategy
    • ➕ Added the support for oneDNN BF16: support conv2d bf16 operator and gru bf16 op, and enabled resnet50 bf16 model inference @intel
      • Added CPU BF16 support:support conv2d bf16 operator and gru bf16 op, enabled resnet50 bf16 model inference.

    🐎 Performance Optimization

    • 🐎 The inference performance of ERNIE model using Paddle-TRT FP16 on T4 is improved by 15%.@NVIDIA
    • 👍 Through the comparison of the speed of supporting oneDNN FP32 GRU and oneDNN INT8 GRU, the speed of the GRU INT8 model is about 1.49 times faster than that of NativeConfig inference (thread = 1, batch_size = 50)@intel
      • Added support for oneDNN FP32 GRU and oneDNN INT8 GRU.The GRU INT8 model has 1.49X speed-up compared with NativeConfig inference (with thread=1, batch_size=50)
    • By upgrading oneDNN to 1.6, the speed of Ernie Large oneDNN inference on Skylake (Intel Core 6148) is improved about 2.7 times (that is, unit test test_analyzer_ernie_large) @intel
      • Since oneDNN is upgraded to 1.6, Ernie Large (test_analyzer_ernie_large) oneDNN inference has speed up ~2.7x.

    🐛 Bug Fixing

    • 🛠 Fixed the bug of memory leak under the variable length input when a user uses the Paddle Inference ZeroCopyRun interface to enable MKLDNN
    • 🛠 Fixed the bug of prediction error when ERNIE model contains shared parameters
    • 🛠 Fixed the bug of initialization error for the prediction library with the Paddle-TensorRT function in the environment when TensorRT is not installed
    • 🛠 Fixed the bug of dimension calculation error when softmax op and layer_norm op use the Paddle-TRT prediction
    • Solved the problem of failing to improve the prediction performance (PaddleOCR repository) when increasing the number of cpu_math_library_num_threads_ @intel
      • Fix the issue that increasing cpu_math_library_num_threads_ does not improve performance in PaddleOCR repository
    • Solved the problem of oneDNN concat reload data error @intel
      • Fix oneDNN concat overwritting data error
    • Solved the problem of error reported when enabling the oneDNN to infer the NHWC model @intel
      • Fix the issue oneDNN inference with NHWC model report error
    • Solved the oneDNN prediction failure problem of the rec_r34_vd_tps_bilstm_attn model @intel
      • Fix rec_r34_vd_tps_bilstm_attn model oneDNN prediction failure
    • Solved the prediction failure problem of deeplabv3p_xception oneDNN @intel
      • Fix the deeplabv3p_xception MKLDNN inference failure by adding conv with dilations support

Previous changes from v2.0.0-beta0

  • 🚀 2.0-beta Release Note

    重要更新

    本版本为飞桨框架v2.0的测试版,最重要的变化为API体系的全面升级以及命令式编程(动态图)能力的全面完善。本版本系统优化了飞桨基础API的目录结构,全面修复了历史遗留的相关问题,并对API做了充分补充,特别是提供了更为完善的高层API功能;同时提供了对动态图的量化训练、混合精度训练的支持,动静转换实现了完备的语法支持,并且易用性大幅提升,动态图相关功能趋于完善,推荐使用动态图模式。此外,推理库的C++接口也做了升级优化,推理库对量化模型的支持以及推理性能都有了全面增强。

    训练框架

    基础API

    兼容性说明

    • Paddle 2.x版本推荐用户使用位于paddle根目录下的API,同时在paddle.fluid目录下保留了所有的Paddle 1.x版本的API。按照设计,Paddle 1.x版本训练的代码,不做任何修改,即可在Paddle 2.x版本上正常运行;Paddle 1.x版本训练保存的模型,可以使用Paddle 2.x版本进行推理。

    目录结构调整

    • 在2.0-alpha版本的基础上,本版本对于目录结构进行了一些调整,调整完最新的目录结构如下:
    目录 功能和包含的API
    paddle.* paddle根目录下保留了常用API的别名,当前包括:paddle.tensor和paddle.framework目录下的所有API
    paddle.tensor 跟tensor操作相关的API,比如:创建zeros, 矩阵运算matmul, 变换concat, 计算add, 查找argmax等
    paddle.nn 跟组网相关的API,比如:Linear, Conv2d,损失函数,卷积,LSTM等,激活函数等
    paddle.static.nn 静态图下组网专用API,比如:输入占位符data, 全连接层fc, 控制流while_loop/cond
    paddle.static 静态图下基础框架相关API,比如:Variable, Program, Executor等
    paddle.framework 框架通用API和imprerative模式的API,比如:to_tensor等
    ⚡️ paddle.optimizer
    ⚡️ paddle.optimizer.lr_scheduler
    paddle.metric 评估指标计算相关的API,比如:accuracy, auc等
    paddle.io 数据输入输出相关API,比如:Dataset, DataLoader等
    paddle.device 设备管理相关API,比如:CPUPlace, CUDAPlace等
    paddle.distributed 分布式相关基础API
    paddle.distributed.fleet 分布式相关高层API
    paddle.vision 视觉领域API,比如,数据集,数据处理,常用基础网络结构,比如resnet
    paddle.text NLP领域API, 比如,数据集,数据处理,常用网络结构,比如transformer

    API别名规则

    • 为了方便用户使用,API会在不同的路径下建立别名,比如paddle.add -> paddle.tensor.add,推荐用户优先使用较短的路径paddle.add
    • 所有framework, tensor目录下的API,均在paddle根目录建立别名;除少数特殊API外,其他API在paddle根目录下均没有别名。
    • paddle.nn目录下除functional目录以外的所有API,在paddle.nn目录下均有别名;functional目录中的API,在paddle.nn目录下均没有别名。
    • 以下为一些特殊的别名关系,推荐使用左边的名称:
      • paddle.sigmoid -> paddle.tensor.sigmoid -> paddle.nn.functional.sigmoid
      • paddle.tanh -> paddle.tensor.tanh -> paddle.nn.functional.tanh
      • paddle.remainder -> paddle.mod -> paddle.floor_mod
      • paddle.divide -> paddle.true_divide
      • paddle.rand -> paddle.uniform
      • paddle.randn -> paddle.standard_normal
      • Optimizer.clear_grad -> Optimizer.clear_gradients
      • Optimizer.set_state_dict -> Optimizer.set_dict
      • Optimizer.get_lr -> Optimizer.current_step_lr
      • Layer.clear_grad -> Layer.clear_gradients
      • Layer.set_state_dict -> Layer.set_dict

    常用API名称变化

    • 此版本使用Tensor表示数据,创建张量API, paddle.fluid.dygraph.to_variable修改为paddle.to_tensor
    • 加、减、乘、除使用全称,不使用简称
    • 对于当前逐元素操作,不加elementwise前缀
    • 对于按照某一轴操作,不加reduce前缀
    • Conv, Pool, Dropout, BatchNorm, Pad组网类API根据输入数据类型增加1d, 2d, 3d后缀
    Paddle 1.8 Paddle 2.0-beta
    paddle.fluid.layers.elementwise_add paddle.add
    paddle.fluid.layers.elementwise_sub paddle.subract
    paddle.fluid.layers.elementwise_mul paddle.multiply
    paddle.fluid.layers.elementwise_div paddle.divide
    paddle.fluid.layers.elementwise_max paddle.maximum
    paddle.fluid.layers.elementwise_min paddle.minimum
    paddle.fluid.layers.reduce_sum paddle.sum
    paddle.fluid.layers.reduce_prod paddle.prod
    paddle.fluid.layers.reduce_max paddle.max
    paddle.fluid.layers.reduce_min paddle.min
    paddle.fluid.layers.reduce_all paddle.all
    paddle.fluid.layers.reduce_any paddle.any
    paddle.fluid.dygraph.Conv2D paddle.nn.Conv2d
    paddle.fluid.dygraph.Conv2DTranspose paddle.nn.ConvTranspose2d
    paddle.fluid.dygraph.Pool2D paddle.nn.MaxPool2d, paddle.nn.AvgPool2d

    新增API

    • 共计新增140个API,具体参考链接和API文档
      • 新增环境设置API:paddle.set_default_dtype, paddle.get_default_dtype, paddle.set_device, paddle.get_device, paddle.manual_seed
      • 新增Tensor操作API:numel, chunk, masked_select, isfinite, isinf, isnan, sort, topk, Flatten, dim, tile
      • 新增组网API: Linear, Bilinear, Embedding, linear, bilinear, embedding
      • 新增视觉组网类API:Conv1d, ConvTranspose1d, MaxPool1d, MaxPool2d, MaxPool3d, AvgPool1d, AvgPool2d, AvgPool3d, AdaptiveMaxPool1d, AdaptiveMaxPool2d, AdaptiveMaxPool3d, ReflactionPad1d, ReflactionPad2d, ReflactionPad3d, ReplicationPad1d, ReplicationPad2d, ReplicationPad3d, ZeroPad2d, ConstantPad1d, ConstantPad2d, ConstantPad3d, PixelShuffle, Upsample, UpsamplingNearest2d, UpsamplingBilinear2d, conv1d, conv_transpose1d, avg_pool1d, avg_pool2d, avg_pool3d, max_pool1d, max_pool2d, max_pool3d, adaptive_max_pool1d, adaptive_max_pool2d, adaptive_max_pool3d, adaptive_avg_pool1d, adaptive_avg_pool3d
      • 新增文本处理组网类API: SimpleRNN, LSTM, GRU, MultiHeadAttention, Transformer, TransformerEncoder, TransformerEncoderLayer, TransformerDecoder, TransformerDecoderLayer
      • 新增激活类API:ELU, Hardshrink, Hardtanh, PReLU, ReLU6, Tanh, Tanhshrink, Softmax
      • 新增归一化API:BatchNorm1d, BatchNorm2d, BatchNorm3d, SyncBatchNorm, InstanceNorm1d, InstanceNorm2d, InstanceNorm3d, weight_norm, remove_weight_norm, batch_norm, instance_norm, layer_norm, normalize
      • 新增Dropout类API:Dropout2d, Dropout3d, AlphaDropout, dropout, dropout2d, dropout3d
      • 新增相似度、损失函数类API:CosineSimilarity, PairwiseDistance, CTCLoss, KLDivLoss, BCEWithLogitsLoss, MarginRankingLoss, SmoothL1Loss, consine_similarity, binary_cross_entropy, binary_cross_entropy_with_logits, cross_entropy, ctc_loss, l1_loss, mse_loss, margin_ranking_loss, nll_loss, smooth_l1_loss
      • 新增分布式通信类API: broadcast, all_reduce, reduce, all_gather, scatter, barrier
      • 新增概率分布类API: Distribution, normal, bernoulli
      • 新增Optimizer相关API:step, AdamW
      • 新增数据集相关API:Dataset, IterableDataset, TensorDataset, Sampler, RandomSampler, BatchSampler, DistributedBatchSampler

    修复和完善API

    • ⬆️ 共计修改和完善155个API,具体参考链接和API文档
    • 修复随机数生成相关的API,包括:种子设置paddle.rand, randn, randint, randperm, dropout, Uniform, Normal等
    • 以下API对应的底层C++ OP进行了代码升级,理论上可以实现兼容,但不排除会出现少量不兼容的情况:linspace, concat, gather, gather_nd, split, squeeze, unsqueeze, clip, argmax, argmin, mean, norm, unique, cumsum, LeakyReLU, leaky_relu, hardshrink, embedding, margin_ranking_loss, grid_sample, affine_grid
    • 增加了relu6和Sigmoid激活函数的 oneDNN支持

    多设备/分布式训练API

    动态图单机多卡训练

    • 新增paddle.distributed.spawn(func, args=(), nprocs=-1, join=True, daemon=False, **options),用于启动动态图多卡训练。
    • 新增paddle.distributed.init_parallel_env(),用于初始化动态图多卡训练的环境。
    • 新增paddle.distributed.get_rank(),用于获取多卡训练时当前进程的rank。

    - 新增paddle.distributed.get_world_size(),用于获取多卡训练时参与训练的总进程数。

    分布式集合通信

    • 新增paddle.distributed.broadcast(tensor, src, group=0),将指定进程上的tensor广播到所有进程。
    • 新增paddle.distributed.all_reduce(tensor, op=ReduceOp.SUM, group=0),对所有进程的指定Tensor执行归约操作,结果返回给所有进程。
    • 新增paddle.distributed.reduce(tensor, dst, op=ReduceOp.SUM, group=0),对所有进程的指定Tensor执行归约操作,结果返回给指定进程。
    • 新增paddle.distributed.all_gather(tensor_list, tensor, group=0),聚合所有进程的指定Tensor,结果返回给所有进程。
    • 新增paddle.distributed.scatter(tensor, tensor_list=None, src=0, group=0),将指定进程Tensor列表中的Tensor分发到所有进程。
    • 新增paddle.distributed.barrier(group=0),同步所有进程。

    高层API

    • 新增飞桨高层API,对模型开发过程中常见的组网、训练、评估、预测、存取等操作进行封装,实现低代码开发,MNIST手写数字识别任务对比命令式编程模式实现方式,高层API可减少80%执行类代码。
    • 数据管理
      • 统一数据加载使用方式
      • 数据集定义,继承paddle.io.Dataset进行实现。
      • 多进程数据加载,使用paddle.io.DataLoader
      • 新增paddle.io.IterableDataset用于流式数据集,并在paddle.io.DataLoader中支持对其进行并发加速。
      • 新增paddle.io.get_worker_info用于paddle.io.IterableDataset中划分子进程数据。
    • 模型组网
      • 新增常见Loss接口paddle.nn.loss.*和Metric接口paddle.metric.*的封装
      • 发布基于高层API实现的12个模型
      • Transformer,Seq2seq,LAC,BMN,ResNet,YOLOv3,VGG,MobileNet,TSM,CycleGAN,Bert,OCR
      • 发布于PaddlePaddle/hapi仓库的examples目录
    • 模型执行
      • 新增Model类paddle.Model封装,封装模型开发过程中常用的基础功能,包括:
      • 提供Model.summary接口,用于查看动态图组网的网络结构与参数数量。
      • 提供Model.prepare接口,用于指定损失函数和优化算法。
      • 提供Model.fit接口,实现训练和评估,可通过callback方式实现训练过程中执行自定义功能,比如模型存储等。
      • 提供Model.evaluate接口,实现评估集上的预测和评估指标计算。
      • 提供Model.predict接口,实现特定的测试数据推理预测。
      • 提供Model.train_batch接口,实现单batch数据的训练。
      • 提供Model.eval_batch接口,实现单batch数据的评估。
      • 提供Model.text_batch接口,实现单batch数据的测试。
      • 提供Model.save/Model.load接口,支持动态图训练模式存储推理模型。
      • 新增Callback接口paddle.callbacks.*,用于模型执行接口,进行日志记录、Checkpoint模型存储等,用户可继承paddle.callbacks.Callback进行自定义。
    • 领域API
      • 新增视觉(CV)领域接口paddle.vision
      • 新增Dataset接口paddle.vision.datasets.*,对常用数据集进行封装,支持数据的随机访问
      • 新增Resize, Normalize等24种常见的数据预处理接口paddle.vision.transforms.*
      • 新增图像分类骨干网络和预训练参数
        • paddle.vision.models.lenetpaddle.vision.lenet
        • paddle.vision.models.vggpaddle.vision.vgg
        • paddle.vision.models.resnetpaddle.vision.vgg
        • paddle.vision.models.mobilenetv1paddle.vision.mobilenetv1
        • paddle.vision.models.mobilenetv2paddle.vision.mobilenetv2
      • 新增自然语言处理(NLP)领域接口paddle.text
      • 新增Dataset接口paddle.text.datasets.*,对常用数据集进行封装,支持数据的随机访问
      • 新增领域组网接口paddle.text.*
    • 自动断点重启
      • 新增接口 train_epoch_range:可以在静态图上实现基于epoch粒度的 checkpoint 自动保存和自动加载功能,支持自动断点重启。

    功能优化(含分布式)

    动态图转静态图

    • ProgramTranslator新增语法支持
      • 新增对return语法动转静支持,使得动转静时可以在if-elif-else或者循环条件中提前return,也能return不同类型的tensor或None。
      • 新增对print语法动转静支持,使得print(tensor)也能在动转静中打印出tensor。
      • 新增对for遍历Tensor,for enumerate遍历Tensor,for遍历TensorList,for enumerate遍历TensorList几种语法的动转静支持,使得循环处理Tensor的相关操作在动转静中能够灵活使用。
      • 新增对assert语法动转静支持,使得assert tensor也能在动转静中保证tensor为True(bool类型)或者非0(其他数据类型)。
      • 新增对数据类型cast的转写支持,使得float(tensor), int(tensor) 等类似的动态图类型转化语句也能在静态图中进行类型转化。
    • ProgramTranslator易用性优化功能
      • 将动转静的返回类型从callable函数改为class StaticLayer,这个class可以调用.code,.main_program等接口更轻松获取转化后的静态图信息。
      • 增加 set_verbosity 和 set_code_level 接口,可以让用户设置log级别来查看动转静运行过程的log或者查看中间状态转化的代码。
      • 新增InputSpec,可以指定动转静时输入Tensor变量形状和数据类型。
      • 优化了动转静运行下如果出错显示的报错信息,使动转静后静态图运行错误的代码也能汇报到原动态图错误的代码行,并且删除python栈中动转静部分报错,使报错信息更多与用户代码相关。
      • 动转静支持用 pdb.set_trace() 进行断点调试。
    • 优化部署模型存储载入接口
      • 新增 paddle.jit.save 接口用于动转静模型的保存,使接口更加易用,删除旧接口ProgramTranslator.save_inference_model 。
      • 新增 paddle.jit.load 接口用于载入静态图格式存储的预测模型,包括paddle.jit.save和paddle.io.save_inference_model保存的模型,模型载入后可在动态图下用于模型推理或者模型训练调优。

    混合精度训练

    • 增加了动态图混合精度的支持,ResNet-50模型在V100上使用混合精度相比于fp32训练加速比为2.6。

    量化训练

    • 新增ImperativeQuantAware类,提供动态图量化训练功能,目前支持对Conv2D、Linear等层的量化,支持的模型类型包括MobileNetV1/MobileNetV2/ResNet50等。
    • 模型经动态图量化训练后,使用ImperativeQuantAware.save_quantized_model接口保存的量化模型可利用Paddle-Lite推理库进行预测部署。
    • 静态图量化支持Conv2d_tranpose量化,支持Linear使用per-channel形式量化。

    性能优化(含分布式)

    • 简化动态图模式下DataLoader底层实现逻辑,降低读取线程开销,进一步提升数据读取效率,提升模型整体训练速度。经测试MobileNetV1在V100单卡、BatchSize=128的场景下整体训练速度提升34%。
    • 动态图组网API升级和性能优化,大量动态图API将直接调用自动生成的Pybind接口,提升性能。

    动态图基础功能

    • 支持多卡训练时配置Embedding等API使用稀疏参数梯度更新的功能。
    • 增加Tensor类成员函数,包括Tensor().abs()、Tensor().add()、Tensor().cos()等120余个。
    • 增加Layer的dir()接口,可以方便地查看Layer中属性和函数。
    • ⚡️ 增加optimizer.set_lr()接口,用户可以在动态图模式下中灵活调整学习率。
    • 增加全局参数初始化方式的接口set_global_initializer,可定义全局的参数初始化方法。
    • 增加了对动态训练和推理的oneDNN(原MKL-DNN)支持。Resent50 oneDNN动态训练可以使用(Minist数据集)。

    调试分析

    • 将框架内仅100处使用LOG(FATAL)抛出异常的写法统一改为使用PADDLE_THROW,优化由于框架不支持某种行为而导致的报错格式与内容。
    • 🚦 完善框架内Signal Handler实现,优化执行遇到系统Signal错误时的报错格式与内容。
    • 优化框架报错栈格式,将编译时python报错栈移至原生报错栈下方,提升报错信息阅读体验。
    • 累计进一步完善约1300余条框架内检查报错的错误类型与提示文案,提升框架整体调试易用性。
    • 动态图报错信息增强,动态图下Pybind层的报错信息进行系统性增强,提升用户体验。

    🐛 Bug修复

    • 修复动态图Layer使用add_parameter接口可能意外出现AttributeError的问题,增强输入检查。
    • 修复无法正常打印int_8与uint_8类型的Tensor的问题,使数据可以正常输出。

    依赖库升级

    • 升级oneDNN(原MKL-DNN)从1.3至1.5版本。

    推理

    Paddle Inference

    API

    • ⚠ 全面升级推理C++ API,推荐使用新版API。原API暂时保留,但使用时会报 warning,计划未来会删除;新版API主要是从规范命名、简化使用方法角度做的升级,重要变化包括:
      • C++ 接口新增 paddle_infer 命名空间,包含推理相关接口;
      • ZeroCopyTensor 更名为 Tensor,作为推理接口默认输入输出表示方式;
      • 简化 CreatePaddlePredictorCreatePredictor,只保留 对AnalysisConfig 的支持,不再支持其他多种Config;
      • 新增服务相关的工具类,比如 PredictorPool,便于创建多个predictor 时使用。

    功能升级

    • 升级算子版本兼容信息注册表以支持更精确的Op版本信息,提升推理兼容性。
    • 新增对TRT 7.1版本的适配支持。
    • Paddle-TensorRT增强对 PaddleSlim 量化模型的支持,涵盖CV上检测,分类,分割等多个任务。
    • Python端推理新增对用户自定义OP支持。
    • ➕ CPU端增加了elementwise_addelementwise_mul INT8 oneDNN(原MKL-DNN)内核支持。
    • 提升了CPU端测试量化模型的易用性,支持同时对比测试原始模型和量化模型。
    • 新增对Jetson Nx硬件的适配支持。

    性能优化

    • 新增 conv + affine_op pass,在6248机器上,MASK-RCNN fp32单线程性能提高了26%。
    • 新增fc + gru pass和oneDNN(原MKL-DNN) GRU fp32内核,使得GRU fp32模型4线程推断速度在机器Intel Xeon 6248上提高 20%。
    • 增加了对许多Op的oneDNN inplace支持(人脸feature fp32模型提速2%)。
    • 优化的oneDNN LRN op,使得GoogleNet fp32模型提速1%。
    • 升级了量化模型的转换和优化。
    • 优化了CUDA 的ArgMin, ArgMax OP,使得该OP的二进制大小从60M下降至1.3M。

    🐛 Bug修复

    • 修复CPU下的mask-rcnn推断错误的问题。
    • 修复CPU多线程量化模型和推断过程中出现的错误。

    🚀 2.0-beta Release Note

    ⚡️ Important Update

    🐎 This version is the beta version of PaddlePaddle Framework v2.0. The most important change is the full upgrade of the API system and the comprehensive improvement on the imperative programming (dynamic graph) capability. This version systematically optimizes the directory structure of PaddlePaddle basic APIs, comprehensively fixes relevant issues left over from the past, fully supplements APIs, and especially provides the better high-level API functions. It also provides support for the quantitative training and mixed precision training under a dynamic graph. Perfect syntax support is implemented in the dynamic-to-static conversion. The usability is improved substantially. Dynamic graph-related functions tend to be perfect. In addition, the C++ APIs for the inference library are upgraded and optimized. Both the support of the inference library for quantitative models and the inference performance are fully enhanced.

    Training Framework

    Basic APIs

    Compatibility Description

    For Version Paddle 2.x, users are recommended to use APIs in the paddle root directory. In addition, all the APIs of Version Paddle 1.x are reserved in the paddle.fluid directory. Codes for Version Paddle 1.x training are not changed according to the design, that is, models saved for Version Paddle 1.x training can run on Version Paddle 2.x normally and inference can be performed using Version Paddle 2.x.

    Directory Structure Adjustment

    • ✅ Based on the 2.0-alpha version, this version has made some adjustments to the directory structure. The latest adjusted directory structure is as follows:
    Directory Functions and Included APIs
    paddle.* The aliases of commonly used APIs are reserved in the paddle root directory, which currently include all the APIs in the paddle.tensor and paddle.framework directories
    paddle.tensor APIs related to tensor operations such as creating zeros, matrix operation matmul, transforming concat, computing add, and finding argmax
    paddle.nn Networking-related APIs such as Linear, Conv2d, loss function, convolution, LSTM,and activation function
    paddle.static.nn Special APIs for networking under a static graph such as input placeholder data, fully connection fc and control flow while_loop/cond
    paddle.static APIs related to the basic framework under a static graph such as Variable, Program, and Executor
    paddle.framework Universal APIs and imprerative mode APIs such as to_tensor
    ⚡️ paddle.optimizer
    ⚡️ paddle.optimizer.lr_scheduler
    paddle.metric APIs related to evaluation index computation such as accuracy and auc
    paddle.io APIs related to data input and output such as Dataset, and DataLoader
    paddle.device APIs related to device management such as CPUPlace and CUDAPlace
    paddle.distributed Distributed related basic APIs
    paddle.distributed.fleet Distributed related high-level APIs
    paddle.vision Vision domain APIs such as datasets, data processing, and commonly used basic network structures like resnet
    paddle.text NLP domain APIs such as datasets, data processing, and commonly used basic network structures like transformer

    API Alias Rules

    • For the convenience of users, APIs will create aliases in different paths, such as paddle.add -> paddle.sensor.add. Users are recommend to use the shorter path paddle.add.
    • All the APIs in the framework and tensor directories are aliased in the paddle root directory. Except for a few special APIs, all other APIs have no aliases in the paddle root directory.
    • All the APIs in the paddle.nn directory, except those in the functional directory, have aliases in the paddle.nn directory. All the APIs in the functional directory have no aliases in the paddle.nn directory.
    • The following are some special alias relations. It is recommended to use the names on the left.
      • paddle.sigmoid -> paddle.tensor.sigmoid -> paddle.nn.functional.sigmoid
      • paddle.tanh -> paddle.tensor.tanh -> paddle.nn.functional.tanh
      • paddle.remainder -> paddle.mod -> paddle.floor_mod
      • paddle.divide -> paddle.true_divide
      • paddle.rand -> paddle.uniform
      • paddle.randn -> paddle.standard_normal
      • Optimizer.clear_grad -> Optimizer.clear_gradients
      • Optimizer.set_state_dict -> Optimizer.set_dict
      • Optimizer.get_lr -> Optimizer.current_step_lr
      • Layer.clear_grad -> Layer.clear_gradients
      • Layer.set_state_dict -> Layer.set_dict

    Name Change of Commonly Used APIs

    • This version uses tensor representation data, creates tensor APIs, and changes paddle.fluid.dygraph.to_variable to paddle.to_tensor
    • ➕ Addition, subtraction, multiplication, and division use full names only
    • For the current element-by-element operation, no elementwise prefix is added
    • For operating by a certain axis, no reduce prefix is added
    • 🛠 For Conv, Pool, Dropout, BatchNorm and Pad networking APIs, 1d, 2d, and 3d suffixes are added according to the input data type
    Paddle 1.8 Paddle 2.0-beta
    paddle.fluid.layers.elementwise_add paddle.add
    paddle.fluid.layers.elementwise_sub paddle.subract
    paddle.fluid.layers.elementwise_mul paddle.multiply
    paddle.fluid.layers.elementwise_div paddle.divide
    paddle.fluid.layers.elementwise_max paddle.maximum
    paddle.fluid.layers.elementwise_min paddle.minimum
    paddle.fluid.layers.reduce_sum paddle.sum
    paddle.fluid.layers.reduce_prod paddle.prod
    paddle.fluid.layers.reduce_max paddle.max
    paddle.fluid.layers.reduce_min paddle.min
    paddle.fluid.layers.reduce_all paddle.all
    paddle.fluid.layers.reduce_any paddle.any
    paddle.fluid.dygraph.Conv2D paddle.nn.Conv2d
    paddle.fluid.dygraph.Conv2DTranspose paddle.nn.ConvTranspose2d
    paddle.fluid.dygraph.Pool2D paddle.nn.MaxPool2d, paddle.nn.AvgPool2d

    ➕ Added APIs

    • ➕ Added a total of 140 APIs. See Link and the API document
      • Added environment setting APIs: paddle.set_default_dtype, paddle.get_default_dtype, paddle.set_device, paddle.get_device, paddle.manual_seed
      • Added tensor operation APIs: numel, chunk, masked_select, isfinite, isinf, isnan, sort, topk, Flatten, dim, tile
      • Added networking APIs: Linear, Bilinear, Embedding, linear, bilinear, embedding
      • Added vision networking APIs: Conv1d, ConvTranspose1d, MaxPool1d, MaxPool2d, MaxPool3d, AvgPool1d, AvgPool2d, AvgPool3d, AdaptiveMaxPool1d, AdaptiveMaxPool2d, AdaptiveMaxPool3d, ReflactionPad1d, ReflactionPad2d, ReflactionPad3d, ReplicationPad1d, ReplicationPad2d, ReplicationPad3d, ZeroPad2d, ConstantPad1d, ConstantPad2d, ConstantPad3d, PixelShuffle, Upsample, UpsamplingNearest2d, UpsamplingBilinear2d, conv1d, conv_transpose1d, avg_pool1d, avg_pool2d, avg_pool3d, max_pool1d, max_pool2d, max_pool3d, adaptive_max_pool1d, adaptive_max_pool2d, adaptive_max_pool3d, adaptive_avg_pool1d, adaptive_avg_pool3d
      • Added text processing networking APIs: SimpleRNN, LSTM, GRU, MultiHeadAttention, Transformer, TransformerEncoder, TransformerEncoderLayer, TransformerDecoder, TransformerDecoderLayer
      • Added activation APIs: ELU, Hardshrink, Hardtanh, PReLU, ReLU6, Tanh, Tanhshrink, Softmax
      • Added normalization APIs: BatchNorm1d, BatchNorm2d, BatchNorm3d, SyncBatchNorm, InstanceNorm1d, InstanceNorm2d, InstanceNorm3d, weight_norm, remove_weight_norm, batch_norm, instance_norm, layer_norm, normalize
      • Added dropout APIs: Dropout2d, Dropout3d, AlphaDropout, dropout, dropout2d, dropout3d
      • Added similarity and loss function APIs: CosineSimilarity, PairwiseDistance, CTCLoss, KLDivLoss, BCEWithLogitsLoss, MarginRankingLoss, SmoothL1Loss, consine_similarity, binary_cross_entropy, binary_cross_entropy_with_logits, cross_entropy, ctc_loss, l1_loss, mse_loss, margin_ranking_loss, nll_loss, smooth_l1_loss
      • Added distributed communication APIs: broadcast, all_reduce, reduce, all_gather, scatter, barrier
      • Added probability distribution APIs: Distribution, normal, bernoulli
      • Added optimizer-related APIs: step, AdamW
      • Added dataset-related APIs: Dataset, IterableDataset, TensorDataset, Sampler, RandomSampler, BatchSampler, DistributedBatchSampler

    🛠 Fixing and Improving APIs

    • ⬆️ Modified and improved a total of 155 APIs. See Link and the API document
    • 🛠 Fixed APIs related to random number generation including: seed setting paddle.rand, randn, randint, randperm, dropout, Uniform, and Normal
    • Upgraded the codes of the underlying C++ operators corresponding to the following APIs to theoretically achieve compatibility without excluding slight incompatibility: linspace, concat, gather, gather_nd, split, squeeze, unsqueeze, clip, argmax, argmin, mean, norm, unique, cumsum, LeakyReLU, leaky_relu, hardshrink, embedding, margin_ranking_loss, grid_sample, affine_grid
    • ➕ Added oneDNN support for the relu6 and Sigmoid activation functions

    Multi-device/Distributed Training APIs

    Single-Machine Multi-Card Training Under a Dynamic Graph

    • Added paddle.distributed.spawn(func, args=(), nprocs=-1, join=True, daemon=False, **options),which is used to start multi-card training under a dynamic graph.
    • Added paddle.distributed.init_parallel_env(), which is used to initialize the environment of multi-card training under a dynamic graph.
    • Added paddle.distribued.get_rank(), which is used to get the rank of the current process during the multi-card training.

    - Added paddle.distribued.get_world_size(), which is used to get the total number of processes participating in training during the multi-card training.

    Distributed Collective Communication

    • Added paddle.distributed.broadcast(tensor, src, group=0), which broadcasts a tensor of a specified process to all the processes.
    • Added paddle.distributed.all_reduce(tensor, op=ReduceOp.SUM, group=0), which performs the reduce operation on specified tensors of all the processes and returns results to all the processes.
    • Added paddle.distributed.reduce(tensor, dst, op=ReduceOp.SUM, group=0), which performs the reduce operation on specified tensors of all the processes and returns results to specified processes.
    • Added paddle.distributed.all_gather(tensor_list, tensor, group=0), which gathers specified tensors of all the processes and returns results to all the processes.
    • Added paddle.distributed.scatter(tensor, tensor_list=None, src=0, group=0), which distributes tensors in a specified tensor list to all the processes.
    • Added paddle.distributed.barrier(group=0),which synchronizes all the processes.

    High-level APIs

    • ➕ Added PaddlePaddle high-level APIs to encapsulate common operations such as networking, training, evaluation, inference, and access so as to implement low code development. In the MNIST handwritten digit recognition task versus the imperative programming implementation mode, high-level APIs can reduce 80% of executable codes.
    • Data Management
      • Unified data loading and usage method
      • Dataset definition, which is implemented by inheriting paddle.io.Dataset.
      • Multi-process data loading using paddle.io.DataLoader.
      • Added paddle.io.IterableDataset, which is used for a streaming dataset and supports its concurrent acceleration in paddle.io.DataLoader.
      • Added paddle.io.get_worker_info for dividing child process data in paddle.io.IterableDataset.
    • Model Networking
      • Added the encapsulation of the common loss API paddle.nn.loss.* and metric API paddle.metric.*
      • Released 12 models based on high-level API implementations, including Transformer, Seq2seq, LAC, BMN, ResNet, YOLOv3, VGG, MobileNet, TSM, CycleGAN, Bert, OCR. The code can be found in PaddlePaddle/hapi examples.
    • Model Execution
      • Added class API paddle.Model, which encapsulates the common model development methods:
      • API Model.summary to view the network structure and the number of parameters of the dynamic graph networking.
      • API Model.prepare to specify a loss function and an optimization algorithm.
      • API Model.fit to implement training and evaluation, which can implement the execution of user-defined functions such as model storage by callback.
      • API Model.evaluate to implement the computation of inference and evaluation indexes on the evaluation set.
      • API Model.predict to implement specific test data inference.
      • API Model.train_batch to implement training on a single batch of data.
      • API Model.eval_batch to implement evaluation on a single batch of data.
      • API Model.text_batch to implement testing on a single batch of data.
      • API Model.save/Model.load , which supports storing an inference model in dynamic graph training mode.
      • Added callback API paddle.callbacks.* as a model execution API, which performs logging and Checkpoint model saving, etc. Users can customize a callback by inheriting paddle.callbacks.Callback.
    • Domain APIs
      • Added computer vision (CV) APIs paddle.vision
      • Added dataset API paddle.vision.datasets.*, which encapsulates common public datasets and supports random access to data.
      • Added 24 common data preprocessing APIs paddle.vision.transforms.* such as Resize, Normalize, etc.
      • Added image classification backbone network and pre-training parameters:
        • paddle.vision.models.lenet or paddle.vision.lenet
        • paddle.vision.models.vgg or paddle.vision.vgg
        • paddle.vision.models.resnet or paddle.vision.resnet
        • paddle.vision.models.mobilenetv1 or paddle.vision.mobilenetv1
        • paddle.vision.models.mobilenetv2 or paddle.vision.mobilenetv2
      • Added natural language processing (NLP) APIs paddle.text.
      • Added dataset API paddle.text.datasets.*, which encapsulates commonly-used datasets and supports random access to data.
      • Added networking API paddle.text.*.
    • Automatic Breakpoint Restart
      • Added API train_epoch_range, which implements the epoch-level checkpoint autosave and autoloading functions on a static graph and supports automatic breakpoint restart.

    Function Optimization (Including Distributed)

    Dynamic Graph to Static Graph

    • Added Syntax Support for ProgramTranslator
      • Added dynamic-to-static support for the return syntax so as to return in advance or to return different types of tensors or none in if-elif-else or loop conditions during the dynamic-to-static conversion.
      • Added dynamic-to-static support for the print syntax so that print (tensor) can also print out a tensor in the dynamic-to-static conversion.
      • Added dynamic support for “for traversing a tensor”, “for traversing a tensor using enumeration”, “for traversing a TensorList”, and “for traversing a TensorList using enumeration” syntaxes so that operations related to the circular processing of tensors can be flexibly used in the dynamic-to-static conversion.
      • Added dynamic-to-static support for the assert syntax to ensure that an assert tensor can be true (bool type) or non-0 (other data types) in the dynamic-to-static conversion.
      • Added support for the transfer of cast of data type so that type conversion of similar conversion statements of dynamic graph type such as float (tensor) and int (tensor) can also be performed in a static graph.
    • ProgramTranslator Usability Optimization Function
      • Changed the dynamic-to-static return type to class StaticLayer from callable. This class can obtain converted static graph information more easily by calling .code,.main_program, and other APIs.
      • Added set_verbosity and set_code_level APIs so that users can set a log class to view a log in the dynamic-to-static running process or a converted code in intermediate state.
      • Added InputSpec to specify the shape and data type of an input tensor variable.
      • Optimized an error message displayed in case of error in the dynamic-to-static running so that codes with running error in the static graph after dynamic-to-static conversion can also be reported to the original error code line in the dynamic graph; deleted some dynamic-to-static errors from python stacks so that an error message is more related to user codes.
      • Support performing a breakpoint test using pdb.set_trace() during the dynamic-to-static conversion.
    • 🚀 Optimized Deployment of Model Storage and Loading APIs
      • Added paddle.jit.save API, which is used to save a dynamic-to-static model so that the API is easier to use; deleted an old API ProgramTranslator.save_inference_model.
      • Added paddle.jit.load API, which is used to load inference models including models saved by paddle.jit.save and paddle.io.save_inference_model. After being loaded, models can be used for model inference or model training optimization in a dynamic graph.

    Mixed Precision Training

    • ➕ Added the support for mixed precision of dynamic graphs. The ratio of the speed when the ResNet-50 model is trained on V100 using mixed precision to the speed using fp32 is 2.6.

    Quantitative Training

    • ➕ Added ImperativeQuantAware class. The dynamic graph quantitative training function is provided. Currently, the quantization of Conv2D, Linear, and other layers are supported. The supported model types include MobileNetV1/MobileNetV2/ResNet50.
    • After dynamic graph quantitative training is performed on a model, inference deployment of any quantitative model saved using an ImperativeQuantAware.save_quantized_model API can be performed using a Paddle-Lite inference library.
    • 👍 As for static graph quantization, Conv2d_tranpose quantization as well as Linear quantization in the form of per-channel is supported.

    🐎 Performance Optimization (Including Distributed)

    • Simplified the DataLoader underlying implementation logic in dynamic graph mode, reduced the thread reading overhead, and further improved the data reading efficiency and the overall model training speed.The overall training speed of MobileNetV1 in a scenario of single V100 card and BatchSize = 128 is increased by 34%.
    • 🐎 Upgrade and performance optimization of dynamic graph networking. A large number of dynamic graph APIs will directly call an automatically generated Pybind API, improving the performance.

    Basic Functions for Dynamic Graph

    • 👌 Support the function of updating the gradient using a sparse parameter by configuring embedding and other APIs.
    • ➕ Added over 120 member functions of Tensor type, including Tensor().abs(), Tensor().add(), and Tensor().cos().
    • ➕ Added dir() API for a layer to facilitate viewing the attributes and functions in the layer.
    • ➕ Added an optimizer.set_lr() API so that users can flexibly adjust a learning rate in dynamic diagram mode.
    • Added a global parameter initialization method API set_global_initializer to define a global parameter initialization method.
    • ➕ Added oneDNN (former MKL-DNN) support for dynamic training and inference.Resent50 oneDNN dynamic training with minist dataset is enabled.
    • ➕ Added oneDNN support for dynamic training and inference. Resent50 oneDNN dynamic training with minist dataset is enabled.

    Debugging Analysis

    • ⚡️ Uniformly changed the wording of LOG (FATAL) throw abnormal at just 100 points to PADDLE_THROW; optimized the error format and content caused by non-support of the framework for a behavior.
    • 👌 Improved Signal Handler implementation within the framework; optimized the error format and content when system signal error occurs during the execution.
    • ⚡️ Optimized the framework error stack format. The python error stack occurring during the compilation is moved to below the native error stack to improve error message reading experience.
    • ✨ Further improved an accumulative total of about 1,300 error type and prompt copywritings of check errors within the framework to enhance the overall debugging usability of the framework.
    • ✨ Enhanced dynamic graph error messages. Error messages on the Pybind layer under a dynamic graph are systematically enhanced to improve user experience.

    🐛 Bug Fixing

    • 🛠 Fixed the problem that AttributeError may unexpectedly occur when the add_parameter API is used on a layer under a dynamic graph; enhance the input check.
    • Fixed the problem that tensors of int_8 and uint_8 types cannot be normally printed so that data can be normally output.

    ⬆️ Dependency Library Upgrading

    • ⬆️ Upgraded oneDNN (former MKL-DNN) to Version 1.5 from Version 1.3.
    • ⬆️ Upgrade oneDNN from 1.3->1.5

    Inference

    Paddle Inference

    API

    • ⬆️ Fully upgraded the inference C++ APIs. The new version of the APIs is recommended. The original APIs are reserved tentatively, but give a warning during use, and are planned to be deleted in the future. The upgrade to the new version of the APIs mainly involves naming standardization and usage method simplification. The important changes include:
      • adding a paddle_infer naming space for the C++ APIs, containing inference-related APIs.
      • renaming ZeroCopyTensor to Tensor as the default input/output representation method for the inference APIs.
      • simplifying CreatePaddlePredictor to CreatePredictor and reserving the support for only AnalysisConfig, not for other Configs any more.
      • adding service-related utility classes such as PredictorPool, which can be used when multiple predictors are created.

    ⬆️ Functional Upgrading

    • ⬆️ Upgraded the operator version compatibility information registry to support more accurate Op version information and improve inferential compatibility.
    • ➕ Added the adaptive support for Version TRT 7.1.
    • 👍 Paddle-TensorRT enhances the support for the PaddleSlim quantitative model. Multiple tasks such as detection, classification, and segmentation on CV are covered.
    • ➕ Added the support for user-defined operators for Python-side inference.
    • ➕ Added the kernel support for elementwise_add and elementwise_mul INT8 oneDNN (former MKL-DNN) on the CPU side.
    • 👌 Improved the usability of CPU-side test quantitative models. A simultaneous comparison test of original models with quantitative models is supported.
    • ➕ Added the adaptive support for Jetson Nx hardware.

    🐎 Performance optimization

    • ➕ Added conv + affine_op pass, MASK-RCNN single thread performance is improved by 26% (1.26x) on machine 6248
    • ➕ Added fc + gru fuse pass and enabled oneDNN gru fp32 kernel, speeding up GRU fp32 model inference on 4 CPU threads by 20% (1.2x) on machine Intel Xeon 6248
    • ➕ Added support for oneDNN inplace support for many operators (speedup 2% for Feature model)
    • ⚡️ Optimized LRN operator (speedup 1% for GoogleNet)
    • 👌 Improved the transformation and optimization of quantized model
    • ⚡️ Optimized the ArgMin, ArgMax operator of CUDA so that the binary system size of the operator is decreased to 1.3 M from 60 M.

    🐛 Bug Fixing

    • 🛠 Fixed mask-rcnn inference error under CPU inference
    • 🛠 Fixed the CPU multithread inference on oneDNN quantized INT8 models