All Versions
28
Latest Version
Avg Release Cycle
29 days
Latest Release
568 days ago

Changelog History
Page 3

  • v1.4.1 Changes

    April 23, 2019

    🚀 Release Notes

    🐛 BUG修复

    修复一些op在InferShape时对变长shape检查的错误

    增加一些op对长度为零的LoD序列输入的支持

    修复用recurrent op实现StaticRNN的一些bug

    修复动态图dygraph模型checkpoint存储和读取的bug

    🛠 fix bugs of some ops in InferShape

    🛠 fix bugs of some ops with input LoD length = 0

    🛠 fix bugs of recurrent op for StaticRNN

    🛠 fix bugs of dygraph when saving and loading model checkpoint

  • v1.4.0 Changes

    April 23, 2019

    🚀 Release Notes

    目录

    • 重要更新
    • 基础框架
      • 安装
      • 中间表达IR和Pass方面的优化
      • IO优化
      • 执行优化
      • 显存优化
      • 完善CPU JITKernel
      • Intel CPU底层计算优化
      • 集成Intel nGraph图编译引擎
      • 框架基础功能增强
      • 动态图preview版基础功能完善
    • 预测引擎
      • 服务器预测引擎
      • 移动端预测引擎
      • 部署工具
    • 分布式训练
    • 模型建设
      • PaddleCV 智能视觉
      • PaddleNLP智能文本处理
      • PaddleRec智能推荐
    • 工具组件
    • 🐛 BUG修复

    重要更新

    • 基础框架对训练速度和显存占用进行了全面优化,完整支持量化训练,初步集成了Intel nGraph,动态图preview版单机单卡基本功能完善。
    • 正式发布模型压缩工具包PaddleSlim和模型预测服务Paddle Serving,全面提升PaddlePaddle部署能力。
    • 优化分布式IO,增加远程文件系统流式读取能力。GPU多机多卡同步训练通过增加稀疏通信能力提升带宽不敏感训练能力,在低配网络带宽网络环境下,例如10G网络下,同步训练可提速10倍。
    • 👷 更好支持K8S生态,提供工业生产环境下的Paddle-K8S-Operator支持;Kubeflow支持paddle-job。
    • 正式发布视频识别工具集,覆盖主流视频分类模型,包括Non-Local、TSM 、Attention Cluster、NeXtVLAD、Attention LSTM、StNet、TSN。
    • 新增中文语义表示模型ERNIE,在多项中文任务上相对 BERT精度绝对提升1-2个百分点。新增对话通用理解相关模型 DGU,支持5类对话任务,在3个公开数据集达到 SOTA 的效果。
    • 新增基于图神经网络的推荐模型(Graph Neural Network),并提供公开数据集下的Benchmark效果。
    • 正式发布PaddleHub预训练模型管理工具,提供包括预训练模型管理、命令行一键式使用和迁移学习三大功能。旨在帮助用户更高效地管理模型并开展迁移学习的工作。
    • 正式开源AutoDL Design,自动化网络结构设计。
    • 全新升级聚焦并行的PARL1.1,一个修饰符,实现并行强化学习算法。
    • 正式发布X2Paddle模型转换工具,用户可以无损地将其他深度学习框架预测模型迁移至PaddlePaddle。

    基础框架

    • 安装
      • 增加install_check.run_check()接口,对安装是否成功提供更完善的检查。
    • 中间表达IR和Pass方面的优化
      • 完成IrGraph、IrNode、IrVarNode以及IrOpNode的封装,支持使用Python编写IR Pass。
    • IO优化
      • PyReader接口优化:可通过新接口reader = fluid.io.PyReader (...,iterable = True,...)创建for循环可迭代的reader,并通过feed方式将数据送入网络训练。
    • 执行优化
      • 用户可设置with_data_parallel的places参数,指定在某些GPU卡上运行,从而支持单进程多训练任务执行。
      • 优化了多卡执行器调度策略,在ResNet50和Transformer模型上验证速度提升8%~19%。
      • 多卡情况下支持对AllReduce进行按分组Fuse,ResNet模型的多卡速度提升8%~30%(不同卡数提速有差异),Transformer模型的多卡速度提升4%左右。
    • 显存优化
      • GC策略优化:Eager Deletion策略支持while_op内部变量的及时删除;支持非全量Eager Deletion策略,用户可设置FLAGS_memory_fraction_of_eager_deletion = 0.xx控制即时删除内存/显存空间的百分比。
      • Op优化:优化cross entropy、expand、layer_norm、dropout等op的反向注册机制,去除无关变量依赖,提高框架显存性能。
      • 新增两个FLAGS(FLAGS_initial_gpu_memory_in_mb和FLAGS_reallocate_gpu_memory_in_mb)来让用户指定初始显存池容量和再分配显存池容量。
      • 调整inplace_op_pass策略,提高inplace的策略的覆盖率。
      • 取消了在python端做activation op inplace优化的逻辑,统一到inplace_op_pass。
      • 新增Memory Profile功能。
    • 完善CPU JITKernel
      • 优化JITKernel的调用方式,添加Cache机制和获取所有相同类型函数的接口,方便开发者根据不同情况有选择的调用。
      • 使用JITKernel优化SGD算法,在PyramidDNN模型下对应的OP部分速度提升44%,整体训练速度提升12%;使用JITKernel优化fused_embedding_seq_pool,在PyramidDNN模型下对应op的反向算子速度提升18%, 整体训练速度提升6%。
    • Intel CPU底层计算优化
      • MKLDNN升级至v0.18,包含若干性能增强(如基于GEMM的卷积运算/INT8卷积运算等)。
      • 使用MKL优化GELU OP,OP性能提升至原来的3倍。
      • 增强MKLDNN相关Kernel的单元测试。
    • 集成了Intel nGraph图编译引擎,为PaddlePaddle支持更多硬件后端提供了便利
      • 通过ngraph_engine OP将子图交由nGraph核心,经图优化后调度在CPU上执行。用环境变量FLAGS_use_ngraph = true即可在运行时调用nGraph。
      • 支持ResNet50模型在CPU上的训练和预测。ResNet50在CPU上的性能,和基于MKLDNN的直接优化相比,预测和训练性能均有显著提升。
    • 框架基础功能增强
      • 支持同步的Batch Norm操作;支持softmax设置axis;新增spectral norm,rang,acos,asin,atanh操作;新增Npair Loss,用于特征学习。
      • 框架中添加cosine_decay学习率调整策略。
      • 新增sampled_softmax_with_cross_entropy,用于提升大词典下的训练效率。
      • 支持SGD和Adam优化算法的fuse,在Transformer模型上,速度能够提升2%,在Cycle GAN模型上,速度能够提升6%。
      • 加强lsmtp,支持cell内部裁剪、初始化cell state和hidden state。
      • 加强adagrad,支持初始化累积动量。
      • 支持Tensor使用__getitem__ 方式操作。
      • 新增QuantizationFreezePass、ConvertToInt8Pass以及TransformForMobilePass。完整支持动态和静态两种量化训练方式及对应模型保存。
    • 动态图preview版基础功能完善
      • 基础功能:支持LRDecay,整体支持GPU单卡及CPU单机的模型训练和评估。
      • API:公开动态图对应基础接口,重构现有的 Layers,增加对 GRU、LayerNorm、NCE、PRelu 等 Layers 的支持。
      • 性能:在ResNet,MNIST模型上验证与静态图基本持平。
      • 增加Transformer、MNIST、SE-ResNeXt 等模型的动态图实现。

    预测引擎

    服务器预测

    • 预测库整合PaddlePaddle/Anakin,统一接口提供高效预测能力
      • 支持Anakin GPU子图和CPU子图。
      • Python预测接口支持Anakin子图。
      • ResNet、VGG、GoogleNet、MobileNet、ShuffleNet、Faster R-CNN、YOLO、SSD等模型实现显著预测加速。
    • 预测框架优化,小模型预测速度提升明显
      • 增加runtime_context_cache_pass,重点模型提升17%。
      • 优化5个OP的infershape,重点模型提升13%。
      • 完善ZeroCopy接口,避免使用AnalysisPredictor 时存在多余CPU拷贝。
    • INT8 量化预测持续加强
      • 进一步完善通过TensorRT 支持INT8 量化,支持AlexNet、GoogleNet、VGG、MobileNet、ShuffleNet等模型。优化调用TensorRT下的信息序列化反序列化,加快模型初始化速度。
      • 实现基于C++ Pass的INT8量化框架。增加若干INT8 OP Kernel:Transpose,Contact,Requantize。通过微调MkldnnQuantizerConfig中的量化策略,用户可快速得到符合精度要求的INT8量化模型。INT8量化后的ResNet-50/MobileNet v1模型,相比原始FP32模型,性能分别提升至3.7倍/3.0倍 (在支持AVX512-DL Boost指令集的至强 6271服务器上)。

    移动端预测

    • ARM CPU
      • Paddle Mobile完成矩阵运算库sgemm和sgemv的重构和效率优化,在大部分模型上能获得10%〜100%以上的性能加速。
      • 新增while、sequence_expand、sequence_pool、sequence_softmax、gru_unit、beam_search和beam_search_decode等19个算子,以及对应大量的优化工作,支持attention-based端到端模型的预测。
      • 新增winograd 的arm v8实现,在IOS上的v8的硬件上能取得更高的预测性能;winograd支持算子融合 ,保证算子融合后的效率更高。
      • 新增kernel为3x3的滑窗直接卷积实现,在channel数较少时会比winograd和gemm效率更高。
      • 完成kernel为3x3的depthwise convolution重构和优化,相比之前版本支持任意的padding、性能更优且计算结果更可靠。
      • 完成kernel为5x5的depthwise convolution armv8版本的实现,NAS模型的预测效率提升30%以上。
      • 完成反卷积conv2d_transpose的效率优化。
      • 新增基于图优化的精简内存复用策略,大部分模型能降低近50%的内存占用。对于ARM CPU已自动开启(FPGA和GPU暂不支持)。
    • ARM GPU
      • Paddle Mobile完成kernel为1x1的卷积优化,MobileNet v1在高通Adreno GPU上平均预测性能提升35%。
    • 预测初步完成和Paddle Mobile、Anakin的接口统一,待进一步深度融合。

    部署工具

    • 模型压缩工具包PaddleSlim
      • 剪切模型压缩策略:支持敏感度和uniform两种方式,支持VGG、ResNet、MobileNet等多种类型的网络,支持用户自定义剪切范围。
      • 量化训练模型压缩策略:支持动态和静态两种量化训练方式,支持对参数进行分channel量化或整体量化,支持以float类型模拟int8值域保存模型,支持以int8类型保存模型,支持以兼容paddle Mobile的格式保存模型。
      • 蒸馏模型压缩策略:支持在teacher网络和student网络任意层添加组合loss,支持FSP Loss,L2 Loss,Softmax with Cross-entropy Loss。
      • 其它功能:支持配置文件管理压缩任务超参数,支持多种压缩策略组合使用,蒸馏和剪切压缩过程支持checkpoints功能。
    • Paddle Serving
      • 支持预测远程部署。
      • 服务端支持用户新增数据处理Operator,支持用户自定义预估逻辑,支持模型热加载功能。
      • 客户端提供C++ SDK,供业务逻辑进行调用,支持自定义protobuf定制网络数据传输协议,A/B测试能力。
      • 提供经典任务使用paddle Serving的示例模板,包括文本分类,图像分类任务。
      • 针对文本分类任务,给出延迟和吞吐的Benchmark。

    分布式训练

    • 分布式IO优化
      • Pipe Reader接口优化:在保持数据预处理灵活性的前提下,提供高效IO的方法。支持企业级Linux系统用户定制化,实现高性能IO组件,在离线数据预处理处进行统一维护。增强远程文件系统流式读取能力,支持数据载入内存模式、分布式打乱功能。
    • Executor与分布式IO的整合
      • AsyncExecutor整合进入Executor,增加train_from_dataset/infer_from_dataset接口,支持基于Pipe Reader的训练,在保持多队列IO功能的前提下,支持用户自定义PipeLine程序,提供python端灵活处理数据的能力。
    • GPU多机多卡同步训练增加带宽不敏感训练能力
      • GPU同步训练增加稀疏通信能力,支持sparse all reduce。
      • 通过通信稀疏度的控制,在算法层面保障模型收敛,并增加DGCOptimizer。
      • 通过在ResNet50 on imagenet上进行实验证明:模型收敛性方面,ResNet50 90轮收敛效果不变;在高速互联网络环境下,稀疏通信不会降低训练速度;低配网络带宽网络环境下(例如10G网络),稀疏通信在训练速度上有明显优势,相比稠密通信的同步训练提速10倍。
    • Collective Operator模式
      • Collective Operator模式的支持,增加GPU下多个all reduce的操作。通过Python API向Program中增加collective op,使得分布式优化算法开发的灵活性显著提升。
    • ResNet50 on Imagenet收敛速度优化
      • 支持动态BatchSize、动态ImageSize以及矩形crop等方法;FP32精度下,在v100单机8卡验证,收敛速度提升68%(acc1>=75.9%, acc5=93.0%)。
    • K8S生态支持
      • Kubeflow支持paddle-job,并贡献到kubeflow社区。
      • 支持工业生产环境下的Paddle-K8S-Operator,可与kubeflow配合使用。
      • K8S环境适合新手提交任务的脚本,提供百度云可复现教程。

    模型建设

    • PaddleCV 智能视觉
      • 正式发布视频识别工具集,覆盖主流视频分类模型,包括Non-Local、TSM 、Attention Cluster、NeXtVLAD、Attention LSTM、StNet、TSN,效果和主流实现打平。
      • 新增基于ImageNet的预训练模型:GoogleNet,ShuffleNetV2,ResNet18,ResNet34。
      • 新增支持目标检测YOLOv3模型,效果与最好公开实现打平(mAP比原作者提高4.7绝对百分点)。
      • 发布基于COCO和MPII数据的Simple Baselines人体姿态估计模型,效果和主流实现打平。
      • 特征学习模型新增npair loss, 在预训练模型(arcmargin loss)的基础上将[email protected]提升至79.03%(+0.78%)。
    • PaddleNLP智能文本处理
      • 新增支持中文语义表示ELMo模型,支持多卡训练,训练速度比主流实现快1倍。验证在中文词法分析任务上F1值绝对提升1.1%,在中文阅读理解任务上Rouge-L值提升1%。
      • 新增中文语义表示模型ERNIE,在自然语言推断、语义相似度、命名实体识别、情感分析、问答匹配等中文任务上相对 BERT 中文模型绝对提升了 1% ~ 2% 的精度。
      • 阅读理解模型升级,优化数据预处理和文档选取,在DuReader验证数据集上Rouge-L提升至47.65(baseline 39.29)。
      • 新增基于知识感知的对话模型,对比基线生成对话模型,在F1,BLEU1,BLEU2的指标上平均提升1个百分点。
      • 发布对话模型工具集,包含DeepAttentionMatchingNet, 新增对话自动评估工具和基于BERT的对话通用理解相关模型DGU(Dialogue General Understanding),支持对话语义匹配、DA、DST、槽位解析和意图识别五种对话任务,3个公开数据集达到SOTA 的效果。
      • 发布PaddleNLP工具包,统一文本分类、文本匹配、序列标注、阅读理解、智能对话等NLP任务的建模,并开放对应的工业级预训练模型。
    • PaddleRec智能推荐
      • Deep Interest Network(DIN):新增DIN模型,并在公开数据复现效果,支持cpu和gpu模式下的单机单/多卡训练。DIN适用于推荐中的排序场景(如ctr预估),主要特点为对历史序列建模的过程中结合了预估目标的信息。
      • Graph Neural Network(GNN):新增基于session的图神经网络推荐模型,并在公开数据复现效果,支持cpu和gpu模式下的单机单卡训练。该模型适用于推荐中的召回场景,使用GNN对用户的历史信息进行建模,可以捕捉到item序列之间蕴含的更复杂的转换关系。
      • Word2vec:word2vec采样策略调优,并在公开数据复现效果,添加多机训练支持。

    工具组件

    • 正式开源AutoDL Design自动化网络结构设计
      • 用AutoDL Design方法生成的一系列神经网络,以及使用CIFAR10数据在其上训练出来的一共6个模型,包括了网络结构以及对应的权重。因此每一位业内同行或者是有兴趣的研究者都可以很容易使用PaddlePaddle以及公开的CIFAR10数据,在这6个模型上进行推理(inference)以及模型融合,获得超过98%的准确率。
      • 生成器和评估器的源码开源,该源代码使用了完全由百度自己研发的PaddlePaddle平台和PARL框架。代码中附带有中文文档,以及一些方便大家快速运行的更简单的小demo(例如,以“RNN生成多少个1”作为样例,可以快速验证整个框架的正确性)。大家可以下载、安装和运行,尝试生成属于自己的、全新的神经网络结构。
    • 全新升级聚焦并行的PARL1.1,一个修饰符,实现并行强化学习算法
      • 通过一个简单的修饰符(@parl.remote_class)即可实现并行化。数据预处理以及simulator仿真等计算密集型的任务经过这个修饰符之后,会自动部署到用户指定的计算资源上运行,不再占用主线程的计算资源。
      • 新增了对IMPALA、A2C、GA3C等并行算法的支持。
    • 正式发布PaddleHub预训练模型管理工具,旨在帮助用户更高效的管理模型并开展迁移学习的工作。
      • 预训练模型管理: 通过hub命令行可完成PaddlePaddle生态的预训练模型下载、搜索、版本管理等功能。
      • 命令行一键使用: 无需代码,通过命令行即可直接使用预训练模型进行预测,快速调研训练模型效果。目前版本支持以下模型;词法分析LAC;情感分析Senta;目标检测SSD;图像分类ResNet, MobileNet。
      • 迁移学习: 提供了基于预训练模型的Finetune API,用户通过少量代码即可完成迁移学习,包括BERT/ERNIE文本分类、序列标注、图像分类迁移等。
    • 正式发布X2Paddle模型转换工具,可以无损地将其他深度学习框架预测模型迁移至PaddlePaddle。工具还附带TensorFlow, Caffe框架的API详细对比文档,旨在帮助用户更便捷的从其他框架迁移PaddlePaddle。

    🐛 BUG修复

    • 修复backward时BFS带来的精度不一致的问题
    • 修复ptimizer minimize创建多余反向输入
    • 修复Paddle-TRT运行显存占用大的问题
    • 修复AllReduceDepPass中的Bug
    • 修复FastThreadedExecutor中的Bug
    • 修复Reshape、cross_entropy、arg_min_max、recurrent等Op中的bug
    • 修复VarBase构造的问题
    • ⚡️ 修复了若干memory_optimizer_pass中的问题与bug:将复用逻辑由>=调整为 =,减少了因Variable复用造成的碎片,去掉了memory_opitmize_pass对BlockDesc的依赖,修复了不同类型的Variable会相互复用的bug
    • 修复python3下使用util.plot报错问题
    • 提升Profiler的稳定性并新增Memory Profile功能
    • 👯 修复C++预测必须在线程内clone,才能使多线程生效的问题

    🚀 Release Notes

    Table of Contents

    • Highlights
    • ⚡️ Fundamental framework updates
      • Installation
      • Optimization on Intermediate Representation IR and Pass
      • IO optimization
      • Execution optimization
      • Video memory optimization
      • Refine CPU JITKernel
      • Low-level Intel CPU computing optimization
      • Intel nGraph graph compiling engine integration
      • Adjustments to basic framework functionality
      • Accomplished basic functions in the preview version of dynamic graph Inference engine
    • Inference engine
      • Server-side Inference Engine
      • Mobile Inference Engine
      • Deployment tools
    • Distributed training
    • Model construction
      • PaddleCV Intelligent Vision
      • PaddleNLP intelligent text processing
      • PaddleRec intelligent recommendation
    • Tools and Components
    • 🐛 Bug fixes notes

    Highlights

    • 👍 Significant improvement has been made on training speed and memory management of the fundamental framework. Full support for quantitative training has been incorporated. Integration of Intel nGraph is also accomplished. Besides, the basic functions of single-card and single-node in the preview version of dynamic graph are perfectly implemented.
    • 🚀 We have officially released the model compression toolkit PaddleSlim and the model inference service Paddle Serving to broadly enhance the PaddlePaddle deployment capabilities.
    • 🔀 Boosted distributed IO interfaces and the stream read capability of remote file systems. Synchronous multi-machine multi-card GPU training promotes bandwidth-insensitive training through enabling sparse communication. For low-bandwidth network, such as network of 10G, synchronous training is 10 times faster.
    • 👌 Support for the K8S ecosystem is smoothened through Paddle-K8S-Operator support in industrial environments; Kubeflow supports paddle-job.
    • 🚀 We have officially released the video classification toolkit which covers mainstream video classification models, including Non-Local, TSM, Attention Cluster, NeXtVLAD, Attention LSTM, StNet, TSN.
    • 👍 ERNIE, a Chinese semantic representation model is introduced, which attains accuracy with absolute 1-2 percentage points higher than BERT on multiple Chinese language tasks. Generic dialogue comprehension model DGU is incorporated, with support for 5 types of dialogue tasks, and reaches SOTA in 3 public datasets.
    • The Recommendation Model Based on Graph Neural Network (GNN) is carried out, for which Benchmark expectation has been reproduced on public dataset.
    • 🚀 PaddleHub, a management tool for pre-trained models, has been officially released, offering three functions: pre-trained model management, command-line one-click manipulation and transfer learning. It strives to facilitate model management and conduct transfer learning more efficiently.
    • 🚀 Open source AutoDL Design is officially released to enable automatic network design.
    • ⬆️ Latest upgrades on the parallelization-oriented PARL1.1. Users are allowed to implement parallelized reinforcement learning algorithms by using a decorator.
    • The model conversion tool X2Paddle has been officially published, which enables transfer of inference models in other deep learning frameworks to PaddlePaddle without any compromise.

    ⚡️ Fundamental Framework Updates

    • Installation
      • install_check.run_check() interface is introduced to provide a more graceful check on whether the installation was successful.
    • Optimization on Intermediate Representation IR and Pass
      • The encapsulation is fulfilled of IrGraph, IrNode, IrVarNode, and IrOpNode. IR Passes scripted in Python is also enabled.
    • IO optimization
      • PyReader optimization: the brand new interface reader = fluid.io.PyReader (..., iterable=True, ...) makes it possible to create an iterable (by 'for' loop) reader and the data will be sent to the network through the 'feed' method.
    • Execution optimization
      • The 'places' parameter in with_data_parallel can be set to specify to run model on which GPU cards to execute single-process multi-training tasks.
      • Scheduling strategy applied on the multi-card executor is optimized, which is proved on the performance that execution speed on the ResNet50 and Transformer models has witnessed a increase of 8%~19%.
      • For Multi-card environment, grouped Fuse for AllReduce is developed. With this manner in place, ResNet model on multi-card is accelerated by 8%~30% (the figure varies with the number of cards). Moreover, Transformer model running on multiple cards picks up speed by 4%.
    • Video Memory optimization
      • GC strategy optimization: Eager Deletion strategy supports timely deletion of internal while_op variables; supports non-full-quantity Eager Deletion strategy, users can set FLAGS_memory_fraction_of_eager_deletion=0.xx to control the percentage of immediate deletion memory/memory_space in real time.
      • Op optimization: Optimize the backward registration mechanism of cross entropy, expand, layer_norm, dropout, etc., and remove irrelevant variable dependencies, and improve the video memory performance.
      • Two new FLAGS (FLAGS_initial_gpu_memory_in_mb and FLAGS_reallocate_gpu_memory_in_mb) to allow the users to specify the initial memory pool capacity and the reallocated memory pool capacity.
      • Adjust the inplace_op_pass strategy to increase the coverage of the inplace strategy.
      • Removed the logic for doing activation op inplace optimization on the python side, and included it to inplace_op_pass.
      • Memory Profile function is provided.
    • Refine CPU JITKernel
      • Modify the manner to call JITKernel, employ cache mechanism and interfaces to get all functions of the same type, which is convenient for developers to flexibly call desired interfaces.
      • As JITKernel is adopted to optimize the SGD algorithm, the equivalent OP part speed is increased by 44% and the overall training speed is increased by 12% in the PyramidDNN model; On the other hand, JITKernel is used to optimize fused_embedding_seq_pool, and the backward versions of corresponding ops in the PyramidDNN model is accelerated by 18% and overall training speeds up by 6%.
    • low-level Intel CPU computing optimization
      • MKLDNN is upgraded to v0.18 and includes various performance boosts (e.g. GEMM-based convolution operations/INT8 convolution operations, etc.).
      • GELU OP is accelerated by MKL. After optimization, the OP performance attains 3 times of the previous.
      • Unit testing of MKLDNN-related Kernels are refined.
    • 👍 Intel nGraph graph compiling engine integration is to facilitate the support for more hardware backends for PaddlePaddle
      • The subgraphs are transferred to the nGraph core via ngraph_engine OP, and then optimized with graph algorithms, after which they will be dispatched to execute on CPUs. nGraph can be called at runtime with the environment variable set as FLAGS_use_ngraph=true.
      • Training and inference of the ResNet50 model on the CPU is fulfilled. The performance of the ResNet50 training and inference on CPU gains notable increase compared with the direct optimization by MKLDNN.
    • Adjustments to basic framework functionality
      • Synchronized Batch Norm operation becomes available; specifying axis in softmax is allowed; new operators are in place: spectral norm, rang, acos, asin, atanh; Npair Loss is adopted for feature learning.
      • cosine_decay , a new learning rate strategy, is implemented.
      • Users can use sampled_softmax_with_cross_entropy to improve training efficiency in large dictionaries.
      • Fuse is possible between SGD and Adam optimization algorithms. If enabled, on the Transformer model, the speed can increase by 2%, while on the Cycle GAN model, the gain turns out to be 6%.
      • A more sophisticated lsmtp, which is able to perform clipping internal cell, initializing cell state and hidden state.
      • A more adjustable adagrad by which users can initialize cumulative momentum.
      • Users are allowed to handle Tensor through __getitem__ method.
      • QuantizationFreezePass, ConvertToInt8Pass, and TransformForMobilePass are introduced with comprehensive support for both dynamic and static quantitative training methods and saving corresponding model.
    • Accomplished basic functions in the preview version of dynamic graph
      • Basic functions: LRDecay, single GPU card and single-node CPU model training and evaluation.
      • API: expose the rudimentary interfaces of dynamic graph to users; reconstruct current Layers; build Layers such as GRU, LayerNorm, NCE, PRelu.
      • Performance: performance evaluated on the ResNet, MNIST model is essentially the same as the static graph.
      • Dynamic graph implementation of models such as Transformer, MNIST, SE-ResNeXt.

    Inference Engine

    Server-side Inference Engine

    • Inference library is currently integrated with PaddlePaddle/Anakin to unify interfaces for a more efficient inference process
      • able to handle Anakin GPU submaps and CPU submaps.
      • The Python inference interface has accepted Anakin subgraph.
      • significant Inference acceleration on ResNet, VGG, GoogleNet, MobileNet, ShuffleNet, Faster R-CNN, YOLO, SSD and other models
    • Inference framework optimization. Inference of small models expedites noticeably
      • Through configuring runtime_context_cache_pass, focal models have obtained a speed-up of 17%.
      • The infershape of 5 OPs are refined, so that the focal models accelerate by 13%.
      • The ZeroCopy interface is upgraded to avoid redundant CPU copies when using AnalysisPredictor.
    • Reinforce INT8 quantitative Inference
      • More inclusive support for INT8 Quantization through TensorRT, applicable for AlexNet, GoogleNet, VGG, MobileNet, ShuffleNet and more. Utilize the information on TensorRT in an optimal manner to perform the serialization and deserialization so that a model will be initialized more speedily.
      • Implement the INT8 quantization framework based on C++ Pass. A few new INT8 OP Kernel: Transpose, Contact, Requantize. By fine-tuning the quantization strategy in MkldnnQuantizerConfig, users can promptly get the INT8 quantization model that meets the accuracy requirements. The INT8 quantized ResNet-50/MobileNet v1 model achieved a performance 7 times/3 times higher compared with the original FP32 model (tested on the Xeon 6271 server supporting the AVX512-DL Boost instruction set).

    Mobile Inference Engine

    • ARM CPU
      • Paddle Mobile has reconstructed and enhanced efficiency of the matrix operation library sgemm and sgemv, which gives rise to performance boost of 10%~100% on most models.
      • 19 new operators are provided in this version such as while, sequence_expand, sequence_pool, sequence_softmax, gru_unit, beam_search, and beam_search_decode. Apart from that, there has also been a large amount of optimization, and the support attention-based end-to-end Model prediction.
      • arm v8 of winograd implementation: higher inference performance on v8 hardware on IOS; winograd support for operator fusion to ensure higher efficiency after operator fusion.
      • Direct convolution for kernel with a 3x3 sliding window, which will be more efficient than winograd and gemm on the condition that the number of channels is small.
      • Reconstructed and optimized depthwise convolution with the kernel size 3x3: in contrast to previous versions, it supports arbitrary padding, and attains better performance and returns more reliable calculation results.
      • Depthwise convolution with the kernel size 5x5 on armv8: the NAS model prediction speeds up by more than 30%.
      • Complete the efficiency optimization of the deconvolution conv2d_transpose.
      • Consolidated with memory reuse strategy based on graph optimization. When the strategy is applied, most models can reduce memory usage by nearly 50%. It is automatically turned on for the ARM CPU (not compatible with FPGA and GPU).
    • ARM GPU
      • Paddle Mobile completes the convolution optimization for the kernel with size 1x1, and MobileNet v1 has an average inference performance improvement of 35% on Qualcomm Adreno GPUs.
      • Paddle Inference has preliminarily unified of Paddle Mobile and Anakin interfaces. Further integration is pending.

    🚀 Deployment Tools

    • Model compression toolkit PaddleSlim
      • Model clipping compression strategy: users can select sensitivity or uniform modes, apply it for various models such as VGG, ResNet, MobileNet, and customize clipping range.
      • Quantitative training model compression strategy: there are two two quantitative training modes, dynamic mode and static mode. Channel quantization or overall quantization of parameters are also selectable. Users can save models with float type simulating int8 value domain, with int8 type, or with formats compatible with paddle-mobile .
      • Model distillation compression strategy: users are permitted to add combined loss at any layer in the teacher network and student network. FSP Loss, L2 Loss, Softmax with Cross-entropy Loss are all available methods.
      • Other functions: Users can configure hyper-parameters of file compression task, and are allowed to combine multiple compression strategies. Moreover, checkpoints function is also applicable for distillation and clipping compression process.
    • Paddle Serving
      • Remote paddle inference deployment is accomplished.
      • The server allows users to add data processing Operator, or define inference logic, and it supports model hot-loading.
      • The client side offers a C++ SDK which can be called business logic if needed. Users are allowed to customize protobuf to define network data transfer protocols, and A/B testing capabilities.
      • Provides sample templates for classic tasks in paddle serving, including text classification and image classification tasks.
      • Benchmarks for latency and throughput for text classification tasks.

    Distributed training

    • Distributed IO optimization
      • Pipe Reader Interface Optimization: high-efficiency IO methods are in place as maintaining flexibility of data pre-processing. Enterprise-class Linux system customization is supported. High-performance IO components are implemented. Unified maintenance is carried out in the procedure of off-line data preprocessing. Remote file system stream read capability is enhanced to support the modes in which data are loaded to memory and distributed shuffling.
    • Integration of Executor and distributed IO
      • AsyncExecutor is integrated into Executor, equipped with a new train_from_dataset/infer_from_dataset interface. It supports Pipe Reader-based training, and accepts user-defined PipeLine program on the condition of maintaining multi-queue IO function, and provides flexible python-side data processing.
    • 🔀 bandwidth insensitive training ability of synchronous multi-node multi-card GPU training
      • Sync GPU training is capable of sparse communication and adopts sparse all reduce.
      • Guarantee model convergence from the algorithm perspective and introduce DGCOptimizer through control of communication sparsity.
      • Experiments on ResNet50 on imagenet prove that: in terms of model convergence, for 90 rounds of ResNet50, convergence remains stable; in high-speed interconnected network environment, sparse communication does not compromise training speed; for low network bandwidth network environment (such as 10G network) ), sparse communication has notable advantages in training speed, where the speed of synchronous training is 10 times faster than that of dense communication.
    • Collective Operator mode
      • Collective Operator mode is available. Multiple all reduce operations are allowed under GPU. Incorporating collective op into Program through the Python API makes the development of distributed optimization algorithms much more flexible.
    • Convergence speed optimization for ResNet50 on Imagenet
      • Dynamic BatchSize, dynamic ImageSize, and rectangular crop can be used. With FP32 precision, on v100 single-node 8 card testing environment, the convergence speed increases by 68% (acc1>=75.9%, acc5=93.0%).
    • 👍 K8S Ecosystem Support
      • Kubeflow has supported paddle-job and contributed to the kubeflow community.
      • The Paddle-K8S-Operator for industrial application is supported. It can collaborate with kubeflow.
      • The K8S environment is suitable for beginners to submit task scripts, of which reproducible tutorials are given on Baidu Cloud.

    Model Construction

    • PaddleCV Intelligent Vision
      • Video Classification Toolkit is formally released. It covers mainstream video classification models, including Non-Local, TSM, Attention Cluster, NeXtVLAD, Attention LSTM, StNet, TSN, and attains the level of mainstream implementations.
      • New pre-trained ImageNet-based model: GoogleNet, ShuffleNetv2, ResNet18, ResNet34.
      • New target detection YOLOv3 model. The effect is equivalent to the finest open implementation (mAP is 7 percentage points higher than the original author).
      • The Simple Baselines human pose estimation model based on COCO and MPII data is realized. The effect is able to parallel mainstream implementation.
      • npair loss is introduced to feature learning models, and raises [email protected] to 79.03% (+0.78%) based on the pre-trained model (arcmargin loss).
    • PaddleNLP intelligent text processing
      • The Chinese semantic representation ELMo model is available. It supports multi-card training, and the training speed is twice as fast as mainstream implementation. It has been verified that the F1 value is increased by absolute 1.1% in Chinese lexical analysis tasks, and the Rouge-L value increases by 1% in Chinese reading comprehension tasks.
      • The Chinese semantic representation model ERNIE is implemented, which has improved the accuracy by absolute 1% ~ 2% compared with the BERT Chinese model in Chinese tasks such as natural language inference, semantic similarity, named entity recognition, sentiment analysis, and question and answer matching.
      • The read understanding model is upgraded by optimizing data pre-processing and document selection. The effect is that Rouge-L was upgraded to 65 (baseline 39.29) on DuReader validation datasets.
      • A knowledge-aware dialogue model is added. Compared with the baseline generation dialog model, it outperforms by an average of 1 percentage point on the F1, BLEU1, and BLEU2 metrics.
      • The dialogue model toolkit is available. It consists of Deep Attention Matching Net, a new automatic dialogue assessment tool and the BERT-based generic dialog understanding model DGU (Dialogue General Understanding), which supports five types of dialogue tasks, namely dialogue semantic matching, DA, DST, slot analysis and intention recognition, and attains the effect of SOTA on three public datasets.
      • The PaddleNLP toolkit is released to unify the modeling of NLP tasks such as text classification, text matching, sequence labeling, reading comprehension, and intelligent dialogue. And their corresponding industrial pre-trained models are also open to use.
    • PaddleRec intelligent recommendation
      • Deep Interest Network (DIN): DIN is fulfilled in this version. reproduce effect on public dataset and support single/multi-card training in both cpu and gpu mode. DIN is appropriate for the sorting scenarios in recommendation (such as ctr prediction). The main feature is the combination of the estimated target information in the process of modeling the historical sequence.
      • Graph Neural Network (GNN): a session-based graph neural network recommendation model is introduced. Effect has been reproduced on public dataset. It supports single-node single-card training in both CPU and GPU mode. The model is suitable for the recall scenario in the recommendation. Using GNN to model the user's historical information can capture more complex transformation relationships underlying item sequences.
      • Word2vec: word2vec sampling strategy is adjusted. The effect is reproduced on the public dataset. Multi-machine training support is included as well.

    Tools and Components

    • 🚀 Open source AutoDL Design is officially released to enable automatic network design
      • A series of neural networks generated with the AutoDL Design, and a total of six models trained on CIFAR10 data have saved the network structures and involved weights. Therefore, any developer or researcher interested in deep learning can easily work on PaddlePaddle and public CIFAR10 data to perform inference and model fusion on these six models, which have attained an accuracy over 98%.
      • The source code for the encoder and the critic is made open source. The source code is based on the PaddlePaddle platform and the PARL framework developed entirely by Baidu. The code also comes with Chinese documentation and some brief demos that make it easier for users to run effortlessly. (for example, with "How many 1s is generated by RNN" as a standard, you can quickly verify the correctness of the entire framework). Moreover, users can download, install, run, and try to generate your own original neural network structure.
    • ⬆️ Latest upgrades on the parallelization-oriented PARL1.1. Users are allowed to implement parallelized reinforcement learning algorithms by using a decorator
      • Parallelization can be achieved simply with a modifier (@parl.remote_class). After computing-intensive tasks, such as the data-preprocessing and simulator simulation tasks, have encountered this decorator, the data will be automatically deployed to the specified computing resources, and no longer occupy the computing resources of the main thread.
      • Support parallelization algorithms such as IMPALA, A2C, and GA3C.
    • 🚀 PaddleHub, a pre-trained model management tool, is released and strives to help users manage models and conduct transfer learning more efficiently
      • Pre-trained model management: Pre-trained model download, search, version management and other functions in the PaddlePaddle ecosystem can be completed through the hub command line.
      • One-click command line: Free from code, you can use the pre-trained model to infer straight through the command line, and quickly examine the effect of the training model. The current version supports the following models: lexical analysis LAC; sentiment analysis Senta; target detection SSD; image classification ResNet, MobileNet.
      • Transfer Learning: Provides a Finetune API based on pre-trained models. Users can complete transfer learning with a small amount of code. The API mainly includes BERT/ERNIE text classification, sequence labeling, image classification transfer.
    • 🚀 The X2Paddle model conversion tool is officially released to transfer prediction models implemented in other deep learning frameworks to PaddlePaddle without loss. The tool is also attached with detailed comparison documents of TensorFlow, the Caffe framework's API , to help users transform the model to PaddlePaddle more easily

    🐛 BUG fixes notes

    • 🛠 Fixed precision inconsistency in BFS occurred in backward computation.
    • 🛠 Fixed redundant backward inputs created by optimizer minimize.
    • 🛠 Fixed Paddle-TRT occupying too much video memory.
    • 🛠 Fixed bugs in AllReduceDepPass.
    • 🛠 Fixed bugs in FastThreadedExecutor.
    • Fixed bugs in Op such as Reshape, cross_entropy, arg_min_max, recurrent, etc.
    • 🛠 Fixed problems with VarBase construction
    • ⚡️ Fixed a number of problems and bugs in memory_optimize_pass: Adjusted the multiplexing logic from >= to =, reduced fragmentation caused by Variable multiplexing, removing the dependency of memory_opitmize_pass on BlockDesc. Fixed a bug that different types of Variables would be reused mutually.
    • 🛠 Fixed an issue with util.plot in python3.
    • 👌 Improved the stability of the Profiler and introduced Memory Profile function.
    • 🛠 Fixed the problem that multithreading was effective only when C++ inference had been cloned within the thread.
  • v1.3.2 Changes

    March 26, 2019

    新增 elementwise_floordiv, elementwise_mod op
    修改concat, mamtul, topk, squeeze的shape检查,区分compile time 和run time
    修复recurrent op, lod_reset op中存在的bug
    ⚡️ 新增optimizer 接口,获取optimizer variable列表
    优化recurrent op, cross_entropy op 显存占用

    Add elementwise_floordiv, elementwise_mod op
    🛠 Fix concat, mamtul, topk, squeeze shape check,distinguish compile time and run time
    🛠 fix recurrent op, lod_reset op
    ➕ Add optimizer interface,retrieve optimizer variable list
    ⚡️ Optimize recurrent op, cross_entropy op memory consumption

  • v1.3.1 Changes

    March 19, 2019

    🛠 Fix bugs

  • v1.3.0 Changes

    February 21, 2019

    🚀 Release Notes

    重要更新

    • 统一Executor和ParallelExecutor接口,用户只需通过CompiledProgram将单卡模型转化多卡模型,并利用Executor进行训练或者预测。
    • 正式发布AnalysisConfig 预测接口,支持计算图分析、算子融合等优化,并支持利用 Intel MKLDNN、Nvidia TensorRT 子图引擎等第三方库的加速.
    • 模型库新增发布PaddlePaddle视频模型库,提供5个视频分类经典模型以及适合视频分类任务的通用骨架代码,用户可一键式高效配置模型完成训练和评测。
    • 新增支持NLP语义表示BERT模型,支持多机多卡训练,支持混合精度训练,训练速度对比主流实现提升50%+,提供完整部署示例。
    • 大规模稀疏参数服务器Benchmark发布, CPU多机异步训练发布显著提升点击率预估任务IO吞吐的built-in reader,多机多卡训练性能多方面提升。
    • 新增支持Intel Deep Learning Boost(VNNI指令集)。在新一代的Intel Xeon Scalable Processor上,使用这个特性的一些模型,INT8预测性能可以达到FP32的2倍。

    基础框架

    • 安装
      • 新增Linux和MacOS下的中文版本辅助安装脚本,提供交互式安装方式,协助用户在复杂环境下快速完成PaddlePaddle安装。
      • Windows支持优化:新增cuda8,cudnn7的GPU支持,新增AVX指令集、MKLDNN、mnist数据集支持。修复Windows加载Linux/Mac下同版本paddle训练模型的问题。
    • 增加动态图基础功能
      • 动态图tracer、 autograd、python Layer/PyLayer,动态图支持MLP、GAN、ptbRNN、Resnet模型,动态图支持Optimizer、GPU训练。
    • Executor和ParallelExecutor接口优化
      • 对Executor和ParallelExecutor接口进行统一,用户只需通过CompiledProgram将单卡模型转化多卡模型,并利用Executor进行训练或者预测。
      • ParallelExecutor优化
        对MultiDevSSAGraphBuilder进行重构,使得MultiDevSSAGraphBuilder更易扩展。
        去除ParallelExecutor中的设备锁,提升ParallelExecutor多卡调度性能。
    • 中间表达IR和Pass方面的优化
      • 完善C++ IR graph的python接口以及C++ IR pass的python接口。
      • 在framework.py中新增IRGraph类,为在Python层编写IR Pass做准备。
      • 新增支持网络无锁更新的Pass。
      • 新增QuantizationTransformPass,此为Quantization Aware Training量化模式训练前的图修改操作部分。
    • 内存和显存方面的优化
      • 新增支持在编译时加入 Jemalloc 作为动态链接库,提升内存管理的性能,降低基础框架内存管理开销
      • 新增memory optimize,inplace pass, memory pool early deletion等显存优化策略。
      • 新增支持网络无锁更新的Pass。
      • 新增QuantizationTransformPass,此为Quantization Aware Training量化模式训练前的图修改操作部分。
    • Operator整体层面的优化
      • 每个op在执行前只做一次scope查询,减少读写锁操作(原来需要做1~5次scope查询)
      • 新增Temporary Allocator,减少op中的同步操作
      • 新增py_func operator,支持python op接入,用户可以借助py_func Operator快速实现所需要的特有操作
    • 重构DDim,Variable Type等,降低基础框架调度开销。
    • INTEL FP32计算相关优化
      • 优化density_prior_box operator,单op四线程提速3倍。
      • 优化Stack operator,单op提速16倍。
      • 开发Transpose,Concat和Conv3d三个基于MKLDNN的kernel。
      • 修复lrn operator中MKLDNN kernel精度bug,同时单op提速1.3倍。
      • 修复MKLDNN初始化占用5G内存的问题,目前初始化占用500MB。
      • 减少从MKLDNN OP kernel到非MKLDNN OP kernel时不必要的reorder。
    • 完善CPU JitKernel
      • sequence pooling 的jitkernel,纯op提升2倍。
      • softmax 的jitkernel,纯op提升2倍,同时使得Bert模型CPU预测提升26%。
      • 常见的基本逻辑:向量的每个元素求平方kVSquare、矩阵乘法kMatMul、向量的最大值kHMax、向量所有元素的和kHSum。

    预测引擎

    服务器预测

    • 正式发布AnalysisConfig 预测接口,支持计算图分析、算子融合等优化,并支持利用 Intel MKLDNN、Nvidia TensorRT 子图引擎等第三方库的加速。
    • 预发布 intel CPU上的 预测 INT8 离线量化方案
      • 开发Conv2D,Pool2D,Quantize,Dequantize四个基于MKL-DNN的INT8 kernel。
      • 预发布Calibration的3个核心Python API(paddle.fluid.contrib.Calibrator)。
      • 开发Calibration工具,保证FP32和INT8的精度在ResNet-50和MobileNet-V1在ImageNet验证数据集上相差在1%内。
      • 支持Intel Xeon CascadeLake Server(VNNI指令)及Intel Xeon SkyLake Server,性能提升约为1.33倍。
    • CPU预测速度提升
      • fuse sequence pooling concatop,支持N (<200)个sequence_pooling op concat起来组成一个新op,整体使得seqpool模型 CPU预测提升56%。
      • fuse 连续重复的fc op为一个大op,使得seqpool模型CPU预测速度提升15%。
      • fuse 逻辑为((X * Y).2 - (X.2 * Y.2) ) .* scalar的op组合 , 使得seqpool模型CPU预测速度提升8.2%。
      • 针对输入tensor元素个数为1的情况,优化compare_op的CPU Kernel。
    • 新增Paddle-TRT 对Calibration INT8的支持,GPU预测速度提升
      • 模型VGG,Resnet50上预测速度达到了Paddle-TRT float32的两倍性能。
      • 模型VGG,Resnet50在imagenet数据集上测试,精度下降0.3%以内。
    • 算子融合
      • 增加 fc和 con 相关两个 fuse,作用于 conv_op CUDNN kernel。
      • 新增Conv+Affine Channel的融合pass,Faster RCNN运行的性能提升26.8%。
      • 新增Transpose+Flatten+Concat 融合pass,MobilenetSSD模型性能提升15%。
      • 实现beam_search operator的CUDA Kernel,并且将相应的top-k、elementwise_add、reshape、log计算融合到beam_search operator中。
    • 功能完善及易用性提升
      • 新增C++ IR graph的Python接口。
      • 新增预测库的Python接口。
      • 服务端预测支持从内存加载模型。
    • 其他
      • 删除legacy V2代码。从1.3版本起,不再支持V1&V2老版本功能。
      • 修复Paddle-TRT elementwise-mul模型运行出现问题的bug。
      • 修复Paddle-TRT trt_engine stream多个连续输入情况下模型输出结果异常的bug。

    移动端预测

    • 效率优化,常见模型预测速度提升
      • int8预测支持dequantize和其他op(batch normalization/relu/elementwise add)进行自动kernel融合。
      • transpose2 operator对于shuffle channel操作进行优化。
      • gru operator使用neon指令进行优化,并针对batch size为1时进行优化。
      • 优化和实现pooling,支持任意的padding。
      • 优化和实现batch normalization、softmax、elementwise add。
    • 新增支持多个输入和多个输出的模型预测。
    • 新增实现prelu6 operator、cast operator、top_k operator。
    • 修复int8 offline量化溢出结果不对的问题。
    • 修复winograd实现在输入feature map的height和width不相等时结果可能为0的bug。

    模型建设

    • PaddleCV 智能视觉
      • 新增发布PaddlePaddle视频模型库,包括五个视频分类模型:Attention Cluster、NeXtVLAD、LSTM,、stNet、TSN。提供适合视频分类任务的通用骨架代码,包括数据读取和预处理、训练和预测、网络模型以及指标计算等多个模块。用户根据需要添加自己的网络模型,直接复用其他模块的代码,快速部署模型。
      • 新增支持目标检测Mask R-CNN模型,效果与主流实现打平。
      • 语义分割DeepLabV3+模型,depthwise_conv op融合,显存优化,显存占用对比上一版本减少40%。
    • PaddleNLP 智能文本处理
      • 新增支持NLP语义表示BERT模型,支持多机多卡训练,支持混合精度训练,训练速度对比主流实现提升50%+,提供完整部署示例。
      • 机器翻译Transformer模型优化解码计算,decoder中加入对encoder output计算结果的cache,预测速度提升一倍。
    • PaddleRec 智能推荐
      • Sequence Semantic Retrieval 新增单机多线程、单机多卡运行示例,添加预测功能、数据预处理优化,完善部署示例。
      • GRU4Rec新增负采样功能,使用bpr loss和cross entropy loss的效果与原作打平。

    分布式训练

    • 大规模稀疏参数服务器Benchmark发布
      • 测试真实业务场景下,特征规模百亿、样本平均特征数1k的点击率预估任务,在batch=512情况下,100worker加速比90.5,吞吐量1.36M/s 。
    • CPU多机异步训练
      • 发布面向点击率预估任务的built-in reader,Criteo数据集下IO总吞吐提升1300%。
    • GPU多机多卡水平扩展性能提升
      • 新增并行模式:PG(ParallelGraph)、MP(Multi-Process),独立GPU卡之间的计算,提升性能同时,不影响模型精度。
      • 在ResNet50模型,单机8卡V100下,PG, MP模式提升训练性能30%以上;4机32卡,PG模式提速46%,MP模式提速60%。
      • 在BERT模型,8卡V100下,PG, MP模式提升训练性能26%。
      • Multi-Process模式相比Parallel-Graph模式对Reader速度敏感度不高。
    • GPU多机多卡垂直扩展性能提升
      • 新增功能:fp16和混合精度训练
      • Fp16单机单卡加速情况:ResNet50提速约87%,BERT提速约70%。
      • BERT同时开启PG和混合精度,单机8卡下单位时间吞吐提升120%。
      • ResNet50同时开启混合精度训练和MP模式,在V100单机8卡、4机32卡下,单位时间吞吐提升100%。
    • 典型模型收敛速度优化
      • 新增功能:动态Batch Size,动态Image Resize方法。
      • Resnet50 on Imagenet数据集:训练收敛轮数下降为标准训练方法的1/3左右。

    VisualDL

    • VisualDL graph支持Paddle fluid保存的模型可视化展示。

    🚀 Release Notes

    Highlights

    • Executor and ParallelExecutor interfaces are unified so that users just need to convert the single card model into multi-card model through CompiledProgram, and use Executor for training or inference.
    • 🚀 This version officially releases AnalysisConfig inference interface, which supports optimization of computational graph analysis, operator fusion, etc., and supports the acceleration of third-party libraries such as Intel MKLDNN and Nvidia TensorRT sub-graph engine.
    • 🚀 The model library has initially released PaddlePaddle video model library, which provides 5 classic video classification models and generic structure code suitable for video classification tasks. Users can configure and evaluate the model with efficient configuration in one-click.
    • 🚀 We added support for NLP semantic representation model BERT, which supports multi-card training on multiple machines and mixed-precision training. It improves training speed by 50%+ compared with mainstream implementation, and a complete deployment example is available.
    • 🚀 Large-scale sparse parameter server Benchmark has been released. Asynchronous multi-machine training on CPU releases a built-in reader to significantly improve the IO throughput of click-rate estimation tasks. Performance of multi-machine multi-card training has enhanced in various aspects.
    • 🐎 We added support for Intel Deep Learning Boost(VNNI) on next generation of Intel Xeon Scalable Processors . With that, INT8 inference performance could be improved by 200% over FP32 on some models.

    ⚡️ Updates on Basic framework

    • Installation
      • Chinese version of the auxiliary installation script is available for Linux and MacOS with an interactive installation method to help users quickly complete PaddlePaddle installation in complex environments.
      • Better support for Windows:cuda8, ​​cudnn7 GPU support and new AVX instruction set, MKLDNN, mnist dataset support are incorporated. The problem is fixed which is incurred when Windows loads the training model with the paddle of the same version from Linux/Mac platform.
    • 🆕 New basic functions for Dynamic Computational Graphs
      • tracer, autograd, python Layer/PyLayer can be carried out for Dynamic Computational Graphs. Dynamic Computational Graphs can run models of MLP, GAN, ptbRNN, Resnet. Dynamic Computational Graphs can perform training through Optimizer and support GPU training.
    • Reformed interfaces of Executor and ParallelExecutor
      • The Executor and ParallelExecutor interfaces are unified. Users only need to convert the single card model into a multi-card model through CompiledProgram and use Executor for training or inference.
      • Improved ParallelExecutor
      • Reconstructing MultiDevSSAGraphBuilder makes MultiDevSSAGraphBuilder easier to extend.
      • The improved has removed device locks in ParallelExecutor to promote performance of multi-card scheduling on ParallelExecutor.
    • Optimization for intermediate expression IR and Pass
      • Improve the Python interface of C++ IR graph and the Python interface of C++ IR pass.
      • IRGraph class is created in framework.py to prepare for writing IR Pass in Python layer.
      • The new Pass is added which supports unpinned network updates.
      • QuantizationTransformPass is introduced, which is the graph transformations phase performed before the quantization training mode of Quantization Aware Training.
    • Optimization of memory and video memory
      • Jemalloc is integrated as a dynamic link library at compile time, to improve performance of memory management and reduce overhead of underlying framework memory management.
      • New video memory optimization strategies are accepted such as memory optimization, inplace pass, memory pool early deletion.
      • A new Pass is supported which supports unpinned network updates.
      • QuantizationTransformPass is introduced, which is the graph transformations phase performed before the quantization training mode of Quantization Aware Training.
    • Overall optimization for Operator
      • Each op only does a single scope query before execution, reducing the read-write lock operations (originally it needs to do 1~5 scope queries)
      • Temporary Allocator is integrated to reduce synchronization in op.
      • py_func operator is realised to accept python op. Users can quickly carry out custom unique operations with the aid of py_func Operator.
    • ⏱ Reconstruct DDim, Variable Type and more to reduce the underlying framework scheduling overhead.
    • Optimization for INTEL FP32 computing related aspects
      • Optimize the density_prior_box operator with a speed 3 times quicker of single op on four threads.
      • Optimized Stack operator, single op speed up to 16 times as quick as the previous version.
      • Three MKLDNN-based kernels of Transpose, Concat and Conv3d, have been developed.
      • Precision bug happened to MKLDNN kernel of lrn operator is corrected, while single op speed is 1.3 times faster.
      • Fix the problem that MKLDNN initialization takes up 5G memory, and the current initialization takes up 500MB.
      • Reduce unnecessary reorders from the MKLDNN OP kernel to the non-MKLDNN OP kernel.
    • 👌 Improve CPU JitKernel
      • Improve the jitkernel of sequence pooling. The efficiency of pure op is increased by 2 times.
      • Improve softmax jitkernel. The performance of pure op is twice higher, while performance of CPU inference for the Bert model increases by 26%.
      • Common basic logics: computing square value of each element -- kVSquare, matrix multiplication -- kMatMul, vector maximum -- kHMax, the sum of all the elements in the vector -- kHSum.

    Inference Engine

    Server-side Inference

    • 🚀 The inference engine AnalysisConfig is officially released,with support for optimization of computational graph analysis, operator fusion, and acceleration of third-party libraries such as Intel MKLDNN and Nvidia TensorRT sub-graph engine.
    • 🚀 Pre-release INT8 inference off-line quantization scheme on Intel Xeon Scalable Processors
      * Four INT8 kernel based on Intel MKL-DNN have been developed, namely Conv2D,Pool2D,Quantize,and Dequantize.
      🚀 * Pre-release the 3 core Python APIs for the Calibration (paddle.fluid.contrib.Calibrator).
      * The Calibration tool is developed to ensure the accuracy loss within 1% between FP32 and INT8 on ResNet-50 and MobileNet-V1 on the ImageNet validation dataset.
      🐎 * Intel Xeon Scalable Processors with Intel Deep Learning Boost (VNNI) are supported. Inference performance on INT8 could be improved by 2 times on some models.
    • Accelerated CPU inference
      • fuse sequence pooling concat op supports N (<200) sequence_pooling ops to concatenate into a new op, which overall improves the CPU inference of seqpool model by 56%.
      • fuse continuously repeating fc ops into a large op, which expedite CPU inference for the seqpool model by 15%.
      • fuse scalar op combination with logic ((X * Y).2 - (X.2 * Y.2) ) .* , which accelerates the seqpool model CPU inference by 8.2%.
      • Optimize the CPU Kernel of compare_op for the case where the number of elements in the input tensor is 1.
    • 🆕 New Paddle-TRT support for Calibration INT8 and faster GPU prediction
      • Speed of VGG and Resnet50 model inference reaches the performance twice as high as Paddle-TRT float32
      • Accuracy of VGG and Resnet50 model tested on imagenet dataset is reduced by less than 0.3%.
    • Operator fusion
      • Fusion of fc and con, to be applied to the conv_op CUDNN kernel.
      • Pass for fusion of Conv+Affine Channels is added, and Faster RCNN performance increases by 26.8%.
      • Pass for fusion of Transpose+Flatten+Concat is added,and the performance of the MobilenetSSD model is increased by 15%.
      • Implement the CUDA Kernel of the beam_search operator and fuse the corresponding top-k, elementwise_add, reshape, and log calculations into the beam_search operator.
    • 👌 Improved functionality and ease of use
      • New Python interfaces for C++ IR graph.
      • New Python interfaces to inference library.新增预测库的Python接口。
      • Server-side inference supports loading models from memory.
    • ⚡️ Miscellaneous Updates
      • Remove the legacy V2 code. From version 1.3, functions in the V1 and V2 legacy version are no longer supported.
      • Fixed a bug in the Paddle-TRT elementwise-mul model.
      • Fixed a bug where the model output was abnormal when Paddle-TRT trt_engine stream accepts multiple consecutive inputs.

    Mobile inference

    • ✨ Enhance Efficiency, increase common model inference speed
      • int8 inference supports automatic kernel fusion performed by dequectize and other ops (batch normalization/relu/elementwise add).
      • The transpose2 operator is optimized for shuffle channel operations.
      • The gru operator is optimized for the batch size of 1 by the neon instructions.
      • Optimize and implement pooling to support arbitrary padding.
      • Optimize and implement batch normalization, softmax, elementwise add.
    • 🆕 New model inference is added which supports multiple inputs and multiple outputs.
    • Implementation of prelu6 operator、cast operator、top_k operator。
    • 🛠 Fixed an issue that the int8 off-line quantization overflows and the result is incorrect.
    • 🛠 Fixed a bug that the winograd might return a 0 when the height and width of the feature map are not equal.

    Models

    • PaddleCV Intelligent Vision
      • Release PaddlePaddle video model library, including five video classification models: Attention Cluster, NeXtVLAD, LSTM, stNet, TSN. It provides generic structure(infrastructure) code for video classification tasks, including data reading and preprocessing, training and inference, network models, and metric calculations. Users add their own network models as needed, directly reuse the code of other modules, and quickly deploy models.
      • Support Target Detection Mask R-CNN model, the effect is on the same level with the mainstream implementation.
      • Semantic segmentation DeepLabV3+ model, depthwise_conv op fusion, video memory optimization. Compared with the previous version, memory consumption reduces by 40%.
    • PaddleNLP Intelligent Text Processing
      • Integrate BERT model for NLP semantic representation, which supports multi-machine multi-card training, mixed-precision training, and the training speed is 50%+ more rapid than mainstream implementation. A complete deployment example is available.
      • The machine translation Transformer model optimizes the decoding calculation. The cache of the result from the encoder output is added into the decoder and the inference speed is doubled.
    • PaddleRec Intelligent Recommendation
      • Sequence Semantic Retrieval is incorporated with a single-node multi-threaded example and a single node multi-card example, and also predictive features, data pre-processing optimization. The complete deployment example given is improved.
      • GRU4Rec adopts a negative sampling function, and the effect of using bpr loss and cross entropy loss is equal to the original test.

    Distributed Training

    • 🚀 Release Large-scale sparse parameter server Benchmark
      • Under the real business scenario, the click rate prediction task with feature size of 10 billion and 1k average sample features, and batch=512, acceleration ratio of 100 worker is 90.5, throughput is 1.36M/s.
    • Asynchronous multi-node training on CPU
      • Released a built-in reader for the click-rate inference task, which increased IO total throughput by 1300% in the Criteo dataset.
    • ✨ Enhance performance of multi-machine multi-card GPU horizontal expansion
      • New parallel mode:PG(ParallelGraph)、MP(Multi-Process). They are calculations on independent GPU cards, improving performance without affecting model accuracy.
      • In the ResNet50 model, with the single-node 8 card V100, PG, MP mode improves training performance by more than 30%; 4 machines with 32 cards, PG mode speed up 46%, MP mode speed up 60%.
      • In the BERT model, with 8 card V100, PG, MP mode improves training performance by 26%.
      • Multi-Process mode is less sensitive to speed of Reader than Parallel-Graph mode.
    • ✨ Enhance performance of multi-machine multi-card GPU vertical expansion
      • New features: fp16 and mixed precision training
      • Fp16 single-node single-card acceleration: speed of ResNet50 is about 87% higher; speed of BERT is about 70% higher.
      • BERT simultaneously turns on PG and mixed-precision, and throughput per unit time is increased by 120% in a single node with 8 cards.
      • ResNet50 simultaneously starts the mixed-precision training and MP mode. On the V100 single-node with 8 cards, 4 nodes with 32 card, the throughput per unit time is increased by 100%.
    • Speed up convergence of classical model
      • New features: Dynamic Batch Size, Dynamic Image Resize method.
      • Resnet50 on Imagenet dataset: The number of training rounds before convergence drops to about 1/3 of that of the standard training method.

    VisualDL

    • 👍 VisualDL graph supports visual demonstration for models saved by Paddle fluid.
  • v1.2.1 Changes

    January 16, 2019

    🚀 Release Notes

    Framework

    1. Loss function huber\_loss is included. 🐎 2. Promotion of training performance on small models.
    2. Tensor type implementation changed from typeindex to enum. 🛠 4. Issue fixed: numpy calculation fails because of the failure of import cv2;

    Inference

    👍 1. Paddle-TRT support multi-thread inference.

    1. Include TRT automatically when generating inference lib. 🛠 3. Issue fixed: Paddle-TRT pool2d converter bug in ceil mode.

    Framework

    1、支持 huber loss op
    2、针对GPU占比较小模型,提升框架执行时间
    3、使用 enum 替换 typeindex 作为 tensor type,提升执行效率
    4、Paddle-TRT支持多线程预测
    5、生成预测库时,将TRT的lib自动加入到thrid_party中

    🐛 Bug Fix

    1、修复Paddle-TRT pool2d converter 在ceil mode下的bug
    2、修复python3下import cv2失败会影响numpy计算的bug

  • v1.2.0 Changes

    December 06, 2018

    🚀 Release Notes

    Framework

    • 🆕 new pip installation package is available, which can be run on Windows CPU environment.
    • 👌 support of python3.6、python3.7
    • Reconstruction of memory allocator modular :Allocator. Improvement on memory allocating strategy in CPU environment.
      Increase in utility ratio of video memory (disabled by default, use FLAGS_allocator_strategy to enable it).
    • 📜 Restriction to the usage of SelectedRows, and fix made to bugs on sparse regulation and sparse optimization.
    • 👍 Tensor supports DLPack,to facilitate integration of other frameworks or into them.
    • OP
      • Issues on inference of expand op shape have been resolved.
      • Activation function Selu is included.

    Inference Engine

    • Server Prediction
      • GPU supports image fusion, and cooperation with TensorRT to realize image modifying. In common image processing models like Resnet50 and Googlenet, with bs=1, the performance has reached a level 50~100% higher.
      • GPU supports DDPG Deep Explore prediction.
      • Paddle-TRT supports more models, including Resnet, SE-Resnet, DPN,GoogleNet.
      • CPU, GPU, TensorRT and other accelerators are merged into AnalysisPredictor,collectively controlled by AnalysisConfig.
      • Add interfaces to call multi-thread mathematic library.
      • Support for TensorRT plugins,including split operator , prelu operator , avg_pool operator , elementwise_mul operator .
      • This version has included JIT CPU Kernel, which is able to perform basic vector operations, partial implementation of common algorithms including ReLU,LSTM and GRU, and automatic runtime switch between AVX and AVX2 instruction set.
      • FDSFDF optimized CRF decoding and implementation of LayerNorm on AVX and AVX2 instruction set.
      • Issue fixed: AnalysisPredictor on GPU or in the transition from CPU to GPU cannot delete transfer data.
      • Issue fixed: Variable has consistent increase of occupied memory of container.
      • Issue fixed: fc_op cannot process 3-D Tensor
      • Issue fixed: on GPU, when running pass, issues happened to Analysis predictor
      • Issue fixed: GoogleNet problems on TensorRT
      • Promotion of prediction performance
      • Max Sequence pool optimization,with single op performance 10% higher.
      • Softmax operator optimization,with single op performance 14% higher.
      • Layer Norm operator optimization, inclusive of AVX2 instruction set, with single op performance 5 times higher.
      • Stack operator optimization,with single op performance 3.6 times higher.
      • add depthwise_conv_mkldnn_pass to accelerate MobileNet prediction.
      • reduce image analysis time in analysis mode, and the velocity is 70 times quicker.
      • DAM open-source model,reached 118.8% of previous version.
    • Mobile Endpoint Prediction
      • This version has realized winograd algorithm, with the help of which the performance of GoogleNet v1 enjoys a dramatic promotion of 35%.
      • improvement on GoogleNet 8bit,14% quicker compared with float.
      • support for MobileNet v1 8bit, 20% faster than float.
      • support for MobileNet v2 8bit, 19% faster than float.
      • FPGA V1 has developed Deconv operator
      • Android gpu supports mainstream network models like MobileNet、MobileNetSSD、GoogleNet、SqueezeNet、YOLO、ResNet.

    Model

    • CV image classifying tasks publish pre-trained models: MobileNet V1, ResNet101, ResNet152,VGG11
    • CV Metric Learning models are extended with loss function arcmargin, and the training method is altered. The new method is to adopt element-wise as pre-trained model, and use pair-wise to make further slight adjustment to improve precision.
    • NLP model tasks are newly equipped with LSTM implementation based on cudnn. Compared with the implementation based on PaddingRNN, the cudnn method is 3~5 times quicker under diverse argument settings.
    • Distributed word2vec model is included,including the new tree-based softmax operator,negative sampling,in line with classic word2vec algorithms.
    • Distributed settings of GRU4Rec、Tag-Space algorithms are added.
    • ⚡️ Multi-view Simnet model is optimized, with an additional inference setting.
    • 👍 Reinforcement learning algorithm DQN is supported.
    • 🌐 Currently compatible python3.x models: Semantic model DAM, reading comprehension BiDAF, machine translation Transformer, language model, reinforcement learning DQN, DoubleDQN model, DuelingDQN model, video classification TSN, Metric Learning, character recognition in natural scenes CRNN-CTC 、OCR Attention,Generative Adversarial Networks ConditionalGAN, DCGAN, CycleGAN, Semantic segmentation ICNET, DeepLab v3+, object detection Faster-RCNN, MobileNet-SSD, PyramidBox, iSE-ResNeXt, ResNet, customized recommendation TagSpace、GRU4Rec、SequenceSemanticRetrieval、DeepCTR、Multiview-Simnet.

    Distributed training

    • multi-CPU asynchronous training
      • Asynchronous concurrent workers: AsyncExecutor is added. With a executive granularity of single training file, it supports lock-less asynchronous worker-end computation in distributed training, and single machine training. Take CTR task as an example, general throughput from single machine training is 14 times larger.
      • IO optimization:This version has added compatibility with AsyncExecutor to DataFeed; enabled customized universal classification task formats; incorporated CTRReader for CTR tasks to linearly elevate speed of reading data. In PaddleRec/ctr tasks,the general throughput increases by 2 times.
      • Better data communication: As for sparse access Dense arguments, like Embedding, the sparse data communication mechanism is adopted. Take tasks of semantic matching for instance, the amount of fetched arguments can be compressed to 1% and below. In searching groundtruth data, the general output reached 15 times more.
    • 🔀 multi-GPU synchronous training
      • Issue fixed: In Transformer、Bert models, P2P training mode may be hung.

    📚 Documentation

    • API
      • Add 13 api guides
      • Add 300 entries of Chinese API Reference
      • Improve 77 entries of English API Reference, including Examples and argument explanation and other adjustable sections.
    • 📚 Documentation about installation
      • Add installing guide on python3.6、python3.7.
      • Add installing guide on windows pip install.
    • 📚 Book Documentation
      • Code examples in Book documentation are substituted with Low level API.

    基础框架

    🏁 提供新pip安装包,支持Windows下CPU执行。

    新增对python3.6、python3.7的支持。

    重构内存分配模块Allocator,提升CPU下内存分配策略,提升显存利用率(默认关闭,需要使用FLAGS_allocator_strategy)。

    限制SelectedRows的使用。修复了稀疏正则和稀疏优化器的bug。

    Tensor支持DLPack,方便被其他框架集成和集成其他训练框架。

    修复 expand op shape 推理错误的bug。

    支持 Selu 激活函数。

    预测引擎

    服务器预测

    GPU 支持图融合,且支持和 TensorRT引擎混合改图,在Resnet50和Googlenet等图像通用模型上bs=1下性能提升 50%~100%。

    GPU支持DDPG Deep Explore预测。

    Paddle-TRT对更多模型的支持,其中包括Resnet, SE-Resnet, DPN,GoogleNet。

    CPU, GPU, TensorRT 等加速引擎合并入 AnalysisPredictor,统一由 AnalysisConfig 控制。

    增加调用多线程数学库的接口。

    新增TensorRT plugin的支持,包括split operator, prelu operator, avg_pool operator, elementwise_mul operator。

    增加了JIT CPU Kernel,支持基本的向量操作,以及常见的算法包括ReLU,LSTM和GRU的部分实现,可以实现在AVX和AVX2指令集之间自动runtime切换。

    优化CRF decoding和LayerNorm在AVX以及AVX2指令集上的实现。

    修复了 AnalysisPredictor 在GPU,在CPU 到 GPU 的 transfer data 不删除的问题。

    修复了 Variable 中包含 container 内存持续增长的问题。

    修复fc_op不支持3-D Tensor的问题。

    修复了Analysis predictor 在GPU下执行pass时的问题。

    修复了TensorRT下运行GoogleNet的问题。

    预测性能提升

    Max Sequence pool optimization,单op提高10%。

    Softmax operator 优化,单op提升14%。

    Layer Norm operator优化,支持avx2指令集,单op提升5倍。

    Stack operator 优化,单op提升3.6倍。

    增加depthwise_conv_mkldnn_pass,加速MobileNet预测。

    加速analysis模式的图分析时间,提升70倍。

    DAM开源模型,提升118.8%。

    移动端预测

    实现winograd算法, GoogleNet v1性能大幅提升35%。

    GoogleNet 8bit优化,相比float加速14%。

    MobileNet v1 8bit支持,相比float加速20%。

    MobileNet v2 8bit支持,相比float加速19%。

    FPGA V1 开发了Deconv算子。

    android gpu支持MobileNet、MobileNetSSD、GoogleNet、SqueezeNet、YOLO、ResNet等主流的网络模型。

    模型建设

    CV图像分类任务发布MobileNet V1, ResNet101, ResNet152,VGG11预训练模型。

    CV Metric Learning模型新增arcmargin损失,并调整训练方式,采用element-wise作为预训练模型,pair-wise继续微调的训练方式提升精度。

    NLP语言模型任务新增基于cudnn的LSTM实现,对比PaddingRNN的实现方式,在不同参数配置下速度提升3~5倍。

    增加分布式word2vec模型,包括新增的tree-based softmax operator,negative sampling等,与经典word2vec算法对齐。

    新增GRU4Rec、Tag-Space算法的分布式配置。

    完善Multi-view Simnet模型,并增加inference配置。

    支持强化学习算法 DQN。

    现已支持python3.5及以上的模型:语义匹配DAM,阅读理解BiDAF,机器翻译Transformer,语言模型,强化学习DQN、DoubleDQN模型、DuelingDQN模型,视频分类TSN,度量学习Metric Learning,场景文字识别CRNN-CTC 、OCR Attention,生成式对抗网络ConditionalGAN 、DCGAN、CycleGAN,语义分割ICNET、DeepLab v3+,目标检测Faster-RCNN、MobileNet-SSD 、PyramidBox ,图像分类SE-ResNeXt、ResNet等,个性化推荐TagSpace、GRU4Rec、SequenceSemanticRetrieval、DeepCTR、Multiview-Simnet。

    分布式训练

    CPU多机异步训练

    👷 worker异步并发:增加AsyncExecutor,以训练文件作为执行粒度,支持分布式训练中的worker端计算异步无锁计算,同时支持单机训练。以CTR任务为例,单机训练速度,在充分利用单机线程的情况下,整体吞吐提升14倍。

    IO优化:增加支持AsyncExecutor的DataFeed,支持可定制化的通用分类任务格式。面向CTR任务,增加CTRReader,使数据读取速度线性提升,在PaddleRec/ctr任务中,整体吞吐提升1倍。

    通信优化:针对稀疏访问的Dense参数例如Embedding,增加稀疏通信机制,以语义匹配任务为例,获取参数的总量可以压缩到1%以下,在搜索真实场景的数据下,整体训练吞吐可以提升50倍。

    GPU多机同步训练

    修复Transformer、Bert模型下P2P训练模式会Hang住的问题。

    文档

    API

    新增13篇API​使用指南。

    新增300个API Reference中文文档。

    优化77个API Reference英文文档:包括代码示例、参数说明等。

    安装文档

    新增python3.6、python3.7安装说明。

    🏁 新增windows pip install安装说明。

    Book文档

    Book文档中的代码示例更改为Low level API。

    使用文档

    新增《Operator相关注意事项》,更新《保存与载入模型变量》、《C++预测API介绍》、《使用TensorRT库预测》、《如何贡献代码》等多篇使用文档。

  • v0.1

    July 01, 2019