PaddlePaddle v1.2.0 Release Notes

Release Date: 2018-12-06 // over 3 years ago
  • 🚀 Release Notes


    • 🆕 new pip installation package is available, which can be run on Windows CPU environment.
    • 👌 support of python3.6、python3.7
    • Reconstruction of memory allocator modular :Allocator. Improvement on memory allocating strategy in CPU environment.
      Increase in utility ratio of video memory (disabled by default, use FLAGS_allocator_strategy to enable it).
    • 📜 Restriction to the usage of SelectedRows, and fix made to bugs on sparse regulation and sparse optimization.
    • 👍 Tensor supports DLPack,to facilitate integration of other frameworks or into them.
    • OP
      • Issues on inference of expand op shape have been resolved.
      • Activation function Selu is included.

    Inference Engine

    • Server Prediction
      • GPU supports image fusion, and cooperation with TensorRT to realize image modifying. In common image processing models like Resnet50 and Googlenet, with bs=1, the performance has reached a level 50~100% higher.
      • GPU supports DDPG Deep Explore prediction.
      • Paddle-TRT supports more models, including Resnet, SE-Resnet, DPN,GoogleNet.
      • CPU, GPU, TensorRT and other accelerators are merged into AnalysisPredictor,collectively controlled by AnalysisConfig.
      • Add interfaces to call multi-thread mathematic library.
      • Support for TensorRT plugins,including split operator , prelu operator , avg_pool operator , elementwise_mul operator .
      • This version has included JIT CPU Kernel, which is able to perform basic vector operations, partial implementation of common algorithms including ReLU,LSTM and GRU, and automatic runtime switch between AVX and AVX2 instruction set.
      • FDSFDF optimized CRF decoding and implementation of LayerNorm on AVX and AVX2 instruction set.
      • Issue fixed: AnalysisPredictor on GPU or in the transition from CPU to GPU cannot delete transfer data.
      • Issue fixed: Variable has consistent increase of occupied memory of container.
      • Issue fixed: fc_op cannot process 3-D Tensor
      • Issue fixed: on GPU, when running pass, issues happened to Analysis predictor
      • Issue fixed: GoogleNet problems on TensorRT
      • Promotion of prediction performance
      • Max Sequence pool optimization,with single op performance 10% higher.
      • Softmax operator optimization,with single op performance 14% higher.
      • Layer Norm operator optimization, inclusive of AVX2 instruction set, with single op performance 5 times higher.
      • Stack operator optimization,with single op performance 3.6 times higher.
      • add depthwise_conv_mkldnn_pass to accelerate MobileNet prediction.
      • reduce image analysis time in analysis mode, and the velocity is 70 times quicker.
      • DAM open-source model,reached 118.8% of previous version.
    • Mobile Endpoint Prediction
      • This version has realized winograd algorithm, with the help of which the performance of GoogleNet v1 enjoys a dramatic promotion of 35%.
      • improvement on GoogleNet 8bit,14% quicker compared with float.
      • support for MobileNet v1 8bit, 20% faster than float.
      • support for MobileNet v2 8bit, 19% faster than float.
      • FPGA V1 has developed Deconv operator
      • Android gpu supports mainstream network models like MobileNet、MobileNetSSD、GoogleNet、SqueezeNet、YOLO、ResNet.


    • CV image classifying tasks publish pre-trained models: MobileNet V1, ResNet101, ResNet152,VGG11
    • CV Metric Learning models are extended with loss function arcmargin, and the training method is altered. The new method is to adopt element-wise as pre-trained model, and use pair-wise to make further slight adjustment to improve precision.
    • NLP model tasks are newly equipped with LSTM implementation based on cudnn. Compared with the implementation based on PaddingRNN, the cudnn method is 3~5 times quicker under diverse argument settings.
    • Distributed word2vec model is included,including the new tree-based softmax operator,negative sampling,in line with classic word2vec algorithms.
    • Distributed settings of GRU4Rec、Tag-Space algorithms are added.
    • ⚡️ Multi-view Simnet model is optimized, with an additional inference setting.
    • 👍 Reinforcement learning algorithm DQN is supported.
    • 🌐 Currently compatible python3.x models: Semantic model DAM, reading comprehension BiDAF, machine translation Transformer, language model, reinforcement learning DQN, DoubleDQN model, DuelingDQN model, video classification TSN, Metric Learning, character recognition in natural scenes CRNN-CTC 、OCR Attention,Generative Adversarial Networks ConditionalGAN, DCGAN, CycleGAN, Semantic segmentation ICNET, DeepLab v3+, object detection Faster-RCNN, MobileNet-SSD, PyramidBox, iSE-ResNeXt, ResNet, customized recommendation TagSpace、GRU4Rec、SequenceSemanticRetrieval、DeepCTR、Multiview-Simnet.

    Distributed training

    • multi-CPU asynchronous training
      • Asynchronous concurrent workers: AsyncExecutor is added. With a executive granularity of single training file, it supports lock-less asynchronous worker-end computation in distributed training, and single machine training. Take CTR task as an example, general throughput from single machine training is 14 times larger.
      • IO optimization:This version has added compatibility with AsyncExecutor to DataFeed; enabled customized universal classification task formats; incorporated CTRReader for CTR tasks to linearly elevate speed of reading data. In PaddleRec/ctr tasks,the general throughput increases by 2 times.
      • Better data communication: As for sparse access Dense arguments, like Embedding, the sparse data communication mechanism is adopted. Take tasks of semantic matching for instance, the amount of fetched arguments can be compressed to 1% and below. In searching groundtruth data, the general output reached 15 times more.
    • 🔀 multi-GPU synchronous training
      • Issue fixed: In Transformer、Bert models, P2P training mode may be hung.

    📚 Documentation

    • API
      • Add 13 api guides
      • Add 300 entries of Chinese API Reference
      • Improve 77 entries of English API Reference, including Examples and argument explanation and other adjustable sections.
    • 📚 Documentation about installation
      • Add installing guide on python3.6、python3.7.
      • Add installing guide on windows pip install.
    • 📚 Book Documentation
      • Code examples in Book documentation are substituted with Low level API.


    🏁 提供新pip安装包,支持Windows下CPU执行。





    修复 expand op shape 推理错误的bug。

    支持 Selu 激活函数。



    GPU 支持图融合,且支持和 TensorRT引擎混合改图,在Resnet50和Googlenet等图像通用模型上bs=1下性能提升 50%~100%。

    GPU支持DDPG Deep Explore预测。

    Paddle-TRT对更多模型的支持,其中包括Resnet, SE-Resnet, DPN,GoogleNet。

    CPU, GPU, TensorRT 等加速引擎合并入 AnalysisPredictor,统一由 AnalysisConfig 控制。


    新增TensorRT plugin的支持,包括split operator, prelu operator, avg_pool operator, elementwise_mul operator。

    增加了JIT CPU Kernel,支持基本的向量操作,以及常见的算法包括ReLU,LSTM和GRU的部分实现,可以实现在AVX和AVX2指令集之间自动runtime切换。

    优化CRF decoding和LayerNorm在AVX以及AVX2指令集上的实现。

    修复了 AnalysisPredictor 在GPU,在CPU 到 GPU 的 transfer data 不删除的问题。

    修复了 Variable 中包含 container 内存持续增长的问题。

    修复fc_op不支持3-D Tensor的问题。

    修复了Analysis predictor 在GPU下执行pass时的问题。



    Max Sequence pool optimization,单op提高10%。

    Softmax operator 优化,单op提升14%。

    Layer Norm operator优化,支持avx2指令集,单op提升5倍。

    Stack operator 优化,单op提升3.6倍。





    实现winograd算法, GoogleNet v1性能大幅提升35%。

    GoogleNet 8bit优化,相比float加速14%。

    MobileNet v1 8bit支持,相比float加速20%。

    MobileNet v2 8bit支持,相比float加速19%。

    FPGA V1 开发了Deconv算子。

    android gpu支持MobileNet、MobileNetSSD、GoogleNet、SqueezeNet、YOLO、ResNet等主流的网络模型。


    CV图像分类任务发布MobileNet V1, ResNet101, ResNet152,VGG11预训练模型。

    CV Metric Learning模型新增arcmargin损失,并调整训练方式,采用element-wise作为预训练模型,pair-wise继续微调的训练方式提升精度。


    增加分布式word2vec模型,包括新增的tree-based softmax operator,negative sampling等,与经典word2vec算法对齐。


    完善Multi-view Simnet模型,并增加inference配置。

    支持强化学习算法 DQN。

    现已支持python3.5及以上的模型:语义匹配DAM,阅读理解BiDAF,机器翻译Transformer,语言模型,强化学习DQN、DoubleDQN模型、DuelingDQN模型,视频分类TSN,度量学习Metric Learning,场景文字识别CRNN-CTC 、OCR Attention,生成式对抗网络ConditionalGAN 、DCGAN、CycleGAN,语义分割ICNET、DeepLab v3+,目标检测Faster-RCNN、MobileNet-SSD 、PyramidBox ,图像分类SE-ResNeXt、ResNet等,个性化推荐TagSpace、GRU4Rec、SequenceSemanticRetrieval、DeepCTR、Multiview-Simnet。



    👷 worker异步并发:增加AsyncExecutor,以训练文件作为执行粒度,支持分布式训练中的worker端计算异步无锁计算,同时支持单机训练。以CTR任务为例,单机训练速度,在充分利用单机线程的情况下,整体吞吐提升14倍。








    新增300个API Reference中文文档。

    优化77个API Reference英文文档:包括代码示例、参数说明等。



    🏁 新增windows pip install安装说明。


    Book文档中的代码示例更改为Low level API。