2.4.2

dbe08e9b · yuguo960516yuguo · b5499578 · dbe08e9b · dbe08e9b · b5499578
Commit dbe08e9b authored Jun 12, 2023 by yuguo960516yuguo
20 changed files
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -155,8 +155,14 @@ if(WIN32)
    endforeach()
  endif()
-  # NOTE(zhouwei): msvc max/min macro conflict with std::min/max, define NOMINMAX globally
+  # msvc max/min macro conflict with std::min/max, define NOMINMAX globally
  add_definitions("-DNOMINMAX")
+  # 1. windows.h define 'small' cause CUDA11.6/11.7/11.8 's cub compile error,
+  # see https://github.com/microsoft/onnxruntime/issues/11227
+  # 2. WIN32_LEAN_AND_MEAN minimize the windows include files, avoid define 'small'
+  add_definitions(-DWIN32_LEAN_AND_MEAN)
  # windows build turn off warnings, use parallel compiling.
  foreach(
    flag_var

--- a/README.md
+++ b/README.md
@@ -4,152 +4,93 @@
 --------------------------------------------------------------------------------
-# 飞桨框架 ROCm 版安装说明
+English | [简体中文](./README_cn.md)
-飞桨框架 ROCm 版支持基于海光 CPU 和海光 DCU 的训练和预测，不仅支持 AMD ROCm，同样支持海光 DCUToolkit（DTK），当前支持的 ROCm 版本为 4.0.1，支持的 DTK 有多个版本。提供两种安装方式：
+[![Build Status](https://travis-ci.org/PaddlePaddle/Paddle.svg?branch=develop)](https://travis-ci.org/PaddlePaddle/Paddle)
+[![Documentation Status](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](https://paddlepaddle.org.cn/documentation/docs/en/guides/index_en.html)
+[![Documentation Status](https://img.shields.io/badge/中文文档-最新-brightgreen.svg)](https://paddlepaddle.org.cn/documentation/docs/zh/guides/index_cn.html)
+[![Release](https://img.shields.io/github/release/PaddlePaddle/Paddle.svg)](https://github.com/PaddlePaddle/Paddle/releases)
+[![License](https://img.shields.io/badge/license-Apache%202-blue.svg)](LICENSE)
- 通过预编译的 wheel 包安装
+Welcome to the PaddlePaddle GitHub.
- 通过源代码编译安装
-**说明**：基于对应 DTK 版本的飞桨 wheel 包可在[光合开发者社区 ](https://developer.hpccube.com/tool/#sdk) AI 生态包中进行下载
+PaddlePaddle, as the first independent R&D deep learning platform in China, has been officially open-sourced to professional communities since 2016. It is an industrial platform with advanced technologies and rich features that cover core deep learning frameworks, basic model libraries, end-to-end development kits, tools & components as well as service platforms.
+PaddlePaddle is originated from industrial practices with dedication and commitments to industrialization. It has been widely adopted by a wide range of sectors including manufacturing, agriculture, enterprise service, and so on while serving more than 5.35 million developers, 200,000 companies and generating 670,000 models. With such advantages, PaddlePaddle has helped an increasing number of partners commercialize AI.
-## 安装方式一：通过 wheel 包安装
-**注意**：当前提供基于 CentOS 7.8 & ROCm 4.0.1 的 docker 镜像，与 Python 3.7 的 wheel 安装包。同时提供基于 CentOS 7.6 & DTK 22.10.1 的 docker 镜像，镜像中包含 Python 3.7 的飞浆 2.3.2 wheel 安装包（ image.sourcefind.cn:5000/dcu/admin/base/paddlepaddle:2.3.2-centos7.6-dtk-22.10.1-py37-latest ）
+## Installation
-**第一步**：准备 CentOS 7.6 & DTK 22.10.1 运行环境 (推荐使用 Paddle 镜像)
+### Latest PaddlePaddle Release: [v2.4](https://github.com/PaddlePaddle/Paddle/tree/release/2.4)
-可以直接从 Paddle 的官方镜像库拉取预先装有 CentOS 7.6 & DTK 22.10.1 的 docker 镜像
+Our vision is to enable deep learning for everyone via PaddlePaddle.
+Please refer to our [release announcement](https://github.com/PaddlePaddle/Paddle/releases) to track the latest features of PaddlePaddle.
-```bash
+### Install Latest Stable Release:
-# 拉取镜像
-docker pull image.sourcefind.cn:5000/dcu/admin/base/paddlepaddle:2.3.2-centos7.6-dtk-22.10.1-py37-latest
-# 启动容器，注意这里的参数，例如 shm-size, device 等都需要配置
-docker run -it --network=host --name=oneflow_compile --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 -v /public/home/xxx:/home image.sourcefind.cn:5000/dcu/admin/base/paddlepaddle:2.3.2-centos7.6-dtk-22.10.1-py37-latest /bin/bash
-# 检查容器是否可以正确识别海光 DCU 设备
-rocm-smi
-# 预期得到以下结果：
-======================= ROCm System Management Interface =======================
-================================= Concise Info =================================
-GPU  Temp   AvgPwr  SCLK     MCLK    Fan   Perf  PwrCap  VRAM%  GPU%
-0    50.0c  23.0W   1319Mhz  800Mhz  0.0%  auto  300.0W    0%   0%
-1    48.0c  25.0W   1319Mhz  800Mhz  0.0%  auto  300.0W    0%   0%
-2    48.0c  24.0W   1319Mhz  800Mhz  0.0%  auto  300.0W    0%   0%
-3    49.0c  27.0W   1319Mhz  800Mhz  0.0%  auto  300.0W    0%   0%
-================================================================================
-============================= End of ROCm SMI Log ==============================
 ```
+# CPU
+pip install paddlepaddle
+# GPU
+pip install paddlepaddle-gpu
-**第二步**：此镜像中已经集成 Python 3.7 的飞浆 2.3.2 版本，如果重新安装需要
-```bash
-pip3 uninstall paddlepaddle-rocm
-pip3 install paddlepaddle-2.3.2_dtk2210_git0195561-cp37-cp37m-manylinux2014_x86_64.whl
 ```
+For more information about installation, please view [Quick Install](https://www.paddlepaddle.org.cn/install/quick)
-**第三步**：验证安装包
+Now our developers can acquire Tesla V100 online computing resources for free. If you create a program by AI Studio, you will obtain 8 hours to train models online per day. [Click here to start](https://aistudio.baidu.com/aistudio/index).
-安装完成之后，运行如下命令。如果出现 PaddlePaddle is installed successfully!，说明已经安装成功
+## FOUR LEADING TECHNOLOGIES
-```bash
+- **Agile Framework for Industrial Development of Deep Neural Networks**
-python -c "import paddle; paddle.utils.run_check()"
-```
-## 安装方式二：通过源码编译安装
+    The PaddlePaddle deep learning framework facilitates the development while lowering the technical burden, through leveraging a programmable scheme to architect the neural networks. It supports both declarative programming and imperative programming with both development flexibility and high runtime performance preserved.  The neural architectures could be automatically designed by algorithms with better performance than the ones designed by human experts.
-**注意**：可使用 Paddle 支持的 CentOS 7.8 & ROCm 4.0.1 编译镜像，且根据 ROCm 4.0.1 的需求，支持的编译器为 devtoolset-7
-**第一步**：准备 ROCm 4.0.1 编译环境 (推荐使用 Paddle 镜像)
+-  **Support Ultra-Large-Scale Training of Deep Neural Networks**
-可以直接从 Paddle 的官方镜像库拉取预先装有 ROCm 4.0.1 的 docker 镜像，在[开发者社区](https://developer.hpccube.com/tool/#sdk) DCU Toolkit 中下载 DTK-22.10.1 解压至 /opt/ 路径下，更换/opt下的原有的  ROCm 4.0.1 文件夹。
+    PaddlePaddle has made breakthroughs in ultra-large-scale deep neural networks training. It launched the world's first large-scale open-source training platform that supports the training of deep networks with 100 billion features and trillions of parameters using data sources distributed over hundreds of nodes. PaddlePaddle overcomes the online deep learning challenges for ultra-large-scale deep learning models, and further achieved real-time model updating with more than 1 trillion parameters.
+     [Click here to learn more](https://github.com/PaddlePaddle/Fleet)
-```bash
-# 拉取镜像
-docker pull paddlepaddle/paddle:latest-dev-rocm4.0-miopen2.11
-# 启动容器，注意这里的参数，例如 shm-size, device 等都需要配置
+- **High-Performance Inference Engines for Comprehensive Deployment Environments**
-docker run -it --name paddle-rocm-dev --shm-size=128G \
-     --device=/dev/kfd --device=/dev/dri --group-add video \
-     --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
-     paddlepaddle/paddle:latest-dev-rocm4.0-miopen2.11 /bin/bash
-# 替换DTK
+   PaddlePaddle is not only compatible with models trained in 3rd party open-source frameworks , but also offers complete inference products for various production scenarios. Our inference product line includes [Paddle Inference](https://paddle-inference.readthedocs.io/en/master/guides/introduction/index_intro.html): Native inference library for high-performance server and cloud inference; [Paddle Serving](https://github.com/PaddlePaddle/Serving): A service-oriented framework suitable for distributed and pipeline productions; [Paddle Lite](https://github.com/PaddlePaddle/Paddle-Lite): Ultra-Lightweight inference engine for mobile and IoT environments; [Paddle.js](https://www.paddlepaddle.org.cn/paddle/paddlejs): A frontend inference engine for browser and mini-apps. Furthermore, by great amounts of optimization with leading hardware in each scenario, Paddle inference engines outperform most of the other mainstream frameworks.
-# 检查容器是否可以正确识别海光 DCU 设备
-rocm-smi
-# 预期得到以下结果：
+- **Industry-Oriented Models and Libraries with Open Source Repositories**
-======================= ROCm System Management Interface =======================
-================================= Concise Info =================================
-GPU  Temp   AvgPwr  SCLK     MCLK    Fan   Perf  PwrCap  VRAM%  GPU%
-0    50.0c  23.0W   1319Mhz  800Mhz  0.0%  auto  300.0W    0%   0%
-1    48.0c  25.0W   1319Mhz  800Mhz  0.0%  auto  300.0W    0%   0%
-2    48.0c  24.0W   1319Mhz  800Mhz  0.0%  auto  300.0W    0%   0%
-3    49.0c  27.0W   1319Mhz  800Mhz  0.0%  auto  300.0W    0%   0%
-================================================================================
-============================= End of ROCm SMI Log ==============================
-```
-请在编译之前，检查如下的环境变量是否正确，如果没有则需要安装相应的依赖库，并导出相应的环境变量。以 Paddle 官方的镜像举例，环境变量如下：
+     PaddlePaddle includes and maintains more than 100 mainstream models that have been practiced and polished for a long time in the industry. Some of these models have won major prizes from key international competitions. In the meanwhile, PaddlePaddle has further more than 200 pre-training models (some of them with source codes) to facilitate the rapid development of industrial applications.
+     [Click here to learn more](https://github.com/PaddlePaddle/models)
-```bash
-# PATH 与 LD_LIBRARY_PATH 中存在 devtoolset-7，如果没有运行以下命令
-source /opt/rh/devtoolset-7/enable
-# PATH 中存在 cmake 3.16.0
+## Documentation
-export PATH=/opt/cmake-3.16/bin:${PATH}
-# PATH 与 LD_LIBRARY_PATH 中存在 rocm 4.0.1
+We provide [English](https://www.paddlepaddle.org.cn/documentation/docs/en/guides/index_en.html) and
-export PATH=/opt/rocm/opencl/bin:/opt/rocm/bin:${PATH}
+[Chinese](https://www.paddlepaddle.org.cn/documentation/docs/zh/guide/index_cn.html) documentation.
-export LD_LIBRARY_PATH=/opt/rocm/lib:${LD_LIBRARY_PATH}
-# PATH 中存在 Python 3.7
+- [Guides](https://www.paddlepaddle.org.cn/documentation/docs/en/guides/index_en.html)
-# 注意：镜像中的 python 3.7 通过 miniconda 安装，请通过 conda activate base 命令加载 Python 3.7 环境
-export PATH=/opt/conda/bin:${PATH}
-```
-**第二步**：下载 Paddle 源码并编译，CMAKE 编译选项含义请参见[编译选项表](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/install/Tables.html#Compile)，如果指定 Paddle 版本，需要在编译前指定环境变量 PADDLE_VERSION
+  You might want to start from how to implement deep learning basics with PaddlePaddle.
-```bash
+- [Practice](https://www.paddlepaddle.org.cn/documentation/docs/zh/tutorial/index_cn.html)
-# 下载源码，默认 develop 分支
-git clone -b 2.3.2-dtk-22.10.1 http://developer.hpccube.com/codes/aicomponent/paddle.git
-cd Paddle
-# 创建编译目录
+  So far you have already been familiar with Fluid. And the next step should be building a more efficient model or inventing your original Operator. 
-mkdir build && cd build
-# 指定 Paddle 版本
+- [API Reference](https://www.paddlepaddle.org.cn/documentation/docs/en/api/index_en.html)
-export PADDLE_VERSION=2.3.2
-# 执行 cmake
+   Our new API enables much shorter programs.
-export ROCM_PATH=/opt/rocm
-cmake .. -DPY_VERSION=3.7 -DWITH_GPU=OFF -DWITH_ROCM=ON -DWITH_RCCL=ON -DWITH_NCCL=OFF -DWITH_TESTING=ON -DWITH_DISTRIBUTE=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DCMAKE_VERBOSE_MAKEFILE=OFF -DWITH_TP_CACHE=ON -DROCM_PATH=${ROCM_PATH} -DWITH_MKLDNN=OFF
+- [How to Contribute](https://www.paddlepaddle.org.cn/documentation/docs/en/guides/08_contribution/index_en.html)
-# 使用以下命令来编译
+   We appreciate your contributions!
-make -j$(nproc)
-```
-**第三步**：安装与验证编译生成的 wheel 包
-编译完成之后进入`Paddle/build/python/dist`目录即可找到编译生成的.whl 安装包，安装与验证命令如下：
+## Communication
-```bash
+- [Github Issues](https://github.com/PaddlePaddle/Paddle/issues): bug reports, feature requests, install issues, usage issues, etc.
-# 安装命令
+- QQ discussion group: 441226485 (PaddlePaddle).
-python -m pip install -U paddlepaddle_rocm-2.3.2-cp37-cp37m-linux_x86_64.whl
+- [Forums](https://aistudio.baidu.com/paddle/forum): discuss implementations, research, etc.
-# 验证命令
-python -c "import paddle; paddle.utils.run_check()"
-```
-## 如何卸载
+## Courses
-请使用以下命令卸载 Paddle：
+- [Server Deployments](https://aistudio.baidu.com/aistudio/course/introduce/19084): Courses introducing high performance server deployments via local and remote services.
+- [Edge Deployments](https://aistudio.baidu.com/aistudio/course/introduce/22690): Courses introducing edge deployments from mobile, IoT to web and applets.
-```
-pip3 uninstall paddlepaddle-rocm
-```
+## Copyright and License
+PaddlePaddle is provided under the [Apache-2.0 license](LICENSE).
--- a/README_ORIGIN.md
+++ b/README_ORIGIN.md
-<p align="center">
-<img align="center" src="doc/imgs/logo.png", width=1600>
-<p>
--------------------------------------------------------------------------------
-English | [简体中文](./README_cn.md)
-[![Build Status](https://travis-ci.org/PaddlePaddle/Paddle.svg?branch=develop)](https://travis-ci.org/PaddlePaddle/Paddle)
-[![Documentation Status](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](https://paddlepaddle.org.cn/documentation/docs/en/guides/index_en.html)
-[![Documentation Status](https://img.shields.io/badge/中文文档-最新-brightgreen.svg)](https://paddlepaddle.org.cn/documentation/docs/zh/guides/index_cn.html)
-[![Release](https://img.shields.io/github/release/PaddlePaddle/Paddle.svg)](https://github.com/PaddlePaddle/Paddle/releases)
-[![License](https://img.shields.io/badge/license-Apache%202-blue.svg)](LICENSE)
-Welcome to the PaddlePaddle GitHub.
-PaddlePaddle, as the first independent R&D deep learning platform in China, has been officially open-sourced to professional communities since 2016. It is an industrial platform with advanced technologies and rich features that cover core deep learning frameworks, basic model libraries, end-to-end development kits, tools & components as well as service platforms.
-PaddlePaddle is originated from industrial practices with dedication and commitments to industrialization. It has been widely adopted by a wide range of sectors including manufacturing, agriculture, enterprise service, and so on while serving more than 4.7 million developers, 180,000 companies and generating 560,000 models. With such advantages, PaddlePaddle has helped an increasing number of partners commercialize AI.
-## Installation
-### Latest PaddlePaddle Release: [v2.3](https://github.com/PaddlePaddle/Paddle/tree/release/2.3)
-Our vision is to enable deep learning for everyone via PaddlePaddle.
-Please refer to our [release announcement](https://github.com/PaddlePaddle/Paddle/releases) to track the latest features of PaddlePaddle.
-### Install Latest Stable Release:
-```
-# CPU
-pip install paddlepaddle
-# GPU
-pip install paddlepaddle-gpu
-```
-For more information about installation, please view [Quick Install](https://www.paddlepaddle.org.cn/install/quick)
-Now our developers can acquire Tesla V100 online computing resources for free. If you create a program by AI Studio, you will obtain 8 hours to train models online per day. [Click here to start](https://aistudio.baidu.com/aistudio/index).
-## FOUR LEADING TECHNOLOGIES
- **Agile Framework for Industrial Development of Deep Neural Networks**
-    The PaddlePaddle deep learning framework facilitates the development while lowering the technical burden, through leveraging a programmable scheme to architect the neural networks. It supports both declarative programming and imperative programming with both development flexibility and high runtime performance preserved.  The neural architectures could be automatically designed by algorithms with better performance than the ones designed by human experts.
-  **Support Ultra-Large-Scale Training of Deep Neural Networks**
-    PaddlePaddle has made breakthroughs in ultra-large-scale deep neural networks training. It launched the world's first large-scale open-source training platform that supports the training of deep networks with 100 billion features and trillions of parameters using data sources distributed over hundreds of nodes. PaddlePaddle overcomes the online deep learning challenges for ultra-large-scale deep learning models, and further achieved real-time model updating with more than 1 trillion parameters.
-     [Click here to learn more](https://github.com/PaddlePaddle/Fleet)
- **High-Performance Inference Engines for Comprehensive Deployment Environments**
-   PaddlePaddle is not only compatible with models trained in 3rd party open-source frameworks , but also offers complete inference products for various production scenarios. Our inference product line includes [Paddle Inference](https://paddle-inference.readthedocs.io/en/master/guides/introduction/index_intro.html): Native inference library for high-performance server and cloud inference; [Paddle Serving](https://github.com/PaddlePaddle/Serving): A service-oriented framework suitable for distributed and pipeline productions; [Paddle Lite](https://github.com/PaddlePaddle/Paddle-Lite): Ultra-Lightweight inference engine for mobile and IoT environments; [Paddle.js](https://www.paddlepaddle.org.cn/paddle/paddlejs): A frontend inference engine for browser and mini-apps. Furthermore, by great amounts of optimization with leading hardware in each scenario, Paddle inference engines outperform most of the other mainstream frameworks.
- **Industry-Oriented Models and Libraries with Open Source Repositories**
-     PaddlePaddle includes and maintains more than 100 mainstream models that have been practiced and polished for a long time in the industry. Some of these models have won major prizes from key international competitions. In the meanwhile, PaddlePaddle has further more than 200 pre-training models (some of them with source codes) to facilitate the rapid development of industrial applications.
-     [Click here to learn more](https://github.com/PaddlePaddle/models)
-## Documentation
-We provide [English](https://www.paddlepaddle.org.cn/documentation/docs/en/guides/index_en.html) and
-[Chinese](https://www.paddlepaddle.org.cn/documentation/docs/zh/guide/index_cn.html) documentation.
- [Guides](https://www.paddlepaddle.org.cn/documentation/docs/en/guides/index_en.html)
-  You might want to start from how to implement deep learning basics with PaddlePaddle.
- [Practice](https://www.paddlepaddle.org.cn/documentation/docs/zh/tutorial/index_cn.html)
-  So far you have already been familiar with Fluid. And the next step should be building a more efficient model or inventing your original Operator. 
- [API Reference](https://www.paddlepaddle.org.cn/documentation/docs/en/api/index_en.html)
-   Our new API enables much shorter programs.
- [How to Contribute](https://www.paddlepaddle.org.cn/documentation/docs/en/guides/08_contribution/index_en.html)
-   We appreciate your contributions!
-## Communication
- [Github Issues](https://github.com/PaddlePaddle/Paddle/issues): bug reports, feature requests, install issues, usage issues, etc.
- QQ discussion group: 441226485 (PaddlePaddle).
- [Forums](https://aistudio.baidu.com/paddle/forum): discuss implementations, research, etc.
-## Courses
- [Server Deployments](https://aistudio.baidu.com/aistudio/course/introduce/19084): Courses introducing high performance server deployments via local and remote services.
- [Edge Deployments](https://aistudio.baidu.com/aistudio/course/introduce/22690): Courses introducing edge deployments from mobile, IoT to web and applets.
-## Copyright and License
-PaddlePaddle is provided under the [Apache-2.0 license](LICENSE).
--- a/README_cn.md
+++ b/README_cn.md
@@ -15,11 +15,11 @@
 欢迎来到 PaddlePaddle GitHub
-飞桨(PaddlePaddle)以百度多年的深度学习技术研究和业务应用为基础，是中国首个自主研发、功能完备、 开源开放的产业级深度学习平台，集深度学习核心训练和推理框架、基础模型库、端到端开发套件和丰富的工具组件于一体。目前，飞桨累计开发者477万，服务企业18万家，基于飞桨开源深度学习平台产生了56万个模型。飞桨助力开发者快速实现AI想法，快速上线AI业务。帮助越来越多的行业完成AI赋能，实现产业智能化升级。
+飞桨(PaddlePaddle)以百度多年的深度学习技术研究和业务应用为基础，是中国首个自主研发、功能完备、 开源开放的产业级深度学习平台，集深度学习核心训练和推理框架、基础模型库、端到端开发套件和丰富的工具组件于一体。目前，飞桨累计开发者535万，服务企业20万家，基于飞桨开源深度学习平台产生了67万个模型。飞桨助力开发者快速实现AI想法，快速上线AI业务。帮助越来越多的行业完成AI赋能，实现产业智能化升级。
 ## 安装
-### PaddlePaddle最新版本: [v2.3](https://github.com/PaddlePaddle/Paddle/tree/release/2.3)
+### PaddlePaddle最新版本: [v2.4](https://github.com/PaddlePaddle/Paddle/tree/release/2.4)
 跟进PaddlePaddle最新特性请参考我们的[版本说明](https://github.com/PaddlePaddle/Paddle/releases)
@@ -63,27 +63,20 @@ PaddlePaddle用户可领取**免费Tesla V100在线算力资源**，训练模型
 我们提供 [英文](https://www.paddlepaddle.org.cn/documentation/docs/en/guides/index_en.html) 和
 [中文](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/index_cn.html) 文档
- [使用指南](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/index_cn.html)
+- [使用指南](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/index_cn.html)：或许您想从深度学习基础开始学习飞桨
-   或许您想从深度学习基础开始学习飞桨
+- [应用实践](https://www.paddlepaddle.org.cn/documentation/docs/zh/tutorial/index_cn.html)：使用飞桨搭建您的模型，更高效的完成深度学习任务
- [应用实践](https://www.paddlepaddle.org.cn/documentation/docs/zh/tutorial/index_cn.html)
+- [API 文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/index_cn.html)：新的 API 支持代码更少更简洁的程序
- [API Reference](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/index_cn.html)
+- [贡献方式](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/08_contribution/index_cn.html)：欢迎您的贡献!
-   新的API支持代码更少更简洁的程序
- [贡献方式](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/08_contribution/index_cn.html)
-   欢迎您的贡献!
 ## 交流与反馈
 - 欢迎您通过[Github Issues](https://github.com/PaddlePaddle/Paddle/issues)来提交问题、报告与建议
 - QQ群: 441226485 (PaddlePaddle)
- [论坛](https://aistudio.baidu.com/paddle/forum): 欢迎大家在PaddlePaddle论坛分享在使用PaddlePaddle中遇到的问题和经验, 营造良好的论坛氛围
+- [论坛](https://aistudio.baidu.com/paddle/forum): 欢迎大家在PaddlePaddle论坛分享在使用PaddlePaddle中遇到的问题和经验，营造良好的论坛氛围
 ## 课程

--- a/cmake/cuda.cmake
+++ b/cmake/cuda.cmake
@@ -6,7 +6,7 @@ if(WITH_NV_JETSON)
  add_definitions(-DWITH_NV_JETSON)
  set(paddle_known_gpu_archs "53 62 72")
  set(paddle_known_gpu_archs10 "53 62 72")
-  set(paddle_known_gpu_archs11 "53 62 72")
+  set(paddle_known_gpu_archs11 "53 62 72 87")
 elseif(NEW_RELEASE_ALL)
  message("Using New Release Strategy - All Arches Packge")
  add_definitions(-DNEW_RELEASE_ALL)
@@ -166,11 +166,15 @@ function(select_nvcc_arch_flags out_variable)
  elseif(${CUDA_ARCH_NAME} STREQUAL "Turing")
    set(cuda_arch_bin "75")
  elseif(${CUDA_ARCH_NAME} STREQUAL "Ampere")
+    if(WITH_NV_JETSON)
+      set(cuda_arch_bin "87")
+    else()
      if(${CMAKE_CUDA_COMPILER_VERSION} LESS 11.1) # CUDA 11.0
        set(cuda_arch_bin "80")
      elseif(${CMAKE_CUDA_COMPILER_VERSION} LESS 12.0) # CUDA 11.1+
        set(cuda_arch_bin "80 86")
      endif()
+    endif()
  elseif(${CUDA_ARCH_NAME} STREQUAL "All")
    set(cuda_arch_bin ${paddle_known_gpu_archs})
  elseif(${CUDA_ARCH_NAME} STREQUAL "Auto")

--- a/cmake/experiments/cuda_module_loading_lazy.cmake
+++ b/cmake/experiments/cuda_module_loading_lazy.cmake
@@ -31,6 +31,11 @@ if(LINUX)
    message("cuda 11.7+ already support lazy module loading")
    return()
  endif()
+  if(${CUDA_VERSION} VERSION_LESS "12.0" AND ${CMAKE_CXX_COMPILER_VERSION}
+                                             VERSION_GREATER_EQUAL 12.0)
+    message("cuda less than 12.0 doesn't support gcc12")
+    return()
+  endif()
  message(
    "for cuda before 11.7, libcudart.so must be used for the lazy module loading trick to work, instead of libcudart_static.a"

--- a/cmake/external/cub.cmake
+++ b/cmake/external/cub.cmake
@@ -14,7 +14,7 @@
 include(ExternalProject)
-# Note(zhouwei): extern_cub  has code __FILE_, If the path of extern_cub is changed,
+# extern_cub  has code __FILE_, If the path of extern_cub is changed,
 # it will effect about 30+ cu files sccache hit and slow compile speed  on windows.
 # Therefore, a fixed CUB_PATH will be input to increase the sccache hit rate.
 set(CUB_PATH
@@ -25,7 +25,7 @@ set(CUB_PREFIX_DIR ${CUB_PATH})
 set(CUB_REPOSITORY ${GIT_URL}/NVlabs/cub.git)
 if(${CMAKE_CUDA_COMPILER_VERSION} GREATER_EQUAL 11.6)
-  # cuda_11.6.2_511.65‘s own cub is 1.15.0, which will cause compiling error in windows.
+  # cuda_11.6/11.7/11.8‘s own cub is 1.15.0, which will cause compiling error in windows.
  set(CUB_TAG 1.16.0)
  # cub 1.16.0 is not compitable with current thrust version
  add_definitions(-DTHRUST_IGNORE_CUB_VERSION_CHECK)

--- a/cmake/external/cutlass.cmake
+++ b/cmake/external/cutlass.cmake
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+include(ExternalProject)
+set(CUTLASS_PREFIX_DIR ${THIRD_PARTY_PATH}/cutlass)
+set(CUTLASS_REPOSITORY https://github.com/NVIDIA/cutlass.git)
+set(CUTLASS_TAG v2.9.1)
+include_directories("${THIRD_PARTY_PATH}/cutlass/src/extern_cutlass/")
+include_directories("${THIRD_PARTY_PATH}/cutlass/src/extern_cutlass/include/")
+include_directories(
+  "${THIRD_PARTY_PATH}/cutlass/src/extern_cutlass/tools/util/include/")
+add_definitions("-DPADDLE_WITH_CUTLASS")
+ExternalProject_Add(
+  extern_cutlass
+  ${EXTERNAL_PROJECT_LOG_ARGS} ${SHALLOW_CLONE}
+  GIT_REPOSITORY ${CUTLASS_REPOSITORY}
+  GIT_TAG "${CUTLASS_TAG}"
+  PREFIX ${CUTLASS_PREFIX_DIR}
+  UPDATE_COMMAND ""
+  CONFIGURE_COMMAND ""
+  BUILD_COMMAND ""
+  INSTALL_COMMAND ""
+  TEST_COMMAND "")
+add_library(cutlass INTERFACE)
+add_dependencies(cutlass extern_cutlass)
--- a/cmake/external/gloo.cmake
+++ b/cmake/external/gloo.cmake
@@ -31,6 +31,17 @@ set(GLOO_LIBRARIES
    "${GLOO_INSTALL_DIR}/lib/libgloo.a"
    CACHE FILEPATH "gloo library." FORCE)
+set(GLOO_PATCH_COMMAND "")
+if(WITH_GPU)
+  if(${CMAKE_CUDA_COMPILER_VERSION} LESS 12.0 AND ${CMAKE_CXX_COMPILER_VERSION}
+                                                  VERSION_GREATER 12.0)
+    file(TO_NATIVE_PATH ${PADDLE_SOURCE_DIR}/patches/gloo/device.cc.patch
+         native_dst)
+    set(GLOO_PATCH_COMMAND patch -d ${GLOO_SOURCE_DIR}/gloo/transport/tcp <
+                           ${native_dst})
+  endif()
+endif()
 include_directories(${GLOO_INCLUDE_DIR})
 if(WITH_ASCEND OR WITH_ASCEND_CL)
@@ -59,6 +70,7 @@ else()
    GIT_TAG ${GLOO_TAG}
    PREFIX "${GLOO_PREFIX_DIR}"
    UPDATE_COMMAND ""
+    PATCH_COMMAND ${GLOO_PATCH_COMMAND}
    CONFIGURE_COMMAND ""
    BUILD_COMMAND
      mkdir -p ${GLOO_SOURCE_DIR}/build && cd ${GLOO_SOURCE_DIR}/build && cmake

--- a/cmake/external/protobuf.cmake
+++ b/cmake/external/protobuf.cmake
@@ -250,6 +250,12 @@ function(build_protobuf TARGET_NAME BUILD_FOR_HOST)
  else()
    set(PROTOBUF_REPOSITORY ${GIT_URL}/protocolbuffers/protobuf.git)
    set(PROTOBUF_TAG 9f75c5aa851cd877fb0d93ccc31b8567a6706546)
+    if(WITH_GPU)
+      if(${CMAKE_CUDA_COMPILER_VERSION} LESS 12.0
+         AND ${CMAKE_CXX_COMPILER_VERSION} VERSION_GREATER 12.0)
+        set(PROTOBUF_TAG 2dc747c574b68a808ea4699d26942c8132fe2b09)
+      endif()
+    endif()
  endif()
  if(WITH_ARM_BRPC)
    set(ARM_PROTOBUF_URL
@@ -322,6 +328,12 @@ elseif(WITH_ARM_BRPC)
  set(PROTOBUF_VERSION 3.7.1-baidu-ee-common)
 else()
  set(PROTOBUF_VERSION 3.1.0)
+  if(WITH_GPU)
+    if(${CMAKE_CUDA_COMPILER_VERSION} LESS 12.0
+       AND ${CMAKE_CXX_COMPILER_VERSION} VERSION_GREATER 12.0)
+      set(PROTOBUF_VERSION 3.16.0)
+    endif()
+  endif()
 endif()
 if(NOT PROTOBUF_FOUND)

--- a/cmake/external/warpctc.cmake
+++ b/cmake/external/warpctc.cmake
@@ -25,6 +25,19 @@ set(WARPCTC_INSTALL_DIR ${THIRD_PARTY_PATH}/install/warpctc)
 set(WARPCTC_REPOSITORY ${GIT_URL}/baidu-research/warp-ctc.git)
 set(WARPCTC_TAG 37ece0e1bbe8a0019a63ac7e6462c36591c66a5b)
+set(WARPCTC_SOURCE_DIR ${THIRD_PARTY_PATH}/warpctc/src/extern_warpctc)
+set(WARPCTC_PATCH_COMMAND "")
+set(WARPCTC_CCBIN_OPTION "")
+if(NOT WIN32 AND WITH_GPU)
+  if(${CMAKE_CUDA_COMPILER_VERSION} LESS 12.0 AND ${CMAKE_CXX_COMPILER_VERSION}
+                                                  VERSION_GREATER 12.0)
+    file(TO_NATIVE_PATH
+         ${PADDLE_SOURCE_DIR}/patches/warpctc/CMakeLists.txt.patch native_src)
+    set(WARPCTC_PATCH_COMMAND patch -d ${WARPCTC_SOURCE_DIR} < ${native_src})
+    set(WARPCTC_CCBIN_OPTION -DCCBIN_COMPILER=${CCBIN_COMPILER})
+  endif()
+endif()
 set(WARPCTC_INCLUDE_DIR
    "${WARPCTC_INSTALL_DIR}/include"
    CACHE PATH "Warp-ctc Directory" FORCE)
@@ -112,7 +125,7 @@ else()
    GIT_TAG ${WARPCTC_TAG}
    PREFIX ${WARPCTC_PREFIX_DIR}
    UPDATE_COMMAND ""
-    PATCH_COMMAND ""
+    PATCH_COMMAND ${WARPCTC_PATCH_COMMAND}
    #BUILD_ALWAYS    1
    CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}
               -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER}
@@ -132,7 +145,9 @@ else()
               -DBUILD_TESTS=OFF
               -DCMAKE_POSITION_INDEPENDENT_CODE=ON
               -DCMAKE_BUILD_TYPE=${THIRD_PARTY_BUILD_TYPE}
+               -DCUDA_TOOLKIT_ROOT_DIR=${CUDA_TOOLKIT_ROOT_DIR}
               ${EXTERNAL_OPTIONAL_ARGS}
+               ${WARPCTC_CCBIN_OPTION}
    CMAKE_CACHE_ARGS
      -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE}
      -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON

--- a/cmake/neuware.cmake
+++ b/cmake/neuware.cmake
@@ -15,12 +15,14 @@ set(NEUWARE_LIB_DIR ${NEUWARE_HOME}/lib64)
 include_directories(${NEUWARE_INCLUDE_DIR})
 set(CNNL_LIB ${NEUWARE_LIB_DIR}/libcnnl.so)
+set(MLUOP_LIB ${NEUWARE_LIB_DIR}/libmluops.so)
 set(CNRT_LIB ${NEUWARE_LIB_DIR}/libcnrt.so)
 set(CNDRV_LIB ${NEUWARE_LIB_DIR}/libcndrv.so)
 set(CNPAPI_LIB ${NEUWARE_LIB_DIR}/libcnpapi.so)
 generate_dummy_static_lib(LIB_NAME "neuware_lib" GENERATOR "neuware.cmake")
-set(NEUWARE_LIB_DEPS ${CNNL_LIB} ${CNRT_LIB} ${CNDRV_LIB} ${CNPAPI_LIB})
+set(NEUWARE_LIB_DEPS ${CNNL_LIB} ${MLUOP_LIB} ${CNRT_LIB} ${CNDRV_LIB}
+                     ${CNPAPI_LIB})
 if(WITH_CNCL)
  message(STATUS "Compile with CNCL!")

--- a/cmake/third_party.cmake
+++ b/cmake/third_party.cmake
@@ -317,8 +317,7 @@ if(WITH_ONNXRUNTIME)
 endif()
 if(WITH_GPU)
-  if(${CMAKE_CUDA_COMPILER_VERSION} LESS 11.0 OR ${CMAKE_CUDA_COMPILER_VERSION}
+  if(${CMAKE_CUDA_COMPILER_VERSION} LESS 11.0)
-                                                 GREATER_EQUAL 11.6)
    include(external/cub) # download cub
    list(APPEND third_party_deps extern_cub)
  endif()
@@ -492,4 +491,14 @@ if(WITH_CUSPARSELT)
  list(APPEND third_party_deps extern_cusparselt)
 endif()
+if(WITH_GPU
+   AND NOT WITH_ARM
+   AND NOT WIN32
+   AND NOT APPLE)
+  if(${CMAKE_CUDA_COMPILER_VERSION} GREATER_EQUAL 11.0)
+    include(external/cutlass) # download, build, install cusparselt
+    list(APPEND third_party_deps extern_cutlass)
+  endif()
+endif()
 add_custom_target(third_party ALL DEPENDS ${third_party_deps})
--- a/paddle/.gitignore
+++ b/paddle/.gitignore
+.timestamp
+*.o
+*.a
+.svn
+GPATH
+GRTAGS
+GTAGS
+.idl*
+*~
+*.pyc
+*.pb.cc
+*.pb.h
+*_pb2.py
+output/
+google/
+Makefile
+log/
+.pptool_config
+hf/
+build
+issue.info
+ar
+g++
+gcc
+ld
+ld-linux-x86-64.so.2
+x86_64-scm-linux-gnu/
+.lint.*.md5
+.idea/
+.test_env
+Paddle_wrap.cxx
+Paddle_wrap.h
+paddle.py
+py_paddle-*.whl
+py_paddle/paddle.py
+.py_paddle_extra_link_flags
+HPPL_ERROR_LOG
+unittest.list
+proto
+dist
+setup.py
--- a/paddle/fluid/eager/.gitignore
+++ b/paddle/fluid/eager/.gitignore
+generated/**
+autocodegen/generated_example/
\ No newline at end of file
--- a/paddle/fluid/eager/api/generated/.gitignore
+++ b/paddle/fluid/eager/api/generated/.gitignore
+fluid_generated/**
+eager_generated/**
\ No newline at end of file
--- a/paddle/fluid/eager/custom_operator/custom_operator_node.cc
+++ b/paddle/fluid/eager/custom_operator/custom_operator_node.cc
@@ -217,18 +217,20 @@ RunCustomOpNode::operator()(
  VLOG(6) << "Prepare Grad outputs for size: " << grad_outputs_names.size();
  for (size_t i = 0; i < OutputMeta().size(); i++) {
    if (map[0][0].find(i) != map[0][0].end()) {
+      int grad_output_idx = map[0][0][i];
      VLOG(7) << "Insert grad outputs: " << i
-              << " with size: " << OutputMeta()[i].size()
+              << " with size: " << OutputMeta()[grad_output_idx].size()
-              << " to tmp_outputs: " << map[0][0][i];
+              << " to tmp_outputs: " << grad_output_idx;
-      for (size_t j = 0; j < OutputMeta()[i].size(); j++) {
+      for (size_t j = 0; j < OutputMeta()[grad_output_idx].size(); j++) {
-        outs[i].emplace_back(/* init it incase of copy nullptr of shared_ptr */
+        outs[grad_output_idx]
+            .emplace_back(/* init it incase of copy nullptr of shared_ptr */
                          std::make_shared<phi::DenseTensor>(
                              phi::DataType::UNDEFINED),
                          egr::Controller::Instance().GenerateUniqueName(
                              "custom_tmp_grad"));
-        egr::EagerUtils::autograd_meta(&(outs[i][j]));
+        egr::EagerUtils::autograd_meta(&(outs[grad_output_idx][j]));
      }
-      tmp_outs[map[0][0][i]] = outs[i];
+      tmp_outs[grad_output_idx] = outs[grad_output_idx];
    }
  }
  for (size_t i = 0; i < tmp_outs.size(); i++) {
@@ -408,17 +410,19 @@ RunCustomOpDoubleGradNode::operator()(
  for (size_t i = 0; i < OutputMeta().size(); i++) {
    if (map[1][0].find(i) != map[1][0].end()) {
+      int grad_output_idx = map[1][0][i];
      VLOG(7) << "Insert grad outputs: " << i
-              << " with size: " << OutputMeta()[i].size()
+              << " with size: " << OutputMeta()[grad_output_idx].size()
-              << " to tmp_outputs: " << map[1][0][i];
+              << " to tmp_outputs: " << grad_output_idx;
-      for (size_t j = 0; j < OutputMeta()[i].size(); j++) {
+      for (size_t j = 0; j < OutputMeta()[grad_output_idx].size(); j++) {
-        outs[i].emplace_back(/* init it incase of copy nullptr of shared_ptr */
+        outs[grad_output_idx]
+            .emplace_back(/* init it incase of copy nullptr of shared_ptr */
                          std::make_shared<phi::DenseTensor>(
                              phi::DataType::UNDEFINED),
                          egr::Controller::Instance().GenerateUniqueName(
                              "custom_tmp_grad"));
      }
-      tmp_outs[map[1][0][i]] = outs[i];
+      tmp_outs[grad_output_idx] = outs[grad_output_idx];
    }
  }
  for (size_t i = 0; i < tmp_outs.size(); i++) {

--- a/paddle/fluid/framework/.gitignore
+++ b/paddle/fluid/framework/.gitignore
+.tensor_util.cu
+.data_type_transform.cu
\ No newline at end of file
--- a/paddle/fluid/framework/distributed_strategy.proto
+++ b/paddle/fluid/framework/distributed_strategy.proto
@@ -123,6 +123,7 @@ message BuildStrategy {
  optional bool allow_cuda_graph_capture = 14 [ default = false ];
  optional int32 reduce_strategy = 15 [ default = 0 ];
  optional bool fuse_gemm_epilogue = 16 [ default = false ];
+  optional string debug_graphviz_path = 17;
 }
 message ExecutionStrategy {

--- a/paddle/fluid/framework/dlpack_tensor.cc
+++ b/paddle/fluid/framework/dlpack_tensor.cc
@@ -134,7 +134,59 @@ struct DLDeviceVisitor
 };
 }  // namespace internal
-DLPackTensor::DLPackTensor(const Tensor &tensor, LaneType lanes) {
+struct PaddleDLMTensor {
+  phi::DenseTensor handle;
+  DLManagedTensor tensor;
+};
+void deleter(DLManagedTensor *arg) {
+  delete[] arg->dl_tensor.shape;
+  delete[] arg->dl_tensor.strides;
+  delete static_cast<PaddleDLMTensor *>(arg->manager_ctx);
+}
+DLManagedTensor *toDLPack(const phi::DenseTensor &src) {
+  PaddleDLMTensor *pdDLMTensor(new PaddleDLMTensor);
+  pdDLMTensor->handle = const_cast<phi::DenseTensor &>(src);
+  pdDLMTensor->tensor.manager_ctx = pdDLMTensor;
+  pdDLMTensor->tensor.deleter = &deleter;
+  pdDLMTensor->tensor.dl_tensor.data = const_cast<void *>(src.data());
+  // init ndim
+  using DimType = decltype(pdDLMTensor->tensor.dl_tensor.ndim);  // int
+  pdDLMTensor->tensor.dl_tensor.ndim = static_cast<DimType>(src.dims().size());
+  DimType ndim = pdDLMTensor->tensor.dl_tensor.ndim;
+  // init shape
+  auto shape = new int64_t[ndim];
+  for (DimType i = 0; i < ndim; ++i) {
+    shape[i] = src.dims()[i];
+  }
+  pdDLMTensor->tensor.dl_tensor.shape = shape;
+  // init stride
+  auto strides = new int64_t[ndim];
+  for (DimType i = 0; i < ndim; ++i) {
+    strides[i] = 1;
+  }
+  for (DimType i = ndim - 2; i >= 0; --i) {
+    strides[i] = shape[i + 1] * strides[i + 1];
+  }
+  pdDLMTensor->tensor.dl_tensor.strides = strides;
+  // init device, DLDevice type with device_type and device_id
+  auto place = src.place();
+  pdDLMTensor->tensor.dl_tensor.device =
+      paddle::platform::VisitPlace(place, internal::DLDeviceVisitor());
+  pdDLMTensor->tensor.dl_tensor.dtype = internal::GetDLDataTypeFromTypeIndex(
+      framework::TransToProtoVarType(src.dtype()));
+  pdDLMTensor->tensor.dl_tensor.byte_offset = 0;
+  return &(pdDLMTensor->tensor);
+}
+DLPackTensor::DLPackTensor(const phi::DenseTensor &tensor, LaneType lanes) {
  // init data, data buffer
  t_.data = const_cast<void *>(tensor.data());