diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
index c72a5749c52ac97bca71c672ef5295d303d22b05..f9ba8cf65f3e3104dd061c178066ec8247811f33 100644
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@@ -1,9 +1,9 @@
-# Microsoft Open Source Code of Conduct
-
-This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
-
-Resources:
-
-- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
-- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
-- Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns
+# Microsoft Open Source Code of Conduct
+
+This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
+
+Resources:
+
+- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
+- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
+- Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 70f39bd7f81f6b8fcedbb68778cbc3caf6c2e6c4..f6e5f39869ebfa495b02328c168f894a5f1620cd 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -65,3 +65,31 @@ Conduct](https://opensource.microsoft.com/codeofconduct/). For more information
[Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact
[opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or
comments.
+
+## New Feature Contribution Guidelines
+Unlike bug fix or improving existing feature (where users usually directly submit a PR and we review it), adding a new feature to DeepSpeed requires several steps: (1) proposal and discussion, (2) implementation and verification, (3) release and maintenance. This general guideline applies to all new feature contributions. Core DeepSpeed team member contributions may complete step 1 internally.
+
+### Step 1: proposal and discussion
+We ask users to first post your intended feature in an issue. This issue needs to include:
+
+* A description of the proposed feature.
+* A motivation of why it will be useful to DeepSpeed users.
+* A rough design of how you implement the feature inside DeepSpeed.
+* (Important) Results or planned experiments to demonstrate the effectiveness and correctness of the feature.
+ * If this is a general feature applicable to different tasks, we require testing it on at least one CV task (e.g., [CIFAR](https://www.deepspeed.ai/tutorials/cifar-10/)) and one NLP task (e.g., [SQuAD](https://www.deepspeed.ai/tutorials/bert-finetuning/)). If this is a feature for one kind of task only, it is fine to just test on the specific task.
+ * If the feature only affects performance and does not affect training convergence, we require testing on a fraction of training to demonstrate that the training/validation loss are consistent with baseline, and that the performance is better than baseline.
+ * If the feature does affect training convergence, we require testing the whole training to demonstrate that the feature achieves better/on-par final model quality and training performance compared to baseline.
+
+Based on the issue we shall discuss the merit of the new feature and decide whether accept or decline the proposal. Once accepted and after we confirm the design and implementation plan, we are ready for step 2.
+
+### Step 2: implementation and verification
+Contributor will go ahead and implement the feature, and the DeepSpeed team will provide guidance/helps as needed. The required deliverables include:
+
+* A PR to [microsoft/DeepSpeed](https://github.com/microsoft/DeepSpeed) including (1) the feature implementation (2) unit tests (3) documentation (4) tutorial
+* A PR to [microsoft/DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples) or [microsoft/Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed) including the examples of how to use the feature (this is related to the planned testing experiments in proposal)
+* In the implementation (code, documentation, tutorial), we require the feature author to record their GitHub username as a contact method for future questions/maintenance.
+
+After receiving the PRs, we will review them and merge them after necessary tests/fixes.
+
+### Step 3: release and maintenance
+After the PRs are merged, we will announce the feature on our website (with credit to the feature author). We ask the feature author to commit to the maintenance of the feature.
diff --git a/LICENSE b/LICENSE
index 3d8b93bc7987d14c848448c089e2ae15311380d7..9e841e7a26e4eb057b24511e7b92d42b257a80e5 100644
--- a/LICENSE
+++ b/LICENSE
@@ -1,21 +1,21 @@
- MIT License
-
- Copyright (c) Microsoft Corporation.
-
- Permission is hereby granted, free of charge, to any person obtaining a copy
- of this software and associated documentation files (the "Software"), to deal
- in the Software without restriction, including without limitation the rights
- to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
- copies of the Software, and to permit persons to whom the Software is
- furnished to do so, subject to the following conditions:
-
- The above copyright notice and this permission notice shall be included in all
- copies or substantial portions of the Software.
-
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
- IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
- FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
- AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
- LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
- OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
- SOFTWARE
+ MIT License
+
+ Copyright (c) Microsoft Corporation.
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in all
+ copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ SOFTWARE
diff --git a/MANIFEST.in b/MANIFEST.in
index 53fcc885090ede17210ec40060c5c8aea8759e98..a918b9286d530f57c67f5d3f80b9245bab541a3e 100644
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -1,4 +1,4 @@
include *.txt README.md
recursive-include requirements *.txt
-recursive-include deepspeed *.cpp *.h *.cu *.tr *.cuh *.cc
+recursive-include deepspeed *.cpp *.h *.cu *.hip *.tr *.cuh *.cc *.json
recursive-include csrc *.cpp *.h *.cu *.tr *.cuh *.cc
diff --git a/MANIFEST_win.in b/MANIFEST_win.in
new file mode 100644
index 0000000000000000000000000000000000000000..ddfe73e0b4185fc2814d758d15d576113297b684
--- /dev/null
+++ b/MANIFEST_win.in
@@ -0,0 +1,8 @@
+include *.txt README.md
+recursive-include requirements *.txt
+
+# this is for Windows only
+recursive-include deepspeed *.tr
+recursive-exclude deepspeed/ops/csrc *.cpp *.h *.cu *.cuh *.cc
+prune csrc
+prune op_builder
diff --git a/README.md b/README.md
old mode 100755
new mode 100644
index c7bde12dd0ea86d008a387e8b7ca810355fff232..aafbbe5e79b470b12edc7d97e8c1c85ca7caf050
--- a/README.md
+++ b/README.md
@@ -2,9 +2,28 @@
[](https://pypi.org/project/deepspeed/)
[](https://deepspeed.readthedocs.io/en/latest/?badge=latest)
[](https://github.com/Microsoft/DeepSpeed/blob/master/LICENSE)
-[](https://hub.docker.com/r/deepspeed/deepspeed)
-### 03/2021: DeepSpeed is hiring! Come join us: [SDE 2](https://careers.microsoft.com/us/en/job/1013160/Software-Engineer-2), [Sr. SDE](https://careers.microsoft.com/us/en/job/1017151/Senior-Software-Engineer), [Sr. Researcher](https://careers.microsoft.com/us/en/job/1016440/Senior-Researcher)
+
+

+

+
+
+
+## Latest News
+* [2022/03/21] [Supporting efficient large model training on AMD Instinct GPUs with DeepSpeed](https://cloudblogs.microsoft.com/opensource/2022/03/21/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed/)
+* [2022/03/07] [Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam](https://www.deepspeed.ai/tutorials/zero-one-adam/)
+* [2022/01/19] [DeepSpeed: Advancing MoE inference and training to power next-generation AI scale](https://www.microsoft.com/en-us/research/blog/deepspeed-advancing-moe-inference-and-training-to-power-next-generation-ai-scale/)
+ * [Mixture of Experts (MoE) for NLG tutorial](https://www.deepspeed.ai/tutorials/mixture-of-experts-nlg/).
+ * [Mixture of Experts (MoE) Inference tutorial](https://www.deepspeed.ai/tutorials/moe-inference-tutorial).
+* [2021/11/15] [Autotuning: Automatically discover the optimal DeepSpeed configuration that delivers good training speed](https://www.deepspeed.ai/news/2021/11/15/autotuning.html)
+* [2021/10/11] [Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/)
+ * Read more on how to [train large models with DeepSpeed](https://www.deepspeed.ai/tutorials/large-models-w-deepspeed/)
+
+### DeepSpeed is hiring, [come join us!](https://careers.microsoft.com/us/en/search-results?keywords=http:%2F%2Fdeepspeed.ai)
+---
[DeepSpeed](https://www.deepspeed.ai/) is a deep learning optimization
library that makes distributed training easy, efficient, and effective.
@@ -14,10 +33,10 @@ library that makes distributed training easy, efficient, and effective.
Minimal Code Change
DeepSpeed delivers extreme-scale model training for everyone, from data scientists training on massive supercomputers to those training on low-end clusters or even on a single GPU:
-* Extreme scale: Using current generation of GPU clusters with hundreds of devices, 3D parallelism of DeepSpeed can efficiently train deep learning models with trillions of parameters.
+* Extreme scale: Using current generation of GPU clusters with hundreds of devices, 3D parallelism of DeepSpeed can efficiently train deep learning models with trillions of parameters.
* Extremely memory efficient: With just a single GPU, ZeRO-Offload of DeepSpeed can train models with over 10B parameters, 10x bigger than the state of arts, democratizing multi-billion-parameter model training such that many deep learning scientists can explore bigger and better models.
-* Extremely long sequence length: Sparse attention of DeepSpeed powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution comparing with dense transformers.
-* Extremely communication efficient: 3D parallelism improves communication efficiency allows users to train multi-billion-parameter models 2–7x faster on clusters with limited network bandwidth. 1-bit Adam reduces communication volume by up to 5x while achieving similar convergence efficiency to Adam, allowing for scaling to different types of GPU clusters and networks.
+* Extremely long sequence length: Sparse attention of DeepSpeed powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution comparing with dense transformers.
+* Extremely communication efficient: 3D parallelism improves communication efficiency allows users to train multi-billion-parameter models 2–7x faster on clusters with limited network bandwidth. 1-bit Adam, 0/1 Adam and 1-bit LAMB reduce communication volume by up to 26x while achieving similar convergence efficiency to Adam/LAMB, allowing for scaling to different types of GPU clusters and networks.
Early adopters of DeepSpeed have already produced
a language model (LM) with over 17B parameters called
@@ -31,22 +50,6 @@ information [here](https://innovation.microsoft.com/en-us/exploring-ai-at-scale)
**_For further documentation, tutorials, and technical deep-dives please see [deepspeed.ai](https://www.deepspeed.ai/)!_**
-
-# News
-* [2021/04/01] [[DeepSpeed on AzureML] Transformers and CIFAR examples are now available on AzureML GitHub](https://github.com/Azure/azureml-examples/tree/main/workflows/train/deepspeed)
-* [2021/03/30] [[PyTorch Lightning Blog] Accessible Multi-Billion Parameter Model Training with PyTorch Lightning + DeepSpeed](https://medium.com/pytorch-lightning/accessible-multi-billion-parameter-model-training-with-pytorch-lightning-deepspeed-c9333ac3bb59)
-* [2021/03/16] [1-bit Adam v2: NCCL-based implementation and more](https://www.deepspeed.ai/tutorials/onebit-adam/)
-* [2021/03/08] [ZeRO-3 Offload: Scale your models to trillion parameters without code changes while leveraging both CPUs & GPUs](https://www.deepspeed.ai/news/2021/03/07/zero3-offload.html)
-* [2021/01/19] [[🤗Hugging Face Blog] Fit More and Train Faster With ZeRO via DeepSpeed and FairScale](https://huggingface.co/blog/zero-deepspeed-fairscale)
-* [2020/11/12] [Simplified install, JIT compiled ops, PyPI releases, and reduced dependencies](#installation)
-* [2020/11/10] [Efficient and robust compressed training through progressive layer dropping](https://www.deepspeed.ai/news/2020/10/28/progressive-layer-dropping-news.html)
-* [2020/09/10] [DeepSpeed v0.3: Extreme-scale model training for everyone](https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/)
- * [Powering 10x longer sequences and 6x faster execution through DeepSpeed Sparse Attention](https://www.deepspeed.ai/news/2020/09/08/sparse-attention-news.html)
- * [Training a trillion parameters with pipeline parallelism](https://www.deepspeed.ai/news/2020/09/08/pipeline-parallelism.html)
- * [Up to 5x less communication and 3.4x faster training through 1-bit Adam](https://www.deepspeed.ai/news/2020/09/08/onebit-adam-news.html)
- * [10x bigger model training on a single GPU with ZeRO-Offload](https://www.deepspeed.ai/news/2020/09/08/ZeRO-Offload.html)
-
-
# Table of Contents
| Section | Description |
| --------------------------------------- | ------------------------------------------- |
@@ -96,6 +99,12 @@ If you would like to pre-install any of the DeepSpeed extensions/ops (instead
of JIT compiling) or install pre-compiled ops via PyPI please see our [advanced
installation instructions](https://www.deepspeed.ai/tutorials/advanced-install/).
+On Windows you can build wheel with following steps, currently only inference mode is supported.
+1. Install pytorch, such as pytorch 1.8 + cuda 11.1
+2. Install visual cpp build tools, such as VS2019 C++ x64/x86 build tools
+3. Launch cmd console with Administrator privilege for creating required symlink folders
+4. Run `python setup.py bdist_wheel` to build wheel in `dist` folder
+
# Features
Below we provide a brief feature list, see our detailed [feature
overview](https://www.deepspeed.ai/features/) for descriptions and usage.
@@ -116,14 +125,14 @@ overview](https://www.deepspeed.ai/features/) for descriptions and usage.
* [ZeRO-Offload](https://www.deepspeed.ai/tutorials/zero-offload/)
* Leverage both CPU/GPU memory for model training
* Support 10B model training on a single GPU
-* [Ultra-fast dense transformer kernels](https://www.deepspeed.ai/news/2020/05/18/bert-record.html)
-* [Sparse attention](https://www.deepspeed.ai/news/2020/09/08/sparse-attention.html)
+* [Ultra-fast dense transformer kernels](https://www.deepspeed.ai/2020/05/18/bert-record.html)
+* [Sparse attention](https://www.deepspeed.ai/2020/09/08/sparse-attention-news.html)
* Memory- and compute-efficient sparse kernels
* Support 10x longer sequences than dense
* Flexible support to different sparse structures
-* [1-bit Adam](https://www.deepspeed.ai/news/2020/09/08/onebit-adam-blog-post.html)
+* [1-bit Adam](https://www.deepspeed.ai/2020/09/08/onebit-adam-blog-post.html), [0/1 Adam](https://www.deepspeed.ai/tutorials/zero-one-adam/) and [1-bit LAMB](https://www.deepspeed.ai/tutorials/onebit-lamb/)
* Custom communication collective
- * Up to 5x communication volume saving
+ * Up to 26x communication volume saving
* [Additional Memory and Bandwidth Optimizations](https://www.deepspeed.ai/features/#additional-memory-and-bandwidth-optimizations)
* Smart Gradient Accumulation
* Communication/Computation Overlap
@@ -142,8 +151,12 @@ overview](https://www.deepspeed.ai/features/) for descriptions and usage.
* Learning Rate Range Test
* 1Cycle Learning Rate Schedule
* [Simplified Data Loader](https://www.deepspeed.ai/features/#simplified-data-loader)
+* [Curriculum Learning](https://www.deepspeed.ai/tutorials/curriculum-learning/)
+ * A curriculum learning-based data pipeline that presents easier or simpler examples earlier during training
+ * Stable and 3.3x faster GPT-2 pre-training with 8x/4x larger batch size/learning rate while maintaining token-wise convergence speed
+ * Complementary to many other DeepSpeed features
* [Performance Analysis and Debugging](https://www.deepspeed.ai/features/#performance-analysis-and-debugging)
-
+* [Mixture of Experts (MoE)](https://www.deepspeed.ai/tutorials/mixture-of-experts/)
# Further Reading
@@ -154,14 +167,14 @@ All DeepSpeed documentation can be found on our website: [deepspeed.ai](https://
| Article | Description |
| ---------------------------------------------------------------------------------------------- | -------------------------------------------- |
| [DeepSpeed Features](https://www.deepspeed.ai/features/) | DeepSpeed features |
-| [Getting Started](https://www.deepspeed.ai/getting-started/) | First steps with DeepSpeed |
+| [Getting Started](https://www.deepspeed.ai/getting-started/) | First steps with DeepSpeed |
| [DeepSpeed JSON Configuration](https://www.deepspeed.ai/docs/config-json/) | Configuring DeepSpeed |
| [API Documentation](https://deepspeed.readthedocs.io/en/latest/) | Generated DeepSpeed API documentation |
| [CIFAR-10 Tutorial](https://www.deepspeed.ai/tutorials/cifar-10) | Getting started with CIFAR-10 and DeepSpeed |
| [Megatron-LM Tutorial](https://www.deepspeed.ai/tutorials/megatron/) | Train GPT2 with DeepSpeed and Megatron-LM |
-| [BERT Pre-training Tutorial](https://www.deepspeed.ai/tutorials/bert-pretraining/) | Pre-train BERT with DeepSpeed |
+| [BERT Pre-training Tutorial](https://www.deepspeed.ai/tutorials/bert-pretraining/) | Pre-train BERT with DeepSpeed |
| [Learning Rate Range Test Tutorial](https://www.deepspeed.ai/tutorials/lrrt/) | Faster training with large learning rates |
-| [1Cycle Tutorial](https://www.deepspeed.ai/tutorials/1Cycle/) | SOTA learning schedule in DeepSpeed |
+| [1Cycle Tutorial](https://www.deepspeed.ai/tutorials/one-cycle/) | SOTA learning schedule in DeepSpeed |
@@ -192,7 +205,12 @@ Conduct](https://opensource.microsoft.com/codeofconduct/). For more information
2. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. (2020) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. [In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '20, Tutorial)](https://dl.acm.org/doi/10.1145/3394486.3406703).
3. Minjia Zhang, Yuxiong He. (2020) Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. [arXiv:2010.13369](https://arxiv.org/abs/2010.13369) and [NeurIPS 2020](https://proceedings.neurips.cc/paper/2020/hash/a1140a3d0df1c81e24ae954d935e8926-Abstract.html).
4. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. (2021) ZeRO-Offload: Democratizing Billion-Scale Model Training. [arXiv:2101.06840](https://arxiv.org/abs/2101.06840).
-5. Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He. (2021) 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed. [arXiv:2102.02888](https://arxiv.org/abs/2102.02888).
+5. Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He. (2021) 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed. [arXiv:2102.02888](https://arxiv.org/abs/2102.02888) and [ICML 2021](http://proceedings.mlr.press/v139/tang21a.html).
+6. Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He. (2021) ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. [arXiv:2104.07857](https://arxiv.org/abs/2104.07857).
+7. Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He. (2021) 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed. [arXiv:2104.06069](https://arxiv.org/abs/2104.06069).
+8. Conglong Li, Minjia Zhang, Yuxiong He. (2021) Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training. [arXiv:2108.06084](https://arxiv.org/abs/2108.06084).
+9. Yucheng Lu, Conglong Li, Minjia Zhang, Christopher De Sa, Yuxiong He. (2022) Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam. [arXiv:2202.06009](https://arxiv.org/abs/2202.06009).
+10. Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He. (2022) DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale [arXiv:2201.05596](https://arxiv.org/abs/2201.05596).
# Videos
1. DeepSpeed KDD 2020 Tutorial
@@ -206,3 +224,6 @@ Conduct](https://opensource.microsoft.com/codeofconduct/). For more information
* Registration is free and all videos are available on-demand.
* [ZeRO & Fastest BERT: Increasing the scale and speed of deep learning training in DeepSpeed](https://note.microsoft.com/MSR-Webinar-DeepSpeed-Registration-On-Demand.html).
3. [DeepSpeed on AzureML](https://youtu.be/yBVXR8G8Bg8)
+4. Community Tutorials
+ * [DeepSpeed: All the tricks to scale to gigantic models](https://www.youtube.com/watch?v=pDGI668pNg0)
+ * [Turing-NLG, DeepSpeed and the ZeRO optimizer](https://www.youtube.com/watch?v=tC01FRB0M7w)
diff --git a/SECURITY.md b/SECURITY.md
index 7ab49eb8296428b7e97282be73aad19117ff34c2..e0dfff56a9569fee0ec4628bb42319f81731b250 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -1,41 +1,41 @@
-
-
-## Security
-
-Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).
-
-If you believe you have found a security vulnerability in any Microsoft-owned repository that meets Microsoft's [Microsoft's definition of a security vulnerability](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)) of a security vulnerability, please report it to us as described below.
-
-## Reporting Security Issues
-
-**Please do not report security vulnerabilities through public GitHub issues.**
-
-Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report).
-
-If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).
-
-You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).
-
-Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
-
- * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
- * Full paths of source file(s) related to the manifestation of the issue
- * The location of the affected source code (tag/branch/commit or direct URL)
- * Any special configuration required to reproduce the issue
- * Step-by-step instructions to reproduce the issue
- * Proof-of-concept or exploit code (if possible)
- * Impact of the issue, including how an attacker might exploit the issue
-
-This information will help us triage your report more quickly.
-
-If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty) page for more details about our active programs.
-
-## Preferred Languages
-
-We prefer all communications to be in English.
-
-## Policy
-
-Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd).
-
-
+
+
+## Security
+
+Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).
+
+If you believe you have found a security vulnerability in any Microsoft-owned repository that meets Microsoft's [Microsoft's definition of a security vulnerability](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)) of a security vulnerability, please report it to us as described below.
+
+## Reporting Security Issues
+
+**Please do not report security vulnerabilities through public GitHub issues.**
+
+Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report).
+
+If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).
+
+You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).
+
+Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
+
+ * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
+ * Full paths of source file(s) related to the manifestation of the issue
+ * The location of the affected source code (tag/branch/commit or direct URL)
+ * Any special configuration required to reproduce the issue
+ * Step-by-step instructions to reproduce the issue
+ * Proof-of-concept or exploit code (if possible)
+ * Impact of the issue, including how an attacker might exploit the issue
+
+This information will help us triage your report more quickly.
+
+If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty) page for more details about our active programs.
+
+## Preferred Languages
+
+We prefer all communications to be in English.
+
+## Policy
+
+Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd).
+
+
diff --git a/azure/attach.sh b/azure/attach.sh
old mode 100755
new mode 100644
diff --git a/azure/azure_ssh.sh b/azure/azure_ssh.sh
old mode 100755
new mode 100644
diff --git a/azure/build_docker_image.sh b/azure/build_docker_image.sh
old mode 100755
new mode 100644
diff --git a/azure/create_vms.sh b/azure/create_vms.sh
old mode 100755
new mode 100644
diff --git a/azure/setup_docker.sh b/azure/setup_docker.sh
old mode 100755
new mode 100644
diff --git a/azure/setup_vms.sh b/azure/setup_vms.sh
old mode 100755
new mode 100644
diff --git a/azure/shutdown_vms.sh b/azure/shutdown_vms.sh
old mode 100755
new mode 100644
diff --git a/azure/start_container.sh b/azure/start_container.sh
old mode 100755
new mode 100644
diff --git a/bin/deepspeed b/bin/deepspeed
deleted file mode 120000
index 6b768564101983015fd56c8d604e439c2374ad06..0000000000000000000000000000000000000000
--- a/bin/deepspeed
+++ /dev/null
@@ -1 +0,0 @@
-ds
\ No newline at end of file
diff --git a/bin/deepspeed b/bin/deepspeed
new file mode 100644
index 0000000000000000000000000000000000000000..5ec8820db922fcdb284ff18cbe7f21c3b2e4d38b
--- /dev/null
+++ b/bin/deepspeed
@@ -0,0 +1,6 @@
+#!/usr/bin/env python3
+
+from deepspeed.launcher.runner import main
+
+if __name__ == '__main__':
+ main()
diff --git a/bin/deepspeed.pt b/bin/deepspeed.pt
deleted file mode 120000
index 6b768564101983015fd56c8d604e439c2374ad06..0000000000000000000000000000000000000000
--- a/bin/deepspeed.pt
+++ /dev/null
@@ -1 +0,0 @@
-ds
\ No newline at end of file
diff --git a/bin/deepspeed.pt b/bin/deepspeed.pt
new file mode 100644
index 0000000000000000000000000000000000000000..5ec8820db922fcdb284ff18cbe7f21c3b2e4d38b
--- /dev/null
+++ b/bin/deepspeed.pt
@@ -0,0 +1,6 @@
+#!/usr/bin/env python3
+
+from deepspeed.launcher.runner import main
+
+if __name__ == '__main__':
+ main()
diff --git a/bin/ds b/bin/ds
old mode 100755
new mode 100644
index 6bb47da8ce7cc99dee05ada3931989e0bc2dce4a..5ec8820db922fcdb284ff18cbe7f21c3b2e4d38b
--- a/bin/ds
+++ b/bin/ds
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
from deepspeed.launcher.runner import main
diff --git a/bin/ds_elastic b/bin/ds_elastic
old mode 100755
new mode 100644
index f55ebf106e058990e6e39464b2f6ea3cc211cd14..c9987d4565da3cb4c7e32b8342c201ba0165e030
--- a/bin/ds_elastic
+++ b/bin/ds_elastic
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
import argparse
import json
diff --git a/bin/ds_report b/bin/ds_report
old mode 100755
new mode 100644
index c03a95645eae8e110261155e0892a0e78eae1178..e6f7b50a78b2368c93192e6ef25357f546815037
--- a/bin/ds_report
+++ b/bin/ds_report
@@ -1,6 +1,6 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
-from deepspeed.env_report import main
+from deepspeed.env_report import cli_main
if __name__ == '__main__':
- main()
+ cli_main()
diff --git a/bin/ds_ssh b/bin/ds_ssh
old mode 100755
new mode 100644
index c2330e31ee12def026ea6ffbcf15c6aa5a3bd200..d89fc0b44e176c8acf422302308e6a865350ab49
--- a/bin/ds_ssh
+++ b/bin/ds_ssh
@@ -10,11 +10,25 @@ fi
hostfile=/job/hostfile
+while getopts "h?f:" opt; do
+ case "$opt" in
+ h|\?)
+ echo "-f : specify a hostfile, defaults to /job/hostfile"
+ exit 0
+ ;;
+ f)
+ hostfile=$OPTARG
+ shift $((OPTIND-1))
+ ;;
+ esac
+done
+
+echo "hostfile=$hostfile"
+
if [ -f $hostfile ]; then
hosts=`cat $hostfile | awk '{print $1}' | paste -sd "," -`
export PDSH_RCMD_TYPE=ssh
pdsh -w ${hosts} $@
else
- echo "Missing hostfile at ${hostfile}, executing command locally"
- $@
+ echo "Missing hostfile at ${hostfile}, unable to proceed"
fi
diff --git a/csrc/adagrad/cpu_adagrad.cpp b/csrc/adagrad/cpu_adagrad.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..4f2a9b69ef966599d1bd6664f79e312c9240671b
--- /dev/null
+++ b/csrc/adagrad/cpu_adagrad.cpp
@@ -0,0 +1,227 @@
+#include "cpu_adagrad.h"
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include "cublas_v2.h"
+#include "cuda.h"
+#include "curand.h"
+#include "custom_cuda_layers.h"
+
+static std::unordered_map> s_optimizers;
+
+// C++ interface
+
+void Adagrad_Optimizer::Step_1(float* _params,
+ float* grads,
+ float* _exp_avg_sq,
+ size_t _param_size,
+ __half* dev_params,
+ bool half_precision)
+{
+ size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+ Step_AVX<1>(
+ &rounded_size, _params, grads, _exp_avg_sq, _param_size, dev_params, half_precision);
+#endif
+ if (_param_size > rounded_size) {
+ float step_size = -1 * _alpha;
+ __half* grads_cast_h;
+ __half* params_cast_h;
+ if (half_precision) {
+ grads_cast_h = reinterpret_cast<__half*>(grads);
+ params_cast_h = reinterpret_cast<__half*>(_params);
+ }
+ for (size_t t = rounded_size; t < _param_size; t += TILE) {
+ size_t copy_size = TILE;
+ if ((t + TILE) > _param_size) copy_size = _param_size - t;
+ size_t offset = copy_size + t;
+ if ((t / TILE) >= 2) { cudaStreamSynchronize(_streams[_buf_index]); }
+#pragma omp parallel for
+ for (size_t k = t; k < offset; k++) {
+ float grad = half_precision ? (float)grads_cast_h[k] : grads[k];
+ float param = half_precision ? (float)params_cast_h[k] : _params[k];
+ float momentum = grads[k];
+ float variance = _exp_avg_sq[k];
+ if (_weight_decay > 0) { grad = param * _weight_decay + grad; }
+
+ variance += grad * grad;
+
+ grad = sqrt(variance);
+ grad += _eps;
+ grad = momentum / grad;
+ param = grad * step_size + param;
+ if (dev_params) _doubled_buffer[_buf_index][k - t] = param;
+
+ if (half_precision)
+ params_cast_h[k] = (__half)param;
+ else
+ _params[k] = param;
+ // STORE UPDATE TERM TO GRAD'S MEMORY
+ grads[k] = grad * step_size;
+ _exp_avg_sq[k] = variance;
+ }
+ if (dev_params) {
+ launch_param_update(
+ _doubled_buffer[_buf_index], dev_params + t, (copy_size), _streams[_buf_index]);
+ _buf_index = !_buf_index;
+ }
+ }
+ }
+}
+
+void Adagrad_Optimizer::Step_4(float* _params,
+ float* grads,
+ float* _exp_avg_sq,
+ size_t _param_size,
+ __half* dev_params,
+ bool half_precision)
+{
+ size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+ Step_AVX<4>(
+ &rounded_size, _params, grads, _exp_avg_sq, _param_size, dev_params, half_precision);
+#endif
+ if (_param_size > rounded_size)
+ Step_1((_params + rounded_size),
+ (grads + rounded_size),
+ (_exp_avg_sq + rounded_size),
+ (_param_size - rounded_size),
+ (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+ half_precision);
+}
+
+int create_adagrad_optimizer(int optimizer_id,
+ float alpha = 1e-2,
+ float eps = 1e-8,
+ float weight_decay = 0,
+ bool should_log = false)
+{
+ auto opt = std::make_shared(alpha, eps, weight_decay);
+
+ s_optimizers[optimizer_id] = opt;
+
+ if (should_log) {
+ std::string avx_type = "";
+#if defined(__AVX512__)
+ avx_type = "AVX512";
+#else
+#if defined(__AVX256__)
+ avx_type = "AVX2";
+#else
+ avx_type = "scalar";
+#endif
+#endif
+
+ printf("Adagrad Optimizer #%d is created with %s arithmetic capability.\n",
+ optimizer_id,
+ avx_type.c_str());
+ printf("Config: alpha=%f, weight_decay=%f\n", alpha, weight_decay);
+ }
+
+ return 0;
+}
+
+void Adagrad_Optimizer::Step_8(float* _params,
+ float* grads,
+ float* _exp_avg_sq,
+ size_t _param_size,
+ __half* dev_params,
+ bool half_precision)
+{
+ size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+ Step_AVX<8>(
+ &rounded_size, _params, grads, _exp_avg_sq, _param_size, dev_params, half_precision);
+#endif
+ if (_param_size > rounded_size)
+ Step_4((_params + rounded_size),
+ (grads + rounded_size),
+ (_exp_avg_sq + rounded_size),
+ (_param_size - rounded_size),
+ (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+ half_precision);
+}
+
+int ds_adagrad_step(int optimizer_id,
+ size_t step,
+ float lr,
+ float epsilon,
+ float weight_decay,
+ torch::Tensor& params,
+ torch::Tensor& grads,
+ torch::Tensor& exp_avg_sq)
+{
+ auto params_c = params.contiguous();
+ auto grads_c = grads.contiguous();
+ auto exp_avg_sq_c = exp_avg_sq.contiguous();
+
+ float* params_ptr = (float*)params_c.data_ptr();
+ float* grads_ptr = (float*)grads_c.data_ptr();
+ float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+ std::shared_ptr opt =
+ std::static_pointer_cast(s_optimizers[optimizer_id]);
+ opt->IncrementStep(step);
+ opt->update_state(lr, epsilon, weight_decay);
+ opt->Step_8(params_ptr, grads_ptr, exp_avg_sq_ptr, params_c.size(0));
+
+ opt->SynchronizeStreams();
+ return 0;
+}
+
+int ds_adagrad_step_plus_copy(int optimizer_id,
+ size_t step,
+ float lr,
+ float epsilon,
+ float weight_decay,
+ torch::Tensor& params,
+ torch::Tensor& grads,
+ torch::Tensor& exp_avg_sq,
+ torch::Tensor& gpu_params)
+{
+ auto params_c = params.contiguous();
+ auto gpu_params_c = gpu_params.contiguous();
+ auto exp_avg_sq_c = exp_avg_sq.contiguous();
+ auto grads_c = grads.contiguous();
+
+ float* params_ptr = (float*)params_c.data_ptr();
+ float* grads_ptr = (float*)grads_c.data_ptr();
+ __half* gpu_params_ptr = (__half*)gpu_params_c.data_ptr();
+ float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+ std::shared_ptr opt =
+ std::static_pointer_cast(s_optimizers[optimizer_id]);
+ opt->IncrementStep(step);
+ opt->update_state(lr, epsilon, weight_decay);
+ opt->Step_8(params_ptr,
+ grads_ptr,
+ exp_avg_sq_ptr,
+ params_c.size(0),
+ gpu_params_ptr,
+ (params.options().dtype() == at::kHalf));
+
+ opt->SynchronizeStreams();
+ return 0;
+}
+
+int destroy_adagrad_optimizer(int optimizer_id)
+{
+ s_optimizers.erase(optimizer_id);
+
+ return 0;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+ m.def("adagrad_update", &ds_adagrad_step, "DeepSpeed CPU Adagrad update (C++)");
+ m.def("adagrad_update_copy",
+ &ds_adagrad_step_plus_copy,
+ "DeepSpeed CPU Adagrad update and param copy (C++)");
+ m.def("create_adagrad", &create_adagrad_optimizer, "DeepSpeed CPU Adagrad (C++)");
+ m.def("destroy_adagrad", &destroy_adagrad_optimizer, "DeepSpeed CPU Adagrad destroy (C++)");
+}
diff --git a/csrc/adagrad/cpu_adagrad_hip.cpp b/csrc/adagrad/cpu_adagrad_hip.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..6bbe9a9ee564c9e8f081c083202326ad279eddd1
--- /dev/null
+++ b/csrc/adagrad/cpu_adagrad_hip.cpp
@@ -0,0 +1,228 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "cpu_adagrad_hip.h"
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include "rocblas.h"
+#include "hip/hip_runtime.h"
+#include "hiprand/hiprand.h"
+#include "custom_hip_layers.h"
+
+static std::unordered_map> s_optimizers;
+
+// C++ interface
+
+void Adagrad_Optimizer::Step_1(float* _params,
+ float* grads,
+ float* _exp_avg_sq,
+ size_t _param_size,
+ __half* dev_params,
+ bool half_precision)
+{
+ size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+ Step_AVX<1>(
+ &rounded_size, _params, grads, _exp_avg_sq, _param_size, dev_params, half_precision);
+#endif
+ if (_param_size > rounded_size) {
+ float step_size = -1 * _alpha;
+ __half* grads_cast_h;
+ __half* params_cast_h;
+ if (half_precision) {
+ grads_cast_h = reinterpret_cast<__half*>(grads);
+ params_cast_h = reinterpret_cast<__half*>(_params);
+ }
+ for (size_t t = rounded_size; t < _param_size; t += TILE) {
+ size_t copy_size = TILE;
+ if ((t + TILE) > _param_size) copy_size = _param_size - t;
+ size_t offset = copy_size + t;
+ if ((t / TILE) >= 2) { hipStreamSynchronize(_streams[_buf_index]); }
+#pragma omp parallel for
+ for (size_t k = t; k < offset; k++) {
+ float grad = half_precision ? (float)grads_cast_h[k] : grads[k];
+ float param = half_precision ? (float)params_cast_h[k] : _params[k];
+ float momentum = grads[k];
+ float variance = _exp_avg_sq[k];
+ if (_weight_decay > 0) { grad = param * _weight_decay + grad; }
+
+ variance += grad * grad;
+
+ grad = sqrt(variance);
+ grad += _eps;
+ grad = momentum / grad;
+ param = grad * step_size + param;
+ if (dev_params) _doubled_buffer[_buf_index][k - t] = param;
+
+ if (half_precision)
+ params_cast_h[k] = (__half)param;
+ else
+ _params[k] = param;
+ // STORE UPDATE TERM TO GRAD'S MEMORY
+ grads[k] = grad * step_size;
+ _exp_avg_sq[k] = variance;
+ }
+ if (dev_params) {
+ launch_param_update(
+ _doubled_buffer[_buf_index], dev_params + t, (copy_size), _streams[_buf_index]);
+ _buf_index = !_buf_index;
+ }
+ }
+ }
+}
+
+void Adagrad_Optimizer::Step_4(float* _params,
+ float* grads,
+ float* _exp_avg_sq,
+ size_t _param_size,
+ __half* dev_params,
+ bool half_precision)
+{
+ size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+ Step_AVX<4>(
+ &rounded_size, _params, grads, _exp_avg_sq, _param_size, dev_params, half_precision);
+#endif
+ if (_param_size > rounded_size)
+ Step_1((_params + rounded_size),
+ (grads + rounded_size),
+ (_exp_avg_sq + rounded_size),
+ (_param_size - rounded_size),
+ (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+ half_precision);
+}
+
+int create_adagrad_optimizer(int optimizer_id,
+ float alpha = 1e-2,
+ float eps = 1e-8,
+ float weight_decay = 0,
+ bool should_log = false)
+{
+ auto opt = std::make_shared(alpha, eps, weight_decay);
+
+ s_optimizers[optimizer_id] = opt;
+
+ if (should_log) {
+ std::string avx_type = "";
+#if defined(__AVX512__)
+ avx_type = "AVX512";
+#else
+#if defined(__AVX256__)
+ avx_type = "AVX2";
+#else
+ avx_type = "scalar";
+#endif
+#endif
+
+ printf("Adagrad Optimizer #%d is created with %s arithmetic capability.\n",
+ optimizer_id,
+ avx_type.c_str());
+ printf("Config: alpha=%f, weight_decay=%f\n", alpha, weight_decay);
+ }
+
+ return 0;
+}
+
+void Adagrad_Optimizer::Step_8(float* _params,
+ float* grads,
+ float* _exp_avg_sq,
+ size_t _param_size,
+ __half* dev_params,
+ bool half_precision)
+{
+ size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+ Step_AVX<8>(
+ &rounded_size, _params, grads, _exp_avg_sq, _param_size, dev_params, half_precision);
+#endif
+ if (_param_size > rounded_size)
+ Step_4((_params + rounded_size),
+ (grads + rounded_size),
+ (_exp_avg_sq + rounded_size),
+ (_param_size - rounded_size),
+ (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+ half_precision);
+}
+
+int ds_adagrad_step(int optimizer_id,
+ size_t step,
+ float lr,
+ float epsilon,
+ float weight_decay,
+ torch::Tensor& params,
+ torch::Tensor& grads,
+ torch::Tensor& exp_avg_sq)
+{
+ auto params_c = params.contiguous();
+ auto grads_c = grads.contiguous();
+ auto exp_avg_sq_c = exp_avg_sq.contiguous();
+
+ float* params_ptr = (float*)params_c.data_ptr();
+ float* grads_ptr = (float*)grads_c.data_ptr();
+ float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+ std::shared_ptr opt =
+ std::static_pointer_cast(s_optimizers[optimizer_id]);
+ opt->IncrementStep(step);
+ opt->update_state(lr, epsilon, weight_decay);
+ opt->Step_8(params_ptr, grads_ptr, exp_avg_sq_ptr, params_c.size(0));
+
+ opt->SynchronizeStreams();
+ return 0;
+}
+
+int ds_adagrad_step_plus_copy(int optimizer_id,
+ size_t step,
+ float lr,
+ float epsilon,
+ float weight_decay,
+ torch::Tensor& params,
+ torch::Tensor& grads,
+ torch::Tensor& exp_avg_sq,
+ torch::Tensor& gpu_params)
+{
+ auto params_c = params.contiguous();
+ auto gpu_params_c = gpu_params.contiguous();
+ auto exp_avg_sq_c = exp_avg_sq.contiguous();
+ auto grads_c = grads.contiguous();
+
+ float* params_ptr = (float*)params_c.data_ptr();
+ float* grads_ptr = (float*)grads_c.data_ptr();
+ __half* gpu_params_ptr = (__half*)gpu_params_c.data_ptr();
+ float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+ std::shared_ptr opt =
+ std::static_pointer_cast(s_optimizers[optimizer_id]);
+ opt->IncrementStep(step);
+ opt->update_state(lr, epsilon, weight_decay);
+ opt->Step_8(params_ptr,
+ grads_ptr,
+ exp_avg_sq_ptr,
+ params_c.size(0),
+ gpu_params_ptr,
+ (params.options().dtype() == at::kHalf));
+
+ opt->SynchronizeStreams();
+ return 0;
+}
+
+int destroy_adagrad_optimizer(int optimizer_id)
+{
+ s_optimizers.erase(optimizer_id);
+
+ return 0;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+ m.def("adagrad_update", &ds_adagrad_step, "DeepSpeed CPU Adagrad update (C++)");
+ m.def("adagrad_update_copy",
+ &ds_adagrad_step_plus_copy,
+ "DeepSpeed CPU Adagrad update and param copy (C++)");
+ m.def("create_adagrad", &create_adagrad_optimizer, "DeepSpeed CPU Adagrad (C++)");
+ m.def("destroy_adagrad", &destroy_adagrad_optimizer, "DeepSpeed CPU Adagrad destroy (C++)");
+}
diff --git a/csrc/adam/cpu_adam.cpp b/csrc/adam/cpu_adam.cpp
index d425dc3169efa35d08469ee91cd56a7c7feffda7..727eec8182c12cdcc0dcb0df53a38918bdeae4b8 100644
--- a/csrc/adam/cpu_adam.cpp
+++ b/csrc/adam/cpu_adam.cpp
@@ -1,682 +1,292 @@
-#include "cpu_adam.h"
-#include
-#include
-#include
-#include
-#include
-#include
-#include
-#include
-#include "cublas_v2.h"
-#include "cuda.h"
-#include "curand.h"
-#include "custom_cuda_layers.h"
-
-static std::unordered_map> s_optimizers;
-
-#define ROUND_DOWN(size, step) ((size) & ~((step)-1))
-
-// C++ interface
-
-void Adam_Optimizer::Step(float* _params,
- float* grads,
- float* _exp_avg,
- float* _exp_avg_sq,
- size_t _param_size,
- __half* dev_params)
-{
- float betta1_minus1 = 1 - _betta1;
- float betta2_minus1 = 1 - _betta2;
-
- float step_size = -1 * _alpha / _bias_correction1;
- float w_decay = -1 * _alpha * _weight_decay;
- size_t rounded_size = 0;
-
-#if defined(__AVX512__) or defined(__AVX256__)
-
- AVX_Data betta1_4;
- betta1_4.data = SIMD_SET(_betta1);
- AVX_Data betta2_4;
- betta2_4.data = SIMD_SET(_betta2);
-
- AVX_Data betta1_minus1_4;
- betta1_minus1_4.data = SIMD_SET(betta1_minus1);
- AVX_Data betta2_minus1_4;
- betta2_minus1_4.data = SIMD_SET(betta2_minus1);
-
- AVX_Data bias2_sqrt;
- bias2_sqrt.data = SIMD_SET(_bias_correction2);
-
- AVX_Data eps_4;
- eps_4.data = SIMD_SET(_eps);
-
- AVX_Data step_size_4;
- step_size_4.data = SIMD_SET(step_size);
-
- AVX_Data weight_decay4;
- if (_weight_decay > 0)
- weight_decay4.data = (_adamw_mode ? SIMD_SET(w_decay) : SIMD_SET(_weight_decay));
- rounded_size = ROUND_DOWN(_param_size, SIMD_WIDTH);
-
- for (size_t t = 0; t < rounded_size; t += TILE) {
- size_t copy_size = TILE;
- if ((t + TILE) > rounded_size) copy_size = rounded_size - t;
- size_t offset = copy_size + t;
- if ((t / TILE) >= 2) { cudaStreamSynchronize(_streams[_buf_index]); }
-
-#pragma omp parallel for
- for (size_t i = t; i < offset; i += SIMD_WIDTH) {
- AVX_Data grad_4;
- grad_4.data = SIMD_LOAD(grads + i);
-
- AVX_Data momentum_4;
- momentum_4.data = SIMD_LOAD(_exp_avg + i);
- AVX_Data variance_4;
- variance_4.data = SIMD_LOAD(_exp_avg_sq + i);
-
- AVX_Data param_4;
- param_4.data = SIMD_LOAD(_params + i);
-
- if (_weight_decay > 0 && !_adamw_mode) {
- grad_4.data = SIMD_FMA(param_4.data, weight_decay4.data, grad_4.data);
- }
- momentum_4.data = SIMD_MUL(momentum_4.data, betta1_4.data);
- momentum_4.data = SIMD_FMA(grad_4.data, betta1_minus1_4.data, momentum_4.data);
-
- variance_4.data = SIMD_MUL(variance_4.data, betta2_4.data);
- grad_4.data = SIMD_MUL(grad_4.data, grad_4.data);
- variance_4.data = SIMD_FMA(grad_4.data, betta2_minus1_4.data, variance_4.data);
-
- grad_4.data = SIMD_SQRT(variance_4.data);
- grad_4.data = SIMD_FMA(grad_4.data, bias2_sqrt.data, eps_4.data);
- grad_4.data = SIMD_DIV(momentum_4.data, grad_4.data);
- if (_weight_decay > 0 && _adamw_mode) {
- param_4.data = SIMD_FMA(param_4.data, weight_decay4.data, param_4.data);
- }
- param_4.data = SIMD_FMA(grad_4.data, step_size_4.data, param_4.data);
-
- SIMD_STORE(_params + i, param_4.data);
-
- if (dev_params) SIMD_STORE(_doubled_buffer[_buf_index] + (i - t), param_4.data);
-
- SIMD_STORE(_exp_avg + i, momentum_4.data);
- SIMD_STORE(_exp_avg_sq + i, variance_4.data);
- }
- if (dev_params) {
- launch_param_update(
- _doubled_buffer[_buf_index], dev_params + t, copy_size, _streams[_buf_index]);
- _buf_index = !_buf_index;
- }
- }
-
-#endif
-
- if (_param_size > rounded_size) {
- for (size_t t = rounded_size; t < _param_size; t += TILE) {
- size_t copy_size = TILE;
- if ((t + TILE) > _param_size) copy_size = _param_size - t;
- size_t offset = copy_size + t;
- if ((t / TILE) >= 2) { cudaStreamSynchronize(_streams[_buf_index]); }
-#pragma omp parallel for
- for (size_t k = t; k < offset; k++) {
- float grad = grads[k];
- float param = _params[k];
- float momentum = _exp_avg[k];
- float variance = _exp_avg_sq[k];
- if (_weight_decay > 0 && !_adamw_mode) { grad = param * _weight_decay + grad; }
- momentum = momentum * _betta1;
- momentum = grad * betta1_minus1 + momentum;
-
- variance = variance * _betta2;
- grad = grad * grad;
- variance = grad * betta2_minus1 + variance;
-
- grad = sqrt(variance);
- grad = grad * _bias_correction2 + _eps;
- grad = momentum / grad;
- if (_weight_decay > 0 && _adamw_mode) { param += w_decay * param; }
- param = grad * step_size + param;
- if (dev_params) _doubled_buffer[_buf_index][k - t] = param;
-
- _params[k] = param;
- _exp_avg[k] = momentum;
- _exp_avg_sq[k] = variance;
- }
- if (dev_params) {
- launch_param_update(
- _doubled_buffer[_buf_index], dev_params + t, (copy_size), _streams[_buf_index]);
- _buf_index = !_buf_index;
- }
- }
- }
-}
-
-void Adam_Optimizer::Step_4(float* _params,
- float* grads,
- float* _exp_avg,
- float* _exp_avg_sq,
- size_t _param_size,
- __half* dev_params)
-{
- size_t rounded_size = 0;
-
-#if defined(__AVX512__) or defined(__AVX256__)
-
- AVX_Data betta1_4;
- betta1_4.data = SIMD_SET(_betta1);
- AVX_Data betta2_4;
- betta2_4.data = SIMD_SET(_betta2);
-
- float betta1_minus1 = 1 - _betta1;
- float betta2_minus1 = 1 - _betta2;
- AVX_Data betta1_minus1_4;
- betta1_minus1_4.data = SIMD_SET(betta1_minus1);
- AVX_Data betta2_minus1_4;
- betta2_minus1_4.data = SIMD_SET(betta2_minus1);
-
- AVX_Data bias2_sqrt;
- bias2_sqrt.data = SIMD_SET(_bias_correction2);
-
- AVX_Data eps_4;
- eps_4.data = SIMD_SET(_eps);
-
- float step_size = -1 * _alpha / _bias_correction1;
- AVX_Data step_size_4;
- step_size_4.data = SIMD_SET(step_size);
-
- float w_decay = -1 * _alpha * _weight_decay;
- AVX_Data weight_decay4;
- if (_weight_decay > 0)
- weight_decay4.data = (_adamw_mode ? SIMD_SET(w_decay) : SIMD_SET(_weight_decay));
- rounded_size = ROUND_DOWN(_param_size, (SIMD_WIDTH << 2));
-
- for (size_t t = 0; t < rounded_size; t += TILE) {
- size_t copy_size = TILE;
- if ((t + TILE) > rounded_size) copy_size = rounded_size - t;
- size_t offset = copy_size + t;
- if ((t / TILE) >= 2) { cudaStreamSynchronize(_streams[_buf_index]); }
-#pragma omp parallel for
- for (size_t i = t; i < offset; i += (SIMD_WIDTH << 2)) {
- AVX_Data grad_4[4];
- grad_4[0].data = SIMD_LOAD(grads + i);
- grad_4[1].data = SIMD_LOAD(grads + i + SIMD_WIDTH);
- grad_4[2].data = SIMD_LOAD(grads + i + (SIMD_WIDTH << 1));
- grad_4[3].data = SIMD_LOAD(grads + i + SIMD_WIDTH * 3);
-
- AVX_Data momentum_4[4];
- momentum_4[0].data = SIMD_LOAD(_exp_avg + i);
- momentum_4[1].data = SIMD_LOAD(_exp_avg + i + SIMD_WIDTH);
- momentum_4[2].data = SIMD_LOAD(_exp_avg + i + (SIMD_WIDTH << 1));
- momentum_4[3].data = SIMD_LOAD(_exp_avg + i + SIMD_WIDTH * 3);
-
- AVX_Data variance_4[4];
- variance_4[0].data = SIMD_LOAD(_exp_avg_sq + i);
- variance_4[1].data = SIMD_LOAD(_exp_avg_sq + i + SIMD_WIDTH);
- variance_4[2].data = SIMD_LOAD(_exp_avg_sq + i + (SIMD_WIDTH << 1));
- variance_4[3].data = SIMD_LOAD(_exp_avg_sq + i + SIMD_WIDTH * 3);
-
- AVX_Data param_4[4];
- param_4[0].data = SIMD_LOAD(_params + i);
- param_4[1].data = SIMD_LOAD(_params + i + SIMD_WIDTH);
- param_4[2].data = SIMD_LOAD(_params + i + (SIMD_WIDTH << 1));
- param_4[3].data = SIMD_LOAD(_params + i + SIMD_WIDTH * 3);
-
- if (_weight_decay > 0 && !_adamw_mode) {
- grad_4[0].data = SIMD_FMA(param_4[0].data, weight_decay4.data, grad_4[0].data);
- grad_4[1].data = SIMD_FMA(param_4[1].data, weight_decay4.data, grad_4[1].data);
- grad_4[2].data = SIMD_FMA(param_4[2].data, weight_decay4.data, grad_4[2].data);
- grad_4[3].data = SIMD_FMA(param_4[3].data, weight_decay4.data, grad_4[3].data);
- }
-
- momentum_4[0].data = SIMD_MUL(momentum_4[0].data, betta1_4.data);
- momentum_4[0].data = SIMD_FMA(grad_4[0].data, betta1_minus1_4.data, momentum_4[0].data);
- momentum_4[1].data = SIMD_MUL(momentum_4[1].data, betta1_4.data);
- momentum_4[1].data = SIMD_FMA(grad_4[1].data, betta1_minus1_4.data, momentum_4[1].data);
- momentum_4[2].data = SIMD_MUL(momentum_4[2].data, betta1_4.data);
- momentum_4[2].data = SIMD_FMA(grad_4[2].data, betta1_minus1_4.data, momentum_4[2].data);
- momentum_4[3].data = SIMD_MUL(momentum_4[3].data, betta1_4.data);
- momentum_4[3].data = SIMD_FMA(grad_4[3].data, betta1_minus1_4.data, momentum_4[3].data);
-
- variance_4[0].data = SIMD_MUL(variance_4[0].data, betta2_4.data);
- variance_4[1].data = SIMD_MUL(variance_4[1].data, betta2_4.data);
- variance_4[2].data = SIMD_MUL(variance_4[2].data, betta2_4.data);
- variance_4[3].data = SIMD_MUL(variance_4[3].data, betta2_4.data);
- grad_4[0].data = SIMD_MUL(grad_4[0].data, grad_4[0].data);
- grad_4[1].data = SIMD_MUL(grad_4[1].data, grad_4[1].data);
- grad_4[2].data = SIMD_MUL(grad_4[2].data, grad_4[2].data);
- grad_4[3].data = SIMD_MUL(grad_4[3].data, grad_4[3].data);
- variance_4[0].data = SIMD_FMA(grad_4[0].data, betta2_minus1_4.data, variance_4[0].data);
- variance_4[1].data = SIMD_FMA(grad_4[1].data, betta2_minus1_4.data, variance_4[1].data);
- variance_4[2].data = SIMD_FMA(grad_4[2].data, betta2_minus1_4.data, variance_4[2].data);
- variance_4[3].data = SIMD_FMA(grad_4[3].data, betta2_minus1_4.data, variance_4[3].data);
-
- grad_4[0].data = SIMD_SQRT(variance_4[0].data);
- grad_4[1].data = SIMD_SQRT(variance_4[1].data);
- grad_4[2].data = SIMD_SQRT(variance_4[2].data);
- grad_4[3].data = SIMD_SQRT(variance_4[3].data);
-
- grad_4[0].data = SIMD_FMA(grad_4[0].data, bias2_sqrt.data, eps_4.data);
- grad_4[1].data = SIMD_FMA(grad_4[1].data, bias2_sqrt.data, eps_4.data);
- grad_4[2].data = SIMD_FMA(grad_4[2].data, bias2_sqrt.data, eps_4.data);
- grad_4[3].data = SIMD_FMA(grad_4[3].data, bias2_sqrt.data, eps_4.data);
- grad_4[0].data = SIMD_DIV(momentum_4[0].data, grad_4[0].data);
- grad_4[1].data = SIMD_DIV(momentum_4[1].data, grad_4[1].data);
- grad_4[2].data = SIMD_DIV(momentum_4[2].data, grad_4[2].data);
- grad_4[3].data = SIMD_DIV(momentum_4[3].data, grad_4[3].data);
-
- if (_weight_decay > 0 && _adamw_mode) {
- param_4[0].data = SIMD_FMA(param_4[0].data, weight_decay4.data, param_4[0].data);
- param_4[1].data = SIMD_FMA(param_4[1].data, weight_decay4.data, param_4[1].data);
- param_4[2].data = SIMD_FMA(param_4[2].data, weight_decay4.data, param_4[2].data);
- param_4[3].data = SIMD_FMA(param_4[3].data, weight_decay4.data, param_4[3].data);
- }
-
- param_4[0].data = SIMD_FMA(grad_4[0].data, step_size_4.data, param_4[0].data);
- param_4[1].data = SIMD_FMA(grad_4[1].data, step_size_4.data, param_4[1].data);
- param_4[2].data = SIMD_FMA(grad_4[2].data, step_size_4.data, param_4[2].data);
- param_4[3].data = SIMD_FMA(grad_4[3].data, step_size_4.data, param_4[3].data);
-
- SIMD_STORE(_params + i, param_4[0].data);
- SIMD_STORE(_params + i + SIMD_WIDTH, param_4[1].data);
- SIMD_STORE(_params + i + (SIMD_WIDTH << 1), param_4[2].data);
- SIMD_STORE(_params + i + SIMD_WIDTH * 3, param_4[3].data);
-
- if (dev_params) {
- SIMD_STORE(_doubled_buffer[_buf_index] + (i - t), param_4[0].data);
- SIMD_STORE(_doubled_buffer[_buf_index] + (i - t) + SIMD_WIDTH, param_4[1].data);
- SIMD_STORE(_doubled_buffer[_buf_index] + (i - t) + (SIMD_WIDTH << 1),
- param_4[2].data);
- SIMD_STORE(_doubled_buffer[_buf_index] + (i - t) + SIMD_WIDTH * 3, param_4[3].data);
- }
-
- SIMD_STORE(_exp_avg + i, momentum_4[0].data);
- SIMD_STORE(_exp_avg + i + SIMD_WIDTH, momentum_4[1].data);
- SIMD_STORE(_exp_avg + i + (SIMD_WIDTH << 1), momentum_4[2].data);
- SIMD_STORE(_exp_avg + i + SIMD_WIDTH * 3, momentum_4[3].data);
-
- SIMD_STORE(_exp_avg_sq + i, variance_4[0].data);
- SIMD_STORE(_exp_avg_sq + i + SIMD_WIDTH, variance_4[1].data);
- SIMD_STORE(_exp_avg_sq + i + (SIMD_WIDTH << 1), variance_4[2].data);
- SIMD_STORE(_exp_avg_sq + i + SIMD_WIDTH * 3, variance_4[3].data);
- }
-
- if (dev_params) {
- launch_param_update(
- _doubled_buffer[_buf_index], dev_params + t, copy_size, _streams[_buf_index]);
- _buf_index = !_buf_index;
- }
- }
-#endif
- if (_param_size > rounded_size)
- Step((_params + rounded_size),
- (grads + rounded_size),
- (_exp_avg + rounded_size),
- (_exp_avg_sq + rounded_size),
- (_param_size - rounded_size),
- (dev_params != nullptr ? (dev_params + rounded_size) : dev_params));
-}
-
-int create_adam_optimizer(int optimizer_id,
- float alpha = 1e-3,
- float betta1 = 0.9,
- float betta2 = 0.999,
- float eps = 1e-8,
- float weight_decay = 0,
- bool adamw_mode = true)
-{
- auto opt =
- std::make_shared(alpha, betta1, betta2, eps, weight_decay, adamw_mode);
-
- s_optimizers[optimizer_id] = opt;
-#if defined(__AVX512__)
- std::cout << "Adam Optimizer #" << optimizer_id
- << " is created with AVX512 arithmetic capability." << std::endl;
- printf("Config: alpha=%f, betas=(%f, %f), weight_decay=%f, adam_w=%d\n",
- alpha,
- betta1,
- betta2,
- weight_decay,
- (int)adamw_mode);
-#else
-#if defined(__AVX256__)
- std::cout << "Adam Optimizer #" << optimizer_id
- << " is created with AVX2 arithmetic capability." << std::endl;
- printf("Config: alpha=%f, betas=(%f, %f), weight_decay=%f, adam_w=%d\n",
- alpha,
- betta1,
- betta2,
- weight_decay,
- (int)adamw_mode);
-#else
- std::cout << "Adam Optimizer #" << optimizer_id
- << " is created with scalar arithmetic capability." << std::endl;
- printf("Config: alpha=%f, betas=(%f, %f), weight_decay=%f, adam_w=%d\n",
- alpha,
- betta1,
- betta2,
- weight_decay,
- (int)adamw_mode);
-#endif
-#endif
- return 0;
-}
-
-void Adam_Optimizer::Step_8(float* _params,
- float* grads,
- float* _exp_avg,
- float* _exp_avg_sq,
- size_t _param_size,
- __half* dev_params)
-{
- size_t rounded_size = 0;
-
-#if defined(__AVX512__) or defined(__AVX256__)
-
- AVX_Data betta1_4;
- betta1_4.data = SIMD_SET(_betta1);
- AVX_Data betta2_4;
- betta2_4.data = SIMD_SET(_betta2);
-
- float betta1_minus1 = 1 - _betta1;
- float betta2_minus1 = 1 - _betta2;
- AVX_Data betta1_minus1_4;
- betta1_minus1_4.data = SIMD_SET(betta1_minus1);
- AVX_Data betta2_minus1_4;
- betta2_minus1_4.data = SIMD_SET(betta2_minus1);
-
- AVX_Data bias2_sqrt;
- bias2_sqrt.data = SIMD_SET(_bias_correction2);
-
- AVX_Data eps_4;
- eps_4.data = SIMD_SET(_eps);
-
- float step_size = -1 * _alpha / _bias_correction1;
- AVX_Data step_size_4;
- step_size_4.data = SIMD_SET(step_size);
-
- float w_decay = -1 * _alpha * _weight_decay;
- AVX_Data weight_decay4;
- if (_weight_decay > 0)
- weight_decay4.data = (_adamw_mode ? SIMD_SET(w_decay) : SIMD_SET(_weight_decay));
- rounded_size = ROUND_DOWN(_param_size, (SIMD_WIDTH << 3));
-
- for (size_t t = 0; t < rounded_size; t += TILE) {
- size_t copy_size = TILE;
- if ((t + TILE) > rounded_size) copy_size = rounded_size - t;
- size_t offset = copy_size + t;
- if ((t / TILE) >= 2) { cudaStreamSynchronize(_streams[_buf_index]); }
-#pragma omp parallel for
- for (size_t i = t; i < offset; i += (SIMD_WIDTH << 3)) {
- AVX_Data grad_4[8];
- grad_4[0].data = SIMD_LOAD(grads + i);
- grad_4[1].data = SIMD_LOAD(grads + i + SIMD_WIDTH);
- grad_4[2].data = SIMD_LOAD(grads + i + (SIMD_WIDTH << 1));
- grad_4[3].data = SIMD_LOAD(grads + i + SIMD_WIDTH * 3);
- grad_4[4].data = SIMD_LOAD(grads + i + (SIMD_WIDTH << 2));
- grad_4[5].data = SIMD_LOAD(grads + i + SIMD_WIDTH * 5);
- grad_4[6].data = SIMD_LOAD(grads + i + SIMD_WIDTH * 6);
- grad_4[7].data = SIMD_LOAD(grads + i + SIMD_WIDTH * 7);
-
- AVX_Data momentum_4[8];
- momentum_4[0].data = SIMD_LOAD(_exp_avg + i);
- momentum_4[1].data = SIMD_LOAD(_exp_avg + i + SIMD_WIDTH);
- momentum_4[2].data = SIMD_LOAD(_exp_avg + i + (SIMD_WIDTH << 1));
- momentum_4[3].data = SIMD_LOAD(_exp_avg + i + SIMD_WIDTH * 3);
- momentum_4[4].data = SIMD_LOAD(_exp_avg + i + (SIMD_WIDTH << 2));
- momentum_4[5].data = SIMD_LOAD(_exp_avg + i + SIMD_WIDTH * 5);
- momentum_4[6].data = SIMD_LOAD(_exp_avg + i + SIMD_WIDTH * 6);
- momentum_4[7].data = SIMD_LOAD(_exp_avg + i + SIMD_WIDTH * 7);
-
- AVX_Data variance_4[8];
- variance_4[0].data = SIMD_LOAD(_exp_avg_sq + i);
- variance_4[1].data = SIMD_LOAD(_exp_avg_sq + i + SIMD_WIDTH);
- variance_4[2].data = SIMD_LOAD(_exp_avg_sq + i + (SIMD_WIDTH << 1));
- variance_4[3].data = SIMD_LOAD(_exp_avg_sq + i + SIMD_WIDTH * 3);
- variance_4[4].data = SIMD_LOAD(_exp_avg_sq + i + (SIMD_WIDTH << 2));
- variance_4[5].data = SIMD_LOAD(_exp_avg_sq + i + SIMD_WIDTH * 5);
- variance_4[6].data = SIMD_LOAD(_exp_avg_sq + i + SIMD_WIDTH * 6);
- variance_4[7].data = SIMD_LOAD(_exp_avg_sq + i + SIMD_WIDTH * 7);
-
- AVX_Data param_4[8];
- param_4[0].data = SIMD_LOAD(_params + i);
- param_4[1].data = SIMD_LOAD(_params + i + SIMD_WIDTH);
- param_4[2].data = SIMD_LOAD(_params + i + (SIMD_WIDTH << 1));
- param_4[3].data = SIMD_LOAD(_params + i + SIMD_WIDTH * 3);
- param_4[4].data = SIMD_LOAD(_params + i + (SIMD_WIDTH << 2));
- param_4[5].data = SIMD_LOAD(_params + i + SIMD_WIDTH * 5);
- param_4[6].data = SIMD_LOAD(_params + i + SIMD_WIDTH * 6);
- param_4[7].data = SIMD_LOAD(_params + i + SIMD_WIDTH * 7);
-
- if (_weight_decay > 0 && !_adamw_mode) {
- grad_4[0].data = SIMD_FMA(param_4[0].data, weight_decay4.data, grad_4[0].data);
- grad_4[1].data = SIMD_FMA(param_4[1].data, weight_decay4.data, grad_4[1].data);
- grad_4[2].data = SIMD_FMA(param_4[2].data, weight_decay4.data, grad_4[2].data);
- grad_4[3].data = SIMD_FMA(param_4[3].data, weight_decay4.data, grad_4[3].data);
- grad_4[4].data = SIMD_FMA(param_4[4].data, weight_decay4.data, grad_4[4].data);
- grad_4[5].data = SIMD_FMA(param_4[5].data, weight_decay4.data, grad_4[5].data);
- grad_4[6].data = SIMD_FMA(param_4[6].data, weight_decay4.data, grad_4[6].data);
- grad_4[7].data = SIMD_FMA(param_4[7].data, weight_decay4.data, grad_4[7].data);
- }
-
- momentum_4[0].data = SIMD_MUL(momentum_4[0].data, betta1_4.data);
- momentum_4[0].data = SIMD_FMA(grad_4[0].data, betta1_minus1_4.data, momentum_4[0].data);
- momentum_4[1].data = SIMD_MUL(momentum_4[1].data, betta1_4.data);
- momentum_4[1].data = SIMD_FMA(grad_4[1].data, betta1_minus1_4.data, momentum_4[1].data);
- momentum_4[2].data = SIMD_MUL(momentum_4[2].data, betta1_4.data);
- momentum_4[2].data = SIMD_FMA(grad_4[2].data, betta1_minus1_4.data, momentum_4[2].data);
- momentum_4[3].data = SIMD_MUL(momentum_4[3].data, betta1_4.data);
- momentum_4[3].data = SIMD_FMA(grad_4[3].data, betta1_minus1_4.data, momentum_4[3].data);
- momentum_4[4].data = SIMD_MUL(momentum_4[4].data, betta1_4.data);
- momentum_4[4].data = SIMD_FMA(grad_4[4].data, betta1_minus1_4.data, momentum_4[4].data);
- momentum_4[5].data = SIMD_MUL(momentum_4[5].data, betta1_4.data);
- momentum_4[5].data = SIMD_FMA(grad_4[5].data, betta1_minus1_4.data, momentum_4[5].data);
- momentum_4[6].data = SIMD_MUL(momentum_4[6].data, betta1_4.data);
- momentum_4[6].data = SIMD_FMA(grad_4[6].data, betta1_minus1_4.data, momentum_4[6].data);
- momentum_4[7].data = SIMD_MUL(momentum_4[7].data, betta1_4.data);
- momentum_4[7].data = SIMD_FMA(grad_4[7].data, betta1_minus1_4.data, momentum_4[7].data);
-
- variance_4[0].data = SIMD_MUL(variance_4[0].data, betta2_4.data);
- variance_4[1].data = SIMD_MUL(variance_4[1].data, betta2_4.data);
- variance_4[2].data = SIMD_MUL(variance_4[2].data, betta2_4.data);
- variance_4[3].data = SIMD_MUL(variance_4[3].data, betta2_4.data);
- variance_4[4].data = SIMD_MUL(variance_4[4].data, betta2_4.data);
- variance_4[5].data = SIMD_MUL(variance_4[5].data, betta2_4.data);
- variance_4[6].data = SIMD_MUL(variance_4[6].data, betta2_4.data);
- variance_4[7].data = SIMD_MUL(variance_4[7].data, betta2_4.data);
- grad_4[0].data = SIMD_MUL(grad_4[0].data, grad_4[0].data);
- grad_4[1].data = SIMD_MUL(grad_4[1].data, grad_4[1].data);
- grad_4[2].data = SIMD_MUL(grad_4[2].data, grad_4[2].data);
- grad_4[3].data = SIMD_MUL(grad_4[3].data, grad_4[3].data);
- grad_4[4].data = SIMD_MUL(grad_4[4].data, grad_4[4].data);
- grad_4[5].data = SIMD_MUL(grad_4[5].data, grad_4[5].data);
- grad_4[6].data = SIMD_MUL(grad_4[6].data, grad_4[6].data);
- grad_4[7].data = SIMD_MUL(grad_4[7].data, grad_4[7].data);
- variance_4[0].data = SIMD_FMA(grad_4[0].data, betta2_minus1_4.data, variance_4[0].data);
- variance_4[1].data = SIMD_FMA(grad_4[1].data, betta2_minus1_4.data, variance_4[1].data);
- variance_4[2].data = SIMD_FMA(grad_4[2].data, betta2_minus1_4.data, variance_4[2].data);
- variance_4[3].data = SIMD_FMA(grad_4[3].data, betta2_minus1_4.data, variance_4[3].data);
- variance_4[4].data = SIMD_FMA(grad_4[4].data, betta2_minus1_4.data, variance_4[4].data);
- variance_4[5].data = SIMD_FMA(grad_4[5].data, betta2_minus1_4.data, variance_4[5].data);
- variance_4[6].data = SIMD_FMA(grad_4[6].data, betta2_minus1_4.data, variance_4[6].data);
- variance_4[7].data = SIMD_FMA(grad_4[7].data, betta2_minus1_4.data, variance_4[7].data);
-
- grad_4[0].data = SIMD_SQRT(variance_4[0].data);
- grad_4[1].data = SIMD_SQRT(variance_4[1].data);
- grad_4[2].data = SIMD_SQRT(variance_4[2].data);
- grad_4[3].data = SIMD_SQRT(variance_4[3].data);
- grad_4[4].data = SIMD_SQRT(variance_4[4].data);
- grad_4[5].data = SIMD_SQRT(variance_4[5].data);
- grad_4[6].data = SIMD_SQRT(variance_4[6].data);
- grad_4[7].data = SIMD_SQRT(variance_4[7].data);
-
- grad_4[0].data = SIMD_FMA(grad_4[0].data, bias2_sqrt.data, eps_4.data);
- grad_4[1].data = SIMD_FMA(grad_4[1].data, bias2_sqrt.data, eps_4.data);
- grad_4[2].data = SIMD_FMA(grad_4[2].data, bias2_sqrt.data, eps_4.data);
- grad_4[3].data = SIMD_FMA(grad_4[3].data, bias2_sqrt.data, eps_4.data);
- grad_4[4].data = SIMD_FMA(grad_4[4].data, bias2_sqrt.data, eps_4.data);
- grad_4[5].data = SIMD_FMA(grad_4[5].data, bias2_sqrt.data, eps_4.data);
- grad_4[6].data = SIMD_FMA(grad_4[6].data, bias2_sqrt.data, eps_4.data);
- grad_4[7].data = SIMD_FMA(grad_4[7].data, bias2_sqrt.data, eps_4.data);
- grad_4[0].data = SIMD_DIV(momentum_4[0].data, grad_4[0].data);
- grad_4[1].data = SIMD_DIV(momentum_4[1].data, grad_4[1].data);
- grad_4[2].data = SIMD_DIV(momentum_4[2].data, grad_4[2].data);
- grad_4[3].data = SIMD_DIV(momentum_4[3].data, grad_4[3].data);
- grad_4[4].data = SIMD_DIV(momentum_4[4].data, grad_4[4].data);
- grad_4[5].data = SIMD_DIV(momentum_4[5].data, grad_4[5].data);
- grad_4[6].data = SIMD_DIV(momentum_4[6].data, grad_4[6].data);
- grad_4[7].data = SIMD_DIV(momentum_4[7].data, grad_4[7].data);
-
- if (_weight_decay > 0 && _adamw_mode) {
- param_4[0].data = SIMD_FMA(param_4[0].data, weight_decay4.data, param_4[0].data);
- param_4[1].data = SIMD_FMA(param_4[1].data, weight_decay4.data, param_4[1].data);
- param_4[2].data = SIMD_FMA(param_4[2].data, weight_decay4.data, param_4[2].data);
- param_4[3].data = SIMD_FMA(param_4[3].data, weight_decay4.data, param_4[3].data);
- param_4[4].data = SIMD_FMA(param_4[4].data, weight_decay4.data, param_4[4].data);
- param_4[5].data = SIMD_FMA(param_4[5].data, weight_decay4.data, param_4[5].data);
- param_4[6].data = SIMD_FMA(param_4[6].data, weight_decay4.data, param_4[6].data);
- param_4[7].data = SIMD_FMA(param_4[7].data, weight_decay4.data, param_4[7].data);
- }
-
- param_4[0].data = SIMD_FMA(grad_4[0].data, step_size_4.data, param_4[0].data);
- param_4[1].data = SIMD_FMA(grad_4[1].data, step_size_4.data, param_4[1].data);
- param_4[2].data = SIMD_FMA(grad_4[2].data, step_size_4.data, param_4[2].data);
- param_4[3].data = SIMD_FMA(grad_4[3].data, step_size_4.data, param_4[3].data);
- param_4[4].data = SIMD_FMA(grad_4[4].data, step_size_4.data, param_4[4].data);
- param_4[5].data = SIMD_FMA(grad_4[5].data, step_size_4.data, param_4[5].data);
- param_4[6].data = SIMD_FMA(grad_4[6].data, step_size_4.data, param_4[6].data);
- param_4[7].data = SIMD_FMA(grad_4[7].data, step_size_4.data, param_4[7].data);
-
- SIMD_STORE(_params + i, param_4[0].data);
- SIMD_STORE(_params + i + SIMD_WIDTH, param_4[1].data);
- SIMD_STORE(_params + i + (SIMD_WIDTH << 1), param_4[2].data);
- SIMD_STORE(_params + i + SIMD_WIDTH * 3, param_4[3].data);
- SIMD_STORE(_params + i + (SIMD_WIDTH << 2), param_4[4].data);
- SIMD_STORE(_params + i + SIMD_WIDTH * 5, param_4[5].data);
- SIMD_STORE(_params + i + SIMD_WIDTH * 6, param_4[6].data);
- SIMD_STORE(_params + i + SIMD_WIDTH * 7, param_4[7].data);
-
- if (dev_params) {
- SIMD_STORE(_doubled_buffer[_buf_index] + (i - t), param_4[0].data);
- SIMD_STORE(_doubled_buffer[_buf_index] + (i - t) + SIMD_WIDTH, param_4[1].data);
- SIMD_STORE(_doubled_buffer[_buf_index] + (i - t) + (SIMD_WIDTH << 1),
- param_4[2].data);
- SIMD_STORE(_doubled_buffer[_buf_index] + (i - t) + SIMD_WIDTH * 3, param_4[3].data);
- SIMD_STORE(_doubled_buffer[_buf_index] + (i - t) + (SIMD_WIDTH << 2),
- param_4[4].data);
- SIMD_STORE(_doubled_buffer[_buf_index] + (i - t) + SIMD_WIDTH * 5, param_4[5].data);
- SIMD_STORE(_doubled_buffer[_buf_index] + (i - t) + SIMD_WIDTH * 6, param_4[6].data);
- SIMD_STORE(_doubled_buffer[_buf_index] + (i - t) + SIMD_WIDTH * 7, param_4[7].data);
- }
-
- SIMD_STORE(_exp_avg + i, momentum_4[0].data);
- SIMD_STORE(_exp_avg + i + SIMD_WIDTH, momentum_4[1].data);
- SIMD_STORE(_exp_avg + i + (SIMD_WIDTH << 1), momentum_4[2].data);
- SIMD_STORE(_exp_avg + i + SIMD_WIDTH * 3, momentum_4[3].data);
- SIMD_STORE(_exp_avg + i + (SIMD_WIDTH << 2), momentum_4[4].data);
- SIMD_STORE(_exp_avg + i + SIMD_WIDTH * 5, momentum_4[5].data);
- SIMD_STORE(_exp_avg + i + SIMD_WIDTH * 6, momentum_4[6].data);
- SIMD_STORE(_exp_avg + i + SIMD_WIDTH * 7, momentum_4[7].data);
-
- SIMD_STORE(_exp_avg_sq + i, variance_4[0].data);
- SIMD_STORE(_exp_avg_sq + i + SIMD_WIDTH, variance_4[1].data);
- SIMD_STORE(_exp_avg_sq + i + (SIMD_WIDTH << 1), variance_4[2].data);
- SIMD_STORE(_exp_avg_sq + i + SIMD_WIDTH * 3, variance_4[3].data);
- SIMD_STORE(_exp_avg_sq + i + (SIMD_WIDTH << 2), variance_4[4].data);
- SIMD_STORE(_exp_avg_sq + i + SIMD_WIDTH * 5, variance_4[5].data);
- SIMD_STORE(_exp_avg_sq + i + SIMD_WIDTH * 6, variance_4[6].data);
- SIMD_STORE(_exp_avg_sq + i + SIMD_WIDTH * 7, variance_4[7].data);
- }
- if (dev_params) {
- launch_param_update(
- _doubled_buffer[_buf_index], dev_params + t, copy_size, _streams[_buf_index]);
- _buf_index = !_buf_index;
- }
- }
-#endif
- if (_param_size > rounded_size)
- Step_4((_params + rounded_size),
- (grads + rounded_size),
- (_exp_avg + rounded_size),
- (_exp_avg_sq + rounded_size),
- (_param_size - rounded_size),
- (dev_params != nullptr ? (dev_params + rounded_size) : dev_params));
-}
-
-int ds_adam_step(int optimizer_id,
- size_t step,
- float lr,
- float beta1,
- float beta2,
- float epsilon,
- float weight_decay,
- bool bias_correction,
- torch::Tensor& params,
- torch::Tensor& grads,
- torch::Tensor& exp_avg,
- torch::Tensor& exp_avg_sq)
-{
- auto params_c = params.contiguous();
- auto grads_c = grads.contiguous();
- auto exp_avg_c = exp_avg.contiguous();
- auto exp_avg_sq_c = exp_avg_sq.contiguous();
-
- float* params_ptr = (float*)params_c.data_ptr();
- float* grads_ptr = (float*)grads_c.data_ptr();
- float* exp_avg_ptr = (float*)exp_avg_c.data_ptr();
- float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
-
- std::shared_ptr opt =
- std::static_pointer_cast(s_optimizers[optimizer_id]);
- opt->IncrementStep(step, beta1, beta2);
- opt->update_state(lr, epsilon, weight_decay, bias_correction);
- opt->Step_8(params_ptr, grads_ptr, exp_avg_ptr, exp_avg_sq_ptr, params_c.size(0));
-
- opt->SynchronizeStreams();
- return 0;
-}
-
-int ds_adam_step_plus_copy(int optimizer_id,
- size_t step,
- float lr,
- float beta1,
- float beta2,
- float epsilon,
- float weight_decay,
- bool bias_correction,
- torch::Tensor& params,
- torch::Tensor& grads,
- torch::Tensor& exp_avg,
- torch::Tensor& exp_avg_sq,
- torch::Tensor& gpu_params)
-{
- auto params_c = params.contiguous();
- auto gpu_params_c = gpu_params.contiguous();
- auto exp_avg_c = exp_avg.contiguous();
- auto exp_avg_sq_c = exp_avg_sq.contiguous();
- auto grads_c = grads.contiguous();
-
- float* params_ptr = (float*)params_c.data_ptr();
- float* grads_ptr = (float*)grads_c.data_ptr();
- __half* gpu_params_ptr = (__half*)gpu_params_c.data_ptr();
- float* exp_avg_ptr = (float*)exp_avg_c.data_ptr();
- float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
-
- std::shared_ptr opt =
- std::static_pointer_cast(s_optimizers[optimizer_id]);
- opt->IncrementStep(step, beta1, beta2);
- opt->update_state(lr, epsilon, weight_decay, bias_correction);
- opt->Step_8(
- params_ptr, grads_ptr, exp_avg_ptr, exp_avg_sq_ptr, params_c.size(0), gpu_params_ptr);
-
- opt->SynchronizeStreams();
- return 0;
-}
-
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
-{
- m.def("adam_update", &ds_adam_step, "DeepSpeed CPU Adam update (C++)");
- m.def("adam_update_copy",
- &ds_adam_step_plus_copy,
- "DeepSpeed CPU Adam update and param copy (C++)");
- m.def("create_adam", &create_adam_optimizer, "DeepSpeed CPU Adam (C++)");
-}
+#include "cpu_adam.h"
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include "cublas_v2.h"
+#include "cuda.h"
+#include "curand.h"
+#include "custom_cuda_layers.h"
+
+static std::unordered_map> s_optimizers;
+
+// C++ interface
+
+void Adam_Optimizer::Step_1(float* _params,
+ float* grads,
+ float* _exp_avg,
+ float* _exp_avg_sq,
+ size_t _param_size,
+ __half* dev_params,
+ bool half_precision)
+{
+ size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+ Step_AVX<1>(&rounded_size,
+ _params,
+ grads,
+ _exp_avg,
+ _exp_avg_sq,
+ _param_size,
+ dev_params,
+ half_precision);
+#endif
+ if (_param_size > rounded_size) {
+ float betta1_minus1 = 1 - _betta1;
+ float betta2_minus1 = 1 - _betta2;
+
+ float step_size = -1 * _alpha / _bias_correction1;
+ float w_decay = -1 * _alpha * _weight_decay;
+ __half* grads_cast_h;
+ __half* params_cast_h;
+ if (half_precision) {
+ grads_cast_h = reinterpret_cast<__half*>(grads);
+ params_cast_h = reinterpret_cast<__half*>(_params);
+ }
+
+ for (size_t t = rounded_size; t < _param_size; t += TILE) {
+ size_t copy_size = TILE;
+ if ((t + TILE) > _param_size) copy_size = _param_size - t;
+ size_t offset = copy_size + t;
+ if ((t / TILE) >= 2) { cudaStreamSynchronize(_streams[_buf_index]); }
+
+#pragma omp parallel for
+ for (size_t k = t; k < offset; k++) {
+ float grad = half_precision ? (float)grads_cast_h[k] : grads[k];
+ float param = half_precision ? (float)params_cast_h[k] : _params[k];
+ float momentum = _exp_avg[k];
+ float variance = _exp_avg_sq[k];
+ if (_weight_decay > 0 && !_adamw_mode) { grad = param * _weight_decay + grad; }
+ momentum = momentum * _betta1;
+ momentum = grad * betta1_minus1 + momentum;
+
+ variance = variance * _betta2;
+ grad = grad * grad;
+ variance = grad * betta2_minus1 + variance;
+
+ grad = sqrt(variance);
+ grad = grad * _bias_correction2 + _eps;
+ grad = momentum / grad;
+ if (_weight_decay > 0 && _adamw_mode) { param += w_decay * param; }
+ param = grad * step_size + param;
+ if (dev_params) _doubled_buffer[_buf_index][k - t] = param;
+
+ if (half_precision)
+ params_cast_h[k] = (__half)param;
+ else
+ _params[k] = param;
+ _exp_avg[k] = momentum;
+ _exp_avg_sq[k] = variance;
+ }
+ if (dev_params) {
+ launch_param_update(
+ _doubled_buffer[_buf_index], dev_params + t, (copy_size), _streams[_buf_index]);
+
+ _buf_index = !_buf_index;
+ }
+ }
+ }
+}
+
+void Adam_Optimizer::Step_4(float* _params,
+ float* grads,
+ float* _exp_avg,
+ float* _exp_avg_sq,
+ size_t _param_size,
+ __half* dev_params,
+ bool half_precision)
+{
+ size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+ Step_AVX<4>(&rounded_size,
+ _params,
+ grads,
+ _exp_avg,
+ _exp_avg_sq,
+ _param_size,
+ dev_params,
+ half_precision);
+#endif
+ if (_param_size > rounded_size)
+ Step_1((_params + rounded_size),
+ (grads + rounded_size),
+ (_exp_avg + rounded_size),
+ (_exp_avg_sq + rounded_size),
+ (_param_size - rounded_size),
+ (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+ half_precision);
+}
+
+int create_adam_optimizer(int optimizer_id,
+ float alpha = 1e-3,
+ float betta1 = 0.9,
+ float betta2 = 0.999,
+ float eps = 1e-8,
+ float weight_decay = 0,
+ bool adamw_mode = true,
+ bool should_log = false)
+{
+ auto opt =
+ std::make_shared(alpha, betta1, betta2, eps, weight_decay, adamw_mode);
+
+ s_optimizers[optimizer_id] = opt;
+
+ if (should_log) {
+ std::string avx_type = "";
+#if defined(__AVX512__)
+ avx_type = "AVX512";
+#else
+#if defined(__AVX256__)
+ avx_type = "AVX2";
+#else
+ avx_type = "scalar";
+#endif
+#endif
+
+ printf("Adam Optimizer #%d is created with %s arithmetic capability.\n",
+ optimizer_id,
+ avx_type.c_str());
+ printf("Config: alpha=%f, betas=(%f, %f), weight_decay=%f, adam_w=%d\n",
+ alpha,
+ betta1,
+ betta2,
+ weight_decay,
+ (int)adamw_mode);
+ }
+
+ return 0;
+}
+
+void Adam_Optimizer::Step_8(float* _params,
+ float* grads,
+ float* _exp_avg,
+ float* _exp_avg_sq,
+ size_t _param_size,
+ __half* dev_params,
+ bool half_precision)
+{
+ size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+ Step_AVX<8>(&rounded_size,
+ _params,
+ grads,
+ _exp_avg,
+ _exp_avg_sq,
+ _param_size,
+ dev_params,
+ half_precision);
+#endif
+ if (_param_size > rounded_size)
+ Step_4((_params + rounded_size),
+ (grads + rounded_size),
+ (_exp_avg + rounded_size),
+ (_exp_avg_sq + rounded_size),
+ (_param_size - rounded_size),
+ (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+ half_precision);
+}
+
+int ds_adam_step(int optimizer_id,
+ size_t step,
+ float lr,
+ float beta1,
+ float beta2,
+ float epsilon,
+ float weight_decay,
+ bool bias_correction,
+ torch::Tensor& params,
+ torch::Tensor& grads,
+ torch::Tensor& exp_avg,
+ torch::Tensor& exp_avg_sq)
+{
+ auto params_c = params.contiguous();
+ auto grads_c = grads.contiguous();
+ auto exp_avg_c = exp_avg.contiguous();
+ auto exp_avg_sq_c = exp_avg_sq.contiguous();
+
+ // assert(params.options().dtype() == grads.options().dtype());
+
+ float* params_ptr = (float*)params_c.data_ptr();
+ float* grads_ptr = (float*)grads_c.data_ptr();
+ float* exp_avg_ptr = (float*)exp_avg_c.data_ptr();
+ float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+ std::shared_ptr opt =
+ std::static_pointer_cast(s_optimizers[optimizer_id]);
+ opt->IncrementStep(step, beta1, beta2);
+ opt->update_state(lr, epsilon, weight_decay, bias_correction);
+
+ opt->Step_8(params_ptr,
+ grads_ptr,
+ exp_avg_ptr,
+ exp_avg_sq_ptr,
+ params_c.size(0),
+ nullptr,
+ (params.options().dtype() == at::kHalf));
+
+ opt->SynchronizeStreams();
+ return 0;
+}
+
+int ds_adam_step_plus_copy(int optimizer_id,
+ size_t step,
+ float lr,
+ float beta1,
+ float beta2,
+ float epsilon,
+ float weight_decay,
+ bool bias_correction,
+ torch::Tensor& params,
+ torch::Tensor& grads,
+ torch::Tensor& exp_avg,
+ torch::Tensor& exp_avg_sq,
+ torch::Tensor& gpu_params)
+{
+ auto params_c = params.contiguous();
+ auto gpu_params_c = gpu_params.contiguous();
+ auto exp_avg_c = exp_avg.contiguous();
+ auto exp_avg_sq_c = exp_avg_sq.contiguous();
+ auto grads_c = grads.contiguous();
+
+ float* params_ptr = (float*)params_c.data_ptr();
+ float* grads_ptr = (float*)grads_c.data_ptr();
+ __half* gpu_params_ptr = (__half*)gpu_params_c.data_ptr();
+ float* exp_avg_ptr = (float*)exp_avg_c.data_ptr();
+ float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+ std::shared_ptr opt =
+ std::static_pointer_cast(s_optimizers[optimizer_id]);
+ opt->IncrementStep(step, beta1, beta2);
+ opt->update_state(lr, epsilon, weight_decay, bias_correction);
+ opt->Step_8(params_ptr,
+ grads_ptr,
+ exp_avg_ptr,
+ exp_avg_sq_ptr,
+ params_c.size(0),
+ gpu_params_ptr,
+ (params.options().dtype() == at::kHalf));
+
+ opt->SynchronizeStreams();
+ return 0;
+}
+
+int destroy_adam_optimizer(int optimizer_id)
+{
+ s_optimizers.erase(optimizer_id);
+
+ return 0;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+ m.def("adam_update", &ds_adam_step, "DeepSpeed CPU Adam update (C++)");
+ m.def("adam_update_copy",
+ &ds_adam_step_plus_copy,
+ "DeepSpeed CPU Adam update and param copy (C++)");
+ m.def("create_adam", &create_adam_optimizer, "DeepSpeed CPU Adam (C++)");
+ m.def("destroy_adam", &destroy_adam_optimizer, "DeepSpeed CPU Adam destroy (C++)");
+}
diff --git a/csrc/adam/cpu_adam_hip.cpp b/csrc/adam/cpu_adam_hip.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..67163979fe3311b85e6b3be3d587bdc1c498485f
--- /dev/null
+++ b/csrc/adam/cpu_adam_hip.cpp
@@ -0,0 +1,293 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "cpu_adam_hip.h"
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include "rocblas.h"
+#include "hip/hip_runtime.h"
+#include "hiprand/hiprand.h"
+#include "custom_hip_layers.h"
+
+static std::unordered_map> s_optimizers;
+
+// C++ interface
+
+void Adam_Optimizer::Step_1(float* _params,
+ float* grads,
+ float* _exp_avg,
+ float* _exp_avg_sq,
+ size_t _param_size,
+ __half* dev_params,
+ bool half_precision)
+{
+ size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+ Step_AVX<1>(&rounded_size,
+ _params,
+ grads,
+ _exp_avg,
+ _exp_avg_sq,
+ _param_size,
+ dev_params,
+ half_precision);
+#endif
+ if (_param_size > rounded_size) {
+ float betta1_minus1 = 1 - _betta1;
+ float betta2_minus1 = 1 - _betta2;
+
+ float step_size = -1 * _alpha / _bias_correction1;
+ float w_decay = -1 * _alpha * _weight_decay;
+ __half* grads_cast_h;
+ __half* params_cast_h;
+ if (half_precision) {
+ grads_cast_h = reinterpret_cast<__half*>(grads);
+ params_cast_h = reinterpret_cast<__half*>(_params);
+ }
+
+ for (size_t t = rounded_size; t < _param_size; t += TILE) {
+ size_t copy_size = TILE;
+ if ((t + TILE) > _param_size) copy_size = _param_size - t;
+ size_t offset = copy_size + t;
+ if ((t / TILE) >= 2) { hipStreamSynchronize(_streams[_buf_index]); }
+
+#pragma omp parallel for
+ for (size_t k = t; k < offset; k++) {
+ float grad = half_precision ? (float)grads_cast_h[k] : grads[k];
+ float param = half_precision ? (float)params_cast_h[k] : _params[k];
+ float momentum = _exp_avg[k];
+ float variance = _exp_avg_sq[k];
+ if (_weight_decay > 0 && !_adamw_mode) { grad = param * _weight_decay + grad; }
+ momentum = momentum * _betta1;
+ momentum = grad * betta1_minus1 + momentum;
+
+ variance = variance * _betta2;
+ grad = grad * grad;
+ variance = grad * betta2_minus1 + variance;
+
+ grad = sqrt(variance);
+ grad = grad * _bias_correction2 + _eps;
+ grad = momentum / grad;
+ if (_weight_decay > 0 && _adamw_mode) { param += w_decay * param; }
+ param = grad * step_size + param;
+ if (dev_params) _doubled_buffer[_buf_index][k - t] = param;
+
+ if (half_precision)
+ params_cast_h[k] = (__half)param;
+ else
+ _params[k] = param;
+ _exp_avg[k] = momentum;
+ _exp_avg_sq[k] = variance;
+ }
+ if (dev_params) {
+ launch_param_update(
+ _doubled_buffer[_buf_index], dev_params + t, (copy_size), _streams[_buf_index]);
+
+ _buf_index = !_buf_index;
+ }
+ }
+ }
+}
+
+void Adam_Optimizer::Step_4(float* _params,
+ float* grads,
+ float* _exp_avg,
+ float* _exp_avg_sq,
+ size_t _param_size,
+ __half* dev_params,
+ bool half_precision)
+{
+ size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+ Step_AVX<4>(&rounded_size,
+ _params,
+ grads,
+ _exp_avg,
+ _exp_avg_sq,
+ _param_size,
+ dev_params,
+ half_precision);
+#endif
+ if (_param_size > rounded_size)
+ Step_1((_params + rounded_size),
+ (grads + rounded_size),
+ (_exp_avg + rounded_size),
+ (_exp_avg_sq + rounded_size),
+ (_param_size - rounded_size),
+ (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+ half_precision);
+}
+
+int create_adam_optimizer(int optimizer_id,
+ float alpha = 1e-3,
+ float betta1 = 0.9,
+ float betta2 = 0.999,
+ float eps = 1e-8,
+ float weight_decay = 0,
+ bool adamw_mode = true,
+ bool should_log = false)
+{
+ auto opt =
+ std::make_shared(alpha, betta1, betta2, eps, weight_decay, adamw_mode);
+
+ s_optimizers[optimizer_id] = opt;
+
+ if (should_log) {
+ std::string avx_type = "";
+#if defined(__AVX512__)
+ avx_type = "AVX512";
+#else
+#if defined(__AVX256__)
+ avx_type = "AVX2";
+#else
+ avx_type = "scalar";
+#endif
+#endif
+
+ printf("Adam Optimizer #%d is created with %s arithmetic capability.\n",
+ optimizer_id,
+ avx_type.c_str());
+ printf("Config: alpha=%f, betas=(%f, %f), weight_decay=%f, adam_w=%d\n",
+ alpha,
+ betta1,
+ betta2,
+ weight_decay,
+ (int)adamw_mode);
+ }
+
+ return 0;
+}
+
+void Adam_Optimizer::Step_8(float* _params,
+ float* grads,
+ float* _exp_avg,
+ float* _exp_avg_sq,
+ size_t _param_size,
+ __half* dev_params,
+ bool half_precision)
+{
+ size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+ Step_AVX<8>(&rounded_size,
+ _params,
+ grads,
+ _exp_avg,
+ _exp_avg_sq,
+ _param_size,
+ dev_params,
+ half_precision);
+#endif
+ if (_param_size > rounded_size)
+ Step_4((_params + rounded_size),
+ (grads + rounded_size),
+ (_exp_avg + rounded_size),
+ (_exp_avg_sq + rounded_size),
+ (_param_size - rounded_size),
+ (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+ half_precision);
+}
+
+int ds_adam_step(int optimizer_id,
+ size_t step,
+ float lr,
+ float beta1,
+ float beta2,
+ float epsilon,
+ float weight_decay,
+ bool bias_correction,
+ torch::Tensor& params,
+ torch::Tensor& grads,
+ torch::Tensor& exp_avg,
+ torch::Tensor& exp_avg_sq)
+{
+ auto params_c = params.contiguous();
+ auto grads_c = grads.contiguous();
+ auto exp_avg_c = exp_avg.contiguous();
+ auto exp_avg_sq_c = exp_avg_sq.contiguous();
+
+ // assert(params.options().dtype() == grads.options().dtype());
+
+ float* params_ptr = (float*)params_c.data_ptr();
+ float* grads_ptr = (float*)grads_c.data_ptr();
+ float* exp_avg_ptr = (float*)exp_avg_c.data_ptr();
+ float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+ std::shared_ptr opt =
+ std::static_pointer_cast(s_optimizers[optimizer_id]);
+ opt->IncrementStep(step, beta1, beta2);
+ opt->update_state(lr, epsilon, weight_decay, bias_correction);
+
+ opt->Step_8(params_ptr,
+ grads_ptr,
+ exp_avg_ptr,
+ exp_avg_sq_ptr,
+ params_c.size(0),
+ nullptr,
+ (params.options().dtype() == at::kHalf));
+
+ opt->SynchronizeStreams();
+ return 0;
+}
+
+int ds_adam_step_plus_copy(int optimizer_id,
+ size_t step,
+ float lr,
+ float beta1,
+ float beta2,
+ float epsilon,
+ float weight_decay,
+ bool bias_correction,
+ torch::Tensor& params,
+ torch::Tensor& grads,
+ torch::Tensor& exp_avg,
+ torch::Tensor& exp_avg_sq,
+ torch::Tensor& gpu_params)
+{
+ auto params_c = params.contiguous();
+ auto gpu_params_c = gpu_params.contiguous();
+ auto exp_avg_c = exp_avg.contiguous();
+ auto exp_avg_sq_c = exp_avg_sq.contiguous();
+ auto grads_c = grads.contiguous();
+
+ float* params_ptr = (float*)params_c.data_ptr();
+ float* grads_ptr = (float*)grads_c.data_ptr();
+ __half* gpu_params_ptr = (__half*)gpu_params_c.data_ptr();
+ float* exp_avg_ptr = (float*)exp_avg_c.data_ptr();
+ float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+ std::shared_ptr opt =
+ std::static_pointer_cast(s_optimizers[optimizer_id]);
+ opt->IncrementStep(step, beta1, beta2);
+ opt->update_state(lr, epsilon, weight_decay, bias_correction);
+ opt->Step_8(params_ptr,
+ grads_ptr,
+ exp_avg_ptr,
+ exp_avg_sq_ptr,
+ params_c.size(0),
+ gpu_params_ptr,
+ (params.options().dtype() == at::kHalf));
+
+ opt->SynchronizeStreams();
+ return 0;
+}
+
+int destroy_adam_optimizer(int optimizer_id)
+{
+ s_optimizers.erase(optimizer_id);
+
+ return 0;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+ m.def("adam_update", &ds_adam_step, "DeepSpeed CPU Adam update (C++)");
+ m.def("adam_update_copy",
+ &ds_adam_step_plus_copy,
+ "DeepSpeed CPU Adam update and param copy (C++)");
+ m.def("create_adam", &create_adam_optimizer, "DeepSpeed CPU Adam (C++)");
+ m.def("destroy_adam", &destroy_adam_optimizer, "DeepSpeed CPU Adam destroy (C++)");
+}
diff --git a/csrc/adam/multi_tensor_adam.hip b/csrc/adam/multi_tensor_adam.hip
new file mode 100644
index 0000000000000000000000000000000000000000..f0b7ced5c29646b793f8fa904768c091fd9d749e
--- /dev/null
+++ b/csrc/adam/multi_tensor_adam.hip
@@ -0,0 +1,164 @@
+// !!! This is a file automatically generated by hipify!!!
+/* Copyright 2020 The Microsoft DeepSpeed Team
+ Copyright NVIDIA/apex
+ This file is adapted from fused adam in NVIDIA/apex, commit a109f85
+*/
+
+#include
+#include
+#include
+#include
+// Another possibility:
+// #include
+
+#include
+
+#include "multi_tensor_apply_hip.cuh"
+#include "type_shim_hip.h"
+
+#define BLOCK_SIZE 512
+#define ILP 4
+
+typedef enum {
+ ADAM_MODE_0 = 0, // L2 regularization mode
+ ADAM_MODE_1 = 1 // Decoupled weight decay mode(AdamW)
+} adamMode_t;
+
+using MATH_T = float;
+
+template
+struct AdamFunctor {
+ __device__ __forceinline__ void operator()(int chunk_size,
+ volatile int* noop_gmem,
+ TensorListMetadata<4>& tl,
+ const float beta1,
+ const float beta2,
+ const float beta1_correction,
+ const float beta2_correction,
+ const float epsilon,
+ const float lr,
+ adamMode_t mode,
+ const float decay)
+ {
+ // I'd like this kernel to propagate infs/nans.
+ // if(*noop_gmem == 1)
+ // return;
+
+ int tensor_loc = tl.block_to_tensor[blockIdx.x];
+
+ // potentially use to pass in list of scalar
+ // int tensor_num = tl.start_tensor_this_launch + tensor_loc;
+
+ int chunk_idx = tl.block_to_chunk[blockIdx.x];
+ int n = tl.sizes[tensor_loc];
+
+ T* g = (T*)tl.addresses[0][tensor_loc];
+ g += chunk_idx * chunk_size;
+
+ T* p = (T*)tl.addresses[1][tensor_loc];
+ p += chunk_idx * chunk_size;
+
+ T* m = (T*)tl.addresses[2][tensor_loc];
+ m += chunk_idx * chunk_size;
+
+ T* v = (T*)tl.addresses[3][tensor_loc];
+ v += chunk_idx * chunk_size;
+
+ n -= chunk_idx * chunk_size;
+
+ // see note in multi_tensor_scale_kernel.cu
+ for (int i_start = 0; i_start < n && i_start < chunk_size; i_start += blockDim.x * ILP) {
+ MATH_T r_g[ILP];
+ MATH_T r_p[ILP];
+ MATH_T r_m[ILP];
+ MATH_T r_v[ILP];
+#pragma unroll
+ for (int ii = 0; ii < ILP; ii++) {
+ int i = i_start + threadIdx.x + ii * blockDim.x;
+ if (i < n && i < chunk_size) {
+ r_g[ii] = g[i];
+ r_p[ii] = p[i];
+ r_m[ii] = m[i];
+ r_v[ii] = v[i];
+ } else {
+ r_g[ii] = MATH_T(0);
+ r_p[ii] = MATH_T(0);
+ r_m[ii] = MATH_T(0);
+ r_v[ii] = MATH_T(0);
+ }
+ }
+#pragma unroll
+ for (int ii = 0; ii < ILP; ii++) {
+ if (mode == ADAM_MODE_0) { // L2
+ r_g[ii] = r_g[ii] + (decay * r_p[ii]);
+ r_m[ii] = beta1 * r_m[ii] + (1 - beta1) * r_g[ii];
+ r_v[ii] = beta2 * r_v[ii] + (1 - beta2) * r_g[ii] * r_g[ii];
+ MATH_T next_m_unbiased = r_m[ii] / beta1_correction;
+ MATH_T next_v_unbiased = r_v[ii] / beta2_correction;
+ MATH_T denom = sqrtf(next_v_unbiased) + epsilon;
+ MATH_T update = next_m_unbiased / denom;
+ r_p[ii] = r_p[ii] - (lr * update);
+ } else { // weight decay
+ r_m[ii] = beta1 * r_m[ii] + (1 - beta1) * r_g[ii];
+ r_v[ii] = beta2 * r_v[ii] + (1 - beta2) * r_g[ii] * r_g[ii];
+ MATH_T next_m_unbiased = r_m[ii] / beta1_correction;
+ MATH_T next_v_unbiased = r_v[ii] / beta2_correction;
+ MATH_T denom = sqrtf(next_v_unbiased) + epsilon;
+ MATH_T update = (next_m_unbiased / denom) + (decay * r_p[ii]);
+ r_p[ii] = r_p[ii] - (lr * update);
+ }
+ }
+#pragma unroll
+ for (int ii = 0; ii < ILP; ii++) {
+ int i = i_start + threadIdx.x + ii * blockDim.x;
+ if (i < n && i < chunk_size) {
+ p[i] = r_p[ii];
+ m[i] = r_m[ii];
+ v[i] = r_v[ii];
+ }
+ }
+ }
+ }
+};
+
+void multi_tensor_adam_cuda(int chunk_size,
+ at::Tensor noop_flag,
+ std::vector> tensor_lists,
+ const float lr,
+ const float beta1,
+ const float beta2,
+ const float epsilon,
+ const int step,
+ const int mode,
+ const int bias_correction,
+ const float weight_decay)
+{
+ using namespace at;
+
+ // Handle bias correction mode
+ float bias_correction1 = 1.0f, bias_correction2 = 1.0f;
+ if (bias_correction == 1) {
+ bias_correction1 = 1 - ::pow(beta1, step);
+ bias_correction2 = 1 - ::pow(beta2, step);
+ }
+
+ // Assume single type across p,g,m1,m2 now
+ DISPATCH_DOUBLE_FLOAT_AND_HALF(tensor_lists[0][0].scalar_type(),
+ 0,
+ "adam",
+ multi_tensor_apply<4>(BLOCK_SIZE,
+ chunk_size,
+ noop_flag,
+ tensor_lists,
+ AdamFunctor(),
+ beta1,
+ beta2,
+ bias_correction1,
+ bias_correction2,
+ epsilon,
+ lr,
+ (adamMode_t)mode,
+ weight_decay);)
+
+ AT_CUDA_CHECK(hipGetLastError());
+}
diff --git a/csrc/adam/multi_tensor_apply_hip.cuh b/csrc/adam/multi_tensor_apply_hip.cuh
new file mode 100644
index 0000000000000000000000000000000000000000..09bc9971f216f73d7e33a1b75c52d2e975115743
--- /dev/null
+++ b/csrc/adam/multi_tensor_apply_hip.cuh
@@ -0,0 +1,129 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+/* Copyright 2020 The Microsoft DeepSpeed Team
+ Copyright NVIDIA/apex
+ This file is adapted from fused adam in NVIDIA/apex, commit a109f85
+*/
+
+#include
+#include
+#include
+#include
+#include
+#include "compat.h"
+
+#include
+
+// #include
+
+// This header is the one-stop shop for all your multi-tensor apply needs.
+
+// TODO: Kernel arg size limit may be <4KB for some other cards (ie Jetson)
+constexpr int depth_to_max_tensors[5] = {110, 64, 48, 36, 30};
+constexpr int depth_to_max_blocks[5] = {320, 320, 320, 320, 320};
+
+template
+struct TensorListMetadata {
+ void* addresses[n][depth_to_max_tensors[n - 1]];
+ int sizes[depth_to_max_tensors[n - 1]];
+ unsigned char block_to_tensor[depth_to_max_blocks[n - 1]];
+ int block_to_chunk[depth_to_max_blocks[n - 1]]; // I fear this needs to be a full int.
+ int start_tensor_this_launch;
+};
+
+template
+__global__ void multi_tensor_apply_kernel(int chunk_size,
+ volatile int* noop_flag,
+ T tl,
+ U callable,
+ ArgTypes... args)
+{
+ // Hand the chunk information to the user-supplied functor to process however it likes.
+ callable(chunk_size, noop_flag, tl, args...);
+}
+
+template
+void multi_tensor_apply(int block_size,
+ int chunk_size,
+ const at::Tensor& noop_flag,
+ const std::vector>& tensor_lists,
+ T callable,
+ ArgTypes... args)
+{
+ TORCH_CHECK(tensor_lists.size() == depth, "tensor_lists.size() != depth");
+ int len0 = tensor_lists[0].size();
+ TORCH_CHECK(len0 > 0, "tensor_lists[0].size() is not > 0");
+ auto ref_device = tensor_lists[0][0].device();
+ TORCH_CHECK(ref_device.type() == at::kCUDA, "expected input to be on cuda");
+ for (int l = 0; l < tensor_lists.size(); l++) // No range-based for because I need indices
+ {
+ TORCH_CHECK(tensor_lists[l].size() == len0, "Size mismatch among tensor lists");
+ for (int t = 0; t < tensor_lists[l].size(); t++) {
+ // TODO: Print which tensor fails.
+ bool contiguous_memory = tensor_lists[l][t].is_contiguous();
+#ifdef VERSION_GE_1_5
+ contiguous_memory = (contiguous_memory ||
+ tensor_lists[l][t].is_contiguous(at::MemoryFormat::ChannelsLast));
+#endif
+ TORCH_CHECK(contiguous_memory, "A tensor was not contiguous.");
+ TORCH_CHECK(tensor_lists[l][t].device() == ref_device,
+ "A tensor was not on the same device as the first tensor");
+ TORCH_CHECK(tensor_lists[l][t].numel() == tensor_lists[0][t].numel(), "Size mismatch");
+ }
+ }
+
+ int ntensors = tensor_lists[0].size();
+
+ TensorListMetadata tl;
+
+ const at::hip::OptionalHIPGuardMasqueradingAsCUDA device_guard(device_of(tensor_lists[0][0]));
+ auto stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
+
+ tl.start_tensor_this_launch = 0;
+ int loc_block_info = 0;
+ int loc_tensor_info = 0;
+ for (int t = 0; t < ntensors; t++) {
+ tl.sizes[loc_tensor_info] = tensor_lists[0][t].numel();
+ for (int d = 0; d < depth; d++)
+ tl.addresses[d][loc_tensor_info] = tensor_lists[d][t].data_ptr();
+ loc_tensor_info++;
+
+ int chunks_this_tensor = (tensor_lists[0][t].numel() + chunk_size - 1) / chunk_size;
+
+ for (int chunk = 0; chunk < chunks_this_tensor; chunk++) {
+ // std::cout << chunks_this_tensor << std::endl;
+ tl.block_to_tensor[loc_block_info] = loc_tensor_info - 1;
+ tl.block_to_chunk[loc_block_info] = chunk;
+ loc_block_info++;
+
+ bool tensors_full = (loc_tensor_info == depth_to_max_tensors[depth - 1] &&
+ chunk == chunks_this_tensor - 1);
+ bool blocks_full = (loc_block_info == depth_to_max_blocks[depth - 1]);
+ bool last_chunk = (t == ntensors - 1 && chunk == chunks_this_tensor - 1);
+ if (tensors_full || blocks_full || last_chunk) {
+ // using accscalar_t = acc_type;
+ hipLaunchKernelGGL(( multi_tensor_apply_kernel), dim3(loc_block_info), dim3(block_size), 0, stream,
+ chunk_size, noop_flag.DATA_PTR(), tl, callable, args...);
+
+ AT_CUDA_CHECK(hipGetLastError());
+
+ // Reset. The control flow possibilities here make my brain hurt.
+ loc_block_info = 0;
+ if (chunk == chunks_this_tensor - 1) {
+ // std::cout << "Hit case 1 " << cond1 << " " << cond2 << " " << cond3 <<
+ // std::endl;
+ loc_tensor_info = 0;
+ tl.start_tensor_this_launch = t + 1;
+ } else {
+ // std::cout << "Hit case 2 " << cond1 << " " << cond2 << " " << cond3 <<
+ // std::endl;
+ tl.sizes[0] = tl.sizes[loc_tensor_info - 1];
+ for (int d = 0; d < depth; d++)
+ tl.addresses[d][0] = tl.addresses[d][loc_tensor_info - 1];
+ loc_tensor_info = 1;
+ tl.start_tensor_this_launch = t;
+ }
+ }
+ }
+ }
+}
diff --git a/csrc/aio/common/deepspeed_aio_common.cpp b/csrc/aio/common/deepspeed_aio_common.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..9e405d8e704c3c260bd8276f648b770471bcd4b7
--- /dev/null
+++ b/csrc/aio/common/deepspeed_aio_common.cpp
@@ -0,0 +1,333 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include
+#include
+#include
+
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+
+#include "deepspeed_aio_common.h"
+
+using namespace std;
+using namespace std::chrono;
+
+#define DEBUG_DS_AIO_PERF 0
+#define DEBUG_DS_AIO_SUBMIT_PERF 0
+
+static const std::string c_library_name = "deepspeed_aio";
+
+static void _report_aio_statistics(const char* tag,
+ const std::vector>& latencies)
+ __attribute__((unused));
+
+static void _report_aio_statistics(const char* tag,
+ const std::vector>& latencies)
+{
+ std::vector lat_usec;
+ for (auto& lat : latencies) { lat_usec.push_back(lat.count() * 1e6); }
+ const auto min_lat = *(std::min_element(lat_usec.begin(), lat_usec.end()));
+ const auto max_lat = *(std::max_element(lat_usec.begin(), lat_usec.end()));
+ const auto avg_lat = std::accumulate(lat_usec.begin(), lat_usec.end(), 0) / lat_usec.size();
+
+ std::cout << c_library_name << ": latency statistics(usec) " << tag
+ << " min/max/avg = " << min_lat << " " << max_lat << " " << avg_lat << std::endl;
+}
+
+static void _get_aio_latencies(std::vector>& raw_latencies,
+ struct deepspeed_aio_latency_t& summary_latencies)
+{
+ std::vector lat_usec;
+ for (auto& lat : raw_latencies) { lat_usec.push_back(lat.count() * 1e6); }
+ summary_latencies._min_usec = *(std::min_element(lat_usec.begin(), lat_usec.end()));
+ summary_latencies._max_usec = *(std::max_element(lat_usec.begin(), lat_usec.end()));
+ summary_latencies._avg_usec =
+ std::accumulate(lat_usec.begin(), lat_usec.end(), 0) / lat_usec.size();
+}
+
+static void _do_io_submit_singles(const long long int n_iocbs,
+ const long long int iocb_index,
+ std::unique_ptr& aio_ctxt,
+ std::vector>& submit_times)
+{
+ for (auto i = 0; i < n_iocbs; ++i) {
+ const auto st = std::chrono::high_resolution_clock::now();
+ const auto submit_ret = io_submit(aio_ctxt->_io_ctxt, 1, aio_ctxt->_iocbs.data() + i);
+ submit_times.push_back(std::chrono::high_resolution_clock::now() - st);
+#if DEBUG_DS_AIO_SUBMIT_PERF
+ printf("submit(usec) %f io_index=%lld buf=%p len=%lu off=%llu \n",
+ submit_times.back().count() * 1e6,
+ iocb_index,
+ aio_ctxt->_iocbs[i]->u.c.buf,
+ aio_ctxt->_iocbs[i]->u.c.nbytes,
+ aio_ctxt->_iocbs[i]->u.c.offset);
+#endif
+ assert(submit_ret > 0);
+ }
+}
+
+static void _do_io_submit_block(const long long int n_iocbs,
+ const long long int iocb_index,
+ std::unique_ptr& aio_ctxt,
+ std::vector>& submit_times)
+{
+ const auto st = std::chrono::high_resolution_clock::now();
+ const auto submit_ret = io_submit(aio_ctxt->_io_ctxt, n_iocbs, aio_ctxt->_iocbs.data());
+ submit_times.push_back(std::chrono::high_resolution_clock::now() - st);
+#if DEBUG_DS_AIO_SUBMIT_PERF
+ printf("submit(usec) %f io_index=%lld nr=%lld buf=%p len=%lu off=%llu \n",
+ submit_times.back().count() * 1e6,
+ iocb_index,
+ n_iocbs,
+ aio_ctxt->_iocbs[0]->u.c.buf,
+ aio_ctxt->_iocbs[0]->u.c.nbytes,
+ aio_ctxt->_iocbs[0]->u.c.offset);
+#endif
+ assert(submit_ret > 0);
+}
+
+static int _do_io_complete(const long long int min_completes,
+ const long long int max_completes,
+ std::unique_ptr& aio_ctxt,
+ std::vector>& reap_times)
+{
+ const auto start_time = std::chrono::high_resolution_clock::now();
+ const auto n_completes = io_getevents(
+ aio_ctxt->_io_ctxt, min_completes, max_completes, aio_ctxt->_io_events.data(), nullptr);
+ reap_times.push_back(std::chrono::high_resolution_clock::now() - start_time);
+
+ assert(n_completes >= min_completes);
+ return n_completes;
+}
+
+void do_aio_operation_sequential(const bool read_op,
+ std::unique_ptr& aio_ctxt,
+ std::unique_ptr& xfer_ctxt,
+ deepspeed_aio_config_t* config,
+ deepspeed_aio_perf_t* perf)
+{
+ struct io_prep_context prep_ctxt(read_op, xfer_ctxt, aio_ctxt->_block_size, &aio_ctxt->_iocbs);
+
+ const auto num_io_blocks = static_cast(
+ ceil(static_cast(xfer_ctxt->_num_bytes) / aio_ctxt->_block_size));
+#if DEBUG_DS_AIO_PERF
+ const auto io_op_name = std::string(read_op ? "read" : "write");
+ std::cout << c_library_name << ": start " << io_op_name << " " << xfer_ctxt->_num_bytes
+ << " bytes with " << num_io_blocks << " io blocks" << std::endl;
+#endif
+
+ std::vector> submit_times;
+ std::vector> reap_times;
+ const auto max_queue_bytes =
+ static_cast(aio_ctxt->_queue_depth * aio_ctxt->_block_size);
+
+ auto start = std::chrono::high_resolution_clock::now();
+ for (long long iocb_index = 0; iocb_index < num_io_blocks;
+ iocb_index += aio_ctxt->_queue_depth) {
+ const auto start_offset = iocb_index * aio_ctxt->_block_size;
+ const auto start_buffer = (char*)xfer_ctxt->_mem_buffer + start_offset;
+ const auto n_iocbs =
+ min(static_cast(aio_ctxt->_queue_depth), (num_io_blocks - iocb_index));
+ const auto num_bytes = min(max_queue_bytes, (xfer_ctxt->_num_bytes - start_offset));
+ prep_ctxt.prep_iocbs(n_iocbs, num_bytes, start_buffer, start_offset);
+
+ if (config->_single_submit) {
+ _do_io_submit_singles(n_iocbs, iocb_index, aio_ctxt, submit_times);
+ } else {
+ _do_io_submit_block(n_iocbs, iocb_index, aio_ctxt, submit_times);
+ }
+
+ _do_io_complete(n_iocbs, n_iocbs, aio_ctxt, reap_times);
+ }
+ const std::chrono::duration elapsed = std::chrono::high_resolution_clock::now() - start;
+
+ if (perf) {
+ _get_aio_latencies(submit_times, perf->_submit);
+ _get_aio_latencies(reap_times, perf->_complete);
+ perf->_e2e_usec = elapsed.count() * 1e6;
+ perf->_e2e_rate_GB = (xfer_ctxt->_num_bytes / elapsed.count() / 1e9);
+ }
+
+#if DEBUG_DS_AIO_PERF
+ _report_aio_statistics("submit", submit_times);
+ _report_aio_statistics("complete", reap_times);
+#endif
+
+#if DEBUG_DS_AIO_PERF
+ std::cout << c_library_name << ": runtime(usec) " << elapsed.count() * 1e6
+ << " rate(GB/sec) = " << (xfer_ctxt->_num_bytes / elapsed.count() / 1e9) << std::endl;
+#endif
+
+#if DEBUG_DS_AIO_PERF
+ std::cout << c_library_name << ": finish " << io_op_name << " " << xfer_ctxt->_num_bytes
+ << " bytes " << std::endl;
+#endif
+}
+
+void do_aio_operation_overlap(const bool read_op,
+ std::unique_ptr& aio_ctxt,
+ std::unique_ptr& xfer_ctxt,
+ deepspeed_aio_config_t* config,
+ deepspeed_aio_perf_t* perf)
+{
+ struct io_prep_generator io_gen(read_op, xfer_ctxt, aio_ctxt->_block_size);
+
+#if DEBUG_DS_AIO_PERF
+ const auto io_op_name = std::string(read_op ? "read" : "write");
+ std::cout << c_library_name << ": start " << io_op_name << " " << xfer_ctxt->_num_bytes
+ << " bytes with " << io_gen._num_io_blocks << " io blocks" << std::endl;
+#endif
+
+ std::vector> submit_times;
+ std::vector> reap_times;
+
+ auto request_iocbs = aio_ctxt->_queue_depth;
+ auto n_pending_iocbs = 0;
+ const auto min_completes = 1;
+ auto start = std::chrono::high_resolution_clock::now();
+ while (true) {
+ const auto n_iocbs = io_gen.prep_iocbs(request_iocbs - n_pending_iocbs, &aio_ctxt->_iocbs);
+ if (n_iocbs > 0) {
+ if (config->_single_submit) {
+ _do_io_submit_singles(
+ n_iocbs, (io_gen._next_iocb_index - n_iocbs), aio_ctxt, submit_times);
+ } else {
+ _do_io_submit_block(
+ n_iocbs, (io_gen._next_iocb_index - n_iocbs), aio_ctxt, submit_times);
+ }
+ }
+
+ n_pending_iocbs += n_iocbs;
+ assert(n_pending_iocbs <= aio_ctxt->_queue_depth);
+
+ if (n_pending_iocbs == 0) { break; }
+
+ const auto n_complete =
+ _do_io_complete(min_completes, n_pending_iocbs, aio_ctxt, reap_times);
+ n_pending_iocbs -= n_complete;
+ }
+
+ const std::chrono::duration elapsed = std::chrono::high_resolution_clock::now() - start;
+
+ if (perf) {
+ _get_aio_latencies(submit_times, perf->_submit);
+ _get_aio_latencies(reap_times, perf->_complete);
+ perf->_e2e_usec = elapsed.count() * 1e6;
+ perf->_e2e_rate_GB = (xfer_ctxt->_num_bytes / elapsed.count() / 1e9);
+ }
+
+#if DEBUG_DS_AIO_PERF
+ _report_aio_statistics("submit", submit_times);
+ _report_aio_statistics("complete", reap_times);
+#endif
+
+#if DEBUG_DS_AIO_PERF
+ std::cout << c_library_name << ": runtime(usec) " << elapsed.count() * 1e6
+ << " rate(GB/sec) = " << (xfer_ctxt->_num_bytes / elapsed.count() / 1e9) << std::endl;
+#endif
+
+#if DEBUG_DS_AIO_PERF
+ std::cout << c_library_name << ": finish " << io_op_name << " " << xfer_ctxt->_num_bytes
+ << " bytes " << std::endl;
+#endif
+}
+
+void report_file_error(const char* filename, const std::string file_op, const int error_code)
+{
+ std::string err_msg = file_op + std::string(" failed on ") + std::string(filename) +
+ " error = " + std::to_string(error_code);
+ std::cerr << c_library_name << ": " << err_msg << std::endl;
+}
+
+int open_file(const char* filename, const bool read_op)
+{
+ const int flags = read_op ? (O_RDONLY | __O_DIRECT) : (O_WRONLY | O_CREAT | __O_DIRECT);
+ const int mode = 0600;
+ const auto fd = open(filename, flags, mode);
+ if (fd == -1) {
+ const auto error_code = errno;
+ const auto error_msg = read_op ? " open for read " : " open for write ";
+ report_file_error(filename, error_msg, error_code);
+ return -1;
+ }
+ return fd;
+}
+
+int regular_read(const char* filename, std::vector& buffer)
+{
+ long long int num_bytes;
+ const auto f_size = get_file_size(filename, num_bytes);
+ assert(f_size != -1);
+ buffer.resize(num_bytes);
+ const auto fd = open(filename, O_RDONLY, 0600);
+ assert(fd != -1);
+ long long int read_bytes = 0;
+ auto r = 0;
+ do {
+ const auto buffer_ptr = buffer.data() + read_bytes;
+ const auto bytes_to_read = num_bytes - read_bytes;
+ r = read(fd, buffer_ptr, bytes_to_read);
+ read_bytes += r;
+ } while (r > 0);
+
+ if (read_bytes != num_bytes) {
+ std::cerr << "read error "
+ << " read_bytes (read) = " << read_bytes << " num_bytes (fstat) = " << num_bytes
+ << std::endl;
+ }
+ assert(read_bytes == num_bytes);
+ close(fd);
+ return 0;
+}
+
+static bool _validate_buffer(const char* filename, void* aio_buffer, const long long int num_bytes)
+{
+ std::vector regular_buffer;
+ const auto reg_ret = regular_read(filename, regular_buffer);
+ assert(0 == reg_ret);
+ std::cout << "regular read of " << filename << " returned " << regular_buffer.size() << " bytes"
+ << std::endl;
+
+ if (static_cast(regular_buffer.size()) != num_bytes) { return false; }
+
+ return (0 == memcmp(aio_buffer, regular_buffer.data(), regular_buffer.size()));
+}
+
+bool validate_aio_operation(const bool read_op,
+ const char* filename,
+ void* aio_buffer,
+ const long long int num_bytes)
+{
+ const auto msg_suffix = std::string("deepspeed_aio_") +
+ std::string(read_op ? "read()" : "write()") +
+ std::string("using read()");
+
+ if (false == _validate_buffer(filename, aio_buffer, num_bytes)) {
+ std::cout << "Fail: correctness of " << msg_suffix << std::endl;
+ return false;
+ }
+
+ std::cout << "Pass: correctness of " << msg_suffix << std::endl;
+ return true;
+}
diff --git a/csrc/aio/common/deepspeed_aio_common.h b/csrc/aio/common/deepspeed_aio_common.h
new file mode 100644
index 0000000000000000000000000000000000000000..cc62d33765c804e88816791c72a3477278738e76
--- /dev/null
+++ b/csrc/aio/common/deepspeed_aio_common.h
@@ -0,0 +1,36 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include
+#include
+#include
+#include
+
+using namespace std;
+
+void do_aio_operation_sequential(const bool read_op,
+ std::unique_ptr& aio_ctxt,
+ std::unique_ptr& xfer_ctxt,
+ deepspeed_aio_config_t* config,
+ deepspeed_aio_perf_t* perf);
+
+void do_aio_operation_overlap(const bool read_op,
+ std::unique_ptr& aio_ctxt,
+ std::unique_ptr& xfer_ctxt,
+ deepspeed_aio_config_t* config,
+ deepspeed_aio_perf_t* perf);
+
+int open_file(const char* filename, const bool read_op);
+
+void report_file_error(const char* filename, const std::string file_op, const int error_code);
+
+int regular_read(const char* filename, std::vector& buffer);
+
+bool validate_aio_operation(const bool read_op,
+ const char* filename,
+ void* aio_buffer,
+ const long long int num_bytes);
diff --git a/csrc/aio/common/deepspeed_aio_types.cpp b/csrc/aio/common/deepspeed_aio_types.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..e5811bb91149fad40422692ac7cde6f9348e0029
--- /dev/null
+++ b/csrc/aio/common/deepspeed_aio_types.cpp
@@ -0,0 +1,74 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include
+
+#include "deepspeed_aio_utils.h"
+
+using namespace std;
+
+const int c_block_size = 128 * 1024;
+const int c_io_queue_depth = 8;
+
+deepspeed_aio_config_t::deepspeed_aio_config_t()
+ : _block_size(c_block_size),
+ _queue_depth(c_io_queue_depth),
+ _single_submit(false),
+ _overlap_events(false),
+ _lock_memory(false)
+{
+}
+
+deepspeed_aio_config_t::deepspeed_aio_config_t(const int block_size,
+ const int queue_depth,
+ const bool single_submit,
+ const bool overlap_events,
+ const bool lock_memory)
+ : _block_size(block_size),
+ _queue_depth(queue_depth),
+ _single_submit(single_submit),
+ _overlap_events(overlap_events),
+ _lock_memory(lock_memory)
+{
+}
+
+void deepspeed_aio_latency_t::dump(const std::string tag)
+{
+ std::cout << tag << _min_usec << " " << _max_usec << " " << _avg_usec << " " << std::endl;
+}
+
+void deepspeed_aio_latency_t::accumulate(const struct deepspeed_aio_latency_t& other)
+{
+ _min_usec += other._min_usec;
+ _max_usec += other._max_usec;
+ _avg_usec += other._avg_usec;
+}
+
+void deepspeed_aio_latency_t::scale(const float scaler)
+{
+ _min_usec *= scaler;
+ _max_usec *= scaler;
+ _avg_usec *= scaler;
+}
+
+aio_context::aio_context(const int block_size, const int queue_depth)
+{
+ _block_size = block_size;
+ _queue_depth = queue_depth;
+ for (auto i = 0; i < queue_depth; ++i) {
+ _iocbs.push_back((struct iocb*)calloc(1, sizeof(struct iocb)));
+ }
+ _io_events.resize(queue_depth);
+ io_queue_init(queue_depth, &_io_ctxt);
+}
+
+aio_context::~aio_context()
+{
+ for (auto& iocb : _iocbs) { free(iocb); }
+ _io_events.resize(0);
+ io_queue_release(_io_ctxt);
+}
diff --git a/csrc/aio/common/deepspeed_aio_types.h b/csrc/aio/common/deepspeed_aio_types.h
new file mode 100644
index 0000000000000000000000000000000000000000..be3b352d6be20733f7e03a821378a648384be0b5
--- /dev/null
+++ b/csrc/aio/common/deepspeed_aio_types.h
@@ -0,0 +1,57 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include
+#include
+
+#include
+#include
+
+using namespace std;
+
+struct deepspeed_aio_latency_t {
+ double _min_usec;
+ double _max_usec;
+ double _avg_usec;
+
+ void dump(const std::string tag);
+ void accumulate(const deepspeed_aio_latency_t&);
+ void scale(const float value);
+};
+
+struct deepspeed_aio_perf_t {
+ deepspeed_aio_latency_t _submit;
+ deepspeed_aio_latency_t _complete;
+ double _e2e_usec;
+ double _e2e_rate_GB;
+};
+
+struct deepspeed_aio_config_t {
+ const int _block_size;
+ const int _queue_depth;
+ const bool _single_submit;
+ const bool _overlap_events;
+ const bool _lock_memory;
+
+ deepspeed_aio_config_t();
+ deepspeed_aio_config_t(const int block_size,
+ const int queue_depth,
+ const bool single_submit,
+ const bool overlap_events,
+ const bool lock_memory);
+};
+
+struct aio_context {
+ io_context_t _io_ctxt;
+ std::vector _io_events;
+ std::vector _iocbs;
+ int _block_size;
+ int _queue_depth;
+
+ aio_context(const int block_size, const int queue_depth);
+ ~aio_context();
+};
diff --git a/csrc/aio/common/deepspeed_aio_utils.cpp b/csrc/aio/common/deepspeed_aio_utils.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..200c7030f120366c2e2a45cb6cc20785ec4518fd
--- /dev/null
+++ b/csrc/aio/common/deepspeed_aio_utils.cpp
@@ -0,0 +1,123 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include
+
+#include "deepspeed_aio_utils.h"
+
+using namespace std;
+
+const int c_block_size = 128 * 1024;
+const int c_io_queue_depth = 8;
+
+io_xfer_ctxt::io_xfer_ctxt(const int fd,
+ const long long int file_offset,
+ const long long int num_bytes,
+ const void* buffer)
+ : _fd(fd), _base_offset(file_offset), _mem_buffer(buffer), _num_bytes(num_bytes)
+{
+}
+
+io_prep_context::io_prep_context(const bool read_op,
+ const std::unique_ptr& xfer_ctxt,
+ const size_t block_size,
+ const std::vector* iocbs)
+ : _read_op(read_op), _xfer_ctxt(xfer_ctxt), _block_size(block_size), _iocbs(iocbs)
+{
+}
+
+void io_prep_context::prep_iocbs(const int n_iocbs,
+ const size_t num_bytes,
+ const void* start_buffer,
+ const long long int start_offset)
+{
+ assert(static_cast(n_iocbs) <= _iocbs->size());
+ for (auto i = 0; i < n_iocbs; ++i) {
+ const auto shift = i * _block_size;
+ const auto xfer_buffer = (char*)start_buffer + _xfer_ctxt->_base_offset + shift;
+ const auto xfer_offset = _xfer_ctxt->_base_offset + start_offset + shift;
+ auto byte_count = _block_size;
+ if ((shift + _block_size) > num_bytes) { byte_count = num_bytes - shift; }
+
+ if (_read_op) {
+ io_prep_pread(_iocbs->at(i), _xfer_ctxt->_fd, xfer_buffer, byte_count, xfer_offset);
+ } else {
+ io_prep_pwrite(_iocbs->at(i), _xfer_ctxt->_fd, xfer_buffer, byte_count, xfer_offset);
+ }
+ }
+}
+
+io_prep_generator::io_prep_generator(const bool read_op,
+ const std::unique_ptr& xfer_ctxt,
+ const size_t block_size)
+ : _read_op(read_op),
+ _xfer_ctxt(xfer_ctxt),
+ _block_size(block_size),
+ _remaining_bytes(xfer_ctxt->_num_bytes),
+ _next_iocb_index(0)
+{
+ _num_io_blocks =
+ static_cast(ceil(static_cast(xfer_ctxt->_num_bytes) / block_size));
+ _remaining_io_blocks = _num_io_blocks;
+}
+
+int io_prep_generator::prep_iocbs(const int n_iocbs, std::vector* iocbs)
+{
+ if ((_remaining_bytes) == 0 || (_remaining_io_blocks == 0)) {
+ assert(static_cast(_remaining_bytes) == _remaining_io_blocks);
+ return 0;
+ }
+
+ assert(static_cast(n_iocbs) <= iocbs->size());
+
+ auto actual_n_iocbs = min(static_cast(n_iocbs), _remaining_io_blocks);
+ for (auto i = 0; i < actual_n_iocbs; ++i, ++_next_iocb_index) {
+ const auto xfer_offset = _xfer_ctxt->_base_offset + (_next_iocb_index * _block_size);
+ const auto xfer_buffer = (char*)_xfer_ctxt->_mem_buffer + xfer_offset;
+ const auto num_bytes = min(static_cast(_block_size), _remaining_bytes);
+
+ if (_read_op) {
+ io_prep_pread(iocbs->at(i), _xfer_ctxt->_fd, xfer_buffer, num_bytes, xfer_offset);
+ } else {
+ io_prep_pwrite(iocbs->at(i), _xfer_ctxt->_fd, xfer_buffer, num_bytes, xfer_offset);
+ }
+ _remaining_bytes -= num_bytes;
+ }
+ _remaining_io_blocks -= actual_n_iocbs;
+
+ return actual_n_iocbs;
+}
+
+int get_file_size(const char* filename, long long int& size)
+{
+ struct stat st;
+ if (stat(filename, &st) == -1) { return -1; }
+ size = st.st_size;
+ return 0;
+}
+
+void* ds_page_aligned_alloc(const size_t size, const bool lock)
+{
+ void* ptr;
+ int retval;
+
+ retval = posix_memalign(&ptr, (size_t)sysconf(_SC_PAGESIZE), size);
+ if (retval) { return nullptr; }
+
+ if (lock == false) { return ptr; }
+
+ auto mlock_ret = mlock(ptr, size);
+ if (mlock_ret != 0) {
+ auto mlock_error = errno;
+ printf("mlock failed with %d %s\n", mlock_error, strerror(mlock_error));
+
+ free(ptr);
+ return nullptr;
+ }
+
+ return ptr;
+}
diff --git a/csrc/aio/common/deepspeed_aio_utils.h b/csrc/aio/common/deepspeed_aio_utils.h
new file mode 100644
index 0000000000000000000000000000000000000000..6c5952749dd33d5e0059c209dc14ea755424da23
--- /dev/null
+++ b/csrc/aio/common/deepspeed_aio_utils.h
@@ -0,0 +1,77 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#pragma once
+
+#include
+#include
+#include
+
+#include
+#include
+#include
+#include
+#include
+#include
+
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+
+struct io_xfer_ctxt {
+ const int _fd;
+ const long long int _base_offset;
+ const void* _mem_buffer;
+ const long long int _num_bytes;
+
+ io_xfer_ctxt(const int fd,
+ const long long int file_offset,
+ const long long int num_bytes,
+ const void* buffer);
+};
+
+struct io_prep_context {
+ const bool _read_op;
+ const std::unique_ptr& _xfer_ctxt;
+ const size_t _block_size;
+ const std::vector* _iocbs;
+
+ io_prep_context(const bool read_op,
+ const std::unique_ptr& xfer_ctxt,
+ const size_t block_size,
+ const std::vector* iocbs);
+
+ void prep_iocbs(const int n_iocbs,
+ const size_t num_bytes,
+ const void* start_buffer,
+ const long long int start_offset);
+};
+
+struct io_prep_generator {
+ const bool _read_op;
+ const std::unique_ptr& _xfer_ctxt;
+ const size_t _block_size;
+
+ long long int _remaining_bytes;
+ long long int _num_io_blocks;
+ long long int _remaining_io_blocks;
+ long long int _next_iocb_index;
+
+ io_prep_generator(const bool read_op,
+ const std::unique_ptr& xfer_ctxt,
+ const size_t block_size);
+
+ int prep_iocbs(const int n_iocbs, std::vector* iocbs);
+};
+
+void* ds_page_aligned_alloc(const size_t size, const bool lock = false);
+
+int get_file_size(const char* filename, long long int& size);
diff --git a/csrc/aio/py_lib/deepspeed_aio_thread.cpp b/csrc/aio/py_lib/deepspeed_aio_thread.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..a2670fb7b4cbc6635c25d71e9b5f9bfd265f59c9
--- /dev/null
+++ b/csrc/aio/py_lib/deepspeed_aio_thread.cpp
@@ -0,0 +1,84 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include "deepspeed_aio_thread.h"
+
+using namespace std;
+
+io_op_desc_t::io_op_desc_t(const bool read_op,
+ const torch::Tensor& buffer,
+ const int fd,
+ const char* filename,
+ const long long int num_bytes,
+ const bool validate)
+ : _read_op(read_op),
+ _buffer(buffer),
+ _fd(fd),
+ _filename(filename),
+ _num_bytes(num_bytes),
+ _validate(validate)
+{
+ _cpu_buffer = _buffer.is_cuda() ? _buffer.to(torch::kCPU).pin_memory() : _buffer;
+ _contiguous_buffer = _cpu_buffer.contiguous();
+}
+
+char* io_op_desc_t::data_ptr() const { return (char*)_contiguous_buffer.data_ptr(); }
+
+void io_op_desc_t::fini()
+{
+ if (_read_op && _buffer.is_cuda()) { _buffer.copy_(_cpu_buffer.to(torch::kCUDA)); }
+}
+
+deepspeed_aio_thread_t::deepspeed_aio_thread_t(const int tid, deepspeed_aio_config_t& aio_config)
+ : _tid(tid),
+ _aio_config(aio_config),
+ _aio_ctxt(new aio_context(aio_config._block_size, aio_config._queue_depth)),
+ _time_to_exit(false)
+{
+}
+
+deepspeed_aio_thread_t::~deepspeed_aio_thread_t() {}
+
+void deepspeed_aio_thread_t::run()
+{
+ while (true) {
+ std::shared_ptr next_io_op = nullptr;
+
+ {
+ std::unique_lock lock(_work_sync._mutex);
+ _work_sync._cond_var.wait(lock,
+ [this] { return (!_work_queue.empty() || _time_to_exit); });
+ if (!_work_queue.empty()) {
+ next_io_op = _work_queue.front();
+ _work_queue.pop();
+ }
+ }
+
+ if (next_io_op) {
+ const auto base_offset = next_io_op->_num_bytes * _tid;
+
+ std::unique_ptr xfer_ctxt(new io_xfer_ctxt(
+ next_io_op->_fd, base_offset, next_io_op->_num_bytes, next_io_op->data_ptr()));
+
+ if (_aio_config._overlap_events) {
+ do_aio_operation_overlap(
+ next_io_op->_read_op, _aio_ctxt, xfer_ctxt, &_aio_config, nullptr);
+ } else {
+ do_aio_operation_sequential(
+ next_io_op->_read_op, _aio_ctxt, xfer_ctxt, &_aio_config, nullptr);
+ }
+
+ {
+ std::lock_guard lock(_complete_sync._mutex);
+ _complete_queue.push(next_io_op);
+ }
+ _complete_sync._cond_var.notify_one();
+ }
+
+ if (_time_to_exit) { break; }
+ }
+}
diff --git a/csrc/aio/py_lib/deepspeed_aio_thread.h b/csrc/aio/py_lib/deepspeed_aio_thread.h
new file mode 100644
index 0000000000000000000000000000000000000000..d1cfcab8bfc2446921422b83efa100444ce0dd31
--- /dev/null
+++ b/csrc/aio/py_lib/deepspeed_aio_thread.h
@@ -0,0 +1,57 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include
+#include
+#include
+#include "deepspeed_py_aio.h"
+
+struct io_op_desc_t {
+ const bool _read_op;
+ torch::Tensor _buffer;
+ int _fd;
+ const std::string _filename;
+ const long long int _num_bytes;
+ torch::Tensor _cpu_buffer;
+ torch::Tensor _contiguous_buffer;
+ const bool _validate;
+
+ io_op_desc_t(const bool read_op,
+ const torch::Tensor& buffer,
+ const int fd,
+ const char* filename,
+ const long long int num_bytes,
+ const bool validate);
+
+ char* data_ptr() const;
+ void fini();
+};
+
+struct thread_sync_t {
+ std::mutex _mutex;
+ std::condition_variable _cond_var;
+};
+
+struct deepspeed_aio_thread_t {
+ const int _tid;
+ deepspeed_aio_config_t& _aio_config;
+
+ std::unique_ptr _aio_ctxt;
+ std::queue> _work_queue;
+ std::queue> _complete_queue;
+
+ bool _time_to_exit;
+
+ struct thread_sync_t _work_sync;
+ struct thread_sync_t _complete_sync;
+
+ deepspeed_aio_thread_t(const int tid, deepspeed_aio_config_t& aio_config);
+
+ ~deepspeed_aio_thread_t();
+
+ void run();
+};
diff --git a/csrc/aio/py_lib/deepspeed_py_aio.cpp b/csrc/aio/py_lib/deepspeed_py_aio.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..49ff1f240c433288a0e12c64389887c65926ad83
--- /dev/null
+++ b/csrc/aio/py_lib/deepspeed_py_aio.cpp
@@ -0,0 +1,121 @@
+
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include
+#include
+#include
+
+#include
+#include
+#include
+#include
+#include
+
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+
+#include "deepspeed_py_aio.h"
+
+using namespace std;
+using namespace std::chrono;
+
+#define DEBUG_DS_AIO_READ 0
+#define DEBUG_DS_AIO_WRITE 0
+
+static const std::string c_library_name = "deepspeed_aio";
+
+int deepspeed_py_aio_write(const torch::Tensor& buffer,
+ const char* filename,
+ const int block_size,
+ const int queue_depth,
+ const bool single_submit,
+ const bool overlap_events,
+ const bool validate)
+{
+ const auto start_time = std::chrono::high_resolution_clock::now();
+ deepspeed_aio_config_t config(block_size, queue_depth, single_submit, overlap_events, false);
+
+ const auto fd = open_file(filename, false);
+ if (fd == -1) { return -1; }
+
+ auto write_buffer = (char*)buffer.data_ptr();
+ const auto num_write_bytes = static_cast(buffer.nbytes());
+ std::unique_ptr xfer_ctxt(new io_xfer_ctxt(fd, 0, num_write_bytes, write_buffer));
+ std::unique_ptr aio_ctxt(new aio_context(config._block_size, config._queue_depth));
+
+ if (config._overlap_events) {
+ do_aio_operation_overlap(false, aio_ctxt, xfer_ctxt, &config, nullptr);
+ } else {
+ do_aio_operation_sequential(false, aio_ctxt, xfer_ctxt, &config, nullptr);
+ }
+ const std::chrono::duration aio_time =
+ std::chrono::high_resolution_clock::now() - start_time;
+
+ close(fd);
+
+ if (validate) { validate_aio_operation(false, filename, write_buffer, num_write_bytes); }
+
+ const std::chrono::duration fn_time =
+ std::chrono::high_resolution_clock::now() - start_time;
+ std::cout << "Elapsed time(usec): "
+ << "aio = " << aio_time.count() * 1e6 << " call = " << fn_time.count() * 1e6
+ << std::endl;
+ return 0;
+}
+
+int deepspeed_py_aio_read(torch::Tensor& buffer,
+ const char* filename,
+ const int block_size,
+ const int queue_depth,
+ const bool single_submit,
+ const bool overlap_events,
+ const bool validate)
+{
+ const auto start_time = std::chrono::high_resolution_clock::now();
+ long long num_file_bytes;
+ if (-1 == get_file_size(filename, num_file_bytes)) {
+ const auto error_code = errno;
+ report_file_error(filename, " fstat for read", error_code);
+ return -1;
+ }
+
+ deepspeed_aio_config_t config(block_size, queue_depth, single_submit, overlap_events, false);
+ const auto fd = open_file(filename, true);
+ if (fd == -1) { return -1; }
+
+ auto read_buffer = (char*)buffer.data_ptr();
+ assert(static_cast(buffer.nbytes()) == num_file_bytes);
+
+ std::unique_ptr xfer_ctxt(new io_xfer_ctxt(fd, 0, num_file_bytes, read_buffer));
+ std::unique_ptr aio_ctxt(new aio_context(config._block_size, config._queue_depth));
+
+ if (config._overlap_events) {
+ do_aio_operation_overlap(true, aio_ctxt, xfer_ctxt, &config, nullptr);
+ } else {
+ do_aio_operation_sequential(true, aio_ctxt, xfer_ctxt, &config, nullptr);
+ }
+ const std::chrono::duration aio_time =
+ std::chrono::high_resolution_clock::now() - start_time;
+
+ close(fd);
+
+ if (validate) { validate_aio_operation(true, filename, read_buffer, num_file_bytes); }
+
+ const std::chrono::duration fn_time =
+ std::chrono::high_resolution_clock::now() - start_time;
+ std::cout << "Elapsed time(usec): "
+ << "aio = " << aio_time.count() * 1e6 << " call = " << fn_time.count() * 1e6
+ << std::endl;
+ return 0;
+}
diff --git a/csrc/aio/py_lib/deepspeed_py_aio.h b/csrc/aio/py_lib/deepspeed_py_aio.h
new file mode 100644
index 0000000000000000000000000000000000000000..230d88da9763a0130554ca83c5e3b1a5d914116f
--- /dev/null
+++ b/csrc/aio/py_lib/deepspeed_py_aio.h
@@ -0,0 +1,27 @@
+
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include
+#include
+#include
+
+int deepspeed_py_aio_write(const torch::Tensor& buffer,
+ const char* filename,
+ const int block_size,
+ const int queue_depth,
+ const bool single_submit,
+ const bool overlap_events,
+ const bool validate);
+
+int deepspeed_py_aio_read(torch::Tensor& buffer,
+ const char* filename,
+ const int block_size,
+ const int queue_depth,
+ const bool single_submit,
+ const bool overlap_events,
+ const bool validate);
diff --git a/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp b/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..417319f8ae5ce3bead644c80c094d9df1061879a
--- /dev/null
+++ b/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp
@@ -0,0 +1,282 @@
+
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include "deepspeed_py_aio_handle.h"
+
+using namespace std;
+
+static void _start_aio_thread(std::shared_ptr ctxt) { ctxt->run(); }
+
+deepspeed_aio_handle_t::deepspeed_aio_handle_t(const int block_size,
+ const int queue_depth,
+ const bool single_submit,
+ const bool overlap_events,
+ const int num_threads)
+ : _aio_ctxt(new aio_context(block_size, queue_depth)),
+ _single_submit(single_submit),
+ _overlap_events(overlap_events),
+ _num_threads(num_threads),
+ _aio_config(block_size, queue_depth, single_submit, overlap_events, false),
+ _num_pending_ops(0)
+{
+ for (auto i = 0; i < num_threads; ++i) {
+ _thread_contexts.push_back(std::make_shared(i, _aio_config));
+ }
+
+ for (auto& ctxt : _thread_contexts) {
+ _threads.push_back(std::thread(_start_aio_thread, ctxt));
+ }
+}
+
+deepspeed_aio_handle_t::~deepspeed_aio_handle_t()
+{
+ _stop_threads();
+ for (auto& thr : _threads) { thr.join(); }
+}
+
+const int deepspeed_aio_handle_t::get_block_size() const
+{
+ return _aio_ctxt ? _aio_ctxt->_block_size : -1;
+}
+
+const int deepspeed_aio_handle_t::get_queue_depth() const
+{
+ return _aio_ctxt ? _aio_ctxt->_queue_depth : -1;
+}
+
+const bool deepspeed_aio_handle_t::get_single_submit() const { return _single_submit; }
+
+const bool deepspeed_aio_handle_t::get_overlap_events() const { return _overlap_events; }
+
+const int deepspeed_aio_handle_t::get_thread_count() const { return _num_threads; }
+
+int deepspeed_aio_handle_t::read(torch::Tensor& buffer, const char* filename, const bool validate)
+{
+ const auto start_time = std::chrono::high_resolution_clock::now();
+
+ assert(_aio_ctxt);
+
+ long long num_file_bytes;
+ if (-1 == get_file_size(filename, num_file_bytes)) {
+ const auto error_code = errno;
+ report_file_error(filename, " fstat for read", error_code);
+ return -1;
+ }
+ assert(static_cast(buffer.nbytes()) == num_file_bytes);
+
+ const auto fd = open_file(filename, true);
+ if (fd == -1) { return -1; }
+
+ auto read_buffer = (char*)buffer.data_ptr();
+ std::unique_ptr xfer_ctxt(new io_xfer_ctxt(fd, 0, num_file_bytes, read_buffer));
+
+ if (_aio_config._overlap_events) {
+ do_aio_operation_overlap(true, _aio_ctxt, xfer_ctxt, &_aio_config, nullptr);
+ } else {
+ do_aio_operation_sequential(true, _aio_ctxt, xfer_ctxt, &_aio_config, nullptr);
+ }
+
+ close(fd);
+ const std::chrono::duration aio_time =
+ std::chrono::high_resolution_clock::now() - start_time;
+
+ if (validate) { validate_aio_operation(true, filename, read_buffer, num_file_bytes); }
+ const std::chrono::duration fn_time =
+ std::chrono::high_resolution_clock::now() - start_time;
+ std::cout << "Elapsed time(usec): "
+ << "aio = " << aio_time.count() * 1e6 << " call = " << fn_time.count() * 1e6
+ << std::endl;
+ return 0;
+}
+
+int deepspeed_aio_handle_t::write(const torch::Tensor& buffer,
+ const char* filename,
+ const bool validate)
+{
+ assert(_aio_ctxt);
+
+ const auto start_time = std::chrono::high_resolution_clock::now();
+
+ const auto fd = open_file(filename, false);
+ if (fd == -1) { return -1; }
+
+ auto write_buffer = (char*)buffer.data_ptr();
+ const auto num_write_bytes = static_cast(buffer.nbytes());
+ std::unique_ptr xfer_ctxt(new io_xfer_ctxt(fd, 0, num_write_bytes, write_buffer));
+
+ if (_aio_config._overlap_events) {
+ do_aio_operation_overlap(false, _aio_ctxt, xfer_ctxt, &_aio_config, nullptr);
+ } else {
+ do_aio_operation_sequential(false, _aio_ctxt, xfer_ctxt, &_aio_config, nullptr);
+ }
+ const std::chrono::duration aio_time =
+ std::chrono::high_resolution_clock::now() - start_time;
+
+ close(fd);
+
+ if (validate) { validate_aio_operation(false, filename, write_buffer, num_write_bytes); }
+
+ const std::chrono::duration fn_time =
+ std::chrono::high_resolution_clock::now() - start_time;
+ std::cout << "Elapsed time(usec): "
+ << "aio = " << aio_time.count() * 1e6 << " call = " << fn_time.count() * 1e6
+ << std::endl;
+ return 0;
+}
+
+void deepspeed_aio_handle_t::_schedule_aio_work(std::shared_ptr scheduled_op)
+{
+ for (auto& ctxt : _thread_contexts) {
+ {
+ std::lock_guard lock(ctxt->_work_sync._mutex);
+ ctxt->_work_queue.push(scheduled_op);
+ }
+ ctxt->_work_sync._cond_var.notify_one();
+ }
+ _num_pending_ops++;
+}
+
+std::shared_ptr deepspeed_aio_handle_t::_wait_for_aio_work()
+{
+ std::shared_ptr completed_op = nullptr;
+ for (auto& ctxt : _thread_contexts) {
+ std::unique_lock lock(ctxt->_complete_sync._mutex);
+ ctxt->_complete_sync._cond_var.wait(lock,
+ [ctxt] { return !ctxt->_complete_queue.empty(); });
+ completed_op = ctxt->_complete_queue.front();
+ ctxt->_complete_queue.pop();
+ }
+ return completed_op;
+}
+
+void deepspeed_aio_handle_t::_stop_threads()
+{
+ assert(0 == _num_pending_ops);
+ for (auto& ctxt : _thread_contexts) {
+ {
+ std::lock_guard lock(ctxt->_work_sync._mutex);
+ ctxt->_time_to_exit = true;
+ }
+ ctxt->_work_sync._cond_var.notify_one();
+ }
+}
+
+int deepspeed_aio_handle_t::wait()
+{
+ assert(_num_pending_ops > 0);
+ auto num_completed_ops = 0;
+
+ while (_num_pending_ops > 0) {
+ auto completed_op = _wait_for_aio_work();
+
+ completed_op->fini();
+
+ close(completed_op->_fd);
+
+ if (completed_op->_validate) {
+ validate_aio_operation(completed_op->_read_op,
+ completed_op->_filename.c_str(),
+ completed_op->data_ptr(),
+ _num_threads * completed_op->_num_bytes);
+ }
+ --_num_pending_ops;
+ ++num_completed_ops;
+ }
+
+ return num_completed_ops;
+}
+
+bool deepspeed_aio_handle_t::_is_valid_parallel_aio_op(const bool read_op,
+ const long long int num_bytes)
+{
+ const auto op_string = read_op ? "Read" : "Write";
+ if (num_bytes % get_thread_count()) {
+ std::cout << "deepspeed_aio failure: parallel " << op_string << " num_bytes = " << num_bytes
+ << " not divisible by thread count = " << get_thread_count() << std::endl;
+ return false;
+ }
+
+ return true;
+}
+
+int deepspeed_aio_handle_t::pread(const torch::Tensor& buffer,
+ const char* filename,
+ const bool validate,
+ const bool async)
+{
+ long long num_file_bytes;
+ if (-1 == get_file_size(filename, num_file_bytes)) {
+ const auto error_code = errno;
+ report_file_error(filename, " fstat for read", error_code);
+ return -1;
+ }
+ const auto buffer_bytes = static_cast(buffer.nbytes());
+ if (buffer_bytes != num_file_bytes) {
+ std::cout << filename << ": buffer nbytes != file bytes " << buffer_bytes
+ << " != " << num_file_bytes << std::endl;
+ }
+ assert(static_cast(buffer.nbytes()) == num_file_bytes);
+ assert((num_file_bytes % _num_threads) == 0);
+
+ if (!_is_valid_parallel_aio_op(true, num_file_bytes)) { return -1; }
+
+ const auto fd = open_file(filename, true);
+ if (fd == -1) { return -1; }
+
+ auto scheduled_op = std::make_shared(
+ true, buffer, fd, filename, (num_file_bytes / _num_threads), validate);
+
+ _schedule_aio_work(scheduled_op);
+
+ if (async) { return 0; }
+
+ return wait();
+}
+
+int deepspeed_aio_handle_t::pwrite(const torch::Tensor& buffer,
+ const char* filename,
+ const bool validate,
+ const bool async)
+{
+ const auto num_write_bytes = static_cast(buffer.nbytes());
+ assert((num_write_bytes % _num_threads) == 0);
+
+ if (!_is_valid_parallel_aio_op(false, num_write_bytes)) { return -1; }
+
+ const auto fd = open_file(filename, false);
+ if (fd == -1) { return -1; }
+
+ auto scheduled_op = std::make_shared(
+ false, buffer, fd, filename, (num_write_bytes / _num_threads), validate);
+
+ _schedule_aio_work(scheduled_op);
+
+ if (async) { return 0; }
+
+ return wait();
+}
+
+int deepspeed_aio_handle_t::sync_pread(torch::Tensor& buffer, const char* filename)
+{
+ return pread(buffer, filename, false, false);
+}
+
+int deepspeed_aio_handle_t::sync_pwrite(const torch::Tensor& buffer, const char* filename)
+{
+ return pwrite(buffer, filename, false, false);
+}
+
+int deepspeed_aio_handle_t::async_pread(torch::Tensor& buffer, const char* filename)
+{
+ return pread(buffer, filename, false, true);
+}
+
+int deepspeed_aio_handle_t::async_pwrite(const torch::Tensor& buffer, const char* filename)
+{
+ return pwrite(buffer, filename, false, true);
+}
diff --git a/csrc/aio/py_lib/deepspeed_py_aio_handle.h b/csrc/aio/py_lib/deepspeed_py_aio_handle.h
new file mode 100644
index 0000000000000000000000000000000000000000..22de4c3961d29abc94517b81ff38b7224822589c
--- /dev/null
+++ b/csrc/aio/py_lib/deepspeed_py_aio_handle.h
@@ -0,0 +1,68 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include