v1.0

e4575be9 · huaerkl · e4575be9 · e4575be9 · e4575be9 · e4575be9
Commit e4575be9 authored Aug 04, 2023 by huaerkl
20 changed files
--- a/.github/workflows/ci.md
+++ b/.github/workflows/ci.md
+# CI setup
+The CI is setup with github actions using the on-demand EC2 backend.
+This setup currently uses a 4gpu instance p3.8xlarge - to test tp=2, pp=2.
+**Unfortunately this only works for PRs created from non-forked branches**
+## The workflow file
+The workflow file is at `.github/workflows/main.yml`
+```
+      - name: Configure AWS credentials
+        uses: aws-actions/configure-aws-credentials@v1
+        with:
+          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+          aws-region: us-east-1
+      - name: Start EC2 runner
+        id: start-ec2-runner
+        uses: machulav/ec2-github-runner@v2
+        with:
+          mode: start
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          ec2-image-id: ami-0dfaabfa78a779fbc
+          ec2-instance-type: p3.8xlarge
+          subnet-id: subnet-3502b45e
+          security-group-id: sg-e8f46d9d
+```
+- `ec2-image-id` is the AMI, which has to be created, or copied to the corresponding `aws-region` region the script requests.
+- `subnet-id` comes from: https://console.aws.amazon.com/vpc/home?region=us-east-1#subnets:
+- `security-group-id` comes from: https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#SecurityGroups:
+It was later updated to use a fault-tolerant solution by trying to start the EC2 on 3 different sub-regions to cope with situations where EC2 reports it doesn't have resources to start the desired instance.
+## Connect to instance
+To pre-install things connect to the instance manually and install what's desired
+1. choose and start an EC2 instance
+2. connect to it as `ubuntu`, then `sudo su` as the runner runs as `root`. I couldn't find a way around it.
+```
+ssh -l ubuntu -i "~/.ssh/bigscience-aim.pem" ubuntu@ec2-3-14-127-35.us-east-2.compute.amazonaws.com
+```
+Once installed, stop the instance.
+Then create a new AMI (see below) and update the script using the new AMI.
+## Prepare the machine
+Steps used to setup fixed software (won't be installed at test time)
+- install cuda:
+https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local
+https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#ubuntu-installation
+### install fixed packages
+- `torch 1.9.0/cu-11.1`
+```
+pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
+```
+- all kinds of prerequisites
+```
+pip install transformers
+wget https://raw.githubusercontent.com/microsoft/DeepSpeed/master/requirements/requirements.txt -O requirements-ds.txt
+pip install -r requirements-ds.txt
+wget https://raw.githubusercontent.com/bigscience-workshop/Megatron-DeepSpeed/main/requirements.txt -O requirements-ms.txt
+pip install -r requirements-ms.txt
+```
+- apex - needs a hack to deal with mismatching minor cuda versions (and it takes forever to build), so using this patch:
+XXX: this no longer works - had to manually patch pytorch to avoid mismatch failure
+```
+--- a/setup.py
+++ b/setup.py
+@@ -99,6 +99,7 @@ def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
+     print(raw_output + "from " + cuda_dir + "/bin\n")
+     if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor):
+        return
+         raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
+                            "not match the version used to compile Pytorch binaries.  " +
+                            "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda) +
+```
+install it: (it was cloned from `git clone https://github.com/NVIDIA/apex`)
+```
+cd code/apex
+# I copied this script from my setup
+./build.sh
+```
+## make a new AMI image
+Once the needed things got installed (and every time anything new is installed) a new AMI must be created (this is like an .iso image snapshot)
+1. go to https://us-east-1.console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances:
+2. choose the instance to create a new image from
+3. Actions -> Image and Templates -> Create Image
+Must ensure it's created in the correct region (same as in script) - or can copy it to the right region.
+The process of creating the image can be done while the instance that has been updated is still running.
+Just don't forget to turn the instance off when validated it to work.
+Finally, once created, the script needs to be updated to that new AMI id (key `ec2-image-id`) in `.github/workflows/main.py`
+## Stop instance alarm
+It looks like occasionally the instance doesn't stop and continues running.
+I added a stop alarm to automatically kill the instance after 1h if util < 10% following the exact instructions from:
+https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/UsingAlarmActions.html
+## Guides
+Set up guide: https://github.com/machulav/ec2-github-runner
+Launching an EC2 instance:
+https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html?icmpid=docs_ec2_console
+https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html
+- All available instances: https://aws.amazon.com/ec2/instance-types/
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
+name: Run all tests
+on:
+  # enable to manually trigger the tests
+  workflow_dispatch:
+# re-enable if we want automatic CI again
+#  pull_request:
+#    paths:
+#      - "**.py"
+jobs:
+# GPU sizes and types that we could use:
+# g4dn.12xlarge  4x 16GB T4 (CC 7.5) (low availability)
+# p3.8xlarge     4x 16GB V100 (CC 7.0) (very low availability)
+# Unfit:
+# g3.16xlarge    4x 8GB Tesla M60 (CC 5.2) (not supported by cuda-11)
+# p2.8xlarge     8x 12GB K80 (CC 3.7 not supported by cuda-11)
+  start-runner:
+    name: Start self-hosted EC2 runner
+    runs-on: ubuntu-latest
+    outputs:
+      label: ${{ steps.start-ec2-runner.outputs.label }}
+      ec2-instance-id: ${{ steps.start-ec2-runner.outputs.ec2-instance-id }}
+    steps:
+      - name: Configure AWS credentials
+        uses: aws-actions/configure-aws-credentials@v1
+        with:
+          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+          aws-region: us-east-1
+      # don't use the following subnets as p3.8xlarge is not supported there:
+      # - subnet-06576a4b # us-east-1d
+      # - subnet-859322b4 # us-east-1e
+      # - subnet-47cfad21 # us-east-1b
+      - name: Try to start EC2 runner (a)
+        id: try-us-east-1a
+        uses: machulav/ec2-github-runner@v2
+        continue-on-error: true
+        with:
+          mode: start
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          ec2-image-id: ami-0ad997818d90480f2
+          ec2-instance-type: g4dn.12xlarge
+          security-group-id: sg-f2a4e2fc
+          subnet-id: subnet-b7533b96 # us-east-1c
+          aws-resource-tags: > # optional, requires additional permissions
+            [
+              {"Key": "Name", "Value": "ec2-github-runner"},
+              {"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
+            ]
+      - name: Try to start EC2 runner (b)
+        id: try-us-east-1b
+        if: steps.try-us-east-1a.outcome == 'failure'
+        uses: machulav/ec2-github-runner@v2
+        continue-on-error: true
+        with:
+          mode: start
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          ec2-image-id: ami-0ad997818d90480f2
+          ec2-instance-type: g4dn.12xlarge
+          security-group-id: sg-f2a4e2fc
+          subnet-id: subnet-a396b2ad # us-east-1f
+          aws-resource-tags: > # optional, requires additional permissions
+            [
+              {"Key": "Name", "Value": "ec2-github-runner"},
+              {"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
+            ]
+      - name: Try to start EC2 runner (c)
+        id: try-us-east-1c
+        if: steps.try-us-east-1b.outcome == 'failure'
+        uses: machulav/ec2-github-runner@v2
+        continue-on-error: true
+        with:
+          mode: start
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          ec2-image-id: ami-0ad997818d90480f2
+          ec2-instance-type: g4dn.12xlarge
+          security-group-id: sg-f2a4e2fc
+          subnet-id: subnet-df0f6180 # us-east-1a
+          aws-resource-tags: > # optional, requires additional permissions
+            [
+              {"Key": "Name", "Value": "ec2-github-runner"},
+              {"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
+            ]
+      - name: Try to start EC2 runner (a-2)
+        id: try-us-east-1a-2
+        if: steps.try-us-east-1c.outcome == 'failure'
+        uses: machulav/ec2-github-runner@v2
+        continue-on-error: true
+        with:
+          mode: start
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          ec2-image-id: ami-0ad997818d90480f2
+          ec2-instance-type: p3.8xlarge
+          security-group-id: sg-f2a4e2fc
+          subnet-id: subnet-b7533b96 # us-east-1c
+          aws-resource-tags: > # optional, requires additional permissions
+            [
+              {"Key": "Name", "Value": "ec2-github-runner"},
+              {"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
+            ]
+      - name: Try to start EC2 runner (b-2)
+        id: try-us-east-1b-2
+        if: steps.try-us-east-1a-2.outcome == 'failure'
+        uses: machulav/ec2-github-runner@v2
+        continue-on-error: true
+        with:
+          mode: start
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          ec2-image-id: ami-0ad997818d90480f2
+          ec2-instance-type: p3.8xlarge
+          security-group-id: sg-f2a4e2fc
+          subnet-id: subnet-a396b2ad # us-east-1f
+          aws-resource-tags: > # optional, requires additional permissions
+            [
+              {"Key": "Name", "Value": "ec2-github-runner"},
+              {"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
+            ]
+      - name: Try to start EC2 runner (c-2)
+        id: try-us-east-1c-2
+        if: steps.try-us-east-1b-2.outcome == 'failure'
+        uses: machulav/ec2-github-runner@v2
+        with:
+          mode: start
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          ec2-image-id: ami-0ad997818d90480f2
+          ec2-instance-type: p3.8xlarge
+          security-group-id: sg-f2a4e2fc
+          subnet-id: subnet-df0f6180 # us-east-1a
+          aws-resource-tags: > # optional, requires additional permissions
+            [
+              {"Key": "Name", "Value": "ec2-github-runner"},
+              {"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
+            ]
+      - name: See if any of 3 sub-regions had the resource
+        id: start-ec2-runner
+        run: |
+          if [ "${{ steps.try-us-east-1a.outcome }}" = "success" ]; then
+            echo "::set-output name=label::${{ steps.try-us-east-1a.outputs.label }}"
+            echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1a.outputs.ec2-instance-id }}"
+          fi
+          if [ "${{ steps.try-us-east-1b.outcome }}" = "success" ]; then
+            echo "::set-output name=label::${{ steps.try-us-east-1b.outputs.label }}"
+            echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1b.outputs.ec2-instance-id }}"
+          fi
+          if [ "${{ steps.try-us-east-1c.outcome }}" = "success" ]; then
+            echo "::set-output name=label::${{ steps.try-us-east-1c.outputs.label }}"
+            echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1c.outputs.ec2-instance-id }}"
+          fi
+          if [ "${{ steps.try-us-east-1a-2.outcome }}" = "success" ]; then
+            echo "::set-output name=label::${{ steps.try-us-east-1a-2.outputs.label }}"
+            echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1a-2.outputs.ec2-instance-id }}"
+          fi
+          if [ "${{ steps.try-us-east-1b-2.outcome }}" = "success" ]; then
+            echo "::set-output name=label::${{ steps.try-us-east-1b-2.outputs.label }}"
+            echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1b-2.outputs.ec2-instance-id }}"
+          fi
+          if [ "${{ steps.try-us-east-1c-2.outcome }}" = "success" ]; then
+            echo "::set-output name=label::${{ steps.try-us-east-1c-2.outputs.label }}"
+            echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1c-2.outputs.ec2-instance-id }}"
+          fi
+  do-the-job:
+    name: Do the job on the runner
+    needs: start-runner # required to start the main job when the runner is ready
+    # need to figure out how to cancel the previous build if a new push was made the old test is still running
+    # concurrency: # cancel previous build on a new push
+    #   group: ${{ github.ref }} # https://docs.github.com/en/actions/reference/context-and-expression-syntax-for-github-actions#github-context
+    #   cancel-in-progress: true
+    runs-on: ${{ needs.start-runner.outputs.label }} # run the job on the newly created runner
+    steps:
+      - name: NVIDIA-SMI
+        run: nvidia-smi
+      - name: Checkout
+        uses: actions/checkout@v2
+      - name: Install Dependencies
+        run: |
+          pip install --upgrade pip
+          pip install -r requirements.txt
+          pip install pytest-timeout
+      - name: Run tests
+        run: pytest --timeout=600 tests
+  stop-runner:
+    name: Stop self-hosted EC2 runner
+    needs:
+      - start-runner # required to get output from the start-runner job
+      - do-the-job # required to wait when the main job is done
+    runs-on: ubuntu-latest
+    if: ${{ always() }} # required to stop the runner even if the error happened in the previous jobs
+    steps:
+      - name: Configure AWS credentials
+        uses: aws-actions/configure-aws-credentials@v1
+        with:
+          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+          aws-region: us-east-1
+      - name: Stop EC2 runner
+        uses: machulav/ec2-github-runner@v2
+        with:
+          mode: stop
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          label: ${{ needs.start-runner.outputs.label }}
+          ec2-instance-id: ${{ needs.start-runner.outputs.ec2-instance-id }}
--- a/.gitignore
+++ b/.gitignore
+# tests
+# megatron autogenerated indices
+tests/data/*/*npy
+tests/tools/openwebtext-1000.jsonl
+tmp/
+# macOS
+.DS_Store
+# Byte-compiled / optimized / DLL files
+*/__pycache__/
+*.py[cod]
+*.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask:
+instance/
+.webassets-cache
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+Pipfile
+Pipfile.lock
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Intellij project settings
+.idea/
+.iml
+# VSCode
+.vscode/
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# static files generated from Django application
+media
+staticfiles
+/tags
+# tmp files
+*.swp
--- a/.gitlab-ci.yml
+++ b/.gitlab-ci.yml
+image: gitlab-master.nvidia.com/dl/dgx/pytorch:20.12-py3-devel
+test:
+  script:
+    - pytest --junitxml=report.xml tests
+  artifacts:
+    when: always
+    reports:
+      junit: report.xml
\ No newline at end of file
--- a/CODEOWNERS
+++ b/CODEOWNERS
+* @bigscience-workshop/megatron-deepspeed-codeowners
--- a/LICENSE
+++ b/LICENSE
+The following applies to all files unless otherwise noted:
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+--
+This repository also contains code from Hugging Face Inc., Google Research,
+Facebook (from their Fairseq project), and Philip Popien. Files from these
+organizations have notices at the top of each file. Below are licenses
+used in those files, as indicated.
+------------- LICENSE FOR huggingface and Google Research code  --------------
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+------------- LICENSE FOR Facebook Fairseq code --------------
+MIT License
+Copyright (c) Facebook, Inc. and its affiliates.
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/MANIFEST.in
+++ b/MANIFEST.in
+include megatron/data/Makefile
+include megatron/data/helpers.cpp
--- a/Makefile
+++ b/Makefile
+.PHONY: test style
+check_dirs := tests tools/convert_checkpoint
+help: ## this help
+	@awk 'BEGIN {FS = ":.*##"; printf "\nUsage:\n  make \033[36m<target>\033[0m\n"} /^[a-zA-Z_-]+:.*?##/ { printf "  \033[36m%-22s\033[0m %s\n", $$1, $$2 } /^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) } ' $(MAKEFILE_LIST)
+test: ## run tests
+	pytest tests
+style: ## checks for code style and applies formatting
+	black $(check_dirs)
+	isort $(check_dirs)
--- a/README.md
+++ b/README.md
+# ViT 
+## 论文
+https://arxiv.org/abs/2010.11929
+## 模型结构
+![img](./images/vit.png)
+## 算法原理
+Vision Transformer先将图像用卷积进行分块以降低计算量，再对每一块进行展平处理变成序列，然后将序列添加位置编码和cls token，再输入多层Transformer结构提取特征，最后将cls tooken取出来通过一个MLP（多层感知机）用于分类。
+Transformer的核心思想是利用注意力模块attention提取特征：
+![img](./images/attention.png)
+## 环境配置
+1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装：
+https://developer.hpccube.com/tool/
+```
+DTK驱动：dtk23.04
+python：python3.8
+torch:1.10.0
+torchvision:0.10.0
+torchaudio:0.10.0
+deepspeed:0.9.2
+apex:0.1
+```
+`Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应`
+2、其它非特殊库参照requirements.txt安装
+```
+pip install -r requirements.txt
+```
+## 数据集
+ILSVRC 2012：
+https://image-net.org/challenges/LSVRC/index.php
+`imagenet 2012` 的解压与整理方法参照链接：
+https://www.jianshu.com/p/a42b7d863825
+整理完成后的数据目录结构如下：
+```
+data
+    |
+    train
+        |
+        n01440764
+        n01806143
+        ...
+    val
+        |
+        n04286575
+        n04596742
+        ...
+    test
+        |
+        images
+            |
+            test_x.JPEG
+            test_xxx.JPEG
+            ...
+```
+## 训练
+进入主目录：
+```
+cd megatron-deepspeed-vit && mkdir logs
+```
+### 一、deepspeed训练：
+**多机多卡：**
+```
+sbatch examples/vit_dsp.sh
+```
+**备注**：deepspeed利用shell脚本创建环境目前存在问题，可通过如下方式解决：
+```
+1、vim ~/.bashrc
+2、末尾加入如下配置参数：
+# 导入dtk
+module purge
+module load compiler/devtoolset/7.3.1
+module load mpi/hpcx/gcc-7.3.1
+module load compiler/dtk/23.04
+# source /opt/dtk-23.04/env.sh
+source /public/home/xxx/dtk-23.04/env.sh
+# 导入python
+source /public/home/xxx/anaconda3/bin/activate megatron
+# 或conda activate megatron
+export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/public/home/xxx/anaconda3/envs/megatron/lib
+3、保存.bashrc，并source ~/.bashrc使配置生效。
+```
+**单机多卡**（需先单独申请线上节点）：
+```
+cd examples
+sh dspvit_1node.sh
+```
+**单机单卡**（需先单独申请线上节点）：
+```
+cd examples
+dspvit_1dcu.sh
+```
+### 二、mpirun训练
+注释[`arguments.py`](./megatron/arguments.py)中的rank和world_size：
+```
+# args.rank = int(os.getenv('RANK', '0'))
+# args.world_size = int(os.getenv("WORLD_SIZE", '1'))
+```
+**多机多卡：**
+```
+sbatch examples/vit_mpi.sh
+```
+## 推理
+方法类似以上训练步骤，只需在传参时额外添加以下两个参数：
+```
+--eval-only True \
+--do_test True \
+```
+### 一、deepspeed测试：
+**多机多卡：**
+```
+sbatch examples/vit_dsp.sh
+```
+### 二、mpirun测试：
+**多机多卡：**
+```
+sbatch examples/vit_mpi.sh
+```
+## result
+![img](./images/classify.png)
+## 应用场景
+### 算法类别
+`图像分类`
+### 应用行业
+`制造,环境,医疗,气象`
+### 算法框架
+`pytorch`
+## 参考资料
+- https://github.com/bigscience-workshop/Megatron-DeepSpeed
+- https://www.deepspeed.ai/getting-started/
+- https://deepspeed.readthedocs.io/en/latest/index.html
--- a/README_NLP.md
+++ b/README_NLP.md
+# 内容
+   * [内容](#内容)
+   * [环境配置](#环境配置)
+   * [下载词汇文件](#下载词汇文件)
+   * [下载训练数据](#下载训练数据)
+   * [训练](#训练)
+      * [数据预处理](#数据预处理)
+      * [GPT预训练](#GPT预训练)
+         * [单卡训练](#单卡训练)
+         * [Deepspeed-PP和ZeRO-DP](#Deepspeed-PP和ZeRO-DP)
+         * [分布式多卡训练](#分布式多卡训练)
+   * [推理](#推理)
+      * [模型转换](#模型转换)
+      * [GPT文本生成](#GPT文本生成)
+   * [参考](#参考)
+# 环境配置
+1. 安装基础依赖包
+pip install -r requirements.txt
+2. 安装DCU相关whl包
+DCU相关包下载目录：[https://cancon.hpccube.com:65024/4/main](https://cancon.hpccube.com:65024/4/main)
+pytorch whl包：pytorch ---> dtk-23.04
+根据python版本,下载对应pytorch的whl包
+<pre>
+pip install torch* (下载的torch的whl包)
+</pre>
+torchvision whl包：vision ---> dtk-23.04
+根据python版本,下载对应torchvision的whl包
+<pre>
+pip install torchvision* (下载的torchvision的whl包)
+</pre>
+apex whl包：apex ---> dtk-23.04
+根据python版本,下载对应apex的whl包
+<pre>
+pip install apex* (下载的apex的whl包)
+</pre>
+deepspeed whl包：deepspeed ---> dtk-23.04
+根据python版本,下载对应apex的whl包
+<pre>
+pip install deepspeed* (下载的apex的whl包)
+</pre>
+若使用 pip install 下载安装过慢，可添加源：-i https://pypi.tuna.tsinghua.edu.cn/simple/
+# 下载词汇文件
+<pre>
+wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
+wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
+</pre>
+# 下载训练数据
+使用1GB 79K jsonl数据集
+<pre>
+wget https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
+xz -d oscar-1GB.jsonl.xz
+</pre>
+# 训练
+## 数据预处理
+<pre>
+python tools/preprocess_data.py \
+    --input oscar-1GB.jsonl \
+    --output-prefix my-gpt2 \
+    --vocab gpt2-vocab.json \
+    --dataset-impl mmap \
+    --tokenizer-type GPT2BPETokenizer \
+    --merge-file gpt2-merges.txt \
+    --append-eod \
+    --workers 8
+</pre>
+## GPT预训练
+### 单卡训练
+1. `examples/pretrain_gpt.sh`:运行单GPU 345M参数GPT预训练(单GPU训练主要用于调试目的，因为代码针对分布式训练进行了优化).
+修改DATA_PATH和CHECKPOINT_PATH路径后运行.
+参数说明: `--micro-batch-size`为单个前向-后向路径的批处理大小，`--global-batch-size`为每次迭代的批处理大小，`--lr`为学习率，数据按949:50:1的比例划分为训练/验证/测试集(默认为 969:30:1),`--train-iters`为训练的迭代次数，或者使用`--train-samples`(训练的样本总数)，若使用`--train-samples`，则不需要指定`--lr-decay-iters`，而是需要指定`--lr-decay-samples`. `--lr-decay-iters`为学习率衰减迭代次数,`--fp16`为训练的数据类型,不设置默认使用fp32训练.
+2. [pretrain_gpt_single_node.sh](example/pretrain_gpt_single_node.sh)
+```
+N_GPUS=1
+CHECKPOINT_PATH=checkpoints/gpt2
+VOCAB_FILE=gpt2-vocab.json
+MERGE_FILE=gpt2-merges.txt
+DATA_PATH=my-gpt2_text_document
+RANK=0
+WORLD_SIZE=$N_GPUS
+GPT_ARGS=" \
+    --num-layers 24 \
+    --hidden-size 1024 \
+    --num-attention-heads 16 \
+    --seq-length 1024 \
+    --max-position-embeddings 1024 \
+    --micro-batch-size 4 \
+    --global-batch-size 8 \
+    --lr 0.00015 \
+    --train-iters 500000 \
+    --lr-decay-iters 320000 \
+    --lr-decay-style cosine \
+    --vocab-file $VOCAB_FILE \
+    --merge-file $MERGE_FILE \
+    --lr-warmup-fraction .01 \
+    --fp16 \
+    --rank ${RANK} \
+    --world_size ${WORLD_SIZE} \
+    --local_rank $RANK
+    "
+OUTPUT_ARGS=" \
+    --log-interval 10 \
+    --save-interval 500 \
+    --eval-interval 100 \
+    --eval-iters 10 \
+    --checkpoint-activations \
+    "
+DATA_ARGS=" \
+    --save $CHECKPOINT_PATH \
+    --load $CHECKPOINT_PATH \
+    --data-path $DATA_PATH \
+    "
+CMD="pretrain_gpt.py $GPT_ARGS $OUTPUT_ARGS $DATA_ARGS"
+N_GPUS=1
+LAUNCHER="deepspeed --num_gpus $N_GPUS"
+$LAUNCHER $CMD
+```
+对于多GPU训练，修改`--num_gpus`所使用的GPU数量.
+3. 模拟`distributed`
+```
+MASTER_ADDR=localhost MASTER_PORT=9994 RANK=0 LOCAL_RANK=0 python pretrain_gpt.py ...
+```
+更多命令行参数见 [`arguments.py`](./megatron/arguments.py).
+### Deepspeed-PP和ZeRO-DP
+使用Deepspeed的PP代替Megatron的PP,DP使用ZERO-DP，与Megatron-LM启动类似，此外还需要有一个deepspeed的配置文件和一些参数：
+```
+CHECKPOINT_PATH=checkpoints/gpt2
+VOCAB_FILE=gpt2-vocab.json
+MERGE_FILE=gpt2-merges.txt
+DATA_PATH=my-gpt2_text_document
+TENSORBOARD_PATH=output_dir/tensorboard
+CODECARBON_PATH=output_dir/codecarbon
+MICRO_BATCH_SIZE=1
+GLOBAL_BATCH_SIZE=16
+TP_SIZE=1
+PP_SIZE=1
+N_GPUS=2
+SAVE_INTERVAL=100
+RANK=0
+WORLD_SIZE=$N_GPUS
+GPT_ARGS=" \
+    --num-layers 2 \
+    --hidden-size 64 \
+    --num-attention-heads 2 \
+    --seq-length 1024 \
+    --max-position-embeddings 1024 \
+    --micro-batch-size $MICRO_BATCH_SIZE \
+    --rampup-batch-size 2 2 1_000 \
+    --global-batch-size $GLOBAL_BATCH_SIZE \
+    --train-samples 100 \
+    --optimizer adam \
+    --adam-beta1 0.9 \
+    --adam-beta2 0.95 \
+    --adam-eps 1e-8 \
+    --lr 1e-4 \
+    --lr-warmup-samples 5 \
+    --clip-grad 1.0 \
+    --weight-decay 1e-1 \
+    --vocab-file $VOCAB_FILE \
+    --merge-file $MERGE_FILE \
+    --fp16 \
+    --rank ${RANK} \
+    --world_size ${WORLD_SIZE} \
+    --local_rank $RANK
+    "
+OUTPUT_ARGS=" \
+    --log-interval 10 \
+    --save-interval $SAVE_INTERVAL \
+    --eval-interval 100 \
+    --eval-iters 10 \
+    --checkpoint-activations \
+    "
+DATA_ARGS=" \
+    --save $CHECKPOINT_PATH \
+    --load $CHECKPOINT_PATH \
+    --data-path $DATA_PATH \
+    --tensorboard-dir $TENSORBOARD_PATH \
+    --tensorboard-queue-size 5 \
+    --log-timers-to-tensorboard \
+    --log-batch-size-to-tensorboard \
+    --log-validation-ppl-to-tensorboard \
+    "
+ZERO_STAGE=1
+config_json="./ds_config.json"
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
+  "train_batch_size": $GLOBAL_BATCH_SIZE,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+    "stage": $ZERO_STAGE
+  },
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 500,
+    "hysteresis": 2,
+    "min_loss_scale": 1,
+    "initial_scale_power": 12
+  },
+  "steps_per_print": 2000,
+  "wall_clock_breakdown": false
+}
+EOT
+DEEPSPEED_ARGS=" \
+    --deepspeed \
+    --deepspeed_config ${config_json} \
+    --zero-stage ${ZERO_STAGE} \
+    --deepspeed-activation-checkpointing \
+    "
+ALL_ARGS="$GPT_ARGS $OUTPUT_ARGS $DATA_ARGS $DEEPSPEED_ARGS"
+# if you can't stand pt-1.9 launcher noise
+export LOGLEVEL=WARNING
+LAUNCHER="deepspeed --num_gpus $N_GPUS"
+export CMD=" \
+    $LAUNCHER pretrain_gpt.py \
+    --tensor-model-parallel-size $TP_SIZE \
+    --pipeline-model-parallel-size $PP_SIZE \
+    --distributed-backend nccl \
+    $ALL_ARGS \
+    "
+echo $CMD
+$CMD
+```
+### 分布式多卡训练
+`examples/pretrain_gpt_distributed.sh`:使用Pytorch分布式启动分布式训练.
+修改DATA_PATH和CHECKPOINT_PATH路径后运行.
+使用两种类型的并行性：数据并行和模型并行.`--DDP-impl`是分布式数据并行的实现，设置为local是在反向传播是在反向传播时执行梯度全规约,设置为torch是将梯度规约与反向传播计算重叠.torch的分布式数据并行在较大的模型尺寸下更加高效.
+开发了一种简单高效的二维模型并行方法。要使用张量模型并行(将单个transformer模块的执行拆分到多个GPU上)，添加`--tensor-model-parallel-size`指定要拆分模型的GPU数量，以及上述传递给分布式启动器的参数.要使用管道并行(将transformer模块分成阶段，每个阶段上具有相同数量的transformer模块，然后通过batch分解为更小的microbatches),添加`--pipeline-model-parallel-size`指定将模型拆分多个阶段的数量(若将有24个transformer层的模型拆分为4个阶段，则每个阶段有6个transformer层，即--pipeline-model-parallel-size 4).
+使用模型并行训练参考：`examples/pretrain_gpt_distributed_with_mp.sh`.当前T5模型不支持管道并行.
+### GPT-15B预训练
+# 推理
+## 模型转换
+## GPT文本生成
+`bash examples/generate_text.sh`
+`--tensor-model-parallel-size`为tp数量，`--out-seq-length`为输出样本的长度，`--load`为加载的预训练检查点路径，`--num-samples`表示生成多少个样本，`--sample-input-file <filename>`可以将filename用作条条件文本，`--genfile`为无条件生成文本的保存文件.
+```
+CHECKPOINT_PATH=checkpoints/gpt2
+VOCAB_FILE=gpt2-vocab.json
+MERGE_FILE=gpt2-merges.txt
+GPT_ARGS=" \
+    --tensor-model-parallel-size 1 \
+    --pipeline-model-parallel-size 1 \
+    --num-layers 24 \
+    --hidden-size 1024 \
+    --num-attention-heads 16 \
+    --seq-length 1024 \
+    --max-position-embeddings 1024 \
+    --micro-batch-size 4 \
+    --global-batch-size 8 \
+    --fp16 \
+    "
+MAX_OUTPUT_SEQUENCE_LENGTH=1024
+TEMPERATURE=1.0
+TOP_P=0.9
+NUMBER_OF_SAMPLES=2
+OUTPUT_FILE=samples.json
+RANK=0
+WORLD_SIZE=1
+python tools/generate_samples_gpt.py \
+    $GPT_ARGS \
+    --load $CHECKPOINT_PATH \
+    --out-seq-length $MAX_OUTPUT_SEQUENCE_LENGTH \
+    --temperature $TEMPERATURE \
+    --vocab-file $VOCAB_FILE \
+    --merge-file $MERGE_FILE \
+    --genfile $OUTPUT_FILE \
+    --num-samples $NUMBER_OF_SAMPLES \
+    --top_p $TOP_P \
+    --recompute \
+    --rank ${RANK} \
+    --world_size ${WORLD_SIZE} \
+```
+# 参考
+- [README_ORIGIN](README_ORIGIN.md)
--- a/README_ORIGIN.md
+++ b/README_ORIGIN.md
--- a/examples/create_embeddings.sh
+++ b/examples/create_embeddings.sh
+#!/bin/bash
+# Compute embeddings for each entry of a given dataset (e.g. Wikipedia)
+RANK=0
+WORLD_SIZE=1
+# Wikipedia data can be downloaded from the following link:
+# https://github.com/facebookresearch/DPR/blob/master/data/download_data.py
+EVIDENCE_DATA_DIR=<Specify path of Wikipedia dataset>
+EMBEDDING_PATH=<Specify path to store embeddings>
+CHECKPOINT_PATH=<Specify path of pretrained ICT model>
+python tools/create_doc_index.py \
+    --num-layers 12 \
+    --hidden-size 768 \
+    --num-attention-heads 12 \
+    --tensor-model-parallel-size 1 \
+    --micro-batch-size 128 \
+    --checkpoint-activations \
+    --seq-length 512 \
+    --retriever-seq-length 256 \
+    --max-position-embeddings 512 \
+    --load ${CHECKPOINT_PATH} \
+    --evidence-data-path ${EVIDENCE_DATA_DIR} \
+    --embedding-path ${EMBEDDING_PATH} \
+    --indexer-log-interval 1000 \
+    --indexer-batch-size 128 \
+    --vocab-file bert-vocab.txt \
+    --num-workers 2 \
+    --fp16
--- a/examples/curriculum_learning/README.md
+++ b/examples/curriculum_learning/README.md
+This is a short tutorial of how to use/tune the curriculum learning (CL) integration. Currently it is only integrated for GPT pre-training. For technical details please refer to our [paper](https://arxiv.org/abs/2108.06084).
+# Disable batch size warmup (--rampup-batch-size)
+In our [paper](https://arxiv.org/abs/2108.06084) section 5.4 we demonstrate that curriculum learning (seqlen-based) provides much better training stability than the batch size warmup technique. So when using CL you need to remove the `--rampup-batch-size` config in your training script. It's not recommended to use both CL and batch size warmup, because both of them will reduce the number of tokens in a batch. Another related change you might want is to increase your micro batch size, since without batch size warmup your batch size will be fixed now.
+# Token-based training termination
+Because CL changes length of each sequence/sample during training, it is very hard/impossible to use number of steps/samples to terminate the training exactly at the desired number of tokens. Thus we add a `--train-tokens` config as an alternative accurate token-based termination. We recommend increase your original `--train-samples` or `--train-iters` to a large enough number (e.g., 2X of what you used for baseline), and set `--train-tokens` at the exact desired number of training tokens (e.g., 300B for GPT-3 like training).
+# Token-based LR decay
+Again because CL changes the number of tokens per batch, in our [paper](https://arxiv.org/abs/2108.06084) Appendix A.2 we show that it is also necessary to change the LR decay to token-based (to avoid decaying LR too fast). Thus we add a `--lr-decay-tokens` which will be the number of LR decay tokens. If previously you were using `--lr-decay-samples`, you can calculate your `--lr-decay-tokens` simply by multiplying the former by full seqlen (e.g. 2K for GPT-3). Then you need to replace `--lr-decay-samples` with `--lr-decay-tokens` in your script.
+# LR warmup adjustment
+For LR warmup we don't change it to token-based, because doing so for CL means slowing down the LR warmup, which is both unnecessary and harmful. However, you may need to adjust your `--lr-warmup-samples` or `--lr-warmup-iters` from non-CL cases for various reasons (e.g., if you used `--rampup-batch-size` in non-CL case, for CL we don't use it so the number of samples per batch will be different at the beginning). Assuming you want to use `X` tokens to warmup the LR (for OpenAI GPT-3 this was 375M tokens), then for CL case you may set `--lr-warmup-samples` as `X` divided by the `min_difficulty` below, or set `--lr-warmup-iters` as `X` divided by `min_difficulty * --global-batch-size`. This is a rough estimation based on that CL starts from seqlen `min_difficulty` and it won't increase too much during LR warmup.
+# Token-based tensorboard
+Because of the above changes, we also add token-based tensorboard scalars. We also add scalars that plot the seqlen at each step.
+# Curriculum learning hyperparameters tuning strategy
+The curriculum learning hyperparameters are all located in the deepspeed config json file (see the example `ds_config_cl.json` in this dir). There are a few config entries that you may need to adjust to your circumstances, and two of which require some tuning. In our [paper](https://arxiv.org/abs/2108.06084) Appendix A.1 we have a more detailed tuning strategy description.
+1. `max_difficulty` should be set as the full seqlen (i.e., your `--seq-length`). No need to tune this.
+2. `min_difficulty` is the beginning seqlen used by CL. In general smaller `min_difficulty` could provide better stability/convergence speed benefit. However we observe that for a larger model or for different training data, starting from a very small seqlen could lead to significant validation PPL fluctuation (or even divergence) at the very beginning. We recommend to start with `min_difficulty` at 64, and then increase it if you observe problems at the very beginning. Note that to enable Tensor Core acceleration you should always use a multiple of 8.
+3. `total_curriculum_step` is the total number of steps used by CL. In general larger `total_curriculum_step` could provide better stability/convergence speed benefit. However we observe that a too large `total_curriculum_step` could lead to overfitting and significant validation PPL fluctuation (or even divergence) at the first few multiple of LR warmup steps. In our paper we have a detailed tuning strategy based on binary search. However, if you want to reduce the tuning effort we recommend directly setting `total_curriculum_step` as half of baseline's total number of steps. This may not provide the highest convergence speed benefit, but should provide enough training stability gains.
+4. `difficulty_step` is the change in seq length per CL step. A smaller value is preferable since it gives more smooth CL and better stability. Like `min_difficulty` it too needs to be multiple of 8 for Tensor core acceleration, thus 8 is a good default.
--- a/examples/curriculum_learning/ds_config_cl.json
+++ b/examples/curriculum_learning/ds_config_cl.json
+{
+  "train_batch_size": 512,
+  "gradient_accumulation_steps": 1,
+  "steps_per_print": 1,
+  "zero_optimization": {
+    "stage": 0
+  },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 0.00015,
+      "max_grad_norm": 1.0,
+      "betas": [0.9, 0.95]
+    }
+  },
+  "gradient_clipping": 1.0,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "wall_clock_breakdown": false,
+  "zero_allow_untested_optimizer": false,
+  "curriculum_learning": {
+    "enabled": true,
+    "curriculum_type": "seqlen",
+    "min_difficulty": 8,
+    "max_difficulty": 1024,
+    "schedule_type": "fixed_linear",
+    "schedule_config": {
+      "total_curriculum_step": 60000,
+      "difficulty_step": 8
+    }
+  }
+}
--- a/examples/curriculum_learning/pretrain_gpt_cl.sh
+++ b/examples/curriculum_learning/pretrain_gpt_cl.sh
+#!/bin/bash
+# This is a dummy train script to show how to use curriculum
+# learning, some parameters are not for actual GPT pretraining.
+TARGET_GLOBAL_BATCH_SIZE=512
+TRAIN_SAMPLES=146_484_375
+LR=1.0e-4
+MIN_LR=1.0e-5
+LR_DECAY_SAMPLES=126_953_125
+LR_WARMUP_SAMPLES=183_105
+SEQLEN=1024
+############################################################
+# New configs for curriculum learning, see README.md
+TRAIN_TOKENS=10_000_000_000
+LR_DECAY_TOKENS=$(($LR_DECAY_SAMPLES*$SEQLEN))
+############################################################
+LOG_INTERVAL=100
+EVAL_ITERS=10
+EVAL_INTERVAL=100
+SAVE_INTERVAL=1000
+VOCAB_PATH=/data/Megatron-LM/data/gpt2-vocab.json
+MERGE_PATH=/data/Megatron-LM/data/gpt2-merges.txt
+DATA_PATH=/data/Megatron-LM/data/indexed_datasets/megatron
+MICRO_BATCH_SIZE=1
+MP_SIZE=1
+PP_SIZE=1
+NUM_GPUS=128
+echo ${NUM_GPUS}
+if [[ $PP_SIZE -gt 0 ]]; then
+    DP_SIZE=$(( ${NUM_GPUS} / (${PP_SIZE} * ${MP_SIZE}) ))
+else
+    DP_SIZE=$(( ${NUM_GPUS} / ${MP_SIZE} ))
+fi
+GRAD_ACC_STEPS=$(( ${TARGET_GLOBAL_BATCH_SIZE} / (${MICRO_BATCH_SIZE} * ${DP_SIZE}) ))
+NAME="gpt-117M-pp${PP_SIZE}-mp${MP_SIZE}-bsz${TARGET_GLOBAL_BATCH_SIZE}-mbsz${MICRO_BATCH_SIZE}-cl"
+current_time=$(date "+%Y.%m.%d-%H.%M.%S")
+host="${HOSTNAME}"
+TENSORBOARD_DIR="tensorboard/${NAME}_${host}_${current_time}"
+mkdir -p ${TENSORBOARD_DIR}
+CHECKPOINT_PATH="checkpoints/${NAME}"
+megatron_options=" \
+        --data-path ${DATA_PATH} \
+        --vocab-file ${VOCAB_PATH} \
+        --merge-file ${MERGE_PATH} \
+        --data-impl mmap \
+        --override-lr-scheduler \
+        --adam-beta1 0.9 \
+        --adam-beta2 0.95 \
+        --tensor-model-parallel-size ${MP_SIZE} \
+        --init-method-std 0.014 \
+        --lr-decay-tokens ${LR_DECAY_TOKENS} \
+        --lr-warmup-samples ${LR_WARMUP_SAMPLES} \
+        --micro-batch-size ${MICRO_BATCH_SIZE} \
+        --global-batch-size ${TARGET_GLOBAL_BATCH_SIZE} \
+        --num-layers 12 \
+        --hidden-size 768 \
+        --num-attention-heads 16 \
+        --seq-length ${SEQLEN} \
+        --max-position-embeddings ${SEQLEN} \
+        --train-samples ${TRAIN_SAMPLES} \
+        --train-tokens ${TRAIN_TOKENS} \
+        --lr ${LR} \
+        --min-lr ${MIN_LR} \
+        --lr-decay-style cosine \
+        --split 98,2,0 \
+        --log-interval ${LOG_INTERVAL} \
+        --eval-interval ${EVAL_INTERVAL} \
+        --eval-iters ${EVAL_ITERS} \
+        --save-interval ${SAVE_INTERVAL} \
+        --weight-decay 0.1 \
+        --clip-grad 1.0 \
+        --hysteresis 2 \
+        --num-workers 0 \
+        --checkpoint-activations \
+        --fp16 \
+        --load ${CHECKPOINT_PATH} \
+        --save ${CHECKPOINT_PATH} \
+        --tensorboard-queue-size 1 \
+        --log-timers-to-tensorboard \
+        --log-batch-size-to-tensorboard \
+        --log-validation-ppl-to-tensorboard \
+        --tensorboard-dir ${TENSORBOARD_DIR}"
+config_json="ds_config_cl.json"
+deepspeed_options=" \
+		    --deepspeed \
+		    --deepspeed_config ${config_json} \
+		    --pipeline-model-parallel-size ${PP_SIZE} \
+		    --partition-activations"
+run_cmd="deepspeed ../../pretrain_gpt.py ${megatron_options} ${deepspeed_options} &>> ${NAME}.log"
+echo ${run_cmd}
+eval ${run_cmd}
+set +x
--- a/examples/ds_config.json
+++ b/examples/ds_config.json
+{
+  "activation_checkpointing": {
+    "partition_activations": true,
+    "cpu_checkpointing": false,
+    "contiguous_memory_optimization": true,
+    "number_checkpoints": null,
+    "synchronize_checkpoint_boundary": false,
+    "profile": true
+  },
+  "flops_profiler": {
+    "enabled": true,
+    "profile_step": 3,
+    "module_depth": -1,
+    "top_modules": 1,
+    "detailed": true,
+    "output_file": null 
+  },
+  "train_batch_size": 8,
+  "train_micro_batch_size_per_gpu": 1,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+    "stage": 1
+  },
+  "curriculum_learning":{
+    "enabled": false
+  },
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 500,
+    "hysteresis": 2,
+    "min_loss_scale": 1,
+    "initial_scale_power": 12
+  },
+  "steps_per_print": 1,
+  "memory_breakdown": false,
+  "wall_clock_breakdown": true
+}
--- a/examples/dspvit_1dcu.sh
+++ b/examples/dspvit_1dcu.sh
+#! /bin/bash
+# Runs the "345M" parameter model
+DATA_PATH="./data"
+CHECKPOINT_PATH="./checkpoint"
+DS_CONFIG="./examples/ds_config.json"
+RANK=0
+WORLD_SIZE=1
+deepspeed --num_gpus 1 pretrain_vit.py \
+       --num-layers 24 \
+       --hidden-size 1024 \
+       --num-attention-heads 16 \
+       --micro-batch-size 1 \
+       --global-batch-size 8 \
+       --seq-length 1024 \
+       --max-position-embeddings 1024 \
+       --train-iters 500000 \
+       --lr-decay-iters 320000 \
+       --save $CHECKPOINT_PATH \
+       --load $CHECKPOINT_PATH \
+       --data-path $DATA_PATH \
+       --data-impl mmap \
+       --split 949,50,1 \
+       --distributed-backend nccl \
+       --lr 0.00015 \
+       --min-lr 1.0e-5 \
+       --lr-decay-style cosine \
+       --weight-decay 1e-2 \
+       --clip-grad 1.0 \
+       --lr-warmup-fraction .01 \
+       --checkpoint-activations \
+       --log-interval 100 \
+       --save-interval 10000 \
+       --eval-interval 1000 \
+       --eval-iters 10 \
+       --fp16 \
+       --padded_vocab_size 224\
+       --rank ${RANK} \
+       --world_size ${WORLD_SIZE} \
+       --deepspeed \
+       --deepspeed_config $DS_CONFIG \
--- a/examples/dspvit_1node.sh
+++ b/examples/dspvit_1node.sh
+#! /bin/bash
+# Runs the "345M" parameter model
+DATA_PATH="./data"
+CHECKPOINT_PATH="./checkpoint"
+DS_CONFIG="./examples/ds_config.json"
+MICRO_BATCH_SIZE=1
+GLOBAL_BATCH_SIZE=8
+deepspeed --num_gpus 4 pretrain_vit.py \
+       --num-layers 24 \
+       --hidden-size 1024 \
+       --num-attention-heads 16 \
+       --micro-batch-size ${MICRO_BATCH_SIZE} \
+       --global-batch-size ${GLOBAL_BATCH_SIZE} \
+       --seq-length 1024 \
+       --max-position-embeddings 1024 \
+       --train-iters 500000 \
+       --lr-decay-iters 320000 \
+       --save $CHECKPOINT_PATH \
+       --load $CHECKPOINT_PATH \
+       --data-path $DATA_PATH \
+       --data-impl mmap \
+       --split 949,50,1 \
+       --distributed-backend nccl \
+       --lr 0.00015 \
+       --min-lr 1.0e-5 \
+       --lr-decay-style cosine \
+       --weight-decay 1e-2 \
+       --clip-grad 1.0 \
+       --lr-warmup-fraction .01 \
+       --checkpoint-activations \
+       --log-interval 100 \
+       --save-interval 10000 \
+       --eval-interval 1000 \
+       --eval-iters 10 \
+       --fp16 \
+       --padded_vocab_size 224\
+       --deepspeed \
+       --deepspeed_config $DS_CONFIG \
+# --eval-only True \
+# --do_test True \
--- a/examples/evaluate_ict_zeroshot_nq.sh
+++ b/examples/evaluate_ict_zeroshot_nq.sh
+#!/bin/bash
+# Evaluate natural question test data given Wikipedia embeddings and pretrained
+# ICT model
+# Datasets can be downloaded from the following link:
+# https://github.com/facebookresearch/DPR/blob/master/data/download_data.py
+EVIDENCE_DATA_DIR=<Specify path of Wikipedia dataset>
+EMBEDDING_PATH=<Specify path of the embeddings>
+CHECKPOINT_PATH=<Specify path of pretrained ICT model>
+QA_FILE=<Path of the natural question test dataset>
+python tasks/main.py \
+    --task ICT-ZEROSHOT-NQ \
+    --tokenizer-type BertWordPieceLowerCase \
+    --num-layers 12 \
+    --hidden-size 768 \
+    --num-attention-heads 12 \
+    --tensor-model-parallel-size 1 \
+    --micro-batch-size 128 \
+    --checkpoint-activations \
+    --seq-length 512 \
+    --max-position-embeddings 512 \
+    --load ${CHECKPOINT_PATH} \
+    --evidence-data-path ${EVIDENCE_DATA_DIR} \
+    --embedding-path ${EMBEDDING_PATH} \
+    --retriever-seq-length 256 \
+    --vocab-file  bert-vocab.txt\
+    --qa-data-test ${QA_FILE} \
+    --num-workers 2 \
+    --faiss-use-gpu \
+    --retriever-report-topk-accuracies 1 5 20 100 \
+    --fp16
--- a/examples/evaluate_zeroshot_gpt.sh
+++ b/examples/evaluate_zeroshot_gpt.sh
+#!/bin/bash
+WORLD_SIZE=8
+DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
+                  --nnodes 1 \
+                  --node_rank 0 \
+                  --master_addr localhost \
+                  --master_port 6000"
+TASK="LAMBADA"
+VALID_DATA=<lambada path>
+VOCAB_FILE=gpt2-vocab.json
+MERGE_FILE=gpt2-merges.txt
+CHECKPOINT=checkpoints/gpt2_345m
+python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/main.py \
+               --task $TASK \
+               --valid-data $VALID_DATA \
+               --tokenizer-type GPT2BPETokenizer \
+               --strict-lambada \
+               --vocab-file $VOCAB_FILE \
+               --merge-file $MERGE_FILE \
+               --load $CHECKPOINT \
+               --tensor-model-parallel-size 1 \
+               --num-layers 24 \
+               --hidden-size 1024 \
+               --num-attention-heads 16 \
+               --batch-size 8 \
+               --checkpoint-activations \
+               --seq-length 1024 \
+               --max-position-embeddings 1024 \
+               --log-interval 10 \
+               --fp16 \
+               --no-load-optim \
+               --no-load-rng