GPT2 base on megatron-deepspeed

8ec5d678 · hepj987 · 8ec5d678 · 8ec5d678 · 8ec5d678 · 8ec5d678
Commit 8ec5d678 authored Apr 03, 2023 by hepj987
20 changed files
--- a/megatron-deepspeed_dtk22.10/.github/workflows/ci.md
+++ b/megatron-deepspeed_dtk22.10/.github/workflows/ci.md
+# CI setup
+
+The CI is setup with github actions using the on-demand EC2 backend.
+
+This setup currently uses a 4gpu instance p3.8xlarge - to test tp=2, pp=2.
+
+**Unfortunately this only works for PRs created from non-forked branches**
+
+
+## The workflow file
+
+The workflow file is at `.github/workflows/main.yml`
+
+
+```
+      - name: Configure AWS credentials
+        uses: aws-actions/configure-aws-credentials@v1
+        with:
+          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+          aws-region: us-east-1
+      - name: Start EC2 runner
+        id: start-ec2-runner
+        uses: machulav/ec2-github-runner@v2
+        with:
+          mode: start
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          ec2-image-id: ami-0dfaabfa78a779fbc
+          ec2-instance-type: p3.8xlarge
+          subnet-id: subnet-3502b45e
+          security-group-id: sg-e8f46d9d
+```
+
+- `ec2-image-id` is the AMI, which has to be created, or copied to the corresponding `aws-region` region the script requests.
+- `subnet-id` comes from: https://console.aws.amazon.com/vpc/home?region=us-east-1#subnets:
+- `security-group-id` comes from: https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#SecurityGroups:
+
+
+It was later updated to use a fault-tolerant solution by trying to start the EC2 on 3 different sub-regions to cope with situations where EC2 reports it doesn't have resources to start the desired instance.
+
+
+
+## Connect to instance
+
+To pre-install things connect to the instance manually and install what's desired
+
+1. choose and start an EC2 instance
+2. connect to it as `ubuntu`, then `sudo su` as the runner runs as `root`. I couldn't find a way around it.
+```
+ssh -l ubuntu -i "~/.ssh/bigscience-aim.pem" ubuntu@ec2-3-14-127-35.us-east-2.compute.amazonaws.com
+```
+
+Once installed, stop the instance.
+
+Then create a new AMI (see below) and update the script using the new AMI.
+
+
+## Prepare the machine
+
+Steps used to setup fixed software (won't be installed at test time)
+
+- install cuda:
+https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local
+https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#ubuntu-installation
+
+### install fixed packages
+
+- `torch 1.9.0/cu-11.1`
+
+```
+pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
+```
+
+- all kinds of prerequisites
+```
+pip install transformers
+wget https://raw.githubusercontent.com/microsoft/DeepSpeed/master/requirements/requirements.txt -O requirements-ds.txt
+pip install -r requirements-ds.txt
+wget https://raw.githubusercontent.com/bigscience-workshop/Megatron-DeepSpeed/main/requirements.txt -O requirements-ms.txt
+pip install -r requirements-ms.txt
+
+```
+
+- apex - needs a hack to deal with mismatching minor cuda versions (and it takes forever to build), so using this patch:
+
+XXX: this no longer works - had to manually patch pytorch to avoid mismatch failure
+
+```
+--- a/setup.py
+++ b/setup.py
+@@ -99,6 +99,7 @@ def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
+     print(raw_output + "from " + cuda_dir + "/bin\n")
+
+     if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor):
+        return
+         raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
+                            "not match the version used to compile Pytorch binaries.  " +
+                            "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda) +
+
+```
+
+install it: (it was cloned from `git clone https://github.com/NVIDIA/apex`)
+
+```
+cd code/apex
+# I copied this script from my setup
+./build.sh
+```
+
+
+## make a new AMI image
+
+Once the needed things got installed (and every time anything new is installed) a new AMI must be created (this is like an .iso image snapshot)
+
+1. go to https://us-east-1.console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances:
+2. choose the instance to create a new image from
+3. Actions -> Image and Templates -> Create Image
+
+Must ensure it's created in the correct region (same as in script) - or can copy it to the right region.
+
+The process of creating the image can be done while the instance that has been updated is still running.
+
+Just don't forget to turn the instance off when validated it to work.
+
+Finally, once created, the script needs to be updated to that new AMI id (key `ec2-image-id`) in `.github/workflows/main.py`
+
+
+## Stop instance alarm
+
+It looks like occasionally the instance doesn't stop and continues running.
+
+I added a stop alarm to automatically kill the instance after 1h if util < 10% following the exact instructions from:
+https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/UsingAlarmActions.html
+
+
+## Guides
+
+Set up guide: https://github.com/machulav/ec2-github-runner
+
+Launching an EC2 instance:
+https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html?icmpid=docs_ec2_console
+
+https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html
+
+- All available instances: https://aws.amazon.com/ec2/instance-types/
--- a/megatron-deepspeed_dtk22.10/.github/workflows/main.yml
+++ b/megatron-deepspeed_dtk22.10/.github/workflows/main.yml
+name: Run all tests
+on:
+  # enable to manually trigger the tests
+  workflow_dispatch:
+# re-enable if we want automatic CI again
+#  pull_request:
+#    paths:
+#      - "**.py"
+
+jobs:
+
+# GPU sizes and types that we could use:
+# g4dn.12xlarge  4x 16GB T4 (CC 7.5) (low availability)
+# p3.8xlarge     4x 16GB V100 (CC 7.0) (very low availability)
+
+# Unfit:
+# g3.16xlarge    4x 8GB Tesla M60 (CC 5.2) (not supported by cuda-11)
+# p2.8xlarge     8x 12GB K80 (CC 3.7 not supported by cuda-11)
+
+  start-runner:
+    name: Start self-hosted EC2 runner
+    runs-on: ubuntu-latest
+    outputs:
+      label: ${{ steps.start-ec2-runner.outputs.label }}
+      ec2-instance-id: ${{ steps.start-ec2-runner.outputs.ec2-instance-id }}
+    steps:
+      - name: Configure AWS credentials
+        uses: aws-actions/configure-aws-credentials@v1
+        with:
+          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+          aws-region: us-east-1
+      # don't use the following subnets as p3.8xlarge is not supported there:
+      # - subnet-06576a4b # us-east-1d
+      # - subnet-859322b4 # us-east-1e
+      # - subnet-47cfad21 # us-east-1b
+      - name: Try to start EC2 runner (a)
+        id: try-us-east-1a
+        uses: machulav/ec2-github-runner@v2
+        continue-on-error: true
+        with:
+          mode: start
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          ec2-image-id: ami-0ad997818d90480f2
+          ec2-instance-type: g4dn.12xlarge
+          security-group-id: sg-f2a4e2fc
+          subnet-id: subnet-b7533b96 # us-east-1c
+          aws-resource-tags: > # optional, requires additional permissions
+            [
+              {"Key": "Name", "Value": "ec2-github-runner"},
+              {"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
+            ]
+      - name: Try to start EC2 runner (b)
+        id: try-us-east-1b
+        if: steps.try-us-east-1a.outcome == 'failure'
+        uses: machulav/ec2-github-runner@v2
+        continue-on-error: true
+        with:
+          mode: start
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          ec2-image-id: ami-0ad997818d90480f2
+          ec2-instance-type: g4dn.12xlarge
+          security-group-id: sg-f2a4e2fc
+          subnet-id: subnet-a396b2ad # us-east-1f
+          aws-resource-tags: > # optional, requires additional permissions
+            [
+              {"Key": "Name", "Value": "ec2-github-runner"},
+              {"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
+            ]
+      - name: Try to start EC2 runner (c)
+        id: try-us-east-1c
+        if: steps.try-us-east-1b.outcome == 'failure'
+        uses: machulav/ec2-github-runner@v2
+        continue-on-error: true
+        with:
+          mode: start
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          ec2-image-id: ami-0ad997818d90480f2
+          ec2-instance-type: g4dn.12xlarge
+          security-group-id: sg-f2a4e2fc
+          subnet-id: subnet-df0f6180 # us-east-1a
+          aws-resource-tags: > # optional, requires additional permissions
+            [
+              {"Key": "Name", "Value": "ec2-github-runner"},
+              {"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
+            ]
+
+      - name: Try to start EC2 runner (a-2)
+        id: try-us-east-1a-2
+        if: steps.try-us-east-1c.outcome == 'failure'
+        uses: machulav/ec2-github-runner@v2
+        continue-on-error: true
+        with:
+          mode: start
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          ec2-image-id: ami-0ad997818d90480f2
+          ec2-instance-type: p3.8xlarge
+          security-group-id: sg-f2a4e2fc
+          subnet-id: subnet-b7533b96 # us-east-1c
+          aws-resource-tags: > # optional, requires additional permissions
+            [
+              {"Key": "Name", "Value": "ec2-github-runner"},
+              {"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
+            ]
+      - name: Try to start EC2 runner (b-2)
+        id: try-us-east-1b-2
+        if: steps.try-us-east-1a-2.outcome == 'failure'
+        uses: machulav/ec2-github-runner@v2
+        continue-on-error: true
+        with:
+          mode: start
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          ec2-image-id: ami-0ad997818d90480f2
+          ec2-instance-type: p3.8xlarge
+          security-group-id: sg-f2a4e2fc
+          subnet-id: subnet-a396b2ad # us-east-1f
+          aws-resource-tags: > # optional, requires additional permissions
+            [
+              {"Key": "Name", "Value": "ec2-github-runner"},
+              {"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
+            ]
+      - name: Try to start EC2 runner (c-2)
+        id: try-us-east-1c-2
+        if: steps.try-us-east-1b-2.outcome == 'failure'
+        uses: machulav/ec2-github-runner@v2
+        with:
+          mode: start
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          ec2-image-id: ami-0ad997818d90480f2
+          ec2-instance-type: p3.8xlarge
+          security-group-id: sg-f2a4e2fc
+          subnet-id: subnet-df0f6180 # us-east-1a
+          aws-resource-tags: > # optional, requires additional permissions
+            [
+              {"Key": "Name", "Value": "ec2-github-runner"},
+              {"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
+            ]
+
+      - name: See if any of 3 sub-regions had the resource
+        id: start-ec2-runner
+        run: |
+          if [ "${{ steps.try-us-east-1a.outcome }}" = "success" ]; then
+            echo "::set-output name=label::${{ steps.try-us-east-1a.outputs.label }}"
+            echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1a.outputs.ec2-instance-id }}"
+          fi
+          if [ "${{ steps.try-us-east-1b.outcome }}" = "success" ]; then
+            echo "::set-output name=label::${{ steps.try-us-east-1b.outputs.label }}"
+            echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1b.outputs.ec2-instance-id }}"
+          fi
+          if [ "${{ steps.try-us-east-1c.outcome }}" = "success" ]; then
+            echo "::set-output name=label::${{ steps.try-us-east-1c.outputs.label }}"
+            echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1c.outputs.ec2-instance-id }}"
+          fi
+          if [ "${{ steps.try-us-east-1a-2.outcome }}" = "success" ]; then
+            echo "::set-output name=label::${{ steps.try-us-east-1a-2.outputs.label }}"
+            echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1a-2.outputs.ec2-instance-id }}"
+          fi
+          if [ "${{ steps.try-us-east-1b-2.outcome }}" = "success" ]; then
+            echo "::set-output name=label::${{ steps.try-us-east-1b-2.outputs.label }}"
+            echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1b-2.outputs.ec2-instance-id }}"
+          fi
+          if [ "${{ steps.try-us-east-1c-2.outcome }}" = "success" ]; then
+            echo "::set-output name=label::${{ steps.try-us-east-1c-2.outputs.label }}"
+            echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1c-2.outputs.ec2-instance-id }}"
+          fi
+
+
+  do-the-job:
+    name: Do the job on the runner
+    needs: start-runner # required to start the main job when the runner is ready
+    # need to figure out how to cancel the previous build if a new push was made the old test is still running
+    # concurrency: # cancel previous build on a new push
+    #   group: ${{ github.ref }} # https://docs.github.com/en/actions/reference/context-and-expression-syntax-for-github-actions#github-context
+    #   cancel-in-progress: true
+    runs-on: ${{ needs.start-runner.outputs.label }} # run the job on the newly created runner
+    steps:
+      - name: NVIDIA-SMI
+        run: nvidia-smi
+
+      - name: Checkout
+        uses: actions/checkout@v2
+
+      - name: Install Dependencies
+        run: |
+          pip install --upgrade pip
+          pip install -r requirements.txt
+          pip install pytest-timeout
+
+      - name: Run tests
+        run: pytest --timeout=600 tests
+
+  stop-runner:
+    name: Stop self-hosted EC2 runner
+    needs:
+      - start-runner # required to get output from the start-runner job
+      - do-the-job # required to wait when the main job is done
+    runs-on: ubuntu-latest
+    if: ${{ always() }} # required to stop the runner even if the error happened in the previous jobs
+    steps:
+      - name: Configure AWS credentials
+        uses: aws-actions/configure-aws-credentials@v1
+        with:
+          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+          aws-region: us-east-1
+      - name: Stop EC2 runner
+        uses: machulav/ec2-github-runner@v2
+        with:
+          mode: stop
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          label: ${{ needs.start-runner.outputs.label }}
+          ec2-instance-id: ${{ needs.start-runner.outputs.ec2-instance-id }}
--- a/megatron-deepspeed_dtk22.10/.gitignore
+++ b/megatron-deepspeed_dtk22.10/.gitignore
+# tests
+# megatron autogenerated indices
+tests/data/*/*npy
+tests/tools/openwebtext-1000.jsonl
+tmp/
+
+# macOS
+.DS_Store
+
+# Byte-compiled / optimized / DLL files
+*/__pycache__/
+*.py[cod]
+*.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask:
+instance/
+.webassets-cache
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+Pipfile
+Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Intellij project settings
+.idea/
+.iml
+
+# VSCode
+.vscode/
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# static files generated from Django application
+media
+staticfiles
+/tags
+
+# tmp files
+*.swp
--- a/megatron-deepspeed_dtk22.10/.gitlab-ci.yml
+++ b/megatron-deepspeed_dtk22.10/.gitlab-ci.yml
+image: gitlab-master.nvidia.com/dl/dgx/pytorch:20.12-py3-devel
+
+test:
+  script:
+    - pytest --junitxml=report.xml tests
+  artifacts:
+    when: always
+    reports:
+      junit: report.xml
+    
\ No newline at end of file
--- a/megatron-deepspeed_dtk22.10/CODEOWNERS
+++ b/megatron-deepspeed_dtk22.10/CODEOWNERS
+* @bigscience-workshop/megatron-deepspeed-codeowners
--- a/megatron-deepspeed_dtk22.10/LICENSE
+++ b/megatron-deepspeed_dtk22.10/LICENSE
+The following applies to all files unless otherwise noted:
+
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+--
+
+This repository also contains code from Hugging Face Inc., Google Research,
+Facebook (from their Fairseq project), and Philip Popien. Files from these
+organizations have notices at the top of each file. Below are licenses
+used in those files, as indicated.
+
+
+------------- LICENSE FOR huggingface and Google Research code  --------------
+
+
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [yyyy] [name of copyright owner]
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+
+------------- LICENSE FOR Facebook Fairseq code --------------
+
+MIT License
+
+Copyright (c) Facebook, Inc. and its affiliates.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+
--- a/megatron-deepspeed_dtk22.10/MANIFEST.in
+++ b/megatron-deepspeed_dtk22.10/MANIFEST.in
+include megatron/data/Makefile
+include megatron/data/helpers.cpp
--- a/megatron-deepspeed_dtk22.10/Makefile
+++ b/megatron-deepspeed_dtk22.10/Makefile
+.PHONY: test style
+
+check_dirs := tests tools/convert_checkpoint
+
+help: ## this help
+	@awk 'BEGIN {FS = ":.*##"; printf "\nUsage:\n  make \033[36m<target>\033[0m\n"} /^[a-zA-Z_-]+:.*?##/ { printf "  \033[36m%-22s\033[0m %s\n", $$1, $$2 } /^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) } ' $(MAKEFILE_LIST)
+
+test: ## run tests
+	pytest tests
+
+style: ## checks for code style and applies formatting
+	black $(check_dirs)
+	isort $(check_dirs)
--- a/megatron-deepspeed_dtk22.10/README.md
+++ b/megatron-deepspeed_dtk22.10/README.md
--- a/megatron-deepspeed_dtk22.10/conver.sh
+++ b/megatron-deepspeed_dtk22.10/conver.sh
+export PYTHONPATH=/home/megatron-deepspeed_dtk22.10:$PYTHONPATH
+
+CHECKPOINT_PATH=./output-module/gpt2-4tp/global_step200
+python tools/convert_checkpoint/deepspeed_to_megatron.py \
+  --input_folder $CHECKPOINT_PATH \
+  --output_folder conver-4tp-model \
+  --target_tp 4 \
+  --target_pp 1
+
--- a/megatron-deepspeed_dtk22.10/deepspeed-0.7.3+unknown-cp37-cp37m-linux_x86_64.whl
+++ b/megatron-deepspeed_dtk22.10/deepspeed-0.7.3+unknown-cp37-cp37m-linux_x86_64.whl
--- a/megatron-deepspeed_dtk22.10/examples/create_embeddings.sh
+++ b/megatron-deepspeed_dtk22.10/examples/create_embeddings.sh
+#!/bin/bash
+
+# Compute embeddings for each entry of a given dataset (e.g. Wikipedia)
+
+RANK=0
+WORLD_SIZE=1
+
+# Wikipedia data can be downloaded from the following link:
+# https://github.com/facebookresearch/DPR/blob/master/data/download_data.py
+EVIDENCE_DATA_DIR=<Specify path of Wikipedia dataset>
+EMBEDDING_PATH=<Specify path to store embeddings>
+CHECKPOINT_PATH=<Specify path of pretrained ICT model>
+
+python tools/create_doc_index.py \
+    --num-layers 12 \
+    --hidden-size 768 \
+    --num-attention-heads 12 \
+    --tensor-model-parallel-size 1 \
+    --micro-batch-size 128 \
+    --checkpoint-activations \
+    --seq-length 512 \
+    --retriever-seq-length 256 \
+    --max-position-embeddings 512 \
+    --load ${CHECKPOINT_PATH} \
+    --evidence-data-path ${EVIDENCE_DATA_DIR} \
+    --embedding-path ${EMBEDDING_PATH} \
+    --indexer-log-interval 1000 \
+    --indexer-batch-size 128 \
+    --vocab-file bert-vocab.txt \
+    --num-workers 2 \
+    --fp16
+
--- a/megatron-deepspeed_dtk22.10/examples/curriculum_learning/README.md
+++ b/megatron-deepspeed_dtk22.10/examples/curriculum_learning/README.md
+This is a short tutorial of how to use/tune the curriculum learning (CL) integration. Currently it is only integrated for GPT pre-training. For technical details please refer to our [paper](https://arxiv.org/abs/2108.06084).
+
+# Disable batch size warmup (--rampup-batch-size)
+In our [paper](https://arxiv.org/abs/2108.06084) section 5.4 we demonstrate that curriculum learning (seqlen-based) provides much better training stability than the batch size warmup technique. So when using CL you need to remove the `--rampup-batch-size` config in your training script. It's not recommended to use both CL and batch size warmup, because both of them will reduce the number of tokens in a batch. Another related change you might want is to increase your micro batch size, since without batch size warmup your batch size will be fixed now.
+
+# Token-based training termination
+
+Because CL changes length of each sequence/sample during training, it is very hard/impossible to use number of steps/samples to terminate the training exactly at the desired number of tokens. Thus we add a `--train-tokens` config as an alternative accurate token-based termination. We recommend increase your original `--train-samples` or `--train-iters` to a large enough number (e.g., 2X of what you used for baseline), and set `--train-tokens` at the exact desired number of training tokens (e.g., 300B for GPT-3 like training).
+
+# Token-based LR decay
+
+Again because CL changes the number of tokens per batch, in our [paper](https://arxiv.org/abs/2108.06084) Appendix A.2 we show that it is also necessary to change the LR decay to token-based (to avoid decaying LR too fast). Thus we add a `--lr-decay-tokens` which will be the number of LR decay tokens. If previously you were using `--lr-decay-samples`, you can calculate your `--lr-decay-tokens` simply by multiplying the former by full seqlen (e.g. 2K for GPT-3). Then you need to replace `--lr-decay-samples` with `--lr-decay-tokens` in your script.
+
+# LR warmup adjustment
+
+For LR warmup we don't change it to token-based, because doing so for CL means slowing down the LR warmup, which is both unnecessary and harmful. However, you may need to adjust your `--lr-warmup-samples` or `--lr-warmup-iters` from non-CL cases for various reasons (e.g., if you used `--rampup-batch-size` in non-CL case, for CL we don't use it so the number of samples per batch will be different at the beginning). Assuming you want to use `X` tokens to warmup the LR (for OpenAI GPT-3 this was 375M tokens), then for CL case you may set `--lr-warmup-samples` as `X` divided by the `min_difficulty` below, or set `--lr-warmup-iters` as `X` divided by `min_difficulty * --global-batch-size`. This is a rough estimation based on that CL starts from seqlen `min_difficulty` and it won't increase too much during LR warmup.
+
+# Token-based tensorboard
+
+Because of the above changes, we also add token-based tensorboard scalars. We also add scalars that plot the seqlen at each step.
+
+# Curriculum learning hyperparameters tuning strategy
+
+The curriculum learning hyperparameters are all located in the deepspeed config json file (see the example `ds_config_cl.json` in this dir). There are a few config entries that you may need to adjust to your circumstances, and two of which require some tuning. In our [paper](https://arxiv.org/abs/2108.06084) Appendix A.1 we have a more detailed tuning strategy description.
+
+1. `max_difficulty` should be set as the full seqlen (i.e., your `--seq-length`). No need to tune this.
+
+2. `min_difficulty` is the beginning seqlen used by CL. In general smaller `min_difficulty` could provide better stability/convergence speed benefit. However we observe that for a larger model or for different training data, starting from a very small seqlen could lead to significant validation PPL fluctuation (or even divergence) at the very beginning. We recommend to start with `min_difficulty` at 64, and then increase it if you observe problems at the very beginning. Note that to enable Tensor Core acceleration you should always use a multiple of 8.
+
+3. `total_curriculum_step` is the total number of steps used by CL. In general larger `total_curriculum_step` could provide better stability/convergence speed benefit. However we observe that a too large `total_curriculum_step` could lead to overfitting and significant validation PPL fluctuation (or even divergence) at the first few multiple of LR warmup steps. In our paper we have a detailed tuning strategy based on binary search. However, if you want to reduce the tuning effort we recommend directly setting `total_curriculum_step` as half of baseline's total number of steps. This may not provide the highest convergence speed benefit, but should provide enough training stability gains.
+
+4. `difficulty_step` is the change in seq length per CL step. A smaller value is preferable since it gives more smooth CL and better stability. Like `min_difficulty` it too needs to be multiple of 8 for Tensor core acceleration, thus 8 is a good default.
--- a/megatron-deepspeed_dtk22.10/examples/curriculum_learning/ds_config_cl.json
+++ b/megatron-deepspeed_dtk22.10/examples/curriculum_learning/ds_config_cl.json
+{
+  "train_batch_size": 512,
+  "gradient_accumulation_steps": 1,
+  "steps_per_print": 1,
+  "zero_optimization": {
+    "stage": 0
+  },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 0.00015,
+      "max_grad_norm": 1.0,
+      "betas": [0.9, 0.95]
+    }
+  },
+  "gradient_clipping": 1.0,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "wall_clock_breakdown": false,
+  "zero_allow_untested_optimizer": false,
+  "curriculum_learning": {
+    "enabled": true,
+    "curriculum_type": "seqlen",
+    "min_difficulty": 8,
+    "max_difficulty": 1024,
+    "schedule_type": "fixed_linear",
+    "schedule_config": {
+      "total_curriculum_step": 60000,
+      "difficulty_step": 8
+    }
+  }
+}
--- a/megatron-deepspeed_dtk22.10/examples/curriculum_learning/pretrain_gpt_cl.sh
+++ b/megatron-deepspeed_dtk22.10/examples/curriculum_learning/pretrain_gpt_cl.sh
+#!/bin/bash
+
+# This is a dummy train script to show how to use curriculum
+# learning, some parameters are not for actual GPT pretraining.
+
+TARGET_GLOBAL_BATCH_SIZE=512
+TRAIN_SAMPLES=146_484_375
+LR=1.0e-4
+MIN_LR=1.0e-5
+LR_DECAY_SAMPLES=126_953_125
+LR_WARMUP_SAMPLES=183_105
+SEQLEN=1024
+
+############################################################
+# New configs for curriculum learning, see README.md
+TRAIN_TOKENS=10_000_000_000
+LR_DECAY_TOKENS=$(($LR_DECAY_SAMPLES*$SEQLEN))
+############################################################
+
+LOG_INTERVAL=100
+EVAL_ITERS=10
+EVAL_INTERVAL=100
+SAVE_INTERVAL=1000
+
+VOCAB_PATH=/data/Megatron-LM/data/gpt2-vocab.json
+MERGE_PATH=/data/Megatron-LM/data/gpt2-merges.txt
+DATA_PATH=/data/Megatron-LM/data/indexed_datasets/megatron
+
+MICRO_BATCH_SIZE=1
+MP_SIZE=1
+PP_SIZE=1
+
+NUM_GPUS=128
+echo ${NUM_GPUS}
+if [[ $PP_SIZE -gt 0 ]]; then
+    DP_SIZE=$(( ${NUM_GPUS} / (${PP_SIZE} * ${MP_SIZE}) ))
+else
+    DP_SIZE=$(( ${NUM_GPUS} / ${MP_SIZE} ))
+fi
+GRAD_ACC_STEPS=$(( ${TARGET_GLOBAL_BATCH_SIZE} / (${MICRO_BATCH_SIZE} * ${DP_SIZE}) ))
+
+NAME="gpt-117M-pp${PP_SIZE}-mp${MP_SIZE}-bsz${TARGET_GLOBAL_BATCH_SIZE}-mbsz${MICRO_BATCH_SIZE}-cl"
+current_time=$(date "+%Y.%m.%d-%H.%M.%S")
+host="${HOSTNAME}"
+TENSORBOARD_DIR="tensorboard/${NAME}_${host}_${current_time}"
+mkdir -p ${TENSORBOARD_DIR}
+CHECKPOINT_PATH="checkpoints/${NAME}"
+
+megatron_options=" \
+        --data-path ${DATA_PATH} \
+        --vocab-file ${VOCAB_PATH} \
+        --merge-file ${MERGE_PATH} \
+        --data-impl mmap \
+        --override-lr-scheduler \
+        --adam-beta1 0.9 \
+        --adam-beta2 0.95 \
+        --tensor-model-parallel-size ${MP_SIZE} \
+        --init-method-std 0.014 \
+        --lr-decay-tokens ${LR_DECAY_TOKENS} \
+        --lr-warmup-samples ${LR_WARMUP_SAMPLES} \
+        --micro-batch-size ${MICRO_BATCH_SIZE} \
+        --global-batch-size ${TARGET_GLOBAL_BATCH_SIZE} \
+        --num-layers 12 \
+        --hidden-size 768 \
+        --num-attention-heads 16 \
+        --seq-length ${SEQLEN} \
+        --max-position-embeddings ${SEQLEN} \
+        --train-samples ${TRAIN_SAMPLES} \
+        --train-tokens ${TRAIN_TOKENS} \
+        --lr ${LR} \
+        --min-lr ${MIN_LR} \
+        --lr-decay-style cosine \
+        --split 98,2,0 \
+        --log-interval ${LOG_INTERVAL} \
+        --eval-interval ${EVAL_INTERVAL} \
+        --eval-iters ${EVAL_ITERS} \
+        --save-interval ${SAVE_INTERVAL} \
+        --weight-decay 0.1 \
+        --clip-grad 1.0 \
+        --hysteresis 2 \
+        --num-workers 0 \
+        --checkpoint-activations \
+        --fp16 \
+        --load ${CHECKPOINT_PATH} \
+        --save ${CHECKPOINT_PATH} \
+        --tensorboard-queue-size 1 \
+        --log-timers-to-tensorboard \
+        --log-batch-size-to-tensorboard \
+        --log-validation-ppl-to-tensorboard \
+        --tensorboard-dir ${TENSORBOARD_DIR}"
+
+config_json="ds_config_cl.json"
+
+deepspeed_options=" \
+		    --deepspeed \
+		    --deepspeed_config ${config_json} \
+		    --pipeline-model-parallel-size ${PP_SIZE} \
+		    --partition-activations"
+
+run_cmd="deepspeed ../../pretrain_gpt.py ${megatron_options} ${deepspeed_options} &>> ${NAME}.log"
+echo ${run_cmd}
+eval ${run_cmd}
+set +x
--- a/megatron-deepspeed_dtk22.10/examples/evaluate_ict_zeroshot_nq.sh
+++ b/megatron-deepspeed_dtk22.10/examples/evaluate_ict_zeroshot_nq.sh
+#!/bin/bash
+
+# Evaluate natural question test data given Wikipedia embeddings and pretrained
+# ICT model
+
+# Datasets can be downloaded from the following link:
+# https://github.com/facebookresearch/DPR/blob/master/data/download_data.py
+
+EVIDENCE_DATA_DIR=<Specify path of Wikipedia dataset>
+EMBEDDING_PATH=<Specify path of the embeddings>
+CHECKPOINT_PATH=<Specify path of pretrained ICT model>
+
+QA_FILE=<Path of the natural question test dataset>
+
+python tasks/main.py \
+    --task ICT-ZEROSHOT-NQ \
+    --tokenizer-type BertWordPieceLowerCase \
+    --num-layers 12 \
+    --hidden-size 768 \
+    --num-attention-heads 12 \
+    --tensor-model-parallel-size 1 \
+    --micro-batch-size 128 \
+    --checkpoint-activations \
+    --seq-length 512 \
+    --max-position-embeddings 512 \
+    --load ${CHECKPOINT_PATH} \
+    --evidence-data-path ${EVIDENCE_DATA_DIR} \
+    --embedding-path ${EMBEDDING_PATH} \
+    --retriever-seq-length 256 \
+    --vocab-file  bert-vocab.txt\
+    --qa-data-test ${QA_FILE} \
+    --num-workers 2 \
+    --faiss-use-gpu \
+    --retriever-report-topk-accuracies 1 5 20 100 \
+    --fp16
+
--- a/megatron-deepspeed_dtk22.10/examples/evaluate_zeroshot_gpt.sh
+++ b/megatron-deepspeed_dtk22.10/examples/evaluate_zeroshot_gpt.sh
+#!/bin/bash
+
+WORLD_SIZE=8
+
+DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
+                  --nnodes 1 \
+                  --node_rank 0 \
+                  --master_addr localhost \
+                  --master_port 6000"
+
+TASK="LAMBADA"
+
+VALID_DATA=<lambada path>
+VOCAB_FILE=gpt2-vocab.json
+MERGE_FILE=gpt2-merges.txt
+CHECKPOINT=checkpoints/gpt2_345m
+
+
+python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/main.py \
+               --task $TASK \
+               --valid-data $VALID_DATA \
+               --tokenizer-type GPT2BPETokenizer \
+               --strict-lambada \
+               --vocab-file $VOCAB_FILE \
+               --merge-file $MERGE_FILE \
+               --load $CHECKPOINT \
+               --tensor-model-parallel-size 1 \
+               --num-layers 24 \
+               --hidden-size 1024 \
+               --num-attention-heads 16 \
+               --batch-size 8 \
+               --checkpoint-activations \
+               --seq-length 1024 \
+               --max-position-embeddings 1024 \
+               --log-interval 10 \
+               --fp16 \
+               --no-load-optim \
+               --no-load-rng
--- a/megatron-deepspeed_dtk22.10/examples/finetune_mnli_distributed.sh
+++ b/megatron-deepspeed_dtk22.10/examples/finetune_mnli_distributed.sh
+#!/bin/bash
+
+WORLD_SIZE=8
+
+DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
+                  --nnodes 1 \
+                  --node_rank 0 \
+                  --master_addr localhost \
+                  --master_port 6000"
+
+TRAIN_DATA="data/glue_data/MNLI/train.tsv"
+VALID_DATA="data/glue_data/MNLI/dev_matched.tsv \
+            data/glue_data/MNLI/dev_mismatched.tsv"
+PRETRAINED_CHECKPOINT=checkpoints/bert_345m
+VOCAB_FILE=bert-vocab.txt
+CHECKPOINT_PATH=checkpoints/bert_345m_mnli
+
+python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/main.py \
+               --task MNLI \
+               --seed 1234 \
+               --train-data $TRAIN_DATA \
+               --valid-data $VALID_DATA \
+               --tokenizer-type BertWordPieceLowerCase \
+               --vocab-file $VOCAB_FILE \
+               --epochs 5 \
+               --pretrained-checkpoint $PRETRAINED_CHECKPOINT \
+               --tensor-model-parallel-size 1 \
+               --num-layers 24 \
+               --hidden-size 1024 \
+               --num-attention-heads 16 \
+               --micro-batch-size 8 \
+               --checkpoint-activations \
+               --lr 5.0e-5 \
+               --lr-decay-style linear \
+               --lr-warmup-fraction 0.065 \
+               --seq-length 512 \
+               --max-position-embeddings 512 \
+               --save-interval 500000 \
+               --save $CHECKPOINT_PATH \
+               --log-interval 10 \
+               --eval-interval 100 \
+               --eval-iters 50 \
+               --weight-decay 1.0e-1 \
+               --fp16
--- a/megatron-deepspeed_dtk22.10/examples/finetune_race_distributed.sh
+++ b/megatron-deepspeed_dtk22.10/examples/finetune_race_distributed.sh
+#!/bin/bash
+
+WORLD_SIZE=8
+
+DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
+                  --nnodes 1 \
+                  --node_rank 0 \
+                  --master_addr localhost \
+                  --master_port 6000"
+
+TRAIN_DATA="data/RACE/train/middle"
+VALID_DATA="data/RACE/dev/middle \
+            data/RACE/dev/high"
+VOCAB_FILE=bert-vocab.txt
+PRETRAINED_CHECKPOINT=checkpoints/bert_345m
+CHECKPOINT_PATH=checkpoints/bert_345m_race
+
+python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/main.py \
+               --task RACE \
+               --seed 1234 \
+               --train-data $TRAIN_DATA \
+               --valid-data $VALID_DATA \
+               --tokenizer-type BertWordPieceLowerCase \
+               --vocab-file $VOCAB_FILE \
+               --epochs 3 \
+               --pretrained-checkpoint $PRETRAINED_CHECKPOINT \
+               --tensor-model-parallel-size 1 \
+               --num-layers 24 \
+               --hidden-size 1024 \
+               --num-attention-heads 16 \
+               --micro-batch-size 4 \
+               --checkpoint-activations \
+               --lr 1.0e-5 \
+               --lr-decay-style linear \
+               --lr-warmup-fraction 0.06 \
+               --seq-length 512 \
+               --max-position-embeddings 512 \
+               --save-interval 100000 \
+               --save $CHECKPOINT_PATH \
+               --log-interval 10 \
+               --eval-interval 100 \
+               --eval-iters 50 \
+               --weight-decay 1.0e-1 \
+               --clip-grad 1.0 \
+               --hidden-dropout 0.1 \
+               --attention-dropout 0.1 \
+               --fp16
--- a/megatron-deepspeed_dtk22.10/examples/generate_text.sh
+++ b/megatron-deepspeed_dtk22.10/examples/generate_text.sh
+#!/bin/bash
+
+CHECKPOINT_PATH=checkpoints/gpt2_345m
+VOCAB_FILE=gpt2-vocab.json
+MERGE_FILE=gpt2-merges.txt
+
+python tools/generate_samples_gpt2.py \
+       --tensor-model-parallel-size 1 \
+       --num-layers 24 \
+       --hidden-size 1024 \
+       --load $CHECKPOINT_PATH \
+       --num-attention-heads 16 \
+       --max-position-embeddings 1024 \
+       --tokenizer-type GPT2BPETokenizer \
+       --fp16 \
+       --batch-size 2 \
+       --seq-length 1024 \
+       --out-seq-length 1024 \
+       --temperature 1.0 \
+       --vocab-file $VOCAB_FILE \
+       --merge-file $MERGE_FILE \
+       --genfile unconditional_samples.json \
+       --num-samples 2 \
+       --top_p 0.9 \
+       --recompute