“13.0”

1106877d · jerrrrry · 1106877d · 1106877d · 1106877d · 1106877d
Commit 1106877d authored Sep 23, 2025 by jerrrrry
20 changed files
--- a/Megatron-LM/docs/source/api-guide/pipeline_parallel_layout.md
+++ b/Megatron-LM/docs/source/api-guide/pipeline_parallel_layout.md
+# Custom Pipeline Model Parallel Layout
+
+*This is an experimental feature and may be changed.*
+
+`--pipeline-model-parallel-layout` is a flexible API for defining the pipeline parallel partitioning, which is essential for balanced partitioning for an imbalanced model. For example, to partition DeepSeek-V3 (61 decoder layers + 1 mtp layer) with PP16VPP2, we can include the arguments as follows:
+
+```bash
+--pipeline-model-parallel-size 16
+--pipeline-model-parallel-layout "Et*3|(tt|)*29,m|L"
+```
+
+| PP \ VPP rank |            0            |       1       |
+|---------------|-------------------------|---------------|
+|       0       | embedding + 3 × decoder |  2 × decoder  |
+|      1~13     |        2 × decoder      |  2 × decoder  |
+|       14      |        2 × decoder      |      mtp      |
+|       15      |        2 × decoder      |      loss     |
+
+In the layout string, stages are split by '|'. Replicated stages or layers can be described with multiplication. Commas can be used cosmetically. Symbol choices:
+
+* `E` = embedding layer
+* `t` = transformer decoder layer
+* `m` = MTP layer
+* `L` = loss calculation layer
+
+Note that it is legal to have empty stages, e.g., `E||t|L` (the second stage is empty).
--- a/Megatron-LM/docs/source/api-guide/tensor_parallel.rst
+++ b/Megatron-LM/docs/source/api-guide/tensor_parallel.rst
+tensor\_parallel package
+========================
+
+This package contains an implementation for tensor parallelism in transformer
+models (see `Megatron-LM: Training Multi-Billion Parameter Language Models
+Using Model Parallelism <https://arxiv.org/abs/1909.08053>`_ and `Reducing
+Activation Recomputation in Large Transformer Models <https://arxiv.org/abs/2205.05198>`_
+for details).
+
+Submodules
+----------
+
+tensor\_parallel.cross\_entropy module
+--------------------------------------
+
+.. automodule:: core.tensor_parallel.cross_entropy
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+tensor\_parallel.data module
+----------------------------
+
+.. automodule:: core.tensor_parallel.data
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+tensor\_parallel.layers module
+------------------------------
+
+.. automodule:: core.tensor_parallel.layers
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+tensor\_parallel.mappings module
+--------------------------------
+
+.. automodule:: core.tensor_parallel.mappings
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+tensor\_parallel.random module
+------------------------------
+
+.. automodule:: core.tensor_parallel.random
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+tensor\_parallel.utils module
+-----------------------------
+
+.. automodule:: core.tensor_parallel.utils
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+Module contents
+---------------
+
+.. automodule:: core.tensor_parallel
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/Megatron-LM/docs/source/api-guide/transformer.rst
+++ b/Megatron-LM/docs/source/api-guide/transformer.rst
+transformer package
+===================
+
+The `transformer` package provides a customizable and configurable
+implementation of the transformer model architecture. Each component
+of a transformer stack, from entire layers down to individual linear
+layers, can be customized by swapping in different PyTorch modules
+using the "spec" parameters (see `here
+<https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/nemo_megatron/mcore_customization.html>`_). The
+configuration of the transformer (hidden size, number of layers,
+number of attention heads, etc.) is provided via a `TransformerConfig`
+object.
+
+Submodules
+----------
+
+transformer.attention module
+----------------------------
+
+This is the entire attention portion, either self or cross attention,
+of a transformer layer including the query, key, and value
+projections, a "core" attention calculation (e.g. dot product
+attention), and final output linear projection.
+
+.. automodule:: core.transformer.attention
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.dot\_product\_attention module
+------------------------------------------
+
+This is a PyTorch-only implementation of dot product attention. A more
+efficient implementation, like those provided by FlashAttention or
+CUDNN's FusedAttention, are typically used when training speed is
+important.
+
+.. automodule:: core.transformer.dot_product_attention
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.enums module
+------------------------
+
+.. automodule:: core.transformer.enums
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.identity\_op module
+-------------------------------
+
+This provides a pass-through module that can be used in specs to
+indicate that the operation should not be performed. For example, when
+using LayerNorm with the subsequent linear layer, an IdentityOp can be
+passed in as the LayerNorm module to use.
+
+.. automodule:: core.transformer.identity_op
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.mlp module
+----------------------
+
+This is the entire MLP portion of the transformer layer with an input
+projection, non-linearity, and output projection.
+
+.. automodule:: core.transformer.mlp
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.module module
+-------------------------
+
+This provides a common base class for all modules used in the
+transformer that contains some common functionality.
+
+.. automodule:: core.transformer.module
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.transformer\_block module
+-------------------------------------
+
+A block, or stack, of several transformer layers. The layers can all
+be the same or each can be unique.
+
+.. automodule:: core.transformer.transformer_block
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.transformer\_config module
+--------------------------------------
+
+This contains all of the configuration options for the
+transformer. Using a dataclass reduces code bloat by keeping all
+arguments together in a dataclass instead of passing several arguments
+through multiple layers of function calls.
+
+.. automodule:: core.transformer.transformer_config
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.transformer\_layer module
+-------------------------------------
+
+A single standard transformer layer including attention and MLP blocks.
+
+.. automodule:: core.transformer.transformer_layer
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.utils module
+------------------------
+
+Various utilities used in the transformer implementation.
+
+.. automodule:: core.transformer.utils
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+Module contents
+---------------
+
+.. automodule:: core.transformer
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/Megatron-LM/docs/source/images/context_parallel/CP_overview.png
+++ b/Megatron-LM/docs/source/images/context_parallel/CP_overview.png
--- a/Megatron-LM/docs/source/images/context_parallel/CP_results.png
+++ b/Megatron-LM/docs/source/images/context_parallel/CP_results.png
--- a/Megatron-LM/docs/source/images/custom_fsdp/FSDP_Allreduce.png
+++ b/Megatron-LM/docs/source/images/custom_fsdp/FSDP_Allreduce.png
--- a/Megatron-LM/docs/source/images/custom_fsdp/FSDP_workflow.png
+++ b/Megatron-LM/docs/source/images/custom_fsdp/FSDP_workflow.png
--- a/Megatron-LM/docs/source/images/custom_fsdp/MCore_Custom_FSDP_Class_Diagram.png
+++ b/Megatron-LM/docs/source/images/custom_fsdp/MCore_Custom_FSDP_Class_Diagram.png
--- a/Megatron-LM/docs/source/images/distrib_optimizer/data_flow.png
+++ b/Megatron-LM/docs/source/images/distrib_optimizer/data_flow.png
--- a/Megatron-LM/docs/source/images/distrib_optimizer/sharding_scheme.png
+++ b/Megatron-LM/docs/source/images/distrib_optimizer/sharding_scheme.png
--- a/Megatron-LM/docs/source/images/moe/token_drop.png
+++ b/Megatron-LM/docs/source/images/moe/token_drop.png
--- a/Megatron-LM/docs/source/images/multi_token_prediction/MTP_implementation.png
+++ b/Megatron-LM/docs/source/images/multi_token_prediction/MTP_implementation.png
--- a/Megatron-LM/docs/source/index.rst
+++ b/Megatron-LM/docs/source/index.rst
+.. Lumache documentation master file, created by
+   sphinx-quickstart on Tue Aug 15 13:44:10 2023.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+Megatron Core User Guide
+===================================
+
+**Megatron Core** is a Python library that has the core components required to build your language models. 
+A reference implementation of Megatron Core can be found in  `NeMo <https://github.com/NVIDIA/NeMo/tree/main>`_ It offers a *simple* and
+*intuitive* API.
+
+.. toctree::
+   :maxdepth: 2
+   :caption: User Guide
+
+   user-guide/index
+
+.. toctree::
+   :maxdepth: 3
+   :caption: API Guide
+   
+   api-guide/index
--- a/Megatron-LM/docs/source/user-guide/index.rst
+++ b/Megatron-LM/docs/source/user-guide/index.rst
+User Guide 
+============
+
+.. mdinclude:: ../../../megatron/core/QuickStart.md
+.. mdinclude:: ../../../megatron/core/MSC_Integration.md
\ No newline at end of file
--- a/Megatron-LM/examples/__init__.py
+++ b/Megatron-LM/examples/__init__.py
--- a/Megatron-LM/examples/academic_paper_scripts/detxoify_lm/README.md
+++ b/Megatron-LM/examples/academic_paper_scripts/detxoify_lm/README.md
+# SGEAT: Detoxify Larger-scale Language Models
+
+This is the official code base for our NeurIPS 2022 paper:
+
+[Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models](https://arxiv.org/abs/2202.04173)
+
+Boxin Wang, Wei Ping, Chaowei Xiao, Peng Xu, Mostofa Patwary, Mohammad Shoeybi, Bo Li, Anima Anandkumar, Bryan Catanzaro
+
+
+## Citation
+
+```
+@article{WangExp2022,
+  title={Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models},
+  author={Wang, Boxin and Ping, Wei and Xiao, Chaowei and Xu, Peng and Patwary, Mostofa and Shoeybi, Mohammad and and Li, Bo and Anandkumar, Anima and Catanzaro, Bryan},
+  journal={NeurIPS},
+  year={2022}
+}
+```
+
+## Usage
+
+### Prepare your environment
+
+The project environment is based on the standard [nvcr docker](nvcr.io/nvidia/pytorch:21.12-py3) of version `nvcr.io/nvidia/pytorch:21.12-py3`.
+
+To run Perspective API, you need to install `google-api-python-client`
+```bash
+pip install --upgrade google-api-python-client
+```
+
+### Self Generation
+
+#### SGEAT (Standard)
+To perform unconditional generation for a Megatron LM, we provide an example script for 1.3B LM.
+
+```bash
+#                                                                              [num of samples]     [model checkpoint]          [random seed]
+bash examples/detxoify_lm/self_generation/selfgenerate-1.3b-unconditional.sh       1000          checkpoints/gpt3/gpt3-1.3b/      2333
+```
+This will generate a jsonl file of  1000 generated text (as a toy example) at `selfgeneration/unconditional_generation_gpt3-1.3b/2333.out`. 
+
+Note that you may want to set your own gpt2 vocab and merge file dir, as well as your output data dir in `selfgenerate-1.3b-unconditional.sh`.
+
+### Annotation
+
+We then use Perspective API to annotate the self generated corpus. Note that you need to fill in your own Perspective API key in the `examples/detoxify_lm/perspective_api_annotate.py`. 
+
+```bash
+python examples/detxoify_lm/perspective_api_annotate.py --data-path [input-data-path] --out-path [output-data-path] --workers 70
+```
+
+For example,
+
+```bash
+python examples/detxoify_lm/annotations/perspective_api_annotate.py --data-path  selfgeneration/unconditional_generation_gpt3-1.3b/2333.out --out-path  selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.out --workers 70
+```
+
+### Filtering
+
+We then filter the self annotated generated corpus to get the most nontoxic 50% of the corus.
+
+For example,
+```bash
+python examples/detxoify_lm/annotations/filter-selfgeneration.py --data-path  selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.out --out-path  selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.nontoxic.out
+```
+
+This will generate a jsonl file of 500 text of the lowest toxicity (as a toy example) at `selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.nontoxic.out`. 
+
+
+### Preprocess
+
+We then preprocess the dataset so that Megatron LM can use the dumped dataset to fine-tune.
+
+```
+bash examples/detxoify_lm/annotations/preprocess.sh selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.nontoxic.out selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.nontoxic
+```
+
+This will generate two files as follows
+```bash
+selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.nontoxic_text_document.idx
+selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.nontoxic_text_document.bin
+```
+which will be used in the following domain-adative training step.
+
+### Fine-tuning
+
+We then use the preprocess dataset as input to fine-tune our Megatron-LM. 
+```bash
+#                                                                          [fine-tuning dataset]                                                                      [output-dir]                             [lr]    [bs]      [train-iters]                       [load checkpoint]
+bash examples/detxoify_lm/finetune_gpt_distributed-1.3b.sh    selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.nontoxic_text_document         gpt3-1.3b-toy-example-lr-2e-5-bs-512             2e-5     512            78                          checkpoints/gpt3/gpt3-1.3b
+```
+
+This will dump the final checkpoint in `$SHARE_DATA/gpt3-1.3b-toy-example-lr-2e-5-bs-512`. (`$SHARE_DATA` is your current work dir, default to `$PWD`)
+
+### Evaluation
+
+We then use the fine-tuned checkpoint to perform conditional generation given RealToxicityPrompts:
+
+```bash
+#                                                 [input-prompts]                          [model-checkpoint]
+bash examples/detxoify_lm/generate-1.3b.sh     augmented_prompts.jsonl      $SHARE_DATA/gpt3-1.3b-toy-example-lr-2e-5-bs-512
+```
+For example, this will generate the continuations in the file `augmented_prompts.jsonl_output_gpt3-1.3b-toy-example-lr-2e-5-bs-512_seed_31846.jsonl` (seed is a random generated number).
+
+Note that the input prompts are augmented so that each prompts appear 25 times to calculate the Expected Maximum Toxicity over 25 generations and Toxicity Probability,  
+
+We then use Perspective API to evaluate the Expected Maximum Toxicity and Toxicity Probability.   
+
+```bash
+python examples/detxoify_lm/perspective_api.py --data-path "augmented_prompts.jsonl_output_gpt3-1.3b-toy-example-lr-2e-5-bs-512_seed_31846.jsonl" --prompt-path prompts.jsonl --workers 30
+```
\ No newline at end of file
--- a/Megatron-LM/examples/academic_paper_scripts/detxoify_lm/annotations/filter-selfgeneration.py
+++ b/Megatron-LM/examples/academic_paper_scripts/detxoify_lm/annotations/filter-selfgeneration.py
+import json
+import time
+from typing import Dict, Optional, List
+
+import joblib
+from googleapiclient import discovery
+from googleapiclient.errors import HttpError
+
+import argparse
+
+from tqdm import tqdm
+
+parser = argparse.ArgumentParser(description='Process some integers.')
+parser.add_argument('--data-path', type=str, default='',
+                    help='data path to load the jsonl')
+parser.add_argument('--out-path', type=str, default='',
+                    help='data path to load the jsonl')
+parser.add_argument('--prompt-path', type=str, default='datasets/realprompts/prompts.jsonl',
+                    help='data path to load the prompt jsonl')
+parser.add_argument('--workers', type=int, default=10,
+                   help='Number of worker processes to launch')
+
+
+def get_corpus_scores(lines):
+    scores = []
+    corpus = []
+
+    for line in tqdm(lines):
+        data = json.loads(line)
+        score = data['score']
+        if score is not None:
+            scores.append(score['toxicity'])
+            corpus.append(data)
+    return scores, corpus
+
+
+def main():
+    args = parser.parse_args()
+
+    with open(args.data_path) as f:
+        lines = f.readlines()
+
+    print(f"total line number: {len(lines)}")
+
+    scores, corpus = get_corpus_scores(lines)
+    import numpy as np
+    scores = np.array(scores)
+    indices = np.argsort(scores)
+
+    print(f"total valid samples: {len(scores)}")
+
+    print(f"Avg toxicity (all): {np.mean(scores)} +- {np.std(scores)}")
+    print(f"Avg toxicity (toxic): {np.mean(scores[scores > 0.5])} +- {np.std(scores[scores > 0.5])}")
+    print(f"Toxic Percentage {sum(scores > 0.5) / len(scores)}")
+    print(f"Avg toxicity (nontoxic): {np.mean(scores[scores <= 0.5])} +- {np.std(scores[scores <= 0.5])}")
+    print(f"Nontoxic Percentage {sum(scores <= 0.5) / len(scores)}")
+
+    samples_left = len(lines) // 2
+    print(f"After filtering: {samples_left} of samples are left")
+    nontoxic_indices = indices[:samples_left]
+    print(f"Avg toxicity (filtered): {np.mean(scores[nontoxic_indices])} +- {np.std(scores[nontoxic_indices])}")
+    print(f"Toxicity Range (filtered): {np.min(scores[nontoxic_indices])} ~ {np.max(scores[nontoxic_indices])}")
+    nontoxic_data = [corpus[ind] for ind in nontoxic_indices]
+    print(f"Total samples after filtering: {len(nontoxic_data)}")
+    print(f"Examples: {nontoxic_data[:3]}")
+
+    from sklearn.utils import shuffle
+    nontoxic_data = shuffle(nontoxic_data)
+
+    with open(args.out_path, 'w') as f:
+        for x in nontoxic_data:
+            f.write(json.dumps(x) + '\n')
+
+
+main()
\ No newline at end of file
--- a/Megatron-LM/examples/academic_paper_scripts/detxoify_lm/annotations/perspective_api_annotate.py
+++ b/Megatron-LM/examples/academic_paper_scripts/detxoify_lm/annotations/perspective_api_annotate.py
+import json
+import time
+from typing import Dict, Optional, List
+
+import joblib
+from googleapiclient import discovery
+from googleapiclient.errors import HttpError
+
+import argparse
+
+from tqdm import tqdm
+
+parser = argparse.ArgumentParser(description='Process some integers.')
+parser.add_argument('--data-path', type=str, default='',
+                    help='data path to load the jsonl')
+parser.add_argument('--out-path', type=str, default='',
+                    help='data path to load the jsonl')
+parser.add_argument('--total', type=int, default=-1,
+                    help='Total number of data')
+parser.add_argument('--workers', type=int, default=1,
+                   help='Number of worker processes to launch')
+
+
+class PerspectiveApiScorer:
+    """
+    This class provides a method for accessing Perspective API using the Google API Client to obtain attribute scores for generated texts.
+    """
+
+    DEFAULT_ATTRIBUTES = ['toxicity', 'severe_toxicity', 'sexually_explicit', 'threat', 'profanity', 'identity_attack']
+
+    def __init__(self):
+        """
+        :param api_key: the API key to use. For details, see https://support.perspectiveapi.com/s/docs-get-started
+        """
+        api_key = ''
+        self._service = discovery.build(
+            "commentanalyzer",
+            "v1alpha1",
+            developerKey=api_key,
+            discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1",
+            static_discovery=False,
+        )
+
+    def get_scores(self, input_text: str, requested_attributes: Optional[List[str]] = None) -> Dict[str, float]:
+        """
+        Get attribute scores for a given text via Perspective API.
+        :param input_text: the input text
+        :param requested_attributes: the attributes for which to compute scores
+        :return: a mapping from attribute names to scores
+        """
+        requested_attributes = requested_attributes if requested_attributes else PerspectiveApiScorer.DEFAULT_ATTRIBUTES
+
+        analyze_request = {
+            'comment': {'text': input_text},
+            'requestedAttributes': {attribute.upper(): {} for attribute in requested_attributes},
+            'spanAnnotations': False,
+            'languages': ['en'],
+        }
+
+        response = None
+        while not response:
+            try:
+                response = self._service.comments().analyze(body=analyze_request).execute()
+            except Exception as e:
+                print(f'Perspective API threw an error: {e}\n Retrying in 5 seconds...')
+                print(input_text)
+                time.sleep(1)
+
+        return {attribute: response['attributeScores'][attribute.upper()]['summaryScore']['value'] for attribute in
+                requested_attributes}
+
+
+def test():
+    scorer = PerspectiveApiScorer()
+    for i in range(1):
+        print(scorer.get_scores("toxic behaviors and nasty negro"))
+
+
+def split_lines(lines, split):
+    tot = len(lines)
+    each = tot // split
+    return [lines[i:i+each] for i in range(0, tot, each)]
+
+from joblib import Parallel, delayed
+
+scorer = PerspectiveApiScorer()
+
+def get_score(line):
+    data = json.loads(line)
+    text = data['text']
+    text = text.replace("<|endoftext|>", "")
+    data['text'] = text
+    if not text.strip():
+        data['score'] = None
+        return json.dumps(data)
+
+    encoded_text = text.encode('utf8')
+    encoded_text = encoded_text[:20480]
+    try:
+        decoded_text = encoded_text.decode('utf8')
+    except UnicodeDecodeError:
+        try:
+            decoded_text = encoded_text[:20479].decode('utf8')
+        except UnicodeDecodeError:
+            try:
+                decoded_text = encoded_text[:20478].decode('utf8')
+            except UnicodeDecodeError:
+                try:
+                    decoded_text = encoded_text[:20476].decode('utf8')
+                except Exception:
+                    print("Error occurred")
+                    data['score'] = None
+                    return json.dumps(data)
+    data['score'] = scorer.get_scores(decoded_text)
+    return json.dumps(data)
+
+
+def get_scores(lines):
+    scorer = PerspectiveApiScorer()
+    all_data = []
+    for i, line in enumerate(tqdm(lines)):
+        data = json.loads(line)
+        text = data['text']
+        if not text.strip():
+            data['score'] = None
+            all_data.append(json.dumps(data))
+            continue
+        encoded_text = text.encode('utf8')
+        encoded_text = encoded_text[:20480]
+        try:
+            decoded_text = encoded_text.decode('utf8')
+        except UnicodeDecodeError:
+            try:
+                decoded_text = encoded_text[:20479].decode('utf8')
+            except UnicodeDecodeError:
+                try:
+                    decoded_text = encoded_text[:20478].decode('utf8')
+                except UnicodeDecodeError:
+                    try:
+                        decoded_text = encoded_text[:20476].decode('utf8')
+                    except Exception:
+                        print("Error occurred")
+                        data['score'] = None
+                        all_data.append(json.dumps(data))
+                        continue
+        data['score'] = scorer.get_scores(decoded_text)
+        all_data.append(json.dumps(data))
+    return all_data
+
+def get_annotated_datasets(lines, threads=10):
+    sub_lines = lines
+    splitted_lines = split_lines(sub_lines, threads)
+    print(len(sub_lines))
+    final = Parallel(n_jobs=threads)(delayed(get_score)(l) for l in splitted_lines)
+    import itertools
+    finals = list(itertools.chain.from_iterable(final))
+    return finals
+
+
+def main():
+    args = parser.parse_args()
+
+    path = args.data_path
+    out = args.out_path if args.out_path else path + '-annotated.jsonl'
+    print(out)
+
+    fin = open(path, 'r', encoding='utf-8')
+    import multiprocessing
+    pool = multiprocessing.Pool(args.workers)
+    annotated = pool.imap(get_score, fin, 25)
+    with open(out, "w") as f:
+        if args.total > 0:
+            for x in tqdm(annotated, total=args.total):
+                f.write(x + '\n')
+        else:
+            for x in tqdm(annotated):
+                f.write(x + '\n')
+
+
+if __name__ == '__main__':
+    main()
+
--- a/Megatron-LM/examples/academic_paper_scripts/detxoify_lm/annotations/preprocess.sh
+++ b/Megatron-LM/examples/academic_paper_scripts/detxoify_lm/annotations/preprocess.sh
+VOCAB_FILE=pt2-vocab.json
+MERGE_FILE=gpt2-merges.txt
+
+python3 tools/preprocess_data.py \
+    --input $1 \
+    --output-prefix $2 \
+    --vocab-file $VOCAB_FILE \
+    --merge-file $MERGE_FILE \
+    --tokenizer-type GPT2BPETokenizer \
+    --append-eod  --workers 20 --chunk-size 25
+
+
+
+
--- a/Megatron-LM/examples/academic_paper_scripts/detxoify_lm/finetune_gpt.py
+++ b/Megatron-LM/examples/academic_paper_scripts/detxoify_lm/finetune_gpt.py
+# coding=utf-8
+# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+
+
+"""Fine-tune GPT"""
+
+import torch
+from functools import partial
+import os
+import sys
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__),
+                                             os.path.pardir, os.path.pardir)))
+from megatron.training import get_args
+from megatron.training import get_timers
+from megatron.training import get_tokenizer
+from megatron.training import print_rank_0
+from megatron.core import mpu
+from megatron.core.datasets.blended_megatron_dataset_builder import BlendedMegatronDatasetBuilder
+from megatron.core.datasets.blended_megatron_dataset_config import GPTDatasetConfig
+from megatron.core.datasets.gpt_dataset import GPTDataset
+from megatron.core.datasets.utils import get_blend_from_list
+from megatron.legacy.model import GPTModel
+from megatron.core.enums import ModelType
+from megatron.training import pretrain
+from megatron.training.utils import get_ltor_masks_and_position_ids
+from megatron.training.utils import average_losses_across_data_parallel_group
+
+def model_provider(pre_process=True, post_process=True):
+    """Build the model."""
+
+    print_rank_0('building GPT model ...')
+    model = GPTModel(
+        num_tokentypes=0,
+        parallel_output=True,
+        pre_process=pre_process,
+        post_process=post_process
+    )
+    return model
+
+
+def get_batch(data_iterator):
+    """Generate a batch"""
+    args = get_args()
+    tokenizer = get_tokenizer()
+
+    # Items and their type.
+    keys = ['text']
+    datatype = torch.int64
+
+    # Broadcast data.
+    if data_iterator is not None:
+        data = next(data_iterator)
+    else:
+        data = None
+    data_b = mpu.broadcast_data(keys, data, datatype)
+
+    # Unpack.
+    tokens_ = data_b['text'].long()
+    labels = tokens_[:, 1:].contiguous()
+    tokens = tokens_[:, :-1].contiguous()
+
+    # Get the masks and postition ids.
+    attention_mask, loss_mask, position_ids = get_ltor_masks_and_position_ids(
+        tokens,
+        tokenizer.eod,
+        args.reset_position_ids,
+        args.reset_attention_mask,
+        args.eod_mask_loss)
+
+    return tokens, labels, loss_mask, attention_mask, position_ids
+
+def loss_func(loss_mask, output_tensor):
+    losses = output_tensor.float()
+    loss_mask = loss_mask.view(-1).float()
+    loss = torch.sum(losses.view(-1) * loss_mask) / loss_mask.sum()
+
+    # Reduce loss for logging.
+    averaged_loss = average_losses_across_data_parallel_group([loss])
+
+    return loss, {'lm loss': averaged_loss[0]}
+
+
+def forward_step(data_iterator, model):
+    """Forward step."""
+    args = get_args()
+    timers = get_timers()
+
+    # Get the batch.
+    timers('batch-generator').start()
+    tokens, labels, loss_mask, attention_mask, position_ids = get_batch(
+        data_iterator)
+    timers('batch-generator').stop()
+
+    output_tensor = model(tokens, position_ids, attention_mask,
+                          labels=labels)
+
+    return output_tensor, partial(loss_func, loss_mask)
+
+
+def train_valid_test_datasets_provider(train_val_test_num_samples):
+    """Build train, valid, and test datasets."""
+    args = get_args()
+
+    print_rank_0('> building train, validation, and test datasets '
+                 'for GPT ...')
+    train_ds, _, test_ds = BlendedMegatronDatasetBuilder(
+        GPTDataset,
+        train_val_test_num_samples,
+        lambda: True,
+        GPTDatasetConfig(
+            blend=get_blend_from_list(args.data_path),
+            split=args.split,
+            random_seed=args.seed,
+            sequence_length=args.seq_length,
+            path_to_cache=args.data_cache_path,
+            return_document_ids=False,
+            mid_level_dataset_surplus=args.mid_level_dataset_surplus,
+        )
+    ).build()
+    print_rank_0("> finished creating finetuning GPT datasets ...")
+
+    _, valid_ds, _ = BlendedMegatronDatasetBuilder(
+        GPTDataset,
+        train_val_test_num_samples,
+        lambda: True,
+        GPTDatasetConfig(
+            blend=get_blend_from_list(args.data_path2),
+            split="98,2,0",
+            random_seed=1234,
+            sequence_length=2048,
+            path_to_cache=args.data_cache_path,
+            return_document_ids=False,
+            mid_level_dataset_surplus=args.mid_level_dataset_surplus,
+        )
+    ).build()
+    print_rank_0("> finished creating pretrained GPT datasets ...")
+
+    return train_ds, valid_ds, test_ds
+
+
+def add_validation_args(parser):
+    """Text generation arguments."""
+    group = parser.add_argument_group(title='validation set')
+    group.add_argument('--data-path2', nargs='*', default=None,
+                       help='Path to the validation dataset. Accepted format:'
+                       '1) a single data path, 2) multiple datasets in the'
+                       'form: dataset1-weight dataset1-path dataset2-weight '
+                       'dataset2-path ...')
+    group.add_argument('--eval-ppl', action='store_true', default=False)
+    group.add_argument('--stored_params', type=dict, default=dict())
+    return parser
+
+
+if __name__ == "__main__":
+
+    pretrain(train_valid_test_datasets_provider, model_provider,
+             ModelType.encoder_or_decoder,
+             forward_step, args_defaults={'tokenizer_type': 'GPT2BPETokenizer'},
+             extra_args_provider=add_validation_args,)