Initial commit

2d2fca6c · jerrrrry · 2d2fca6c · 2d2fca6c · 2d2fca6c · 2d2fca6c
Commit 2d2fca6c authored Feb 12, 2026 by jerrrrry
20 changed files
--- a/Megatron-LM/docs/source/api-guide/optimizer_param_scheduler.rst
+++ b/Megatron-LM/docs/source/api-guide/optimizer_param_scheduler.rst
+Optimizer Parameters Scheduler
+==============================
+This api is used to calculate the learning rate and weight decay for the optimizer.
+
+
+Module contents
+---------------
+
+.. automodule:: core.optimizer_param_scheduler
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/Megatron-LM/docs/source/api-guide/pipeline_parallel.rst
+++ b/Megatron-LM/docs/source/api-guide/pipeline_parallel.rst
+pipeline\_parallel package
+==========================
+
+This package contains implementations for two different pipeline parallelism
+schedules (one without interleaving and one with interleaving, see `Efficient
+Large-Scale Language Model Training on GPU Clusters Using Megatron-LM <https://arxiv.org/abs/2104.04473>`_
+for details), and a default no-pipelining schedule. It also contains methods
+for the point-to-point communication that is needed between pipeline stages.
+
+Submodules
+----------
+
+.. mdinclude:: pipeline_parallel_layout.md
+
+pipeline\_parallel.p2p\_communication module
+--------------------------------------------
+
+Contains implementations for the various point-to-point communication needed
+(e.g., `recv_forward` and `recv_backward`) in the different pipeline parallelism
+schedules.
+
+.. automodule:: core.pipeline_parallel.p2p_communication
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+pipeline\_parallel.schedules module
+-----------------------------------
+
+Contains implementations for two pipeline parallelism schedules
+(`forward_backward_pipelining_with_interleaving`for pipeline parallelism with
+interleaving, `forward_backward_pipelining_without_interleaving` for pipeline
+parallelism without interleaving) and a default no-pipelining schedule
+(`forward_backward_no_pipelining`). `get_forward_backward_func` returns the right
+scheduling function to use based on the configuration being trained
+(e.g., if pipeline-parallel size is 1, use `forward_backward_no_pipelining`).
+
+.. automodule:: core.pipeline_parallel.schedules
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+Module contents
+---------------
+
+.. automodule:: core.pipeline_parallel
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/Megatron-LM/docs/source/api-guide/pipeline_parallel_layout.md
+++ b/Megatron-LM/docs/source/api-guide/pipeline_parallel_layout.md
+# Custom Pipeline Model Parallel Layout
+
+*This is an experimental feature and may be changed.*
+
+`--pipeline-model-parallel-layout` is a flexible API for defining the pipeline parallel partitioning, which is essential for balanced partitioning for an imbalanced model. For example, to partition DeepSeek-V3 (61 decoder layers + 1 mtp layer) with PP16VPP2, we can include the arguments as follows:
+
+```bash
+--pipeline-model-parallel-size 16
+--pipeline-model-parallel-layout "Et*3|(tt|)*29,m|L"
+```
+
+| PP \ VPP rank |            0            |       1       |
+|---------------|-------------------------|---------------|
+|       0       | embedding + 3 × decoder |  2 × decoder  |
+|      1~13     |        2 × decoder      |  2 × decoder  |
+|       14      |        2 × decoder      |      mtp      |
+|       15      |        2 × decoder      |      loss     |
+
+In the layout string, stages are split by '|'. Replicated stages or layers can be described with multiplication. Commas can be used cosmetically. Symbol choices:
+
+* `E` = embedding layer
+* `t` = transformer decoder layer
+* `m` = MTP layer
+* `L` = loss calculation layer
+
+Note that it is legal to have empty stages, e.g., `E||t|L` (the second stage is empty).
--- a/Megatron-LM/docs/source/api-guide/tensor_parallel.rst
+++ b/Megatron-LM/docs/source/api-guide/tensor_parallel.rst
+tensor\_parallel package
+========================
+
+This package contains an implementation for tensor parallelism in transformer
+models (see `Megatron-LM: Training Multi-Billion Parameter Language Models
+Using Model Parallelism <https://arxiv.org/abs/1909.08053>`_ and `Reducing
+Activation Recomputation in Large Transformer Models <https://arxiv.org/abs/2205.05198>`_
+for details).
+
+Submodules
+----------
+
+tensor\_parallel.cross\_entropy module
+--------------------------------------
+
+.. automodule:: core.tensor_parallel.cross_entropy
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+tensor\_parallel.data module
+----------------------------
+
+.. automodule:: core.tensor_parallel.data
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+tensor\_parallel.layers module
+------------------------------
+
+.. automodule:: core.tensor_parallel.layers
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+tensor\_parallel.mappings module
+--------------------------------
+
+.. automodule:: core.tensor_parallel.mappings
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+tensor\_parallel.random module
+------------------------------
+
+.. automodule:: core.tensor_parallel.random
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+tensor\_parallel.utils module
+-----------------------------
+
+.. automodule:: core.tensor_parallel.utils
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+Module contents
+---------------
+
+.. automodule:: core.tensor_parallel
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/Megatron-LM/docs/source/api-guide/tokenizers.md
+++ b/Megatron-LM/docs/source/api-guide/tokenizers.md
+# New Tokenizer System
+
+## Key Differences from the Old Tokenizer System
+
+### 1. Hugging Face–style API
+
+We now have a `MegatronTokenizer` class that provides a familiar, simple API similar to Hugging Face’s:
+
+`.from_pretrained()` – Load a tokenizer from a directory or file, automatically detecting the type and settings.
+
+`.write_metadata()` – Save tokenizer configuration (metadata) so that it can be reused without re-specifying parameters.
+
+This eliminates the need for long initialization arguments and hard-coded settings in training scripts.
+
+### 2. Tokenizer Metadata
+
+A metadata file (JSON) now stores all essential tokenizer configuration in one place:
+ - Tokenizer library (e.g., HuggingFace, SentencePiece, TikToken, etc.)
+ - Chat templates
+ - Tokenizer class
+
+Benefits:
+ - You only need to set these parameters once.
+ - No more passing multiple CLI arguments for tokenizer settings.
+ - Easy sharing — just copy the tokenizer directory with its metadata file.
+
+### 3. Library Classes Are Now Internal
+
+In the old system, you had to know which tokenizer library to use (`SentencePieceTokenizer`, `HuggingFaceTokenizer`, etc.) and instantiate it manually.
+
+In the new system:
+ - The library is automatically detected from the metadata.
+ - The correct tokenizer implementation is chosen under the hood.
+ - Users don’t need to manually manage tokenizer classes.
+
+### 3. Support for Model-specific Tokenizer Classes
+
+The system now supports:
+ - Built-in LLM-specific tokenizers. 
+ - Custom tokenizers: You can create your own tokenizer class by inheriting from `MegatronTokenizerText` and specify it in the `tokenizer_class` field in the metadata file.
+ - This allows advanced customization while keeping defaults simple for most users.
+
+### 4. Usage
+
+**Creating and Saving Metadata**
+
+```python
+from megatron.core.tokenizers import MegatronTokenizer
+
+# The metadata will be stored as a file named tokenizer_metadata.json inside the tokenizer’s directory.
+MegatronTokenizer.write_metadata(
+    tokenizer_path="/path/to/tokenizer.model",
+    tokenizer_library="sentencepiece",
+    chat_template="chat template in jinja format",
+)
+
+# To use custom tokenizer class
+from megatron.core.tokenizers.text import MegatronTokenizerText
+
+class CustomTokenizer(MegatronTokenizerText):
+    ...
+
+MegatronTokenizer.write_metadata(
+    tokenizer_path="/path/to/tokenizer.model",
+    tokenizer_library="sentencepiece",
+    chat_template="chat template in jinja format",
+    tokenizer_class=CustomTokenizer,
+)
+
+# To save metadata to another dir
+MegatronTokenizer.write_metadata(
+    tokenizer_path="/path/to/tokenizer.model",
+    tokenizer_library="sentencepiece",
+    metadata_path="/path/to/save/metadata.json",
+)
+
+```
+
+**Restoring the tokenizer**
+
+```python
+from megatron.core.tokenizers import MegatronTokenizer
+
+MegatronTokenizer.from_pretrained(
+    tokenizer_path="/path/to/tokenizer.model",
+)
+
+# If metadata is not in tokenizer’s dir
+MegatronTokenizer.from_pretrained(
+    tokenizer_path="/path/to/tokenizer.model",
+    metadata_path="/path/to/metadata.json",
+)
+
+# Pass metadata as dict
+MegatronTokenizer.from_pretrained(
+    tokenizer_path="GPT2BPETokenizer",
+    metadata_path={"library": "megatron"},
+    vocab_file="/path/to/vocab.txt",
+)
+
+# Pass additional params
+MegatronTokenizer.from_pretrained(
+    tokenizer_path="/path/to/tokenizer/model.json",
+    metadata_path={"library": "tiktoken"},
+    pattern="v2",
+    num_special_tokens=1000,
+)
+
+# Null tokenzier
+MegatronTokenizer.from_pretrained(
+    metadata_path={"library": "null"},
+    vocab_size=131072,
+)
+
+```
+
+### 4. Megatron-LM pretraining compatibility
+
+New tokenizer system is compatible with megatron-lm pretrain script. If `--tokenizer-metadata` is not specified, a default metadata file will be generated automatically.
+
+```bash
+# Null tokenizer
+torchrun --nproc_per_node=1 pretrain_gpt.py \
+    ... \
+    --tokenizer-type NullTokenizer \
+    --vocab-size 131072
+
+# HuggingFace tokenizer with specified metadata
+torchrun --nproc_per_node=1 pretrain_gpt.py \
+    ... \
+    --tokenizer-type HuggingFaceTokenizer \
+    --tokenizer-model meta-llama/Meta-Llama-3-8B \
+    --tokenizer-metadata /path/to/metadata.json
+
+```
+
+The Megatron-LM pretraining script still supports the legacy tokenizer system. To enable it, simply add the `--legacy-tokenizer` flag.
--- a/Megatron-LM/docs/source/api-guide/transformer.rst
+++ b/Megatron-LM/docs/source/api-guide/transformer.rst
+transformer package
+===================
+
+The `transformer` package provides a customizable and configurable
+implementation of the transformer model architecture. Each component
+of a transformer stack, from entire layers down to individual linear
+layers, can be customized by swapping in different PyTorch modules
+using the "spec" parameters (see `here
+<https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/nemo_megatron/mcore_customization.html>`_). The
+configuration of the transformer (hidden size, number of layers,
+number of attention heads, etc.) is provided via a `TransformerConfig`
+object.
+
+Submodules
+----------
+
+transformer.attention module
+----------------------------
+
+This is the entire attention portion, either self or cross attention,
+of a transformer layer including the query, key, and value
+projections, a "core" attention calculation (e.g. dot product
+attention), and final output linear projection.
+
+.. automodule:: core.transformer.attention
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.dot\_product\_attention module
+------------------------------------------
+
+This is a PyTorch-only implementation of dot product attention. A more
+efficient implementation, like those provided by FlashAttention or
+CUDNN's FusedAttention, are typically used when training speed is
+important.
+
+.. automodule:: core.transformer.dot_product_attention
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.enums module
+------------------------
+
+.. automodule:: core.transformer.enums
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.identity\_op module
+-------------------------------
+
+This provides a pass-through module that can be used in specs to
+indicate that the operation should not be performed. For example, when
+using LayerNorm with the subsequent linear layer, an IdentityOp can be
+passed in as the LayerNorm module to use.
+
+.. automodule:: core.transformer.identity_op
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.mlp module
+----------------------
+
+This is the entire MLP portion of the transformer layer with an input
+projection, non-linearity, and output projection.
+
+.. automodule:: core.transformer.mlp
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.module module
+-------------------------
+
+This provides a common base class for all modules used in the
+transformer that contains some common functionality.
+
+.. automodule:: core.transformer.module
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.transformer\_block module
+-------------------------------------
+
+A block, or stack, of several transformer layers. The layers can all
+be the same or each can be unique.
+
+.. automodule:: core.transformer.transformer_block
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.transformer\_config module
+--------------------------------------
+
+This contains all of the configuration options for the
+transformer. Using a dataclass reduces code bloat by keeping all
+arguments together in a dataclass instead of passing several arguments
+through multiple layers of function calls.
+
+.. automodule:: core.transformer.transformer_config
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.transformer\_layer module
+-------------------------------------
+
+A single standard transformer layer including attention and MLP blocks.
+
+.. automodule:: core.transformer.transformer_layer
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.utils module
+------------------------
+
+Various utilities used in the transformer implementation.
+
+.. automodule:: core.transformer.utils
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+Module contents
+---------------
+
+.. automodule:: core.transformer
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/Megatron-LM/docs/source/images/context_parallel/CP_overview.png
+++ b/Megatron-LM/docs/source/images/context_parallel/CP_overview.png
--- a/Megatron-LM/docs/source/images/context_parallel/CP_results.png
+++ b/Megatron-LM/docs/source/images/context_parallel/CP_results.png
--- a/Megatron-LM/docs/source/images/custom_fsdp/FSDP_Allreduce.png
+++ b/Megatron-LM/docs/source/images/custom_fsdp/FSDP_Allreduce.png
--- a/Megatron-LM/docs/source/images/custom_fsdp/FSDP_workflow.png
+++ b/Megatron-LM/docs/source/images/custom_fsdp/FSDP_workflow.png
--- a/Megatron-LM/docs/source/images/custom_fsdp/MCore_Custom_FSDP_Class_Diagram.png
+++ b/Megatron-LM/docs/source/images/custom_fsdp/MCore_Custom_FSDP_Class_Diagram.png
--- a/Megatron-LM/docs/source/images/distrib_optimizer/data_flow.png
+++ b/Megatron-LM/docs/source/images/distrib_optimizer/data_flow.png
--- a/Megatron-LM/docs/source/images/distrib_optimizer/sharding_scheme.png
+++ b/Megatron-LM/docs/source/images/distrib_optimizer/sharding_scheme.png
--- a/Megatron-LM/docs/source/images/moe/token_drop.png
+++ b/Megatron-LM/docs/source/images/moe/token_drop.png
--- a/Megatron-LM/docs/source/images/multi_token_prediction/MTP_implementation.png
+++ b/Megatron-LM/docs/source/images/multi_token_prediction/MTP_implementation.png
--- a/Megatron-LM/docs/source/index.rst
+++ b/Megatron-LM/docs/source/index.rst
+.. Lumache documentation master file, created by
+   sphinx-quickstart on Tue Aug 15 13:44:10 2023.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+Megatron Core User Guide
+===================================
+
+**Megatron Core** is a Python library that has the core components required to build your language models. 
+A reference implementation of Megatron Core can be found in  `NeMo <https://github.com/NVIDIA/NeMo/tree/main>`_ It offers a *simple* and
+*intuitive* API.
+
+.. toctree::
+   :maxdepth: 2
+   :caption: User Guide
+
+   user-guide/index
+
+.. toctree::
+   :maxdepth: 3
+   :caption: API Guide
+   
+   api-guide/index
--- a/Megatron-LM/docs/source/user-guide/index.rst
+++ b/Megatron-LM/docs/source/user-guide/index.rst
+User Guide 
+============
+
+.. mdinclude:: ../../../megatron/core/QuickStart.md
+.. mdinclude:: ../../../megatron/core/MSC_Integration.md
\ No newline at end of file
--- a/Megatron-LM/examples/__init__.py
+++ b/Megatron-LM/examples/__init__.py
+ 
\ No newline at end of file
--- a/Megatron-LM/examples/academic_paper_scripts/detxoify_lm/README.md
+++ b/Megatron-LM/examples/academic_paper_scripts/detxoify_lm/README.md
+# SGEAT: Detoxify Larger-scale Language Models
+
+This is the official code base for our NeurIPS 2022 paper:
+
+[Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models](https://arxiv.org/abs/2202.04173)
+
+Boxin Wang, Wei Ping, Chaowei Xiao, Peng Xu, Mostofa Patwary, Mohammad Shoeybi, Bo Li, Anima Anandkumar, Bryan Catanzaro
+
+
+## Citation
+
+```
+@article{WangExp2022,
+  title={Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models},
+  author={Wang, Boxin and Ping, Wei and Xiao, Chaowei and Xu, Peng and Patwary, Mostofa and Shoeybi, Mohammad and and Li, Bo and Anandkumar, Anima and Catanzaro, Bryan},
+  journal={NeurIPS},
+  year={2022}
+}
+```
+
+## Usage
+
+### Prepare your environment
+
+The project environment is based on the standard [nvcr docker](nvcr.io/nvidia/pytorch:21.12-py3) of version `nvcr.io/nvidia/pytorch:21.12-py3`.
+
+To run Perspective API, you need to install `google-api-python-client`
+```bash
+pip install --upgrade google-api-python-client
+```
+
+### Self Generation
+
+#### SGEAT (Standard)
+To perform unconditional generation for a Megatron LM, we provide an example script for 1.3B LM.
+
+```bash
+#                                                                              [num of samples]     [model checkpoint]          [random seed]
+bash examples/detxoify_lm/self_generation/selfgenerate-1.3b-unconditional.sh       1000          checkpoints/gpt3/gpt3-1.3b/      2333
+```
+This will generate a jsonl file of  1000 generated text (as a toy example) at `selfgeneration/unconditional_generation_gpt3-1.3b/2333.out`. 
+
+Note that you may want to set your own gpt2 vocab and merge file dir, as well as your output data dir in `selfgenerate-1.3b-unconditional.sh`.
+
+### Annotation
+
+We then use Perspective API to annotate the self generated corpus. Note that you need to fill in your own Perspective API key in the `examples/detoxify_lm/perspective_api_annotate.py`. 
+
+```bash
+python examples/detxoify_lm/perspective_api_annotate.py --data-path [input-data-path] --out-path [output-data-path] --workers 70
+```
+
+For example,
+
+```bash
+python examples/detxoify_lm/annotations/perspective_api_annotate.py --data-path  selfgeneration/unconditional_generation_gpt3-1.3b/2333.out --out-path  selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.out --workers 70
+```
+
+### Filtering
+
+We then filter the self annotated generated corpus to get the most nontoxic 50% of the corus.
+
+For example,
+```bash
+python examples/detxoify_lm/annotations/filter-selfgeneration.py --data-path  selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.out --out-path  selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.nontoxic.out
+```
+
+This will generate a jsonl file of 500 text of the lowest toxicity (as a toy example) at `selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.nontoxic.out`. 
+
+
+### Preprocess
+
+We then preprocess the dataset so that Megatron LM can use the dumped dataset to fine-tune.
+
+```
+bash examples/detxoify_lm/annotations/preprocess.sh selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.nontoxic.out selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.nontoxic
+```
+
+This will generate two files as follows
+```bash
+selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.nontoxic_text_document.idx
+selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.nontoxic_text_document.bin
+```
+which will be used in the following domain-adative training step.
+
+### Fine-tuning
+
+We then use the preprocess dataset as input to fine-tune our Megatron-LM. 
+```bash
+#                                                                          [fine-tuning dataset]                                                                      [output-dir]                             [lr]    [bs]      [train-iters]                       [load checkpoint]
+bash examples/detxoify_lm/finetune_gpt_distributed-1.3b.sh    selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.nontoxic_text_document         gpt3-1.3b-toy-example-lr-2e-5-bs-512             2e-5     512            78                          checkpoints/gpt3/gpt3-1.3b
+```
+
+This will dump the final checkpoint in `$SHARE_DATA/gpt3-1.3b-toy-example-lr-2e-5-bs-512`. (`$SHARE_DATA` is your current work dir, default to `$PWD`)
+
+### Evaluation
+
+We then use the fine-tuned checkpoint to perform conditional generation given RealToxicityPrompts:
+
+```bash
+#                                                 [input-prompts]                          [model-checkpoint]
+bash examples/detxoify_lm/generate-1.3b.sh     augmented_prompts.jsonl      $SHARE_DATA/gpt3-1.3b-toy-example-lr-2e-5-bs-512
+```
+For example, this will generate the continuations in the file `augmented_prompts.jsonl_output_gpt3-1.3b-toy-example-lr-2e-5-bs-512_seed_31846.jsonl` (seed is a random generated number).
+
+Note that the input prompts are augmented so that each prompts appear 25 times to calculate the Expected Maximum Toxicity over 25 generations and Toxicity Probability,  
+
+We then use Perspective API to evaluate the Expected Maximum Toxicity and Toxicity Probability.   
+
+```bash
+python examples/detxoify_lm/perspective_api.py --data-path "augmented_prompts.jsonl_output_gpt3-1.3b-toy-example-lr-2e-5-bs-512_seed_31846.jsonl" --prompt-path prompts.jsonl --workers 30
+```
\ No newline at end of file
--- a/Megatron-LM/examples/academic_paper_scripts/detxoify_lm/annotations/filter-selfgeneration.py
+++ b/Megatron-LM/examples/academic_paper_scripts/detxoify_lm/annotations/filter-selfgeneration.py
+import json
+import time
+from typing import Dict, Optional, List
+
+import joblib
+from googleapiclient import discovery
+from googleapiclient.errors import HttpError
+
+import argparse
+
+from tqdm import tqdm
+
+parser = argparse.ArgumentParser(description='Process some integers.')
+parser.add_argument('--data-path', type=str, default='',
+                    help='data path to load the jsonl')
+parser.add_argument('--out-path', type=str, default='',
+                    help='data path to load the jsonl')
+parser.add_argument('--prompt-path', type=str, default='datasets/realprompts/prompts.jsonl',
+                    help='data path to load the prompt jsonl')
+parser.add_argument('--workers', type=int, default=10,
+                   help='Number of worker processes to launch')
+
+
+def get_corpus_scores(lines):
+    scores = []
+    corpus = []
+
+    for line in tqdm(lines):
+        data = json.loads(line)
+        score = data['score']
+        if score is not None:
+            scores.append(score['toxicity'])
+            corpus.append(data)
+    return scores, corpus
+
+
+def main():
+    args = parser.parse_args()
+
+    with open(args.data_path) as f:
+        lines = f.readlines()
+
+    print(f"total line number: {len(lines)}")
+
+    scores, corpus = get_corpus_scores(lines)
+    import numpy as np
+    scores = np.array(scores)
+    indices = np.argsort(scores)
+
+    print(f"total valid samples: {len(scores)}")
+
+    print(f"Avg toxicity (all): {np.mean(scores)} +- {np.std(scores)}")
+    print(f"Avg toxicity (toxic): {np.mean(scores[scores > 0.5])} +- {np.std(scores[scores > 0.5])}")
+    print(f"Toxic Percentage {sum(scores > 0.5) / len(scores)}")
+    print(f"Avg toxicity (nontoxic): {np.mean(scores[scores <= 0.5])} +- {np.std(scores[scores <= 0.5])}")
+    print(f"Nontoxic Percentage {sum(scores <= 0.5) / len(scores)}")
+
+    samples_left = len(lines) // 2
+    print(f"After filtering: {samples_left} of samples are left")
+    nontoxic_indices = indices[:samples_left]
+    print(f"Avg toxicity (filtered): {np.mean(scores[nontoxic_indices])} +- {np.std(scores[nontoxic_indices])}")
+    print(f"Toxicity Range (filtered): {np.min(scores[nontoxic_indices])} ~ {np.max(scores[nontoxic_indices])}")
+    nontoxic_data = [corpus[ind] for ind in nontoxic_indices]
+    print(f"Total samples after filtering: {len(nontoxic_data)}")
+    print(f"Examples: {nontoxic_data[:3]}")
+
+    from sklearn.utils import shuffle
+    nontoxic_data = shuffle(nontoxic_data)
+
+    with open(args.out_path, 'w') as f:
+        for x in nontoxic_data:
+            f.write(json.dumps(x) + '\n')
+
+
+main()
\ No newline at end of file