Commit 2d2fca6c authored by jerrrrry's avatar jerrrrry
Browse files

Initial commit

parents
Pipeline #3401 failed with stages
in 0 seconds
Optimizer Parameters Scheduler
==============================
This api is used to calculate the learning rate and weight decay for the optimizer.
Module contents
---------------
.. automodule:: core.optimizer_param_scheduler
:members:
:undoc-members:
:show-inheritance:
pipeline\_parallel package
==========================
This package contains implementations for two different pipeline parallelism
schedules (one without interleaving and one with interleaving, see `Efficient
Large-Scale Language Model Training on GPU Clusters Using Megatron-LM <https://arxiv.org/abs/2104.04473>`_
for details), and a default no-pipelining schedule. It also contains methods
for the point-to-point communication that is needed between pipeline stages.
Submodules
----------
.. mdinclude:: pipeline_parallel_layout.md
pipeline\_parallel.p2p\_communication module
--------------------------------------------
Contains implementations for the various point-to-point communication needed
(e.g., `recv_forward` and `recv_backward`) in the different pipeline parallelism
schedules.
.. automodule:: core.pipeline_parallel.p2p_communication
:members:
:undoc-members:
:show-inheritance:
pipeline\_parallel.schedules module
-----------------------------------
Contains implementations for two pipeline parallelism schedules
(`forward_backward_pipelining_with_interleaving`for pipeline parallelism with
interleaving, `forward_backward_pipelining_without_interleaving` for pipeline
parallelism without interleaving) and a default no-pipelining schedule
(`forward_backward_no_pipelining`). `get_forward_backward_func` returns the right
scheduling function to use based on the configuration being trained
(e.g., if pipeline-parallel size is 1, use `forward_backward_no_pipelining`).
.. automodule:: core.pipeline_parallel.schedules
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: core.pipeline_parallel
:members:
:undoc-members:
:show-inheritance:
# Custom Pipeline Model Parallel Layout
*This is an experimental feature and may be changed.*
`--pipeline-model-parallel-layout` is a flexible API for defining the pipeline parallel partitioning, which is essential for balanced partitioning for an imbalanced model. For example, to partition DeepSeek-V3 (61 decoder layers + 1 mtp layer) with PP16VPP2, we can include the arguments as follows:
```bash
--pipeline-model-parallel-size 16
--pipeline-model-parallel-layout "Et*3|(tt|)*29,m|L"
```
| PP \ VPP rank | 0 | 1 |
|---------------|-------------------------|---------------|
| 0 | embedding + 3 × decoder | 2 × decoder |
| 1~13 | 2 × decoder | 2 × decoder |
| 14 | 2 × decoder | mtp |
| 15 | 2 × decoder | loss |
In the layout string, stages are split by '|'. Replicated stages or layers can be described with multiplication. Commas can be used cosmetically. Symbol choices:
* `E` = embedding layer
* `t` = transformer decoder layer
* `m` = MTP layer
* `L` = loss calculation layer
Note that it is legal to have empty stages, e.g., `E||t|L` (the second stage is empty).
tensor\_parallel package
========================
This package contains an implementation for tensor parallelism in transformer
models (see `Megatron-LM: Training Multi-Billion Parameter Language Models
Using Model Parallelism <https://arxiv.org/abs/1909.08053>`_ and `Reducing
Activation Recomputation in Large Transformer Models <https://arxiv.org/abs/2205.05198>`_
for details).
Submodules
----------
tensor\_parallel.cross\_entropy module
--------------------------------------
.. automodule:: core.tensor_parallel.cross_entropy
:members:
:undoc-members:
:show-inheritance:
tensor\_parallel.data module
----------------------------
.. automodule:: core.tensor_parallel.data
:members:
:undoc-members:
:show-inheritance:
tensor\_parallel.layers module
------------------------------
.. automodule:: core.tensor_parallel.layers
:members:
:undoc-members:
:show-inheritance:
tensor\_parallel.mappings module
--------------------------------
.. automodule:: core.tensor_parallel.mappings
:members:
:undoc-members:
:show-inheritance:
tensor\_parallel.random module
------------------------------
.. automodule:: core.tensor_parallel.random
:members:
:undoc-members:
:show-inheritance:
tensor\_parallel.utils module
-----------------------------
.. automodule:: core.tensor_parallel.utils
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: core.tensor_parallel
:members:
:undoc-members:
:show-inheritance:
# New Tokenizer System
## Key Differences from the Old Tokenizer System
### 1. Hugging Face–style API
We now have a `MegatronTokenizer` class that provides a familiar, simple API similar to Hugging Face’s:
`.from_pretrained()` – Load a tokenizer from a directory or file, automatically detecting the type and settings.
`.write_metadata()` – Save tokenizer configuration (metadata) so that it can be reused without re-specifying parameters.
This eliminates the need for long initialization arguments and hard-coded settings in training scripts.
### 2. Tokenizer Metadata
A metadata file (JSON) now stores all essential tokenizer configuration in one place:
- Tokenizer library (e.g., HuggingFace, SentencePiece, TikToken, etc.)
- Chat templates
- Tokenizer class
Benefits:
- You only need to set these parameters once.
- No more passing multiple CLI arguments for tokenizer settings.
- Easy sharing — just copy the tokenizer directory with its metadata file.
### 3. Library Classes Are Now Internal
In the old system, you had to know which tokenizer library to use (`SentencePieceTokenizer`, `HuggingFaceTokenizer`, etc.) and instantiate it manually.
In the new system:
- The library is automatically detected from the metadata.
- The correct tokenizer implementation is chosen under the hood.
- Users don’t need to manually manage tokenizer classes.
### 3. Support for Model-specific Tokenizer Classes
The system now supports:
- Built-in LLM-specific tokenizers.
- Custom tokenizers: You can create your own tokenizer class by inheriting from `MegatronTokenizerText` and specify it in the `tokenizer_class` field in the metadata file.
- This allows advanced customization while keeping defaults simple for most users.
### 4. Usage
**Creating and Saving Metadata**
```python
from megatron.core.tokenizers import MegatronTokenizer
# The metadata will be stored as a file named tokenizer_metadata.json inside the tokenizer’s directory.
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/tokenizer.model",
tokenizer_library="sentencepiece",
chat_template="chat template in jinja format",
)
# To use custom tokenizer class
from megatron.core.tokenizers.text import MegatronTokenizerText
class CustomTokenizer(MegatronTokenizerText):
...
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/tokenizer.model",
tokenizer_library="sentencepiece",
chat_template="chat template in jinja format",
tokenizer_class=CustomTokenizer,
)
# To save metadata to another dir
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/tokenizer.model",
tokenizer_library="sentencepiece",
metadata_path="/path/to/save/metadata.json",
)
```
**Restoring the tokenizer**
```python
from megatron.core.tokenizers import MegatronTokenizer
MegatronTokenizer.from_pretrained(
tokenizer_path="/path/to/tokenizer.model",
)
# If metadata is not in tokenizer’s dir
MegatronTokenizer.from_pretrained(
tokenizer_path="/path/to/tokenizer.model",
metadata_path="/path/to/metadata.json",
)
# Pass metadata as dict
MegatronTokenizer.from_pretrained(
tokenizer_path="GPT2BPETokenizer",
metadata_path={"library": "megatron"},
vocab_file="/path/to/vocab.txt",
)
# Pass additional params
MegatronTokenizer.from_pretrained(
tokenizer_path="/path/to/tokenizer/model.json",
metadata_path={"library": "tiktoken"},
pattern="v2",
num_special_tokens=1000,
)
# Null tokenzier
MegatronTokenizer.from_pretrained(
metadata_path={"library": "null"},
vocab_size=131072,
)
```
### 4. Megatron-LM pretraining compatibility
New tokenizer system is compatible with megatron-lm pretrain script. If `--tokenizer-metadata` is not specified, a default metadata file will be generated automatically.
```bash
# Null tokenizer
torchrun --nproc_per_node=1 pretrain_gpt.py \
... \
--tokenizer-type NullTokenizer \
--vocab-size 131072
# HuggingFace tokenizer with specified metadata
torchrun --nproc_per_node=1 pretrain_gpt.py \
... \
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model meta-llama/Meta-Llama-3-8B \
--tokenizer-metadata /path/to/metadata.json
```
The Megatron-LM pretraining script still supports the legacy tokenizer system. To enable it, simply add the `--legacy-tokenizer` flag.
transformer package
===================
The `transformer` package provides a customizable and configurable
implementation of the transformer model architecture. Each component
of a transformer stack, from entire layers down to individual linear
layers, can be customized by swapping in different PyTorch modules
using the "spec" parameters (see `here
<https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/nemo_megatron/mcore_customization.html>`_). The
configuration of the transformer (hidden size, number of layers,
number of attention heads, etc.) is provided via a `TransformerConfig`
object.
Submodules
----------
transformer.attention module
----------------------------
This is the entire attention portion, either self or cross attention,
of a transformer layer including the query, key, and value
projections, a "core" attention calculation (e.g. dot product
attention), and final output linear projection.
.. automodule:: core.transformer.attention
:members:
:undoc-members:
:show-inheritance:
transformer.dot\_product\_attention module
------------------------------------------
This is a PyTorch-only implementation of dot product attention. A more
efficient implementation, like those provided by FlashAttention or
CUDNN's FusedAttention, are typically used when training speed is
important.
.. automodule:: core.transformer.dot_product_attention
:members:
:undoc-members:
:show-inheritance:
transformer.enums module
------------------------
.. automodule:: core.transformer.enums
:members:
:undoc-members:
:show-inheritance:
transformer.identity\_op module
-------------------------------
This provides a pass-through module that can be used in specs to
indicate that the operation should not be performed. For example, when
using LayerNorm with the subsequent linear layer, an IdentityOp can be
passed in as the LayerNorm module to use.
.. automodule:: core.transformer.identity_op
:members:
:undoc-members:
:show-inheritance:
transformer.mlp module
----------------------
This is the entire MLP portion of the transformer layer with an input
projection, non-linearity, and output projection.
.. automodule:: core.transformer.mlp
:members:
:undoc-members:
:show-inheritance:
transformer.module module
-------------------------
This provides a common base class for all modules used in the
transformer that contains some common functionality.
.. automodule:: core.transformer.module
:members:
:undoc-members:
:show-inheritance:
transformer.transformer\_block module
-------------------------------------
A block, or stack, of several transformer layers. The layers can all
be the same or each can be unique.
.. automodule:: core.transformer.transformer_block
:members:
:undoc-members:
:show-inheritance:
transformer.transformer\_config module
--------------------------------------
This contains all of the configuration options for the
transformer. Using a dataclass reduces code bloat by keeping all
arguments together in a dataclass instead of passing several arguments
through multiple layers of function calls.
.. automodule:: core.transformer.transformer_config
:members:
:undoc-members:
:show-inheritance:
transformer.transformer\_layer module
-------------------------------------
A single standard transformer layer including attention and MLP blocks.
.. automodule:: core.transformer.transformer_layer
:members:
:undoc-members:
:show-inheritance:
transformer.utils module
------------------------
Various utilities used in the transformer implementation.
.. automodule:: core.transformer.utils
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: core.transformer
:members:
:undoc-members:
:show-inheritance:
.. Lumache documentation master file, created by
sphinx-quickstart on Tue Aug 15 13:44:10 2023.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Megatron Core User Guide
===================================
**Megatron Core** is a Python library that has the core components required to build your language models.
A reference implementation of Megatron Core can be found in `NeMo <https://github.com/NVIDIA/NeMo/tree/main>`_ It offers a *simple* and
*intuitive* API.
.. toctree::
:maxdepth: 2
:caption: User Guide
user-guide/index
.. toctree::
:maxdepth: 3
:caption: API Guide
api-guide/index
User Guide
============
.. mdinclude:: ../../../megatron/core/QuickStart.md
.. mdinclude:: ../../../megatron/core/MSC_Integration.md
\ No newline at end of file
\ No newline at end of file
# SGEAT: Detoxify Larger-scale Language Models
This is the official code base for our NeurIPS 2022 paper:
[Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models](https://arxiv.org/abs/2202.04173)
Boxin Wang, Wei Ping, Chaowei Xiao, Peng Xu, Mostofa Patwary, Mohammad Shoeybi, Bo Li, Anima Anandkumar, Bryan Catanzaro
## Citation
```
@article{WangExp2022,
title={Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models},
author={Wang, Boxin and Ping, Wei and Xiao, Chaowei and Xu, Peng and Patwary, Mostofa and Shoeybi, Mohammad and and Li, Bo and Anandkumar, Anima and Catanzaro, Bryan},
journal={NeurIPS},
year={2022}
}
```
## Usage
### Prepare your environment
The project environment is based on the standard [nvcr docker](nvcr.io/nvidia/pytorch:21.12-py3) of version `nvcr.io/nvidia/pytorch:21.12-py3`.
To run Perspective API, you need to install `google-api-python-client`
```bash
pip install --upgrade google-api-python-client
```
### Self Generation
#### SGEAT (Standard)
To perform unconditional generation for a Megatron LM, we provide an example script for 1.3B LM.
```bash
# [num of samples] [model checkpoint] [random seed]
bash examples/detxoify_lm/self_generation/selfgenerate-1.3b-unconditional.sh 1000 checkpoints/gpt3/gpt3-1.3b/ 2333
```
This will generate a jsonl file of 1000 generated text (as a toy example) at `selfgeneration/unconditional_generation_gpt3-1.3b/2333.out`.
Note that you may want to set your own gpt2 vocab and merge file dir, as well as your output data dir in `selfgenerate-1.3b-unconditional.sh`.
### Annotation
We then use Perspective API to annotate the self generated corpus. Note that you need to fill in your own Perspective API key in the `examples/detoxify_lm/perspective_api_annotate.py`.
```bash
python examples/detxoify_lm/perspective_api_annotate.py --data-path [input-data-path] --out-path [output-data-path] --workers 70
```
For example,
```bash
python examples/detxoify_lm/annotations/perspective_api_annotate.py --data-path selfgeneration/unconditional_generation_gpt3-1.3b/2333.out --out-path selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.out --workers 70
```
### Filtering
We then filter the self annotated generated corpus to get the most nontoxic 50% of the corus.
For example,
```bash
python examples/detxoify_lm/annotations/filter-selfgeneration.py --data-path selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.out --out-path selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.nontoxic.out
```
This will generate a jsonl file of 500 text of the lowest toxicity (as a toy example) at `selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.nontoxic.out`.
### Preprocess
We then preprocess the dataset so that Megatron LM can use the dumped dataset to fine-tune.
```
bash examples/detxoify_lm/annotations/preprocess.sh selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.nontoxic.out selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.nontoxic
```
This will generate two files as follows
```bash
selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.nontoxic_text_document.idx
selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.nontoxic_text_document.bin
```
which will be used in the following domain-adative training step.
### Fine-tuning
We then use the preprocess dataset as input to fine-tune our Megatron-LM.
```bash
# [fine-tuning dataset] [output-dir] [lr] [bs] [train-iters] [load checkpoint]
bash examples/detxoify_lm/finetune_gpt_distributed-1.3b.sh selfgeneration/unconditional_generation_gpt3-1.3b/2333.annotated.nontoxic_text_document gpt3-1.3b-toy-example-lr-2e-5-bs-512 2e-5 512 78 checkpoints/gpt3/gpt3-1.3b
```
This will dump the final checkpoint in `$SHARE_DATA/gpt3-1.3b-toy-example-lr-2e-5-bs-512`. (`$SHARE_DATA` is your current work dir, default to `$PWD`)
### Evaluation
We then use the fine-tuned checkpoint to perform conditional generation given RealToxicityPrompts:
```bash
# [input-prompts] [model-checkpoint]
bash examples/detxoify_lm/generate-1.3b.sh augmented_prompts.jsonl $SHARE_DATA/gpt3-1.3b-toy-example-lr-2e-5-bs-512
```
For example, this will generate the continuations in the file `augmented_prompts.jsonl_output_gpt3-1.3b-toy-example-lr-2e-5-bs-512_seed_31846.jsonl` (seed is a random generated number).
Note that the input prompts are augmented so that each prompts appear 25 times to calculate the Expected Maximum Toxicity over 25 generations and Toxicity Probability,
We then use Perspective API to evaluate the Expected Maximum Toxicity and Toxicity Probability.
```bash
python examples/detxoify_lm/perspective_api.py --data-path "augmented_prompts.jsonl_output_gpt3-1.3b-toy-example-lr-2e-5-bs-512_seed_31846.jsonl" --prompt-path prompts.jsonl --workers 30
```
\ No newline at end of file
import json
import time
from typing import Dict, Optional, List
import joblib
from googleapiclient import discovery
from googleapiclient.errors import HttpError
import argparse
from tqdm import tqdm
parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('--data-path', type=str, default='',
help='data path to load the jsonl')
parser.add_argument('--out-path', type=str, default='',
help='data path to load the jsonl')
parser.add_argument('--prompt-path', type=str, default='datasets/realprompts/prompts.jsonl',
help='data path to load the prompt jsonl')
parser.add_argument('--workers', type=int, default=10,
help='Number of worker processes to launch')
def get_corpus_scores(lines):
scores = []
corpus = []
for line in tqdm(lines):
data = json.loads(line)
score = data['score']
if score is not None:
scores.append(score['toxicity'])
corpus.append(data)
return scores, corpus
def main():
args = parser.parse_args()
with open(args.data_path) as f:
lines = f.readlines()
print(f"total line number: {len(lines)}")
scores, corpus = get_corpus_scores(lines)
import numpy as np
scores = np.array(scores)
indices = np.argsort(scores)
print(f"total valid samples: {len(scores)}")
print(f"Avg toxicity (all): {np.mean(scores)} +- {np.std(scores)}")
print(f"Avg toxicity (toxic): {np.mean(scores[scores > 0.5])} +- {np.std(scores[scores > 0.5])}")
print(f"Toxic Percentage {sum(scores > 0.5) / len(scores)}")
print(f"Avg toxicity (nontoxic): {np.mean(scores[scores <= 0.5])} +- {np.std(scores[scores <= 0.5])}")
print(f"Nontoxic Percentage {sum(scores <= 0.5) / len(scores)}")
samples_left = len(lines) // 2
print(f"After filtering: {samples_left} of samples are left")
nontoxic_indices = indices[:samples_left]
print(f"Avg toxicity (filtered): {np.mean(scores[nontoxic_indices])} +- {np.std(scores[nontoxic_indices])}")
print(f"Toxicity Range (filtered): {np.min(scores[nontoxic_indices])} ~ {np.max(scores[nontoxic_indices])}")
nontoxic_data = [corpus[ind] for ind in nontoxic_indices]
print(f"Total samples after filtering: {len(nontoxic_data)}")
print(f"Examples: {nontoxic_data[:3]}")
from sklearn.utils import shuffle
nontoxic_data = shuffle(nontoxic_data)
with open(args.out_path, 'w') as f:
for x in nontoxic_data:
f.write(json.dumps(x) + '\n')
main()
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment