Commit 71e79847 authored by chenzk's avatar chenzk
Browse files

v1.0.3

parents
Pipeline #2034 canceled with stages
torchrun --nproc-per-node 1 examples/llama/convert_nanotron_to_hf.py --checkpoint_path checkpoints/10 --save_path hf/hf-Llama-3.1-8B --tokenizer_name Meta-Llama-3.1-8B
FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-py3.10-dtk24.04.3-ubuntu20.04
ENV DEBIAN_FRONTEND=noninteractive
# RUN yum update && yum install -y git cmake wget build-essential
# RUN source /opt/dtk-24.04.3/env.sh
# # 安装pip相关依赖
COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
torch>=1.13.1
pyyaml
numpy
packaging
safetensors
dacite
tqdm
datasets
flash-attn
setuptools
dacite==1.8.1
fsspec==2024.9.0
numba
datatrove[all] # 0.3.0
transformers==4.46.3
tokenizers==0.20.3
docker run -it --shm-size=64G -v $PWD/nanotron:/home/nanotron -v /public/DL_DATA/AI:/home/AI -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name llama b272aae8ec72 bash
# python -m torch.utils.collect_env
This diff is collapsed.
# Debugging FAQ:
When debugging you may run into errors of the sort:
![](image.png)
Don't get overwhelmed by the amount of information in the error message. The final error message is not very informative so we have to scroll a little bit up to find the actual error message. In this case, the error message is:
![](image-2.png)
which is a `ValueError` that says that the model in PP=1 has no gradient. This is a common error when using `torch.nn.parallel.DistributedDataParallel` (DDP) and it means that the model in the pipeline parallelism (PP) rank 1 has no gradient. This can happen if the model is too small and the gradient is not computed for the model. The solution is to increase the number of layers of the model or put a smaller PP size. In this case, the model is `LlamaForTraining` and the model in PP=1 is `LlamaModel`.
We could also decrease the model's vocab size from 50277 to 256 to help a better partitioning of the pipeline blocks. This will help to avoid the error message above.
# Doc on collective operations
This NVIDIA doc is nice on all collective operations (all_reduce, reduce_scatter, etc): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html
# Usage
We showcase usage in the `examples` directory.
# Key concepts
Let's go through some key concepts.
## ParallelContext
`ParallelContext` is the base class referencing all the process groups you might need when running parallel workloads. You can initialize it using the following:
```python
from nanotron.parallel import ParallelContext
# define your topology
parallel_context = ParallelContext(
tensor_parallel_size=2,
data_parallel_size=2,
pipeline_parallel_size=2
)
```
`ProcessGroups` is a mechanism in order to run distributed collectives (`all-reduce`, `all-gather`, ...) on a subgroup of all the ranks. It provides the granularity needed for 3D parallelism.
From this dataclass you can access multiple process groups:
- `dp_pg`/`tp_pg`/`pp_pg`: This produces your typical process groups linked to 3D parallelism
- `world_pg`: ProcessGroup including all the processes.
- `world_rank_matrix`: This allows one to compute the world rank knowing the 3D ranks of a given process, or inversely when using `get_local_ranks`.
- `world_ranks_to_pg`: This is a more generic pattern that allows you to store custom set of ProcessGroups, and querying it via a list of world ranks.
## NanotronParameter
Given a specific computation workload, we can freely define how we distribute workloads. For example:
```python
from torch import nn
# Example: let's assume you want to run a Linear without bias
hidden_size = 8
# Single process way of running computation
module = nn.Linear(hidden_size, hidden_size) # Parameters: [H, H]
input = torch.randn(batch_size, hidden_size)
output = module(input)
# Sharded ways of running computation across `tp_pg` (`ProcessGroup`)
# Version 1
sharded_module = nn.Linear(hidden_size, hidden_size / tp_pg.size())
input = torch.randn(batch_size, hidden_size)
sharded_output = module(input)
torch.distributed.all_gather(output, sharded_output, group=tp_pg.size())
# Version 2
sharded_module = nn.Linear(hidden_size / tp_pg.size(), hidden_size)
sharded_input = torch.randn(batch_size, hidden_size / tp_pg.size())
sharded_output = module(sharded_input)
torch.distributed.all_reduce(output, sharded_output, group=tp_pg.size())
# Version 3
sharded_module = nn.Linear(hidden_size, hidden_size)
sharded_input = torch.randn(batch_size / tp_pg.size(), hidden_size)
torch.distributed.all_gather(input, sharded_input, group=tp_pg.size())
output = module(input) # Duplicate workload
# Version ....
```
Distributed workloads have the tendency to generate tradeoffs between duplicated computation and extra communication. There's multiple ways to run the same computation, what we can optimize is the amount of communication we do, as well as duplicated work. Sometimes it's worth duplicating work in order to reduce communication significantly.
As seen in previous example, sometimes the parameters are sharded across multiple devices, and sometimes they are duplicated. In `nanotron`, we decided to add those additional metadatas to `nn.Parameter`. We call our new datastructure: `NanotronParameter`
## Sharded parameter
A sharded parameter has the following metadata attached:
```python
@dataclasses.dataclass
class SlicesPair:
local_slices: Tuple[slice, ...]
global_slices: Tuple[slice, ...]
@dataclasses.dataclass
class ShardedInfo:
# All world ranks involved in the sharding.
global_ranks: Tuple[int, ...]
# Info of to what slice of the unsharded tensor (global_slices) the current sharded tensor corresponds (local_slices)
local_global_slices_pairs: Tuple[SlicesPair, ...]
# The shape of the unsharded tensor
unsharded_shape: Tuple[int, ...]
```
Imagine we sharded a tensor t of shape [8, 64] across 2 ranks, 0 and 3, where rank 0 holds the first shard t[:, :32] and rank 3 holds the second shard t[:, 32:], then the sharded_info for them is:
```python
shard_info = ShardedInfo(global_ranks=(0,3), local_global_slices_pairs=(SlicesPair(local_slices=(slice(0,8), slice(0, 32),), global_slices=(slice(0,8), slice(0, 32)),),), unsharded_shape=(8, 64)) # world rank 0
shard_info = ShardedInfo(global_ranks=(0,3), local_global_slices_pairs=(SlicesPair(local_slices=(slice(0,8), slice(0, 32),), global_slices=(slice(0,8), slice(32, 64)),),), unsharded_shape=(8, 64)) # world rank 3
```
## Tied parameter
This signifies that multiple occurrences of a given parameter are duplicated on multiple devices. Therefore we need a mechanism for them to be synced at all time. A typical example would be `lm_head` on top of transformers that's tied to the word embedding parameters. We attach the following metadata to the parameter:
```python
@dataclasses.dataclass
class TiedInfo:
# We usually arbitrarily choose a name of a parameter, either `lm_head.weight` or `wte.weight` for example.
name: str
# This allows us to define the scope in which `name` is valid.
root_module: nn.Module
# All world ranks involved in the tying.
global_ranks: Tuple[int, ...]
# In order to keep parameter synced, we add a `reduce_op` value that defines what kind of reduce operation we apply to the gradient.
# None signifies that we do not reduce
reduce_op: Optional[dist.ReduceOp]
```
Most interesting in this dataclass is the `reduce_op` parameter. Sometimes duplicated workload can remove the need to sync gradients as by design gradient computation would have already computed the correct gradient. A typical example of this is classic TP implementation using `all-reduce`/`identity`.
Note: a parameter can be both sharded and tied. Both notion just have to involve different ranks. For example: lm_head and word embeddings can be sharded across TP, and tied between the first PP rank, and the last one.
## Tensor parallelism
Usually the go-to solution when models can't fit within a device. The basic idea is to figure out patterns where one can divide a single workload into multiple smaller workerloads that can run in parallel. We mimic tensor parallelism from Megatron-LM. Current supported modules:
- ColumnLinear/RowLinear
- ParallelVocabulary
- Cross-Entropy over sharded logits
- Distributed samplers for generation
[Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) introduces that notion upon implementing one of the first large scale transformers:
![Tensor parallelism in transformer model](assets/tensor_parallel_in_transformer.png)
(Source: [link](https://arxiv.org/abs/1909.08053))
## Pipeline parallelism
We can view the neural network as a sequence of operations. Instead of previous assumption where we split operations into smaller workloads that we can distribute. We take contiguous chunks and assign them to specific ranks. Instead of running parallel workloads, those are inherently sequential. In order to run them in parallel, we introduce fancy schedulers that process different batches in parallel:Rank 0 can be processing batch 1, while rank 1 is processing batch 0
- Rank 0 starts to process batch 0
- Rank 0 finishes to process batch 0
- Rank 0 sends outputs to rank 1
- Rank 1 starts to process batch 0
- Rank 0 starts to process batch 1 (Rank 1 and Rank 0 are processing in parallel batches 1 and 0 respectively)
- Rank 1 finishes to process batch 0
- Rank 0 finishes to process batch 1
### PipelineBlock
The core component of our pipeline engine is a `PipelineBlock`.
It acts as the granularity for all our pipeline engines, we can define a specific workload that needs to happen on a specific device, ie rank.
Other ranks run a dummy `forward` where the forward pass returns `TensorPointer` which hold enough metadata in order to know where the output of the computation is.
```python
@dataclass
class TensorPointer:
group_rank: int
```
Module defined within `PipelineBlock` can be directly instantiated on the specific device.
In short, what does `PipelineBlock` does:
- Receives either a set of `torch.Tensor`/`TensorPointer` as input
- In case of `TensorPointer`, query the tensor from the specified rank we extract from its state/context.
- Run the defined computation if current rank is responsible for running computation
- Return a dictionary `Dict[str, Union[torch.Tensor, TensorPointer]]`.
`TensorPointer` as output are for ranks that didn't run computation and require to know where the output of the computation is.
```python
class PipelineBlock(nn.Module):
def __init__(
self,
p2p, # point-to-point communication class
module_builder, # module constructor in order to build module lazily
module_kwargs, # module constructor arguments in order to build module lazily
module_input_keys, # ranks that are not running compute to know the module input structure. Serves as a validation mechanism.
module_output_keys, # metadata for ranks that are not running compute to know the module output structure.
):
pass
# Example
# Lazy instantiation of a `nn.Linear`
model = PipelineBlock(
p2p=p2p,
module_builder=nn.Linear,
module_kwargs={"in_features":3, "out_feature": 5},
module_input_keys={"input"},
module_output_keys={"output"}
)
model.build_and_set_rank(pp_rank) # Instantiate model parameters on `pp_rank` assigned device
```
In order to define which rank we use the `build_and_set_rank` method. It attaches the rank as a meta data, and builds the module on that specific rank.
Models have to be defined using a "surface" of `PipelineBlock`. Typically, above `PipelineBlock` it's all about defining the `PipelineBlock` computational direct acyclic graph, below is where device specific computation is defined.
As a non trivial example:
```python
class DummyModel(nn.Module):
def __init__(
self,
p2p: P2P,
):
super().__init__()
self.dense1 = PipelineBlock(
p2p=p2p,
module_builder=nn.Linear,
module_kwargs={"in_features": 10, "out_features": 10},
module_input_keys={"input"},
module_output_keys={"output"},
)
self.dense2 = PipelineBlock(
p2p=p2p,
module_builder=nn.Linear,
module_kwargs={"in_features": 10, "out_features": 10},
module_input_keys={"input"},
module_output_keys={"output"},
)
# Doesn't hold any parameter, but we have to specify where the computation happens.
self.loss = PipelineBlock(
p2p=p2p,
module_builder=lambda: lambda x: x.sum(),
module_kwargs={},
module_input_keys={"x"},
module_output_keys={"output"},
)
def forward(self, x: Union[torch.Tensor, TensorPointer]):
# x can be a `torch.Tensor` or a `TensorPointer` depending on the current rank, and where the pipeline blocks run their compute
x = self.dense1(input=x)["output"]
x = self.dense2(input=x)["output"]
x = self.loss(x=x)["output"]
return x
```
### Pipeline engine
We now support two kinds of engines: `AllForwardAllBackward`, `OneForwardOneBackward`
Pipeline engines are different schedules for the set of workloads. A great illustration for the different schedules we support for training can be found in [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
](https://arxiv.org/abs/2104.04473). We support `All forward all backward` and `One forward one backward` currently (Figure 3 and top of figure 4).
![Pipeline engine](assets/pipeline_engine.png)
(Source: [link](https://arxiv.org/abs/2104.04473))
> **_IMPORTANT NOTE:_** When preparing your dataloader, make sure every tensor lives on a single rank, and other ranks must have `TensorPointer` to that rank. This is a requirement for the pipeline engine to work.
## ZeRO-1 optimizer
ZeRO stands for "Zero Redundancy Optimizer", also known as "FSDP" in Pytorch. The goal of such techniques is to shard tensors across multiple devices instead of duplicating them. Consequently it allows for significant memory gains at the cost of some communication overhead (with potential ability to overlap computation and communication). Sharding is done across data parallel dimension There are three stages:
- `Stage 1`: The optimizer states are sharded.
- `Stage 2`: The gradients are sharded
- `Stage 3`: The model weight are sharded
As of now, we currently only support `stage 1`.
![ZeRO](assets/zero.png)
(Source: [link](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/))
# The awesome to have
## Recomputation utilities
Activation recomputation, also known as "activation checkpointing" is a memory saving technique. Pytorch automatically stores a set activation during the forward pass required for backward computation. However with large workloads, it might be worth recomputing specific activation in order to save memory. In `nanotron` we provide a decorator to implement this feature:
```python
class MyFancyModule(nn.Module):
def __init__(self):
...
self.do_checkpoint: bool = True
@checkpoint_method(attr_name="do_checkpoint")
def forward(self, x):
...
```
## On device initialization
Usual pytorch module constructor instantiate weights on cpu and then move them to gpus. This can blow up cpu memory as well as being overall quite slow.
```python
with init_on_device_and_dtype(device=torch.device("cuda"), dtype=torch.bfloat16):
module = MyFancyModule() # This directly instantiate the model on your device
# If you want to bypass Pytorch weight initialization mechanism
with init_on_device_and_dtype(device=torch.device("meta"), dtype=torch.bfloat16):
module = MyFancyModule()
module.to_empty(torch.device("cuda")) # bfloat 16 model loaded in gpu with weight not initialized (only the storage buffers are allocated)
```
## Unified API for logging
We provide a uniform API to logging, whether that's on tensorboard, on stdout or on Hugging Face hub:
```python
@dataclass
class LogItem:
tag: str
scalar_value: Union[float, int]
log_format: Optional[str] = None
```
All logger need to implement a single method:
```python
class BaseLogger:
@abstractmethod
def add_scalars_from_list(self, log_entries: List[LogItem], iteration_step: int):
...
```
If you want to have tensorboard logger support: `pip install -e ".[tb-logger]"`.
If you want to have huggingface-hub tensorboard logger support: `pip install -e ".[hf-logger]"`.
## Random state handling primitives
We currently have a mechanism to have an arbitrary number of `RandomState` in a `RandomStates`:
```python
class RandomState:
random
numpy
torch
torch_cuda
class RandomStates(MutableMapping[str, RandomState])
pass
```
At all time we get/set current random state in the current context
```python
def get_current_random_state():
# This gets the current random_state from the current context
pass
def set_random_state(random_state: RandomState):
# This sets random state in the current context
pass
```
In order to use specific `RandomState` for specific operations, typically when you want to synchronize `nn.Dropout` across multiple ranks for example, you can run `branch_random_state` context manager:
```python
def branch_random_state(random_states:RandomStates, key:str):
# Context manager which sets the random state associated with `key` when entering
# When exiting, we update the random state at `key` and restore previous random state.
pass
# Usage
random_states = RandomStates({"my_own_random_state": get_current_random_state()})
with branch_random_state(random_states, "my_own_random_state"):
output = nn.Dropout(0.1)(input)
```
Finally we provide a quick helper in order to get a synchronized random state across a process group.
```python
def get_synced_random_state(random_state: RandomState, pg: ProcessGroup):
# This allows us to get a synchronized random state with other ranks within a single group
# Usage
random_states = RandomStates({"tp_synced_random_state": get_synced_random_state(random_state=get_current_random_state(), group=tp_pg)})
with branch_random_state(random_states, "tp_synced_random_state"):
# Assuming that input is synced across TP, all ranks will apply the same random mask.
output = nn.Dropout(0.1)(input)
```
# Distributed serialization mechanism
We rely on compute nodes having access to a single shared filesystem.
We use `safetensors` to store our checkpoints.
Current format:
```python
checkpoint_metadata.json # Stores version, topology, other metadata that would make the training resumable
optimizer
optimizer_config.json # Stores enough information to reinstantiate which optimizer this runs.
optimizer_tp-0-of-1_dp-0-of-1_pp-0-of-2.pt
optimizer_tp-0-of-1_dp-0-of-1_pp-0-of-2.pt
lr_scheduler
lr_scheduler_tp-0-of-1_dp-0-of-1_pp-0-of-2.pt
lr_scheduler_tp-0-of-1_dp-0-of-1_pp-0-of-2.pt
random # Stores random states from each process in order to resume training from the point on.
tp-0-of-1_dp-0-of-1_pp-0-of-2.pt
tp-0-of-1_dp-0-of-1_pp-1-of-2.pt
model
dense1
model_weight.safetensors
model_bias.safetensors
dense2
model_weight.safetensors
model_bias.safetensors
```
Some observations:
- checkpoints are NOT topology agnostic, this is due to both `random_states` and `sharded` tensors.
Instead of trying to reconcile those and obtain a topology agnostic one, we want to support a `checkpoint_reshape` method.
The motivations are the following:
- When training, one spends a LOT more time `saving` checkpoints than loading. In doing so, having the fastest saving mechanism helps. Consequently not having any distributed communication/locking will help this.
- Random states are not so easily reconcilable. Given random states for two separate processes when we have TP=2, it's not obvious what should be the random state if we set to TP=1.
- Optimizer states are aligned with parameters. It's usually the case where for each parameter you can define an optimizer state. But that's a limitation on the current serialization format.
# Current restrictions:
- `nn.Module` inside PipelineBlocks have to return a `Dict[str,torch.Tensor]` or `torch.Tensor`.
- No conditional flow on top of pipeline, or at least making sure that all the processes within a data parallel rank are performing the same sequence of operations:
- First all but one process will be things on `TensorPointer` which would make input dependent control flow quite hard.
- Second if you were to have input dependent control flow, causing two processes within a single data parallel rank to be different, then you might end up with weird communication issues.
# Nanosets
Nanotron incorporates [`Nanosets`](../src/nanotron/data/nanoset.py), a dataset for processing tokenized documents with [`datatrove`](https://github.com/huggingface/datatrove). They allow reading tokens from one or multiple datasets and even specifying the weight of each dataset when building batches.
## Install
To use `Nanosets`, it's necessary to install Nanotron with the `nanosets` flavor.
```
pip install nanotron[nanosets]
```
This will install the following dependencies:
- `datatrove`: To preprocess the datasets
- `numba`: To compile helper functions in order to speed up the creation of `Nanosets`
- `transformers`: For the tokenizers
## Data pre-processing
To use this dataset, first, we need to preprocess the data using `datatrove`'s `DocumentTokenizer` pipeline. We invite you to take a look at `datatrove`, since it contains multiple features that allow, for example, filter out documents based on specific rules/criteria, extract text content from raw formats or scheduling the preprocessing in a Slurm cluster. We have also added a simple script capable of tokenizing datasets.
The preprocessing is done using the [`tools/preprocess_data.py`](../tools/preprocess_data.py) script. The input format can either be a Hugging Face Dataset, a path to a `.jsonl` or a path to a folder containing multiple `.jsonl` files. Below we show an example for processing a Hugging Face Dataset from the Hub with the Llama3 tokenizer.
<pre>
python3 tools/preprocess_data.py \
--tokenizer-name-or-path meta-llama/Meta-Llama-3-8B \
--output-folder datasets/emotion \
--n-tasks 16 \
hf \
--dataset dair-ai/emotion \
</pre>
First with `--tokenizer-name-or-path` we will specify a tokenizer in the same way as we do when using `AutoTokenizers.from_pretrained(...)`. Then we specify the `--output-folder` where we will store the tokenized documents and the number of workers with `--n-tasks`. Finally we will indicate the type of dataset (whether if it's a Hugging Face Dataset ["**hf**"] or in jsonl ["**jsonl**"] format) and the dataset that we want to preprocess. Check the different settings with `python3 tools/preprocess_data.py --help`, `python3 tools/preprocess_data.py hf --help` & `python3 tools/preprocess_data.py jsonl --help`.
Every worker will store in `--output-folder` 3 different kind of files:
- `*.ds` Containing the tokenized documents
- `*.ds.index` Containing the bounds of each tokenized document
- `*.ds.metadata` Containing the number of tokens and tokenizer used
> [!IMPORTANT]
Remember to introduce the type of dataset to process. e.g. python3 tools/preprocess_data.py --tokenizer-name-or-path gpt2 --n-tasks 16 **jsonl** --dataset raw_datasets/c4-es-json-files
## Working with Nanosets
To work with `Nanosets`, we just need to configure 1 argument:
1. `dataset_folder`: This argument specifies the file or files that will compose the `Nanoset`. There are 3 ways to specify it:
1. If we specify a single path, we will create a `Nanoset` from a single dataset file.
```yaml
data_stages:
- name: General purpose training (Single dataset)
start_training_step: 1
data:
dataset:
dataset_folder: datasets/SlimPajama-6B
num_loading_workers: 0
seed: 1234
```
2. If we specify a list of paths, we will create a `Nanoset` from all the dataset files. In every epoch we will consume each and every sample from each dataset randomly.
```yaml
data_stages:
- name: Second purpose training (> 1 dataset)
start_training_step: 15
data:
dataset:
dataset_folder:
- datasets/SlimPajama-6B
- datasets/testing_alpaca_small
num_loading_workers: 0
seed: 1234
```
3. If we specify a dictionary with paths and weights, we will create a `Nanoset` from the dataset files where each epoch will have a number of samples from each dataset according to the specified weights.
```yaml
data_stages:
- name: Third purpose training (Blended dataset)
start_training_step: 25
data:
dataset:
dataset_folder:
datasets/SlimPajama-6B: 0.8
datasets/testing_alpaca_small: 0.2
num_loading_workers: 0
seed: 1234
```
> [!IMPORTANT]
> Remember to set the `tokenizer.tokenizer_name_or_path` in the config file to the tokenizer used to preprocess the documents and set the `model.model_config.vocab_size` accordingly.
Finally, to use the `Nanosets`, launch the training with [`run_train.py`](../run_train.py).
```shell
torchrun --nproc-per-node 1 run_train.py --config examples/config_nanoset.yaml
```
## Under the hood
`Nanosets` are responsible of building samples of `sequence length + 1` tokens from the preprocessed dataset files. Despite most of the extracting logic lies in `DatatroveFolderDataset`, `Nanosets` will take care of the following:
1. Creating dataset mixtures from different dataset folder paths
2. Ensure that in each epoch, we consume each sample only once
3. Ensure that we never exhaust the `DataLoader`
Based on the `dataset lengths`, the `dataset weights` and the `number of samples per epoch` (defined as the `sum(dataset lengths)`), we build the two indexes we need in order to extract samples from the `Nanoset` ([build_nanoset_index_helper](../src/nanotron/data/nanoset.py)):
- `dataset index`: Contains the index of the dataset from the list of `dataset paths` from which to extract the sample, respecting the established dataset weight.
```
Given:
D = [d0, d1, d2, d3] # datasets
DL = [8, 2, 5, 5] # dataset lengths
W = [0.1, 0.5, 0.3, 0.1] # dataset weights
SPE = 20 # number of samples per epoch
Then, for example:
dataset_index = [1, 2, 0, 1, 3, 1, 2, 1, 2, 1, 0, 1, 2, 1, 3, 1, 2, 1, 2, 1]
```
- `dataset sample index`: Contains the sample index to extract from the `dataset index[index]` dataset, always < `len(dataset)`.
```
dataset_index = [1, 2, 0, 1, 3, 1, 2, 1, 2, 1, 0, 1, 2, 1, 3, 1, 2, 1, 2, 1]
dataset_sample_index = [0, 0, 0, 1, 0, 0, 1, 1, 2, 0, 1, 1, 3, 0, 1, 1, 4, 0, 0, 1]
```
Then, we **shuffle with the same permutation both indexes** and concatenate them `number of epochs` times, which is defined by `train split num samples` / `number of samples per epoch`.
```
Given:
N = 70 # train split num samples
dataset_index = [1, 2, 0, 1, 3, 1, 2, 1, 2, 1, 0, 1, 2, 1, 3, 1, 2, 1, 2, 1]
dataset_sample_index = [0, 0, 0, 1, 0, 0, 1, 1, 2, 0, 1, 1, 3, 0, 1, 1, 4, 0, 0, 1]
Shuffle dataset_index and dataset_sample_index:
dataset_index = [1, 1, 0, 2, 3, 1, 3, 1, 2, 2, 1, 1, 0, 1, 1, 2, 1, 2, 2, 1]
dataset_sample_index = [1, 0, 0, 4, 1, 0, 0, 0, 2, 0, 0, 1, 1, 0, 1, 0, 1, 3, 1, 1]
n_concatenations = (70/(20)) + 1 = 4
dataset_index = dataset_index concatenated 4 times
dataset_sample_index = dataset_sample_index concatenated 4 times
dataset_index = dataset_index[: N]
dataset_sample_index = dataset_sample_index[: N]
```
To query the `Nanoset` for the k-th sample we do the following:
- Use the `dataset_index` to retrieve the corresponding dataset from `D` and the `dataset_sample_index` to retrieve the corresponding sample from that dataset.
```
sample = D[dataset_index[k]][dataset_sample_index[k]]
```
"""
Benchmarking script for the Llama-2-7b model
"""
import os
from nanotron.config import (
CheckpointsArgs,
Config,
DataArgs,
GeneralArgs,
LlamaConfig,
LoggingArgs,
LRSchedulerArgs,
ModelArgs,
OptimizerArgs,
ParallelismArgs,
PretrainDatasetsArgs,
RandomInit,
TokenizerArgs,
TokensArgs,
)
from nanotron.logging import human_format
# Config for a llama model with 6.74M parameters
model_config = LlamaConfig()
num_params = human_format(
model_config.vocab_size * model_config.hidden_size * 2
+ model_config.num_hidden_layers
* (
3 * model_config.hidden_size * model_config.intermediate_size
+ 4 * model_config.hidden_size * model_config.hidden_size
)
).replace(".", "p")
print(f"Model has {num_params} parameters")
seed = 42
learning_rate = LRSchedulerArgs(
learning_rate=3e-4, lr_warmup_steps=2, lr_warmup_style="linear", lr_decay_style="cosine", min_decay_lr=1e-5
)
optimizer = OptimizerArgs(
zero_stage=0,
weight_decay=0.01,
clip_grad=1.0,
accumulate_grad_in_fp32=True,
adam_eps=1e-08,
adam_beta1=0.9,
adam_beta2=0.95,
torch_adam_is_fused=True,
learning_rate_scheduler=learning_rate,
)
parallelism = ParallelismArgs(
dp=2,
pp=1,
tp=4,
pp_engine="1f1b",
tp_mode="REDUCE_SCATTER",
tp_linear_async_communication=True,
)
tokens = TokensArgs(sequence_length=8192, train_steps=5, micro_batch_size=1, batch_accumulation_per_replica=8)
dataset = PretrainDatasetsArgs(hf_dataset_or_datasets="stas/openwebtext-10k", text_column_name="text")
checkpoints_path = os.path.dirname(os.path.dirname(__file__)) + "/checkpoints"
os.makedirs(checkpoints_path, exist_ok=True)
config = Config(
general=GeneralArgs(project="bench", run="llama", seed=seed),
checkpoints=CheckpointsArgs(checkpoints_path=checkpoints_path, checkpoint_interval=1000),
parallelism=parallelism,
model=ModelArgs(init_method=RandomInit(std=0.025), model_config=model_config),
tokenizer=TokenizerArgs("meta-llama/Llama-2-7b-hf"),
optimizer=optimizer,
logging=LoggingArgs(),
tokens=tokens,
data=DataArgs(dataset=dataset, seed=seed),
profiler=None,
)
if __name__ == "__main__":
dir = os.path.dirname(__file__)
# Save config as YAML file
config.save_as_yaml(f"{dir}/config_llama.yaml")
# Launch training
os.system("export CUDA_DEVICE_MAX_CONNECTIONS=1")
gpus = config.parallelism.dp * config.parallelism.pp * config.parallelism.tp
os.system(f"torchrun --nproc_per_node={gpus} run_train.py --config-file {dir}/config_llama.yaml")
checkpoints:
checkpoint_interval: 10
checkpoints_path: checkpoints
checkpoints_path_is_shared_file_system: false
resume_checkpoint_path: null
save_final_state: false
save_initial_state: false
data_stages:
- data:
dataset:
dataset_overwrite_cache: false
dataset_processing_num_proc_per_process: 1
hf_dataset_config_name: null
hf_dataset_or_datasets: stas/openwebtext-10k
hf_dataset_splits: train
text_column_name: text
num_loading_workers: 1
seed: 42
name: Stable Training Stage
start_training_step: 1
- data:
dataset:
dataset_overwrite_cache: false
dataset_processing_num_proc_per_process: 1
hf_dataset_config_name: null
hf_dataset_or_datasets: stas/openwebtext-10k
hf_dataset_splits: train
text_column_name: text
num_loading_workers: 1
seed: 42
name: Annealing Phase
start_training_step: 10
general:
benchmark_csv_path: null
consumed_train_samples: null
ignore_sanity_checks: true
project: debug
run: tiny_llama_%date_%jobid
seed: 42
step: null
lighteval: null
logging:
iteration_step_info_interval: 1
log_level: info
log_level_replica: info
model:
ddp_bucket_cap_mb: 25
dtype: bfloat16
init_method:
std: 0.025
make_vocab_size_divisible_by: 1
model_config:
bos_token_id: 128000
eos_token_id: 128001
hidden_act: silu
hidden_size: 4096
initializer_range: 0.02
intermediate_size: 14336
is_llama_config: true
max_position_embeddings: 2048
num_attention_heads: 32
num_hidden_layers: 32
num_key_value_heads: 8
pad_token_id: null
pretraining_tp: 1
rms_norm_eps: 1.0e-05
rope_interleaved: false
rope_scaling: null
rope_theta: 500000.0
tie_word_embeddings: false
use_cache: true
vocab_size: 128256
optimizer:
accumulate_grad_in_fp32: true
clip_grad: 1.0
learning_rate_scheduler:
learning_rate: 0.0003
lr_decay_starting_step: null
lr_decay_steps: 13
lr_decay_style: cosine
lr_warmup_steps: 2
lr_warmup_style: linear
min_decay_lr: 1.0e-05
optimizer_factory:
adam_beta1: 0.9
adam_beta2: 0.95
adam_eps: 1.0e-08
name: adamW
torch_adam_is_fused: true
weight_decay: 0.01
zero_stage: 0
parallelism:
dp: 1
expert_parallel_size: 1
pp: 2
pp_engine: 1f1b
recompute_layer: false
tp: 2
tp_linear_async_communication: true
tp_mode: REDUCE_SCATTER
tp_recompute_allgather: true
profiler: null
s3_upload: null
tokenizer:
tokenizer_max_length: null
tokenizer_name_or_path: Meta-Llama-3.1-8B
tokenizer_revision: null
tokens:
batch_accumulation_per_replica: 1
limit_test_batches: 0
limit_val_batches: 0
micro_batch_size: 2
sequence_length: 2048
train_steps: 15
val_check_interval: -1
checkpoints:
checkpoint_interval: 10
checkpoints_path: checkpoints
checkpoints_path_is_shared_file_system: false
resume_checkpoint_path: null
save_final_state: false
save_initial_state: false
data_stages:
- data:
dataset:
dataset_overwrite_cache: false
dataset_processing_num_proc_per_process: 1
hf_dataset_config_name: null
hf_dataset_or_datasets: stas/openwebtext-10k
hf_dataset_splits: train
text_column_name: text
num_loading_workers: 1
seed: 42
name: Stable Training Stage
start_training_step: 1
- data:
dataset:
dataset_overwrite_cache: false
dataset_processing_num_proc_per_process: 1
hf_dataset_config_name: null
hf_dataset_or_datasets: stas/openwebtext-10k
hf_dataset_splits: train
text_column_name: text
num_loading_workers: 1
seed: 42
name: Annealing Phase
start_training_step: 10
general:
benchmark_csv_path: null
consumed_train_samples: null
ignore_sanity_checks: true
project: debug
run: tiny_llama_%date_%jobid
seed: 42
step: null
lighteval: null
logging:
iteration_step_info_interval: 1
log_level: info
log_level_replica: info
model:
ddp_bucket_cap_mb: 25
dtype: bfloat16
init_method:
std: 0.025
make_vocab_size_divisible_by: 1
model_config:
bos_token_id: 128000
eos_token_id: 128001
hidden_act: silu
hidden_size: 4096
initializer_range: 0.02
intermediate_size: 14336
is_llama_config: true
max_position_embeddings: 2048
num_attention_heads: 32
num_hidden_layers: 32
num_key_value_heads: 8
pad_token_id: null
pretraining_tp: 1
rms_norm_eps: 1.0e-05
rope_interleaved: false
rope_scaling: null
rope_theta: 500000.0
tie_word_embeddings: false
use_cache: true
vocab_size: 256
optimizer:
accumulate_grad_in_fp32: true
clip_grad: 1.0
learning_rate_scheduler:
learning_rate: 0.0003
lr_decay_starting_step: null
lr_decay_steps: 13
lr_decay_style: cosine
lr_warmup_steps: 2
lr_warmup_style: linear
min_decay_lr: 1.0e-05
optimizer_factory:
adam_beta1: 0.9
adam_beta2: 0.95
adam_eps: 1.0e-08
name: adamW
torch_adam_is_fused: true
weight_decay: 0.01
zero_stage: 0
parallelism:
dp: 1
expert_parallel_size: 1
pp: 2
pp_engine: 1f1b
recompute_layer: false
tp: 2
tp_linear_async_communication: true
tp_mode: REDUCE_SCATTER
tp_recompute_allgather: true
profiler: null
s3_upload: null
tokenizer:
tokenizer_max_length: null
tokenizer_name_or_path: robot-test/dummy-tokenizer-wordlevel
tokenizer_revision: null
tokens:
batch_accumulation_per_replica: 1
limit_test_batches: 0
limit_val_batches: 0
micro_batch_size: 2
sequence_length: 2048
train_steps: 15
val_check_interval: -1
checkpoints:
checkpoint_interval: 1000
checkpoints_path: checkpoints/
checkpoints_path_is_shared_file_system: false
resume_checkpoint_path: null
save_initial_state: false
data_stages:
- data:
dataset:
dataset_folder: datasets/c4-es/tokenized
num_loading_workers: 1
seed: 42
name: General purpose training (Single dataset)
start_training_step: 1
- data:
dataset:
dataset_folder:
- datasets/SlimPajama-6B/tokenized
- datasets/c4-es/tokenized
num_loading_workers: 1
seed: 42
name: Second purpose training (> 1 dataset)
start_training_step: 15
- data:
dataset:
dataset_folder:
datasets/SlimPajama-6B/tokenized: 0.8
datasets/c4-es/tokenized: 0.2
num_loading_workers: 1
seed: 42
name: Third purpose training (Blended dataset)
start_training_step: 25
general:
benchmark_csv_path: null
consumed_train_samples: null
ignore_sanity_checks: true
project: Nanoset
run: llama
seed: 42
step: null
lighteval: null
logging:
iteration_step_info_interval: 1
log_level: info
log_level_replica: info
model:
ddp_bucket_cap_mb: 25
dtype: bfloat16
init_method:
std: 0.025
make_vocab_size_divisible_by: 1
model_config:
bos_token_id: 1
eos_token_id: 2
hidden_act: silu
hidden_size: 16
initializer_range: 0.02
intermediate_size: 64
is_llama_config: true
max_position_embeddings: 1024
num_attention_heads: 4
num_hidden_layers: 2
num_key_value_heads: 4
pad_token_id: null
pretraining_tp: 1
rms_norm_eps: 1.0e-05
rope_scaling: null
tie_word_embeddings: true
use_cache: true
vocab_size: 50257
optimizer:
accumulate_grad_in_fp32: true
clip_grad: 1.0
learning_rate_scheduler:
learning_rate: 0.0003
lr_decay_starting_step: null
lr_decay_steps: 98
lr_decay_style: cosine
lr_warmup_steps: 2
lr_warmup_style: linear
min_decay_lr: 1.0e-05
optimizer_factory:
adam_beta1: 0.9
adam_beta2: 0.95
adam_eps: 1.0e-08
name: adamW
torch_adam_is_fused: true
weight_decay: 0.01
zero_stage: 0
parallelism:
dp: 1
expert_parallel_size: 1
pp: 1
pp_engine: 1f1b
tp: 1
tp_linear_async_communication: true
tp_mode: REDUCE_SCATTER
profiler: null
tokenizer:
tokenizer_max_length: null
tokenizer_name_or_path: gpt2
tokenizer_revision: null
tokens:
batch_accumulation_per_replica: 1
limit_test_batches: 0
limit_val_batches: 0
micro_batch_size: 2
sequence_length: 1024
train_steps: 200
val_check_interval: -1
""" Example python script to generate a YAML config file which can be used to run a training with nanotron. Refer to "examples" section in the `/README.md` for more information."""
import os
from nanotron.config import (
AdamWOptimizerArgs,
CheckpointsArgs,
Config,
DataArgs,
DatasetStageArgs,
GeneralArgs,
LlamaConfig,
LoggingArgs,
LRSchedulerArgs,
ModelArgs,
OptimizerArgs,
ParallelismArgs,
PretrainDatasetsArgs,
RandomInit,
TokenizerArgs,
TokensArgs,
)
from nanotron.logging import human_format
model_config = LlamaConfig(
# Config for a tiny model model with 1.62M parameters
bos_token_id=1,
eos_token_id=2,
hidden_act="silu",
hidden_size=16,
initializer_range=0.02,
intermediate_size=64,
max_position_embeddings=256,
num_attention_heads=4,
num_hidden_layers=2,
num_key_value_heads=4,
pretraining_tp=1,
rms_norm_eps=1e-05,
rope_scaling=None,
tie_word_embeddings=True,
use_cache=True,
vocab_size=256,
)
num_params = human_format(
model_config.vocab_size * model_config.hidden_size * 2
+ model_config.num_hidden_layers
* (
3 * model_config.hidden_size * model_config.intermediate_size
+ 4 * model_config.hidden_size * model_config.hidden_size
)
).replace(".", "p")
print(f"Model has {num_params} parameters")
seed = 42
learning_rate = LRSchedulerArgs(
learning_rate=3e-4, lr_warmup_steps=2, lr_warmup_style="linear", lr_decay_style="cosine", min_decay_lr=1e-5
)
optimizer = OptimizerArgs(
zero_stage=0,
weight_decay=0.01,
clip_grad=1.0,
accumulate_grad_in_fp32=True,
learning_rate_scheduler=learning_rate,
optimizer_factory=AdamWOptimizerArgs(
adam_eps=1e-08,
adam_beta1=0.9,
adam_beta2=0.95,
torch_adam_is_fused=True,
),
)
parallelism = ParallelismArgs(
dp=2,
pp=2,
tp=2,
pp_engine="1f1b",
tp_mode="REDUCE_SCATTER",
tp_linear_async_communication=True,
)
tokens = TokensArgs(sequence_length=256, train_steps=15, micro_batch_size=2, batch_accumulation_per_replica=1)
data_stages = [
DatasetStageArgs(
name="Stable Training Stage",
start_training_step=1,
data=DataArgs(
dataset=PretrainDatasetsArgs(hf_dataset_or_datasets="stas/openwebtext-10k", text_column_name="text"),
seed=seed,
),
),
DatasetStageArgs(
name="Annealing Phase",
start_training_step=10,
data=DataArgs(
dataset=PretrainDatasetsArgs(hf_dataset_or_datasets="stas/openwebtext-10k", text_column_name="text"),
seed=seed,
),
),
]
checkpoints_path = "./checkpoints"
os.makedirs(checkpoints_path, exist_ok=True)
config = Config(
general=GeneralArgs(project="debug", run="tiny_llama_%date_%jobid", seed=seed),
checkpoints=CheckpointsArgs(checkpoints_path=checkpoints_path, checkpoint_interval=10),
parallelism=parallelism,
model=ModelArgs(init_method=RandomInit(std=0.025), model_config=model_config),
tokenizer=TokenizerArgs("robot-test/dummy-tokenizer-wordlevel"),
optimizer=optimizer,
logging=LoggingArgs(),
tokens=tokens,
data_stages=data_stages,
profiler=None,
)
if __name__ == "__main__":
dir = os.path.dirname(__file__)
# Save config as YAML file
config.save_as_yaml(f"{dir}/config_tiny_llama.yaml")
# You can now train a model with this config using `/run_train.py`
checkpoints:
checkpoint_interval: 10
checkpoints_path: checkpoints
checkpoints_path_is_shared_file_system: false
resume_checkpoint_path: null
save_final_state: false
save_initial_state: false
data_stages:
- data:
dataset:
dataset_overwrite_cache: false
dataset_processing_num_proc_per_process: 1
hf_dataset_config_name: null
hf_dataset_or_datasets: stas/openwebtext-10k
hf_dataset_splits: train
text_column_name: text
num_loading_workers: 1
seed: 42
name: Stable Training Stage
start_training_step: 1
- data:
dataset:
dataset_overwrite_cache: false
dataset_processing_num_proc_per_process: 1
hf_dataset_config_name: null
hf_dataset_or_datasets: stas/openwebtext-10k
hf_dataset_splits: train
text_column_name: text
num_loading_workers: 1
seed: 42
name: Annealing Phase
start_training_step: 10
general:
benchmark_csv_path: null
consumed_train_samples: null
ignore_sanity_checks: true
project: debug
run: tiny_llama_%date_%jobid
seed: 42
step: null
lighteval: null
logging:
iteration_step_info_interval: 1
log_level: info
log_level_replica: info
model:
ddp_bucket_cap_mb: 25
dtype: bfloat16
init_method:
std: 0.025
make_vocab_size_divisible_by: 1
model_config:
bos_token_id: 1
eos_token_id: 2
hidden_act: silu
hidden_size: 16
initializer_range: 0.02
intermediate_size: 64
is_llama_config: true
max_position_embeddings: 256
num_attention_heads: 4
num_hidden_layers: 2
num_key_value_heads: 4
pad_token_id: null
pretraining_tp: 1
rms_norm_eps: 1.0e-05
rope_interleaved: false
rope_scaling: null
rope_theta: 10000.0
tie_word_embeddings: true
use_cache: true
vocab_size: 256
optimizer:
accumulate_grad_in_fp32: true
clip_grad: 1.0
learning_rate_scheduler:
learning_rate: 0.0003
lr_decay_starting_step: null
lr_decay_steps: 13
lr_decay_style: cosine
lr_warmup_steps: 2
lr_warmup_style: linear
min_decay_lr: 1.0e-05
optimizer_factory:
adam_beta1: 0.9
adam_beta2: 0.95
adam_eps: 1.0e-08
name: adamW
torch_adam_is_fused: true
weight_decay: 0.01
zero_stage: 0
parallelism:
dp: 1
expert_parallel_size: 1
pp: 2
pp_engine: 1f1b
recompute_layer: false
tp: 2
tp_linear_async_communication: true
tp_mode: REDUCE_SCATTER
tp_recompute_allgather: true
profiler: null
s3_upload: null
tokenizer:
tokenizer_max_length: null
tokenizer_name_or_path: robot-test/dummy-tokenizer-wordlevel
tokenizer_revision: null
tokens:
batch_accumulation_per_replica: 1
limit_test_batches: 0
limit_val_batches: 0
micro_batch_size: 2
sequence_length: 256
train_steps: 15
val_check_interval: -1
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment