v1.0.8

dfcb88ff · chenzk · dfcb88ff · dfcb88ff · dfcb88ff · dfcb88ff
Commit dfcb88ff authored Dec 04, 2024 by chenzk
20 changed files
--- a/convert_nanotron_to_hf.sh
+++ b/convert_nanotron_to_hf.sh
+torchrun --nproc-per-node 1 examples/llama/convert_nanotron_to_hf.py --checkpoint_path checkpoints/10 --save_path hf/hf-Llama-3.1-8B --tokenizer_name Meta-Llama-3.1-8B 
--- a/doc/algorithm.png
+++ b/doc/algorithm.png
--- a/doc/llama3.png
+++ b/doc/llama3.png
--- a/doc/llama3_detail.png
+++ b/doc/llama3_detail.png
--- a/doc/llm.png
+++ b/doc/llm.png
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-py3.10-dtk24.04.3-ubuntu20.04
+ENV DEBIAN_FRONTEND=noninteractive
+# RUN yum update && yum install -y git cmake wget build-essential
+# RUN source /opt/dtk-24.04.3/env.sh
+# # 安装pip相关依赖
+COPY requirements.txt requirements.txt
+RUN pip3 install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
--- a/docker/requirements.txt
+++ b/docker/requirements.txt
+torch>=1.13.1
+pyyaml
+numpy
+packaging
+safetensors
+dacite
+tqdm
+datasets
+flash-attn
+setuptools
+dacite==1.8.1
+fsspec==2024.9.0
+numba
+datatrove[all] # 0.3.0
+transformers==4.46.3
+tokenizers==0.20.3
--- a/docker_start.sh
+++ b/docker_start.sh
+docker run -it --shm-size=64G -v $PWD/nanotron:/home/nanotron -v /public/DL_DATA/AI:/home/AI -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name llama b272aae8ec72 bash
+# python -m torch.utils.collect_env
--- a/docs/3d_parallelism.md
+++ b/docs/3d_parallelism.md
+## The internals of nanotron
+### 1. Tensor Parallelism
+#### Asynchronous Tensor Parallelism
+Q: What are the two different tensor parallel linear modes in nanotron?
+A: All-reduce and Reduce-scatter
+Q: How does asynchronous column parallel linear work differently than regular column parallel linear?
+A: In regular column parallel linear, each rank only computes its portion of the output matrix, then gathers the partial outputs at the end.
+In asynchronous column parallel, each rank kicks off an asynchronous all-gather on the input tensor at the start. While that communication is happening, the rank computes the portion of the output corresponding to its local shard of the weights. When the all-gather finishes, each rank can compute the remaining portions of the output matrix it's missing, using the parts of the gathered input from other ranks.
+Q: In asynchronous column parallel, what exactly does it gather?
+A: In asynchronous column parallel, each rank kicks off an all-gather operation on the input tensor X at the start of the forward pass. This gathers the shards of X from all tensor parallel ranks into one large tensor.
+For example with 4 GPUs:
+ Input X is sharded as [X0, X1, X2, X3] across 4 ranks
+ Rank 0 all-gathers: [X0, X1, X2, X3]
+So each GPU gathers the complete input X from all GPUs.
+Q: In nanotron, what is the core difference between regular and asynchronous tensor parallel linear layers in terms of computation?
+A:
+- In regular column parallel, each rank only computes the portion of the output corresponding to its shard of weights. It does not compute the full output matrix.
+- In asynchronous column parallel, each rank computes the entire output matrix locally using inputs and its shard of weights.
+Q: What do before_shard and after_shard represent in asynchronous tensor parallel?
+A:
+ before_shard is the portion of the output matrix that a rank can compute using input shards that come before its own input shard.
+ after_shard is the portion of the output matrix that a rank can compute using input shards that come after its own input shard.
+For example, on rank 2 with input shards [X0, X1, X2, X3]: before_shard = X0 * W0 + X1 * W1 after_shard = X3 * W3
+Q: What is the core tradeoff between asynchronous and regular tensor parallelism?
+A: Async trades off more floating point operations (FLOPs) for less communication.
+It does more FLOPs by having each rank compute the full output matrix instead of just a partial shard. But it reduces communication by doing only a single collective communication. So async can improve performance if the model is communication bound, at the cost of increased FLOP requirements.
+Q: Can you give a concrete example illustrating how asynchronous tensor parallelism works? (6 steps)
+A:
+- Step 1: Let's look at an example with 4 GPU ranks:
+    + Input X sharded across ranks as [X0, X1, X2, X3]
+    + Weight matrix W sharded as [W0, W1, W2, W3]
+- Step 2: Rank 2 kicks off async all-gather to get [X0, X1, X2, X3]
+- Step 3: While gathering, rank 2 computes: local_output = X2 * W2
+- Step 4: All-gather completes, rank 2 has [X0, X1, X2, X3]
+- Step 5: Rank 2 computes: before_local_output = X0 * W0 + X1 * W1, after_local_output = X3 * W3
+- Step 6: Rank 2's output = before_local_output + local_output + after_local_output
+So each rank computes the full output using the locally gathered X and its shard of W.
+#### Tied Linear
+Q: Why does brrr have only a single rank save tied linear weights instead of all ranks?
+A: Tied linear weights are replicated across ranks, meaning all ranks hold the same weight values. Having every rank save the tied weight would result in the same weight being saved multiple times redundantly. So brrr designates only one rank (such as rank 0), to save the weight to avoid duplicating the same weight in the checkpoint.
+Q: How does Nanotron detect tied parameters?
+A: Nanotron has a base model class called NanotronModel. NanotronModel class implements a common method for accessing tied parameters (called .get_tied_parameters()) When initializing the model, Trainer calls this method to get a list of parameter names that should be tied.
+For example, for a goose model, it may return ["lm_head.weight", "word_embeddings.weight"] indicating the lm head weight and word embedding weight should be tied.
+Q: How does a tied linear layer differ from a regular parallel linear layer in nanotron?
+A:
+ In a regular parallel linear layer, the weight matrix is sharded across ranks.
+ In a tied linear layer, the entire weight matrix is replicated on all ranks.
+Q: What is the difference between a tied parameter and a regular parameter in nanotron?
+A:
+ Tied parameters in nanotron are parameters that need to have their gradients synchronized (typically summed) across a specific set of ranks during training.
+ Regular parameters don't have any special synchronization requirements.
+Q: When would you use tied parameters in a transformer model in nanotron?
+A: Tied parameters should be used when the same weights are replicated in multiple layers of the transformer. A common example is tying the weights of the embedding layer and the final linear layer in the language modeling head.
+Q: What are the different types of linear layers in nanotron and how are they different?
+A: Tied linear, tensor parallel linear, and async tensor parallel linear
+### 2. Pipeline Parallelism
+Q: What are the four core components in brrr’s pipeline parallelism?
+A:
+ PipelineBlock: Contains model computation split up over devices.
+ PipelineEngine: Orchestrate overall forward/backward passes across blocks.
+ PipelineBatchState: Stores all P2P operations
+ TensorPointer: Pointer to a tensor produced on a different device.
+Q: How does PipelineEngine allow implementing different schedules like 1F1B or GPipe?
+A: PipelineEngine has abstract methods like train_batch_iter and validate_batch_iter that are overridden by subclasses to implement different execution orderings.
+For example, AllForwardAllBackward does all forwards first, then all backwards. 1F1B interleaves them, doing 1 forward then 1 backward. The specific scheduling logic is handled in these methods.
+Q: What is the advantage of TensorPointer compared to directly sending activations after computation?
+A: So TensorPointers allow pipeline stages to represent tensors produced on other ranks, and request them on-demand when needed for computation. The key benefit is lazy communication - tensors are only transferred between processes when really needed, not all upfront
+The TensorPointers allow us to queue up a whole batch of communications that will happen later, instead of blocking and communicating each tensor as it is needed.
+Q: How do TensorPointers interact with other components in brrr? (4 steps)
+A: TensorPointer is used to represent tensors that are not locally available on the current process. It contains metadata about which process rank actually holds the real tensor data.
+ Step 1: Block A runs on rank 0, produces output tensor X
+ Step 2: Block B runs on rank 1, needs X as input
+ Step 3: In Block B's forward, X is represented as a TensorPointer pointing to rank 0. To actually get the X tensor data, Block B uses the TensorPointer to send a request to rank 0 to receive X.
+ Step 4: Rank 0 receives the request, sends X to rank 1, which populates it into Block B's input
+Similarly, if Block B produces an output Y that the next Block C on rank 2 needs, it will return Y wrapped in a TensorPointer pointing to rank 1.
+Q: In the forward pass, how do the four core components in brrr's pipeline parallelism work together? (5 steps)
+A:
+- Step 1: PipelineEngine coordinates executing the PipelineBlocks for each microbatch.
+- Step 2: PipelineBlockA runs on device A, producing an activation x. It returns {"x": TensorPointer(rank=A)}
+- Step 3: PipelineBlockB runs on device B. It sees the TensorPointer for x, telling it to retrieve x from device A. PipelineBlockB tells PipelineBatchState to receive x from device A.
+- Step 4: PipelineEngine triggers PipelineBatchState to run communication. PipelineBatchState executes the receive operation, getting x from device A.
+- Step 5: PipelineBlockB retrieves x from PipelineBatchState's buffer and continues its computation.
+Q: What are the three core components of brrr's P2P communication?
+A:
+- P2P class: Handles sending and receiving tensors between ranks.
+- TensorMetaData: Stores tensor’s metadata like shape, dtype… to interpret raw tensor data.
+- Communication buffers: Reusable buffers for sending metadata and tensor data.
+Q: What is the difference between PipelineBatchState and BatchTensorSendRecvState?
+A: PipelineBatchState orchestrates pipeline communication across microbatches during training or inference. BatchTensorSendRecvState handles sending/receiving generic tensors in a batch.
+PipelineBatchState leverages BatchTensorSendRecvState under the hood for lower-level P2P communication but adds pipeline-specific logic on top like managing activations and gradients across stages.
+Q: Why does pipeline engine batch p2p communication? Isn’t at each clock cycle, there is only a single send or recv in a microbatch?
+A: The pipeline engine batches P2P communication across microbatches, not within a microbatch. Within a microbatch there may be only a single send or receive between stages, but across microbatches the sends/receives can be batched.
+For example, say we have a model with two pipeline stages, A and B. In microbatch 1, A sends tensor X to B. In microbatch 2, A sends tensor Y to B. Instead of sending X and Y in separate P2P operations, the pipeline engine will batch them together into one send of [X,Y].
+Q: How does PipelineBlock's forward pass work? (4 steps)
+A:
+- Step 1: It receives inputs, which can be Tensors or TensorPointers from other ranks.
+- Step 2: For any TensorPointer inputs, it uses P2P communication to fetch the actual tensor from the rank specified.
+- Step 3: It runs the forward pass of the module it encapsulates, passing the tensors as inputs.
+- Step 4: It returns a dict containing the outputs of the module. For ranks that didn't run this block, it returns TensorPointers instead of real tensors.
+Q: How does a PipelineBlock decide to return a Tensor vs a TensorPointer? Explain
+A: A PipelineBlock will return a TensorPointer if the block is running on a different pipeline rank from the one that is meant to output that tensor. Otherwise, it will return the actual Tensor
+For example, say PipelineBlockA produces output X and is assigned to pipeline rank 2.
+ When running on pipeline rank 2, PipelineBlockA will return the actual Tensor X.
+ But when running on rank 1 or 3, PipelineBlockA will return a TensorPointer to rank 2 rather than the actual Tensor X data.
+Q: In 3D parallelism, how does Nanotron calculate the overall loss when each microbatch has a different loss value?
+A:
+- Step 1: Each microbatch has its own loss value
+- Step 2: The losses for each microbatch are summed together
+- Step 3: The total sum is averaged across data parallelism
+This represents the mean loss across all microbatches in the global batch
+Q: What does PipelineBlock.rank represent?
+A: PipelineBlock.rank specifies which pipeline parallel rank the block is assigned to. When initializing the model, each PipelineBlock's rank is set to place it on a particular pipeline rank.
+For example, setting a block's rank to 2 means it will run on pipeline rank 2. The block's parameters will be instantiated on rank 2's device, and its forward pass will execute on rank 2.
+Q: What do target_pp_ranks represent when initializing a nanotron model?
+A:
+target_pp_ranks specifies which subset of pipeline ranks the model should be built on. By default, the model is built on all pipeline ranks (0 to pp_size-1). But you can pass a custom list like [0, 2, 3] to build the model only on those ranks.
+Concrete example: pp_size = 8, target_pp_ranks = [0, 4, 7]. This will build the model only on pipeline ranks 0, 4, and 7 out of the total 8 ranks. The intermediate ranks 1-3 and 5-6 will not have the model built on them.
+#### Loading data in 3D parallelism
+Q: In 3D parallelism, how does brrr sample training data for model replicas? (2 steps)
+A: For example, with 2 devices, 4 microbatch size, and 100 samples:
+- Step 1: It first divides the full dataset into equal chunks, one chunk per GPU.
+    + Device 0 gets samples [0, 2, 4, .. 98]
+    + Device 1 gets samples [1, 3, 5, .. 99]
+- Step 2: Then within each GPU, samples are drawn sequentially to create micro-batches. The samples are accumulated into microbatches.
+    Epoch 1:
+    + Device 0 samples [0, 2, 4, 6] -> first microbatch
+    + Device 1 samples [1, 3, 5, 7]
+    Epoch 2:
+    + Device 0 samples [8, 10, 12, 14]
+    + Device 1 samples [9, 11, 13, 15]
+Q: In the BRRR dataloader, why are some tensor values replaced with TensorPointers?
+A: Dataloader is designed to work with BRRR's pipeline parallelism. Certain tensors like the input ids and attention mask are only needed by the first pipeline stage. Other ranks don't need the actual tensors - a TensorPointer is just a placeholder.
+For example, say rank 2 is where the model input is located. Dataloader will return:
+ Rank 2: {"input_ids": <actual tensor>}
+ Other ranks: {"input_ids": TensorPointer(group_rank=2)}
+Q: Given a dataset with: 100,000 samples, 10 model replicas, Micro-batch size = 16, Consumed samples so far = 10,000
+How does the MegatronPretrainingSampler work concretely? (4 steps)
+A:
+ Step 1: Available samples = 100,000 - 10,000 = 90,000
+ Step 2 Each model replicas gets shard of 90,000 / 10 = 9,000 samples
+ Step 3: With a microbatch size of 16, each worker samples indices 0-15, 16-31 etc. from its shard (9,000 - 18,000)…
+ Step 4: Update consumed samples after each micro-batch of 16
+Q: In 3D parallelism, what's the difference between sequential and random pretraining samplers?
+A: For example, with 2 GPUs, 4 microbatch size, and 8 samples:
+- Sequential sampler walks through its chunk sequentially.
+ GPU 0: [0, 2, 4, 6]
+ GPU 1: [1, 3, 5, 7]
+- Random sampler shuffles its chunk each epoch before sampling.
+ GPU 0: [6, 4, 0, 2] // shuffled shard
+ GPU 1: [5, 7, 1, 3]
+### 3. Distributed Serialization
+Q: What are the five things saved in a brrr checkpoint?
+A: Model weights, optimizer state, learning rate scheduler, random number generator state, and any other misc metadata required for restoring sharded weights
+Q: What are the key differences when brrr saves the weights for the 3 types of parameters?
+A:
+ Regular parameters: Just directly save the full tensor normally.
+ Sharded parameters: Only save the shard owned by the first model replicas, to avoid redundancy across data parallelism.
+ Tied parameters: Only a rank in the tied group saves the weight.
+Q: How does brrr reconstruct the full original unsharded tensor from the shards when loading a checkpoint?
+A: When saving a sharded weight, brrr stores metadata about how the shards map to the original tensor. This includes:
+Slices mapping info - Maps each shard's slice of the tensor to the corresponding slice in the original unsharded tensor. Like shard 1 covers unsharded tensor indices 0-50, etc.
+During loading, BRRR uses this mapping to copy each shard into the right location in the unsharded tensor to reconstruct it.
+- Step 1: Orig tensor A: [A1][A2][A3]
+- Step 2: Checkpoint shards: A1 A2 A3
+- Step 3: Loading:
+    + A1 -> copy to indices 0-50 of A
+    + A2 -> copy to indices 51-100 of A
+    + A3 -> copy to indices 101-150 of A
+Q: What are the three types of parameters that BRRR handles when saving checkpoints?
+A: Regular parameters, sharded parameters, tied/replicated parameters
+Q: How does brrr ensure all ranks start with the same initial random state for determinism? (3 steps)
+A:
+- Step 1: Rank 0 generates the initial state by seeding the RNG and grabbing the state tensor.
+- Step 2: The state tensor is broadcast from rank 0 to all ranks.
+- Step 3: Each rank loads the state tensor into its RNG.
+### 4. Trainer & Model Initialization
+#### Trainer
+Q: What's the main idea behind brrr’s model initialization?
+A: The main idea is to initialize models directly on the device and datatype we want by overriding PyTorch's default initialization. For example, by default PyTorch may initialize weights on CPU and in fp32. brrr overrides this so we can initialize directly in target precision format on GPUs from the start.
+Q: How does brrr’s model initialization context manager work? (3 steps)
+A:
+- Step 1: Enter context: Override nn.Module register methods and tensor creation functions
+- Step 2: Inside context: Modules/tensors now use overridden methods, so they initialize directly on target device/dtype
+- Step 3: Exit context: Restore original nn.Module methods and tensor creation functions
+Q: Which two nn.Module methods does brrr override to implement its model initialization context manager? Explain
+A: brrr overrides nn.Module.register_parameter() and nn.Module.register_buffer() which are called when modules register parameters and buffers during initialization.
+Q: What does kill switch do in Nanotron?
+A: Kill switch is a file that the trainer periodically checks during training. If the kill switch file is detected, Trainer will:
+ Step 1: Save a checkpoint
+ Step 2: Exit training gracefully
+Q: Why does brrr have the custom initialization context manager instead of just using module.to() to move models to the target device?
+A: module.to() moves existing tensors to a new device. BRRR's custom initialization context manager initializes tensors directly on the target device to begin with. For example, if we want mixed precision on GPU from the start, the context manager will initialize weights in fp16 on the GPU, instead of initializing in fp32 on CPU then moving.
+Q: In FP16 training, how does nanotron updates in the accumulated FP32 gradients when each parameter has an FP16 gradient? (4 steps)
+A:
+- Step 1: Each FP16 parameter has an associated FP32 gradient buffer allocated.
+- Step 2: During backward, the FP16 gradients are accumulated into the FP32 buffer, instead of directly into the .grad attribute.
+- Step 3: Before the optimizer step, nanotron copies the accumulated FP32 gradients into the .grad attribute of the FP32 copy of each parameter that will be updated.
+- Step 4: The optimizer performs the update on the FP32 parameters.
+#### Model Initialization
+Q: In Nanotron, how does Trainer initialize a model from scratch using 3D parallelism? (5 steps)
+A:
+- Step 1: Create an instance of the model
+- Step 2: Initialize parameters randomly (using model.init_model_randomly())
+- Step 3: Mark tied parameters (using tie_parameters())
+- Step 4: Sync model parameters across data parallelism with all_reduce
+- Step 5: Sync tied parameters across their tied groups with all_reduce
+Q: What is the high-level flow of BRRR's training loop? (3 steps) (ignore schedulers, logging…)
+A:
+- Step 1: Do a training step - run forward/backward pass through the model pipeline.
+- Step 2: Check for kill switch file, exit if triggered.
+- Step 3: Save checkpoint if current step matches interval.
+Q: In 3D parallelism, how does Nanotron calculate the total number of parameters of a replicas? (2 steps)
+A:
+- Step 1: Sum the parameters within each pipeline stage (across tensor parallelism) ⇒ The total params for that stage.
+- Step 2: Sum the parameters across pipeline stages ⇒ The total model parameters
+    For example with 2 pipeline stages, 2 tensor parallel:
+    + Stage 1: (TP0): 10 params, (TP1): 15 params. Sum = 25
+    + Stage 2: (TP0): 20 params, (TP1): 25 params. Sum = 45
+    Total params = Stage 1 + Stage 2 = (10+15) + (20+25) = 35 + 45 = 70
+Q: Why does BRRR need a kill switch to terminate training? Can't we just Ctrl-C or cancel the job?
+A: Kill switch provides a graceful way to terminate training without losing progress:
+ Ctrl-C stops the process immediately, risking corrupted checkpoints.
+ Cancelling the job kills all processes abruptly.
+The kill switch allows: checkpoint is safely saved before terminating
+Q: Why is there a second all-reduce after the first DP all-reduce during model initialization?
+A: The first DP all-reduce syncs weights across data parallelism, but not within each replica. For example, it syncs embedding weights across DP ranks, but not between embeddings and lm_head within each rank. The second all-reduce specifically syncs tied weights like embeddings and lm_head within each replica.
+For example, suppose we have: + [Embedding A1, LM Head A1], [Embedding A2, LM Head A2]
+The first all-reduce makes
+ Embedding A1 == Embedding A2
+ LM Head A1 == LM Head A2
+but not Embedding A1 == LM Head A1.The second all-reduce syncs Embedding A1 and LM Head A1, and Embedding A2 and LM Head A2.
+Q: Why does BRRR issue an all-reduce across data parallelism dimension when initializing a model from scratch?
+A: When initializing a model randomly, each replica (data parallel rank) can end up with different initial values due to randomness. The all-reduce (or an equivalent operation) syncs up these initial values across data parallelism, so each replica starts with the same initial weights.
+For example, with 2 data parallel ranks:
+ Replica 1: Embedding weights initially [0.1, 0.3, 0.2]
+ Replica 2: Embedding weights initially [0.4, 0.1, 0.5]
+After all-reduce, both will have the same initialized weights, say [0.25, 0.2, 0.35].
+Q: What are the 3 pretraining samplers in brrr?
+A:
+- Sequential sampler: Walks through each GPU's data shard sequentially
+- Random sampler: Shuffles each GPU's shard before walking through it
+- Cyclic sampler: After one pass through the datasets, loops back to the beginning
--- a/docs/debugging.md
+++ b/docs/debugging.md
+# Debugging FAQ:
+When debugging you may run into errors of the sort:
+![](image.png)
+Don't get overwhelmed by the amount of information in the error message. The final error message is not very informative so we have to scroll a little bit up to find the actual error message. In this case, the error message is:
+![](image-2.png)
+which is a `ValueError` that says that the model in PP=1 has no gradient. This is a common error when using `torch.nn.parallel.DistributedDataParallel` (DDP) and it means that the model in the pipeline parallelism (PP) rank 1 has no gradient. This can happen if the model is too small and the gradient is not computed for the model. The solution is to increase the number of layers of the model or put a smaller PP size. In this case, the model is `LlamaForTraining` and the model in PP=1 is `LlamaModel`.
+We could also decrease the model's vocab size from 50277 to 256 to help a better partitioning of the pipeline blocks. This will help to avoid the error message above.
--- a/docs/docs.md
+++ b/docs/docs.md
+# Doc on collective operations
+This NVIDIA doc is nice on all collective operations (all_reduce, reduce_scatter, etc): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html
+# Usage
+We showcase usage in the `examples` directory.
+# Key concepts
+Let's go through some key concepts.
+## ParallelContext
+`ParallelContext` is the base class referencing all the process groups you might need when running parallel workloads. You can initialize it using the following:
+```python
+from nanotron.parallel import ParallelContext
+# define your topology
+parallel_context = ParallelContext(
+    tensor_parallel_size=2,
+    data_parallel_size=2,
+    pipeline_parallel_size=2
+)
+```
+`ProcessGroups` is a mechanism in order to run distributed collectives (`all-reduce`, `all-gather`, ...) on a subgroup of all the ranks. It provides the granularity needed for 3D parallelism.
+From this dataclass you can access multiple process groups:
+ - `dp_pg`/`tp_pg`/`pp_pg`: This produces your typical process groups linked to 3D parallelism
+ - `world_pg`: ProcessGroup including all the processes.
+ - `world_rank_matrix`: This allows one to compute the world rank knowing the 3D ranks of a given process, or inversely when using `get_local_ranks`.
+ - `world_ranks_to_pg`: This is a more generic pattern that allows you to store custom set of ProcessGroups, and querying it via a list of world ranks.
+## NanotronParameter
+Given a specific computation workload, we can freely define how we distribute workloads. For example:
+```python
+from torch import nn
+# Example: let's assume you want to run a Linear without bias
+hidden_size = 8
+# Single process way of running computation
+module = nn.Linear(hidden_size, hidden_size) # Parameters: [H, H]
+input = torch.randn(batch_size, hidden_size)
+output = module(input)
+# Sharded ways of running computation across `tp_pg` (`ProcessGroup`)
+# Version 1
+sharded_module = nn.Linear(hidden_size, hidden_size / tp_pg.size())
+input = torch.randn(batch_size, hidden_size)
+sharded_output = module(input)
+torch.distributed.all_gather(output, sharded_output, group=tp_pg.size())
+# Version 2
+sharded_module = nn.Linear(hidden_size / tp_pg.size(), hidden_size)
+sharded_input = torch.randn(batch_size, hidden_size / tp_pg.size())
+sharded_output = module(sharded_input)
+torch.distributed.all_reduce(output, sharded_output, group=tp_pg.size())
+# Version 3
+sharded_module = nn.Linear(hidden_size, hidden_size)
+sharded_input = torch.randn(batch_size / tp_pg.size(), hidden_size)
+torch.distributed.all_gather(input, sharded_input, group=tp_pg.size())
+output = module(input) # Duplicate workload
+# Version ....
+```
+Distributed workloads have the tendency to generate tradeoffs between duplicated computation and extra communication. There's multiple ways to run the same computation, what we can optimize is the amount of communication we do, as well as duplicated work. Sometimes it's worth duplicating work in order to reduce communication significantly.
+As seen in previous example, sometimes the parameters are sharded across multiple devices, and sometimes they are duplicated. In `nanotron`, we decided to add those additional metadatas to `nn.Parameter`. We call our new datastructure: `NanotronParameter`
+## Sharded parameter
+A sharded parameter has the following metadata attached:
+```python
+@dataclasses.dataclass
+class SlicesPair:
+    local_slices: Tuple[slice, ...]
+    global_slices: Tuple[slice, ...]
+@dataclasses.dataclass
+class ShardedInfo:
+    # All world ranks involved in the sharding.
+    global_ranks: Tuple[int, ...]
+    # Info of to what slice of the unsharded tensor (global_slices) the current sharded tensor corresponds (local_slices)
+    local_global_slices_pairs: Tuple[SlicesPair, ...]
+    # The shape of the unsharded tensor
+    unsharded_shape: Tuple[int, ...]
+```
+Imagine we sharded a tensor t of shape [8, 64] across 2 ranks, 0 and 3, where rank 0 holds the first shard t[:, :32] and rank 3 holds the second shard t[:, 32:], then the sharded_info for them is:
+```python
+shard_info = ShardedInfo(global_ranks=(0,3), local_global_slices_pairs=(SlicesPair(local_slices=(slice(0,8), slice(0, 32),), global_slices=(slice(0,8), slice(0, 32)),),), unsharded_shape=(8, 64)) # world rank 0
+shard_info = ShardedInfo(global_ranks=(0,3), local_global_slices_pairs=(SlicesPair(local_slices=(slice(0,8), slice(0, 32),), global_slices=(slice(0,8), slice(32, 64)),),), unsharded_shape=(8, 64)) # world rank 3
+```
+## Tied parameter
+This signifies that multiple occurrences of a given parameter are duplicated on multiple devices. Therefore we need a mechanism for them to be synced at all time. A typical example would be `lm_head` on top of transformers that's tied to the word embedding parameters. We attach the following metadata to the parameter:
+```python
+@dataclasses.dataclass
+class TiedInfo:
+    # We usually arbitrarily choose a name of a parameter, either `lm_head.weight` or `wte.weight` for example.
+    name: str
+    # This allows us to define the scope in which `name` is valid.
+    root_module: nn.Module
+    # All world ranks involved in the tying.
+    global_ranks: Tuple[int, ...]
+    # In order to keep parameter synced, we add a `reduce_op` value that defines what kind of reduce operation we apply to the gradient.
+    # None signifies that we do not reduce
+    reduce_op: Optional[dist.ReduceOp]
+```
+Most interesting in this dataclass is the `reduce_op` parameter. Sometimes duplicated workload can remove the need to sync gradients as by design gradient computation would have already computed the correct gradient. A typical example of this is classic TP implementation using `all-reduce`/`identity`.
+Note: a parameter can be both sharded and tied. Both notion just have to involve different ranks. For example: lm_head and word embeddings can be sharded across TP, and tied between the first PP rank, and the last one.
+## Tensor parallelism
+Usually the go-to solution when models can't fit within a device. The basic idea is to figure out patterns where one can divide a single workload into multiple smaller workerloads that can run in parallel. We mimic tensor parallelism from Megatron-LM. Current supported modules:
+ - ColumnLinear/RowLinear
+ - ParallelVocabulary
+ - Cross-Entropy over sharded logits
+ - Distributed samplers for generation
+[Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) introduces that notion upon implementing one of the first large scale transformers:
+![Tensor parallelism in transformer model](assets/tensor_parallel_in_transformer.png)
+(Source: [link](https://arxiv.org/abs/1909.08053))
+## Pipeline parallelism
+We can view the neural network as a sequence of operations. Instead of previous assumption where we split operations into smaller workloads that we can distribute. We take contiguous chunks and assign them to specific ranks. Instead of running parallel workloads, those are inherently sequential. In order to run them in parallel, we introduce fancy schedulers that process different batches in parallel:Rank 0 can be processing batch 1, while rank 1 is processing batch 0
+ - Rank 0 starts to process batch 0
+ - Rank 0 finishes to process batch 0
+ - Rank 0 sends outputs to rank 1
+ - Rank 1 starts to process batch 0
+ - Rank 0 starts to process batch 1 (Rank 1 and Rank 0 are processing in parallel batches 1 and 0 respectively)
+ - Rank 1 finishes to process batch 0
+ - Rank 0 finishes to process batch 1
+### PipelineBlock
+The core component of our pipeline engine is a `PipelineBlock`.
+It acts as the granularity for all our pipeline engines, we can define a specific workload that needs to happen on a specific device, ie rank.
+Other ranks run a dummy `forward` where the forward pass returns `TensorPointer` which hold enough metadata in order to know where the output of the computation is.
+```python
+@dataclass
+class TensorPointer:
+    group_rank: int
+```
+Module defined within `PipelineBlock` can be directly instantiated on the specific device.
+In short, what does `PipelineBlock` does:
+ - Receives either a set of `torch.Tensor`/`TensorPointer` as input
+ - In case of `TensorPointer`, query the tensor from the specified rank we extract from its state/context.
+ - Run the defined computation if current rank is responsible for running computation
+ - Return a dictionary `Dict[str, Union[torch.Tensor, TensorPointer]]`.
+   `TensorPointer` as output are for ranks that didn't run computation and require to know where the output of the computation is.
+```python
+class PipelineBlock(nn.Module):
+    def __init__(
+       self,
+       p2p, # point-to-point communication class
+       module_builder, # module constructor in order to build module lazily
+       module_kwargs, # module constructor arguments in order to build module lazily
+       module_input_keys, # ranks that are not running compute to know the module input structure. Serves as a validation mechanism.
+       module_output_keys, # metadata for ranks that are not running compute to know the module output structure.
+    ):
+        pass
+# Example
+# Lazy instantiation of a `nn.Linear`
+model = PipelineBlock(
+   p2p=p2p,
+   module_builder=nn.Linear,
+   module_kwargs={"in_features":3, "out_feature": 5},
+   module_input_keys={"input"},
+   module_output_keys={"output"}
+)
+model.build_and_set_rank(pp_rank) # Instantiate model parameters on `pp_rank` assigned device
+```
+In order to define which rank we use the `build_and_set_rank` method. It attaches the rank as a meta data, and builds the module on that specific rank.
+Models have to be defined using a "surface" of `PipelineBlock`. Typically, above `PipelineBlock` it's all about defining the `PipelineBlock` computational direct acyclic graph, below is where device specific computation is defined.
+As a non trivial example:
+```python
+class DummyModel(nn.Module):
+    def __init__(
+        self,
+        p2p: P2P,
+    ):
+        super().__init__()
+        self.dense1 = PipelineBlock(
+            p2p=p2p,
+            module_builder=nn.Linear,
+            module_kwargs={"in_features": 10, "out_features": 10},
+            module_input_keys={"input"},
+            module_output_keys={"output"},
+        )
+        self.dense2 = PipelineBlock(
+            p2p=p2p,
+            module_builder=nn.Linear,
+            module_kwargs={"in_features": 10, "out_features": 10},
+            module_input_keys={"input"},
+            module_output_keys={"output"},
+        )
+        # Doesn't hold any parameter, but we have to specify where the computation happens.
+        self.loss = PipelineBlock(
+            p2p=p2p,
+            module_builder=lambda: lambda x: x.sum(),
+            module_kwargs={},
+            module_input_keys={"x"},
+            module_output_keys={"output"},
+        )
+    def forward(self, x: Union[torch.Tensor, TensorPointer]):
+        # x can be a `torch.Tensor` or a `TensorPointer` depending on the current rank, and where the pipeline blocks run their compute
+        x = self.dense1(input=x)["output"]
+        x = self.dense2(input=x)["output"]
+        x = self.loss(x=x)["output"]
+        return x
+```
+### Pipeline engine
+We now support two kinds of engines: `AllForwardAllBackward`, `OneForwardOneBackward`
+Pipeline engines are different schedules for the set of workloads. A great illustration for the different schedules we support for training can be found in [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
+](https://arxiv.org/abs/2104.04473). We support `All forward all backward` and `One forward one backward` currently (Figure 3 and top of figure 4).
+![Pipeline engine](assets/pipeline_engine.png)
+(Source: [link](https://arxiv.org/abs/2104.04473))
+> **_IMPORTANT NOTE:_** When preparing your dataloader, make sure every tensor lives on a single rank, and other ranks must have `TensorPointer` to that rank. This is a requirement for the pipeline engine to work.
+## ZeRO-1 optimizer
+ZeRO stands for "Zero Redundancy Optimizer", also known as "FSDP" in Pytorch. The goal of such techniques is to shard tensors across multiple devices instead of duplicating them. Consequently it allows for significant memory gains at the cost of some communication overhead (with potential ability to overlap computation and communication). Sharding is done across data parallel dimension There are three stages:
+ - `Stage 1`: The optimizer states are sharded.
+ - `Stage 2`: The gradients are sharded
+ - `Stage 3`: The model weight are sharded
+As of now, we currently only support `stage 1`.
+![ZeRO](assets/zero.png)
+(Source: [link](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/))
+# The awesome to have
+## Recomputation utilities
+Activation recomputation, also known as "activation checkpointing" is a memory saving technique. Pytorch automatically stores a set activation during the forward pass required for backward computation. However with large workloads, it might be worth recomputing specific activation in order to save memory. In `nanotron` we provide a decorator to implement this feature:
+```python
+class MyFancyModule(nn.Module):
+    def __init__(self):
+        ...
+        self.do_checkpoint: bool = True
+    @checkpoint_method(attr_name="do_checkpoint")
+    def forward(self, x):
+        ...
+```
+## On device initialization
+Usual pytorch module constructor instantiate weights on cpu and then move them to gpus. This can blow up cpu memory as well as being overall quite slow.
+```python
+with init_on_device_and_dtype(device=torch.device("cuda"), dtype=torch.bfloat16):
+    module = MyFancyModule() # This directly instantiate the model on your device
+# If you want to bypass Pytorch weight initialization mechanism
+with init_on_device_and_dtype(device=torch.device("meta"), dtype=torch.bfloat16):
+    module = MyFancyModule()
+module.to_empty(torch.device("cuda")) # bfloat 16 model loaded in gpu with weight not initialized (only the storage buffers are allocated)
+```
+## Unified API for logging
+We provide a uniform API to logging, whether that's on tensorboard, on stdout or on Hugging Face hub:
+```python
+@dataclass
+class LogItem:
+    tag: str
+    scalar_value: Union[float, int]
+    log_format: Optional[str] = None
+```
+All logger need to implement a single method:
+```python
+class BaseLogger:
+    @abstractmethod
+    def add_scalars_from_list(self, log_entries: List[LogItem], iteration_step: int):
+        ...
+```
+If you want to have tensorboard logger support: `pip install -e ".[tb-logger]"`.
+If you want to have huggingface-hub tensorboard logger support: `pip install -e ".[hf-logger]"`.
+## Random state handling primitives
+We currently have a mechanism to have an arbitrary number of `RandomState` in a `RandomStates`:
+```python
+class RandomState:
+    random
+    numpy
+    torch
+    torch_cuda
+class RandomStates(MutableMapping[str, RandomState])
+    pass
+```
+At all time we get/set current random state in the current context
+```python
+def get_current_random_state():
+   # This gets the current random_state from the current context
+   pass
+def set_random_state(random_state: RandomState):
+   # This sets random state in the current context
+   pass
+```
+In order to use specific `RandomState` for specific operations, typically when you want to synchronize `nn.Dropout` across multiple ranks for example, you can run `branch_random_state` context manager:
+```python
+def branch_random_state(random_states:RandomStates, key:str):
+   # Context manager which sets the random state associated with `key` when entering
+   # When exiting, we update the random state at `key` and restore previous random state.
+   pass
+# Usage
+random_states = RandomStates({"my_own_random_state": get_current_random_state()})
+with branch_random_state(random_states, "my_own_random_state"):
+    output = nn.Dropout(0.1)(input)
+```
+Finally we provide a quick helper in order to get a synchronized random state across a process group.
+```python
+def get_synced_random_state(random_state: RandomState, pg: ProcessGroup):
+   # This allows us to get a synchronized random state with other ranks within a single group
+# Usage
+random_states = RandomStates({"tp_synced_random_state": get_synced_random_state(random_state=get_current_random_state(), group=tp_pg)})
+with branch_random_state(random_states, "tp_synced_random_state"):
+    # Assuming that input is synced across TP, all ranks will apply the same random mask.
+    output = nn.Dropout(0.1)(input)
+```
+# Distributed serialization mechanism
+We rely on compute nodes having access to a single shared filesystem.
+We use `safetensors` to store our checkpoints.
+Current format:
+```python
+checkpoint_metadata.json # Stores version, topology, other metadata that would make the training resumable
+optimizer
+    optimizer_config.json # Stores enough information to reinstantiate which optimizer this runs.
+    optimizer_tp-0-of-1_dp-0-of-1_pp-0-of-2.pt
+    optimizer_tp-0-of-1_dp-0-of-1_pp-0-of-2.pt
+lr_scheduler
+    lr_scheduler_tp-0-of-1_dp-0-of-1_pp-0-of-2.pt
+    lr_scheduler_tp-0-of-1_dp-0-of-1_pp-0-of-2.pt
+random # Stores random states from each process in order to resume training from the point on.
+    tp-0-of-1_dp-0-of-1_pp-0-of-2.pt
+    tp-0-of-1_dp-0-of-1_pp-1-of-2.pt
+model
+    dense1
+        model_weight.safetensors
+        model_bias.safetensors
+    dense2
+        model_weight.safetensors
+        model_bias.safetensors
+```
+Some observations:
+ - checkpoints are NOT topology agnostic, this is due to both `random_states` and `sharded` tensors.
+   Instead of trying to reconcile those and obtain a topology agnostic one, we want to support a `checkpoint_reshape` method.
+   The motivations are the following:
+   - When training, one spends a LOT more time `saving` checkpoints than loading. In doing so, having the fastest saving mechanism helps. Consequently not having any distributed communication/locking will help this.
+   - Random states are not so easily reconcilable. Given random states for two separate processes when we have TP=2, it's not obvious what should be the random state if we set to TP=1.
+ - Optimizer states are aligned with parameters. It's usually the case where for each parameter you can define an optimizer state. But that's a limitation on the current serialization format.
+ # Current restrictions:
+- `nn.Module` inside PipelineBlocks have to return a `Dict[str,torch.Tensor]` or `torch.Tensor`.
+- No conditional flow on top of pipeline, or at least making sure that all the processes within a data parallel rank are performing the same sequence of operations:
+  - First all but one process will be things on `TensorPointer` which would make input dependent control flow quite hard.
+  - Second if you were to have input dependent control flow, causing two processes within a single data parallel rank to be different, then you might end up with weird communication issues.
--- a/docs/image-2.png
+++ b/docs/image-2.png
--- a/docs/image.png
+++ b/docs/image.png
--- a/docs/nanoset.md
+++ b/docs/nanoset.md
+# Nanosets
+Nanotron incorporates [`Nanosets`](../src/nanotron/data/nanoset.py), a dataset for processing tokenized documents with [`datatrove`](https://github.com/huggingface/datatrove). They allow reading tokens from one or multiple datasets and even specifying the weight of each dataset when building batches.
+## Install
+To use `Nanosets`, it's necessary to install Nanotron with the `nanosets` flavor.
+```
+pip install nanotron[nanosets]
+```
+This will install the following dependencies:
+- `datatrove`: To preprocess the datasets
+- `numba`: To compile helper functions in order to speed up the creation of `Nanosets`
+- `transformers`: For the tokenizers
+## Data pre-processing
+To use this dataset, first, we need to preprocess the data using `datatrove`'s `DocumentTokenizer` pipeline. We invite you to take a look at `datatrove`, since it contains multiple features that allow, for example, filter out documents based on specific rules/criteria, extract text content from raw formats or scheduling the preprocessing in a Slurm cluster. We have also added a simple script capable of tokenizing datasets.
+The preprocessing is done using the [`tools/preprocess_data.py`](../tools/preprocess_data.py) script. The input format can either be a Hugging Face Dataset, a path to a `.jsonl` or a path to a folder containing multiple `.jsonl` files. Below we show an example for processing a Hugging Face Dataset from the Hub with the Llama3 tokenizer.
+<pre>
+python3 tools/preprocess_data.py \
+       --tokenizer-name-or-path meta-llama/Meta-Llama-3-8B \
+       --output-folder datasets/emotion \
+       --n-tasks 16 \
+       hf \
+       --dataset dair-ai/emotion \
+</pre>
+First with `--tokenizer-name-or-path` we will specify a tokenizer in the same way as we do when using `AutoTokenizers.from_pretrained(...)`. Then we specify the `--output-folder` where we will store the tokenized documents and the number of workers with `--n-tasks`. Finally we will indicate the type of dataset (whether if it's a Hugging Face Dataset ["**hf**"] or in jsonl ["**jsonl**"] format) and the dataset that we want to preprocess. Check the different settings with `python3 tools/preprocess_data.py --help`, `python3 tools/preprocess_data.py hf --help` & `python3 tools/preprocess_data.py jsonl --help`.
+Every worker will store in `--output-folder` 3 different kind of files:
+- `*.ds` Containing the tokenized documents
+- `*.ds.index` Containing the bounds of each tokenized document
+- `*.ds.metadata` Containing the number of tokens and tokenizer used
+> [!IMPORTANT]
+Remember to introduce the type of dataset to process. e.g. python3 tools/preprocess_data.py --tokenizer-name-or-path gpt2 --n-tasks 16 **jsonl** --dataset raw_datasets/c4-es-json-files
+## Working with Nanosets
+To work with `Nanosets`, we just need to configure 1 argument:
+1. `dataset_folder`: This argument specifies the file or files that will compose the `Nanoset`. There are 3 ways to specify it:
+   1. If we specify a single path, we will create a `Nanoset` from a single dataset file.
+    ```yaml
+    data_stages:
+      - name: General purpose training (Single dataset)
+        start_training_step: 1
+        data:
+          dataset:
+            dataset_folder: datasets/SlimPajama-6B
+          num_loading_workers: 0
+          seed: 1234
+    ```
+   2. If we specify a list of paths, we will create a `Nanoset` from all the dataset files. In every epoch we will consume each and every sample from each dataset randomly.
+    ```yaml
+    data_stages:
+      - name: Second purpose training (> 1 dataset)
+        start_training_step: 15
+        data:
+          dataset:
+            dataset_folder:
+            - datasets/SlimPajama-6B
+            - datasets/testing_alpaca_small
+          num_loading_workers: 0
+          seed: 1234
+    ```
+    3. If we specify a dictionary with paths and weights, we will create a `Nanoset` from the dataset files where each epoch will have a number of samples from each dataset according to the specified weights.
+    ```yaml
+    data_stages:
+      - name: Third purpose training (Blended dataset)
+        start_training_step: 25
+        data:
+          dataset:
+            dataset_folder:
+              datasets/SlimPajama-6B: 0.8
+              datasets/testing_alpaca_small: 0.2
+          num_loading_workers: 0
+          seed: 1234
+    ```
+> [!IMPORTANT]
+> Remember to set the `tokenizer.tokenizer_name_or_path` in the config file to the tokenizer used to preprocess the documents and set the `model.model_config.vocab_size` accordingly.
+Finally, to use the `Nanosets`, launch the training with [`run_train.py`](../run_train.py).
+```shell
+torchrun --nproc-per-node 1 run_train.py --config examples/config_nanoset.yaml
+```
+## Under the hood
+`Nanosets` are responsible of building samples of `sequence length + 1` tokens from the preprocessed dataset files. Despite most of the extracting logic lies in `DatatroveFolderDataset`, `Nanosets` will take care of the following:
+1. Creating dataset mixtures from different dataset folder paths
+2. Ensure that in each epoch, we consume each sample only once
+3. Ensure that we never exhaust the `DataLoader`
+Based on the `dataset lengths`, the `dataset weights` and the `number of samples per epoch` (defined as the `sum(dataset lengths)`), we build the two indexes we need in order to extract samples from the `Nanoset`  ([build_nanoset_index_helper](../src/nanotron/data/nanoset.py)):
+- `dataset index`: Contains the index of the dataset from the list of `dataset paths` from which to extract the sample, respecting the established dataset weight.
+```
+Given:
+D = [d0, d1, d2, d3]        # datasets
+DL = [8, 2, 5, 5]           # dataset lengths
+W = [0.1, 0.5, 0.3, 0.1]    # dataset weights
+SPE = 20                    # number of samples per epoch
+Then, for example:
+dataset_index = [1, 2, 0, 1, 3, 1, 2, 1, 2, 1, 0, 1, 2, 1, 3, 1, 2, 1, 2, 1]
+```
+- `dataset sample index`: Contains the sample index to extract from the `dataset index[index]` dataset, always < `len(dataset)`.
+```
+dataset_index =         [1, 2, 0, 1, 3, 1, 2, 1, 2, 1, 0, 1, 2, 1, 3, 1, 2, 1, 2, 1]
+dataset_sample_index =  [0, 0, 0, 1, 0, 0, 1, 1, 2, 0, 1, 1, 3, 0, 1, 1, 4, 0, 0, 1]
+```
+Then, we **shuffle with the same permutation both indexes** and concatenate them `number of epochs` times, which is defined by `train split num samples` / `number of samples per epoch`.
+```
+Given:
+N = 70                      # train split num samples
+dataset_index =         [1, 2, 0, 1, 3, 1, 2, 1, 2, 1, 0, 1, 2, 1, 3, 1, 2, 1, 2, 1]
+dataset_sample_index =  [0, 0, 0, 1, 0, 0, 1, 1, 2, 0, 1, 1, 3, 0, 1, 1, 4, 0, 0, 1]
+Shuffle dataset_index and dataset_sample_index:
+dataset_index =         [1, 1, 0, 2, 3, 1, 3, 1, 2, 2, 1, 1, 0, 1, 1, 2, 1, 2, 2, 1]
+dataset_sample_index =  [1, 0, 0, 4, 1, 0, 0, 0, 2, 0, 0, 1, 1, 0, 1, 0, 1, 3, 1, 1]
+n_concatenations = (70/(20)) + 1 = 4
+dataset_index = dataset_index concatenated 4 times
+dataset_sample_index = dataset_sample_index concatenated 4 times
+dataset_index = dataset_index[: N]
+dataset_sample_index = dataset_sample_index[: N]
+```
+To query the `Nanoset` for the k-th sample we do the following:
+- Use the `dataset_index` to retrieve the corresponding dataset from `D` and the `dataset_sample_index` to retrieve the corresponding sample from that dataset.
+```
+sample = D[dataset_index[k]][dataset_sample_index[k]]
+```
--- a/examples/bench_llama_7b.py
+++ b/examples/bench_llama_7b.py
+"""
+Benchmarking script for the Llama-2-7b model
+"""
+import os
+from nanotron.config import (
+    CheckpointsArgs,
+    Config,
+    DataArgs,
+    GeneralArgs,
+    LlamaConfig,
+    LoggingArgs,
+    LRSchedulerArgs,
+    ModelArgs,
+    OptimizerArgs,
+    ParallelismArgs,
+    PretrainDatasetsArgs,
+    RandomInit,
+    TokenizerArgs,
+    TokensArgs,
+)
+from nanotron.logging import human_format
+# Config for a llama model with 6.74M parameters
+model_config = LlamaConfig()
+num_params = human_format(
+    model_config.vocab_size * model_config.hidden_size * 2
+    + model_config.num_hidden_layers
+    * (
+        3 * model_config.hidden_size * model_config.intermediate_size
+        + 4 * model_config.hidden_size * model_config.hidden_size
+    )
+).replace(".", "p")
+print(f"Model has {num_params} parameters")
+seed = 42
+learning_rate = LRSchedulerArgs(
+    learning_rate=3e-4, lr_warmup_steps=2, lr_warmup_style="linear", lr_decay_style="cosine", min_decay_lr=1e-5
+)
+optimizer = OptimizerArgs(
+    zero_stage=0,
+    weight_decay=0.01,
+    clip_grad=1.0,
+    accumulate_grad_in_fp32=True,
+    adam_eps=1e-08,
+    adam_beta1=0.9,
+    adam_beta2=0.95,
+    torch_adam_is_fused=True,
+    learning_rate_scheduler=learning_rate,
+)
+parallelism = ParallelismArgs(
+    dp=2,
+    pp=1,
+    tp=4,
+    pp_engine="1f1b",
+    tp_mode="REDUCE_SCATTER",
+    tp_linear_async_communication=True,
+)
+tokens = TokensArgs(sequence_length=8192, train_steps=5, micro_batch_size=1, batch_accumulation_per_replica=8)
+dataset = PretrainDatasetsArgs(hf_dataset_or_datasets="stas/openwebtext-10k", text_column_name="text")
+checkpoints_path = os.path.dirname(os.path.dirname(__file__)) + "/checkpoints"
+os.makedirs(checkpoints_path, exist_ok=True)
+config = Config(
+    general=GeneralArgs(project="bench", run="llama", seed=seed),
+    checkpoints=CheckpointsArgs(checkpoints_path=checkpoints_path, checkpoint_interval=1000),
+    parallelism=parallelism,
+    model=ModelArgs(init_method=RandomInit(std=0.025), model_config=model_config),
+    tokenizer=TokenizerArgs("meta-llama/Llama-2-7b-hf"),
+    optimizer=optimizer,
+    logging=LoggingArgs(),
+    tokens=tokens,
+    data=DataArgs(dataset=dataset, seed=seed),
+    profiler=None,
+)
+if __name__ == "__main__":
+    dir = os.path.dirname(__file__)
+    # Save config as YAML file
+    config.save_as_yaml(f"{dir}/config_llama.yaml")
+    # Launch training
+    os.system("export CUDA_DEVICE_MAX_CONNECTIONS=1")
+    gpus = config.parallelism.dp * config.parallelism.pp * config.parallelism.tp
+    os.system(f"torchrun --nproc_per_node={gpus} run_train.py --config-file {dir}/config_llama.yaml")
--- a/examples/config_llama3.yaml
+++ b/examples/config_llama3.yaml
+checkpoints:
+  checkpoint_interval: 10
+  checkpoints_path: checkpoints
+  checkpoints_path_is_shared_file_system: false
+  resume_checkpoint_path: null
+  save_final_state: false
+  save_initial_state: false
+data_stages:
+- data:
+    dataset:
+      dataset_overwrite_cache: false
+      dataset_processing_num_proc_per_process: 1
+      hf_dataset_config_name: null
+      hf_dataset_or_datasets: stas/openwebtext-10k
+      hf_dataset_splits: train
+      text_column_name: text
+    num_loading_workers: 1
+    seed: 42
+  name: Stable Training Stage
+  start_training_step: 1
+- data:
+    dataset:
+      dataset_overwrite_cache: false
+      dataset_processing_num_proc_per_process: 1
+      hf_dataset_config_name: null
+      hf_dataset_or_datasets: stas/openwebtext-10k
+      hf_dataset_splits: train
+      text_column_name: text
+    num_loading_workers: 1
+    seed: 42
+  name: Annealing Phase
+  start_training_step: 10
+general:
+  benchmark_csv_path: null
+  consumed_train_samples: null
+  ignore_sanity_checks: true
+  project: debug
+  run: tiny_llama_%date_%jobid
+  seed: 42
+  step: null
+lighteval: null
+logging:
+  iteration_step_info_interval: 1
+  log_level: info
+  log_level_replica: info
+model:
+  ddp_bucket_cap_mb: 25
+  dtype: bfloat16
+  init_method:
+    std: 0.025
+  make_vocab_size_divisible_by: 1
+  model_config:
+    bos_token_id: 128000
+    eos_token_id: 128001
+    hidden_act: silu
+    hidden_size: 4096
+    initializer_range: 0.02
+    intermediate_size: 14336
+    is_llama_config: true
+    max_position_embeddings: 2048
+    num_attention_heads: 32
+    num_hidden_layers: 32
+    num_key_value_heads: 8
+    pad_token_id: null
+    pretraining_tp: 1
+    rms_norm_eps: 1.0e-05
+    rope_interleaved: false
+    rope_scaling: null
+    rope_theta: 500000.0
+    tie_word_embeddings: false
+    use_cache: true
+    vocab_size: 128256
+optimizer:
+  accumulate_grad_in_fp32: true
+  clip_grad: 1.0
+  learning_rate_scheduler:
+    learning_rate: 0.0003
+    lr_decay_starting_step: null
+    lr_decay_steps: 13
+    lr_decay_style: cosine
+    lr_warmup_steps: 2
+    lr_warmup_style: linear
+    min_decay_lr: 1.0e-05
+  optimizer_factory:
+    adam_beta1: 0.9
+    adam_beta2: 0.95
+    adam_eps: 1.0e-08
+    name: adamW
+    torch_adam_is_fused: true
+  weight_decay: 0.01
+  zero_stage: 0
+parallelism:
+  dp: 1
+  expert_parallel_size: 1
+  pp: 2
+  pp_engine: 1f1b
+  recompute_layer: false
+  tp: 2
+  tp_linear_async_communication: true
+  tp_mode: REDUCE_SCATTER
+  tp_recompute_allgather: true
+profiler: null
+s3_upload: null
+tokenizer:
+  tokenizer_max_length: null
+  tokenizer_name_or_path: Meta-Llama-3.1-8B
+  tokenizer_revision: null
+tokens:
+  batch_accumulation_per_replica: 1
+  limit_test_batches: 0
+  limit_val_batches: 0
+  micro_batch_size: 2
+  sequence_length: 2048
+  train_steps: 15
+  val_check_interval: -1
--- a/examples/config_llama3_dummytokenizer.yaml
+++ b/examples/config_llama3_dummytokenizer.yaml
+checkpoints:
+  checkpoint_interval: 10
+  checkpoints_path: checkpoints
+  checkpoints_path_is_shared_file_system: false
+  resume_checkpoint_path: null
+  save_final_state: false
+  save_initial_state: false
+data_stages:
+- data:
+    dataset:
+      dataset_overwrite_cache: false
+      dataset_processing_num_proc_per_process: 1
+      hf_dataset_config_name: null
+      hf_dataset_or_datasets: stas/openwebtext-10k
+      hf_dataset_splits: train
+      text_column_name: text
+    num_loading_workers: 1
+    seed: 42
+  name: Stable Training Stage
+  start_training_step: 1
+- data:
+    dataset:
+      dataset_overwrite_cache: false
+      dataset_processing_num_proc_per_process: 1
+      hf_dataset_config_name: null
+      hf_dataset_or_datasets: stas/openwebtext-10k
+      hf_dataset_splits: train
+      text_column_name: text
+    num_loading_workers: 1
+    seed: 42
+  name: Annealing Phase
+  start_training_step: 10
+general:
+  benchmark_csv_path: null
+  consumed_train_samples: null
+  ignore_sanity_checks: true
+  project: debug
+  run: tiny_llama_%date_%jobid
+  seed: 42
+  step: null
+lighteval: null
+logging:
+  iteration_step_info_interval: 1
+  log_level: info
+  log_level_replica: info
+model:
+  ddp_bucket_cap_mb: 25
+  dtype: bfloat16
+  init_method:
+    std: 0.025
+  make_vocab_size_divisible_by: 1
+  model_config:
+    bos_token_id: 128000
+    eos_token_id: 128001
+    hidden_act: silu
+    hidden_size: 4096
+    initializer_range: 0.02
+    intermediate_size: 14336
+    is_llama_config: true
+    max_position_embeddings: 2048
+    num_attention_heads: 32
+    num_hidden_layers: 32
+    num_key_value_heads: 8
+    pad_token_id: null
+    pretraining_tp: 1
+    rms_norm_eps: 1.0e-05
+    rope_interleaved: false
+    rope_scaling: null
+    rope_theta: 500000.0
+    tie_word_embeddings: false
+    use_cache: true
+    vocab_size: 256
+optimizer:
+  accumulate_grad_in_fp32: true
+  clip_grad: 1.0
+  learning_rate_scheduler:
+    learning_rate: 0.0003
+    lr_decay_starting_step: null
+    lr_decay_steps: 13
+    lr_decay_style: cosine
+    lr_warmup_steps: 2
+    lr_warmup_style: linear
+    min_decay_lr: 1.0e-05
+  optimizer_factory:
+    adam_beta1: 0.9
+    adam_beta2: 0.95
+    adam_eps: 1.0e-08
+    name: adamW
+    torch_adam_is_fused: true
+  weight_decay: 0.01
+  zero_stage: 0
+parallelism:
+  dp: 1
+  expert_parallel_size: 1
+  pp: 2
+  pp_engine: 1f1b
+  recompute_layer: false
+  tp: 2
+  tp_linear_async_communication: true
+  tp_mode: REDUCE_SCATTER
+  tp_recompute_allgather: true
+profiler: null
+s3_upload: null
+tokenizer:
+  tokenizer_max_length: null
+  tokenizer_name_or_path: robot-test/dummy-tokenizer-wordlevel
+  tokenizer_revision: null
+tokens:
+  batch_accumulation_per_replica: 1
+  limit_test_batches: 0
+  limit_val_batches: 0
+  micro_batch_size: 2
+  sequence_length: 2048
+  train_steps: 15
+  val_check_interval: -1
--- a/examples/config_nanoset.yaml
+++ b/examples/config_nanoset.yaml
+checkpoints:
+  checkpoint_interval: 1000
+  checkpoints_path: checkpoints/
+  checkpoints_path_is_shared_file_system: false
+  resume_checkpoint_path: null
+  save_initial_state: false
+data_stages:
+- data:
+    dataset:
+      dataset_folder: datasets/c4-es/tokenized
+    num_loading_workers: 1
+    seed: 42
+  name: General purpose training (Single dataset)
+  start_training_step: 1
+- data:
+    dataset:
+      dataset_folder:
+      - datasets/SlimPajama-6B/tokenized
+      - datasets/c4-es/tokenized
+    num_loading_workers: 1
+    seed: 42
+  name: Second purpose training (> 1 dataset)
+  start_training_step: 15
+- data:
+    dataset:
+      dataset_folder:
+        datasets/SlimPajama-6B/tokenized: 0.8
+        datasets/c4-es/tokenized: 0.2
+    num_loading_workers: 1
+    seed: 42
+  name: Third purpose training (Blended dataset)
+  start_training_step: 25
+general:
+  benchmark_csv_path: null
+  consumed_train_samples: null
+  ignore_sanity_checks: true
+  project: Nanoset
+  run: llama
+  seed: 42
+  step: null
+lighteval: null
+logging:
+  iteration_step_info_interval: 1
+  log_level: info
+  log_level_replica: info
+model:
+  ddp_bucket_cap_mb: 25
+  dtype: bfloat16
+  init_method:
+    std: 0.025
+  make_vocab_size_divisible_by: 1
+  model_config:
+    bos_token_id: 1
+    eos_token_id: 2
+    hidden_act: silu
+    hidden_size: 16
+    initializer_range: 0.02
+    intermediate_size: 64
+    is_llama_config: true
+    max_position_embeddings: 1024
+    num_attention_heads: 4
+    num_hidden_layers: 2
+    num_key_value_heads: 4
+    pad_token_id: null
+    pretraining_tp: 1
+    rms_norm_eps: 1.0e-05
+    rope_scaling: null
+    tie_word_embeddings: true
+    use_cache: true
+    vocab_size: 50257
+optimizer:
+  accumulate_grad_in_fp32: true
+  clip_grad: 1.0
+  learning_rate_scheduler:
+    learning_rate: 0.0003
+    lr_decay_starting_step: null
+    lr_decay_steps: 98
+    lr_decay_style: cosine
+    lr_warmup_steps: 2
+    lr_warmup_style: linear
+    min_decay_lr: 1.0e-05
+  optimizer_factory:
+    adam_beta1: 0.9
+    adam_beta2: 0.95
+    adam_eps: 1.0e-08
+    name: adamW
+    torch_adam_is_fused: true
+  weight_decay: 0.01
+  zero_stage: 0
+parallelism:
+  dp: 1
+  expert_parallel_size: 1
+  pp: 1
+  pp_engine: 1f1b
+  tp: 1
+  tp_linear_async_communication: true
+  tp_mode: REDUCE_SCATTER
+profiler: null
+tokenizer:
+  tokenizer_max_length: null
+  tokenizer_name_or_path: gpt2
+  tokenizer_revision: null
+tokens:
+  batch_accumulation_per_replica: 1
+  limit_test_batches: 0
+  limit_val_batches: 0
+  micro_batch_size: 2
+  sequence_length: 1024
+  train_steps: 200
+  val_check_interval: -1
--- a/examples/config_tiny_llama.py
+++ b/examples/config_tiny_llama.py
+""" Example python script to generate a YAML config file which can be used to run a training with nanotron. Refer to "examples" section in the `/README.md` for more information."""
+import os
+from nanotron.config import (
+    AdamWOptimizerArgs,
+    CheckpointsArgs,
+    Config,
+    DataArgs,
+    DatasetStageArgs,
+    GeneralArgs,
+    LlamaConfig,
+    LoggingArgs,
+    LRSchedulerArgs,
+    ModelArgs,
+    OptimizerArgs,
+    ParallelismArgs,
+    PretrainDatasetsArgs,
+    RandomInit,
+    TokenizerArgs,
+    TokensArgs,
+)
+from nanotron.logging import human_format
+model_config = LlamaConfig(
+    # Config for a tiny model model with 1.62M parameters
+    bos_token_id=1,
+    eos_token_id=2,
+    hidden_act="silu",
+    hidden_size=16,
+    initializer_range=0.02,
+    intermediate_size=64,
+    max_position_embeddings=256,
+    num_attention_heads=4,
+    num_hidden_layers=2,
+    num_key_value_heads=4,
+    pretraining_tp=1,
+    rms_norm_eps=1e-05,
+    rope_scaling=None,
+    tie_word_embeddings=True,
+    use_cache=True,
+    vocab_size=256,
+)
+num_params = human_format(
+    model_config.vocab_size * model_config.hidden_size * 2
+    + model_config.num_hidden_layers
+    * (
+        3 * model_config.hidden_size * model_config.intermediate_size
+        + 4 * model_config.hidden_size * model_config.hidden_size
+    )
+).replace(".", "p")
+print(f"Model has {num_params} parameters")
+seed = 42
+learning_rate = LRSchedulerArgs(
+    learning_rate=3e-4, lr_warmup_steps=2, lr_warmup_style="linear", lr_decay_style="cosine", min_decay_lr=1e-5
+)
+optimizer = OptimizerArgs(
+    zero_stage=0,
+    weight_decay=0.01,
+    clip_grad=1.0,
+    accumulate_grad_in_fp32=True,
+    learning_rate_scheduler=learning_rate,
+    optimizer_factory=AdamWOptimizerArgs(
+        adam_eps=1e-08,
+        adam_beta1=0.9,
+        adam_beta2=0.95,
+        torch_adam_is_fused=True,
+    ),
+)
+parallelism = ParallelismArgs(
+    dp=2,
+    pp=2,
+    tp=2,
+    pp_engine="1f1b",
+    tp_mode="REDUCE_SCATTER",
+    tp_linear_async_communication=True,
+)
+tokens = TokensArgs(sequence_length=256, train_steps=15, micro_batch_size=2, batch_accumulation_per_replica=1)
+data_stages = [
+    DatasetStageArgs(
+        name="Stable Training Stage",
+        start_training_step=1,
+        data=DataArgs(
+            dataset=PretrainDatasetsArgs(hf_dataset_or_datasets="stas/openwebtext-10k", text_column_name="text"),
+            seed=seed,
+        ),
+    ),
+    DatasetStageArgs(
+        name="Annealing Phase",
+        start_training_step=10,
+        data=DataArgs(
+            dataset=PretrainDatasetsArgs(hf_dataset_or_datasets="stas/openwebtext-10k", text_column_name="text"),
+            seed=seed,
+        ),
+    ),
+]
+checkpoints_path = "./checkpoints"
+os.makedirs(checkpoints_path, exist_ok=True)
+config = Config(
+    general=GeneralArgs(project="debug", run="tiny_llama_%date_%jobid", seed=seed),
+    checkpoints=CheckpointsArgs(checkpoints_path=checkpoints_path, checkpoint_interval=10),
+    parallelism=parallelism,
+    model=ModelArgs(init_method=RandomInit(std=0.025), model_config=model_config),
+    tokenizer=TokenizerArgs("robot-test/dummy-tokenizer-wordlevel"),
+    optimizer=optimizer,
+    logging=LoggingArgs(),
+    tokens=tokens,
+    data_stages=data_stages,
+    profiler=None,
+)
+if __name__ == "__main__":
+    dir = os.path.dirname(__file__)
+    # Save config as YAML file
+    config.save_as_yaml(f"{dir}/config_tiny_llama.yaml")
+    # You can now train a model with this config using `/run_train.py`
--- a/examples/config_tiny_llama.yaml
+++ b/examples/config_tiny_llama.yaml
+checkpoints:
+  checkpoint_interval: 10
+  checkpoints_path: checkpoints
+  checkpoints_path_is_shared_file_system: false
+  resume_checkpoint_path: null
+  save_initial_state: false
+data_stages:
+- data:
+    dataset:
+      dataset_overwrite_cache: false
+      dataset_processing_num_proc_per_process: 1
+      hf_dataset_config_name: null
+      hf_dataset_or_datasets: stas/openwebtext-10k
+      hf_dataset_splits: train
+      text_column_name: text
+    num_loading_workers: 1
+    seed: 42
+  name: Stable Training Stage
+  start_training_step: 1
+- data:
+    dataset:
+      dataset_overwrite_cache: false
+      dataset_processing_num_proc_per_process: 1
+      hf_dataset_config_name: null
+      hf_dataset_or_datasets: stas/openwebtext-10k
+      hf_dataset_splits: train
+      text_column_name: text
+    num_loading_workers: 1
+    seed: 42
+  name: Annealing Phase
+  start_training_step: 10
+general:
+  benchmark_csv_path: null
+  consumed_train_samples: null
+  ignore_sanity_checks: true
+  project: debug
+  run: tiny_llama_%date_%jobid
+  seed: 42
+  step: null
+lighteval: null
+logging:
+  iteration_step_info_interval: 1
+  log_level: info
+  log_level_replica: info
+model:
+  ddp_bucket_cap_mb: 25
+  dtype: bfloat16
+  init_method:
+    std: 0.025
+  make_vocab_size_divisible_by: 1
+  model_config:
+    bos_token_id: 1
+    eos_token_id: 2
+    hidden_act: silu
+    hidden_size: 16
+    initializer_range: 0.02
+    intermediate_size: 64
+    is_llama_config: true
+    max_position_embeddings: 256
+    num_attention_heads: 4
+    num_hidden_layers: 2
+    num_key_value_heads: 4
+    pad_token_id: null
+    pretraining_tp: 1
+    rms_norm_eps: 1.0e-05
+    rope_scaling: null
+    tie_word_embeddings: true
+    use_cache: true
+    vocab_size: 256
+optimizer:
+  accumulate_grad_in_fp32: true
+  clip_grad: 1.0
+  learning_rate_scheduler:
+    learning_rate: 0.0003
+    lr_decay_starting_step: null
+    lr_decay_steps: 13
+    lr_decay_style: cosine
+    lr_warmup_steps: 2
+    lr_warmup_style: linear
+    min_decay_lr: 1.0e-05
+  optimizer_factory:
+    adam_beta1: 0.9
+    adam_beta2: 0.95
+    adam_eps: 1.0e-08
+    name: adamW
+    torch_adam_is_fused: true
+  weight_decay: 0.01
+  zero_stage: 0
+parallelism:
+  dp: 1
+  expert_parallel_size: 1
+  pp: 2
+  pp_engine: 1f1b
+  tp: 2
+  tp_linear_async_communication: true
+  tp_mode: REDUCE_SCATTER
+profiler: null
+tokenizer:
+  tokenizer_max_length: null
+  tokenizer_name_or_path: robot-test/dummy-tokenizer-wordlevel
+  tokenizer_revision: null
+tokens:
+  batch_accumulation_per_replica: 1
+  limit_test_batches: 0
+  limit_val_batches: 0
+  micro_batch_size: 2
+  sequence_length: 256
+  train_steps: 15
+  val_check_interval: -1