# Nanosets Nanotron incorporates [`Nanosets`](../src/nanotron/data/nanoset.py), a dataset for processing tokenized documents with [`datatrove`](https://github.com/huggingface/datatrove). They allow reading tokens from one or multiple datasets and even specifying the weight of each dataset when building batches. ## Install To use `Nanosets`, it's necessary to install Nanotron with the `nanosets` flavor. ``` pip install nanotron[nanosets] ``` This will install the following dependencies: - `datatrove`: To preprocess the datasets - `numba`: To compile helper functions in order to speed up the creation of `Nanosets` - `transformers`: For the tokenizers ## Data pre-processing To use this dataset, first, we need to preprocess the data using `datatrove`'s `DocumentTokenizer` pipeline. We invite you to take a look at `datatrove`, since it contains multiple features that allow, for example, filter out documents based on specific rules/criteria, extract text content from raw formats or scheduling the preprocessing in a Slurm cluster. We have also added a simple script capable of tokenizing datasets. The preprocessing is done using the [`tools/preprocess_data.py`](../tools/preprocess_data.py) script. The input format can either be a Hugging Face Dataset, a path to a `.jsonl` or a path to a folder containing multiple `.jsonl` files. Below we show an example for processing a Hugging Face Dataset from the Hub with the Llama3 tokenizer.
python3 tools/preprocess_data.py \
       --tokenizer-name-or-path meta-llama/Meta-Llama-3-8B \
       --output-folder datasets/emotion \
       --n-tasks 16 \
       hf \
       --dataset dair-ai/emotion \
First with `--tokenizer-name-or-path` we will specify a tokenizer in the same way as we do when using `AutoTokenizers.from_pretrained(...)`. Then we specify the `--output-folder` where we will store the tokenized documents and the number of workers with `--n-tasks`. Finally we will indicate the type of dataset (whether if it's a Hugging Face Dataset ["**hf**"] or in jsonl ["**jsonl**"] format) and the dataset that we want to preprocess. Check the different settings with `python3 tools/preprocess_data.py --help`, `python3 tools/preprocess_data.py hf --help` & `python3 tools/preprocess_data.py jsonl --help`. Every worker will store in `--output-folder` 3 different kind of files: - `*.ds` Containing the tokenized documents - `*.ds.index` Containing the bounds of each tokenized document - `*.ds.metadata` Containing the number of tokens and tokenizer used > [!IMPORTANT] Remember to introduce the type of dataset to process. e.g. python3 tools/preprocess_data.py --tokenizer-name-or-path gpt2 --n-tasks 16 **jsonl** --dataset raw_datasets/c4-es-json-files ## Working with Nanosets To work with `Nanosets`, we just need to configure 1 argument: 1. `dataset_folder`: This argument specifies the file or files that will compose the `Nanoset`. There are 3 ways to specify it: 1. If we specify a single path, we will create a `Nanoset` from a single dataset file. ```yaml data_stages: - name: General purpose training (Single dataset) start_training_step: 1 data: dataset: dataset_folder: datasets/SlimPajama-6B num_loading_workers: 0 seed: 1234 ``` 2. If we specify a list of paths, we will create a `Nanoset` from all the dataset files. In every epoch we will consume each and every sample from each dataset randomly. ```yaml data_stages: - name: Second purpose training (> 1 dataset) start_training_step: 15 data: dataset: dataset_folder: - datasets/SlimPajama-6B - datasets/testing_alpaca_small num_loading_workers: 0 seed: 1234 ``` 3. If we specify a dictionary with paths and weights, we will create a `Nanoset` from the dataset files where each epoch will have a number of samples from each dataset according to the specified weights. ```yaml data_stages: - name: Third purpose training (Blended dataset) start_training_step: 25 data: dataset: dataset_folder: datasets/SlimPajama-6B: 0.8 datasets/testing_alpaca_small: 0.2 num_loading_workers: 0 seed: 1234 ``` > [!IMPORTANT] > Remember to set the `tokenizer.tokenizer_name_or_path` in the config file to the tokenizer used to preprocess the documents and set the `model.model_config.vocab_size` accordingly. Finally, to use the `Nanosets`, launch the training with [`run_train.py`](../run_train.py). ```shell torchrun --nproc-per-node 1 run_train.py --config examples/config_nanoset.yaml ``` ## Under the hood `Nanosets` are responsible of building samples of `sequence length + 1` tokens from the preprocessed dataset files. Despite most of the extracting logic lies in `DatatroveFolderDataset`, `Nanosets` will take care of the following: 1. Creating dataset mixtures from different dataset folder paths 2. Ensure that in each epoch, we consume each sample only once 3. Ensure that we never exhaust the `DataLoader` Based on the `dataset lengths`, the `dataset weights` and the `number of samples per epoch` (defined as the `sum(dataset lengths)`), we build the two indexes we need in order to extract samples from the `Nanoset` ([build_nanoset_index_helper](../src/nanotron/data/nanoset.py)): - `dataset index`: Contains the index of the dataset from the list of `dataset paths` from which to extract the sample, respecting the established dataset weight. ``` Given: D = [d0, d1, d2, d3] # datasets DL = [8, 2, 5, 5] # dataset lengths W = [0.1, 0.5, 0.3, 0.1] # dataset weights SPE = 20 # number of samples per epoch Then, for example: dataset_index = [1, 2, 0, 1, 3, 1, 2, 1, 2, 1, 0, 1, 2, 1, 3, 1, 2, 1, 2, 1] ``` - `dataset sample index`: Contains the sample index to extract from the `dataset index[index]` dataset, always < `len(dataset)`. ``` dataset_index = [1, 2, 0, 1, 3, 1, 2, 1, 2, 1, 0, 1, 2, 1, 3, 1, 2, 1, 2, 1] dataset_sample_index = [0, 0, 0, 1, 0, 0, 1, 1, 2, 0, 1, 1, 3, 0, 1, 1, 4, 0, 0, 1] ``` Then, we **shuffle with the same permutation both indexes** and concatenate them `number of epochs` times, which is defined by `train split num samples` / `number of samples per epoch`. ``` Given: N = 70 # train split num samples dataset_index = [1, 2, 0, 1, 3, 1, 2, 1, 2, 1, 0, 1, 2, 1, 3, 1, 2, 1, 2, 1] dataset_sample_index = [0, 0, 0, 1, 0, 0, 1, 1, 2, 0, 1, 1, 3, 0, 1, 1, 4, 0, 0, 1] Shuffle dataset_index and dataset_sample_index: dataset_index = [1, 1, 0, 2, 3, 1, 3, 1, 2, 2, 1, 1, 0, 1, 1, 2, 1, 2, 2, 1] dataset_sample_index = [1, 0, 0, 4, 1, 0, 0, 0, 2, 0, 0, 1, 1, 0, 1, 0, 1, 3, 1, 1] n_concatenations = (70/(20)) + 1 = 4 dataset_index = dataset_index concatenated 4 times dataset_sample_index = dataset_sample_index concatenated 4 times dataset_index = dataset_index[: N] dataset_sample_index = dataset_sample_index[: N] ``` To query the `Nanoset` for the k-th sample we do the following: - Use the `dataset_index` to retrieve the corresponding dataset from `D` and the `dataset_sample_index` to retrieve the corresponding sample from that dataset. ``` sample = D[dataset_index[k]][dataset_sample_index[k]] ```