Initial commit

0816dd4a · libo11 · 0816dd4a · 0816dd4a · 0816dd4a · 0816dd4a
Commit 0816dd4a authored Sep 29, 2024 by libo11
20 changed files
--- a/megatron/core/datasets/readme.md
+++ b/megatron/core/datasets/readme.md
+# Data Pipeline
+
+## Data pre-processing
+
+Data preprocessing is built around the following classes:
+
+1. `IndexedDatasetBuilder`
+2. `IndexedDataset`
+
+At the moment, an end-to-end data preprocessing implementation is left to the user. See the class docstring(s) for more details.
+
+#### IndexedDatasetBuilder
+
+The `IndexedDatasetBuilder` is capable of building and merging `IndexedDataset` instances.
+
+#### IndexedDataset
+
+The `IndexedDataset` class is the lowest-level data interface in Megatron Core. Internally, an `IndexedDataset` instance references two binaries: the data file (`.bin`) contains document/sequence data and the index file (`.idx`) contains document/sequence metadata.
+
+The index file stores dataset-level metadata first:
+- The index header, for backward compatibility
+- The index version, for backward compatibility
+- A numeric code corresponding to the data type used to write data to the data file
+- The number of sequences in the dataset
+- The number of documents in the dataset
+
+The index file stores document-level and sequence-level metadata second:
+- In order, the number of elements per sequence
+- In order, the byte offset (pointer) per sequence
+- In order, the consecutive sequence index range `[...)` per document
+- In order, the mode per sequence (in the multimodal case)
+
+## Data loading: construction
+
+Building the data loaders is a distributed-aware process built around the following classes:
+
+1. `BlendedMegatronDatasetConfig`
+2. `BlendedMegatronDatasetBuilder`
+3. `IndexedDataset`
+3. `MegatronDataset`
+4. `BlendedDataset`
+
+See the class docstrings for more details.
+
+#### BlendedMegatronDatasetConfig (extendable)
+
+The `BlendedMegatronDatasetConfig` class parameterizes the `BlendedMegatronDatasetBuilder` and in turn the `MegatronDataset` and `BlendedDataset`.
+
+Different training/inference regimes will require different extensions e.g. the `GPTDatasetConfig`
+
+#### BlendedMegatronDatasetBuilder
+
+The `BlendedMegatronDatasetBuilder` class builds the highest-level data interfaces in Megatron Core.
+
+**NB:** All ranks should attempt to build the dataset via the `BlendedMegatronDatasetBuilder` or the program will hang. Which ranks follow through on their attempts can be controlled via the `BlendedMegatronDatasetConfig`.
+
+#### IndexedDataset
+
+The `IndexedDataset` class is the lowest-level data interface in Megatron Core.
+
+The `IndexedDataset` should already exist on disk before attempting to build any of the high-level data interfaces.
+
+
+#### MegatronDataset (extendable)
+
+The `MegatronDataset` abstract class is a high-level data interface in Megatron Core. It is an abstraction built upon the `IndexedDataset`.
+
+Different training/inference regimes will require different extensions e.g. the `GPTDataset`
+
+#### BlendedDataset
+
+The `BlendedDataset` class is a high-level data interface in Megatron Core. It is an abstraction built upon the `MegatronDataset`.
+
+The `BlendedDataset` is only necessary when a blend multiple data distributions, i.e. multiple `MegatronDataset` instances, should contribute to a certain dataset split. The blend can be controlled via the `BlendedMegatronDatasetConfig`.
+
+## Data loading: implementation
+
+### GPTDataset
+
+The `GPTDataset` is parameterized by the following variables: the underlying `IndexedDataset` instance `indexed_dataset`, the split indices `indexed_indices` (the congituous subset of document or sequence indices used for training, validation, and testing), the number of samples `N`, the sequence length `S`, and the random seed `R`.
+
+The `GPTDataset` creates three index mappings to facilitate lookup: (1) the document index, (2) the sample index, and (3) the shuffle index.
+
+1. The document index _Do_idx_ is a 1-D array mapping from _i_ to document index of length `E * |indexed_indices|` where `E` corresponds to the minimum number of epochs such that `E * |indexed_indices| >= N`. The document index is shuffled according to `R`.
+
+    ```
+    Given:
+
+    N = 15
+    indexed_indices = [5, 6, 7, 8, 9]
+    E = 3
+
+    Then, for example:
+
+    Do_idx = [8, 8, 9, 6, 7, 5, 8, 5, 6, 6, 5, 9, 7, 7, 9]
+    ```
+
+2. The sample index _Sa_idx_ is a 2-D array mapping from _j_ to pairs of (_i_, _Do_idx_[ _i_ ] offset) of shape `[N + 1, 2]`. The rows _j_ and _j_ + 1 serve as the left and right bounds for the _j_-th sample. 
+
+    ```
+    Given:
+
+    S = 1024
+
+    Then, for example:
+
+    Sa_idx[0] = (0, 0)
+    Sa_idx[1] = (0, 1024)       => Do_idx[0] has length greater than S
+    Sa_idx[2] = (1, 512)        => Do_idx[0] has length 1536
+    Sa_idx[3] = (2, 0)          => Do_idx[1] has length 1536
+    Sa_idx[4] = (5, 300)        => Do_idx[2:5] are shorter documents relative to Do_idx[0:2]
+    Sa_idx[5] = (6, 24)         => Do_idx[5] has length 1300
+    ```
+
+3. The shuffle index _Sh_idx_ is a 1-D array mapping from _k_ to _j_ of length `N`. The shuffle index is shuffled according to `R`.
+
+    ```
+    Given
+
+    N = 10
+
+    Then, for example:
+
+    Sh_idx = [4, 0, 2, 6, 1, 9, 5, 8, 7, 3]
+    ```
+
+To query the `GPTDataset` for the _k_-th sample we do the following
+
+-  Use the shuffle index to get the index _j_ into the sample index.
+
+    ```
+    j = Sh_idx[k]
+    ```
+- Use the sample index to get the left and right sample-bounding indices into the document index and the starting token offset for each document.
+
+    ```
+    i, offset = Sa_idx[j]
+    i_next, offset_next = Sa_idx[j + 1]
+    ```
+- Use the document index to retrieve `S` tokens from consecutive (in the document index) documents.
+
+    ```
+    sample = []
+    sample += indexed_dataset[Do_idx[i]][offset:]
+    if i != i_next:
+        sample += indexed_dataset[Do_idx[i + 1:i_next]]
+    sample += indexed_dataset[Do_idx[i_next]][:offset_next]
+    ```
+
+To save time during initialization, each index is built/cached sequentially on one process rank and subsequently loaded in parallel on other process ranks. The cached indices are unique to a hash generated in the `MegatronDataset.__init__` function.
+
+### BlendedDataset
+
+The `BlendedDataset` is parameterized by the following variables: the underlying `MegatronDataset` instances `D`, the weights `W` (one per dataset), and the size `S`. The `BlendedDataset` will draw samples from contributing datasets in proportion to the weights until achieving a composite dataset of the desired size. During each sampling step, we draw a single sample from the dataset which has the greatest sampling error.
+
+The `BlendedDataset` creates two "blending" indices to facilitate lookup: (1) the dataset index and (2) the dataset sample index.
+
+1. The dataset index _Da_idx_ is a 1-D array mapping from _i_ to dataset index of length `S`.
+
+    ```
+    Given
+
+    D = [d0, d1, d2]
+    W = [1/2, 1/4, 1/4]
+    S = 4
+
+    Then, for example:
+
+    Da_idx = [0, 1, 2, 0]
+
+    ```
+
+2. The dataset sample index _Sa_idx_ is a 1-D mapping from _i_ to the sample index for dataset _Da_idx[i]_ of length `S`.
+
+    ```
+    Given
+
+    Da_idx = [0, 1, 2, 0]
+
+    Then, for example:
+
+    Sa_idx = [0, 0, 0, 1]
+    ```
+
+To query the `BlendedDataset` for the _k_-th sample we do the following
+
+- Use the dataset index to retrieve the corresponding dataset from `D` and the dataset sample index to retrieve the corresponding sample from that dataset.
+
+    ```
+    sample = D[Da_idx[k]][Sa_idx[k]]
+    ```
+
+To save time during initialization, each index is built/cached sequentially on one process rank and subsequently loaded in parallel on other process ranks. The cached indices are unique to a hash generated in the `BlendedDataset.__init__` function.
--- a/megatron/core/datasets/retro/__init__.py
+++ b/megatron/core/datasets/retro/__init__.py
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+
+from .config import RetroGPTChunkDatasets
+from .query.multi_split_gpt_dataset import MultiSplitGPTDataset, MultiSplitGPTDatasetConfig
+from .query.retro_dataset import get_retro_datasets
--- a/megatron/core/datasets/retro/config/__init__.py
+++ b/megatron/core/datasets/retro/config/__init__.py
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+
+"""
+Exports:
+
+  - Embedder: Base class for all Bert embedders.
+  - RetroBertEmbedders: Container class for in-memory and on-disk embedders.
+  - RetroPreprocessingConfig: Configuration class for all of Retro preprocessing.
+  - RetroGPTChunkDatasets: Container class for train, valid, and test datasets.
+  - RetroTokenizers: Container class for GPT and Bert tokenizers.
+"""
+
+from .bert_embedders import Embedder, RetroBertEmbedders
+from .config import RetroPreprocessingConfig
+from .gpt_chunk_datasets import RetroGPTChunkDatasets
+from .tokenizers import RetroTokenizers
--- a/megatron/core/datasets/retro/config/bert_embedders.py
+++ b/megatron/core/datasets/retro/config/bert_embedders.py
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+
+"""Container dataclass for holding both in-memory and on-disk Bert embedders."""
+
+import abc
+from dataclasses import dataclass
+from typing import Any
+
+import numpy as np
+import torch
+
+
+class Embedder(abc.ABC):
+    """Base class for all Bert embedders.
+
+    All embedders should be able to embed either an entire text dataset (to a 2D
+    numpy array), or a single text string (to a 1D numpy array).
+    """
+
+    @abc.abstractmethod
+    def embed_text_dataset(self, text_dataset: torch.utils.data.Dataset) -> np.ndarray:
+        """Embed a text dataset.
+
+        Args:
+            text_dataset (torch.utils.data.Dataset): Text dataset to embed. Each sample of the text dataset should output a dict with a key 'text' and a string value.
+
+        Returns:
+            A 2D ndarray with shape (len(text_dataset), dimension(embedder)).
+        """
+
+    @abc.abstractmethod
+    def embed_text(self, text: str) -> np.ndarray:
+        """Embed a simple string of text.
+
+        Args:
+            text (str): A single text sample.
+
+        Returns:
+            A 1D ndarray with shape (dimensions(embedder),).
+        """
+
+
+@dataclass
+class RetroBertEmbedders:
+    """Container dataclass for in-memory and on-disk Bert embedders."""
+
+    disk: Embedder
+    mem: Embedder
--- a/megatron/core/datasets/retro/config/config.py
+++ b/megatron/core/datasets/retro/config/config.py
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+
+"""Retro preprocessing config."""
+
+from dataclasses import dataclass
+
+from megatron.core.transformer import TransformerConfig
+
+from .bert_embedders import RetroBertEmbedders
+from .gpt_chunk_datasets import RetroGPTChunkDatasets
+from .tokenizers import RetroTokenizers
+
+
+@dataclass
+class RetroPreprocessingConfig(TransformerConfig):
+    """Configuration object for Retro preprocessing.
+
+    *Note* : Arguments prefixed with '--retro-gpt-*' or '--retro-bert-*' are
+    included and named as such to more easily handle managing both models
+    running at the same time. Megatron is not optimized to run two models at
+    once, so this naming convention makes it clearer.
+
+    Args:
+
+        retro_project_dir (str): Retro project directory, which contains the preprocessed data for for pretraining. This directory is built during preprocessing (see tools/retro/README.md), and contains subdirectories for the chunk database and pretraining neighbors.
+        retro_tasks (str): Comma-separated list of tasks to run. Run entire preprocesing pipeline by using '--retro-tasks build'. Alternatively, run individual stages with tasks (in this order) 'db-build', 'index-build', or 'query-pretraining-neighbors'. For example, '--retro-tasks db-build,index-build,query-pretraining-neighbors' is equivalent to '--retro-tasks build'; or the argument can contain a subset of these tasks. Stages must always be run in the correct order (listed above).
+        retro_task_validate (float): If defined, validate a randomly sampled subset of the existing results of the given task. Each task implements a 'validate' method that is responsible for sampling a `retro_task_validate` fraction of the existing results, and then checking for bitwise equality with the current code base. (E.g., `--retro-task-validate 0.01`.)
+        retro_block_size (int): Number of chunks to process at a time when generating Bert embeddings and querying the search index. Partial results for each block are generally saved to disk in separate files.
+        retro_doc_block_size (int): Number of documents to processe at time when processing token datasets into chunk databases. The partial chunk database for each block is saved into a separate file.
+        retro_gpt_seed (int): Random seed used for python, numpy, pytorch, and cuda.
+        retro_gpt_data_path (str): Path to the training dataset. Accepted format: 1) a single data path, 2) multiple datasets in the form: dataset1-weight dataset1-path dataset2-weight dataset2-path ... It is used with --split when a single dataset used for all three: train, valid and test. It is exclusive to the other --*-data-path args.
+        retro_gpt_data_cache_path (str): Path to a directory to hold cached index files.
+        retro_gpt_split (str): Comma-separated list of proportions for training, validation, and test split. For example the split `90,5,5` will use 90%% of data for training, 5%% for validation and 5%% for test.
+        retro_gpt_train_samples (int): Total number of samples to train over all training runs.
+        retro_gpt_eval_interval (int): GPT evaluation interval.
+        retro_gpt_eval_iters (int): GPT evaluation iterations.
+        retro_gpt_tokenizer_type (str): GPT tokenizer type.
+        retro_gpt_tokenizer_model (str): GPT tokenizer model file.
+        retro_gpt_vocab_file (str): GPT vocab file.
+        retro_gpt_merge_file (str): GPT merge file.
+        retro_gpt_seq_length (int): GPT sequence length.
+        retro_gpt_global_batch_size (int): GPT global batch size.
+        retro_gpt_chunk_length (int): GPT chunk length.
+        retro_bert_tokenizer_type (str): Bert tokenizer type (for when using '--bert-embedder-type megatron').
+        retro_bert_vocab_file (str): Bert vocab file.
+        retro_bert_batch_size (int): Micro-batch size for processing Bert embeddings.
+        retro_bert_max_chunk_length (int): Maximum sequence length for Bert embeddings. (Named 'chunk' here in reference to these Bert sequences being converted from GPT chunks.)
+        retro_index_type (str): A 'faiss-base' index is a simple, un-optimized wrapper around a Faiss index. A 'faiss-par-add' index optimizes the 'add()' method by making it multi-node and multi-process, but with bit-wise equivalent results.
+        retro_index_str (str): Index string used for calling faiss.index_factory(). For example, 'IVF262144_HNSW32,Flat' or 'OPQ32_256,IVF4194304_HNSW32,PQ32'.
+        retro_index_ntrain (int): Number of database chunks to use for training the index. This value must be less or equal to the total number of chunks in the database.
+        retro_index_train_load_fraction (float): Fraction of sampled chunks to use for training the index. Useful when our total sampled embeddings use too much memory; lowering the load fraction is less costly than re-embedding a new sampled dataset from scratch.
+        retro_index_add_load_fraction (float): Fraction of database chunks to use for adding to the index. Useful when our total index size would use too much memory; lowering the load fraction is less costly than re-designing our token datasets.
+        retro_index_delete_training_embeddings (bool): Delete training embeddings for the search index. Useful for debugging.
+        retro_index_delete_added_codes (bool): Delete added codes for the search index. Useful for debugging.
+        retro_query_ef_search (int): Index ef-search parameter for Hierarchical Navigable Small Worlds (HNSW) during querying.
+        retro_query_nprobe (int): Index nprobe parameter for Inverted File (IVF) during querying.
+        retro_query_num_neighbors_query (int): Number of neighbors to retrieve when calling index.search().
+        retro_query_num_neighbors_save (int): Number of neighbors to save to disk after the index's returned neighbors. If longer than target value, neighbors truncated; and if shorter than target value, neighbors are padded with -1's.
+        retro_bert_embedders (RetroBertEmbedders): Set of Bert embedders used for embedding chunks. Contains entries: 1) 'mem' for an in-memory embedder, and 2) 'disk' for an embedder that saves results in blocks to disk.
+        retro_gpt_chunk_datasets (RetroGPTChunkDatasets): GPT datasets for 'train', 'valid', and 'test'.
+        retro_tokenizers (RetroTokenizers): GPT ('gpt') and Bert ('bert') tokenizers.
+    """
+
+    # Basic.
+    retro_project_dir: str = None
+    retro_tasks: str = 'build'
+    retro_task_validate: float = None
+    retro_block_size: int = 100000
+    retro_doc_block_size: int = 100000
+
+    # GPT.
+    retro_gpt_seed: int = 1234
+    retro_gpt_data_path: list = None  # basic list here, for parsing purposes
+    retro_gpt_data_cache_path: str = None
+    retro_gpt_split: str = '969,30,1'
+    retro_gpt_train_samples: int = None
+    retro_gpt_eval_interval: int = None
+    retro_gpt_eval_iters: int = None
+    retro_gpt_tokenizer_type: str = None
+    retro_gpt_tokenizer_model: str = None
+    retro_gpt_vocab_file: str = None
+    retro_gpt_merge_file: str = None
+    retro_gpt_seq_length: int = None
+    retro_gpt_global_batch_size: int = None
+    retro_gpt_chunk_length: int = 64
+
+    # Bert.
+    retro_bert_tokenizer_type: str = None
+    retro_bert_vocab_file: str = None
+    retro_bert_batch_size: int = 128
+    retro_bert_max_chunk_length: int = 256
+
+    # Index.
+    retro_index_type: str = 'faiss-par-add'
+    retro_index_str: str = None
+    retro_index_ntrain: int = None
+    retro_index_train_load_fraction: float = 1.0
+    retro_index_add_load_fraction: float = 1.0
+    retro_index_delete_training_embeddings: bool = True
+    retro_index_delete_added_codes: bool = True
+
+    # Query.
+    retro_query_ef_search: int = 256
+    retro_query_nprobe: int = 65536
+    retro_query_num_neighbors_query: int = 200
+    retro_query_num_neighbors_save: int = 20
+
+    # Tools.
+    retro_bert_embedders: RetroBertEmbedders = None
+    retro_gpt_chunk_datasets: RetroGPTChunkDatasets = None
+    retro_tokenizers: RetroTokenizers = None
+
+    def __post_init__(self) -> None:
+        """Validate Retro config."""
+
+        # Validate required attributes.
+        assert self.retro_project_dir is not None
+        assert self.retro_tasks is not None
+        assert self.retro_gpt_data_path is not None or self.retro_gpt_data_cache_path is not None
+        assert self.retro_gpt_train_samples is not None
+        assert self.retro_gpt_eval_interval is not None
+        assert self.retro_gpt_eval_iters is not None
+        assert self.retro_gpt_tokenizer_type is not None
+        assert self.retro_gpt_tokenizer_model is not None or (
+            self.retro_gpt_vocab_file is not None and self.retro_gpt_merge_file is not None
+        )
+        assert self.retro_gpt_seq_length is not None
+        assert self.retro_gpt_global_batch_size is not None
+        assert self.retro_bert_tokenizer_type is not None
+        assert self.retro_bert_vocab_file is not None
+        assert self.retro_index_str is not None
+        assert self.retro_index_ntrain is not None
+
+        # Split retro tasks.
+        self.retro_tasks = self.retro_tasks.split(",")
--- a/megatron/core/datasets/retro/config/gpt_chunk_datasets.py
+++ b/megatron/core/datasets/retro/config/gpt_chunk_datasets.py
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+
+"""Container dataclass for GPT chunk datasets (train, valid, and test)."""
+
+from dataclasses import dataclass
+
+
+@dataclass
+class RetroGPTChunkDatasets:
+    """Container dataclass for GPT chunk datasets."""
+
+    # Each dict contains 'dataset', 'neighbor_dir', and 'num_active_chunks'.
+    train: dict = None
+    valid: dict = None
+    test: dict = None
--- a/megatron/core/datasets/retro/config/tokenizers.py
+++ b/megatron/core/datasets/retro/config/tokenizers.py
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+
+"""Container class for GPT and Bert tokenizers."""
+
+from dataclasses import dataclass
+
+from megatron.core.datasets.megatron_tokenizer import MegatronTokenizer
+
+
+@dataclass
+class RetroTokenizers:
+    """Container class for GPT and Bert tokenizers."""
+
+    gpt: MegatronTokenizer = None
+    bert: MegatronTokenizer = None
--- a/megatron/core/datasets/retro/db/__init__.py
+++ b/megatron/core/datasets/retro/db/__init__.py
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+
+"""
+Exports:
+
+  - build_db: Build a chunk database from a list of indexed datasets.
+"""
+
+from .build import build_db
--- a/megatron/core/datasets/retro/db/build.py
+++ b/megatron/core/datasets/retro/db/build.py
--- a/megatron/core/datasets/retro/db/dataset.py
+++ b/megatron/core/datasets/retro/db/dataset.py
--- a/megatron/core/datasets/retro/db/utils.py
+++ b/megatron/core/datasets/retro/db/utils.py
--- a/megatron/core/datasets/retro/external_libs.py
+++ b/megatron/core/datasets/retro/external_libs.py
--- a/megatron/core/datasets/retro/index/__init__.py
+++ b/megatron/core/datasets/retro/index/__init__.py
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+
+"""
+Exports:
+
+  - train_index: Train an index on representative vectors.
+  - add_to_index: Add vectors to a trained index.
+  - build_index: Wrapper function that calls above two functions.
+"""
+
+from .build import add_to_index, build_index, train_index
--- a/megatron/core/datasets/retro/index/build.py
+++ b/megatron/core/datasets/retro/index/build.py
--- a/megatron/core/datasets/retro/index/factory.py
+++ b/megatron/core/datasets/retro/index/factory.py
--- a/megatron/core/datasets/retro/index/index.py
+++ b/megatron/core/datasets/retro/index/index.py
--- a/megatron/core/datasets/retro/index/indexes/__init__.py
+++ b/megatron/core/datasets/retro/index/indexes/__init__.py
--- a/megatron/core/datasets/retro/index/indexes/faiss_base.py
+++ b/megatron/core/datasets/retro/index/indexes/faiss_base.py
--- a/megatron/core/datasets/retro/index/indexes/faiss_par_add.py
+++ b/megatron/core/datasets/retro/index/indexes/faiss_par_add.py
--- a/megatron/core/datasets/retro/index/utils.py
+++ b/megatron/core/datasets/retro/index/utils.py