The text dataset that consists of tokens converted from the 'train' chunk database. These are the chunks used for retrieval by the pretraining 'train' dataset.
Index type generally read from argument '--retro-index-ty'.
"""
@classmethod
defget_index_class(cls,index_type:str)->type:
"""Get an index class, given a type string.
Args:
index_type (str): One of 'faiss-base' (naive Faiss index wrapper) or 'faiss-par-add' (Faiss index wrapper with near embarrassingly parallel index.add().
Returns:
An `Index` sub-type corresponding to the `index_type`.
index_type (str): One of 'faiss-base' (naive Faiss index wrapper) or 'faiss-par-add' (Faiss index wrapper with near embarrassingly parallel index.add().
Returns:
An `Index` instance corresponding to the `index_type`.
block (dict): Range information specifying the start/end indices within the encoded text dataset. Here, the 'path' item is used for writing the encodings to storage.
codes (np.ndarray): Block of encodings to be saved to storage.
"""Get sample, including represented document IDs.
Args:
idx (int): Sample index.
Returns:
A sample, which contains both the chunk-length token sample ('text') along with all document_ids ('doc_ids') contained withing the full `sequence_length` sample.
"""
# Convert global chunk index to global sample index & local chunk index.
"""Configuration object for Megatron Core blended and Retro datasets.
Args:
return_document_ids (bool): Whether to return the document ids when querying the dataset. Turn this option on during preprocessing.
split_preprocessing (str): The Retro preprocessing split string. It follows the same pattern convention as 'split'. Not to be used with 'blend_per_split'.
"""
return_document_ids:bool=None
split_preprocessing:str=None
def__post_init__(self)->None:
"""Validate config attributes."""
super().__post_init__()
assertself.splitisnotNone,"the Retro data pipeline does not support 'blend_per_split'"
assertself.return_document_idsisnotNone,"this attribute must be user defined"
assertself.split_preprocessingisnotNone,"this attribute must be user defined"
query_dataset (GPTChunkDataset): GPT chunk dataset to be queried.
num_active_chunks (int): The 'active' chunks are the subset of the GPT chunk dataset that aren't being queried. This argument is used when validating the correctness of a subset of the GPT chunk dataset.
prefix (str): Extra string for logging progress.
neighbor_dir (str): File path to directory for saving neighbor IDs.
index (Index): Vector index populated with chunk database indices.
"""
defvalidate(f:h5py.File)->None:
"""Validation method for validating saved neighbor IDs.
gpt_datasets (dict): Mapping of data split key ('train', 'valid', or 'test') to the original sequence-length GPT dataset (i.e., not the chunk dataset).
sample_length (int): Alias to `sequence_length`.
eod_token_id (int): GPT EOD token ID.
Returns:
A tuple of 'train', 'valid', and 'test' `RetroDataset`s.
"""Divide range [0, num_samples) to sequence of block ranges.
This is a core method within the concept of block processing. The idea
is to divide a range (size n_samples) into a sequence of blocks. Each
block corresponds to a file within 'dirname' with name
'{start_idx}-{end_idx}.hdf5'. This method checks for the existence of
these files, and returns two lists, one for existing blocks and one for
missing blocks.
Args:
dirname (str): Path to directory containing block files.
n_samples (int): Ideal number of samples. The total number of saved block data is <=n_samples.
block_size (int): Max number of samples per block file (e.g., 100000).
validate (Callable): Method for validating each block file during load.
Returns:
A namespace consisting of 2 lists: existing blocks, and missing blocks. The total number of samples between the existing and missing blocks should equal n_samples above.
"""Divide existing and missing blocks evenly across all ranks.
See 'get_blocks()' above for description. The returned lists of existing and
missing blocks are split evenly across ranks via interleaving. This way,
each rank has a roughly equal number of blocks to process for a
downstream operation.
Args:
dirname (str): Path to directory containing block files.
n_samples (int): Ideal number of samples. The total number of saved block data is <=n_samples.
block_size (int): Max number of samples per block file (e.g., 100000).
validate (Callable): Method for validating each block file during load.
sample (Optional[float]): If provided, sample a random subset of the blocks. Used for validating preprocessing correctness.
Returns:
A namespace consisting of 2 lists: existing blocks, and missing blocks. Each of these two lists is potentially a sub-sample of the total set of existing and missing blocks, depending on whether sampling is used. Additionally, the attributes n_existing_world and n_missing_world are the total number of existing and missing blocks, independent of samples. Therefore, (n_existing_world + n_missing_world) * block_size == n_samples.
"""Get the megatron.core.datasets.blended_megatron_dataset_config.BlendedMegatronDatasetConfig blend from the blend list
Args:
blend (Optional[List[str]]): The blend list, which can be either (1) a list of prefixes, e.g. ["path/to/dataset_1_prefix", "path/to/dataset_2_prefix"], or (2) a flattened, zipped list of weights and prefixes, e.g. ["30", "path/to/dataset_1_prefix", "70", "path/to/dataset_2_prefix"]
Returns:
Optional[Tuple[List[str], Optional[List[float]]]]: The blend, consisting of a list of dataset prefixes and optionally a list of dataset weights, e.g. [["path/to/dataset_1_prefix", "path/to/dataset_2_prefix"], [30.0, 70.0]].