Index type generally read from argument '--retro-index-ty'.
"""
@classmethod
defget_index_class(cls,index_type:str)->type:
"""Get an index class, given a type string.
Args:
index_type (str): One of 'faiss-base' (naive Faiss index wrapper) or 'faiss-par-add' (Faiss index wrapper with near embarrassingly parallel index.add().
Returns:
An `Index` sub-type corresponding to the `index_type`.
index_type (str): One of 'faiss-base' (naive Faiss index wrapper) or 'faiss-par-add' (Faiss index wrapper with near embarrassingly parallel index.add().
Returns:
An `Index` instance corresponding to the `index_type`.
block (dict): Range information specifying the start/end indices within the encoded text dataset. Here, the 'path' item is used for writing the encodings to storage.
codes (np.ndarray): Block of encodings to be saved to storage.
"""Get sample, including represented document IDs.
Args:
idx (int): Sample index.
Returns:
A sample, which contains both the chunk-length token sample ('text') along with all document_ids ('doc_ids') contained withing the full `sequence_length` sample.
"""
# Convert global chunk index to global sample index & local chunk index.
"""Configuration object for Megatron Core blended and Retro datasets.
Args:
return_document_ids (bool): Whether to return the document ids when querying the dataset. Turn this option on during preprocessing.
split_preprocessing (str): The Retro preprocessing split string. It follows the same pattern convention as 'split'. Not to be used with 'blend_per_split'.
"""
return_document_ids:bool=None
split_preprocessing:str=None
def__post_init__(self)->None:
"""Validate config attributes."""
super().__post_init__()
assertself.splitisnotNone,"the Retro data pipeline does not support 'blend_per_split'"
assertself.return_document_idsisnotNone,"this attribute must be user defined"
assertself.split_preprocessingisnotNone,"this attribute must be user defined"
query_dataset (GPTChunkDataset): GPT chunk dataset to be queried.
num_active_chunks (int): The 'active' chunks are the subset of the GPT chunk dataset that aren't being queried. This argument is used when validating the correctness of a subset of the GPT chunk dataset.
prefix (str): Extra string for logging progress.
neighbor_dir (str): File path to directory for saving neighbor IDs.
index (Index): Vector index populated with chunk database indices.
"""
defvalidate(f:h5py.File)->None:
"""Validation method for validating saved neighbor IDs.
gpt_datasets (dict): Mapping of data split key ('train', 'valid', or 'test') to the original sequence-length GPT dataset (i.e., not the chunk dataset).
sample_length (int): Alias to `sequence_length`.
eod_token_id (int): GPT EOD token ID.
Returns:
A tuple of 'train', 'valid', and 'test' `RetroDataset`s.
"""Divide range [0, num_samples) to sequence of block ranges.
This is a core method within the concept of block processing. The idea
is to divide a range (size n_samples) into a sequence of blocks. Each
block corresponds to a file within 'dirname' with name
'{start_idx}-{end_idx}.hdf5'. This method checks for the existence of
these files, and returns two lists, one for existing blocks and one for
missing blocks.
Args:
dirname (str): Path to directory containing block files.
n_samples (int): Ideal number of samples. The total number of saved block data is <=n_samples.
block_size (int): Max number of samples per block file (e.g., 100000).
validate (Callable): Method for validating each block file during load.
Returns:
A namespace consisting of 2 lists: existing blocks, and missing blocks. The total number of samples between the existing and missing blocks should equal n_samples above.
"""Divide existing and missing blocks evenly across all ranks.
See 'get_blocks()' above for description. The returned lists of existing and
missing blocks are split evenly across ranks via interleaving. This way,
each rank has a roughly equal number of blocks to process for a
downstream operation.
Args:
dirname (str): Path to directory containing block files.
n_samples (int): Ideal number of samples. The total number of saved block data is <=n_samples.
block_size (int): Max number of samples per block file (e.g., 100000).
validate (Callable): Method for validating each block file during load.
sample (Optional[float]): If provided, sample a random subset of the blocks. Used for validating preprocessing correctness.
Returns:
A namespace consisting of 2 lists: existing blocks, and missing blocks. Each of these two lists is potentially a sub-sample of the total set of existing and missing blocks, depending on whether sampling is used. Additionally, the attributes n_existing_world and n_missing_world are the total number of existing and missing blocks, independent of samples. Therefore, (n_existing_world + n_missing_world) * block_size == n_samples.
"""Get the megatron.core.datasets.blended_megatron_dataset_config.BlendedMegatronDatasetConfig blend from the blend list
Args:
blend (Optional[List[str]]): The blend list, which can be either (1) a list of prefixes, e.g. ["path/to/dataset_1_prefix", "path/to/dataset_2_prefix"], or (2) a flattened, zipped list of weights and prefixes, e.g. ["30", "path/to/dataset_1_prefix", "70", "path/to/dataset_2_prefix"]
Returns:
Optional[Tuple[List[str], Optional[List[float]]]]: The blend, consisting of a list of dataset prefixes and optionally a list of dataset weights, e.g. [["path/to/dataset_1_prefix", "path/to/dataset_2_prefix"], [30.0, 70.0]].
"""Config when the data (.bin) file and the index (.idx) file are in S3
TODO: These parameters are few and can be consolidated with parameters specific to bin reader
classes - @jkamalu
Attributes:
path_to_idx_cache (str): The local directory where we will store the index (.idx) file
bin_chunk_nbytes (int): If the number of bytes is too small, then we send a request to S3 at each call of the `read` method in _S3BinReader, which is slow, because each request has a fixed cost independent of the size of the byte range requested. If the number of bytes is too large, then we only rarely have to send requests to S3, but it takes a lot of time to complete the request when we do, which can block training. We've found that 256 * 1024 * 1024 (i.e., 256 MiB) has worked well (though we have not put that much effort into tuning it), so we default to it.
"""
path_to_idx_cache:str
bin_chunk_nbytes:int=256*1024*1024
classS3Client(Protocol):
"""The protocol which all s3 clients should abide by"""