README.md

# IndexKits  
  
[TOC]  
  
## Introduction  
  
Index Kits (`index_kits`) for streaming Arrow data.  
  
* Supports creating datasets from configuration files.  
* Supports creating index v2 from arrows.  
* Supports the creation, reading, and usage of index files.  
* Supports the creation, reading, and usage of multi-bucket, multi-resolution indexes.  
  
`index_kits` has the following advantages:  
  
* Optimizes memory usage for streaming reads, supporting the reading of data at the billion level.  
* Optimizes memory usage and reading speed of Index files.  
* Supports the creation of Base/Multireso Index V2 datasets, including data filtering, repeating, and deduplication during creation.  
* Supports loading various dataset types (Arrow files, Base Index V2, multiple Base Index V2, Multireso Index V2, multiple Multireso Index V2).  
* Uses a unified API interface to access images, text, and other attributes for all dataset types.  
* Built-in shuffle, `resize_and_crop`, and returns the coordinates of the crop.  
  
## Installation  
  
* **Install from pre-compiled whl (recommended)**  
  
* Install from source  
  
  ```shell  
  cd IndexKits  
  pip install -e .  
  ```

## Usage

### Loading an Index V2 File

```python
from index_kits import ArrowIndexV2  
  
index_manager = ArrowIndexV2('data.json')  
  
# You can shuffle the dataset using a seed. If a seed is provided (i.e., the seed is not None), this shuffle will not affect any random state.  
index_manager.shuffle(1234, fast=True)  
  
for i in range(len(index_manager)):  
    # Get an image (requires the arrow to contain an image/binary column):  
    pil_image = index_manager.get_image(i)  
    # Get text (requires the arrow to contain a text_zh column):  
    text = index_manager.get_attribute(i, column='text_zh')  
    # Get MD5 (requires the arrow to contain an md5 column):  
    md5 = index_manager.get_md5(i)  
    # Get any attribute by specifying the column (must be contained in the arrow):  
    ocr_num = index_manager.get_attribute(i, column='ocr_num')  
    # Get multiple attributes at once  
    item = index_manager.get_data(i, columns=['text_zh', 'md5'])     # i: in-dataset index  
    print(item)  
    # {  
    #      'index': 3,              # in-json index  
    #      'in_arrow_index': 3,     # in-arrow index  
    #      'arrow_name': '/HunYuanDiT/dataset/porcelain/00000.arrow',   
    #      'text_zh': 'Fortune arrives with the auspicious green porcelain tea cup',   
    #      'md5': '1db68f8c0d4e95f97009d65fdf7b441c'  
    # }  
```


### Loading a Set of Arrow Files

If you have a small batch of arrow files, you can directly use `IndexV2Builder` to load these arrow files without creating an Index V2 file.

```python
from index_kits import IndexV2Builder  

index_manager = IndexV2Builder(arrow_files).to_index_v2()  
```

### Loading a Multi-Bucket (Multi-Resolution) Index V2 File

When using a multi-bucket (multi-resolution) index file, refer to the following example code, especially the definition of SimpleDataset and the usage of MultiResolutionBucketIndexV2.

```python
from torch.utils.data import DataLoader, Dataset  
import torchvision.transforms as T  
  
from index_kits import MultiResolutionBucketIndexV2  
from index_kits.sampler import BlockDistributedSampler  
  
class SimpleDataset(Dataset):  
    def __init__(self, index_file, batch_size, world_size):  
        # When using multi-bucket (multi-resolution), batch_size and world_size need to be specified for underlying data alignment.  
        self.index_manager = MultiResolutionBucketIndexV2(index_file, batch_size, world_size)  
  
        self.flip_norm = T.Compose(  
            [  
                T.RandomHorizontalFlip(),  
                T.ToTensor(),  
                T.Normalize([0.5], [0.5]),  
            ]  
        )  
  
    def shuffle(self, seed, fast=False):  
        self.index_manager.shuffle(seed, fast=fast)  
  
    def __len__(self):  
        return len(self.index_manager)  
  
    def __getitem__(self, idx):  
        image = self.index_manager.get_image(idx)  
        original_size = image.size    # (w, h)  
        target_size = self.index_manager.get_target_size(idx)     # (w, h)  
        image, crops_coords_left_top = self.index_manager.resize_and_crop(image, target_size)  
        image = self.flip_norm(image)  
        text = self.index_manager.get_attribute(idx, column='text_zh')  
        return image, text, (original_size, target_size, crops_coords_left_top)  
  
batch_size = 8      # batch_size per GPU  
world_size = 8      # total number of GPUs  
rank = 0            # rank of the current process  
num_workers = 4     # customizable based on actual conditions  
shuffle = False     # must be set to False  
drop_last = True    # must be set to True  
  
# Correct batch_size and world_size must be passed in, otherwise, the multi-resolution data cannot be aligned correctly.   
dataset = SimpleDataset('data_multireso.json', batch_size=batch_size, world_size=world_size)  
# Must use BlockDistributedSampler to ensure the samples in a batch have the same resolution.  
sampler = BlockDistributedSampler(dataset, num_replicas=world_size, rank=rank,  
                                  shuffle=shuffle, drop_last=drop_last, batch_size=batch_size)  
loader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, sampler=sampler,  
                    num_workers=num_workers, pin_memory=True, drop_last=drop_last)  
  
for epoch in range(10):  
   # Please use the shuffle method provided by the dataset, not the DataLoader's shuffle parameter.  
   dataset.shuffle(epoch, fast=True)  
   for batch in loader:  
       pass  
```


## Fast Shuffle

When your index v2 file contains hundreds of millions of samples, using the default shuffle can be quite slow. Therefore, it’s recommended to enable `fast=True` mode:

```python
index_manager.shuffle(seed=1234, fast=True)
```

This will globally shuffle without keeping the indices of the same arrow together. Although this may reduce reading speed, it requires a trade-off based on the model’s forward time.