README.md 5.99 KB
Newer Older
jerrrrry's avatar
jerrrrry committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
# IndexKits  
  
[TOC]  
  
## Introduction  
  
Index Kits (`index_kits`) for streaming Arrow data.  
  
* Supports creating datasets from configuration files.  
* Supports creating index v2 from arrows.  
* Supports the creation, reading, and usage of index files.  
* Supports the creation, reading, and usage of multi-bucket, multi-resolution indexes.  
  
`index_kits` has the following advantages:  
  
* Optimizes memory usage for streaming reads, supporting the reading of data at the billion level.  
* Optimizes memory usage and reading speed of Index files.  
* Supports the creation of Base/Multireso Index V2 datasets, including data filtering, repeating, and deduplication during creation.  
* Supports loading various dataset types (Arrow files, Base Index V2, multiple Base Index V2, Multireso Index V2, multiple Multireso Index V2).  
* Uses a unified API interface to access images, text, and other attributes for all dataset types.  
* Built-in shuffle, `resize_and_crop`, and returns the coordinates of the crop.  
  
## Installation  
  
* **Install from pre-compiled whl (recommended)**  
  
* Install from source  
  
  ```shell  
  cd IndexKits  
  pip install -e .  
  ```

## Usage

### Loading an Index V2 File

```python
from index_kits import ArrowIndexV2  
  
index_manager = ArrowIndexV2('data.json')  
  
# You can shuffle the dataset using a seed. If a seed is provided (i.e., the seed is not None), this shuffle will not affect any random state.  
index_manager.shuffle(1234, fast=True)  
  
for i in range(len(index_manager)):  
    # Get an image (requires the arrow to contain an image/binary column):  
    pil_image = index_manager.get_image(i)  
    # Get text (requires the arrow to contain a text_zh column):  
    text = index_manager.get_attribute(i, column='text_zh')  
    # Get MD5 (requires the arrow to contain an md5 column):  
    md5 = index_manager.get_md5(i)  
    # Get any attribute by specifying the column (must be contained in the arrow):  
    ocr_num = index_manager.get_attribute(i, column='ocr_num')  
    # Get multiple attributes at once  
    item = index_manager.get_data(i, columns=['text_zh', 'md5'])     # i: in-dataset index  
    print(item)  
    # {  
    #      'index': 3,              # in-json index  
    #      'in_arrow_index': 3,     # in-arrow index  
    #      'arrow_name': '/HunYuanDiT/dataset/porcelain/00000.arrow',   
    #      'text_zh': 'Fortune arrives with the auspicious green porcelain tea cup',   
    #      'md5': '1db68f8c0d4e95f97009d65fdf7b441c'  
    # }  
```


### Loading a Set of Arrow Files

If you have a small batch of arrow files, you can directly use `IndexV2Builder` to load these arrow files without creating an Index V2 file.

```python
from index_kits import IndexV2Builder  

index_manager = IndexV2Builder(arrow_files).to_index_v2()  
```

### Loading a Multi-Bucket (Multi-Resolution) Index V2 File

When using a multi-bucket (multi-resolution) index file, refer to the following example code, especially the definition of SimpleDataset and the usage of MultiResolutionBucketIndexV2.

```python
from torch.utils.data import DataLoader, Dataset  
import torchvision.transforms as T  
  
from index_kits import MultiResolutionBucketIndexV2  
from index_kits.sampler import BlockDistributedSampler  
  
class SimpleDataset(Dataset):  
    def __init__(self, index_file, batch_size, world_size):  
        # When using multi-bucket (multi-resolution), batch_size and world_size need to be specified for underlying data alignment.  
        self.index_manager = MultiResolutionBucketIndexV2(index_file, batch_size, world_size)  
  
        self.flip_norm = T.Compose(  
            [  
                T.RandomHorizontalFlip(),  
                T.ToTensor(),  
                T.Normalize([0.5], [0.5]),  
            ]  
        )  
  
    def shuffle(self, seed, fast=False):  
        self.index_manager.shuffle(seed, fast=fast)  
  
    def __len__(self):  
        return len(self.index_manager)  
  
    def __getitem__(self, idx):  
        image = self.index_manager.get_image(idx)  
        original_size = image.size    # (w, h)  
        target_size = self.index_manager.get_target_size(idx)     # (w, h)  
        image, crops_coords_left_top = self.index_manager.resize_and_crop(image, target_size)  
        image = self.flip_norm(image)  
        text = self.index_manager.get_attribute(idx, column='text_zh')  
        return image, text, (original_size, target_size, crops_coords_left_top)  
  
batch_size = 8      # batch_size per GPU  
world_size = 8      # total number of GPUs  
rank = 0            # rank of the current process  
num_workers = 4     # customizable based on actual conditions  
shuffle = False     # must be set to False  
drop_last = True    # must be set to True  
  
# Correct batch_size and world_size must be passed in, otherwise, the multi-resolution data cannot be aligned correctly.   
dataset = SimpleDataset('data_multireso.json', batch_size=batch_size, world_size=world_size)  
# Must use BlockDistributedSampler to ensure the samples in a batch have the same resolution.  
sampler = BlockDistributedSampler(dataset, num_replicas=world_size, rank=rank,  
                                  shuffle=shuffle, drop_last=drop_last, batch_size=batch_size)  
loader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, sampler=sampler,  
                    num_workers=num_workers, pin_memory=True, drop_last=drop_last)  
  
for epoch in range(10):  
   # Please use the shuffle method provided by the dataset, not the DataLoader's shuffle parameter.  
   dataset.shuffle(epoch, fast=True)  
   for batch in loader:  
       pass  
```


## Fast Shuffle

When your index v2 file contains hundreds of millions of samples, using the default shuffle can be quite slow. Therefore, it’s recommended to enable `fast=True` mode:

```python
index_manager.shuffle(seed=1234, fast=True)
```

This will globally shuffle without keeping the indices of the same arrow together. Although this may reduce reading speed, it requires a trade-off based on the model’s forward time.