Thanks for your error report and we appreciate it a lot.
**Checklist**
1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
**Describe the bug**
A clear and concise description of what the bug is.
**Reproduction**
1. What command or script did you run?
```none
A placeholder for the command.
```
2. Did you make any modifications on the code or config? Did you understand what you have modified?
3. What dataset did you use?
**Environment**
1. Please run `python utils/collect_env.py` to collect necessary environment information and paste it here.
2. You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch \[e.g., pip, conda, source\]
- Other environment variables that may be related (such as `$PATH`, `$LD_LIBRARY_PATH`, `$PYTHONPATH`, etc.)
**Error traceback**
If applicable, paste the error trackback here.
```none
A placeholder for trackback.
```
**Bug fix**
If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!
about:Ask about questions during model reimplementation
title:''
labels:reimplementation
assignees:''
---
**Notice**
There are several common situations in the reimplementation issues as below
1. Reimplement a model in the model zoo using the provided configs
2. Reimplement a model in the model zoo on other dataset (e.g., custom datasets)
3. Reimplement a custom model but all the components are implemented in HunyuanDiT
4. Reimplement a custom model with new modules implemented by yourself
There are several things to do for different cases as below.
- For case 1 & 3, please follow the steps in the following sections thus we could help to quick identify the issue.
- For case 2 & 4, please understand that we are not able to do much help here because we usually do not know the full code and the users should be responsible to the code they write.
- One suggestion for case 2 & 4 is that the users should first check whether the bug lies in the self-implemented code or the original code. For example, users can first make sure that the same model runs well on supported datasets. If you still need help, please describe what you have done and what you obtain in the issue, and follow the steps in the following sections and try as clear as possible so that we can better help you.
**Checklist**
1. I have searched related issues but cannot get the expected help.
2. The issue has not been fixed in the latest version.
**Describe the issue**
A clear and concise description of what the problem you meet and what have you done.
**Reproduction**
1. What command or script did you run?
```none
A placeholder for the command.
```
2. What config dir you run?
```none
A placeholder for the config.
```
3. Did you make any modifications on the code or config? Did you understand what you have modified?
4. What dataset did you use?
**Environment**
1. Please run `python utils/collect_env.py` to collect necessary environment information and paste it here.
2. You may add addition that may be helpful for locating the problem, such as
1. How you installed PyTorch \[e.g., pip, conda, source\]
2. Other environment variables that may be related (such as `$PATH`, `$LD_LIBRARY_PATH`, `$PYTHONPATH`, etc.)
**Results**
If applicable, paste the related results here, e.g., what you expect and what you get.
```none
A placeholder for results comparison
```
**Issue fix**
If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!
### Loading a Multi-Bucket (Multi-Resolution) Index V2 File
When using a multi-bucket (multi-resolution) index file, refer to the following example code, especially the definition of SimpleDataset and the usage of MultiResolutionBucketIndexV2.
# Please use the shuffle method provided by the dataset, not the DataLoader's shuffle parameter.
dataset.shuffle(epoch,fast=True)
forbatchinloader:
pass
```
## Fast Shuffle
When your index v2 file contains hundreds of millions of samples, using the default shuffle can be quite slow. Therefore, it’s recommended to enable `fast=True` mode:
```python
index_manager.shuffle(seed=1234,fast=True)
```
This will globally shuffle without keeping the indices of the same arrow together. Although this may reduce reading speed, it requires a trade-off based on the model’s forward time.
The IndexKits library offers a command-line tool `idk` for creating datasets and viewing their statistical information. You can view the instructions by using `idk -h`. Here, it refers to creating an Index V2 format dataset from a series of arrow files.
## 1. Creating a Base Index V2 Dataset
### 1.1 Create a Base Index V2 dataset using `idk`
When creating a Base Index V2 dataset, you need to specify the path to a configuration file using the `-c` parameter and a save path using the `-t` parameter.
```shell
idk base -c base_config.yaml -t base_dataset.json
```
### 1.2 Basic Configuration
Next, let’s discuss how to write the configuration file. The configuration file is in `yaml` format, and below is a basic example:
Filename: `base_config.yaml`
```yaml
source:
-/HunYuanDiT/dataset/porcelain/arrows/00000.arrow
```
| Field Name | Type | Description |
|:---------:|:---:|:----------:|
| source | Optional | Arrow List |
We provide an example that includes all features and fields in [full_config.yaml](./docs/full_config.yaml).
### 1.3 Filtering Criteria
`idk` offers two types of filtering capabilities during the dataset creation process: (1) based on columns in Arrow, and (2) based on MD5 files.
To enable filtering criteria, add the `filter` field in the configuration file.
#### 1.3.1 Column Filtering
To enable column filtering, add the `column` field under the `filter` section.
Multiple column filtering criteria can be applied simultaneously, with the intersection of multiple conditions being taken.
For example, to select data where both the length and width are greater than or equal to 512,
with the default being 1024 if the length and width are invalid:
```yaml
filter:
column:
-name:height
type:int
action:ge
target:512
default:1024
-name:width
type:int
action:ge
target:512
default:1024
```
This filtering condition is equivalent to `table['height'].to_int(default=1024) >= 512 && table['width'].to_int(default=1024) >= 512`.
Each filtering criterion includes the following fields:
| in | str in, `str.in(target)` | `str` | not_in | str not in, `str.not_in(target)` | `str` |
| lower_last_in | lower str last char in, `str.lower()[-1].in(target)` | `str` | | | |
#### 1.3.2 MD5 Filtering
Add an `md5` field under the `filter` section to initiate MD5 filtering. Multiple MD5 filtering criteria can be applied simultaneously, with the intersection of multiple conditions being taken.
例如:
*`badcase.txt` is a list of MD5s, aiming to filter out the entries listed in these lists.
*`badcase.json` is a dictionary, where the key is the MD5 and the value is a text-related `tag`. The goal is to filter out specific `tags`.
| name | Required | The name of the filtering criterion, which can be customized for ease of statistics. | - |
| path | Required | The path to the filtering file, which can be a single path or multiple paths provided in a list format. Supports `.txt`, `.json`, `.pkl` formats. | - |
| type | Required | The type of records in the filtering file. | `list` or `dict` |
| action | Required | The filtering action. | For `list`: `in`, `not_in`; For `dict`: `eq`, `ne`, `gt`, `lt`, `ge`, `le`|
| target | Optional | The filtering criterion. | Required when type is `dict` |
| is_valid | Required | Whether a hit on action+target is considered valid or invalid. | `true` or `false` |
| arrow_file_keyword | Optional | Keywords in the Arrow file path. | - |
### 1.4 Advanced Filtering
`idk` also supports some more advanced filtering functions.
#### 1.4.1 Filtering Criteria Applied to Part of Arrow Files
Using the `arrow_file_keyword` parameter, filtering criteria can be applied only to part of the Arrow files.
For example:
* The filtering criterion `height>=0` applies only to arrows whose path includes `human`.
* The filtering criterion “keep samples in goodcase.txt” applies only to arrows whose path includes `human`.
```yaml
filter:
column:
-name:height
type:int
action:ge
target:512
default:1024
arrow_file_keyword:
-human
md5:
-name:goodcase
path:goodcase.txt
type:list
action:in
is_valid:true
arrow_file_keyword:
-human
```
#### 1.4.2 The “or” Logic in Filtering Criteria
By default, filtering criteria follow an “and” logic. If you want two or more filtering criteria to follow an “or” logic, you should use the `logical_or` field. The column filtering criteria listed under this field will be combined using an “or” logic.
Special Note: The `logical_or` field is applicable only to the `column` filtering criteria within `filter`.
1.4.3 Excluding Certain Arrows from the Source
While wildcards can be used to fetch multiple arrows at once, there might be instances where we want to exclude some of them. This can be achieved through the `exclude` field. Keywords listed under `exclude`, if found in the path of the current group of arrows, will result in those arrows being excluded.
```yaml
source:
-/HunYuanDiT/dataset/porcelain/arrows/*.arrow:
exclude:
-arrow1
-arrow2
```
### 1.5 Repeating Samples
`idk` offers the capability to repeat either all samples or specific samples during dataset creation. There are three types of repeaters:
* Directly repeating the source
* Based on keywords in the Arrow file name (enable repeat conditions by adding the `repeater` field in the configuration file)
* Based on an MD5 file (enable repeat conditions by adding the `repeater` field in the configuration file)
**Special Note:** The above three conditions can be used simultaneously. If a sample meets multiple repeat conditions, the highest number of repeats will be taken.
#### 1.5.1 Repeating the Source
In the source, for the Arrow(s) you want to repeat (which can include wildcards like `*`, `?`), add `repeat: n` to mark the number of times to repeat.
```yaml
source:
-/HunYuanDiT/dataset/porcelain/arrows/*.arrow:
repeat:10
```
**Special Note: Add a colon at the end of the Arrow path**
#### 1.5.2 Arrow Keywords
Add an `arrow_file_keyword` field under the `repeater` section.
```yaml
repeater:
arrow_file_keyword:
-repeat:8
keyword:
-Lolita anime style
-Minimalist style
-repeat:5
keyword:
-Magical Barbie style
-Disney style
```
Each repeat condition includes two fields:
| Field Name | Type | Description | Value Range |
|:-------:|:---:|:---------------:|:----:|
| repeat | Required | The number of times to repeat | Number |
| keyword | Required | Keywords in the Arrow file path | - |