Commit 80069cad authored by zhe chen's avatar zhe chen
Browse files

Update train.txt and val.txt (#48)

parent 0552aa5e
...@@ -54,6 +54,9 @@ pip install timm==0.6.11 mmdet==2.28.1 ...@@ -54,6 +54,9 @@ pip install timm==0.6.11 mmdet==2.28.1
```bash ```bash
pip install opencv-python termcolor yacs pyyaml scipy pip install opencv-python termcolor yacs pyyaml scipy
# Please use a version of numpy lower than 2.0
pip install numpy==1.26.4
pip install pydantic==1.10.13
``` ```
- Compiling CUDA operators - Compiling CUDA operators
...@@ -72,13 +75,13 @@ python test.py ...@@ -72,13 +75,13 @@ python test.py
## Data Preparation ## Data Preparation
We use standard ImageNet dataset, you can download it from http://image-net.org/. We provide the following ways to prepare data:
We provide the following two ways to load data:
<details open> <details open>
<summary>Standard ImageNet-1K</summary> <summary>Standard ImageNet-1K</summary>
We use standard ImageNet dataset, you can download it from http://image-net.org/.
- For standard folder dataset, move validation images to labeled sub-folders. The file structure should look like: - For standard folder dataset, move validation images to labeled sub-folders. The file structure should look like:
```bash ```bash
...@@ -102,7 +105,6 @@ We provide the following two ways to load data: ...@@ -102,7 +105,6 @@ We provide the following two ways to load data:
│ ├── img6.jpeg │ ├── img6.jpeg
│ └── ... │ └── ...
└── ... └── ...
``` ```
</details> </details>
...@@ -231,24 +233,33 @@ python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> --maste ...@@ -231,24 +233,33 @@ python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> --maste
## Manage Jobs with Slurm ## Manage Jobs with Slurm
For example, to train `InternImage` with 8 GPU on a single node for 300 epochs, run: For example, to train or evaluate `InternImage` with 8 GPU on a single node, run:
`InternImage-T`: `InternImage-T`:
```bash ```bash
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_t_1k_224.yaml --resume internimage_t_1k_224.pth --eval # Train for 300 epochs
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_t_1k_224.yaml
# Evaluate on ImageNet-1K
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_t_1k_224.yaml --resume pretrained/internimage_t_1k_224.pth --eval
``` ```
`InternImage-S`: `InternImage-S`:
```bash ```bash
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_s_1k_224.yaml --resume internimage_s_1k_224.pth --eval # Train for 300 epochs
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_s_1k_224.yaml
# Evaluate on ImageNet-1K
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_s_1k_224.yaml --resume pretrained/internimage_s_1k_224.pth --eval
``` ```
`InternImage-XL`: `InternImage-XL`:
```bash ```bash
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_xl_22kto1k_384.pth --resume internimage_xl_22kto1k_384.pth --eval # Train for 300 epochs
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_xl_22kto1k_384.pth
# Evaluate on ImageNet-1K
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_xl_22kto1k_384.pth --resume pretrained/internimage_xl_22kto1k_384.pth --eval
``` ```
<!-- <!--
...@@ -275,7 +286,7 @@ python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main. ...@@ -275,7 +286,7 @@ python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.
## Training with DeepSpeed ## Training with DeepSpeed
We support utilizing [Deepspeed](https://github.com/microsoft/DeepSpeed) to reduce memory costs for training large-scale models, e.g. InternImage-H with over 1 billion parameters. We support utilizing [DeepSpeed](https://github.com/microsoft/DeepSpeed) to reduce memory costs for training large-scale models, e.g. InternImage-H with over 1 billion parameters.
To use it, first install the requirements as To use it, first install the requirements as
```bash ```bash
...@@ -286,23 +297,23 @@ Then you could launch the training in a slurm system with 8 GPUs as follows (tin ...@@ -286,23 +297,23 @@ Then you could launch the training in a slurm system with 8 GPUs as follows (tin
The default zero stage is 1 and it could config via command line args `--zero-stage`. The default zero stage is 1 and it could config via command line args `--zero-stage`.
``` ```
GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh INTERN2 train configs/internimage_t_1k_224.yaml --batch-size 128 --accumulation-steps 4 GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh <partition> <job-name> configs/internimage_t_1k_224.yaml --batch-size 128 --accumulation-steps 4
GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh INTERN2 train configs/internimage_t_1k_224.yaml --batch-size 128 --accumulation-steps 4 --eval --resume ckpt.pth GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh <partition> <job-name> configs/internimage_t_1k_224.yaml --batch-size 128 --accumulation-steps 4 --eval --resume ckpt.pth
GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh INTERN2 train configs/internimage_t_1k_224.yaml --batch-size 128 --accumulation-steps 4 --eval --resume deepspeed_ckpt_dir GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh <partition> <job-name> configs/internimage_t_1k_224.yaml --batch-size 128 --accumulation-steps 4 --eval --resume deepspeed_ckpt_dir
GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh INTERN2 train configs/internimage_h_22kto1k_640.yaml --batch-size 16 --accumulation-steps 4 --pretrained ckpt/internimage_h_jointto22k_384.pth GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh <partition> <job-name> configs/internimage_h_22kto1k_640.yaml --batch-size 16 --accumulation-steps 4 --pretrained pretrained/internimage_h_jointto22k_384.pth
GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh INTERN2 train configs/internimage_h_22kto1k_640.yaml --batch-size 16 --accumulation-steps 4 --pretrained ckpt/internimage_h_jointto22k_384.pth --zero-stage 3 GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh <partition> <job-name> configs/internimage_h_22kto1k_640.yaml --batch-size 16 --accumulation-steps 4 --pretrained pretrained/internimage_h_jointto22k_384.pth --zero-stage 3
``` ```
🤗 **Huggingface Accelerate Integration of DeepSpeed** 🤗 **HuggingFace Accelerate Integration of DeepSpeed**
Optionally, you could use our [Huggingface Accelerate](https://github.com/huggingface/accelerate) integration to use DeepSpeed. Optionally, you could use our [HuggingFace Accelerate](https://github.com/huggingface/accelerate) integration to use DeepSpeed.
```bash ```bash
pip install accelerate==0.18.0 pip install accelerate==0.18.0
``` ```
```bash ```bash
accelerate launch --config_file configs/accelerate/dist_8gpus_zero3_wo_loss_scale.yaml main_accelerate.py --cfg configs/internimage_h_22kto1k_640.yaml --data-path data/imagenet --batch-size 16 --pretrained ckpt/internimage_h_jointto22k_384.pth --accumulation-steps 4 accelerate launch --config_file configs/accelerate/dist_8gpus_zero3_wo_loss_scale.yaml main_accelerate.py --cfg configs/internimage_h_22kto1k_640.yaml --data-path data/imagenet --batch-size 16 --pretrained pretrained/internimage_h_jointto22k_384.pth --accumulation-steps 4
accelerate launch --config_file configs/accelerate/dist_8gpus_zero3_offload.yaml main_accelerate.py --cfg configs/internimage_t_1k_224.yaml --data-path data/imagenet --batch-size 128 --accumulation-steps 4 --output output_zero3_offload accelerate launch --config_file configs/accelerate/dist_8gpus_zero3_offload.yaml main_accelerate.py --cfg configs/internimage_t_1k_224.yaml --data-path data/imagenet --batch-size 128 --accumulation-steps 4 --output output_zero3_offload
accelerate launch --config_file configs/accelerate/dist_8gpus_zero1.yaml main_accelerate.py --cfg configs/internimage_t_1k_224.yaml --data-path data/imagenet --batch-size 128 --accumulation-steps 4 accelerate launch --config_file configs/accelerate/dist_8gpus_zero1.yaml main_accelerate.py --cfg configs/internimage_t_1k_224.yaml --data-path data/imagenet --batch-size 128 --accumulation-steps 4
``` ```
......
...@@ -12,6 +12,7 @@ import os ...@@ -12,6 +12,7 @@ import os
import os.path as osp import os.path as osp
import re import re
import time import time
import zipfile
from abc import abstractmethod from abc import abstractmethod
import mmcv import mmcv
...@@ -382,8 +383,23 @@ class ParserCephImage(Parser): ...@@ -382,8 +383,23 @@ class ParserCephImage(Parser):
else: else:
self.io_backend = 'disk' self.io_backend = 'disk'
self.class_to_idx = None self.class_to_idx = None
with open(osp.join(annotation_root, f'{split}.txt'), 'r') as f: txt_file = osp.join(annotation_root, f'{split}.txt')
zip_file = osp.join(annotation_root, f'{split}.txt.zip')
if osp.exists(txt_file):
with open(txt_file, 'r') as f:
self.samples = f.read().splitlines() self.samples = f.read().splitlines()
elif osp.exists(zip_file):
with zipfile.ZipFile(zip_file, 'r') as zf:
file_list = zf.namelist()
if f'{split}.txt' in file_list:
with zf.open(f'{split}.txt') as f:
self.samples = f.read().decode('utf-8').splitlines()
else:
raise FileNotFoundError(f"'{split}.txt' not found in '{zip_file}'")
else:
raise FileNotFoundError(f"Neither '{split}.txt' nor '{split}.txt.zip' found in '{annotation_root}'")
local_rank = None local_rank = None
local_size = None local_size = None
self._consecutive_errors = 0 self._consecutive_errors = 0
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment