Commit 80069cad authored by zhe chen's avatar zhe chen
Browse files

Update train.txt and val.txt (#48)

parent 0552aa5e
......@@ -54,6 +54,9 @@ pip install timm==0.6.11 mmdet==2.28.1
```bash
pip install opencv-python termcolor yacs pyyaml scipy
# Please use a version of numpy lower than 2.0
pip install numpy==1.26.4
pip install pydantic==1.10.13
```
- Compiling CUDA operators
......@@ -72,13 +75,13 @@ python test.py
## Data Preparation
We use standard ImageNet dataset, you can download it from http://image-net.org/.
We provide the following two ways to load data:
We provide the following ways to prepare data:
<details open>
<summary>Standard ImageNet-1K</summary>
We use standard ImageNet dataset, you can download it from http://image-net.org/.
- For standard folder dataset, move validation images to labeled sub-folders. The file structure should look like:
```bash
......@@ -102,7 +105,6 @@ We provide the following two ways to load data:
│ ├── img6.jpeg
│ └── ...
└── ...
```
</details>
......@@ -231,24 +233,33 @@ python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> --maste
## Manage Jobs with Slurm
For example, to train `InternImage` with 8 GPU on a single node for 300 epochs, run:
For example, to train or evaluate `InternImage` with 8 GPU on a single node, run:
`InternImage-T`:
```bash
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_t_1k_224.yaml --resume internimage_t_1k_224.pth --eval
# Train for 300 epochs
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_t_1k_224.yaml
# Evaluate on ImageNet-1K
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_t_1k_224.yaml --resume pretrained/internimage_t_1k_224.pth --eval
```
`InternImage-S`:
```bash
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_s_1k_224.yaml --resume internimage_s_1k_224.pth --eval
# Train for 300 epochs
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_s_1k_224.yaml
# Evaluate on ImageNet-1K
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_s_1k_224.yaml --resume pretrained/internimage_s_1k_224.pth --eval
```
`InternImage-XL`:
```bash
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_xl_22kto1k_384.pth --resume internimage_xl_22kto1k_384.pth --eval
# Train for 300 epochs
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_xl_22kto1k_384.pth
# Evaluate on ImageNet-1K
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_xl_22kto1k_384.pth --resume pretrained/internimage_xl_22kto1k_384.pth --eval
```
<!--
......@@ -275,7 +286,7 @@ python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.
## Training with DeepSpeed
We support utilizing [Deepspeed](https://github.com/microsoft/DeepSpeed) to reduce memory costs for training large-scale models, e.g. InternImage-H with over 1 billion parameters.
We support utilizing [DeepSpeed](https://github.com/microsoft/DeepSpeed) to reduce memory costs for training large-scale models, e.g. InternImage-H with over 1 billion parameters.
To use it, first install the requirements as
```bash
......@@ -286,23 +297,23 @@ Then you could launch the training in a slurm system with 8 GPUs as follows (tin
The default zero stage is 1 and it could config via command line args `--zero-stage`.
```
GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh INTERN2 train configs/internimage_t_1k_224.yaml --batch-size 128 --accumulation-steps 4
GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh INTERN2 train configs/internimage_t_1k_224.yaml --batch-size 128 --accumulation-steps 4 --eval --resume ckpt.pth
GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh INTERN2 train configs/internimage_t_1k_224.yaml --batch-size 128 --accumulation-steps 4 --eval --resume deepspeed_ckpt_dir
GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh INTERN2 train configs/internimage_h_22kto1k_640.yaml --batch-size 16 --accumulation-steps 4 --pretrained ckpt/internimage_h_jointto22k_384.pth
GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh INTERN2 train configs/internimage_h_22kto1k_640.yaml --batch-size 16 --accumulation-steps 4 --pretrained ckpt/internimage_h_jointto22k_384.pth --zero-stage 3
GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh <partition> <job-name> configs/internimage_t_1k_224.yaml --batch-size 128 --accumulation-steps 4
GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh <partition> <job-name> configs/internimage_t_1k_224.yaml --batch-size 128 --accumulation-steps 4 --eval --resume ckpt.pth
GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh <partition> <job-name> configs/internimage_t_1k_224.yaml --batch-size 128 --accumulation-steps 4 --eval --resume deepspeed_ckpt_dir
GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh <partition> <job-name> configs/internimage_h_22kto1k_640.yaml --batch-size 16 --accumulation-steps 4 --pretrained pretrained/internimage_h_jointto22k_384.pth
GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh <partition> <job-name> configs/internimage_h_22kto1k_640.yaml --batch-size 16 --accumulation-steps 4 --pretrained pretrained/internimage_h_jointto22k_384.pth --zero-stage 3
```
🤗 **Huggingface Accelerate Integration of DeepSpeed**
🤗 **HuggingFace Accelerate Integration of DeepSpeed**
Optionally, you could use our [Huggingface Accelerate](https://github.com/huggingface/accelerate) integration to use DeepSpeed.
Optionally, you could use our [HuggingFace Accelerate](https://github.com/huggingface/accelerate) integration to use DeepSpeed.
```bash
pip install accelerate==0.18.0
```
```bash
accelerate launch --config_file configs/accelerate/dist_8gpus_zero3_wo_loss_scale.yaml main_accelerate.py --cfg configs/internimage_h_22kto1k_640.yaml --data-path data/imagenet --batch-size 16 --pretrained ckpt/internimage_h_jointto22k_384.pth --accumulation-steps 4
accelerate launch --config_file configs/accelerate/dist_8gpus_zero3_wo_loss_scale.yaml main_accelerate.py --cfg configs/internimage_h_22kto1k_640.yaml --data-path data/imagenet --batch-size 16 --pretrained pretrained/internimage_h_jointto22k_384.pth --accumulation-steps 4
accelerate launch --config_file configs/accelerate/dist_8gpus_zero3_offload.yaml main_accelerate.py --cfg configs/internimage_t_1k_224.yaml --data-path data/imagenet --batch-size 128 --accumulation-steps 4 --output output_zero3_offload
accelerate launch --config_file configs/accelerate/dist_8gpus_zero1.yaml main_accelerate.py --cfg configs/internimage_t_1k_224.yaml --data-path data/imagenet --batch-size 128 --accumulation-steps 4
```
......
......@@ -12,6 +12,7 @@ import os
import os.path as osp
import re
import time
import zipfile
from abc import abstractmethod
import mmcv
......@@ -382,8 +383,23 @@ class ParserCephImage(Parser):
else:
self.io_backend = 'disk'
self.class_to_idx = None
with open(osp.join(annotation_root, f'{split}.txt'), 'r') as f:
self.samples = f.read().splitlines()
txt_file = osp.join(annotation_root, f'{split}.txt')
zip_file = osp.join(annotation_root, f'{split}.txt.zip')
if osp.exists(txt_file):
with open(txt_file, 'r') as f:
self.samples = f.read().splitlines()
elif osp.exists(zip_file):
with zipfile.ZipFile(zip_file, 'r') as zf:
file_list = zf.namelist()
if f'{split}.txt' in file_list:
with zf.open(f'{split}.txt') as f:
self.samples = f.read().decode('utf-8').splitlines()
else:
raise FileNotFoundError(f"'{split}.txt' not found in '{zip_file}'")
else:
raise FileNotFoundError(f"Neither '{split}.txt' nor '{split}.txt.zip' found in '{annotation_root}'")
local_rank = None
local_size = None
self._consecutive_errors = 0
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment