Commit 01b319fe authored by zhe chen's avatar zhe chen
Browse files

Release VOC model


Release VOC model
parent 9baa7cc4
...@@ -115,7 +115,7 @@ Prepare datasets according to the guidelines in [MMDetection v2.28.1](https://gi ...@@ -115,7 +115,7 @@ Prepare datasets according to the guidelines in [MMDetection v2.28.1](https://gi
</details> </details>
<details open> <details>
<summary> Dataset: LVIS </summary> <summary> Dataset: LVIS </summary>
<br> <br>
<div> <div>
...@@ -128,7 +128,7 @@ Prepare datasets according to the guidelines in [MMDetection v2.28.1](https://gi ...@@ -128,7 +128,7 @@ Prepare datasets according to the guidelines in [MMDetection v2.28.1](https://gi
</details> </details>
<details open> <details>
<summary> Dataset: OpenImages </summary> <summary> Dataset: OpenImages </summary>
<br> <br>
<div> <div>
...@@ -141,6 +141,19 @@ Prepare datasets according to the guidelines in [MMDetection v2.28.1](https://gi ...@@ -141,6 +141,19 @@ Prepare datasets according to the guidelines in [MMDetection v2.28.1](https://gi
</details> </details>
<details>
<summary> Dataset: VOC 2007 & 2012 </summary>
<br>
<div>
| method | backbone | VOC 2007 | VOC 2012 | #param | Config | Download |
| :----: | :-----------: | :------: | :------: | :----: | :---------------------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------: |
| DINO | InternImage-H | 94.0 | 97.2 | 2.18B | [config](./configs/voc/dino_4scale_cbinternimage_h_objects365_voc07.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/dino_4scale_cbinternimage_h_objects365_voc0712.pth) |
</div>
</details>
## Evaluation ## Evaluation
To evaluate our `InternImage` on COCO val, run: To evaluate our `InternImage` on COCO val, run:
......
# PASCAL VOC
## Introduction
PASCAL VOC 2007 is a widely used dataset for object detection, classification, and segmentation tasks in computer vision. Released in 2007, it contains 9,963 images with 24,640 annotated objects across 20 categories, such as people, animals, and vehicles. The dataset is divided into training (2,501 images), validation (2,510 images), and test (4,952 images) sets. Each object is labeled with a class, bounding box, and additional attributes like "difficult" or "truncated." VOC 2007 introduced the mean Average Precision (mAP) metric, which remains a standard for evaluating object detection models.
PASCAL VOC 2012, released in 2012, is an improved version of VOC 2007 with more diverse images and annotations. It contains 11,540 images and 27,450 object instances, covering the same 20 categories. In addition to object detection and classification, VOC 2012 includes more detailed annotations for semantic segmentation. The dataset is split into training (5,717 images), validation (5,823 images), and a test set with hidden labels. VOC 2012 is often used as a benchmark for deep learning models and serves as a foundation for modern object detection and segmentation research.
## Model Zoo
### DINO + CB-InternImage
| backbone | pretrain | VOC 2007 | VOC 2012 | #param | Config | Download |
| :--------------: | :--------: | :------: | :------: | :----: | :---------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------: |
| CB-InternImage-H | Objects365 | 94.0 | 97.2 | 2.18B | [config](./dino_4scale_cbinternimage_h_objects365_voc07.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/dino_4scale_cbinternimage_h_objects365_voc0712.pth) |
_base_ = [
'../_base_/datasets/voc0712.py',
'../_base_/default_runtime.py'
]
load_from = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/dino_4scale_cbinternimage_h_objects365_80classes.pth'
model = dict(
type='CBDINO',
backbone=dict(
type='CBInternImage',
core_op='DCNv3',
channels=320,
depths=[6, 6, 32, 6],
groups=[10, 20, 40, 80],
mlp_ratio=4.,
drop_path_rate=0.5,
norm_layer='LN',
layer_scale=None,
offset_scale=1.0,
post_norm=False,
dw_kernel_size=5, # for InternImage-H/G
res_post_norm=True, # for InternImage-H/G
level2_post_norm=True, # for InternImage-H/G
level2_post_norm_block_ids=[5, 11, 17, 23, 29], # for InternImage-H/G
center_feature_scale=True, # for InternImage-H/G
with_cp=True,
out_indices=[(0, 1, 2, 3), (1, 2, 3)],
init_cfg=None,
),
neck=[dict(
type='CBChannelMapper',
in_channels=[640, 1280, 2560],
kernel_size=1,
out_channels=256,
act_cfg=None,
norm_cfg=dict(type='GN', num_groups=32),
num_outs=4)],
bbox_head=dict(
type='CBDINOHead',
num_query=900,
num_classes=20,
in_channels=2048, # TODO
sync_cls_avg_factor=True,
as_two_stage=True,
with_box_refine=True,
dn_cfg=dict(
type='CdnQueryGenerator',
noise_scale=dict(label=0.5, box=1.0), # 0.5, 0.4 for DN-DETR
group_cfg=dict(dynamic=True, num_groups=None, num_dn_queries=1000)),
transformer=dict(
type='DinoTransformer',
two_stage_num_proposals=900,
encoder=dict(
type='DetrTransformerEncoder',
num_layers=6,
transformerlayers=dict(
type='BaseTransformerLayer',
attn_cfgs=dict(
type='MultiScaleDeformableAttention',
embed_dims=256,
dropout=0.0), # 0.1 for DeformDETR
feedforward_channels=2048, # 1024 for DeformDETR
ffn_cfgs=dict(
type='FFN',
embed_dims=256,
feedforward_channels=2048,
num_fcs=2,
ffn_drop=0.,
use_checkpoint=True,
act_cfg=dict(type='ReLU', inplace=True),),
ffn_dropout=0.0, # 0.1 for DeformDETR
operation_order=('self_attn', 'norm', 'ffn', 'norm'))),
decoder=dict(
type='DinoTransformerDecoder',
num_layers=6,
return_intermediate=True,
transformerlayers=dict(
type='DetrTransformerDecoderLayer',
attn_cfgs=[
dict(
type='MultiheadAttention',
embed_dims=256,
num_heads=8,
dropout=0.0), # 0.1 for DeformDETR
dict(
type='MultiScaleDeformableAttention',
num_levels=4,
embed_dims=256,
dropout=0.0), # 0.1 for DeformDETR
],
feedforward_channels=2048, # 1024 for DeformDETR
ffn_cfgs=dict(
type='FFN',
embed_dims=256,
feedforward_channels=2048,
num_fcs=2,
ffn_drop=0.,
use_checkpoint=True,
act_cfg=dict(type='ReLU', inplace=True),),
ffn_dropout=0.0, # 0.1 for DeformDETR
operation_order=('self_attn', 'norm', 'cross_attn', 'norm',
'ffn', 'norm')))),
positional_encoding=dict(
type='SinePositionalEncoding',
num_feats=128,
temperature=20,
normalize=True),
loss_cls=dict(
type='FocalLoss',
use_sigmoid=True,
gamma=2.0,
alpha=0.25,
loss_weight=1.0), # 2.0 in DeformDETR
loss_bbox=dict(type='L1Loss', loss_weight=5.0),
loss_iou=dict(type='GIoULoss', loss_weight=2.0)),
# training and testing settings
train_cfg=dict(
assigner=dict(
type='HungarianAssigner',
cls_cost=dict(type='FocalLossCost', weight=2.0),
reg_cost=dict(type='BBoxL1Cost', weight=5.0, box_format='xywh'),
iou_cost=dict(type='IoUCost', iou_mode='giou', weight=2.0)),
snip_cfg=dict(
type='v3',
weight=0.1)),
test_cfg=dict(max_per_img=300)) # TODO: Originally 100
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
# train_pipeline, NOTE the img_scale and the Pad's size_divisor is different
# from the default setting in mmdet.
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(type='Resize',
img_scale=[(2000, 600), (2000, 1200)],
multiscale_mode='range',
keep_ratio=True),
dict(type='Normalize', **img_norm_cfg),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(2000, 1000),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(type='Normalize', **img_norm_cfg),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]
data = dict(
samples_per_gpu=1,
workers_per_gpu=2,
train=dict(pipeline=train_pipeline),
val=dict(pipeline=test_pipeline),
test=dict(pipeline=test_pipeline))
# test=dict(
# type='VOCDataset',
# ann_file='./data/VOCdevkit/VOC2012test/ImageSets/Main/test.txt',
# img_prefix='./data/VOCdevkit/VOC2012test/',
# pipeline=test_pipeline,))
# optimizer
optimizer = dict(
type='AdamW', lr=0.0001/2, weight_decay=0.0001,
constructor='CustomLayerDecayOptimizerConstructor',
paramwise_cfg=dict(num_layers=50, layer_decay_rate=0.94,
depths=[6, 6, 32, 6], offset_lr_scale=1e-3))
optimizer_config = dict(grad_clip=dict(max_norm=0.1, norm_type=2))
# learning policy
lr_config = dict(
policy='step',
warmup='linear',
warmup_iters=500,
warmup_ratio=0.001,
step=[])
runner = dict(type='IterBasedRunner', max_iters=20000)
checkpoint_config = dict(interval=500, max_keep_ckpts=3)
evaluation = dict(interval=500, save_best='auto')
_base_ = [
'../_base_/datasets/voc0712.py',
'../_base_/default_runtime.py'
]
load_from = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/dino_4scale_cbinternimage_h_objects365_80classes.pth'
model = dict(
type='CBDINO',
backbone=dict(
type='CBInternImage',
core_op='DCNv3',
channels=320,
depths=[6, 6, 32, 6],
groups=[10, 20, 40, 80],
mlp_ratio=4.,
drop_path_rate=0.5,
norm_layer='LN',
layer_scale=None,
offset_scale=1.0,
post_norm=False,
dw_kernel_size=5, # for InternImage-H/G
res_post_norm=True, # for InternImage-H/G
level2_post_norm=True, # for InternImage-H/G
level2_post_norm_block_ids=[5, 11, 17, 23, 29], # for InternImage-H/G
center_feature_scale=True, # for InternImage-H/G
with_cp=True,
out_indices=[(0, 1, 2, 3), (1, 2, 3)],
init_cfg=None,
),
neck=[dict(
type='CBChannelMapper',
in_channels=[640, 1280, 2560],
kernel_size=1,
out_channels=256,
act_cfg=None,
norm_cfg=dict(type='GN', num_groups=32),
num_outs=4)],
bbox_head=dict(
type='CBDINOHead',
num_query=900,
num_classes=20,
in_channels=2048, # TODO
sync_cls_avg_factor=True,
as_two_stage=True,
with_box_refine=True,
dn_cfg=dict(
type='CdnQueryGenerator',
noise_scale=dict(label=0.5, box=1.0), # 0.5, 0.4 for DN-DETR
group_cfg=dict(dynamic=True, num_groups=None, num_dn_queries=1000)),
transformer=dict(
type='DinoTransformer',
two_stage_num_proposals=900,
encoder=dict(
type='DetrTransformerEncoder',
num_layers=6,
transformerlayers=dict(
type='BaseTransformerLayer',
attn_cfgs=dict(
type='MultiScaleDeformableAttention',
embed_dims=256,
dropout=0.0), # 0.1 for DeformDETR
feedforward_channels=2048, # 1024 for DeformDETR
ffn_cfgs=dict(
type='FFN',
embed_dims=256,
feedforward_channels=2048,
num_fcs=2,
ffn_drop=0.,
use_checkpoint=True,
act_cfg=dict(type='ReLU', inplace=True),),
ffn_dropout=0.0, # 0.1 for DeformDETR
operation_order=('self_attn', 'norm', 'ffn', 'norm'))),
decoder=dict(
type='DinoTransformerDecoder',
num_layers=6,
return_intermediate=True,
transformerlayers=dict(
type='DetrTransformerDecoderLayer',
attn_cfgs=[
dict(
type='MultiheadAttention',
embed_dims=256,
num_heads=8,
dropout=0.0), # 0.1 for DeformDETR
dict(
type='MultiScaleDeformableAttention',
num_levels=4,
embed_dims=256,
dropout=0.0), # 0.1 for DeformDETR
],
feedforward_channels=2048, # 1024 for DeformDETR
ffn_cfgs=dict(
type='FFN',
embed_dims=256,
feedforward_channels=2048,
num_fcs=2,
ffn_drop=0.,
use_checkpoint=True,
act_cfg=dict(type='ReLU', inplace=True),),
ffn_dropout=0.0, # 0.1 for DeformDETR
operation_order=('self_attn', 'norm', 'cross_attn', 'norm',
'ffn', 'norm')))),
positional_encoding=dict(
type='SinePositionalEncoding',
num_feats=128,
temperature=20,
normalize=True),
loss_cls=dict(
type='FocalLoss',
use_sigmoid=True,
gamma=2.0,
alpha=0.25,
loss_weight=1.0), # 2.0 in DeformDETR
loss_bbox=dict(type='L1Loss', loss_weight=5.0),
loss_iou=dict(type='GIoULoss', loss_weight=2.0)),
# training and testing settings
train_cfg=dict(
assigner=dict(
type='HungarianAssigner',
cls_cost=dict(type='FocalLossCost', weight=2.0),
reg_cost=dict(type='BBoxL1Cost', weight=5.0, box_format='xywh'),
iou_cost=dict(type='IoUCost', iou_mode='giou', weight=2.0)),
snip_cfg=dict(
type='v3',
weight=0.1)),
test_cfg=dict(max_per_img=300)) # TODO: Originally 100
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
# train_pipeline, NOTE the img_scale and the Pad's size_divisor is different
# from the default setting in mmdet.
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(type='Resize',
img_scale=[(2000, 600), (2000, 1200)],
multiscale_mode='range',
keep_ratio=True),
dict(type='Normalize', **img_norm_cfg),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(2000, 1000),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(type='Normalize', **img_norm_cfg),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]
data = dict(
samples_per_gpu=1,
workers_per_gpu=2,
train=dict(pipeline=train_pipeline),
val=dict(pipeline=test_pipeline),
# test=dict(pipeline=test_pipeline))
test=dict(
type='VOCDataset',
ann_file='./data/VOCdevkit/VOC2012test/ImageSets/Main/test.txt',
img_prefix='./data/VOCdevkit/VOC2012test/',
pipeline=test_pipeline,))
# optimizer
optimizer = dict(
type='AdamW', lr=0.0001/2, weight_decay=0.0001,
constructor='CustomLayerDecayOptimizerConstructor',
paramwise_cfg=dict(num_layers=50, layer_decay_rate=0.94,
depths=[6, 6, 32, 6], offset_lr_scale=1e-3))
optimizer_config = dict(grad_clip=dict(max_norm=0.1, norm_type=2))
# learning policy
lr_config = dict(
policy='step',
warmup='linear',
warmup_iters=500,
warmup_ratio=0.001,
step=[])
runner = dict(type='IterBasedRunner', max_iters=20000)
checkpoint_config = dict(interval=500, max_keep_ckpts=3)
evaluation = dict(interval=500, save_best='auto')
import argparse
import tarfile
from itertools import repeat
from multiprocessing.pool import ThreadPool
from pathlib import Path
from tarfile import TarFile
from zipfile import ZipFile
import torch
from mmcv.utils.path import mkdir_or_exist
def parse_args():
parser = argparse.ArgumentParser(
description='Download datasets for training')
parser.add_argument(
'--dataset-name', type=str, help='dataset name', default='coco2017')
parser.add_argument(
'--save-dir',
type=str,
help='the dir to save dataset',
default='data/coco')
parser.add_argument(
'--unzip',
action='store_true',
help='whether unzip dataset or not, zipped files will be saved')
parser.add_argument(
'--delete',
action='store_true',
help='delete the download zipped files')
parser.add_argument(
'--threads', type=int, help='number of threading', default=4)
args = parser.parse_args()
return args
def download(url, dir, unzip=True, delete=False, threads=1):
def download_one(url, dir):
f = dir / Path(url).name
if Path(url).is_file():
Path(url).rename(f)
elif not f.exists():
print(f'Downloading {url} to {f}')
torch.hub.download_url_to_file(url, f, progress=True)
if unzip and f.suffix in ('.zip', '.tar'):
print(f'Unzipping {f.name}')
if f.suffix == '.zip':
ZipFile(f).extractall(path=dir)
elif f.suffix == '.tar':
TarFile(f).extractall(path=dir)
if delete:
f.unlink()
print(f'Delete {f}')
dir = Path(dir)
if threads > 1:
pool = ThreadPool(threads)
pool.imap(lambda x: download_one(*x), zip(url, repeat(dir)))
pool.close()
pool.join()
else:
for u in [url] if isinstance(url, (str, Path)) else url:
download_one(u, dir)
def download_objects365v2(url, dir, unzip=True, delete=False, threads=1):
def download_single(url, dir):
if 'train' in url:
saving_dir = dir / Path('train_zip')
mkdir_or_exist(saving_dir)
f = saving_dir / Path(url).name
unzip_dir = dir / Path('train')
mkdir_or_exist(unzip_dir)
elif 'val' in url:
saving_dir = dir / Path('val')
mkdir_or_exist(saving_dir)
f = saving_dir / Path(url).name
unzip_dir = dir / Path('val')
mkdir_or_exist(unzip_dir)
else:
raise NotImplementedError
if Path(url).is_file():
Path(url).rename(f)
elif not f.exists():
print(f'Downloading {url} to {f}')
torch.hub.download_url_to_file(url, f, progress=True)
if unzip and str(f).endswith('.tar.gz'):
print(f'Unzipping {f.name}')
tar = tarfile.open(f)
tar.extractall(path=unzip_dir)
if delete:
f.unlink()
print(f'Delete {f}')
# process annotations
full_url = []
for _url in url:
if 'zhiyuan_objv2_train.tar.gz' in _url or \
'zhiyuan_objv2_val.json' in _url:
full_url.append(_url)
elif 'train' in _url:
for i in range(51):
full_url.append(f'{_url}patch{i}.tar.gz')
elif 'val/images/v1' in _url:
for i in range(16):
full_url.append(f'{_url}patch{i}.tar.gz')
elif 'val/images/v2' in _url:
for i in range(16, 44):
full_url.append(f'{_url}patch{i}.tar.gz')
else:
raise NotImplementedError
dir = Path(dir)
if threads > 1:
pool = ThreadPool(threads)
pool.imap(lambda x: download_single(*x), zip(full_url, repeat(dir)))
pool.close()
pool.join()
else:
for u in full_url:
download_single(u, dir)
def main():
args = parse_args()
path = Path(args.save_dir)
if not path.exists():
path.mkdir(parents=True, exist_ok=True)
data2url = dict(
# TODO: Support for downloading Panoptic Segmentation of COCO
coco2017=[
'http://images.cocodataset.org/zips/train2017.zip',
'http://images.cocodataset.org/zips/val2017.zip',
'http://images.cocodataset.org/zips/test2017.zip',
'http://images.cocodataset.org/annotations/' +
'annotations_trainval2017.zip'
],
lvis=[
'https://s3-us-west-2.amazonaws.com/dl.fbaipublicfiles.com/LVIS/lvis_v1_train.json.zip', # noqa
'https://s3-us-west-2.amazonaws.com/dl.fbaipublicfiles.com/LVIS/lvis_v1_train.json.zip', # noqa
],
voc2007=[
'http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar', # noqa
'http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar', # noqa
'http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCdevkit_08-Jun-2007.tar', # noqa
],
# Note: There is no download link for Objects365-V1 right now. If you
# would like to download Objects365-V1, please visit
# http://www.objects365.org/ to concat the author.
objects365v2=[
# training annotations
'https://dorc.ks3-cn-beijing.ksyun.com/data-set/2020Objects365%E6%95%B0%E6%8D%AE%E9%9B%86/train/zhiyuan_objv2_train.tar.gz', # noqa
# validation annotations
'https://dorc.ks3-cn-beijing.ksyun.com/data-set/2020Objects365%E6%95%B0%E6%8D%AE%E9%9B%86/val/zhiyuan_objv2_val.json', # noqa
# training url root
'https://dorc.ks3-cn-beijing.ksyun.com/data-set/2020Objects365%E6%95%B0%E6%8D%AE%E9%9B%86/train/', # noqa
# validation url root_1
'https://dorc.ks3-cn-beijing.ksyun.com/data-set/2020Objects365%E6%95%B0%E6%8D%AE%E9%9B%86/val/images/v1/', # noqa
# validation url root_2
'https://dorc.ks3-cn-beijing.ksyun.com/data-set/2020Objects365%E6%95%B0%E6%8D%AE%E9%9B%86/val/images/v2/' # noqa
])
url = data2url.get(args.dataset_name, None)
if url is None:
print('Only support COCO, VOC, LVIS, and Objects365v2 now!')
return
if args.dataset_name == 'objects365v2':
download_objects365v2(
url,
dir=path,
unzip=args.unzip,
delete=args.delete,
threads=args.threads)
else:
download(
url,
dir=path,
unzip=args.unzip,
delete=args.delete,
threads=args.threads)
if __name__ == '__main__':
main()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment