Release VOC model

Release VOC model

Release VOC model
01b319fe · zhe chen · 9baa7cc4 · 01b319fe · 01b319fe · 01b319fe
Commit 01b319fe authored Feb 26, 2025 by zhe chen
5 changed files
--- a/detection/README.md
+++ b/detection/README.md
@@ -115,7 +115,7 @@ Prepare datasets according to the guidelines in [MMDetection v2.28.1](https://gi
 </details>
-<details open>
+<details>
 <summary> Dataset: LVIS </summary>
 <br>
 <div>
@@ -128,7 +128,7 @@ Prepare datasets according to the guidelines in [MMDetection v2.28.1](https://gi
 </details>
-<details open>
+<details>
 <summary> Dataset: OpenImages </summary>
 <br>
 <div>
@@ -141,6 +141,19 @@ Prepare datasets according to the guidelines in [MMDetection v2.28.1](https://gi
 </details>
+<details>
+<summary> Dataset: VOC 2007 & 2012 </summary>
+<br>
+<div>
+| method |   backbone    | VOC 2007 | VOC 2012 | #param |                                 Config                                  |                                                       Download                                                       |
+| :----: | :-----------: | :------: | :------: | :----: | :---------------------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------: |
+|  DINO  | InternImage-H |   94.0   |   97.2   | 2.18B  | [config](./configs/voc/dino_4scale_cbinternimage_h_objects365_voc07.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/dino_4scale_cbinternimage_h_objects365_voc0712.pth) |
+</div>
+</details>
 ## Evaluation
 To evaluate our `InternImage` on COCO val, run:

--- a/detection/configs/voc/README.md
+++ b/detection/configs/voc/README.md
+# PASCAL VOC
+## Introduction
+PASCAL VOC 2007 is a widely used dataset for object detection, classification, and segmentation tasks in computer vision. Released in 2007, it contains 9,963 images with 24,640 annotated objects across 20 categories, such as people, animals, and vehicles. The dataset is divided into training (2,501 images), validation (2,510 images), and test (4,952 images) sets. Each object is labeled with a class, bounding box, and additional attributes like "difficult" or "truncated." VOC 2007 introduced the mean Average Precision (mAP) metric, which remains a standard for evaluating object detection models.
+PASCAL VOC 2012, released in 2012, is an improved version of VOC 2007 with more diverse images and annotations. It contains 11,540 images and 27,450 object instances, covering the same 20 categories. In addition to object detection and classification, VOC 2012 includes more detailed annotations for semantic segmentation. The dataset is split into training (5,717 images), validation (5,823 images), and a test set with hidden labels. VOC 2012 is often used as a benchmark for deep learning models and serves as a foundation for modern object detection and segmentation research.
+## Model Zoo
+### DINO + CB-InternImage
+|     backbone     |  pretrain  | VOC 2007 | VOC 2012 | #param |                           Config                            |                                                       Download                                                       |
+| :--------------: | :--------: | :------: | :------: | :----: | :---------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------: |
+| CB-InternImage-H | Objects365 |   94.0   |   97.2   | 2.18B  | [config](./dino_4scale_cbinternimage_h_objects365_voc07.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/dino_4scale_cbinternimage_h_objects365_voc0712.pth) |
--- a/detection/configs/voc/dino_4scale_cbinternimage_h_objects365_voc07.py
+++ b/detection/configs/voc/dino_4scale_cbinternimage_h_objects365_voc07.py
+_base_ = [
+    '../_base_/datasets/voc0712.py',
+    '../_base_/default_runtime.py'
+]
+load_from = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/dino_4scale_cbinternimage_h_objects365_80classes.pth'
+model = dict(
+    type='CBDINO',
+    backbone=dict(
+        type='CBInternImage',
+        core_op='DCNv3',
+        channels=320,
+        depths=[6, 6, 32, 6],
+        groups=[10, 20, 40, 80],
+        mlp_ratio=4.,
+        drop_path_rate=0.5,
+        norm_layer='LN',
+        layer_scale=None,
+        offset_scale=1.0,
+        post_norm=False,
+        dw_kernel_size=5,  # for InternImage-H/G
+        res_post_norm=True,  # for InternImage-H/G
+        level2_post_norm=True,  # for InternImage-H/G
+        level2_post_norm_block_ids=[5, 11, 17, 23, 29],  # for InternImage-H/G
+        center_feature_scale=True,  # for InternImage-H/G
+        with_cp=True,
+        out_indices=[(0, 1, 2, 3), (1, 2, 3)],
+        init_cfg=None,
+    ),
+    neck=[dict(
+        type='CBChannelMapper',
+        in_channels=[640, 1280, 2560],
+        kernel_size=1,
+        out_channels=256,
+        act_cfg=None,
+        norm_cfg=dict(type='GN', num_groups=32),
+        num_outs=4)],
+    bbox_head=dict(
+        type='CBDINOHead',
+        num_query=900,
+        num_classes=20,
+        in_channels=2048,  # TODO
+        sync_cls_avg_factor=True,
+        as_two_stage=True,
+        with_box_refine=True,
+        dn_cfg=dict(
+            type='CdnQueryGenerator',
+            noise_scale=dict(label=0.5, box=1.0),  # 0.5, 0.4 for DN-DETR
+            group_cfg=dict(dynamic=True, num_groups=None, num_dn_queries=1000)),
+        transformer=dict(
+            type='DinoTransformer',
+            two_stage_num_proposals=900,
+            encoder=dict(
+                type='DetrTransformerEncoder',
+                num_layers=6,
+                transformerlayers=dict(
+                    type='BaseTransformerLayer',
+                    attn_cfgs=dict(
+                        type='MultiScaleDeformableAttention',
+                        embed_dims=256,
+                        dropout=0.0),  # 0.1 for DeformDETR
+                    feedforward_channels=2048,  # 1024 for DeformDETR
+                    ffn_cfgs=dict(
+                        type='FFN',
+                        embed_dims=256,
+                        feedforward_channels=2048,
+                        num_fcs=2,
+                        ffn_drop=0.,
+                        use_checkpoint=True,
+                        act_cfg=dict(type='ReLU', inplace=True),),
+                    ffn_dropout=0.0,  # 0.1 for DeformDETR
+                    operation_order=('self_attn', 'norm', 'ffn', 'norm'))),
+            decoder=dict(
+                type='DinoTransformerDecoder',
+                num_layers=6,
+                return_intermediate=True,
+                transformerlayers=dict(
+                    type='DetrTransformerDecoderLayer',
+                    attn_cfgs=[
+                        dict(
+                            type='MultiheadAttention',
+                            embed_dims=256,
+                            num_heads=8,
+                            dropout=0.0),  # 0.1 for DeformDETR
+                        dict(
+                            type='MultiScaleDeformableAttention',
+                            num_levels=4,
+                            embed_dims=256,
+                            dropout=0.0),  # 0.1 for DeformDETR
+                    ],
+                    feedforward_channels=2048,  # 1024 for DeformDETR
+                    ffn_cfgs=dict(
+                        type='FFN',
+                        embed_dims=256,
+                        feedforward_channels=2048,
+                        num_fcs=2,
+                        ffn_drop=0.,
+                        use_checkpoint=True,
+                        act_cfg=dict(type='ReLU', inplace=True),),
+                    ffn_dropout=0.0,  # 0.1 for DeformDETR
+                    operation_order=('self_attn', 'norm', 'cross_attn', 'norm',
+                                     'ffn', 'norm')))),
+        positional_encoding=dict(
+            type='SinePositionalEncoding',
+            num_feats=128,
+            temperature=20,
+            normalize=True),
+        loss_cls=dict(
+            type='FocalLoss',
+            use_sigmoid=True,
+            gamma=2.0,
+            alpha=0.25,
+            loss_weight=1.0),  # 2.0 in DeformDETR
+        loss_bbox=dict(type='L1Loss', loss_weight=5.0),
+        loss_iou=dict(type='GIoULoss', loss_weight=2.0)),
+    # training and testing settings
+    train_cfg=dict(
+        assigner=dict(
+            type='HungarianAssigner',
+            cls_cost=dict(type='FocalLossCost', weight=2.0),
+            reg_cost=dict(type='BBoxL1Cost', weight=5.0, box_format='xywh'),
+            iou_cost=dict(type='IoUCost', iou_mode='giou', weight=2.0)),
+        snip_cfg=dict(
+            type='v3',
+            weight=0.1)),
+    test_cfg=dict(max_per_img=300))  # TODO: Originally 100
+img_norm_cfg = dict(
+    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
+# train_pipeline, NOTE the img_scale and the Pad's size_divisor is different
+# from the default setting in mmdet.
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='LoadAnnotations', with_bbox=True),
+    dict(type='RandomFlip', flip_ratio=0.5),
+    dict(type='Resize',
+         img_scale=[(2000, 600), (2000, 1200)],
+         multiscale_mode='range',
+         keep_ratio=True),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='Pad', size_divisor=32),
+    dict(type='DefaultFormatBundle'),
+    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='MultiScaleFlipAug',
+        img_scale=(2000, 1000),
+        flip=False,
+        transforms=[
+            dict(type='Resize', keep_ratio=True),
+            dict(type='RandomFlip'),
+            dict(type='Normalize', **img_norm_cfg),
+            dict(type='Pad', size_divisor=32),
+            dict(type='ImageToTensor', keys=['img']),
+            dict(type='Collect', keys=['img'])
+        ])
+]
+data = dict(
+    samples_per_gpu=1,
+    workers_per_gpu=2,
+    train=dict(pipeline=train_pipeline),
+    val=dict(pipeline=test_pipeline),
+    test=dict(pipeline=test_pipeline))
+    # test=dict(
+    #     type='VOCDataset',
+    #     ann_file='./data/VOCdevkit/VOC2012test/ImageSets/Main/test.txt',
+    #     img_prefix='./data/VOCdevkit/VOC2012test/',
+    #     pipeline=test_pipeline,))
+# optimizer
+optimizer = dict(
+    type='AdamW', lr=0.0001/2, weight_decay=0.0001,
+    constructor='CustomLayerDecayOptimizerConstructor',
+    paramwise_cfg=dict(num_layers=50, layer_decay_rate=0.94,
+                       depths=[6, 6, 32, 6], offset_lr_scale=1e-3))
+optimizer_config = dict(grad_clip=dict(max_norm=0.1, norm_type=2))
+# learning policy
+lr_config = dict(
+    policy='step',
+    warmup='linear',
+    warmup_iters=500,
+    warmup_ratio=0.001,
+    step=[])
+runner = dict(type='IterBasedRunner', max_iters=20000)
+checkpoint_config = dict(interval=500, max_keep_ckpts=3)
+evaluation = dict(interval=500, save_best='auto')
--- a/detection/configs/voc/dino_4scale_cbinternimage_h_objects365_voc12.py
+++ b/detection/configs/voc/dino_4scale_cbinternimage_h_objects365_voc12.py
+_base_ = [
+    '../_base_/datasets/voc0712.py',
+    '../_base_/default_runtime.py'
+]
+load_from = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/dino_4scale_cbinternimage_h_objects365_80classes.pth'
+model = dict(
+    type='CBDINO',
+    backbone=dict(
+        type='CBInternImage',
+        core_op='DCNv3',
+        channels=320,
+        depths=[6, 6, 32, 6],
+        groups=[10, 20, 40, 80],
+        mlp_ratio=4.,
+        drop_path_rate=0.5,
+        norm_layer='LN',
+        layer_scale=None,
+        offset_scale=1.0,
+        post_norm=False,
+        dw_kernel_size=5,  # for InternImage-H/G
+        res_post_norm=True,  # for InternImage-H/G
+        level2_post_norm=True,  # for InternImage-H/G
+        level2_post_norm_block_ids=[5, 11, 17, 23, 29],  # for InternImage-H/G
+        center_feature_scale=True,  # for InternImage-H/G
+        with_cp=True,
+        out_indices=[(0, 1, 2, 3), (1, 2, 3)],
+        init_cfg=None,
+    ),
+    neck=[dict(
+        type='CBChannelMapper',
+        in_channels=[640, 1280, 2560],
+        kernel_size=1,
+        out_channels=256,
+        act_cfg=None,
+        norm_cfg=dict(type='GN', num_groups=32),
+        num_outs=4)],
+    bbox_head=dict(
+        type='CBDINOHead',
+        num_query=900,
+        num_classes=20,
+        in_channels=2048,  # TODO
+        sync_cls_avg_factor=True,
+        as_two_stage=True,
+        with_box_refine=True,
+        dn_cfg=dict(
+            type='CdnQueryGenerator',
+            noise_scale=dict(label=0.5, box=1.0),  # 0.5, 0.4 for DN-DETR
+            group_cfg=dict(dynamic=True, num_groups=None, num_dn_queries=1000)),
+        transformer=dict(
+            type='DinoTransformer',
+            two_stage_num_proposals=900,
+            encoder=dict(
+                type='DetrTransformerEncoder',
+                num_layers=6,
+                transformerlayers=dict(
+                    type='BaseTransformerLayer',
+                    attn_cfgs=dict(
+                        type='MultiScaleDeformableAttention',
+                        embed_dims=256,
+                        dropout=0.0),  # 0.1 for DeformDETR
+                    feedforward_channels=2048,  # 1024 for DeformDETR
+                    ffn_cfgs=dict(
+                        type='FFN',
+                        embed_dims=256,
+                        feedforward_channels=2048,
+                        num_fcs=2,
+                        ffn_drop=0.,
+                        use_checkpoint=True,
+                        act_cfg=dict(type='ReLU', inplace=True),),
+                    ffn_dropout=0.0,  # 0.1 for DeformDETR
+                    operation_order=('self_attn', 'norm', 'ffn', 'norm'))),
+            decoder=dict(
+                type='DinoTransformerDecoder',
+                num_layers=6,
+                return_intermediate=True,
+                transformerlayers=dict(
+                    type='DetrTransformerDecoderLayer',
+                    attn_cfgs=[
+                        dict(
+                            type='MultiheadAttention',
+                            embed_dims=256,
+                            num_heads=8,
+                            dropout=0.0),  # 0.1 for DeformDETR
+                        dict(
+                            type='MultiScaleDeformableAttention',
+                            num_levels=4,
+                            embed_dims=256,
+                            dropout=0.0),  # 0.1 for DeformDETR
+                    ],
+                    feedforward_channels=2048,  # 1024 for DeformDETR
+                    ffn_cfgs=dict(
+                        type='FFN',
+                        embed_dims=256,
+                        feedforward_channels=2048,
+                        num_fcs=2,
+                        ffn_drop=0.,
+                        use_checkpoint=True,
+                        act_cfg=dict(type='ReLU', inplace=True),),
+                    ffn_dropout=0.0,  # 0.1 for DeformDETR
+                    operation_order=('self_attn', 'norm', 'cross_attn', 'norm',
+                                     'ffn', 'norm')))),
+        positional_encoding=dict(
+            type='SinePositionalEncoding',
+            num_feats=128,
+            temperature=20,
+            normalize=True),
+        loss_cls=dict(
+            type='FocalLoss',
+            use_sigmoid=True,
+            gamma=2.0,
+            alpha=0.25,
+            loss_weight=1.0),  # 2.0 in DeformDETR
+        loss_bbox=dict(type='L1Loss', loss_weight=5.0),
+        loss_iou=dict(type='GIoULoss', loss_weight=2.0)),
+    # training and testing settings
+    train_cfg=dict(
+        assigner=dict(
+            type='HungarianAssigner',
+            cls_cost=dict(type='FocalLossCost', weight=2.0),
+            reg_cost=dict(type='BBoxL1Cost', weight=5.0, box_format='xywh'),
+            iou_cost=dict(type='IoUCost', iou_mode='giou', weight=2.0)),
+        snip_cfg=dict(
+            type='v3',
+            weight=0.1)),
+    test_cfg=dict(max_per_img=300))  # TODO: Originally 100
+img_norm_cfg = dict(
+    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
+# train_pipeline, NOTE the img_scale and the Pad's size_divisor is different
+# from the default setting in mmdet.
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='LoadAnnotations', with_bbox=True),
+    dict(type='RandomFlip', flip_ratio=0.5),
+    dict(type='Resize',
+         img_scale=[(2000, 600), (2000, 1200)],
+         multiscale_mode='range',
+         keep_ratio=True),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='Pad', size_divisor=32),
+    dict(type='DefaultFormatBundle'),
+    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='MultiScaleFlipAug',
+        img_scale=(2000, 1000),
+        flip=False,
+        transforms=[
+            dict(type='Resize', keep_ratio=True),
+            dict(type='RandomFlip'),
+            dict(type='Normalize', **img_norm_cfg),
+            dict(type='Pad', size_divisor=32),
+            dict(type='ImageToTensor', keys=['img']),
+            dict(type='Collect', keys=['img'])
+        ])
+]
+data = dict(
+    samples_per_gpu=1,
+    workers_per_gpu=2,
+    train=dict(pipeline=train_pipeline),
+    val=dict(pipeline=test_pipeline),
+    # test=dict(pipeline=test_pipeline))
+    test=dict(
+        type='VOCDataset',
+        ann_file='./data/VOCdevkit/VOC2012test/ImageSets/Main/test.txt',
+        img_prefix='./data/VOCdevkit/VOC2012test/',
+        pipeline=test_pipeline,))
+# optimizer
+optimizer = dict(
+    type='AdamW', lr=0.0001/2, weight_decay=0.0001,
+    constructor='CustomLayerDecayOptimizerConstructor',
+    paramwise_cfg=dict(num_layers=50, layer_decay_rate=0.94,
+                       depths=[6, 6, 32, 6], offset_lr_scale=1e-3))
+optimizer_config = dict(grad_clip=dict(max_norm=0.1, norm_type=2))
+# learning policy
+lr_config = dict(
+    policy='step',
+    warmup='linear',
+    warmup_iters=500,
+    warmup_ratio=0.001,
+    step=[])
+runner = dict(type='IterBasedRunner', max_iters=20000)
+checkpoint_config = dict(interval=500, max_keep_ckpts=3)
+evaluation = dict(interval=500, save_best='auto')
--- a/detection/tools/download_dataset.py
+++ b/detection/tools/download_dataset.py
+import argparse
+import tarfile
+from itertools import repeat
+from multiprocessing.pool import ThreadPool
+from pathlib import Path
+from tarfile import TarFile
+from zipfile import ZipFile
+import torch
+from mmcv.utils.path import mkdir_or_exist
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description='Download datasets for training')
+    parser.add_argument(
+        '--dataset-name', type=str, help='dataset name', default='coco2017')
+    parser.add_argument(
+        '--save-dir',
+        type=str,
+        help='the dir to save dataset',
+        default='data/coco')
+    parser.add_argument(
+        '--unzip',
+        action='store_true',
+        help='whether unzip dataset or not, zipped files will be saved')
+    parser.add_argument(
+        '--delete',
+        action='store_true',
+        help='delete the download zipped files')
+    parser.add_argument(
+        '--threads', type=int, help='number of threading', default=4)
+    args = parser.parse_args()
+    return args
+def download(url, dir, unzip=True, delete=False, threads=1):
+    def download_one(url, dir):
+        f = dir / Path(url).name
+        if Path(url).is_file():
+            Path(url).rename(f)
+        elif not f.exists():
+            print(f'Downloading {url} to {f}')
+            torch.hub.download_url_to_file(url, f, progress=True)
+        if unzip and f.suffix in ('.zip', '.tar'):
+            print(f'Unzipping {f.name}')
+            if f.suffix == '.zip':
+                ZipFile(f).extractall(path=dir)
+            elif f.suffix == '.tar':
+                TarFile(f).extractall(path=dir)
+            if delete:
+                f.unlink()
+                print(f'Delete {f}')
+    dir = Path(dir)
+    if threads > 1:
+        pool = ThreadPool(threads)
+        pool.imap(lambda x: download_one(*x), zip(url, repeat(dir)))
+        pool.close()
+        pool.join()
+    else:
+        for u in [url] if isinstance(url, (str, Path)) else url:
+            download_one(u, dir)
+def download_objects365v2(url, dir, unzip=True, delete=False, threads=1):
+    def download_single(url, dir):
+        if 'train' in url:
+            saving_dir = dir / Path('train_zip')
+            mkdir_or_exist(saving_dir)
+            f = saving_dir / Path(url).name
+            unzip_dir = dir / Path('train')
+            mkdir_or_exist(unzip_dir)
+        elif 'val' in url:
+            saving_dir = dir / Path('val')
+            mkdir_or_exist(saving_dir)
+            f = saving_dir / Path(url).name
+            unzip_dir = dir / Path('val')
+            mkdir_or_exist(unzip_dir)
+        else:
+            raise NotImplementedError
+        if Path(url).is_file():
+            Path(url).rename(f)
+        elif not f.exists():
+            print(f'Downloading {url} to {f}')
+            torch.hub.download_url_to_file(url, f, progress=True)
+        if unzip and str(f).endswith('.tar.gz'):
+            print(f'Unzipping {f.name}')
+            tar = tarfile.open(f)
+            tar.extractall(path=unzip_dir)
+            if delete:
+                f.unlink()
+                print(f'Delete {f}')
+    # process annotations
+    full_url = []
+    for _url in url:
+        if 'zhiyuan_objv2_train.tar.gz' in _url or \
+                'zhiyuan_objv2_val.json' in _url:
+            full_url.append(_url)
+        elif 'train' in _url:
+            for i in range(51):
+                full_url.append(f'{_url}patch{i}.tar.gz')
+        elif 'val/images/v1' in _url:
+            for i in range(16):
+                full_url.append(f'{_url}patch{i}.tar.gz')
+        elif 'val/images/v2' in _url:
+            for i in range(16, 44):
+                full_url.append(f'{_url}patch{i}.tar.gz')
+        else:
+            raise NotImplementedError
+    dir = Path(dir)
+    if threads > 1:
+        pool = ThreadPool(threads)
+        pool.imap(lambda x: download_single(*x), zip(full_url, repeat(dir)))
+        pool.close()
+        pool.join()
+    else:
+        for u in full_url:
+            download_single(u, dir)
+def main():
+    args = parse_args()
+    path = Path(args.save_dir)
+    if not path.exists():
+        path.mkdir(parents=True, exist_ok=True)
+    data2url = dict(
+        # TODO: Support for downloading Panoptic Segmentation of COCO
+        coco2017=[
+            'http://images.cocodataset.org/zips/train2017.zip',
+            'http://images.cocodataset.org/zips/val2017.zip',
+            'http://images.cocodataset.org/zips/test2017.zip',
+            'http://images.cocodataset.org/annotations/' +
+            'annotations_trainval2017.zip'
+        ],
+        lvis=[
+            'https://s3-us-west-2.amazonaws.com/dl.fbaipublicfiles.com/LVIS/lvis_v1_train.json.zip',  # noqa
+            'https://s3-us-west-2.amazonaws.com/dl.fbaipublicfiles.com/LVIS/lvis_v1_train.json.zip',  # noqa
+        ],
+        voc2007=[
+            'http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar',  # noqa
+            'http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar',  # noqa
+            'http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCdevkit_08-Jun-2007.tar',  # noqa
+        ],
+        # Note: There is no download link for Objects365-V1 right now. If you
+        # would like to download Objects365-V1, please visit
+        # http://www.objects365.org/ to concat the author.
+        objects365v2=[
+            # training annotations
+            'https://dorc.ks3-cn-beijing.ksyun.com/data-set/2020Objects365%E6%95%B0%E6%8D%AE%E9%9B%86/train/zhiyuan_objv2_train.tar.gz',  # noqa
+            # validation annotations
+            'https://dorc.ks3-cn-beijing.ksyun.com/data-set/2020Objects365%E6%95%B0%E6%8D%AE%E9%9B%86/val/zhiyuan_objv2_val.json',  # noqa
+            # training url root
+            'https://dorc.ks3-cn-beijing.ksyun.com/data-set/2020Objects365%E6%95%B0%E6%8D%AE%E9%9B%86/train/',  # noqa
+            # validation url root_1
+            'https://dorc.ks3-cn-beijing.ksyun.com/data-set/2020Objects365%E6%95%B0%E6%8D%AE%E9%9B%86/val/images/v1/',  # noqa
+            # validation url root_2
+            'https://dorc.ks3-cn-beijing.ksyun.com/data-set/2020Objects365%E6%95%B0%E6%8D%AE%E9%9B%86/val/images/v2/'  # noqa
+        ])
+    url = data2url.get(args.dataset_name, None)
+    if url is None:
+        print('Only support COCO, VOC, LVIS, and Objects365v2 now!')
+        return
+    if args.dataset_name == 'objects365v2':
+        download_objects365v2(
+            url,
+            dir=path,
+            unzip=args.unzip,
+            delete=args.delete,
+            threads=args.threads)
+    else:
+        download(
+            url,
+            dir=path,
+            unzip=args.unzip,
+            delete=args.delete,
+            threads=args.threads)
+if __name__ == '__main__':
+    main()