Commit 9d4b5614 authored by zhe chen's avatar zhe chen
Browse files

Release coco-stuff-10k model


Release coco-stuff-10k model


Release coco-stuff-10k model
parent c570a7eb
...@@ -115,7 +115,7 @@ Prepare datasets according to the [guidelines](https://github.com/open-mmlab/mms ...@@ -115,7 +115,7 @@ Prepare datasets according to the [guidelines](https://github.com/open-mmlab/mms
<div> <div>
| method | backbone | resolution | mIoU (ss/ms) | #params | FLOPs | Config | Download | | method | backbone | resolution | mIoU (ss/ms) | #params | FLOPs | Config | Download |
| :---------: | :------------: | :--------: | :-----------: | :-----: | :---: | :-----------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | | :-----------: | :------------: | :--------: | :-----------: | :-----: | :---: | :--------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| UperNet | InternImage-T | 512x1024 | 82.58 / 83.40 | 59M | 1889G | [config](./configs/cityscapes/upernet_internimage_t_512x1024_160k_cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_t_512x1024_160k_cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_t_512x1024_160k_cityscapes.log.json) | | UperNet | InternImage-T | 512x1024 | 82.58 / 83.40 | 59M | 1889G | [config](./configs/cityscapes/upernet_internimage_t_512x1024_160k_cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_t_512x1024_160k_cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_t_512x1024_160k_cityscapes.log.json) |
| UperNet | InternImage-S | 512x1024 | 82.74 / 83.45 | 80M | 2035G | [config](./configs/cityscapes/upernet_internimage_s_512x1024_160k_cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_s_512x1024_160k_cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_s_512x1024_160k_cityscapes.log.json) | | UperNet | InternImage-S | 512x1024 | 82.74 / 83.45 | 80M | 2035G | [config](./configs/cityscapes/upernet_internimage_s_512x1024_160k_cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_s_512x1024_160k_cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_s_512x1024_160k_cityscapes.log.json) |
| UperNet | InternImage-B | 512x1024 | 83.18 / 83.97 | 128M | 2369G | [config](./configs/cityscapes/upernet_internimage_b_512x1024_160k_cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_b_512x1024_160k_cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_b_512x1024_160k_cityscapes.log.json) | | UperNet | InternImage-B | 512x1024 | 83.18 / 83.97 | 128M | 2369G | [config](./configs/cityscapes/upernet_internimage_b_512x1024_160k_cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_b_512x1024_160k_cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_b_512x1024_160k_cityscapes.log.json) |
...@@ -125,7 +125,7 @@ Prepare datasets according to the [guidelines](https://github.com/open-mmlab/mms ...@@ -125,7 +125,7 @@ Prepare datasets according to the [guidelines](https://github.com/open-mmlab/mms
| UperNet\* | InternImage-XL | 512x1024 | 86.20 / 86.42 | 368M | 4022G | [config](./configs/cityscapes/upernet_internimage_xl_512x1024_160k_mapillary2cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_xl_512x1024_160k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_xl_512x1024_160k_mapillary2cityscapes.log.json) | | UperNet\* | InternImage-XL | 512x1024 | 86.20 / 86.42 | 368M | 4022G | [config](./configs/cityscapes/upernet_internimage_xl_512x1024_160k_mapillary2cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_xl_512x1024_160k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_xl_512x1024_160k_mapillary2cityscapes.log.json) |
| SegFormer\* | InternImage-L | 512x1024 | 85.16 / 85.67 | 220M | 1580G | [config](./configs/cityscapes/segformer_internimage_l_512x1024_160k_mapillary2cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/segformer_internimage_l_512x1024_160k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/segformer_internimage_l_512x1024_160k_mapillary2cityscapes.log.json) | | SegFormer\* | InternImage-L | 512x1024 | 85.16 / 85.67 | 220M | 1580G | [config](./configs/cityscapes/segformer_internimage_l_512x1024_160k_mapillary2cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/segformer_internimage_l_512x1024_160k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/segformer_internimage_l_512x1024_160k_mapillary2cityscapes.log.json) |
| SegFormer\* | InternImage-XL | 512x1024 | 85.41 / 85.93 | 330M | 2364G | [config](./configs/cityscapes/segformer_internimage_xl_512x1024_160k_mapillary2cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/segformer_internimage_xl_512x1024_160k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/segformer_internimage_xl_512x1024_160k_mapillary2cityscapes.log.json) | | SegFormer\* | InternImage-XL | 512x1024 | 85.41 / 85.93 | 330M | 2364G | [config](./configs/cityscapes/segformer_internimage_xl_512x1024_160k_mapillary2cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/segformer_internimage_xl_512x1024_160k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/segformer_internimage_xl_512x1024_160k_mapillary2cityscapes.log.json) |
| Mask2Former | InternImage-H | 1024x1024 | 86.37 / 86.96 | 1094M | 7878G | [config](./configs/cityscapes/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes_ss.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes.log.json) | | Mask2Former\* | InternImage-H | 1024x1024 | 86.37 / 86.96 | 1094M | 7878G | [config](./configs/cityscapes/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes.log.json) |
\* denotes the model is trained using extra Mapillary dataset. \* denotes the model is trained using extra Mapillary dataset.
...@@ -139,8 +139,8 @@ Prepare datasets according to the [guidelines](https://github.com/open-mmlab/mms ...@@ -139,8 +139,8 @@ Prepare datasets according to the [guidelines](https://github.com/open-mmlab/mms
<div> <div>
| method | backbone | resolution | mIoU (ss) | #params | FLOPs | Config | Download | | method | backbone | resolution | mIoU (ss) | #params | FLOPs | Config | Download |
| :---------: | :-----------: | :--------: | :-------: | :-----: | :---: | :--------------------------------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | | :---------: | :-----------: | :--------: | :-------: | :-----: | :---: | :-----------------------------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| Mask2Former | InternImage-H | 896x896 | 52.6 | 1.31B | 4635G | [config](./configs/coco_stuff164k/mask2former_internimage_h_896_80k_cocostuff164k_ss.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_896_80k_cocostuff164k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_896_80k_cocostuff164k.log.json) | | Mask2Former | InternImage-H | 896x896 | 52.6 | 1.31B | 4635G | [config](./configs/coco_stuff164k/mask2former_internimage_h_896_80k_cocostuff164k.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_896_80k_cocostuff164k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_896_80k_cocostuff164k.log.json) |
</div> </div>
...@@ -152,8 +152,8 @@ Prepare datasets according to the [guidelines](https://github.com/open-mmlab/mms ...@@ -152,8 +152,8 @@ Prepare datasets according to the [guidelines](https://github.com/open-mmlab/mms
<div> <div>
| method | backbone | resolution | mIoU (ss) | #params | FLOPs | Config | Download | | method | backbone | resolution | mIoU (ss) | #params | FLOPs | Config | Download |
| :---------: | :-----------: | :--------: | :-------: | :-----: | :---: | :------------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | | :---------: | :-----------: | :--------: | :---------: | :-----: | :---: | :------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| Mask2Former | InternImage-H | 896x896 | 52.6 | 1.31B | 4635G | [config](./configs/coco_stuff10k/mask2former_internimage_h_896_80k_cocostuff10k_ss.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_896_80k_cocostuff10k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_896_80k_cocostuff10k.log.json) | | Mask2Former | InternImage-H | 512x512 | 59.2 / 59.6 | 1.28B | 1528G | [config](./configs/coco_stuff10k/mask2former_internimage_h_512_40k_cocostuff164k_to_10k.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_512_40k_cocostuff164k_to_10k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_512_40k_cocostuff164k_to_10k.log.json) |
</div> </div>
......
...@@ -42,5 +42,5 @@ Mapillary 80k + Cityscapes (w/ coarse data) 160k ...@@ -42,5 +42,5 @@ Mapillary 80k + Cityscapes (w/ coarse data) 160k
Mapillary 80k + Cityscapes (w/ coarse data) 80k Mapillary 80k + Cityscapes (w/ coarse data) 80k
| backbone | resolution | mIoU (ss/ms) | #params | FLOPs | Config | Download | | backbone | resolution | mIoU (ss/ms) | #params | FLOPs | Config | Download |
| :-----------: | :--------: | :-----------: | :-----: | :---: | :----------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | | :-----------: | :--------: | :-----------: | :-----: | :---: | :-------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| InternImage-H | 1024x1024 | 86.37 / 86.96 | 1094M | 7878G | [config](./mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes_ss.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes.log.json) | | InternImage-H | 1024x1024 | 86.37 / 86.96 | 1094M | 7878G | [config](./mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes.log.json) |
# COCO-Stuff-10K
<!-- [ALGORITHM] -->
## Introduction
COCO-Stuff-10K is a dataset designed to enhance scene understanding tasks in computer vision by providing pixel-level annotations for both "things" (discrete objects with well-defined shapes, like cars and people) and "stuff" (amorphous background regions, such as grass and sky). This dataset augments 10,000 images from the original COCO dataset, offering detailed labels across 182 classes—91 "thing" classes and 91 "stuff" classes.
## Model Zoo
### Mask2Former + InternImage
| backbone | resolution | mIoU (ss/ms) | #param | FLOPs | Config | Download |
| :-----------: | :--------: | :----------: | :----: | :---: | :-------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| InternImage-H | 512x512 | 59.2 / 59.6 | 1.28B | 1528G | [config](./mask2former_internimage_h_512_40k_cocostuff164k_to_10k.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_512_40k_cocostuff164k_to_10k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_512_40k_cocostuff164k_to_10k.log.json) |
...@@ -4,12 +4,11 @@ ...@@ -4,12 +4,11 @@
# Licensed under The MIT License [see LICENSE for details] # Licensed under The MIT License [see LICENSE for details]
# -------------------------------------------------------- # --------------------------------------------------------
_base_ = [ _base_ = [
'../_base_/models/mask2former_beit.py', '../_base_/datasets/cityscapes_1024x1024.py', '../_base_/models/mask2former_beit.py', '../_base_/datasets/coco-stuff10k.py',
'../_base_/default_runtime.py', '../_base_/schedules/schedule_80k.py' '../_base_/default_runtime.py', '../_base_/schedules/schedule_40k.py'
] ]
num_classes = 19 num_classes = 171
crop_size = (1024, 1024) load_from = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_896_80k_cocostuff164k.pth'
load_from = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_896x896_80k_mapillary.pth'
model = dict( model = dict(
type='EncoderDecoderMask2Former', type='EncoderDecoderMask2Former',
backbone=dict( backbone=dict(
...@@ -35,10 +34,10 @@ model = dict( ...@@ -35,10 +34,10 @@ model = dict(
init_cfg=None), init_cfg=None),
decode_head=dict( decode_head=dict(
in_channels=[320, 640, 1280, 2560], in_channels=[320, 640, 1280, 2560],
feat_channels=256, feat_channels=1024,
out_channels=256, out_channels=1024,
num_classes=num_classes, num_classes=num_classes,
num_queries=100, num_queries=200,
pixel_decoder=dict( pixel_decoder=dict(
type='MSDeformAttnPixelDecoder', type='MSDeformAttnPixelDecoder',
num_outs=3, num_outs=3,
...@@ -51,8 +50,8 @@ model = dict( ...@@ -51,8 +50,8 @@ model = dict(
type='BaseTransformerLayer', type='BaseTransformerLayer',
attn_cfgs=dict( attn_cfgs=dict(
type='MultiScaleDeformableAttention', type='MultiScaleDeformableAttention',
embed_dims=256, embed_dims=1024,
num_heads=8, num_heads=32,
num_levels=3, num_levels=3,
num_points=4, num_points=4,
im2col_step=64, im2col_step=64,
...@@ -62,8 +61,8 @@ model = dict( ...@@ -62,8 +61,8 @@ model = dict(
init_cfg=None), init_cfg=None),
ffn_cfgs=dict( ffn_cfgs=dict(
type='FFN', type='FFN',
embed_dims=256, embed_dims=1024,
feedforward_channels=1024, feedforward_channels=4096,
num_fcs=2, num_fcs=2,
ffn_drop=0.0, ffn_drop=0.0,
with_cp=False, # set with_cp=True to save memory with_cp=False, # set with_cp=True to save memory
...@@ -71,10 +70,10 @@ model = dict( ...@@ -71,10 +70,10 @@ model = dict(
operation_order=('self_attn', 'norm', 'ffn', 'norm')), operation_order=('self_attn', 'norm', 'ffn', 'norm')),
init_cfg=None), init_cfg=None),
positional_encoding=dict( positional_encoding=dict(
type='SinePositionalEncoding', num_feats=128, normalize=True), type='SinePositionalEncoding', num_feats=512, normalize=True),
init_cfg=None), init_cfg=None),
positional_encoding=dict( positional_encoding=dict(
type='SinePositionalEncoding', num_feats=128, normalize=True), type='SinePositionalEncoding', num_feats=512, normalize=True),
transformer_decoder=dict( transformer_decoder=dict(
type='DetrTransformerDecoder', type='DetrTransformerDecoder',
return_intermediate=True, return_intermediate=True,
...@@ -83,22 +82,22 @@ model = dict( ...@@ -83,22 +82,22 @@ model = dict(
type='DetrTransformerDecoderLayer', type='DetrTransformerDecoderLayer',
attn_cfgs=dict( attn_cfgs=dict(
type='MultiheadAttention', type='MultiheadAttention',
embed_dims=256, embed_dims=1024,
num_heads=8, num_heads=32,
attn_drop=0.0, attn_drop=0.0,
proj_drop=0.0, proj_drop=0.0,
dropout_layer=None, dropout_layer=None,
batch_first=False), batch_first=False),
ffn_cfgs=dict( ffn_cfgs=dict(
embed_dims=256, embed_dims=1024,
feedforward_channels=2048, feedforward_channels=4096,
num_fcs=2, num_fcs=2,
act_cfg=dict(type='ReLU', inplace=True), act_cfg=dict(type='ReLU', inplace=True),
ffn_drop=0.0, ffn_drop=0.0,
dropout_layer=None, dropout_layer=None,
with_cp=False, # set with_cp=True to save memory with_cp=False, # set with_cp=True to save memory
add_identity=True), add_identity=True),
feedforward_channels=2048, feedforward_channels=4096,
operation_order=('cross_attn', 'norm', 'self_attn', 'norm', operation_order=('cross_attn', 'norm', 'self_attn', 'norm',
'ffn', 'norm')), 'ffn', 'norm')),
init_cfg=None), init_cfg=None),
...@@ -109,13 +108,14 @@ model = dict( ...@@ -109,13 +108,14 @@ model = dict(
reduction='mean', reduction='mean',
class_weight=[1.0] * num_classes + [0.1]) class_weight=[1.0] * num_classes + [0.1])
), ),
test_cfg=dict(mode='slide', crop_size=crop_size, stride=(512, 512))) test_cfg=dict(mode='whole'))
crop_size = (512, 512)
img_norm_cfg = dict( img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True) mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [ train_pipeline = [
dict(type='LoadImageFromFile'), dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations'), dict(type='LoadAnnotations', reduce_zero_label=True),
dict(type='Resize', img_scale=(2048, 1024), ratio_range=(0.5, 2.0)), dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
dict(type='RandomCrop', crop_size=crop_size, cat_max_ratio=0.75), dict(type='RandomCrop', crop_size=crop_size, cat_max_ratio=0.75),
dict(type='RandomFlip', prob=0.5), dict(type='RandomFlip', prob=0.5),
dict(type='PhotoMetricDistortion'), dict(type='PhotoMetricDistortion'),
...@@ -129,9 +129,9 @@ test_pipeline = [ ...@@ -129,9 +129,9 @@ test_pipeline = [
dict(type='LoadImageFromFile'), dict(type='LoadImageFromFile'),
dict( dict(
type='MultiScaleFlipAug', type='MultiScaleFlipAug',
img_scale=(2048, 1024), img_scale=(2048, 512),
img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75], # img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75],
flip=True, flip=False,
transforms=[ transforms=[
dict(type='Resize', keep_ratio=True), dict(type='Resize', keep_ratio=True),
dict(type='ResizeToMultiple', size_divisor=32), dict(type='ResizeToMultiple', size_divisor=32),
...@@ -151,13 +151,12 @@ lr_config = dict(_delete_=True, policy='poly', ...@@ -151,13 +151,12 @@ lr_config = dict(_delete_=True, policy='poly',
warmup_iters=1500, warmup_iters=1500,
warmup_ratio=1e-6, warmup_ratio=1e-6,
power=1.0, min_lr=0.0, by_epoch=False) power=1.0, min_lr=0.0, by_epoch=False)
# By default, models are trained on 8 GPUs with 2 images per GPU # By default, models are trained on 16 GPUs with 1 images per GPU
data = dict(samples_per_gpu=2, data = dict(samples_per_gpu=1,
train=dict(pipeline=train_pipeline), train=dict(pipeline=train_pipeline),
val=dict(pipeline=test_pipeline), val=dict(pipeline=test_pipeline),
test=dict(pipeline=test_pipeline)) test=dict(pipeline=test_pipeline))
runner = dict(type='IterBasedRunner') runner = dict(type='IterBasedRunner')
optimizer_config = dict(_delete_=True, grad_clip=dict(max_norm=0.1, norm_type=2))
checkpoint_config = dict(by_epoch=False, interval=1000, max_keep_ckpts=1) checkpoint_config = dict(by_epoch=False, interval=1000, max_keep_ckpts=1)
evaluation = dict(interval=2000, metric='mIoU', save_best='mIoU') evaluation = dict(interval=2000, metric='mIoU', save_best='mIoU')
# fp16 = dict(loss_scale=dict(init_scale=512)) # fp16 = dict(loss_scale=dict(init_scale=512))
...@@ -4,12 +4,12 @@ ...@@ -4,12 +4,12 @@
## Introduction ## Introduction
The Common Objects in COntext-stuff (COCO-stuff) dataset is a dataset for scene understanding tasks like semantic segmentation, object detection and image captioning. It is constructed by annotating the original COCO dataset, which originally annotated things while neglecting stuff annotations.  There are 164k images in COCO-Stuff-164K dataset that span over 172 categories including 80 things, 91 stuff, and 1 unlabeled class. The Common Objects in COntext-stuff (COCO-stuff) dataset is a dataset for scene understanding tasks like semantic segmentation, object detection and image captioning. It is constructed by annotating the original COCO dataset, which originally annotated things while neglecting stuff annotations. There are 164k images in COCO-Stuff-164K dataset that span over 172 categories including 80 things, 91 stuff, and 1 unlabeled class.
## Model Zoo ## Model Zoo
### Mask2Former + InternImage ### Mask2Former + InternImage
| backbone | resolution | mIoU (ss) | train speed | train time | #param | FLOPs | Config | Download | | backbone | resolution | mIoU (ss) | train speed | train time | #param | FLOPs | Config | Download |
| :-----------: | :--------: | :-------: | :---------: | :--------: | :----: | :---: | :---------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | | :-----------: | :--------: | :-------: | :---------: | :--------: | :----: | :---: | :------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| InternImage-H | 896x896 | 52.6 | 1.6s / iter | 1.5d (2n) | 1.31B | 4635G | [config](./mask2former_internimage_h_896_80k_cocostuff164k_ss.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_896_80k_cocostuff164k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_896_80k_cocostuff164k.log.json) | | InternImage-H | 896x896 | 52.6 | 1.6s / iter | 1.5d (2n) | 1.31B | 4635G | [config](./mask2former_internimage_h_896_80k_cocostuff164k.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_896_80k_cocostuff164k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_896_80k_cocostuff164k.log.json) |
...@@ -157,7 +157,6 @@ data = dict(samples_per_gpu=1, ...@@ -157,7 +157,6 @@ data = dict(samples_per_gpu=1,
val=dict(pipeline=test_pipeline), val=dict(pipeline=test_pipeline),
test=dict(pipeline=test_pipeline)) test=dict(pipeline=test_pipeline))
runner = dict(type='IterBasedRunner') runner = dict(type='IterBasedRunner')
optimizer_config = dict(_delete_=True, grad_clip=dict(max_norm=0.1, norm_type=2))
checkpoint_config = dict(by_epoch=False, interval=1000, max_keep_ckpts=1) checkpoint_config = dict(by_epoch=False, interval=1000, max_keep_ckpts=1)
evaluation = dict(interval=8000, metric='mIoU', save_best='mIoU') evaluation = dict(interval=8000, metric='mIoU', save_best='mIoU')
# fp16 = dict(loss_scale=dict(init_scale=512)) # fp16 = dict(loss_scale=dict(init_scale=512))
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment