Release coco-stuff-10k model

Release coco-stuff-10k model Release coco-stuff-10k model

Release coco-stuff-10k model
Release coco-stuff-10k model Release coco-stuff-10k model
9d4b5614 · zhe chen · c570a7eb · 9d4b5614 · 9d4b5614 · 9d4b5614
Commit 9d4b5614 authored Jan 24, 2025 by zhe chen
7 changed files
--- a/segmentation/README.md
+++ b/segmentation/README.md
@@ -115,7 +115,7 @@ Prepare datasets according to the [guidelines](https://github.com/open-mmlab/mms
 <div>

 |    method     |    backbone    | resolution | mIoU (ss/ms)  | #params | FLOPs |                                             Config                                             |                                                                                                                                 Download                                                                                                                                 |
-| :---------: | :------------: | :--------: | :-----------: | :-----: | :---: | :-----------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| :-----------: | :------------: | :--------: | :-----------: | :-----: | :---: | :--------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
 |    UperNet    | InternImage-T  |  512x1024  | 82.58 / 83.40 |   59M   | 1889G |        [config](./configs/cityscapes/upernet_internimage_t_512x1024_160k_cityscapes.py)        |               [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_t_512x1024_160k_cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_t_512x1024_160k_cityscapes.log.json)               |
 |    UperNet    | InternImage-S  |  512x1024  | 82.74 / 83.45 |   80M   | 2035G |        [config](./configs/cityscapes/upernet_internimage_s_512x1024_160k_cityscapes.py)        |               [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_s_512x1024_160k_cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_s_512x1024_160k_cityscapes.log.json)               |
 |    UperNet    | InternImage-B  |  512x1024  | 83.18 / 83.97 |  128M   | 2369G |        [config](./configs/cityscapes/upernet_internimage_b_512x1024_160k_cityscapes.py)        |               [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_b_512x1024_160k_cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_b_512x1024_160k_cityscapes.log.json)               |
@@ -125,7 +125,7 @@ Prepare datasets according to the [guidelines](https://github.com/open-mmlab/mms
 |   UperNet\*   | InternImage-XL |  512x1024  | 86.20 / 86.42 |  368M   | 4022G |  [config](./configs/cityscapes/upernet_internimage_xl_512x1024_160k_mapillary2cityscapes.py)   |    [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_xl_512x1024_160k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_xl_512x1024_160k_mapillary2cityscapes.log.json)    |
 |  SegFormer\*  | InternImage-L  |  512x1024  | 85.16 / 85.67 |  220M   | 1580G |  [config](./configs/cityscapes/segformer_internimage_l_512x1024_160k_mapillary2cityscapes.py)  |   [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/segformer_internimage_l_512x1024_160k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/segformer_internimage_l_512x1024_160k_mapillary2cityscapes.log.json)   |
 |  SegFormer\*  | InternImage-XL |  512x1024  | 85.41 / 85.93 |  330M   | 2364G | [config](./configs/cityscapes/segformer_internimage_xl_512x1024_160k_mapillary2cityscapes.py)  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/segformer_internimage_xl_512x1024_160k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/segformer_internimage_xl_512x1024_160k_mapillary2cityscapes.log.json)  |
-| Mask2Former | InternImage-H  | 1024x1024  | 86.37 / 86.96 |  1094M  | 7878G | [config](./configs/cityscapes/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes_ss.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes.log.json) |
+| Mask2Former\* | InternImage-H  | 1024x1024  | 86.37 / 86.96 |  1094M  | 7878G | [config](./configs/cityscapes/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes.log.json) |

 \* denotes the model is trained using extra Mapillary dataset.

@@ -139,8 +139,8 @@ Prepare datasets according to the [guidelines](https://github.com/open-mmlab/mms
 <div>

 |   method    |   backbone    | resolution | mIoU (ss) | #params | FLOPs |                                        Config                                         |                                                                                                                    Download                                                                                                                    |
-| :---------: | :-----------: | :--------: | :-------: | :-----: | :---: | :--------------------------------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
-| Mask2Former | InternImage-H |  896x896   |   52.6    |  1.31B  | 4635G | [config](./configs/coco_stuff164k/mask2former_internimage_h_896_80k_cocostuff164k_ss.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_896_80k_cocostuff164k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_896_80k_cocostuff164k.log.json) |
+| :---------: | :-----------: | :--------: | :-------: | :-----: | :---: | :-----------------------------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| Mask2Former | InternImage-H |  896x896   |   52.6    |  1.31B  | 4635G | [config](./configs/coco_stuff164k/mask2former_internimage_h_896_80k_cocostuff164k.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_896_80k_cocostuff164k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_896_80k_cocostuff164k.log.json) |

 </div>

@@ -152,8 +152,8 @@ Prepare datasets according to the [guidelines](https://github.com/open-mmlab/mms
 <div>

 |   method    |   backbone    | resolution |  mIoU (ss)  | #params | FLOPs |                                            Config                                            |                                                                                                                           Download                                                                                                                           |
-| :---------: | :-----------: | :--------: | :-------: | :-----: | :---: | :------------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
-| Mask2Former | InternImage-H |  896x896   |   52.6    |  1.31B  | 4635G | [config](./configs/coco_stuff10k/mask2former_internimage_h_896_80k_cocostuff10k_ss.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_896_80k_cocostuff10k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_896_80k_cocostuff10k.log.json) |
+| :---------: | :-----------: | :--------: | :---------: | :-----: | :---: | :------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| Mask2Former | InternImage-H |  512x512   | 59.2 / 59.6 |  1.28B  | 1528G | [config](./configs/coco_stuff10k/mask2former_internimage_h_512_40k_cocostuff164k_to_10k.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_512_40k_cocostuff164k_to_10k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_512_40k_cocostuff164k_to_10k.log.json) |

 </div>


--- a/segmentation/configs/cityscapes/README.md
+++ b/segmentation/configs/cityscapes/README.md
@@ -42,5 +42,5 @@ Mapillary 80k + Cityscapes (w/ coarse data) 160k
 Mapillary 80k + Cityscapes (w/ coarse data) 80k

 |   backbone    | resolution | mIoU (ss/ms)  | #params | FLOPs |                                   Config                                    |                                                                                                                                 Download                                                                                                                                 |
-| :-----------: | :--------: | :-----------: | :-----: | :---: | :----------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
-| InternImage-H | 1024x1024  | 86.37 / 86.96 |  1094M  | 7878G | [config](./mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes_ss.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes.log.json) |
+| :-----------: | :--------: | :-----------: | :-----: | :---: | :-------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| InternImage-H | 1024x1024  | 86.37 / 86.96 |  1094M  | 7878G | [config](./mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes.log.json) |
--- a/segmentation/configs/cityscapes/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes_ss.py
+++ b/segmentation/configs/cityscapes/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes_ss.py
--- a/segmentation/configs/coco_stuff10k/README.md
+++ b/segmentation/configs/coco_stuff10k/README.md
+# COCO-Stuff-10K
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+COCO-Stuff-10K is a dataset designed to enhance scene understanding tasks in computer vision by providing pixel-level annotations for both "things" (discrete objects with well-defined shapes, like cars and people) and "stuff" (amorphous background regions, such as grass and sky). This dataset augments 10,000 images from the original COCO dataset, offering detailed labels across 182 classes—91 "thing" classes and 91 "stuff" classes.
+
+## Model Zoo
+
+### Mask2Former + InternImage
+
+|   backbone    | resolution | mIoU (ss/ms) | #param | FLOPs |                                Config                                 |                                                                                                                           Download                                                                                                                           |
+| :-----------: | :--------: | :----------: | :----: | :---: | :-------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| InternImage-H |  512x512   | 59.2 / 59.6  | 1.28B  | 1528G | [config](./mask2former_internimage_h_512_40k_cocostuff164k_to_10k.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_512_40k_cocostuff164k_to_10k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_512_40k_cocostuff164k_to_10k.log.json) |
--- a/segmentation/configs/cityscapes/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes_ms.py
+++ b/segmentation/configs/cityscapes/mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes_ms.py
@@ -4,12 +4,11 @@
 # Licensed under The MIT License [see LICENSE for details]
 # --------------------------------------------------------
 _base_ = [
-    '../_base_/models/mask2former_beit.py', '../_base_/datasets/cityscapes_1024x1024.py',
-    '../_base_/default_runtime.py', '../_base_/schedules/schedule_80k.py'
+    '../_base_/models/mask2former_beit.py', '../_base_/datasets/coco-stuff10k.py',
+    '../_base_/default_runtime.py', '../_base_/schedules/schedule_40k.py'
 ]
-num_classes = 19
-crop_size = (1024, 1024)
-load_from = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_896x896_80k_mapillary.pth'
+num_classes = 171
+load_from = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_896_80k_cocostuff164k.pth'
 model = dict(
    type='EncoderDecoderMask2Former',
    backbone=dict(
@@ -35,10 +34,10 @@ model = dict(
        init_cfg=None),
    decode_head=dict(
        in_channels=[320, 640, 1280, 2560],
-        feat_channels=256,
-        out_channels=256,
+        feat_channels=1024,
+        out_channels=1024,
        num_classes=num_classes,
-        num_queries=100,
+        num_queries=200,
        pixel_decoder=dict(
            type='MSDeformAttnPixelDecoder',
            num_outs=3,
@@ -51,8 +50,8 @@ model = dict(
                    type='BaseTransformerLayer',
                    attn_cfgs=dict(
                        type='MultiScaleDeformableAttention',
-                        embed_dims=256,
-                        num_heads=8,
+                        embed_dims=1024,
+                        num_heads=32,
                        num_levels=3,
                        num_points=4,
                        im2col_step=64,
@@ -62,8 +61,8 @@ model = dict(
                        init_cfg=None),
                    ffn_cfgs=dict(
                        type='FFN',
-                        embed_dims=256,
-                        feedforward_channels=1024,
+                        embed_dims=1024,
+                        feedforward_channels=4096,
                        num_fcs=2,
                        ffn_drop=0.0,
                        with_cp=False,  # set with_cp=True to save memory
@@ -71,10 +70,10 @@ model = dict(
                    operation_order=('self_attn', 'norm', 'ffn', 'norm')),
                init_cfg=None),
            positional_encoding=dict(
-                type='SinePositionalEncoding', num_feats=128, normalize=True),
+                type='SinePositionalEncoding', num_feats=512, normalize=True),
            init_cfg=None),
        positional_encoding=dict(
-            type='SinePositionalEncoding', num_feats=128, normalize=True),
+            type='SinePositionalEncoding', num_feats=512, normalize=True),
        transformer_decoder=dict(
            type='DetrTransformerDecoder',
            return_intermediate=True,
@@ -83,22 +82,22 @@ model = dict(
                type='DetrTransformerDecoderLayer',
                attn_cfgs=dict(
                    type='MultiheadAttention',
-                    embed_dims=256,
-                    num_heads=8,
+                    embed_dims=1024,
+                    num_heads=32,
                    attn_drop=0.0,
                    proj_drop=0.0,
                    dropout_layer=None,
                    batch_first=False),
                ffn_cfgs=dict(
-                    embed_dims=256,
-                    feedforward_channels=2048,
+                    embed_dims=1024,
+                    feedforward_channels=4096,
                    num_fcs=2,
                    act_cfg=dict(type='ReLU', inplace=True),
                    ffn_drop=0.0,
                    dropout_layer=None,
                    with_cp=False,  # set with_cp=True to save memory
                    add_identity=True),
-                feedforward_channels=2048,
+                feedforward_channels=4096,
                operation_order=('cross_attn', 'norm', 'self_attn', 'norm',
                                 'ffn', 'norm')),
            init_cfg=None),
@@ -109,13 +108,14 @@ model = dict(
            reduction='mean',
            class_weight=[1.0] * num_classes + [0.1])
    ),
-    test_cfg=dict(mode='slide', crop_size=crop_size, stride=(512, 512)))
+    test_cfg=dict(mode='whole'))
+crop_size = (512, 512)
 img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
 train_pipeline = [
    dict(type='LoadImageFromFile'),
-    dict(type='LoadAnnotations'),
-    dict(type='Resize', img_scale=(2048, 1024), ratio_range=(0.5, 2.0)),
+    dict(type='LoadAnnotations', reduce_zero_label=True),
+    dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
    dict(type='RandomCrop', crop_size=crop_size, cat_max_ratio=0.75),
    dict(type='RandomFlip', prob=0.5),
    dict(type='PhotoMetricDistortion'),
@@ -129,9 +129,9 @@ test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
-        img_scale=(2048, 1024),
-        img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75],
-        flip=True,
+        img_scale=(2048, 512),
+        # img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75],
+        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='ResizeToMultiple', size_divisor=32),
@@ -151,13 +151,12 @@ lr_config = dict(_delete_=True, policy='poly',
                 warmup_iters=1500,
                 warmup_ratio=1e-6,
                 power=1.0, min_lr=0.0, by_epoch=False)
-# By default, models are trained on 8 GPUs with 2 images per GPU
-data = dict(samples_per_gpu=2,
+# By default, models are trained on 16 GPUs with 1 images per GPU
+data = dict(samples_per_gpu=1,
            train=dict(pipeline=train_pipeline),
            val=dict(pipeline=test_pipeline),
            test=dict(pipeline=test_pipeline))
 runner = dict(type='IterBasedRunner')
-optimizer_config = dict(_delete_=True, grad_clip=dict(max_norm=0.1, norm_type=2))
 checkpoint_config = dict(by_epoch=False, interval=1000, max_keep_ckpts=1)
 evaluation = dict(interval=2000, metric='mIoU', save_best='mIoU')
 # fp16 = dict(loss_scale=dict(init_scale=512))
--- a/segmentation/configs/coco_stuff164k/README.md
+++ b/segmentation/configs/coco_stuff164k/README.md
@@ -4,12 +4,12 @@

 ## Introduction

-The Common Objects in COntext-stuff (COCO-stuff) dataset is a dataset for scene understanding tasks like semantic segmentation, object detection and image captioning. It is constructed by annotating the original COCO dataset, which originally annotated things while neglecting stuff annotations.  There are 164k images in COCO-Stuff-164K dataset that span over 172 categories including 80 things, 91 stuff, and 1 unlabeled class.
+The Common Objects in COntext-stuff (COCO-stuff) dataset is a dataset for scene understanding tasks like semantic segmentation, object detection and image captioning. It is constructed by annotating the original COCO dataset, which originally annotated things while neglecting stuff annotations. There are 164k images in COCO-Stuff-164K dataset that span over 172 categories including 80 things, 91 stuff, and 1 unlabeled class.

 ## Model Zoo

 ### Mask2Former + InternImage

 |   backbone    | resolution | mIoU (ss) | train speed | train time | #param | FLOPs |                             Config                             |                                                                                                                    Download                                                                                                                    |
-| :-----------: | :--------: | :-------: | :---------: | :--------: | :----: | :---: | :---------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
-| InternImage-H |  896x896   |   52.6    | 1.6s / iter | 1.5d (2n)  | 1.31B  | 4635G | [config](./mask2former_internimage_h_896_80k_cocostuff164k_ss.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_896_80k_cocostuff164k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_896_80k_cocostuff164k.log.json) |
+| :-----------: | :--------: | :-------: | :---------: | :--------: | :----: | :---: | :------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| InternImage-H |  896x896   |   52.6    | 1.6s / iter | 1.5d (2n)  | 1.31B  | 4635G | [config](./mask2former_internimage_h_896_80k_cocostuff164k.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_896_80k_cocostuff164k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_896_80k_cocostuff164k.log.json) |
--- a/segmentation/configs/coco_stuff164k/mask2former_internimage_h_896_80k_cocostuff164k_ss.py
+++ b/segmentation/configs/coco_stuff164k/mask2former_internimage_h_896_80k_cocostuff164k_ss.py
@@ -157,7 +157,6 @@ data = dict(samples_per_gpu=1,
            val=dict(pipeline=test_pipeline),
            test=dict(pipeline=test_pipeline))
 runner = dict(type='IterBasedRunner')
-optimizer_config = dict(_delete_=True, grad_clip=dict(max_norm=0.1, norm_type=2))
 checkpoint_config = dict(by_epoch=False, interval=1000, max_keep_ckpts=1)
 evaluation = dict(interval=8000, metric='mIoU', save_best='mIoU')
 # fp16 = dict(loss_scale=dict(init_scale=512))