Commit 43e508f4 authored by Zhe Chen's avatar Zhe Chen Committed by zhe chen
Browse files

Replace github release with hugging face (#42)



* Update function descriptions

* Update README.md

* Replace GitHub release with hugging face

* Release InternImage-H segmentation model

---------
Co-authored-by: default avatarZhenhang Huang <prc_hzh@163.com>
parent 4bbef509
...@@ -160,7 +160,7 @@ ...@@ -160,7 +160,7 @@
## 开源模型 ## 开源模型
- 目标检测和实例分割: [COCO](detection/configs/mask_rcnn/) - 目标检测和实例分割: [COCO](detection/configs/coco/)
- 语义分割: [ADE20K](segmentation/configs/ade20k/), [Cityscapes](segmentation/configs/cityscapes/) - 语义分割: [ADE20K](segmentation/configs/ade20k/), [Cityscapes](segmentation/configs/cityscapes/)
- 图文检索、图像描述和视觉问答: [Uni-Perceiver](https://github.com/fundamentalvision/Uni-Perceiver) - 图文检索、图像描述和视觉问答: [Uni-Perceiver](https://github.com/fundamentalvision/Uni-Perceiver)
- 3D感知: [BEVFormer](https://github.com/fundamentalvision/BEVFormer) - 3D感知: [BEVFormer](https://github.com/fundamentalvision/BEVFormer)
...@@ -171,47 +171,46 @@ ...@@ -171,47 +171,46 @@
**ImageNet图像分类** **ImageNet图像分类**
| name | pretrain | resolution | acc@1 | #param | FLOPs | 22K model | 1K model | | name | pretrain | resolution | acc@1 | #param | FLOPs | 22K model | 1K model |
| :------------: | :----------: | :--------: | :---: | :-----: | :---: | :-----------------: | :-----------------: | | :------------: | :----------: | :--------: | :---: | :-----: | :---: | :-----------------: | :-----------------: |
| InternImage-T | ImageNet-1K | 224x224 | 83.5 | 30M | 5G | - | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_t_1k_224.pth) \| [cfg](classification/configs/internimage_t_1k_224.yaml) | | InternImage-T | ImageNet-1K | 224x224 | 83.5 | 30M | 5G | - | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth) \| [cfg](classification/configs/internimage_t_1k_224.yaml) |
| InternImage-S | ImageNet-1K | 224x224 | 84.2 | 50M | 8G | - | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_s_1k_224.pth) \| [cfg](classification/configs/internimage_s_1k_224.yaml) | | InternImage-S | ImageNet-1K | 224x224 | 84.2 | 50M | 8G | - | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth) \| [cfg](classification/configs/internimage_s_1k_224.yaml) |
| InternImage-B | ImageNet-1K | 224x224 | 84.9 | 97M | 16G | - | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_b_1k_224.pth) \| [cfg](classification/configs/internimage_b_1k_224.yaml) | | InternImage-B | ImageNet-1K | 224x224 | 84.9 | 97M | 16G | - | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth) \| [cfg](classification/configs/internimage_b_1k_224.yaml) |
| InternImage-L | ImageNet-22K | 384x384 | 87.7 | 223M | 108G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_l_22k_192to384.pth) | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_l_22kto1k_384.pth) \| [cfg](classification/configs/internimage_l_22kto1k_384.yaml) | | InternImage-L | ImageNet-22K | 384x384 | 87.7 | 223M | 108G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22kto1k_384.pth) \| [cfg](classification/configs/internimage_l_22kto1k_384.yaml) |
| InternImage-XL | ImageNet-22K | 384x384 | 88.0 | 335M | 163G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_xl_22k_192to384.pth) | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_xl_22kto1k_384.pth) \| [cfg](classification/configs/internimage_xl_22kto1k_384.yaml) | | InternImage-XL | ImageNet-22K | 384x384 | 88.0 | 335M | 163G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22kto1k_384.pth) \| [cfg](classification/configs/internimage_xl_22kto1k_384.yaml) |
| InternImage-H | Joint 427M | 640x640 | 89.6 | 1.08B | 1478G | TBD | [ckpt](https://pan.baidu.com/s/1R3niTRjrERUet2xGc6ePPA) \| [cfg](classification/configs/internimage_h_jointto1k_640.yaml) | | InternImage-H | Joint 427M | 640x640 | 89.6 | 1.08B | 1478G | TODO | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto1k_640.pth) \| [cfg](classification/configs/internimage_h_jointto1k_640.yaml) |
| InternImage-G | Joint 427M | 512x512 | 90.1 | 3B | TBD | TBD | [ckpt](https://pan.baidu.com/s/1R3niTRjrERUet2xGc6ePPA) \| [cfg](classification/configs/internimage_g_jointto1k_512.yaml) | | InternImage-G | Joint 427M | 512x512 | 90.1 | 3B | TODO | TODO | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_jointto1k_512.pth) \| [cfg](classification/configs/internimage_g_jointto1k_512.yaml) |
- 下载InternImage-H/G的百度网盘提取码: 2vwu
**COCO目标检测和实例分割** **COCO目标检测和实例分割**
| backbone | method | schd | box mAP | mask mAP | #param | FLOPs | Download | | backbone | method | schd | box mAP | mask mAP | #param | FLOPs | Download |
| :------------: | :----------------: | :---------: | :-----: | :------: | :-----: | :---: | :---: | | :------------: | :----------------: | :---------: | :-----: | :------: | :-----: | :---: | :---: |
| InternImage-T | Mask R-CNN | 1x | 47.2 | 42.5 | 49M | 270G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_t_fpn_1x_coco.pth) \| [cfg](detection/configs/mask_rcnn/mask_rcnn_internimage_t_fpn_1x_coco.py) | | InternImage-T | Mask R-CNN | 1x | 47.2 | 42.5 | 49M | 270G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_1x_coco.py) |
| InternImage-T | Mask R-CNN | 3x | 49.1 | 43.7 | 49M | 270G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_t_fpn_3x_coco.pth) \| [cfg](detection/configs/mask_rcnn/mask_rcnn_internimage_t_fpn_3x_coco.py) | | InternImage-T | Mask R-CNN | 3x | 49.1 | 43.7 | 49M | 270G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_3x_coco.py) |
| InternImage-S | Mask R-CNN | 1x | 47.8 | 43.3 | 69M | 340G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_s_fpn_1x_coco.pth) \| [cfg](detection/configs/mask_rcnn/mask_rcnn_internimage_s_fpn_1x_coco.py) | | InternImage-S | Mask R-CNN | 1x | 47.8 | 43.3 | 69M | 340G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_s_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_s_fpn_1x_coco.py) |
| InternImage-S | Mask R-CNN | 3x | 49.7 | 44.5 | 69M | 340G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_s_fpn_3x_coco.pth) \| [cfg](detection/configs/mask_rcnn/mask_rcnn_internimage_s_fpn_3x_coco.py) | | InternImage-S | Mask R-CNN | 3x | 49.7 | 44.5 | 69M | 340G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_s_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_s_fpn_3x_coco.py) |
| InternImage-B | Mask R-CNN | 1x | 48.8 | 44.0 | 115M | 501G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_b_fpn_1x_coco.pth) \| [cfg](detection/configs/mask_rcnn/mask_rcnn_internimage_b_fpn_1x_coco.py) | | InternImage-B | Mask R-CNN | 1x | 48.8 | 44.0 | 115M | 501G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_b_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_b_fpn_1x_coco.py) |
| InternImage-B | Mask R-CNN | 3x | 50.3 | 44.8 | 115M | 501G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_b_fpn_3x_coco.pth) \| [cfg](detection/configs/mask_rcnn/mask_rcnn_internimage_b_fpn_3x_coco.py) | | InternImage-B | Mask R-CNN | 3x | 50.3 | 44.8 | 115M | 501G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_b_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_b_fpn_3x_coco.py) |
| InternImage-L | Cascade | 1x | 54.9 | 47.7 | 277M | 1399G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/cascade_internimage_l_fpn_1x_coco.pth) \| [cfg](detection/configs/cascade_mask_rcnn/cascade_internimage_l_fpn_1x_coco.py) | | InternImage-L | Cascade | 1x | 54.9 | 47.7 | 277M | 1399G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_1x_coco.py) |
| InternImage-L | Cascade | 3x | 56.1 | 48.5 | 277M | 1399G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/cascade_internimage_l_fpn_3x_coco.pth) \| [cfg](detection/configs/cascade_mask_rcnn/cascade_internimage_l_fpn_3x_coco.py) | | InternImage-L | Cascade | 3x | 56.1 | 48.5 | 277M | 1399G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_3x_coco.py) |
| InternImage-XL | Cascade | 1x | 55.3 | 48.1 | 387M | 1782G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/cascade_internimage_xl_fpn_1x_coco.pth) \| [cfg](detection/configs/cascade_mask_rcnn/cascade_internimage_xl_fpn_1x_coco.py) | | InternImage-XL | Cascade | 1x | 55.3 | 48.1 | 387M | 1782G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_1x_coco.py) |
| InternImage-XL | Cascade | 3x | 56.2 | 48.8 | 387M | 1782G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/cascade_internimage_xl_fpn_1x_coco.pth) \| [cfg](detection/configs/cascade_mask_rcnn/cascade_internimage_xl_fpn_3x_coco.py) | | InternImage-XL | Cascade | 3x | 56.2 | 48.8 | 387M | 1782G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_3x_coco.py) |
| backbone | method | box mAP (val/test) | #param | FLOPs | Download | | backbone | method | box mAP (val/test) | #param | FLOPs | Download |
| :------------: | :----------------: | :---------: | :------: | :-----: | :---: | | :------------: | :----------------: | :---------: | :------: | :-----: | :---: |
| InternImage-H | DINO (TTA) | 65.0 / 65.4 | 2.18B | TBD | TBD | | InternImage-H | DINO (TTA) | 65.0 / 65.4 | 2.18B | TODO | TODO |
| InternImage-G | DINO (TTA) | 65.3 / 65.5 | 3B | TBD | TBD | | InternImage-G | DINO (TTA) | 65.3 / 65.5 | 3B | TODO | TODO |
**ADE20K语义分割** **ADE20K语义分割**
| backbone | method | resolution | mIoU (ss/ms) | #param | FLOPs | Download | | backbone | method | resolution | mIoU (ss/ms) | #param | FLOPs | Download |
| :------------: | :--------: | :--------: | :----------: | :-----: | :---: | :---: | | :------------: | :--------: | :--------: | :----------: | :-----: | :---: | :---: |
| InternImage-T | UperNet | 512x512 | 47.9 / 48.1 | 59M | 944G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/seg_models/upernet_internimage_t_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_t_512_160k_ade20k.py) | | InternImage-T | UperNet | 512x512 | 47.9 / 48.1 | 59M | 944G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_t_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_t_512_160k_ade20k.py) |
| InternImage-S | UperNet | 512x512 | 50.1 / 50.9 | 80M | 1017G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/seg_models/upernet_internimage_s_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_s_512_160k_ade20k.py) | | InternImage-S | UperNet | 512x512 | 50.1 / 50.9 | 80M | 1017G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_s_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_s_512_160k_ade20k.py) |
| InternImage-B | UperNet | 512x512 | 50.8 / 51.3 | 128M | 1185G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/seg_models/upernet_internimage_b_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_b_512_160k_ade20k.py) | | InternImage-B | UperNet | 512x512 | 50.8 / 51.3 | 128M | 1185G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_b_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_b_512_160k_ade20k.py) |
| InternImage-L | UperNet | 640x640 | 53.9 / 54.1 | 256M | 2526G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/seg_models/upernet_internimage_l_640_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_l_640_160k_ade20k.py) | | InternImage-L | UperNet | 640x640 | 53.9 / 54.1 | 256M | 2526G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_l_640_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_l_640_160k_ade20k.py) |
| InternImage-XL | UperNet | 640x640 | 55.0 / 55.3 | 368M | 3142G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/seg_models/upernet_internimage_xl_640_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_xl_640_160k_ade20k.py) | | InternImage-XL | UperNet | 640x640 | 55.0 / 55.3 | 368M | 3142G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_xl_640_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_xl_640_160k_ade20k.py) |
| InternImage-H | UperNet | 896x896 | 59.9 / 60.3 | 1.12B | 3566G | TBD | | InternImage-H | UperNet | 896x896 | 59.9 / 60.3 | 1.12B | 3566G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_h_896_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_h_896_160k_ade20k.py) |
| InternImage-H | Mask2Former | 896x896 | 62.5 / 62.9 | 1.31B | 4635G | TBD | | InternImage-H | Mask2Former | 896x896 | 62.5 / 62.9 | 1.31B | 4635G | ckpt \| cfg |
**模型推理速度** **模型推理速度**
......
...@@ -90,7 +90,7 @@ ADE20K, outperforming previous models by a large margin. ...@@ -90,7 +90,7 @@ ADE20K, outperforming previous models by a large margin.
**Segmentation Task** **Segmentation Task**
<table border="1" width="90%"> <table border="1" width="90%">
<tr align="center"> <tr align="center">
<th colspan="3"> Semantic Segmentation</th><th colspan="1">Panoptic Segmentation</th><th colspan="1">RGBD Segmentation</th> <th colspan="3"> Semantic Segmentation</th><th colspan="1">Street Segmentation</th><th colspan="1">RGBD Segmentation</th>
</tr> </tr>
<tr align="center"> <tr align="center">
<th>ADE20K</th><th>COCO Stuff-10K</th><th>Pascal Context</th><th>CityScapes</th><th>NYU Depth V2</th> <th>ADE20K</th><th>COCO Stuff-10K</th><th>Pascal Context</th><th>CityScapes</th><th>NYU Depth V2</th>
...@@ -159,7 +159,7 @@ tasks ...@@ -159,7 +159,7 @@ tasks
## Model Zoo ## Model Zoo
- Object Detection and Instance Segmentation: [COCO](detection/configs/mask_rcnn/) - Object Detection and Instance Segmentation: [COCO](detection/configs/coco/)
- Semantic Segmentation: [ADE20K](segmentation/configs/ade20k/), [Cityscapes](segmentation/configs/cityscapes/) - Semantic Segmentation: [ADE20K](segmentation/configs/ade20k/), [Cityscapes](segmentation/configs/cityscapes/)
- Image-Text Retrieval, Image Captioning, and Visual Question Answering: [Uni-Perceiver](https://github.com/fundamentalvision/Uni-Perceiver) - Image-Text Retrieval, Image Captioning, and Visual Question Answering: [Uni-Perceiver](https://github.com/fundamentalvision/Uni-Perceiver)
- 3D Perception: [BEVFormer](https://github.com/fundamentalvision/BEVFormer) - 3D Perception: [BEVFormer](https://github.com/fundamentalvision/BEVFormer)
...@@ -171,48 +171,48 @@ tasks ...@@ -171,48 +171,48 @@ tasks
| name | pretrain | resolution | acc@1 | #param | FLOPs | 22K model | 1K model | | name | pretrain | resolution | acc@1 | #param | FLOPs | 22K model | 1K model |
| :------------: | :----------: | :--------: | :---: | :-----: | :---: | :-----------------: | :-----------------: | | :------------: | :----------: | :--------: | :---: | :-----: | :---: | :-----------------: | :-----------------: |
| InternImage-T | ImageNet-1K | 224x224 | 83.5 | 30M | 5G | - | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_t_1k_224.pth) \| [cfg](classification/configs/internimage_t_1k_224.yaml) | | InternImage-T | ImageNet-1K | 224x224 | 83.5 | 30M | 5G | - | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth) \| [cfg](classification/configs/internimage_t_1k_224.yaml) |
| InternImage-S | ImageNet-1K | 224x224 | 84.2 | 50M | 8G | - | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_s_1k_224.pth) \| [cfg](classification/configs/internimage_s_1k_224.yaml) | | InternImage-S | ImageNet-1K | 224x224 | 84.2 | 50M | 8G | - | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth) \| [cfg](classification/configs/internimage_s_1k_224.yaml) |
| InternImage-B | ImageNet-1K | 224x224 | 84.9 | 97M | 16G | - | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_b_1k_224.pth) \| [cfg](classification/configs/internimage_b_1k_224.yaml) | | InternImage-B | ImageNet-1K | 224x224 | 84.9 | 97M | 16G | - | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth) \| [cfg](classification/configs/internimage_b_1k_224.yaml) |
| InternImage-L | ImageNet-22K | 384x384 | 87.7 | 223M | 108G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_l_22k_192to384.pth) | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_l_22kto1k_384.pth) \| [cfg](classification/configs/internimage_l_22kto1k_384.yaml) | | InternImage-L | ImageNet-22K | 384x384 | 87.7 | 223M | 108G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22kto1k_384.pth) \| [cfg](classification/configs/internimage_l_22kto1k_384.yaml) |
| InternImage-XL | ImageNet-22K | 384x384 | 88.0 | 335M | 163G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_xl_22k_192to384.pth) | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_xl_22kto1k_384.pth) \| [cfg](classification/configs/internimage_xl_22kto1k_384.yaml) | | InternImage-XL | ImageNet-22K | 384x384 | 88.0 | 335M | 163G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22kto1k_384.pth) \| [cfg](classification/configs/internimage_xl_22kto1k_384.yaml) |
| InternImage-H | Joint 427M | 640x640 | 89.6 | 1.08B | 1478G |TBD | [ckpt](https://pan.baidu.com/s/1R3niTRjrERUet2xGc6ePPA) \| [cfg](classification/configs/internimage_h_jointto1k_640.yaml) | | InternImage-H | Joint 427M | 640x640 | 89.6 | 1.08B | 1478G |TODO | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto1k_640.pth) \| [cfg](classification/configs/internimage_h_jointto1k_640.yaml) |
| InternImage-G | Joint 427M | 512x512 | 90.1 | 3B | TBD | TBD | [ckpt](https://pan.baidu.com/s/1R3niTRjrERUet2xGc6ePPA) \| [cfg](classification/configs/internimage_g_jointto1k_512.yaml)| | InternImage-G | Joint 427M | 512x512 | 90.1 | 3B | TODO | TODO | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_jointto1k_512.pth) \| [cfg](classification/configs/internimage_g_jointto1k_512.yaml)|
- Extraction code for downloading InternImage-H/G: 2vwu
**COCO Object Detection and Instance Segmentation** **COCO Object Detection and Instance Segmentation**
| backbone | method | schd | box mAP | mask mAP | #param | FLOPs | Download | | backbone | method | schd | box mAP | mask mAP | #param | FLOPs | Download |
| :------------: | :----------------: | :---------: | :-----: | :------: | :-----: | :---: | :---: | | :------------: | :----------------: | :---------: | :-----: | :------: | :-----: | :---: | :---: |
| InternImage-T | Mask R-CNN | 1x | 47.2 | 42.5 | 49M | 270G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_t_fpn_1x_coco.pth) \| [cfg](detection/configs/mask_rcnn/mask_rcnn_internimage_t_fpn_1x_coco.py) | | InternImage-T | Mask R-CNN | 1x | 47.2 | 42.5 | 49M | 270G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_1x_coco.py) |
| InternImage-T | Mask R-CNN | 3x | 49.1 | 43.7 | 49M | 270G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_t_fpn_3x_coco.pth) \| [cfg](detection/configs/mask_rcnn/mask_rcnn_internimage_t_fpn_3x_coco.py) | | InternImage-T | Mask R-CNN | 3x | 49.1 | 43.7 | 49M | 270G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_3x_coco.py) |
| InternImage-S | Mask R-CNN | 1x | 47.8 | 43.3 | 69M | 340G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_s_fpn_1x_coco.pth) \| [cfg](detection/configs/mask_rcnn/mask_rcnn_internimage_s_fpn_1x_coco.py) | | InternImage-S | Mask R-CNN | 1x | 47.8 | 43.3 | 69M | 340G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_s_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_s_fpn_1x_coco.py) |
| InternImage-S | Mask R-CNN | 3x | 49.7 | 44.5 | 69M | 340G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_s_fpn_3x_coco.pth) \| [cfg](detection/configs/mask_rcnn/mask_rcnn_internimage_s_fpn_3x_coco.py) | | InternImage-S | Mask R-CNN | 3x | 49.7 | 44.5 | 69M | 340G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_s_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_s_fpn_3x_coco.py) |
| InternImage-B | Mask R-CNN | 1x | 48.8 | 44.0 | 115M | 501G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_b_fpn_1x_coco.pth) \| [cfg](detection/configs/mask_rcnn/mask_rcnn_internimage_b_fpn_1x_coco.py) | | InternImage-B | Mask R-CNN | 1x | 48.8 | 44.0 | 115M | 501G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_b_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_b_fpn_1x_coco.py) |
| InternImage-B | Mask R-CNN | 3x | 50.3 | 44.8 | 115M | 501G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_b_fpn_3x_coco.pth) \| [cfg](detection/configs/mask_rcnn/mask_rcnn_internimage_b_fpn_3x_coco.py) | | InternImage-B | Mask R-CNN | 3x | 50.3 | 44.8 | 115M | 501G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_b_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_b_fpn_3x_coco.py) |
| InternImage-L | Cascade | 1x | 54.9 | 47.7 | 277M | 1399G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/cascade_internimage_l_fpn_1x_coco.pth) \| [cfg](detection/configs/cascade_mask_rcnn/cascade_internimage_l_fpn_1x_coco.py) | | InternImage-L | Cascade | 1x | 54.9 | 47.7 | 277M | 1399G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_1x_coco.py) |
| InternImage-L | Cascade | 3x | 56.1 | 48.5 | 277M | 1399G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/cascade_internimage_l_fpn_3x_coco.pth) \| [cfg](detection/configs/cascade_mask_rcnn/cascade_internimage_l_fpn_3x_coco.py) | | InternImage-L | Cascade | 3x | 56.1 | 48.5 | 277M | 1399G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_3x_coco.py) |
| InternImage-XL | Cascade | 1x | 55.3 | 48.1 | 387M | 1782G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/cascade_internimage_xl_fpn_1x_coco.pth) \| [cfg](detection/configs/cascade_mask_rcnn/cascade_internimage_xl_fpn_1x_coco.py) | | InternImage-XL | Cascade | 1x | 55.3 | 48.1 | 387M | 1782G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_1x_coco.py) |
| InternImage-XL | Cascade | 3x | 56.2 | 48.8 | 387M | 1782G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/cascade_internimage_xl_fpn_1x_coco.pth) \| [cfg](detection/configs/cascade_mask_rcnn/cascade_internimage_xl_fpn_3x_coco.py) | | InternImage-XL | Cascade | 3x | 56.2 | 48.8 | 387M | 1782G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_3x_coco.py) |
| backbone | method | box mAP (val/test) | #param | FLOPs | Download | | backbone | method | box mAP (val/test) | #param | FLOPs | Download |
| :------------: | :----------------: | :---------: | :------: | :-----: | :---: | | :------------: | :----------------: | :---------: | :------: | :-----: | :---: |
| InternImage-H | DINO (TTA) | 65.0 / 65.4 | 2.18B | TBD | TBD | | InternImage-H | DINO (TTA) | 65.0 / 65.4 | 2.18B | TODO | TODO |
| InternImage-G | DINO (TTA) | 65.3 / 65.5 | 3B | TBD | TBD | | InternImage-G | DINO (TTA) | 65.3 / 65.5 | 3B | TODO | TODO |
**ADE20K Semantic Segmentation** **ADE20K Semantic Segmentation**
| backbone | method | resolution | mIoU (ss/ms) | #param | FLOPs | Download | | backbone | method | resolution | mIoU (ss/ms) | #param | FLOPs | Download |
| :------------: | :--------: | :--------: | :----------: | :-----: | :---: | :---: | | :------------: | :--------: | :--------: | :----------: | :-----: | :---: | :---: |
| InternImage-T | UperNet | 512x512 | 47.9 / 48.1 | 59M | 944G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/seg_models/upernet_internimage_t_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_t_512_160k_ade20k.py) | | InternImage-T | UperNet | 512x512 | 47.9 / 48.1 | 59M | 944G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_t_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_t_512_160k_ade20k.py) |
| InternImage-S | UperNet | 512x512 | 50.1 / 50.9 | 80M | 1017G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/seg_models/upernet_internimage_s_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_s_512_160k_ade20k.py) | | InternImage-S | UperNet | 512x512 | 50.1 / 50.9 | 80M | 1017G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_s_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_s_512_160k_ade20k.py) |
| InternImage-B | UperNet | 512x512 | 50.8 / 51.3 | 128M | 1185G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/seg_models/upernet_internimage_b_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_b_512_160k_ade20k.py) | | InternImage-B | UperNet | 512x512 | 50.8 / 51.3 | 128M | 1185G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_b_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_b_512_160k_ade20k.py) |
| InternImage-L | UperNet | 640x640 | 53.9 / 54.1 | 256M | 2526G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/seg_models/upernet_internimage_l_640_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_l_640_160k_ade20k.py) | | InternImage-L | UperNet | 640x640 | 53.9 / 54.1 | 256M | 2526G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_l_640_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_l_640_160k_ade20k.py) |
| InternImage-XL | UperNet | 640x640 | 55.0 / 55.3 | 368M | 3142G | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/seg_models/upernet_internimage_xl_640_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_xl_640_160k_ade20k.py) | | InternImage-XL | UperNet | 640x640 | 55.0 / 55.3 | 368M | 3142G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_xl_640_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_xl_640_160k_ade20k.py) |
| InternImage-H | UperNet | 896x896 | 59.9 / 60.3 | 1.12B | 3566G | TBD | | InternImage-H | UperNet | 896x896 | 59.9 / 60.3 | 1.12B | 3566G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_h_896_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_h_896_160k_ade20k.py) |
| InternImage-H | Mask2Former | 896x896 | 62.5 / 62.9 | 1.31B | 4635G | TBD | | InternImage-H | Mask2Former | 896x896 | 62.5 / 62.9 | 1.31B | 4635G | TODO | ckpt \| cfg
**Main Results of FPS** **Main Results of FPS**
......
...@@ -66,7 +66,21 @@ def build_act_layer(act_layer): ...@@ -66,7 +66,21 @@ def build_act_layer(act_layer):
class CrossAttention(nn.Module): class CrossAttention(nn.Module):
r""" Cross Attention Module
Args:
dim (int): Number of input channels.
num_heads (int): Number of attention heads. Default: 8
qkv_bias (bool, optional): If True, add a learnable bias to q, k, v.
Default: False.
qk_scale (float | None, optional): Override default qk scale of
head_dim ** -0.5 if set. Default: None.
attn_drop (float, optional): Dropout ratio of attention weight.
Default: 0.0
proj_drop (float, optional): Dropout ratio of output. Default: 0.0
attn_head_dim (int, optional): Dimension of attention head.
out_dim (int, optional): Dimension of output.
"""
def __init__(self, def __init__(self,
dim, dim,
num_heads=8, num_heads=8,
...@@ -84,7 +98,7 @@ class CrossAttention(nn.Module): ...@@ -84,7 +98,7 @@ class CrossAttention(nn.Module):
if attn_head_dim is not None: if attn_head_dim is not None:
head_dim = attn_head_dim head_dim = attn_head_dim
all_head_dim = head_dim * self.num_heads all_head_dim = head_dim * self.num_heads
self.scale = qk_scale or head_dim**-0.5 self.scale = qk_scale or head_dim ** -0.5
assert all_head_dim == dim assert all_head_dim == dim
self.q = nn.Linear(dim, all_head_dim, bias=False) self.q = nn.Linear(dim, all_head_dim, bias=False)
...@@ -104,9 +118,7 @@ class CrossAttention(nn.Module): ...@@ -104,9 +118,7 @@ class CrossAttention(nn.Module):
self.proj = nn.Linear(all_head_dim, out_dim) self.proj = nn.Linear(all_head_dim, out_dim)
self.proj_drop = nn.Dropout(proj_drop) self.proj_drop = nn.Dropout(proj_drop)
def forward(self, x, bool_masked_pos=None, k=None, v=None): def forward(self, x, k=None, v=None):
# import pdb; pdb.set_trace()
# print("1", x.shape, k.shape, v.shape)
B, N, C = x.shape B, N, C = x.shape
N_k = k.shape[1] N_k = k.shape[1]
N_v = v.shape[1] N_v = v.shape[1]
...@@ -114,7 +126,6 @@ class CrossAttention(nn.Module): ...@@ -114,7 +126,6 @@ class CrossAttention(nn.Module):
q_bias, k_bias, v_bias = None, None, None q_bias, k_bias, v_bias = None, None, None
if self.q_bias is not None: if self.q_bias is not None:
q_bias = self.q_bias q_bias = self.q_bias
# k_bias = torch.zeros_like(self.v_bias, requires_grad=False)
k_bias = self.k_bias k_bias = self.k_bias
v_bias = self.v_bias v_bias = self.v_bias
...@@ -131,7 +142,6 @@ class CrossAttention(nn.Module): ...@@ -131,7 +142,6 @@ class CrossAttention(nn.Module):
v = v.reshape(B, N_v, 1, self.num_heads, -1).permute(2, 0, 3, 1, v = v.reshape(B, N_v, 1, self.num_heads, -1).permute(2, 0, 3, 1,
4).squeeze(0) 4).squeeze(0)
# print("2", q.shape, k.shape, v.shape)
q = q * self.scale q = q * self.scale
attn = (q @ k.transpose(-2, -1)) # (B, N_head, N_q, N_k) attn = (q @ k.transpose(-2, -1)) # (B, N_head, N_q, N_k)
...@@ -146,7 +156,23 @@ class CrossAttention(nn.Module): ...@@ -146,7 +156,23 @@ class CrossAttention(nn.Module):
class AttentiveBlock(nn.Module): class AttentiveBlock(nn.Module):
r"""Attentive Block
Args:
dim (int): Number of input channels.
num_heads (int): Number of attention heads. Default: 8
qkv_bias (bool, optional): If True, add a learnable bias to q, k, v.
Default: False.
qk_scale (float | None, optional): Override default qk scale of
head_dim ** -0.5 if set. Default: None.
drop (float, optional): Dropout rate. Default: 0.0.
attn_drop (float, optional): Attention dropout rate. Default: 0.0.
drop_path (float | tuple[float], optional): Stochastic depth rate.
Default: 0.0.
norm_layer (nn.Module, optional): Normalization layer. Default: nn.LayerNorm.
attn_head_dim (int, optional): Dimension of attention head. Default: None.
out_dim (int, optional): Dimension of output. Default: None.
"""
def __init__(self, def __init__(self,
dim, dim,
num_heads, num_heads,
...@@ -322,7 +348,6 @@ class InternImageLayer(nn.Module): ...@@ -322,7 +348,6 @@ class InternImageLayer(nn.Module):
core_op, core_op,
channels, channels,
groups, groups,
dw_kernel_size,
mlp_ratio=4., mlp_ratio=4.,
drop=0., drop=0.,
drop_path=0., drop_path=0.,
...@@ -332,8 +357,9 @@ class InternImageLayer(nn.Module): ...@@ -332,8 +357,9 @@ class InternImageLayer(nn.Module):
layer_scale=None, layer_scale=None,
offset_scale=1.0, offset_scale=1.0,
with_cp=False, with_cp=False,
res_post_norm=False, dw_kernel_size=None, # for InternImage-H/G
center_feature_scale=False): res_post_norm=False, # for InternImage-H/G
center_feature_scale=False): # for InternImage-H/G
super().__init__() super().__init__()
self.channels = channels self.channels = channels
self.groups = groups self.groups = groups
...@@ -345,7 +371,6 @@ class InternImageLayer(nn.Module): ...@@ -345,7 +371,6 @@ class InternImageLayer(nn.Module):
self.dcn = core_op( self.dcn = core_op(
channels=channels, channels=channels,
kernel_size=3, kernel_size=3,
dw_kernel_size=dw_kernel_size,
stride=1, stride=1,
pad=1, pad=1,
dilation=1, dilation=1,
...@@ -353,7 +378,8 @@ class InternImageLayer(nn.Module): ...@@ -353,7 +378,8 @@ class InternImageLayer(nn.Module):
offset_scale=offset_scale, offset_scale=offset_scale,
act_layer=act_layer, act_layer=act_layer,
norm_layer=norm_layer, norm_layer=norm_layer,
center_feature_scale=center_feature_scale) dw_kernel_size=dw_kernel_size, # for InternImage-H/G
center_feature_scale=center_feature_scale) # for InternImage-H/G
self.drop_path = DropPath(drop_path) if drop_path > 0. \ self.drop_path = DropPath(drop_path) if drop_path > 0. \
else nn.Identity() else nn.Identity()
self.norm2 = build_norm_layer(channels, 'LN') self.norm2 = build_norm_layer(channels, 'LN')
...@@ -379,9 +405,8 @@ class InternImageLayer(nn.Module): ...@@ -379,9 +405,8 @@ class InternImageLayer(nn.Module):
if self.post_norm: if self.post_norm:
x = x + self.drop_path(self.norm1(self.dcn(x))) x = x + self.drop_path(self.norm1(self.dcn(x)))
x = x + self.drop_path(self.norm2(self.mlp(x))) x = x + self.drop_path(self.norm2(self.mlp(x)))
elif self.res_post_norm: elif self.res_post_norm: # for InternImage-H/G
shortcut = x x = x + self.drop_path(self.res_post_norm1(self.dcn(self.norm1(x))))
x = shortcut + self.drop_path(self.res_post_norm1(self.dcn(self.norm1(x))))
x = x + self.drop_path(self.res_post_norm2(self.mlp(self.norm2(x)))) x = x + self.drop_path(self.res_post_norm2(self.mlp(self.norm2(x))))
else: else:
x = x + self.drop_path(self.dcn(self.norm1(x))) x = x + self.drop_path(self.dcn(self.norm1(x)))
...@@ -425,7 +450,6 @@ class InternImageBlock(nn.Module): ...@@ -425,7 +450,6 @@ class InternImageBlock(nn.Module):
channels, channels,
depth, depth,
groups, groups,
dw_kernel_size,
downsample=True, downsample=True,
mlp_ratio=4., mlp_ratio=4.,
drop=0., drop=0.,
...@@ -436,6 +460,7 @@ class InternImageBlock(nn.Module): ...@@ -436,6 +460,7 @@ class InternImageBlock(nn.Module):
offset_scale=1.0, offset_scale=1.0,
layer_scale=None, layer_scale=None,
with_cp=False, with_cp=False,
dw_kernel_size=None, # for InternImage-H/G
post_norm_block_ids=None, # for InternImage-H/G post_norm_block_ids=None, # for InternImage-H/G
res_post_norm=False, # for InternImage-H/G res_post_norm=False, # for InternImage-H/G
center_feature_scale=False): # for InternImage-H/G center_feature_scale=False): # for InternImage-H/G
...@@ -444,12 +469,12 @@ class InternImageBlock(nn.Module): ...@@ -444,12 +469,12 @@ class InternImageBlock(nn.Module):
self.depth = depth self.depth = depth
self.post_norm = post_norm self.post_norm = post_norm
self.center_feature_scale = center_feature_scale self.center_feature_scale = center_feature_scale
self.blocks = nn.ModuleList([ self.blocks = nn.ModuleList([
InternImageLayer( InternImageLayer(
core_op=core_op, core_op=core_op,
channels=channels, channels=channels,
groups=groups, groups=groups,
dw_kernel_size=dw_kernel_size,
mlp_ratio=mlp_ratio, mlp_ratio=mlp_ratio,
drop=drop, drop=drop,
drop_path=drop_path[i] if isinstance( drop_path=drop_path[i] if isinstance(
...@@ -460,20 +485,21 @@ class InternImageBlock(nn.Module): ...@@ -460,20 +485,21 @@ class InternImageBlock(nn.Module):
layer_scale=layer_scale, layer_scale=layer_scale,
offset_scale=offset_scale, offset_scale=offset_scale,
with_cp=with_cp, with_cp=with_cp,
res_post_norm=res_post_norm, dw_kernel_size=dw_kernel_size, # for InternImage-H/G
center_feature_scale=center_feature_scale res_post_norm=res_post_norm, # for InternImage-H/G
center_feature_scale=center_feature_scale # for InternImage-H/G
) for i in range(depth) ) for i in range(depth)
]) ])
if not self.post_norm or center_feature_scale: if not self.post_norm or center_feature_scale:
self.norm = build_norm_layer(channels, 'LN') self.norm = build_norm_layer(channels, 'LN')
self.post_norm_block_ids = post_norm_block_ids self.post_norm_block_ids = post_norm_block_ids
if post_norm_block_ids is not None: if post_norm_block_ids is not None: # for InternImage-H/G
self.post_norms = nn.ModuleList( self.post_norms = nn.ModuleList(
[build_norm_layer(channels, 'LN', eps=1e-6) for _ in post_norm_block_ids] [build_norm_layer(channels, 'LN', eps=1e-6) for _ in post_norm_block_ids]
) )
self.downsample = DownsampleLayer( self.downsample = DownsampleLayer(
channels=channels, norm_layer=norm_layer) if downsample else None channels=channels, norm_layer=norm_layer) if downsample else None
def forward(self, x, return_wo_downsample=False): def forward(self, x, return_wo_downsample=False):
for i, blk in enumerate(self.blocks): for i, blk in enumerate(self.blocks):
x = blk(x) x = blk(x)
...@@ -510,6 +536,12 @@ class InternImage(nn.Module): ...@@ -510,6 +536,12 @@ class InternImage(nn.Module):
layer_scale (bool): Whether to use layer scale. Default: False layer_scale (bool): Whether to use layer scale. Default: False
cls_scale (bool): Whether to use class scale. Default: False cls_scale (bool): Whether to use class scale. Default: False
with_cp (bool): Use checkpoint or not. Using checkpoint will save some with_cp (bool): Use checkpoint or not. Using checkpoint will save some
dw_kernel_size (int): Size of the dwconv. Default: None
use_clip_projector (bool): Whether to use clip projector. Default: False
level2_post_norm (bool): Whether to use level2 post norm. Default: False
level2_post_norm_block_ids (list): Indexes of post norm blocks. Default: None
res_post_norm (bool): Whether to use res post norm. Default: False
center_feature_scale (bool): Whether to use center feature scale. Default: False
""" """
def __init__(self, def __init__(self,
...@@ -554,7 +586,7 @@ class InternImage(nn.Module): ...@@ -554,7 +586,7 @@ class InternImage(nn.Module):
print(f"level2_post_norm: {level2_post_norm}") print(f"level2_post_norm: {level2_post_norm}")
print(f"level2_post_norm_block_ids: {level2_post_norm_block_ids}") print(f"level2_post_norm_block_ids: {level2_post_norm_block_ids}")
print(f"res_post_norm: {res_post_norm}") print(f"res_post_norm: {res_post_norm}")
in_chans = 3 in_chans = 3
self.patch_embed = StemLayer(in_chans=in_chans, self.patch_embed = StemLayer(in_chans=in_chans,
out_chans=channels, out_chans=channels,
...@@ -578,7 +610,6 @@ class InternImage(nn.Module): ...@@ -578,7 +610,6 @@ class InternImage(nn.Module):
channels=int(channels * 2**i), channels=int(channels * 2**i),
depth=depths[i], depth=depths[i],
groups=groups[i], groups=groups[i],
dw_kernel_size=dw_kernel_size, # for InternImage-H/G
mlp_ratio=self.mlp_ratio, mlp_ratio=self.mlp_ratio,
drop=drop_rate, drop=drop_rate,
drop_path=dpr[sum(depths[:i]):sum(depths[:i + 1])], drop_path=dpr[sum(depths[:i]):sum(depths[:i + 1])],
...@@ -589,6 +620,7 @@ class InternImage(nn.Module): ...@@ -589,6 +620,7 @@ class InternImage(nn.Module):
layer_scale=layer_scale, layer_scale=layer_scale,
offset_scale=offset_scale, offset_scale=offset_scale,
with_cp=with_cp, with_cp=with_cp,
dw_kernel_size=dw_kernel_size, # for InternImage-H/G
post_norm_block_ids=post_norm_block_ids, # for InternImage-H/G post_norm_block_ids=post_norm_block_ids, # for InternImage-H/G
res_post_norm=res_post_norm, # for InternImage-H/G res_post_norm=res_post_norm, # for InternImage-H/G
center_feature_scale=center_feature_scale # for InternImage-H/G center_feature_scale=center_feature_scale # for InternImage-H/G
......
...@@ -70,13 +70,13 @@ sh dist_test.sh <config-file> <checkpoint> <gpu-num> --eval bbox segm ...@@ -70,13 +70,13 @@ sh dist_test.sh <config-file> <checkpoint> <gpu-num> --eval bbox segm
For example, to evaluate the `InternImage-T` with a single GPU: For example, to evaluate the `InternImage-T` with a single GPU:
```bash ```bash
python test.py configs/mask_rcnn/mask_rcnn_internimage_t_fpn_1x_coco.py checkpoint_dir/det/mask_rcnn_internimage_t_fpn_1x_coco.pth --eval bbox segm python test.py configs/coco/mask_rcnn_internimage_t_fpn_1x_coco.py checkpoint_dir/det/mask_rcnn_internimage_t_fpn_1x_coco.pth --eval bbox segm
``` ```
For example, to evaluate the `InternImage-B` with a single node with 8 GPUs: For example, to evaluate the `InternImage-B` with a single node with 8 GPUs:
```bash ```bash
sh dist_test.sh configs/mask_rcnn/mask_rcnn_internimage_b_fpn_1x_coco.py checkpoint_dir/det/mask_rcnn_internimage_b_fpn_1x_coco.py 8 --eval bbox segm sh dist_test.sh configs/coco/mask_rcnn_internimage_b_fpn_1x_coco.py checkpoint_dir/det/mask_rcnn_internimage_b_fpn_1x_coco.py 8 --eval bbox segm
``` ```
### Training on COCO ### Training on COCO
...@@ -90,7 +90,7 @@ sh dist_train.sh <config-file> <gpu-num> ...@@ -90,7 +90,7 @@ sh dist_train.sh <config-file> <gpu-num>
For example, to train `InternImage-T` with 8 GPU on 1 node, run: For example, to train `InternImage-T` with 8 GPU on 1 node, run:
```bash ```bash
sh dist_train.sh configs/mask_rcnn/mask_rcnn_internimage_t_fpn_1x_coco.py 8 sh dist_train.sh configs/coco/mask_rcnn_internimage_t_fpn_1x_coco.py 8
``` ```
### Manage jobs with Srun ### Manage jobs with Srun
...@@ -98,5 +98,5 @@ sh dist_train.sh configs/mask_rcnn/mask_rcnn_internimage_t_fpn_1x_coco.py 8 ...@@ -98,5 +98,5 @@ sh dist_train.sh configs/mask_rcnn/mask_rcnn_internimage_t_fpn_1x_coco.py 8
For example, to train `InternImage-L` with 32 GPU on 4 node, run: For example, to train `InternImage-L` with 32 GPU on 4 node, run:
```bash ```bash
GPUS=32 sh slurm_train.sh <partition> <job-name> configs/cascade_mask_rcnn/cascade_internimage_xl_fpn_3x_coco.py work_dirs/cascade_internimage_xl_fpn_3x_coco GPUS=32 sh slurm_train.sh <partition> <job-name> configs/coco/cascade_internimage_xl_fpn_3x_coco.py work_dirs/cascade_internimage_xl_fpn_3x_coco
``` ```
# Cascade R-CNN
> [Cascade R-CNN: High Quality Object Detection and Instance Segmentation](https://arxiv.org/abs/1906.09756)
<!-- [ALGORITHM] -->
## Introduction
In object detection, the intersection over union (IoU) threshold is frequently used to define positives/negatives. The threshold used to train a detector defines its quality. While the commonly used threshold of 0.5 leads to noisy (low-quality) detections, detection performance frequently degrades for larger thresholds. This paradox of high-quality detection has two causes: 1) overfitting, due to vanishing positive samples for large thresholds, and 2) inference-time quality mismatch between detector and test hypotheses. A multi-stage object detection architecture, the Cascade R-CNN, composed of a sequence of detectors trained with increasing IoU thresholds, is proposed to address these problems. The detectors are trained sequentially, using the output of a detector as training set for the next. This resampling progressively improves hypotheses quality, guaranteeing a positive training set of equivalent size for all detectors and minimizing overfitting. The same cascade is applied at inference, to eliminate quality mismatches between hypotheses and detectors. An implementation of the Cascade R-CNN without bells or whistles achieves state-of-the-art performance on the COCO dataset, and significantly improves high-quality detection on generic and specific object detection datasets, including VOC, KITTI, CityPerson, and WiderFace. Finally, the Cascade R-CNN is generalized to instance segmentation, with nontrivial improvements over the Mask R-CNN.
<div align=center>
<img src="https://user-images.githubusercontent.com/40661020/143872197-d99b90e4-4f05-4329-80a4-327ac862a051.png"/>
</div>
## Model Zoo
| backbone | schd | box mAP | mask mAP | train speed | train time | #param | FLOPs | Config | Download |
| :------------: | :---------: | :-----: | :------: | :-----: | :---: | :-----: | :---: | :---: | :---: |
| InternImage-L | 1x | 54.9 | 47.7 | 0.73s / iter | 18h | 277M | 1399G | [config](./cascade_internimage_l_fpn_1x_coco.py) | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/cascade_internimage_l_fpn_1x_coco.pth) |
| InternImage-L | 3x | 56.1 | 48.5 | 0.79s / iter | 15h (n4) | 277M | 1399G | [config](./cascade_internimage_l_fpn_3x_coco.py) | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/cascade_internimage_l_fpn_3x_coco.pth) \| [log](https://github.com/OpenGVLab/InternImage/releases/download/det_model/cascade_internimage_l_fpn_3x_coco.log.json) |
| InternImage-XL | 1x | 55.3 | 48.1 | 0.82s / iter | 21h | 387M | 1782G | [config](./cascade_internimage_xl_fpn_1x_coco.py) | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/cascade_internimage_xl_fpn_1x_coco.pth) \| [log](https://github.com/OpenGVLab/InternImage/releases/download/det_model/cascade_internimage_xl_fpn_1x_coco.log.json) |
| InternImage-XL | 3x | 56.2 | 48.8 | 0.91s / iter | 17h (n4) | 387M | 1782G | [config](./cascade_internimage_xl_fpn_3x_coco.py) | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/cascade_internimage_xl_fpn_1x_coco.pth) \| [log](https://github.com/OpenGVLab/InternImage/releases/download/det_model/cascade_internimage_xl_fpn_3x_coco.log.json) |
- Training speed is measured with A100 GPUs using current code and may be faster than the speed in logs.
- Some logs are our recent newly trained ones. There might be slight differences between the results in logs and our paper.
- Please set `with_cp=True` to save memory if you meet `out-of-memory` issues.
# COCO
## Introduction
Introduced by Lin et al. in [Microsoft COCO: Common Objects in Context](https://arxiv.org/pdf/1405.0312v3.pdf)
The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.
Splits: The first version of MS COCO dataset was released in 2014. It contains 164K images split into training (83K), validation (41K) and test (41K) sets. In 2015 additional test set of 81K images was released, including all the previous test images and 40K new images.
Based on community feedback, in 2017 the training/validation split was changed from 83K/41K to 118K/5K. The new split uses the same images and annotations. The 2017 test set is a subset of 41K images of the 2015 test set. Additionally, the 2017 release contains a new unannotated dataset of 123K images.
## Model Zoo
### Mask R-CNN + InternImage
| backbone | schd | box mAP | mask mAP | train speed | train time |#param | FLOPs | Config | Download |
| :------------: | :---------: | :-----: | :------: | :-----: |:------: | :-----: |:------: | :-----: | :---: |
| InternImage-T | 1x | 47.2 | 42.5 | 0.36s / iter | 9h | 49M | 270G | [config](./mask_rcnn_internimage_t_fpn_1x_coco.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_1x_coco.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_1x_coco.log.json) |
| InternImage-T | 3x | 49.1 | 43.7 | 0.34s / iter | 26h | 49M | 270G | [config](./mask_rcnn_internimage_t_fpn_3x_coco.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_3x_coco.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_3x_coco.log.json) |
| InternImage-S | 1x | 47.8 | 43.3 | 0.40s / iter | 10h | 69M | 340G | [config](./mask_rcnn_internimage_s_fpn_1x_coco.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_s_fpn_1x_coco.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_s_fpn_1x_coco.log.json) |
| InternImage-S | 3x | 49.7 | 44.5 | 0.40s / iter | 30h | 69M | 340G | [config](./mask_rcnn_internimage_s_fpn_3x_coco.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_s_fpn_3x_coco.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_s_fpn_3x_coco.log.json) |
| InternImage-B | 1x | 48.8 | 44.0 | 0.45s / iter | 11.5h | 115M | 501G | [config](./mask_rcnn_internimage_b_fpn_1x_coco.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_b_fpn_1x_coco.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_b_fpn_1x_coco.log.json) |
| InternImage-B | 3x | 50.3 | 44.8 | 0.45s / iter | 34h | 115M | 501G | [config](./mask_rcnn_internimage_b_fpn_3x_coco.py)| [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_b_fpn_3x_coco.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_b_fpn_3x_coco.log.json) |
- Training speed is measured with A100 GPUs using current code and may be faster than the speed in logs.
- Some logs are our recent newly trained ones. There might be slight differences between the results in logs and our paper.
- Please set `with_cp=True` to save memory if you meet `out-of-memory` issues.
### Cascade Mask R-CNN + InternImage
| backbone | schd | box mAP | mask mAP | train speed | train time | #param | FLOPs | Config | Download |
| :------------: | :---------: | :-----: | :------: | :-----: | :---: | :-----: | :---: | :---: | :---: |
| InternImage-L | 1x | 54.9 | 47.7 | 0.73s / iter | 18h | 277M | 1399G | [config](./cascade_internimage_l_fpn_1x_coco.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_1x_coco.pth) |
| InternImage-L | 3x | 56.1 | 48.5 | 0.79s / iter | 15h (4n) | 277M | 1399G | [config](./cascade_internimage_l_fpn_3x_coco.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_3x_coco.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_3x_coco.log.json) |
| InternImage-XL | 1x | 55.3 | 48.1 | 0.82s / iter | 21h | 387M | 1782G | [config](./cascade_internimage_xl_fpn_1x_coco.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.log.json) |
| InternImage-XL | 3x | 56.2 | 48.8 | 0.91s / iter | 17h (4n) | 387M | 1782G | [config](./cascade_internimage_xl_fpn_3x_coco.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_3x_coco.log.json) |
- Training speed is measured with A100 GPUs using current code and may be faster than the speed in logs.
- Some logs are our recent newly trained ones. There might be slight differences between the results in logs and our paper.
- Please set `with_cp=True` to save memory if you meet `out-of-memory` issues.
...@@ -9,7 +9,7 @@ _base_ = [ ...@@ -9,7 +9,7 @@ _base_ = [
'../_base_/schedules/schedule_1x.py', '../_base_/schedules/schedule_1x.py',
'../_base_/default_runtime.py' '../_base_/default_runtime.py'
] ]
pretrained = 'https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_l_22k_192to384.pth' pretrained = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth'
model = dict( model = dict(
backbone=dict( backbone=dict(
_delete_=True, _delete_=True,
......
...@@ -9,7 +9,7 @@ _base_ = [ ...@@ -9,7 +9,7 @@ _base_ = [
'../_base_/schedules/schedule_3x.py', '../_base_/schedules/schedule_3x.py',
'../_base_/default_runtime.py' '../_base_/default_runtime.py'
] ]
pretrained = 'https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_l_22k_192to384.pth' pretrained = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth'
model = dict( model = dict(
backbone=dict( backbone=dict(
_delete_=True, _delete_=True,
......
...@@ -9,7 +9,7 @@ _base_ = [ ...@@ -9,7 +9,7 @@ _base_ = [
'../_base_/schedules/schedule_1x.py', '../_base_/schedules/schedule_1x.py',
'../_base_/default_runtime.py' '../_base_/default_runtime.py'
] ]
pretrained = 'https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_xl_22k_192to384.pth' pretrained = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth'
model = dict( model = dict(
backbone=dict( backbone=dict(
_delete_=True, _delete_=True,
......
...@@ -9,7 +9,7 @@ _base_ = [ ...@@ -9,7 +9,7 @@ _base_ = [
'../_base_/schedules/schedule_3x.py', '../_base_/schedules/schedule_3x.py',
'../_base_/default_runtime.py' '../_base_/default_runtime.py'
] ]
pretrained = 'https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_xl_22k_192to384.pth' pretrained = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth'
model = dict( model = dict(
backbone=dict( backbone=dict(
_delete_=True, _delete_=True,
......
...@@ -9,7 +9,7 @@ _base_ = [ ...@@ -9,7 +9,7 @@ _base_ = [
'../_base_/schedules/schedule_1x.py', '../_base_/schedules/schedule_1x.py',
'../_base_/default_runtime.py' '../_base_/default_runtime.py'
] ]
pretrained = 'https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_b_1k_224.pth' pretrained = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth'
model = dict( model = dict(
backbone=dict( backbone=dict(
_delete_=True, _delete_=True,
......
...@@ -9,7 +9,7 @@ _base_ = [ ...@@ -9,7 +9,7 @@ _base_ = [
'../_base_/schedules/schedule_3x.py', '../_base_/schedules/schedule_3x.py',
'../_base_/default_runtime.py' '../_base_/default_runtime.py'
] ]
pretrained = 'https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_b_1k_224.pth' pretrained = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth'
model = dict( model = dict(
backbone=dict( backbone=dict(
_delete_=True, _delete_=True,
......
...@@ -9,7 +9,7 @@ _base_ = [ ...@@ -9,7 +9,7 @@ _base_ = [
'../_base_/schedules/schedule_1x.py', '../_base_/schedules/schedule_1x.py',
'../_base_/default_runtime.py' '../_base_/default_runtime.py'
] ]
pretrained = 'https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_s_1k_224.pth' pretrained = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth'
model = dict( model = dict(
backbone=dict( backbone=dict(
_delete_=True, _delete_=True,
......
...@@ -9,7 +9,7 @@ _base_ = [ ...@@ -9,7 +9,7 @@ _base_ = [
'../_base_/schedules/schedule_3x.py', '../_base_/schedules/schedule_3x.py',
'../_base_/default_runtime.py' '../_base_/default_runtime.py'
] ]
pretrained = 'https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_s_1k_224.pth' pretrained = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth'
model = dict( model = dict(
backbone=dict( backbone=dict(
_delete_=True, _delete_=True,
......
...@@ -9,7 +9,7 @@ _base_ = [ ...@@ -9,7 +9,7 @@ _base_ = [
'../_base_/schedules/schedule_1x.py', '../_base_/schedules/schedule_1x.py',
'../_base_/default_runtime.py' '../_base_/default_runtime.py'
] ]
pretrained = 'https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_t_1k_224.pth' pretrained = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth'
model = dict( model = dict(
backbone=dict( backbone=dict(
_delete_=True, _delete_=True,
......
...@@ -9,7 +9,7 @@ _base_ = [ ...@@ -9,7 +9,7 @@ _base_ = [
'../_base_/schedules/schedule_3x.py', '../_base_/schedules/schedule_3x.py',
'../_base_/default_runtime.py' '../_base_/default_runtime.py'
] ]
pretrained = 'https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_t_1k_224.pth' pretrained = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth'
model = dict( model = dict(
backbone=dict( backbone=dict(
_delete_=True, _delete_=True,
......
# Mask R-CNN
> [Mask R-CNN](https://arxiv.org/abs/1703.06870)
<!-- [ALGORITHM] -->
## Introduction
Mask R-CNN is a conceptually simple, flexible, and general framework for object instance segmentation. It efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. And it extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners.
<div align=center>
<img src="https://user-images.githubusercontent.com/40661020/143967081-c2552bed-9af2-46c4-ae44-5b3b74e5679f.png"/>
</div>
## Model Zoo
| backbone | schd | box mAP | mask mAP | train speed | train time |#param | FLOPs | Config | Download |
| :------------: | :---------: | :-----: | :------: | :-----: |:------: | :-----: |:------: | :-----: | :---: |
| InternImage-T | 1x | 47.2 | 42.5 | 0.36s / iter | 9h | 49M | 270G | [config](./mask_rcnn_internimage_t_fpn_1x_coco.py) | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_t_fpn_1x_coco.pth) \| [log](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_t_fpn_1x_coco.log.json) |
| InternImage-T | 3x | 49.1 | 43.7 | 0.34s / iter | 26h | 49M | 270G | [config](./mask_rcnn_internimage_t_fpn_3x_coco.py) | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_t_fpn_3x_coco.pth) \| [log](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_t_fpn_3x_coco.log.json) |
| InternImage-S | 1x | 47.8 | 43.3 | 0.40s / iter | 10h | 69M | 340G | [config](./mask_rcnn_internimage_s_fpn_1x_coco.py) | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_s_fpn_1x_coco.pth) \| [log](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_s_fpn_1x_coco.log.json) |
| InternImage-S | 3x | 49.7 | 44.5 | 0.40s / iter | 30h | 69M | 340G | [config](./mask_rcnn_internimage_s_fpn_3x_coco.py) | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_s_fpn_3x_coco.pth) \| [log](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_s_fpn_3x_coco.log.json) |
| InternImage-B | 1x | 48.8 | 44.0 | 0.45s / iter | 11.5h | 115M | 501G | [config](./mask_rcnn_internimage_b_fpn_1x_coco.py) | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_b_fpn_1x_coco.pth) \| [log](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_b_fpn_1x_coco.log.json) |
| InternImage-B | 3x | 50.3 | 44.8 | 0.45s / iter | 34h | 115M | 501G | [config](./mask_rcnn_internimage_b_fpn_3x_coco.py)| [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_b_fpn_3x_coco.pth) \| [log](https://github.com/OpenGVLab/InternImage/releases/download/det_model/mask_rcnn_internimage_b_fpn_3x_coco.log.json) |
- Training speed is measured with A100 GPUs using current code and may be faster than the speed in logs.
- Some logs are our recent newly trained ones. There might be slight differences between the results in logs and our paper.
- Please set `with_cp=True` to save memory if you meet `out-of-memory` issues.
...@@ -12,11 +12,12 @@ The ADE20K semantic segmentation dataset contains more than 20K scene-centric im ...@@ -12,11 +12,12 @@ The ADE20K semantic segmentation dataset contains more than 20K scene-centric im
| backbone | resolution | mIoU (ss/ms) | train speed | train time | #param | FLOPs | Config | Download | | backbone | resolution | mIoU (ss/ms) | train speed | train time | #param | FLOPs | Config | Download |
|:--------------:|:----------:|:-----------:|:-----------:|:----------:|:-------:|:-----:|:-----:|:-------------------:| |:--------------:|:----------:|:-----------:|:-----------:|:----------:|:-------:|:-----:|:-----:|:-------------------:|
| InternImage-T | 512x512 | 47.9 / 48.1 | 0.23s / iter | 10.5h | 59M | 944G | [config](./upernet_internimage_t_512_160k_ade20k.py) | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/seg_models/upernet_internimage_t_512_160k_ade20k.pth) \| [log](https://github.com/OpenGVLab/InternImage/releases/download/seg_models/upernet_internimage_t_512_160k_ade20k.log.json) | | InternImage-T | 512x512 | 47.9 / 48.1 | 0.23s / iter | 10.5h | 59M | 944G | [config](./upernet_internimage_t_512_160k_ade20k.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_t_512_160k_ade20k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_t_512_160k_ade20k.log.json) |
| InternImage-S | 512x512 | 50.1 / 50.9 | 0.25s / iter | 11.5h | 80M | 1017G | [config](./upernet_internimage_s_512_160k_ade20k.py) | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/seg_models/upernet_internimage_s_512_160k_ade20k.pth) \| [log](https://github.com/OpenGVLab/InternImage/releases/download/seg_models/upernet_internimage_s_512_160k_ade20k.log.json) | | InternImage-S | 512x512 | 50.1 / 50.9 | 0.25s / iter | 11.5h | 80M | 1017G | [config](./upernet_internimage_s_512_160k_ade20k.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_s_512_160k_ade20k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_s_512_160k_ade20k.log.json) |
| InternImage-B | 512x512 | 50.8 / 51.3 | 0.26s / iter | 12h | 128M | 1185G | [config](./upernet_internimage_b_512_160k_ade20k.py) | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/seg_models/upernet_internimage_b_512_160k_ade20k.pth) \| [log](https://github.com/OpenGVLab/InternImage/releases/download/seg_models/upernet_internimage_b_512_160k_ade20k.log.json) | | InternImage-B | 512x512 | 50.8 / 51.3 | 0.26s / iter | 12h | 128M | 1185G | [config](./upernet_internimage_b_512_160k_ade20k.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_b_512_160k_ade20k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_b_512_160k_ade20k.log.json) |
| InternImage-L | 640x640 | 53.9 / 54.1 | 0.42s / iter | 19h | 256M | 2526G | [config](./upernet_internimage_l_640_160k_ade20k.py)| [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/seg_models/upernet_internimage_l_640_160k_ade20k.pth) \| [log](https://github.com/OpenGVLab/InternImage/releases/download/seg_models/upernet_internimage_l_640_160k_ade20k.log.json) | | InternImage-L | 640x640 | 53.9 / 54.1 | 0.42s / iter | 19h | 256M | 2526G | [config](./upernet_internimage_l_640_160k_ade20k.py)| [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_l_640_160k_ade20k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_l_640_160k_ade20k.log.json) |
| InternImage-XL | 640x640 | 55.0 / 55.3 | 0.47s / iter | 22h | 368M | 3142G | [config](./upernet_internimage_xl_640_160k_ade20k.py) | [ckpt](https://github.com/OpenGVLab/InternImage/releases/download/seg_models/upernet_internimage_xl_640_160k_ade20k.pth) \| [log](https://github.com/OpenGVLab/InternImage/releases/download/seg_models/upernet_internimage_xl_640_160k_ade20k.log.json) | | InternImage-XL | 640x640 | 55.0 / 55.3 | 0.47s / iter | 22h | 368M | 3142G | [config](./upernet_internimage_xl_640_160k_ade20k.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_xl_640_160k_ade20k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_xl_640_160k_ade20k.log.json) |
| InternImage-H | 896x896 | 59.9 / 60.3 | 0.94s / iter | 2d (2n) | 1.12B | 3566G | [config](./upernet_internimage_h_896_160k_ade20k.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_h_896_160k_ade20k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_h_896_160k_ade20k.log.json) |
- Training speed is measured with A100 GPU. - Training speed is measured with A100 GPU.
- Please set `with_cp=True` to save memory if you meet `out-of-memory` issues. - Please set `with_cp=True` to save memory if you meet `out-of-memory` issues.
......
...@@ -7,7 +7,7 @@ _base_ = [ ...@@ -7,7 +7,7 @@ _base_ = [
'../_base_/models/upernet_r50.py', '../_base_/datasets/ade20k.py', '../_base_/models/upernet_r50.py', '../_base_/datasets/ade20k.py',
'../_base_/default_runtime.py', '../_base_/schedules/schedule_160k.py' '../_base_/default_runtime.py', '../_base_/schedules/schedule_160k.py'
] ]
pretrained = 'https://github.com/OpenGVLab/InternImage/releases/download/cls_model/internimage_b_1k_224.pth' pretrained = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth'
model = dict( model = dict(
backbone=dict( backbone=dict(
_delete_=True, _delete_=True,
......
# --------------------------------------------------------
# InternImage
# Copyright (c) 2022 OpenGVLab
# Licensed under The MIT License [see LICENSE for details]
# --------------------------------------------------------
_base_ = [
'../_base_/models/upernet_r50.py', '../_base_/datasets/ade20k.py',
'../_base_/default_runtime.py', '../_base_/schedules/schedule_160k.py'
]
# pretrained = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth'
model = dict(
backbone=dict(
_delete_=True,
type='InternImage',
core_op='DCNv3',
channels=320,
depths=[6, 6, 32, 6],
groups=[10, 20, 40, 80],
mlp_ratio=4.,
drop_path_rate=0.5,
norm_layer='LN',
layer_scale=None,
offset_scale=1.0,
post_norm=False,
dw_kernel_size=5, # for InternImage-H/G
res_post_norm=True, # for InternImage-H/G
level2_post_norm=True, # for InternImage-H/G
level2_post_norm_block_ids=[5, 11, 17, 23, 29], # for InternImage-H/G
center_feature_scale=True, # for InternImage-H/G
with_cp=False,
out_indices=(0, 1, 2, 3),
init_cfg=None), #dict(type='Pretrained', checkpoint=pretrained)),
decode_head=dict(num_classes=150, in_channels=[320, 640, 1280, 2560]),
auxiliary_head=dict(num_classes=150, in_channels=1280),
test_cfg=dict(mode='whole'))
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
crop_size = (896, 896)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', reduce_zero_label=True),
dict(type='Resize', img_scale=(3584, 896), ratio_range=(0.5, 2.0)),
dict(type='RandomCrop', crop_size=crop_size, cat_max_ratio=0.75),
dict(type='RandomFlip', prob=0.5),
dict(type='PhotoMetricDistortion'),
dict(type='Normalize', **img_norm_cfg),
dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=255),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_semantic_seg']),
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(3584, 896),
# img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75],
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='ResizeToMultiple', size_divisor=32),
dict(type='RandomFlip'),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img']),
])
]
optimizer = dict(
_delete_=True, type='AdamW', lr=0.00002, betas=(0.9, 0.999), weight_decay=0.05,
constructor='CustomLayerDecayOptimizerConstructor',
paramwise_cfg=dict(num_layers=50, layer_decay_rate=0.95,
depths=[6, 6, 32, 6], offset_lr_scale=1.0))
lr_config = dict(_delete_=True, policy='poly',
warmup='linear',
warmup_iters=1500,
warmup_ratio=1e-6,
power=1.0, min_lr=0.0, by_epoch=False)
# By default, models are trained on 8 GPUs with 2 images per GPU
data = dict(samples_per_gpu=1,
train=dict(pipeline=train_pipeline),
val=dict(pipeline=test_pipeline),
test=dict(pipeline=test_pipeline))
runner = dict(type='IterBasedRunner')
optimizer_config = dict(_delete_=True, grad_clip=dict(max_norm=0.1, norm_type=2))
checkpoint_config = dict(by_epoch=False, interval=1000, max_keep_ckpts=1)
evaluation = dict(interval=16000, metric='mIoU', save_best='mIoU')
# fp16 = dict(loss_scale=dict(init_scale=512))
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment