Commit a8184dc3 authored by Zhe Chen's avatar Zhe Chen Committed by zhe chen
Browse files

Update README.md and release models (#44)

* Update README.md

* update configs

* clean code

* support InternImage-H/G
parent 6be127ee
.idea/ .idea/
.DS_Store .DS_Store
classification/convertor/
segmentation/convertor/
...@@ -93,7 +93,7 @@ ...@@ -93,7 +93,7 @@
**分割任务** **分割任务**
<table border="1" width="90%"> <table border="1" width="90%">
<tr align="center"> <tr align="center">
<th colspan="3"> 语义分割</th><th colspan="1">街景分割</th><th colspan="1">RGBD分割</th> <th colspan="3">语义分割</th><th colspan="1">街景分割</th><th colspan="1">RGBD分割</th>
</tr> </tr>
<tr align="center"> <tr align="center">
<th>ADE20K</th><th>COCO Stuff-10K</th><th>Pascal Context</th><th>CityScapes</th><th>NYU Depth V2</th> <th>ADE20K</th><th>COCO Stuff-10K</th><th>Pascal Context</th><th>CityScapes</th><th>NYU Depth V2</th>
...@@ -125,7 +125,7 @@ ...@@ -125,7 +125,7 @@
**图文多模态任务** **图文多模态任务**
<table border="1" width="90%"> <table border="1" width="90%">
<tr align="center"> <tr align="center">
<th colspan="1"> 图像描述</th><th colspan="2">微调图文检索</th><th colspan="1">零样本图文检索</th> <th colspan="1">图像描述</th><th colspan="2">微调图文检索</th><th colspan="1">零样本图文检索</th>
</tr> </tr>
<tr align="center"> <tr align="center">
<th>COCO Caption</th><th>COCO Caption</th><th>Flickr30k</th><th>Flickr30k</th> <th>COCO Caption</th><th>COCO Caption</th><th>Flickr30k</th><th>Flickr30k</th>
...@@ -166,31 +166,31 @@ ...@@ -166,31 +166,31 @@
- 3D感知: [BEVFormer](https://github.com/fundamentalvision/BEVFormer) - 3D感知: [BEVFormer](https://github.com/fundamentalvision/BEVFormer)
## 开源视觉预训练模型 ## 开源视觉预训练模型
| name | pretrain data | pre-training resolution | #param | FLOPs | Download | | name | pretrain | pre-training resolution | #param | download |
| :------------: | :--------: | :--------: | :-----: | :---: | :-----------------: | | :------------: | :--------: | :--------: | :-----: | :-----------------: |
| InternImage-L | ImageNet-22K | 384x384 | 223M | 108G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth) | | InternImage-L | ImageNet-22K | 384x384 | 223M | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth) |
| InternImage-XL | ImageNet-22K | 384x384 | 335M | 163G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth) | | InternImage-XL | ImageNet-22K | 384x384 | 335M | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth) |
| InternImage-H | Joint 427M | 384x384 | 1.08B | 1478G | (2023/03/16) | | InternImage-H | Joint 427M | 384x384 | 1.08B | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto22k_384.pth) |
| InternImage-G | - | 384x384 | 3B | (2023/03/16) | (2023/03/16) | | InternImage-G | - | 384x384 | 3B | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_pretrainto22k_384.pth) |
## ImageNet-1K图像分类 ## ImageNet-1K图像分类
| name | resolution | acc@1 | #param | FLOPs | Download | | name | pretrain | resolution | acc@1 | #param | FLOPs | download |
| :------------: | :--------: | :---: | :-----: | :---: | :-----------------: | | :------------: | :----------: | :--------: | :---: | :-----: | :---: | :-----------------: |
| InternImage-T | 224x224 | 83.5 | 30M | 5G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth) \| [cfg](classification/configs/internimage_t_1k_224.yaml) | | InternImage-T | ImageNet-1K | 224x224 | 83.5 | 30M | 5G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth) \| [cfg](classification/configs/internimage_t_1k_224.yaml) |
| InternImage-S | 224x224 | 84.2 | 50M | 8G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth) \| [cfg](classification/configs/internimage_s_1k_224.yaml) | | InternImage-S | ImageNet-1K | 224x224 | 84.2 | 50M | 8G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth) \| [cfg](classification/configs/internimage_s_1k_224.yaml) |
| InternImage-B | 224x224 | 84.9 | 97M | 16G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth) \| [cfg](classification/configs/internimage_b_1k_224.yaml) | | InternImage-B | ImageNet-1K | 224x224 | 84.9 | 97M | 16G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth) \| [cfg](classification/configs/internimage_b_1k_224.yaml) |
| InternImage-L | 384x384 | 87.7 | 223M | 108G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22kto1k_384.pth) \| [cfg](classification/configs/internimage_l_22kto1k_384.yaml) | | InternImage-L | ImageNet-22K | 384x384 | 87.7 | 223M | 108G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22kto1k_384.pth) \| [cfg](classification/configs/internimage_l_22kto1k_384.yaml) |
| InternImage-XL | 384x384 | 88.0 | 335M | 163G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22kto1k_384.pth) \| [cfg](classification/configs/internimage_xl_22kto1k_384.yaml) | | InternImage-XL | ImageNet-22K | 384x384 | 88.0 | 335M | 163G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22kto1k_384.pth) \| [cfg](classification/configs/internimage_xl_22kto1k_384.yaml) |
| InternImage-H | 640x640 | 89.6 | 1.08B | 1478G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto1k_640.pth) \| [cfg](classification/configs/internimage_h_jointto1k_640.yaml) | | InternImage-H | Joint 427M | 640x640 | 89.6 | 1.08B | 1478G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_22kto1k_640.pth) \| [cfg](classification/configs/internimage_h_22kto1k_640.yaml) |
| InternImage-G | 512x512 | 90.1 | 3B | (2023/03/16) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_jointto1k_512.pth) \| [cfg](classification/configs/internimage_g_jointto1k_512.yaml) | | InternImage-G | - | 512x512 | 90.1 | 3B | 2700G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_22kto1k_512.pth) \| [cfg](classification/configs/internimage_g_22kto1k_512.yaml) |
## COCO目标检测和实例分割 ## COCO目标检测和实例分割
| backbone | method | schd | box mAP | mask mAP | #param | FLOPs | Download | | backbone | method | schd | box mAP | mask mAP | #param | FLOPs | download |
| :------------: | :----------------: | :---------: | :-----: | :------: | :-----: | :---: | :---: | | :------------: | :----------------: | :---------: | :-----: | :------: | :-----: | :---: | :---: |
| InternImage-T | Mask R-CNN | 1x | 47.2 | 42.5 | 49M | 270G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_1x_coco.py) | | InternImage-T | Mask R-CNN | 1x | 47.2 | 42.5 | 49M | 270G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_1x_coco.py) |
| InternImage-T | Mask R-CNN | 3x | 49.1 | 43.7 | 49M | 270G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_3x_coco.py) | | InternImage-T | Mask R-CNN | 3x | 49.1 | 43.7 | 49M | 270G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_3x_coco.py) |
...@@ -201,16 +201,16 @@ ...@@ -201,16 +201,16 @@
| InternImage-L | Cascade | 1x | 54.9 | 47.7 | 277M | 1399G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_1x_coco.py) | | InternImage-L | Cascade | 1x | 54.9 | 47.7 | 277M | 1399G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_1x_coco.py) |
| InternImage-L | Cascade | 3x | 56.1 | 48.5 | 277M | 1399G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_3x_coco.py) | | InternImage-L | Cascade | 3x | 56.1 | 48.5 | 277M | 1399G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_3x_coco.py) |
| InternImage-XL | Cascade | 1x | 55.3 | 48.1 | 387M | 1782G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_1x_coco.py) | | InternImage-XL | Cascade | 1x | 55.3 | 48.1 | 387M | 1782G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_1x_coco.py) |
| InternImage-XL | Cascade | 3x | 56.2 | 48.8 | 387M | 1782G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_3x_coco.py) | | InternImage-XL | Cascade | 3x | 56.2 | 48.8 | 387M | 1782G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_3x_coco.py) |
| backbone | method | box mAP (val/test) | #param | FLOPs | Download | | backbone | method | box mAP (val/test) | #param | FLOPs | download |
| :------------: | :----------------: | :---------: | :------: | :-----: | :-----: | | :------------: | :----------------: | :---------: | :------: | :-----: | :-----: |
| InternImage-H | DINO (TTA) | 65.0 / 65.4 | 2.18B | TODO | TODO | | InternImage-H | DINO (TTA) | 65.0 / 65.4 | 2.18B | TODO | TODO |
| InternImage-G | DINO (TTA) | 65.3 / 65.5 | 3B | TODO | TODO | | InternImage-G | DINO (TTA) | 65.3 / 65.5 | 3B | TODO | TODO |
## ADE20K语义分割 ## ADE20K语义分割
| backbone | method | resolution | mIoU (ss/ms) | #param | FLOPs | Download | | backbone | method | resolution | mIoU (ss/ms) | #param | FLOPs | download |
| :------------: | :--------: | :--------: | :----------: | :-----: | :---: | :---: | | :------------: | :--------: | :--------: | :----------: | :-----: | :---: | :---: |
| InternImage-T | UperNet | 512x512 | 47.9 / 48.1 | 59M | 944G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_t_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_t_512_160k_ade20k.py) | | InternImage-T | UperNet | 512x512 | 47.9 / 48.1 | 59M | 944G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_t_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_t_512_160k_ade20k.py) |
| InternImage-S | UperNet | 512x512 | 50.1 / 50.9 | 80M | 1017G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_s_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_s_512_160k_ade20k.py) | | InternImage-S | UperNet | 512x512 | 50.1 / 50.9 | 80M | 1017G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_s_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_s_512_160k_ade20k.py) |
...@@ -225,7 +225,7 @@ ...@@ -225,7 +225,7 @@
[TensorRT](classification/export.py) [TensorRT](classification/export.py)
| name | resolution | #param | FLOPs | Batch 1 FPS(TensorRT) | | name | resolution | #param | FLOPs | batch 1 FPS (TensorRT) |
| :------------: | :--------: | :-----: | :---: | :-------------------: | | :------------: | :--------: | :-----: | :---: | :-------------------: |
| InternImage-T | 224x224 | 30M | 5G | 156 | | InternImage-T | 224x224 | 30M | 5G | 156 |
| InternImage-S | 224x224 | 50M | 8G | 129 | | InternImage-S | 224x224 | 50M | 8G | 129 |
......
...@@ -90,7 +90,7 @@ ADE20K, outperforming previous models by a large margin. ...@@ -90,7 +90,7 @@ ADE20K, outperforming previous models by a large margin.
**Segmentation Task** **Segmentation Task**
<table border="1" width="90%"> <table border="1" width="90%">
<tr align="center"> <tr align="center">
<th colspan="3"> Semantic Segmentation</th><th colspan="1">Street Segmentation</th><th colspan="1">RGBD Segmentation</th> <th colspan="3">Semantic Segmentation</th><th colspan="1">Street Segmentation</th><th colspan="1">RGBD Segmentation</th>
</tr> </tr>
<tr align="center"> <tr align="center">
<th>ADE20K</th><th>COCO Stuff-10K</th><th>Pascal Context</th><th>CityScapes</th><th>NYU Depth V2</th> <th>ADE20K</th><th>COCO Stuff-10K</th><th>Pascal Context</th><th>CityScapes</th><th>NYU Depth V2</th>
...@@ -122,7 +122,7 @@ ADE20K, outperforming previous models by a large margin. ...@@ -122,7 +122,7 @@ ADE20K, outperforming previous models by a large margin.
**Multimodal Tasks** **Multimodal Tasks**
<table border="1" width="90%"> <table border="1" width="90%">
<tr align="center"> <tr align="center">
<th colspan="1"> Image Captioning</th><th colspan="2">Fine-tuning Image-Text Retrieval</th><th colspan="1">Zero-shot Image-Text Retrieval</th> <th colspan="1">Image Captioning</th><th colspan="2">Fine-tuning Image-Text Retrieval</th><th colspan="1">Zero-shot Image-Text Retrieval</th>
</tr> </tr>
<tr align="center"> <tr align="center">
<th>COCO Caption</th><th>COCO Caption</th><th>Flickr30k</th><th>Flickr30k</th> <th>COCO Caption</th><th>COCO Caption</th><th>Flickr30k</th><th>Flickr30k</th>
...@@ -147,13 +147,12 @@ InternImage, the visual backbone network of "INTERN-2.5", has a parameter size o ...@@ -147,13 +147,12 @@ InternImage, the visual backbone network of "INTERN-2.5", has a parameter size o
## Project Release ## Project Release
- [ ] Model for other downstream - [ ] Model for other downstream tasks
tasks
- [x] InternImage-H(1B)/G(3B) - [x] InternImage-H(1B)/G(3B)
- [x] TensorRT inference - [x] TensorRT inference
- [x] Classification code of the InternImage series - [x] Classification code of the InternImage series
- [x] InternImage-T/S/B/L/XL ImageNet-1k pretrained model - [x] InternImage-T/S/B/L/XL ImageNet-1K pretrained model
- [x] InternImage-L/XL ImageNet-22k pretrained model - [x] InternImage-L/XL ImageNet-22K pretrained model
- [x] InternImage-T/S/B/L/XL detection and instance segmentation model - [x] InternImage-T/S/B/L/XL detection and instance segmentation model
- [x] InternImage-T/S/B/L/XL semantic segmentation model - [x] InternImage-T/S/B/L/XL semantic segmentation model
...@@ -165,25 +164,31 @@ tasks ...@@ -165,25 +164,31 @@ tasks
- 3D Perception: [BEVFormer](https://github.com/fundamentalvision/BEVFormer) - 3D Perception: [BEVFormer](https://github.com/fundamentalvision/BEVFormer)
## Performance on Visual Benchmark ## Open-source Visual Pretrained Models
| name | pretrain | pre-training resolution | #param | download |
| :------------: | :--------: | :--------: | :-----: | :-----------------: |
| InternImage-L | ImageNet-22K | 384x384 | 223M | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth) |
| InternImage-XL | ImageNet-22K | 384x384 | 335M | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth) |
| InternImage-H | Joint 427M | 384x384 | 1.08B | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto22k_384.pth) |
| InternImage-G | - | 384x384 | 3B | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_pretrainto22k_384.pth) |
**ImageNet Image Classification**
| name | pretrain | resolution | acc@1 | #param | FLOPs | 22K model | 1K model |
| :------------: | :----------: | :--------: | :---: | :-----: | :---: | :-----------------: | :-----------------: |
| InternImage-T | ImageNet-1K | 224x224 | 83.5 | 30M | 5G | - | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth) \| [cfg](classification/configs/internimage_t_1k_224.yaml) |
| InternImage-S | ImageNet-1K | 224x224 | 84.2 | 50M | 8G | - | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth) \| [cfg](classification/configs/internimage_s_1k_224.yaml) |
| InternImage-B | ImageNet-1K | 224x224 | 84.9 | 97M | 16G | - | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth) \| [cfg](classification/configs/internimage_b_1k_224.yaml) |
| InternImage-L | ImageNet-22K | 384x384 | 87.7 | 223M | 108G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22kto1k_384.pth) \| [cfg](classification/configs/internimage_l_22kto1k_384.yaml) |
| InternImage-XL | ImageNet-22K | 384x384 | 88.0 | 335M | 163G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22kto1k_384.pth) \| [cfg](classification/configs/internimage_xl_22kto1k_384.yaml) |
| InternImage-H | Joint 427M | 640x640 | 89.6 | 1.08B | 1478G |TODO | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto1k_640.pth) \| [cfg](classification/configs/internimage_h_jointto1k_640.yaml) |
| InternImage-G | Joint 427M | 512x512 | 90.1 | 3B | TODO | TODO | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_jointto1k_512.pth) \| [cfg](classification/configs/internimage_g_jointto1k_512.yaml)|
## ImageNet-1K Image Classification
| name | pretrain | resolution | acc@1 | #param | FLOPs | download |
| :------------: | :----------: | :--------: | :---: | :-----: | :---: | :-----------------: |
| InternImage-T | ImageNet-1K | 224x224 | 83.5 | 30M | 5G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth) \| [cfg](classification/configs/internimage_t_1k_224.yaml) |
| InternImage-S | ImageNet-1K | 224x224 | 84.2 | 50M | 8G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth) \| [cfg](classification/configs/internimage_s_1k_224.yaml) |
| InternImage-B | ImageNet-1K | 224x224 | 84.9 | 97M | 16G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth) \| [cfg](classification/configs/internimage_b_1k_224.yaml) |
| InternImage-L | ImageNet-22K | 384x384 | 87.7 | 223M | 108G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22kto1k_384.pth) \| [cfg](classification/configs/internimage_l_22kto1k_384.yaml) |
| InternImage-XL | ImageNet-22K | 384x384 | 88.0 | 335M | 163G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22kto1k_384.pth) \| [cfg](classification/configs/internimage_xl_22kto1k_384.yaml) |
| InternImage-H | Joint 427M | 640x640 | 89.6 | 1.08B | 1478G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_22kto1k_640.pth) \| [cfg](classification/configs/internimage_h_22kto1k_640.yaml) |
| InternImage-G | - | 512x512 | 90.1 | 3B | 2700G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_22kto1k_512.pth) \| [cfg](classification/configs/internimage_g_22kto1k_512.yaml) |
**COCO Object Detection and Instance Segmentation**
## COCO Object Detection and Instance Segmentation
| backbone | method | schd | box mAP | mask mAP | #param | FLOPs | Download | | backbone | method | schd | box mAP | mask mAP | #param | FLOPs | download |
| :------------: | :----------------: | :---------: | :-----: | :------: | :-----: | :---: | :---: | | :------------: | :----------------: | :---------: | :-----: | :------: | :-----: | :---: | :---: |
| InternImage-T | Mask R-CNN | 1x | 47.2 | 42.5 | 49M | 270G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_1x_coco.py) | | InternImage-T | Mask R-CNN | 1x | 47.2 | 42.5 | 49M | 270G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_1x_coco.py) |
| InternImage-T | Mask R-CNN | 3x | 49.1 | 43.7 | 49M | 270G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_3x_coco.py) | | InternImage-T | Mask R-CNN | 3x | 49.1 | 43.7 | 49M | 270G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_3x_coco.py) |
...@@ -194,17 +199,17 @@ tasks ...@@ -194,17 +199,17 @@ tasks
| InternImage-L | Cascade | 1x | 54.9 | 47.7 | 277M | 1399G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_1x_coco.py) | | InternImage-L | Cascade | 1x | 54.9 | 47.7 | 277M | 1399G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_1x_coco.py) |
| InternImage-L | Cascade | 3x | 56.1 | 48.5 | 277M | 1399G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_3x_coco.py) | | InternImage-L | Cascade | 3x | 56.1 | 48.5 | 277M | 1399G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_3x_coco.py) |
| InternImage-XL | Cascade | 1x | 55.3 | 48.1 | 387M | 1782G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_1x_coco.py) | | InternImage-XL | Cascade | 1x | 55.3 | 48.1 | 387M | 1782G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_1x_coco.py) |
| InternImage-XL | Cascade | 3x | 56.2 | 48.8 | 387M | 1782G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_3x_coco.py) | | InternImage-XL | Cascade | 3x | 56.2 | 48.8 | 387M | 1782G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_3x_coco.py) |
| backbone | method | box mAP (val/test) | #param | FLOPs | Download | | backbone | method | box mAP (val/test) | #param | FLOPs | download |
| :------------: | :----------------: | :---------: | :------: | :-----: | :---: | | :------------: | :----------------: | :---------: | :------: | :-----: | :---: |
| InternImage-H | DINO (TTA) | 65.0 / 65.4 | 2.18B | TODO | TODO | | InternImage-H | DINO (TTA) | 65.0 / 65.4 | 2.18B | TODO | TODO |
| InternImage-G | DINO (TTA) | 65.3 / 65.5 | 3B | TODO | TODO | | InternImage-G | DINO (TTA) | 65.3 / 65.5 | 3B | TODO | TODO |
**ADE20K Semantic Segmentation** ## ADE20K Semantic Segmentation
| backbone | method | resolution | mIoU (ss/ms) | #param | FLOPs | Download | | backbone | method | resolution | mIoU (ss/ms) | #param | FLOPs | download |
| :------------: | :--------: | :--------: | :----------: | :-----: | :---: | :---: | | :------------: | :--------: | :--------: | :----------: | :-----: | :---: | :---: |
| InternImage-T | UperNet | 512x512 | 47.9 / 48.1 | 59M | 944G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_t_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_t_512_160k_ade20k.py) | | InternImage-T | UperNet | 512x512 | 47.9 / 48.1 | 59M | 944G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_t_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_t_512_160k_ade20k.py) |
| InternImage-S | UperNet | 512x512 | 50.1 / 50.9 | 80M | 1017G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_s_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_s_512_160k_ade20k.py) | | InternImage-S | UperNet | 512x512 | 50.1 / 50.9 | 80M | 1017G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_s_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_s_512_160k_ade20k.py) |
...@@ -212,12 +217,14 @@ tasks ...@@ -212,12 +217,14 @@ tasks
| InternImage-L | UperNet | 640x640 | 53.9 / 54.1 | 256M | 2526G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_l_640_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_l_640_160k_ade20k.py) | | InternImage-L | UperNet | 640x640 | 53.9 / 54.1 | 256M | 2526G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_l_640_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_l_640_160k_ade20k.py) |
| InternImage-XL | UperNet | 640x640 | 55.0 / 55.3 | 368M | 3142G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_xl_640_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_xl_640_160k_ade20k.py) | | InternImage-XL | UperNet | 640x640 | 55.0 / 55.3 | 368M | 3142G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_xl_640_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_xl_640_160k_ade20k.py) |
| InternImage-H | UperNet | 896x896 | 59.9 / 60.3 | 1.12B | 3566G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_h_896_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_h_896_160k_ade20k.py) | | InternImage-H | UperNet | 896x896 | 59.9 / 60.3 | 1.12B | 3566G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_h_896_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_h_896_160k_ade20k.py) |
| InternImage-H | Mask2Former | 896x896 | 62.5 / 62.9 | 1.31B | 4635G | TODO | ckpt \| cfg | InternImage-H | Mask2Former | 896x896 | 62.5 / 62.9 | 1.31B | 4635G | TODO |
**Main Results of FPS** ## Main Results of FPS
| name | resolution | #params | FLOPs | Batch 1 FPS(TensorRT) | [TensorRT](classification/export.py)
| name | resolution | #param | FLOPs | batch 1 FPS (TensorRT) |
| :------------: | :--------: | :-----: | :---: | :-------------------: | | :------------: | :--------: | :-----: | :---: | :-------------------: |
| InternImage-T | 224x224 | 30M | 5G | 156 | | InternImage-T | 224x224 | 30M | 5G | 156 |
| InternImage-S | 224x224 | 50M | 8G | 129 | | InternImage-S | 224x224 | 50M | 8G | 129 |
......
...@@ -51,7 +51,7 @@ sh ./make.sh ...@@ -51,7 +51,7 @@ sh ./make.sh
python test.py python test.py
``` ```
### Data preparation ### Data Preparation
We use standard ImageNet dataset, you can download it from http://image-net.org/. We provide the following two ways to We use standard ImageNet dataset, you can download it from http://image-net.org/. We provide the following two ways to
load data: load data:
...@@ -128,7 +128,7 @@ load data: ...@@ -128,7 +128,7 @@ load data:
### Evaluation ### Evaluation
To evaluate a pre-trained `InternImage` on ImageNet val, run: To evaluate a pretrained `InternImage` on ImageNet val, run:
```bash ```bash
python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> --master_port 12345 main.py --eval \ python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> --master_port 12345 main.py --eval \
...@@ -142,7 +142,7 @@ python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.p ...@@ -142,7 +142,7 @@ python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.p
--cfg configs/internimage_b_1k_224.yaml --resume internimage_b_1k_224.pth --data-path <imagenet-path> --cfg configs/internimage_b_1k_224.yaml --resume internimage_b_1k_224.pth --data-path <imagenet-path>
``` ```
### Training from scratch on ImageNet-1K ### Training from Scratch on ImageNet-1K
To train an `InternImage` on ImageNet from scratch, run: To train an `InternImage` on ImageNet from scratch, run:
...@@ -151,7 +151,7 @@ python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> --maste ...@@ -151,7 +151,7 @@ python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> --maste
--cfg <config-file> --data-path <imagenet-path> [--batch-size <batch-size-per-gpu> --output <output-directory> --tag <job-tag>] --cfg <config-file> --data-path <imagenet-path> [--batch-size <batch-size-per-gpu> --output <output-directory> --tag <job-tag>]
``` ```
### Manage jobs with Srun. ### Manage Jobs with Slurm.
For example, to train `InternImage` with 8 GPU on a single node for 300 epochs, run: For example, to train `InternImage` with 8 GPU on a single node for 300 epochs, run:
...@@ -184,9 +184,9 @@ python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> --maste ...@@ -184,9 +184,9 @@ python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> --maste
--resume internimage_xl_22k_192to384.pth --eval --resume internimage_xl_22k_192to384.pth --eval
``` --> ``` -->
<!-- ### Fine-tuning from a ImageNet-22K pre-trained model <!-- ### Fine-tuning from a ImageNet-22K pretrained model
For example, to fine-tune a `InternImage-XL-22k` model pre-trained on ImageNet-22K: For example, to fine-tune a `InternImage-XL-22k` model pretrained on ImageNet-22K:
```bashs ```bashs
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_image_.yaml --pretrained intern_image_b.pth --eval GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_image_.yaml --pretrained intern_image_b.pth --eval
......
DATA:
IMG_SIZE: 224
IMG_ON_MEMORY: True
AUG:
MIXUP: 0.0
CUTMIX: 0.0
REPROB: 0.0
MODEL:
TYPE: intern_image
DROP_PATH_RATE: 0.6
LABEL_SMOOTHING: 0.3
INTERN_IMAGE:
CORE_OP: 'DCNv3'
DEPTHS: [6, 6, 32, 6]
GROUPS: [10, 20, 40, 80]
CHANNELS: 320
DW_KERNEL_SIZE: 5
LAYER_SCALE: None
OFFSET_SCALE: 1.0
MLP_RATIO: 4.0
POST_NORM: False
RES_POST_NORM: True
LEVEL2_POST_NORM: True
LEVEL2_POST_NORM_BLOCK_IDS: [5, 11, 17, 23, 29]
CENTER_FEATURE_SCALE: True
USE_CLIP_PROJECTOR: True
TRAIN:
EMA:
ENABLE: true
DECAY: 0.9998
EPOCHS: 30
WARMUP_EPOCHS: 0
WEIGHT_DECAY: 1e-8
BASE_LR: 3e-05 # 512
WARMUP_LR: 3e-08
MIN_LR: 3e-07
LR_LAYER_DECAY: true
LR_LAYER_DECAY_RATIO: 0.8
RAND_INIT_FT_HEAD: true
USE_CHECKPOINT: true
AMP_OPT_LEVEL: O0
EVAL_FREQ: 1
\ No newline at end of file
...@@ -74,7 +74,7 @@ def _is_power_of_2(n): ...@@ -74,7 +74,7 @@ def _is_power_of_2(n):
raise ValueError( raise ValueError(
"invalid input for _is_power_of_2: {} (type: {})".format(n, type(n))) "invalid input for _is_power_of_2: {} (type: {})".format(n, type(n)))
return (n & (n-1) == 0) and n != 0 return (n & (n - 1) == 0) and n != 0
class CenterFeatureScaleModule(nn.Module): class CenterFeatureScaleModule(nn.Module):
...@@ -86,7 +86,7 @@ class CenterFeatureScaleModule(nn.Module): ...@@ -86,7 +86,7 @@ class CenterFeatureScaleModule(nn.Module):
weight=center_feature_scale_proj_weight, weight=center_feature_scale_proj_weight,
bias=center_feature_scale_proj_bias).sigmoid() bias=center_feature_scale_proj_bias).sigmoid()
return center_feature_scale return center_feature_scale
class DCNv3_pytorch(nn.Module): class DCNv3_pytorch(nn.Module):
def __init__( def __init__(
...@@ -104,10 +104,10 @@ class DCNv3_pytorch(nn.Module): ...@@ -104,10 +104,10 @@ class DCNv3_pytorch(nn.Module):
center_feature_scale=False): center_feature_scale=False):
""" """
DCNv3 Module DCNv3 Module
:param channels :param channels
:param kernel_size :param kernel_size
:param stride :param stride
:param pad :param pad
:param dilation :param dilation
:param group :param group
:param offset_scale :param offset_scale
...@@ -231,10 +231,10 @@ class DCNv3(nn.Module): ...@@ -231,10 +231,10 @@ class DCNv3(nn.Module):
center_feature_scale=False): center_feature_scale=False):
""" """
DCNv3 Module DCNv3 Module
:param channels :param channels
:param kernel_size :param kernel_size
:param stride :param stride
:param pad :param pad
:param dilation :param dilation
:param group :param group
:param offset_scale :param offset_scale
......
...@@ -54,7 +54,7 @@ sh ./make.sh ...@@ -54,7 +54,7 @@ sh ./make.sh
python test.py python test.py
``` ```
## Data Preparation ### Data Preparation
Prepare COCO according to the guidelines in [MMDetection v2.28.1](https://github.com/open-mmlab/mmdetection/blob/master/docs/en/1_exist_data_model.md). Prepare COCO according to the guidelines in [MMDetection v2.28.1](https://github.com/open-mmlab/mmdetection/blob/master/docs/en/1_exist_data_model.md).
...@@ -93,7 +93,7 @@ For example, to train `InternImage-T` with 8 GPU on 1 node, run: ...@@ -93,7 +93,7 @@ For example, to train `InternImage-T` with 8 GPU on 1 node, run:
sh dist_train.sh configs/coco/mask_rcnn_internimage_t_fpn_1x_coco.py 8 sh dist_train.sh configs/coco/mask_rcnn_internimage_t_fpn_1x_coco.py 8
``` ```
### Manage jobs with Srun ### Manage Jobs with Slurm
For example, to train `InternImage-L` with 32 GPU on 4 node, run: For example, to train `InternImage-L` with 32 GPU on 4 node, run:
......
...@@ -36,7 +36,7 @@ Based on community feedback, in 2017 the training/validation split was changed f ...@@ -36,7 +36,7 @@ Based on community feedback, in 2017 the training/validation split was changed f
| InternImage-L | 1x | 54.9 | 47.7 | 0.73s / iter | 18h | 277M | 1399G | [config](./cascade_internimage_l_fpn_1x_coco.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_1x_coco.pth) | | InternImage-L | 1x | 54.9 | 47.7 | 0.73s / iter | 18h | 277M | 1399G | [config](./cascade_internimage_l_fpn_1x_coco.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_1x_coco.pth) |
| InternImage-L | 3x | 56.1 | 48.5 | 0.79s / iter | 15h (4n) | 277M | 1399G | [config](./cascade_internimage_l_fpn_3x_coco.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_3x_coco.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_3x_coco.log.json) | | InternImage-L | 3x | 56.1 | 48.5 | 0.79s / iter | 15h (4n) | 277M | 1399G | [config](./cascade_internimage_l_fpn_3x_coco.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_3x_coco.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_3x_coco.log.json) |
| InternImage-XL | 1x | 55.3 | 48.1 | 0.82s / iter | 21h | 387M | 1782G | [config](./cascade_internimage_xl_fpn_1x_coco.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.log.json) | | InternImage-XL | 1x | 55.3 | 48.1 | 0.82s / iter | 21h | 387M | 1782G | [config](./cascade_internimage_xl_fpn_1x_coco.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.log.json) |
| InternImage-XL | 3x | 56.2 | 48.8 | 0.91s / iter | 17h (4n) | 387M | 1782G | [config](./cascade_internimage_xl_fpn_3x_coco.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_3x_coco.log.json) | | InternImage-XL | 3x | 56.2 | 48.8 | 0.91s / iter | 17h (4n) | 387M | 1782G | [config](./cascade_internimage_xl_fpn_3x_coco.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_3x_coco.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_3x_coco.log.json) |
- Training speed is measured with A100 GPUs using current code and may be faster than the speed in logs. - Training speed is measured with A100 GPUs using current code and may be faster than the speed in logs.
- Some logs are our recent newly trained ones. There might be slight differences between the results in logs and our paper. - Some logs are our recent newly trained ones. There might be slight differences between the results in logs and our paper.
......
...@@ -13,9 +13,11 @@ from mmcv.runner import _load_checkpoint ...@@ -13,9 +13,11 @@ from mmcv.runner import _load_checkpoint
from mmcv.cnn import constant_init, trunc_normal_init from mmcv.cnn import constant_init, trunc_normal_init
from mmdet.utils import get_root_logger from mmdet.utils import get_root_logger
from mmdet.models.builder import BACKBONES from mmdet.models.builder import BACKBONES
import torch.nn.functional as F
from ops_dcnv3 import modules as opsm from ops_dcnv3 import modules as opsm
class to_channels_first(nn.Module): class to_channels_first(nn.Module):
def __init__(self): def __init__(self):
...@@ -69,6 +71,171 @@ def build_act_layer(act_layer): ...@@ -69,6 +71,171 @@ def build_act_layer(act_layer):
raise NotImplementedError(f'build_act_layer does not support {act_layer}') raise NotImplementedError(f'build_act_layer does not support {act_layer}')
class CrossAttention(nn.Module):
r""" Cross Attention Module
Args:
dim (int): Number of input channels.
num_heads (int): Number of attention heads. Default: 8
qkv_bias (bool, optional): If True, add a learnable bias to q, k, v.
Default: False.
qk_scale (float | None, optional): Override default qk scale of
head_dim ** -0.5 if set. Default: None.
attn_drop (float, optional): Dropout ratio of attention weight.
Default: 0.0
proj_drop (float, optional): Dropout ratio of output. Default: 0.0
attn_head_dim (int, optional): Dimension of attention head.
out_dim (int, optional): Dimension of output.
"""
def __init__(self,
dim,
num_heads=8,
qkv_bias=False,
qk_scale=None,
attn_drop=0.,
proj_drop=0.,
attn_head_dim=None,
out_dim=None):
super().__init__()
if out_dim is None:
out_dim = dim
self.num_heads = num_heads
head_dim = dim // num_heads
if attn_head_dim is not None:
head_dim = attn_head_dim
all_head_dim = head_dim * self.num_heads
self.scale = qk_scale or head_dim ** -0.5
assert all_head_dim == dim
self.q = nn.Linear(dim, all_head_dim, bias=False)
self.k = nn.Linear(dim, all_head_dim, bias=False)
self.v = nn.Linear(dim, all_head_dim, bias=False)
if qkv_bias:
self.q_bias = nn.Parameter(torch.zeros(all_head_dim))
self.k_bias = nn.Parameter(torch.zeros(all_head_dim))
self.v_bias = nn.Parameter(torch.zeros(all_head_dim))
else:
self.q_bias = None
self.k_bias = None
self.v_bias = None
self.attn_drop = nn.Dropout(attn_drop)
self.proj = nn.Linear(all_head_dim, out_dim)
self.proj_drop = nn.Dropout(proj_drop)
def forward(self, x, k=None, v=None):
B, N, C = x.shape
N_k = k.shape[1]
N_v = v.shape[1]
q_bias, k_bias, v_bias = None, None, None
if self.q_bias is not None:
q_bias = self.q_bias
k_bias = self.k_bias
v_bias = self.v_bias
q = F.linear(input=x, weight=self.q.weight, bias=q_bias)
q = q.reshape(B, N, 1, self.num_heads,
-1).permute(2, 0, 3, 1,
4).squeeze(0) # (B, N_head, N_q, dim)
k = F.linear(input=k, weight=self.k.weight, bias=k_bias)
k = k.reshape(B, N_k, 1, self.num_heads, -1).permute(2, 0, 3, 1,
4).squeeze(0)
v = F.linear(input=v, weight=self.v.weight, bias=v_bias)
v = v.reshape(B, N_v, 1, self.num_heads, -1).permute(2, 0, 3, 1,
4).squeeze(0)
q = q * self.scale
attn = (q @ k.transpose(-2, -1)) # (B, N_head, N_q, N_k)
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
x = (attn @ v).transpose(1, 2).reshape(B, N, -1)
x = self.proj(x)
x = self.proj_drop(x)
return x
class AttentiveBlock(nn.Module):
r"""Attentive Block
Args:
dim (int): Number of input channels.
num_heads (int): Number of attention heads. Default: 8
qkv_bias (bool, optional): If True, add a learnable bias to q, k, v.
Default: False.
qk_scale (float | None, optional): Override default qk scale of
head_dim ** -0.5 if set. Default: None.
drop (float, optional): Dropout rate. Default: 0.0.
attn_drop (float, optional): Attention dropout rate. Default: 0.0.
drop_path (float | tuple[float], optional): Stochastic depth rate.
Default: 0.0.
norm_layer (nn.Module, optional): Normalization layer. Default: nn.LayerNorm.
attn_head_dim (int, optional): Dimension of attention head. Default: None.
out_dim (int, optional): Dimension of output. Default: None.
"""
def __init__(self,
dim,
num_heads,
qkv_bias=False,
qk_scale=None,
drop=0.,
attn_drop=0.,
drop_path=0.,
norm_layer="LN",
attn_head_dim=None,
out_dim=None):
super().__init__()
self.norm1_q = build_norm_layer(dim, norm_layer, eps=1e-6)
self.norm1_k = build_norm_layer(dim, norm_layer, eps=1e-6)
self.norm1_v = build_norm_layer(dim, norm_layer, eps=1e-6)
self.cross_dcn = CrossAttention(dim,
num_heads=num_heads,
qkv_bias=qkv_bias,
qk_scale=qk_scale,
attn_drop=attn_drop,
proj_drop=drop,
attn_head_dim=attn_head_dim,
out_dim=out_dim)
self.drop_path = DropPath(
drop_path) if drop_path > 0. else nn.Identity()
def forward(self,
x_q,
x_kv,
pos_q,
pos_k,
bool_masked_pos,
rel_pos_bias=None):
x_q = self.norm1_q(x_q + pos_q)
x_k = self.norm1_k(x_kv + pos_k)
x_v = self.norm1_v(x_kv)
x = self.cross_dcn(x_q, k=x_k, v=x_v)
return x
class AttentionPoolingBlock(AttentiveBlock):
def forward(self, x):
x_q = x.mean(1, keepdim=True)
x_kv = x
pos_q, pos_k = 0, 0
x = super().forward(x_q, x_kv, pos_q, pos_k,
bool_masked_pos=None,
rel_pos_bias=None)
x = x.squeeze(1)
return x
class StemLayer(nn.Module): class StemLayer(nn.Module):
r""" Stem layer of InternImage r""" Stem layer of InternImage
Args: Args:
...@@ -195,7 +362,10 @@ class InternImageLayer(nn.Module): ...@@ -195,7 +362,10 @@ class InternImageLayer(nn.Module):
post_norm=False, post_norm=False,
layer_scale=None, layer_scale=None,
offset_scale=1.0, offset_scale=1.0,
with_cp=False): with_cp=False,
dw_kernel_size=None, # for InternImage-H/G
res_post_norm=False, # for InternImage-H/G
center_feature_scale=False): # for InternImage-H/G
super().__init__() super().__init__()
self.channels = channels self.channels = channels
self.groups = groups self.groups = groups
...@@ -204,15 +374,18 @@ class InternImageLayer(nn.Module): ...@@ -204,15 +374,18 @@ class InternImageLayer(nn.Module):
self.norm1 = build_norm_layer(channels, 'LN') self.norm1 = build_norm_layer(channels, 'LN')
self.post_norm = post_norm self.post_norm = post_norm
self.dcn = core_op(channels=channels, self.dcn = core_op(
kernel_size=3, channels=channels,
stride=1, kernel_size=3,
pad=1, stride=1,
dilation=1, pad=1,
group=groups, dilation=1,
offset_scale=offset_scale, group=groups,
act_layer=act_layer, offset_scale=offset_scale,
norm_layer=norm_layer) act_layer=act_layer,
norm_layer=norm_layer,
dw_kernel_size=dw_kernel_size, # for InternImage-H/G
center_feature_scale=center_feature_scale) # for InternImage-H/G
self.drop_path = DropPath(drop_path) if drop_path > 0. \ self.drop_path = DropPath(drop_path) if drop_path > 0. \
else nn.Identity() else nn.Identity()
self.norm2 = build_norm_layer(channels, 'LN') self.norm2 = build_norm_layer(channels, 'LN')
...@@ -226,6 +399,10 @@ class InternImageLayer(nn.Module): ...@@ -226,6 +399,10 @@ class InternImageLayer(nn.Module):
requires_grad=True) requires_grad=True)
self.gamma2 = nn.Parameter(layer_scale * torch.ones(channels), self.gamma2 = nn.Parameter(layer_scale * torch.ones(channels),
requires_grad=True) requires_grad=True)
self.res_post_norm = res_post_norm
if res_post_norm:
self.res_post_norm1 = build_norm_layer(channels, 'LN')
self.res_post_norm2 = build_norm_layer(channels, 'LN')
def forward(self, x): def forward(self, x):
...@@ -234,6 +411,9 @@ class InternImageLayer(nn.Module): ...@@ -234,6 +411,9 @@ class InternImageLayer(nn.Module):
if self.post_norm: if self.post_norm:
x = x + self.drop_path(self.norm1(self.dcn(x))) x = x + self.drop_path(self.norm1(self.dcn(x)))
x = x + self.drop_path(self.norm2(self.mlp(x))) x = x + self.drop_path(self.norm2(self.mlp(x)))
elif self.res_post_norm: # for InternImage-H/G
x = x + self.drop_path(self.res_post_norm1(self.dcn(self.norm1(x))))
x = x + self.drop_path(self.res_post_norm2(self.mlp(self.norm2(x))))
else: else:
x = x + self.drop_path(self.dcn(self.norm1(x))) x = x + self.drop_path(self.dcn(self.norm1(x)))
x = x + self.drop_path(self.mlp(self.norm2(x))) x = x + self.drop_path(self.mlp(self.norm2(x)))
...@@ -285,36 +465,54 @@ class InternImageBlock(nn.Module): ...@@ -285,36 +465,54 @@ class InternImageBlock(nn.Module):
post_norm=False, post_norm=False,
offset_scale=1.0, offset_scale=1.0,
layer_scale=None, layer_scale=None,
with_cp=False): with_cp=False,
dw_kernel_size=None, # for InternImage-H/G
post_norm_block_ids=None, # for InternImage-H/G
res_post_norm=False, # for InternImage-H/G
center_feature_scale=False): # for InternImage-H/G
super().__init__() super().__init__()
self.channels = channels self.channels = channels
self.depth = depth self.depth = depth
self.post_norm = post_norm self.post_norm = post_norm
self.center_feature_scale = center_feature_scale
self.blocks = nn.ModuleList([ self.blocks = nn.ModuleList([
InternImageLayer(core_op=core_op, InternImageLayer(
channels=channels, core_op=core_op,
groups=groups, channels=channels,
mlp_ratio=mlp_ratio, groups=groups,
drop=drop, mlp_ratio=mlp_ratio,
drop_path=drop_path[i] if isinstance( drop=drop,
drop_path, list) else drop_path, drop_path=drop_path[i] if isinstance(
act_layer=act_layer, drop_path, list) else drop_path,
norm_layer=norm_layer, act_layer=act_layer,
post_norm=post_norm, norm_layer=norm_layer,
layer_scale=layer_scale, post_norm=post_norm,
offset_scale=offset_scale, layer_scale=layer_scale,
with_cp=with_cp) for i in range(depth) offset_scale=offset_scale,
with_cp=with_cp,
dw_kernel_size=dw_kernel_size, # for InternImage-H/G
res_post_norm=res_post_norm, # for InternImage-H/G
center_feature_scale=center_feature_scale # for InternImage-H/G
) for i in range(depth)
]) ])
if not self.post_norm: if not self.post_norm or center_feature_scale:
self.norm = build_norm_layer(channels, 'LN') self.norm = build_norm_layer(channels, 'LN')
self.post_norm_block_ids = post_norm_block_ids
if post_norm_block_ids is not None: # for InternImage-H/G
self.post_norms = nn.ModuleList(
[build_norm_layer(channels, 'LN', eps=1e-6) for _ in post_norm_block_ids]
)
self.downsample = DownsampleLayer( self.downsample = DownsampleLayer(
channels=channels, norm_layer=norm_layer) if downsample else None channels=channels, norm_layer=norm_layer) if downsample else None
def forward(self, x, return_wo_downsample=False): def forward(self, x, return_wo_downsample=False):
for blk in self.blocks: for i, blk in enumerate(self.blocks):
x = blk(x) x = blk(x)
if not self.post_norm: if (self.post_norm_block_ids is not None) and (i in self.post_norm_block_ids):
index = self.post_norm_block_ids.index(i)
x = self.post_norms[index](x) # for InternImage-H/G
if not self.post_norm or self.center_feature_scale:
x = self.norm(x) x = self.norm(x)
if return_wo_downsample: if return_wo_downsample:
x_ = x x_ = x
...@@ -344,6 +542,11 @@ class InternImage(nn.Module): ...@@ -344,6 +542,11 @@ class InternImage(nn.Module):
layer_scale (bool): Whether to use layer scale. Default: False layer_scale (bool): Whether to use layer scale. Default: False
cls_scale (bool): Whether to use class scale. Default: False cls_scale (bool): Whether to use class scale. Default: False
with_cp (bool): Use checkpoint or not. Using checkpoint will save some with_cp (bool): Use checkpoint or not. Using checkpoint will save some
dw_kernel_size (int): Size of the dwconv. Default: None
level2_post_norm (bool): Whether to use level2 post norm. Default: False
level2_post_norm_block_ids (list): Indexes of post norm blocks. Default: None
res_post_norm (bool): Whether to use res post norm. Default: False
center_feature_scale (bool): Whether to use center feature scale. Default: False
""" """
def __init__(self, def __init__(self,
...@@ -361,6 +564,11 @@ class InternImage(nn.Module): ...@@ -361,6 +564,11 @@ class InternImage(nn.Module):
offset_scale=1.0, offset_scale=1.0,
post_norm=False, post_norm=False,
with_cp=False, with_cp=False,
dw_kernel_size=None, # for InternImage-H/G
level2_post_norm=False, # for InternImage-H/G
level2_post_norm_block_ids=None, # for InternImage-H/G
res_post_norm=False, # for InternImage-H/G
center_feature_scale=False, # for InternImage-H/G
out_indices=(0, 1, 2, 3), out_indices=(0, 1, 2, 3),
init_cfg=None, init_cfg=None,
**kwargs): **kwargs):
...@@ -374,10 +582,15 @@ class InternImage(nn.Module): ...@@ -374,10 +582,15 @@ class InternImage(nn.Module):
self.mlp_ratio = mlp_ratio self.mlp_ratio = mlp_ratio
self.init_cfg = init_cfg self.init_cfg = init_cfg
self.out_indices = out_indices self.out_indices = out_indices
print(f'using core type: {core_op}') self.level2_post_norm_block_ids = level2_post_norm_block_ids
print(f'using activation layer: {act_layer}') logger = get_root_logger()
print(f'using main norm layer: {norm_layer}') logger.info(f'using core type: {core_op}')
print(f'using dpr: {drop_path_type}, {drop_path_rate}') logger.info(f'using activation layer: {act_layer}')
logger.info(f'using main norm layer: {norm_layer}')
logger.info(f'using dpr: {drop_path_type}, {drop_path_rate}')
logger.info(f"level2_post_norm: {level2_post_norm}")
logger.info(f"level2_post_norm_block_ids: {level2_post_norm_block_ids}")
logger.info(f"res_post_norm: {res_post_norm}")
in_chans = 3 in_chans = 3
self.patch_embed = StemLayer(in_chans=in_chans, self.patch_embed = StemLayer(in_chans=in_chans,
...@@ -395,6 +608,8 @@ class InternImage(nn.Module): ...@@ -395,6 +608,8 @@ class InternImage(nn.Module):
self.levels = nn.ModuleList() self.levels = nn.ModuleList()
for i in range(self.num_levels): for i in range(self.num_levels):
post_norm_block_ids = level2_post_norm_block_ids if level2_post_norm and (
i == 2) else None # for InternImage-H/G
level = InternImageBlock( level = InternImageBlock(
core_op=getattr(opsm, core_op), core_op=getattr(opsm, core_op),
channels=int(channels * 2**i), channels=int(channels * 2**i),
...@@ -409,7 +624,12 @@ class InternImage(nn.Module): ...@@ -409,7 +624,12 @@ class InternImage(nn.Module):
downsample=(i < self.num_levels - 1), downsample=(i < self.num_levels - 1),
layer_scale=layer_scale, layer_scale=layer_scale,
offset_scale=offset_scale, offset_scale=offset_scale,
with_cp=with_cp) with_cp=with_cp,
dw_kernel_size=dw_kernel_size, # for InternImage-H/G
post_norm_block_ids=post_norm_block_ids, # for InternImage-H/G
res_post_norm=res_post_norm, # for InternImage-H/G
center_feature_scale=center_feature_scale # for InternImage-H/G
)
self.levels.append(level) self.levels.append(level)
self.num_layers = len(depths) self.num_layers = len(depths)
......
...@@ -9,6 +9,7 @@ from __future__ import print_function ...@@ -9,6 +9,7 @@ from __future__ import print_function
from __future__ import division from __future__ import division
import warnings import warnings
import torch
from torch import nn from torch import nn
import torch.nn.functional as F import torch.nn.functional as F
from torch.nn.init import xavier_uniform_, constant_ from torch.nn.init import xavier_uniform_, constant_
...@@ -73,20 +74,40 @@ def _is_power_of_2(n): ...@@ -73,20 +74,40 @@ def _is_power_of_2(n):
raise ValueError( raise ValueError(
"invalid input for _is_power_of_2: {} (type: {})".format(n, type(n))) "invalid input for _is_power_of_2: {} (type: {})".format(n, type(n)))
return (n & (n-1) == 0) and n != 0 return (n & (n - 1) == 0) and n != 0
class CenterFeatureScaleModule(nn.Module):
def forward(self,
query,
center_feature_scale_proj_weight,
center_feature_scale_proj_bias):
center_feature_scale = F.linear(query,
weight=center_feature_scale_proj_weight,
bias=center_feature_scale_proj_bias).sigmoid()
return center_feature_scale
class DCNv3_pytorch(nn.Module): class DCNv3_pytorch(nn.Module):
def __init__( def __init__(
self, channels=64, kernel_size=3, stride=1, self,
pad=1, dilation=1, group=4, offset_scale=1.0, channels=64,
act_layer='GELU', norm_layer='LN'): kernel_size=3,
dw_kernel_size=None,
stride=1,
pad=1,
dilation=1,
group=4,
offset_scale=1.0,
act_layer='GELU',
norm_layer='LN',
center_feature_scale=False):
""" """
DCNv3 Module DCNv3 Module
:param channels :param channels
:param kernel_size :param kernel_size
:param stride :param stride
:param pad :param pad
:param dilation :param dilation
:param group :param group
:param offset_scale :param offset_scale
...@@ -98,6 +119,7 @@ class DCNv3_pytorch(nn.Module): ...@@ -98,6 +119,7 @@ class DCNv3_pytorch(nn.Module):
raise ValueError( raise ValueError(
f'channels must be divisible by group, but got {channels} and {group}') f'channels must be divisible by group, but got {channels} and {group}')
_d_per_group = channels // group _d_per_group = channels // group
dw_kernel_size = dw_kernel_size if dw_kernel_size is not None else kernel_size
# you'd better set _d_per_group to a power of 2 which is more efficient in our CUDA implementation # you'd better set _d_per_group to a power of 2 which is more efficient in our CUDA implementation
if not _is_power_of_2(_d_per_group): if not _is_power_of_2(_d_per_group):
warnings.warn( warnings.warn(
...@@ -107,20 +129,22 @@ class DCNv3_pytorch(nn.Module): ...@@ -107,20 +129,22 @@ class DCNv3_pytorch(nn.Module):
self.offset_scale = offset_scale self.offset_scale = offset_scale
self.channels = channels self.channels = channels
self.kernel_size = kernel_size self.kernel_size = kernel_size
self.dw_kernel_size = dw_kernel_size
self.stride = stride self.stride = stride
self.dilation = 1 self.dilation = dilation
self.pad = pad self.pad = pad
self.group = group self.group = group
self.group_channels = channels // group self.group_channels = channels // group
self.offset_scale = offset_scale self.offset_scale = offset_scale
self.center_feature_scale = center_feature_scale
self.dw_conv = nn.Sequential( self.dw_conv = nn.Sequential(
nn.Conv2d( nn.Conv2d(
channels, channels,
channels, channels,
kernel_size=kernel_size, kernel_size=dw_kernel_size,
stride=1, stride=1,
padding=(kernel_size-1)//2, padding=(dw_kernel_size - 1) // 2,
groups=channels), groups=channels),
build_norm_layer( build_norm_layer(
channels, channels,
...@@ -137,6 +161,13 @@ class DCNv3_pytorch(nn.Module): ...@@ -137,6 +161,13 @@ class DCNv3_pytorch(nn.Module):
self.input_proj = nn.Linear(channels, channels) self.input_proj = nn.Linear(channels, channels)
self.output_proj = nn.Linear(channels, channels) self.output_proj = nn.Linear(channels, channels)
self._reset_parameters() self._reset_parameters()
if center_feature_scale:
self.center_feature_scale_proj_weight = nn.Parameter(
torch.zeros((group, channels), dtype=torch.float))
self.center_feature_scale_proj_bias = nn.Parameter(
torch.tensor(0.0, dtype=torch.float).view((1,)).repeat(group, ))
self.center_feature_scale_module = CenterFeatureScaleModule()
def _reset_parameters(self): def _reset_parameters(self):
constant_(self.offset.weight.data, 0.) constant_(self.offset.weight.data, 0.)
...@@ -156,6 +187,7 @@ class DCNv3_pytorch(nn.Module): ...@@ -156,6 +187,7 @@ class DCNv3_pytorch(nn.Module):
N, H, W, _ = input.shape N, H, W, _ = input.shape
x = self.input_proj(input) x = self.input_proj(input)
x_proj = x
x1 = input.permute(0, 3, 1, 2) x1 = input.permute(0, 3, 1, 2)
x1 = self.dw_conv(x1) x1 = self.dw_conv(x1)
...@@ -171,6 +203,13 @@ class DCNv3_pytorch(nn.Module): ...@@ -171,6 +203,13 @@ class DCNv3_pytorch(nn.Module):
self.dilation, self.dilation, self.dilation, self.dilation,
self.group, self.group_channels, self.group, self.group_channels,
self.offset_scale) self.offset_scale)
if self.center_feature_scale:
center_feature_scale = self.center_feature_scale_module(
x1, self.center_feature_scale_proj_weight, self.center_feature_scale_proj_bias)
# N, H, W, groups -> N, H, W, groups, 1 -> N, H, W, groups, _d_per_group -> N, H, W, channels
center_feature_scale = center_feature_scale[..., None].repeat(
1, 1, 1, 1, self.channels // self.group).flatten(-2)
x = x * (1 - center_feature_scale) + x_proj * center_feature_scale
x = self.output_proj(x) x = self.output_proj(x)
return x return x
...@@ -178,15 +217,24 @@ class DCNv3_pytorch(nn.Module): ...@@ -178,15 +217,24 @@ class DCNv3_pytorch(nn.Module):
class DCNv3(nn.Module): class DCNv3(nn.Module):
def __init__( def __init__(
self, channels=64, kernel_size=3, stride=1, self,
pad=1, dilation=1, group=4, offset_scale=1.0, channels=64,
act_layer='GELU', norm_layer='LN'): kernel_size=3,
dw_kernel_size=None,
stride=1,
pad=1,
dilation=1,
group=4,
offset_scale=1.0,
act_layer='GELU',
norm_layer='LN',
center_feature_scale=False):
""" """
DCNv3 Module DCNv3 Module
:param channels :param channels
:param kernel_size :param kernel_size
:param stride :param stride
:param pad :param pad
:param dilation :param dilation
:param group :param group
:param offset_scale :param offset_scale
...@@ -198,6 +246,7 @@ class DCNv3(nn.Module): ...@@ -198,6 +246,7 @@ class DCNv3(nn.Module):
raise ValueError( raise ValueError(
f'channels must be divisible by group, but got {channels} and {group}') f'channels must be divisible by group, but got {channels} and {group}')
_d_per_group = channels // group _d_per_group = channels // group
dw_kernel_size = dw_kernel_size if dw_kernel_size is not None else kernel_size
# you'd better set _d_per_group to a power of 2 which is more efficient in our CUDA implementation # you'd better set _d_per_group to a power of 2 which is more efficient in our CUDA implementation
if not _is_power_of_2(_d_per_group): if not _is_power_of_2(_d_per_group):
warnings.warn( warnings.warn(
...@@ -207,20 +256,22 @@ class DCNv3(nn.Module): ...@@ -207,20 +256,22 @@ class DCNv3(nn.Module):
self.offset_scale = offset_scale self.offset_scale = offset_scale
self.channels = channels self.channels = channels
self.kernel_size = kernel_size self.kernel_size = kernel_size
self.dw_kernel_size = dw_kernel_size
self.stride = stride self.stride = stride
self.dilation = 1 self.dilation = dilation
self.pad = pad self.pad = pad
self.group = group self.group = group
self.group_channels = channels // group self.group_channels = channels // group
self.offset_scale = offset_scale self.offset_scale = offset_scale
self.center_feature_scale = center_feature_scale
self.dw_conv = nn.Sequential( self.dw_conv = nn.Sequential(
nn.Conv2d( nn.Conv2d(
channels, channels,
channels, channels,
kernel_size=kernel_size, kernel_size=dw_kernel_size,
stride=1, stride=1,
padding=(kernel_size-1)//2, padding=(dw_kernel_size - 1) // 2,
groups=channels), groups=channels),
build_norm_layer( build_norm_layer(
channels, channels,
...@@ -237,6 +288,13 @@ class DCNv3(nn.Module): ...@@ -237,6 +288,13 @@ class DCNv3(nn.Module):
self.input_proj = nn.Linear(channels, channels) self.input_proj = nn.Linear(channels, channels)
self.output_proj = nn.Linear(channels, channels) self.output_proj = nn.Linear(channels, channels)
self._reset_parameters() self._reset_parameters()
if center_feature_scale:
self.center_feature_scale_proj_weight = nn.Parameter(
torch.zeros((group, channels), dtype=torch.float))
self.center_feature_scale_proj_bias = nn.Parameter(
torch.tensor(0.0, dtype=torch.float).view((1,)).repeat(group, ))
self.center_feature_scale_module = CenterFeatureScaleModule()
def _reset_parameters(self): def _reset_parameters(self):
constant_(self.offset.weight.data, 0.) constant_(self.offset.weight.data, 0.)
...@@ -256,6 +314,7 @@ class DCNv3(nn.Module): ...@@ -256,6 +314,7 @@ class DCNv3(nn.Module):
N, H, W, _ = input.shape N, H, W, _ = input.shape
x = self.input_proj(input) x = self.input_proj(input)
x_proj = x
dtype = x.dtype dtype = x.dtype
x1 = input.permute(0, 3, 1, 2) x1 = input.permute(0, 3, 1, 2)
...@@ -273,6 +332,14 @@ class DCNv3(nn.Module): ...@@ -273,6 +332,14 @@ class DCNv3(nn.Module):
self.group, self.group_channels, self.group, self.group_channels,
self.offset_scale, self.offset_scale,
256) 256)
if self.center_feature_scale:
center_feature_scale = self.center_feature_scale_module(
x1, self.center_feature_scale_proj_weight, self.center_feature_scale_proj_bias)
# N, H, W, groups -> N, H, W, groups, 1 -> N, H, W, groups, _d_per_group -> N, H, W, channels
center_feature_scale = center_feature_scale[..., None].repeat(
1, 1, 1, 1, self.channels // self.group).flatten(-2)
x = x * (1 - center_feature_scale) + x_proj * center_feature_scale
x = self.output_proj(x) x = self.output_proj(x)
return x return x
...@@ -4,15 +4,6 @@ This folder contains the implementation of the InternImage for semantic segmenta ...@@ -4,15 +4,6 @@ This folder contains the implementation of the InternImage for semantic segmenta
Our segmentation code is developed on top of [MMSegmentation v0.27.0](https://github.com/open-mmlab/mmsegmentation/tree/v0.27.0). Our segmentation code is developed on top of [MMSegmentation v0.27.0](https://github.com/open-mmlab/mmsegmentation/tree/v0.27.0).
## Model Zoo
- [x] [ADE20K](configs/ade20k/)
- [x] [Cityscapes](configs/cityscapes/)
- [ ] COCO-Stuff-164K
- [ ] COCO-Stuff-10K
- [ ] Pascal Context
- [ ] NYU Depth V2
## Usage ## Usage
### Install ### Install
......
# --------------------------------------------------------
# InternImage
# Copyright (c) 2022 OpenGVLab
# Licensed under The MIT License [see LICENSE for details]
# --------------------------------------------------------
_base_ = [
'../_base_/models/upernet_r50.py', '../_base_/datasets/ade20k.py',
'../_base_/default_runtime.py', '../_base_/schedules/schedule_160k.py'
]
pretrained = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_pretrainto22k_384.pth'
model = dict(
backbone=dict(
_delete_=True,
type='InternImage',
core_op='DCNv3',
channels=512,
depths=[2, 2, 48, 4],
groups=[16, 32, 64, 128],
mlp_ratio=4.,
drop_path_rate=0.5,
norm_layer='LN',
layer_scale=None,
offset_scale=1.0,
post_norm=True,
dw_kernel_size=5, # for InternImage-H/G
res_post_norm=False, # for InternImage-H/G
level2_post_norm=True, # for InternImage-H/G
level2_post_norm_block_ids=[5, 11, 17, 23, 29, 35, 41, 47], # for InternImage-H/G
center_feature_scale=True, # for InternImage-H/G
with_cp=True,
out_indices=(0, 1, 2, 3),
init_cfg=dict(type='Pretrained', checkpoint=pretrained)),
decode_head=dict(num_classes=150, in_channels=[512, 1024, 2048, 4096]),
auxiliary_head=dict(num_classes=150, in_channels=2048),
test_cfg=dict(mode='whole'))
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
crop_size = (896, 896)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', reduce_zero_label=True),
dict(type='Resize', img_scale=(3584, 896), ratio_range=(0.5, 2.0)),
dict(type='RandomCrop', crop_size=crop_size, cat_max_ratio=0.75),
dict(type='RandomFlip', prob=0.5),
dict(type='PhotoMetricDistortion'),
dict(type='Normalize', **img_norm_cfg),
dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=255),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_semantic_seg']),
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(3584, 896),
# img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75],
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='ResizeToMultiple', size_divisor=32),
dict(type='RandomFlip'),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img']),
])
]
optimizer = dict(
_delete_=True, type='AdamW', lr=0.00002, betas=(0.9, 0.999), weight_decay=0.05,
constructor='CustomLayerDecayOptimizerConstructor',
paramwise_cfg=dict(num_layers=56, layer_decay_rate=0.95,
depths=[2, 2, 48, 4], offset_lr_scale=1.0))
lr_config = dict(_delete_=True, policy='poly',
warmup='linear',
warmup_iters=1500,
warmup_ratio=1e-6,
power=1.0, min_lr=0.0, by_epoch=False)
# By default, models are trained on 16 GPUs with 1 images per GPU
data = dict(samples_per_gpu=1,
train=dict(pipeline=train_pipeline),
val=dict(pipeline=test_pipeline),
test=dict(pipeline=test_pipeline))
runner = dict(type='IterBasedRunner')
optimizer_config = dict(_delete_=True, grad_clip=dict(max_norm=0.1, norm_type=2))
checkpoint_config = dict(by_epoch=False, interval=1000, max_keep_ckpts=1)
evaluation = dict(interval=16000, metric='mIoU', save_best='mIoU')
# fp16 = dict(loss_scale=dict(init_scale=512))
...@@ -7,7 +7,7 @@ _base_ = [ ...@@ -7,7 +7,7 @@ _base_ = [
'../_base_/models/upernet_r50.py', '../_base_/datasets/ade20k.py', '../_base_/models/upernet_r50.py', '../_base_/datasets/ade20k.py',
'../_base_/default_runtime.py', '../_base_/schedules/schedule_160k.py' '../_base_/default_runtime.py', '../_base_/schedules/schedule_160k.py'
] ]
# pretrained = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth' pretrained = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto22k_384.pth'
model = dict( model = dict(
backbone=dict( backbone=dict(
_delete_=True, _delete_=True,
...@@ -74,7 +74,7 @@ lr_config = dict(_delete_=True, policy='poly', ...@@ -74,7 +74,7 @@ lr_config = dict(_delete_=True, policy='poly',
warmup_iters=1500, warmup_iters=1500,
warmup_ratio=1e-6, warmup_ratio=1e-6,
power=1.0, min_lr=0.0, by_epoch=False) power=1.0, min_lr=0.0, by_epoch=False)
# By default, models are trained on 8 GPUs with 2 images per GPU # By default, models are trained on 16 GPUs with 1 images per GPU
data = dict(samples_per_gpu=1, data = dict(samples_per_gpu=1,
train=dict(pipeline=train_pipeline), train=dict(pipeline=train_pipeline),
val=dict(pipeline=test_pipeline), val=dict(pipeline=test_pipeline),
......
...@@ -36,11 +36,3 @@ Mapillary 80k + Cityscapes (w/ coarse data) 160k ...@@ -36,11 +36,3 @@ Mapillary 80k + Cityscapes (w/ coarse data) 160k
|:--------------:|:----------:|:------------:|:-----------:|:-----------:|:-------:|:-----:|:-----:|:---------:| |:--------------:|:----------:|:------------:|:-----------:|:-----------:|:-------:|:-----:|:-----:|:---------:|
| InternImage-L | 512x1024 | 85.16 / 85.67 | 0.37s / iter | 17h | 220M | 1580G | [config](./segformer_internimage_l_512x1024_160k_mapillary2cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/segformer_internimage_l_512x1024_160k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/segformer_internimage_l_512x1024_160k_mapillary2cityscapes.log.json) | | InternImage-L | 512x1024 | 85.16 / 85.67 | 0.37s / iter | 17h | 220M | 1580G | [config](./segformer_internimage_l_512x1024_160k_mapillary2cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/segformer_internimage_l_512x1024_160k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/segformer_internimage_l_512x1024_160k_mapillary2cityscapes.log.json) |
| InternImage-XL | 512x1024 | 85.41 / 85.93 | 0.43s / iter | 19.5h | 330M | 2364G | [config](./segformer_internimage_xl_512x1024_160k_mapillary2cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/segformer_internimage_xl_512x1024_160k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/segformer_internimage_xl_512x1024_160k_mapillary2cityscapes.log.json) | | InternImage-XL | 512x1024 | 85.41 / 85.93 | 0.43s / iter | 19.5h | 330M | 2364G | [config](./segformer_internimage_xl_512x1024_160k_mapillary2cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/segformer_internimage_xl_512x1024_160k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/segformer_internimage_xl_512x1024_160k_mapillary2cityscapes.log.json) |
### Mask2Former + InternImage (with additional data)
Mapillary 80k + Cityscapes (w/ coarse data) 80k
| backbone | resolution | mIoU (ss/ms) | train speed | train time | #params | FLOPs | Config | Download |
|:--------------:|:----------:|:------------:|:-----------:|:-----------:|:-------:|:-----:|:-----:|:---------:|
| InternImage-H | 1024x1024 | 86.37 / 86.96 | TODO | TODO | TODO | TODO | [config](./mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes.py) | [ckpt]() \| [log]() |
...@@ -24,8 +24,3 @@ We first pretrain our models on the Mapillary Vistas dataset, then finetune them ...@@ -24,8 +24,3 @@ We first pretrain our models on the Mapillary Vistas dataset, then finetune them
| InternImage-L | 512x1024 | 80k | 0.37s / iter | 9h | 220M | 1580G | [config](./segformer_internimage_l_512x1024_80k_mapillary.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/segformer_internimage_l_512x1024_80k_mapillary.pth) | | InternImage-L | 512x1024 | 80k | 0.37s / iter | 9h | 220M | 1580G | [config](./segformer_internimage_l_512x1024_80k_mapillary.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/segformer_internimage_l_512x1024_80k_mapillary.pth) |
| InternImage-XL | 512x1024 | 80k | 0.43s / iter | 10h | 330M | 2364G | [config](./segformer_internimage_xl_512x1024_80k_mapillary.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/segformer_internimage_xl_512x1024_80k_mapillary.pth) | | InternImage-XL | 512x1024 | 80k | 0.43s / iter | 10h | 330M | 2364G | [config](./segformer_internimage_xl_512x1024_80k_mapillary.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/segformer_internimage_xl_512x1024_80k_mapillary.pth) |
### Mask2Former + InternImage
| backbone | resolution | schd | train speed | train time | #params | FLOPs | Config | Download |
|:--------------:|:----------:|:------------:|:-----------:|:-----------:|:-------:|:-----:|:-----:|:---------:|
| InternImage-H | 1024x1024 | 80k | TODO | TODO | TODO | TODO | [config](./mask2former_internimage_h_1024x1024_80k_mapillary.py) | [ckpt]() |
import torch
import argparse
import math
from collections import OrderedDict
parser = argparse.ArgumentParser(description='Hyperparams')
parser.add_argument('filename', nargs='?', type=str, default=None)
args = parser.parse_args()
def convert_fl16(m):
new_sd = OrderedDict()
for k, v in m.items():
new_k = k
new_sd[new_k] = v.half()
return new_sd
model = torch.load(args.filename, map_location=torch.device('cpu'))['state_dict']
new_model = {"state_dict": convert_fl16(model)}
torch.save(new_model, args.filename.replace(".pth", "_fp16.pth"))
import torch
import argparse
import math
from collections import OrderedDict
parser = argparse.ArgumentParser(description='Hyperparams')
parser.add_argument('filename', nargs='?', type=str, default=None)
args = parser.parse_args()
def gen_grid(n_heads):
n_heads = n_heads
n_points = 9
points_list = []
kernel_size = int(math.sqrt(n_points))
y, x = torch.meshgrid(
torch.linspace(
(-kernel_size // 2 + 1),
(kernel_size // 2), kernel_size,
dtype=torch.float32),
torch.linspace(
(-kernel_size // 2 + 1),
(kernel_size // 2), kernel_size,
dtype=torch.float32))
points_list.extend([y, x])
grid = torch.stack(points_list, -1).reshape(-1, 1, 2).\
repeat(1, n_heads, 1).permute(1, 0, 2)
return grid
def remove_ab(m):
new_sd = OrderedDict()
n_points = 9
for k, v in m.items():
if 'alpha_beta' in k:
ab = v
ab = ab.repeat(1, n_points)
h, _ = ab.size()
offset_b = k.replace('alpha_beta', 'sampling_offsets.bias')
ob = m[offset_b]
grid = gen_grid(h)
grid = grid.reshape(h, -1)
delta = (ab - 1) * grid
delta = delta.reshape(-1)
ob = ob + delta
new_sd[offset_b] = ob
continue
if 'sampling_offsets.bias' in k:
continue
new_sd[k] = v
return new_sd
model = torch.load(args.filename, map_location=torch.device('cpu'))
model = model['state_dict']
model = remove_ab(model)
new_model = {"state_dict": model}
torch.save(new_model, args.filename.replace(".pth", "_rmab.pth"))
print("finished!")
\ No newline at end of file
import torch
import argparse
import math
from collections import OrderedDict
parser = argparse.ArgumentParser(description='Hyperparams')
parser.add_argument('filename', nargs='?', type=str, default=None)
args = parser.parse_args()
def gen_grid(n_heads):
n_heads = n_heads
n_points = 9
points_list = []
kernel_size = int(math.sqrt(n_points))
y, x = torch.meshgrid(
torch.linspace((-kernel_size // 2 + 1), (kernel_size // 2),
kernel_size,
dtype=torch.float32),
torch.linspace((-kernel_size // 2 + 1), (kernel_size // 2),
kernel_size,
dtype=torch.float32))
points_list.extend([y, x])
grid = torch.stack(points_list, -1).reshape(-1, 1, 2).\
repeat(1, n_heads, 1).permute(1, 0, 2)
return grid
def convert_to_newop(m):
new_sd = OrderedDict()
n_points = 9
for k, v in m.items():
new_k = k
if 'attn' in k:
new_k = new_k.replace('attn', 'dcn')
if 'sampling_offsets' in k:
new_k = new_k.replace('sampling_offsets', 'offset')
if 'attention_weights' in k:
new_k = new_k.replace('attention_weights', 'mask')
if 'value_proj' in k:
new_k = new_k.replace('value_proj', 'input_proj')
if 'ema' in k:
continue
if ".norm1_k." in k:
new_k = new_k.replace('.norm1_k.', '.norm1_k.0.')
if ".norm1_q." in k:
new_k = new_k.replace('.norm1_q.', '.norm1_q.0.')
if ".norm1_v." in k:
new_k = new_k.replace('.norm1_v.', '.norm1_v.0.')
if ".post_norms." in k:
new_k = new_k.replace('.bias', '.0.bias')
new_k = new_k.replace('.weight', '.0.weight')
if "fc_norm." in k:
new_k = new_k.replace('fc_norm.', 'fc_norm.0.')
new_sd[new_k] = v.half()
return new_sd
model = torch.load(args.filename, map_location=torch.device('cpu'))['state_dict']
new_model = {"state_dict": convert_to_newop(model)}
torch.save(new_model, args.filename.replace(".pth", "_rename.pth"))
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment