Update README.md and release models (#44)

* Update README.md * update configs * clean code * support InternImage-H/G

Update README.md and release models (#44)
* Update README.md * update configs * clean code * support InternImage-H/G
a8184dc3 · Zhe Chen · zhe chen · 6be127ee · a8184dc3 · a8184dc3
Commit a8184dc3 authored Mar 16, 2023 by Zhe Chen Committed by zhe chen Mar 16, 2023
20 changed files
--- a/.gitignore
+++ b/.gitignore
 .idea/
 .DS_Store
+classification/convertor/
+segmentation/convertor/
--- a/README.md
+++ b/README.md
@@ -93,7 +93,7 @@
 **分割任务**
 <table border="1" width="90%">
 	<tr align="center">
-        <th colspan="3"> 语义分割</th><th colspan="1">街景分割</th><th colspan="1">RGBD分割</th>
+        <th colspan="3">语义分割</th><th colspan="1">街景分割</th><th colspan="1">RGBD分割</th>
    </tr>
    <tr align="center">
        <th>ADE20K</th><th>COCO Stuff-10K</th><th>Pascal Context</th><th>CityScapes</th><th>NYU Depth V2</th>
@@ -125,7 +125,7 @@
 **图文多模态任务**
 <table border="1" width="90%">
 	<tr align="center">
-        <th colspan="1"> 图像描述</th><th colspan="2">微调图文检索</th><th colspan="1">零样本图文检索</th>
+        <th colspan="1">图像描述</th><th colspan="2">微调图文检索</th><th colspan="1">零样本图文检索</th>
    </tr>
    <tr align="center">
        <th>COCO Caption</th><th>COCO Caption</th><th>Flickr30k</th><th>Flickr30k</th>
@@ -166,31 +166,31 @@
 - 3D感知: [BEVFormer](https://github.com/fundamentalvision/BEVFormer)
 ## 开源视觉预训练模型
-|      name      | pretrain data  | pre-training resolution |  #param | FLOPs |       Download      |
+|      name      | pretrain   | pre-training  resolution |  #param |       download      |
-| :------------: | :--------: | :--------: | :-----: | :---: | :-----------------: |
+| :------------: | :--------: | :--------: | :-----: | :-----------------: |
-| InternImage-L  | ImageNet-22K |  384x384   |  223M   |  108G  | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth)            |
+| InternImage-L  | ImageNet-22K |  384x384   |  223M   |   [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth)            |
-| InternImage-XL | ImageNet-22K |  384x384   |  335M   |  163G  | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth)            |
+| InternImage-XL | ImageNet-22K |  384x384   |  335M   |   [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth)            |
-| InternImage-H | Joint 427M |  384x384   |  1.08B   | 1478G  | (2023/03/16)            |
+| InternImage-H | Joint 427M |  384x384   |  1.08B   |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto22k_384.pth)            |
-| InternImage-G | - |  384x384   |   3B   | (2023/03/16)  | (2023/03/16)            | 
+| InternImage-G | - |  384x384   |   3B   |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_pretrainto22k_384.pth)            | 
 ## ImageNet-1K图像分类
-|      name      | resolution | acc@1 | #param | FLOPs |           Download       |
+|      name      |   pretrain   | resolution | acc@1 | #param | FLOPs |           download       |
-| :------------: | :--------: | :---: | :-----: | :---: |  :-----------------: |
+| :------------: | :----------: | :--------: | :---: | :-----: | :---: |  :-----------------: |
-| InternImage-T  |  224x224   | 83.5  |   30M   |  5G   |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth) \| [cfg](classification/configs/internimage_t_1k_224.yaml) |
+| InternImage-T  | ImageNet-1K  |  224x224   | 83.5  |   30M   |  5G   |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth) \| [cfg](classification/configs/internimage_t_1k_224.yaml) |
-| InternImage-S  |  224x224   | 84.2  |   50M   |  8G   |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth) \| [cfg](classification/configs/internimage_s_1k_224.yaml) |
+| InternImage-S  | ImageNet-1K  |  224x224   | 84.2  |   50M   |  8G   |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth) \| [cfg](classification/configs/internimage_s_1k_224.yaml) |
-| InternImage-B  |  224x224   | 84.9  |   97M   |  16G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth) \| [cfg](classification/configs/internimage_b_1k_224.yaml) |
+| InternImage-B  | ImageNet-1K  |  224x224   | 84.9  |   97M   |  16G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth) \| [cfg](classification/configs/internimage_b_1k_224.yaml) |
-| InternImage-L  |  384x384   | 87.7  |  223M   | 108G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22kto1k_384.pth) \| [cfg](classification/configs/internimage_l_22kto1k_384.yaml) |
+| InternImage-L  | ImageNet-22K |  384x384   | 87.7  |  223M   | 108G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22kto1k_384.pth) \| [cfg](classification/configs/internimage_l_22kto1k_384.yaml) |
-| InternImage-XL |  384x384   | 88.0  |  335M   | 163G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22kto1k_384.pth) \| [cfg](classification/configs/internimage_xl_22kto1k_384.yaml) |
+| InternImage-XL | ImageNet-22K |  384x384   | 88.0  |  335M   | 163G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22kto1k_384.pth) \| [cfg](classification/configs/internimage_xl_22kto1k_384.yaml) |
-| InternImage-H |  640x640   | 89.6  |  1.08B   | 1478G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto1k_640.pth) \| [cfg](classification/configs/internimage_h_jointto1k_640.yaml) |
+| InternImage-H | Joint 427M |  640x640   | 89.6  |  1.08B   | 1478G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_22kto1k_640.pth) \| [cfg](classification/configs/internimage_h_22kto1k_640.yaml) |
-| InternImage-G | 512x512 | 90.1  |  3B   | (2023/03/16)  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_jointto1k_512.pth) \| [cfg](classification/configs/internimage_g_jointto1k_512.yaml) |
+| InternImage-G | - | 512x512 | 90.1  |  3B   | 2700G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_22kto1k_512.pth) \| [cfg](classification/configs/internimage_g_22kto1k_512.yaml) |
 ## COCO目标检测和实例分割
-|    backbone    |       method       | schd | box mAP  | mask mAP  | #param | FLOPs | Download | 
+|    backbone    |       method       | schd | box mAP  | mask mAP  | #param | FLOPs | download | 
 | :------------: | :----------------: | :---------: | :-----: | :------: | :-----: | :---: | :---: | 
 | InternImage-T  |     Mask R-CNN     |     1x      |  47.2   |   42.5   |   49M   | 270G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_1x_coco.py) |
 | InternImage-T  |     Mask R-CNN     |     3x      |  49.1   |   43.7   |   49M   | 270G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_3x_coco.py) |
@@ -201,16 +201,16 @@
 | InternImage-L  |     Cascade        |     1x      |  54.9   |   47.7   |  277M   | 1399G |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_1x_coco.py) |
 | InternImage-L  |     Cascade        |     3x      |  56.1   |   48.5   |  277M   | 1399G |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_3x_coco.py) |
 | InternImage-XL |     Cascade        |     1x      |  55.3   |   48.1   |  387M   | 1782G |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_1x_coco.py) |
-| InternImage-XL |     Cascade        |     3x      |  56.2   |   48.8   |  387M   | 1782G |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_3x_coco.py) |
+| InternImage-XL |     Cascade        |     3x      |  56.2   |   48.8   |  387M   | 1782G |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_3x_coco.py) |
-|    backbone    |       method       |  box mAP (val/test) |  #param  | FLOPs | Download | 
+|    backbone    |       method       |  box mAP (val/test) |  #param  | FLOPs | download |
 | :------------: | :----------------: |     :---------:     | :------: | :-----: | :-----: |
 | InternImage-H  |     DINO (TTA)     |      65.0 / 65.4     |   2.18B  | TODO | TODO |
 | InternImage-G  |     DINO (TTA)     |      65.3 / 65.5     |    3B    | TODO | TODO |
 ## ADE20K语义分割
-|    backbone    | method     |   resolution | mIoU (ss/ms) | #param | FLOPs | Download | 
+|    backbone    | method     |   resolution | mIoU (ss/ms) | #param | FLOPs | download |
 | :------------: | :--------: | :--------: | :----------: | :-----: | :---: |   :---:  |
 | InternImage-T  |  UperNet   |   512x512   |     47.9 / 48.1     |   59M   | 944G  | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_t_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_t_512_160k_ade20k.py) |
 | InternImage-S  |  UperNet   |  512x512   |     50.1 / 50.9     |   80M   | 1017G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_s_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_s_512_160k_ade20k.py) |
@@ -225,7 +225,7 @@
 [TensorRT](classification/export.py)
-|      name      | resolution | #param | FLOPs | Batch 1 FPS(TensorRT) |
+|      name      | resolution | #param | FLOPs | batch 1 FPS (TensorRT) |
 | :------------: | :--------: | :-----: | :---: | :-------------------: |
 | InternImage-T  |  224x224   |   30M   |  5G   |          156          |
 | InternImage-S  |  224x224   |   50M   |  8G   |          129          |

--- a/README_EN.md
+++ b/README_EN.md
@@ -90,7 +90,7 @@ ADE20K, outperforming previous models by a large margin.
 **Segmentation Task**
 <table border="1" width="90%">
 	<tr align="center">
-        <th colspan="3"> Semantic Segmentation</th><th colspan="1">Street Segmentation</th><th colspan="1">RGBD Segmentation</th>
+        <th colspan="3">Semantic Segmentation</th><th colspan="1">Street Segmentation</th><th colspan="1">RGBD Segmentation</th>
    </tr>
    <tr align="center">
        <th>ADE20K</th><th>COCO Stuff-10K</th><th>Pascal Context</th><th>CityScapes</th><th>NYU Depth V2</th>
@@ -122,7 +122,7 @@ ADE20K, outperforming previous models by a large margin.
 **Multimodal Tasks**
 <table border="1" width="90%">
 	<tr align="center">
-        <th colspan="1"> Image Captioning</th><th colspan="2">Fine-tuning Image-Text Retrieval</th><th colspan="1">Zero-shot Image-Text Retrieval</th>
+        <th colspan="1">Image Captioning</th><th colspan="2">Fine-tuning Image-Text Retrieval</th><th colspan="1">Zero-shot Image-Text Retrieval</th>
    </tr>
    <tr align="center">
        <th>COCO Caption</th><th>COCO Caption</th><th>Flickr30k</th><th>Flickr30k</th>
@@ -147,13 +147,12 @@ InternImage, the visual backbone network of "INTERN-2.5", has a parameter size o
 ## Project Release
- [ ] Model for other downstream 
+- [ ] Model for other downstream tasks
-tasks
 - [x] InternImage-H(1B)/G(3B)
 - [x] TensorRT inference
 - [x] Classification code of the InternImage series
- [x] InternImage-T/S/B/L/XL ImageNet-1k pretrained model
+- [x] InternImage-T/S/B/L/XL ImageNet-1K pretrained model
- [x] InternImage-L/XL ImageNet-22k pretrained model
+- [x] InternImage-L/XL ImageNet-22K pretrained model
 - [x] InternImage-T/S/B/L/XL detection and instance segmentation model
 - [x] InternImage-T/S/B/L/XL semantic segmentation model
@@ -165,25 +164,31 @@ tasks
 - 3D Perception: [BEVFormer](https://github.com/fundamentalvision/BEVFormer)
-## Performance on Visual Benchmark
+## Open-source Visual Pretrained Models
+|      name      | pretrain     | pre-training  resolution |  #param |       download      |
+| :------------: | :--------:   | :--------: | :-----: | :-----------------: |
+| InternImage-L  | ImageNet-22K |  384x384   |  223M   |   [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth)            |
+| InternImage-XL | ImageNet-22K |  384x384   |  335M   |   [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth)            |
+| InternImage-H  | Joint 427M   |  384x384   |  1.08B  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto22k_384.pth)            |
+| InternImage-G  | -            |  384x384   |   3B    |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_pretrainto22k_384.pth)            | 
-**ImageNet Image Classification**
-|      name      |   pretrain   | resolution | acc@1 | #param | FLOPs |      22K model      |      1K model       |
-| :------------: | :----------: | :--------: | :---: | :-----: | :---: | :-----------------: | :-----------------: |
-| InternImage-T  | ImageNet-1K  |  224x224   | 83.5  |   30M   |  5G   |          -          | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth) \| [cfg](classification/configs/internimage_t_1k_224.yaml) |
-| InternImage-S  | ImageNet-1K  |  224x224   | 84.2  |   50M   |  8G   |          -          | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth) \| [cfg](classification/configs/internimage_s_1k_224.yaml) |
-| InternImage-B  | ImageNet-1K  |  224x224   | 84.9  |   97M   |  16G  |          -          | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth) \| [cfg](classification/configs/internimage_b_1k_224.yaml) |
-| InternImage-L  | ImageNet-22K |  384x384   | 87.7  |  223M   | 108G  | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth)            | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22kto1k_384.pth) \| [cfg](classification/configs/internimage_l_22kto1k_384.yaml) |
-| InternImage-XL | ImageNet-22K |  384x384   | 88.0  |  335M   | 163G  | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth)            | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22kto1k_384.pth) \| [cfg](classification/configs/internimage_xl_22kto1k_384.yaml) |
-| InternImage-H | Joint 427M |  640x640   | 89.6  |  1.08B   | 1478G  |TODO           | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto1k_640.pth) \| [cfg](classification/configs/internimage_h_jointto1k_640.yaml) |
-| InternImage-G | Joint 427M |  512x512   | 90.1  |  3B   | TODO  |  TODO      | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_jointto1k_512.pth) \| [cfg](classification/configs/internimage_g_jointto1k_512.yaml)|
+## ImageNet-1K Image Classification
+|      name      |   pretrain   | resolution | acc@1 | #param | FLOPs |           download       |
+| :------------: | :----------: | :--------: | :---: | :-----: | :---: |  :-----------------: |
+| InternImage-T  | ImageNet-1K  |  224x224   | 83.5  |   30M   |  5G   |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth) \| [cfg](classification/configs/internimage_t_1k_224.yaml) |
+| InternImage-S  | ImageNet-1K  |  224x224   | 84.2  |   50M   |  8G   |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth) \| [cfg](classification/configs/internimage_s_1k_224.yaml) |
+| InternImage-B  | ImageNet-1K  |  224x224   | 84.9  |   97M   |  16G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth) \| [cfg](classification/configs/internimage_b_1k_224.yaml) |
+| InternImage-L  | ImageNet-22K |  384x384   | 87.7  |  223M   | 108G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22kto1k_384.pth) \| [cfg](classification/configs/internimage_l_22kto1k_384.yaml) |
+| InternImage-XL | ImageNet-22K |  384x384   | 88.0  |  335M   | 163G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22kto1k_384.pth) \| [cfg](classification/configs/internimage_xl_22kto1k_384.yaml) |
+| InternImage-H | Joint 427M |  640x640   | 89.6  |  1.08B   | 1478G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_22kto1k_640.pth) \| [cfg](classification/configs/internimage_h_22kto1k_640.yaml) |
+| InternImage-G | - | 512x512 | 90.1  |  3B   | 2700G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_22kto1k_512.pth) \| [cfg](classification/configs/internimage_g_22kto1k_512.yaml) |
-**COCO Object Detection and Instance Segmentation**
+## COCO Object Detection and Instance Segmentation
-|    backbone    |       method       | schd | box mAP  | mask mAP  | #param | FLOPs | Download | 
+|    backbone    |       method       | schd | box mAP  | mask mAP  | #param | FLOPs | download | 
 | :------------: | :----------------: | :---------: | :-----: | :------: | :-----: | :---: | :---: | 
 | InternImage-T  |     Mask R-CNN     |     1x      |  47.2   |   42.5   |   49M   | 270G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_1x_coco.py) |
 | InternImage-T  |     Mask R-CNN     |     3x      |  49.1   |   43.7   |   49M   | 270G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_3x_coco.py) |
@@ -194,17 +199,17 @@ tasks
 | InternImage-L  |     Cascade        |     1x      |  54.9   |   47.7   |  277M   | 1399G |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_1x_coco.py) |
 | InternImage-L  |     Cascade        |     3x      |  56.1   |   48.5   |  277M   | 1399G |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_3x_coco.py) |
 | InternImage-XL |     Cascade        |     1x      |  55.3   |   48.1   |  387M   | 1782G |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_1x_coco.py) |
-| InternImage-XL |     Cascade        |     3x      |  56.2   |   48.8   |  387M   | 1782G |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_3x_coco.py) |
+| InternImage-XL |     Cascade        |     3x      |  56.2   |   48.8   |  387M   | 1782G |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_3x_coco.py) |
-|    backbone    |       method       |  box mAP (val/test) |  #param  | FLOPs | Download | 
+|    backbone    |       method       |  box mAP (val/test) |  #param  | FLOPs | download | 
 | :------------: | :----------------: |     :---------:     | :------: | :-----: | :---: | 
 | InternImage-H  |     DINO (TTA)     |      65.0 / 65.4     |   2.18B  | TODO |  TODO |
 | InternImage-G  |     DINO (TTA)     |      65.3 / 65.5     |    3B    | TODO |  TODO |
-**ADE20K Semantic Segmentation**
+## ADE20K Semantic Segmentation
-|    backbone    | method     |   resolution | mIoU (ss/ms) | #param | FLOPs | Download | 
+|    backbone    | method     |   resolution | mIoU (ss/ms) | #param | FLOPs | download | 
 | :------------: | :--------: | :--------: | :----------: | :-----: | :---: |   :---:  |
 | InternImage-T  |  UperNet   |   512x512   |     47.9 / 48.1     |   59M   | 944G  | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_t_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_t_512_160k_ade20k.py) |
 | InternImage-S  |  UperNet   |  512x512   |     50.1 / 50.9     |   80M   | 1017G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_s_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_s_512_160k_ade20k.py) |
@@ -212,12 +217,14 @@ tasks
 | InternImage-L  |  UperNet   |  640x640   |     53.9 / 54.1     |  256M   | 2526G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_l_640_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_l_640_160k_ade20k.py) |
 | InternImage-XL |  UperNet   |  640x640   |     55.0 / 55.3     |  368M   | 3142G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_xl_640_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_xl_640_160k_ade20k.py) |
 | InternImage-H |  UperNet   |  896x896   |     59.9 / 60.3     |  1.12B   | 3566G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_h_896_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_h_896_160k_ade20k.py) |
-| InternImage-H |  Mask2Former   |  896x896   |     62.5 / 62.9     |  1.31B   | 4635G | TODO | ckpt \| cfg
+| InternImage-H |  Mask2Former   |  896x896   |     62.5 / 62.9     |  1.31B   | 4635G | TODO |
-**Main Results of FPS**
+## Main Results of FPS 
-|      name      | resolution | #params | FLOPs | Batch 1 FPS(TensorRT) |
+[TensorRT](classification/export.py)
+|      name      | resolution | #param | FLOPs | batch 1 FPS (TensorRT) |
 | :------------: | :--------: | :-----: | :---: | :-------------------: |
 | InternImage-T  |  224x224   |   30M   |  5G   |          156          |
 | InternImage-S  |  224x224   |   50M   |  8G   |          129          |

--- a/classification/README.md
+++ b/classification/README.md
@@ -51,7 +51,7 @@ sh ./make.sh
 python test.py
 ```
-### Data preparation
+### Data Preparation
 We use standard ImageNet dataset, you can download it from http://image-net.org/. We provide the following two ways to
 load data:
@@ -128,7 +128,7 @@ load data:
 ### Evaluation
-To evaluate a pre-trained `InternImage` on ImageNet val, run:
+To evaluate a pretrained `InternImage` on ImageNet val, run:
 ```bash
 python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> --master_port 12345 main.py --eval \
@@ -142,7 +142,7 @@ python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.p
 --cfg configs/internimage_b_1k_224.yaml --resume internimage_b_1k_224.pth --data-path <imagenet-path>
 ```
-### Training from scratch on ImageNet-1K
+### Training from Scratch on ImageNet-1K
 To train an `InternImage` on ImageNet from scratch, run:
@@ -151,7 +151,7 @@ python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> --maste
 --cfg <config-file> --data-path <imagenet-path> [--batch-size <batch-size-per-gpu> --output <output-directory> --tag <job-tag>]
 ```
-### Manage jobs with Srun.
+### Manage Jobs with Slurm.
 For example, to train `InternImage` with 8 GPU on a single node for 300 epochs, run:
@@ -184,9 +184,9 @@ python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> --maste
 --resume internimage_xl_22k_192to384.pth --eval
 ``` -->
-<!-- ### Fine-tuning from a ImageNet-22K pre-trained model
+<!-- ### Fine-tuning from a ImageNet-22K pretrained model
-For example, to fine-tune a `InternImage-XL-22k` model pre-trained on ImageNet-22K:
+For example, to fine-tune a `InternImage-XL-22k` model pretrained on ImageNet-22K:
 ```bashs
 GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_image_.yaml --pretrained intern_image_b.pth --eval

--- a/classification/configs/internimage_g_jointto1k_512.yaml
+++ b/classification/configs/internimage_g_jointto1k_512.yaml
--- a/classification/configs/internimage_h_jointto1k_640.yaml
+++ b/classification/configs/internimage_h_jointto1k_640.yaml
--- a/classification/configs/internimage_h_jointto1k_224.yaml
+++ b/classification/configs/internimage_h_jointto1k_224.yaml
-DATA:
-  IMG_SIZE: 224
-  IMG_ON_MEMORY: True
-AUG:
-  MIXUP: 0.0
-  CUTMIX: 0.0
-  REPROB: 0.0
-MODEL:
-  TYPE: intern_image
-  DROP_PATH_RATE: 0.6
-  LABEL_SMOOTHING: 0.3
-  INTERN_IMAGE:
-    CORE_OP: 'DCNv3'
-    DEPTHS: [6, 6, 32, 6]
-    GROUPS: [10, 20, 40, 80]
-    CHANNELS: 320
-    DW_KERNEL_SIZE: 5
-    LAYER_SCALE: None
-    OFFSET_SCALE: 1.0
-    MLP_RATIO: 4.0
-    POST_NORM: False
-    RES_POST_NORM: True
-    LEVEL2_POST_NORM: True
-    LEVEL2_POST_NORM_BLOCK_IDS: [5, 11, 17, 23, 29]
-    CENTER_FEATURE_SCALE: True
-    USE_CLIP_PROJECTOR: True
-TRAIN:
-  EMA:
-    ENABLE: true
-    DECAY: 0.9998
-  EPOCHS: 30
-  WARMUP_EPOCHS: 0
-  WEIGHT_DECAY: 1e-8
-  BASE_LR: 3e-05 # 512
-  WARMUP_LR: 3e-08
-  MIN_LR: 3e-07
-  LR_LAYER_DECAY: true
-  LR_LAYER_DECAY_RATIO: 0.8
-  RAND_INIT_FT_HEAD: true
-  USE_CHECKPOINT: true
-AMP_OPT_LEVEL: O0
-EVAL_FREQ: 1
\ No newline at end of file
--- a/classification/ops_dcnv3/modules/dcnv3.py
+++ b/classification/ops_dcnv3/modules/dcnv3.py
@@ -74,7 +74,7 @@ def _is_power_of_2(n):
        raise ValueError(
            "invalid input for _is_power_of_2: {} (type: {})".format(n, type(n)))
-    return (n & (n-1) == 0) and n != 0
+    return (n & (n - 1) == 0) and n != 0
 class CenterFeatureScaleModule(nn.Module):
@@ -86,7 +86,7 @@ class CenterFeatureScaleModule(nn.Module):
                                        weight=center_feature_scale_proj_weight,
                                        bias=center_feature_scale_proj_bias).sigmoid()
        return center_feature_scale
 class DCNv3_pytorch(nn.Module):
    def __init__(
@@ -104,10 +104,10 @@ class DCNv3_pytorch(nn.Module):
            center_feature_scale=False):
        """
        DCNv3 Module
-        :param channels     
+        :param channels
-        :param kernel_size  
+        :param kernel_size
-        :param stride      
+        :param stride
-        :param pad     
+        :param pad
        :param dilation
        :param group
        :param offset_scale
@@ -231,10 +231,10 @@ class DCNv3(nn.Module):
            center_feature_scale=False):
        """
        DCNv3 Module
-        :param channels     
+        :param channels
-        :param kernel_size  
+        :param kernel_size
-        :param stride      
+        :param stride
-        :param pad     
+        :param pad
        :param dilation
        :param group
        :param offset_scale

--- a/detection/README.md
+++ b/detection/README.md
@@ -54,7 +54,7 @@ sh ./make.sh
 python test.py
 ```
-## Data Preparation
+### Data Preparation
 Prepare COCO according to the guidelines in [MMDetection v2.28.1](https://github.com/open-mmlab/mmdetection/blob/master/docs/en/1_exist_data_model.md).
@@ -93,7 +93,7 @@ For example, to train `InternImage-T` with 8 GPU on 1 node, run:
 sh dist_train.sh configs/coco/mask_rcnn_internimage_t_fpn_1x_coco.py 8
 ```
-### Manage jobs with Srun
+### Manage Jobs with Slurm
 For example, to train `InternImage-L` with 32 GPU on 4 node, run:

--- a/detection/configs/coco/README.md
+++ b/detection/configs/coco/README.md
@@ -36,7 +36,7 @@ Based on community feedback, in 2017 the training/validation split was changed f
 | InternImage-L  |        1x      |  54.9   |   47.7   | 0.73s / iter | 18h |  277M   | 1399G | [config](./cascade_internimage_l_fpn_1x_coco.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_1x_coco.pth)  |
 | InternImage-L  |        3x      |  56.1   |   48.5   | 0.79s / iter | 15h (4n) |  277M   | 1399G | [config](./cascade_internimage_l_fpn_3x_coco.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_3x_coco.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_3x_coco.log.json) |
 | InternImage-XL |        1x      |  55.3   |   48.1   | 0.82s / iter | 21h |  387M   | 1782G | [config](./cascade_internimage_xl_fpn_1x_coco.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.log.json) |
-| InternImage-XL |        3x      |  56.2   |   48.8   | 0.91s / iter | 17h (4n) |  387M   | 1782G | [config](./cascade_internimage_xl_fpn_3x_coco.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_3x_coco.log.json) |
+| InternImage-XL |        3x      |  56.2   |   48.8   | 0.91s / iter | 17h (4n) |  387M   | 1782G | [config](./cascade_internimage_xl_fpn_3x_coco.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_3x_coco.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_3x_coco.log.json) |
 - Training speed is measured with A100 GPUs using current code and may be faster than the speed in logs.
 - Some logs are our recent newly trained ones. There might be slight differences between the results in logs and our paper.

--- a/detection/mmdet_custom/models/backbones/intern_image.py
+++ b/detection/mmdet_custom/models/backbones/intern_image.py
@@ -13,9 +13,11 @@ from mmcv.runner import _load_checkpoint
 from mmcv.cnn import constant_init, trunc_normal_init
 from mmdet.utils import get_root_logger
 from mmdet.models.builder import BACKBONES
+import torch.nn.functional as F
 from ops_dcnv3 import modules as opsm
 class to_channels_first(nn.Module):
    def __init__(self):
@@ -69,6 +71,171 @@ def build_act_layer(act_layer):
    raise NotImplementedError(f'build_act_layer does not support {act_layer}')
+class CrossAttention(nn.Module):
+    r""" Cross Attention Module
+    Args:
+        dim (int): Number of input channels.
+        num_heads (int): Number of attention heads. Default: 8
+        qkv_bias (bool, optional):  If True, add a learnable bias to q, k, v.
+            Default: False.
+        qk_scale (float | None, optional): Override default qk scale of
+            head_dim ** -0.5 if set. Default: None.
+        attn_drop (float, optional): Dropout ratio of attention weight.
+            Default: 0.0
+        proj_drop (float, optional): Dropout ratio of output. Default: 0.0
+        attn_head_dim (int, optional): Dimension of attention head.
+        out_dim (int, optional): Dimension of output.
+    """
+    def __init__(self,
+                 dim,
+                 num_heads=8,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 attn_drop=0.,
+                 proj_drop=0.,
+                 attn_head_dim=None,
+                 out_dim=None):
+        super().__init__()
+        if out_dim is None:
+            out_dim = dim
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        if attn_head_dim is not None:
+            head_dim = attn_head_dim
+        all_head_dim = head_dim * self.num_heads
+        self.scale = qk_scale or head_dim ** -0.5
+        assert all_head_dim == dim
+        self.q = nn.Linear(dim, all_head_dim, bias=False)
+        self.k = nn.Linear(dim, all_head_dim, bias=False)
+        self.v = nn.Linear(dim, all_head_dim, bias=False)
+        if qkv_bias:
+            self.q_bias = nn.Parameter(torch.zeros(all_head_dim))
+            self.k_bias = nn.Parameter(torch.zeros(all_head_dim))
+            self.v_bias = nn.Parameter(torch.zeros(all_head_dim))
+        else:
+            self.q_bias = None
+            self.k_bias = None
+            self.v_bias = None
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(all_head_dim, out_dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+    def forward(self, x, k=None, v=None):
+        B, N, C = x.shape
+        N_k = k.shape[1]
+        N_v = v.shape[1]
+        q_bias, k_bias, v_bias = None, None, None
+        if self.q_bias is not None:
+            q_bias = self.q_bias
+            k_bias = self.k_bias
+            v_bias = self.v_bias
+        q = F.linear(input=x, weight=self.q.weight, bias=q_bias)
+        q = q.reshape(B, N, 1, self.num_heads,
+                      -1).permute(2, 0, 3, 1,
+                                  4).squeeze(0)  # (B, N_head, N_q, dim)
+        k = F.linear(input=k, weight=self.k.weight, bias=k_bias)
+        k = k.reshape(B, N_k, 1, self.num_heads, -1).permute(2, 0, 3, 1,
+                                                             4).squeeze(0)
+        v = F.linear(input=v, weight=self.v.weight, bias=v_bias)
+        v = v.reshape(B, N_v, 1, self.num_heads, -1).permute(2, 0, 3, 1,
+                                                             4).squeeze(0)
+        q = q * self.scale
+        attn = (q @ k.transpose(-2, -1))  # (B, N_head, N_q, N_k)
+        attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+        x = (attn @ v).transpose(1, 2).reshape(B, N, -1)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+class AttentiveBlock(nn.Module):
+    r"""Attentive Block
+    Args:
+        dim (int): Number of input channels.
+        num_heads (int): Number of attention heads. Default: 8
+        qkv_bias (bool, optional):  If True, add a learnable bias to q, k, v.
+            Default: False.
+        qk_scale (float | None, optional): Override default qk scale of
+            head_dim ** -0.5 if set. Default: None.
+        drop (float, optional): Dropout rate. Default: 0.0.
+        attn_drop (float, optional): Attention dropout rate. Default: 0.0.
+        drop_path (float | tuple[float], optional): Stochastic depth rate.
+            Default: 0.0.
+        norm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm.
+        attn_head_dim (int, optional): Dimension of attention head. Default: None.
+        out_dim (int, optional): Dimension of output. Default: None.
+    """
+    def __init__(self,
+                 dim,
+                 num_heads,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 drop=0.,
+                 attn_drop=0.,
+                 drop_path=0.,
+                 norm_layer="LN",
+                 attn_head_dim=None,
+                 out_dim=None):
+        super().__init__()
+        self.norm1_q = build_norm_layer(dim, norm_layer, eps=1e-6)
+        self.norm1_k = build_norm_layer(dim, norm_layer, eps=1e-6)
+        self.norm1_v = build_norm_layer(dim, norm_layer, eps=1e-6)
+        self.cross_dcn = CrossAttention(dim,
+                                        num_heads=num_heads,
+                                        qkv_bias=qkv_bias,
+                                        qk_scale=qk_scale,
+                                        attn_drop=attn_drop,
+                                        proj_drop=drop,
+                                        attn_head_dim=attn_head_dim,
+                                        out_dim=out_dim)
+        self.drop_path = DropPath(
+            drop_path) if drop_path > 0. else nn.Identity()
+    def forward(self,
+                x_q,
+                x_kv,
+                pos_q,
+                pos_k,
+                bool_masked_pos,
+                rel_pos_bias=None):
+        x_q = self.norm1_q(x_q + pos_q)
+        x_k = self.norm1_k(x_kv + pos_k)
+        x_v = self.norm1_v(x_kv)
+        x = self.cross_dcn(x_q, k=x_k, v=x_v)
+        return x
+class AttentionPoolingBlock(AttentiveBlock):
+    def forward(self, x):
+        x_q = x.mean(1, keepdim=True)
+        x_kv = x
+        pos_q, pos_k = 0, 0
+        x = super().forward(x_q, x_kv, pos_q, pos_k,
+                            bool_masked_pos=None,
+                            rel_pos_bias=None)
+        x = x.squeeze(1)
+        return x
 class StemLayer(nn.Module):
    r""" Stem layer of InternImage
    Args:
@@ -195,7 +362,10 @@ class InternImageLayer(nn.Module):
                 post_norm=False,
                 layer_scale=None,
                 offset_scale=1.0,
-                 with_cp=False):
+                 with_cp=False,
+                 dw_kernel_size=None, # for InternImage-H/G
+                 res_post_norm=False, # for InternImage-H/G
+                 center_feature_scale=False): # for InternImage-H/G
        super().__init__()
        self.channels = channels
        self.groups = groups
@@ -204,15 +374,18 @@ class InternImageLayer(nn.Module):
        self.norm1 = build_norm_layer(channels, 'LN')
        self.post_norm = post_norm
-        self.dcn = core_op(channels=channels,
+        self.dcn = core_op(
-                           kernel_size=3,
+            channels=channels,
-                           stride=1,
+            kernel_size=3,
-                           pad=1,
+            stride=1,
-                           dilation=1,
+            pad=1,
-                           group=groups,
+            dilation=1,
-                           offset_scale=offset_scale,
+            group=groups,
-                           act_layer=act_layer,
+            offset_scale=offset_scale,
-                           norm_layer=norm_layer)
+            act_layer=act_layer,
+            norm_layer=norm_layer,
+            dw_kernel_size=dw_kernel_size, # for InternImage-H/G
+            center_feature_scale=center_feature_scale) # for InternImage-H/G
        self.drop_path = DropPath(drop_path) if drop_path > 0. \
            else nn.Identity()
        self.norm2 = build_norm_layer(channels, 'LN')
@@ -226,6 +399,10 @@ class InternImageLayer(nn.Module):
                                       requires_grad=True)
            self.gamma2 = nn.Parameter(layer_scale * torch.ones(channels),
                                       requires_grad=True)
+        self.res_post_norm = res_post_norm
+        if res_post_norm:
+            self.res_post_norm1 = build_norm_layer(channels, 'LN')
+            self.res_post_norm2 = build_norm_layer(channels, 'LN')
    def forward(self, x):
@@ -234,6 +411,9 @@ class InternImageLayer(nn.Module):
                if self.post_norm:
                    x = x + self.drop_path(self.norm1(self.dcn(x)))
                    x = x + self.drop_path(self.norm2(self.mlp(x)))
+                elif self.res_post_norm: # for InternImage-H/G
+                    x = x + self.drop_path(self.res_post_norm1(self.dcn(self.norm1(x))))
+                    x = x + self.drop_path(self.res_post_norm2(self.mlp(self.norm2(x))))
                else:
                    x = x + self.drop_path(self.dcn(self.norm1(x)))
                    x = x + self.drop_path(self.mlp(self.norm2(x)))
@@ -285,36 +465,54 @@ class InternImageBlock(nn.Module):
                 post_norm=False,
                 offset_scale=1.0,
                 layer_scale=None,
-                 with_cp=False):
+                 with_cp=False,
+                 dw_kernel_size=None, # for InternImage-H/G
+                 post_norm_block_ids=None, # for InternImage-H/G
+                 res_post_norm=False, # for InternImage-H/G
+                 center_feature_scale=False): # for InternImage-H/G
        super().__init__()
        self.channels = channels
        self.depth = depth
        self.post_norm = post_norm
+        self.center_feature_scale = center_feature_scale
        self.blocks = nn.ModuleList([
-            InternImageLayer(core_op=core_op,
+            InternImageLayer(
-                             channels=channels,
+                core_op=core_op,
-                             groups=groups,
+                channels=channels,
-                             mlp_ratio=mlp_ratio,
+                groups=groups,
-                             drop=drop,
+                mlp_ratio=mlp_ratio,
-                             drop_path=drop_path[i] if isinstance(
+                drop=drop,
-                                 drop_path, list) else drop_path,
+                drop_path=drop_path[i] if isinstance(
-                             act_layer=act_layer,
+                    drop_path, list) else drop_path,
-                             norm_layer=norm_layer,
+                act_layer=act_layer,
-                             post_norm=post_norm,
+                norm_layer=norm_layer,
-                             layer_scale=layer_scale,
+                post_norm=post_norm,
-                             offset_scale=offset_scale,
+                layer_scale=layer_scale,
-                             with_cp=with_cp) for i in range(depth)
+                offset_scale=offset_scale,
+                with_cp=with_cp,
+                dw_kernel_size=dw_kernel_size, # for InternImage-H/G
+                res_post_norm=res_post_norm, # for InternImage-H/G
+                center_feature_scale=center_feature_scale # for InternImage-H/G
+            ) for i in range(depth)
        ])
-        if not self.post_norm:
+        if not self.post_norm or center_feature_scale:
            self.norm = build_norm_layer(channels, 'LN')
+        self.post_norm_block_ids = post_norm_block_ids
+        if post_norm_block_ids is not None: # for InternImage-H/G
+            self.post_norms = nn.ModuleList(
+                [build_norm_layer(channels, 'LN', eps=1e-6) for _ in post_norm_block_ids]
+            )
        self.downsample = DownsampleLayer(
            channels=channels, norm_layer=norm_layer) if downsample else None
    def forward(self, x, return_wo_downsample=False):
-        for blk in self.blocks:
+        for i, blk in enumerate(self.blocks):
            x = blk(x)
-        if not self.post_norm:
+            if (self.post_norm_block_ids is not None) and (i in self.post_norm_block_ids):
+                index = self.post_norm_block_ids.index(i)
+                x = self.post_norms[index](x) # for InternImage-H/G
+        if not self.post_norm or self.center_feature_scale:
            x = self.norm(x)
        if return_wo_downsample:
            x_ = x
@@ -344,6 +542,11 @@ class InternImage(nn.Module):
        layer_scale (bool): Whether to use layer scale. Default: False
        cls_scale (bool): Whether to use class scale. Default: False
        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+        dw_kernel_size (int): Size of the dwconv. Default: None
+        level2_post_norm (bool): Whether to use level2 post norm. Default: False
+        level2_post_norm_block_ids (list): Indexes of post norm blocks. Default: None
+        res_post_norm (bool): Whether to use res post norm. Default: False
+        center_feature_scale (bool): Whether to use center feature scale. Default: False
    """
    def __init__(self,
@@ -361,6 +564,11 @@ class InternImage(nn.Module):
                 offset_scale=1.0,
                 post_norm=False,
                 with_cp=False,
+                 dw_kernel_size=None,  # for InternImage-H/G
+                 level2_post_norm=False,  # for InternImage-H/G
+                 level2_post_norm_block_ids=None,  # for InternImage-H/G
+                 res_post_norm=False,  # for InternImage-H/G
+                 center_feature_scale=False,  # for InternImage-H/G
                 out_indices=(0, 1, 2, 3),
                 init_cfg=None,
                 **kwargs):
@@ -374,10 +582,15 @@ class InternImage(nn.Module):
        self.mlp_ratio = mlp_ratio
        self.init_cfg = init_cfg
        self.out_indices = out_indices
-        print(f'using core type: {core_op}')
+        self.level2_post_norm_block_ids = level2_post_norm_block_ids
-        print(f'using activation layer: {act_layer}')
+        logger = get_root_logger()
-        print(f'using main norm layer: {norm_layer}')
+        logger.info(f'using core type: {core_op}')
-        print(f'using dpr: {drop_path_type}, {drop_path_rate}')
+        logger.info(f'using activation layer: {act_layer}')
+        logger.info(f'using main norm layer: {norm_layer}')
+        logger.info(f'using dpr: {drop_path_type}, {drop_path_rate}')
+        logger.info(f"level2_post_norm: {level2_post_norm}")
+        logger.info(f"level2_post_norm_block_ids: {level2_post_norm_block_ids}")
+        logger.info(f"res_post_norm: {res_post_norm}")
        in_chans = 3
        self.patch_embed = StemLayer(in_chans=in_chans,
@@ -395,6 +608,8 @@ class InternImage(nn.Module):
        self.levels = nn.ModuleList()
        for i in range(self.num_levels):
+            post_norm_block_ids = level2_post_norm_block_ids if level2_post_norm and (
+                i == 2) else None # for InternImage-H/G
            level = InternImageBlock(
                core_op=getattr(opsm, core_op),
                channels=int(channels * 2**i),
@@ -409,7 +624,12 @@ class InternImage(nn.Module):
                downsample=(i < self.num_levels - 1),
                layer_scale=layer_scale,
                offset_scale=offset_scale,
-                with_cp=with_cp)
+                with_cp=with_cp,
+                dw_kernel_size=dw_kernel_size,  # for InternImage-H/G
+                post_norm_block_ids=post_norm_block_ids, # for InternImage-H/G
+                res_post_norm=res_post_norm, # for InternImage-H/G
+                center_feature_scale=center_feature_scale # for InternImage-H/G
+            )
            self.levels.append(level)
        self.num_layers = len(depths)

--- a/detection/ops_dcnv3/modules/dcnv3.py
+++ b/detection/ops_dcnv3/modules/dcnv3.py
@@ -9,6 +9,7 @@ from __future__ import print_function
 from __future__ import division
 import warnings
+import torch
 from torch import nn
 import torch.nn.functional as F
 from torch.nn.init import xavier_uniform_, constant_
@@ -73,20 +74,40 @@ def _is_power_of_2(n):
        raise ValueError(
            "invalid input for _is_power_of_2: {} (type: {})".format(n, type(n)))
-    return (n & (n-1) == 0) and n != 0
+    return (n & (n - 1) == 0) and n != 0
+class CenterFeatureScaleModule(nn.Module):
+    def forward(self,
+                query,
+                center_feature_scale_proj_weight,
+                center_feature_scale_proj_bias):
+        center_feature_scale = F.linear(query,
+                                        weight=center_feature_scale_proj_weight,
+                                        bias=center_feature_scale_proj_bias).sigmoid()
+        return center_feature_scale
 class DCNv3_pytorch(nn.Module):
    def __init__(
-            self, channels=64, kernel_size=3, stride=1,
+            self,
-            pad=1, dilation=1, group=4, offset_scale=1.0,
+            channels=64,
-            act_layer='GELU', norm_layer='LN'):
+            kernel_size=3,
+            dw_kernel_size=None,
+            stride=1,
+            pad=1,
+            dilation=1,
+            group=4,
+            offset_scale=1.0,
+            act_layer='GELU',
+            norm_layer='LN',
+            center_feature_scale=False):
        """
        DCNv3 Module
-        :param channels     
+        :param channels
-        :param kernel_size  
+        :param kernel_size
-        :param stride      
+        :param stride
-        :param pad     
+        :param pad
        :param dilation
        :param group
        :param offset_scale
@@ -98,6 +119,7 @@ class DCNv3_pytorch(nn.Module):
            raise ValueError(
                f'channels must be divisible by group, but got {channels} and {group}')
        _d_per_group = channels // group
+        dw_kernel_size = dw_kernel_size if dw_kernel_size is not None else kernel_size
        # you'd better set _d_per_group to a power of 2 which is more efficient in our CUDA implementation
        if not _is_power_of_2(_d_per_group):
            warnings.warn(
@@ -107,20 +129,22 @@ class DCNv3_pytorch(nn.Module):
        self.offset_scale = offset_scale
        self.channels = channels
        self.kernel_size = kernel_size
+        self.dw_kernel_size = dw_kernel_size
        self.stride = stride
-        self.dilation = 1
+        self.dilation = dilation
        self.pad = pad
        self.group = group
        self.group_channels = channels // group
        self.offset_scale = offset_scale
+        self.center_feature_scale = center_feature_scale
        self.dw_conv = nn.Sequential(
            nn.Conv2d(
                channels,
                channels,
-                kernel_size=kernel_size,
+                kernel_size=dw_kernel_size,
                stride=1,
-                padding=(kernel_size-1)//2,
+                padding=(dw_kernel_size - 1) // 2,
                groups=channels),
            build_norm_layer(
                channels,
@@ -137,6 +161,13 @@ class DCNv3_pytorch(nn.Module):
        self.input_proj = nn.Linear(channels, channels)
        self.output_proj = nn.Linear(channels, channels)
        self._reset_parameters()
+        if center_feature_scale:
+            self.center_feature_scale_proj_weight = nn.Parameter(
+                torch.zeros((group, channels), dtype=torch.float))
+            self.center_feature_scale_proj_bias = nn.Parameter(
+                torch.tensor(0.0, dtype=torch.float).view((1,)).repeat(group, ))
+            self.center_feature_scale_module = CenterFeatureScaleModule()
    def _reset_parameters(self):
        constant_(self.offset.weight.data, 0.)
@@ -156,6 +187,7 @@ class DCNv3_pytorch(nn.Module):
        N, H, W, _ = input.shape
        x = self.input_proj(input)
+        x_proj = x
        x1 = input.permute(0, 3, 1, 2)
        x1 = self.dw_conv(x1)
@@ -171,6 +203,13 @@ class DCNv3_pytorch(nn.Module):
            self.dilation, self.dilation,
            self.group, self.group_channels,
            self.offset_scale)
+        if self.center_feature_scale:
+            center_feature_scale = self.center_feature_scale_module(
+                x1, self.center_feature_scale_proj_weight, self.center_feature_scale_proj_bias)
+            # N, H, W, groups -> N, H, W, groups, 1 -> N, H, W, groups, _d_per_group -> N, H, W, channels
+            center_feature_scale = center_feature_scale[..., None].repeat(
+                1, 1, 1, 1, self.channels // self.group).flatten(-2)
+            x = x * (1 - center_feature_scale) + x_proj * center_feature_scale
        x = self.output_proj(x)
        return x
@@ -178,15 +217,24 @@ class DCNv3_pytorch(nn.Module):
 class DCNv3(nn.Module):
    def __init__(
-            self, channels=64, kernel_size=3, stride=1,
+            self,
-            pad=1, dilation=1, group=4, offset_scale=1.0,
+            channels=64,
-            act_layer='GELU', norm_layer='LN'):
+            kernel_size=3,
+            dw_kernel_size=None,
+            stride=1,
+            pad=1,
+            dilation=1,
+            group=4,
+            offset_scale=1.0,
+            act_layer='GELU',
+            norm_layer='LN',
+            center_feature_scale=False):
        """
        DCNv3 Module
-        :param channels     
+        :param channels
-        :param kernel_size  
+        :param kernel_size
-        :param stride      
+        :param stride
-        :param pad     
+        :param pad
        :param dilation
        :param group
        :param offset_scale
@@ -198,6 +246,7 @@ class DCNv3(nn.Module):
            raise ValueError(
                f'channels must be divisible by group, but got {channels} and {group}')
        _d_per_group = channels // group
+        dw_kernel_size = dw_kernel_size if dw_kernel_size is not None else kernel_size
        # you'd better set _d_per_group to a power of 2 which is more efficient in our CUDA implementation
        if not _is_power_of_2(_d_per_group):
            warnings.warn(
@@ -207,20 +256,22 @@ class DCNv3(nn.Module):
        self.offset_scale = offset_scale
        self.channels = channels
        self.kernel_size = kernel_size
+        self.dw_kernel_size = dw_kernel_size
        self.stride = stride
-        self.dilation = 1
+        self.dilation = dilation
        self.pad = pad
        self.group = group
        self.group_channels = channels // group
        self.offset_scale = offset_scale
+        self.center_feature_scale = center_feature_scale
        self.dw_conv = nn.Sequential(
            nn.Conv2d(
                channels,
                channels,
-                kernel_size=kernel_size,
+                kernel_size=dw_kernel_size,
                stride=1,
-                padding=(kernel_size-1)//2,
+                padding=(dw_kernel_size - 1) // 2,
                groups=channels),
            build_norm_layer(
                channels,
@@ -237,6 +288,13 @@ class DCNv3(nn.Module):
        self.input_proj = nn.Linear(channels, channels)
        self.output_proj = nn.Linear(channels, channels)
        self._reset_parameters()
+        if center_feature_scale:
+            self.center_feature_scale_proj_weight = nn.Parameter(
+                torch.zeros((group, channels), dtype=torch.float))
+            self.center_feature_scale_proj_bias = nn.Parameter(
+                torch.tensor(0.0, dtype=torch.float).view((1,)).repeat(group, ))
+            self.center_feature_scale_module = CenterFeatureScaleModule()
    def _reset_parameters(self):
        constant_(self.offset.weight.data, 0.)
@@ -256,6 +314,7 @@ class DCNv3(nn.Module):
        N, H, W, _ = input.shape
        x = self.input_proj(input)
+        x_proj = x
        dtype = x.dtype
        x1 = input.permute(0, 3, 1, 2)
@@ -273,6 +332,14 @@ class DCNv3(nn.Module):
            self.group, self.group_channels,
            self.offset_scale,
            256)
+        if self.center_feature_scale:
+            center_feature_scale = self.center_feature_scale_module(
+                x1, self.center_feature_scale_proj_weight, self.center_feature_scale_proj_bias)
+            # N, H, W, groups -> N, H, W, groups, 1 -> N, H, W, groups, _d_per_group -> N, H, W, channels
+            center_feature_scale = center_feature_scale[..., None].repeat(
+                1, 1, 1, 1, self.channels // self.group).flatten(-2)
+            x = x * (1 - center_feature_scale) + x_proj * center_feature_scale
        x = self.output_proj(x)
        return x
--- a/segmentation/README.md
+++ b/segmentation/README.md
@@ -4,15 +4,6 @@ This folder contains the implementation of the InternImage for semantic segmenta
 Our segmentation code is developed on top of [MMSegmentation v0.27.0](https://github.com/open-mmlab/mmsegmentation/tree/v0.27.0).
-## Model Zoo
- [x] [ADE20K](configs/ade20k/)
- [x] [Cityscapes](configs/cityscapes/)
- [ ] COCO-Stuff-164K
- [ ] COCO-Stuff-10K
- [ ] Pascal Context
- [ ] NYU Depth V2
 ## Usage
 ### Install

--- a/segmentation/configs/ade20k/upernet_internimage_g_896_160k_ade20k.py
+++ b/segmentation/configs/ade20k/upernet_internimage_g_896_160k_ade20k.py
+# --------------------------------------------------------
+# InternImage
+# Copyright (c) 2022 OpenGVLab
+# Licensed under The MIT License [see LICENSE for details]
+# --------------------------------------------------------
+_base_ = [
+    '../_base_/models/upernet_r50.py', '../_base_/datasets/ade20k.py',
+    '../_base_/default_runtime.py', '../_base_/schedules/schedule_160k.py'
+]
+pretrained = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_pretrainto22k_384.pth'
+model = dict(
+    backbone=dict(
+        _delete_=True,
+        type='InternImage',
+        core_op='DCNv3',
+        channels=512,
+        depths=[2, 2, 48, 4],
+        groups=[16, 32, 64, 128],
+        mlp_ratio=4.,
+        drop_path_rate=0.5,
+        norm_layer='LN',
+        layer_scale=None,
+        offset_scale=1.0,
+        post_norm=True,
+        dw_kernel_size=5, # for InternImage-H/G
+        res_post_norm=False, # for InternImage-H/G
+        level2_post_norm=True, # for InternImage-H/G
+        level2_post_norm_block_ids=[5, 11, 17, 23, 29, 35, 41, 47], # for InternImage-H/G
+        center_feature_scale=True, # for InternImage-H/G
+        with_cp=True,
+        out_indices=(0, 1, 2, 3),
+        init_cfg=dict(type='Pretrained', checkpoint=pretrained)),
+    decode_head=dict(num_classes=150, in_channels=[512, 1024, 2048, 4096]),
+    auxiliary_head=dict(num_classes=150, in_channels=2048),
+    test_cfg=dict(mode='whole'))
+img_norm_cfg = dict(
+    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
+crop_size = (896, 896)
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='LoadAnnotations', reduce_zero_label=True),
+    dict(type='Resize', img_scale=(3584, 896), ratio_range=(0.5, 2.0)),
+    dict(type='RandomCrop', crop_size=crop_size, cat_max_ratio=0.75),
+    dict(type='RandomFlip', prob=0.5),
+    dict(type='PhotoMetricDistortion'),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=255),
+    dict(type='DefaultFormatBundle'),
+    dict(type='Collect', keys=['img', 'gt_semantic_seg']),
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='MultiScaleFlipAug',
+        img_scale=(3584, 896),
+        # img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75],
+        flip=False,
+        transforms=[
+            dict(type='Resize', keep_ratio=True),
+            dict(type='ResizeToMultiple', size_divisor=32),
+            dict(type='RandomFlip'),
+            dict(type='Normalize', **img_norm_cfg),
+            dict(type='ImageToTensor', keys=['img']),
+            dict(type='Collect', keys=['img']),
+        ])
+]
+optimizer = dict(
+    _delete_=True, type='AdamW', lr=0.00002, betas=(0.9, 0.999), weight_decay=0.05,
+    constructor='CustomLayerDecayOptimizerConstructor',
+    paramwise_cfg=dict(num_layers=56, layer_decay_rate=0.95,
+                       depths=[2, 2, 48, 4], offset_lr_scale=1.0))
+lr_config = dict(_delete_=True, policy='poly',
+                 warmup='linear',
+                 warmup_iters=1500,
+                 warmup_ratio=1e-6,
+                 power=1.0, min_lr=0.0, by_epoch=False)
+# By default, models are trained on 16 GPUs with 1 images per GPU
+data = dict(samples_per_gpu=1,
+            train=dict(pipeline=train_pipeline),
+            val=dict(pipeline=test_pipeline),
+            test=dict(pipeline=test_pipeline))
+runner = dict(type='IterBasedRunner')
+optimizer_config = dict(_delete_=True, grad_clip=dict(max_norm=0.1, norm_type=2))
+checkpoint_config = dict(by_epoch=False, interval=1000, max_keep_ckpts=1)
+evaluation = dict(interval=16000, metric='mIoU', save_best='mIoU')
+# fp16 = dict(loss_scale=dict(init_scale=512))
--- a/segmentation/configs/ade20k/upernet_internimage_h_896_160k_ade20k.py
+++ b/segmentation/configs/ade20k/upernet_internimage_h_896_160k_ade20k.py
@@ -7,7 +7,7 @@ _base_ = [
    '../_base_/models/upernet_r50.py', '../_base_/datasets/ade20k.py',
    '../_base_/default_runtime.py', '../_base_/schedules/schedule_160k.py'
 ]
-# pretrained = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth'
+pretrained = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto22k_384.pth'
 model = dict(
    backbone=dict(
        _delete_=True,
@@ -74,7 +74,7 @@ lr_config = dict(_delete_=True, policy='poly',
                 warmup_iters=1500,
                 warmup_ratio=1e-6,
                 power=1.0, min_lr=0.0, by_epoch=False)
-# By default, models are trained on 8 GPUs with 2 images per GPU
+# By default, models are trained on 16 GPUs with 1 images per GPU
 data = dict(samples_per_gpu=1,
            train=dict(pipeline=train_pipeline),
            val=dict(pipeline=test_pipeline),

--- a/segmentation/configs/cityscapes/README.md
+++ b/segmentation/configs/cityscapes/README.md
@@ -36,11 +36,3 @@ Mapillary 80k + Cityscapes (w/ coarse data) 160k
 |:--------------:|:----------:|:------------:|:-----------:|:-----------:|:-------:|:-----:|:-----:|:---------:|
 | InternImage-L  | 512x1024   | 85.16 / 85.67  | 0.37s / iter       | 17h        | 220M    | 1580G | [config](./segformer_internimage_l_512x1024_160k_mapillary2cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/segformer_internimage_l_512x1024_160k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/segformer_internimage_l_512x1024_160k_mapillary2cityscapes.log.json)  |
 | InternImage-XL | 512x1024   | 85.41 / 85.93  | 0.43s / iter       |  19.5h      | 330M    | 2364G | [config](./segformer_internimage_xl_512x1024_160k_mapillary2cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/segformer_internimage_xl_512x1024_160k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/segformer_internimage_xl_512x1024_160k_mapillary2cityscapes.log.json) |
-### Mask2Former + InternImage (with additional data)
-Mapillary 80k + Cityscapes (w/ coarse data) 80k
-| backbone       | resolution |  mIoU (ss/ms) | train speed | train time | #params | FLOPs | Config | Download |
-|:--------------:|:----------:|:------------:|:-----------:|:-----------:|:-------:|:-----:|:-----:|:---------:|
-| InternImage-H  | 1024x1024   | 86.37 / 86.96  | TODO       | TODO        | TODO    | TODO | [config](./mask2former_internimage_h_1024x1024_80k_mapillary2cityscapes.py) | [ckpt]() \| [log]()  |
--- a/segmentation/configs/mapillary/README.md
+++ b/segmentation/configs/mapillary/README.md
@@ -24,8 +24,3 @@ We first pretrain our models on the Mapillary Vistas dataset, then finetune them
 | InternImage-L  | 512x1024   | 80k  | 0.37s / iter       |   9h       | 220M    | 1580G | [config](./segformer_internimage_l_512x1024_80k_mapillary.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/segformer_internimage_l_512x1024_80k_mapillary.pth)   |
 | InternImage-XL | 512x1024   | 80k  | 0.43s / iter       |  10h      | 330M    | 2364G | [config](./segformer_internimage_xl_512x1024_80k_mapillary.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/segformer_internimage_xl_512x1024_80k_mapillary.pth)  |
-### Mask2Former + InternImage
-| backbone       | resolution |  schd | train speed | train time | #params | FLOPs | Config | Download |
-|:--------------:|:----------:|:------------:|:-----------:|:-----------:|:-------:|:-----:|:-----:|:---------:|
-| InternImage-H  | 1024x1024   | 80k  | TODO       | TODO        | TODO    | TODO | [config](./mask2former_internimage_h_1024x1024_80k_mapillary.py) | [ckpt]()  |
--- a/segmentation/convertor/fp16.py
+++ b/segmentation/convertor/fp16.py
-import torch
-import argparse
-import math
-from collections import OrderedDict
-parser = argparse.ArgumentParser(description='Hyperparams')
-parser.add_argument('filename', nargs='?', type=str, default=None)
-args = parser.parse_args()
-def convert_fl16(m):
-    new_sd = OrderedDict()
-    for k, v in m.items():
-        new_k = k
-        new_sd[new_k] = v.half()
-    return new_sd
-model = torch.load(args.filename, map_location=torch.device('cpu'))['state_dict']
-new_model = {"state_dict": convert_fl16(model)}
-torch.save(new_model, args.filename.replace(".pth", "_fp16.pth"))
--- a/segmentation/convertor/step1_remove_ab.py
+++ b/segmentation/convertor/step1_remove_ab.py
-import torch
-import argparse
-import math
-from collections import OrderedDict
-parser = argparse.ArgumentParser(description='Hyperparams')
-parser.add_argument('filename', nargs='?', type=str, default=None)
-args = parser.parse_args()
-def gen_grid(n_heads):
-    n_heads = n_heads
-    n_points = 9
-    points_list = []
-    kernel_size = int(math.sqrt(n_points))
-    y, x = torch.meshgrid(
-        torch.linspace(
-            (-kernel_size // 2 + 1),
-            (kernel_size // 2), kernel_size,
-            dtype=torch.float32),
-        torch.linspace(
-            (-kernel_size // 2 + 1),
-            (kernel_size // 2), kernel_size,
-            dtype=torch.float32))
-    points_list.extend([y, x])
-    grid = torch.stack(points_list, -1).reshape(-1, 1, 2).\
-        repeat(1, n_heads, 1).permute(1, 0, 2)
-    return grid
-def remove_ab(m):
-    new_sd = OrderedDict()
-    n_points = 9
-    for k, v in m.items():
-        if 'alpha_beta' in k:
-            ab = v
-            ab = ab.repeat(1, n_points)
-            h, _ = ab.size()
-            offset_b = k.replace('alpha_beta', 'sampling_offsets.bias')
-            ob = m[offset_b]
-            grid = gen_grid(h)
-            grid = grid.reshape(h, -1)
-            delta = (ab - 1) * grid
-            delta = delta.reshape(-1)
-            ob = ob + delta
-            new_sd[offset_b] = ob
-            continue
-        if 'sampling_offsets.bias' in k:
-            continue
-        new_sd[k] = v
-    return new_sd
-model = torch.load(args.filename, map_location=torch.device('cpu'))
-model = model['state_dict']
-model = remove_ab(model)
-new_model = {"state_dict": model}
-torch.save(new_model, args.filename.replace(".pth", "_rmab.pth"))
-print("finished!")
\ No newline at end of file
--- a/segmentation/convertor/step2_rename_huge.py
+++ b/segmentation/convertor/step2_rename_huge.py
-import torch
-import argparse
-import math
-from collections import OrderedDict
-parser = argparse.ArgumentParser(description='Hyperparams')
-parser.add_argument('filename', nargs='?', type=str, default=None)
-args = parser.parse_args()
-def gen_grid(n_heads):
-    n_heads = n_heads
-    n_points = 9
-    points_list = []
-    kernel_size = int(math.sqrt(n_points))
-    y, x = torch.meshgrid(
-        torch.linspace((-kernel_size // 2 + 1), (kernel_size // 2),
-                       kernel_size,
-                       dtype=torch.float32),
-        torch.linspace((-kernel_size // 2 + 1), (kernel_size // 2),
-                       kernel_size,
-                       dtype=torch.float32))
-    points_list.extend([y, x])
-    grid = torch.stack(points_list, -1).reshape(-1, 1, 2).\
-        repeat(1, n_heads, 1).permute(1, 0, 2)
-    return grid
-def convert_to_newop(m):
-    new_sd = OrderedDict()
-    n_points = 9
-    for k, v in m.items():
-        new_k = k
-        if 'attn' in k:
-            new_k = new_k.replace('attn', 'dcn')
-            if 'sampling_offsets' in k:
-                new_k = new_k.replace('sampling_offsets', 'offset')
-            if 'attention_weights' in k:
-                new_k = new_k.replace('attention_weights', 'mask')
-            if 'value_proj' in k:
-                new_k = new_k.replace('value_proj', 'input_proj')
-        if 'ema' in k:
-            continue
-        if ".norm1_k." in k:
-            new_k = new_k.replace('.norm1_k.', '.norm1_k.0.')
-        if ".norm1_q." in k:
-            new_k = new_k.replace('.norm1_q.', '.norm1_q.0.')
-        if ".norm1_v." in k:
-            new_k = new_k.replace('.norm1_v.', '.norm1_v.0.')
-        if ".post_norms." in k:
-            new_k = new_k.replace('.bias', '.0.bias')
-            new_k = new_k.replace('.weight', '.0.weight')
-        if "fc_norm." in k:
-            new_k = new_k.replace('fc_norm.', 'fc_norm.0.')
-        new_sd[new_k] = v.half()
-    return new_sd
-model = torch.load(args.filename, map_location=torch.device('cpu'))['state_dict']
-new_model = {"state_dict": convert_to_newop(model)}
-torch.save(new_model, args.filename.replace(".pth", "_rename.pth"))