Update README.md

Update README.md Update README.md Update README.md Update README.md Update README.md Update README.md Update README.md Update README.md Update README.md Update README.md Update README.md Update README.md

Update README.md
Update README.md Update README.md Update README.md Update README.md Update README.md Update README.md Update README.md Update README.md Update README.md Update README.md Update README.md Update README.md
477b5ed3 · zhe chen · f37f9c2a · 477b5ed3 · 477b5ed3 · 477b5ed3
Commit 477b5ed3 authored Jan 24, 2025 by zhe chen
7 changed files
--- a/.gitignore
+++ b/.gitignore
-
 .idea/
 .DS_Store
 __pycache__/
@@ -6,3 +5,4 @@ classification/convertor/
 segmentation/convertor/
 checkpoint_dir/
 demo/
+pretrained/
--- a/README.md
+++ b/README.md
 <p>
 	<a href="./README_CN.md">[中文版本]</a>
 </p>
-We currently receive a bunch of issues, our team will check and solve them one by one, please stay tuned.

-# INTERN-2.5: Multimodal Multitask General Large Model
+# InternImage: Large-Scale Vision Foundation Model

 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-coco?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-lvis-v1-0-minival)](https://paperswithcode.com/sota/object-detection-on-lvis-v1-0-minival?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-lvis-v1-0-val)](https://paperswithcode.com/sota/object-detection-on-lvis-v1-0-val?p=internimage-exploring-large-scale-vision)
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-pascal-voc-2007)](https://paperswithcode.com/sota/object-detection-on-pascal-voc-2007?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-pascal-voc-2012)](https://paperswithcode.com/sota/object-detection-on-pascal-voc-2012?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-openimages-v6)](https://paperswithcode.com/sota/object-detection-on-openimages-v6?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-crowdhuman-full-body)](https://paperswithcode.com/sota/object-detection-on-crowdhuman-full-body?p=internimage-exploring-large-scale-vision)
@@ -18,7 +16,6 @@ We currently receive a bunch of issues, our team will check and solve them one b
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/semantic-segmentation-on-cityscapes)](https://paperswithcode.com/sota/semantic-segmentation-on-cityscapes?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/semantic-segmentation-on-cityscapes-val)](https://paperswithcode.com/sota/semantic-segmentation-on-cityscapes-val?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/semantic-segmentation-on-pascal-context)](https://paperswithcode.com/sota/semantic-segmentation-on-pascal-context?p=internimage-exploring-large-scale-vision)
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/semantic-segmentation-on-coco-stuff-test)](https://paperswithcode.com/sota/semantic-segmentation-on-coco-stuff-test?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/image-classification-on-inaturalist-2018)](https://paperswithcode.com/sota/image-classification-on-inaturalist-2018?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/image-classification-on-places365)](https://paperswithcode.com/sota/image-classification-on-places365?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/image-classification-on-places205)](https://paperswithcode.com/sota/image-classification-on-places205?p=internimage-exploring-large-scale-vision)
@@ -37,43 +34,20 @@ The official implementation of
 - 🏆 **Achieved `90.1% Top1` accuracy in ImageNet, the most accurate among open-source models**
 - 🏆 **Achieved `65.5 mAP` on the COCO benchmark dataset for object detection, the only model that exceeded `65.0 mAP`**

-## Related Projects
-
-### Foundation Models
-
- [Uni-Perceiver](https://github.com/fundamentalvision/Uni-Perceiver): A Pre-training unified architecture for generic perception for zero-shot and few-shot tasks
- [Uni-Perceiver v2](https://arxiv.org/abs/2211.09808): A generalist model for large-scale vision and vision-language tasks
- [M3I-Pretraining](https://github.com/OpenGVLab/M3I-Pretraining): One-stage pre-training paradigm via maximizing multi-modal mutual information
- [InternVL](https://github.com/OpenGVLab/InternVL): The largest open-source vision/vision-language foundation model (14B) to date
-
-### Autonomous Driving
-
- [BEVFormer](https://github.com/fundamentalvision/BEVFormer): A cutting-edge baseline for camera-based 3D detection
- [BEVFormer v2](https://arxiv.org/abs/2211.10439):  Adapting modern image backbones to Bird's-Eye-View recognition via perspective supervision
-
-## Application in Challenges
-
- [2022 Waymo 3D Camera-Only Detection Challenge](https://waymo.com/open/challenges/2022/3d-camera-only-detection/): BEVFormer++ **Ranks 1st** based on InternImage
- [nuScenes 3D detection task](https://www.nuscenes.org/object-detection?externalData=all&mapData=all&modalities=Camera): BEVFormer v2 achieves SOTA performance of 64.8 NDS on nuScenes Camera Only
- [CVPR 2023 Workshop End-to-End Autonomous Driving](https://opendrivelab.com/e2ead/cvpr23): InternImage supports the baseline of the [3D Occupancy Prediction Challenge](https://opendrivelab.com/AD23Challenge.html#Track3) and [OpenLane Topology Challenge](https://opendrivelab.com/AD23Challenge.html#Track1)
-
 ## News

 - `Jan 22, 2024`: 🚀 Support [DCNv4](https://github.com/OpenGVLab/DCNv4) in InternImage!
- `Mar 14, 2023`: 🚀 "INTERN-2.5" is released！
 - `Feb 28, 2023`: 🚀 InternImage is accepted to CVPR 2023!
 - `Nov 18, 2022`: 🚀 InternImage-XL merged into [BEVFormer v2](https://arxiv.org/abs/2211.10439) achieves state-of-the-art performance of `63.4 NDS` on nuScenes Camera Only.
- `Nov 10, 2022`: 🚀 InternImage-H achieves a new record `65.4 mAP` on COCO detection test-dev and `62.9 mIoU` on
-  ADE20K, outperforming previous models by a large margin.
+- `Nov 10, 2022`: 🚀 InternImage-H achieves a new record `65.4 mAP` on COCO detection test-dev and `62.9 mIoU` on ADE20K, outperforming previous models by a large margin.

 ## History

 - [ ] Models/APIs for other downstream tasks
 - [ ] Support [CVPR 2023 Workshop on End-to-End Autonomous Driving](https://opendrivelab.com/e2ead/cvpr23), see [here](https://github.com/OpenGVLab/InternImage/tree/master/autonomous_driving)
- [ ] Support Segment Anything
 - [x] Support extracting intermediate features, see [here](classification/extract_feature.py)
 - [x] Low-cost training with [DeepSpeed](https://github.com/microsoft/DeepSpeed), see [here](https://github.com/OpenGVLab/InternImage/tree/master/classification)
- [x] Compiling-free .whl package of DCNv3 operator, see [here](https://github.com/OpenGVLab/InternImage/releases/tag/whl_files)
+- [x] Compiling-free `.whl` package of DCNv3 operator, see [here](https://github.com/OpenGVLab/InternImage/releases/tag/whl_files)
 - [x] InternImage-H(1B)/G(3B)
 - [x] TensorRT inference for classification/detection/segmentation models
 - [x] Classification code of the InternImage series
@@ -84,25 +58,25 @@ The official implementation of

 ## Introduction

-"INTERN-2.5" is a powerful multimodal multitask general model jointly released by SenseTime and Shanghai AI Laboratory. It consists of large-scale vision foundation model "InternImage", pre-training method "M3I-Pretraining", generic decoder "Uni-Perceiver" series, and generic encoder for autonomous driving perception "BEVFormer" series.
+InternImage is an advanced vision foundation model developed by researchers from Shanghai AI Laboratory, Tsinghua University, and other institutions. Unlike models based on Transformers, InternImage employs DCNv3 as its core operator. This approach equips the model with dynamic and effective receptive fields required for downstream tasks like object detection and segmentation, while enabling adaptive spatial aggregation.

-<div align=left>
-<img src='./docs/figs/intern_pipeline_en.png' width=900>
+<div align=center>
+<img src='./docs/figs/arch.png' width=400>
 </div>

-## Applications
-
-### 🌅 Image Modality Tasks
+Some other projects related to InternImage include the pretraining algorithm "M3I-Pretraining," the general-purpose decoder series "Uni-Perceiver," and the autonomous driving perception encoder series "BEVFormer."

-"INTERN-2.5" achieved an impressive Top-1 accuracy of 90.1% on the ImageNet benchmark dataset using only publicly available data for image classification. Apart from two undisclosed models trained with additional datasets by Google and Microsoft, "INTERN-2.5" is the only open-source model that achieves a Top-1 accuracy of over 90.0%, and it is also the largest model in scale worldwide.
-
-"INTERN-2.5" outperformed all other models worldwide on the COCO object detection benchmark dataset with a remarkable mAP of 65.5, making it the only model that surpasses 65 mAP in the world.
+<div align=left>
+<img src='./docs/figs/intern_pipeline_en.png' width=900>
+</div>

-"INTERN-2.5" also demonstrated world's best performance on 16 other important visual benchmark datasets, covering a wide range of tasks such as classification, detection, and segmentation, making it the top-performing model across multiple domains.
+## Performance

-**Performance**
+- InternImage achieved an impressive Top-1 accuracy of 90.1% on the ImageNet benchmark dataset using only publicly available data for image classification. Apart from two undisclosed models trained with additional datasets by Google and Microsoft, InternImage is the only open-source model that achieves a Top-1 accuracy of over 90.0%, and it is also the largest model in scale worldwide.
+- InternImage outperformed all other models worldwide on the COCO object detection benchmark dataset with a remarkable mAP of 65.5, making it the only model that surpasses 65 mAP in the world.
+- InternImage also demonstrated world's best performance on 16 other important visual benchmark datasets, covering a wide range of tasks such as classification, detection, and segmentation, making it the top-performing model across multiple domains.

- Classification
+**Classification**

 <table border="1" width="90%">
 	<tr align="center">
@@ -112,15 +86,15 @@ The official implementation of
        <th>ImageNet</th><th>Places365</th><th>Places 205</th><th>iNaturalist 2018</th>
    </tr>
    <tr align="center">
-        <th>90.1</th><th>61.2</th><th>71.7</th><th>92.3</th>
+        <th>90.1</th><th>61.2</th><th>71.7</th><th>92.6</th>
    </tr>
 </table>

- Detection
+**Detection**

 <table border="1" width="90%">
 	<tr align="center">
-        <th colspan="4"> Conventional Object Detection</th><th colspan="3">Long-Tail Object Detection </th><th colspan="1">Autonomous Driving Object Detection</th><th colspan="1">Dense Object Detection</th>
+        <th colspan="4"> General Object Detection </th><th colspan="3"> Long-Tail Object Detection </th><th colspan="1"> Autonomous Driving Object Detection </th><th colspan="1"> Dense Object Detection </th>
    </tr>
    <tr align="center">
        <th>COCO</th><th>VOC 2007</th><th>VOC 2012</th><th>OpenImage</th><th>LVIS minival</th><th>LVIS val</th><th>BDD100K</th><th>nuScenes</th><th>CrowdHuman</th>
@@ -130,7 +104,7 @@ The official implementation of
    </tr>
 </table>

- Segmentation
+**Segmentation**

 <table border="1" width="90%">
 	<tr align="center">
@@ -140,70 +114,48 @@ The official implementation of
        <th>ADE20K</th><th>COCO Stuff-10K</th><th>Pascal Context</th><th>CityScapes</th><th>NYU Depth V2</th>
    </tr>
    <tr align="center">
-        <th>62.9</th><th>59.6</th><th>70.3</th><th>86.1</th><th>69.7</th>
+        <th>62.9</th><th>59.6</th><th>70.3</th><th>87.0</th><th>68.1</th>
    </tr>
 </table>
-<br>
-
-### 🌁 📖 Image and Text Cross-Modal Tasks
-
-**Image-Text Retrieval**: "INTERN-2.5" can quickly locate and retrieve the most semantically relevant images based on textual content requirements. This capability can be applied to both videos and image collections and can be further combined with object detection boxes to enable a variety of applications, helping users quickly and easily find the required image resources. For example, it can return the relevant images specified by the text in the album.
-
-**Image-To-Text**: "INTERN-2.5" has a strong understanding capability in various aspects of visual-to-text tasks such as image captioning, visual question answering, visual reasoning, and optical character recognition. For example, in the context of autonomous driving, it can enhance the scene perception and understanding capabilities, assist the vehicle in judging traffic signal status, road signs, and other information, and provide effective perception information support for vehicle decision-making and planning.
-
-**Performance**
-
-<table border="1" width="90%">
-	<tr align="center">
-        <th colspan="1">Image Captioning</th><th colspan="2">Fine-tuning Image-Text Retrieval</th><th colspan="1">Zero-shot Image-Text Retrieval</th>
-    </tr>
-    <tr align="center">
-        <th>COCO Caption</th><th>COCO Caption</th><th>Flickr30k</th><th>Flickr30k</th>
-    </tr>
-    <tr align="center">
-        <th>148.2</th><th>76.4</th><th>94.8</th><th>89.1</th>
-    </tr>
-</table>
-<br>

 ## Released Models

-<details>
-<summary> Open-source Visual Pretrained Models </summary>
+<details open>
+<summary> Open-Source Visual Pretrained Models </summary>
 <br>
 <div>

-|      name      |   pretrain   | pre-training  resolution | #param |                                               download                                                |
-| :------------: | :----------: | :----------------------: | :----: | :---------------------------------------------------------------------------------------------------: |
-| InternImage-L  | ImageNet-22K |         384x384          |  223M  |   [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth)    |
-| InternImage-XL | ImageNet-22K |         384x384          |  335M  |   [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth)   |
-| InternImage-H  |  Joint 427M  |         384x384          | 1.08B  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto22k_384.pth)   |
-| InternImage-G  |      -       |         384x384          |   3B   | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_pretrainto22k_384.pth) |
+|      name      |   pretrain   | resolution | #param |                                               download                                                |
+| :------------: | :----------: | :--------: | :----: | :---------------------------------------------------------------------------------------------------: |
+| InternImage-L  | ImageNet-22K |  384x384   |  223M  |   [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth)    |
+| InternImage-XL | ImageNet-22K |  384x384   |  335M  |   [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth)   |
+| InternImage-H  |  Joint 427M  |  384x384   | 1.08B  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto22k_384.pth)   |
+| InternImage-G  |  Joint 427M  |  384x384   |   3B   | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_pretrainto22k_384.pth) |

 </div>

 </details>

-<details>
+<details open>
 <summary> ImageNet-1K Image Classification </summary>
 <br>
 <div>

-|      name      |   pretrain   | resolution | acc@1 | #param | FLOPs |                                                                                      download                                                                                      |
-| :------------: | :----------: | :--------: | :---: | :----: | :---: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
-| InternImage-T  | ImageNet-1K  |  224x224   | 83.5  |  30M   |  5G   |       [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth) \| [cfg](classification/configs/without_lr_decay/internimage_t_1k_224.yaml)       |
-| InternImage-S  | ImageNet-1K  |  224x224   | 84.2  |  50M   |  8G   |       [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth) \| [cfg](classification/configs/without_lr_decay/internimage_s_1k_224.yaml)       |
-| InternImage-B  | ImageNet-1K  |  224x224   | 84.9  |  97M   |  16G  |       [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth) \| [cfg](classification/configs/without_lr_decay/internimage_b_1k_224.yaml)       |
-| InternImage-L  | ImageNet-22K |  384x384   | 87.7  |  223M  | 108G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22kto1k_384.pth) \| [cfg](classification/configs/without_lr_decay/internimage_l_22kto1k_384.yaml)  |
-| InternImage-XL | ImageNet-22K |  384x384   | 88.0  |  335M  | 163G  | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22kto1k_384.pth) \| [cfg](classification/configs/without_lr_decay/internimage_xl_22kto1k_384.yaml) |
-| InternImage-H  |  Joint 427M  |  640x640   | 89.6  | 1.08B  | 1478G |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_22kto1k_640.pth) \| [cfg](classification/configs/without_lr_decay/internimage_h_22kto1k_640.yaml)  |
-| InternImage-G  |      -       |  512x512   | 90.1  |   3B   | 2700G |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_22kto1k_512.pth) \| [cfg](classification/configs/without_lr_decay/internimage_g_22kto1k_512.yaml)  |
+|      name      |   pretrain   | resolution | acc@1 | #param | FLOPs |                                                                              download                                                                               |
+| :------------: | :----------: | :--------: | :---: | :----: | :---: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| InternImage-T  | ImageNet-1K  |  224x224   | 83.5  |  30M   |  5G   |       [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth) \| [cfg](configs/without_lr_decay/internimage_t_1k_224.yaml)       |
+| InternImage-S  | ImageNet-1K  |  224x224   | 84.2  |  50M   |  8G   |       [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth) \| [cfg](configs/without_lr_decay/internimage_s_1k_224.yaml)       |
+| InternImage-B  | ImageNet-1K  |  224x224   | 84.9  |  97M   |  16G  |       [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth) \| [cfg](configs/without_lr_decay/internimage_b_1k_224.yaml)       |
+| InternImage-L  | ImageNet-22K |  384x384   | 87.7  |  223M  | 108G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22kto1k_384.pth) \| [cfg](configs/without_lr_decay/internimage_l_22kto1k_384.yaml)  |
+| InternImage-XL | ImageNet-22K |  384x384   | 88.0  |  335M  | 163G  | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22kto1k_384.pth) \| [cfg](configs/without_lr_decay/internimage_xl_22kto1k_384.yaml) |
+| InternImage-H  |  Joint 427M  |  640x640   | 89.6  | 1.08B  | 1478G |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_22kto1k_640.pth) \| [cfg](configs/without_lr_decay/internimage_h_22kto1k_640.yaml)  |
+| InternImage-G  |  Joint 427M  |  512x512   | 90.1  |   3B   | 2700G |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_22kto1k_512.pth) \| [cfg](configs/without_lr_decay/internimage_g_22kto1k_512.yaml)  |

 </div>

 </details>

-<details>
+<details open>
 <summary> COCO Object Detection and Instance Segmentation </summary>
 <br>
 <div>
@@ -230,7 +182,7 @@ The official implementation of

 </details>

-<details>
+<details open>
 <summary> ADE20K Semantic Segmentation </summary>
 <br>
 <div>
@@ -287,13 +239,33 @@ cd ${MMDEPLOY_DIR}
 pip install -e .
 ```

-For more details on building custom ops, please refering to [this document](https://github.com/open-mmlab/mmdeploy/blob/master/docs/en/01-how-to-build/linux-x86_64.md).
+For more details on building custom ops, please referring to [this document](https://github.com/open-mmlab/mmdeploy/blob/master/docs/en/01-how-to-build/linux-x86_64.md).

 </div>

 </details>

-## Citations
+## Related Projects
+
+### Foundation Models
+
+- [Uni-Perceiver](https://github.com/fundamentalvision/Uni-Perceiver): A Pre-training unified architecture for generic perception for zero-shot and few-shot tasks
+- [Uni-Perceiver v2](https://arxiv.org/abs/2211.09808): A generalist model for large-scale vision and vision-language tasks
+- [M3I-Pretraining](https://github.com/OpenGVLab/M3I-Pretraining): One-stage pre-training paradigm via maximizing multi-modal mutual information
+- [InternVL](https://github.com/OpenGVLab/InternVL): A leading multimodal large language model excelling in tasks such as OCR, multimodal reasoning, and dialogue
+
+### Autonomous Driving
+
+- [BEVFormer](https://github.com/fundamentalvision/BEVFormer): A cutting-edge baseline for camera-based 3D detection
+- [BEVFormer v2](https://arxiv.org/abs/2211.10439): Adapting modern image backbones to Bird's-Eye-View recognition via perspective supervision
+
+## Application in Challenges
+
+- [2022 Waymo 3D Camera-Only Detection Challenge](https://waymo.com/open/challenges/2022/3d-camera-only-detection/): BEVFormer++ ranks 1st based on InternImage
+- [nuScenes 3D detection](https://www.nuscenes.org/object-detection?externalData=all&mapData=all&modalities=Camera): BEVFormer v2 achieves SOTA performance of 64.8 NDS on nuScenes Camera Only
+- [CVPR 2023 Workshop End-to-End Autonomous Driving](https://opendrivelab.com/e2ead/cvpr23): InternImage supports the baseline of the [3D Occupancy Prediction Challenge](https://opendrivelab.com/AD23Challenge.html#Track3) and [OpenLane Topology Challenge](https://opendrivelab.com/AD23Challenge.html#Track1)
+
+## Citation

 If this work is helpful for your research, please consider citing the following BibTeX entry.

@@ -304,48 +276,4 @@ If this work is helpful for your research, please consider citing the following
  journal={arXiv preprint arXiv:2211.05778},
  year={2022}
 }
-
-@inproceedings{zhu2022uni,
-  title={Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks},
-  author={Zhu, Xizhou and Zhu, Jinguo and Li, Hao and Wu, Xiaoshi and Li, Hongsheng and Wang, Xiaohua and Dai, Jifeng},
-  booktitle={CVPR},
-  pages={16804--16815},
-  year={2022}
-}
-
-@article{zhu2022uni,
-  title={Uni-perceiver-moe: Learning sparse generalist models with conditional moes},
-  author={Zhu, Jinguo and Zhu, Xizhou and Wang, Wenhai and Wang, Xiaohua and Li, Hongsheng and Wang, Xiaogang and Dai, Jifeng},
-  journal={arXiv preprint arXiv:2206.04674},
-  year={2022}
-}
-
-@article{li2022uni,
-  title={Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks},
-  author={Li, Hao and Zhu, Jinguo and Jiang, Xiaohu and Zhu, Xizhou and Li, Hongsheng and Yuan, Chun and Wang, Xiaohua and Qiao, Yu and Wang, Xiaogang and Wang, Wenhai and others},
-  journal={arXiv preprint arXiv:2211.09808},
-  year={2022}
-}
-
-@article{yang2022bevformer,
-  title={BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision},
-  author={Yang, Chenyu and Chen, Yuntao and Tian, Hao and Tao, Chenxin and Zhu, Xizhou and Zhang, Zhaoxiang and Huang, Gao and Li, Hongyang and Qiao, Yu and Lu, Lewei and others},
-  journal={arXiv preprint arXiv:2211.10439},
-  year={2022}
-}
-
-@article{su2022towards,
-  title={Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information},
-  author={Su, Weijie and Zhu, Xizhou and Tao, Chenxin and Lu, Lewei and Li, Bin and Huang, Gao and Qiao, Yu and Wang, Xiaogang and Zhou, Jie and Dai, Jifeng},
-  journal={arXiv preprint arXiv:2211.09807},
-  year={2022}
-}
-
-@inproceedings{li2022bevformer,
-  title={Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers},
-  author={Li, Zhiqi and Wang, Wenhai and Li, Hongyang and Xie, Enze and Sima, Chonghao and Lu, Tong and Qiao, Yu and Dai, Jifeng},
-  booktitle={ECCV},
-  pages={1--18},
-  year={2022},
-}
 ```
--- a/README_CN.md
+++ b/README_CN.md
 <p>
 	<a href="./README.md">[English Version]</a>
 </p>
-现在issue有点多，我们团队会逐一查阅并解决，请耐心等待。

-# 书生2.5 - 多模态多任务通用大模型
+# 书生图像 - 大规模视觉基础模型

 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-coco?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-lvis-v1-0-minival)](https://paperswithcode.com/sota/object-detection-on-lvis-v1-0-minival?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-lvis-v1-0-val)](https://paperswithcode.com/sota/object-detection-on-lvis-v1-0-val?p=internimage-exploring-large-scale-vision)
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-pascal-voc-2007)](https://paperswithcode.com/sota/object-detection-on-pascal-voc-2007?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-pascal-voc-2012)](https://paperswithcode.com/sota/object-detection-on-pascal-voc-2012?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-openimages-v6)](https://paperswithcode.com/sota/object-detection-on-openimages-v6?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-crowdhuman-full-body)](https://paperswithcode.com/sota/object-detection-on-crowdhuman-full-body?p=internimage-exploring-large-scale-vision)
@@ -18,62 +16,39 @@
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/semantic-segmentation-on-cityscapes)](https://paperswithcode.com/sota/semantic-segmentation-on-cityscapes?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/semantic-segmentation-on-cityscapes-val)](https://paperswithcode.com/sota/semantic-segmentation-on-cityscapes-val?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/semantic-segmentation-on-pascal-context)](https://paperswithcode.com/sota/semantic-segmentation-on-pascal-context?p=internimage-exploring-large-scale-vision)
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/semantic-segmentation-on-coco-stuff-test)](https://paperswithcode.com/sota/semantic-segmentation-on-coco-stuff-test?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/image-classification-on-inaturalist-2018)](https://paperswithcode.com/sota/image-classification-on-inaturalist-2018?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/image-classification-on-places365)](https://paperswithcode.com/sota/image-classification-on-places365?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/image-classification-on-places205)](https://paperswithcode.com/sota/image-classification-on-places205?p=internimage-exploring-large-scale-vision)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bevformer-v2-adapting-modern-image-backbones/3d-object-detection-on-nuscenes-camera-only)](https://paperswithcode.com/sota/3d-object-detection-on-nuscenes-camera-only?p=bevformer-v2-adapting-modern-image-backbones)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=internimage-exploring-large-scale-vision)

-这个代码仓库是[InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions](https://arxiv.org/abs/2211.05778)的官方实现。
+这个代码仓库是 [InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions](https://arxiv.org/abs/2211.05778) 的官方实现。

 \[[论文](https://arxiv.org/abs/2211.05778)\] \[[知乎专栏](https://zhuanlan.zhihu.com/p/610772005)\]

 ## 亮点

- :thumbsup: **高达30亿参数的最强视觉通用主干模型**
- 🏆 **图像分类标杆数据集ImageNet `90.1% Top1`准确率，开源模型中准确度最高**
- 🏆 **物体检测标杆数据集COCO `65.5 mAP`，唯一超过`65 mAP`的模型**
-
-## 相关项目
-
-### 多模态基模型
-
- [Uni-Perceiver](https://github.com/fundamentalvision/Uni-Perceiver): 通用感知任务预训练统一框架, 可直接处理zero-shot和few-shot任务
- [Uni-Perceiver v2](https://arxiv.org/abs/2211.09808):
-  用于处理图像/图文任务的通用模型
- [M3I-Pretraining](https://github.com/OpenGVLab/M3I-Pretraining): 基于最大化输入和目标的互信息的单阶段预训练范式
-
-### 自动驾驶
-
- [BEVFormer](https://github.com/fundamentalvision/BEVFormer): 基于BEV的新一代纯视觉环视感知方案
- [BEVFormer v2](https://arxiv.org/abs/2211.10439): 融合BEV感知和透视图检测的两阶段检测器
-
-## Application in Challenge
-
- [2022 Waymo 3D Camera-Only Detection Challenge](https://waymo.com/open/challenges/2022/3d-camera-only-detection/): 基于书生2.5 BEVFormer++取得赛道冠军
- [nuScenes 3D detection task](https://www.nuscenes.org/object-detection?externalData=all&mapData=all&modalities=Camera): BEVFormer v2 在nuScenes纯视觉检测任务中取得SOTA性能(64.8 NDS)
- [CVPR 2023 Workshop End-to-End Autonomous Driving](https://opendrivelab.com/e2ead/cvpr23): InternImage作为baseline支持了比赛
-  [3D Occupancy Prediction Challenge](https://opendrivelab.com/AD23Challenge.html#Track3)和[OpenLane Topology Challenge](https://opendrivelab.com/AD23Challenge.html#Track1)
+- :thumbsup: **高达 30 亿参数的最强视觉通用主干模型**
+- 🏆 **图像分类标杆数据集 ImageNet `90.1% Top1`准确率，开源模型中准确度最高**
+- 🏆 **物体检测标杆数据集 COCO `65.5 mAP`，唯一超过 `65 mAP` 的模型**

 ## 最新进展

- 2023年3月14日: 🚀 “书生2.5”发布！
- 2023年2月28日: 🚀 InternImage 被CVPR 2023接收!
+- 2024年1月22日: 🚀 在 InternImage 中支持了 [DCNv4](https://github.com/OpenGVLab/DCNv4)!
+- 2023年2月28日: 🚀 InternImage 被 CVPR 2023 接收!
 - 2022年11月18日: 🚀 基于 InternImage-XL 主干网络，[BEVFormer v2](https://arxiv.org/abs/2211.10439) 在nuScenes的纯视觉3D检测任务上取得了最佳性能 `63.4 NDS` ！
- 2022年11月10日: 🚀 InternImage-H 在COCO目标检测任务上以 `65.4 mAP` 斩获冠军，是唯一突破 `65.0 mAP` 的超强物体检测模型！
- 2022年11月10日: 🚀 InternImage-H 在ADE20K语义分割数据集上取得 `62.9 mIoU` 的SOTA性能！
+- 2022年11月10日: 🚀 InternImage-H 在 COCO 目标检测任务上以 `65.4 mAP` 斩获冠军，是唯一突破 `65.0 mAP` 的超强物体检测模型！
+- 2022年11月10日: 🚀 InternImage-H 在 ADE20K 语义分割数据集上取得 `62.9 mIoU` 的SOTA性能！

 ## 项目功能

 - [ ] 各类下游任务
- [ ] 支持[CVPR 2023 Workshop on End-to-End Autonomous Driving](https://opendrivelab.com/e2ead/cvpr23)，[详见](https://github.com/OpenGVLab/InternImage/tree/master/autonomous_driving)
- [ ] 支持Segment Anything
+- [ ] 支持 [CVPR 2023 Workshop on End-to-End Autonomous Driving](https://opendrivelab.com/e2ead/cvpr23)，[详见](https://github.com/OpenGVLab/InternImage/tree/master/autonomous_driving)
 - [x] 支持提取模型中间层特征，[详见](classification/extract_feature.py)
- [x] 支持基于[DeepSpeed](https://github.com/microsoft/DeepSpeed)的低成本训练，[详见](https://github.com/OpenGVLab/InternImage/tree/master/classification)
- [x] DCNv3算子预编译.whl包，[详见](https://github.com/OpenGVLab/InternImage/releases/tag/whl_files)
+- [x] 支持基于 [DeepSpeed](https://github.com/microsoft/DeepSpeed) 的低成本训练，[详见](https://github.com/OpenGVLab/InternImage/tree/master/classification)
+- [x] DCNv3 算子预编译 `.whl` 包，[详见](https://github.com/OpenGVLab/InternImage/releases/tag/whl_files)
 - [x] InternImage-H(1B)/G(3B)
- [x] 支持分类/检测/分割TensorRT推理
+- [x] 支持分类/检测/分割 TensorRT 推理
 - [x] InternImage 系列分类代码
 - [x] InternImage-T/S/B/L/XL ImageNet-1K 预训练模型
 - [x] InternImage-L/XL ImageNet-22K 预训练模型
@@ -82,43 +57,43 @@

 ## 简介

-"书生2.5"是商汤科技与上海人工智能实验室联合发布的多模态多任务通用大模型。"书生2.5"包括大规模视觉基础模型"InternImage"，预训练算法"M3I-Pretraining"，通用解码器"Uni-Perceiver"系列，以及自动驾驶感知通用编码器"BEVFormer"系列。
+InternImage 是一个由上海人工智能实验室、清华大学等机构的研究人员提出的基于卷积神经网络（CNN）的视觉基础模型。与基于 Transformer 的网络不同，InternImage 以可变形卷积 DCNv3 作为核心算子，使模型不仅具有检测和分割等下游任务所需的动态有效感受野，而且能够进行自适应的空间聚合。

-<div align=left>
-<img src='./docs/figs/intern_pipeline.png' width=900>
+<div align=center>
+<img src='./docs/figs/arch.png' width=400>
 </div>

-## “书生2.5”的应用
+与 InternImage 相关的其他项目还包括：预训练算法 M3I-Pretraining，通用解码器 Uni-Perceiver 系列，以及自动驾驶感知通用编码器 BEVFormer 系列。

-### 1. 图像模态任务性能
+<div align=left>
+<img src='./docs/figs/intern_pipeline.png' width=900>
+</div>

- 在图像分类标杆数据集ImageNet上，“书生2.5”仅基于公开数据便达到了 90.1% 的Top-1准确率。这是除谷歌与微软两个未公开模型及额外数据集外，唯一准确率超过90.0%的模型，同时也是世界上开源模型中ImageNet准确度最高，规模最大的模型；
- 在物体检测标杆数据集COCO上，“书生2.5” 取得了 65.5 的 mAP，是世界上唯一超过65 mAP的模型；
- 在另外16个重要的视觉基础数据集（覆盖分类、检测和分割任务）上取得世界最好性能。
+## 性能

-<div align="left">
-<br>
+- 在图像分类标杆数据集 ImageNet 上，InternImage 仅基于公开数据便达到了 90.1% 的 Top-1 准确率。这是除谷歌与微软两个未公开模型及额外数据集外，唯一准确率超过 90.0% 的模型，同时也是世界上开源模型中 ImageNet 准确度最高，规模最大的模型；
+- 在物体检测标杆数据集 COCO 上，InternImage 取得了 65.5 的 mAP，是世界上唯一超过 65 mAP 的模型；
+- 在另外 16 个重要的视觉基础数据集（覆盖分类、检测和分割任务）上取得世界最好性能。

 **分类任务**

 <table border="1" width="90%">
 	<tr align="center">
-        <th colspan="1"> 图像分类</th><th colspan="2"> 场景分类 </th><th colspan="1">长尾分类</th>
+        <th colspan="1"> 图像分类 </th><th colspan="2"> 场景分类 </th><th colspan="1"> 长尾分类 </th>
    </tr>
    <tr align="center">
        <th>ImageNet</th><th>Places365</th><th>Places 205</th><th>iNaturalist 2018</th>
    </tr>
    <tr align="center">
-        <th>90.1</th><th>61.2</th><th>71.7</th><th>92.3</th>
+        <th>90.1</th><th>61.2</th><th>71.7</th><th>92.6</th>
    </tr>
 </table>
-<br>

 **检测任务**

 <table border="1" width="90%">
 	<tr align="center">
-        <th colspan="4"> 常规物体检测</th><th colspan="2">长尾物体检测 </th><th colspan="2">自动驾驶物体检测</th><th colspan="1">密集物体检测</th>
+        <th colspan="4"> 常规物体检测 </th><th colspan="2"> 长尾物体检测 </th><th colspan="2"> 自动驾驶物体检测 </th><th colspan="1"> 密集物体检测 </th>
    </tr>
    <tr align="center">
        <th>COCO</th><th>VOC 2007</th><th>VOC 2012</th><th>OpenImage</th><th>LVIS minival</th><th>LVIS val</th><th>BDD100K</th><th>nuScenes</th><th>CrowdHuman</th>
@@ -127,7 +102,6 @@
        <th>65.5</th><th>94.0</th><th>97.2</th><th>74.1</th><th>65.8</th><th>63.2</th><th>38.8</th><th>64.8</th><th>97.2</th>
    </tr>
 </table>
-<br>

 **分割任务**

@@ -139,82 +113,49 @@
        <th>ADE20K</th><th>COCO Stuff-10K</th><th>Pascal Context</th><th>CityScapes</th><th>NYU Depth V2</th>
    </tr>
    <tr align="center">
-        <th>62.9</th><th>59.6</th><th>70.3</th><th>86.1</th><th>69.7</th>
+        <th>62.9</th><th>59.6</th><th>70.3</th><th>87.0</th><th>68.1</th>
    </tr>
 </table>
-<br>
-
-</div>
-
-### 2. 图文跨模态任务性能
-
- 图文检索
-
-“书生2.5”可根据文本内容需求快速定位检索出语义最相关的图像。这一能力既可应用于视频和图像集合，也可进一步结合物体检测框，具有丰富的应用模式，帮助用户更便捷、快速地找到所需图像资源, 例如可在相册中返回文本所指定的相关图像。
-
- 以图生文
-
-“书生2.5”的“以图生文”在图像描述、视觉问答、视觉推理和文字识别等多个方面均拥有强大的理解能力。例如在自动驾驶场景下，可以提升场景感知理解能力，辅助车辆判断交通信号灯状态、道路标志牌等信息，为车辆的决策规划提供有效的感知信息支持。
-
-<div align="left">
-<br>
-
-**图文多模态任务**
-
-<table border="1" width="90%">
-	<tr align="center">
-        <th colspan="1">图像描述</th><th colspan="2">微调图文检索</th><th colspan="1">零样本图文检索</th>
-    </tr>
-    <tr align="center">
-        <th>COCO Caption</th><th>COCO Caption</th><th>Flickr30k</th><th>Flickr30k</th>
-    </tr>
-    <tr align="center">
-        <th>148.2</th><th>76.4</th><th>94.8</th><th>89.1</th>
-    </tr>
-</table>
-<br>

-</div>
-
-## 预训练模型
+## 已发布模型

-<details>
+<details open>
 <summary> 开源视觉预训练模型 </summary>
 <br>
 <div>

-|      name      |   pretrain   | pre-training  resolution | #param |                                               download                                                |
-| :------------: | :----------: | :----------------------: | :----: | :---------------------------------------------------------------------------------------------------: |
-| InternImage-L  | ImageNet-22K |         384x384          |  223M  |   [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth)    |
-| InternImage-XL | ImageNet-22K |         384x384          |  335M  |   [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth)   |
-| InternImage-H  |  Joint 427M  |         384x384          | 1.08B  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto22k_384.pth)   |
-| InternImage-G  |      -       |         384x384          |   3B   | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_pretrainto22k_384.pth) |
+|      name      |   pretrain   | resolution | #param |                                               download                                                |
+| :------------: | :----------: | :--------: | :----: | :---------------------------------------------------------------------------------------------------: |
+| InternImage-L  | ImageNet-22K |  384x384   |  223M  |   [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth)    |
+| InternImage-XL | ImageNet-22K |  384x384   |  335M  |   [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth)   |
+| InternImage-H  |  Joint 427M  |  384x384   | 1.08B  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto22k_384.pth)   |
+| InternImage-G  |  Joint 427M  |  384x384   |   3B   | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_pretrainto22k_384.pth) |

 </div>

 </details>

-<details>
-<summary> ImageNet-1K图像分类 </summary>
+<details open>
+<summary> ImageNet-1K 图像分类 </summary>
 <br>
 <div>

-|      name      |   pretrain   | resolution | acc@1 | #param | FLOPs |                                                                                      download                                                                                      |
-| :------------: | :----------: | :--------: | :---: | :----: | :---: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
-| InternImage-T  | ImageNet-1K  |  224x224   | 83.5  |  30M   |  5G   |       [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth) \| [cfg](classification/configs/without_lr_decay/internimage_t_1k_224.yaml)       |
-| InternImage-S  | ImageNet-1K  |  224x224   | 84.2  |  50M   |  8G   |       [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth) \| [cfg](classification/configs/without_lr_decay/internimage_s_1k_224.yaml)       |
-| InternImage-B  | ImageNet-1K  |  224x224   | 84.9  |  97M   |  16G  |       [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth) \| [cfg](classification/configs/without_lr_decay/internimage_b_1k_224.yaml)       |
-| InternImage-L  | ImageNet-22K |  384x384   | 87.7  |  223M  | 108G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22kto1k_384.pth) \| [cfg](classification/configs/without_lr_decay/internimage_l_22kto1k_384.yaml)  |
-| InternImage-XL | ImageNet-22K |  384x384   | 88.0  |  335M  | 163G  | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22kto1k_384.pth) \| [cfg](classification/configs/without_lr_decay/internimage_xl_22kto1k_384.yaml) |
-| InternImage-H  |  Joint 427M  |  640x640   | 89.6  | 1.08B  | 1478G |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_22kto1k_640.pth) \| [cfg](classification/configs/without_lr_decay/internimage_h_22kto1k_640.yaml)  |
-| InternImage-G  |      -       |  512x512   | 90.1  |   3B   | 2700G |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_22kto1k_512.pth) \| [cfg](classification/configs/without_lr_decay/internimage_g_22kto1k_512.yaml)  |
+|      name      |   pretrain   | resolution | acc@1 | #param | FLOPs |                                                                              download                                                                               |
+| :------------: | :----------: | :--------: | :---: | :----: | :---: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| InternImage-T  | ImageNet-1K  |  224x224   | 83.5  |  30M   |  5G   |       [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth) \| [cfg](configs/without_lr_decay/internimage_t_1k_224.yaml)       |
+| InternImage-S  | ImageNet-1K  |  224x224   | 84.2  |  50M   |  8G   |       [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth) \| [cfg](configs/without_lr_decay/internimage_s_1k_224.yaml)       |
+| InternImage-B  | ImageNet-1K  |  224x224   | 84.9  |  97M   |  16G  |       [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth) \| [cfg](configs/without_lr_decay/internimage_b_1k_224.yaml)       |
+| InternImage-L  | ImageNet-22K |  384x384   | 87.7  |  223M  | 108G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22kto1k_384.pth) \| [cfg](configs/without_lr_decay/internimage_l_22kto1k_384.yaml)  |
+| InternImage-XL | ImageNet-22K |  384x384   | 88.0  |  335M  | 163G  | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22kto1k_384.pth) \| [cfg](configs/without_lr_decay/internimage_xl_22kto1k_384.yaml) |
+| InternImage-H  |  Joint 427M  |  640x640   | 89.6  | 1.08B  | 1478G |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_22kto1k_640.pth) \| [cfg](configs/without_lr_decay/internimage_h_22kto1k_640.yaml)  |
+| InternImage-G  |  Joint 427M  |  512x512   | 90.1  |   3B   | 2700G |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_22kto1k_512.pth) \| [cfg](configs/without_lr_decay/internimage_g_22kto1k_512.yaml)  |

 </div>

 </details>

-<details>
-<summary> COCO目标检测和实例分割 </summary>
+<details open>
+<summary> COCO 目标检测和实例分割 </summary>
 <br>
 <div>

@@ -240,8 +181,8 @@

 </details>

-<details>
-<summary> ADE20K语义分割 </summary>
+<details open>
+<summary> ADE20K 语义分割 </summary>
 <br>
 <div>

@@ -257,20 +198,18 @@

 </div>

-</div>
-
 </details>

 <details>
-<summary> 模型推理速度  </summary>
+<summary> 模型推理速度 </summary>
 <br>
 <div>

-[export classification model from pytorch to tensorrt](classification/README.md#export)
+[Export classification model from pytorch to tensorrt](classification/README.md#export)

-[export detection model from pytorch to tensorrt](detection/README.md#export)
+[Export detection model from pytorch to tensorrt](detection/README.md#export)

-[export segmentation model from pytorch to tensorrt](segmentation/README.md#export)
+[Export segmentation model from pytorch to tensorrt](segmentation/README.md#export)

 |      name      | resolution | #param | FLOPs | batch 1 FPS (TensorRT) |
 | :------------: | :--------: | :----: | :---: | :--------------------: |
@@ -280,7 +219,7 @@
 | InternImage-L  |  384x384   |  223M  | 108G  |           56           |
 | InternImage-XL |  384x384   |  335M  | 163G  |           47           |

-在使用`mmdeploy`将PyTorch模型转为TensorRT之前，请确保您已正确编译DCNv3的自定义算子，其安装方式如下：
+在使用 `mmdeploy` 将 PyTorch 模型转为 TensorRT 之前，请确保您已正确编译 DCNv3 的自定义算子，其安装方式如下：

 ```shell
 export MMDEPLOY_DIR=/the/root/path/of/MMDeploy
@@ -299,15 +238,35 @@ cd ${MMDEPLOY_DIR}
 pip install -e .
 ```

-关于`mmdeploy`编译自定义算子的更多细节，请参考这份[文档](https://github.com/open-mmlab/mmdeploy/blob/master/docs/en/01-how-to-build/linux-x86_64.md)。
+关于 `mmdeploy` 编译自定义算子的更多细节，请参考这份[文档](https://github.com/open-mmlab/mmdeploy/blob/master/docs/en/01-how-to-build/linux-x86_64.md)。

 </div>

 </details>

+## 相关项目
+
+### 多模态基础模型
+
+- [Uni-Perceiver](https://github.com/fundamentalvision/Uni-Perceiver): 通用感知任务预训练统一框架, 可直接处理 zero-shot 和 few-shot 任务
+- [Uni-Perceiver v2](https://arxiv.org/abs/2211.09808): 用于处理图像/图文任务的通用模型
+- [M3I-Pretraining](https://github.com/OpenGVLab/M3I-Pretraining): 基于最大化输入和目标的互信息的单阶段预训练范式
+- [InternVL](https://github.com/OpenGVLab/InternVL): 领先的多模态大语言模型，在 OCR、多模态推理和对话等任务中表现卓越
+
+### 自动驾驶
+
+- [BEVFormer](https://github.com/fundamentalvision/BEVFormer): 基于 BEV 的新一代纯视觉环视感知方案
+- [BEVFormer v2](https://arxiv.org/abs/2211.10439): 融合 BEV 感知和透视图检测的两阶段检测器
+
+## 算法竞赛
+
+- [2022 Waymo 3D Camera-Only Detection Challenge](https://waymo.com/open/challenges/2022/3d-camera-only-detection/): 基于 InternImage，BEVFormer++ 取得赛道冠军
+- [nuScenes 3D detection](https://www.nuscenes.org/object-detection?externalData=all&mapData=all&modalities=Camera): BEVFormer v2 在 nuScenes 纯视觉检测任务中取得SOTA性能 (64.8 NDS)
+- [CVPR 2023 Workshop End-to-End Autonomous Driving](https://opendrivelab.com/e2ead/cvpr23): InternImage 作为 baseline 支持了比赛 [3D Occupancy Prediction Challenge](https://opendrivelab.com/AD23Challenge.html#Track3) 和 [OpenLane Topology Challenge](https://opendrivelab.com/AD23Challenge.html#Track1)
+
 ## 引用

-若“书生2.5”对您的研究工作有帮助，请参考如下bibtex对我们的工作进行引用。
+若这个工作对您的研究有帮助，请参考如下 BibTeX 对我们的工作进行引用。

 ```bibtex
 @article{wang2022internimage,
@@ -316,52 +275,4 @@ pip install -e .
  journal={arXiv preprint arXiv:2211.05778},
  year={2022}
 }
-
-@inproceedings{zhu2022uni,
-  title={Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks},
-  author={Zhu, Xizhou and Zhu, Jinguo and Li, Hao and Wu, Xiaoshi and Li, Hongsheng and Wang, Xiaohua and Dai, Jifeng},
-  booktitle={CVPR},
-  pages={16804--16815},
-  year={2022}
-}
-
-@article{zhu2022uni,
-  title={Uni-perceiver-moe: Learning sparse generalist models with conditional moes},
-  author={Zhu, Jinguo and Zhu, Xizhou and Wang, Wenhai and Wang, Xiaohua and Li, Hongsheng and Wang, Xiaogang and Dai, Jifeng},
-  journal={arXiv preprint arXiv:2206.04674},
-  year={2022}
-}
-
-@article{li2022uni,
-  title={Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks},
-  author={Li, Hao and Zhu, Jinguo and Jiang, Xiaohu and Zhu, Xizhou and Li, Hongsheng and Yuan, Chun and Wang, Xiaohua and Qiao, Yu and Wang, Xiaogang and Wang, Wenhai and others},
-  journal={arXiv preprint arXiv:2211.09808},
-  year={2022}
-}
-
-@article{yang2022bevformer,
-  title={BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision},
-  author={Yang, Chenyu and Chen, Yuntao and Tian, Hao and Tao, Chenxin and Zhu, Xizhou and Zhang, Zhaoxiang and Huang, Gao and Li, Hongyang and Qiao, Yu and Lu, Lewei and others},
-  journal={arXiv preprint arXiv:2211.10439},
-  year={2022}
-}
-
-@article{su2022towards,
-  title={Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information},
-  author={Su, Weijie and Zhu, Xizhou and Tao, Chenxin and Lu, Lewei and Li, Bin and Huang, Gao and Qiao, Yu and Wang, Xiaogang and Zhou, Jie and Dai, Jifeng},
-  journal={arXiv preprint arXiv:2211.09807},
-  year={2022}
-}
-
-@inproceedings{li2022bevformer,
-  title={Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers},
-  author={Li, Zhiqi and Wang, Wenhai and Li, Hongyang and Xie, Enze and Sima, Chonghao and Lu, Tong and Qiao, Yu and Dai, Jifeng},
-  booktitle={ECCV},
-  pages={1--18},
-  year={2022},
-}
 ```
-
-<div align=left>
-
-</div>
--- a/classification/README.md
+++ b/classification/README.md
@@ -10,7 +10,7 @@ This folder contains the implementation of the InternImage for image classificat
 - [Evaluation](#evaluation)
 - [Training from Scratch on ImageNet-1K](#training-from-scratch-on-imagenet-1k)
 - [Manage Jobs with Slurm](#manage-jobs-with-slurm)
- [Training with Deepspeed](#training-with-deepspeed)
+- [Training with DeepSpeed](#training-with-deepspeed)
 - [Extracting Intermediate Features](#extracting-intermediate-features)
 - [Export](#export)

@@ -47,6 +47,7 @@ pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113  -f https://download.p
 ```bash
 pip install -U openmim
 mim install mmcv-full==1.5.0
+mim install mmsegmentation==0.27.0
 pip install timm==0.6.11 mmdet==2.28.1
 ```

@@ -59,7 +60,7 @@ pip install numpy==1.26.4
 pip install pydantic==1.10.13
 ```

- Compiling CUDA operators
+- Compile CUDA operators

 Before compiling, please use the `nvcc -V` command to check whether your `nvcc` version matches the CUDA version of PyTorch.

@@ -79,8 +80,9 @@ We provide the following ways to prepare data:

 <details open>
  <summary>Standard ImageNet-1K</summary>
+<br>

-We use standard ImageNet dataset, you can download it from http://image-net.org/.
+- We use standard ImageNet dataset, you can download it from http://image-net.org/.

 - For standard folder dataset, move validation images to labeled sub-folders. The file structure should look like:

@@ -195,12 +197,12 @@ We use standard ImageNet dataset, you can download it from http://image-net.org/
 <br>
 <div>

-|      name      |   pretrain   | pre-training  resolution | #param |                                               download                                                |
-| :------------: | :----------: | :----------------------: | :----: | :---------------------------------------------------------------------------------------------------: |
-| InternImage-L  | ImageNet-22K |         384x384          |  223M  |   [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth)    |
-| InternImage-XL | ImageNet-22K |         384x384          |  335M  |   [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth)   |
-| InternImage-H  |  Joint 427M  |         384x384          | 1.08B  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto22k_384.pth)   |
-| InternImage-G  |      -       |         384x384          |   3B   | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_pretrainto22k_384.pth) |
+|      name      |   pretrain   | resolution | #param |                                               download                                                |
+| :------------: | :----------: | :--------: | :----: | :---------------------------------------------------------------------------------------------------: |
+| InternImage-L  | ImageNet-22K |  384x384   |  223M  |   [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth)    |
+| InternImage-XL | ImageNet-22K |  384x384   |  335M  |   [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth)   |
+| InternImage-H  |  Joint 427M  |  384x384   | 1.08B  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto22k_384.pth)   |
+| InternImage-G  |  Joint 427M  |  384x384   |   3B   | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_pretrainto22k_384.pth) |

 </div>

@@ -211,15 +213,15 @@ We use standard ImageNet dataset, you can download it from http://image-net.org/
 <br>
 <div>

-|      name      |   pretrain   | resolution | acc@1 | #param | FLOPs |                                                                              download                                                                               |
-| :------------: | :----------: | :--------: | :---: | :----: | :---: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------: |
-| InternImage-T  | ImageNet-1K  |  224x224   | 83.5  |  30M   |  5G   |       [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth) \| [cfg](configs/without_lr_decay/internimage_t_1k_224.yaml)       |
-| InternImage-S  | ImageNet-1K  |  224x224   | 84.2  |  50M   |  8G   |       [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth) \| [cfg](configs/without_lr_decay/internimage_s_1k_224.yaml)       |
-| InternImage-B  | ImageNet-1K  |  224x224   | 84.9  |  97M   |  16G  |       [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth) \| [cfg](configs/without_lr_decay/internimage_b_1k_224.yaml)       |
-| InternImage-L  | ImageNet-22K |  384x384   | 87.7  |  223M  | 108G  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22kto1k_384.pth) \| [cfg](configs/without_lr_decay/internimage_l_22kto1k_384.yaml)  |
-| InternImage-XL | ImageNet-22K |  384x384   | 88.0  |  335M  | 163G  | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22kto1k_384.pth) \| [cfg](configs/without_lr_decay/internimage_xl_22kto1k_384.yaml) |
-| InternImage-H  |  Joint 427M  |  640x640   | 89.6  | 1.08B  | 1478G |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_22kto1k_640.pth) \| [cfg](configs/without_lr_decay/internimage_h_22kto1k_640.yaml)  |
-| InternImage-G  |      -       |  512x512   | 90.1  |   3B   | 2700G |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_22kto1k_512.pth) \| [cfg](configs/without_lr_decay/internimage_g_22kto1k_512.yaml)  |
+|      name      |   pretrain   | resolution | acc@1 | #param | FLOPs |                                                                                                                     download                                                                                                                     |
+| :------------: | :----------: | :--------: | :---: | :----: | :---: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| InternImage-T  | ImageNet-1K  |  224x224   | 83.5  |  30M   |  5G   | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth) \| [cfg](configs/without_lr_decay/internimage_t_1k_224.yaml) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/internimage_t_1k_224.log) |
+| InternImage-S  | ImageNet-1K  |  224x224   | 84.2  |  50M   |  8G   | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth) \| [cfg](configs/without_lr_decay/internimage_s_1k_224.yaml) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/internimage_s_1k_224.log) |
+| InternImage-B  | ImageNet-1K  |  224x224   | 84.9  |  97M   |  16G  | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth) \| [cfg](configs/without_lr_decay/internimage_b_1k_224.yaml) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/internimage_b_1k_224.log) |
+| InternImage-L  | ImageNet-22K |  384x384   | 87.7  |  223M  | 108G  |                                        [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22kto1k_384.pth) \| [cfg](configs/without_lr_decay/internimage_l_22kto1k_384.yaml)                                         |
+| InternImage-XL | ImageNet-22K |  384x384   | 88.0  |  335M  | 163G  |                                       [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22kto1k_384.pth) \| [cfg](configs/without_lr_decay/internimage_xl_22kto1k_384.yaml)                                        |
+| InternImage-H  |  Joint 427M  |  640x640   | 89.6  | 1.08B  | 1478G |                                        [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_22kto1k_640.pth) \| [cfg](configs/without_lr_decay/internimage_h_22kto1k_640.yaml)                                         |
+| InternImage-G  |  Joint 427M  |  512x512   | 90.1  |   3B   | 2700G |                                        [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_22kto1k_512.pth) \| [cfg](configs/without_lr_decay/internimage_g_22kto1k_512.yaml)                                         |

 </div>

@@ -230,9 +232,9 @@ We use standard ImageNet dataset, you can download it from http://image-net.org/
 <br>
 <div>

-|     name      |  pretrain  | resolution | acc@1 | #param |                                    download                                     |
-| :-----------: | :--------: | :--------: | :---: | :----: | :-----------------------------------------------------------------------------: |
-| InternImage-H | Joint 427M |  384x384   | 92.6  |  1.1B  | [ckpt](<>) \| [cfg](configs/inaturalist2018/internimage_h_22ktoinat18_384.yaml) |
+|     name      |  pretrain  | resolution | acc@1 | #param |                                                                                                                                  download                                                                                                                                  |
+| :-----------: | :--------: | :--------: | :---: | :----: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| InternImage-H | Joint 427M |  384x384   | 92.6  |  1.1B  | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_22ktoinat18_384.pth) \| [cfg](configs/inaturalist2018/internimage_h_22ktoinat18_384.yaml) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/internimage_h_22ktoinat18_384.log) |

 </div>

@@ -267,56 +269,104 @@ python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> --maste

 ## Manage Jobs with Slurm

-For example, to train or evaluate `InternImage` with 8 GPU on a single node, run:
+For example, to train or evaluate `InternImage` with slurm cluster, run:

-`InternImage-T`:
+<details open>
+<summary> InternImage-T (IN-1K) </summary>
+<br>

 ```bash
-# Train for 300 epochs
-GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_t_1k_224.yaml
-# Evaluate on ImageNet-1K
+# Train for 300 epochs with 8 GPUs
+GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_t_1k_224.yaml --batch-size 512
+# Train for 300 epochs with 32 GPUs
+GPUS=32 sh train_in1k.sh <partition> <job-name> configs/internimage_t_1k_224.yaml --batch-size 128
+# Evaluate on ImageNet-1K with 8 GPUs
 GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_t_1k_224.yaml --resume pretrained/internimage_t_1k_224.pth --eval
 ```

-`InternImage-S`:
+</details>
+
+<details>
+<summary> InternImage-S (IN-1K) </summary>
+<br>

 ```bash
-# Train for 300 epochs
-GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_s_1k_224.yaml
-# Evaluate on ImageNet-1K
+# Train for 300 epochs with 8 GPUs
+GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_s_1k_224.yaml --batch-size 512
+# Train for 300 epochs with 32 GPUs
+GPUS=32 sh train_in1k.sh <partition> <job-name> configs/internimage_s_1k_224.yaml --batch-size 128
+# Evaluate on ImageNet-1K with 8 GPUs
 GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_s_1k_224.yaml --resume pretrained/internimage_s_1k_224.pth --eval
 ```

-`InternImage-XL`:
+</details>
+
+<details>
+<summary> InternImage-B (IN-1K) </summary>
+<br>
+
+```bash
+# Train for 300 epochs with 8 GPUs
+GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_b_1k_224.yaml --batch-size 512
+# Train for 300 epochs with 32 GPUs
+GPUS=32 sh train_in1k.sh <partition> <job-name> configs/internimage_b_1k_224.yaml --batch-size 128
+# Evaluate on ImageNet-1K with 8 GPUs
+GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_b_1k_224.yaml --resume pretrained/internimage_b_1k_224.pth --eval
+```
+
+</details>
+
+<details>
+<summary> InternImage-L (IN-22K to IN-1K) </summary>
+<br>

 ```bash
-# Train for 300 epochs
-GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_xl_22kto1k_384.yaml
-# Evaluate on ImageNet-1K
+# Train for 20 epochs with 32 GPUs
+GPUS=32 sh train_in1k.sh <partition> <job-name> configs/internimage_l_22kto1k_384.yaml --batch-size 16
+# Evaluate on ImageNet-1K with 8 GPUs
+GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_l_22kto1k_384.yaml --resume pretrained/internimage_l_22kto1k_384.pth --eval
+```
+
+</details>
+
+<details>
+<summary> InternImage-XL (IN-22K to IN-1K) </summary>
+<br>
+
+```bash
+# Train for 20 epochs with 32 GPUs
+GPUS=32 sh train_in1k.sh <partition> <job-name> configs/internimage_xl_22kto1k_384.yaml --batch-size 16
+# Evaluate on ImageNet-1K with 8 GPUs
 GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_xl_22kto1k_384.yaml --resume pretrained/internimage_xl_22kto1k_384.pth --eval
 ```

-<!--
-### Test pretrained model on ImageNet-22K
+</details>

-For example, to evaluate the `InternImage-L-22k`:
+<details>
+<summary> InternImage-H (IN-22K to IN-1K) </summary>
+<br>

 ```bash
-python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> --master_port 12345  main.py \
--cfg configs/internimage_xl_22k_192to384.yaml --data-path <imagenet-path> [--batch-size <batch-size-per-gpu> --output <output-directory>] \
--resume internimage_xl_22k_192to384.pth --eval
-``` -->
+# Train for 20 epochs with 32 GPUs
+GPUS=32 sh train_in1k.sh <partition> <job-name> configs/internimage_h_22kto1k_640.yaml --batch-size 16
+# Evaluate on ImageNet-1K with 8 GPUs
+GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_h_22kto1k_640.yaml --resume pretrained/internimage_h_22kto1k_640.pth --eval
+```

-<!-- ### Fine-tuning from a ImageNet-22K pretrained model
+</details>

-For example, to fine-tune a `InternImage-XL-22k` model pretrained on ImageNet-22K:
+<details>
+<summary> InternImage-G (IN-22K to IN-1K) </summary>
+<br>

-```bashs
-GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_image_.yaml --pretrained intern_image_b.pth --eval
-python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345  main.py \
--cfg configs/.yaml --pretrained swin_base_patch4_window7_224_22k.pth \
--data-path <imagenet-path> --batch-size 64 --accumulation-steps 2 [--use-checkpoint]
-``` -->
+```bash
+# Train for 20 epochs with 64 GPUs
+GPUS=64 sh train_in1k.sh <partition> <job-name> configs/internimage_g_22kto1k_512.yaml --batch-size 8
+# Evaluate on ImageNet-1K with 8 GPUs
+GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_g_22kto1k_512.yaml --resume pretrained/internimage_g_22kto1k_512.pth --eval
+```
+
+</details>

 ## Training with DeepSpeed

@@ -394,7 +444,7 @@ python extract_feature.py --cfg configs/internimage_t_1k_224.yaml --img b.png --
 Install `mmdeploy` at first:

 ```shell
-pip
+pip install mmdeploy==0.14.0
 ```

 To export `InternImage-T` from PyTorch to ONNX, run:

--- a/classification/configs/inaturalist2018/internimage_h_22ktoinat18_384.yaml
+++ b/classification/configs/inaturalist2018/internimage_h_22ktoinat18_384.yaml
@@ -28,7 +28,7 @@ MODEL:
  PRETRAINED: 'pretrained/internimage_h_jointto22k_384.pth'
 TRAIN:
  EMA:
-    ENABLE: true
+    ENABLE: false
    DECAY: 0.9999
  EPOCHS: 100
  WARMUP_EPOCHS: 0
@@ -36,11 +36,7 @@ TRAIN:
  BASE_LR: 2e-05 # 512
  WARMUP_LR: .0
  MIN_LR: .0
-  LR_LAYER_DECAY: true
-  LR_LAYER_DECAY_RATIO: 0.9
  USE_CHECKPOINT: true
  RAND_INIT_FT_HEAD: true
-  OPTIMIZER:
-    DCN_LR_MUL: 0.1
 AMP_OPT_LEVEL: O0
 EVAL_FREQ: 1
--- a/segmentation/README.md
+++ b/segmentation/README.md
@@ -4,11 +4,22 @@ This folder contains the implementation of the InternImage for semantic segmenta

 Our segmentation code is developed on top of [MMSegmentation v0.27.0](https://github.com/open-mmlab/mmsegmentation/tree/v0.27.0).

-## Usage
+<!-- TOC -->

-### Install
+- [Installation](#installation)
+- [Data Preparation](#data-preparation)
+- [Released Models](#released-models)
+- [Evaluation](#evaluation)
+- [Training](#training)
+- [Manage Jobs with Slurm](#manage-jobs-with-slurm)
+- [Image Demo](#image-demo)
+- [Export](#export)

- Clone this repo:
+<!-- TOC -->
+
+## Installation
+
+- Clone this repository:

 ```bash
 git clone https://github.com/OpenGVLab/InternImage.git
@@ -26,11 +37,10 @@ conda activate internimage
  the [official installation instructions](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html)
 - Install `PyTorch>=1.10.0` and `torchvision>=0.9.0` with `CUDA>=10.2`:

-For examples, to install torch==1.11 with CUDA==11.3 and nvcc:
+For examples, to install `torch==1.11` with `CUDA==11.3`:

 ```bash
-conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch -y
-conda install -c conda-forge cudatoolkit-dev=11.3 -y # to install nvcc
+pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113  -f https://download.pytorch.org/whl/torch_stable.html
 ```

 - Install other requirements:
@@ -42,7 +52,7 @@ conda install -c conda-forge termcolor yacs pyyaml scipy pip -y
 pip install opencv-python
 ```

- Install `timm` and `mmcv-full` and \`mmsegmentation':
+- Install `timm`, `mmcv-full` and \`mmsegmentation':

 ```bash
 pip install -U openmim
@@ -51,8 +61,19 @@ mim install mmsegmentation==0.27.0
 pip install timm==0.6.11 mmdet==2.28.1
 ```

+- Install other requirements:
+
+```bash
+pip install opencv-python termcolor yacs pyyaml scipy
+# Please use a version of numpy lower than 2.0
+pip install numpy==1.26.4
+pip install pydantic==1.10.13
+```
+
 - Compile CUDA operators

+Before compiling, please use the `nvcc -V` command to check whether your `nvcc` version matches the CUDA version of PyTorch.
+
 ```bash
 cd ./ops_dcnv3
 sh ./make.sh
@@ -60,14 +81,71 @@ sh ./make.sh
 python test.py
 ```

- You can also install the operator using .whl files
+- You can also install the operator using precompiled `.whl` files
  [DCNv3-1.0-whl](https://github.com/OpenGVLab/InternImage/releases/tag/whl_files)

-### Data Preparation
+## Data Preparation

 Prepare datasets according to the [guidelines](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#prepare-datasets) in MMSegmentation.

-### Evaluation
+## Released Models
+
+<details open>
+<summary> Dataset: ADE20K </summary>
+<br>
+<div>
+
+|   method    |    backbone    | resolution | mIoU (ss/ms) | #param | FLOPs |                                       Config                                        |                                                                                                                       Download                                                                                                                       |
+| :---------: | :------------: | :--------: | :----------: | :----: | :---: | :---------------------------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+|   UperNet   | InternImage-T  |  512x512   | 47.9 / 48.1  |  59M   | 944G  |         [config](./configs/ade20k/upernet_internimage_t_512_160k_ade20k.py)         |              [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_t_512_160k_ade20k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_t_512_160k_ade20k.log.json)              |
+|   UperNet   | InternImage-S  |  512x512   | 50.1 / 50.9  |  80M   | 1017G |         [config](./configs/ade20k/upernet_internimage_s_512_160k_ade20k.py)         |              [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_s_512_160k_ade20k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_s_512_160k_ade20k.log.json)              |
+|   UperNet   | InternImage-B  |  512x512   | 50.8 / 51.3  |  128M  | 1185G |         [config](./configs/ade20k/upernet_internimage_b_512_160k_ade20k.py)         |              [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_b_512_160k_ade20k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_b_512_160k_ade20k.log.json)              |
+|   UperNet   | InternImage-L  |  640x640   | 53.9 / 54.1  |  256M  | 2526G |         [config](./configs/ade20k/upernet_internimage_l_640_160k_ade20k.py)         |              [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_l_640_160k_ade20k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_l_640_160k_ade20k.log.json)              |
+|   UperNet   | InternImage-XL |  640x640   | 55.0 / 55.3  |  368M  | 3142G |        [config](./configs/ade20k/upernet_internimage_xl_640_160k_ade20k.py)         |             [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_xl_640_160k_ade20k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_xl_640_160k_ade20k.log.json)             |
+|   UperNet   | InternImage-H  |  896x896   | 59.9 / 60.3  | 1.12B  | 3566G |         [config](./configs/ade20k/upernet_internimage_h_896_160k_ade20k.py)         |              [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_h_896_160k_ade20k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_h_896_160k_ade20k.log.json)              |
+| Mask2Former | InternImage-H  |  896x896   | 62.6 / 62.9  | 1.31B  | 4635G | [config](./configs/ade20k/mask2former_internimage_h_896_80k_cocostuff2ade20k_ss.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_896_80k_cocostuff2ade20k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_896_80k_cocostuff2ade20k.log.json) |
+
+</div>
+
+</details>
+
+<details>
+<summary> Dataset: Cityscapes </summary>
+<br>
+<div>
+
+|   method    |    backbone    | resolution | mIoU (ss/ms)  | #params | FLOPs |                                            Config                                             |                                                                                                                                Download                                                                                                                                |
+| :---------: | :------------: | :--------: | :-----------: | :-----: | :---: | :-------------------------------------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+|   UperNet   | InternImage-T  |  512x1024  | 82.58 / 83.40 |   59M   | 1889G |       [config](./configs/cityscapes/upernet_internimage_t_512x1024_160k_cityscapes.py)        |              [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_t_512x1024_160k_cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_t_512x1024_160k_cityscapes.log.json)              |
+|   UperNet   | InternImage-S  |  512x1024  | 82.74 / 83.45 |   80M   | 2035G |       [config](./configs/cityscapes/upernet_internimage_s_512x1024_160k_cityscapes.py)        |              [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_s_512x1024_160k_cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_s_512x1024_160k_cityscapes.log.json)              |
+|   UperNet   | InternImage-B  |  512x1024  | 83.18 / 83.97 |  128M   | 2369G |       [config](./configs/cityscapes/upernet_internimage_b_512x1024_160k_cityscapes.py)        |              [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_b_512x1024_160k_cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_b_512x1024_160k_cityscapes.log.json)              |
+|   UperNet   | InternImage-L  |  512x1024  | 83.68 / 84.41 |  256M   | 3234G |       [config](./configs/cityscapes/upernet_internimage_l_512x1024_160k_cityscapes.py)        |              [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_l_512x1024_160k_cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_l_512x1024_160k_cityscapes.log.json)              |
+|  UperNet\*  | InternImage-L  |  512x1024  | 85.94 / 86.22 |  256M   | 3234G |  [config](./configs/cityscapes/upernet_internimage_l_512x1024_160k_mapillary2cityscapes.py)   |   [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_l_512x1024_160k_mapillary2cityscapes.pth)  \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_l_512x1024_160k_mapillary2cityscapes.log.json)    |
+|   UperNet   | InternImage-XL |  512x1024  | 83.62 / 84.28 |  368M   | 4022G |       [config](./configs/cityscapes/upernet_internimage_xl_512x1024_160k_cityscapes.py)       |             [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_xl_512x1024_160k_cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_xl_512x1024_160k_cityscapes.log.json)             |
+|  UperNet\*  | InternImage-XL |  512x1024  | 86.20 / 86.42 |  368M   | 4022G |  [config](./configs/cityscapes/upernet_internimage_xl_512x1024_160k_mapillary2cityscapes.py)  |   [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_xl_512x1024_160k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/upernet_internimage_xl_512x1024_160k_mapillary2cityscapes.log.json)   |
+| SegFormer\* | InternImage-L  |  512x1024  | 85.16 / 85.67 |  220M   | 1580G | [config](./configs/cityscapes/segformer_internimage_l_512x1024_160k_mapillary2cityscapes.py)  |  [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/segformer_internimage_l_512x1024_160k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/segformer_internimage_l_512x1024_160k_mapillary2cityscapes.log.json)  |
+| SegFormer\* | InternImage-XL |  512x1024  | 85.41 / 85.93 |  330M   | 2364G | [config](./configs/cityscapes/segformer_internimage_xl_512x1024_160k_mapillary2cityscapes.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/segformer_internimage_xl_512x1024_160k_mapillary2cityscapes.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/segformer_internimage_xl_512x1024_160k_mapillary2cityscapes.log.json) |
+
+\* denotes the model is trained using extra Mapillary dataset.
+
+</div>
+
+</details>
+
+<details>
+<summary> Dataset: COCO-Stuff-164K </summary>
+<br>
+<div>
+
+|   method    |   backbone    | resolution | mIoU (ss) | #params | FLOPs |                                          Config                                          |                                                                                                                    Download                                                                                                                    |
+| :---------: | :-----------: | :--------: | :-------: | :-----: | :---: | :--------------------------------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| Mask2Former | InternImage-H |  896x896   |   52.6    |  1.31B  | 4635G | [config](./configs/coco_stuff164k/mask2former_internimage_h_896_80k_cocostuff164k_ss.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask2former_internimage_h_896_80k_cocostuff164k.pth) \| [log](https://huggingface.co/OpenGVLab/InternImage/raw/main/mask2former_internimage_h_896_80k_cocostuff164k.log.json) |
+
+</div>
+
+</details>
+
+## Evaluation

 To evaluate our `InternImage` on ADE20K val, run:

@@ -75,21 +153,19 @@ To evaluate our `InternImage` on ADE20K val, run:
 sh dist_test.sh <config-file> <checkpoint> <gpu-num> --eval mIoU
 ```

-You can download checkpoint files from [here](https://huggingface.co/OpenGVLab/InternImage/tree/fc1e4e7e01c3e7a39a3875bdebb6577a7256ff91). Then place it to segmentation/checkpoint_dir/seg.
-
 For example, to evaluate the `InternImage-T` with a single GPU:

 ```bash
-python test.py configs/ade20k/upernet_internimage_t_512_160k_ade20k.py checkpoint_dir/seg/upernet_internimage_t_512_160k_ade20k.pth --eval mIoU
+python test.py configs/ade20k/upernet_internimage_t_512_160k_ade20k.py pretrained/upernet_internimage_t_512_160k_ade20k.pth --eval mIoU
 ```

 For example, to evaluate the `InternImage-B` with a single node with 8 GPUs:

 ```bash
-sh dist_test.sh configs/ade20k/upernet_internimage_b_512_160k_ade20k.py checkpoint_dir/seg/upernet_internimage_b_512_160k_ade20k.pth 8 --eval mIoU
+sh dist_test.sh configs/ade20k/upernet_internimage_b_512_160k_ade20k.py pretrained/upernet_internimage_b_512_160k_ade20k.pth 8 --eval mIoU
 ```

-### Training
+## Training

 To train an `InternImage` on ADE20K, run:

@@ -103,7 +179,7 @@ For example, to train `InternImage-T` with 8 GPU on 1 node (total batch size 16)
 sh dist_train.sh configs/ade20k/upernet_internimage_t_512_160k_ade20k.py 8
 ```

-### Manage Jobs with Slurm
+## Manage Jobs with Slurm

 For example, to train `InternImage-XL` with 8 GPU on 1 node (total batch size 16), run:

@@ -111,10 +187,10 @@ For example, to train `InternImage-XL` with 8 GPU on 1 node (total batch size 16
 GPUS=8 sh slurm_train.sh <partition> <job-name> configs/ade20k/upernet_internimage_xl_640_160k_ade20k.py
 ```

-### Image Demo
+## Image Demo

 To inference a single/multiple image like this.
-If you specify image containing directory instead of a single image, it will process all the images in the directory.:
+If you specify image containing directory instead of a single image, it will process all the images in the directory.

 ```
 CUDA_VISIBLE_DEVICES=0 python image_demo.py \
@@ -124,7 +200,13 @@ CUDA_VISIBLE_DEVICES=0 python image_demo.py \
  --palette ade20k
 ```

-### Export
+## Export
+
+Install `mmdeploy` at first:
+
+```shell
+pip install mmdeploy==0.14.0
+```

 To export a segmentation model from PyTorch to TensorRT, run:


--- a/segmentation/configs/coco_stuff164k/README.md
+++ b/segmentation/configs/coco_stuff164k/README.md
@@ -4,7 +4,7 @@

 ## Introduction

-The Common Objects in COntext-stuff (COCO-stuff) dataset is a dataset for scene understanding tasks like semantic segmentation, object detection and image captioning. It is constructed by annotating the original COCO dataset, which originally annotated things while neglecting stuff annotations.  There are 164k images in COCO-Stuff-164K dataset that span over 172 categories including 80 things, 91 stuff, and 1 unlabeled class.
+The Common Objects in COntext-stuff (COCO-stuff) dataset is a dataset for scene understanding tasks like semantic segmentation, object detection and image captioning. It is constructed by annotating the original COCO dataset, which originally annotated things while neglecting stuff annotations.  There are 164k images in COCO-Stuff-164K dataset that span over 172 categories including 80 things, 91 stuff, and 1 unlabeled class.

 ## Model Zoo