README.md 5.03 KB
Newer Older
ZhenhangHuang's avatar
ZhenhangHuang committed
1
# InternImage
Zhe Chen's avatar
Zhe Chen committed
2
3
4
5

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-coco?p=internimage-exploring-large-scale-vision)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=internimage-exploring-large-scale-vision)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/semantic-segmentation-on-ade20k)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=internimage-exploring-large-scale-vision)
ZhenhangHuang's avatar
ZhenhangHuang committed
6
7
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/towards-all-in-one-pre-training-via/object-detection-on-lvis-v1-0-minival)](https://paperswithcode.com/sota/object-detection-on-lvis-v1-0-minival?p=towards-all-in-one-pre-training-via)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bevformer-v2-adapting-modern-image-backbones/3d-object-detection-on-nuscenes-camera-only)](https://paperswithcode.com/sota/3d-object-detection-on-nuscenes-camera-only?p=bevformer-v2-adapting-modern-image-backbones)
Zhe Chen's avatar
Zhe Chen committed
8
9
10
11
12
13
14
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internimage-exploring-large-scale-vision/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=internimage-exploring-large-scale-vision)

This repository is an official implementation of the [InternImage: Exploring Large-Scale Vision Foundation Models with
Deformable Convolutions](https://arxiv.org/abs/2211.05778).

By Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, Yu Qiao

ZhenhangHuang's avatar
ZhenhangHuang committed
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
## News

- `Nov 18, 2022`: 🚀 InternImage-XL merged into [BEVFormer v2](https://arxiv.org/abs/2211.10439) achieves stae-of-the-art performance of `63.4 NDS` on nuScenes Camera Only.
- `Nov 10, 2022`: 🚀🚀 InternImage-H achieves a new record `65.4 mAP` on COCO detection test-dev and `62.9 mIoU` on
ADE20K, outperforming previous models by a large margin.

## Coming soon
- [ ] Classification/detection/segmentation code of the InternImage series.
- [ ] InternImage-T/S/B/L/XL ImageNet-1k pretrained model.
- [ ] InternImage-L/XL ImageNet-22k pretrained model.
- [ ] InternImage-T/S/B/L/XL detection and instance segmentation model.
- [ ] InternImage-T/S/B/L/XL semantic segmentation model.

## Introduction

**InternImage**, initially described in [arxiv](https://arxiv.org/abs/2211.05778), can be a general backbone for computer vision.
It takes deformable convolution as the core operator to obtain large effective receptive fields, and introducing adaptive spatial aggregation
to reduces the strict inductive bias. Our model makes it possible to learn more stronger and robust models with large-scale parameters from massive data.

<div align=center>
<img src='./figs/arch.png' width=400>
</div>

## Main Results on ImageNet with Pretrained Models

**ImageNet-1K and ImageNet-22K Pretrained InternImage Models**

| name | pretrain | resolution |acc@1 | #params | FLOPs |
| :---: | :---: | :---: | :---: | :---: | :---: | 
| InternImage-T | ImageNet-1K | 224x224 | 83.5 | 30M | 5G |
| InternImage-S | ImageNet-1K | 224x224 | 84.2 | 50M | 8G |
| InternImage-B | ImageNet-1K | 224x224 | 84.9 | 97M | 16G |
| InternImage-L | ImageNet-22K | 384x384 | 87.7 | 223M | 108G |
| InternImage-XL | ImageNet-22K | 384x384 | 88.0 | 335M | 163G |

## Main Results on Downstream Tasks

**COCO Object Detection**

| backbone | method | lr schedule | box mAP | mask mAP | #params | FLOPs |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| InternImage-T | Mask R-CNN | 1x | 47.2 | 42.5 | 49M | 270G |
| InternImage-S | Mask R-CNN | 1x | 47.8 | 43.3 | 69M | 340G |
| InternImage-B | Mask R-CNN | 1x | 48.8 | 44.0 | 115M | 501G |
| InternImage-L | Cascade Mask R-CNN | 1x | 54.9 | 47.7 | 277M | 1399G |
| InternImage-XL | Cascade Mask R-CNN | 1x | 55.3 | 48.0 | 387M | 1782G |

**ADE20K Semantic Segmentation**
Zhe Chen's avatar
Zhe Chen committed
63

ZhenhangHuang's avatar
ZhenhangHuang committed
64
65
66
67
68
69
70
| backbone | resolution | single scale | multi scale | #params | FLOPs|
| :---: | :---: | :---: | :---: | :---: | :---: | 
| InternImage-T | 512x512 | 47.9 | 48.1 | 59M | 944G | 
| InternImage-S | 512x512 | 50.1 | 50.9 | 80M | 1017G |
| InternImage-B | 512x512 | 50.8 | 51.3 | 128M | 1185G |
| InternImage-L | 640x640 | 53.9 | 54.1 | 256M | 2526G |
| InternImage-XL | 640x640 | 55.0 | 55.3 | 368M | 3142G |
Zhe Chen's avatar
Zhe Chen committed
71
72
73
74
75
76
77
78
79
80
81
82
83

## Citation

If this work is helpful for your research, please consider citing the following BibTeX entry.

```
@article{wang2022internimage,
  title={InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions},
  author={Wang, Wenhai and Dai, Jifeng and Chen, Zhe and Huang, Zhenhang and Li, Zhiqi and Zhu, Xizhou and Hu, Xiaowei and Lu, Tong and Lu, Lewei and Li, Hongsheng and others},
  journal={arXiv preprint arXiv:2211.05778},
  year={2022}
}
```