Update huggingface model

Update huggingface model Update README.md Update README.md Update README.md Update huggingface model Update huggingface model

Update huggingface model
Update huggingface model Update README.md Update README.md Update README.md Update huggingface model Update huggingface model
fe6cdd2e · zhe chen · 3bd2e7b9 · fe6cdd2e · fe6cdd2e · fe6cdd2e
Commit fe6cdd2e authored Mar 03, 2025 by zhe chen
20 changed files
--- a/.gitignore
+++ b/.gitignore
@@ -6,3 +6,4 @@ segmentation/convertor/
 checkpoint_dir/
 demo/
 pretrained/
+upload.py
\ No newline at end of file
--- a/classification/huggingface/22k_model/internimage_g_jointto22k_384/README.md
+++ b/classification/huggingface/22k_model/internimage_g_jointto22k_384/README.md
+---
+license: mit
+pipeline_tag: image-classification
+library_name: transformers
+tags:
+  - internimage
+  - custom_code
+datasets:
+  - ILSVRC/imagenet-1k
+---
+
+# InternImage Model Card
+
+## Introduction
+
+InternImage is an advanced vision foundation model developed by researchers from Shanghai AI Laboratory, Tsinghua University, and other institutions. Unlike models based on Transformers, InternImage employs DCNv3 as its core operator. This approach equips the model with dynamic and effective receptive fields required for downstream tasks like object detection and segmentation, while enabling adaptive spatial aggregation.
+
+<div style="text-align: center;"> <img src="https://github.com/OpenGVLab/InternImage/raw/master/docs/figs/arch.png" style="width:60%;" /> </div>
+
+## Performance
+
+- InternImage achieved an impressive Top-1 accuracy of 90.1% on the ImageNet benchmark dataset using only publicly available data for image classification. Apart from two undisclosed models trained with additional datasets by Google and Microsoft, InternImage is the only open-source model that achieves a Top-1 accuracy of over 90.0%, and it is also the largest model in scale worldwide.
+- InternImage outperformed all other models worldwide on the COCO object detection benchmark dataset with a remarkable mAP of 65.5, making it the only model that surpasses 65 mAP in the world.
+- InternImage also demonstrated world's best performance on 16 other important visual benchmark datasets, covering a wide range of tasks such as classification, detection, and segmentation, making it the top-performing model across multiple domains.
+
+## Released Models
+
+### Open‑Source Visual Pretrained Models
+
+|                                       huggingface name                                        |  model  name   |       pretrain       | resolution | #param |
+| :-------------------------------------------------------------------------------------------: | :------------: | :------------------: | :--------: | :----: |
+|        [internimage_l_22k_384](https://huggingface.co/OpenGVLab/internimage_l_22k_384)        | InternImage-L  |        IN-22K        |  384x384   |  223M  |
+|       [internimage_xl_22k_384](https://huggingface.co/OpenGVLab/internimage_xl_22k_384)       | InternImage-XL |        IN-22K        |  384x384   |  335M  |
+| [internimage_h_jointto22k_384](https://huggingface.co/OpenGVLab/internimage_h_jointto22k_384) | InternImage-H  | Joint 427M -> IN-22K |  384x384   | 1.08B  |
+| [internimage_g_jointto22k_384](https://huggingface.co/OpenGVLab/internimage_g_jointto22k_384) | InternImage-G  | Joint 427M -> IN-22K |  384x384   |   3B   |
+
+### ImageNet-1K Image Classification
+
+|                                     huggingface name                                      |   model name   |       pretrain       | resolution | acc@1 | #param | FLOPs |
+| :---------------------------------------------------------------------------------------: | :------------: | :------------------: | :--------: | :---: | :----: | :---: |
+|       [internimage_t_1k_224](https://huggingface.co/OpenGVLab/internimage_t_1k_224)       | InternImage-T  |        IN-1K         |  224x224   | 83.5  |  30M   |  5G   |
+|       [internimage_s_1k_224](https://huggingface.co/OpenGVLab/internimage_s_1k_224)       | InternImage-S  |        IN-1K         |  224x224   | 84.2  |  50M   |  8G   |
+|       [internimage_b_1k_224](https://huggingface.co/OpenGVLab/internimage_b_1k_224)       | InternImage-B  |        IN-1K         |  224x224   | 84.9  |  97M   |  16G  |
+|  [internimage_l_22kto1k_384](https://huggingface.co/OpenGVLab/internimage_l_22kto1k_384)  | InternImage-L  |        IN-22K        |  384x384   | 87.7  |  223M  | 108G  |
+| [internimage_xl_22kto1k_384](https://huggingface.co/OpenGVLab/internimage_xl_22kto1k_384) | InternImage-XL |        IN-22K        |  384x384   | 88.0  |  335M  | 163G  |
+|  [internimage_h_22kto1k_640](https://huggingface.co/OpenGVLab/internimage_h_22kto1k_640)  | InternImage-H  | Joint 427M -> IN-22K |  640x640   | 89.6  | 1.08B  | 1478G |
+|  [internimage_g_22kto1k_512](https://huggingface.co/OpenGVLab/internimage_g_22kto1k_512)  | InternImage-G  | Joint 427M -> IN-22K |  512x512   | 90.1  |   3B   | 2700G |
+
+## DCNv3 CUDA Kernel Installation
+
+If you do not install the CUDA version of DCNv3, InternImage will automatically fall back to a PyTorch implementation. However, the CUDA implementation can significantly reduce GPU memory usage and improve inference efficiency.
+
+**Installation Tutorial:**
+
+1. Open your terminal and run:
+
+   ```bash
+   git clone https://github.com/OpenGVLab/InternImage.git
+   cd InternImage/classification/ops_dcnv3
+   ```
+
+2. Make sure you have an available GPU for compilation, then run:
+
+   ```bash
+   sh make.sh
+   ```
+
+This will compile the CUDA version of DCNv3. Once installed, InternImage will automatically leverage the optimized CUDA implementation for better performance.
+
+## Usage with Transformers
+
+Below are two usage examples for InternImage with the Transformers framework:
+
+### Example 1: Using InternImage as an Image Backbone
+
+```python
+import torch
+from PIL import Image
+from transformers import AutoModel, CLIPImageProcessor
+
+# Replace 'model_name' with the appropriate model identifier
+model_name = "OpenGVLab/internimage_t_1k_224"  # example model
+
+# Prepare the image
+image_path = 'img.png'
+image_processor = CLIPImageProcessor.from_pretrained(model_name)
+image = Image.open(image_path)
+image = image_processor(images=image, return_tensors='pt').pixel_values
+print('image shape:', image.shape)
+
+# Load the model as a backbone
+model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
+# 'hidden_states' contains the outputs from the 4 stages of the InternImage backbone
+hidden_states = model(image).hidden_states
+```
+
+### Example 2: Using InternImage for Image Classification
+
+```python
+import torch
+from PIL import Image
+from transformers import AutoModelForImageClassification, CLIPImageProcessor
+
+# Replace 'model_name' with the appropriate model identifier
+model_name = "OpenGVLab/internimage_t_1k_224"  # example model
+
+# Prepare the image
+image_path = 'img.png'
+image_processor = CLIPImageProcessor.from_pretrained(model_name)
+image = Image.open(image_path)
+image = image_processor(images=image, return_tensors='pt').pixel_values
+print('image shape:', image.shape)
+
+# Load the model as an image classifier
+model = AutoModelForImageClassification.from_pretrained(model_name, trust_remote_code=True)
+logits = model(image).logits
+label = torch.argmax(logits, dim=1)
+print("Predicted label:", label.item())
+```
+
+## Citation
+
+If this work is helpful for your research, please consider citing the following BibTeX entry.
+
+```Bibtex
+@inproceedings{wang2023internimage,
+  title={Internimage: Exploring large-scale vision foundation models with deformable convolutions},
+  author={Wang, Wenhai and Dai, Jifeng and Chen, Zhe and Huang, Zhenhang and Li, Zhiqi and Zhu, Xizhou and Hu, Xiaowei and Lu, Tong and Lu, Lewei and Li, Hongsheng and others},
+  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
+  pages={14408--14419},
+  year={2023}
+}
+```
--- a/classification/huggingface/22k_model/internimage_g_jointto22k_384/config.json
+++ b/classification/huggingface/22k_model/internimage_g_jointto22k_384/config.json
+{
+  "_name_or_path": "OpenGVLab/internimage_g_jointto22k_384",
+  "act_layer": "GELU",
+  "architectures": [
+    "InternImageModel"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_internimage.InternImageConfig",
+    "AutoModel": "modeling_internimage.InternImageModel",
+    "AutoModelForImageClassification": "modeling_internimage.InternImageModelForImageClassification"
+  },
+  "center_feature_scale": true,
+  "channels": 512,
+  "cls_scale": 1.5,
+  "core_op": "DCNv3",
+  "depths": [
+    2,
+    2,
+    48,
+    4
+  ],
+  "drop_path_rate": 0.0,
+  "drop_path_type": "linear",
+  "drop_rate": 0.0,
+  "dw_kernel_size": 5,
+  "groups": [
+    16,
+    32,
+    64,
+    128
+  ],
+  "layer_scale": null,
+  "level2_post_norm": true,
+  "level2_post_norm_block_ids": [
+    5,
+    11,
+    17,
+    23,
+    29,
+    35,
+    41,
+    47
+  ],
+  "mlp_ratio": 4.0,
+  "model_type": "internimage",
+  "norm_layer": "LN",
+  "num_classes": 21841,
+  "offset_scale": 1.0,
+  "post_norm": true,
+  "remove_center": false,
+  "res_post_norm": false,
+  "torch_dtype": "float32",
+  "transformers_version": "4.37.2",
+  "use_clip_projector": true,
+  "with_cp": false
+}
--- a/classification/huggingface/22k_model/internimage_g_jointto22k_384/configuration_internimage.py
+++ b/classification/huggingface/22k_model/internimage_g_jointto22k_384/configuration_internimage.py
+# --------------------------------------------------------
+# InternImage
+# Copyright (c) 2025 OpenGVLab
+# Licensed under The MIT License [see LICENSE for details]
+# --------------------------------------------------------
+
+from transformers import PretrainedConfig
+
+
+class InternImageConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`~InternImageModel`].
+    It is used to instantiate an internimage model according to the specified arguments, defining the model
+    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
+    the internimage [OpenGVLab/internimage](https://huggingface.co/OpenGVLab/internimage) architecture.
+
+    Configuration objects inherit from  [`PretrainedConfig`] and can be used
+    to control the model outputs. Read the documentation from  [`PretrainedConfig`]
+    for more information.
+
+    Args:
+        core_op (`str`, *optional*, defaults to `"DCNv3"`):
+            Core operation used in the InternImageModel.
+        depths (`tuple`, *optional*, defaults to `(4, 4, 18, 4)`):
+            Tuple specifying the depth of layers in the InternImageModel.
+        groups (`tuple`, *optional*, defaults to `(4, 8, 16, 32)`):
+            Tuple specifying the group of layers in the InternImageModel.
+        channels (`int`, *optional*, defaults to `64`):
+            Number of channels in the InternImageModel.
+        dw_kernel_size (`int`, *optional*, defaults to `None`):
+            Kernel size for depthwise convolutions.
+        layer_scale (`float`, *optional*, defaults to `None`):
+            Scale of the layers in the model.
+        offset_scale (`float`, *optional*, defaults to `1.0`):
+            Offset scale in the model.
+        mlp_ratio (`float`, *optional*, defaults to `4.0`):
+            Ratio of mlp layers in the InternImageModel.
+        post_norm (`bool`, *optional*, defaults to `False`):
+            Whether to use post normalization in the model.
+        level2_post_norm (`bool`, *optional*, defaults to `False`):
+            Whether to use level 2 post normalization.
+        level2_post_norm_block_ids (`list`, *optional*, defaults to `None`):
+            Specific block IDs for level 2 post normalization.
+        center_feature_scale (`bool`, *optional*, defaults to `False`):
+            Whether to apply center feature scaling.
+        use_clip_projector (`bool`, *optional*, defaults to `False`):
+            Whether to use CLIP projector.
+        remove_center (`bool`, *optional*, defaults to `False`):
+            Whether to remove center pixels in some operations.
+        num_classes (`int`, *optional*, defaults to `1000`):
+            Number of classes for the model output.
+        drop_rate (`float`, *optional*, defaults to `0.0`):
+            Dropout rate in the model.
+        drop_path_rate (`float`, *optional*, defaults to `0.0`):
+            Dropout path rate in the model.
+        drop_path_type (`str`, *optional*, defaults to `"linear"`):
+            Type of dropout path used in the model.
+        act_layer (`str`, *optional*, defaults to `"GELU"`):
+            Activation function used in the model.
+        norm_layer (`str`, *optional*, defaults to `"LN"`):
+            Normalization layer used in the model.
+        cls_scale (`float`, *optional*, defaults to `1.5`):
+            Scale of the classification layer in the model.
+        with_cp (`bool`, *optional*, defaults to `False`):
+            Whether to use checkpointing in the model.
+    """
+    model_type = 'internimage'
+
+    def __init__(
+            self,
+            core_op='DCNv3',
+            depths=(4, 4, 18, 4),
+            groups=(4, 8, 16, 32),
+            channels=64,
+            dw_kernel_size=None,
+            layer_scale=None,
+            offset_scale=1.0,
+            mlp_ratio=4.0,
+            post_norm=False,
+            res_post_norm=False,
+            level2_post_norm=False,
+            level2_post_norm_block_ids=None,
+            center_feature_scale=False,
+            use_clip_projector=False,
+            remove_center=False,
+            num_classes=1000,
+            drop_rate=0.0,
+            drop_path_rate=0.0,
+            drop_path_type='linear',
+            act_layer='GELU',
+            norm_layer='LN',
+            cls_scale=1.5,
+            with_cp=False,
+            **kwargs,
+    ):
+        super().__init__(**kwargs)
+
+        # Model configuration parameters
+        self.core_op = core_op
+        self.depths = depths
+        self.groups = groups
+        self.channels = channels
+        self.dw_kernel_size = dw_kernel_size
+        self.layer_scale = layer_scale
+        self.offset_scale = offset_scale
+        self.mlp_ratio = mlp_ratio
+        self.post_norm = post_norm
+        self.res_post_norm = res_post_norm
+        self.level2_post_norm = level2_post_norm
+        self.level2_post_norm_block_ids = level2_post_norm_block_ids
+        self.center_feature_scale = center_feature_scale
+        self.use_clip_projector = use_clip_projector
+        self.remove_center = remove_center
+        self.num_classes = num_classes
+        self.drop_rate = drop_rate
+        self.drop_path_rate = drop_path_rate
+        self.drop_path_type = drop_path_type
+        self.act_layer = act_layer
+        self.norm_layer = norm_layer
+        self.cls_scale = cls_scale
+        self.with_cp = with_cp
--- a/classification/huggingface/22k_model/internimage_g_jointto22k_384/dcnv3.py
+++ b/classification/huggingface/22k_model/internimage_g_jointto22k_384/dcnv3.py
+# --------------------------------------------------------
+# InternImage
+# Copyright (c) 2025 OpenGVLab
+# Licensed under The MIT License [see LICENSE for details]
+# --------------------------------------------------------
+
+from __future__ import absolute_import, division, print_function
+
+import warnings
+
+import torch
+import torch.nn.functional as F
+from torch import nn
+from torch.nn.init import constant_, xavier_uniform_
+
+from .dcnv3_func import DCNv3Function, dcnv3_core_pytorch, has_cuda_kernel
+
+
+class to_channels_first(nn.Module):
+
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x.permute(0, 3, 1, 2)
+
+
+class to_channels_last(nn.Module):
+
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x.permute(0, 2, 3, 1)
+
+
+def build_norm_layer(dim,
+                     norm_layer,
+                     in_format='channels_last',
+                     out_format='channels_last',
+                     eps=1e-6):
+    layers = []
+    if norm_layer == 'BN':
+        if in_format == 'channels_last':
+            layers.append(to_channels_first())
+        layers.append(nn.BatchNorm2d(dim))
+        if out_format == 'channels_last':
+            layers.append(to_channels_last())
+    elif norm_layer == 'LN':
+        if in_format == 'channels_first':
+            layers.append(to_channels_last())
+        layers.append(nn.LayerNorm(dim, eps=eps))
+        if out_format == 'channels_first':
+            layers.append(to_channels_first())
+    else:
+        raise NotImplementedError(
+            f'build_norm_layer does not support {norm_layer}')
+    return nn.Sequential(*layers)
+
+
+def build_act_layer(act_layer):
+    if act_layer == 'ReLU':
+        return nn.ReLU(inplace=True)
+    elif act_layer == 'SiLU':
+        return nn.SiLU(inplace=True)
+    elif act_layer == 'GELU':
+        return nn.GELU()
+
+    raise NotImplementedError(f'build_act_layer does not support {act_layer}')
+
+
+def _is_power_of_2(n):
+    if (not isinstance(n, int)) or (n < 0):
+        raise ValueError(
+            'invalid input for _is_power_of_2: {} (type: {})'.format(n, type(n)))
+
+    return (n & (n - 1) == 0) and n != 0
+
+
+class CenterFeatureScaleModule(nn.Module):
+    def forward(self,
+                query,
+                center_feature_scale_proj_weight,
+                center_feature_scale_proj_bias):
+        center_feature_scale = F.linear(query,
+                                        weight=center_feature_scale_proj_weight,
+                                        bias=center_feature_scale_proj_bias).sigmoid()
+        return center_feature_scale
+
+
+class DCNv3_pytorch(nn.Module):
+    def __init__(
+            self,
+            channels=64,
+            kernel_size=3,
+            dw_kernel_size=None,
+            stride=1,
+            pad=1,
+            dilation=1,
+            group=4,
+            offset_scale=1.0,
+            act_layer='GELU',
+            norm_layer='LN',
+            center_feature_scale=False,
+            remove_center=False,
+    ):
+        """
+        DCNv3 Module
+        :param channels
+        :param kernel_size
+        :param stride
+        :param pad
+        :param dilation
+        :param group
+        :param offset_scale
+        :param act_layer
+        :param norm_layer
+        """
+        super().__init__()
+        if channels % group != 0:
+            raise ValueError(
+                f'channels must be divisible by group, but got {channels} and {group}')
+        _d_per_group = channels // group
+        dw_kernel_size = dw_kernel_size if dw_kernel_size is not None else kernel_size
+        # you'd better set _d_per_group to a power of 2 which is more efficient in our CUDA implementation
+        if not _is_power_of_2(_d_per_group):
+            warnings.warn(
+                "You'd better set channels in DCNv3 to make the dimension of each attention head a power of 2 "
+                'which is more efficient in our CUDA implementation.')
+
+        self.offset_scale = offset_scale
+        self.channels = channels
+        self.kernel_size = kernel_size
+        self.dw_kernel_size = dw_kernel_size
+        self.stride = stride
+        self.dilation = dilation
+        self.pad = pad
+        self.group = group
+        self.group_channels = channels // group
+        self.offset_scale = offset_scale
+        self.center_feature_scale = center_feature_scale
+        self.remove_center = int(remove_center)
+
+        self.dw_conv = nn.Sequential(
+            nn.Conv2d(
+                channels,
+                channels,
+                kernel_size=dw_kernel_size,
+                stride=1,
+                padding=(dw_kernel_size - 1) // 2,
+                groups=channels),
+            build_norm_layer(
+                channels,
+                norm_layer,
+                'channels_first',
+                'channels_last'),
+            build_act_layer(act_layer))
+        self.offset = nn.Linear(
+            channels,
+            group * (kernel_size * kernel_size - remove_center) * 2)
+        self.mask = nn.Linear(
+            channels,
+            group * (kernel_size * kernel_size - remove_center))
+        self.input_proj = nn.Linear(channels, channels)
+        self.output_proj = nn.Linear(channels, channels)
+        self._reset_parameters()
+
+        if center_feature_scale:
+            self.center_feature_scale_proj_weight = nn.Parameter(
+                torch.zeros((group, channels), dtype=torch.float))
+            self.center_feature_scale_proj_bias = nn.Parameter(
+                torch.tensor(0.0, dtype=torch.float).view((1,)).repeat(group, ))
+            self.center_feature_scale_module = CenterFeatureScaleModule()
+
+    def _reset_parameters(self):
+        constant_(self.offset.weight.data, 0.)
+        constant_(self.offset.bias.data, 0.)
+        constant_(self.mask.weight.data, 0.)
+        constant_(self.mask.bias.data, 0.)
+        xavier_uniform_(self.input_proj.weight.data)
+        constant_(self.input_proj.bias.data, 0.)
+        xavier_uniform_(self.output_proj.weight.data)
+        constant_(self.output_proj.bias.data, 0.)
+
+    def forward(self, input):
+        """
+        :param query                       (N, H, W, C)
+        :return output                     (N, H, W, C)
+        """
+        N, H, W, _ = input.shape
+
+        x = self.input_proj(input)
+        x_proj = x
+
+        x1 = input.permute(0, 3, 1, 2)
+        x1 = self.dw_conv(x1)
+        offset = self.offset(x1)
+        mask = self.mask(x1).reshape(N, H, W, self.group, -1)
+        mask = F.softmax(mask, -1).reshape(N, H, W, -1)
+
+        x = dcnv3_core_pytorch(
+            x, offset, mask,
+            self.kernel_size, self.kernel_size,
+            self.stride, self.stride,
+            self.pad, self.pad,
+            self.dilation, self.dilation,
+            self.group, self.group_channels,
+            self.offset_scale, self.remove_center)
+        if self.center_feature_scale:
+            center_feature_scale = self.center_feature_scale_module(
+                x1, self.center_feature_scale_proj_weight, self.center_feature_scale_proj_bias)
+            # N, H, W, groups -> N, H, W, groups, 1 -> N, H, W, groups, _d_per_group -> N, H, W, channels
+            center_feature_scale = center_feature_scale[..., None].repeat(
+                1, 1, 1, 1, self.channels // self.group).flatten(-2)
+            x = x * (1 - center_feature_scale) + x_proj * center_feature_scale
+        x = self.output_proj(x)
+
+        return x
+
+
+class DCNv3(nn.Module):
+    def __init__(
+        self,
+        channels=64,
+        kernel_size=3,
+        dw_kernel_size=None,
+        stride=1,
+        pad=1,
+        dilation=1,
+        group=4,
+        offset_scale=1.0,
+        act_layer='GELU',
+        norm_layer='LN',
+        center_feature_scale=False,
+        remove_center=False,
+    ):
+        """
+        DCNv3 Module
+        :param channels
+        :param kernel_size
+        :param stride
+        :param pad
+        :param dilation
+        :param group
+        :param offset_scale
+        :param act_layer
+        :param norm_layer
+        """
+        super().__init__()
+        if channels % group != 0:
+            raise ValueError(
+                f'channels must be divisible by group, but got {channels} and {group}')
+        _d_per_group = channels // group
+        dw_kernel_size = dw_kernel_size if dw_kernel_size is not None else kernel_size
+        # you'd better set _d_per_group to a power of 2 which is more efficient in our CUDA implementation
+        if not _is_power_of_2(_d_per_group):
+            warnings.warn(
+                "You'd better set channels in DCNv3 to make the dimension of each attention head a power of 2 "
+                'which is more efficient in our CUDA implementation.')
+
+        self.offset_scale = offset_scale
+        self.channels = channels
+        self.kernel_size = kernel_size
+        self.dw_kernel_size = dw_kernel_size
+        self.stride = stride
+        self.dilation = dilation
+        self.pad = pad
+        self.group = group
+        self.group_channels = channels // group
+        self.offset_scale = offset_scale
+        self.center_feature_scale = center_feature_scale
+        self.remove_center = int(remove_center)
+
+        if self.remove_center and self.kernel_size % 2 == 0:
+            raise ValueError('remove_center is only compatible with odd kernel size.')
+
+        self.dw_conv = nn.Sequential(
+            nn.Conv2d(
+                channels,
+                channels,
+                kernel_size=dw_kernel_size,
+                stride=1,
+                padding=(dw_kernel_size - 1) // 2,
+                groups=channels),
+            build_norm_layer(
+                channels,
+                norm_layer,
+                'channels_first',
+                'channels_last'),
+            build_act_layer(act_layer))
+        self.offset = nn.Linear(
+            channels,
+            group * (kernel_size * kernel_size - remove_center) * 2)
+        self.mask = nn.Linear(
+            channels,
+            group * (kernel_size * kernel_size - remove_center))
+        self.input_proj = nn.Linear(channels, channels)
+        self.output_proj = nn.Linear(channels, channels)
+        self._reset_parameters()
+
+        if center_feature_scale:
+            self.center_feature_scale_proj_weight = nn.Parameter(
+                torch.zeros((group, channels), dtype=torch.float))
+            self.center_feature_scale_proj_bias = nn.Parameter(
+                torch.tensor(0.0, dtype=torch.float).view((1,)).repeat(group, ))
+            self.center_feature_scale_module = CenterFeatureScaleModule()
+
+    def _reset_parameters(self):
+        constant_(self.offset.weight.data, 0.)
+        constant_(self.offset.bias.data, 0.)
+        constant_(self.mask.weight.data, 0.)
+        constant_(self.mask.bias.data, 0.)
+        xavier_uniform_(self.input_proj.weight.data)
+        constant_(self.input_proj.bias.data, 0.)
+        xavier_uniform_(self.output_proj.weight.data)
+        constant_(self.output_proj.bias.data, 0.)
+
+    def forward(self, input):
+        """
+        :param query                       (N, H, W, C)
+        :return output                     (N, H, W, C)
+        """
+        N, H, W, _ = input.shape
+
+        x = self.input_proj(input)
+        x_proj = x
+        dtype = x.dtype
+
+        x1 = input.permute(0, 3, 1, 2)
+        x1 = self.dw_conv(x1)
+        offset = self.offset(x1)
+        mask = self.mask(x1).reshape(N, H, W, self.group, -1)
+        mask = F.softmax(mask, -1)
+        mask = mask.reshape(N, H, W, -1).type(dtype)
+
+        x = DCNv3Function.apply(
+            x, offset, mask,
+            self.kernel_size, self.kernel_size,
+            self.stride, self.stride,
+            self.pad, self.pad,
+            self.dilation, self.dilation,
+            self.group, self.group_channels,
+            self.offset_scale,
+            256,
+            self.remove_center)
+
+        if self.center_feature_scale:
+            center_feature_scale = self.center_feature_scale_module(
+                x1, self.center_feature_scale_proj_weight, self.center_feature_scale_proj_bias)
+            # N, H, W, groups -> N, H, W, groups, 1 -> N, H, W, groups, _d_per_group -> N, H, W, channels
+            center_feature_scale = center_feature_scale[..., None].repeat(
+                1, 1, 1, 1, self.channels // self.group).flatten(-2)
+            x = x * (1 - center_feature_scale) + x_proj * center_feature_scale
+        x = self.output_proj(x)
+
+        return x
--- a/classification/huggingface/22k_model/internimage_g_jointto22k_384/dcnv3_func.py
+++ b/classification/huggingface/22k_model/internimage_g_jointto22k_384/dcnv3_func.py
+# --------------------------------------------------------
+# InternImage
+# Copyright (c) 2025 OpenGVLab
+# Licensed under The MIT License [see LICENSE for details]
+# --------------------------------------------------------
+
+from __future__ import absolute_import, division, print_function
+
+try:
+    import DCNv3
+    dcn_version = float(pkg_resources.get_distribution('DCNv3').version)
+    has_cuda_kernel = True
+except:
+    has_cuda_kernel = False
+import pkg_resources
+import torch
+import torch.nn.functional as F
+from torch.autograd import Function
+from torch.autograd.function import once_differentiable
+from torch.cuda.amp import custom_bwd, custom_fwd
+
+
+class DCNv3Function(Function):
+    @staticmethod
+    @custom_fwd
+    def forward(
+            ctx, input, offset, mask,
+            kernel_h, kernel_w, stride_h, stride_w,
+            pad_h, pad_w, dilation_h, dilation_w,
+            group, group_channels, offset_scale, im2col_step, remove_center):
+        ctx.kernel_h = kernel_h
+        ctx.kernel_w = kernel_w
+        ctx.stride_h = stride_h
+        ctx.stride_w = stride_w
+        ctx.pad_h = pad_h
+        ctx.pad_w = pad_w
+        ctx.dilation_h = dilation_h
+        ctx.dilation_w = dilation_w
+        ctx.group = group
+        ctx.group_channels = group_channels
+        ctx.offset_scale = offset_scale
+        ctx.im2col_step = im2col_step
+        ctx.remove_center = remove_center
+
+        args = [
+            input, offset, mask, kernel_h,
+            kernel_w, stride_h, stride_w, pad_h,
+            pad_w, dilation_h, dilation_w, group,
+            group_channels, offset_scale, ctx.im2col_step
+        ]
+        if remove_center or dcn_version > 1.0:
+            args.append(remove_center)
+
+        output = DCNv3.dcnv3_forward(*args)
+        ctx.save_for_backward(input, offset, mask)
+
+        return output
+
+    @staticmethod
+    @once_differentiable
+    @custom_bwd
+    def backward(ctx, grad_output):
+        input, offset, mask = ctx.saved_tensors
+
+        args = [
+            input, offset, mask, ctx.kernel_h,
+            ctx.kernel_w, ctx.stride_h, ctx.stride_w, ctx.pad_h,
+            ctx.pad_w, ctx.dilation_h, ctx.dilation_w, ctx.group,
+            ctx.group_channels, ctx.offset_scale, grad_output.contiguous(), ctx.im2col_step
+        ]
+        if ctx.remove_center or dcn_version > 1.0:
+            args.append(ctx.remove_center)
+
+        grad_input, grad_offset, grad_mask = \
+            DCNv3.dcnv3_backward(*args)
+
+        return grad_input, grad_offset, grad_mask, \
+            None, None, None, None, None, None, None, None, None, None, None, None, None
+
+    @staticmethod
+    def symbolic(g, input, offset, mask, kernel_h, kernel_w, stride_h,
+                 stride_w, pad_h, pad_w, dilation_h, dilation_w, group,
+                 group_channels, offset_scale, im2col_step, remove_center):
+        """Symbolic function for mmdeploy::DCNv3.
+
+        Returns:
+            DCNv3 op for onnx.
+        """
+        return g.op(
+            'mmdeploy::TRTDCNv3',
+            input,
+            offset,
+            mask,
+            kernel_h_i=int(kernel_h),
+            kernel_w_i=int(kernel_w),
+            stride_h_i=int(stride_h),
+            stride_w_i=int(stride_w),
+            pad_h_i=int(pad_h),
+            pad_w_i=int(pad_w),
+            dilation_h_i=int(dilation_h),
+            dilation_w_i=int(dilation_w),
+            group_i=int(group),
+            group_channels_i=int(group_channels),
+            offset_scale_f=float(offset_scale),
+            im2col_step_i=int(im2col_step),
+            remove_center_i=int(remove_center),
+        )
+
+
+def _get_reference_points(spatial_shapes, device, kernel_h, kernel_w, dilation_h, dilation_w, pad_h=0, pad_w=0, stride_h=1, stride_w=1):
+    _, H_, W_, _ = spatial_shapes
+    H_out = (H_ - (dilation_h * (kernel_h - 1) + 1)) // stride_h + 1
+    W_out = (W_ - (dilation_w * (kernel_w - 1) + 1)) // stride_w + 1
+
+    ref_y, ref_x = torch.meshgrid(
+        torch.linspace(
+            # pad_h + 0.5,
+            # H_ - pad_h - 0.5,
+            (dilation_h * (kernel_h - 1)) // 2 + 0.5,
+            (dilation_h * (kernel_h - 1)) // 2 + 0.5 + (H_out - 1) * stride_h,
+            H_out,
+            dtype=torch.float32,
+            device=device),
+        torch.linspace(
+            # pad_w + 0.5,
+            # W_ - pad_w - 0.5,
+            (dilation_w * (kernel_w - 1)) // 2 + 0.5,
+            (dilation_w * (kernel_w - 1)) // 2 + 0.5 + (W_out - 1) * stride_w,
+            W_out,
+            dtype=torch.float32,
+            device=device))
+    ref_y = ref_y.reshape(-1)[None] / H_
+    ref_x = ref_x.reshape(-1)[None] / W_
+
+    ref = torch.stack((ref_x, ref_y), -1).reshape(
+        1, H_out, W_out, 1, 2)
+
+    return ref
+
+
+def _generate_dilation_grids(spatial_shapes, kernel_h, kernel_w, dilation_h, dilation_w, group, device):
+    _, H_, W_, _ = spatial_shapes
+    points_list = []
+    x, y = torch.meshgrid(
+        torch.linspace(
+            -((dilation_w * (kernel_w - 1)) // 2),
+            -((dilation_w * (kernel_w - 1)) // 2) + (kernel_w - 1) * dilation_w,
+            kernel_w,
+            dtype=torch.float32,
+            device=device),
+        torch.linspace(
+            -((dilation_h * (kernel_h - 1)) // 2),
+            -((dilation_h * (kernel_h - 1)) // 2) + (kernel_h - 1) * dilation_h,
+            kernel_h,
+            dtype=torch.float32,
+            device=device))
+
+    points_list.extend([x / W_, y / H_])
+    grid = torch.stack(points_list, -1).reshape(-1, 1, 2).\
+        repeat(1, group, 1).permute(1, 0, 2)
+    grid = grid.reshape(1, 1, 1, group * kernel_h * kernel_w, 2)
+
+    return grid
+
+
+def remove_center_sampling_locations(sampling_locations, kernel_w, kernel_h):
+    idx = list(range(sampling_locations.shape[-2]))
+    C = (kernel_w * kernel_h - 1)//2
+    idx = [i for i in idx if i != C and (i-C) % (C*2+1) != 0]
+    sampling_locations = sampling_locations[:,:,:,idx, :]
+    return sampling_locations
+
+
+def dcnv3_core_pytorch(
+        input, offset, mask, kernel_h,
+        kernel_w, stride_h, stride_w, pad_h,
+        pad_w, dilation_h, dilation_w, group,
+        group_channels, offset_scale, remove_center):
+    # for debug and test only,
+    # need to use cuda version instead
+
+    if remove_center and (kernel_h % 2 == 0 or kernel_w % 2 == 0 or kernel_w != kernel_h):
+        raise ValueError('remove_center is only compatible with square odd kernel size.')
+
+    input = F.pad(
+        input,
+        [0, 0, pad_h, pad_h, pad_w, pad_w])
+    N_, H_in, W_in, _ = input.shape
+    _, H_out, W_out, _ = offset.shape
+
+    ref = _get_reference_points(
+        input.shape, input.device, kernel_h, kernel_w, dilation_h, dilation_w, pad_h, pad_w, stride_h, stride_w)
+    grid = _generate_dilation_grids(
+        input.shape, kernel_h, kernel_w, dilation_h, dilation_w, group, input.device)
+    spatial_norm = torch.tensor([W_in, H_in]).reshape(1, 1, 1, 2).\
+        repeat(1, 1, 1, group*(kernel_h*kernel_w-remove_center)).to(input.device)
+
+    sampling_locations = (ref + grid * offset_scale).repeat(N_, 1, 1, 1, 1)
+    if remove_center:
+        sampling_locations = remove_center_sampling_locations(sampling_locations, kernel_w=kernel_w, kernel_h=kernel_h)
+    sampling_locations = sampling_locations.flatten(3, 4)
+    sampling_locations = sampling_locations + offset * offset_scale / spatial_norm
+
+    P_ = kernel_h * kernel_w - remove_center
+    sampling_grids = 2 * sampling_locations - 1
+    # N_, H_in, W_in, group*group_channels -> N_, H_in*W_in, group*group_channels -> N_, group*group_channels, H_in*W_in -> N_*group, group_channels, H_in, W_in
+    input_ = input.view(N_, H_in*W_in, group*group_channels).transpose(1, 2).\
+        reshape(N_*group, group_channels, H_in, W_in)
+    # N_, H_out, W_out, group*P_*2 -> N_, H_out*W_out, group, P_, 2 -> N_, group, H_out*W_out, P_, 2 -> N_*group, H_out*W_out, P_, 2
+    sampling_grid_ = sampling_grids.view(N_, H_out*W_out, group, P_, 2).transpose(1, 2).\
+        flatten(0, 1)
+    # N_*group, group_channels, H_out*W_out, P_
+    sampling_input_ = F.grid_sample(
+        input_, sampling_grid_, mode='bilinear', padding_mode='zeros', align_corners=False)
+
+    # (N_, H_out, W_out, group*P_) -> N_, H_out*W_out, group, P_ -> (N_, group, H_out*W_out, P_) -> (N_*group, 1, H_out*W_out, P_)
+    mask = mask.view(N_, H_out*W_out, group, P_).transpose(1, 2).\
+        reshape(N_*group, 1, H_out*W_out, P_)
+    output = (sampling_input_ * mask).sum(-1).view(N_,
+                                                   group*group_channels, H_out*W_out)
+
+    return output.transpose(1, 2).reshape(N_, H_out, W_out, -1).contiguous()
--- a/classification/huggingface/22k_model/internimage_g_jointto22k_384/modeling_internimage.py
+++ b/classification/huggingface/22k_model/internimage_g_jointto22k_384/modeling_internimage.py
--- a/classification/huggingface/22k_model/internimage_g_jointto22k_384/preprocessor_config.json
+++ b/classification/huggingface/22k_model/internimage_g_jointto22k_384/preprocessor_config.json
+{
+  "crop_size": 384,
+  "do_center_crop": true,
+  "do_normalize": true,
+  "do_resize": true,
+  "feature_extractor_type": "CLIPFeatureExtractor",
+  "image_mean": [
+    0.485,
+    0.456,
+    0.406
+  ],
+  "image_std": [
+    0.229,
+    0.224,
+    0.225
+  ],
+  "resample": 3,
+  "size": 384
+}
--- a/classification/huggingface/22k_model/internimage_h_jointto22k_384/README.md
+++ b/classification/huggingface/22k_model/internimage_h_jointto22k_384/README.md
+---
+license: mit
+pipeline_tag: image-classification
+library_name: transformers
+tags:
+  - internimage
+  - custom_code
+datasets:
+  - ILSVRC/imagenet-1k
+---
+
+# InternImage Model Card
+
+## Introduction
+
+InternImage is an advanced vision foundation model developed by researchers from Shanghai AI Laboratory, Tsinghua University, and other institutions. Unlike models based on Transformers, InternImage employs DCNv3 as its core operator. This approach equips the model with dynamic and effective receptive fields required for downstream tasks like object detection and segmentation, while enabling adaptive spatial aggregation.
+
+<div style="text-align: center;"> <img src="https://github.com/OpenGVLab/InternImage/raw/master/docs/figs/arch.png" style="width:60%;" /> </div>
+
+## Performance
+
+- InternImage achieved an impressive Top-1 accuracy of 90.1% on the ImageNet benchmark dataset using only publicly available data for image classification. Apart from two undisclosed models trained with additional datasets by Google and Microsoft, InternImage is the only open-source model that achieves a Top-1 accuracy of over 90.0%, and it is also the largest model in scale worldwide.
+- InternImage outperformed all other models worldwide on the COCO object detection benchmark dataset with a remarkable mAP of 65.5, making it the only model that surpasses 65 mAP in the world.
+- InternImage also demonstrated world's best performance on 16 other important visual benchmark datasets, covering a wide range of tasks such as classification, detection, and segmentation, making it the top-performing model across multiple domains.
+
+## Released Models
+
+### Open‑Source Visual Pretrained Models
+
+|                                       huggingface name                                        |  model  name   |       pretrain       | resolution | #param |
+| :-------------------------------------------------------------------------------------------: | :------------: | :------------------: | :--------: | :----: |
+|        [internimage_l_22k_384](https://huggingface.co/OpenGVLab/internimage_l_22k_384)        | InternImage-L  |        IN-22K        |  384x384   |  223M  |
+|       [internimage_xl_22k_384](https://huggingface.co/OpenGVLab/internimage_xl_22k_384)       | InternImage-XL |        IN-22K        |  384x384   |  335M  |
+| [internimage_h_jointto22k_384](https://huggingface.co/OpenGVLab/internimage_h_jointto22k_384) | InternImage-H  | Joint 427M -> IN-22K |  384x384   | 1.08B  |
+| [internimage_g_jointto22k_384](https://huggingface.co/OpenGVLab/internimage_g_jointto22k_384) | InternImage-G  | Joint 427M -> IN-22K |  384x384   |   3B   |
+
+### ImageNet-1K Image Classification
+
+|                                     huggingface name                                      |   model name   |       pretrain       | resolution | acc@1 | #param | FLOPs |
+| :---------------------------------------------------------------------------------------: | :------------: | :------------------: | :--------: | :---: | :----: | :---: |
+|       [internimage_t_1k_224](https://huggingface.co/OpenGVLab/internimage_t_1k_224)       | InternImage-T  |        IN-1K         |  224x224   | 83.5  |  30M   |  5G   |
+|       [internimage_s_1k_224](https://huggingface.co/OpenGVLab/internimage_s_1k_224)       | InternImage-S  |        IN-1K         |  224x224   | 84.2  |  50M   |  8G   |
+|       [internimage_b_1k_224](https://huggingface.co/OpenGVLab/internimage_b_1k_224)       | InternImage-B  |        IN-1K         |  224x224   | 84.9  |  97M   |  16G  |
+|  [internimage_l_22kto1k_384](https://huggingface.co/OpenGVLab/internimage_l_22kto1k_384)  | InternImage-L  |        IN-22K        |  384x384   | 87.7  |  223M  | 108G  |
+| [internimage_xl_22kto1k_384](https://huggingface.co/OpenGVLab/internimage_xl_22kto1k_384) | InternImage-XL |        IN-22K        |  384x384   | 88.0  |  335M  | 163G  |
+|  [internimage_h_22kto1k_640](https://huggingface.co/OpenGVLab/internimage_h_22kto1k_640)  | InternImage-H  | Joint 427M -> IN-22K |  640x640   | 89.6  | 1.08B  | 1478G |
+|  [internimage_g_22kto1k_512](https://huggingface.co/OpenGVLab/internimage_g_22kto1k_512)  | InternImage-G  | Joint 427M -> IN-22K |  512x512   | 90.1  |   3B   | 2700G |
+
+## DCNv3 CUDA Kernel Installation
+
+If you do not install the CUDA version of DCNv3, InternImage will automatically fall back to a PyTorch implementation. However, the CUDA implementation can significantly reduce GPU memory usage and improve inference efficiency.
+
+**Installation Tutorial:**
+
+1. Open your terminal and run:
+
+   ```bash
+   git clone https://github.com/OpenGVLab/InternImage.git
+   cd InternImage/classification/ops_dcnv3
+   ```
+
+2. Make sure you have an available GPU for compilation, then run:
+
+   ```bash
+   sh make.sh
+   ```
+
+This will compile the CUDA version of DCNv3. Once installed, InternImage will automatically leverage the optimized CUDA implementation for better performance.
+
+## Usage with Transformers
+
+Below are two usage examples for InternImage with the Transformers framework:
+
+### Example 1: Using InternImage as an Image Backbone
+
+```python
+import torch
+from PIL import Image
+from transformers import AutoModel, CLIPImageProcessor
+
+# Replace 'model_name' with the appropriate model identifier
+model_name = "OpenGVLab/internimage_t_1k_224"  # example model
+
+# Prepare the image
+image_path = 'img.png'
+image_processor = CLIPImageProcessor.from_pretrained(model_name)
+image = Image.open(image_path)
+image = image_processor(images=image, return_tensors='pt').pixel_values
+print('image shape:', image.shape)
+
+# Load the model as a backbone
+model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
+# 'hidden_states' contains the outputs from the 4 stages of the InternImage backbone
+hidden_states = model(image).hidden_states
+```
+
+### Example 2: Using InternImage for Image Classification
+
+```python
+import torch
+from PIL import Image
+from transformers import AutoModelForImageClassification, CLIPImageProcessor
+
+# Replace 'model_name' with the appropriate model identifier
+model_name = "OpenGVLab/internimage_t_1k_224"  # example model
+
+# Prepare the image
+image_path = 'img.png'
+image_processor = CLIPImageProcessor.from_pretrained(model_name)
+image = Image.open(image_path)
+image = image_processor(images=image, return_tensors='pt').pixel_values
+print('image shape:', image.shape)
+
+# Load the model as an image classifier
+model = AutoModelForImageClassification.from_pretrained(model_name, trust_remote_code=True)
+logits = model(image).logits
+label = torch.argmax(logits, dim=1)
+print("Predicted label:", label.item())
+```
+
+## Citation
+
+If this work is helpful for your research, please consider citing the following BibTeX entry.
+
+```Bibtex
+@inproceedings{wang2023internimage,
+  title={Internimage: Exploring large-scale vision foundation models with deformable convolutions},
+  author={Wang, Wenhai and Dai, Jifeng and Chen, Zhe and Huang, Zhenhang and Li, Zhiqi and Zhu, Xizhou and Hu, Xiaowei and Lu, Tong and Lu, Lewei and Li, Hongsheng and others},
+  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
+  pages={14408--14419},
+  year={2023}
+}
+```
--- a/classification/huggingface/22k_model/internimage_h_jointto22k_384/config.json
+++ b/classification/huggingface/22k_model/internimage_h_jointto22k_384/config.json
+{
+  "_name_or_path": "OpenGVLab/internimage_h_jointto22k_384",
+  "act_layer": "GELU",
+  "architectures": [
+    "InternImageModel"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_internimage.InternImageConfig",
+    "AutoModel": "modeling_internimage.InternImageModel",
+    "AutoModelForImageClassification": "modeling_internimage.InternImageModelForImageClassification"
+  },
+  "center_feature_scale": true,
+  "channels": 320,
+  "cls_scale": 1.5,
+  "core_op": "DCNv3",
+  "depths": [
+    6,
+    6,
+    32,
+    6
+  ],
+  "drop_path_rate": 0.0,
+  "drop_path_type": "linear",
+  "drop_rate": 0.0,
+  "dw_kernel_size": 5,
+  "groups": [
+    10,
+    20,
+    40,
+    80
+  ],
+  "layer_scale": null,
+  "level2_post_norm": true,
+  "level2_post_norm_block_ids": [
+    5,
+    11,
+    17,
+    23,
+    29
+  ],
+  "mlp_ratio": 4.0,
+  "model_type": "internimage",
+  "norm_layer": "LN",
+  "num_classes": 21841,
+  "offset_scale": 1.0,
+  "post_norm": false,
+  "remove_center": false,
+  "res_post_norm": true,
+  "torch_dtype": "float32",
+  "transformers_version": "4.37.2",
+  "use_clip_projector": true,
+  "with_cp": false
+}
--- a/classification/huggingface/22k_model/internimage_h_jointto22k_384/configuration_internimage.py
+++ b/classification/huggingface/22k_model/internimage_h_jointto22k_384/configuration_internimage.py
+# --------------------------------------------------------
+# InternImage
+# Copyright (c) 2025 OpenGVLab
+# Licensed under The MIT License [see LICENSE for details]
+# --------------------------------------------------------
+
+from transformers import PretrainedConfig
+
+
+class InternImageConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`~InternImageModel`].
+    It is used to instantiate an internimage model according to the specified arguments, defining the model
+    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
+    the internimage [OpenGVLab/internimage](https://huggingface.co/OpenGVLab/internimage) architecture.
+
+    Configuration objects inherit from  [`PretrainedConfig`] and can be used
+    to control the model outputs. Read the documentation from  [`PretrainedConfig`]
+    for more information.
+
+    Args:
+        core_op (`str`, *optional*, defaults to `"DCNv3"`):
+            Core operation used in the InternImageModel.
+        depths (`tuple`, *optional*, defaults to `(4, 4, 18, 4)`):
+            Tuple specifying the depth of layers in the InternImageModel.
+        groups (`tuple`, *optional*, defaults to `(4, 8, 16, 32)`):
+            Tuple specifying the group of layers in the InternImageModel.
+        channels (`int`, *optional*, defaults to `64`):
+            Number of channels in the InternImageModel.
+        dw_kernel_size (`int`, *optional*, defaults to `None`):
+            Kernel size for depthwise convolutions.
+        layer_scale (`float`, *optional*, defaults to `None`):
+            Scale of the layers in the model.
+        offset_scale (`float`, *optional*, defaults to `1.0`):
+            Offset scale in the model.
+        mlp_ratio (`float`, *optional*, defaults to `4.0`):
+            Ratio of mlp layers in the InternImageModel.
+        post_norm (`bool`, *optional*, defaults to `False`):
+            Whether to use post normalization in the model.
+        level2_post_norm (`bool`, *optional*, defaults to `False`):
+            Whether to use level 2 post normalization.
+        level2_post_norm_block_ids (`list`, *optional*, defaults to `None`):
+            Specific block IDs for level 2 post normalization.
+        center_feature_scale (`bool`, *optional*, defaults to `False`):
+            Whether to apply center feature scaling.
+        use_clip_projector (`bool`, *optional*, defaults to `False`):
+            Whether to use CLIP projector.
+        remove_center (`bool`, *optional*, defaults to `False`):
+            Whether to remove center pixels in some operations.
+        num_classes (`int`, *optional*, defaults to `1000`):
+            Number of classes for the model output.
+        drop_rate (`float`, *optional*, defaults to `0.0`):
+            Dropout rate in the model.
+        drop_path_rate (`float`, *optional*, defaults to `0.0`):
+            Dropout path rate in the model.
+        drop_path_type (`str`, *optional*, defaults to `"linear"`):
+            Type of dropout path used in the model.
+        act_layer (`str`, *optional*, defaults to `"GELU"`):
+            Activation function used in the model.
+        norm_layer (`str`, *optional*, defaults to `"LN"`):
+            Normalization layer used in the model.
+        cls_scale (`float`, *optional*, defaults to `1.5`):
+            Scale of the classification layer in the model.
+        with_cp (`bool`, *optional*, defaults to `False`):
+            Whether to use checkpointing in the model.
+    """
+    model_type = 'internimage'
+
+    def __init__(
+            self,
+            core_op='DCNv3',
+            depths=(4, 4, 18, 4),
+            groups=(4, 8, 16, 32),
+            channels=64,
+            dw_kernel_size=None,
+            layer_scale=None,
+            offset_scale=1.0,
+            mlp_ratio=4.0,
+            post_norm=False,
+            res_post_norm=False,
+            level2_post_norm=False,
+            level2_post_norm_block_ids=None,
+            center_feature_scale=False,
+            use_clip_projector=False,
+            remove_center=False,
+            num_classes=1000,
+            drop_rate=0.0,
+            drop_path_rate=0.0,
+            drop_path_type='linear',
+            act_layer='GELU',
+            norm_layer='LN',
+            cls_scale=1.5,
+            with_cp=False,
+            **kwargs,
+    ):
+        super().__init__(**kwargs)
+
+        # Model configuration parameters
+        self.core_op = core_op
+        self.depths = depths
+        self.groups = groups
+        self.channels = channels
+        self.dw_kernel_size = dw_kernel_size
+        self.layer_scale = layer_scale
+        self.offset_scale = offset_scale
+        self.mlp_ratio = mlp_ratio
+        self.post_norm = post_norm
+        self.res_post_norm = res_post_norm
+        self.level2_post_norm = level2_post_norm
+        self.level2_post_norm_block_ids = level2_post_norm_block_ids
+        self.center_feature_scale = center_feature_scale
+        self.use_clip_projector = use_clip_projector
+        self.remove_center = remove_center
+        self.num_classes = num_classes
+        self.drop_rate = drop_rate
+        self.drop_path_rate = drop_path_rate
+        self.drop_path_type = drop_path_type
+        self.act_layer = act_layer
+        self.norm_layer = norm_layer
+        self.cls_scale = cls_scale
+        self.with_cp = with_cp
--- a/classification/huggingface/22k_model/internimage_h_jointto22k_384/dcnv3.py
+++ b/classification/huggingface/22k_model/internimage_h_jointto22k_384/dcnv3.py
+# --------------------------------------------------------
+# InternImage
+# Copyright (c) 2025 OpenGVLab
+# Licensed under The MIT License [see LICENSE for details]
+# --------------------------------------------------------
+
+from __future__ import absolute_import, division, print_function
+
+import warnings
+
+import torch
+import torch.nn.functional as F
+from torch import nn
+from torch.nn.init import constant_, xavier_uniform_
+
+from .dcnv3_func import DCNv3Function, dcnv3_core_pytorch, has_cuda_kernel
+
+
+class to_channels_first(nn.Module):
+
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x.permute(0, 3, 1, 2)
+
+
+class to_channels_last(nn.Module):
+
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x.permute(0, 2, 3, 1)
+
+
+def build_norm_layer(dim,
+                     norm_layer,
+                     in_format='channels_last',
+                     out_format='channels_last',
+                     eps=1e-6):
+    layers = []
+    if norm_layer == 'BN':
+        if in_format == 'channels_last':
+            layers.append(to_channels_first())
+        layers.append(nn.BatchNorm2d(dim))
+        if out_format == 'channels_last':
+            layers.append(to_channels_last())
+    elif norm_layer == 'LN':
+        if in_format == 'channels_first':
+            layers.append(to_channels_last())
+        layers.append(nn.LayerNorm(dim, eps=eps))
+        if out_format == 'channels_first':
+            layers.append(to_channels_first())
+    else:
+        raise NotImplementedError(
+            f'build_norm_layer does not support {norm_layer}')
+    return nn.Sequential(*layers)
+
+
+def build_act_layer(act_layer):
+    if act_layer == 'ReLU':
+        return nn.ReLU(inplace=True)
+    elif act_layer == 'SiLU':
+        return nn.SiLU(inplace=True)
+    elif act_layer == 'GELU':
+        return nn.GELU()
+
+    raise NotImplementedError(f'build_act_layer does not support {act_layer}')
+
+
+def _is_power_of_2(n):
+    if (not isinstance(n, int)) or (n < 0):
+        raise ValueError(
+            'invalid input for _is_power_of_2: {} (type: {})'.format(n, type(n)))
+
+    return (n & (n - 1) == 0) and n != 0
+
+
+class CenterFeatureScaleModule(nn.Module):
+    def forward(self,
+                query,
+                center_feature_scale_proj_weight,
+                center_feature_scale_proj_bias):
+        center_feature_scale = F.linear(query,
+                                        weight=center_feature_scale_proj_weight,
+                                        bias=center_feature_scale_proj_bias).sigmoid()
+        return center_feature_scale
+
+
+class DCNv3_pytorch(nn.Module):
+    def __init__(
+            self,
+            channels=64,
+            kernel_size=3,
+            dw_kernel_size=None,
+            stride=1,
+            pad=1,
+            dilation=1,
+            group=4,
+            offset_scale=1.0,
+            act_layer='GELU',
+            norm_layer='LN',
+            center_feature_scale=False,
+            remove_center=False,
+    ):
+        """
+        DCNv3 Module
+        :param channels
+        :param kernel_size
+        :param stride
+        :param pad
+        :param dilation
+        :param group
+        :param offset_scale
+        :param act_layer
+        :param norm_layer
+        """
+        super().__init__()
+        if channels % group != 0:
+            raise ValueError(
+                f'channels must be divisible by group, but got {channels} and {group}')
+        _d_per_group = channels // group
+        dw_kernel_size = dw_kernel_size if dw_kernel_size is not None else kernel_size
+        # you'd better set _d_per_group to a power of 2 which is more efficient in our CUDA implementation
+        if not _is_power_of_2(_d_per_group):
+            warnings.warn(
+                "You'd better set channels in DCNv3 to make the dimension of each attention head a power of 2 "
+                'which is more efficient in our CUDA implementation.')
+
+        self.offset_scale = offset_scale
+        self.channels = channels
+        self.kernel_size = kernel_size
+        self.dw_kernel_size = dw_kernel_size
+        self.stride = stride
+        self.dilation = dilation
+        self.pad = pad
+        self.group = group
+        self.group_channels = channels // group
+        self.offset_scale = offset_scale
+        self.center_feature_scale = center_feature_scale
+        self.remove_center = int(remove_center)
+
+        self.dw_conv = nn.Sequential(
+            nn.Conv2d(
+                channels,
+                channels,
+                kernel_size=dw_kernel_size,
+                stride=1,
+                padding=(dw_kernel_size - 1) // 2,
+                groups=channels),
+            build_norm_layer(
+                channels,
+                norm_layer,
+                'channels_first',
+                'channels_last'),
+            build_act_layer(act_layer))
+        self.offset = nn.Linear(
+            channels,
+            group * (kernel_size * kernel_size - remove_center) * 2)
+        self.mask = nn.Linear(
+            channels,
+            group * (kernel_size * kernel_size - remove_center))
+        self.input_proj = nn.Linear(channels, channels)
+        self.output_proj = nn.Linear(channels, channels)
+        self._reset_parameters()
+
+        if center_feature_scale:
+            self.center_feature_scale_proj_weight = nn.Parameter(
+                torch.zeros((group, channels), dtype=torch.float))
+            self.center_feature_scale_proj_bias = nn.Parameter(
+                torch.tensor(0.0, dtype=torch.float).view((1,)).repeat(group, ))
+            self.center_feature_scale_module = CenterFeatureScaleModule()
+
+    def _reset_parameters(self):
+        constant_(self.offset.weight.data, 0.)
+        constant_(self.offset.bias.data, 0.)
+        constant_(self.mask.weight.data, 0.)
+        constant_(self.mask.bias.data, 0.)
+        xavier_uniform_(self.input_proj.weight.data)
+        constant_(self.input_proj.bias.data, 0.)
+        xavier_uniform_(self.output_proj.weight.data)
+        constant_(self.output_proj.bias.data, 0.)
+
+    def forward(self, input):
+        """
+        :param query                       (N, H, W, C)
+        :return output                     (N, H, W, C)
+        """
+        N, H, W, _ = input.shape
+
+        x = self.input_proj(input)
+        x_proj = x
+
+        x1 = input.permute(0, 3, 1, 2)
+        x1 = self.dw_conv(x1)
+        offset = self.offset(x1)
+        mask = self.mask(x1).reshape(N, H, W, self.group, -1)
+        mask = F.softmax(mask, -1).reshape(N, H, W, -1)
+
+        x = dcnv3_core_pytorch(
+            x, offset, mask,
+            self.kernel_size, self.kernel_size,
+            self.stride, self.stride,
+            self.pad, self.pad,
+            self.dilation, self.dilation,
+            self.group, self.group_channels,
+            self.offset_scale, self.remove_center)
+        if self.center_feature_scale:
+            center_feature_scale = self.center_feature_scale_module(
+                x1, self.center_feature_scale_proj_weight, self.center_feature_scale_proj_bias)
+            # N, H, W, groups -> N, H, W, groups, 1 -> N, H, W, groups, _d_per_group -> N, H, W, channels
+            center_feature_scale = center_feature_scale[..., None].repeat(
+                1, 1, 1, 1, self.channels // self.group).flatten(-2)
+            x = x * (1 - center_feature_scale) + x_proj * center_feature_scale
+        x = self.output_proj(x)
+
+        return x
+
+
+class DCNv3(nn.Module):
+    def __init__(
+        self,
+        channels=64,
+        kernel_size=3,
+        dw_kernel_size=None,
+        stride=1,
+        pad=1,
+        dilation=1,
+        group=4,
+        offset_scale=1.0,
+        act_layer='GELU',
+        norm_layer='LN',
+        center_feature_scale=False,
+        remove_center=False,
+    ):
+        """
+        DCNv3 Module
+        :param channels
+        :param kernel_size
+        :param stride
+        :param pad
+        :param dilation
+        :param group
+        :param offset_scale
+        :param act_layer
+        :param norm_layer
+        """
+        super().__init__()
+        if channels % group != 0:
+            raise ValueError(
+                f'channels must be divisible by group, but got {channels} and {group}')
+        _d_per_group = channels // group
+        dw_kernel_size = dw_kernel_size if dw_kernel_size is not None else kernel_size
+        # you'd better set _d_per_group to a power of 2 which is more efficient in our CUDA implementation
+        if not _is_power_of_2(_d_per_group):
+            warnings.warn(
+                "You'd better set channels in DCNv3 to make the dimension of each attention head a power of 2 "
+                'which is more efficient in our CUDA implementation.')
+
+        self.offset_scale = offset_scale
+        self.channels = channels
+        self.kernel_size = kernel_size
+        self.dw_kernel_size = dw_kernel_size
+        self.stride = stride
+        self.dilation = dilation
+        self.pad = pad
+        self.group = group
+        self.group_channels = channels // group
+        self.offset_scale = offset_scale
+        self.center_feature_scale = center_feature_scale
+        self.remove_center = int(remove_center)
+
+        if self.remove_center and self.kernel_size % 2 == 0:
+            raise ValueError('remove_center is only compatible with odd kernel size.')
+
+        self.dw_conv = nn.Sequential(
+            nn.Conv2d(
+                channels,
+                channels,
+                kernel_size=dw_kernel_size,
+                stride=1,
+                padding=(dw_kernel_size - 1) // 2,
+                groups=channels),
+            build_norm_layer(
+                channels,
+                norm_layer,
+                'channels_first',
+                'channels_last'),
+            build_act_layer(act_layer))
+        self.offset = nn.Linear(
+            channels,
+            group * (kernel_size * kernel_size - remove_center) * 2)
+        self.mask = nn.Linear(
+            channels,
+            group * (kernel_size * kernel_size - remove_center))
+        self.input_proj = nn.Linear(channels, channels)
+        self.output_proj = nn.Linear(channels, channels)
+        self._reset_parameters()
+
+        if center_feature_scale:
+            self.center_feature_scale_proj_weight = nn.Parameter(
+                torch.zeros((group, channels), dtype=torch.float))
+            self.center_feature_scale_proj_bias = nn.Parameter(
+                torch.tensor(0.0, dtype=torch.float).view((1,)).repeat(group, ))
+            self.center_feature_scale_module = CenterFeatureScaleModule()
+
+    def _reset_parameters(self):
+        constant_(self.offset.weight.data, 0.)
+        constant_(self.offset.bias.data, 0.)
+        constant_(self.mask.weight.data, 0.)
+        constant_(self.mask.bias.data, 0.)
+        xavier_uniform_(self.input_proj.weight.data)
+        constant_(self.input_proj.bias.data, 0.)
+        xavier_uniform_(self.output_proj.weight.data)
+        constant_(self.output_proj.bias.data, 0.)
+
+    def forward(self, input):
+        """
+        :param query                       (N, H, W, C)
+        :return output                     (N, H, W, C)
+        """
+        N, H, W, _ = input.shape
+
+        x = self.input_proj(input)
+        x_proj = x
+        dtype = x.dtype
+
+        x1 = input.permute(0, 3, 1, 2)
+        x1 = self.dw_conv(x1)
+        offset = self.offset(x1)
+        mask = self.mask(x1).reshape(N, H, W, self.group, -1)
+        mask = F.softmax(mask, -1)
+        mask = mask.reshape(N, H, W, -1).type(dtype)
+
+        x = DCNv3Function.apply(
+            x, offset, mask,
+            self.kernel_size, self.kernel_size,
+            self.stride, self.stride,
+            self.pad, self.pad,
+            self.dilation, self.dilation,
+            self.group, self.group_channels,
+            self.offset_scale,
+            256,
+            self.remove_center)
+
+        if self.center_feature_scale:
+            center_feature_scale = self.center_feature_scale_module(
+                x1, self.center_feature_scale_proj_weight, self.center_feature_scale_proj_bias)
+            # N, H, W, groups -> N, H, W, groups, 1 -> N, H, W, groups, _d_per_group -> N, H, W, channels
+            center_feature_scale = center_feature_scale[..., None].repeat(
+                1, 1, 1, 1, self.channels // self.group).flatten(-2)
+            x = x * (1 - center_feature_scale) + x_proj * center_feature_scale
+        x = self.output_proj(x)
+
+        return x
--- a/classification/huggingface/22k_model/internimage_h_jointto22k_384/dcnv3_func.py
+++ b/classification/huggingface/22k_model/internimage_h_jointto22k_384/dcnv3_func.py
+# --------------------------------------------------------
+# InternImage
+# Copyright (c) 2025 OpenGVLab
+# Licensed under The MIT License [see LICENSE for details]
+# --------------------------------------------------------
+
+from __future__ import absolute_import, division, print_function
+
+try:
+    import DCNv3
+    dcn_version = float(pkg_resources.get_distribution('DCNv3').version)
+    has_cuda_kernel = True
+except:
+    has_cuda_kernel = False
+import pkg_resources
+import torch
+import torch.nn.functional as F
+from torch.autograd import Function
+from torch.autograd.function import once_differentiable
+from torch.cuda.amp import custom_bwd, custom_fwd
+
+
+class DCNv3Function(Function):
+    @staticmethod
+    @custom_fwd
+    def forward(
+            ctx, input, offset, mask,
+            kernel_h, kernel_w, stride_h, stride_w,
+            pad_h, pad_w, dilation_h, dilation_w,
+            group, group_channels, offset_scale, im2col_step, remove_center):
+        ctx.kernel_h = kernel_h
+        ctx.kernel_w = kernel_w
+        ctx.stride_h = stride_h
+        ctx.stride_w = stride_w
+        ctx.pad_h = pad_h
+        ctx.pad_w = pad_w
+        ctx.dilation_h = dilation_h
+        ctx.dilation_w = dilation_w
+        ctx.group = group
+        ctx.group_channels = group_channels
+        ctx.offset_scale = offset_scale
+        ctx.im2col_step = im2col_step
+        ctx.remove_center = remove_center
+
+        args = [
+            input, offset, mask, kernel_h,
+            kernel_w, stride_h, stride_w, pad_h,
+            pad_w, dilation_h, dilation_w, group,
+            group_channels, offset_scale, ctx.im2col_step
+        ]
+        if remove_center or dcn_version > 1.0:
+            args.append(remove_center)
+
+        output = DCNv3.dcnv3_forward(*args)
+        ctx.save_for_backward(input, offset, mask)
+
+        return output
+
+    @staticmethod
+    @once_differentiable
+    @custom_bwd
+    def backward(ctx, grad_output):
+        input, offset, mask = ctx.saved_tensors
+
+        args = [
+            input, offset, mask, ctx.kernel_h,
+            ctx.kernel_w, ctx.stride_h, ctx.stride_w, ctx.pad_h,
+            ctx.pad_w, ctx.dilation_h, ctx.dilation_w, ctx.group,
+            ctx.group_channels, ctx.offset_scale, grad_output.contiguous(), ctx.im2col_step
+        ]
+        if ctx.remove_center or dcn_version > 1.0:
+            args.append(ctx.remove_center)
+
+        grad_input, grad_offset, grad_mask = \
+            DCNv3.dcnv3_backward(*args)
+
+        return grad_input, grad_offset, grad_mask, \
+            None, None, None, None, None, None, None, None, None, None, None, None, None
+
+    @staticmethod
+    def symbolic(g, input, offset, mask, kernel_h, kernel_w, stride_h,
+                 stride_w, pad_h, pad_w, dilation_h, dilation_w, group,
+                 group_channels, offset_scale, im2col_step, remove_center):
+        """Symbolic function for mmdeploy::DCNv3.
+
+        Returns:
+            DCNv3 op for onnx.
+        """
+        return g.op(
+            'mmdeploy::TRTDCNv3',
+            input,
+            offset,
+            mask,
+            kernel_h_i=int(kernel_h),
+            kernel_w_i=int(kernel_w),
+            stride_h_i=int(stride_h),
+            stride_w_i=int(stride_w),
+            pad_h_i=int(pad_h),
+            pad_w_i=int(pad_w),
+            dilation_h_i=int(dilation_h),
+            dilation_w_i=int(dilation_w),
+            group_i=int(group),
+            group_channels_i=int(group_channels),
+            offset_scale_f=float(offset_scale),
+            im2col_step_i=int(im2col_step),
+            remove_center_i=int(remove_center),
+        )
+
+
+def _get_reference_points(spatial_shapes, device, kernel_h, kernel_w, dilation_h, dilation_w, pad_h=0, pad_w=0, stride_h=1, stride_w=1):
+    _, H_, W_, _ = spatial_shapes
+    H_out = (H_ - (dilation_h * (kernel_h - 1) + 1)) // stride_h + 1
+    W_out = (W_ - (dilation_w * (kernel_w - 1) + 1)) // stride_w + 1
+
+    ref_y, ref_x = torch.meshgrid(
+        torch.linspace(
+            # pad_h + 0.5,
+            # H_ - pad_h - 0.5,
+            (dilation_h * (kernel_h - 1)) // 2 + 0.5,
+            (dilation_h * (kernel_h - 1)) // 2 + 0.5 + (H_out - 1) * stride_h,
+            H_out,
+            dtype=torch.float32,
+            device=device),
+        torch.linspace(
+            # pad_w + 0.5,
+            # W_ - pad_w - 0.5,
+            (dilation_w * (kernel_w - 1)) // 2 + 0.5,
+            (dilation_w * (kernel_w - 1)) // 2 + 0.5 + (W_out - 1) * stride_w,
+            W_out,
+            dtype=torch.float32,
+            device=device))
+    ref_y = ref_y.reshape(-1)[None] / H_
+    ref_x = ref_x.reshape(-1)[None] / W_
+
+    ref = torch.stack((ref_x, ref_y), -1).reshape(
+        1, H_out, W_out, 1, 2)
+
+    return ref
+
+
+def _generate_dilation_grids(spatial_shapes, kernel_h, kernel_w, dilation_h, dilation_w, group, device):
+    _, H_, W_, _ = spatial_shapes
+    points_list = []
+    x, y = torch.meshgrid(
+        torch.linspace(
+            -((dilation_w * (kernel_w - 1)) // 2),
+            -((dilation_w * (kernel_w - 1)) // 2) + (kernel_w - 1) * dilation_w,
+            kernel_w,
+            dtype=torch.float32,
+            device=device),
+        torch.linspace(
+            -((dilation_h * (kernel_h - 1)) // 2),
+            -((dilation_h * (kernel_h - 1)) // 2) + (kernel_h - 1) * dilation_h,
+            kernel_h,
+            dtype=torch.float32,
+            device=device))
+
+    points_list.extend([x / W_, y / H_])
+    grid = torch.stack(points_list, -1).reshape(-1, 1, 2).\
+        repeat(1, group, 1).permute(1, 0, 2)
+    grid = grid.reshape(1, 1, 1, group * kernel_h * kernel_w, 2)
+
+    return grid
+
+
+def remove_center_sampling_locations(sampling_locations, kernel_w, kernel_h):
+    idx = list(range(sampling_locations.shape[-2]))
+    C = (kernel_w * kernel_h - 1)//2
+    idx = [i for i in idx if i != C and (i-C) % (C*2+1) != 0]
+    sampling_locations = sampling_locations[:,:,:,idx, :]
+    return sampling_locations
+
+
+def dcnv3_core_pytorch(
+        input, offset, mask, kernel_h,
+        kernel_w, stride_h, stride_w, pad_h,
+        pad_w, dilation_h, dilation_w, group,
+        group_channels, offset_scale, remove_center):
+    # for debug and test only,
+    # need to use cuda version instead
+
+    if remove_center and (kernel_h % 2 == 0 or kernel_w % 2 == 0 or kernel_w != kernel_h):
+        raise ValueError('remove_center is only compatible with square odd kernel size.')
+
+    input = F.pad(
+        input,
+        [0, 0, pad_h, pad_h, pad_w, pad_w])
+    N_, H_in, W_in, _ = input.shape
+    _, H_out, W_out, _ = offset.shape
+
+    ref = _get_reference_points(
+        input.shape, input.device, kernel_h, kernel_w, dilation_h, dilation_w, pad_h, pad_w, stride_h, stride_w)
+    grid = _generate_dilation_grids(
+        input.shape, kernel_h, kernel_w, dilation_h, dilation_w, group, input.device)
+    spatial_norm = torch.tensor([W_in, H_in]).reshape(1, 1, 1, 2).\
+        repeat(1, 1, 1, group*(kernel_h*kernel_w-remove_center)).to(input.device)
+
+    sampling_locations = (ref + grid * offset_scale).repeat(N_, 1, 1, 1, 1)
+    if remove_center:
+        sampling_locations = remove_center_sampling_locations(sampling_locations, kernel_w=kernel_w, kernel_h=kernel_h)
+    sampling_locations = sampling_locations.flatten(3, 4)
+    sampling_locations = sampling_locations + offset * offset_scale / spatial_norm
+
+    P_ = kernel_h * kernel_w - remove_center
+    sampling_grids = 2 * sampling_locations - 1
+    # N_, H_in, W_in, group*group_channels -> N_, H_in*W_in, group*group_channels -> N_, group*group_channels, H_in*W_in -> N_*group, group_channels, H_in, W_in
+    input_ = input.view(N_, H_in*W_in, group*group_channels).transpose(1, 2).\
+        reshape(N_*group, group_channels, H_in, W_in)
+    # N_, H_out, W_out, group*P_*2 -> N_, H_out*W_out, group, P_, 2 -> N_, group, H_out*W_out, P_, 2 -> N_*group, H_out*W_out, P_, 2
+    sampling_grid_ = sampling_grids.view(N_, H_out*W_out, group, P_, 2).transpose(1, 2).\
+        flatten(0, 1)
+    # N_*group, group_channels, H_out*W_out, P_
+    sampling_input_ = F.grid_sample(
+        input_, sampling_grid_, mode='bilinear', padding_mode='zeros', align_corners=False)
+
+    # (N_, H_out, W_out, group*P_) -> N_, H_out*W_out, group, P_ -> (N_, group, H_out*W_out, P_) -> (N_*group, 1, H_out*W_out, P_)
+    mask = mask.view(N_, H_out*W_out, group, P_).transpose(1, 2).\
+        reshape(N_*group, 1, H_out*W_out, P_)
+    output = (sampling_input_ * mask).sum(-1).view(N_,
+                                                   group*group_channels, H_out*W_out)
+
+    return output.transpose(1, 2).reshape(N_, H_out, W_out, -1).contiguous()
--- a/classification/huggingface/22k_model/internimage_h_jointto22k_384/modeling_internimage.py
+++ b/classification/huggingface/22k_model/internimage_h_jointto22k_384/modeling_internimage.py
--- a/classification/huggingface/22k_model/internimage_h_jointto22k_384/preprocessor_config.json
+++ b/classification/huggingface/22k_model/internimage_h_jointto22k_384/preprocessor_config.json
+{
+  "crop_size": 384,
+  "do_center_crop": true,
+  "do_normalize": true,
+  "do_resize": true,
+  "feature_extractor_type": "CLIPFeatureExtractor",
+  "image_mean": [
+    0.485,
+    0.456,
+    0.406
+  ],
+  "image_std": [
+    0.229,
+    0.224,
+    0.225
+  ],
+  "resample": 3,
+  "size": 384
+}
--- a/classification/huggingface/22k_model/internimage_l_22k_384/README.md
+++ b/classification/huggingface/22k_model/internimage_l_22k_384/README.md
+---
+license: mit
+pipeline_tag: image-classification
+library_name: transformers
+tags:
+  - internimage
+  - custom_code
+datasets:
+  - ILSVRC/imagenet-1k
+---
+
+# InternImage Model Card
+
+## Introduction
+
+InternImage is an advanced vision foundation model developed by researchers from Shanghai AI Laboratory, Tsinghua University, and other institutions. Unlike models based on Transformers, InternImage employs DCNv3 as its core operator. This approach equips the model with dynamic and effective receptive fields required for downstream tasks like object detection and segmentation, while enabling adaptive spatial aggregation.
+
+<div style="text-align: center;"> <img src="https://github.com/OpenGVLab/InternImage/raw/master/docs/figs/arch.png" style="width:60%;" /> </div>
+
+## Performance
+
+- InternImage achieved an impressive Top-1 accuracy of 90.1% on the ImageNet benchmark dataset using only publicly available data for image classification. Apart from two undisclosed models trained with additional datasets by Google and Microsoft, InternImage is the only open-source model that achieves a Top-1 accuracy of over 90.0%, and it is also the largest model in scale worldwide.
+- InternImage outperformed all other models worldwide on the COCO object detection benchmark dataset with a remarkable mAP of 65.5, making it the only model that surpasses 65 mAP in the world.
+- InternImage also demonstrated world's best performance on 16 other important visual benchmark datasets, covering a wide range of tasks such as classification, detection, and segmentation, making it the top-performing model across multiple domains.
+
+## Released Models
+
+### Open‑Source Visual Pretrained Models
+
+|                                       huggingface name                                        |  model  name   |       pretrain       | resolution | #param |
+| :-------------------------------------------------------------------------------------------: | :------------: | :------------------: | :--------: | :----: |
+|        [internimage_l_22k_384](https://huggingface.co/OpenGVLab/internimage_l_22k_384)        | InternImage-L  |        IN-22K        |  384x384   |  223M  |
+|       [internimage_xl_22k_384](https://huggingface.co/OpenGVLab/internimage_xl_22k_384)       | InternImage-XL |        IN-22K        |  384x384   |  335M  |
+| [internimage_h_jointto22k_384](https://huggingface.co/OpenGVLab/internimage_h_jointto22k_384) | InternImage-H  | Joint 427M -> IN-22K |  384x384   | 1.08B  |
+| [internimage_g_jointto22k_384](https://huggingface.co/OpenGVLab/internimage_g_jointto22k_384) | InternImage-G  | Joint 427M -> IN-22K |  384x384   |   3B   |
+
+### ImageNet-1K Image Classification
+
+|                                     huggingface name                                      |   model name   |       pretrain       | resolution | acc@1 | #param | FLOPs |
+| :---------------------------------------------------------------------------------------: | :------------: | :------------------: | :--------: | :---: | :----: | :---: |
+|       [internimage_t_1k_224](https://huggingface.co/OpenGVLab/internimage_t_1k_224)       | InternImage-T  |        IN-1K         |  224x224   | 83.5  |  30M   |  5G   |
+|       [internimage_s_1k_224](https://huggingface.co/OpenGVLab/internimage_s_1k_224)       | InternImage-S  |        IN-1K         |  224x224   | 84.2  |  50M   |  8G   |
+|       [internimage_b_1k_224](https://huggingface.co/OpenGVLab/internimage_b_1k_224)       | InternImage-B  |        IN-1K         |  224x224   | 84.9  |  97M   |  16G  |
+|  [internimage_l_22kto1k_384](https://huggingface.co/OpenGVLab/internimage_l_22kto1k_384)  | InternImage-L  |        IN-22K        |  384x384   | 87.7  |  223M  | 108G  |
+| [internimage_xl_22kto1k_384](https://huggingface.co/OpenGVLab/internimage_xl_22kto1k_384) | InternImage-XL |        IN-22K        |  384x384   | 88.0  |  335M  | 163G  |
+|  [internimage_h_22kto1k_640](https://huggingface.co/OpenGVLab/internimage_h_22kto1k_640)  | InternImage-H  | Joint 427M -> IN-22K |  640x640   | 89.6  | 1.08B  | 1478G |
+|  [internimage_g_22kto1k_512](https://huggingface.co/OpenGVLab/internimage_g_22kto1k_512)  | InternImage-G  | Joint 427M -> IN-22K |  512x512   | 90.1  |   3B   | 2700G |
+
+## DCNv3 CUDA Kernel Installation
+
+If you do not install the CUDA version of DCNv3, InternImage will automatically fall back to a PyTorch implementation. However, the CUDA implementation can significantly reduce GPU memory usage and improve inference efficiency.
+
+**Installation Tutorial:**
+
+1. Open your terminal and run:
+
+   ```bash
+   git clone https://github.com/OpenGVLab/InternImage.git
+   cd InternImage/classification/ops_dcnv3
+   ```
+
+2. Make sure you have an available GPU for compilation, then run:
+
+   ```bash
+   sh make.sh
+   ```
+
+This will compile the CUDA version of DCNv3. Once installed, InternImage will automatically leverage the optimized CUDA implementation for better performance.
+
+## Usage with Transformers
+
+Below are two usage examples for InternImage with the Transformers framework:
+
+### Example 1: Using InternImage as an Image Backbone
+
+```python
+import torch
+from PIL import Image
+from transformers import AutoModel, CLIPImageProcessor
+
+# Replace 'model_name' with the appropriate model identifier
+model_name = "OpenGVLab/internimage_t_1k_224"  # example model
+
+# Prepare the image
+image_path = 'img.png'
+image_processor = CLIPImageProcessor.from_pretrained(model_name)
+image = Image.open(image_path)
+image = image_processor(images=image, return_tensors='pt').pixel_values
+print('image shape:', image.shape)
+
+# Load the model as a backbone
+model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
+# 'hidden_states' contains the outputs from the 4 stages of the InternImage backbone
+hidden_states = model(image).hidden_states
+```
+
+### Example 2: Using InternImage for Image Classification
+
+```python
+import torch
+from PIL import Image
+from transformers import AutoModelForImageClassification, CLIPImageProcessor
+
+# Replace 'model_name' with the appropriate model identifier
+model_name = "OpenGVLab/internimage_t_1k_224"  # example model
+
+# Prepare the image
+image_path = 'img.png'
+image_processor = CLIPImageProcessor.from_pretrained(model_name)
+image = Image.open(image_path)
+image = image_processor(images=image, return_tensors='pt').pixel_values
+print('image shape:', image.shape)
+
+# Load the model as an image classifier
+model = AutoModelForImageClassification.from_pretrained(model_name, trust_remote_code=True)
+logits = model(image).logits
+label = torch.argmax(logits, dim=1)
+print("Predicted label:", label.item())
+```
+
+## Citation
+
+If this work is helpful for your research, please consider citing the following BibTeX entry.
+
+```Bibtex
+@inproceedings{wang2023internimage,
+  title={Internimage: Exploring large-scale vision foundation models with deformable convolutions},
+  author={Wang, Wenhai and Dai, Jifeng and Chen, Zhe and Huang, Zhenhang and Li, Zhiqi and Zhu, Xizhou and Hu, Xiaowei and Lu, Tong and Lu, Lewei and Li, Hongsheng and others},
+  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
+  pages={14408--14419},
+  year={2023}
+}
+```
--- a/classification/huggingface/22k_model/internimage_l_22k_384/config.json
+++ b/classification/huggingface/22k_model/internimage_l_22k_384/config.json
+{
+  "_name_or_path": "OpenGVLab/internimage_l_22k_384",
+  "act_layer": "GELU",
+  "architectures": [
+    "InternImageModel"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_internimage.InternImageConfig",
+    "AutoModel": "modeling_internimage.InternImageModel",
+    "AutoModelForImageClassification": "modeling_internimage.InternImageModelForImageClassification"
+  },
+  "center_feature_scale": false,
+  "channels": 160,
+  "cls_scale": 1.5,
+  "core_op": "DCNv3",
+  "depths": [
+    5,
+    5,
+    22,
+    5
+  ],
+  "drop_path_rate": 0.0,
+  "drop_path_type": "linear",
+  "drop_rate": 0.0,
+  "dw_kernel_size": null,
+  "groups": [
+    10,
+    20,
+    40,
+    80
+  ],
+  "layer_scale": 1e-05,
+  "level2_post_norm": false,
+  "level2_post_norm_block_ids": null,
+  "mlp_ratio": 4.0,
+  "model_type": "internimage",
+  "norm_layer": "LN",
+  "num_classes": 21841,
+  "offset_scale": 2.0,
+  "post_norm": true,
+  "remove_center": false,
+  "res_post_norm": false,
+  "torch_dtype": "float32",
+  "transformers_version": "4.37.2",
+  "use_clip_projector": false,
+  "with_cp": false
+}
--- a/classification/huggingface/22k_model/internimage_l_22k_384/configuration_internimage.py
+++ b/classification/huggingface/22k_model/internimage_l_22k_384/configuration_internimage.py
+# --------------------------------------------------------
+# InternImage
+# Copyright (c) 2025 OpenGVLab
+# Licensed under The MIT License [see LICENSE for details]
+# --------------------------------------------------------
+
+from transformers import PretrainedConfig
+
+
+class InternImageConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`~InternImageModel`].
+    It is used to instantiate an internimage model according to the specified arguments, defining the model
+    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
+    the internimage [OpenGVLab/internimage](https://huggingface.co/OpenGVLab/internimage) architecture.
+
+    Configuration objects inherit from  [`PretrainedConfig`] and can be used
+    to control the model outputs. Read the documentation from  [`PretrainedConfig`]
+    for more information.
+
+    Args:
+        core_op (`str`, *optional*, defaults to `"DCNv3"`):
+            Core operation used in the InternImageModel.
+        depths (`tuple`, *optional*, defaults to `(4, 4, 18, 4)`):
+            Tuple specifying the depth of layers in the InternImageModel.
+        groups (`tuple`, *optional*, defaults to `(4, 8, 16, 32)`):
+            Tuple specifying the group of layers in the InternImageModel.
+        channels (`int`, *optional*, defaults to `64`):
+            Number of channels in the InternImageModel.
+        dw_kernel_size (`int`, *optional*, defaults to `None`):
+            Kernel size for depthwise convolutions.
+        layer_scale (`float`, *optional*, defaults to `None`):
+            Scale of the layers in the model.
+        offset_scale (`float`, *optional*, defaults to `1.0`):
+            Offset scale in the model.
+        mlp_ratio (`float`, *optional*, defaults to `4.0`):
+            Ratio of mlp layers in the InternImageModel.
+        post_norm (`bool`, *optional*, defaults to `False`):
+            Whether to use post normalization in the model.
+        level2_post_norm (`bool`, *optional*, defaults to `False`):
+            Whether to use level 2 post normalization.
+        level2_post_norm_block_ids (`list`, *optional*, defaults to `None`):
+            Specific block IDs for level 2 post normalization.
+        center_feature_scale (`bool`, *optional*, defaults to `False`):
+            Whether to apply center feature scaling.
+        use_clip_projector (`bool`, *optional*, defaults to `False`):
+            Whether to use CLIP projector.
+        remove_center (`bool`, *optional*, defaults to `False`):
+            Whether to remove center pixels in some operations.
+        num_classes (`int`, *optional*, defaults to `1000`):
+            Number of classes for the model output.
+        drop_rate (`float`, *optional*, defaults to `0.0`):
+            Dropout rate in the model.
+        drop_path_rate (`float`, *optional*, defaults to `0.0`):
+            Dropout path rate in the model.
+        drop_path_type (`str`, *optional*, defaults to `"linear"`):
+            Type of dropout path used in the model.
+        act_layer (`str`, *optional*, defaults to `"GELU"`):
+            Activation function used in the model.
+        norm_layer (`str`, *optional*, defaults to `"LN"`):
+            Normalization layer used in the model.
+        cls_scale (`float`, *optional*, defaults to `1.5`):
+            Scale of the classification layer in the model.
+        with_cp (`bool`, *optional*, defaults to `False`):
+            Whether to use checkpointing in the model.
+    """
+    model_type = 'internimage'
+
+    def __init__(
+            self,
+            core_op='DCNv3',
+            depths=(4, 4, 18, 4),
+            groups=(4, 8, 16, 32),
+            channels=64,
+            dw_kernel_size=None,
+            layer_scale=None,
+            offset_scale=1.0,
+            mlp_ratio=4.0,
+            post_norm=False,
+            res_post_norm=False,
+            level2_post_norm=False,
+            level2_post_norm_block_ids=None,
+            center_feature_scale=False,
+            use_clip_projector=False,
+            remove_center=False,
+            num_classes=1000,
+            drop_rate=0.0,
+            drop_path_rate=0.0,
+            drop_path_type='linear',
+            act_layer='GELU',
+            norm_layer='LN',
+            cls_scale=1.5,
+            with_cp=False,
+            **kwargs,
+    ):
+        super().__init__(**kwargs)
+
+        # Model configuration parameters
+        self.core_op = core_op
+        self.depths = depths
+        self.groups = groups
+        self.channels = channels
+        self.dw_kernel_size = dw_kernel_size
+        self.layer_scale = layer_scale
+        self.offset_scale = offset_scale
+        self.mlp_ratio = mlp_ratio
+        self.post_norm = post_norm
+        self.res_post_norm = res_post_norm
+        self.level2_post_norm = level2_post_norm
+        self.level2_post_norm_block_ids = level2_post_norm_block_ids
+        self.center_feature_scale = center_feature_scale
+        self.use_clip_projector = use_clip_projector
+        self.remove_center = remove_center
+        self.num_classes = num_classes
+        self.drop_rate = drop_rate
+        self.drop_path_rate = drop_path_rate
+        self.drop_path_type = drop_path_type
+        self.act_layer = act_layer
+        self.norm_layer = norm_layer
+        self.cls_scale = cls_scale
+        self.with_cp = with_cp
--- a/classification/huggingface/22k_model/internimage_l_22k_384/dcnv3.py
+++ b/classification/huggingface/22k_model/internimage_l_22k_384/dcnv3.py
+# --------------------------------------------------------
+# InternImage
+# Copyright (c) 2025 OpenGVLab
+# Licensed under The MIT License [see LICENSE for details]
+# --------------------------------------------------------
+
+from __future__ import absolute_import, division, print_function
+
+import warnings
+
+import torch
+import torch.nn.functional as F
+from torch import nn
+from torch.nn.init import constant_, xavier_uniform_
+
+from .dcnv3_func import DCNv3Function, dcnv3_core_pytorch, has_cuda_kernel
+
+
+class to_channels_first(nn.Module):
+
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x.permute(0, 3, 1, 2)
+
+
+class to_channels_last(nn.Module):
+
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x.permute(0, 2, 3, 1)
+
+
+def build_norm_layer(dim,
+                     norm_layer,
+                     in_format='channels_last',
+                     out_format='channels_last',
+                     eps=1e-6):
+    layers = []
+    if norm_layer == 'BN':
+        if in_format == 'channels_last':
+            layers.append(to_channels_first())
+        layers.append(nn.BatchNorm2d(dim))
+        if out_format == 'channels_last':
+            layers.append(to_channels_last())
+    elif norm_layer == 'LN':
+        if in_format == 'channels_first':
+            layers.append(to_channels_last())
+        layers.append(nn.LayerNorm(dim, eps=eps))
+        if out_format == 'channels_first':
+            layers.append(to_channels_first())
+    else:
+        raise NotImplementedError(
+            f'build_norm_layer does not support {norm_layer}')
+    return nn.Sequential(*layers)
+
+
+def build_act_layer(act_layer):
+    if act_layer == 'ReLU':
+        return nn.ReLU(inplace=True)
+    elif act_layer == 'SiLU':
+        return nn.SiLU(inplace=True)
+    elif act_layer == 'GELU':
+        return nn.GELU()
+
+    raise NotImplementedError(f'build_act_layer does not support {act_layer}')
+
+
+def _is_power_of_2(n):
+    if (not isinstance(n, int)) or (n < 0):
+        raise ValueError(
+            'invalid input for _is_power_of_2: {} (type: {})'.format(n, type(n)))
+
+    return (n & (n - 1) == 0) and n != 0
+
+
+class CenterFeatureScaleModule(nn.Module):
+    def forward(self,
+                query,
+                center_feature_scale_proj_weight,
+                center_feature_scale_proj_bias):
+        center_feature_scale = F.linear(query,
+                                        weight=center_feature_scale_proj_weight,
+                                        bias=center_feature_scale_proj_bias).sigmoid()
+        return center_feature_scale
+
+
+class DCNv3_pytorch(nn.Module):
+    def __init__(
+            self,
+            channels=64,
+            kernel_size=3,
+            dw_kernel_size=None,
+            stride=1,
+            pad=1,
+            dilation=1,
+            group=4,
+            offset_scale=1.0,
+            act_layer='GELU',
+            norm_layer='LN',
+            center_feature_scale=False,
+            remove_center=False,
+    ):
+        """
+        DCNv3 Module
+        :param channels
+        :param kernel_size
+        :param stride
+        :param pad
+        :param dilation
+        :param group
+        :param offset_scale
+        :param act_layer
+        :param norm_layer
+        """
+        super().__init__()
+        if channels % group != 0:
+            raise ValueError(
+                f'channels must be divisible by group, but got {channels} and {group}')
+        _d_per_group = channels // group
+        dw_kernel_size = dw_kernel_size if dw_kernel_size is not None else kernel_size
+        # you'd better set _d_per_group to a power of 2 which is more efficient in our CUDA implementation
+        if not _is_power_of_2(_d_per_group):
+            warnings.warn(
+                "You'd better set channels in DCNv3 to make the dimension of each attention head a power of 2 "
+                'which is more efficient in our CUDA implementation.')
+
+        self.offset_scale = offset_scale
+        self.channels = channels
+        self.kernel_size = kernel_size
+        self.dw_kernel_size = dw_kernel_size
+        self.stride = stride
+        self.dilation = dilation
+        self.pad = pad
+        self.group = group
+        self.group_channels = channels // group
+        self.offset_scale = offset_scale
+        self.center_feature_scale = center_feature_scale
+        self.remove_center = int(remove_center)
+
+        self.dw_conv = nn.Sequential(
+            nn.Conv2d(
+                channels,
+                channels,
+                kernel_size=dw_kernel_size,
+                stride=1,
+                padding=(dw_kernel_size - 1) // 2,
+                groups=channels),
+            build_norm_layer(
+                channels,
+                norm_layer,
+                'channels_first',
+                'channels_last'),
+            build_act_layer(act_layer))
+        self.offset = nn.Linear(
+            channels,
+            group * (kernel_size * kernel_size - remove_center) * 2)
+        self.mask = nn.Linear(
+            channels,
+            group * (kernel_size * kernel_size - remove_center))
+        self.input_proj = nn.Linear(channels, channels)
+        self.output_proj = nn.Linear(channels, channels)
+        self._reset_parameters()
+
+        if center_feature_scale:
+            self.center_feature_scale_proj_weight = nn.Parameter(
+                torch.zeros((group, channels), dtype=torch.float))
+            self.center_feature_scale_proj_bias = nn.Parameter(
+                torch.tensor(0.0, dtype=torch.float).view((1,)).repeat(group, ))
+            self.center_feature_scale_module = CenterFeatureScaleModule()
+
+    def _reset_parameters(self):
+        constant_(self.offset.weight.data, 0.)
+        constant_(self.offset.bias.data, 0.)
+        constant_(self.mask.weight.data, 0.)
+        constant_(self.mask.bias.data, 0.)
+        xavier_uniform_(self.input_proj.weight.data)
+        constant_(self.input_proj.bias.data, 0.)
+        xavier_uniform_(self.output_proj.weight.data)
+        constant_(self.output_proj.bias.data, 0.)
+
+    def forward(self, input):
+        """
+        :param query                       (N, H, W, C)
+        :return output                     (N, H, W, C)
+        """
+        N, H, W, _ = input.shape
+
+        x = self.input_proj(input)
+        x_proj = x
+
+        x1 = input.permute(0, 3, 1, 2)
+        x1 = self.dw_conv(x1)
+        offset = self.offset(x1)
+        mask = self.mask(x1).reshape(N, H, W, self.group, -1)
+        mask = F.softmax(mask, -1).reshape(N, H, W, -1)
+
+        x = dcnv3_core_pytorch(
+            x, offset, mask,
+            self.kernel_size, self.kernel_size,
+            self.stride, self.stride,
+            self.pad, self.pad,
+            self.dilation, self.dilation,
+            self.group, self.group_channels,
+            self.offset_scale, self.remove_center)
+        if self.center_feature_scale:
+            center_feature_scale = self.center_feature_scale_module(
+                x1, self.center_feature_scale_proj_weight, self.center_feature_scale_proj_bias)
+            # N, H, W, groups -> N, H, W, groups, 1 -> N, H, W, groups, _d_per_group -> N, H, W, channels
+            center_feature_scale = center_feature_scale[..., None].repeat(
+                1, 1, 1, 1, self.channels // self.group).flatten(-2)
+            x = x * (1 - center_feature_scale) + x_proj * center_feature_scale
+        x = self.output_proj(x)
+
+        return x
+
+
+class DCNv3(nn.Module):
+    def __init__(
+        self,
+        channels=64,
+        kernel_size=3,
+        dw_kernel_size=None,
+        stride=1,
+        pad=1,
+        dilation=1,
+        group=4,
+        offset_scale=1.0,
+        act_layer='GELU',
+        norm_layer='LN',
+        center_feature_scale=False,
+        remove_center=False,
+    ):
+        """
+        DCNv3 Module
+        :param channels
+        :param kernel_size
+        :param stride
+        :param pad
+        :param dilation
+        :param group
+        :param offset_scale
+        :param act_layer
+        :param norm_layer
+        """
+        super().__init__()
+        if channels % group != 0:
+            raise ValueError(
+                f'channels must be divisible by group, but got {channels} and {group}')
+        _d_per_group = channels // group
+        dw_kernel_size = dw_kernel_size if dw_kernel_size is not None else kernel_size
+        # you'd better set _d_per_group to a power of 2 which is more efficient in our CUDA implementation
+        if not _is_power_of_2(_d_per_group):
+            warnings.warn(
+                "You'd better set channels in DCNv3 to make the dimension of each attention head a power of 2 "
+                'which is more efficient in our CUDA implementation.')
+
+        self.offset_scale = offset_scale
+        self.channels = channels
+        self.kernel_size = kernel_size
+        self.dw_kernel_size = dw_kernel_size
+        self.stride = stride
+        self.dilation = dilation
+        self.pad = pad
+        self.group = group
+        self.group_channels = channels // group
+        self.offset_scale = offset_scale
+        self.center_feature_scale = center_feature_scale
+        self.remove_center = int(remove_center)
+
+        if self.remove_center and self.kernel_size % 2 == 0:
+            raise ValueError('remove_center is only compatible with odd kernel size.')
+
+        self.dw_conv = nn.Sequential(
+            nn.Conv2d(
+                channels,
+                channels,
+                kernel_size=dw_kernel_size,
+                stride=1,
+                padding=(dw_kernel_size - 1) // 2,
+                groups=channels),
+            build_norm_layer(
+                channels,
+                norm_layer,
+                'channels_first',
+                'channels_last'),
+            build_act_layer(act_layer))
+        self.offset = nn.Linear(
+            channels,
+            group * (kernel_size * kernel_size - remove_center) * 2)
+        self.mask = nn.Linear(
+            channels,
+            group * (kernel_size * kernel_size - remove_center))
+        self.input_proj = nn.Linear(channels, channels)
+        self.output_proj = nn.Linear(channels, channels)
+        self._reset_parameters()
+
+        if center_feature_scale:
+            self.center_feature_scale_proj_weight = nn.Parameter(
+                torch.zeros((group, channels), dtype=torch.float))
+            self.center_feature_scale_proj_bias = nn.Parameter(
+                torch.tensor(0.0, dtype=torch.float).view((1,)).repeat(group, ))
+            self.center_feature_scale_module = CenterFeatureScaleModule()
+
+    def _reset_parameters(self):
+        constant_(self.offset.weight.data, 0.)
+        constant_(self.offset.bias.data, 0.)
+        constant_(self.mask.weight.data, 0.)
+        constant_(self.mask.bias.data, 0.)
+        xavier_uniform_(self.input_proj.weight.data)
+        constant_(self.input_proj.bias.data, 0.)
+        xavier_uniform_(self.output_proj.weight.data)
+        constant_(self.output_proj.bias.data, 0.)
+
+    def forward(self, input):
+        """
+        :param query                       (N, H, W, C)
+        :return output                     (N, H, W, C)
+        """
+        N, H, W, _ = input.shape
+
+        x = self.input_proj(input)
+        x_proj = x
+        dtype = x.dtype
+
+        x1 = input.permute(0, 3, 1, 2)
+        x1 = self.dw_conv(x1)
+        offset = self.offset(x1)
+        mask = self.mask(x1).reshape(N, H, W, self.group, -1)
+        mask = F.softmax(mask, -1)
+        mask = mask.reshape(N, H, W, -1).type(dtype)
+
+        x = DCNv3Function.apply(
+            x, offset, mask,
+            self.kernel_size, self.kernel_size,
+            self.stride, self.stride,
+            self.pad, self.pad,
+            self.dilation, self.dilation,
+            self.group, self.group_channels,
+            self.offset_scale,
+            256,
+            self.remove_center)
+
+        if self.center_feature_scale:
+            center_feature_scale = self.center_feature_scale_module(
+                x1, self.center_feature_scale_proj_weight, self.center_feature_scale_proj_bias)
+            # N, H, W, groups -> N, H, W, groups, 1 -> N, H, W, groups, _d_per_group -> N, H, W, channels
+            center_feature_scale = center_feature_scale[..., None].repeat(
+                1, 1, 1, 1, self.channels // self.group).flatten(-2)
+            x = x * (1 - center_feature_scale) + x_proj * center_feature_scale
+        x = self.output_proj(x)
+
+        return x
--- a/classification/huggingface/22k_model/internimage_l_22k_384/dcnv3_func.py
+++ b/classification/huggingface/22k_model/internimage_l_22k_384/dcnv3_func.py
+# --------------------------------------------------------
+# InternImage
+# Copyright (c) 2025 OpenGVLab
+# Licensed under The MIT License [see LICENSE for details]
+# --------------------------------------------------------
+
+from __future__ import absolute_import, division, print_function
+
+try:
+    import DCNv3
+    dcn_version = float(pkg_resources.get_distribution('DCNv3').version)
+    has_cuda_kernel = True
+except:
+    has_cuda_kernel = False
+import pkg_resources
+import torch
+import torch.nn.functional as F
+from torch.autograd import Function
+from torch.autograd.function import once_differentiable
+from torch.cuda.amp import custom_bwd, custom_fwd
+
+
+class DCNv3Function(Function):
+    @staticmethod
+    @custom_fwd
+    def forward(
+            ctx, input, offset, mask,
+            kernel_h, kernel_w, stride_h, stride_w,
+            pad_h, pad_w, dilation_h, dilation_w,
+            group, group_channels, offset_scale, im2col_step, remove_center):
+        ctx.kernel_h = kernel_h
+        ctx.kernel_w = kernel_w
+        ctx.stride_h = stride_h
+        ctx.stride_w = stride_w
+        ctx.pad_h = pad_h
+        ctx.pad_w = pad_w
+        ctx.dilation_h = dilation_h
+        ctx.dilation_w = dilation_w
+        ctx.group = group
+        ctx.group_channels = group_channels
+        ctx.offset_scale = offset_scale
+        ctx.im2col_step = im2col_step
+        ctx.remove_center = remove_center
+
+        args = [
+            input, offset, mask, kernel_h,
+            kernel_w, stride_h, stride_w, pad_h,
+            pad_w, dilation_h, dilation_w, group,
+            group_channels, offset_scale, ctx.im2col_step
+        ]
+        if remove_center or dcn_version > 1.0:
+            args.append(remove_center)
+
+        output = DCNv3.dcnv3_forward(*args)
+        ctx.save_for_backward(input, offset, mask)
+
+        return output
+
+    @staticmethod
+    @once_differentiable
+    @custom_bwd
+    def backward(ctx, grad_output):
+        input, offset, mask = ctx.saved_tensors
+
+        args = [
+            input, offset, mask, ctx.kernel_h,
+            ctx.kernel_w, ctx.stride_h, ctx.stride_w, ctx.pad_h,
+            ctx.pad_w, ctx.dilation_h, ctx.dilation_w, ctx.group,
+            ctx.group_channels, ctx.offset_scale, grad_output.contiguous(), ctx.im2col_step
+        ]
+        if ctx.remove_center or dcn_version > 1.0:
+            args.append(ctx.remove_center)
+
+        grad_input, grad_offset, grad_mask = \
+            DCNv3.dcnv3_backward(*args)
+
+        return grad_input, grad_offset, grad_mask, \
+            None, None, None, None, None, None, None, None, None, None, None, None, None
+
+    @staticmethod
+    def symbolic(g, input, offset, mask, kernel_h, kernel_w, stride_h,
+                 stride_w, pad_h, pad_w, dilation_h, dilation_w, group,
+                 group_channels, offset_scale, im2col_step, remove_center):
+        """Symbolic function for mmdeploy::DCNv3.
+
+        Returns:
+            DCNv3 op for onnx.
+        """
+        return g.op(
+            'mmdeploy::TRTDCNv3',
+            input,
+            offset,
+            mask,
+            kernel_h_i=int(kernel_h),
+            kernel_w_i=int(kernel_w),
+            stride_h_i=int(stride_h),
+            stride_w_i=int(stride_w),
+            pad_h_i=int(pad_h),
+            pad_w_i=int(pad_w),
+            dilation_h_i=int(dilation_h),
+            dilation_w_i=int(dilation_w),
+            group_i=int(group),
+            group_channels_i=int(group_channels),
+            offset_scale_f=float(offset_scale),
+            im2col_step_i=int(im2col_step),
+            remove_center_i=int(remove_center),
+        )
+
+
+def _get_reference_points(spatial_shapes, device, kernel_h, kernel_w, dilation_h, dilation_w, pad_h=0, pad_w=0, stride_h=1, stride_w=1):
+    _, H_, W_, _ = spatial_shapes
+    H_out = (H_ - (dilation_h * (kernel_h - 1) + 1)) // stride_h + 1
+    W_out = (W_ - (dilation_w * (kernel_w - 1) + 1)) // stride_w + 1
+
+    ref_y, ref_x = torch.meshgrid(
+        torch.linspace(
+            # pad_h + 0.5,
+            # H_ - pad_h - 0.5,
+            (dilation_h * (kernel_h - 1)) // 2 + 0.5,
+            (dilation_h * (kernel_h - 1)) // 2 + 0.5 + (H_out - 1) * stride_h,
+            H_out,
+            dtype=torch.float32,
+            device=device),
+        torch.linspace(
+            # pad_w + 0.5,
+            # W_ - pad_w - 0.5,
+            (dilation_w * (kernel_w - 1)) // 2 + 0.5,
+            (dilation_w * (kernel_w - 1)) // 2 + 0.5 + (W_out - 1) * stride_w,
+            W_out,
+            dtype=torch.float32,
+            device=device))
+    ref_y = ref_y.reshape(-1)[None] / H_
+    ref_x = ref_x.reshape(-1)[None] / W_
+
+    ref = torch.stack((ref_x, ref_y), -1).reshape(
+        1, H_out, W_out, 1, 2)
+
+    return ref
+
+
+def _generate_dilation_grids(spatial_shapes, kernel_h, kernel_w, dilation_h, dilation_w, group, device):
+    _, H_, W_, _ = spatial_shapes
+    points_list = []
+    x, y = torch.meshgrid(
+        torch.linspace(
+            -((dilation_w * (kernel_w - 1)) // 2),
+            -((dilation_w * (kernel_w - 1)) // 2) + (kernel_w - 1) * dilation_w,
+            kernel_w,
+            dtype=torch.float32,
+            device=device),
+        torch.linspace(
+            -((dilation_h * (kernel_h - 1)) // 2),
+            -((dilation_h * (kernel_h - 1)) // 2) + (kernel_h - 1) * dilation_h,
+            kernel_h,
+            dtype=torch.float32,
+            device=device))
+
+    points_list.extend([x / W_, y / H_])
+    grid = torch.stack(points_list, -1).reshape(-1, 1, 2).\
+        repeat(1, group, 1).permute(1, 0, 2)
+    grid = grid.reshape(1, 1, 1, group * kernel_h * kernel_w, 2)
+
+    return grid
+
+
+def remove_center_sampling_locations(sampling_locations, kernel_w, kernel_h):
+    idx = list(range(sampling_locations.shape[-2]))
+    C = (kernel_w * kernel_h - 1)//2
+    idx = [i for i in idx if i != C and (i-C) % (C*2+1) != 0]
+    sampling_locations = sampling_locations[:,:,:,idx, :]
+    return sampling_locations
+
+
+def dcnv3_core_pytorch(
+        input, offset, mask, kernel_h,
+        kernel_w, stride_h, stride_w, pad_h,
+        pad_w, dilation_h, dilation_w, group,
+        group_channels, offset_scale, remove_center):
+    # for debug and test only,
+    # need to use cuda version instead
+
+    if remove_center and (kernel_h % 2 == 0 or kernel_w % 2 == 0 or kernel_w != kernel_h):
+        raise ValueError('remove_center is only compatible with square odd kernel size.')
+
+    input = F.pad(
+        input,
+        [0, 0, pad_h, pad_h, pad_w, pad_w])
+    N_, H_in, W_in, _ = input.shape
+    _, H_out, W_out, _ = offset.shape
+
+    ref = _get_reference_points(
+        input.shape, input.device, kernel_h, kernel_w, dilation_h, dilation_w, pad_h, pad_w, stride_h, stride_w)
+    grid = _generate_dilation_grids(
+        input.shape, kernel_h, kernel_w, dilation_h, dilation_w, group, input.device)
+    spatial_norm = torch.tensor([W_in, H_in]).reshape(1, 1, 1, 2).\
+        repeat(1, 1, 1, group*(kernel_h*kernel_w-remove_center)).to(input.device)
+
+    sampling_locations = (ref + grid * offset_scale).repeat(N_, 1, 1, 1, 1)
+    if remove_center:
+        sampling_locations = remove_center_sampling_locations(sampling_locations, kernel_w=kernel_w, kernel_h=kernel_h)
+    sampling_locations = sampling_locations.flatten(3, 4)
+    sampling_locations = sampling_locations + offset * offset_scale / spatial_norm
+
+    P_ = kernel_h * kernel_w - remove_center
+    sampling_grids = 2 * sampling_locations - 1
+    # N_, H_in, W_in, group*group_channels -> N_, H_in*W_in, group*group_channels -> N_, group*group_channels, H_in*W_in -> N_*group, group_channels, H_in, W_in
+    input_ = input.view(N_, H_in*W_in, group*group_channels).transpose(1, 2).\
+        reshape(N_*group, group_channels, H_in, W_in)
+    # N_, H_out, W_out, group*P_*2 -> N_, H_out*W_out, group, P_, 2 -> N_, group, H_out*W_out, P_, 2 -> N_*group, H_out*W_out, P_, 2
+    sampling_grid_ = sampling_grids.view(N_, H_out*W_out, group, P_, 2).transpose(1, 2).\
+        flatten(0, 1)
+    # N_*group, group_channels, H_out*W_out, P_
+    sampling_input_ = F.grid_sample(
+        input_, sampling_grid_, mode='bilinear', padding_mode='zeros', align_corners=False)
+
+    # (N_, H_out, W_out, group*P_) -> N_, H_out*W_out, group, P_ -> (N_, group, H_out*W_out, P_) -> (N_*group, 1, H_out*W_out, P_)
+    mask = mask.view(N_, H_out*W_out, group, P_).transpose(1, 2).\
+        reshape(N_*group, 1, H_out*W_out, P_)
+    output = (sampling_input_ * mask).sum(-1).view(N_,
+                                                   group*group_channels, H_out*W_out)
+
+    return output.transpose(1, 2).reshape(N_, H_out, W_out, -1).contiguous()