init

b7535e7c · luopl · b7535e7c · b7535e7c · b7535e7c · b7535e7c
Commit b7535e7c authored Sep 29, 2024 by luopl
20 changed files
--- a/Dockerfile
+++ b/Dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
\ No newline at end of file
--- a/LICENSE
+++ b/LICENSE
+Copyright (c) 2024, NVIDIA Corporation. All rights reserved.
+
+Nvidia Source Code License-NC
+
+1. Definitions
+
+“Licensor” means any person or entity that distributes its Work.
+
+“Work” means (a) the original work of authorship made available under this license, which may include software, documentation,
+or other files, and (b) any additions to or derivative works  thereof  that are made available under this license.
+
+The terms “reproduce,” “reproduction,” “derivative works,” and “distribution” have the meaning as provided under U.S. 
+copyright law; provided, however, that for the purposes of this license, derivative works shall not include works that 
+remain separable from, or merely link (or bind by name) to the interfaces of, the Work.
+
+Works are “made available” under this license by including in or with the Work either (a) a copyright notice referencing 
+the applicability of this license to the Work, or (b) a copy of this license.
+
+2. License Grant
+
+2.1 Copyright Grant. Subject to the terms and conditions of this license, each Licensor grants to you a perpetual, 
+worldwide, non-exclusive, royalty-free, copyright license to use, reproduce, prepare derivative works of, publicly 
+display, publicly perform, sublicense and distribute its Work and any resulting derivative works in any form.
+
+3. Limitations
+
+3.1 Redistribution. You may reproduce or distribute the Work only if (a) you do so under this license, (b) you include a 
+complete copy of this license with your distribution, and (c) you retain without modification any copyright, patent, 
+trademark, or attribution notices that are present in the Work.
+
+3.2 Derivative Works. You may specify that additional or different terms apply to the use, reproduction, and distribution
+of your derivative works of the Work (“Your Terms”) only if (a) Your Terms provide that the use limitation in Section 3.3
+applies to your derivative works, and (b) you identify the specific derivative works that are subject to Your Terms. 
+Notwithstanding Your Terms, this license (including the redistribution requirements in Section 3.1) will continue to apply 
+to the Work itself.
+
+3.3 Use Limitation. The Work and any derivative works thereof only may be used or intended for use non-commercially. 
+Notwithstanding the foregoing, NVIDIA Corporation and its affiliates may use the Work and any derivative works commercially. 
+As used herein, “non-commercially” means for research or evaluation purposes only.
+
+3.4 Patent Claims. If you bring or threaten to bring a patent claim against any Licensor (including any claim, cross-claim
+or counterclaim in a lawsuit) to enforce any patents that you allege are infringed by any Work, then your rights under 
+this license from such Licensor (including the grant in Section 2.1) will terminate immediately.
+
+3.5 Trademarks. This license does not grant any rights to use any Licensor’s or its affiliates’ names, logos, or trademarks,
+except as necessary to reproduce the notices described in this license.
+
+3.6 Termination. If you violate any term of this license, then your rights under this license (including the grant in Section 2.1)
+will terminate immediately.
+
+4. Disclaimer of Warranty.
+
+THE WORK IS PROVIDED “AS IS” WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WARRANTIES 
+OR CONDITIONS OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE OR NON-INFRINGEMENT. YOU BEAR THE RISK OF UNDERTAKING 
+ANY ACTIVITIES UNDER THIS LICENSE. 
+
+5. Limitation of Liability.
+
+EXCEPT AS PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER IN TORT (INCLUDING NEGLIGENCE), CONTRACT, 
+OR OTHERWISE SHALL ANY LICENSOR BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL 
+DAMAGES ARISING OUT OF OR RELATED TO THIS LICENSE, THE USE OR INABILITY TO USE THE WORK (INCLUDING BUT NOT LIMITED TO LOSS OF GOODWILL, 
+BUSINESS INTERRUPTION, LOST PROFITS OR DATA, COMPUTER FAILURE OR MALFUNCTION, OR ANY OTHER DAMAGES OR LOSSES), EVEN IF THE LICENSOR 
+HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
--- a/README.md
+++ b/README.md
+# MambaVision
+## 论文
+`MambaVision: A Hybrid Mamba-Transformer Vision Backbone`
+- https://arxiv.org/abs/2407.08083
+## 模型结构
+MambaVision模型提出了一种新型混合 Mamba-Transformer 主干，专门针对视觉应用而量身定制。
+核心贡献包括重新设计 Mamba 公式以增强其高效建模视觉特征的能力。
+此外，将 Vision Transformers (ViT) 与 Mamba 集成的可行性进行了全面的消融研究。
+在 Mamba 架构的最后几层配备多个自注意力块可大大提高捕捉长距离空间依赖关系的建模能力。根据研究结果，引入了一系列具有分层架构的 MambaVision 模型，以满足各种设计标准。
+
+
+<div align=center>
+    <img src="./mambavision/assets/arch.png"/>
+</div>
+
+## 算法原理
+MambaVision通过创建没有 SSM 的对称路径来引入一种新颖的混合器模块，以增强全局上下文的建模
+<div align=center>
+    <img src="./mambavision/assets/block.png"/>
+</div>
+
+## 环境配置
+### Docker（方法一）
+此处提供[光源](https://www.sourcefind.cn/#/service-details)拉取docker镜像的地址与使用步骤，以及[光合](https://developer.hpccube.com/tool/)开发者社区深度学习库下载地址
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
+docker run -it --shm-size=1024G -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal:/opt/hyhal --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name MambaVision_pytorch  <your IMAGE ID> bash # <your IMAGE ID>为以上拉取的docker的镜像ID替换，本镜像为：a4dd5be0ca23
+cd /path/your_code_data/MambaVision_pytorch/mamba
+pip install wheel -i https://mirrors.aliyun.com/pypi/simple/
+pip install . --no-build-isolation --no-deps
+git clone https://github.com/Dao-AILab/causal-conv1d.git
+cd causal-conv1d
+pip install e .
+cd /path/your_code_data/MambaVision_pytorch/
+pip install . --no-build-isolation --no-deps
+pip install timm==0.9.0 tensorboardX
+
+```
+### Dockerfile（方法二）
+此处提供dockerfile的使用方法
+```
+docker build --no-cache -t MambaVision:latest .
+docker run -it --shm-size=128G -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name MambaVision_pytorch  MambaVision  bash
+cd /path/your_code_data/MambaVision_pytorch/mamba
+pip install wheel -i https://mirrors.aliyun.com/pypi/simple/
+pip install . --no-build-isolation --no-deps
+git clone https://github.com/Dao-AILab/causal-conv1d.git
+cd causal-conv1d
+pip install e .
+cd /path/your_code_data/MambaVision_pytorch/
+pip install . --no-build-isolation --no-deps
+pip install timm==0.9.0 tensorboardX
+```
+### Anaconda（方法三）
+此处提供本地配置、编译的详细步骤，例如：
+
+关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
+```
+#DTK驱动：dtk24.04.1
+# python：python3.10
+# torch: 2.1.0
+# torchvision: 0.16.0
+```
+`Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应`
+
+其它依赖环境安装如下：
+```
+cd /path/your_code_data/MambaVision_pytorch/mamba
+pip install wheel -i https://mirrors.aliyun.com/pypi/simple/
+pip install . --no-build-isolation --no-deps
+git clone https://github.com/Dao-AILab/causal-conv1d.git
+cd causal-conv1d
+pip install e .
+cd /path/your_code_data/MambaVision_pytorch/
+pip install . --no-build-isolation --no-deps
+pip install timm==0.9.0 tensorboardX
+```
+## 数据集
+
+dataset数据结构如下:
+数据集SCNet快速下载链接
+
+[ImageNet-1K](http://113.200.138.88:18080/aidatasets/project-dependency/imagenet-1k)
+
+```
+ ── imagenet-1k
+│   ├── train
+│   │  ├── n13133613
+│   │  ├── n15075141
+│   │  └── ...
+│   ├── val
+│   │  ├── n13133613
+│   │  ├── n15075141
+│   │  └── ...
+```
+
+## 训练
+
+### 单机单卡
+```
+sh train.sh
+```
+### 单机多卡
+```
+sh multidcu_train.sh
+```
+
+注意：修改DATA_PATH的地址为自己的数据地址
+
+## 推理
+模型权重SCNet快速下载链接见下方预训练权重
+
+### 单卡推理
+
+Inference :
+
+To save outputs to a directory , use --output
+```
+python inference.py
+```
+
+Evaluate :
+```
+sh validate.sh
+```
+
+
+### 多卡推理
+
+```
+sh multidcu_validate.sh
+```
+
+注意：修改DATA_PATH的地址为自己的数据地址;修改checkpoint为自己的权重地址并修改--model名称与其对应。
+
+## result
+Inference ：
+
+<div align=center>
+    <img src="./mambavision/assets/results.png"/>
+</div>
+
+
+
+
+### 精度
+使用1张K100 AI卡推理
+
+|                                        Method                                        | Acc@1(%) | Acc@5(%) | 
+|:------------------------------------------------------------------------------------:|:---------------:|------|
+|                   MambaVision-T                    |      82.242       | 96.146 |
+|                 MambaVision-S                |      83.256       | 96.464 |
+|                    MambaVision-B                     |      84.148       | 96.878 |
+|                   MambaVision-L                    |      84.968       | 97.114 |
+
+
+
+## 应用场景
+### 算法类别
+`图像分类`
+### 热点应用行业
+`科研,制造,医疗,家居,教育`
+## 预训练权重
+模型权重SCNet快速下载链接[mambavision_model](http://113.200.138.88:18080/aimodels/mambavision_model)
+
+## 源码仓库及问题反馈
+- https://developer.hpccube.com/codes/modelzoo/mambavision_pytorch
+## 参考资料
+- https://github.com/NVlabs/MambaVision
+
--- a/README_ori.md
+++ b/README_ori.md
+# MambaVision: A Hybrid Mamba-Transformer Vision Backbone
+
+Official PyTorch implementation of [**MambaVision: A Hybrid Mamba-Transformer Vision Backbone**](https://arxiv.org/abs/2407.08083).
+
+
+[![Star on GitHub](https://img.shields.io/github/stars/NVlabs/MambaVision.svg?style=social)](https://github.com/NVlabs/MambaVision/stargazers)
+
+[Ali Hatamizadeh](https://research.nvidia.com/person/ali-hatamizadeh) and
+[Jan Kautz](https://jankautz.com/). 
+
+For business inquiries, please visit our website and submit the form: [NVIDIA Research Licensing](https://www.nvidia.com/en-us/research/inquiries/)
+
+--- 
+
+MambaVision demonstrates a strong performance by achieving a new SOTA Pareto-front in
+terms of Top-1 accuracy and throughput. 
+
+<p align="center">
+<img src="https://github.com/NVlabs/MambaVision/assets/26806394/79dcf841-3966-4b77-883d-76cd5e1d4320" width=62% height=62% 
+class="center">
+</p>
+
+We introduce a novel mixer block by creating a symmetric path without SSM to enhance the modeling of global context: 
+
+
+<p align="center">
+<img src="https://github.com/NVlabs/MambaVision/assets/26806394/295c0984-071e-4c84-b2c8-9059e2794182" width=32% height=32% 
+class="center">
+</p>
+
+
+
+MambaVision has a hierarchical architecture that employs both self-attention and mixer blocks:
+
+![teaser](./mambavision/assets/arch.png)
+
+
+## 💥 News 💥
+
+- **[07.24.2024]** MambaVision [Hugging Face](https://huggingface.co/collections/nvidia/mambavision-66943871a6b36c9e78b327d3) models are released ! 
+
+- **[07.14.2024]** We added support for processing any resolution images.
+
+- **[07.12.2024]** [Paper](https://arxiv.org/abs/2407.08083) is now available on arXiv !
+
+- **[07.11.2024]** [Mambavision pip package](https://pypi.org/project/mambavision/) is released !
+
+- **[07.10.2024]** We have released the code and model checkpoints for Mambavision !
+
+## Quick Start
+
+
+### Hugging Face (Classification + Feature extraction)
+
+Pretrained MambaVision models can be simply used via [Hugging Face](https://huggingface.co/collections/nvidia/mambavision-66943871a6b36c9e78b327d3) library with **a few lines of code**. First install the requirements: 
+
+```bash
+pip install mambavision
+```
+
+The model can be simply imported:
+
+
+```python
+>>> from transformers import AutoModelForImageClassification
+
+>>> model = AutoModelForImageClassification.from_pretrained("nvidia/MambaVision-T-1K", trust_remote_code=True)
+```
+
+We demonstrate an end-to-end image classification example in the following.
+
+Given the following image from [COCO dataset](https://cocodataset.org/#home)  val set as an input:
+
+
+<p align="center">
+<img src="https://cdn-uploads.huggingface.co/production/uploads/64414b62603214724ebd2636/4duSnqLf4lrNiAHczSmAN.jpeg" width=70% height=70% 
+class="center">
+</p>
+
+
+The following snippet can be used:
+
+```python
+from transformers import AutoModelForImageClassification
+from PIL import Image
+from timm.data.transforms_factory import create_transform
+import requests
+
+model = AutoModelForImageClassification.from_pretrained("nvidia/MambaVision-T-1K", trust_remote_code=True)
+
+# eval mode for inference
+model.cuda().eval()
+
+# prepare image for the model
+url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
+image = Image.open(requests.get(url, stream=True).raw)
+input_resolution = (3, 224, 224)  # MambaVision supports any input resolutions
+
+transform = create_transform(input_size=input_resolution,
+                             is_training=False,
+                             mean=model.config.mean,
+                             std=model.config.std,
+                             crop_mode=model.config.crop_mode,
+                             crop_pct=model.config.crop_pct)
+
+inputs = transform(image).unsqueeze(0).cuda()
+# model inference
+outputs = model(inputs)
+logits = outputs['logits'] 
+predicted_class_idx = logits.argmax(-1).item()
+print("Predicted class:", model.config.id2label[predicted_class_idx])
+```
+
+The predicted label is brown bear, bruin, Ursus arctos.
+
+
+You can also use Hugging Face MambaVision models for feature extraction. The model provides the outputs of each stage of model (hierarchical multi-scale features in 4 stages) as well as the final averaged-pool features that are flattened. The former is used for downstream tasks such as classification and detection. 
+
+The following snippet can be used for feature extraction:
+
+```Python
+from transformers import AutoModel
+from PIL import Image
+from timm.data.transforms_factory import create_transform
+import requests
+
+model = AutoModel.from_pretrained("nvidia/MambaVision-T-1K", trust_remote_code=True)
+
+# eval mode for inference
+model.cuda().eval()
+
+# prepare image for the model
+url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
+image = Image.open(requests.get(url, stream=True).raw)
+input_resolution = (3, 224, 224)  # MambaVision supports any input resolutions
+
+transform = create_transform(input_size=input_resolution,
+                             is_training=False,
+                             mean=model.config.mean,
+                             std=model.config.std,
+                             crop_mode=model.config.crop_mode,
+                             crop_pct=model.config.crop_pct)
+inputs = transform(image).unsqueeze(0).cuda()
+# model inference
+out_avg_pool, features = model(inputs)
+print("Size of the averaged pool features:", out_avg_pool.size())  # torch.Size([1, 640])
+print("Number of stages in extracted features:", len(features)) # 4 stages
+print("Size of extracted features in stage 1:", features[0].size()) # torch.Size([1, 80, 56, 56])
+print("Size of extracted features in stage 4:", features[3].size()) # torch.Size([1, 640, 7, 7])
+```
+
+Currently, we offer [MambaVision-T-1K](https://huggingface.co/nvidia/MambaVision-T-1K), [MambaVision-T2-1K](https://huggingface.co/nvidia/MambaVision-T2-1K), [MambaVision-S-1K](https://huggingface.co/nvidia/MambaVision-S-1K), [MambaVision-B-1K](https://huggingface.co/nvidia/MambaVision-B-1K), [MambaVision-L-1K](https://huggingface.co/nvidia/MambaVision-L-1K) and [MambaVision-L2-1K](https://huggingface.co/nvidia/MambaVision-L2-1K) on Hugging Face. All models can also be viewed [here](https://huggingface.co/collections/nvidia/mambavision-66943871a6b36c9e78b327d3).
+
+### Classification (pip package)
+
+We can also import pre-trained MambaVision models from the pip package with **a few lines of code**:
+
+```bash
+pip install mambavision
+```
+
+A pretrained MambaVision model with default hyper-parameters can be created as in:
+
+```python
+>>> from mambavision import create_model
+
+# Define mamba_vision_T model
+
+>>> model = create_model('mamba_vision_T', pretrained=True, model_path="/tmp/mambavision_tiny_1k.pth.tar")
+```
+
+Available list of pretrained models include `mamba_vision_T`, `mamba_vision_T2`, `mamba_vision_S`, `mamba_vision_B`, `mamba_vision_L` and `mamba_vision_L2`.  
+
+We can also simply test the model by passing a dummy image with **any resolution**. The output is the logits:
+
+```python
+>>> import torch
+
+>>> image = torch.rand(1, 3, 512, 224).cuda() # place image on cuda
+>>> model = model.cuda() # place model on cuda
+>>> output = model(image) # output logit size is [1, 1000]
+```
+
+Using the pretrained models from our pip package, you can simply run validation:
+
+```
+python validate_pip_model.py --model mamba_vision_T --data_dir=$DATA_PATH --batch-size $BS 
+``` 
+
+## FAQ
+
+1. Does MambaVision support processing images with any input resolutions ? 
+
+Yes ! you can pass images with any arbitrary resolutions without the need to change the model.
+
+
+2. Can I apply MambaVision for downstream tasks like detection, segmentation ? 
+
+Yes ! we are working to have it released very soon. But employing MambaVision backbones for these tasks is very similar to other models in `mmseg` or `mmdet` packages. In addition, MambaVision [Hugging Face](https://huggingface.co/collections/nvidia/mambavision-66943871a6b36c9e78b327d3) models provide feature extraction capablity which can be used for downstream tasks. Please see the above example. 
+
+
+3. I am interested in re-implementing MambaVision in my own repository. Can we use the pretrained weights ? 
+
+Yes ! the pretrained weights are released under [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Please submit an issue in this repo and we will add your repository to the README of our codebase and properly acknowledge your efforts. 
+
+## Results + Pretrained Models
+
+### ImageNet-1K
+**MambaVision ImageNet-1K Pretrained Models**
+
+<table>
+  <tr>
+    <th>Name</th>
+    <th>Acc@1(%)</th>
+    <th>Acc@5(%)</th>
+    <th>Throughput(Img/Sec)</th>
+    <th>Resolution</th>
+    <th>#Params(M)</th>
+    <th>FLOPs(G)</th>
+    <th>Download</th>
+  </tr>
+
+<tr>
+    <td>MambaVision-T</td>
+    <td>82.3</td>
+    <td>96.2</td>
+    <td>6298</td>
+    <td>224x224</td>
+    <td>31.8</td>
+    <td>4.4</td>
+    <td><a href="https://huggingface.co/nvidia/MambaVision-T-1K/resolve/main/mambavision_tiny_1k.pth.tar">model</a></td>
+</tr>
+
+<tr>
+    <td>MambaVision-T2</td>
+    <td>82.7</td>
+    <td>96.3</td>
+    <td>5990</td>
+    <td>224x224</td>
+    <td>35.1</td>
+    <td>5.1</td>
+    <td><a href="https://huggingface.co/nvidia/MambaVision-T2-1K/resolve/main/mambavision_tiny2_1k.pth.tar">model</a></td>
+</tr>
+
+<tr>
+    <td>MambaVision-S</td>
+    <td>83.3</td>
+    <td>96.5</td>
+    <td>4700</td>
+    <td>224x224</td>
+    <td>50.1</td>
+    <td>7.5</td>
+    <td><a href="https://huggingface.co/nvidia/MambaVision-S-1K/resolve/main/mambavision_small_1k.pth.tar">model</a></td>
+</tr>
+
+<tr>
+    <td>MambaVision-B</td>
+    <td>84.2</td>
+    <td>96.9</td>
+    <td>3670</td>
+    <td>224x224</td>
+    <td>97.7</td>
+    <td>15.0</td>
+    <td><a href="https://huggingface.co/nvidia/MambaVision-B-1K/resolve/main/mambavision_base_1k.pth.tar">model</a></td>
+</tr>
+
+<tr>
+    <td>MambaVision-L</td>
+    <td>85.0</td>
+    <td>97.1</td>
+    <td>2190</td>
+    <td>224x224</td>
+    <td>227.9</td>
+    <td>34.9</td>
+    <td><a href="https://huggingface.co/nvidia/MambaVision-L-1K/resolve/main/mambavision_large_1k.pth.tar">model</a></td>
+</tr>
+
+<tr>
+    <td>MambaVision-L2</td>
+    <td>85.3</td>
+    <td>97.2</td>
+    <td>1021</td>
+    <td>224x224</td>
+    <td>241.5</td>
+    <td>37.5</td>
+    <td><a href="https://huggingface.co/nvidia/MambaVision-L2-1K/resolve/main/mambavision_large2_1k.pth.tar">model</a></td>
+</tr>
+
+</table>
+
+## Installation
+
+We provide a [docker file](./Dockerfile). In addition, assuming that a recent [PyTorch](https://pytorch.org/get-started/locally/) package is installed, the dependencies can be installed by running:
+
+```bash
+pip install -r requirements.txt
+```
+
+## Evaluation
+
+The MambaVision models can be evaluated on ImageNet-1K validation set using the following: 
+
+```
+python validate.py \
+--model <model-name>
+--checkpoint <checkpoint-path>
+--data_dir <imagenet-path>
+--batch-size <batch-size-per-gpu
+``` 
+
+Here `--model` is the MambaVision variant (e.g. `mambavision_tiny_1k`), `--checkpoint` is the path to pretrained model weights, `--data_dir` is the path to ImageNet-1K validation set and `--batch-size` is the number of batch size. We also provide a sample script [here](./mambavision/validate.sh). 
+
+## Citation
+
+If you find MambaVision to be useful for your work, please consider citing our paper: 
+
+```
+@article{hatamizadeh2024mambavision,
+  title={MambaVision: A Hybrid Mamba-Transformer Vision Backbone},
+  author={Hatamizadeh, Ali and Kautz, Jan},
+  journal={arXiv preprint arXiv:2407.08083},
+  year={2024}
+}
+```
+
+## Star History
+
+[![Star History Chart](https://api.star-history.com/svg?repos=NVlabs/MambaVision&type=Date)](https://star-history.com/#NVlabs/MambaVision&Date)
+
+
+## Licenses
+
+Copyright © 2024, NVIDIA Corporation. All rights reserved.
+
+This work is made available under the NVIDIA Source Code License-NC. Click [here](LICENSE) to view a copy of this license.
+
+The pre-trained models are shared under [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
+
+For license information regarding the timm repository, please refer to its [repository](https://github.com/rwightman/pytorch-image-models).
+
+For license information regarding the ImageNet dataset, please see the [ImageNet official website](https://www.image-net.org/). 
+
+## Acknowledgement
+This repository is built on top of the [timm](https://github.com/huggingface/pytorch-image-models) repository. We thank [Ross Wrightman](https://rwightman.com/) for creating and maintaining this high-quality library.  
--- a/icon.png
+++ b/icon.png
--- a/mambavision/000000020247.jpg
+++ b/mambavision/000000020247.jpg
--- a/mambavision/__init__.py
+++ b/mambavision/__init__.py
+from .models.registry import create_model
\ No newline at end of file
--- a/mambavision/assets/arch.png
+++ b/mambavision/assets/arch.png
--- a/mambavision/assets/block.png
+++ b/mambavision/assets/block.png
--- a/mambavision/assets/comp.png
+++ b/mambavision/assets/comp.png
--- a/mambavision/assets/results.png
+++ b/mambavision/assets/results.png
--- a/mambavision/configs/mambavision_base_1k.yaml
+++ b/mambavision/configs/mambavision_base_1k.yaml
+ThreeAugment: false
+aa: rand-m9-mstd0.5-inc1
+activation_tracker: false
+amp: true
+ampere_sparsity: false
+aot_autograd: false
+apex_amp: false
+attn_drop_rate: 0.0
+aug_repeats: 0
+aug_splits: 0
+batch_size: 128
+bce_loss: false
+bce_target_thresh: null
+bn_eps: null
+bn_momentum: null
+channels_last: true
+checkpoint_hist: 1
+class_map: ''
+clip_grad: 5.0
+clip_mode: norm
+color_jitter: 0.4
+cooldown_epochs: 10
+crop_pct: 1.0
+cutmix: 1.0
+cutmix_minmax: null
+data_dir: /datasets/imagenet_lmdb
+data_len: 1281167
+dataset: ''
+dataset_download: false
+decay_epochs: 100
+decay_milestones:
+- 30
+- 60
+decay_rate: 0.1
+dist_bn: reduce
+drop_block: null
+drop_connect: null
+drop_path: null
+drop_rate: 0.0
+epoch_repeats: 0.0
+epochs: 310
+eval_metric: top1
+experiment: ''
+fuser: ''
+gp: null
+grad_checkpointing: false
+hflip: 0.5
+img_size: null
+initial_checkpoint: ''
+input_size:
+- 3
+- 224
+- 224
+interpolation: ''
+jsd_loss: false
+layer_decay: null
+loadcheckpoint: ''
+local_rank: 0
+log_dir: ./log_dir/
+log_interval: 50
+log_wandb: false
+lr: 0.005
+lr_cycle_decay: 1.0
+lr_cycle_limit: 1
+lr_cycle_mul: 1.0
+lr_ep: false
+lr_k_decay: 1.0
+lr_noise: null
+lr_noise_pct: 0.67
+lr_noise_std: 1.0
+mean: null
+mesa: 1.0
+mesa_start_ratio: 0.3
+min_lr: 5.0e-06
+mixup: 0.8
+mixup_mode: batch
+mixup_off_epoch: 0
+mixup_prob: 1.0
+mixup_switch_prob: 0.5
+model: mamba_vision_B
+model_ema: true
+model_ema_decay: 0.9998
+model_ema_force_cpu: false
+momentum: 0.9
+native_amp: false
+no_aug: false
+no_ddp_bb: false
+no_prefetcher: false
+no_resume_opt: false
+no_saver: false
+num_classes: null
+opt: lamb
+opt_betas:
+- 0.9
+- 0.999
+opt_eps: 1.0e-08
+output: ''
+patience_epochs: 10
+pin_mem: false
+pretrained: false
+ratio:
+- 0.75
+- 1.3333333333333333
+recount: 1
+recovery_interval: 0
+remode: pixel
+reprob: 0.25
+resplit: false
+resume: ''
+save_images: false
+scale:
+- 0.08
+- 1.0
+sched: cosine
+seed: 31
+smoothing: 0.1
+split_bn: false
+start_epoch: null
+std: null
+sync_bn: false
+tag: mambavision_base_1k
+torchscript: false
+train_interpolation: random
+train_split: train
+tta: 0
+use_multi_epochs_loader: false
+val_split: validation
+validate_only: false
+validation_batch_size: null
+vflip: 0.0
+warmup_epochs: 35
+warmup_lr: 1.0e-06
+weight_decay: 0.075
+worker_seeding: all
+workers: 8
--- a/mambavision/configs/mambavision_large2_1k.yaml
+++ b/mambavision/configs/mambavision_large2_1k.yaml
+ThreeAugment: false
+aa: rand-m9-mstd0.5-inc1
+activation_tracker: false
+amp: true
+ampere_sparsity: false
+aot_autograd: false
+apex_amp: false
+attn_drop_rate: 0.0
+aug_repeats: 0
+aug_splits: 0
+batch_size: 128
+bce_loss: false
+bce_target_thresh: null
+bn_eps: null
+bn_momentum: null
+channels_last: true
+checkpoint_hist: 1
+class_map: ''
+clip_grad: 5.0
+clip_mode: norm
+color_jitter: 0.4
+cooldown_epochs: 10
+crop_pct: 1.0
+cutmix: 1.0
+cutmix_minmax: null
+data_dir: /datasets/imagenet_lmdb
+data_len: 1281167
+dataset: ''
+dataset_download: false
+decay_epochs: 100
+decay_milestones:
+- 30
+- 60
+decay_rate: 0.1
+dist_bn: reduce
+drop_block: null
+drop_connect: null
+drop_path: null
+drop_rate: 0.0
+epoch_repeats: 0.0
+epochs: 310
+eval_metric: top1
+experiment: ''
+fuser: ''
+gp: null
+grad_checkpointing: false
+hflip: 0.5
+img_size: null
+initial_checkpoint: ''
+input_size:
+- 3
+- 224
+- 224
+interpolation: ''
+jsd_loss: false
+layer_decay: null
+loadcheckpoint: ''
+local_rank: 0
+log_dir: ./log_dir/
+log_interval: 50
+log_wandb: false
+lr: 0.005
+lr_cycle_decay: 1.0
+lr_cycle_limit: 1
+lr_cycle_mul: 1.0
+lr_ep: false
+lr_k_decay: 1.0
+lr_noise: null
+lr_noise_pct: 0.67
+lr_noise_std: 1.0
+mean: null
+mesa: 6.0
+mesa_start_ratio: 0.25
+min_lr: 5.0e-06
+mixup: 0.8
+mixup_mode: batch
+mixup_off_epoch: 0
+mixup_prob: 1.0
+mixup_switch_prob: 0.5
+model: mamba_vision_L2
+model_ema: true
+model_ema_decay: 0.9998
+model_ema_force_cpu: false
+momentum: 0.9
+native_amp: false
+no_aug: false
+no_ddp_bb: false
+no_prefetcher: false
+no_resume_opt: false
+no_saver: false
+num_classes: null
+opt: lamb
+opt_betas:
+- 0.9
+- 0.999
+opt_eps: 1.0e-08
+output: ''
+patience_epochs: 10
+pin_mem: false
+pretrained: false
+ratio:
+- 0.75
+- 1.3333333333333333
+recount: 1
+recovery_interval: 0
+remode: pixel
+reprob: 0.25
+resplit: false
+resume: ''
+save_images: false
+scale:
+- 0.08
+- 1.0
+sched: cosine
+seed: 31
+smoothing: 0.1
+split_bn: false
+start_epoch: null
+std: null
+sync_bn: false
+tag: mambavision_large2_1k
+torchscript: false
+train_interpolation: random
+train_split: train
+tta: 0
+use_multi_epochs_loader: false
+val_split: validation
+validate_only: false
+validation_batch_size: null
+vflip: 0.0
+warmup_epochs: 20
+warmup_lr: 1.0e-06
+weight_decay: 0.12
+worker_seeding: all
+workers: 8
--- a/mambavision/configs/mambavision_large_1k.yaml
+++ b/mambavision/configs/mambavision_large_1k.yaml
+ThreeAugment: false
+aa: rand-m9-mstd0.5-inc1
+activation_tracker: false
+amp: true
+ampere_sparsity: false
+aot_autograd: false
+apex_amp: false
+attn_drop_rate: 0.0
+aug_repeats: 0
+aug_splits: 0
+batch_size: 128
+bce_loss: false
+bce_target_thresh: null
+bn_eps: null
+bn_momentum: null
+channels_last: true
+checkpoint_hist: 1
+class_map: ''
+clip_grad: 5.0
+clip_mode: norm
+color_jitter: 0.4
+cooldown_epochs: 10
+crop_pct: 1.0
+cutmix: 1.0
+cutmix_minmax: null
+data_dir: /datasets/imagenet_lmdb
+data_len: 1281167
+dataset: ''
+dataset_download: false
+decay_epochs: 100
+decay_milestones:
+- 30
+- 60
+decay_rate: 0.1
+dist_bn: reduce
+drop_block: null
+drop_connect: null
+drop_path: null
+drop_rate: 0.0
+epoch_repeats: 0.0
+epochs: 310
+eval_metric: top1
+experiment: ''
+fuser: ''
+gp: null
+grad_checkpointing: false
+hflip: 0.5
+img_size: null
+initial_checkpoint: ''
+input_size:
+- 3
+- 224
+- 224
+interpolation: ''
+jsd_loss: false
+layer_decay: null
+loadcheckpoint: ''
+local_rank: 0
+log_dir: ./log_dir/
+log_interval: 50
+log_wandb: false
+lr: 0.005
+lr_cycle_decay: 1.0
+lr_cycle_limit: 1
+lr_cycle_mul: 1.0
+lr_ep: false
+lr_k_decay: 1.0
+lr_noise: null
+lr_noise_pct: 0.67
+lr_noise_std: 1.0
+mean: null
+mesa: 6.0
+mesa_start_ratio: 0.25
+min_lr: 5.0e-06
+mixup: 0.8
+mixup_mode: batch
+mixup_off_epoch: 0
+mixup_prob: 1.0
+mixup_switch_prob: 0.5
+model: mamba_vision_L
+model_ema: true
+model_ema_decay: 0.9998
+model_ema_force_cpu: false
+momentum: 0.9
+native_amp: false
+no_aug: false
+no_ddp_bb: false
+no_prefetcher: false
+no_resume_opt: false
+no_saver: false
+num_classes: null
+opt: lamb
+opt_betas:
+- 0.9
+- 0.999
+opt_eps: 1.0e-08
+output: ''
+patience_epochs: 10
+pin_mem: false
+pretrained: false
+ratio:
+- 0.75
+- 1.3333333333333333
+recount: 1
+recovery_interval: 0
+remode: pixel
+reprob: 0.25
+resplit: false
+resume: ''
+save_images: false
+scale:
+- 0.08
+- 1.0
+sched: cosine
+seed: 31
+smoothing: 0.1
+split_bn: false
+start_epoch: null
+std: null
+sync_bn: false
+tag: mambavision_large_1k
+torchscript: false
+train_interpolation: random
+train_split: train
+tta: 0
+use_multi_epochs_loader: false
+val_split: validation
+validate_only: false
+validation_batch_size: null
+vflip: 0.0
+warmup_epochs: 20
+warmup_lr: 1.0e-06
+weight_decay: 0.12
+worker_seeding: all
+workers: 8
--- a/mambavision/configs/mambavision_small_1k.yaml
+++ b/mambavision/configs/mambavision_small_1k.yaml
+ThreeAugment: false
+aa: rand-m9-mstd0.5-inc1
+activation_tracker: false
+amp: true
+ampere_sparsity: false
+aot_autograd: false
+apex_amp: false
+attn_drop_rate: 0.0
+aug_repeats: 0
+aug_splits: 0
+batch_size: 128
+bce_loss: false
+bce_target_thresh: null
+bn_eps: null
+bn_momentum: null
+channels_last: true
+checkpoint_hist: 1
+class_map: ''
+clip_grad: 5.0
+clip_mode: norm
+color_jitter: 0.4
+cooldown_epochs: 10
+crop_pct: 0.875
+cutmix: 1.0
+cutmix_minmax: null
+data_dir: /datasets/imagenet_lmdb
+data_len: 1281167
+dataset: ''
+dataset_download: false
+decay_epochs: 100
+decay_milestones:
+- 30
+- 60
+decay_rate: 0.1
+dist_bn: reduce
+drop_block: null
+drop_connect: null
+drop_path: null
+drop_rate: 0.0
+epoch_repeats: 0.0
+epochs: 310
+eval_metric: top1
+experiment: ''
+fuser: ''
+gp: null
+grad_checkpointing: false
+hflip: 0.5
+img_size: null
+initial_checkpoint: ''
+input_size:
+- 3
+- 224
+- 224
+interpolation: ''
+jsd_loss: false
+layer_decay: null
+loadcheckpoint: ''
+local_rank: 0
+log_dir: ./log_dir/
+log_interval: 50
+log_wandb: false
+lr: 0.005
+lr_cycle_decay: 1.0
+lr_cycle_limit: 1
+lr_cycle_mul: 1.0
+lr_ep: false
+lr_k_decay: 1.0
+lr_noise: null
+lr_noise_pct: 0.67
+lr_noise_std: 1.0
+mean: null
+mesa: 1.0
+mesa_start_ratio: 0.25
+min_lr: 5.0e-06
+mixup: 0.8
+mixup_mode: batch
+mixup_off_epoch: 0
+mixup_prob: 1.0
+mixup_switch_prob: 0.5
+model: mamba_vision_S
+model_ema: true
+model_ema_decay: 0.9998
+model_ema_force_cpu: false
+momentum: 0.9
+native_amp: false
+no_aug: false
+no_ddp_bb: false
+no_prefetcher: false
+no_resume_opt: false
+no_saver: false
+num_classes: null
+opt: lamb
+opt_betas:
+- 0.9
+- 0.999
+opt_eps: 1.0e-08
+output: ''
+patience_epochs: 10
+pin_mem: false
+pretrained: false
+ratio:
+- 0.75
+- 1.3333333333333333
+recount: 1
+recovery_interval: 0
+remode: pixel
+reprob: 0.25
+resplit: false
+resume: ''
+save_images: false
+scale:
+- 0.08
+- 1.0
+sched: cosine
+seed: 31
+smoothing: 0.1
+split_bn: false
+start_epoch: null
+std: null
+sync_bn: false
+tag: mambavision_small_1k
+torchscript: false
+train_interpolation: random
+train_split: train
+tta: 0
+use_multi_epochs_loader: false
+val_split: validation
+validate_only: false
+validation_batch_size: null
+vflip: 0.0
+warmup_epochs: 20
+warmup_lr: 1.0e-06
+weight_decay: 0.05
+worker_seeding: all
+workers: 8
--- a/mambavision/configs/mambavision_tiny2_1k.yaml
+++ b/mambavision/configs/mambavision_tiny2_1k.yaml
+ThreeAugment: false
+aa: rand-m9-mstd0.5-inc1
+activation_tracker: false
+amp: true
+ampere_sparsity: false
+aot_autograd: false
+apex_amp: false
+attn_drop_rate: 0.0
+aug_repeats: 0
+aug_splits: 0
+batch_size: 128
+bce_loss: false
+bce_target_thresh: null
+bn_eps: null
+bn_momentum: null
+channels_last: true
+checkpoint_hist: 1
+class_map: ''
+clip_grad: 5.0
+clip_mode: norm
+color_jitter: 0.4
+cooldown_epochs: 10
+crop_pct: 1.0
+cutmix: 1.0
+cutmix_minmax: null
+data_dir: /datasets/imagenet_lmdb
+data_len: 1281167
+dataset: ''
+dataset_download: false
+decay_epochs: 100
+decay_milestones:
+- 30
+- 60
+decay_rate: 0.1
+dist_bn: reduce
+drop_block: null
+drop_connect: null
+drop_path: null
+drop_rate: 0.0
+epoch_repeats: 0.0
+epochs: 310
+eval_metric: top1
+experiment: ''
+fuser: ''
+gp: null
+grad_checkpointing: false
+hflip: 0.5
+img_size: null
+initial_checkpoint: ''
+input_size:
+- 3
+- 224
+- 224
+interpolation: ''
+jsd_loss: false
+layer_decay: null
+loadcheckpoint: ''
+local_rank: 0
+log_dir: ./log_dir/
+log_interval: 50
+log_wandb: false
+lr: 0.005
+lr_cycle_decay: 1.0
+lr_cycle_limit: 1
+lr_cycle_mul: 1.0
+lr_ep: false
+lr_k_decay: 1.0
+lr_noise: null
+lr_noise_pct: 0.67
+lr_noise_std: 1.0
+mean: null
+mesa: 0.75
+mesa_start_ratio: 0.25
+min_lr: 5.0e-06
+mixup: 0.8
+mixup_mode: batch
+mixup_off_epoch: 0
+mixup_prob: 1.0
+mixup_switch_prob: 0.5
+model: mamba_vision_T2
+model_ema: true
+model_ema_decay: 0.9998
+model_ema_force_cpu: false
+momentum: 0.9
+native_amp: false
+no_aug: false
+no_ddp_bb: false
+no_prefetcher: false
+no_resume_opt: false
+no_saver: false
+num_classes: null
+opt: lamb
+opt_betas:
+- 0.9
+- 0.999
+opt_eps: 1.0e-08
+output: ''
+patience_epochs: 10
+pin_mem: false
+pretrained: false
+ratio:
+- 0.75
+- 1.3333333333333333
+recount: 1
+recovery_interval: 0
+remode: pixel
+reprob: 0.25
+resplit: false
+resume: ''
+save_images: false
+scale:
+- 0.08
+- 1.0
+sched: cosine
+seed: 31
+smoothing: 0.1
+split_bn: false
+start_epoch: null
+std: null
+sync_bn: false
+tag: mambavision_tiny2_1k
+torchscript: false
+train_interpolation: random
+train_split: train
+tta: 0
+use_multi_epochs_loader: false
+val_split: validation
+validate_only: false
+validation_batch_size: null
+vflip: 0.0
+warmup_epochs: 20
+warmup_lr: 1.0e-06
+weight_decay: 0.05
+worker_seeding: all
+workers: 8
--- a/mambavision/configs/mambavision_tiny_1k.yaml
+++ b/mambavision/configs/mambavision_tiny_1k.yaml
+ThreeAugment: false
+aa: rand-m9-mstd0.5-inc1
+activation_tracker: false
+amp: true
+ampere_sparsity: false
+aot_autograd: false
+apex_amp: false
+attn_drop_rate: 0.0
+aug_repeats: 0
+aug_splits: 0
+batch_size: 128
+bce_loss: false
+bce_target_thresh: null
+bn_eps: null
+bn_momentum: null
+channels_last: true
+checkpoint_hist: 1
+class_map: ''
+clip_grad: 5.0
+clip_mode: norm
+color_jitter: 0.4
+cooldown_epochs: 10
+crop_pct: 1.0
+cutmix: 1.0
+cutmix_minmax: null
+data_dir: /datasets/imagenet_lmdb
+data_len: 1281167
+dataset: ''
+dataset_download: false
+decay_epochs: 100
+decay_milestones:
+- 30
+- 60
+decay_rate: 0.1
+dist_bn: reduce
+drop_block: null
+drop_connect: null
+drop_path: null
+drop_rate: 0.0
+epoch_repeats: 0.0
+epochs: 310
+eval_metric: top1
+experiment: ''
+fuser: ''
+gp: null
+grad_checkpointing: false
+hflip: 0.5
+img_size: null
+initial_checkpoint: ''
+input_size:
+- 3
+- 224
+- 224
+interpolation: ''
+jsd_loss: false
+layer_decay: null
+loadcheckpoint: ''
+local_rank: 0
+log_dir: ./log_dir/
+log_interval: 50
+log_wandb: false
+lr: 0.005
+lr_cycle_decay: 1.0
+lr_cycle_limit: 1
+lr_cycle_mul: 1.0
+lr_ep: false
+lr_k_decay: 1.0
+lr_noise: null
+lr_noise_pct: 0.67
+lr_noise_std: 1.0
+mean: null
+mesa: 0.5
+mesa_start_ratio: 0.25
+min_lr: 5.0e-06
+mixup: 0.8
+mixup_mode: batch
+mixup_off_epoch: 0
+mixup_prob: 1.0
+mixup_switch_prob: 0.5
+model: mamba_vision_T
+model_ema: true
+model_ema_decay: 0.9998
+model_ema_force_cpu: false
+momentum: 0.9
+native_amp: false
+no_aug: false
+no_ddp_bb: false
+no_prefetcher: false
+no_resume_opt: false
+no_saver: false
+num_classes: null
+opt: lamb
+opt_betas:
+- 0.9
+- 0.999
+opt_eps: 1.0e-08
+output: ''
+patience_epochs: 10
+pin_mem: false
+pretrained: false
+ratio:
+- 0.75
+- 1.3333333333333333
+recount: 1
+recovery_interval: 0
+remode: pixel
+reprob: 0.25
+resplit: false
+resume: ''
+save_images: false
+scale:
+- 0.08
+- 1.0
+sched: cosine
+seed: 31
+smoothing: 0.1
+split_bn: false
+start_epoch: null
+std: null
+sync_bn: false
+tag: mambavision_tiny_1k
+torchscript: false
+train_interpolation: random
+train_split: train
+tta: 0
+use_multi_epochs_loader: false
+val_split: validation
+validate_only: false
+validation_batch_size: null
+vflip: 0.0
+warmup_epochs: 20
+warmup_lr: 1.0e-06
+weight_decay: 0.05
+worker_seeding: all
+workers: 8
--- a/mambavision/dummy_test.py
+++ b/mambavision/dummy_test.py
+import torch
+from timm.models import create_model, load_checkpoint
+import argparse
+
+parser = argparse.ArgumentParser()
+parser.add_argument('--model', '-m', metavar='NAME', default='mamba_vision_T', help='model architecture (default: mamba_vision_T)')
+parser.add_argument('--checkpoint', default='', type=str, metavar='PATH',help='path to latest checkpoint (default: none)')
+parser.add_argument('--use_pip', action='store_true', default=False, help='to use pip package')
+args = parser.parse_args()
+
+# Define mamba_vision_T model with 224 x 224 resolution
+
+if args.use_pip:
+      from mambavision import create_model
+      model = create_model(args.model, pretrained=True, model_path="/tmp/mambavision_tiny_1k.pth.tar")
+else:
+      from models.mamba_vision import *
+      model = create_model(args.model) 
+      if args.checkpoint:
+        load_checkpoint(model, args.checkpoint, None)
+        
+print('{} model succesfully created !'.format(args.model))
+
+image = torch.rand(1, 3, 754, 234).cuda() # place image on cuda
+
+model = model.cuda() # place model on cuda
+
+output = model(image) # output logit size is [1, 1000]
+
+print('Inference succesfully completed on dummy input !')
+
--- a/mambavision/inference.py
+++ b/mambavision/inference.py
+from transformers import AutoModelForImageClassification
+from PIL import Image
+from timm.data.transforms_factory import create_transform
+import requests
+
+model = AutoModelForImageClassification.from_pretrained("MambaVision-T-1K", trust_remote_code=True)
+
+# eval mode for inference
+model.cuda().eval()
+
+# prepare image for the model
+url = '000000020247.jpg'
+image = Image.open(url)
+input_resolution = (3, 224, 224)  # MambaVision supports any input resolutions
+
+transform = create_transform(input_size=input_resolution,
+                             is_training=False,
+                             mean=model.config.mean,
+                             std=model.config.std,
+                             crop_mode=model.config.crop_mode,
+                             crop_pct=model.config.crop_pct)
+
+inputs = transform(image).unsqueeze(0).cuda()
+# model inference
+outputs = model(inputs)
+logits = outputs['logits'] 
+predicted_class_idx = logits.argmax(-1).item()
+print("Predicted class:", model.config.id2label[predicted_class_idx])
\ No newline at end of file
--- a/mambavision/models/__init__.py
+++ b/mambavision/models/__init__.py
+from .mamba_vision import *
+
+from .registry import create_model