Update README and repo

1bada89a · chenych · 1efa1bca · 1efa1bca · 1bada89a · 1bada89a
Commit 1bada89a authored Dec 25, 2025 by chenych
7 changed files
--- a/Dockerfile
+++ b/Dockerfile
-FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-py3.10-dtk24.04.3-ubuntu20.04
\ No newline at end of file
--- a/LICENSE
+++ b/LICENSE
+MIT License
+Copyright (c) 2023 OpenGVLab
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/README.md
+++ b/README.md
 # InternVL2.5
 ## 论文
 [Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling](https://arxiv.org/abs/2412.05271)
+## 模型简介
-## 模型结构
+InternVL 2.5 保留了与前代模型 InternVL 1.5 和 2.0 相同的模型架构，遵循“ViT-MLP-LLM”范式。在新版本中，使用随机初始化的 MLP 投影器集成了一个新增量预训练的 InternViT 与各种预训练 LLM（包括 InternLM 2.5 和 Qwen 2.5）。
-InternVL 2.5保留了与InternVL 1.5[35]相同的模型架构，并且
+正如之前的版本一样，应用了像素解混操作，将视觉标记的数量减少到原来的四分之一。此外，采用了与InternVL 1.5类似的动态分辨率策略，将图像分割成448×448像素的块。从InternVL 2.0开始的关键区别在于，额外引入了对多图像和视频数据的支持。
-InternVL 2.0，即广泛使用的“ViT-MLP-LLM”范式，结合了预先训练的InternViT-300M
-或通过MLP投影仪具有各种尺寸的LLM[19，229]的InternViT-6B。与之前的版本一致，
-我们应用像素反洗牌操作来减少每个448 × 448图像块产生的1024个视觉令牌
-到256个代币。此外，与InternVL 1.5相比，InternVL 2.0和2.5引入了额外的数据类型，
-将多图像和视频数据与现有的单图像和纯文本数据合并。
 <div align=center>
    <img src="./Pic/arch.png"/>
 </div>
-## 算法原理
+## 环境依赖
+- 列举基础环境需求，根据实际情况填写
-InternVL 2.5训练过程分为三个阶段——阶段1（MLP预热）可选阶段1.5（ViT增量学习）和阶段2（全模型指令调优）。多阶段设计逐步增强视觉-语言对齐，稳定培训，并为与更大的LLM集成准备模块。(b)逐步扩大战略。在早期阶段用较小的LLM训练的ViT模块可以很容易地与较大的LLM集成，以可承受的资源开销实现可扩展的模型对齐。
-<div align=center>
-    <img src="./Pic/theory.png"/>
-</div>
-## 环境配置
-### Docker（方法一）
-推荐使用docker方式运行， 此处提供[光源](https://www.sourcefind.cn/#/service-details)拉取docker镜像的地址与使用步骤
-```
-docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-py3.10-dtk24.04.3-ubuntu20.04
-docker run -it --shm-size=1024G -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal:/opt/hyhal --network=host --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name internvl  <your IMAGE ID> bash # <your IMAGE ID>为以上拉取的docker的镜像ID替换
-git clone http://developer.sourcefind.cn/codes/modelzoo/internvl2.5_pytorch.git
+| 软件 | 版本 |
+| :------: | :------: |
+| DTK | 24.04.3 |
+| python | 3.10 |
+| torch | 2.3.0 |
+| transformers | >=4.37.2 |
+| flash-attn | 2.6.1 |
-cd /path/your_code_data/
+推荐使用镜像:
-pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+- 挂载地址`-v`，`{docker_name}`和 `{docker_image_name}`根据实际模型情况修改
-pip install accelerate
+```bash
+docker run -it --shm-size 200g --network=host --name {docker_name} --privileged --device=/dev/kfd --device=/dev/dri --device=/dev/mkfd --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_path/:/path/your_code_path/ -v /opt/hyhal/:/opt/hyhal/:ro {docker_image_name} bash
+示例如下：
+docker run -it --shm-size=1024G --network=host --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name internvl -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal:/opt/hyhal  image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-py3.10-dtk24.04.3-ubuntu20.04 bash
+docker run -it --shm-size 200g --network=host --name qwen3 --privileged --device=/dev/kfd --device=/dev/dri --device=/dev/mkfd --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_path/:/path/your_code_path/ -v /opt/hyhal/:/opt/hyhal/:ro image.sourcefind.cn:5000/dcu/admin/base/vllm:0.9.2-ubuntu22.04-dtk25.04.2-py3.10 bash
 ```
-Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。
+更多镜像可前往[光源](https://sourcefind.cn/#/service-list)下载使用。
-### Dockerfile（方法二）
-此处提供dockerfile的使用方法
-```
-docker build -t internvl:latest .
-docker run --shm-size 500g --network=host --name=internvl --privileged --device=/dev/kfd --network=host --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 项目地址(绝对路径):/home/ -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash
-git clone http://developer.sourcefind.cn/codes/modelzoo/internvl2.5_pytorch.git
-cd /path/your_code_data/
-pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装，其它包参照requirements.txt安装：
+```bash
+pip install -r requirements.txt
 pip install accelerate
-```
-### Anaconda（方法三）
-此处提供本地配置、编译的详细步骤，例如：
-关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
-```
-DTK驱动:dtk24.04.3
-python:3.10
-torch:2.3.0
-flash-attn:2.6.1
-```
-`Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应`
-其它非深度学习库参照requirement.txt安装：
 ```
-git clone http://developer.sourcefind.cn/codes/modelzoo/internvl2.5_pytorch.git
-cd /path/your_code_data/
-pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
-pip install accelerate
-```
 ## 数据集
+暂无
-无
 ## 训练
+暂无
-无
-### 单机多卡
-无
 ## 推理
+### transformers
-### 单机多卡
+#### 单机推理
+此处以[OpenGVLab/InternVL2_5-26B](https://www.modelscope.cn/models/OpenGVLab/InternVL2_5-26B)为例
+```bash
-```
+export HIP_VISIBLE_DEVICES=0,1
 python internvl_inference.py
 ```
+## 效果展示
-## result
 - 多模态推理
 <div align=left>
    <img src="./Pic/result.png"/>
 </div>
 ### 精度
+DCU与GPU精度一致，推理框架：transformers。
-无
-## 应用场景
-### 算法类别
-`对话问答`
-### 热点应用行业
-`科研,教育,政府,金融`
 ## 预训练权重
+| 模型名称  | 权重大小  | DCU型号  | 最低卡数需求 |下载地址|
-模型可在[SCNet](https://www.scnet.cn/ui/aihub/models)进行搜索下载
+|:-----:|:----------:|:----------:|:---------------------:|:----------:|
+| InternVL 2.5 | 1B | K100AI| 1 | [Modelscope](https://www.modelscope.cn/models/OpenGVLab/InternVL2_5-1B) |
- [OpenGVLab/InternVL2_5模型下载SCNet链接](https://www.scnet.cn/ui/aihub/models/OpenGVLab/InternVL2_5)
+| InternVL 2.5 | 2B | K100AI| 1 | [Modelscope](https://www.modelscope.cn/models/OpenGVLab/InternVL2_5-2B) |
+| InternVL 2.5 | 4B | K100AI| 1 | [Modelscope](https://www.modelscope.cn/models/OpenGVLab/InternVL2_5-4B) |
-魔搭下载路径
+| InternVL 2.5 | 8B | K100AI| 1 | [Modelscope](https://www.modelscope.cn/models/OpenGVLab/InternVL2_5-8B) |
- [OpenGVLab/InternVL2_5 魔搭下载](https://www.modelscope.cn/collections/InternVL-25-fbde6e47302942)
+| InternVL 2.5 | 26B | K100AI| 2 | [Modelscope](https://www.modelscope.cn/models/OpenGVLab/InternVL2_5-26B) |
+| InternVL 2.5 | 38B | K100AI| 2 | [Modelscope](https://www.modelscope.cn/models/OpenGVLab/InternVL2_5-38B) |
+| InternVL 2.5 | 78B | K100AI| 4 | [Modelscope](https://www.modelscope.cn/models/OpenGVLab/InternVL2_5-78B) |
 ## 源码仓库及问题反馈
 - https://developer.sourcefind.cn/codes/modelzoo/internvl2.5_pytorch
 ## 参考资料
- https://modelscope.cn/models/OpenGVLab/InternVL2_5-8B
 - https://github.com/OpenGVLab/InternVL
--- a/examples/image1.jpg
+++ b/examples/image1.jpg
--- a/examples/image2.jpg
+++ b/examples/image2.jpg
--- a/examples/red-panda.mp4
+++ b/examples/red-panda.mp4
--- a/internvl_inference.py
+++ b/internvl_inference.py
-import numpy as np
+import math
 import torch
+import numpy as np
 import torchvision.transforms as T
 from decord import VideoReader, cpu
 from PIL import Image
 from torchvision.transforms.functional import InterpolationMode
@@ -9,9 +12,6 @@ from transformers import AutoModel, AutoTokenizer
 IMAGENET_MEAN = (0.485, 0.456, 0.406)
 IMAGENET_STD = (0.229, 0.224, 0.225)
-import math
-import torch
-from transformers import AutoTokenizer, AutoModel
 def split_model(model_name):
    device_map = {}
@@ -40,17 +40,6 @@ def split_model(model_name):
    return device_map
-path = "OpenGVLab/InternVL2_5-8B"
-device_map = split_model('InternVL2_5-8B')
-model = AutoModel.from_pretrained(
-    path,
-    torch_dtype=torch.bfloat16,
-    low_cpu_mem_usage=True,
-    use_flash_attn=True,
-    trust_remote_code=True,
-    device_map=device_map).eval()
 def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
@@ -123,17 +112,20 @@ def load_image(image_file, input_size=448, max_num=12):
    return pixel_values
 # If you want to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
-path = "/home/wanglch/InternVL2_5/InternVL2_5-8B/"
+path = "OpenGVLab/InternVL2_5-26B"
+device_map = split_model('InternVL2_5-26B')
 model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
-    trust_remote_code=True).eval().cuda()
+    trust_remote_code=True,
+    device_map=device_map).eval()
 tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
 # set the max number of tiles in `max_num`
-pixel_values = load_image('/home/wanglch/InternVL2_5/InternVL2_5-8B/examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
+pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
 generation_config = dict(max_new_tokens=1024, do_sample=True)
 # pure-text conversation (纯文本对话)
@@ -160,8 +152,8 @@ response, history = model.chat(tokenizer, pixel_values, question, generation_con
 print(f'User: {question}\nAssistant: {response}')
 # multi-image multi-round conversation, combined images (多图多轮对话，拼接图像)
-pixel_values1 = load_image('/home/wanglch/InternVL2_5/InternVL2_5-8B/examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
+pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
-pixel_values2 = load_image('/home/wanglch/InternVL2_5/InternVL2_5-8B/examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
+pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
 pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
 question = '<image>\nDescribe the two images in detail.'
@@ -175,8 +167,8 @@ response, history = model.chat(tokenizer, pixel_values, question, generation_con
 print(f'User: {question}\nAssistant: {response}')
 # multi-image multi-round conversation, separate images (多图多轮对话，独立图像)
-pixel_values1 = load_image('/home/wanglch/InternVL2_5/InternVL2_5-8B/examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
+pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
-pixel_values2 = load_image('/home/wanglch/InternVL2_5/InternVL2_5-8B/examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
+pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
 pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
 num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
@@ -193,8 +185,8 @@ response, history = model.chat(tokenizer, pixel_values, question, generation_con
 print(f'User: {question}\nAssistant: {response}')
 # batch inference, single image per sample (单图批处理)
-pixel_values1 = load_image('/home/wanglch/InternVL2_5/InternVL2_5-8B/examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
+pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
-pixel_values2 = load_image('/home/wanglch/InternVL2_5/InternVL2_5-8B/examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
+pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
 num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
 pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
@@ -239,7 +231,7 @@ def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=3
    pixel_values = torch.cat(pixel_values_list)
    return pixel_values, num_patches_list
-video_path = '/home/wanglch/InternVL2_5/InternVL2_5-8B/examples/red-panda.mp4'
+video_path = './examples/red-panda.mp4'
 pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
 pixel_values = pixel_values.to(torch.bfloat16).cuda()
 video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
@@ -253,4 +245,3 @@ question = 'Describe this video in detail.'
 response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=history, return_history=True)
 print(f'User: {question}\nAssistant: {response}')