Commit 1bada89a authored by chenych's avatar chenych
Browse files

Update README and repo

parent 1efa1bca
FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-py3.10-dtk24.04.3-ubuntu20.04
\ No newline at end of file
MIT License
Copyright (c) 2023 OpenGVLab
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
# InternVL2.5 # InternVL2.5
## 论文 ## 论文
[Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling](https://arxiv.org/abs/2412.05271) [Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling](https://arxiv.org/abs/2412.05271)
## 模型简介
## 模型结构 InternVL 2.5 保留了与前代模型 InternVL 1.5 和 2.0 相同的模型架构,遵循“ViT-MLP-LLM”范式。在新版本中,使用随机初始化的 MLP 投影器集成了一个新增量预训练的 InternViT 与各种预训练 LLM(包括 InternLM 2.5 和 Qwen 2.5)。
InternVL 2.5保留了与InternVL 1.5[35]相同的模型架构,并且 正如之前的版本一样,应用了像素解混操作,将视觉标记的数量减少到原来的四分之一。此外,采用了与InternVL 1.5类似的动态分辨率策略,将图像分割成448×448像素的块。从InternVL 2.0开始的关键区别在于,额外引入了对多图像和视频数据的支持。
InternVL 2.0,即广泛使用的“ViT-MLP-LLM”范式,结合了预先训练的InternViT-300M
或通过MLP投影仪具有各种尺寸的LLM[19,229]的InternViT-6B。与之前的版本一致,
我们应用像素反洗牌操作来减少每个448 × 448图像块产生的1024个视觉令牌
到256个代币。此外,与InternVL 1.5相比,InternVL 2.0和2.5引入了额外的数据类型,
将多图像和视频数据与现有的单图像和纯文本数据合并。
<div align=center> <div align=center>
<img src="./Pic/arch.png"/> <img src="./Pic/arch.png"/>
</div> </div>
## 算法原理 ## 环境依赖
- 列举基础环境需求,根据实际情况填写
InternVL 2.5训练过程分为三个阶段——阶段1(MLP预热)可选阶段1.5(ViT增量学习)和阶段2(全模型指令调优)。多阶段设计逐步增强视觉-语言对齐,稳定培训,并为与更大的LLM集成准备模块。(b)逐步扩大战略。在早期阶段用较小的LLM训练的ViT模块可以很容易地与较大的LLM集成,以可承受的资源开销实现可扩展的模型对齐。
<div align=center>
<img src="./Pic/theory.png"/>
</div>
## 环境配置
### Docker(方法一)
推荐使用docker方式运行, 此处提供[光源](https://www.sourcefind.cn/#/service-details)拉取docker镜像的地址与使用步骤
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-py3.10-dtk24.04.3-ubuntu20.04
docker run -it --shm-size=1024G -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal:/opt/hyhal --network=host --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name internvl <your IMAGE ID> bash # <your IMAGE ID>为以上拉取的docker的镜像ID替换
git clone http://developer.sourcefind.cn/codes/modelzoo/internvl2.5_pytorch.git | 软件 | 版本 |
| :------: | :------: |
| DTK | 24.04.3 |
| python | 3.10 |
| torch | 2.3.0 |
| transformers | >=4.37.2 |
| flash-attn | 2.6.1 |
cd /path/your_code_data/ 推荐使用镜像:
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple - 挂载地址`-v``{docker_name}``{docker_image_name}`根据实际模型情况修改
pip install accelerate ```bash
docker run -it --shm-size 200g --network=host --name {docker_name} --privileged --device=/dev/kfd --device=/dev/dri --device=/dev/mkfd --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_path/:/path/your_code_path/ -v /opt/hyhal/:/opt/hyhal/:ro {docker_image_name} bash
示例如下:
docker run -it --shm-size=1024G --network=host --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name internvl -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal:/opt/hyhal image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-py3.10-dtk24.04.3-ubuntu20.04 bash
docker run -it --shm-size 200g --network=host --name qwen3 --privileged --device=/dev/kfd --device=/dev/dri --device=/dev/mkfd --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_path/:/path/your_code_path/ -v /opt/hyhal/:/opt/hyhal/:ro image.sourcefind.cn:5000/dcu/admin/base/vllm:0.9.2-ubuntu22.04-dtk25.04.2-py3.10 bash
``` ```
Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。 更多镜像可前往[光源](https://sourcefind.cn/#/service-list)下载使用。
### Dockerfile(方法二)
此处提供dockerfile的使用方法
```
docker build -t internvl:latest .
docker run --shm-size 500g --network=host --name=internvl --privileged --device=/dev/kfd --network=host --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 项目地址(绝对路径):/home/ -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash
git clone http://developer.sourcefind.cn/codes/modelzoo/internvl2.5_pytorch.git
cd /path/your_code_data/
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装,其它包参照requirements.txt安装:
```bash
pip install -r requirements.txt
pip install accelerate pip install accelerate
```
### Anaconda(方法三)
此处提供本地配置、编译的详细步骤,例如:
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
```
DTK驱动:dtk24.04.3
python:3.10
torch:2.3.0
flash-attn:2.6.1
```
`Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应`
其它非深度学习库参照requirement.txt安装:
``` ```
git clone http://developer.sourcefind.cn/codes/modelzoo/internvl2.5_pytorch.git
cd /path/your_code_data/
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install accelerate
```
## 数据集 ## 数据集
暂无
## 训练 ## 训练
暂无
### 单机多卡
## 推理 ## 推理
### transformers
### 单机多卡 #### 单机推理
此处以[OpenGVLab/InternVL2_5-26B](https://www.modelscope.cn/models/OpenGVLab/InternVL2_5-26B)为例
```bash
``` export HIP_VISIBLE_DEVICES=0,1
python internvl_inference.py python internvl_inference.py
``` ```
## 效果展示
## result
- 多模态推理 - 多模态推理
<div align=left> <div align=left>
<img src="./Pic/result.png"/> <img src="./Pic/result.png"/>
</div> </div>
### 精度 ### 精度
DCU与GPU精度一致,推理框架:transformers。
## 应用场景
### 算法类别
`对话问答`
### 热点应用行业
`科研,教育,政府,金融`
## 预训练权重 ## 预训练权重
| 模型名称 | 权重大小 | DCU型号 | 最低卡数需求 |下载地址|
模型可在[SCNet](https://www.scnet.cn/ui/aihub/models)进行搜索下载 |:-----:|:----------:|:----------:|:---------------------:|:----------:|
| InternVL 2.5 | 1B | K100AI| 1 | [Modelscope](https://www.modelscope.cn/models/OpenGVLab/InternVL2_5-1B) |
- [OpenGVLab/InternVL2_5模型下载SCNet链接](https://www.scnet.cn/ui/aihub/models/OpenGVLab/InternVL2_5) | InternVL 2.5 | 2B | K100AI| 1 | [Modelscope](https://www.modelscope.cn/models/OpenGVLab/InternVL2_5-2B) |
| InternVL 2.5 | 4B | K100AI| 1 | [Modelscope](https://www.modelscope.cn/models/OpenGVLab/InternVL2_5-4B) |
魔搭下载路径 | InternVL 2.5 | 8B | K100AI| 1 | [Modelscope](https://www.modelscope.cn/models/OpenGVLab/InternVL2_5-8B) |
- [OpenGVLab/InternVL2_5 魔搭下载](https://www.modelscope.cn/collections/InternVL-25-fbde6e47302942) | InternVL 2.5 | 26B | K100AI| 2 | [Modelscope](https://www.modelscope.cn/models/OpenGVLab/InternVL2_5-26B) |
| InternVL 2.5 | 38B | K100AI| 2 | [Modelscope](https://www.modelscope.cn/models/OpenGVLab/InternVL2_5-38B) |
| InternVL 2.5 | 78B | K100AI| 4 | [Modelscope](https://www.modelscope.cn/models/OpenGVLab/InternVL2_5-78B) |
## 源码仓库及问题反馈 ## 源码仓库及问题反馈
- https://developer.sourcefind.cn/codes/modelzoo/internvl2.5_pytorch - https://developer.sourcefind.cn/codes/modelzoo/internvl2.5_pytorch
## 参考资料 ## 参考资料
- https://modelscope.cn/models/OpenGVLab/InternVL2_5-8B
- https://github.com/OpenGVLab/InternVL - https://github.com/OpenGVLab/InternVL
import numpy as np import math
import torch import torch
import numpy as np
import torchvision.transforms as T import torchvision.transforms as T
from decord import VideoReader, cpu from decord import VideoReader, cpu
from PIL import Image from PIL import Image
from torchvision.transforms.functional import InterpolationMode from torchvision.transforms.functional import InterpolationMode
...@@ -9,9 +12,6 @@ from transformers import AutoModel, AutoTokenizer ...@@ -9,9 +12,6 @@ from transformers import AutoModel, AutoTokenizer
IMAGENET_MEAN = (0.485, 0.456, 0.406) IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225) IMAGENET_STD = (0.229, 0.224, 0.225)
import math
import torch
from transformers import AutoTokenizer, AutoModel
def split_model(model_name): def split_model(model_name):
device_map = {} device_map = {}
...@@ -40,17 +40,6 @@ def split_model(model_name): ...@@ -40,17 +40,6 @@ def split_model(model_name):
return device_map return device_map
path = "OpenGVLab/InternVL2_5-8B"
device_map = split_model('InternVL2_5-8B')
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()
def build_transform(input_size): def build_transform(input_size):
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
transform = T.Compose([ transform = T.Compose([
...@@ -123,17 +112,20 @@ def load_image(image_file, input_size=448, max_num=12): ...@@ -123,17 +112,20 @@ def load_image(image_file, input_size=448, max_num=12):
return pixel_values return pixel_values
# If you want to load a model using multiple GPUs, please refer to the `Multiple GPUs` section. # If you want to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
path = "/home/wanglch/InternVL2_5/InternVL2_5-8B/"
path = "OpenGVLab/InternVL2_5-26B"
device_map = split_model('InternVL2_5-26B')
model = AutoModel.from_pretrained( model = AutoModel.from_pretrained(
path, path,
torch_dtype=torch.bfloat16, torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True, low_cpu_mem_usage=True,
use_flash_attn=True, use_flash_attn=True,
trust_remote_code=True).eval().cuda() trust_remote_code=True,
device_map=device_map).eval()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False) tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
# set the max number of tiles in `max_num` # set the max number of tiles in `max_num`
pixel_values = load_image('/home/wanglch/InternVL2_5/InternVL2_5-8B/examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda() pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True) generation_config = dict(max_new_tokens=1024, do_sample=True)
# pure-text conversation (纯文本对话) # pure-text conversation (纯文本对话)
...@@ -160,8 +152,8 @@ response, history = model.chat(tokenizer, pixel_values, question, generation_con ...@@ -160,8 +152,8 @@ response, history = model.chat(tokenizer, pixel_values, question, generation_con
print(f'User: {question}\nAssistant: {response}') print(f'User: {question}\nAssistant: {response}')
# multi-image multi-round conversation, combined images (多图多轮对话,拼接图像) # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
pixel_values1 = load_image('/home/wanglch/InternVL2_5/InternVL2_5-8B/examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda() pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('/home/wanglch/InternVL2_5/InternVL2_5-8B/examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda() pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0) pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
question = '<image>\nDescribe the two images in detail.' question = '<image>\nDescribe the two images in detail.'
...@@ -175,8 +167,8 @@ response, history = model.chat(tokenizer, pixel_values, question, generation_con ...@@ -175,8 +167,8 @@ response, history = model.chat(tokenizer, pixel_values, question, generation_con
print(f'User: {question}\nAssistant: {response}') print(f'User: {question}\nAssistant: {response}')
# multi-image multi-round conversation, separate images (多图多轮对话,独立图像) # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
pixel_values1 = load_image('/home/wanglch/InternVL2_5/InternVL2_5-8B/examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda() pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('/home/wanglch/InternVL2_5/InternVL2_5-8B/examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda() pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0) pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)] num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
...@@ -193,8 +185,8 @@ response, history = model.chat(tokenizer, pixel_values, question, generation_con ...@@ -193,8 +185,8 @@ response, history = model.chat(tokenizer, pixel_values, question, generation_con
print(f'User: {question}\nAssistant: {response}') print(f'User: {question}\nAssistant: {response}')
# batch inference, single image per sample (单图批处理) # batch inference, single image per sample (单图批处理)
pixel_values1 = load_image('/home/wanglch/InternVL2_5/InternVL2_5-8B/examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda() pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('/home/wanglch/InternVL2_5/InternVL2_5-8B/examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda() pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)] num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0) pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
...@@ -239,7 +231,7 @@ def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=3 ...@@ -239,7 +231,7 @@ def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=3
pixel_values = torch.cat(pixel_values_list) pixel_values = torch.cat(pixel_values_list)
return pixel_values, num_patches_list return pixel_values, num_patches_list
video_path = '/home/wanglch/InternVL2_5/InternVL2_5-8B/examples/red-panda.mp4' video_path = './examples/red-panda.mp4'
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1) pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
pixel_values = pixel_values.to(torch.bfloat16).cuda() pixel_values = pixel_values.to(torch.bfloat16).cuda()
video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))]) video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
...@@ -253,4 +245,3 @@ question = 'Describe this video in detail.' ...@@ -253,4 +245,3 @@ question = 'Describe this video in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, response, history = model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list, history=history, return_history=True) num_patches_list=num_patches_list, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}') print(f'User: {question}\nAssistant: {response}')
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment