# GLM-4V

**GLM-4V-9B** 具备 1120 * 1120 高分辨率下的中英双语多轮对话能力，在中英文综合能力、感知推理、文字识别、图表理解等多方面多模态评测中，GLM-4V-9B 表现出超越 GPT-4-turbo-2024-04-09、Gemini 1.0 Pro、Qwen-VL-Max 和 Claude 3 Opus 的卓越性能。

## 论文

- [GLM: General Language Model Pretraining with Autoregressive Blank Infilling](https://arxiv.org/abs/2103.10360)

## 模型结构

GLM-4-9B 是智谱 AI 推出的最新一代预训练模型 GLM-4 系列中的开源版本。

<div align="center">
    <img src="./images/GLM.png"/>
</div>

## 算法原理

在强化文本能力的同时，我们首次推出了基于GLM基座的开源多模态模型GLM-4V-9B。这一模型采用了与CogVLM2相似的架构设计，能够处理高达1120 x 1120分辨率的输入，并通过降采样技术有效减少了token的开销。为了减小部署与计算开销，GLM-4V-9B没有引入额外的视觉专家模块，采用了直接混合文本和图片数据的方式进行训练，在保持文本性能的同时提升多模态能力。

<div align=center>
    <img src="./images/mt.png"/>
</div>

## 环境配置

### Docker（方法一）

[光源](https://www.sourcefind.cn/#/service-details)拉取docker镜像的地址与使用步骤

```
docker pull image.sourcefind.cn:5000/dcu/admin/base/vllm:0.9.2-ubuntu22.04-dtk25.04.1-rc5-rocblas104381-0915-das1.6-py3.10-20250916-rc2
# <Image ID>用上面拉取docker镜像的ID替换
# <Host Path>主机端路径
# <Container Path>容器映射路径
# 若要在主机端和容器端映射端口需要删除--network host参数
docker run -it --name internlm_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal -v <Host Path>:<Container Path> <Image ID> /bin/bash
```

### Dockerfile（方法二）

```
# <Host Path>主机端路径
# <Container Path>容器映射路径
docker build -t internlm:latest .
docker run -it --name internlm_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal:ro -v <Host Path>:<Container Path> internlm:latest /bin/bash
```

### Anaconda（方法三）

关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。

```
* DTK驱动：dtk25.04.01
* Pytorch: 2.5.1
* triton: 3.0.0
* lmslim: 0.3.1
* flash_attn: 2.6.1
* flash_mla: 1.0.0
* vllm: 0.9.2
* python: python3.10
```

`Tips：以上dtk驱动、python、paddle等DCU相关工具版本需要严格一一对应`

关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。

```
conda create -n glm-4v python=3.10
```

环境变量：
export VLLM_NUMA_BIND=1
export ALLREDUCE_STREAM_WITH_COMPUTE=1
export VLLM_RANK0_NUMA=0
export VLLM_RANK1_NUMA=1
export VLLM_RANK2_NUMA=2
export VLLM_RANK3_NUMA=3
export VLLM_RANK4_NUMA=4
export VLLM_RANK5_NUMA=5
export VLLM_RANK6_NUMA=6
export VLLM_RANK7_NUMA=7

## 数据集

无

## 推理

### 模型下载

| 基座模型                                         |
| ------------------------------------------------ |
| [glm-4v-9b](https://huggingface.co/THUDM/glm-4v-9b) |

## 模型推理

```bash
python examples/offline_inference/vision_language.py
```

## OpenAI兼容服务

启动服务：

`cd examples`

> ```
> vllm serve model_path  --trust-remote-code --port 8000 --host 0.0.0.0 --allowed-local-media-path xxxx --hf-overrides '{"architectures": ["GLM4VForCausalLM"]}' --chat-template examples/template_chatml.jinja
> ```

### OpenAI Completions API和vllm结合使用

```bash
curl -X POST http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{ 
    "model": "model_path", 
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is the content of this image？"},
          {"type": "image_url", "image_url": {"url": "xxx"}} 
        ]
      }
    ]
  }'
```

### **gradio和vllm结合使用**

1.安装gradio

```
pip install gradio
```

2.安装必要文件与端口映射

    2.1 启动gradio服务，根据提示操作

```
python  gradio_openai_vlm_webserver.py --model model_path --model-url http://localhost:8000/v1 --host "0.0.0.0" --port 8001

```

    2.2 更改文件权限

打开提示下载文件目录，输入以下命令给予权限

```
chmod +x frpc_linux_amd64_v0.*
```

    2.3端口映射

```
ssh -L 8000:计算节点IP:8000 -L 8001:计算节点IP:8001 用户名@登录节点 -p 登录节点端口
```

3.启动OpenAI兼容服务

`cd examples`

```
vllm serve model_path  --trust-remote-code --port 8000 --host 0.0.0.0 --allowed-local-media-path xxxx --hf-overrides '{"architectures": ["GLM4VForCausalLM"]}' --chat-template examples/template_chatml.jinja

```

4.启动gradio服务

```
python  gradio_openai_vlm_webserver.py --model model_path --model-url http://localhost:8000/v1 --host "0.0.0.0" --port 8001
```

5.使用对话服务

在浏览器中输入本地 URL，可以使用 Gradio 提供的对话服务。

## result

### 离线推理服务

使用的加速卡:单卡K100_AI     模型：[glm-4v-9b](https://huggingface.co/THUDM/glm-4v-9b)

输入：

    images:

<div align="center">
    <img src="./images/images.png" width="300" height="200"/>
</div>

text:     	                       What is the content of this image?

输出：

    output:               The image features a close-up view of a stop sign on a city street

### gradio服务

使用的加速卡:单卡K100_AI     模型：[glm-4v-9b](https://huggingface.co/THUDM/glm-4v-9b)

<div align=center>
    <img src="./images/glm_gradio.png" width="800" height="500"/>
</div>

### 精度

无

## 应用场景

### 算法类别

`ocr`

### 热点应用行业

`金融,教育,政府,科研,制造,能源,交通`

## 源码仓库及问题反馈

- [ModelZoo / GLM-4V_vllm · GitLab](https://developer.sourcefind.cn/codes/modelzoo/glm-4v_vllm)

## 参考资料

- [GLM: General Language Model Pretraining with Autoregressive Blank Infilling](https://arxiv.org/abs/2103.10360)
- [GLM4v github](https://github.com/THUDM/GLM-4)
- [swift github](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/glm4v%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md)