README.md 10.9 KB
Newer Older
suily's avatar
init  
suily committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# VTimeLLM
## 论文
`VTimeLLM: Empower LLM to Grasp Video Moments`
- https://arxiv.org/abs/2311.18445
## 模型结构
VTimeLLM有以下两个部分组成:1、一个视觉编码器和一个视觉适配器来处理输入视频;2、 一个特制的LLM过三阶段预训练来使模型同时具有grounding和chat能力

阶段一:图文对齐,通过图片-文本对训练将视觉特征与LLM在语义空间对齐;

阶段二:设计了密集Video Caption的单轮QA任务和包括片段描述&时序grounding的多轮的QA任务,使VTimeLLM具有时序感知的能力,可以定位视频的segmentation;

阶段三:创造了一个高质量的对话数据集来指令微调,来和人类意图对齐。
<div align=center>
    <img src="./doc/VTimeLLM.PNG"/>
</div>

## 算法原理
Visual Encoder:利用CLIP ViT-L/14模型对每一帧获取cls token的feature和每个patch的feature,其中采用cls token的特征v_cls作为图片的feature

Visual Adapter:一个线性层,对每一帧的v_cls做变换,映射到LLM空间,最后视频由N*d的特征Z表示(N为帧数,d为LLM的隐层维度),这里均匀采样100帧

Vicuna:即LLM,用<video>来代表视频内容,将视觉特征Z嵌入到text的embedding中间
<div align=center>
suily's avatar
suily committed
24
    <img src="./doc/VTimeLLM.PNG"/>
suily's avatar
init  
suily committed
25
26
</div>

suily's avatar
suily committed
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
## 代码改动说明
deepspeed0.14.2只能使用transformers4.31.0,此版本不支持k100ai使用bf16/tf32精度,修改transformers代码,避免精度错误:

ps:仓库中是改动后的代码,不需再次修改
```
pip show pip  #查看依赖库安装地址site-packages

1、site-packages/transformers/utils/import_utils.py,修改def is_torch_bf16_gpu_available():
...
#TODO:if torch.cuda.is_available() and torch.version.cuda is not None:
if torch.cuda.is_available():
    if torch.cuda.get_device_properties(torch.cuda.current_device()).major < 8:
        return False
    # if int(torch.version.cuda.split(".")[0]) < 11:
       # return False
    if not hasattr(torch.cuda.amp, "autocast"):
        return False
else:
    return False

2、site-packages/transformers/utils/import_utils.py,修改def is_torch_tf32_available():
...
#TODO:if not torch.cuda.is_available() or torch.version.cuda is None:
if not torch.cuda.is_available():
    return False
if torch.cuda.get_device_properties(torch.cuda.current_device()).major < 8:
    return False
# if int(torch.version.cuda.split(".")[0]) < 11:
   #  return False
if version.parse(version.parse(torch.__version__).base_version) < version.parse("1.7"):
    return False
return True
```

suily's avatar
init  
suily committed
61
62
63
## 环境配置
### Docker(方法一)
```
suily's avatar
suily committed
64
65
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.2-py3.10
docker run -it --name=VTimeLLM --network=host --privileged=true --device=/dev/kfd --device=/dev/dri --shm-size=16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v /path/your_code_data:/path/VTimeLLM -v /opt/hyhal/:/opt/hyhal/:ro <imageID> bash  # <imageID>为以上拉取的docker的镜像ID替换
suily's avatar
init  
suily committed
66

suily's avatar
suily committed
67
cd VTimeLLM
suily's avatar
init  
suily committed
68
69
70
# 安装依赖
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install -r requirements.txt
suily's avatar
suily committed
71
72
73
74
export HF_ENDPOINT=https://hf-mirror.com
# 解决deepspeed的async_io报错
apt update
apt install gcc libaio-dev
suily's avatar
init  
suily committed
75
76
77
```
### Dockerfile(方法二)
```
suily's avatar
suily committed
78
79
docker build --no-cache -t vtimellm:latest .
docker run -it --name=VTimeLLM --network=host --privileged=true --device=/dev/kfd --device=/dev/dri --shm-size=16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v /path/your_code_data:/path/VTimeLLM -v /opt/hyhal/:/opt/hyhal/:ro vtimellm /bin/bash
suily's avatar
init  
suily committed
80

suily's avatar
suily committed
81
cd VTimeLLM
suily's avatar
init  
suily committed
82
83
84
# 安装依赖
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install -r requirements.txt
suily's avatar
suily committed
85
86
87
88
export HF_ENDPOINT=https://hf-mirror.com
# 解决deepspeed的async_io报错
apt update
apt install gcc libaio-dev
suily's avatar
init  
suily committed
89
90
91
92
93
```
### Anaconda(方法三)
1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装: https://developer.hpccube.com/tool/
```
DTK软件栈:dtk24.04.2
suily's avatar
suily committed
94
python:python3.10
suily's avatar
init  
suily committed
95
pytorch:2.1.0
suily's avatar
suily committed
96
97
98
torchvision:0.16.0
deepspeed:0.14.2
flash-attn:2.0.4
suily's avatar
init  
suily committed
99
100
101
102
103
```
`Tips:以上dtk软件栈、python、pytorch等DCU相关工具版本需要严格一一对应`

2、其他非特殊库直接按照下面步骤进行安装
```
suily's avatar
suily committed
104
cd VTimeLLM
suily's avatar
init  
suily committed
105
106
107
# 安装依赖
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install -r requirements.txt
suily's avatar
suily committed
108
109
110
111
export HF_ENDPOINT=https://hf-mirror.com
# 解决deepspeed的async_io报错
apt update
apt install gcc libaio-dev
suily's avatar
init  
suily committed
112
113
```
## 数据集
suily's avatar
suily committed
114
115
116
117
118
119
120
121
122
123
124
125
### 训练数据集
VTimeLLM可基于Vicuna v1.5训练英文版本、基于ChatGLM3-6b训练中文版本,训练某个版本时只下载对应的数据集(data不同)即可。
训练数据集包括三阶段的数据集data和预提取特征feat两部分,可通过[scnet](http://113.200.138.88:18080/aidatasets/project-dependency/vtimellm) 或官网链接进行下载。官网链接如下:

ps:本仓库准备了小数据集供训练测试,数据量约为完整数据集的。。。。。,可通过scnet进行下载。

1、下载data

(1)VTimeLLM-7B:
* [stage1.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/blob/main/blip_laion_cc_sbu_558k.json) 
* [stage2.json](https://cloud.tsinghua.edu.cn/d/6db5d02883124826aa6f/files/?p=%2Fdata%2Fstage2.json)
* [stage3.json](https://cloud.tsinghua.edu.cn/d/6db5d02883124826aa6f/files/?p=%2Fdata%2Fstage3.json)
suily's avatar
suily committed
126

suily's avatar
suily committed
127
128
(2)ChatGLM3-6b:
* [stage1/2/3.json](https://cloud.tsinghua.edu.cn/d/6db5d02883124826aa6f/files/?p=%2Fdata%2Fdata_Chinese.zip)
suily's avatar
suily committed
129

suily's avatar
suily committed
130
131
2、下载feat
* [feat_list](https://cloud.tsinghua.edu.cn/d/6db5d02883124826aa6f/?p=%2Ffeat&mode=list)
suily's avatar
suily committed
132

suily's avatar
suily committed
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
解压缩feat的代码如下:
```
cd VTimeLLM/feat
tar -xzvf stage1.tar.gz
cat stage2_part_* > stage2.tar.gz
tar -xzvf stage2.tar.gz
tar -xzvf stage3.tar.gz
```
以基于VTimeLLM-7B的VTimeLLM为例,数据集目录结构如下:
```
VTimeLLM:
 ── data
    │   ├── blip_laion_cc_sbu_558k.json
    │   ├── stage2.json
    │   └── stage3.json
 ── feat
    │   ├── 558k_clip_feat
    │   ├── intern_clip_feat
    │   └── stage3_clip_feat
```
### 推理数据集
推理测试所用数据已保存在VTimeLLM/images/demo.mp4
suily's avatar
init  
suily committed
155

suily's avatar
suily committed
156
157
158
159
160
161
162
## 训练
VTimeLLM可基于Vicuna v1.5训练英文版本、基于ChatGLM3-6b训练中文版本,训练某个版本时只下载对应的模型即可。
训练需要分别下载clip、Vicuna v1.5(或ChatGLM3-6b)权重,并将它们放入 'checkpoints' 目录中,下载链接如下:

1、下载cilp模型
* [scnet](http://113.200.138.88:18080/aimodels/findsource-dependency/vtimellm) 
* [官网链接](https://cloud.tsinghua.edu.cn/d/6db5d02883124826aa6f/?p=%2Fcheckpoints&mode=list)
suily's avatar
suily committed
163

suily's avatar
suily committed
164
165
166
167
168
169
170
171
172
173
174
175
2-1、下载Vicuna v1.5权重
* [scnet](http://113.200.138.88:18080/aimodels/vicuna-7b-v1.5) 
* [官网链接](https://huggingface.co/lmsys/vicuna-7b-v1.5/tree/main)
* 代码下载(huggingface)
```
cd VTimeLLM
export HF_ENDPOINT=https://hf-mirror.com
export HF_DATASETS_CACHE="./checkpoints/vicuna-7b-v1.5"
huggingface-cli download --resume-download lmsys/vicuna-7b-v1.5 --local-dir checkpoints/vicuna-7b-v1.5 --local-dir-use-symlinks False
```
2-2、下载ChatGLM3-6b权重
* [scnet](http://113.200.138.88:18080/aimodels/chatglm3-6b) 
suily's avatar
suily committed
176
* [官网链接](https://huggingface.co/THUDM/chatglm3-6b/tree/main)
suily's avatar
suily committed
177
178
179
180
181
182
183
184
185
186
187
188
189
190
* 代码下载(huggingface)
```
cd VTimeLLM
export HF_ENDPOINT=https://hf-mirror.com
export HF_DATASETS_CACHE="./checkpoints/chatglm3-6b"
huggingface-cli download --resume-download THUDM/chatglm3-6b --local-dir checkpoints/chatglm3-6b
```
以基于VTimeLLM-7B的VTimeLLM为例,模型目录结构如下:
```
VTimeLLM:
 ── clip
    │   └── ViT-L-14.pt
 ── vicuna-7b-v1.5
    │   └── ...
suily's avatar
init  
suily committed
191
```
suily's avatar
suily committed
192
193

以基于Vicuna v1.5的VTimeLLM为例,训练运行代码:
suily's avatar
init  
suily committed
194
```
suily's avatar
suily committed
195
196
197
198
199
cd VTimeLLM
wandb off
sh scripts/stage1.sh
sh scripts/stage2.sh
sh scripts/stage3.sh
suily's avatar
init  
suily committed
200
```
suily's avatar
suily committed
201
202
203
204
205
## 推理
VTimeLLM基于Vicuna v1.5训练了英文版本,存储为vtimellm-vicuna-v1-5-7b.tar.gz;基于ChatGLM3-6b训练了中文版本,存储为vtimellm-chatglm3-6b.tar.gz。推理某个版本时只下载对应的模型即可。
推理需要分别下载clip、Vicuna v1.5(或ChatGLM3-6b)、VTimeLLM权重,并将它们放入 'checkpoints' 目录中。clip、Vicuna v1.5(或ChatGLM3-6b)的下载参考训练阶段,VTimeLLM权重的下载链接如下:
* [scnet](http://113.200.138.88:18080/aimodels/findsource-dependency/vtimellm) 
* [官网链接](https://cloud.tsinghua.edu.cn/d/6db5d02883124826aa6f/?p=%2Fcheckpoints&mode=list)
suily's avatar
suily committed
206

suily's avatar
suily committed
207
解压VTimeLLM权重的代码如下:
suily's avatar
suily committed
208
```
suily's avatar
suily committed
209
210
cd VTimeLLM/checkpoints
tar -xzvf vtimellm-vicuna-v1-5-7b.tar.gz 
suily's avatar
suily committed
211
212
# tar -xzvf vtimellm-chatglm3-6b.tar.gz
```
suily's avatar
suily committed
213
214
215
216
217
218
219
220
221
222
223
224
以基于Vicuna v1.5的VTimeLLM为例,模型目录结构如下:
```
VTimeLLM:
 ── clip
    │   └── ViT-L-14.pt
 ── vtimellm-vicuna-v1-5-7b-stage1
    │   └── ...
 ── vtimellm-vicuna-v1-5-7b-stage2
    │   └── ...
 ── vtimellm-vicuna-v1-5-7b-stage3
    │   └── ...
 ── vicuna-7b-v1.5
suily's avatar
init  
suily committed
225
226
    │   └── ...
```
suily's avatar
suily committed
227
228
229
230
231
232
233
234
235
236
237
238

以基于Vicuna v1.5的VTimeLLM为例,推理运行代码:
```
cd VTimeLLM
HIP_VISIBLE_DEVICES=0 python -m vtimellm.inference \
	--model_base "checkpoints/vicuna-7b-v1.5" \
	--pretrain_mm_mlp_adapter "checkpoints/vtimellm-vicuna-v1-5-7b-stage1/mm_projector.bin" \
	--stage2 "checkpoints/vtimellm-vicuna-v1-5-7b-stage2" \
	--stage3 "checkpoints/vtimellm-vicuna-v1-5-7b-stage3" \
	--video_path "images/demo.mp4"
```
推理VTimeLLM-ChatGLM版本,请参考VTimeLLM/docs/inference_for_glm.ipynb
suily's avatar
init  
suily committed
239
240
241
## result
推理运行的默认推理结果为:
<div align=center>
suily's avatar
suily committed
242
    <img src="./doc/inference_result.png"/>
suily's avatar
init  
suily committed
243
244
245
</div>

### 精度
suily's avatar
suily committed
246
247
248
249
250
251
以下为默认训练结果:
|                                  | 测试参数                                         | 软件栈     | final loss |
| -------------------------------- | ------------------------------------------------ | ---------- | ---------- |
| A800 * 2<br/>(80G,1410 Mhz)   | MODEL_VERSION=vicuna-v1-5-7b<br/>bf16=True<br/>tf32=True  | cuda11.8   |  stages1:2.415712<br/>stages2:1.046057<br/>stages3:1.283405  |
| k100ai * 2<br/>(64G,1500 Mhz) | MODEL_VERSION=vicuna-v1-5-7b<br/>bf16=True<br/>tf32=True  | dtk24.04.2 |  stages1:2.414052<br/>stages2:1.050350<br/>stages3:1.265567  |

suily's avatar
init  
suily committed
252
253
## 应用场景
### 算法类别
suily's avatar
suily committed
254
`视频理解`
suily's avatar
init  
suily committed
255
256
257
### 热点应用行业
`家具,电商,医疗,广媒,教育`
## 预训练权重
suily's avatar
suily committed
258
- http://113.200.138.88:18080/aimodels/findsource-dependency/vtimellm (vtimellm、clip)
suily's avatar
suily committed
259

suily's avatar
suily committed
260
  http://113.200.138.88:18080/aimodels/vicuna-7b-v1.5.git (vicuna-7b-v1.5)
suily's avatar
suily committed
261
  
suily's avatar
suily committed
262
263
  http://113.200.138.88:18080/aimodels/chatglm3-6b (chatglm3-6b)
- https://cloud.tsinghua.edu.cn/d/6db5d02883124826aa6f/?p=%2Fcheckpoints&mode=list (vtimellm、clip)
suily's avatar
suily committed
264
  
suily's avatar
suily committed
265
  https://huggingface.co/lmsys/vicuna-7b-v1.5/tree/main (vicuna-7b-v1.5)
suily's avatar
suily committed
266
  
suily's avatar
suily committed
267
  https://huggingface.co/THUDM/chatglm3-6b/tree/main (chatglm3-6b)
suily's avatar
init  
suily committed
268
## 源码仓库及问题反馈
suily's avatar
suily committed
269
- https://developer.sourcefind.cn/codes/suily/vtimellm_pytorch
suily's avatar
init  
suily committed
270
## 参考资料
suily's avatar
suily committed
271
- https://github.com/huangb23/VTimeLLM