# VTimeLLM
## 论文
`VTimeLLM: Empower LLM to Grasp Video Moments`
- https://arxiv.org/abs/2311.18445
## 模型结构
VTimeLLM有以下两个部分组成：1、一个视觉编码器和一个视觉适配器来处理输入视频；2、 一个特制的LLM过三阶段预训练来使模型同时具有grounding和chat能力

阶段一：图文对齐，通过图片-文本对训练将视觉特征与LLM在语义空间对齐；

阶段二：设计了密集Video Caption的单轮QA任务和包括片段描述&时序grounding的多轮的QA任务，使VTimeLLM具有时序感知的能力，可以定位视频的segmentation；

阶段三：创造了一个高质量的对话数据集来指令微调，来和人类意图对齐。
<div align=center>
    <img src="./doc/VTimeLLM.PNG"/>
</div>

## 算法原理
Visual Encoder：利用CLIP ViT-L/14模型对每一帧获取cls token的feature和每个patch的feature，其中采用cls token的特征v_cls作为图片的feature

Visual Adapter：一个线性层，对每一帧的v_cls做变换，映射到LLM空间，最后视频由N*d的特征Z表示（N为帧数，d为LLM的隐层维度），这里均匀采样100帧

Vicuna：即LLM，用<video>来代表视频内容，将视觉特征Z嵌入到text的embedding中间

<div align=center>
    <img src="./doc/ExpNet.PNG"/>
</div>
<div align=center>
    <img src="./doc/PoseVAE.PNG"/>
</div>
<div align=center>
    <img src="./doc/FaceRender.PNG"/>
</div>

## 环境配置
### Docker（方法一）
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/jupyterlab-pytorch:2.1.0-ubuntu20.04-dtk24.04.2-py3.8
docker run -it --name=SadTalker --network=host --privileged=true --device=/dev/kfd --device=/dev/dri --shm-size=16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v /path/your_code_data:/path/SadTalker -v /opt/hyhal/:/opt/hyhal/:ro <imageID> bash  # <imageID>为以上拉取的docker的镜像ID替换

cd SadTalker
# 安装ffmpeg：格式转换相关
apt update
apt install ffmpeg
# 安装依赖
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install tb-nightly -i https://mirrors.aliyun.com/pypi/simple
pip install -r requirements.txt
```
### Dockerfile（方法二）
```
docker build --no-cache -t sadtalker:latest .
docker run -it --name=SadTalker --network=host --privileged=true --device=/dev/kfd --device=/dev/dri --shm-size=16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v /path/your_code_data:/path/SadTalker -v /opt/hyhal/:/opt/hyhal/:ro sadtalker /bin/bash

cd SadTalker
# 安装ffmpeg：格式转换相关
apt update
apt install ffmpeg
# 安装依赖
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install tb-nightly -i https://mirrors.aliyun.com/pypi/simple
pip install -r requirements.txt
```
### Anaconda（方法三）
1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装: https://developer.hpccube.com/tool/
```
DTK软件栈：dtk24.04.2
python：python3.8
pytorch：2.1.0
torchvision：
torchaudio：
```
`Tips：以上dtk软件栈、python、pytorch等DCU相关工具版本需要严格一一对应`

2、其他非特殊库直接按照下面步骤进行安装
```
cd SadTalker
# 安装ffmpeg：格式转换相关
apt update
apt install ffmpeg
# 安装依赖
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install tb-nightly -i https://mirrors.aliyun.com/pypi/simple
pip install -r requirements.txt
```
## 数据集
推理测试所用数据已保存在SadTalker/dataset/下，目录结构如下：
```
 ── dataset
    │   ├── bus_chinese.wav
    │   └── image.png
```
## 训练
官方暂未开放
## 推理
模型可通过[scnet](http://113.200.138.88:18080/aimodels/findsource-dependency/sadtalker)或以下方式进行下载：

1-1、Pre-Trained Models
* [Google Drive](https://drive.google.com/file/d/1gwWh45pF7aelNP_P78uDJL8Sycep-K7j/view?usp=sharing)
* [GitHub Releases](https://github.com/OpenTalker/SadTalker/releases)
* [Baidu (百度云盘)](https://pan.baidu.com/s/1kb1BCPaLOWX1JJb9Czbn6w?pwd=sadt) (Password: `sadt`)

1-2、GFPGAN Offline Patch
* [Google Drive](https://drive.google.com/file/d/19AIBsmfcHW6BRJmeqSFlG5fL445Xmsyi?usp=sharing)
* [GitHub Releases](https://github.com/OpenTalker/SadTalker/releases)
* [Baidu (百度云盘)](https://pan.baidu.com/s/1P4fRgk9gaSutZnn8YW034Q?pwd=sadt) (Password: `sadt`)

2、运行自动下载（GitHub Releases）：
```
cd SadTalker
sh scripts/download_models.sh
```
模型目录结构如下，checkpoints是预训练模型，gfpgan是人脸检测和增强模型：
```
 ── checkpoints
    │   └── ...
 ── gfpgan
    │   └── weights
    │          └── ...
```
推理运行代码：
```
HIP_VISIBLE_DEVICES=0 python inference.py \
	--driven_audio dataset/bus_chinese.wav \
	--source_image dataset/image.png \
	--still \
	--preprocess full \
	--enhancer gfpgan \
	--result_dir result/

# --driven_audio 音频数据的路径
# --source_image 图片数据的路径
# --still 使用与原始图像相同的姿势参数，头部运动较少
# --preprocess full 对图像进行['crop', 'extcrop', 'resize', 'full', 'extfull']预处理
# --enhancer 使用或通过人脸修复网络[gfpgan, RestoreFormer]增强生成的人脸
# --result_dir 输出路径
# 更多参数设置可参考inference.py的parser注释和docs/best_practice.md
```
## result
推理运行的默认推理结果为：
<div align=center>
    <video src="./doc/inference_result.mp4"/>
</div>

### 精度
无
## 应用场景
### 算法类别
`视频生成`
### 热点应用行业
`家具,电商,医疗,广媒,教育`
## 预训练权重
- 
- http://113.200.138.88:18080/aimodels/vicuna-7b-v1.5.git (vicuna-7b-v1.5)
  http://113.200.138.88:18080/aimodels/chatglm3-6b (chatglm3-6b)
## 源码仓库及问题反馈
- 
## 参考资料
- https://github.com/huangb23/VTimeLLM