README_origin.md 6.71 KB
Newer Older
suily's avatar
init  
suily committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
# VTimeLLM \[[Paper](https://arxiv.org/pdf/2311.18445.pdf)\]
Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance-4)](https://paperswithcode.com/sota/video-based-generative-performance-4?p=vtimellm-empower-llm-to-grasp-video-moments)
 	
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/dense-video-captioning-on-activitynet)](https://paperswithcode.com/sota/dense-video-captioning-on-activitynet?p=vtimellm-empower-llm-to-grasp-video-moments)
 	
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance-3)](https://paperswithcode.com/sota/video-based-generative-performance-3?p=vtimellm-empower-llm-to-grasp-video-moments)
 	
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance-1)](https://paperswithcode.com/sota/video-based-generative-performance-1?p=vtimellm-empower-llm-to-grasp-video-moments)


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance-5)](https://paperswithcode.com/sota/video-based-generative-performance-5?p=vtimellm-empower-llm-to-grasp-video-moments)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance-2)](https://paperswithcode.com/sota/video-based-generative-performance-2?p=vtimellm-empower-llm-to-grasp-video-moments)
 	
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance)](https://paperswithcode.com/sota/video-based-generative-performance?p=vtimellm-empower-llm-to-grasp-video-moments)

---

## :loudspeaker: Latest Updates
- **Jan-2**: Thanks to [Xiao Xia](https://github.com/Rishubi) , [Shengbo Tong](https://github.com/tsb-19) and [Beining Wang](https://github.com/Benson0704), we have refactored the code to now support both the LLAMA and ChatGLM3 architectures. We translated the training data into Chinese and fine-tuned a Chinese version based on the ChatGLM3-6b. 
- **Dec-14**: Released the training code and data. All the resources including models, datasets and extracted features are available 
[here](https://cloud.tsinghua.edu.cn/d/6db5d02883124826aa6f/?p=%2F&mode=list). :fire::fire:
- **Dec-4**: VTimeLLM: demo released.

---



## VTimeLLM Overview :bulb:

VTimeLLM is a novel Video LLM designed for fine-grained video moment understanding and reasoning with respect to time boundary.

VTimeLLM adopts a boundary-aware three-stage training strategy, which respectively utilizes image-text pairs for feature alignment, multiple-event videos to increase temporal-boundary awareness, and high-quality video-instruction tuning to further improve temporal understanding ability as well as align with human intents.

![framework](images/framework.png)


---

## Contributions :trophy:

- We propose VTimeLLM, the first boundary-aware Video LLM, to the best of our knowledge.
- We propose the boundary-aware three-stage training strategy, which consecutively leverages i) large-scale image-text data for feature alignment, ii) large-scale multi-event video-text data together with the temporal-related single-turn and multi-turn QA to enhance the awareness of time boundary, and iii) instruction tuning on the high-quality dialog dataset for better temporal reasoning ability.
- We conduct extensive experiments to demonstrate that the proposed VTimeLLM significantly outperforms existing Video LLMs in various fine-grained temporal-related video tasks, showing its superior ability for video understanding and reasoning.


---

## Installation :wrench:

We recommend setting up a conda environment for the project:
```shell
conda create --name=vtimellm python=3.10
conda activate vtimellm

git clone https://github.com/huangb23/VTimeLLM.git
cd VTimeLLM
pip install -r requirements.txt
```
Additionally, install additional packages for training cases.
```shell
pip install ninja
pip install flash-attn --no-build-isolation
```



## Running Demo Offline :cd:

To run the demo offline, please refer to the instructions in [offline_demo.md](docs/offline_demo.md).

## Training :train:

For training instructions, check out [train.md](docs/train.md).

## Qualitative Analysis :mag:

A Comprehensive Evaluation of VTimeLLM's Performance across Multiple Tasks.


### Video Understanding and Conversational Tasks :speech_balloon:
![0](images/ex.png) 

---

### Creative Tasks :paintbrush:
![1](images/ex1.png) 

---
### Fine-grained Understanding Tasks :globe_with_meridians:
![2](images/ex2.png) 

---
### Video Reasoning Tasks :question:
![3](images/ex3.png) 

---


## Acknowledgements :pray:

We are grateful for the following awesome projects our VTimeLLM arising from:

* [LLaVA](https://github.com/haotian-liu/LLaVA): Large Language and Vision Assistant
* [FastChat](https://github.com/lm-sys/FastChat): An Open Platform for Training, Serving, and Evaluating Large Language Model based Chatbots
* [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT): Towards Detailed Video Understanding via Large Vision and Language Models
* [LLaMA](https://github.com/facebookresearch/llama): Open and Efficient Foundation Language Models
* [Vid2seq](https://github.com/google-research/scenic/tree/main/scenic/projects/vid2seq): Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
* [InternVid](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid): A Large-scale Video-Text dataset


If you're using VTimeLLM in your research or applications, please cite using this BibTeX:
```bibtex
@inproceedings{huang2024vtimellm,
  title={Vtimellm: Empower llm to grasp video moments},
  author={Huang, Bin and Wang, Xin and Chen, Hong and Song, Zihan and Zhu, Wenwu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={14271--14280},
  year={2024}
}
```

## License :scroll:
<a rel="license" href="https://creativecommons.org/licenses/by-nc-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/80x15.png" /></a> 

This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License</a>.


Looking forward to your feedback, contributions, and stars! :star2: