llava-next

d5878167 · mashun1 · d5878167 · d5878167 · d5878167 · d5878167
Commit d5878167 authored Mar 28, 2025 by mashun1
20 changed files
--- a/.gitignore
+++ b/.gitignore
+__pycache__
+ckpts
--- a/Dockerfile
+++ b/Dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10
\ No newline at end of file
--- a/LICENSE
+++ b/LICENSE
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
--- a/README.md
+++ b/README.md
+# LLaVA-NeXT: Improved reasoning, OCR, and world knowledge
+## 论文
+`LLaVA-NeXT: Improved reasoning, OCR, and world knowledge`
+* https://llava-vl.github.io/blog/2024-01-30-llava-next/
+## 模型结构
+模型包含一个预训练视觉编码器，一个映射器以及一个大语言模型。
+![alt text](readme_imgs/arch.png)
+## 算法原理
+当提供高分辨率图像和保留这些细节的表示时，模型感知图像中复杂细节的能力会得到显着提高。它减少了在面对低分辨率图像时猜想想象的视觉内容的模型幻觉。通过将图像分割成视觉编码器最初训练时所针对的分辨率的更小的图像块，并独立地对它们进行编码。在获得各个图像块的特征图之后，将它们组合成一个具有目标分辨率的大型特征图，并将其输入到大型语言模型（LLM）中。
+![alt text](readme_imgs/alg.png)
+## 环境配置
+注意：需要修改代码，具体如下
+`llava/model/llava_arch.py +410`
+```python
+# 修改前
+image_feature = torch.cat((image_feature, self.model.image_newline[None]), dim=0)
+# 修改后
+image_feature = torch.cat((image_feature, self.model.image_newline[None].to(image_feature.device)), dim=0)
+```
+### Docker（方法一）
+    docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10
+    docker run --shm-size 100g --network=host --name=llava_next --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 项目地址(绝对路径):/home/ -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash
+    pip install -e ".[train]"
+    pip install lmms-eval
+    pip install python-Levenshtein
+    pip install gradio
+    pip install rouge
+### Dockerfile（方法二）
+    docker build -t <IMAGE_NAME>:<TAG> .
+    docker run --shm-size 100g --network=host --name=llava_next --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 项目地址(绝对路径):/home/ -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash
+    pip install -e ".[train]"
+    pip install lmms-eval
+    pip install python-Levenshtein
+    pip install gradio
+    pip install rouge
+### Anaconda（方法三）
+1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装： https://developer.hpccube.com/tool/
+```
+DTK驱动: dtk2504
+python:python3.10
+torch:2.4.1
+torchvision:0.19.1
+triton:3.0.0
+vllm:0.6.2
+flash-attn:2.6.1
+deepspeed:0.14.2
+apex:1.4.0
+```
+2、其他非特殊库直接按照requirements.txt安装
+```
+pip install -e ".[train]"
+pip install lmms-eval
+pip install python-Levenshtein
+pip install gradio
+```
+## 数据集
+无
+## 训练
+无
+## 推理
+单图像输入
+```bash
+python inference_single_hf.py
+```
+多图像输入
+```bash
+python inference_multi_hf.py
+```
+注意：在运行前需要修改文件中的图像和模型路径。
+### 更多模型
+[interleave](interleave/README_interleave.md) | [image](image/README_image.md) | [onevision](onevision/README_onevision.md) | [video](video/README_video.md)
+## result
+![alt text](readme_imgs/result.png)
+### 精度
+无
+## 应用场景
+### 算法类别
+`对话问答`
+### 热点应用行业
+`电商,教育,交通,能源`
+## 预训练权重
+|model|url|
+|:---:|:---:|
+|llava-v1.6-mistral-7b-hf|[hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) \| [SCNet]() |
+|llava-v1.6-vicuna-7b-hf|[hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-7b-hf) \| [SCNet](http://113.200.138.88:18080/aimodels/llava-hf/llava-v1.6-vicuna-7b-hf) |
+|llava-v1.6-vicuna-13b-hf|[hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf) \| [SCNet]() |
+|llava-v1.6-34b-hf|[hf](https://huggingface.co/llava-hf/llava-v1.6-34b-hf) \| [SCNet](http://113.200.138.88:18080/aimodels/llava-hf/llava-v1.6-34b-hf) |
+|llama3-llava-next-8b-hf|[hf](https://huggingface.co/llava-hf/llama3-llava-next-8b-hf) \| [SCNet]() |
+|llava-next-72b-hf|[hf](https://huggingface.co/llava-hf/llava-next-72b-hf) \| [SCNet]() |
+|llava-next-110b-hf|[hf](https://huggingface.co/llava-hf/llava-next-110b-hf) \| [SCNet]() |
+<!-- | Version | LLM | Schedule | Checkpoint |
+|----------|----------|-----------|-----------|
+| LLaVA-1.6 | Vicuna-7B | full_ft-1e | [liuhaotian/llava-v1.6-vicuna-7b](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b) |
+| LLaVA-1.6 | Vicuna-13B | full_ft-1e | [liuhaotian/llava-v1.6-vicuna-13b](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-13b) |
+| LLaVA-1.6 | Mistral-7B | full_ft-1e | [liuhaotian/llava-v1.6-mistral-7b](https://huggingface.co/liuhaotian/llava-v1.6-mistral-7b) |
+| LLaVA-1.6 | Hermes-Yi-34B | full_ft-1e | [liuhaotian/llava-v1.6-34b](https://huggingface.co/liuhaotian/llava-v1.6-34b) | -->
+## 源码仓库及问题反馈
+* https://developer.sourcefind.cn/codes/modelzoo/llava-next_pytorch
+## 参考资料
+* https://llava-vl.github.io/blog/2024-01-30-llava-next/
+* https://hugging-face.cn/docs/transformers/model_doc/llava_next
--- a/README_official.md
+++ b/README_official.md
--- a/cog.yaml
+++ b/cog.yaml
+# Configuration for Cog ⚙️
+# Reference: https://github.com/replicate/cog/blob/main/docs/yaml.md
+build:
+  gpu: true
+  python_version: "3.11"
+  python_packages:
+    - "torch==2.0.1"
+    - "accelerate==0.21.0"
+    - "bitsandbytes==0.41.0"
+    - "deepspeed==0.9.5"
+    - "einops-exts==0.0.4"
+    - "einops==0.6.1"
+    - "gradio==3.35.2"
+    - "gradio_client==0.2.9"
+    - "httpx==0.24.0"
+    - "markdown2==2.4.10"
+    - "numpy==1.26.0"
+    - "peft==0.4.0"
+    - "scikit-learn==1.2.2"
+    - "sentencepiece==0.1.99"
+    - "shortuuid==1.0.11"
+    - "timm==0.6.13"
+    - "tokenizers==0.13.3"
+    - "torch==2.0.1"
+    - "torchvision==0.15.2"
+    - "transformers==4.31.0"
+    - "wandb==0.15.12"
+    - "wavedrom==2.0.3.post3"
+    - "Pygments==2.16.1"
+  run:
+    - curl -o /usr/local/bin/pget -L "https://github.com/replicate/pget/releases/download/v0.0.3/pget" && chmod +x /usr/local/bin/pget
+# predict.py defines how predictions are run on your model
+predict: "predict.py:Predictor"
--- a/docs/LLaVA-NeXT-Interleave.md
+++ b/docs/LLaVA-NeXT-Interleave.md
+# LLaVA-NeXT: Tackling Multi-image, Video, and 3D in Large Multimodal Models
+## Contents
+- [Demo](#demo)
+- [Evaluation](#evaluation)
+## Demo
+> make sure you installed the LLaVA-NeXT model files via outside REAME.md
+1. **Example model:** `lmms-lab/llava-next-interleave-7b`
+To run a demo, execute:
+```bash
+# If you find any bug when running the demo, please make sure checkpoint path contains 'qwen'.
+# You can try command like 'mv llava-next-interleave-7b llava-next-interleave-qwen-7b'
+python playground/demo/interleave_demo.py --model_path path/to/ckpt
+```
+## Evaluation
+### Preparation
+Please download the evaluation data and its metadata from the following links:
+1. **llava-interleave-bench:** [here](https://huggingface.co/datasets/lmms-lab/llava-interleave-bench).
+Unzip eval_images.zip and there are Split1 and Split2 in it.
+Organize the downloaded data into the following structure:
+```
+interleave_data
+├── Split1
+│   ├── ...
+│   └── ...
+|
+├── Split2
+|   ├── ...
+│   └── ...
+├── multi_image_in_domain.json
+├── multi_image_out_domain.json
+└── multi_view_in_domain.json
+```
+### Inference and Evaluation
+Example:
+Please first edit /path/to/ckpt to the path of checkpoint, /path/to/images to the path of "interleave_data" in scripts/interleave/eval_all.sh and then run
+```bash
+bash scripts/interleave/eval_all.sh
+```
--- a/docs/LLaVA-NeXT-Video.md
+++ b/docs/LLaVA-NeXT-Video.md
+# LLaVA-NeXT: A Strong Zero-shot Video Understanding Model
+## Contents
+- [Demo](#demo)
+- [Evaluation](#evaluation)
+## Demo
+> make sure you installed the LLaVA-NeXT model files via outside REAME.md
+1. **Example model:** `lmms-lab/LLaVA-NeXT-Video-7B-DPO`
+2. **Prompt mode:** `vicuna_v1` (use `mistral_direct` for `lmms-lab/LLaVA-NeXT-Video-34B-DPO`)
+3. **Sampled frames:** `32` (Defines how many frames to sample from the video.)
+4. **Spatial pooling stride:** `2` (With original tokens for one frame at 24x24, if stride=2, then the tokens for one frame are 12x12.)
+5. **Spatial pooling mode:** `average` (Options: `average`, `max`.)
+6. **Local video path:** `./data/llava_video/video-chatgpt/evaluation/Test_Videos/v_Lf_7RurLgp0.mp4`
+To run a demo, execute:
+```bash
+bash scripts/video/demo/video_demo.sh ${Example model} ${Prompt mode} ${Sampled frames} ${Spatial pooling stride} ${Spatial pooling mode} grid True ${Video path at local}
+```
+Example:
+```bash
+bash scripts/video/demo/video_demo.sh lmms-lab/LLaVA-NeXT-Video-7B-DPO vicuna_v1 32 2 average no_token True playground/demo/xU25MMA2N4aVtYay.mp4
+```
+**IMPORTANT** Please refer to [Latest video model](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/docs/LLaVA-NeXT-Video_0716.md) for the runnning of the latest model.
+## Evaluation
+### Preparation
+Please download the evaluation data and its metadata from the following links:
+1. **video-chatgpt:** [here](https://github.com/mbzuai-oryx/Video-ChatGPT/blob/main/quantitative_evaluation/README.md#video-based-generative-performance-benchmarking).
+2. **video_detail_description:** [here](https://mbzuaiac-my.sharepoint.com/personal/hanoona_bangalath_mbzuai_ac_ae/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fhanoona%5Fbangalath%5Fmbzuai%5Fac%5Fae%2FDocuments%2FVideo%2DChatGPT%2FData%5FCode%5FModel%5FRelease%2FQuantitative%5FEvaluation%2Fbenchamarking%2FTest%5FHuman%5FAnnotated%5FCaptions%2Ezip&parent=%2Fpersonal%2Fhanoona%5Fbangalath%5Fmbzuai%5Fac%5Fae%2FDocuments%2FVideo%2DChatGPT%2FData%5FCode%5FModel%5FRelease%2FQuantitative%5FEvaluation%2Fbenchamarking&ga=1).
+3. **activity_qa:** [here](https://mbzuaiac-my.sharepoint.com/personal/hanoona_bangalath_mbzuai_ac_ae/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fhanoona%5Fbangalath%5Fmbzuai%5Fac%5Fae%2FDocuments%2FVideo%2DChatGPT%2FData%5FCode%5FModel%5FRelease%2FData%2FActivityNet%5FTest%2D1%2D3%5Fvideos%2Ezip&parent=%2Fpersonal%2Fhanoona%5Fbangalath%5Fmbzuai%5Fac%5Fae%2FDocuments%2FVideo%2DChatGPT%2FData%5FCode%5FModel%5FRelease%2FData&ga=1) and [here](https://github.com/MILVLG/activitynet-qa/tree/master/dataset).
+Organize the downloaded data into the following structure:
+```
+LLaVA-NeXT
+├── llava
+├── scripts
+└── data
+    └── llava_video
+        ├── video-chatgpt
+        │   ├── Test_Videos
+        │   ├── consistency_qa.json
+        │   ├── consistency_qa_test.json
+        │   ├── consistency_qa_train.json
+        ├── video_detail_description
+        │   └── Test_Human_Annotated_Captions
+        └── ActivityNet-QA
+            ├── all_test
+            ├── test_a.json
+            └── test_b.json
+```
+### Inference and Evaluation
+Example for video detail description evaluation (additional scripts are available in `scripts/eval`):
+```bash
+bash scripts/video/eval/video_detail_description_eval_shard.sh ${Example model} ${Prompt mode} ${Sampled frames} ${Spatial pooling stride} True 8
+```
+Example:
+```bash
+bash scripts/eval/video_detail_description_eval_shard.sh liuhaotian/llava-v1.6-vicuna-7b vicuna_v1 32 2 True 8 
+```
+### GPT Evaluation Example (Optional if the above step is completed)
+Assuming you have `pred.json` (model-generated predictions) for model `llava-v1.6-vicuna-7b` at `./work_dirs/eval_video_detail_description/llava-v1.6-vicuna-7b_vicuna_v1_frames_32_stride_2`:
+```bash
+bash scripts/video/eval/video_description_eval_only.sh llava-v1.6-vicuna-7b_vicuna_v1_frames_32_stride_2
+```
--- a/docs/LLaVA-NeXT-Video_0716.md
+++ b/docs/LLaVA-NeXT-Video_0716.md
+## LLaVA-NeXT-Video is upgraded 🚀
+In our [LLaVA-Video blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/) released this April, we shared two key observations: 
+- 🎬 AnyRes provides a shared and flexible representation between images and videos, and thus accommodates capability transfer between the two most common vision signals. Therefore, stronger image LMMs can naturally lead to stronger zero-shot video LMMs. 
+- 🗂️ There is a lack of high-quality language-video data, including video instruction-following data, and thus naive tuning on existing public data at that time results in performance degradation. Therefore, there is an urgent need to build high-quality video captions and QA datasets to train LMMs for improved video performance.
+Based on the insights, the new LLaVA-NeXT-Video in this release improves from two aspects:
+- 🎬 A stronger image LMMs ([LLaVA-NeXT-32B-Qwen](https://huggingface.co/lmms-lab/llava-next-qwen-32b)), which is built by initializing from Qwen-1.5 32B LLM. We further initialize our video training from this image checkpoint.
+- 🗂️ A new high-quality video dataset with 830k samples. It is combined with LLaVA-1.6 image training data, and applying the same image-video mixed training procedure leads to the new video model.
+The new model achieves the best open-source performance in several video benchmarks including [Video-MME](https://video-mme.github.io/home_page.html#leaderboard).
+### Resources
+- **Model Card**: [LLaVA-NeXT-Video-32B-Qwen on Hugging Face](https://huggingface.co/lmms-lab/LLaVA-NeXT-Video-32B-Qwen)
+- **Inference Script**:
+  ```bash
+  bash scripts/video/demo/video_demo.sh lmms-lab/LLaVA-NeXT-Video-32B-Qwen qwen_1_5 32 2 average grid True playground/demo/xU25MMA2N4aVtYay.mp4
+  ```
+### Evaluation Results
+| Model                       | NextQA-MC | video-mme(overall) |        | Egochema | Perception Test  (val) |
+|-----------------------------|-----------|--------------------|--------|----------|------------------------|
+|                             |           | w/o subs           | w subs |          |                        |
+| **Proprietary**                 |           |                    |        |          |                        |
+| GPT-4o                      | -         | 71.9               | 77.2   | 72.2     | -                      |
+| Gemini 1.5 Pro              | -         | 75.0               | 81.3   | 72.2     | -                      |
+| **Open-Source**                 |           |                    |        |          |                        |
+| VideoLLaMA 2 (8x7B)         | 76.3*     | 47.9               | 50.3   | 53.3     | 51.2*                  |
+| VILA-1.5-34B                | 67.89*    | 60.1               | 61.1   | 58.04*   | 54                     |
+| LLaVA-NeXT-Video (Qwen-32B) | 77.31     | 60.2               | 63.0   | 60.85    | 59.38                  |
+_*Results are reproduced by [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval). Please refer to the lmms-eval to reproduce the results._
+### Citations
+```bibtex
+@misc{zhang2024llavanextvideo,
+  title={LLaVA-NeXT: A Strong Zero-shot Video Understanding Model},
+  url={https://llava-vl.github.io/blog/2024-04-30-llava-next-video/},
+  author={Zhang, Yuanhan and Li, Bo and Liu, haotian and Lee, Yong jae and Gui, Liangke and Fu, Di and Feng, Jiashi and Liu, Ziwei and Li, Chunyuan},
+  month={April},
+  year={2024}
+}
--- a/docs/LLaVA-NeXT.md
+++ b/docs/LLaVA-NeXT.md
+# LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild
+## Quick Start With HuggingFace
+First please install our repo with code and environments: `pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git`
+Here is a quick inference code using [`llavanext-llama3-8B`](https://huggingface.co/lmms-lab/llama3-llava-next-8b) as an example. You will need to install [`flash-attn`](https://github.com/Dao-AILab/flash-attention) to use this code snippet. If you don't want to install it, you can set `attn_implementation=None` when load_pretrained_model
+```python
+from llava.model.builder import load_pretrained_model
+from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
+from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
+from llava.conversation import conv_templates, SeparatorStyle
+from PIL import Image
+import requests
+import copy
+import torch
+pretrained = "lmms-lab/llama3-llava-next-8b"
+model_name = "llava_llama3"
+device = "cuda"
+device_map = "auto"
+tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
+model.eval()
+model.tie_weights()
+url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
+image = Image.open(requests.get(url, stream=True).raw)
+image_tensor = process_images([image], image_processor, model.config)
+image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
+conv_template = "llava_llama_3" # Make sure you use correct chat template for different models
+question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
+conv = copy.deepcopy(conv_templates[conv_template])
+conv.append_message(conv.roles[0], question)
+conv.append_message(conv.roles[1], None)
+prompt_question = conv.get_prompt()
+input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
+image_sizes = [image.size]
+cont = model.generate(
+    input_ids,
+    images=image_tensor,
+    image_sizes=image_sizes,
+    do_sample=False,
+    temperature=0,
+    max_new_tokens=256,
+)
+text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
+print(text_outputs)
+# The image shows a radar chart, also known as a spider chart or a web chart, which is a type of graph used to display multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point. Each axis represents a different variable, and the values are plotted along each axis and connected to form a polygon.\n\nIn this particular radar chart, there are several axes labeled with different variables, such as "MM-Vet," "LLaVA-Bench," "SEED-Bench," "MMBench-CN," "MMBench," "TextVQA," "VizWiz," "GQA," "BLIP-2," "InstructBLIP," "Owen-VL-Chat," and "LLaVA-1.5." These labels suggest that the chart is comparing the performance of different models or systems across various benchmarks or tasks, such as machine translation, visual question answering, and text-based question answering.\n\nThe chart is color-coded, with each color representing a different model or system. The points on the chart are connected to form a polygon, which shows the relative performance of each model across the different benchmarks. The closer the point is to the outer edge of the
+```
+## Evaluation
+**Install the evaluation package:**
+```bash
+# make sure you installed the LLaVA-NeXT model files via outside REAME.md
+pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
+```
+### Check the evaluation results with [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval)
+Our models' evaluation results can be fully reproduced by using the [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval) toolkit. After you install lmms-eval and llava, you can run the evaluation using the following commands. To run following commands, you will have to install [`flash-attn`](https://github.com/Dao-AILab/flash-attention). If you do not want to install it, you can disable the flash-attn by specifying it in `--model_args pretrained=lmms-lab/llama3-llava-next-8b,conv_template=llava_llama_3,attn_implementation=None`.
+Please note that different torch versions might causing the results to vary.
+```shell
+# Evaluating Llama-3-LLaVA-NeXT-8B on multiple datasets
+accelerate launch --num_processes=8 \
+  -m lmms_eval \
+  --model llava \
+  --model_args pretrained=lmms-lab/llama3-llava-next-8b,conv_template=llava_llama_3 \
+  --tasks ai2d,chartqa,docvqa_val,mme,mmbench_en_dev \
+  --batch_size 1 \
+  --log_samples \
+  --log_samples_suffix llava_next \
+  --output_path ./logs/
+# Evaluating LLaVA-NeXT-72B on multiple datasets
+accelerate launch --num_processes=1 \
+  -m lmms_eval \
+  --model llava \
+  --model_args pretrained=lmms-lab/llava-next-72b,conv_template=qwen_1_5,model_name=llava_qwen,device_map=auto \
+  --tasks ai2d,chartqa,docvqa_val,mme,mmbench_en_dev \
+  --batch_size 1 \
+  --log_samples \
+  --log_samples_suffix llava_next \
+  --output_path ./logs/
+```
--- a/docs/LLaVA_OneVision.md
+++ b/docs/LLaVA_OneVision.md
+# LLaVA OneVision
+## Model Details
+LLaVA OneVision is a multi-modal model capable of processing images, text, image-text interleaved inputs, and videos. The model is trained in multiple stages:
+1. Stage-1: Initial training on 558K samples from the LCS dataset.
+2. Stage-1.5: Training on 4M high-quality samples with detailed captions, OCR and knowledge data.
+3. Stage-2: 
+   - Single-Image: Training on 3.2M instruction-following image samples.
+   - OneVision: Training on 1.6M single-image, multi-image and video samples with instructions.
+Key features:
+- Supports various input resolutions up to 2304 * 2304 pixels.
+- Single image input is represented by 729 * (9+1) tokens at most under `anyres_max_9` mode.
+- Supports multi-image and video inputs. Multi-image input is represented by 729 token for each image, and video input is represented by 196 token for each frame.
+- Available in three sizes: 0.5B, 7B and 72B parameter versions, fit for different memory and inference latency requirements.
+Some Implementation Details:
+- Trained using a combination of vision-specific (AdamW, 2e-6) and language model (AdamW, 1e-5) learning rates.
+- Each stage is trained for 1 epoch.
+The model uses [SO400M](https://huggingface.co/collections/google/siglip-659d5e62f0ae1a57ae0e83ba) as the vision encoder and [Qwen-2.0](https://huggingface.co/docs/transformers/model_doc/qwen2) as the language model, with trainable components including a projector and the full model in later stages.
+We recommend to use the scripts in [training](../scripts/) to get the details of the training process.
+## Inference Guidance
+We recommend to follow the [tutorial](./LLaVA_OneVision_Tutorials.ipynb) to get started on using our most basic 0.5B model for image, text, image-text interleaved, and video input. We use our 0.5B version as an example. This could be running on a GPU with 4GB memory. And with the following examples, you could see it's surprisingly have promising performance on understanding the image, interleaved image-text, and video. Tiny but mighty!
+## Evaluation Guidance
+We use the [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) toolkit to evaluate our models. Ensure you have installed the LLaVA-NeXT model files as per the instructions in the main README.md.
+Install lmms-eval:
+> pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
+### Reproducing Evaluation Results
+Our models' evaluation results can be fully reproduced using the lmms-eval toolkit. After installing lmms-eval and llava, you can run the evaluation using the following commands.
+Note: These commands require flash-attn. If you prefer not to install it, disable flash-attn by adding `attn_implementation=None` to the `--model_args` parameter.
+Important: Different torch versions may cause slight variations in results. By default in `lmms-eval`, the requirement for torch version is set to the latest version. In `llava` repo, the torch version is set to `2.1.2`. Torch version `2.1.2` would be stable for both `llava` and `lmms-eval`
+### Evaluating LLaVA-OneVision on multiple datasets
+We recommend the developers and researchers to thoroughly evaluate the models on more datasets to get a comprehensive understanding of their performance in different scenarios. So we provide a comprehensive list of datasets for evaluation, and welcome to incoporate more evaluation tasks. Please refer to the [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) for more details.
+Task: single-image tasks.
+```bash
+# image tasks
+accelerate launch --num_processes=8 \
+-m lmms_eval \
+--model llava_onevision \
+--model_args pretrained=lmms-lab/llava-onevision-qwen2-0.5b-si,conv_template=qwen_1_5,model_name=llava_qwen \
+--tasks ai2d,chartqa,docvqa_val,infovqa_val,mme,realworldqa,mathvista_testmini,llava_in_the_wild,mmvet,mmbench_en_dev,ocrbench,mmmu,mathverse_testmini_vision_intensive,mathverse_testmini_vision_only,seedbench,scienceqa_img,mmstar \
+--batch_size 1 \
+--log_samples \
+--log_samples_suffix llava_onevision \
+--output_path ./logs/
+```
+Task: video tasks. The video tasks are more computationally expensive. We recommend running them on a machine with a GPU with at least 16GB memory.
+```bash
+# video tasks
+accelerate launch --num_processes=8 \
+-m lmms_eval \
+--model llava_onevision \
+--model_args pretrained=lmms-lab/llava-onevision-qwen2-0.5b-ov,conv_template=qwen_1_5,model_name=llava_qwen \
+--tasks activitynetqa,videochatgpt,nextqa_mc_test,egoschema,video_dc499,videmme,videomme_w_subtitle,perceptiontest_val_mc \
+--batch_size 1 \
+--log_samples \
+--log_samples_suffix llava_onevision \
+--output_path ./logs/
+```
+Task: interleave tasks (`llava-interleave-bench` already contains most of existing image-text tasks). `mmmu_test` contains single image and multiple images as input, we run the model to obtain a submission file and you need to submit it to the [leaderboard](https://eval.ai/web/challenges/challenge-page/1700/overview) to get the accuracy for MMMU (multi-image) result.
+```bash
+accelerate launch --num_processes=8 \
+-m lmms_eval \
+--model llava_onevision \
+--model_args pretrained=lmms-lab/llava-onevision-qwen2-0.5b-ov,conv_template=qwen_1_5,model_name=llava_qwen \
+--tasks llava-interleave-bench,muirbench,mmmu_test \
+--batch_size 1 \
+--log_samples \
+--log_samples_suffix llava_onevision \
+--output_path ./logs/
+```
--- a/docs/LLaVA_OneVision_Chat.md
+++ b/docs/LLaVA_OneVision_Chat.md
+# LLaVA-OneVision-Chat: Improving Chat with Preference Learning
+[LLaVA-OneVision](https://arxiv.org/abs/2408.03326) has demonstrated strong multimodal capabilities, showing excellent performance on various benchmarks in single-image, multi-image and video scenarios. However, we see potential for further improvement, particularly in its visual chat abilities. To achieve this, we've focused on enhancing the model through preference alignment, and our early experiments have produced some promising insights.
+### Key Observations:
+- **Impact of Preference Learning**: By incorporating alignment learning—whether through human feedback or AI-generated feedback—we've observed a notable improvement in LLaVA-OneVision's chat experience. This progress is reflected in the significant performance gains recorded on both the LLaVA-W and WildVision benchmarks.
+- **Success of Self-Generated Feedback**: In LLaVA-OneVision's case, leveraging self-generated feedback data has proven to be a highly effective strategy for enhancing its visual chat capabilities. Specifically, [LLaVA-Critic](https://llava-vl.github.io/blog/2024-10-03-llava-critic/) is a utilized as a generalist evaluator to generate the scoring feedback for preference learning. This approach allows the model to refine its responses autonomously, leading to more natural and coherent conversations.
+----
+### Release
+- 🤗 Model Checkpoints: [[OV-7b-Chat]](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov-chat) | [[OV-72b-Chat]](https://huggingface.co/lmms-lab/llava-onevision-qwen2-72b-ov-chat)
+- 💬 Demo: [https://llava-onevision.lmms-lab.com](https://llava-onevision.lmms-lab.com/)
+----
+### Results
+The figure below illustrates the performance gain of LLaVA-OV-Chat across 5 benchmarks. The delta numbers shown on top of the bars indicate the improvement of the chat model variant(7b/72b) over its base model LLaVA-OV. 
+![](ov_chat_images/chat_results.png)
+| Model Name          | WildVision | LLaVA-W | LLaVA-Wilder | LiveBench | Video Detailed Description |
+|---------------------|------------|---------|--------------|-----------|----------------------------|
+| LLaVA-OV-7b         | 54.0       | 90.7    | 67.8         | 77.1      | 3.75                       |
+| LLaVA-OV-7b-Chat    | 67.3       | 100.3   | 71.6         | 84.5      | 3.87                       |
+| LLaVA-OV-72b        | 51.7       | 93.5    | 72.0         | 81.5      | 3.60                       |
+| LLaVA-OV-72b-Chat   | 70.0       | 104.4   | 75.9         | 88.5      | 3.86                       |
+| GPT-4V              | 79.8       | 98.0    | 81.0         | 73.7      | 4.00                       |
+LLaVA-OV-Chat consistently showcases exceptional visual chat capabilities across five open-ended multimodal tasks. Notably, compared to its base version, LLaVA-OV, which was trained using next-token prediction, LLaVA-OV-Chat—enhanced through alignment learning—receives significantly higher preference on four image-based tasks. The *[WildVision Bench](https://huggingface.co/datasets/WildVision/wildvision-bench)* simulates real-world multimodal interactions based on human preferences, mirroring scenarios from the online platform [WildVision-Arena (WV-Arena)](https://huggingface.co/spaces/WildVision/vision-arena). Both *[LLaVA-W](https://huggingface.co/datasets/lmms-lab/llava-bench-in-the-wild)* and *[LLaVA-Wilder](https://huggingface.co/datasets/lmms-lab/LLaVA-Bench-Wilder)* tackle daily-life visual tasks, with the former focusing on simpler cases and the latter on more challenging ones. The *[Multimodal Live-Bench](https://huggingface.co/spaces/lmms-lab/LiveBench)* leverages continuously updating news and online forums to test models' ability to generalize unobserved knowledge. Despite the alignment learning being conducted solely with images, LLaVA-OV-Chat also excels in *[Video Detailed Captioning](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)*, demonstrating that its improved chat capabilities generalize effectively from image to video contexts.
+#### Examples
+|  Example 1 |
+| -- |
+| <div align="right" style="display: inline;"><img src="ov_chat_images/example1_tree.png" style="height:250px" align="right"></div><div align="left" style="display: inline"><div><b>[Question]</b>: How likely is it to snow after this picture was taken? What would change with this type of tree before it's likely to snow? </div></div> |
+| **[LLaVA-OneVision]**: It is highly unlikely that it would snow immediately after this picture was taken, as the lush green leaves suggest it's either late spring or summer. Before snowfall, which typically occurs in late autumn or winter, this type of tree, likely an oak, would shed its leaves, and the branches would appear bare. |
+| **[LLaVA-OneVision-Chat]**: Given the lush green leaves and the overall vibrancy of the tree in the image, it is highly unlikely that snow would follow immediately. This type of tree, which appears to be a deciduous species, typically sheds its leaves in autumn as a preparation for winter. Before it's likely to snow, the tree would undergo several changes:<br><br>1. **Color Change**: The leaves would change color from green to shades of yellow, orange, red, or brown as the chlorophyll breaks down, revealing other pigments. <br>2. **Leaf Fall**: As the days shorten and temperatures cool, the tree would gradually shed its leaves. This process is triggered by hormonal changes within the tree that cause the cells at the base of each leaf to weaken and eventually break, allowing the leaves to fall.<br>3. **Dormancy**: The tree would enter a state of dormancy, during which its metabolic processes slow down significantly. This helps the tree conserve energy and survive the cold winter months.<br>4. **Bark Changes**: The bark may become more rugged and textured as the tree ages and prepares for winter. This is a natural defense mechanism against cold temperatures and potential pests.<br>5. **Root Activity**: The roots would continue to absorb water and nutrients until the ground freezes, after which they would go dormant as well.<br>6. **Snow Preparation**: Once the tree has shed its leaves and entered dormancy, it would be better prepared to withstand snow and ice accumulation on its branches without the added weight of leaves.<br><br>In summary, the tree would need to go through the full autumn cycle of leaf color change and fall, followed by entering dormancy, before it would be likely to snow. This process can take several weeks to a few months, depending on the specific species of the tree and the local climate conditions.|
+|  Example 2 |
+| -- |
+| <div align="right" style="display: inline;"><img src="ov_chat_images/example2_dog.jpg" style="height:230px" align="right"></div><div align="left" style="display: inline"><div><b>[Question]</b>: Discuss how this creative twist on a classic work of art might be interpreted differently by various audiences. </div></div>|
+| **[LLaVA-OneVision]**: Different audiences might interpret this creative twist on the Mona Lisa in various ways. Art enthusiasts might appreciate the skillful execution and the playful reinterpretation of a classic, seeing it as a commentary on the universality of beauty and expression across species. Dog lovers could find joy in seeing their favorite breed portrayed with such dignity and elegance. Some might view it as a humorous take on high art, making it more accessible and relatable. Others might see it as a statement on the anthropomorphism of pets, reflecting how we often attribute human-like qualities to animals. Overall, the varied interpretations highlight the versatility of art in engaging diverse perspectives and emotions. |
+| **[LLaVA-OneVision-Chat]**: This creative twist on the Mona Lisa, featuring a dog's face, can be interpreted in various ways by different audiences. For art enthusiasts, it might be seen as a playful commentary on the nature of portraiture and the human tendency to anthropomorphize animals. It challenges traditional notions of beauty and subjectivity in art, inviting viewers to reconsider what makes a portrait compelling.<br><br>For dog lovers and pet owners, the image could evoke a sense of joy and familiarity, as they see their beloved pets reflected in one of the most iconic artworks in history. It might also highlight the deep bond between humans and their pets, suggesting that animals share in our cultural heritage.<br><br>In a broader social context, this image could be interpreted as a reflection of contemporary society's obsession with internet culture and memes. The substitution of the Mona Lisa's face with a dog's could be seen as a satirical take on how viral content often overshadows classical art in modern media.<br><br>Academically, this piece could spark discussions about copyright, artistic originality, and the public domain. It raises questions about the boundaries between homage and appropriation in art, encouraging critical thinking about intellectual property rights and creative freedom.<br><br>Overall, the diverse interpretations of this image underscore the power of art to engage, provoke, and connect people across different backgrounds and perspectives. |
+********
+### Iterative DPO for Enhanced Performance
+#### Feedback Data
+To optimize LLaVA-OneVision’s in-the-wild conversational abilities, we've employed an iterative Direct Preference Optimization (DPO) process. Through this method, we found that feedback from two primary sources is particularly effective:
+1. **Human Feedback from [LLaVA-RLHF](https://llava-rlhf.github.io/)**: Real-world human input plays a crucial role in guiding the model toward more intuitive and user-friendly responses.
+2. **AI Feedback from LLaVA-OV’s Self-Generated Responses**: Additionally, the AI's own self-generated feedback allows it to continuously improve and adapt, making this a valuable source for iterative learning. [LLaVA-Critic](https://llava-vl.github.io/blog/2024-10-03-llava-critic/) is a utilized as a generalist evaluator to generate the scoring feedback for preference learning
+By experimenting with either of these two forms of feedback, we've been able to significantly enhance LLaVA-OneVision's conversation capabilities, bringing it closer to achieving seamless visual chat interactions in dynamic, real-world environments.
+#### Alignment Learning with Iterative DPO 
+We provide a breakdown of the process for enhancing LLaVA-OneVision’s visual chat capabilities through iterative DPO.
+##### Requirements:
+1. **SFT Checkpoint**: We begin with a pretrained LLaVA-OneVision SFT (Supervised Fine-Tuning) model as the initial checkpoint for response generation.
+2. **Preference Data**: The dataset used in our experiments consists of (language-image prompt, response, preference) pairs sourced from human feedback or AI feedback, which serves as the training data for the model to align with user preference to improve chat experience.
+##### Step 1: Response Generation
+For each langauge-image prompt in the dataset, we randomly generate `k = 5` candidate responses from the starting checkpoint. To ensure diversity in the generated responses, we employ random decoding with the following parameters: `temperature = 0.7`, `top-p (nucleus sampling) = 0.9`. These settings encourage the generation of varied responses by balancing randomness and precision, giving us a broad spectrum of potential answers for further evaluation.
+##### Step 2: Scoring and Acquiring Feedback Data
+Once the candidate responses are generated, we utilize a feedback source (e.g., the reward signals from LLaVA-RLHF or reward signals from LLaVA-Critic) to score each of them. The reward model is responsible for evaluating the quality of the responses based on relevance, coherence, and appropriateness in relation to the given image-question pair. From the scored responses, we then select:
+- The **best** response (highest score)
+- The **worst** response (lowest score)
+These two responses serve as **pairwise feedback data** for the next phase of the training process.
+##### Step 3: Training with Iterative DPO
+Using the feedback data obtained in `Step 2`, we conduct DPO training in an iterative fashion. The process is as follows:
+1. In the $i^{th}$ round of training, we start with the pretrained model from the previous round $(i-1)^{th}$.
+2. We generate new candidate responses by repeating the response generation process outlined in `Step 1`.
+3. The reward source evaluates these new responses, and pairwise feedback data is acquired, as described in `Step 2`.
+4. Finally, we apply DPO training to the model using the feedback data. Each round of DPO training lasts for **`1 epoch`**.
+This iterative process is repeated for `N=3` rounds in total, with each round refining the model’s ability to generate high-quality visual chat responses by progressively incorporating feedback from both human and AI assessments.
+**Training script and data format**
+- Example training script: [`/scripts/train/dpo_ov7b.sh`](../scripts/train/dpo_ov7b.sh)
+- Format of training data:
+~~~json
+{
+  "id": "<image-id>",
+  "image": "<image path under args.image_folder>",
+  "prompt": "<input prompt/question>",
+  "chosen": "<chosen model response>",
+  "rejected": "<rejected model response>"
+}
+~~~
+------
+Check out on how we develop AI feedback for self-improvement LMMs, using [LLaVA-Critic](https://llava-vl.github.io/blog/2024-10-03-llava-critic/) as a generalist evaluator to generate the scoring feedback for preference learning!
+*Contributors to LLaVA-OneVision-Chat: [Tianyi Xiong](https://tyxiong23.github.io/), [Bo Li](https://brianboli.com/), [Dong Guo](https://www.linkedin.com/in/dongguoset/), [Huizhuo Yuan](https://scholar.google.com/citations?user=8foZzX4AAAAJ), [Quanquan Gu](https://web.cs.ucla.edu/~qgu/), [Chunyuan Li](https://scholar.google.com/citations?user=Zd7WmXUAAAAJ)*
+### Citation
+If you find it useful for your research and applications, please cite related papers/blogs using this BibTeX:
+```bibtex
+@misc{xiong2024llavaovchat,
+  title={LLaVA-OneVision-Chat: Improving Chat with Preference Learning},
+  url={https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/docs/LLaVA_OneVision_Chat.md},
+  author={Xiong, Tianyi and Li, Bo and Guo, Dong and Yuan, Huizhuo and Gu, Quanquan and Li, Chunyuan},
+  month={September},
+  year={2024}
+}
+@article{xiong2024llavacritic,
+  title={LLaVA-Critic: Learning to Evaluate Multimodal Models},
+  author={Xiong, Tianyi and Wang, Xiyao and Guo, Dong and Ye, Qinghao and Fan, Haoqi and Gu, Quanquan and Huang, Heng and Li, Chunyuan},
+  year={2024},
+  eprint={2410.02712},
+  archivePrefix={arXiv},
+  primaryClass={cs.CV},
+  url={https://arxiv.org/abs/2410.02712},
+}
+@article{li2024llavaov,
+  title={Llava-onevision: Easy visual task transfer},
+  author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
+  journal={arXiv preprint arXiv:2408.03326},
+  year={2024}
+}
+@article{sun2023aligning,
+  title={Aligning large multimodal models with factually augmented rlhf},
+  author={Sun, Zhiqing and Shen, Sheng and Cao, Shengcao and Liu, Haotian and Li, Chunyuan and Shen, Yikang and Gan, Chuang and Gui, Liang-Yan and Wang, Yu-Xiong and Yang, Yiming and Keutzer, Kurt and Darrell, Trevor},
+  journal={arXiv preprint arXiv:2309.14525},
+  year={2023}
+}
+```
--- a/docs/LLaVA_OneVision_Tutorials.ipynb
+++ b/docs/LLaVA_OneVision_Tutorials.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# (Frustratingly Easy) LLaVA OneVision Tutorial\n",
+    "\n",
+    "We know that it's always beneficial to have a unified interface for different tasks. So we are trying to unify the interface for image, text, image-text interleaved, and video input. And in this tutorial, we aim to provide the most straightforward way to use our model. \n",
+    "\n",
+    "We use our 0.5B version as an example. This could be running on a GPU with 4GB memory. And with the following examples, you could see it's surprisingly have promising performance on understanding the image, interleaved image-text, and video. Tiny but mighty!\n",
+    "\n",
+    "The same code could be used for 7B model as well.\n",
+    "\n",
+    "## Inference Guidance\n",
+    "\n",
+    "First please install our repo with code and environments: pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git\n",
+    "\n",
+    "Here is a quick inference code using [lmms-lab/qwen2-0.5b-si](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-si) as an example. You will need to install `flash-attn` to use this code snippet. If you don't want to install it, you can set `attn_implementation=None` when load_pretrained_model\n",
+    "\n",
+    "### Image Input\n",
+    "Tackling the single image input with LLaVA OneVision is pretty straightforward."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llava.model.builder import load_pretrained_model\n",
+    "from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token\n",
+    "from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX\n",
+    "from llava.conversation import conv_templates, SeparatorStyle\n",
+    "\n",
+    "from PIL import Image\n",
+    "import requests\n",
+    "import copy\n",
+    "import torch\n",
+    "\n",
+    "import sys\n",
+    "import warnings\n",
+    "\n",
+    "warnings.filterwarnings(\"ignore\")\n",
+    "pretrained = \"lmms-lab/llava-onevision-qwen2-0.5b-si\"\n",
+    "model_name = \"llava_qwen\"\n",
+    "device = \"cuda\"\n",
+    "device_map = \"auto\"\n",
+    "llava_model_args = {\n",
+    "    \"multimodal\": True,\n",
+    "    \"attn_implementation\": \"sdpa\",\n",
+    "}\n",
+    "tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, **llava_model_args)  # Add any other thing you want to pass in llava_model_args\n",
+    "\n",
+    "model.eval()\n",
+    "\n",
+    "url = \"https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true\"\n",
+    "image = Image.open(requests.get(url, stream=True).raw)\n",
+    "image_tensor = process_images([image], image_processor, model.config)\n",
+    "image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]\n",
+    "\n",
+    "conv_template = \"qwen_1_5\"  # Make sure you use correct chat template for different models\n",
+    "question = DEFAULT_IMAGE_TOKEN + \"\\nWhat is shown in this image?\"\n",
+    "conv = copy.deepcopy(conv_templates[conv_template])\n",
+    "conv.append_message(conv.roles[0], question)\n",
+    "conv.append_message(conv.roles[1], None)\n",
+    "prompt_question = conv.get_prompt()\n",
+    "\n",
+    "input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors=\"pt\").unsqueeze(0).to(device)\n",
+    "image_sizes = [image.size]\n",
+    "\n",
+    "\n",
+    "cont = model.generate(\n",
+    "    input_ids,\n",
+    "    images=image_tensor,\n",
+    "    image_sizes=image_sizes,\n",
+    "    do_sample=False,\n",
+    "    temperature=0,\n",
+    "    max_new_tokens=4096,\n",
+    ")\n",
+    "text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)\n",
+    "print(text_outputs)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You could use the following code to make it streaming in terminal, this would be pretty useful when creating a chatbot."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from threading import Thread\n",
+    "from transformers import TextIteratorStreamer\n",
+    "import json\n",
+    "\n",
+    "url = \"https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true\"\n",
+    "image = Image.open(requests.get(url, stream=True).raw)\n",
+    "image_tensor = process_images([image], image_processor, model.config)\n",
+    "image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]\n",
+    "\n",
+    "conv_template = \"qwen_1_5\"\n",
+    "question = DEFAULT_IMAGE_TOKEN + \"\\nWhat is shown in this image?\"\n",
+    "conv = copy.deepcopy(conv_templates[conv_template])\n",
+    "conv.append_message(conv.roles[0], question)\n",
+    "conv.append_message(conv.roles[1], None)\n",
+    "prompt_question = conv.get_prompt()\n",
+    "\n",
+    "input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors=\"pt\").unsqueeze(0).to(device)\n",
+    "image_sizes = [image.size]\n",
+    "\n",
+    "max_context_length = getattr(model.config, \"max_position_embeddings\", 2048)\n",
+    "num_image_tokens = question.count(DEFAULT_IMAGE_TOKEN) * model.get_vision_tower().num_patches\n",
+    "\n",
+    "streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True, timeout=15)\n",
+    "\n",
+    "max_new_tokens = min(4096, max_context_length - input_ids.shape[-1] - num_image_tokens)\n",
+    "\n",
+    "if max_new_tokens < 1:\n",
+    "    print(\n",
+    "        json.dumps(\n",
+    "            {\n",
+    "                \"text\": question + \"Exceeds max token length. Please start a new conversation, thanks.\",\n",
+    "                \"error_code\": 0,\n",
+    "            }\n",
+    "        )\n",
+    "    )\n",
+    "else:\n",
+    "    gen_kwargs = {\n",
+    "        \"do_sample\": False,\n",
+    "        \"temperature\": 0,\n",
+    "        \"max_new_tokens\": max_new_tokens,\n",
+    "        \"images\": image_tensor,\n",
+    "        \"image_sizes\": image_sizes,\n",
+    "    }\n",
+    "\n",
+    "    thread = Thread(\n",
+    "        target=model.generate,\n",
+    "        kwargs=dict(\n",
+    "            inputs=input_ids,\n",
+    "            streamer=streamer,\n",
+    "            **gen_kwargs,\n",
+    "        ),\n",
+    "    )\n",
+    "    thread.start()\n",
+    "\n",
+    "    generated_text = \"\"\n",
+    "    for new_text in streamer:\n",
+    "        generated_text += new_text\n",
+    "        print(generated_text, flush=True)\n",
+    "        # print(json.dumps({\"text\": generated_text, \"error_code\": 0}), flush=True)\n",
+    "\n",
+    "    print(\"Final output:\", generated_text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Image-Text Interleaved Input\n",
+    "\n",
+    "Now switching to our onevision model for more complex tasks. You should start to use `llava-onevision-qwen2-0.5b-ov` for image-text interleaved input and video input.\n",
+    "\n",
+    "Processing image-text interleaved input is a bit more complicated. But following the code below should work."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Load model\n",
+    "pretrained = \"lmms-lab/llava-onevision-qwen2-0.5b-ov\"\n",
+    "model_name = \"llava_qwen\"\n",
+    "device = \"cuda\"\n",
+    "device_map = \"auto\"\n",
+    "llava_model_args = {\n",
+    "        \"multimodal\": True,\n",
+    "    }\n",
+    "overwrite_config = {}\n",
+    "overwrite_config[\"image_aspect_ratio\"] = \"pad\"\n",
+    "llava_model_args[\"overwrite_config\"] = overwrite_config\n",
+    "tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, **llava_model_args)\n",
+    "\n",
+    "model.eval()\n",
+    "\n",
+    "# Load two images\n",
+    "url1 = \"https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true\"\n",
+    "url2 = \"https://raw.githubusercontent.com/haotian-liu/LLaVA/main/images/llava_logo.png\"\n",
+    "\n",
+    "image1 = Image.open(requests.get(url1, stream=True).raw)\n",
+    "image2 = Image.open(requests.get(url2, stream=True).raw)\n",
+    "\n",
+    "images = [image1, image2]\n",
+    "image_tensors = process_images(images, image_processor, model.config)\n",
+    "image_tensors = [_image.to(dtype=torch.float16, device=device) for _image in image_tensors]\n",
+    "\n",
+    "# Prepare interleaved text-image input\n",
+    "conv_template = \"qwen_1_5\"\n",
+    "question = f\"{DEFAULT_IMAGE_TOKEN} This is the first image. Can you describe what you see?\\n\\nNow, let's look at another image: {DEFAULT_IMAGE_TOKEN}\\nWhat's the difference between these two images?\"\n",
+    "\n",
+    "conv = copy.deepcopy(conv_templates[conv_template])\n",
+    "conv.append_message(conv.roles[0], question)\n",
+    "conv.append_message(conv.roles[1], None)\n",
+    "prompt_question = conv.get_prompt()\n",
+    "\n",
+    "input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors=\"pt\").unsqueeze(0).to(device)\n",
+    "image_sizes = [image.size for image in images]\n",
+    "\n",
+    "# Generate response\n",
+    "cont = model.generate(\n",
+    "    input_ids,\n",
+    "    images=image_tensors,\n",
+    "    image_sizes=image_sizes,\n",
+    "    do_sample=False,\n",
+    "    temperature=0,\n",
+    "    max_new_tokens=4096,\n",
+    ")\n",
+    "text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)\n",
+    "print(text_outputs[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Video Input\n",
+    "\n",
+    "Now let's try video input. It's the same as image input, but you need to pass in a list of video frames. And remember to set the `<image>` token only once in the prompt, e.g. \"<image>\\nWhat is shown in this video?\", not \"<image>\\n<image>\\n<image>\\nWhat is shown in this video?\". Since we trained on this format, it's important to keep the format consistent."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/tiger/miniconda3/envs/public_llava/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n",
+      "/home/tiger/miniconda3/envs/public_llava/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n",
+      "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Loaded LLaVA model: lmms-lab/llava-onevision-qwen2-7b-ov\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n",
+      "You are using a model of type llava to instantiate a model of type llava_qwen. This is not supported for all configurations of models and can yield errors.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Loading vision tower: google/siglip-so400m-patch14-384\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  2.07s/it]\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Model Class: LlavaQwenForCausalLM\n",
+      "(16, 1024, 576, 3)\n",
+      "The video features a person standing on a stage, dressed in a black shirt and dark pants. A large hand appears from the background, reaching towards the person's pocket. The text 'Source: Joshua AG' is displayed at the top left corner of the frames, and 'EVAN CARMICHAEL' is shown in the top right corner. The text 'Anyone know what this pocket is for?' appears as the hand continues to reach into the pocket. The person then looks down at their pocket, and the text 'I've always wondered that' appears. The hand finally pulls out a small white device labeled 'iPod Nano'. The person holds up the iPod Nano, and the text 'is the new iPod Nano' appears. The video concludes with a close-up of the person holding the iPod Nano, showing it from different angles.\n"
+     ]
+    }
+   ],
+   "source": [
+    "from operator import attrgetter\n",
+    "from llava.model.builder import load_pretrained_model\n",
+    "from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token\n",
+    "from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX\n",
+    "from llava.conversation import conv_templates, SeparatorStyle\n",
+    "\n",
+    "import torch\n",
+    "import cv2\n",
+    "import numpy as np\n",
+    "from PIL import Image\n",
+    "import requests\n",
+    "import copy\n",
+    "import warnings\n",
+    "from decord import VideoReader, cpu\n",
+    "\n",
+    "warnings.filterwarnings(\"ignore\")\n",
+    "# Load the OneVision model\n",
+    "pretrained = \"lmms-lab/llava-onevision-qwen2-7b-ov\"\n",
+    "model_name = \"llava_qwen\"\n",
+    "device = \"cuda\"\n",
+    "device_map = \"auto\"\n",
+    "llava_model_args = {\n",
+    "    \"multimodal\": True,\n",
+    "}\n",
+    "tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, attn_implementation=\"sdpa\", **llava_model_args)\n",
+    "\n",
+    "model.eval()\n",
+    "\n",
+    "\n",
+    "# Function to extract frames from video\n",
+    "def load_video(video_path, max_frames_num):\n",
+    "    if type(video_path) == str:\n",
+    "        vr = VideoReader(video_path, ctx=cpu(0))\n",
+    "    else:\n",
+    "        vr = VideoReader(video_path[0], ctx=cpu(0))\n",
+    "    total_frame_num = len(vr)\n",
+    "    uniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int)\n",
+    "    frame_idx = uniform_sampled_frames.tolist()\n",
+    "    spare_frames = vr.get_batch(frame_idx).asnumpy()\n",
+    "    return spare_frames  # (frames, height, width, channels)\n",
+    "\n",
+    "\n",
+    "# Load and process video\n",
+    "video_path = \"jobs.mp4\"\n",
+    "video_frames = load_video(video_path, 16)\n",
+    "print(video_frames.shape) # (16, 1024, 576, 3)\n",
+    "image_tensors = []\n",
+    "frames = image_processor.preprocess(video_frames, return_tensors=\"pt\")[\"pixel_values\"].half().cuda()\n",
+    "image_tensors.append(frames)\n",
+    "\n",
+    "# Prepare conversation input\n",
+    "conv_template = \"qwen_1_5\"\n",
+    "question = f\"{DEFAULT_IMAGE_TOKEN}\\nDescribe what's happening in this video.\"\n",
+    "\n",
+    "conv = copy.deepcopy(conv_templates[conv_template])\n",
+    "conv.append_message(conv.roles[0], question)\n",
+    "conv.append_message(conv.roles[1], None)\n",
+    "prompt_question = conv.get_prompt()\n",
+    "\n",
+    "input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors=\"pt\").unsqueeze(0).to(device)\n",
+    "image_sizes = [frame.size for frame in video_frames]\n",
+    "\n",
+    "# Generate response\n",
+    "cont = model.generate(\n",
+    "    input_ids,\n",
+    "    images=image_tensors,\n",
+    "    image_sizes=image_sizes,\n",
+    "    do_sample=False,\n",
+    "    temperature=0,\n",
+    "    max_new_tokens=4096,\n",
+    "    modalities=[\"video\"],\n",
+    ")\n",
+    "text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)\n",
+    "print(text_outputs[0])"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.9.2 64-bit",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.14"
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/docs/LLaVA_Video_1003.md
+++ b/docs/LLaVA_Video_1003.md
+# LLaVA Video
+##  Table of Contents
+1. [Model Summary](##model-summary)
+2. [Inference](##inference)
+3. [Training](##training)
+4. [Evaluation](##evaluation-guidance)
+6. [Citation](##citation)
+## Model Summary
+The LLaVA-Video models are 7/72B parameter models trained on [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K) and [LLaVA-OneVision Dataset](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), based on Qwen2 language model with a context window of 32K tokens.
+## Inference
+We provide the simple generation process for using our model. For more details, you could refer to [Github](https://github.com/LLaVA-VL/LLaVA-NeXT).
+```python
+# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
+from llava.model.builder import load_pretrained_model
+from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
+from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
+from llava.conversation import conv_templates, SeparatorStyle
+from PIL import Image
+import requests
+import copy
+import torch
+import sys
+import warnings
+from decord import VideoReader, cpu
+import numpy as np
+warnings.filterwarnings("ignore")
+def load_video(self, video_path, max_frames_num,fps=1,force_sample=False):
+    if max_frames_num == 0:
+        return np.zeros((1, 336, 336, 3))
+    vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
+    total_frame_num = len(vr)
+    video_time = total_frame_num / vr.get_avg_fps()
+    fps = round(vr.get_avg_fps()/fps)
+    frame_idx = [i for i in range(0, len(vr), fps)]
+    frame_time = [i/fps for i in frame_idx]
+    if len(frame_idx) > max_frames_num or force_sample:
+        sample_fps = max_frames_num
+        uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
+        frame_idx = uniform_sampled_frames.tolist()
+        frame_time = [i/vr.get_avg_fps() for i in frame_idx]
+    frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
+    spare_frames = vr.get_batch(frame_idx).asnumpy()
+    # import pdb;pdb.set_trace()
+    return spare_frames,frame_time,video_time
+pretrained = "lmms-lab/LLaVA-Video-7B-Qwen2"
+model_name = "llava_qwen"
+device = "cuda"
+device_map = "auto"
+tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map)  # Add any other thing you want to pass in llava_model_args
+model.eval()
+video_path = "XXXX"
+max_frames_num = "64"
+video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
+video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16()
+video = [video]
+conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
+time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
+question = DEFAULT_IMAGE_TOKEN + f"{time_instruciton}\nPlease describe this video in detail."
+conv = copy.deepcopy(conv_templates[conv_template])
+conv.append_message(conv.roles[0], question)
+conv.append_message(conv.roles[1], None)
+prompt_question = conv.get_prompt()
+input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
+cont = model.generate(
+    input_ids,
+    images=video,
+    modalities= ["video"],
+    do_sample=False,
+    temperature=0,
+    max_new_tokens=4096,
+)
+text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
+print(text_outputs)
+```
+## Data Preparation
+1. **Download LLaVA-OneVision**  
+   Refer to the official instructions here: [LLaVA-OneVision Data](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/main/scripts/train#about-the-llava-onevision-data). Make sure to follow the guidelines provided to obtain and organize the data correctly.
+2. **Download LLaVA-Video-178K**  
+   The dataset is available on Hugging Face: [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K). After downloading, place it in your desired directory.
+3. **Update `exp.yaml`**  
+   In the [`exp.yaml` file](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/video/train/exp.yaml), update the file paths to point to the directories where you stored the datasets:
+   - **Line 186-Line 263**: Specify the path for the LLaVA-Video-178K dataset.  
+   - For other data references, update them to point to your local LLaVA-OneVision data directory.
+## Training
+[[Scripts]](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/yhzhang/video_dev/scripts/video/train/SO400M_Qwen2_72B_ov_to_video_am9_aug6.sh): Start training models on your single-image/multi-image/video data.
+## Evaluation Guidance
+We use the [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) toolkit to evaluate our models. Ensure you have installed the LLaVA-NeXT model files as per the instructions in the main README.md.
+Install lmms-eval:
+> pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
+### Reproducing Evaluation Results
+Our models' evaluation results can be fully reproduced using the lmms-eval toolkit. After installing lmms-eval and llava, you can run the evaluation using the following commands.
+Note: These commands require flash-attn. If you prefer not to install it, disable flash-attn by adding `attn_implementation=None` to the `--model_args` parameter.
+Important: Different torch versions may cause slight variations in results. By default in `lmms-eval`, the requirement for torch version is set to the latest version. In `llava` repo, the torch version is set to `2.1.2`. Torch version `2.1.2` would be stable for both `llava` and `lmms-eval`
+### Evaluating LLaVA-Video on multiple datasets
+We recommend the developers and researchers to thoroughly evaluate the models on more datasets to get a comprehensive understanding of their performance in different scenarios. So we provide a comprehensive list of datasets for evaluation, and welcome to incoporate more evaluation tasks. Please refer to the [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) for more details.
+```bash
+# video tasks
+accelerate launch --num_processes=8 \
+-m lmms_eval \
+--model llava_vid \
+--model_args pretrained=lmms-lab/LLaVA-Video-7B-Qwen2,conv_template=qwen_1_5,max_frames_num=64,mm_spatial_pool_mode=average \
+--tasks activitynetqa,videochatgpt,nextqa_mc_test,egoschema,video_dc499,videmme,videomme_w_subtitle,perceptiontest_val_mc \
+--batch_size 1 \
+--log_samples \
+--log_samples_suffix llava_vid \
+--output_path ./logs/
+```
--- a/docs/README.md
+++ b/docs/README.md
+# LLaVA-NeXT Documentation
+Welcome to the LLaVA-NeXT documentation. This guide provides an overview of the different components and features of LLaVA-NeXT. Please refer to the following documents for detailed information on specific topics:
+1. [LLaVA OneVision](LLaVA_OneVision.md): Learn about the most advanced and unified version: LLaVA OneVision.
+    - [LLaVA OneVision: Inference Tutorials](LLaVA_OneVision_Tutorials.ipynb): Learn how to use LLaVA OneVision for inference.
+    - [LLaVA Onevision Chat](LLaVA_OneVision_Chat.md): Improving Chat with Preference Learning
+2. [LLaVA-NeXT Interleave](LLaVA-NeXT-Interleave.md): Explore the interleaved training approach used in LLaVA-NeXT.
+3. [LLaVA-NeXT Video (0716)](LLaVA-NeXT-Video_0716.md): Discover the video processing capabilities of LLaVA-NeXT (version 0716).
+4. [LLaVA-NeXT Video](LLaVA-NeXT-Video.md): Get information about the latest video processing features in LLaVA-NeXT.
+5. [LLaVA-NeXT Overview](LLaVA-NeXT.md): Read a comprehensive overview of the LLaVA-NeXT project, including its architecture, features, and capabilities.
+These documents provide in-depth information on various aspects of LLaVA-NeXT. Please refer to them for detailed explanations, implementation details, and usage instructions.
\ No newline at end of file
--- a/docs/jobs.mp4
+++ b/docs/jobs.mp4
--- a/docs/onevision_trial.py
+++ b/docs/onevision_trial.py
+from llava.model.builder import load_pretrained_model
+from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
+from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
+from llava.conversation import conv_templates, SeparatorStyle
+from PIL import Image
+import requests
+import copy
+import torch
+import sys
+import warnings
+warnings.filterwarnings("ignore")
+pretrained = "lmms-lab/llava-onevision-qwen2-0.5b-si"
+model_name = "llava_qwen"
+device = "cuda"
+device_map = "auto"
+tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)  # Add any other thing you want to pass in llava_model_args
+model.eval()
+url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
+image = Image.open(requests.get(url, stream=True).raw)
+image_tensor = process_images([image], image_processor, model.config)
+image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
+conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
+question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
+conv = copy.deepcopy(conv_templates[conv_template])
+conv.append_message(conv.roles[0], question)
+conv.append_message(conv.roles[1], None)
+prompt_question = conv.get_prompt()
+input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
+image_sizes = [image.size]
+cont = model.generate(
+    input_ids,
+    images=image_tensor,
+    image_sizes=image_sizes,
+    do_sample=False,
+    temperature=0,
+    max_new_tokens=4096,
+)
+text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
+print(text_outputs)
+from threading import Thread
+from transformers import TextIteratorStreamer
+import json
+url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
+image = Image.open(requests.get(url, stream=True).raw)
+image_tensor = process_images([image], image_processor, model.config)
+image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
+conv_template = "qwen_1_5"
+question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
+conv = copy.deepcopy(conv_templates[conv_template])
+conv.append_message(conv.roles[0], question)
+conv.append_message(conv.roles[1], None)
+prompt_question = conv.get_prompt()
+input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
+image_sizes = [image.size]
+max_context_length = getattr(model.config, "max_position_embeddings", 2048)
+num_image_tokens = question.count(DEFAULT_IMAGE_TOKEN) * model.get_vision_tower().num_patches
+streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True, timeout=15)
+max_new_tokens = min(4096, max_context_length - input_ids.shape[-1] - num_image_tokens)
+if max_new_tokens < 1:
+    print(
+        json.dumps(
+            {
+                "text": question + "Exceeds max token length. Please start a new conversation, thanks.",
+                "error_code": 0,
+            }
+        )
+    )
+else:
+    gen_kwargs = {
+        "do_sample": False,
+        "temperature": 0,
+        "max_new_tokens": max_new_tokens,
+        "images": image_tensor,
+        "image_sizes": image_sizes,
+    }
+    thread = Thread(
+        target=model.generate,
+        kwargs=dict(
+            inputs=input_ids,
+            streamer=streamer,
+            **gen_kwargs,
+        ),
+    )
+    thread.start()
+    generated_text = ""
+    for new_text in streamer:
+        generated_text += new_text
+        sys.stdout.write(new_text)
+        sys.stdout.flush()
+    print("\nFinal output:", generated_text)
--- a/docs/ov_chat_images/chat_results.png
+++ b/docs/ov_chat_images/chat_results.png
--- a/docs/ov_chat_images/example1_tree.png
+++ b/docs/ov_chat_images/example1_tree.png
--- a/docs/ov_chat_images/example2_dog.jpg
+++ b/docs/ov_chat_images/example2_dog.jpg