Unverified Commit e71f2863 authored by Raushan Turganbay's avatar Raushan Turganbay Committed by GitHub
Browse files

Add LLaVa NeXT Video (#31252)



* squash into single commit

* run diff once more

* docstring

* tests

* minor chnages and ready to go

* Update src/transformers/models/llava_next_video/processing_llava_next_video.py
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update tests/models/vipllava/test_modeling_vipllava.py
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* [run-slow] llava-next-video

* [run-slow] llava-next-video

* [run-slow] llava_next_video

* fix two tests

* fix slow tests

* remove logit checks due to numeric errors

* run test once more

* [run-slow] llava_next_video

* final try to pass the test

* [run-slow] llava_next_video

* [run-slow] llava_next_video

* [run-slow] llava_next_video

* style

* fix

* style

---------
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: default avatarydshieh <ydshieh@users.noreply.github.com>
parent b1ec7454
...@@ -794,6 +794,8 @@ ...@@ -794,6 +794,8 @@
title: Llava title: Llava
- local: model_doc/llava_next - local: model_doc/llava_next
title: LLaVA-NeXT title: LLaVA-NeXT
- local: model_doc/llava-next-video
title: LLaVa-NeXT-Video
- local: model_doc/lxmert - local: model_doc/lxmert
title: LXMERT title: LXMERT
- local: model_doc/matcha - local: model_doc/matcha
......
...@@ -182,6 +182,7 @@ Flax), PyTorch, and/or TensorFlow. ...@@ -182,6 +182,7 @@ Flax), PyTorch, and/or TensorFlow.
| [Llama3](model_doc/llama3) | ✅ | ❌ | ✅ | | [Llama3](model_doc/llama3) | ✅ | ❌ | ✅ |
| [LLaVa](model_doc/llava) | ✅ | ❌ | ❌ | | [LLaVa](model_doc/llava) | ✅ | ❌ | ❌ |
| [LLaVA-NeXT](model_doc/llava_next) | ✅ | ❌ | ❌ | | [LLaVA-NeXT](model_doc/llava_next) | ✅ | ❌ | ❌ |
| [LLaVa-NeXT-Video](model_doc/llava-next-video) | ✅ | ❌ | ❌ |
| [Longformer](model_doc/longformer) | ✅ | ✅ | ❌ | | [Longformer](model_doc/longformer) | ✅ | ✅ | ❌ |
| [LongT5](model_doc/longt5) | ✅ | ❌ | ✅ | | [LongT5](model_doc/longt5) | ✅ | ❌ | ✅ |
| [LUKE](model_doc/luke) | ✅ | ❌ | ❌ | | [LUKE](model_doc/luke) | ✅ | ❌ | ❌ |
......
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# LLaVa-NeXT-Video
## Overview
The LLaVa-NeXT-Video model was proposed in [LLaVA-NeXT: A Strong Zero-shot Video Understanding Model
](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/) by Yuanhan Zhang, Bo Li, Haotian Liu, Yong Jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, Chunyuan Li. LLaVa-NeXT-Video improves upon [LLaVa-NeXT](llava_next) by fine-tuning on a mix if video and image dataset thus increasing the model's performance on videos.
[LLaVA-NeXT](llava_next) surprisingly has strong performance in understanding video content in zero-shot fashion with the AnyRes technique that it uses. The AnyRes technique naturally represents a high-resolution image into multiple images. This technique is naturally generalizable to represent videos because videos can be considered as a set of frames (similar to a set of images in LLaVa-NeXT). The current version of LLaVA-NeXT makes use of AnyRes and trains with supervised fine-tuning (SFT) on top of LLaVA-Next on video data to achieves better video understanding capabilities.The model is a current SOTA among open-source models on [VideoMME bench](https://arxiv.org/abs/2405.21075).
The introduction from the blog is the following:
On January 30, 2024, we released LLaVA-NeXT, an open-source Large Multimodal Model (LMM) that has been trained exclusively on text-image data. With the proposed AnyRes technique, it boosts capabilities in reasoning, OCR, and world knowledge, demonstrating remarkable performance across a spectrum of image-based multimodal understanding tasks, and even exceeding Gemini-Pro on several image benchmarks, e.g. MMMU and MathVista.
**In today’s exploration, we delve into the performance of LLaVA-NeXT within the realm of video understanding tasks. We reveal that LLaVA-NeXT surprisingly has strong performance in understanding video content. The current version of LLaVA-NeXT for videos has several improvements:
- Zero-shot video representation capabilities with AnyRes: The AnyRes technique naturally represents a high-resolution image into multiple images that a pre-trained VIT is able to digest, and forms them into a concantenated sequence. This technique is naturally generalizable to represent videos (consisting of multiple frames), allowing the image-only-trained LLaVA-Next model to perform surprisingly well on video tasks. Notably, this is the first time that LMMs show strong zero-shot modality transfer ability.
- Inference with length generalization improves on longer videos. The linear scaling technique enables length generalization, allowing LLaVA-NeXT to effectively handle long-video beyond the limitation of the "max_token_length" of the LLM.
- Strong video understanding ability. (1) LLaVA-Next-Image, which combines the above two techniques, yields superior zero-shot performance than open-source LMMs tuned on videos. (2) LLaVA-Next-Video, further supervised fine-tuning (SFT) LLaVA-Next-Image on video data, achieves better video understanding capabilities compared to LLaVA-Next-Image. (3) LLaVA-Next-Video-DPO, which aligns the model response with AI feedback using direct preference optimization (DPO), showing significant performance boost.
- Efficient deployment and inference with SGLang. It allows 5x faster inference on video tasks, allowing more scalable serving such as million-level video re-captioning. See instructions in our repo.**
This model was contributed by [RaushanTurganbay](https://huggingface.co/RaushanTurganbay).
The original code can be found [here](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/inference).
## Usage tips
- We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to call `processor.tokenizer.padding_side = "left"` before generating.
- Note that each checkpoint has been trained with a specific prompt format, depending on which large language model (LLM) was used. You can use tokenizer's `apply_chat_template` to format your prompts correctly. Below is an example of how to do that.
We will use [LLaVA-NeXT-Video-7B-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf) and a conversation history of videos and images. Each content field has to be a list of dicts, as follows:
```python
from transformers import LlavaNextVideoProcessor
processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")
conversation = [
{
"role": "system",
"content": [
{"type": "text", "text": "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions."},
],
},
{
"role": "user",
"content": [
{"type": "text", "text": "What’s shown in this image?"},
{"type": "image"},
],
},
{
"role": "assistant",
"content": [{"type": "text", "text": "This image shows a red stop sign."},]
},
{
"role": "user",
"content": [
{"type": "text", "text": "Why is this video funny?"},
{"type": "video"},
],
},
]
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Note that the template simply formats your prompt, you still have to tokenize it and obtain pixel values for your visuals
print(text_prompt)
```
## Usage example
### Single Media Mode
The model can accept both images and videos as input. Here's an example code for inference in half-precision (`torch.float16`):
```python
import av
import torch
import numpy as np
from transformers import LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor
def read_video_pyav(container, indices):
'''
Decode the video with PyAV decoder.
Args:
container (`av.container.input.InputContainer`): PyAV container.
indices (`List[int]`): List of frame indices to decode.
Returns:
result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
'''
frames = []
container.seek(0)
start_index = indices[0]
end_index = indices[-1]
for i, frame in enumerate(container.decode(video=0)):
if i > end_index:
break
if i >= start_index and i in indices:
frames.append(frame)
return np.stack([x.to_ndarray(format="rgb24") for x in frames])
# Load the model in half-precision
model = LlavaNextVideoForConditionalGeneration.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf", torch_dtype=torch.float16, device_map="auto")
processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")
# Load the video as an np.array, sampling uniformly 8 frames (can sample more for longer videos)
video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
container = av.open(video_path)
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
video = read_video_pyav(container, indices)
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "Why is this video funny?"},
{"type": "video"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=prompt, videos=video, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=60)
processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spaces=True)
```
### Mixed Media Mode
The model can also generate from an interleaved image-video inputs. However note, that it was not trained in interleaved image-video setting which might affect the performance. Below is an example usage for mixed media input, add the following lines to the above code snippet:
```python
from PIL import Image
import requests
# Generate from image and video mixed inputs
# Load and image and write a new prompt
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "How many cats are there in the image?"},
{"type": "image"},
],
},
{
"role": "assistant",
"content": [{"type": "text", "text": "There are two cats"}],
},
{
"role": "user",
"content": [
{"type": "text", "text": "Why is this video funny?"},
{"type": "video"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=prompt, images=image, videos=clip, padding=True, return_tensors="pt")
# Generate
generate_ids = model.generate(**inputs, max_length=50)
processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
```
## Model optimization
### Quantization using Bitsandbytes for memory efficiency
The model can be loaded in lower bits, significantly reducing memory burden while maintaining the performance of the original model. This allows for efficient deployment on resource-constrained cases.
First make sure to install bitsandbytes by running `pip install bitsandbytes` and to have access to a CUDA compatible GPU device. Load the quantized model by simply adding [`BitsAndBytesConfig`](../main_classes/quantization#transformers.BitsAndBytesConfig) as shown below:
```python
from transformers import LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor
# specify how to quantize the model
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
model = LlavaNextVideoForConditionalGeneration.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf", quantization_config=quantization_config, device_map="auto")
```
### Flash-Attention 2 to speed-up generation
Additionally, we can greatly speed-up model inference by using [Flash Attention](../perf_train_gpu_one.md#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model.
First, make sure to install the latest version of Flash Attention 2:
```bash
pip install -U flash-attn --no-build-isolation
```
Also, you should have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`.
To load and run a model using Flash Attention-2, simply add `attn_implementation="flash_attention_2"` when loading the model as follows:
```python
from transformers import LlavaNextVideoForConditionalGeneration
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
"llava-hf/LLaVA-NeXT-Video-7B-hf",
torch_dtype=torch.float16,
attn_implementation="flash_attention_2",
).to(0)
```
## LlavaNextVideoConfig
[[autodoc]] LlavaNextVideoConfig
## LlavaNextVideoProcessor
[[autodoc]] LlavaNextVideoProcessor
## LlavaNextVideoImageProcessor
[[autodoc]] LlavaNextVideoImageProcessor
## LlavaNextVideoForConditionalGeneration
[[autodoc]] LlavaNextVideoForConditionalGeneration
- forward
...@@ -55,6 +55,7 @@ FlashAttention-2 is currently supported for the following architectures: ...@@ -55,6 +55,7 @@ FlashAttention-2 is currently supported for the following architectures:
* [Llama](https://huggingface.co/docs/transformers/model_doc/llama#transformers.LlamaModel) * [Llama](https://huggingface.co/docs/transformers/model_doc/llama#transformers.LlamaModel)
* [Llava](https://huggingface.co/docs/transformers/model_doc/llava) * [Llava](https://huggingface.co/docs/transformers/model_doc/llava)
* [Llava-NeXT](https://huggingface.co/docs/transformers/model_doc/llava_next) * [Llava-NeXT](https://huggingface.co/docs/transformers/model_doc/llava_next)
* [Llava-NeXT-Video](https://huggingface.co/docs/transformers/model_doc/llava_next_video)
* [VipLlava](https://huggingface.co/docs/transformers/model_doc/vipllava) * [VipLlava](https://huggingface.co/docs/transformers/model_doc/vipllava)
* [VideoLlava](https://huggingface.co/docs/transformers/model_doc/video_llava) * [VideoLlava](https://huggingface.co/docs/transformers/model_doc/video_llava)
* [M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100) * [M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)
......
...@@ -516,6 +516,10 @@ _import_structure = { ...@@ -516,6 +516,10 @@ _import_structure = {
"LlavaNextConfig", "LlavaNextConfig",
"LlavaNextProcessor", "LlavaNextProcessor",
], ],
"models.llava_next_video": [
"LlavaNextVideoConfig",
"LlavaNextVideoProcessor",
],
"models.longformer": [ "models.longformer": [
"LongformerConfig", "LongformerConfig",
"LongformerTokenizer", "LongformerTokenizer",
...@@ -1148,6 +1152,7 @@ else: ...@@ -1148,6 +1152,7 @@ else:
_import_structure["models.layoutlmv3"].extend(["LayoutLMv3FeatureExtractor", "LayoutLMv3ImageProcessor"]) _import_structure["models.layoutlmv3"].extend(["LayoutLMv3FeatureExtractor", "LayoutLMv3ImageProcessor"])
_import_structure["models.levit"].extend(["LevitFeatureExtractor", "LevitImageProcessor"]) _import_structure["models.levit"].extend(["LevitFeatureExtractor", "LevitImageProcessor"])
_import_structure["models.llava_next"].append("LlavaNextImageProcessor") _import_structure["models.llava_next"].append("LlavaNextImageProcessor")
_import_structure["models.llava_next_video"].append("LlavaNextVideoImageProcessor")
_import_structure["models.mask2former"].append("Mask2FormerImageProcessor") _import_structure["models.mask2former"].append("Mask2FormerImageProcessor")
_import_structure["models.maskformer"].extend(["MaskFormerFeatureExtractor", "MaskFormerImageProcessor"]) _import_structure["models.maskformer"].extend(["MaskFormerFeatureExtractor", "MaskFormerImageProcessor"])
_import_structure["models.mobilenet_v1"].extend(["MobileNetV1FeatureExtractor", "MobileNetV1ImageProcessor"]) _import_structure["models.mobilenet_v1"].extend(["MobileNetV1FeatureExtractor", "MobileNetV1ImageProcessor"])
...@@ -2432,6 +2437,12 @@ else: ...@@ -2432,6 +2437,12 @@ else:
"LlavaNextPreTrainedModel", "LlavaNextPreTrainedModel",
] ]
) )
_import_structure["models.llava_next_video"].extend(
[
"LlavaNextVideoForConditionalGeneration",
"LlavaNextVideoPreTrainedModel",
]
)
_import_structure["models.longformer"].extend( _import_structure["models.longformer"].extend(
[ [
"LongformerForMaskedLM", "LongformerForMaskedLM",
...@@ -5137,6 +5148,10 @@ if TYPE_CHECKING: ...@@ -5137,6 +5148,10 @@ if TYPE_CHECKING:
LlavaNextConfig, LlavaNextConfig,
LlavaNextProcessor, LlavaNextProcessor,
) )
from .models.llava_next_video import (
LlavaNextVideoConfig,
LlavaNextVideoProcessor,
)
from .models.longformer import ( from .models.longformer import (
LongformerConfig, LongformerConfig,
LongformerTokenizer, LongformerTokenizer,
...@@ -5804,6 +5819,7 @@ if TYPE_CHECKING: ...@@ -5804,6 +5819,7 @@ if TYPE_CHECKING:
) )
from .models.levit import LevitFeatureExtractor, LevitImageProcessor from .models.levit import LevitFeatureExtractor, LevitImageProcessor
from .models.llava_next import LlavaNextImageProcessor from .models.llava_next import LlavaNextImageProcessor
from .models.llava_next_video import LlavaNextVideoImageProcessor
from .models.mask2former import Mask2FormerImageProcessor from .models.mask2former import Mask2FormerImageProcessor
from .models.maskformer import ( from .models.maskformer import (
MaskFormerFeatureExtractor, MaskFormerFeatureExtractor,
...@@ -6874,6 +6890,10 @@ if TYPE_CHECKING: ...@@ -6874,6 +6890,10 @@ if TYPE_CHECKING:
LlavaNextForConditionalGeneration, LlavaNextForConditionalGeneration,
LlavaNextPreTrainedModel, LlavaNextPreTrainedModel,
) )
from .models.llava_next_video import (
LlavaNextVideoForConditionalGeneration,
LlavaNextVideoPreTrainedModel,
)
from .models.longformer import ( from .models.longformer import (
LongformerForMaskedLM, LongformerForMaskedLM,
LongformerForMultipleChoice, LongformerForMultipleChoice,
......
...@@ -125,6 +125,7 @@ from . import ( ...@@ -125,6 +125,7 @@ from . import (
llama, llama,
llava, llava,
llava_next, llava_next,
llava_next_video,
longformer, longformer,
longt5, longt5,
luke, luke,
......
...@@ -141,6 +141,7 @@ CONFIG_MAPPING_NAMES = OrderedDict( ...@@ -141,6 +141,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
("lilt", "LiltConfig"), ("lilt", "LiltConfig"),
("llama", "LlamaConfig"), ("llama", "LlamaConfig"),
("llava", "LlavaConfig"), ("llava", "LlavaConfig"),
("llava-next-video", "LlavaNextVideoConfig"),
("llava_next", "LlavaNextConfig"), ("llava_next", "LlavaNextConfig"),
("longformer", "LongformerConfig"), ("longformer", "LongformerConfig"),
("longt5", "LongT5Config"), ("longt5", "LongT5Config"),
...@@ -421,6 +422,7 @@ MODEL_NAMES_MAPPING = OrderedDict( ...@@ -421,6 +422,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("llama2", "Llama2"), ("llama2", "Llama2"),
("llama3", "Llama3"), ("llama3", "Llama3"),
("llava", "LLaVa"), ("llava", "LLaVa"),
("llava-next-video", "LLaVa-NeXT-Video"),
("llava_next", "LLaVA-NeXT"), ("llava_next", "LLaVA-NeXT"),
("longformer", "Longformer"), ("longformer", "Longformer"),
("longt5", "LongT5"), ("longt5", "LongT5"),
......
...@@ -95,6 +95,7 @@ else: ...@@ -95,6 +95,7 @@ else:
("layoutlmv3", ("LayoutLMv3ImageProcessor",)), ("layoutlmv3", ("LayoutLMv3ImageProcessor",)),
("levit", ("LevitImageProcessor",)), ("levit", ("LevitImageProcessor",)),
("llava", ("CLIPImageProcessor",)), ("llava", ("CLIPImageProcessor",)),
("llava-next-video", ("LlavaNextVideoImageProcessor",)),
("llava_next", ("LlavaNextImageProcessor",)), ("llava_next", ("LlavaNextImageProcessor",)),
("mask2former", ("Mask2FormerImageProcessor",)), ("mask2former", ("Mask2FormerImageProcessor",)),
("maskformer", ("MaskFormerImageProcessor",)), ("maskformer", ("MaskFormerImageProcessor",)),
......
...@@ -299,6 +299,7 @@ MODEL_FOR_PRETRAINING_MAPPING_NAMES = OrderedDict( ...@@ -299,6 +299,7 @@ MODEL_FOR_PRETRAINING_MAPPING_NAMES = OrderedDict(
("idefics2", "Idefics2ForConditionalGeneration"), ("idefics2", "Idefics2ForConditionalGeneration"),
("layoutlm", "LayoutLMForMaskedLM"), ("layoutlm", "LayoutLMForMaskedLM"),
("llava", "LlavaForConditionalGeneration"), ("llava", "LlavaForConditionalGeneration"),
("llava-next-video", "LlavaNextVideoForConditionalGeneration"),
("llava_next", "LlavaNextForConditionalGeneration"), ("llava_next", "LlavaNextForConditionalGeneration"),
("longformer", "LongformerForMaskedLM"), ("longformer", "LongformerForMaskedLM"),
("luke", "LukeForMaskedLM"), ("luke", "LukeForMaskedLM"),
...@@ -700,6 +701,7 @@ MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES = OrderedDict( ...@@ -700,6 +701,7 @@ MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES = OrderedDict(
("instructblipvideo", "InstructBlipVideoForConditionalGeneration"), ("instructblipvideo", "InstructBlipVideoForConditionalGeneration"),
("kosmos-2", "Kosmos2ForConditionalGeneration"), ("kosmos-2", "Kosmos2ForConditionalGeneration"),
("llava", "LlavaForConditionalGeneration"), ("llava", "LlavaForConditionalGeneration"),
("llava-next-video", "LlavaNextVideoForConditionalGeneration"),
("llava_next", "LlavaNextForConditionalGeneration"), ("llava_next", "LlavaNextForConditionalGeneration"),
("paligemma", "PaliGemmaForConditionalGeneration"), ("paligemma", "PaliGemmaForConditionalGeneration"),
("pix2struct", "Pix2StructForConditionalGeneration"), ("pix2struct", "Pix2StructForConditionalGeneration"),
......
...@@ -69,6 +69,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict( ...@@ -69,6 +69,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
("layoutlmv2", "LayoutLMv2Processor"), ("layoutlmv2", "LayoutLMv2Processor"),
("layoutlmv3", "LayoutLMv3Processor"), ("layoutlmv3", "LayoutLMv3Processor"),
("llava", "LlavaProcessor"), ("llava", "LlavaProcessor"),
("llava-next-video", "LlavaNextVideoProcessor"),
("llava_next", "LlavaNextProcessor"), ("llava_next", "LlavaNextProcessor"),
("markuplm", "MarkupLMProcessor"), ("markuplm", "MarkupLMProcessor"),
("mctct", "MCTCTProcessor"), ("mctct", "MCTCTProcessor"),
......
...@@ -242,6 +242,7 @@ else: ...@@ -242,6 +242,7 @@ else:
), ),
), ),
("llava", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)), ("llava", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("llava-next-video", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("llava_next", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)), ("llava_next", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("longformer", ("LongformerTokenizer", "LongformerTokenizerFast" if is_tokenizers_available() else None)), ("longformer", ("LongformerTokenizer", "LongformerTokenizerFast" if is_tokenizers_available() else None)),
( (
......
...@@ -545,8 +545,9 @@ class LlavaNextForConditionalGeneration(LlavaNextPreTrainedModel): ...@@ -545,8 +545,9 @@ class LlavaNextForConditionalGeneration(LlavaNextPreTrainedModel):
) )
# Compute the maximum embed dimension # Compute the maximum embed dimension
# max_image_feature_lens is max_feature_lens per batch # max_image_feature_lens is max_feature_lens per batch
feature_lens = feature_lens.to(input_ids.device)
feature_lens_batch = feature_lens.split(num_special_image_tokens.tolist(), dim=0) feature_lens_batch = feature_lens.split(num_special_image_tokens.tolist(), dim=0)
feature_lens_batch_sum = torch.tensor([x.sum() for x in feature_lens_batch], device=feature_lens.device) feature_lens_batch_sum = torch.tensor([x.sum() for x in feature_lens_batch], device=input_ids.device)
embed_sequence_lengths = ( embed_sequence_lengths = (
(attention_mask == 1).long().sum(-1) - num_special_image_tokens + feature_lens_batch_sum (attention_mask == 1).long().sum(-1) - num_special_image_tokens + feature_lens_batch_sum
) )
...@@ -577,9 +578,9 @@ class LlavaNextForConditionalGeneration(LlavaNextPreTrainedModel): ...@@ -577,9 +578,9 @@ class LlavaNextForConditionalGeneration(LlavaNextPreTrainedModel):
final_attention_mask = torch.zeros( final_attention_mask = torch.zeros(
batch_size, max_embed_dim, dtype=attention_mask.dtype, device=inputs_embeds.device batch_size, max_embed_dim, dtype=attention_mask.dtype, device=inputs_embeds.device
) )
final_labels = None final_input_ids = torch.full(
if labels is not None: (batch_size, max_embed_dim), self.pad_token_id, dtype=input_ids.dtype, device=inputs_embeds.device
final_labels = torch.full_like(final_attention_mask, ignore_index).to(torch.long) )
# In case the Vision model or the Language model has been offloaded to CPU, we need to manually # In case the Vision model or the Language model has been offloaded to CPU, we need to manually
# set the corresponding tensors into their correct target device. # set the corresponding tensors into their correct target device.
target_device = inputs_embeds.device target_device = inputs_embeds.device
...@@ -589,12 +590,17 @@ class LlavaNextForConditionalGeneration(LlavaNextPreTrainedModel): ...@@ -589,12 +590,17 @@ class LlavaNextForConditionalGeneration(LlavaNextPreTrainedModel):
text_to_overwrite.to(target_device), text_to_overwrite.to(target_device),
) )
attention_mask = attention_mask.to(target_device) attention_mask = attention_mask.to(target_device)
input_ids = input_ids.to(target_device)
# 4. Fill the embeddings based on the mask. If we have ["hey" "<image>", "how", "are"] # 4. Fill the embeddings based on the mask. If we have ["hey" "<image>", "how", "are"]
# we need to index copy on [0, 577, 578, 579] for the text and [1:576] for the image features # we need to index copy on [0, 577, 578, 579] for the text and [1:576] for the image features
final_embedding[batch_indices, text_to_overwrite] = inputs_embeds[batch_indices, non_image_indices] final_embedding[batch_indices, text_to_overwrite] = inputs_embeds[batch_indices, non_image_indices]
final_attention_mask[batch_indices, text_to_overwrite] = attention_mask[batch_indices, non_image_indices] final_attention_mask[batch_indices, text_to_overwrite] = attention_mask[batch_indices, non_image_indices]
final_input_ids[batch_indices, text_to_overwrite] = input_ids[batch_indices, non_image_indices]
final_labels = None
if labels is not None: if labels is not None:
labels = labels.to(target_device)
final_labels = torch.full_like(final_attention_mask, ignore_index).to(torch.long)
final_labels[batch_indices, text_to_overwrite] = labels[batch_indices, non_image_indices] final_labels[batch_indices, text_to_overwrite] = labels[batch_indices, non_image_indices]
# 5. Fill the embeddings corresponding to the images. Anything that is not `text_positions` needs filling (#29835) # 5. Fill the embeddings corresponding to the images. Anything that is not `text_positions` needs filling (#29835)
...@@ -609,6 +615,7 @@ class LlavaNextForConditionalGeneration(LlavaNextPreTrainedModel): ...@@ -609,6 +615,7 @@ class LlavaNextForConditionalGeneration(LlavaNextPreTrainedModel):
if left_padding: if left_padding:
# exclude padding on the left # exclude padding on the left
max_embed_dim = max_embed_dim.to(target_device)
val = (max_embed_dim - embed_indices) <= embed_seq_lens val = (max_embed_dim - embed_indices) <= embed_seq_lens
else: else:
# exclude padding on the right # exclude padding on the right
...@@ -626,7 +633,7 @@ class LlavaNextForConditionalGeneration(LlavaNextPreTrainedModel): ...@@ -626,7 +633,7 @@ class LlavaNextForConditionalGeneration(LlavaNextPreTrainedModel):
final_attention_mask |= image_to_overwrite final_attention_mask |= image_to_overwrite
position_ids = (final_attention_mask.cumsum(-1) - 1).masked_fill_((final_attention_mask == 0), 1) position_ids = (final_attention_mask.cumsum(-1) - 1).masked_fill_((final_attention_mask == 0), 1)
return final_embedding, final_attention_mask, position_ids, final_labels return final_embedding, final_attention_mask, position_ids, final_labels, final_input_ids
def pack_image_features(self, image_features, image_sizes, image_newline=None): def pack_image_features(self, image_features, image_sizes, image_newline=None):
""" """
...@@ -796,7 +803,7 @@ class LlavaNextForConditionalGeneration(LlavaNextPreTrainedModel): ...@@ -796,7 +803,7 @@ class LlavaNextForConditionalGeneration(LlavaNextPreTrainedModel):
) )
inputs_embeds = inputs_embeds.to(image_features.dtype) inputs_embeds = inputs_embeds.to(image_features.dtype)
inputs_embeds, attention_mask, position_ids, labels = self._merge_input_ids_with_image_features( inputs_embeds, attention_mask, position_ids, labels, _ = self._merge_input_ids_with_image_features(
image_features, image_features,
feature_lens, feature_lens,
inputs_embeds, inputs_embeds,
......
# Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available
_import_structure = {
"configuration_llava_next_video": ["LlavaNextVideoConfig"],
"processing_llava_next_video": ["LlavaNextVideoProcessor"],
}
try:
if not is_vision_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["image_processing_llava_next_video"] = ["LlavaNextVideoImageProcessor"]
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_llava_next_video"] = [
"LlavaNextVideoForConditionalGeneration",
"LlavaNextVideoPreTrainedModel",
]
if TYPE_CHECKING:
from .configuration_llava_next_video import LlavaNextVideoConfig
from .processing_llava_next_video import LlavaNextVideoProcessor
try:
if not is_vision_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .image_processing_llava_next_video import LlavaNextVideoImageProcessor
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_llava_next_video import (
LlavaNextVideoForConditionalGeneration,
LlavaNextVideoPreTrainedModel,
)
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# This file was automatically generated from <path_to_diff_file.py>.
# Do NOT edit this file manually as any edits will be overwritten by the generation of
# the file from the diff. If any change should be done, please apply the change to the
# diff.py file directly.
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# coding=utf-8
# Copyright 2024 HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from transformers import PretrainedConfig
from ..auto import CONFIG_MAPPING
class LlavaNextVideoConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`LlavaNextVideoForConditionalGeneration`]. It is used to instantiate an
Llava-NeXT model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of the [llava-hf/LLaVA-NeXT-Video-7B-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf)
model.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vision_config (`Union[AutoConfig, dict]`, *optional*, defaults to `CLIPVisionConfig`):
The config object or dictionary of the vision backbone.
text_config (`Union[AutoConfig, dict]`, *optional*, defaults to `LlamaConfig`):
The config object or dictionary of the text backbone.
ignore_index (`int`, *optional*, defaults to -100):
The ignore index for the loss function.
image_token_index (`int`, *optional*, defaults to 32001):
The image token index to encode the image prompt.
projector_hidden_act (`str`, *optional*, defaults to `"gelu"`):
The activation function used by the multimodal projector.
vision_feature_select_strategy (`str`, *optional*, defaults to `"default"`):
The feature selection strategy used to select the vision feature from the vision backbone.
Can be one of `"default"` or `"full"`. If `"default"`, the CLS token is removed from the vision features.
If `"full"`, the full vision features are used.
vision_feature_layer (`int`, *optional*, defaults to -2):
The index of the layer to select the vision feature.
image_grid_pinpoints (`List`, *optional*, defaults to `[[336, 672], [672, 336], [672, 672], [1008, 336], [336, 1008]]`):
A list of possible resolutions to use for processing high resolution images. Each item in the list should be a tuple or list
of the form `(height, width)`.
tie_word_embeddings (`bool`, *optional*, defaults to `False`):
Whether the model's input and output word embeddings should be tied.
video_token_index (`int`, *optional*, defaults to 32000):
The video token index to encode the image prompt.
spatial_pool_mode (`str`, *optional*, defaults to `"average"`):
Pooling mode to use for videos. Can be "average", "max" or "conv".
spatial_pool_stride (`int`, *optional*, defaults to 2):
Stride used in the pooling layer for videos.
Example:
```python
>>> from transformers import LlavaNextVideoForConditionalGeneration, LlavaNextVideoConfig, CLIPVisionConfig, LlamaConfig
>>> # Initializing a CLIP-vision config
>>> vision_config = CLIPVisionConfig()
>>> # Initializing a Llama config
>>> text_config = LlamaConfig()
>>> configuration = LlavaNextVideoConfig(vision_config, text_config)
>>> model = LlavaNextVideoForConditionalGeneration(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "llava_next_video"
is_composition = True
def __init__(
self,
vision_config=None,
text_config=None,
ignore_index=-100,
image_token_index=32001,
projector_hidden_act="gelu",
vision_feature_select_strategy="default",
vision_feature_layer=-2,
image_grid_pinpoints=None,
tie_word_embeddings=False,
video_token_index=32000,
spatial_pool_mode="average",
spatial_pool_stride=2,
**kwargs,
):
self.video_token_index = video_token_index
self.spatial_pool_mode = spatial_pool_mode
self.spatial_pool_stride = spatial_pool_stride
self.ignore_index = ignore_index
self.image_token_index = image_token_index
self.projector_hidden_act = projector_hidden_act
if vision_feature_select_strategy not in ["default", "full"]:
raise ValueError(
"vision_feature_select_strategy should be one of 'default', 'full'."
f"Got: {vision_feature_select_strategy}"
)
self.vision_feature_select_strategy = vision_feature_select_strategy
self.vision_feature_layer = vision_feature_layer
image_grid_pinpoints = (
image_grid_pinpoints
if image_grid_pinpoints is not None
else [[336, 672], [672, 336], [672, 672], [1008, 336], [336, 1008]]
)
self.image_grid_pinpoints = image_grid_pinpoints
if isinstance(vision_config, dict):
vision_config["model_type"] = (
vision_config["model_type"] if "model_type" in vision_config else "clip_vision_model"
)
vision_config = CONFIG_MAPPING[vision_config["model_type"]](**vision_config)
elif vision_config is None:
vision_config = CONFIG_MAPPING["clip_vision_model"](
intermediate_size=4096,
hidden_size=1024,
patch_size=14,
image_size=336,
num_hidden_layers=24,
num_attention_heads=16,
vocab_size=32000,
projection_dim=768,
)
self.vision_config = vision_config
if isinstance(text_config, dict):
text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "llama"
text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
elif text_config is None:
text_config = CONFIG_MAPPING["llama"]()
self.text_config = text_config
super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Convert LLaVa-NeXT-Video checkpoints from the original repository.
URL: https://github.com/LLaVA-VL/LLaVA-NeXT/tree/inference
"""
import argparse
import glob
import json
from pathlib import Path
import torch
from accelerate import init_empty_weights
from huggingface_hub import hf_hub_download, snapshot_download
from safetensors import safe_open
from transformers import (
AddedToken,
AutoConfig,
AutoTokenizer,
LlavaNextImageProcessor,
LlavaNextVideoConfig,
LlavaNextVideoForConditionalGeneration,
LlavaNextVideoImageProcessor,
LlavaNextVideoProcessor,
)
KEYS_TO_MODIFY_MAPPING = {
"model.vision_tower.": "",
".vision_resampler": "", # all lmms-lab models do avg pooling, so no vision_resampler
"model.mm_projector": "multi_modal_projector",
"model": "model.model",
"vision_model.model": "vision_model",
"lm_head": "language_model.lm_head",
"model.model": "language_model.model",
"multi_modal_projector.0": "multi_modal_projector.linear_1",
"multi_modal_projector.2": "multi_modal_projector.linear_2",
"language_model.model.image_newline": "image_newline",
}
# {{SYSTEM_PROMPT}} USER: <image>\n{{PROMPT}} ASSISTANT:" assistant end with "</s> "
chat_vicuna = (
"{% for message in messages %}"
"{% if message['role'] == 'system' %}"
"{{ message['content'][0]['text'] }}"
"{% else %}"
"{{ message['role'].upper() + ': '}}"
"{% endif %}"
"{# Render all images first #}"
"{% for content in message['content'] | selectattr('type', 'equalto', 'image') %}"
"{{ '<image>\n' }}"
"{% endfor %}"
"{# Render all text next #}"
"{% for content in message['content'] | selectattr('type', 'equalto', 'text') %}"
"{{ content['text'] + ' '}}"
"{% endfor %}"
"{% endfor %}"
"{% if add_generation_prompt %}"
"{{ 'ASSISTANT:' }}"
"{% endif %}"
)
# "[INST] <image>\nWhat is shown in this image? [/INST]" assistant end with "</s> "
chat_mistral = (
"{% for message in messages %}"
"{% if message['role'] == 'user' %}"
"{{ '[INST] ' }}"
"{# Render all images first #}"
"{% for content in message['content'] | selectattr('type', 'equalto', 'image') %}"
"{{ '<image>\n' }}"
"{% endfor %}"
"{# Render all text next #}"
"{% for content in message['content'] | selectattr('type', 'equalto', 'text') %}"
"{{ content['text'] }}"
"{% endfor %}"
"{{' [/INST]' }}"
"{% elif message['role'] == 'assistant' %}"
r"{{ ' ' + message['content'][0]['text'] + '<\s> '}}"
"{% else %}"
"{{ raise_exception('Only user and assistant roles are supported!') }}"
"{% endif %}"
"{% endfor %}"
)
# "<|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n<image>\nWhat is shown in this image?<|im_end|><|im_start|>assistant\n"
chat_yi = (
"{% for message in messages %}"
"{{'<|im_start|>' + message['role'] + '\n'}}"
"{# Render all images first #}"
"{% for content in message['content'] | selectattr('type', 'equalto', 'image') %}"
"{{ '<image>\n' }}"
"{% endfor %}"
"{# Render all text next #}"
"{% for content in message['content'] | selectattr('type', 'equalto', 'text') %}"
"{{ content['text'] }}"
"{% endfor %}"
"{{'<|im_end|>' + '\n'}}"
"{% endfor %}"
"{% if add_generation_prompt %}"
"{{ '<|im_start|>assistant\n' }}"
"{% endif %}"
)
model2template = {
"lmms-lab/LLaVA-NeXT-Video-7B-32K": chat_mistral,
"lmms-lab/LLaVA-NeXT-Video-7B": chat_vicuna,
"lmms-lab/LLaVA-NeXT-Video-7B-DPO": chat_vicuna,
"lmms-lab/LLaVA-NeXT-Video-34B": chat_yi,
"lmms-lab/LLaVA-NeXT-Video-34B-DPO": chat_yi,
}
def load_original_state_dict(model_id):
directory_path = snapshot_download(repo_id=model_id, allow_patterns=["*.safetensors"])
original_state_dict = {}
for path in glob.glob(f"{directory_path}/*"):
if path.endswith(".safetensors"):
with safe_open(path, framework="pt", device="cpu") as f:
for key in f.keys():
original_state_dict[key] = f.get_tensor(key)
return original_state_dict
def convert_state_dict_to_hf(state_dict):
new_state_dict = {}
for key, value in state_dict.items():
if key.endswith(".inv_freq"):
continue
for key_to_modify, new_key in KEYS_TO_MODIFY_MAPPING.items():
if key_to_modify in key:
key = key.replace(key_to_modify, new_key)
new_state_dict[key] = value.to(torch.bfloat16)
return new_state_dict
def convert_llava_to_hf(model_id, pytorch_dump_folder_path, push_to_hub=False):
# load original config
filepath = hf_hub_download(repo_id=model_id, filename="config.json", repo_type="model")
with open(filepath) as f:
data = json.load(f)
print(data)
if model_id == "lmms-lab/LLaVA-NeXT-Video-7B-32K":
text_model_id = "mistralai/Mistral-7B-Instruct-v0.2"
video_token_index = 32000
image_token_index = 32001
overwrite_text_config = {}
elif model_id in ["lmms-lab/LLaVA-NeXT-Video-7B", "lmms-lab/LLaVA-NeXT-Video-7B-DPO"]:
text_model_id = "lmsys/vicuna-7b-v1.5"
video_token_index = 32000
image_token_index = 32001
overwrite_text_config = {"factor": 2.0, "type": "linear"}
elif model_id in ["lmms-lab/LLaVA-NeXT-Video-34B", "lmms-lab/LLaVA-NeXT-Video-34B-DPO"]:
text_model_id = "NousResearch/Nous-Hermes-2-Yi-34B"
video_token_index = 64000
image_token_index = 64001
overwrite_text_config = {}
else:
raise ValueError("Incorrect checkpoint referenced. Text model-id not identified!")
vision_model_id = data["mm_vision_tower"]
torch.set_default_dtype(torch.bfloat16)
text_config = AutoConfig.from_pretrained(text_model_id)
text_config = text_config.to_dict()
text_config.update(overwrite_text_config)
tokenizer = AutoTokenizer.from_pretrained(text_model_id, use_fast=True, padding_side="left")
tokenizer.add_tokens(AddedToken("<video>", special=True, normalized=False), special_tokens=True)
tokenizer.add_tokens(AddedToken("<image>", special=True, normalized=False), special_tokens=True)
image_processor = LlavaNextImageProcessor.from_pretrained(vision_model_id)
video_processor = LlavaNextVideoImageProcessor.from_pretrained(vision_model_id)
processor = LlavaNextVideoProcessor(
tokenizer=tokenizer,
video_processor=video_processor,
image_processor=image_processor,
chat_template=model2template[model_id],
)
config = LlavaNextVideoConfig(
text_config=text_config,
image_grid_pinpoints=image_processor.image_grid_pinpoints,
use_image_newline_parameter=True,
video_token_index=video_token_index,
image_token_index=image_token_index,
)
with init_empty_weights():
model = LlavaNextVideoForConditionalGeneration(config)
# load original state dict
state_dict = load_original_state_dict(model_id)
state_dict = convert_state_dict_to_hf(state_dict)
model.load_state_dict(state_dict, assign=True, strict=True)
# See https://nlp.stanford.edu/~johnhew/vocab-expansion.html for why we get mean/stdev this way to expand embeddings
pre_expansion_embeddings = model.language_model.model.embed_tokens.weight.data
mu = torch.mean(pre_expansion_embeddings, dim=0).float()
n = pre_expansion_embeddings.size()[0]
sigma = ((pre_expansion_embeddings - mu).T @ (pre_expansion_embeddings - mu)) / n
dist = torch.distributions.multivariate_normal.MultivariateNormal(mu, covariance_matrix=1e-5 * sigma)
# We add an image token so we resize the model
# Pad to 64 for performance reasons
pad_shape = 64
vocab_size = config.text_config.vocab_size
# this one has 2 additional tokens, namely <image>, <video> and <pad>
num_tokens = vocab_size + 3
model.resize_token_embeddings(num_tokens, pad_to_multiple_of=pad_shape)
model.language_model.model.embed_tokens.weight.data[vocab_size:] = torch.stack(
tuple(
(dist.sample() for _ in range(model.language_model.model.embed_tokens.weight.data[vocab_size:].shape[0]))
),
dim=0,
)
model.language_model.lm_head.weight.data[vocab_size:] = torch.stack(
tuple((dist.sample() for _ in range(model.language_model.lm_head.weight.data[vocab_size:].shape[0]))),
dim=0,
)
if pytorch_dump_folder_path is not None:
print(f"Saving model and processor for {model_id} to {pytorch_dump_folder_path}")
Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
model.save_pretrained(pytorch_dump_folder_path)
processor.save_pretrained(pytorch_dump_folder_path)
if push_to_hub:
repo_id = model_id.split("/")[-1]
print(f"Pushing model to hub repo: {repo_id}")
model.push_to_hub(f"llava-hf/{repo_id}-hf")
processor.push_to_hub(f"llava-hf/{repo_id}-hf")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--model_id",
help="Hub location of the model to convert",
default="lmms-lab/LLaVA-NeXT-Video-7B",
choices=[
"lmms-lab/LLaVA-NeXT-Video-7B",
"lmms-lab/LLaVA-NeXT-Video-7B-DPO",
"lmms-lab/LLaVA-NeXT-Video-7B-32K",
"lmms-lab/LLaVA-NeXT-Video-34B",
"lmms-lab/LLaVA-NeXT-Video-34B-DPO",
],
required=False,
)
parser.add_argument(
"--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
)
parser.add_argument(
"--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
)
args = parser.parse_args()
convert_llava_to_hf(args.model_id, args.pytorch_dump_folder_path, args.push_to_hub)
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Processor class for LLaVa-NeXT-Video.
"""
from typing import TYPE_CHECKING, List, Optional, Union
from ...feature_extraction_utils import BatchFeature
from ...image_utils import ImageInput, VideoInput
from ...processing_utils import ProcessorMixin
from ...tokenization_utils_base import PaddingStrategy, PreTokenizedInput, TextInput, TruncationStrategy
from ...utils import TensorType, logging
if TYPE_CHECKING:
pass
logger = logging.get_logger(__name__)
class LlavaNextVideoProcessor(ProcessorMixin):
r"""
Constructs a LLaVa-NeXT-Video processor which wraps a LLaVa-NeXT image processor, LLaVa-NeXT-Video video processor and
a LLaMa tokenizer into a single processor.
[`LlavaNextVideoProcessor`] offers all the functionalities of [`LlavaNextImageProcessor`], [`LlavaNextVideoImageProcessor`] and
[`LlamaTokenizerFast`]. See the [`~LlavaNextVideoProcessor.__call__`] and [`~LlavaNextVideoProcessor.decode`] for more information.
Args:
video_processor ([`LlavaNextVideoImageProcessor`], *optional*):
The video processor is a required input.
image_processor ([`LlavaNextImageProcessor`], *optional*):
The image processor is a required input.
tokenizer ([`LlamaTokenizerFast`], *optional*):
The tokenizer is a required input.
chat_template (`str`, *optional*):
Jinja chat template that will be used in tokenizer's `apply_chat_template`
"""
# video and image processor share same args, but have different processing logic
# only image processor config is saved in the hub
attributes = ["video_processor", "image_processor", "tokenizer"]
image_processor_class = "LlavaNextImageProcessor"
video_processor_class = "LlavaNextVideoImageProcessor"
tokenizer_class = ("LlamaTokenizer", "LlamaTokenizerFast")
def __init__(self, video_processor=None, image_processor=None, tokenizer=None, chat_template=None):
super().__init__(video_processor, image_processor, tokenizer, chat_template=chat_template)
def __call__(
self,
text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]],
images: ImageInput = None,
videos: VideoInput = None,
padding: Union[bool, str, PaddingStrategy] = False,
truncation: Union[bool, str, TruncationStrategy] = None,
max_length: int = None,
return_tensors: Optional[Union[str, TensorType]] = TensorType.PYTORCH,
) -> BatchFeature:
"""
Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
and `kwargs` arguments to LlamaTokenizerFast's [`~LlamaTokenizerFast.__call__`] if `text` is not `None` to encode
the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
LlavaNextImageProcessor's [`~LlavaNextImageProcessor.__call__`] if `images` is not `None`. To prepare the video(s),
this method forwards the `videos` and `kwrags` arguments to LlavaNextVideoImageProcessor's
[`~LlavaNextVideoImageProcessor.__call__`] if `videos` is not `None`. Please refer to the doctsring
of the above two methods for more information.
Args:
text (`str`, `List[str]`, `List[List[str]]`):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
tensor. Both channels-first and channels-last formats are supported.
videos (`np.ndarray`, `torch.Tensor`, `List[np.ndarray]`, `List[torch.Tensor]`):
The image or batch of videos to be prepared. Each video can be a 4D NumPy array or PyTorch
tensor, or a nested list of 3D frames. Both channels-first and channels-last formats are supported.
padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `False`):
Select a strategy to pad the returned sequences (according to the model's padding side and padding
index) among:
- `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).
- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
acceptable input length for the model if that argument is not provided.
- `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
lengths).
max_length (`int`, *optional*):
Maximum length of the returned list and optionally padding length (see above).
truncation (`bool`, *optional*):
Activates truncation to cut input sequences longer than `max_length` to `max_length`.
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors of a particular framework. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
- `'jax'`: Return JAX `jnp.ndarray` objects.
Returns:
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
`None`).
- **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
"""
if images is not None:
image_inputs = self.image_processor(images, return_tensors=return_tensors)
else:
image_inputs = {}
if videos is not None:
videos_inputs = self.video_processor(videos, return_tensors=return_tensors)
else:
videos_inputs = {}
text_inputs = self.tokenizer(
text, return_tensors=return_tensors, padding=padding, truncation=truncation, max_length=max_length
)
return BatchFeature(data={**text_inputs, **image_inputs, **videos_inputs})
# Copied from transformers.models.clip.processing_clip.CLIPProcessor.batch_decode with CLIP->Llama
def batch_decode(self, *args, **kwargs):
"""
This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
refer to the docstring of this method for more information.
"""
return self.tokenizer.batch_decode(*args, **kwargs)
# Copied from transformers.models.clip.processing_clip.CLIPProcessor.decode with CLIP->Llama
def decode(self, *args, **kwargs):
"""
This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
the docstring of this method for more information.
"""
return self.tokenizer.decode(*args, **kwargs)
@property
# Copied from transformers.models.clip.processing_clip.CLIPProcessor.model_input_names
def model_input_names(self):
tokenizer_input_names = self.tokenizer.model_input_names
image_processor_input_names = self.image_processor.model_input_names
return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
@property
def default_chat_template(self):
"""
This default vicuna template formats inputs in the form of a chat history. For each message in the chat history:
* the template will output the role of the speaker followed by the content of the message.
* content is a list of strings and images.
* If the content element is an image, the template will output a sequence of <image> or <video> tokens
Example:
```python
messages = [{
"role": "user",
"content": [
{"type": "text", "text": "What’s the content of this video?"},
{"type": "video"},
],
},
{
"role": "assistant",
"content": [{"type": "text", "text": "This picture shows a red stop sign."},]
}]
```
Will create outputs like:
```
USER: <video>\nWhat is the content of this video?
ASSITANT: This picture shows a red stop sign
```
"""
# fmt: off
return (
"{% for message in messages %}"
"{% if message['role'] == 'system' %}"
"{{ message['content'][0]['text'] }}"
"{% else %}"
"{{ message['role'].upper() + ': '}}"
"{% endif %}"
"{# Render all images first #}"
"{% for content in message['content'] | selectattr('type', 'equalto', 'image') %}"
"{{ '<image>\n' }}"
"{% endfor %}"
"{# Render all videos next #}"
"{% for content in message['content'] | selectattr('type', 'equalto', 'video') %}"
"{{ '<video>\n' }}"
"{% endfor %}"
"{# Render all text finally #}"
"{% for content in message['content'] | selectattr('type', 'equalto', 'text') %}"
"{{ content['text'] + ' '}}"
"{% endfor %}"
"{% endfor %}"
"{% if add_generation_prompt %}"
"{{ 'ASSISTANT:' }}"
"{% endif %}"
)
# fmt: on
...@@ -5140,6 +5140,20 @@ class LlavaNextPreTrainedModel(metaclass=DummyObject): ...@@ -5140,6 +5140,20 @@ class LlavaNextPreTrainedModel(metaclass=DummyObject):
requires_backends(self, ["torch"]) requires_backends(self, ["torch"])
class LlavaNextVideoForConditionalGeneration(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class LlavaNextVideoPreTrainedModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class LongformerForMaskedLM(metaclass=DummyObject): class LongformerForMaskedLM(metaclass=DummyObject):
_backends = ["torch"] _backends = ["torch"]
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment