v1.0

1d9ad5d4 · chenzk · 1d9ad5d4 · 1d9ad5d4 · 1d9ad5d4 · 1d9ad5d4
Commit 1d9ad5d4 authored May 27, 2025 by chenzk
20 changed files
--- a/A photo of cute cat.png
+++ b/A photo of cute cat.png
--- a/Alpha-VLLM/Lumina-Next-SFT-diffusers/README.md
+++ b/Alpha-VLLM/Lumina-Next-SFT-diffusers/README.md
+---
+license: apache-2.0
+tags:
+- text-to-image
+- safetensors
+- diffusers
+datasets:
+- JourneyDB/JourneyDB
+library_name: diffusers
+pipeline_tag: text-to-image
+---
+# Lumina-Next-SFT
+The `Lumina-Next-SFT` is a Next-DiT model containing 2B parameters and utilizes [Gemma-2B](https://huggingface.co/google/gemma-2b) as the text encoder, enhanced through high-quality supervised fine-tuning (SFT).
+Our generative model has `Next-DiT` as the backbone, the text encoder is the `Gemma` 2B model, and the VAE uses a version of `sdxl` fine-tuned by stabilityai.
+- Generation Model: Next-DiT
+- Text Encoder: [Gemma-2B](https://huggingface.co/google/gemma-2b)
+- VAE: [stabilityai/sdxl-vae](https://huggingface.co/stabilityai/sdxl-vae)
+[![Lumina-Next](https://img.shields.io/badge/Paper-Lumina--Next-2b9348.svg?logo=arXiv)](https://github.com/Alpha-VLLM/Lumina-T2X/blob/main/assets/lumina-next.pdf)
+[Lumina-T2X paper](https://arxiv.org/abs/2405.05945)
+![hero](https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/9f52eabb-07dc-4881-8257-6d8a5f2a0a5a)
+## 📰 News
+- **[2024-07-08] 🎉🎉🎉 Lumina-Next is now supported in the [diffusers](https://github.com/huggingface/diffusers)! Thanks to [@yiyixuxu](https://github.com/yiyixuxu) and [@sayakpaul](https://github.com/sayakpaul)!**
+- [2024-06-08] 🎉🎉🎉 We have released the `Lumina-Next-SFT` model.
+- [2024-05-28] We updated the `Lumina-Next-T2I` model to support 2K Resolution image generation.
+- [2024-05-16] We have converted the `.pth` weights to `.safetensors` weights. Please pull the latest code to use `demo.py` for inference.
+- [2024-05-12] We release the next version of `Lumina-T2I`, called `Lumina-Next-T2I` for faster and lower memory usage image generation model.
+## 🎮 Model Zoo
+More checkpoints of our model will be released soon~
+| Resolution | Next-DiT Parameter| Text Encoder | Prediction | Download URL  |
+| ---------- | ----------------------- | ------------ | -----------|-------------- |
+| 1024  | 2B  | [Gemma-2B](https://huggingface.co/google/gemma-2b)  | Rectified Flow | [hugging face](https://huggingface.co/Alpha-VLLM/Lumina-Next-SFT-diffusers) |
+## Installation
+### 1. Create a conda environment and install PyTorch
+Note: You may want to adjust the CUDA version [according to your driver version](https://docs.nvidia.com/deploy/cuda-compatibility/#default-to-minor-version).
+```bash
+conda create -n Lumina_T2X -y
+conda activate Lumina_T2X
+conda install python=3.11 pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y
+```
+### 2. Install dependencies
+```bash
+pip install diffusers huggingface_hub
+```
+### 3. Install ``flash-attn``
+```bash
+pip install flash-attn --no-build-isolation
+```
+## Inference
+1. Prepare the pre-trained model
+⭐⭐ (Recommended) you can use huggingface_cli to download our model:
+```bash
+huggingface-cli download --resume-download Alpha-VLLM/Lumina-Next-SFT-diffusers --local-dir /path/to/ckpt
+```
+2. Run with demo code:
+```python
+from diffusers import LuminaText2ImgPipeline
+import torch
+pipeline = LuminaText2ImgPipeline.from_pretrained("/path/to/ckpt/Lumina-Next-SFT-diffusers", torch_dtype=torch.bfloat16).to("cuda")
+# or you can download the model using code directly
+# pipeline = LuminaText2ImgPipeline.from_pretrained("Alpha-VLLM/Lumina-Next-SFT-diffusers", torch_dtype=torch.bfloat16).to("cuda")
+image = pipeline(prompt="Upper body of a young woman in a Victorian-era outfit with brass goggles and leather straps. "
+                        "Background shows an industrial revolution cityscape with smoky skies and tall, metal structures").images[0]
+```
--- a/BLIP3o-Model-8B/README.md
+++ b/BLIP3o-Model-8B/README.md
+---
+language:
+- en
+license: apache-2.0
+---
+This is BLIP3o-8B checkpoint trained on the open source data.
\ No newline at end of file
--- a/Qwen/Qwen2.5-VL-3B-Instruct/README.md
+++ b/Qwen/Qwen2.5-VL-3B-Instruct/README.md
+---
+license_name: qwen-research
+license_link: https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/blob/main/LICENSE
+language:
+- en
+pipeline_tag: image-text-to-text
+tags:
+- multimodal
+library_name: transformers
+---
+# Qwen2.5-VL-3B-Instruct
+<a href="https://chat.qwenlm.ai/" target="_blank" style="margin: 2px;">
+    <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
+</a>
+## Introduction
+In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.
+#### Key Enhancements:
+* **Understand things visually**: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images.
+* **Being agentic**: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use.
+* **Understanding long videos and capturing events**: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of cpaturing event by pinpointing the relevant video segments.
+* **Capable of visual localization in different formats**: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes.
+* **Generating structured outputs**: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc.
+#### Model Architecture Updates:
+* **Dynamic Resolution and Frame Rate Training for Video Understanding**:
+We extend dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. Accordingly, we update mRoPE in the time dimension with IDs and absolute time alignment, enabling the model to learn temporal sequence and speed, and ultimately acquire the ability to pinpoint specific moments.
+<p align="center">
+    <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-VL/qwen2.5vl_arc.jpeg" width="80%"/>
+<p>
+* **Streamlined and Efficient Vision Encoder**
+We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
+We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 3B Qwen2.5-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL).
+## Evaluation
+### Image benchmark
+| Benchmark | InternVL2.5-4B |Qwen2-VL-7B |Qwen2.5-VL-3B |
+| :--- | :---:  | :---: | :---: |
+| MMMU<sub>val</sub>  | 52.3 | 54.1 | 53.1| 
+| MMMU-Pro<sub>val</sub>  | **32.7** | 30.5 | 31.6|
+| AI2D<sub>test</sub> | 81.4 | **83.0** | 81.5 |
+| DocVQA<sub>test</sub>  | 91.6 | 94.5 | **93.9** | 
+| InfoVQA<sub>test</sub>  | 72.1 | 76.5 | **77.1** |
+| TextVQA<sub>val</sub>  | 76.8 | **84.3** | 79.3|
+| MMBench-V1.1<sub>test</sub>  | 79.3 | **80.7** | 77.6 | 
+| MMStar | 58.3 | **60.7** | 55.9 | 
+| MathVista<sub>testmini</sub>  | 60.5 | 58.2 | **62.3** |
+| MathVision<sub>full</sub>  | 20.9 | 16.3  | **21.2** |
+### Video benchmark
+| Benchmark | InternVL2.5-4B | Qwen2-VL-7B | Qwen2.5-VL-3B |
+| :--- | :---:  | :---: | :---: |
+| MVBench | 71.6 | 67.0 | 67.0 |
+| VideoMME | 63.6/62.3 | 69.0/63.3 | 67.6/61.5 |
+| MLVU | 48.3 | - | 68.2 |
+| LVBench | - | - | 43.3 |
+| MMBench-Video | 1.73 | 1.44 | 1.63 |
+| EgoSchema | - | - | 64.8 |
+| PerceptionTest | - | - | 66.9 |
+| TempCompass | - | - | 64.4 |
+| LongVideoBench | 55.2 | 55.6 | 54.2 |
+| CharadesSTA/mIoU | - | - | 38.8 |
+### Agent benchmark
+| Benchmarks              | Qwen2.5-VL-3B |
+|-------------------------|---------------|
+| ScreenSpot              |     55.5    |
+| ScreenSpot Pro          |     23.9    |
+| AITZ_EM                 |  	76.9    |
+| Android Control High_EM |    	63.7    |
+| Android Control Low_EM  |  	22.2    |
+| AndroidWorld_SR         | 	90.8  	|
+| MobileMiniWob++_SR      | 	67.9    |
+## Requirements
+The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
+```
+pip install git+https://github.com/huggingface/transformers accelerate
+```
+or you might encounter the following error:
+```
+KeyError: 'qwen2_5_vl'
+```
+## Quickstart
+Below, we provide simple examples to show how to use Qwen2.5-VL with 🤖 ModelScope and 🤗 Transformers.
+The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
+```
+pip install git+https://github.com/huggingface/transformers accelerate
+```
+or you might encounter the following error:
+```
+KeyError: 'qwen2_5_vl'
+```
+We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
+```bash
+# It's highly recommanded to use `[decord]` feature for faster video loading.
+pip install qwen-vl-utils[decord]==0.0.8
+```
+If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-vl-utils` which will fall back to using torchvision for video processing. However, you can still [install decord from source](https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source) to get decord used when loading video.
+### Using 🤗  Transformers to Chat
+Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
+```python
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
+from qwen_vl_utils import process_vision_info
+# default: Load the model on the available device(s)
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
+)
+# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
+# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+#     "Qwen/Qwen2.5-VL-3B-Instruct",
+#     torch_dtype=torch.bfloat16,
+#     attn_implementation="flash_attention_2",
+#     device_map="auto",
+# )
+# default processer
+processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
+# The default range for the number of visual tokens per image in the model is 4-16384.
+# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
+# min_pixels = 256*28*28
+# max_pixels = 1280*28*28
+# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
+            },
+            {"type": "text", "text": "Describe this image."},
+        ],
+    }
+]
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to("cuda")
+# Inference: Generation of the output
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+<details>
+<summary>Multi image inference</summary>
+```python
+# Messages containing multiple images and a text query
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": "file:///path/to/image1.jpg"},
+            {"type": "image", "image": "file:///path/to/image2.jpg"},
+            {"type": "text", "text": "Identify the similarities between these images."},
+        ],
+    }
+]
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to("cuda")
+# Inference
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+</details>
+<details>
+<summary>Video inference</summary>
+```python
+# Messages containing a images list as a video and a text query
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "video",
+                "video": [
+                    "file:///path/to/frame1.jpg",
+                    "file:///path/to/frame2.jpg",
+                    "file:///path/to/frame3.jpg",
+                    "file:///path/to/frame4.jpg",
+                ],
+            },
+            {"type": "text", "text": "Describe this video."},
+        ],
+    }
+]
+# Messages containing a local video path and a text query
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "video",
+                "video": "file:///path/to/video1.mp4",
+                "max_pixels": 360 * 420,
+                "fps": 1.0,
+            },
+            {"type": "text", "text": "Describe this video."},
+        ],
+    }
+]
+# Messages containing a video url and a text query
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "video",
+                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
+            },
+            {"type": "text", "text": "Describe this video."},
+        ],
+    }
+]
+#In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    fps=fps,
+    padding=True,
+    return_tensors="pt",
+    **video_kwargs,
+)
+inputs = inputs.to("cuda")
+# Inference
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+Video URL compatibility largely depends on the third-party library version. The details are in the table below. change the backend by `FORCE_QWENVL_VIDEO_READER=torchvision` or `FORCE_QWENVL_VIDEO_READER=decord` if you prefer not to use the default one.
+| Backend     | HTTP | HTTPS |
+|-------------|------|-------|
+| torchvision >= 0.19.0 | ✅  | ✅   |
+| torchvision < 0.19.0  | ❌  | ❌   |
+| decord      | ✅  | ❌   |
+</details>
+<details>
+<summary>Batch inference</summary>
+```python
+# Sample messages for batch inference
+messages1 = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": "file:///path/to/image1.jpg"},
+            {"type": "image", "image": "file:///path/to/image2.jpg"},
+            {"type": "text", "text": "What are the common elements in these pictures?"},
+        ],
+    }
+]
+messages2 = [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "Who are you?"},
+]
+# Combine messages for batch processing
+messages = [messages1, messages2]
+# Preparation for batch inference
+texts = [
+    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
+    for msg in messages
+]
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=texts,
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to("cuda")
+# Batch Inference
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_texts = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_texts)
+```
+</details>
+### 🤖 ModelScope
+We strongly advise users especially those in mainland China to use ModelScope. `snapshot_download` can help you solve issues concerning downloading checkpoints.
+### More Usage Tips
+For input images, we support local files, base64, and URLs. For videos, we currently only support local files.
+```python
+# You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
+## Local file path
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": "file:///path/to/your/image.jpg"},
+            {"type": "text", "text": "Describe this image."},
+        ],
+    }
+]
+## Image URL
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": "http://path/to/your/image.jpg"},
+            {"type": "text", "text": "Describe this image."},
+        ],
+    }
+]
+## Base64 encoded image
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": "data:image;base64,/9j/..."},
+            {"type": "text", "text": "Describe this image."},
+        ],
+    }
+]
+```
+#### Image Resolution for performance boost
+The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage.
+```python
+min_pixels = 256 * 28 * 28
+max_pixels = 1280 * 28 * 28
+processor = AutoProcessor.from_pretrained(
+    "Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
+)
+```
+Besides, We provide two methods for fine-grained control over the image size input to the model:
+1. Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels.
+2. Specify exact dimensions: Directly set `resized_height` and `resized_width`. These values will be rounded to the nearest multiple of 28.
+```python
+# min_pixels and max_pixels
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": "file:///path/to/your/image.jpg",
+                "resized_height": 280,
+                "resized_width": 420,
+            },
+            {"type": "text", "text": "Describe this image."},
+        ],
+    }
+]
+# resized_height and resized_width
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": "file:///path/to/your/image.jpg",
+                "min_pixels": 50176,
+                "max_pixels": 50176,
+            },
+            {"type": "text", "text": "Describe this image."},
+        ],
+    }
+]
+```
+### Processing Long Texts
+The current `config.json` is set for context length up to 32,768 tokens.
+To handle extensive inputs exceeding 32,768 tokens, we utilize [YaRN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.
+For supported frameworks, you could add the following to `config.json` to enable YaRN:
+```
+{
+	...,
+    "type": "yarn",
+    "mrope_section": [
+        16,
+        24,
+        24
+    ],
+    "factor": 4,
+    "original_max_position_embeddings": 32768
+}
+```
+However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
+At the same time, for long video inputs, since MRoPE itself is more economical with ids, the max_position_embeddings can be directly modified to a larger value, such as 64k.
+## Citation
+If you find our work helpful, feel free to give us a cite.
+```
+@misc{qwen2.5-VL,
+    title = {Qwen2.5-VL},
+    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
+    author = {Qwen Team},
+    month = {January},
+    year = {2025}
+}
+@article{Qwen2VL,
+  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
+  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
+  journal={arXiv preprint arXiv:2409.12191},
+  year={2024}
+}
+@article{Qwen-VL,
+  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
+  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
+  journal={arXiv preprint arXiv:2308.12966},
+  year={2023}
+}
+```
--- a/Qwen/Qwen2.5-VL-7B-Instruct/README.md
+++ b/Qwen/Qwen2.5-VL-7B-Instruct/README.md
+---
+license: apache-2.0
+language:
+- en
+pipeline_tag: image-text-to-text
+tags:
+- multimodal
+library_name: transformers
+---
+# Qwen2.5-VL-7B-Instruct
+<a href="https://chat.qwenlm.ai/" target="_blank" style="margin: 2px;">
+    <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
+</a>
+## Introduction
+In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.
+#### Key Enhancements:
+* **Understand things visually**: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images.
+* **Being agentic**: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use.
+* **Understanding long videos and capturing events**: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of cpaturing event by pinpointing the relevant video segments.
+* **Capable of visual localization in different formats**: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes.
+* **Generating structured outputs**: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc.
+#### Model Architecture Updates:
+* **Dynamic Resolution and Frame Rate Training for Video Understanding**:
+We extend dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. Accordingly, we update mRoPE in the time dimension with IDs and absolute time alignment, enabling the model to learn temporal sequence and speed, and ultimately acquire the ability to pinpoint specific moments.
+<p align="center">
+    <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-VL/qwen2.5vl_arc.jpeg" width="80%"/>
+<p>
+* **Streamlined and Efficient Vision Encoder**
+We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
+We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2.5-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL).
+## Evaluation
+### Image benchmark
+| Benchmark | InternVL2.5-8B | MiniCPM-o 2.6 | GPT-4o-mini | Qwen2-VL-7B |**Qwen2.5-VL-7B** |
+| :--- | :---: | :---: | :---: | :---: | :---: |
+| MMMU<sub>val</sub>  | 56 | 50.4 | **60**| 54.1 | 58.6|
+| MMMU-Pro<sub>val</sub>  | 34.3 | - | 37.6| 30.5 | 41.0|
+| DocVQA<sub>test</sub>  | 93 | 93 | - | 94.5 | **95.7** |
+| InfoVQA<sub>test</sub>  | 77.6 | - |  - |76.5 | **82.6** |
+| ChartQA<sub>test</sub>  | 84.8 | - |- | 83.0 |**87.3** |
+| TextVQA<sub>val</sub>  | 79.1 | 80.1 | -| 84.3 | **84.9**|
+| OCRBench | 822 | 852 | 785 | 845 | **864** |
+| CC_OCR | 57.7 |  | | 61.6 | **77.8**|
+| MMStar | 62.8| | |60.7| **63.9**|
+| MMBench-V1.1-En<sub>test</sub>  | 79.4 | 78.0 | 76.0| 80.7 | **82.6** |
+| MMT-Bench<sub>test</sub> | - | - | - |**63.7** |63.6 |
+| MMStar | **61.5** | 57.5 |  54.8 | 60.7 |63.9 |
+| MMVet<sub>GPT-4-Turbo</sub>  | 54.2 | 60.0 | 66.9 | 62.0 | **67.1**|
+| HallBench<sub>avg</sub>  | 45.2 | 48.1 | 46.1| 50.6 | **52.9**|
+| MathVista<sub>testmini</sub>  | 58.3 | 60.6 | 52.4 | 58.2 | **68.2**|
+| MathVision  | - | -  | - | 16.3 | **25.07** |
+### Video Benchmarks
+| Benchmark |  Qwen2-VL-7B | **Qwen2.5-VL-7B** |
+| :--- | :---: | :---: |
+| MVBench |  67.0 | **69.6** |
+| PerceptionTest<sub>test</sub>  | 66.9 | **70.5** |
+| Video-MME<sub>wo/w subs</sub>   | 63.3/69.0 | **65.1**/**71.6** |
+| LVBench  |  | 45.3 |
+| LongVideoBench  |  | 54.7 |
+| MMBench-Video | 1.44 | 1.79 |
+| TempCompass |  | 71.7 |
+| MLVU |  | 70.2 |
+| CharadesSTA/mIoU |  43.6|
+### Agent benchmark
+| Benchmarks              | Qwen2.5-VL-7B |
+|-------------------------|---------------|
+| ScreenSpot              |     84.7    |
+| ScreenSpot Pro          |     29.0    |
+| AITZ_EM                 |  	81.9    |
+| Android Control High_EM |    	60.1    |
+| Android Control Low_EM  |  	93.7    |
+| AndroidWorld_SR         | 	25.5  	|
+| MobileMiniWob++_SR      | 	91.4    |
+## Requirements
+The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
+```
+pip install git+https://github.com/huggingface/transformers accelerate
+```
+or you might encounter the following error:
+```
+KeyError: 'qwen2_5_vl'
+```
+## Quickstart
+Below, we provide simple examples to show how to use Qwen2.5-VL with 🤖 ModelScope and 🤗 Transformers.
+The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
+```
+pip install git+https://github.com/huggingface/transformers accelerate
+```
+or you might encounter the following error:
+```
+KeyError: 'qwen2_5_vl'
+```
+We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
+```bash
+# It's highly recommanded to use `[decord]` feature for faster video loading.
+pip install qwen-vl-utils[decord]==0.0.8
+```
+If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-vl-utils` which will fall back to using torchvision for video processing. However, you can still [install decord from source](https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source) to get decord used when loading video.
+### Using 🤗  Transformers to Chat
+Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
+```python
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
+from qwen_vl_utils import process_vision_info
+# default: Load the model on the available device(s)
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    "Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
+)
+# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
+# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+#     "Qwen/Qwen2.5-VL-7B-Instruct",
+#     torch_dtype=torch.bfloat16,
+#     attn_implementation="flash_attention_2",
+#     device_map="auto",
+# )
+# default processer
+processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
+# The default range for the number of visual tokens per image in the model is 4-16384.
+# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
+# min_pixels = 256*28*28
+# max_pixels = 1280*28*28
+# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
+            },
+            {"type": "text", "text": "Describe this image."},
+        ],
+    }
+]
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to("cuda")
+# Inference: Generation of the output
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+<details>
+<summary>Multi image inference</summary>
+```python
+# Messages containing multiple images and a text query
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": "file:///path/to/image1.jpg"},
+            {"type": "image", "image": "file:///path/to/image2.jpg"},
+            {"type": "text", "text": "Identify the similarities between these images."},
+        ],
+    }
+]
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to("cuda")
+# Inference
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+</details>
+<details>
+<summary>Video inference</summary>
+```python
+# Messages containing a images list as a video and a text query
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "video",
+                "video": [
+                    "file:///path/to/frame1.jpg",
+                    "file:///path/to/frame2.jpg",
+                    "file:///path/to/frame3.jpg",
+                    "file:///path/to/frame4.jpg",
+                ],
+            },
+            {"type": "text", "text": "Describe this video."},
+        ],
+    }
+]
+# Messages containing a local video path and a text query
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "video",
+                "video": "file:///path/to/video1.mp4",
+                "max_pixels": 360 * 420,
+                "fps": 1.0,
+            },
+            {"type": "text", "text": "Describe this video."},
+        ],
+    }
+]
+# Messages containing a video url and a text query
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "video",
+                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
+            },
+            {"type": "text", "text": "Describe this video."},
+        ],
+    }
+]
+#In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    fps=fps,
+    padding=True,
+    return_tensors="pt",
+    **video_kwargs,
+)
+inputs = inputs.to("cuda")
+# Inference
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+Video URL compatibility largely depends on the third-party library version. The details are in the table below. change the backend by `FORCE_QWENVL_VIDEO_READER=torchvision` or `FORCE_QWENVL_VIDEO_READER=decord` if you prefer not to use the default one.
+| Backend     | HTTP | HTTPS |
+|-------------|------|-------|
+| torchvision >= 0.19.0 | ✅  | ✅   |
+| torchvision < 0.19.0  | ❌  | ❌   |
+| decord      | ✅  | ❌   |
+</details>
+<details>
+<summary>Batch inference</summary>
+```python
+# Sample messages for batch inference
+messages1 = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": "file:///path/to/image1.jpg"},
+            {"type": "image", "image": "file:///path/to/image2.jpg"},
+            {"type": "text", "text": "What are the common elements in these pictures?"},
+        ],
+    }
+]
+messages2 = [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "Who are you?"},
+]
+# Combine messages for batch processing
+messages = [messages1, messages2]
+# Preparation for batch inference
+texts = [
+    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
+    for msg in messages
+]
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=texts,
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to("cuda")
+# Batch Inference
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_texts = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_texts)
+```
+</details>
+### 🤖 ModelScope
+We strongly advise users especially those in mainland China to use ModelScope. `snapshot_download` can help you solve issues concerning downloading checkpoints.
+### More Usage Tips
+For input images, we support local files, base64, and URLs. For videos, we currently only support local files.
+```python
+# You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
+## Local file path
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": "file:///path/to/your/image.jpg"},
+            {"type": "text", "text": "Describe this image."},
+        ],
+    }
+]
+## Image URL
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": "http://path/to/your/image.jpg"},
+            {"type": "text", "text": "Describe this image."},
+        ],
+    }
+]
+## Base64 encoded image
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": "data:image;base64,/9j/..."},
+            {"type": "text", "text": "Describe this image."},
+        ],
+    }
+]
+```
+#### Image Resolution for performance boost
+The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage.
+```python
+min_pixels = 256 * 28 * 28
+max_pixels = 1280 * 28 * 28
+processor = AutoProcessor.from_pretrained(
+    "Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
+)
+```
+Besides, We provide two methods for fine-grained control over the image size input to the model:
+1. Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels.
+2. Specify exact dimensions: Directly set `resized_height` and `resized_width`. These values will be rounded to the nearest multiple of 28.
+```python
+# min_pixels and max_pixels
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": "file:///path/to/your/image.jpg",
+                "resized_height": 280,
+                "resized_width": 420,
+            },
+            {"type": "text", "text": "Describe this image."},
+        ],
+    }
+]
+# resized_height and resized_width
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": "file:///path/to/your/image.jpg",
+                "min_pixels": 50176,
+                "max_pixels": 50176,
+            },
+            {"type": "text", "text": "Describe this image."},
+        ],
+    }
+]
+```
+### Processing Long Texts
+The current `config.json` is set for context length up to 32,768 tokens.
+To handle extensive inputs exceeding 32,768 tokens, we utilize [YaRN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.
+For supported frameworks, you could add the following to `config.json` to enable YaRN:
+{
+	...,
+    "type": "yarn",
+    "mrope_section": [
+        16,
+        24,
+        24
+    ],
+    "factor": 4,
+    "original_max_position_embeddings": 32768
+}
+However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
+At the same time, for long video inputs, since MRoPE itself is more economical with ids, the max_position_embeddings can be directly modified to a larger value, such as 64k.
+## Citation
+If you find our work helpful, feel free to give us a cite.
+```
+@misc{qwen2.5-VL,
+    title = {Qwen2.5-VL},
+    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
+    author = {Qwen Team},
+    month = {January},
+    year = {2025}
+}
+@article{Qwen2VL,
+  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
+  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
+  journal={arXiv preprint arXiv:2409.12191},
+  year={2024}
+}
+@article{Qwen-VL,
+  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
+  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
+  journal={arXiv preprint arXiv:2308.12966},
+  year={2023}
+}
+```
--- a/README.md
+++ b/README.md
+# BLIP3-o
+BLIP3-o可用于多模态数据预标注，通过60k指令调优数据集BLIP3o-60k进行增强，全开源统一多模态模型，支持文本到图像生成、图像描述以及视觉问答在内的多种任务。
+## 论文
+`BLIP3-o: A Family of Fully Open Unified Multimodal Models—Architecture, Training and Dataset`
+- https://arxiv.org/pdf/2505.09568
+## 模型结构
+直接在Qwen 2.5 VL上构建图像生成模块，在8B模型中，我们冻结Qwen2.5-VL-7B-Instruct主干网络并训练扩散变压器，总共有14亿（1.4B）可训练参数，采用CLIP + 流匹配和顺序训练来开发先进的统一多模态模型BLIP3-o。
+<div align=center>
+    <img src="./doc/blip3-o.png"/>
+</div>
+## 算法原理
+CLIP嵌入与流匹配损失相结合，CLIP特征比VAE特征产生更紧凑、语义更丰富的表示，从而提高了训练效率，流匹配被证明是对图像分布进行建模的更有效的训练目标，从而产生更大的样本多样性和更高的视觉质量。
+<div align=center>
+    <img src="./doc/algorithm.png"/>
+</div>
+## 环境配置
+```
+mv BLIP3o_pytorch BLIP3o # 去框架名后缀
+```
+### Docker（方法一）
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10-fixpy
+# <your IMAGE ID>为以上拉取的docker的镜像ID替换，本镜像为：6063b673703a
+docker run -it --shm-size=64G -v $PWD/BLIP3o:/home/BLIP3o -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name blip3o <your IMAGE ID> bash
+cd /home/BLIP3o
+pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
+pip install whl/bitsandbytes-0.42.0+das.opt1.dtk2504-py3-none-any.whl # bitsandbytes==0.42
+pip install whl/torchaudio-2.1.2+das.opt2.dtk24043-cp310-cp310-manylinux_2_28_x86_64.whl # torchaudio==2.1.2
+cd diffusers 
+pip install -e . -i https://mirrors.aliyun.com/pypi/simple # diffusers==0.32.2
+cd /home/BLIP3o
+pip install -e . -i https://mirrors.aliyun.com/pypi/simple # blip3o==0.1.0
+```
+### Dockerfile（方法二）
+```
+cd /home/BLIP3o/docker
+docker build --no-cache -t blip3o:latest .
+docker run --shm-size=64G --name blip3o -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v $PWD/../../BLIP3o:/home/BLIP3o -it blip3o bash
+# 若遇到Dockerfile启动的方式安装环境需要长时间等待，可注释掉里面的pip安装，启动容器后再安装python库：pip install -r requirements.txt。
+cd /home/BLIP3o
+pip install whl/bitsandbytes-0.42.0+das.opt1.dtk2504-py3-none-any.whl # bitsandbytes==0.42
+pip install whl/torchaudio-2.1.2+das.opt2.dtk24043-cp310-cp310-manylinux_2_28_x86_64.whl # torchaudio==2.1.2
+cd diffusers 
+pip install -e . -i https://mirrors.aliyun.com/pypi/simple # diffusers==0.32.2
+cd /home/BLIP3o
+pip install -e . -i https://mirrors.aliyun.com/pypi/simple # blip3o==0.1.0
+```
+### Anaconda（方法三）
+1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装：
+- https://developer.sourcefind.cn/tool/
+```
+DTK驱动:dtk2504
+python:python3.10
+torch:2.4.1
+torchvision:0.19.1
+torchaudio:2.1.2
+triton:3.0.0
+vllm:0.6.2
+flash-attn:2.6.1
+deepspeed:0.14.2
+apex:1.4.0
+bitsandbytes:0.42
+transformers:4.51.3
+```
+`Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`
+2、其它非特殊库参照requirements.txt安装
+```
+cd /home/BLIP3o
+pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
+pip install whl/bitsandbytes-0.42.0+das.opt1.dtk2504-py3-none-any.whl # bitsandbytes==0.42
+pip install whl/torchaudio-2.1.2+das.opt2.dtk24043-cp310-cp310-manylinux_2_28_x86_64.whl # torchaudio==2.1.2
+cd diffusers 
+pip install -e . -i https://mirrors.aliyun.com/pypi/simple # diffusers==0.32.2
+cd /home/BLIP3o
+pip install -e . -i https://mirrors.aliyun.com/pypi/simple # blip3o==0.1.0
+```
+## 数据集
+`无`
+## 训练
+无
+## 推理
+预训练权重目录结构：
+```
+/home/BLIP3o/
+    |── BLIP3o-Model-8B
+    |── Alpha-VLLM/Lumina-Next-SFT-diffusers
+    |── black-forest-labs/FLUX.1-dev
+    |── jiuhai/eva_clip_vision_tower
+    |── Qwen/Qwen2.5-VL-7B-Instruct
+    └── Qwen/Qwen2.5-VL-3B-Instruct
+``` 
+### 单机单卡
+```
+cd /home/BLIP3o
+python inference.py BLIP3o-Model-8B # 论文作者的源码限制为单卡推理
+```
+更多资料可参考源项目中的[`README_origin`](./README_origin.md)。
+## result
+`输入: `
+```
+prompt = "A photo of cute cat"
+```
+`输出:`
+```
+'A photo of cute cat.png'
+```
+<div align=center>
+    <img src="./doc/A photo of cute cat.png"/>
+</div>
+官方生成效果示例：
+<div align=center>
+    <img src="./doc/result.png"/>
+</div>
+### 精度
+DCU与GPU精度一致，推理框架：pytorch。
+## 应用场景
+### 算法类别
+`多模态`
+### 热点应用行业
+`制造,广媒,金融,能源,医疗,家居,教育`
+## 预训练权重
+HF下载地址为：[BLIP3o-Model-8B](https://huggingface.co/BLIP3o/BLIP3o-Model-8B)、[Alpha-VLLM/Lumina-Next-SFT-diffusers](https://huggingface.co/Alpha-VLLM/Lumina-Next-SFT-diffusers)、[black-forest-labs/FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev)、[jiuhai/eva_clip_vision_tower](https://huggingface.co/jiuhai/eva_clip_vision_tower)、[Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)、[Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
+## 源码仓库及问题反馈
+- http://developer.sourcefind.cn/codes/modelzoo/BLIP3o_pytorch.git
+## 参考资料
+- https://github.com/JiuhaiChen/BLIP3o.git
--- a/README_origin.md
+++ b/README_origin.md
+# 🌌 BLIP3-o
+BLIP3-o is a unified multimodal model that combines the reasoning and instruction following strength of autoregressive models with the generative power of diffusion models. Unlike prior works that diffuse VAE features or raw pixels, BLIP3-o diffuses semantically rich **CLIP image features**, enabling a powerful and efficient architecture for both image understanding and generation.
+## 📖 [Arxiv](http://arxiv.org/abs/2505.09568)
+## Update
+- [2025/05/20] 🔥 We create discussion groups by the end of page, feel free to join us! 
+- [2025/05/19] 🔥 We understand this is a large codebase, we shared a high-level overview of its [Code Structure](https://github.com/JiuhaiChen/BLIP3o/issues/11#issuecomment-2891930000), feel free to open an issue if you encounter any problems.
+- [2025/05/16] 🔥 We’ve published a dataset of 20 million images with detailed captions [BLIP3o Pretrain Long Caption](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption) and 4 million images with short caption [BLIP3o Pretrain Short Caption](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Short-Caption). All images and their captions are compressed into tar archives, **no separate image url downloads or manual unzipping required**. 
+- [2025/05/16] 🔥 We’ve reorganized and cleaned up the repository to ensure a clear, well-structured codebase. Please give the training and inference scripts a try, and feel free to leave an issue if you run into any problems. We apologize for any confusion caused by our original codebase release.
+## ✨ Highlights
+- **Fully Open-Source:** Fully open-source training data (Pretraining and Instruction Tuning), training recipe, model weights, code.
+- **Unified Architecture:** for both image understanding and generation.
+- **CLIP Feature Diffusion:** Directly diffuses semantic vision features for stronger alignment and performance.
+- **State-of-the-art performance:** across a wide range of image understanding and generation benchmarks.
+<!-- <p align="center">
+  <img src="figure/arch.png" alt="BLIP3-U Overview Figure" width="700"/>
+</p>
+*Figure: Overview of the BLIP3-U architecture. We use Flow Matching Loss to predict the ground truth CLIP embeddings. At inference, the autoregressive model first generates a sequence of visual tokens from the given conditioning, and those visual tokens are then passed to a diffusion transformer that decodes them into the final image.* -->
+---
+## Demo
+You can try out BLIP3-o in your browser using our interactive [Demo](https://blip3o.salesforceresearch.ai/). 
+Install package for tranining
+```Shell
+conda create -n blip3o python=3.11 -y
+conda activate blip3o
+pip install --upgrade pip  setuptools
+pip install -r requirements.txt
+```
+## Model Checkpoint
+BLIP3o-4B [4B](https://huggingface.co/BLIP3o/BLIP3o-Model-4B)
+BLIP3o-8B [8B](https://huggingface.co/BLIP3o/BLIP3o-Model)
+## Inference
+You can  download our chekpoint
+```Shell
+python -c "from huggingface_hub import snapshot_download; print(snapshot_download(repo_id='BLIP3o/BLIP3o-Model', repo_type='model'))"
+```
+and run the inference code
+```Shell
+python inference.py  /HF_model/checkpoint/path/
+```
+## Training
+We include two scripts: **slurm.sh** for multi-node training on Slurm clusters, and **run.sh** for debugging.
+For both **slurm.sh** and **run.sh**, you need to import huggingface home **HF_HOME**, training data folder **IMG_FOLDER** and output model save folder **OUTPUT_FOLDER**. 
+For our open source model training, we combine the pretraining dataset, including both long and short captions, images from JourneyDB. You can download [JourneyDB](https://huggingface.co/datasets/JourneyDB/JourneyDB). When training the diffusion transformer from scratch, we recommend using a large number of training steps along with a cosine annealing learning rate schedule that decays from 1×10⁻⁴ down to 1×10⁻⁵.
+## CLIP + Diffusion (Encoder + Decoder)
+We also provide two CLIP + Diffusion: 
+[EVA-CLIP + SDXL]: The model checkpoint already includes the diffusion decoder [diffusion-decoder](https://huggingface.co/BLIP3o/BLIP3o-Model/tree/main/diffusion-decoder). The EVA-CLIP vision tower weights can be downloaded here [EVA-CLIP](https://huggingface.co/jiuhai/eva_clip_vision_tower), the preprocess of EVA-CLIP is in the training code [EVA-CLIP-preprocess](https://github.com/JiuhaiChen/BLIP3o/tree/main/blip3o/model/multimodal_encoder/eva_clip).
+[SigLIP + SANA]: [coming soon]
+## Supported Tasks
+- **Text → Text**  
+- **Image → Text** (Image Understanding) 
+- **Text → Image** (Image Generation)  
+- **Image → Image** (Image Editing)  
+- **Multitask Training** (Image generation and undetstanding mix training)
+## Supported Image Generation Methods
+- **CLIP + MSE**  
+- **CLIP + Flow Matching** 
+- **VAE + Flow Matching** 
+- **Transfusion, LMFusion** 
+## Supported Autoregressive Backbones
+- **Qwen-2.5-VL**  
+- **LLaMA 3**
+We suggest to use Qwen-2.5-VL as the backbone, we are fixing some tokenizer issues for LLama3.
+## Supported Dataset Format
+- **Webdataset**  
+- **Json**
+## Data Loading
+Most of our training data use Huggingface datasets to load **WebDataset**. To download the datasets:
+[Pretrain](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption)
+You can download the datasets by
+```Shell
+python -c "from huggingface_hub import snapshot_download; print(snapshot_download(repo_id='BLIP3o/BLIP3o-Pretrain', repo_type='dataset'))"
+```
+And load them directly with HuggingFace WebDataset
+```Shell
+train_dataset = load_dataset("webdataset", data_files=data_files, split="train", num_proc=128)
+```
+[BLIP3o-60k](https://huggingface.co/datasets/BLIP3o/BLIP3o-60k)
+![BLIP3-o Overview Figure](figure/image.png)
+*Figure: Qualitative results of BLIP3-o.*
+### Join Discussion
+Welcome to discuss with us if you have any questions.
+Discord: https://discord.gg/SsVYdV84bw
+or Wechat
+<p align="center">
+<img src="figure/wechat_1.jpg" width="256">
+</p>
+### Citation
+To cite the paper and model
+```
+@article{chen2025blip3,
+  title={BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset},
+  author={Chen, Jiuhai and Xu, Zhiyang and Pan, Xichen and Hu, Yushi and Qin, Can and Goldstein, Tom and Huang, Lifu and Zhou, Tianyi and Xie, Saining and Savarese, Silvio and others},
+  journal={arXiv preprint arXiv:2505.09568},
+  year={2025}
+}
+```
--- a/black-forest-labs/FLUX.1-dev/README.md
+++ b/black-forest-labs/FLUX.1-dev/README.md
+---
+language:
+- en
+license: other
+license_name: flux-1-dev-non-commercial-license
+license_link: LICENSE.md
+extra_gated_prompt: By clicking "Agree", you agree to the [FluxDev Non-Commercial License Agreement](https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/LICENSE.md)
+  and acknowledge the [Acceptable Use Policy](https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/POLICY.md).
+tags:
+- text-to-image
+- image-generation
+- flux
+---
+![FLUX.1 [dev] Grid](./dev_grid.jpg)
+`FLUX.1 [dev]` is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions.
+For more information, please read our [blog post](https://blackforestlabs.ai/announcing-black-forest-labs/).
+# Key Features
+1. Cutting-edge output quality, second only to our state-of-the-art model `FLUX.1 [pro]`.
+2. Competitive prompt following, matching the performance of closed source alternatives .
+3. Trained using guidance distillation, making `FLUX.1 [dev]` more efficient.
+4. Open weights to drive new scientific research, and empower artists to develop innovative workflows.
+5. Generated outputs can be used for personal, scientific, and commercial purposes as described in the [`FLUX.1 [dev]` Non-Commercial License](https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/LICENSE.md).
+# Usage
+We provide a reference implementation of `FLUX.1 [dev]`, as well as sampling code, in a dedicated [github repository](https://github.com/black-forest-labs/flux).
+Developers and creatives looking to build on top of `FLUX.1 [dev]` are encouraged to use this as a starting point.
+## API Endpoints
+The FLUX.1 models are also available via API from the following sources
+- [bfl.ml](https://docs.bfl.ml/) (currently `FLUX.1 [pro]`)
+- [replicate.com](https://replicate.com/collections/flux)
+- [fal.ai](https://fal.ai/models/fal-ai/flux/dev)
+- [mystic.ai](https://www.mystic.ai/black-forest-labs/flux1-dev)
+## ComfyUI
+`FLUX.1 [dev]` is also available in [Comfy UI](https://github.com/comfyanonymous/ComfyUI) for local inference with a node-based workflow.
+## Diffusers
+To use `FLUX.1 [dev]` with the 🧨 diffusers python library, first install or upgrade diffusers
+```shell
+pip install -U diffusers
+```
+Then you can use `FluxPipeline` to run the model
+```python
+import torch
+from diffusers import FluxPipeline
+pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16)
+pipe.enable_model_cpu_offload() #save some VRAM by offloading the model to CPU. Remove this if you have enough GPU power
+prompt = "A cat holding a sign that says hello world"
+image = pipe(
+    prompt,
+    height=1024,
+    width=1024,
+    guidance_scale=3.5,
+    num_inference_steps=50,
+    max_sequence_length=512,
+    generator=torch.Generator("cpu").manual_seed(0)
+).images[0]
+image.save("flux-dev.png")
+```
+To learn more check out the [diffusers](https://huggingface.co/docs/diffusers/main/en/api/pipelines/flux) documentation
+---
+# Limitations
+- This model is not intended or able to provide factual information.
+- As a statistical model this checkpoint might amplify existing societal biases.
+- The model may fail to generate output that matches the prompts.
+- Prompt following is heavily influenced by the prompting-style.
+# Out-of-Scope Use
+The model and its derivatives may not be used
+- In any way that violates any applicable national, federal, state, local or international law or regulation.
+- For the purpose of exploiting, harming or attempting to exploit or harm minors in any way; including but not limited to the solicitation, creation, acquisition, or dissemination of child exploitative content.
+- To generate or disseminate verifiably false information and/or content with the purpose of harming others.
+- To generate or disseminate personal identifiable information that can be used to harm an individual.
+- To harass, abuse, threaten, stalk, or bully individuals or groups of individuals.
+- To create non-consensual nudity or illegal pornographic content.
+- For fully automated decision making that adversely impacts an individual's legal rights or otherwise creates or modifies a binding, enforceable obligation.
+- Generating or facilitating large-scale disinformation campaigns.
+# License
+This model falls under the [`FLUX.1 [dev]` Non-Commercial License](https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/LICENSE.md).
\ No newline at end of file
--- a/blip3o/.DS_Store
+++ b/blip3o/.DS_Store
--- a/blip3o/__init__.py
+++ b/blip3o/__init__.py
+from .model import blip3oLlamaForCausalLM
--- a/blip3o/__pycache__/__init__.cpython-310.pyc
+++ b/blip3o/__pycache__/__init__.cpython-310.pyc
--- a/blip3o/__pycache__/__init__.cpython-311.pyc
+++ b/blip3o/__pycache__/__init__.cpython-311.pyc
--- a/blip3o/__pycache__/constants.cpython-310.pyc
+++ b/blip3o/__pycache__/constants.cpython-310.pyc
--- a/blip3o/__pycache__/constants.cpython-311.pyc
+++ b/blip3o/__pycache__/constants.cpython-311.pyc
--- a/blip3o/__pycache__/conversation.cpython-310.pyc
+++ b/blip3o/__pycache__/conversation.cpython-310.pyc
--- a/blip3o/__pycache__/conversation.cpython-311.pyc
+++ b/blip3o/__pycache__/conversation.cpython-311.pyc
--- a/blip3o/__pycache__/mm_utils.cpython-310.pyc
+++ b/blip3o/__pycache__/mm_utils.cpython-310.pyc
--- a/blip3o/__pycache__/mm_utils.cpython-311.pyc
+++ b/blip3o/__pycache__/mm_utils.cpython-311.pyc
--- a/blip3o/__pycache__/utils.cpython-310.pyc
+++ b/blip3o/__pycache__/utils.cpython-310.pyc
--- a/blip3o/__pycache__/utils.cpython-311.pyc
+++ b/blip3o/__pycache__/utils.cpython-311.pyc