v1.0

44862a12 · chenzk · 44862a12 · 44862a12 · 44862a12 · 44862a12
Commit 44862a12 authored May 23, 2025 by chenzk
20 changed files
--- a/1. Florence-2 sample inference.ipynb
+++ b/1. Florence-2 sample inference.ipynb
--- a/2. Data auto-labelling with Florence-2.ipynb
+++ b/2. Data auto-labelling with Florence-2.ipynb
--- a/3. Fine-tuning Florence-2.ipynb
+++ b/3. Fine-tuning Florence-2.ipynb
--- a/4. Finetuning Florence-2 on detection dataset - Roboflow.ipynb
+++ b/4. Finetuning Florence-2 on detection dataset - Roboflow.ipynb
--- a/AI-ModelScope/Florence-2-large-ft/README.md
+++ b/AI-ModelScope/Florence-2-large-ft/README.md
+---
+license: mit
+license_link: https://huggingface.co/microsoft/Florence-2-large-ft/resolve/main/LICENSE
+pipeline_tag: image-text-to-text
+tags:
+- vision
+---
+# Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
+## Model Summary
+This Hub repository contains a HuggingFace's `transformers` implementation of Florence-2 model from Microsoft.
+Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks.  Florence-2 can interpret simple text prompts to perform tasks like captioning, object detection, and segmentation. It leverages our FLD-5B dataset, containing 5.4 billion annotations across 126 million images, to master multi-task learning. The model's sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings, proving to be a competitive vision foundation model. 
+Resources and Technical Documentation:
+ [Florence-2 technical report](https://arxiv.org/abs/2311.06242). 
+ [Jupyter Notebook for inference and visualization of Florence-2-large model](https://huggingface.co/microsoft/Florence-2-large/blob/main/sample_inference.ipynb)
+| Model   | Model size | Model Description | 
+| ------- | ------------- |   ------------- |  
+| Florence-2-base[[HF]](https://huggingface.co/microsoft/Florence-2-base) | 0.23B  | Pretrained model with FLD-5B  
+| Florence-2-large[[HF]](https://huggingface.co/microsoft/Florence-2-large) | 0.77B  | Pretrained model with FLD-5B  
+| Florence-2-base-ft[[HF]](https://huggingface.co/microsoft/Florence-2-base-ft) | 0.23B  | Finetuned model on a colletion of downstream tasks
+| Florence-2-large-ft[[HF]](https://huggingface.co/microsoft/Florence-2-large-ft) | 0.77B  | Finetuned model on a colletion of downstream tasks
+## How to Get Started with the Model
+Use the code below to get started with the model. All models are trained with float16. 
+```python
+import requests
+from PIL import Image
+from transformers import AutoProcessor, AutoModelForCausalLM 
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
+model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-large-ft", torch_dtype=torch_dtype, trust_remote_code=True).to(device)
+processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large-ft", trust_remote_code=True)
+prompt = "<OD>"
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
+image = Image.open(requests.get(url, stream=True).raw)
+inputs = processor(text=prompt, images=image, return_tensors="pt").to(device, torch_dtype)
+generated_ids = model.generate(
+    input_ids=inputs["input_ids"],
+    pixel_values=inputs["pixel_values"],
+    max_new_tokens=1024,
+    do_sample=False,
+    num_beams=3
+)
+generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
+parsed_answer = processor.post_process_generation(generated_text, task="<OD>", image_size=(image.width, image.height))
+print(parsed_answer)
+```
+## Tasks
+This model is capable of performing different tasks through changing the prompts.
+First, let's define a function to run a prompt.
+<details>
+<summary> Click to expand </summary>
+```python
+import requests
+from PIL import Image
+from transformers import AutoProcessor, AutoModelForCausalLM 
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
+model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-large-ft", torch_dtype=torch_dtype, trust_remote_code=True).to(device)
+processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large-ft", trust_remote_code=True)
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
+image = Image.open(requests.get(url, stream=True).raw)
+def run_example(task_prompt, text_input=None):
+    if text_input is None:
+        prompt = task_prompt
+    else:
+        prompt = task_prompt + text_input
+    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device, torch_dtype)
+    generated_ids = model.generate(
+      input_ids=inputs["input_ids"],
+      pixel_values=inputs["pixel_values"],
+      max_new_tokens=1024,
+      num_beams=3
+    )
+    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
+    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
+    print(parsed_answer)
+```
+</details>
+Here are the tasks `Florence-2` could perform:
+<details>
+<summary> Click to expand </summary>
+### Caption
+```python
+prompt = "<CAPTION>"
+run_example(prompt)
+```
+### Detailed Caption
+```python
+prompt = "<DETAILED_CAPTION>"
+run_example(prompt)
+```
+### More Detailed Caption
+```python
+prompt = "<MORE_DETAILED_CAPTION>"
+run_example(prompt)
+```
+### Caption to Phrase Grounding 
+caption to phrase grounding task requires additional text input, i.e. caption. 
+Caption to phrase grounding results format: 
+{'\<CAPTION_TO_PHRASE_GROUNDING>': {'bboxes': [[x1, y1, x2, y2], ...], 'labels': ['', '', ...]}}
+```python
+task_prompt = "<CAPTION_TO_PHRASE_GROUNDING>"
+results = run_example(task_prompt, text_input="A green car parked in front of a yellow building.")
+```
+### Object Detection
+OD results format: 
+{'\<OD>': {'bboxes': [[x1, y1, x2, y2], ...], 
+'labels': ['label1', 'label2', ...]} }
+```python
+prompt = "<OD>"
+run_example(prompt)
+```
+### Dense Region Caption
+Dense region caption results format: 
+{'\<DENSE_REGION_CAPTION>' : {'bboxes': [[x1, y1, x2, y2], ...], 
+'labels': ['label1', 'label2', ...]} }
+```python
+prompt = "<DENSE_REGION_CAPTION>"
+run_example(prompt)
+```
+### Region proposal
+Dense region caption results format: 
+{'\<REGION_PROPOSAL>': {'bboxes': [[x1, y1, x2, y2], ...], 
+'labels': ['', '', ...]}}
+```python
+prompt = "<REGION_PROPOSAL>"
+run_example(prompt)
+```
+### OCR 
+```python
+prompt = "<OCR>"
+run_example(prompt)
+```
+### OCR with Region
+OCR with region output format:
+{'\<OCR_WITH_REGION>': {'quad_boxes': [[x1, y1, x2, y2, x3, y3, x4, y4], ...], 'labels': ['text1', ...]}}
+```python
+prompt = "<OCR_WITH_REGION>"
+run_example(prompt)
+```
+for More detailed examples, please refer to [notebook](https://huggingface.co/microsoft/Florence-2-large/blob/main/sample_inference.ipynb)
+</details>
+# Benchmarks
+## Florence-2 Zero-shot performance
+The following table presents the zero-shot performance of generalist vision foundation models on image captioning and object detection evaluation tasks. These models have not been exposed to the training data of the evaluation tasks during their training phase.  
+| Method | #params | COCO Cap. test CIDEr | NoCaps val CIDEr | TextCaps val CIDEr | COCO Det. val2017 mAP |  
+|--------|---------|----------------------|------------------|--------------------|-----------------------|
+| Flamingo | 80B | 84.3 | - | - | - | 
+| Florence-2-base| 0.23B | 133.0 | 118.7 | 70.1 | 34.7 | 
+| Florence-2-large| 0.77B | 135.6 | 120.8 | 72.8 | 37.5 |
+The following table continues the comparison with performance on other vision-language evaluation tasks.  
+| Method | Flickr30k test R@1 | Refcoco val Accuracy | Refcoco test-A Accuracy | Refcoco test-B Accuracy | Refcoco+ val Accuracy | Refcoco+ test-A Accuracy | Refcoco+ test-B Accuracy | Refcocog val Accuracy | Refcocog test Accuracy | Refcoco RES val mIoU |  
+|--------|----------------------|----------------------|-------------------------|-------------------------|-----------------------|--------------------------|--------------------------|-----------------------|------------------------|----------------------|  
+| Kosmos-2 | 78.7 | 52.3 | 57.4 | 47.3 | 45.5 | 50.7 | 42.2 | 60.6 | 61.7 | - |  
+| Florence-2-base | 83.6 | 53.9 | 58.4 | 49.7 | 51.5 | 56.4 | 47.9 | 66.3 | 65.1 | 34.6 |  
+| Florence-2-large | 84.4 | 56.3 | 61.6 | 51.4 | 53.6 | 57.9 | 49.9 | 68.0 | 67.0 | 35.8 |  
+## Florence-2 finetuned performance 
+We finetune Florence-2 models with a collection of downstream tasks, resulting two generalist models *Florence-2-base-ft* and *Florence-2-large-ft* that can conduct a wide range of downstream tasks. 
+The table below compares the performance of specialist and generalist models on various captioning and Visual Question Answering (VQA) tasks. Specialist models are fine-tuned specifically for each task, whereas generalist models are fine-tuned in a task-agnostic manner across all tasks. The symbol "▲" indicates the usage of external OCR as input.  
+| Method         | # Params | COCO Caption Karpathy test CIDEr | NoCaps val CIDEr | TextCaps val CIDEr | VQAv2 test-dev Acc | TextVQA test-dev Acc | VizWiz VQA test-dev Acc |  
+|----------------|----------|-----------------------------------|------------------|--------------------|--------------------|----------------------|-------------------------|  
+| **Specialist Models**   |          |                                   |                  |                    |                    |                      |                         |  
+| CoCa           | 2.1B     | 143.6                              | 122.4            | -                  | 82.3               | -                    | -                       |  
+| BLIP-2         | 7.8B     | 144.5                              | 121.6            | -                  | 82.2               | -                    | -                       |  
+| GIT2           | 5.1B     | 145.0                              | 126.9            | 148.6              | 81.7               | 67.3                 | 71.0                    |  
+| Flamingo       | 80B      | 138.1                              | -                | -                  | 82.0               | 54.1                 | 65.7                    |  
+| PaLI           | 17B      | 149.1                              | 127.0            | 160.0▲             | 84.3               | 58.8 / 73.1▲         | 71.6 / 74.4▲            |  
+| PaLI-X         | 55B      | 149.2                              | 126.3            | 147.0 / 163.7▲     | 86.0               | 71.4 / 80.8▲         | 70.9 / 74.6▲            |  
+| **Generalist Models**   |          |                                   |                  |                    |                    |                      |                         |  
+| Unified-IO     | 2.9B     | -                                  | 100.0            | -                  | 77.9               | -                    | 57.4                    |  
+| Florence-2-base-ft | 0.23B  | 140.0                              | 116.7            | 143.9              | 79.7               | 63.6                 | 63.6                    |  
+| Florence-2-large-ft | 0.77B  | 143.3                              | 124.9            | 151.1              | 81.7               | 73.5                 | 72.6                    |  
+| Method               | # Params | COCO Det. val2017 mAP | Flickr30k test R@1 | RefCOCO val Accuracy | RefCOCO test-A Accuracy | RefCOCO test-B Accuracy | RefCOCO+ val Accuracy | RefCOCO+ test-A Accuracy | RefCOCO+ test-B Accuracy | RefCOCOg val Accuracy | RefCOCOg test Accuracy | RefCOCO RES val mIoU |  
+|----------------------|----------|-----------------------|--------------------|----------------------|-------------------------|-------------------------|------------------------|---------------------------|---------------------------|------------------------|-----------------------|------------------------|  
+| **Specialist Models** |          |                       |                    |                      |                         |                         |                        |                           |                           |                        |                       |                        |  
+| SeqTR                | -        | -                     | -                  | 83.7                 | 86.5                    | 81.2                    | 71.5                   | 76.3                      | 64.9                      | 74.9                   | 74.2                  | -                      |  
+| PolyFormer           | -        | -                     | -                  | 90.4                 | 92.9                    | 87.2                    | 85.0                   | 89.8                      | 78.0                      | 85.8                   | 85.9                  | 76.9                   |  
+| UNINEXT              | 0.74B    | 60.6                  | -                  | 92.6                 | 94.3                    | 91.5                    | 85.2                   | 89.6                      | 79.8                      | 88.7                   | 89.4                  | -                      |  
+| Ferret               | 13B      | -                     | -                  | 89.5                 | 92.4                    | 84.4                    | 82.8                   | 88.1                      | 75.2                      | 85.8                   | 86.3                  | -                      |  
+| **Generalist Models** |          |                       |                    |                      |                         |                         |                        |                           |                           |                        |                       |                        |  
+| UniTAB               | -        | -                     | -                  | 88.6                 | 91.1                    | 83.8                    | 81.0                   | 85.4                      | 71.6                      | 84.6                   | 84.7                  | -                      |  
+| Florence-2-base-ft | 0.23B    | 41.4                  | 84.0                | 92.6                 | 94.8                    | 91.5                   | 86.8                   | 91.7                      | 82.2                      | 89.8                   | 82.2                  | 78.0                  |  
+| Florence-2-large-ft| 0.77B    | 43.4                  | 85.2               | 93.4                 | 95.3                    | 92.0                    | 88.3                   | 92.9                      | 83.6                      | 91.2                   | 91.7                  | 80.5                   |  
+## BibTex and citation info
+```
+@article{xiao2023florence,
+  title={Florence-2: Advancing a unified representation for a variety of vision tasks},
+  author={Xiao, Bin and Wu, Haiping and Xu, Weijian and Dai, Xiyang and Hu, Houdong and Lu, Yumao and Zeng, Michael and Liu, Ce and Yuan, Lu},
+  journal={arXiv preprint arXiv:2311.06242},
+  year={2023}
+}
+```
\ No newline at end of file
--- a/Florence 2 - Advancing a Unified Representation for a Variety of Vision Tasks.pdf
+++ b/Florence 2 - Advancing a Unified Representation for a Variety of Vision Tasks.pdf
--- a/README.md
+++ b/README.md
+# Florence-2
+Florence-2常被一些大模型公司用于多模态的数据预标注，支持10余种任务，参数量小精度高于CLIP。
+## 论文
+`Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks`
+- https://arxiv.org/pdf/2311.06242
+## 模型结构
+Florence-2由图片文本编码器和普通的多模态编码器-解码器组成。
+<div align=center>
+    <img src="./doc/Florence-2.png"/>
+</div>
+## 算法原理
+Florence-2采用了DaViT视觉编码器，将图像转换为视觉嵌入，并结合BERT将文本提示转换为文本和位置嵌入，这些嵌入经过标准编码器-解码器transformer架构的处理，最终生成文本输出。
+## 环境配置
+```
+mv Florence-2-Vision-Language-Model_pytorch Florence-2-Vision-Language-Model # 去框架名后缀
+```
+### Docker（方法一）
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10-fixpy
+# <your IMAGE ID>为以上拉取的docker的镜像ID替换，本镜像为：6063b673703a
+docker run -it --shm-size=64G -v $PWD/Florence-2-Vision-Language-Model:/home/Florence-2-Vision-Language-Model -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name florence2 <your IMAGE ID> bash
+cd /home/Florence-2-Vision-Language-Model
+pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
+```
+### Dockerfile（方法二）
+```
+cd /home/Florence-2-Vision-Language-Model/docker
+docker build --no-cache -t florence2:latest .
+docker run --shm-size=64G --name florence2 -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v $PWD/../../Florence-2-Vision-Language-Model:/home/Florence-2-Vision-Language-Model -it florence2 bash
+# 若遇到Dockerfile启动的方式安装环境需要长时间等待，可注释掉里面的pip安装，启动容器后再安装python库：pip install -r requirements.txt。
+```
+### Anaconda（方法三）
+1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装：
+- https://developer.sourcefind.cn/tool/
+```
+DTK驱动:dtk2504
+python:python3.10
+torch:2.4.1
+torchvision:0.19.1
+triton:3.0.0
+vllm:0.6.2
+flash-attn:2.6.1
+deepspeed:0.14.2
+apex:1.4.0
+transformers:4.46.3
+```
+`Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`
+2、其它非特殊库参照requirements.txt安装
+```
+cd /home/Florence-2-Vision-Language-Model
+pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
+```
+## 数据集
+`无`
+## 训练
+无
+## 推理
+预训练权重目录结构：
+```
+/home/Florence-2-Vision-Language-Mode/
+    └── AI-ModelScope/Florence-2-large-ft
+``` 
+### 单机单卡
+```
+cd /home/Florence-2-Vision-Language-Mode
+python infer.py
+```
+Florence 2 支持许多类型的任务:
+-  **Caption**,
+-  **Detailed Caption**,
+-  **More Detailed Caption**,
+-  **Dense Region Caption**,
+-  **Object Detection**,
+-  **OCR**,
+-  **Caption to Phrase Grounding**,
+-  **segmentation**,
+-  **Region proposal**,
+-  **OCR**,
+-  **OCR with Region**.  
+更多资料可参考源项目中的[`readme_origin`](./readme_origin.md)。
+## result
+`输入: `
+```
+car.jpg
+```
+`输出:`
+```
+{'<OD>': {'bboxes': [[34.23999786376953, 160.0800018310547, 597.4400024414062, 371.7599792480469], [454.7200012207031, 97.19999694824219, 579.5199584960938, 261.3599853515625], [452.79998779296875, 276.7200012207031, 553.9199829101562, 370.79998779296875], [94.4000015258789, 280.55999755859375, 196.1599884033203, 371.2799987792969]], 'labels': ['car', 'door', 'wheel', 'wheel']}}
+```
+Florence-2训练用到的数据标签示例，其可能实现的标注效果可参考以下图片：
+<div align=center>
+    <img src="./doc/label.png"/>
+</div>
+### 精度
+DCU与GPU精度一致，推理框架：pytorch。
+## 应用场景
+### 算法类别
+`多模态`
+### 热点应用行业
+`制造,广媒,金融,能源,医疗,家居,教育`
+## 预训练权重
+魔搭社区下载地址为：[AI-ModelScope/Florence-2-large-ft](https://www.modelscope.cn/models/AI-ModelScope/Florence-2-large-ft)
+## 源码仓库及问题反馈
+- http://developer.sourcefind.cn/codes/modelzoo/Florence-2-Vision-Language-Model_pytorch.git
+## 参考资料
+- https://github.com/anyantudre/Florence-2-Vision-Language-Model.git
+- https://github.com/andimarafioti/florence2-finetuning.git
--- a/README_origin.md
+++ b/README_origin.md
+<div align="center">
+  <h1>Florence-2: Microsoft's Cutting-edge Vision Language Models</h1>
+  <p align="center">
+    🕸 <a href="https://www.linkedin.com/in/anyantudre">LinkedIn</a> • 
+    📙 <a href="https://www.kaggle.com/waalbannyantudre">Kaggle</a> • 
+    💻 <a href="https://anyantudre.medium.com/">Medium Blog</a> • 
+    🤗 <a href="https://huggingface.co/anyantudre">Hugging Face</a> • 
+  </p>
+</div>
+<br/>
+<a href="" style="align-items:center"> <img src="https://github.com/ANYANTUDRE/Florence-2-Vision-Language-Model/blob/main/img/card.png" alt="Open In" class="center"></a>
+# 🔗 Short Links
+- [Florence-2 technical report](https://arxiv.org/abs/2311.06242)
+- [HuggingFace's transformers implementation of Florence-2 model](https://huggingface.co/microsoft/Florence-2-large)
+# 📃 Model Description
+Florence-2, released by Microsoft in June 2024, is an advanced, lightweight foundation **vision-language model open-sourced** under the MIT license. This model is very attractive because of its small size (0.2B and 0.7B) and strong performance on a variety of computer vision and vision-language tasks.
+Despite its small size, it achieves results comparable to those of much larger models, such as Kosmos-2. The model's strength lies not in a complex architecture but in the large-scale **FLD-5B dataset**, consisting of 126 million images and 5.4 billion comprehensive visual annotations.
+#### **Florence-2 model series**
+| Model   | Model size | Model Description | 
+| ------- | ------------- |   ------------- |  
+| Florence-2-base [[HF]](https://huggingface.co/microsoft/Florence-2-base) | 0.23B | Pretrained model with FLD-5B  
+| Florence-2-large [[HF]](https://huggingface.co/microsoft/Florence-2-large) | 0.77B  | Pretrained model with FLD-5B  
+| Florence-2-base-ft [[HF]](https://huggingface.co/microsoft/Florence-2-base-ft) | 0.23B  | Finetuned model on a colletion of downstream tasks
+| Florence-2-large-ft [[HF]](https://huggingface.co/microsoft/Florence-2-large-ft) | 0.77B | Finetuned model on a colletion of downstream tasks
+#### **Tasks**
+Florence 2 supports many tasks out of the box:
+-  **Caption**,
+-  **Detailed Caption**,
+-  **More Detailed Caption**,
+-  **Dense Region Caption**,
+-  **Object Detection**,
+-  **OCR**,
+-  **Caption to Phrase Grounding**,
+-  **segmentation**,
+-  **Region proposal**,
+-  **OCR**,
+-  **OCR with Region**.  
+You can try out the model via [HF Space]().
+# 🕸 Unified Representation
+Vision tasks are diverse and vary in terms of spatial hierarchy and semantic granularity. Instance segmentation provides detailed information about object locations within an image but lacks semantic information. On the other hand, image captioning allows for a deeper understanding of the relationships between objects, but without reference to their actual locations.
+<a href=""> <img src="https://github.com/ANYANTUDRE/Florence-2-Vision-Language-Model/blob/main/img/representation.jpeg" alt="Open In "></a>   
+*Figure 1. Illustration showing the level of spatial hierarchy and semantic granularity expressed by each task. Source: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks.*
+The authors of Florence-2 decided that instead of training a series of separate models capable of executing individual tasks, they would unify their representation and train a single model capable of executing over 10 tasks. However, this requires a new dataset.
+# 💎 Dataset
+Florence-2's strength doesn't stem from its architecture, but from the massive dataset it was pre-trained on. The authors noted that leading computer vision datasets typically contain limited information - WIT only includes image/caption pairs, SA-1B only contains images and associated segmentation masks. Therefore, they decided to build a new **FLD-5B dataset** containing a wide range of information about each image - boxes, masks, captions, and grounding. The dataset creation process was largely automated. The authors used off-the-shelf task-specific models and a set of heuristics and quality checks to clean the obtained results. The result was a new dataset containing over 5 billion annotations for 126 million images, which was used to pre-train the Florence-2 model.
+<a href=""> <img src="https://github.com/ANYANTUDRE/Florence-2-Vision-Language-Model/blob/main/img/annotation.jpeg" alt="Open In "></a>
+*An illustrative example of an image and its corresponding annotations in the FLD-5B dataset. Source: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks.*
+FLD-5B is not yet publicly available, but the authors announced its upcoming release during CVPR 2024.
+<a href=""> <img src="https://github.com/ANYANTUDRE/Florence-2-Vision-Language-Model/blob/main/img/dataset.jpeg" alt="Open In "></a>
+*Summary of size, spatial hierarchy, and semantic granularity of top datasets. Source: Florence-2 CVPR 2024 poster.*
+# 🧩 Architecture and Pre-training details 
+Regardless of the computer vision task being performed, Florence-2 formulates the problem as a sequence-to-sequence task. Florence-2 takes an image and text as inputs, and generates text as output. The model has a simple structure. It uses a DaViT vision encoder to convert images into visual embeddings, and BERT to convert text prompts into text and location embeddings. The resulting embeddings are then processed by a standard encoder-decoder transformer architecture, generating text and location tokens.
+<a href=""> <img src="https://github.com/ANYANTUDRE/Florence-2-Vision-Language-Model/blob/main/img/architecture.png" alt="Open In "></a>
+*Overview of Florence-2 architecture. Source: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks.*
+For region-specific tasks, location tokens representing quantized coordinates are added to the tokenizer's vocabulary.
+- **Box Representation (x0, y0, x1, y1):** Location tokens correspond to the box coordinates, specifically the top-left and bottom-right corners.
+- **Polygon Representation (x0, y0, ..., xn, yn):** Location tokens represent the polygon's vertices in clockwise order.
+# 🦾 Capabilities
+Florence-2 is smaller and more accurate than its predecessors. The Florence-2 series consists of two models: Florence-2-base and Florence-2-large, with 0.23 billion and 0.77 billion parameters, respectively. This size allows for deployment on even mobile devices.
+Despite its small size, Florence-2 achieves better zero-shot results than Kosmos-2 across all benchmarks, even though Kosmos-2 has 1.6 billion parameters.
+#### Examples
+# 🏋🏾‍♂️ Finetuning
+Even if Florence-2 supports many tasks, maybe your task or domain might not be supported, or you may want to better control the model's output for your task. That's when you will need to fine-tune.
+- This post shows an example on [fine-tuning Florence on DocVQA](https://huggingface.co/blog/finetune-florence2).
+- [Finetuning notebook]()
+# 🗂 Resources
+| Title | Type | Brief Description  | Links |
+|---------|--------------------|-------------------------------|----------------------------------------------------------|
+| **Florence-2 Demo** | Demo  | HF Space | [Link]() |
+| **Florence-2 DocVQA Demo** | Demo  | HF Space | [Link]() |
+| **Florence-2 Finetuned Demo** | Demo  | HF Space | [Link]() |
+| **Florence-2 Inference Notebook** | Notebook  | Notebook | [Link]() |
+| **Florence-2 Finetuning Notebook** | Notebook  | Notebook | [Link]() |
+| **Vision Language Models Explained** |  Blog article | article | [Link](https://huggingface.co/blog/vlms) |
+| **Florence-2 Finetuning on DocVQA** | Video  | Video | [Link]() |
+| **Florence-2 Finetuning on** | Video  | Vido | [Link]() |
+# 🔗 Citations and References
+- @article{xiao2023florence,
+  title={Florence-2: Advancing a unified representation for a variety of vision tasks},
+  author={Xiao, Bin and Wu, Haiping and Xu, Weijian and Dai, Xiyang and Hu, Houdong and Lu, Yumao and Zeng, Michael and Liu, Ce and Yuan, Lu},
+  journal={arXiv preprint arXiv:2311.06242},
+  year={2023}
+}
+- Piotr Skalski. (Jun 20, 2024). Florence-2: Open Source Vision Foundation Model by Microsoft. [Roboflow Blog](https://blog.roboflow.com/florence-2/)
+- [Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models](https://huggingface.co/blog/finetune-florence2)
--- a/car.jpg
+++ b/car.jpg
--- a/doc/Florence-2.png
+++ b/doc/Florence-2.png
--- a/doc/label.png
+++ b/doc/label.png
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10-fixpy
+ENV DEBIAN_FRONTEND=noninteractive
+# RUN yum update && yum install -y git cmake wget build-essential
+# RUN source /opt/dtk-dtk25.04/env.sh
+# # 安装pip相关依赖
+COPY requirements.txt requirements.txt
+RUN pip3 install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
--- a/docker/requirements.txt
+++ b/docker/requirements.txt
+pillow
+modelscope
+timm
--- a/icon.png
+++ b/icon.png
--- a/img/annotation.jpeg
+++ b/img/annotation.jpeg
--- a/img/architecture.png
+++ b/img/architecture.png
--- a/img/card.png
+++ b/img/card.png
--- a/img/dataset.jpeg
+++ b/img/dataset.jpeg
--- a/img/representation.jpeg
+++ b/img/representation.jpeg
--- a/infer.py
+++ b/infer.py
+import requests
+from PIL import Image
+from modelscope import AutoProcessor, AutoModelForCausalLM 
+import torch
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
+model = AutoModelForCausalLM.from_pretrained("AI-ModelScope/Florence-2-large-ft", torch_dtype=torch_dtype, trust_remote_code=True).to(device)
+processor = AutoProcessor.from_pretrained("AI-ModelScope/Florence-2-large-ft", trust_remote_code=True)
+prompt = "<OD>"
+'''
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
+image = Image.open(requests.get(url, stream=True).raw)
+'''
+image = Image.open("car.jpg")
+inputs = processor(text=prompt, images=image, return_tensors="pt").to(device, torch_dtype)
+generated_ids = model.generate(
+    input_ids=inputs["input_ids"],
+    pixel_values=inputs["pixel_values"],
+    max_new_tokens=1024,
+    do_sample=False,
+    num_beams=3
+)
+generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
+parsed_answer = processor.post_process_generation(generated_text, task="<OD>", image_size=(image.width, image.height))
+print(parsed_answer)