init

ef30d662 · bailuo · ef30d662 · ef30d662 · ef30d662 · ef30d662
Commit ef30d662 authored Mar 13, 2025 by bailuo
20 changed files
--- a/doc/transformer.png
+++ b/doc/transformer.png
--- a/doc/webui.png
+++ b/doc/webui.png
--- a/doc/webui2.png
+++ b/doc/webui2.png
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-ubuntu22.04-dtk24.04.3-py3.10
+ENV DEBIAN_FRONTEND=noninteractive
+# COPY requirements.txt requirements.txt
+# RUN pip3 install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
--- a/flagged/Output Image/9efda5cd32bec385c73f/image.webp
+++ b/flagged/Output Image/9efda5cd32bec385c73f/image.webp
--- a/flagged/Output Video/0de5ea168ebdced955f0/ret_video.mp4
+++ b/flagged/Output Video/0de5ea168ebdced955f0/ret_video.mp4
--- a/flagged/log.csv
+++ b/flagged/log.csv
+Upload Image,Upload mp4 video,Follow up Question,Text Instruction,Output Image,Output Video,output 2,flag,username,timestamp
+flagged/Upload Image/81c78635b1580a4ebf31/微信图片_20240522104204.jpg,,false,Could you please give me a detailed description of the image?,flagged/Output Image/12abac9eeec7619282c7/image.webp,,"<link href=""https://fonts.googleapis.com/css2?family=Montserrat:wght@400;700&display=swap"" rel=""stylesheet"">
+<style>
+        .highlighted-text {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 600;
+            font-size: 14px;
+            color: rgb(255, 255, 239);
+            background-color: rgb(225, 231, 254);
+            border-radius: 7px;
+            padding: 5px 7px;
+            display: inline-block;
+        }
+        .regular-text {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 400;
+            font-size: 14px;
+        }
+        .highlighted-response {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 600;
+            font-size: 14px;
+            border-radius: 6px;
+            padding: 3px 4px;
+            display: inline-block;
+        }
+</style>
+<span class=""highlighted-text"" style='color:rgb(107, 100, 239)'>Sa2VA</span>
+<p><span class='regular-text'> 
+    The image features a desk with a computer setup, including a keyboard, a mouse, and a monitor. The keyboard is placed in the center of the desk, with a mouse to its right. There are two other keyboards on the desk, one on the left side and another on the right side. A laptop is also present on the left side of the desk.In addition to the computer peripherals, there are two pens on the desk, one near the center and the other on the right side. The setup appears to be a typical workspace for someone who uses a computer for various tasks.
+    ",,,2025-03-11 20:52:22.391278
+flagged/Upload Image/43f1a49663b0b6b5474a/微信图片_20240522104204.jpg,,false,Could you provide me with a detailed analysis of this photo? Please output with interleaved segmentation masks for the corresponding parts of the answer,flagged/Output Image/47d3fb0fc7839320f5f4/image.webp,,"<link href=""https://fonts.googleapis.com/css2?family=Montserrat:wght@400;700&display=swap"" rel=""stylesheet"">
+<style>
+        .highlighted-text {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 600;
+            font-size: 14px;
+            color: rgb(255, 255, 239);
+            background-color: rgb(225, 231, 254);
+            border-radius: 7px;
+            padding: 5px 7px;
+            display: inline-block;
+        }
+        .regular-text {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 400;
+            font-size: 14px;
+        }
+        .highlighted-response {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 600;
+            font-size: 14px;
+            border-radius: 6px;
+            padding: 3px 4px;
+            display: inline-block;
+        }
+</style>
+<span class=""highlighted-text"" style='color:rgb(107, 100, 239)'>Sa2VA</span>
+<p><span class='regular-text'> 
+    <span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> a black keyboard with a red logo on the back </span> [SEG], along with <span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> a black keyboard with a black and red logo </span> [SEG], are placed on a desk next to <span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> a black computer mouse with an orange button </span> [SEG]. <span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> a black computer mouse with a black and orange button </span> [SEG] is also present. <span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> a black computer mouse with a black and orange button </span> [SEG] is also present. the <span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb[COLOR]'> </span> [SEG]
+    ",,,2025-03-11 20:55:59.174846
+flagged/Upload Image/5107a46a6d81d11a8398/微信图片_20240522104201.jpg,,false,Could you provide me with a detailed analysis of this photo? Please output with interleaved segmentation masks for the corresponding parts of the answer,flagged/Output Image/9efda5cd32bec385c73f/image.webp,,"<link href=""https://fonts.googleapis.com/css2?family=Montserrat:wght@400;700&display=swap"" rel=""stylesheet"">
+<style>
+        .highlighted-text {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 600;
+            font-size: 14px;
+            color: rgb(255, 255, 239);
+            background-color: rgb(225, 231, 254);
+            border-radius: 7px;
+            padding: 5px 7px;
+            display: inline-block;
+        }
+        .regular-text {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 400;
+            font-size: 14px;
+        }
+        .highlighted-response {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 600;
+            font-size: 14px;
+            border-radius: 6px;
+            padding: 3px 4px;
+            display: inline-block;
+        }
+</style>
+<span class=""highlighted-text"" style='color:rgb(107, 100, 239)'>Sa2VA</span>
+<p><span class='regular-text'> 
+    <span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> A calm lake </span> [SEG] reflects the city skyline with <span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> a bridge </span> [SEG] and <span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> tall buildings </span> [SEG], creating a picturesque scene. <span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> Trees </span> [SEG] and <span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> grass </span> [SEG] are visible in the foreground, and <span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> the sky </span> [SEG] can be seen in the background<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG].
+    ",,,2025-03-11 20:57:21.660529
+,"{""video"": ""flagged/Upload mp4 video/6106b3f3cbe031638302/\u676f\u5b50.mp4"", ""subtitles"": null}",false,"Instruction: ""Please segment the cup.""",,"{""video"": ""flagged/Output Video/762cf09451fc44db897c/ret_video.mp4"", ""subtitles"": null}","<link href=""https://fonts.googleapis.com/css2?family=Montserrat:wght@400;700&display=swap"" rel=""stylesheet"">
+<style>
+        .highlighted-text {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 600;
+            font-size: 14px;
+            color: rgb(255, 255, 239);
+            background-color: rgb(225, 231, 254);
+            border-radius: 7px;
+            padding: 5px 7px;
+            display: inline-block;
+        }
+        .regular-text {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 400;
+            font-size: 14px;
+        }
+        .highlighted-response {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 600;
+            font-size: 14px;
+            border-radius: 6px;
+            padding: 3px 4px;
+            display: inline-block;
+        }
+</style>
+<span class=""highlighted-text"" style='color:rgb(107, 100, 239)'>Sa2VA</span>
+<p><span class='regular-text'> 
+    Sure, the segmentation result is [SEG].
+    ",,,2025-03-11 21:00:34.998173
+,"{""video"": ""flagged/Upload mp4 video/c3b6bd48ff3ba1b41034/dog-5.mp4"", ""subtitles"": null}",false,"Instruction: ""Please segment the dog.""",,"{""video"": ""flagged/Output Video/f563c4642c8ccb51bd68/ret_video.mp4"", ""subtitles"": null}","<link href=""https://fonts.googleapis.com/css2?family=Montserrat:wght@400;700&display=swap"" rel=""stylesheet"">
+<style>
+        .highlighted-text {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 600;
+            font-size: 14px;
+            color: rgb(255, 255, 239);
+            background-color: rgb(225, 231, 254);
+            border-radius: 7px;
+            padding: 5px 7px;
+            display: inline-block;
+        }
+        .regular-text {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 400;
+            font-size: 14px;
+        }
+        .highlighted-response {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 600;
+            font-size: 14px;
+            border-radius: 6px;
+            padding: 3px 4px;
+            display: inline-block;
+        }
+</style>
+<span class=""highlighted-text"" style='color:rgb(107, 100, 239)'>Sa2VA</span>
+<p><span class='regular-text'> 
+    Sure, the segmentation result is [SEG].
+    ",,,2025-03-11 21:08:15.217660
+,"{""video"": ""flagged/Upload mp4 video/6ab54f3481c586fed40b/Biker.mp4"", ""subtitles"": null}",false,"Instruction: ""Tell me about this video.""",,"{""video"": ""flagged/Output Video/0de5ea168ebdced955f0/ret_video.mp4"", ""subtitles"": null}","<link href=""https://fonts.googleapis.com/css2?family=Montserrat:wght@400;700&display=swap"" rel=""stylesheet"">
+<style>
+        .highlighted-text {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 600;
+            font-size: 14px;
+            color: rgb(255, 255, 239);
+            background-color: rgb(225, 231, 254);
+            border-radius: 7px;
+            padding: 5px 7px;
+            display: inline-block;
+        }
+        .regular-text {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 400;
+            font-size: 14px;
+        }
+        .highlighted-response {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 600;
+            font-size: 14px;
+            border-radius: 6px;
+            padding: 3px 4px;
+            display: inline-block;
+        }
+</style>
+<span class=""highlighted-text"" style='color:rgb(107, 100, 239)'>Sa2VA</span>
+<p><span class='regular-text'> 
+    Sure, [SEG].
+    ",,,2025-03-12 11:29:11.297654
+,"{""video"": ""flagged/Upload mp4 video/7cdd3a3bf82b4b0ad290/GOT-10k_Test_000010.mp4"", ""subtitles"": null}",false,"Instruction: ""Tell me about this video.""",,"{""video"": ""flagged/Output Video/32ec117c318e434a7dd0/GOT-10k_Test_000010.mp4"", ""subtitles"": null}","<link href=""https://fonts.googleapis.com/css2?family=Montserrat:wght@400;700&display=swap"" rel=""stylesheet"">
+<style>
+        .highlighted-text {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 600;
+            font-size: 14px;
+            color: rgb(255, 255, 239);
+            background-color: rgb(225, 231, 254);
+            border-radius: 7px;
+            padding: 5px 7px;
+            display: inline-block;
+        }
+        .regular-text {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 400;
+            font-size: 14px;
+        }
+        .highlighted-response {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 600;
+            font-size: 14px;
+            border-radius: 6px;
+            padding: 3px 4px;
+            display: inline-block;
+        }
+</style>
+<span class=""highlighted-text"" style='color:rgb(107, 100, 239)'>Sa2VA</span>
+<p><span class='regular-text'> 
+    Sure, the video shows a serene scene of a white swan gracefully gliding across a calm lake. The swan is seen in various positions, sometimes flying low over the water and other times soaring higher. The water is still, reflecting the swan's movements and the surrounding environment. The swan's elegant flight and the tranquil setting create a peaceful and picturesque atmosphere.
+    ",,,2025-03-12 11:30:50.966251
+,"{""video"": ""flagged/Upload mp4 video/4580bf6f906338669fb1/\u676f\u5b50.mp4"", ""subtitles"": null}",false,"Instruction: ""Tell me about this video.""",,"{""video"": ""flagged/Output Video/2a18939795b479c5228c/\u676f\u5b50.mp4"", ""subtitles"": null}","<link href=""https://fonts.googleapis.com/css2?family=Montserrat:wght@400;700&display=swap"" rel=""stylesheet"">
+<style>
+        .highlighted-text {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 600;
+            font-size: 14px;
+            color: rgb(255, 255, 239);
+            background-color: rgb(225, 231, 254);
+            border-radius: 7px;
+            padding: 5px 7px;
+            display: inline-block;
+        }
+        .regular-text {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 400;
+            font-size: 14px;
+        }
+        .highlighted-response {
+            font-family: 'Montserrat', sans-serif;
+            font-weight: 600;
+            font-size: 14px;
+            border-radius: 6px;
+            padding: 3px 4px;
+            display: inline-block;
+        }
+</style>
+<span class=""highlighted-text"" style='color:rgb(107, 100, 239)'>Sa2VA</span>
+<p><span class='regular-text'> 
+    Sure, the video shows a close-up of a blue coffee mug with a white interior, placed on a white table. The mug is positioned in the center of the frame, and it appears to be empty. The background features a red wall, which adds a pop of color to the scene. The lighting is bright, highlighting the details of the mug and the table. The overall atmosphere of the video is simple and clean, focusing on the mug as the main subject.
+    ",,,2025-03-12 11:32:21.480367
--- a/icon.png
+++ b/icon.png
--- a/model.properties
+++ b/model.properties
+# 模型唯一标识
+modelCode = 1450
+# 模型名称
+modelName=Sa2VA_pytorch
+# 模型描述
+modelDescription=将SAM2与LLaVA结合，实现对图像和视频的深入理解。
+# 应用场景
+appScenario=图像理解,零售,制造,电商,医疗,教育
+# 框架类型
+frameType=pytorch
--- a/projects/glamm/datasets/__init__.py
+++ b/projects/glamm/datasets/__init__.py
+from .semantic_seg_dataset import SemanticSegDataset, ADE20kSemanticSegDataset, \
+    COCOStuffSemanticSegDataset, PascalPartSemanticSegDataset, PacoSemanticSegDataset
+from .gcg_dataset import GCGDataset, GranDfGCGDataset, RefCOCOgGCGDataset, OpenPsgGCGDataset, Flickr30kGCGDataset
+from .region_level_dataset import RefCocoGRegionDataset, VisualGenomeRegionDataset
+from .refcoco_segm_dataset import ReferSegmDataset
+from .utils.utils import *
+from .collate_fns.glamm_collate_fn import glamm_collate_fn
--- a/projects/glamm/datasets/collate_fns/glamm_collate_fn.py
+++ b/projects/glamm/datasets/collate_fns/glamm_collate_fn.py
+from typing import Dict, Sequence
+
+import torch
+from torch.nn.utils.rnn import pad_sequence
+
+from xtuner.parallel.sequence import (get_sequence_parallel_world_size,
+                                      pad_for_sequence_parallel)
+from xtuner.utils import DEFAULT_PAD_TOKEN_INDEX, IGNORE_INDEX
+
+
+def glamm_collate_fn(instances: Sequence[Dict],
+                       pad_index: int = DEFAULT_PAD_TOKEN_INDEX,
+                       return_hf_format: bool = False,
+                       use_varlen_attn: bool = False):
+    seq_parallel_world_size = get_sequence_parallel_world_size()
+
+    input_ids, labels = [], []
+    has_image = any(inst.get('pixel_values') is not None for inst in instances)
+    has_grounding_image = any(inst.get('g_pixel_values') is not None for inst in instances)
+    has_mask = any(inst.get('masks') is not None for inst in instances)
+    has_bboxes = any(inst.get('bboxes') is not None for inst in instances)
+    has_points = any(inst.get('points') is not None for inst in instances)
+
+    if use_varlen_attn:
+        position_ids, cumulative_len = [], []
+        assert len(instances) == 1, (
+            f'If utilizing varlen attention, the batch size should be'
+            f' set to 1, but got {len(instances)}')
+        assert not has_image, 'Currently, it is not configured to '
+        'accommodate the use of varlen Attention in multimodal training'
+
+    if has_image:
+        pixel_values = []
+    if has_grounding_image:
+        grounding_pixel_values = []
+    if has_mask:
+        object_masks = []
+    if has_bboxes:
+        object_bboxes = []
+    if has_points:
+        prompt_points = []
+
+    for example in instances:
+        input_ids.append(torch.LongTensor(example['input_ids']))
+        labels.append(torch.LongTensor(example['labels']))
+        if use_varlen_attn:
+            cumulative_len.append(torch.IntTensor(example['cumulative_len']))
+            position_ids.append(torch.LongTensor(example['position_ids']))
+
+        if has_image:
+            pixel_values.append(example['pixel_values'])
+        if has_grounding_image:
+            grounding_pixel_values.append(example['g_pixel_values'])
+        if has_mask:
+            if 'masks' in example.keys() and example['masks'] is not None:
+                object_masks.append(example['masks'])
+        if has_bboxes:
+            if 'bboxes' in example.keys() and example['bboxes'] is not None:
+                object_bboxes.append(example['bboxes'])
+        if has_points:
+            if 'points' in example.keys() and example['points'] is not None:
+                prompt_points.append(example['points'])
+
+    ori_length = [len(ids) for ids in input_ids]
+    if len(instances) > 1:
+        input_ids = pad_sequence(
+            input_ids, batch_first=True, padding_value=pad_index)
+        labels = pad_sequence(
+            labels, batch_first=True, padding_value=IGNORE_INDEX)
+    else:
+        input_ids = torch.stack(input_ids)
+        labels = torch.stack(labels)
+
+    if use_varlen_attn:
+        assert input_ids.size(1) % seq_parallel_world_size == 0
+        attention_mask = None
+        position_ids = torch.stack(position_ids, dim=0)
+    else:
+        # Some tokenizers have the same eos token and pad token, so input_ids
+        # cannot be masked directly based on the pad token id.
+        attention_mask = torch.zeros_like(input_ids).bool()
+        for i, length in enumerate(ori_length):
+            attention_mask[i, :length] = True
+
+        bs, seq_len = input_ids.shape
+        position_ids = torch.arange(seq_len).unsqueeze(0).long().repeat(bs, 1)
+
+    if seq_parallel_world_size > 1:
+        input_ids = pad_for_sequence_parallel(input_ids, pad_index)
+        labels = pad_for_sequence_parallel(labels, IGNORE_INDEX)
+        position_ids = pad_for_sequence_parallel(position_ids, 0)
+        if attention_mask is not None:
+            attention_mask = pad_for_sequence_parallel(attention_mask, 0)
+
+    if use_varlen_attn:
+        max_seqlen = (
+            cumulative_len[0][1:] -  # noqa: W504
+            cumulative_len[0][:-1]).max().item()
+        data_dict = {
+            'input_ids': input_ids,
+            'cumulative_len': cumulative_len,
+            'position_ids': position_ids,
+            'labels': labels,
+            'max_seqlen': max_seqlen
+        }
+    else:
+        data_dict = {
+            'input_ids': input_ids,
+            'attention_mask': attention_mask,
+            'position_ids': position_ids,
+            'labels': labels
+        }
+
+    if has_image:
+        if all(x.shape == pixel_values[0].shape for x in pixel_values):
+            pixel_values = torch.stack(pixel_values, dim=0)
+        data_dict['pixel_values'] = pixel_values
+
+    if has_grounding_image:
+        # if all(x.shape == grounding_pixel_values[0].shape for x in grounding_pixel_values):
+            # grounding_pixel_values = torch.stack(grounding_pixel_values, dim=0)
+        data_dict['g_pixel_values'] = grounding_pixel_values
+
+    if has_mask:
+        data_dict['masks'] = object_masks
+
+    if has_bboxes:
+        data_dict['bboxes'] = object_bboxes
+
+    if has_points:
+        data_dict['points'] = prompt_points
+
+    if return_hf_format:
+        return data_dict
+    else:
+        return {'data': data_dict, 'data_samples': None}
--- a/projects/glamm/datasets/gcg_dataset.py
+++ b/projects/glamm/datasets/gcg_dataset.py
+import copy
+import random
+import glob
+import json
+import logging
+import os
+import torch
+
+from mmengine import print_log
+from mmengine.config import Config, ConfigDict
+from PIL import Image
+from torch.utils.data import Dataset
+import numpy as np
+import torch.nn.functional as F
+from pycocotools.coco import COCO
+from pycocotools import mask as mask_utils
+
+from xtuner.registry import BUILDER
+
+from xtuner.dataset.utils import encode_fn
+from xtuner.dataset.map_fns import llava_map_fn
+
+from projects.glamm.datasets.utils.utils import expand2square
+
+from projects.glamm.datasets.utils.utils import GCG_QUESTIONS, ANSWER_LIST
+from projects.glamm.utils import DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
+class GCGDataset(Dataset):
+    def __init__(self,
+                 image_folder,
+                 image_processor,
+                 data_path=None,
+                 tokenizer=None,
+                 template_map_fn=None,
+                 max_length=2048,
+                 pad_image_to_square=False,
+                 repeats=1,
+                 num_classes_per_sample=3,
+                 extra_image_processor=None):
+        super().__init__()
+        self.question_templates = GCG_QUESTIONS
+        if extra_image_processor is not None:
+            self.extra_image_processor = BUILDER.build(extra_image_processor)
+        self.num_classes_per_sample = num_classes_per_sample
+        self.tokenizer = BUILDER.build(tokenizer)
+
+        self.tokenizer.add_tokens(
+            [DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True
+        )
+        reg_tokens = ['<bbox>', '<point>']
+        segmentation_tokens = ['[SEG]']
+        phrase_tokens = ['<p>', '</p>']
+        special_tokens = reg_tokens + segmentation_tokens + phrase_tokens
+        self.tokenizer.add_tokens(special_tokens, special_tokens=True)
+
+        self.max_length = max_length
+        self.template_map_fn = BUILDER.build(template_map_fn)
+
+        self.text_data = self.json_file_preprocess(data_path, image_folder)
+        self.image_folder = image_folder
+
+        self.image_processor = BUILDER.build(image_processor)
+        size = self.image_processor.crop_size
+
+        if isinstance(size, dict):
+            self.image_w, self.image_h = size['width'], size['height']
+        elif isinstance(size, int):
+            self.image_h, self.image_w = size, size
+        else:
+            self.image_w, self.image_h = size
+
+        self.pad_image_to_square = pad_image_to_square
+        self.repeats = repeats
+
+    def json_file_preprocess(self, data_path, image_folder=None):
+        with open(data_path, 'r') as f:
+            json_data = json.load(f)
+        return json_data
+
+    @property
+    def modality_length(self):
+        length_list = []
+        for data_dict in self.text_data:
+            cur_len = 100
+            length_list.append(cur_len)
+        return length_list * self.repeats
+
+    def __len__(self):
+        return len(self.text_data) * self.repeats
+
+    def real_len(self):
+        return len(self.text_data)
+
+    def _parse_annotations(self, ann_info):
+        image_path = os.path.join(self.image_folder, ann_info['file_name'])
+        image = Image.open(image_path).convert('RGB')
+        if hasattr(self, 'extra_image_processor'):
+            g_image = np.array(image) # for grounding
+            g_image = self.extra_image_processor.apply_image(g_image)
+            g_pixel_values = torch.from_numpy(g_image).permute(2, 0, 1).contiguous()
+            ann_info['g_pixel_values'] = g_pixel_values
+
+        width, height = image.size
+        if self.pad_image_to_square:
+            image = expand2square(
+                image, tuple(int(x * 255) for x in self.image_processor.image_mean))
+        image = self.image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
+        ann_info['pixel_values'] = image
+
+        caption = ann_info['caption'].strip('"').strip()
+        masks, phrases, tokens_positive = [], [], []
+        for word, grounding in ann_info["groundings"].items():
+            phrases.append(word)
+            tokens_positive.append(grounding["token_positives"])
+
+            # Convert segmentation to binary mask
+            binary_mask = np.zeros((height, width), dtype=np.uint8)
+            for rle in grounding["rle_masks"]:
+                m = mask_utils.decode(rle).astype(np.uint8)
+                binary_mask += m.squeeze()
+            masks.append(binary_mask)
+
+        def sort_by_start_index(items, order):
+            return [items[i] for i in order]
+        
+        phrase_order = sorted(range(len(tokens_positive)), key=lambda x: tokens_positive[x][0])
+        masks = sort_by_start_index(masks, phrase_order)
+        phrases = sort_by_start_index(phrases, phrase_order)
+        tokens_positive = sort_by_start_index(tokens_positive, phrase_order)
+
+        ann_info.update({
+            'image_path': image_path,
+            'caption': caption,
+            'masks': masks,
+            'phrases': phrases,
+            'tokens_positive': tokens_positive,
+        })
+        return ann_info
+
+    def create_conversation(self, caption, tokens_positive):
+        question = random.choice(self.question_templates).strip()
+
+        # Prepare caption with tags
+        def tag_caption(caption, tokens):
+            for start, end in sorted(tokens, key=lambda x: x[0], reverse=True):
+                caption = f"{caption[:start]}<p> {caption[start:end]} </p> [SEG]{caption[end:]}"
+            return caption
+
+        detailed_answer = tag_caption(caption, tokens_positive)
+
+        question = 'The <image> provides an overview of the picture.\n' + question
+        conversation = [{'input': question, 'output': detailed_answer}]
+        return conversation
+    
+    def __getitem__(self, index):
+        index = index % self.real_len()
+        data_dict = {}
+        ann_info = copy.deepcopy(self.text_data[index])
+        ann_info = self._parse_annotations(ann_info)
+        
+        data_dict['g_pixel_values'] = ann_info.pop('g_pixel_values')
+        data_dict['pixel_values'] = ann_info.pop('pixel_values')
+        if len(ann_info['masks']) == 0:
+            return self.__getitem__(0)
+        data_dict['masks'] = torch.from_numpy(np.stack(ann_info['masks'], axis=0))
+
+        conversation = self.create_conversation(ann_info['caption'], ann_info['tokens_positive'])
+        data_dict['conversation'] = conversation
+
+        result = self.template_map_fn(data_dict)
+        data_dict.update(result)
+
+        result = encode_fn(data_dict, tokenizer=self.tokenizer, max_length=self.max_length, with_image_token=True)
+        data_dict.update(result)
+
+        return data_dict
+
+class GranDfGCGDataset(GCGDataset):
+    pass
+class RefCOCOgGCGDataset(GCGDataset):
+    def json_file_preprocess(self, data_path, image_folder=None):
+        with open(data_path, 'r') as f:
+            json_data = json.load(f)
+        return [list(line.values())[0] for line in json_data]
+
+    def _parse_annotations(self, ann_info):
+        image_path = os.path.join(self.image_folder, ann_info['img_file_name'])
+        image = Image.open(image_path).convert('RGB')
+        if hasattr(self, 'extra_image_processor'):
+            g_image = np.array(image) # for grounding
+            g_image = self.extra_image_processor.apply_image(g_image)
+            g_pixel_values = torch.from_numpy(g_image).permute(2, 0, 1).contiguous()
+            ann_info['g_pixel_values'] = g_pixel_values
+
+        width, height = image.size
+        if self.pad_image_to_square:
+            image = expand2square(
+                image, tuple(int(x * 255) for x in self.image_processor.image_mean))
+        image = self.image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
+        ann_info['pixel_values'] = image
+
+        caption = ann_info['caption'].strip('"').strip().lower()
+        masks, phrases, tokens_positive = [], [], []
+        for detail in ann_info['refs']:
+            phrase = detail['sentence']
+            if phrase.lower() in caption:
+                phrases.append(phrase)
+                index = caption.find(phrase)
+                end_index = index + len(phrase) if index != -1 else -1
+                tokens_positive.append([index, end_index])
+
+                binary_mask = np.zeros((height, width), dtype=np.uint8)
+                for seg in detail["segmentation"]:
+                    rles = mask_utils.frPyObjects([seg], height, width)
+                    m = mask_utils.decode(rles)
+                    m = m.astype(np.uint8)
+                    binary_mask += m.squeeze()
+                masks.append(binary_mask)
+
+        def sort_by_start_index(items, order):
+            return [items[i] for i in order]
+        
+        phrase_order = sorted(range(len(tokens_positive)), key=lambda x: tokens_positive[x][0])
+        masks = sort_by_start_index(masks, phrase_order)
+        phrases = sort_by_start_index(phrases, phrase_order)
+        tokens_positive = sort_by_start_index(tokens_positive, phrase_order)
+
+        ann_info.update({
+            'image_path': image_path,
+            'caption': caption,
+            'masks': masks,
+            'phrases': phrases,
+            'tokens_positive': tokens_positive,
+        })
+        return ann_info
+
+class OpenPsgGCGDataset(GCGDataset):
+    pass
+
+class Flickr30kGCGDataset(GCGDataset):
+
+    def json_file_preprocess(self, data_path, image_folder=None):
+        def filter_images(data_infos, min_size):
+            return [i for i, info in enumerate(data_infos) if min(info['width'], info['height']) >= min_size]
+        
+        self.coco = COCO(data_path)
+        self.image_ids = self.coco.getImgIds()
+        data_infos = []
+        total_ann_ids = []
+        removed_img_count = 0
+        for img_id in self.image_ids:
+            info = self.coco.loadImgs([img_id])[0]
+            if len(info['caption'].split(' ')) < 3:
+                removed_img_count += 1
+                continue
+            info['filename'] = info['file_name'].split('_')[-1]
+            info['height'] = int(info['height'])
+            info['width'] = int(info['width'])
+            data_infos.append(info)
+            ann_ids = self.coco.getAnnIds(imgIds=[img_id])
+            total_ann_ids.extend(ann_ids)
+        assert len(set(total_ann_ids)) == len(total_ann_ids), f"Non-unique annotation IDs in '{data_path}'!"
+        print(f'Removed {removed_img_count} images.')
+        data_infos = [data_infos[i] for i in filter_images(data_infos, min_size=32)]
+
+        return data_infos
+    
+    def _parse_annotations(self, img_info):
+        ann_ids = self.coco.getAnnIds(imgIds=img_info['id'])
+        ann_info = self.coco.loadAnns(ann_ids)
+        
+        annotations = {'phrases': [], 'caption': img_info['caption'], 'masks': [], 'tokens_positive': []}
+        image_path = os.path.join(self.image_folder, img_info['file_name'])
+        image = Image.open(image_path).convert('RGB')
+        if hasattr(self, 'extra_image_processor'):
+            g_image = np.array(image) # for grounding
+            g_image = self.extra_image_processor.apply_image(g_image)
+            g_pixel_values = torch.from_numpy(g_image).permute(2, 0, 1).contiguous()
+            annotations['g_pixel_values'] = g_pixel_values
+
+        width, height = image.size
+        if self.pad_image_to_square:
+            image = expand2square(
+                image, tuple(int(x * 255) for x in self.image_processor.image_mean))
+        image = self.image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
+        annotations['pixel_values'] = image
+
+        for ann in ann_info:
+            if ann.get('ignore', False):
+                continue
+            x1, y1, w, h = ann['bbox']
+            inter_w = max(0, min(x1 + w, img_info['width']) - max(x1, 0))
+            inter_h = max(0, min(y1 + h, img_info['height']) - max(y1, 0))
+            if inter_w * inter_h == 0 or ann['area'] <= 0 or w < 1 or h < 1:
+                continue
+            bbox = [x1, y1, x1 + w, y1 + h]
+            tokens_positive = ann['tokens_positive']
+            phrase = [img_info['caption'][span[0]:span[1]] for span in tokens_positive]
+            annotations['phrases'].append(phrase[0])
+            annotations['tokens_positive'].append(tokens_positive[0])
+
+            rle = ann['sam_mask']
+            mask_decoded = mask_utils.decode(rle).astype(np.uint8)
+            annotations['masks'].append(mask_decoded)
+
+        def sort_by_start_index(items, order):
+            return [items[i] for i in order]
+        
+        phrase_order = sorted(range(len(annotations['tokens_positive'])), key=lambda x: annotations['tokens_positive'][x][0])
+        annotations['masks'] = sort_by_start_index(annotations['masks'], phrase_order)
+        annotations['phrases'] = sort_by_start_index(annotations['phrases'], phrase_order)
+        annotations['tokens_positive'] = sort_by_start_index(annotations['tokens_positive'], phrase_order)
+
+        return annotations
+
+if __name__ == '__main__':
+    from transformers import CLIPImageProcessor, AutoTokenizer
+    from third_parts.segment_anything.utils.transforms import ResizeLongestSide
+    pretrained_model = 'MBZUAI/GLaMM-GranD-Pretrained'
+    llm_name_or_path = 'lmsys/vicuna-7b-v1.5'
+    
+    tokenizer = dict(
+        type=AutoTokenizer.from_pretrained,
+        pretrained_model_name_or_path=llm_name_or_path)
+    image_processor = dict(
+        type=CLIPImageProcessor.from_pretrained,
+        pretrained_model_name_or_path='openai/clip-vit-large-patch14-336')
+    extra_image_processor = dict(
+        type=ResizeLongestSide,
+        target_length=1024,
+    )
+    from xtuner.utils.templates import PROMPT_TEMPLATE
+    prompt_template = PROMPT_TEMPLATE.vicuna
+    from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory, template_map_fn
+    from projects.glamm.datasets.collate_fns.glamm_collate_fn import glamm_collate_fn
+    dataset = Flickr30kGCGDataset(
+        image_folder='data/flickr30k/flickr30k-images/',
+        image_processor=image_processor,
+        data_path='./data/GranDf/annotations/train/flickr_mergedGT_GCG_train.json',
+        tokenizer=tokenizer,
+        template_map_fn=dict(
+            type=template_map_fn_factory, template=prompt_template),
+        max_length=2048,
+        pad_image_to_square=True,
+        repeats=1,
+        num_classes_per_sample=3,
+        extra_image_processor=extra_image_processor)
+    
+    for i in range(1000):
+        print(dataset[i])
\ No newline at end of file
--- a/projects/glamm/datasets/refcoco_segm_dataset.py
+++ b/projects/glamm/datasets/refcoco_segm_dataset.py
+import copy
+import random
+import glob
+import json
+import logging
+import os
+import torch
+
+from mmengine import print_log
+from mmengine.config import Config, ConfigDict
+from PIL import Image
+from torch.utils.data import Dataset
+import numpy as np
+import torch.nn.functional as F
+from pycocotools.coco import COCO
+from pycocotools import mask as mask_utils
+
+from xtuner.registry import BUILDER
+
+from xtuner.dataset.utils import encode_fn
+from xtuner.dataset.map_fns import llava_map_fn
+
+from projects.glamm.datasets.utils.utils import expand2square
+
+from projects.glamm.datasets.utils.utils import SEG_QUESTIONS, ANSWER_LIST
+from projects.glamm.utils import DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
+
+from third_parts.mmdet.datasets.refcoco import RefCocoDataset
+
+
+class ReferSegmDataset(RefCocoDataset):
+    def __init__(self,
+                 data_root,
+                 ann_file=None,
+                 split_file=None,
+                 image_processor=None,
+                 extra_image_processor=None,
+                 data_prefix=dict(img_path='train2014/'),
+                 tokenizer=None,
+                 template_map_fn=None,
+                 max_length=2048,
+                 pad_image_to_square=False,
+                 num_classes_per_sample=3):
+        super().__init__(
+            data_root=data_root,
+            data_prefix=data_prefix,
+            pipeline=None,
+            ann_file=ann_file,
+            split_file=split_file,
+        )
+        self.begin_str = f"""{DEFAULT_IMAGE_TOKEN} provides an overview of the picture.\n"""
+
+        self.question_templates = SEG_QUESTIONS
+        if extra_image_processor is not None:
+            self.extra_image_processor = BUILDER.build(extra_image_processor)
+        self.num_classes_per_sample = num_classes_per_sample
+        self.tokenizer = BUILDER.build(tokenizer)
+
+        self.tokenizer.add_tokens(
+            [DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True
+        )
+        reg_tokens = ['<bbox>', '<point>']
+        segmentation_tokens = ['[SEG]']
+        phrase_tokens = ['<p>', '</p>']
+        special_tokens = reg_tokens + segmentation_tokens + phrase_tokens
+        self.tokenizer.add_tokens(special_tokens, special_tokens=True)
+
+        self.max_length = max_length
+        self.template_map_fn = BUILDER.build(template_map_fn)
+
+        self.image_processor = BUILDER.build(image_processor)
+        size = self.image_processor.crop_size
+        if isinstance(size, dict):
+            self.image_w, self.image_h = size['width'], size['height']
+        self.pad_image_to_square = pad_image_to_square
+
+    @property
+    def modality_length(self):
+        import pickle
+        length_list = []
+        for idx in range(len(self)):
+            length_list.append(100)
+        # for idx in range(len(self)):
+        #     if self.serialize_data:
+        #         start_addr = 0 if idx == 0 else self.data_address[idx - 1].item()
+        #         end_addr = self.data_address[idx].item()
+        #         bytes = memoryview(
+        #             self.data_bytes[start_addr:end_addr])  # type: ignore
+        #         data_dict = pickle.loads(bytes) 
+        #     else:
+        #         data_dict = copy.deepcopy(self.data_list[idx])
+        return length_list
+
+    def _parse_annotations(self, ann_info):
+        image_path = ann_info['img_path']
+        image = Image.open(image_path).convert('RGB')
+        if hasattr(self, 'extra_image_processor'):
+            g_image = np.array(image)  # for grounding
+            g_image = self.extra_image_processor.apply_image(g_image)
+            g_pixel_values = torch.from_numpy(
+                g_image).permute(2, 0, 1).contiguous()
+            ann_info['g_pixel_values'] = g_pixel_values
+
+        width, height = image.size
+        if self.pad_image_to_square:
+            image = expand2square(
+                image, tuple(int(x * 255) for x in self.image_processor.image_mean))
+        image = self.image_processor.preprocess(
+            image, return_tensors='pt')['pixel_values'][0]
+        ann_info['pixel_values'] = image
+
+        masks, phrases = [], []
+        instances, text = ann_info['instances'], ann_info['text']
+        index = np.random.choice(range(len(instances)), min(
+            len(instances), self.num_classes_per_sample))
+        for idx in index:
+            inst = instances[idx]
+            phrase = text[idx].lower()
+            phrases.append(phrase)
+            binary_mask = np.zeros((height, width), dtype=np.uint8)
+            for seg in inst["mask"]:
+                rles = mask_utils.frPyObjects([seg], height, width)
+                m = mask_utils.decode(rles)
+                m = m.astype(np.uint8)
+                binary_mask += m.squeeze()
+            masks.append(binary_mask)
+
+        ann_info.update({
+            'masks': masks,
+            'phrases': phrases,
+        })
+        return ann_info
+
+    def __getitem__(self, idx):
+        data_dict = {}
+        ann_info = super().__getitem__(idx)
+        ann_info = self._parse_annotations(ann_info)
+
+        data_dict['g_pixel_values'] = ann_info.pop('g_pixel_values')
+        data_dict['pixel_values'] = ann_info.pop('pixel_values')
+        if len(ann_info['masks']) == 0:
+            return self.__getitem__(0)
+        data_dict['masks'] = torch.from_numpy(
+            np.stack(ann_info['masks'], axis=0))
+
+        conversation = []
+        for i, phrase in enumerate(ann_info['phrases']):
+            question = random.choice(SEG_QUESTIONS).format(class_name=phrase)
+            conversation.append(
+                {'input': question, 'output': random.choice(ANSWER_LIST)})
+
+        data_dict['conversation'] = conversation
+        result = self.template_map_fn(data_dict)
+        data_dict.update(result)
+
+        result = encode_fn(data_dict, tokenizer=self.tokenizer,
+                           max_length=self.max_length, with_image_token=True)
+        data_dict.update(result)
+
+        return data_dict
+
+if __name__ == '__main__':
+    from transformers import CLIPImageProcessor, AutoTokenizer
+    from third_parts.segment_anything.utils.transforms import ResizeLongestSide
+    pretrained_model = 'MBZUAI/GLaMM-GranD-Pretrained'
+    llm_name_or_path = 'lmsys/vicuna-7b-v1.5'
+
+    tokenizer = dict(
+        type=AutoTokenizer.from_pretrained,
+        pretrained_model_name_or_path=llm_name_or_path)
+    image_processor = dict(
+        type=CLIPImageProcessor.from_pretrained,
+        pretrained_model_name_or_path='openai/clip-vit-large-patch14-336')
+    extra_image_processor = dict(
+        type=ResizeLongestSide,
+        target_length=1024,
+    )
+    from xtuner.utils.templates import PROMPT_TEMPLATE
+    prompt_template = PROMPT_TEMPLATE.vicuna
+    from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory, template_map_fn
+    from projects.glamm.datasets.collate_fns.glamm_collate_fn import glamm_collate_fn
+
+    dataset = ReferSegmDataset(
+        tokenizer=tokenizer,
+        image_processor=image_processor,
+        template_map_fn=dict(
+            type=template_map_fn_factory, template=prompt_template),
+        extra_image_processor=extra_image_processor,
+        data_root='data/coco/',
+        data_prefix=dict(img_path='train2014/'),
+        ann_file='refcoco+/instances.json',
+        split_file='refcoco+/refs(unc).p',
+    )
+    for i in range(1000):
+        dataset[i]
--- a/projects/glamm/datasets/region_level_dataset.py
+++ b/projects/glamm/datasets/region_level_dataset.py
+import copy
+import random
+import glob
+import json
+import logging
+import os
+import torch
+
+from mmengine import print_log
+from mmengine.config import Config, ConfigDict
+from PIL import Image
+from torch.utils.data import Dataset
+import numpy as np
+import torch.nn.functional as F
+from pycocotools.coco import COCO
+from pycocotools import mask as mask_utils
+
+from xtuner.registry import BUILDER
+
+from xtuner.dataset.utils import encode_fn
+from xtuner.dataset.map_fns import llava_map_fn
+
+from projects.glamm.datasets.utils.utils import expand2square
+
+from projects.glamm.datasets.utils.utils import ANSWER_LIST, REGION_QUESTIONS
+from projects.glamm.utils import DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
+
+
+class RegionDataset(Dataset):
+    def __init__(self,
+                 image_folder,
+                 image_processor,
+                 data_path=None,
+                 tokenizer=None,
+                 template_map_fn=None,
+                 max_length=2048,
+                 pad_image_to_square=False,
+                 repeats=1,
+                 num_classes_per_sample=3,
+                 extra_image_processor=None):
+        super().__init__()
+
+        self.begin_str = f"""{DEFAULT_IMAGE_TOKEN} provides an overview of the picture.\n"""
+        self.question_templates = REGION_QUESTIONS
+
+        if extra_image_processor is not None:
+            self.extra_image_processor = BUILDER.build(extra_image_processor)
+        self.num_classes_per_sample = num_classes_per_sample
+        self.tokenizer = BUILDER.build(tokenizer)
+
+        self.tokenizer.add_tokens(
+            [DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True
+        )
+        reg_tokens = ['<bbox>', '<point>']
+        segmentation_tokens = ['[SEG]']
+        phrase_tokens = ['<p>', '</p>']
+        special_tokens = reg_tokens + segmentation_tokens + phrase_tokens
+        self.tokenizer.add_tokens(special_tokens, special_tokens=True)
+
+        self.max_length = max_length
+        self.template_map_fn = BUILDER.build(template_map_fn)
+
+        self.text_data = self._load_annotations(data_path, image_folder)
+        self.image_folder = image_folder
+
+        self.image_processor = BUILDER.build(image_processor)
+        size = self.image_processor.crop_size
+
+        if isinstance(size, dict):
+            self.image_w, self.image_h = size['width'], size['height']
+        elif isinstance(size, int):
+            self.image_h, self.image_w = size, size
+        else:
+            self.image_w, self.image_h = size
+
+        self.pad_image_to_square = pad_image_to_square
+        self.repeats = repeats
+
+    def _load_annotations(self, data_path, image_folder=None):
+        self.coco = COCO(data_path)
+        img_ids = self.coco.getImgIds()
+        data_infos = []
+        for img_id in img_ids:
+            info = self.coco.loadImgs([img_id])[0]
+            info['filename'] = info['file_name'].split('_')[-1]
+            info['height'] = int(info['height'])
+            info['width'] = int(info['width'])
+            if min(info['height'], info['width']) < 32:
+                continue
+            data_infos.append(info)
+        return data_infos
+
+    @property
+    def modality_length(self):
+        length_list = []
+        for data_dict in self.text_data:
+            cur_len = 100
+            length_list.append(cur_len)
+        return length_list * self.repeats
+
+    def __len__(self):
+        return len(self.text_data) * self.repeats
+
+    def real_len(self):
+        return len(self.text_data)
+
+    def region_processor(self, orig_size, post_size, bboxes, labels):
+        orig_h, orig_w = orig_size
+        post_h, post_w = post_size
+        y_scale = post_h / orig_h
+        x_scale = post_w / orig_w
+        shuffle_ids = torch.randperm(len(labels))[:self.num_classes_per_sample]
+        selected_bboxes = bboxes[shuffle_ids]
+
+        # Ensure selected_bboxes is two-dimensional
+        if len(selected_bboxes.shape) == 1:
+            selected_bboxes = np.expand_dims(selected_bboxes, axis=0)
+
+        selected_labels = [labels[i] for i in shuffle_ids]
+        selected_bboxes[:, [0, 2]] *= x_scale
+        selected_bboxes[:, [1, 3]] *= y_scale
+        selected_bboxes = torch.tensor(
+            selected_bboxes, dtype=torch.float32) / post_h
+        return selected_bboxes, selected_labels
+
+    def _parse_annotations(self, img_info):
+        data_dict = {}
+        bboxes, captions = [], []
+        ann_info = self.coco.loadAnns(self.coco.getAnnIds(imgIds=img_info['id']))
+        image_path = os.path.join(self.image_folder, img_info['file_name'])
+        image = Image.open(image_path).convert('RGB')
+        if hasattr(self, 'extra_image_processor'):
+            g_image = np.array(image)  # for grounding
+            g_image = self.extra_image_processor.apply_image(g_image)
+            g_pixel_values = torch.from_numpy(
+                g_image).permute(2, 0, 1).contiguous()
+            data_dict['g_pixel_values'] = g_pixel_values
+
+        orig_w, orig_h = image.size
+        if self.pad_image_to_square:
+            image = expand2square(
+                image, tuple(int(x * 255) for x in self.image_processor.image_mean))
+        image = self.image_processor.preprocess(
+            image, return_tensors='pt')['pixel_values'][0]
+        post_h, post_w = image.shape[1:3]
+        data_dict['pixel_values'] = image
+
+        for ann in ann_info:
+            if ann.get('ignore', False) or ann['area'] <= 0 or ann['bbox'][2] < 1 or ann['bbox'][3] < 1:
+                continue
+            x1, y1, w, h = ann['bbox']
+            inter_w = max(0, min(x1 + w, orig_w) - max(x1, 0))
+            inter_h = max(0, min(y1 + h, orig_h) - max(y1, 0))
+            if inter_w * inter_h == 0:
+                continue
+            bbox = [x1, y1, x1 + w, y1 + h]
+
+            if bbox:
+                bboxes.append(bbox)
+                captions.append(img_info['caption'])
+
+        if len(bboxes) == 0:
+            return self.__getitem__(0)
+
+        bboxes = np.array(bboxes, dtype=np.float32)
+        seg_map = img_info['file_name'].replace('jpg', 'png')
+        bboxes, captions = self.region_processor((orig_h, orig_w), (post_h, post_w), bboxes, captions)
+
+        data_dict['bboxes'] = bboxes
+        data_dict['captions'] = captions
+        data_dict['seg_map'] = seg_map
+        return data_dict
+
+    def create_conversation(self, captions):
+        questions = []
+        answers = []
+        for i, label in enumerate(captions):
+            question = random.choice(self.question_templates).strip().replace('<region>', f'region{i + 1} <bbox>')
+            questions.append(question)
+            answers.append(label)
+
+        conversation = []
+        for i, (question, answer) in enumerate(zip(questions, answers)):
+            if i == 0:
+                question = self.begin_str + question
+            conversation.append({'input': question, 'output': answer})
+        return conversation
+
+    def __getitem__(self, index):
+        index = index % self.real_len()
+        data_dict = {}
+        ann_info = copy.deepcopy(self.text_data[index])
+        ann_info = self._parse_annotations(ann_info)
+
+        data_dict['g_pixel_values'] = ann_info.pop('g_pixel_values', None)
+        data_dict['pixel_values'] = ann_info.pop('pixel_values')
+        data_dict['bboxes'] = ann_info.pop('bboxes', None)
+
+        conversation = self.create_conversation(ann_info['captions'])
+        data_dict['conversation'] = conversation
+
+        result = self.template_map_fn(data_dict)
+        data_dict.update(result)
+
+        result = encode_fn(data_dict, tokenizer=self.tokenizer,
+                           max_length=self.max_length, with_image_token=True)
+        data_dict.update(result)
+
+        return data_dict
+
+class RefCocoGRegionDataset(RegionDataset):
+    pass
+
+class VisualGenomeRegionDataset(RegionDataset):
+    def _parse_annotations(self, img_info):
+        data_dict = {}
+        bboxes, captions = [], []
+        ann_info = self.coco.loadAnns(self.coco.getAnnIds(imgIds=img_info['id']))
+        image_path = os.path.join(self.image_folder, img_info['file_name'])
+        image = Image.open(image_path).convert('RGB')
+        if hasattr(self, 'extra_image_processor'):
+            g_image = np.array(image)  # for grounding
+            g_image = self.extra_image_processor.apply_image(g_image)
+            g_pixel_values = torch.from_numpy(
+                g_image).permute(2, 0, 1).contiguous()
+            data_dict['g_pixel_values'] = g_pixel_values
+
+        orig_w, orig_h = image.size
+        if self.pad_image_to_square:
+            image = expand2square(
+                image, tuple(int(x * 255) for x in self.image_processor.image_mean))
+        image = self.image_processor.preprocess(
+            image, return_tensors='pt')['pixel_values'][0]
+        post_h, post_w = image.shape[1:3]
+        data_dict['pixel_values'] = image
+
+        for ann in ann_info:
+            if ann.get('ignore', False) or ann['area'] <= 0 or ann['bbox'][2] < 1 or ann['bbox'][3] < 1:
+                continue
+            x1, y1, w, h = ann['bbox']
+            inter_w = max(0, min(x1 + w, orig_w) - max(x1, 0))
+            inter_h = max(0, min(y1 + h, orig_h) - max(y1, 0))
+            if inter_w * inter_h == 0:
+                continue
+            bbox = [x1, y1, x1 + w, y1 + h]
+
+            if bbox:
+                bboxes.append(bbox)
+                captions.append(ann['caption'].strip())
+
+        if len(bboxes) == 0:
+            return self.__getitem__(0)
+
+        bboxes = np.array(bboxes, dtype=np.float32)
+        seg_map = img_info['file_name'].replace('jpg', 'png')
+        bboxes, captions = self.region_processor((orig_h, orig_w), (post_h, post_w), bboxes, captions)
+
+        data_dict['bboxes'] = bboxes
+        data_dict['captions'] = captions
+        data_dict['seg_map'] = seg_map
+        return data_dict
+
+if __name__ == '__main__':
+    from transformers import CLIPImageProcessor, AutoTokenizer
+    from third_parts.segment_anything.utils.transforms import ResizeLongestSide
+    pretrained_model = 'MBZUAI/GLaMM-GranD-Pretrained'
+    llm_name_or_path = 'lmsys/vicuna-7b-v1.5'
+
+    tokenizer = dict(
+        type=AutoTokenizer.from_pretrained,
+        pretrained_model_name_or_path=llm_name_or_path)
+    image_processor = dict(
+        type=CLIPImageProcessor.from_pretrained,
+        pretrained_model_name_or_path='openai/clip-vit-large-patch14-336')
+    extra_image_processor = dict(
+        type=ResizeLongestSide,
+        target_length=1024,
+    )
+    from xtuner.utils.templates import PROMPT_TEMPLATE
+    prompt_template = PROMPT_TEMPLATE.vicuna
+    from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory, template_map_fn
+    from projects.glamm.datasets.collate_fns.glamm_collate_fn import glamm_collate_fn
+    dataset = VisualGenomeRegionDataset(
+        image_folder='./data/visual_genome/images',
+        image_processor=image_processor,
+        data_path='data/visual_genome/train.json',
+        tokenizer=tokenizer,
+        template_map_fn=dict(
+            type=template_map_fn_factory, template=prompt_template),
+        max_length=2048,
+        pad_image_to_square=False,
+        repeats=1,
+        num_classes_per_sample=3,
+        extra_image_processor=None)
+
+    for i in range(1000):
+        print(dataset[i])
--- a/projects/glamm/datasets/semantic_seg_dataset.py
+++ b/projects/glamm/datasets/semantic_seg_dataset.py
+import copy
+import random
+import glob
+import json
+import logging
+import os
+import torch
+
+from mmengine import print_log
+from mmengine.config import Config, ConfigDict
+from PIL import Image
+from torch.utils.data import Dataset
+import numpy as np
+import torch.nn.functional as F
+from pycocotools.coco import COCO
+
+from xtuner.registry import BUILDER
+
+from xtuner.dataset.utils import encode_fn
+from xtuner.dataset.map_fns import llava_map_fn
+
+from projects.glamm.datasets.utils.utils import expand2square
+
+from projects.glamm.datasets.utils.utils import SEG_QUESTIONS, ANSWER_LIST
+from projects.glamm.utils import DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
+
+
+class SemanticSegDataset(Dataset):
+    def __init__(self,
+                 image_folder,
+                 image_processor,
+                 data_path=None,
+                 tokenizer=None,
+                 offline_processed_text_folder=None,
+                 max_dataset_length=None,
+                 dataset_map_fn=None,
+                 template_map_fn=None,
+                 max_length=2048,
+                 pad_image_to_square=False,
+                 num_proc=8,
+                 lazy=False,
+                 repeats=1,
+                 gcg_format=False,
+                 num_classes_per_sample=3,
+                 extra_image_processor=None):
+        super().__init__()
+        self.gcg_format = gcg_format
+        if extra_image_processor is not None:
+            self.extra_image_processor = BUILDER.build(extra_image_processor)
+        self.num_classes_per_sample = num_classes_per_sample
+        self.tokenizer = BUILDER.build(tokenizer)
+
+        self.tokenizer.add_tokens(
+            [DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True
+        )
+        reg_tokens = ['<bbox>', '<point>']
+        segmentation_tokens = ['[SEG]']
+        phrase_tokens = ['<p>', '</p>']
+        special_tokens = reg_tokens + segmentation_tokens + phrase_tokens
+        self.tokenizer.add_tokens(special_tokens, special_tokens=True)
+
+        assert offline_processed_text_folder or (data_path and tokenizer)
+        self.lazy = lazy
+
+        self.max_length = max_length
+        self.dataset_map_fn = dataset_map_fn
+        self.template_map_fn = template_map_fn
+        if isinstance(self.template_map_fn, dict) and self.lazy:
+            _type = self.template_map_fn['type']
+            del self.template_map_fn['type']
+            self.template_map_fn = _type(**self.template_map_fn)
+
+        if offline_processed_text_folder and data_path:
+            print_log(
+                'Both `offline_processed_text_folder` and '
+                '`data_path` are set, and we load dataset from'
+                '`offline_processed_text_folder` '
+                f'({offline_processed_text_folder})',
+                logger='current',
+                level=logging.WARNING)
+
+        if offline_processed_text_folder is not None:
+            raise NotImplementedError
+        else:
+            self.image_label_datas = self.json_file_preprocess(data_path, image_folder)
+
+        self.image_folder = image_folder
+
+        if isinstance(image_processor, dict) or isinstance(image_processor, Config) or isinstance(image_processor, ConfigDict):
+            self.image_processor = BUILDER.build(image_processor)
+        else:
+            self.image_processor = image_processor
+
+        size = self.image_processor.crop_size
+
+        if isinstance(size, dict):
+            self.image_w, self.image_h = size['width'], size['height']
+        elif isinstance(size, int):
+            self.image_h, self.image_w = size, size
+        else:
+            self.image_w, self.image_h = size
+
+        self.pad_image_to_square = pad_image_to_square
+        self.down_ratio = 1
+        self.repeats = repeats
+
+    def json_file_preprocess(self, data_path, image_folder):
+        # ade20k
+        with open(data_path, 'r') as file:
+            ade20k_classes = json.load(file)
+        ade20k_image_dir = image_folder
+        ade20k_images = [os.path.join(ade20k_image_dir, img) for img in os.listdir(ade20k_image_dir) if
+                         img.endswith('.jpg')]
+        ade20k_labels = [img.replace(".jpg", ".png").replace(
+            "images", "annotations") for img in ade20k_images]
+        self.classes = np.array(ade20k_classes)
+
+        ret = []
+        for image, label in zip(ade20k_images, ade20k_labels):
+            ret.append({"image": image, "label": label})
+        return ret
+
+    def __len__(self):
+        return len(self.image_label_datas) * self.repeats
+
+    @property
+    def modality_length(self):
+        length_list = []
+        for data_dict in self.image_label_datas:
+            length_list.append(100)
+        length_list = length_list * self.repeats
+        return length_list
+
+    def real_len(self):
+        return len(self.image_label_datas)
+
+    def decode_mask(self, label_path):
+        label = np.array(Image.open(label_path))
+
+        # ade20k
+        label = np.where(label == 0, 255, label - 1)
+        unique_labels = [lbl for lbl in np.unique(label) if lbl != 255]
+        if not unique_labels:
+            return None, None
+
+        selected_labels = np.random.choice(unique_labels, min(
+            len(unique_labels), self.num_classes_per_sample), replace=False)
+        label = torch.from_numpy(label).long()
+        masks = torch.stack([label == class_id for class_id in selected_labels], dim=0)
+        return masks, selected_labels
+
+    def __getitem__(self, index):
+        index = index % self.real_len()
+        data_dict = copy.deepcopy(self.image_label_datas[index])
+
+        assert 'image' in data_dict.keys()
+        if data_dict.get('image', None) is not None:
+            image_file = data_dict['image']
+            image = Image.open(image_file).convert('RGB')
+            if hasattr(self, 'extra_image_processor'):
+                g_image = np.array(image) # for grounding
+                g_image = self.extra_image_processor.apply_image(g_image)
+                g_pixel_values = torch.from_numpy(g_image).permute(2, 0, 1).contiguous()
+                data_dict['g_pixel_values'] = g_pixel_values
+
+            ori_width, ori_height = image.size
+            if self.pad_image_to_square:
+                image = expand2square(image, tuple(int(x * 255)
+                                      for x in self.image_processor.image_mean))
+            image = self.image_processor.preprocess(
+                image, return_tensors='pt')['pixel_values'][0]
+            data_dict['pixel_values'] = image
+
+            # process and get masks
+            data_dict['masks'], class_id = self.decode_mask(data_dict['label'])
+            if class_id is None:
+                return self.__getitem__(0)
+
+            if self.gcg_format:
+                pass
+            else:
+                conversation = []
+                for i, c_id in enumerate(class_id):
+                    question = random.choice(SEG_QUESTIONS).format(
+                        class_name=self.classes[c_id].lower())
+                    if i == 0:
+                        question = f"""The {DEFAULT_IMAGE_TOKEN} provides an overview of the picture.\n""" + question
+                    conversation.append(
+                        {'input': question, 'output': random.choice(ANSWER_LIST)})
+
+            data_dict.update({'conversation': conversation})
+        else:
+            if hasattr(self.image_processor, 'crop_size'):
+                crop_size = self.image_processor.crop_size
+            else:
+                crop_size = self.image_processor.size
+            data_dict['pixel_values'] = torch.zeros(3, crop_size['height'],
+                                                    crop_size['width'])
+            data_dict['masks'] = None
+
+        if self.lazy:
+            result = self.template_map_fn(data_dict)
+            data_dict.update(result)
+
+            result = encode_fn(data_dict, tokenizer=self.tokenizer,
+                               max_length=self.max_length, with_image_token=True)
+            data_dict.update(result)
+
+        return data_dict
+
+class ADE20kSemanticSegDataset(SemanticSegDataset):
+    def __init__(self,
+                 image_folder,
+                 image_processor,
+                 data_path=None,
+                 tokenizer=None,
+                 offline_processed_text_folder=None,
+                 max_dataset_length=None,
+                 dataset_map_fn=None,
+                 template_map_fn=None,
+                 max_length=2048,
+                 pad_image_to_square=False,
+                 num_proc=8,
+                 lazy=False,
+                 repeats=1,
+                 gcg_format=False,
+                 num_classes_per_sample=3,
+                 extra_image_processor=None):
+        super().__init__(
+            image_folder=image_folder,
+            image_processor=image_processor,
+            data_path=data_path,
+            tokenizer=tokenizer,
+            offline_processed_text_folder=offline_processed_text_folder,
+            max_dataset_length=max_dataset_length,
+            dataset_map_fn=dataset_map_fn,
+            template_map_fn=template_map_fn,
+            max_length=max_length,
+            pad_image_to_square=pad_image_to_square,
+            num_proc=num_proc,
+            lazy=lazy,
+            repeats=repeats,
+            gcg_format=gcg_format,
+            num_classes_per_sample=num_classes_per_sample,
+            extra_image_processor=extra_image_processor,
+        )
+
+class COCOStuffSemanticSegDataset(SemanticSegDataset):
+    def __init__(self,
+                 image_folder,
+                 image_processor,
+                 data_path=None,
+                 tokenizer=None,
+                 offline_processed_text_folder=None,
+                 max_dataset_length=None,
+                 dataset_map_fn=None,
+                 template_map_fn=None,
+                 max_length=2048,
+                 pad_image_to_square=False,
+                 num_proc=8,
+                 lazy=False,
+                 repeats=1,
+                 label_path=None,
+                 gcg_format=False,
+                 num_classes_per_sample=3,
+                 extra_image_processor=None):
+        self.label_path = label_path
+        super().__init__(
+            image_folder=image_folder,
+            image_processor=image_processor,
+            data_path=data_path,
+            tokenizer=tokenizer,
+            offline_processed_text_folder=offline_processed_text_folder,
+            max_dataset_length=max_dataset_length,
+            dataset_map_fn=dataset_map_fn,
+            template_map_fn=template_map_fn,
+            max_length=max_length,
+            pad_image_to_square=pad_image_to_square,
+            num_proc=num_proc,
+            lazy=lazy,
+            repeats=repeats,
+            gcg_format=gcg_format,
+            num_classes_per_sample=num_classes_per_sample,
+            extra_image_processor=extra_image_processor,
+        )
+        self.cocostuff_class2index = {c: i for i, c in enumerate(self.classes)}
+
+    def json_file_preprocess(self, data_path, image_folder):
+        # coco stuff
+        assert self.label_path is not None
+        with open(data_path, 'r') as file:
+            cocostuff_classes = [line.strip().split(": ")[-1]
+                                 for line in file.readlines()[1:]]
+        coco_stuff_image_dir = image_folder
+        coco_stuff_label_dir = self.label_path
+        coco_stuff_labels = glob.glob(
+            os.path.join(coco_stuff_label_dir, "*.png"))
+
+        coco_stuff_images = [label.replace(".png", ".jpg").replace(coco_stuff_label_dir, coco_stuff_image_dir)
+                             for label in coco_stuff_labels]
+
+        self.classes = np.array(cocostuff_classes)
+
+        ret = []
+        for image, label in zip(coco_stuff_images, coco_stuff_labels):
+            ret.append({"image": image, "label": label})
+        return ret
+
+    def decode_mask(self, label_path):
+        label = np.array(Image.open(label_path))
+
+        # coco stuff
+        ignored_classes = [index for class_name,
+                           index in self.cocostuff_class2index.items() if "-" in class_name]
+        label = np.where(np.isin(label, ignored_classes), 255, label)
+
+        unique_labels = [lbl for lbl in np.unique(label) if lbl != 255]
+        if not unique_labels:
+            print("No valid label !!!")
+            return None, None
+
+        # only choose 1
+        selected_labels = np.random.choice(unique_labels, min(
+            len(unique_labels), self.num_classes_per_sample), replace=False)
+
+        label = torch.from_numpy(label).long()
+        masks = torch.stack(
+            [label == class_id for class_id in selected_labels], dim=0)
+        return masks, selected_labels
+
+class PascalPartSemanticSegDataset(SemanticSegDataset):
+
+    def json_file_preprocess(self, data_path, image_folder):
+        self.coco_api = COCO(data_path)
+        img_ids = self.coco_api.getImgIds()
+        all_classes = self.coco_api.loadCats(self.coco_api.getCatIds())
+        class_map_pascal_part = {}
+        for cat in all_classes:
+            cat_main, cat_part = cat["name"].strip().split(":")
+            name = (cat_main, cat_part)
+            class_map_pascal_part[cat["id"]] = name
+        self.classes = class_map_pascal_part
+        return img_ids
+
+    def __getitem__(self, index):
+        index = index % self.real_len()
+        img_id = self.image_label_datas[index]
+        img_info = self.coco_api.loadImgs([img_id])[0]
+        file_name = img_info["file_name"]
+        data_dict = {}
+
+        image_file = os.path.join(self.image_folder, file_name)
+        image = Image.open(image_file).convert('RGB')
+
+        if hasattr(self, 'extra_image_processor'):
+            g_image = np.array(image)  # for grounding
+            g_image = self.extra_image_processor.apply_image(g_image)
+            g_pixel_values = torch.from_numpy(g_image).permute(2, 0, 1).contiguous()
+            data_dict['g_pixel_values'] = g_pixel_values
+
+        if self.pad_image_to_square:
+            image = expand2square(
+                image,  tuple(int(x * 255) for x in self.image_processor.image_mean))
+        image = self.image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
+        data_dict['pixel_values'] = image
+
+        annotation_ids = self.coco_api.getAnnIds(imgIds=img_info["id"])
+        annotations = self.coco_api.loadAnns(annotation_ids)
+
+        if not annotations:
+            return self.__getitem__(0)
+
+        sampled_anns = np.random.choice(annotations, min(
+            len(annotations), self.num_classes_per_sample), replace=False)
+
+        conversation = []
+        for i, ann in enumerate(sampled_anns):
+            cat_id = ann['category_id']
+            sampled_cls = self.classes[cat_id]
+            if isinstance(sampled_cls, tuple):
+                obj, part = sampled_cls
+                name = f"{obj} {part}" if random.random() < 0.5 else f"the {part} of the {obj}"
+            else:
+                name = sampled_cls
+            question = random.choice(SEG_QUESTIONS).format(class_name=name)
+            if i == 0:
+                question = f"""The {DEFAULT_IMAGE_TOKEN} provides an overview of the picture.\n""" + question
+            conversation.append(
+                {'input': question, 'output': random.choice(ANSWER_LIST)})
+
+        masks = [self.coco_api.annToMask(ann) for ann in sampled_anns]
+        masks = np.stack(masks, axis=0)
+        masks = torch.from_numpy(masks)
+
+        data_dict['masks'] = masks
+        data_dict['conversation'] = conversation
+
+        if self.lazy:
+            result = self.template_map_fn(data_dict)
+            data_dict.update(result)
+
+            result = encode_fn(data_dict, tokenizer=self.tokenizer, max_length=self.max_length, with_image_token=True)
+            data_dict.update(result)
+
+        return data_dict
+
+class PacoSemanticSegDataset(PascalPartSemanticSegDataset):
+    def json_file_preprocess(self, data_path, image_folder):
+        self.coco_api = COCO(data_path)
+        all_classes = self.coco_api.loadCats(self.coco_api.getCatIds())
+        class_map_paco = {}
+        for cat in all_classes:
+            cat_split = cat["name"].strip().split(":")
+            if len(cat_split) == 1:
+                name = cat_split[0].split("_(")[0]
+            else:
+                assert len(cat_split) == 2
+                obj, part = cat_split
+                obj = obj.split("_(")[0]
+                part = part.split("_(")[0]
+                name = (obj, part)
+            class_map_paco[cat["id"]] = name
+        self.classes = class_map_paco
+        return self.coco_api.getImgIds()
\ No newline at end of file
--- a/projects/glamm/datasets/utils/ade20k_classes.json
+++ b/projects/glamm/datasets/utils/ade20k_classes.json
+[
+    "wall", "building", "sky", "floor", "tree", "ceiling", "road",
+    "bed", "windowpane", "grass", "cabinet", "sidewalk",
+    "person", "earth", "door", "table", "mountain", "plant",
+    "curtain", "chair", "car", "water", "painting", "sofa",
+    "shelf", "house", "sea", "mirror", "rug", "field", "armchair",
+    "seat", "fence", "desk", "rock", "wardrobe", "lamp",
+    "bathtub", "railing", "cushion", "base", "box", "column",
+    "signboard", "chest of drawers", "counter", "sand", "sink",
+    "skyscraper", "fireplace", "refrigerator", "grandstand",
+    "path", "stairs", "runway", "case", "pool table", "pillow",
+    "screen door", "stairway", "river", "bridge", "bookcase",
+    "blind", "coffee table", "toilet", "flower", "book", "hill",
+    "bench", "countertop", "stove", "palm", "kitchen island",
+    "computer", "swivel chair", "boat", "bar", "arcade machine",
+    "hovel", "bus", "towel", "light", "truck", "tower",
+    "chandelier", "awning", "streetlight", "booth",
+    "television receiver", "airplane", "dirt track", "apparel",
+    "pole", "land", "bannister", "escalator", "ottoman", "bottle",
+    "buffet", "poster", "stage", "van", "ship", "fountain",
+    "conveyer belt", "canopy", "washer", "plaything",
+    "swimming pool", "stool", "barrel", "basket", "waterfall",
+    "tent", "bag", "minibike", "cradle", "oven", "ball", "food",
+    "step", "tank", "trade name", "microwave", "pot", "animal",
+    "bicycle", "lake", "dishwasher", "screen", "blanket",
+    "sculpture", "hood", "sconce", "vase", "traffic light",
+    "tray", "ashcan", "fan", "pier", "crt screen", "plate",
+    "monitor", "bulletin board", "shower", "radiator", "glass",
+    "clock", "flag"
+]
\ No newline at end of file
--- a/projects/glamm/datasets/utils/cocostuff_classes.txt
+++ b/projects/glamm/datasets/utils/cocostuff_classes.txt
+0: unlabeled
+1: person
+2: bicycle
+3: car
+4: motorcycle
+5: airplane
+6: bus
+7: train
+8: truck
+9: boat
+10: traffic light
+11: fire hydrant
+12: street sign
+13: stop sign
+14: parking meter
+15: bench
+16: bird
+17: cat
+18: dog
+19: horse
+20: sheep
+21: cow
+22: elephant
+23: bear
+24: zebra
+25: giraffe
+26: hat
+27: backpack
+28: umbrella
+29: shoe
+30: eye glasses
+31: handbag
+32: tie
+33: suitcase
+34: frisbee
+35: skis
+36: snowboard
+37: sports ball
+38: kite
+39: baseball bat
+40: baseball glove
+41: skateboard
+42: surfboard
+43: tennis racket
+44: bottle
+45: plate
+46: wine glass
+47: cup
+48: fork
+49: knife
+50: spoon
+51: bowl
+52: banana
+53: apple
+54: sandwich
+55: orange
+56: broccoli
+57: carrot
+58: hot dog
+59: pizza
+60: donut
+61: cake
+62: chair
+63: couch
+64: potted plant
+65: bed
+66: mirror
+67: dining table
+68: window
+69: desk
+70: toilet
+71: door
+72: tv
+73: laptop
+74: mouse
+75: remote
+76: keyboard
+77: cell phone
+78: microwave
+79: oven
+80: toaster
+81: sink
+82: refrigerator
+83: blender
+84: book
+85: clock
+86: vase
+87: scissors
+88: teddy bear
+89: hair drier
+90: toothbrush
+91: hair brush
+92: banner
+93: blanket
+94: branch
+95: bridge
+96: building-other
+97: bush
+98: cabinet
+99: cage
+100: cardboard
+101: carpet
+102: ceiling-other
+103: ceiling-tile
+104: cloth
+105: clothes
+106: clouds
+107: counter
+108: cupboard
+109: curtain
+110: desk-stuff
+111: dirt
+112: door-stuff
+113: fence
+114: floor-marble
+115: floor-other
+116: floor-stone
+117: floor-tile
+118: floor-wood
+119: flower
+120: fog
+121: food-other
+122: fruit
+123: furniture-other
+124: grass
+125: gravel
+126: ground-other
+127: hill
+128: house
+129: leaves
+130: light
+131: mat
+132: metal
+133: mirror-stuff
+134: moss
+135: mountain
+136: mud
+137: napkin
+138: net
+139: paper
+140: pavement
+141: pillow
+142: plant-other
+143: plastic
+144: platform
+145: playingfield
+146: railing
+147: railroad
+148: river
+149: road
+150: rock
+151: roof
+152: rug
+153: salad
+154: sand
+155: sea
+156: shelf
+157: sky
+158: skyscraper
+159: snow
+160: solid-other
+161: stairs
+162: stone
+163: straw
+164: structural-other
+165: table
+166: tent
+167: textile-other
+168: towel
+169: tree
+170: vegetable
+171: wall-brick
+172: wall-concrete
+173: wall-other
+174: wall-panel
+175: wall-stone
+176: wall-tile
+177: wall-wood
+178: water-other
+179: waterdrops
+180: window-blind
+181: window-other
+182: wood
--- a/projects/glamm/datasets/utils/utils.py
+++ b/projects/glamm/datasets/utils/utils.py
+from PIL import Image
+
+
+
+def expand2square(pil_img, background_color):
+    width, height = pil_img.size
+    if width == height:
+        return pil_img
+    elif width > height:
+        result = Image.new(pil_img.mode, (width, width), background_color)
+        result.paste(pil_img, (0, (width - height) // 2))
+        return result
+    else:
+        result = Image.new(pil_img.mode, (height, height), background_color)
+        result.paste(pil_img, ((height - width) // 2, 0))
+        return result
+    
+CAPTION_QUESTIONS = [
+    'Could you please give me a detailed description of the image?',
+    'Can you provide a thorough description of the this image?',
+    'Please provide a thorough description of the this image',
+    'Please provide a thorough description of the this image.',
+    'Please describe in detail the contents of the image.',
+    'Please describe in detail the contents of the image',
+    'Could you give a comprehensive explanation of what can be found within this picture?',
+    'Could you give me an elaborate explanation of this picture?',
+    'Could you provide me with a detailed analysis of this photo?',
+    'Could you please give me a detailed description of the image?',
+    'Can you provide a thorough description of the this image?',
+    'Please describe in detail the contents of the image',
+    'Please describe in detail the contents of the image.',
+    'Can you give a comprehensive explanation of this photo',
+    'Please provide an elaborate explanation of this picture.',
+    'Please provide an elaborate explanation of this picture',
+    'Could you provide me with a detailed analysis of this photo',
+]
+
+REGION_QUESTIONS = [
+    'Can you provide me with a detailed description of the region in the picture marked by <region>?',
+    "I'm curious about the region represented by <region> in the picture. Could you describe it in detail?",
+    'What can you tell me about the region indicated by <region> in the image?',
+    "I'd like to know more about the area in the photo labeled <region>. Can you give me a detailed description?",
+    'Could you describe the region shown as <region> in the picture in great detail?',
+    'What details can you give me about the region outlined by <region> in the photo?',
+    'Please provide me with a comprehensive description of the region marked with <region> in the image.',
+    'Can you give me a detailed account of the region labeled as <region> in the picture?',
+    "I'm interested in learning more about the region represented by <region> in the photo. Can you describe it in detail?",
+    'What is the region outlined by <region> in the picture like? Could you give me a detailed description?',
+    'Can you provide me with a detailed description of the region in the picture marked by <region>, please?',
+    "I'm curious about the region represented by <region> in the picture. Could you describe it in detail, please?",
+    'What can you tell me about the region indicated by <region> in the image, exactly?',
+    "I'd like to know more about the area in the photo labeled <region>, please. Can you give me a detailed description?",
+    'Could you describe the region shown as <region> in the picture in great detail, please?',
+    'What details can you give me about the region outlined by <region> in the photo, please?',
+    'Please provide me with a comprehensive description of the region marked with <region> in the image, please.',
+    'Can you give me a detailed account of the region labeled as <region> in the picture, please?',
+    "I'm interested in learning more about the region represented by <region> in the photo. Can you describe it in detail, please?",
+    'What is the region outlined by <region> in the picture like, please? Could you give me a detailed description?',
+]
+
+REGION_GROUP_QUESTIONS = [
+    'Could you please give me a detailed description of these areas <region>?',
+    'Can you provide a thorough description of the regions <region> in this image?',
+    'Please describe in detail the contents of the boxed areas <region>.',
+    'Could you give a comprehensive explanation of what can be found within <region> in the picture?',
+    'Could you give me an elaborate explanation of the <region> regions in this picture?',
+    'Can you provide a comprehensive description of the areas identified by <region> in this photo?',
+    'Help me understand the specific locations labeled <region> in this picture in detail, please.',
+    'What is the detailed information about the areas marked by <region> in this image?',
+    'Could you provide me with a detailed analysis of the regions designated <region> in this photo?',
+    'What are the specific features of the areas marked <region> in this picture that you can describe in detail?',
+    'Could you elaborate on the regions identified by <region> in this image?',
+    'What can you tell me about the areas labeled <region> in this picture?',
+    'Can you provide a thorough analysis of the specific locations designated <region> in this photo?',
+    'I am interested in learning more about the regions marked <region> in this image. Can you provide me with more information?',
+    'Could you please provide a detailed description of the areas identified by <region> in this photo?',
+    'What is the significance of the regions labeled <region> in this picture?',
+    'I would like to know more about the specific locations designated <region> in this image. Can you provide me with more information?',
+    'Can you provide a detailed breakdown of the regions marked <region> in this photo?',
+    'What specific features can you tell me about the areas identified by <region> in this picture?',
+    'Could you please provide a comprehensive explanation of the locations labeled <region> in this image?',
+    'Can you provide a detailed account of the regions designated <region> in this photo?',
+    'I am curious about the areas marked <region> in this picture. Can you provide me with a detailed analysis?',
+    'What important details can you tell me about the specific locations identified by <region> in this image?',
+    'Could you please provide a detailed description of the regions labeled <region> in this photo?',
+    'What can you tell me about the features of the areas designated <region> in this picture?',
+    'Can you provide a comprehensive overview of the regions marked <region> in this image?',
+    'I would like to know more about the specific locations identified by <region> in this photo. Can you provide me with more information?',
+    'What is the detailed information you have on the areas labeled <region> in this picture?',
+    'Could you provide me with a thorough analysis of the regions designated <region> in this image?',
+    'Can you provide a detailed explanation of the specific locations marked by <region> in this photo?'
+]
+
+GCG_QUESTIONS = [
+    'Could you please give me a detailed description of the image? Please respond with interleaved segmentation masks for the corresponding parts of the answer.',
+    'Can you provide a thorough description of the this image? Please output with interleaved segmentation masks for the corresponding phrases.',
+    'Please describe in detail the contents of the image. Please respond with interleaved segmentation masks for the corresponding parts of the answer.',
+    'Could you give a comprehensive explanation of what can be found within this picture? Please output with interleaved segmentation masks for the corresponding phrases.',
+    'Could you give me an elaborate explanation of this picture? Please respond with interleaved segmentation masks for the corresponding phrases.',
+    'Could you provide me with a detailed analysis of this photo? Please output with interleaved segmentation masks for the corresponding parts of the answer.',
+]
+
+SEG_QUESTIONS = [
+    "Can you segment the {class_name} in this image?",
+    "Please segment {class_name} in this image.",
+    "What is {class_name} in this image? Please respond with segmentation mask.",
+    "What is {class_name} in this image? Please output segmentation mask.",
+
+    "Can you segment the {class_name} in this image",
+    "Please segment {class_name} in this image",
+    "What is {class_name} in this image? Please respond with segmentation mask",
+    "What is {class_name} in this image? Please output segmentation mask",
+
+    "Could you provide a segmentation mask for the {class_name} in this image?",
+    "Please identify and segment the {class_name} in this image.",
+    "Where is the {class_name} in this picture? Please respond with a segmentation mask.",
+    "Can you highlight the {class_name} in this image with a segmentation mask?",
+
+    "Could you provide a segmentation mask for the {class_name} in this image",
+    "Please identify and segment the {class_name} in this image",
+    "Where is the {class_name} in this picture? Please respond with a segmentation mask",
+    "Can you highlight the {class_name} in this image with a segmentation mask",
+]
+
+ANSWER_LIST = [
+    "It is [SEG].",
+    "Sure, [SEG].",
+    "Sure, it is [SEG].",
+    "Sure, the segmentation result is [SEG].",
+    "[SEG].",
+]
\ No newline at end of file
--- a/projects/glamm/models/glamm.py
+++ b/projects/glamm/models/glamm.py
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from xtuner.registry import BUILDER
+from xtuner.model.utils import LoadWoInit, guess_load_checkpoint
+from xtuner.model.llava import LLaVAModel
+
+from mmengine.model import BaseModel
+from mmengine import print_log
+
+from projects.glamm.utils import prepare_inputs_labels_for_multimodal
+from projects.glamm.utils import DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
+
+
+class GLaMM(LLaVAModel):
+    def __init__(self,
+                 use_activation_checkpointing=True,
+                 tokenizer=None,
+                 grounding_encoder=None,
+                 region_encoder=None,
+                 loss_mask=None,
+                 loss_dice=None,
+                 *args, **kwargs):
+        super(GLaMM, self).__init__(
+            *args, use_activation_checkpointing=use_activation_checkpointing, **kwargs)
+
+        self.use_activation_checkpointing = use_activation_checkpointing
+        self.tokenizer = BUILDER.build(tokenizer)
+        self._add_special_tokens()
+
+        self.grounding_encoder = BUILDER.build(grounding_encoder)
+        self.grounding_encoder.requires_grad_(False)
+        self.grounding_encoder.mask_decoder.requires_grad_(True)
+
+        if region_encoder is not None:
+            self.region_encoder = BUILDER.build(region_encoder)
+
+        in_dim = self.config.hidden_size
+        out_dim = self.grounding_encoder.mask_decoder.transformer_dim
+        self.text_hidden_fcs = nn.Sequential(
+            nn.Linear(in_dim, in_dim), nn.ReLU(inplace=True),
+            nn.Linear(in_dim, out_dim), nn.Dropout(0.0)
+        )
+
+        self.loss_mask = BUILDER.build(loss_mask)
+        self.loss_dice = BUILDER.build(loss_dice)
+
+    def _add_special_tokens(self):
+        reg_tokens = ['<im_start>', '<im_end>', '<bbox>', '<point>']
+        segmentation_tokens = ['[SEG]']
+        phrase_tokens = ['<p>', '</p>']
+        special_tokens = reg_tokens + segmentation_tokens + phrase_tokens
+        num_new_tokens = self.tokenizer.add_tokens(
+            special_tokens, special_tokens=True)
+        if num_new_tokens > 0:
+            self.llm.resize_token_embeddings(len(self.tokenizer))
+            input_embeddings = self.llm.get_input_embeddings().weight.data
+            output_embeddings = self.llm.get_output_embeddings().weight.data
+
+            input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(
+                dim=0, keepdim=True)
+            output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(
+                dim=0, keepdim=True)
+
+            input_embeddings[-num_new_tokens:] = input_embeddings_avg
+            output_embeddings[-num_new_tokens:] = output_embeddings_avg
+
+        self.seg_token_idx = self.tokenizer("[SEG]", add_special_tokens=False).input_ids[0]
+        self.bop_token_idx = self.tokenizer("<p>", add_special_tokens=False).input_ids[0]
+        self.eop_token_idx = self.tokenizer("</p>", add_special_tokens=False).input_ids[0]
+        self.bbox_token_idx = self.tokenizer("<bbox>", add_special_tokens=False).input_ids[0]
+
+        if self.use_activation_checkpointing or self.use_llm_lora or not self.freeze_llm:
+            self.llm.enable_input_require_grads()
+
+    def forward(self, data, data_samples=None, mode='loss'):
+        if 'pixel_values' in data:
+            visual_outputs = self.visual_encoder(
+                data['pixel_values'].to(self.visual_encoder.dtype),
+                output_hidden_states=True)
+            pixel_values = self.projector(
+                visual_outputs.hidden_states[self.visual_select_layer][:, 1:])
+            data['pixel_values'] = pixel_values
+            bboxes = data.pop('bboxes', None)
+            if bboxes is not None:
+                select_hidden_state_layer = -2
+                num_level_reg_features = 4
+                mlvl_reg_features = visual_outputs.hidden_states[select_hidden_state_layer::-3]
+                mlvl_reg_features = mlvl_reg_features[::-1]
+                mlvl_reg_features = mlvl_reg_features[-num_level_reg_features:]
+                mlvl_reg_features = [item[:, 1:] for item in mlvl_reg_features]
+                mlvl_reg_features = self.region_encoder(mlvl_reg_features, bboxes)
+            data = prepare_inputs_labels_for_multimodal(llm=self.llm, **data)
+            
+            if bboxes is not None:
+                inputs_embeds = data['inputs_embeds']
+                for i, reg_feat in enumerate(mlvl_reg_features):
+                    reg_mask = data['new_input_ids'][i] == self.bbox_token_idx
+                    inputs_embeds[i][reg_mask] = reg_feat
+                data['inputs_embeds'] = inputs_embeds
+
+        if mode == 'loss':
+            return self.compute_loss(data, data_samples)
+        elif mode == 'predict':
+            return self.predict(data, data_samples)
+        elif mode == 'tensor':
+            return self._forward(data, data_samples)
+        else:
+            raise NotImplementedError
+
+    def compute_loss(self, data, data_samples=None):
+        g_pixel_values = data.pop('g_pixel_values', None)
+        gt_masks = data.pop('masks', None)
+        new_input_ids = data.pop('new_input_ids', None)
+
+        output = self.llm(output_hidden_states=True, **data)
+        if gt_masks is None:
+            return {'llm_loss': output.loss}
+
+        resize_list = [pixel.shape[-2:] for pixel in g_pixel_values]
+        ori_size_list = [mask.shape[-2:] for mask in gt_masks]
+        g_pixel_values = torch.stack([
+            self.grounding_encoder.preprocess(pixel) for pixel in g_pixel_values
+        ])
+        image_embeddings = self.grounding_encoder.image_encoder(g_pixel_values)
+
+        seg_token_mask = new_input_ids == self.seg_token_idx
+        hidden_states = output.hidden_states
+        hidden_states = self.text_hidden_fcs(hidden_states[-1])
+        pred_embeddings = hidden_states[seg_token_mask]
+
+        seg_token_counts = seg_token_mask.int().sum(-1)
+        pred_embeddings_list = torch.split(pred_embeddings, seg_token_counts.tolist(), dim=0)
+        
+        pred_masks = self._generate_and_postprocess_masks(
+            pred_embeddings_list, image_embeddings, resize_list, ori_size_list)
+        
+        bs = len(pred_masks)
+        loss_mask, loss_dice = 0, 0
+        for i in range(bs):
+            pred_mask = pred_masks[i]
+            gt_mask = gt_masks[i]
+
+            sam_loss_mask = self.loss_mask(pred_mask, gt_mask)
+            sam_loss_dice = self.loss_dice(pred_mask, gt_mask)
+            accuracy = torch.eq((pred_mask.sigmoid() > 0.5), gt_mask).to(pred_mask).mean()
+            loss_mask += sam_loss_mask
+            loss_dice += sam_loss_dice
+
+
+        loss_dict = {
+            'loss_mask': loss_mask / bs,
+            'loss_dice': loss_dice / bs,
+            'accuracy': accuracy,
+            'llm_loss': output.loss,
+        }
+        return loss_dict
+
+  
+    def _generate_and_postprocess_masks(self, pred_embeddings, image_embeddings, resize_list=None, orig_size_list=None, infer=False):
+        pred_masks = []
+        for i, pred_embedding in enumerate(pred_embeddings):
+            sparse_embeddings, dense_embeddings = self.grounding_encoder.prompt_encoder(
+                points=None, boxes=None, masks=None, text_embeds=pred_embedding.unsqueeze(1)
+            )
+            sparse_embeddings = sparse_embeddings.to(pred_embedding.dtype)
+            low_res_masks, _ = self.grounding_encoder.mask_decoder(
+                image_embeddings=image_embeddings[i].unsqueeze(0),
+                image_pe=self.grounding_encoder.prompt_encoder.get_dense_pe(),
+                sparse_prompt_embeddings=sparse_embeddings, dense_prompt_embeddings=dense_embeddings,
+                multimask_output=False, )
+            
+            pred_mask = self.grounding_encoder.postprocess_masks(
+                low_res_masks, input_size=resize_list[i], original_size=orig_size_list[i], )
+            pred_masks.append(pred_mask[:, 0])
+        return pred_masks
+    
+    def predict(self, data):
+        pass
+
+    def _forward(self, data, dta_samples=None):
+        outputs = self.llm(**data)
+        return outputs
--- a/projects/glamm/models/region_encoder.py
+++ b/projects/glamm/models/region_encoder.py
+from abc import ABCMeta, abstractmethod
+from typing import List, Optional, Tuple
+from torch import Tensor
+
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from mmcv import ops
+from mmcv.cnn import ConvModule, Linear
+from mmengine.model import BaseModule
+
+class BaseRoIExtractor(BaseModule, metaclass=ABCMeta):
+    """Base class for RoI extractor.
+
+    Args:
+        roi_layer (:obj:`ConfigDict` or dict): Specify RoI layer type and
+            arguments.
+        out_channels (int): Output channels of RoI layers.
+        featmap_strides (list[int]): Strides of input feature maps.
+        init_cfg (:obj:`ConfigDict` or dict or list[:obj:`ConfigDict` or \
+            dict], optional): Initialization config dict. Defaults to None.
+    """
+
+    def __init__(self,
+                 roi_layer,
+                 out_channels: int,
+                 featmap_strides: List[int],
+                 init_cfg=None) -> None:
+        super().__init__(init_cfg=init_cfg)
+        self.roi_layers = self.build_roi_layers(roi_layer, featmap_strides)
+        self.out_channels = out_channels
+        self.featmap_strides = featmap_strides
+
+    @property
+    def num_inputs(self) -> int:
+        """int: Number of input feature maps."""
+        return len(self.featmap_strides)
+
+    def build_roi_layers(self, layer_cfg,
+                         featmap_strides: List[int]) -> nn.ModuleList:
+        """Build RoI operator to extract feature from each level feature map.
+
+        Args:
+            layer_cfg (:obj:`ConfigDict` or dict): Dictionary to construct and
+                config RoI layer operation. Options are modules under
+                ``mmcv/ops`` such as ``RoIAlign``.
+            featmap_strides (list[int]): The stride of input feature map w.r.t
+                to the original image size, which would be used to scale RoI
+                coordinate (original image coordinate system) to feature
+                coordinate system.
+
+        Returns:
+            :obj:`nn.ModuleList`: The RoI extractor modules for each level
+                feature map.
+        """
+
+        cfg = layer_cfg.copy()
+        layer_type = cfg.pop('type')
+        if isinstance(layer_type, str):
+            assert hasattr(ops, layer_type)
+            layer_cls = getattr(ops, layer_type)
+        else:
+            layer_cls = layer_type
+        roi_layers = nn.ModuleList(
+            [layer_cls(spatial_scale=1 / s, **cfg) for s in featmap_strides])
+        return roi_layers
+
+    def roi_rescale(self, rois: Tensor, scale_factor: float) -> Tensor:
+        """Scale RoI coordinates by scale factor.
+
+        Args:
+            rois (Tensor): RoI (Region of Interest), shape (n, 5)
+            scale_factor (float): Scale factor that RoI will be multiplied by.
+
+        Returns:
+            Tensor: Scaled RoI.
+        """
+
+        cx = (rois[:, 1] + rois[:, 3]) * 0.5
+        cy = (rois[:, 2] + rois[:, 4]) * 0.5
+        w = rois[:, 3] - rois[:, 1]
+        h = rois[:, 4] - rois[:, 2]
+        new_w = w * scale_factor
+        new_h = h * scale_factor
+        x1 = cx - new_w * 0.5
+        x2 = cx + new_w * 0.5
+        y1 = cy - new_h * 0.5
+        y2 = cy + new_h * 0.5
+        new_rois = torch.stack((rois[:, 0], x1, y1, x2, y2), dim=-1)
+        return new_rois
+
+    @abstractmethod
+    def forward(self,
+                feats: Tuple[Tensor],
+                rois: Tensor,
+                roi_scale_factor: Optional[float] = None) -> Tensor:
+        """Extractor ROI feats.
+
+        Args:
+            feats (Tuple[Tensor]): Multi-scale features.
+            rois (Tensor): RoIs with the shape (n, 5) where the first
+                column indicates batch id of each RoI.
+            roi_scale_factor (Optional[float]): RoI scale factor.
+                Defaults to None.
+
+        Returns:
+            Tensor: RoI feature.
+        """
+        pass
+
+
+class MLVLFuseModule(nn.Module):
+    def __init__(self, input_dims=1024, embed_dims=1024, num_levels=3, num_fuse=4):
+        super(MLVLFuseModule, self).__init__()
+        self.embed_dims = embed_dims
+        self.num_levels = num_levels
+        self.num_fuse = num_fuse
+        self.input_dims = input_dims
+        self.shuffle_channles = embed_dims // 4
+
+        # contains the tuple of level indices that will do the interaction
+        self.fuse_lvl_list = []
+        num_levels = self.num_levels
+        for lvl in range(num_levels):
+            top_lvl = min(lvl + 1, num_levels - 1)
+            dow_lvl = max(lvl - 1, 0)
+            tar_lvl = lvl
+            self.fuse_lvl_list.append((tar_lvl, top_lvl, dow_lvl))
+
+        self.remain_chs = self.embed_dims - self.shuffle_channles * 2
+        self._init_layers()
+
+    def generate_coordinate(self, featmap_sizes, device='cuda'):
+
+        x_range = torch.linspace(-1, 1, featmap_sizes[-1], device=device)
+        y_range = torch.linspace(-1, 1, featmap_sizes[-2], device=device)
+        y, x = torch.meshgrid(y_range, x_range)
+        y = y.expand([featmap_sizes[0], 1, -1, -1])
+        x = x.expand([featmap_sizes[0], 1, -1, -1])
+        coord_feat = torch.cat([x, y], 1)
+
+        return coord_feat
+
+    def _init_layers(self):
+        self.input_conv = nn.ModuleList([nn.Conv2d(self.input_dims + 2,
+                                                   self.embed_dims, 1)
+                                         for _ in range(self.num_levels)])
+        self.fuse_convs = nn.ModuleList()
+        for i in range(self.num_fuse):
+            self.fuse_convs.append(
+                ConvModule(self.embed_dims,
+                           self.embed_dims,
+                           3,
+                           stride=1,
+                           padding=3 // 2,
+                           conv_cfg=None,
+                           norm_cfg=dict(type='GN',
+                                         num_groups=64,
+                                         requires_grad=True)
+                           ))
+
+    def init_weights(self):
+        pass
+
+    def _single_shuffle(self, inputs, conv_module):
+        if not isinstance(conv_module, (nn.ModuleList, list)):
+            conv_module = [conv_module]
+        for single_conv_m in conv_module:
+            fused_inputs = []
+            for fuse_lvl_tuple in self.fuse_lvl_list:
+                tar_lvl, top_lvl, dow_lvl = fuse_lvl_tuple
+                tar_input = inputs[tar_lvl]
+                top_input = inputs[top_lvl]
+                down_input = inputs[dow_lvl]
+                remain = tar_input[:, :self.remain_chs]
+                from_top = top_input[:, self.remain_chs:][:, self.shuffle_channles:]
+                from_top = F.interpolate(from_top.to(torch.float32),
+                                         size=tar_input.shape[-2:],
+                                         mode='bilinear',
+                                         align_corners=True)
+                from_down = down_input[:, self.remain_chs:][:, :self.shuffle_channles]
+                from_down = F.interpolate(from_down.to(torch.float32),
+                                          size=tar_input.shape[-2:],
+                                          mode='bilinear',
+                                          align_corners=True)
+                fused_inputs.append(
+                    torch.cat([remain, from_top.to(remain.dtype), from_down.to(remain.dtype)], dim=1))
+            fused_inputs = [single_conv_m(item) for item in fused_inputs]
+            inputs = fused_inputs
+        return inputs
+
+    def forward(self, inputs, ):
+        feat_size = [item.shape for item in inputs]
+        new_inputs = []
+        for feat, single_feat_size in zip(inputs, feat_size):
+            coord_feat = self.generate_coordinate(
+                single_feat_size, device=inputs[0].device)
+            # feat = torch.cat([feat, coord_feat], dim=1)
+            feat = torch.cat([feat, coord_feat.to(feat.dtype)], dim=1)
+            new_inputs.append(feat)
+        inputs = new_inputs
+
+        inputs = [self.input_conv[lvl](item)
+                  for lvl, item in enumerate(inputs)]
+
+        for conv_m in self.fuse_convs:
+            inputs = self._single_shuffle(inputs, [conv_m])
+        return inputs
+
+
+class MlvlRoIExtractor(BaseRoIExtractor):
+    def __init__(self,
+                 roi_layer,
+                 out_channels,
+                 featmap_strides,
+                 embed_dims=1024,
+                 stride=1,
+                 norm_init=True,
+                 fuse_level=3,
+                 finest_scale=56,
+                 init_cfg=None):
+        super(MlvlRoIExtractor, self).__init__(roi_layer, out_channels,
+                                               featmap_strides, init_cfg)
+        self.embed_dims = embed_dims
+        self.finest_scale = finest_scale
+        self.fuse_level = fuse_level
+        self.norm_init = norm_init
+
+        self.pconvs = nn.ModuleList(
+            nn.Conv2d(self.embed_dims, self.embed_dims, 3, stride=1, padding=1)
+            for _ in range(self.fuse_level))
+        self.pos_embedd = nn.Sequential(
+            nn.Linear(4, 256),
+            nn.ReLU(inplace=True),
+            nn.LayerNorm(256),
+            nn.Linear(256, 1024),
+            nn.ReLU(inplace=True),
+            nn.LayerNorm(1024),
+        )
+        self.updims = nn.Linear(1024, 4096)
+
+        self.flatten_linear = nn.Linear(
+            self.embed_dims * self.roi_layers[0].output_size[0] ** 2, 1024)
+
+        self.norm_init_weights()
+
+    #  self.dtype = torch.float32
+    def norm_init_weights(self):
+        pass
+
+    def forward(self, feats, rois, roi_scale_factor=None):
+        """Forward function."""
+        num_imgs = len(rois)
+        # feats = [item for item in feats]
+        batch_rois = torch.cat(rois, dim=0).to(feats[0].dtype)
+        pos_embedd = self.pos_embedd(batch_rois)
+        out_size = self.roi_layers[0].output_size
+        num_levels = len(feats)
+        if feats[0].dim() == 3:
+            h = w = int(math.sqrt(feats[0].shape[1]))
+            assert h == 16
+            assert w == 16
+            b, c = feats[0].shape[0], feats[0].shape[-1]
+            feats = [item.reshape(b, h, w, c).permute(0, 3, 1, 2)
+                     for item in feats]
+        new_rois = []
+        for img_id, single_img_roi in enumerate(rois):
+            # rescale to original img scale
+            single_img_roi = single_img_roi * 224
+
+            roi_img_id = single_img_roi.new_ones(len(single_img_roi)) * img_id
+            single_img_roi = torch.cat(
+                [roi_img_id[:, None], single_img_roi], dim=1)
+            new_rois.append(single_img_roi)
+        rois = torch.cat(new_rois)
+
+        roi_feats = feats[0].new_zeros(self.fuse_level,
+                                       rois.size(0), self.out_channels, *out_size)
+
+        for i in range(num_levels):
+            if len(rois) > 0:
+                rois_ = rois
+                ori_dtype = feats[i].dtype
+                roi_feats_t = self.roi_layers[i](feats[i].to(
+                    torch.float32), rois_.to(torch.float32))
+
+                roi_feats[i] = roi_feats_t.to(ori_dtype)
+
+            else:
+                roi_feats += sum(
+                    x.view(-1)[0]
+                    for x in self.parameters()) * 0. + feats[i].sum() * 0.
+
+        fuse_roi_feats = []
+        for i in range(self.fuse_level):
+            fuse_roi_feats.append(self.pconvs[i](roi_feats[i]))
+
+        fuse_roi_feats = sum(fuse_roi_feats)
+        fuse_roi_feats = F.relu(fuse_roi_feats)
+        fuse_roi_feats = fuse_roi_feats.flatten(1, -1)
+        fuse_roi_feats = self.flatten_linear(fuse_roi_feats)
+        fuse_roi_feats = fuse_roi_feats + pos_embedd
+        fuse_roi_feats = self.updims(fuse_roi_feats)
+        query_feats = []
+        for i in range(num_imgs):
+            mask = rois[:, 0] == i
+            query_feats.append(fuse_roi_feats[mask])
+
+        return query_feats
+
+
+class MLVLROIQueryModule(nn.Module):
+    def __init__(self, embed_dims=1024, out_dims=4096,
+                 num_levels=3):
+        super(MLVLROIQueryModule, self).__init__()
+        self.mlvl_fuse = MLVLFuseModule(input_dims=embed_dims,
+                                        embed_dims=embed_dims,
+                                        num_levels=num_levels,
+                                        num_fuse=5)
+        strids = [14 / 8, 14 / 4, 14 / 2, 14]
+        assert len(strids) == num_levels
+        bbox_roi_extractor = dict(roi_layer=dict(type='RoIAlign',
+                                                 output_size=14,
+                                                 sampling_ratio=2),
+                                  out_channels=embed_dims,
+                                  embed_dims=embed_dims,
+                                  fuse_level=num_levels,
+                                  featmap_strides=strids)
+
+        self.roi_align = MlvlRoIExtractor(**bbox_roi_extractor)
+
+    def forward(self, mlvl_feats, bboxes):
+        if mlvl_feats[0].dim() == 3:
+            h = w = int(math.sqrt(mlvl_feats[0].shape[1]))
+            assert h == 24
+            assert w == 24
+            b, c = mlvl_feats[0].shape[0], mlvl_feats[0].shape[-1]
+            mlvl_feats = [item.reshape(b, h, w, c).permute(0, 3, 1, 2) for item in mlvl_feats]
+        base_shape = mlvl_feats[0].shape[-2:]
+        num_level = len(mlvl_feats)
+        to_shape = [(base_shape[0] * 2 ** level, base_shape[1] * 2 ** level)
+                    for level in range(num_level)]
+        to_shape = to_shape[::-1]
+        for level in range(num_level):
+            feat = mlvl_feats[level]
+            shape = to_shape[level]
+            # feat = feat
+            # mlvl_feats[level] = F.interpolate(feat, size=shape, mode='bilinear', align_corners=True)
+            # todo: temporary fix for "upsample_bilinear2d_out_frame" not implemented for 'BFloat16'
+            feat = feat.to(torch.float32)
+            mlvl_feats[level] = F.interpolate(
+                feat, size=shape, mode='bilinear', align_corners=True)
+            mlvl_feats[level] = mlvl_feats[level].to(torch.bfloat16)
+
+        mlvl_feats = self.mlvl_fuse(mlvl_feats)
+
+        return self.roi_align(mlvl_feats, bboxes)