Commit ef30d662 authored by bailuo's avatar bailuo
Browse files

init

parents
Pipeline #2496 failed with stages
in 0 seconds
FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-ubuntu22.04-dtk24.04.3-py3.10
ENV DEBIAN_FRONTEND=noninteractive
# COPY requirements.txt requirements.txt
# RUN pip3 install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
Upload Image,Upload mp4 video,Follow up Question,Text Instruction,Output Image,Output Video,output 2,flag,username,timestamp
flagged/Upload Image/81c78635b1580a4ebf31/微信图片_20240522104204.jpg,,false,Could you please give me a detailed description of the image?,flagged/Output Image/12abac9eeec7619282c7/image.webp,,"<link href=""https://fonts.googleapis.com/css2?family=Montserrat:wght@400;700&display=swap"" rel=""stylesheet"">
<style>
.highlighted-text {
font-family: 'Montserrat', sans-serif;
font-weight: 600;
font-size: 14px;
color: rgb(255, 255, 239);
background-color: rgb(225, 231, 254);
border-radius: 7px;
padding: 5px 7px;
display: inline-block;
}
.regular-text {
font-family: 'Montserrat', sans-serif;
font-weight: 400;
font-size: 14px;
}
.highlighted-response {
font-family: 'Montserrat', sans-serif;
font-weight: 600;
font-size: 14px;
border-radius: 6px;
padding: 3px 4px;
display: inline-block;
}
</style>
<span class=""highlighted-text"" style='color:rgb(107, 100, 239)'>Sa2VA</span>
<p><span class='regular-text'>
The image features a desk with a computer setup, including a keyboard, a mouse, and a monitor. The keyboard is placed in the center of the desk, with a mouse to its right. There are two other keyboards on the desk, one on the left side and another on the right side. A laptop is also present on the left side of the desk.In addition to the computer peripherals, there are two pens on the desk, one near the center and the other on the right side. The setup appears to be a typical workspace for someone who uses a computer for various tasks.
",,,2025-03-11 20:52:22.391278
flagged/Upload Image/43f1a49663b0b6b5474a/微信图片_20240522104204.jpg,,false,Could you provide me with a detailed analysis of this photo? Please output with interleaved segmentation masks for the corresponding parts of the answer,flagged/Output Image/47d3fb0fc7839320f5f4/image.webp,,"<link href=""https://fonts.googleapis.com/css2?family=Montserrat:wght@400;700&display=swap"" rel=""stylesheet"">
<style>
.highlighted-text {
font-family: 'Montserrat', sans-serif;
font-weight: 600;
font-size: 14px;
color: rgb(255, 255, 239);
background-color: rgb(225, 231, 254);
border-radius: 7px;
padding: 5px 7px;
display: inline-block;
}
.regular-text {
font-family: 'Montserrat', sans-serif;
font-weight: 400;
font-size: 14px;
}
.highlighted-response {
font-family: 'Montserrat', sans-serif;
font-weight: 600;
font-size: 14px;
border-radius: 6px;
padding: 3px 4px;
display: inline-block;
}
</style>
<span class=""highlighted-text"" style='color:rgb(107, 100, 239)'>Sa2VA</span>
<p><span class='regular-text'>
<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> a black keyboard with a red logo on the back </span> [SEG], along with <span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> a black keyboard with a black and red logo </span> [SEG], are placed on a desk next to <span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> a black computer mouse with an orange button </span> [SEG]. <span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> a black computer mouse with a black and orange button </span> [SEG] is also present. <span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> a black computer mouse with a black and orange button </span> [SEG] is also present. the <span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(126, 169, 205)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(118, 189, 213)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 210, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 172)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 107)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(218, 112, 112)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 192, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> </span> [SEG]<span class='highlighted-response' style='background-color:rgb[COLOR]'> </span> [SEG]
",,,2025-03-11 20:55:59.174846
flagged/Upload Image/5107a46a6d81d11a8398/微信图片_20240522104201.jpg,,false,Could you provide me with a detailed analysis of this photo? Please output with interleaved segmentation masks for the corresponding parts of the answer,flagged/Output Image/9efda5cd32bec385c73f/image.webp,,"<link href=""https://fonts.googleapis.com/css2?family=Montserrat:wght@400;700&display=swap"" rel=""stylesheet"">
<style>
.highlighted-text {
font-family: 'Montserrat', sans-serif;
font-weight: 600;
font-size: 14px;
color: rgb(255, 255, 239);
background-color: rgb(225, 231, 254);
border-radius: 7px;
padding: 5px 7px;
display: inline-block;
}
.regular-text {
font-family: 'Montserrat', sans-serif;
font-weight: 400;
font-size: 14px;
}
.highlighted-response {
font-family: 'Montserrat', sans-serif;
font-weight: 600;
font-size: 14px;
border-radius: 6px;
padding: 3px 4px;
display: inline-block;
}
</style>
<span class=""highlighted-text"" style='color:rgb(107, 100, 239)'>Sa2VA</span>
<p><span class='regular-text'>
<span class='highlighted-response' style='background-color:rgb(254, 76, 76)'> A calm lake </span> [SEG] reflects the city skyline with <span class='highlighted-response' style='background-color:rgb(76, 254, 76)'> a bridge </span> [SEG] and <span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> tall buildings </span> [SEG], creating a picturesque scene. <span class='highlighted-response' style='background-color:rgb(254, 254, 76)'> Trees </span> [SEG] and <span class='highlighted-response' style='background-color:rgb(254, 76, 254)'> grass </span> [SEG] are visible in the foreground, and <span class='highlighted-response' style='background-color:rgb(76, 254, 254)'> the sky </span> [SEG] can be seen in the background<span class='highlighted-response' style='background-color:rgb(76, 76, 254)'> </span> [SEG].
",,,2025-03-11 20:57:21.660529
,"{""video"": ""flagged/Upload mp4 video/6106b3f3cbe031638302/\u676f\u5b50.mp4"", ""subtitles"": null}",false,"Instruction: ""Please segment the cup.""",,"{""video"": ""flagged/Output Video/762cf09451fc44db897c/ret_video.mp4"", ""subtitles"": null}","<link href=""https://fonts.googleapis.com/css2?family=Montserrat:wght@400;700&display=swap"" rel=""stylesheet"">
<style>
.highlighted-text {
font-family: 'Montserrat', sans-serif;
font-weight: 600;
font-size: 14px;
color: rgb(255, 255, 239);
background-color: rgb(225, 231, 254);
border-radius: 7px;
padding: 5px 7px;
display: inline-block;
}
.regular-text {
font-family: 'Montserrat', sans-serif;
font-weight: 400;
font-size: 14px;
}
.highlighted-response {
font-family: 'Montserrat', sans-serif;
font-weight: 600;
font-size: 14px;
border-radius: 6px;
padding: 3px 4px;
display: inline-block;
}
</style>
<span class=""highlighted-text"" style='color:rgb(107, 100, 239)'>Sa2VA</span>
<p><span class='regular-text'>
Sure, the segmentation result is [SEG].
",,,2025-03-11 21:00:34.998173
,"{""video"": ""flagged/Upload mp4 video/c3b6bd48ff3ba1b41034/dog-5.mp4"", ""subtitles"": null}",false,"Instruction: ""Please segment the dog.""",,"{""video"": ""flagged/Output Video/f563c4642c8ccb51bd68/ret_video.mp4"", ""subtitles"": null}","<link href=""https://fonts.googleapis.com/css2?family=Montserrat:wght@400;700&display=swap"" rel=""stylesheet"">
<style>
.highlighted-text {
font-family: 'Montserrat', sans-serif;
font-weight: 600;
font-size: 14px;
color: rgb(255, 255, 239);
background-color: rgb(225, 231, 254);
border-radius: 7px;
padding: 5px 7px;
display: inline-block;
}
.regular-text {
font-family: 'Montserrat', sans-serif;
font-weight: 400;
font-size: 14px;
}
.highlighted-response {
font-family: 'Montserrat', sans-serif;
font-weight: 600;
font-size: 14px;
border-radius: 6px;
padding: 3px 4px;
display: inline-block;
}
</style>
<span class=""highlighted-text"" style='color:rgb(107, 100, 239)'>Sa2VA</span>
<p><span class='regular-text'>
Sure, the segmentation result is [SEG].
",,,2025-03-11 21:08:15.217660
,"{""video"": ""flagged/Upload mp4 video/6ab54f3481c586fed40b/Biker.mp4"", ""subtitles"": null}",false,"Instruction: ""Tell me about this video.""",,"{""video"": ""flagged/Output Video/0de5ea168ebdced955f0/ret_video.mp4"", ""subtitles"": null}","<link href=""https://fonts.googleapis.com/css2?family=Montserrat:wght@400;700&display=swap"" rel=""stylesheet"">
<style>
.highlighted-text {
font-family: 'Montserrat', sans-serif;
font-weight: 600;
font-size: 14px;
color: rgb(255, 255, 239);
background-color: rgb(225, 231, 254);
border-radius: 7px;
padding: 5px 7px;
display: inline-block;
}
.regular-text {
font-family: 'Montserrat', sans-serif;
font-weight: 400;
font-size: 14px;
}
.highlighted-response {
font-family: 'Montserrat', sans-serif;
font-weight: 600;
font-size: 14px;
border-radius: 6px;
padding: 3px 4px;
display: inline-block;
}
</style>
<span class=""highlighted-text"" style='color:rgb(107, 100, 239)'>Sa2VA</span>
<p><span class='regular-text'>
Sure, [SEG].
",,,2025-03-12 11:29:11.297654
,"{""video"": ""flagged/Upload mp4 video/7cdd3a3bf82b4b0ad290/GOT-10k_Test_000010.mp4"", ""subtitles"": null}",false,"Instruction: ""Tell me about this video.""",,"{""video"": ""flagged/Output Video/32ec117c318e434a7dd0/GOT-10k_Test_000010.mp4"", ""subtitles"": null}","<link href=""https://fonts.googleapis.com/css2?family=Montserrat:wght@400;700&display=swap"" rel=""stylesheet"">
<style>
.highlighted-text {
font-family: 'Montserrat', sans-serif;
font-weight: 600;
font-size: 14px;
color: rgb(255, 255, 239);
background-color: rgb(225, 231, 254);
border-radius: 7px;
padding: 5px 7px;
display: inline-block;
}
.regular-text {
font-family: 'Montserrat', sans-serif;
font-weight: 400;
font-size: 14px;
}
.highlighted-response {
font-family: 'Montserrat', sans-serif;
font-weight: 600;
font-size: 14px;
border-radius: 6px;
padding: 3px 4px;
display: inline-block;
}
</style>
<span class=""highlighted-text"" style='color:rgb(107, 100, 239)'>Sa2VA</span>
<p><span class='regular-text'>
Sure, the video shows a serene scene of a white swan gracefully gliding across a calm lake. The swan is seen in various positions, sometimes flying low over the water and other times soaring higher. The water is still, reflecting the swan's movements and the surrounding environment. The swan's elegant flight and the tranquil setting create a peaceful and picturesque atmosphere.
",,,2025-03-12 11:30:50.966251
,"{""video"": ""flagged/Upload mp4 video/4580bf6f906338669fb1/\u676f\u5b50.mp4"", ""subtitles"": null}",false,"Instruction: ""Tell me about this video.""",,"{""video"": ""flagged/Output Video/2a18939795b479c5228c/\u676f\u5b50.mp4"", ""subtitles"": null}","<link href=""https://fonts.googleapis.com/css2?family=Montserrat:wght@400;700&display=swap"" rel=""stylesheet"">
<style>
.highlighted-text {
font-family: 'Montserrat', sans-serif;
font-weight: 600;
font-size: 14px;
color: rgb(255, 255, 239);
background-color: rgb(225, 231, 254);
border-radius: 7px;
padding: 5px 7px;
display: inline-block;
}
.regular-text {
font-family: 'Montserrat', sans-serif;
font-weight: 400;
font-size: 14px;
}
.highlighted-response {
font-family: 'Montserrat', sans-serif;
font-weight: 600;
font-size: 14px;
border-radius: 6px;
padding: 3px 4px;
display: inline-block;
}
</style>
<span class=""highlighted-text"" style='color:rgb(107, 100, 239)'>Sa2VA</span>
<p><span class='regular-text'>
Sure, the video shows a close-up of a blue coffee mug with a white interior, placed on a white table. The mug is positioned in the center of the frame, and it appears to be empty. The background features a red wall, which adds a pop of color to the scene. The lighting is bright, highlighting the details of the mug and the table. The overall atmosphere of the video is simple and clean, focusing on the mug as the main subject.
",,,2025-03-12 11:32:21.480367
icon.png

67 KB

# 模型唯一标识
modelCode = 1450
# 模型名称
modelName=Sa2VA_pytorch
# 模型描述
modelDescription=将SAM2与LLaVA结合,实现对图像和视频的深入理解。
# 应用场景
appScenario=图像理解,零售,制造,电商,医疗,教育
# 框架类型
frameType=pytorch
from .semantic_seg_dataset import SemanticSegDataset, ADE20kSemanticSegDataset, \
COCOStuffSemanticSegDataset, PascalPartSemanticSegDataset, PacoSemanticSegDataset
from .gcg_dataset import GCGDataset, GranDfGCGDataset, RefCOCOgGCGDataset, OpenPsgGCGDataset, Flickr30kGCGDataset
from .region_level_dataset import RefCocoGRegionDataset, VisualGenomeRegionDataset
from .refcoco_segm_dataset import ReferSegmDataset
from .utils.utils import *
from .collate_fns.glamm_collate_fn import glamm_collate_fn
from typing import Dict, Sequence
import torch
from torch.nn.utils.rnn import pad_sequence
from xtuner.parallel.sequence import (get_sequence_parallel_world_size,
pad_for_sequence_parallel)
from xtuner.utils import DEFAULT_PAD_TOKEN_INDEX, IGNORE_INDEX
def glamm_collate_fn(instances: Sequence[Dict],
pad_index: int = DEFAULT_PAD_TOKEN_INDEX,
return_hf_format: bool = False,
use_varlen_attn: bool = False):
seq_parallel_world_size = get_sequence_parallel_world_size()
input_ids, labels = [], []
has_image = any(inst.get('pixel_values') is not None for inst in instances)
has_grounding_image = any(inst.get('g_pixel_values') is not None for inst in instances)
has_mask = any(inst.get('masks') is not None for inst in instances)
has_bboxes = any(inst.get('bboxes') is not None for inst in instances)
has_points = any(inst.get('points') is not None for inst in instances)
if use_varlen_attn:
position_ids, cumulative_len = [], []
assert len(instances) == 1, (
f'If utilizing varlen attention, the batch size should be'
f' set to 1, but got {len(instances)}')
assert not has_image, 'Currently, it is not configured to '
'accommodate the use of varlen Attention in multimodal training'
if has_image:
pixel_values = []
if has_grounding_image:
grounding_pixel_values = []
if has_mask:
object_masks = []
if has_bboxes:
object_bboxes = []
if has_points:
prompt_points = []
for example in instances:
input_ids.append(torch.LongTensor(example['input_ids']))
labels.append(torch.LongTensor(example['labels']))
if use_varlen_attn:
cumulative_len.append(torch.IntTensor(example['cumulative_len']))
position_ids.append(torch.LongTensor(example['position_ids']))
if has_image:
pixel_values.append(example['pixel_values'])
if has_grounding_image:
grounding_pixel_values.append(example['g_pixel_values'])
if has_mask:
if 'masks' in example.keys() and example['masks'] is not None:
object_masks.append(example['masks'])
if has_bboxes:
if 'bboxes' in example.keys() and example['bboxes'] is not None:
object_bboxes.append(example['bboxes'])
if has_points:
if 'points' in example.keys() and example['points'] is not None:
prompt_points.append(example['points'])
ori_length = [len(ids) for ids in input_ids]
if len(instances) > 1:
input_ids = pad_sequence(
input_ids, batch_first=True, padding_value=pad_index)
labels = pad_sequence(
labels, batch_first=True, padding_value=IGNORE_INDEX)
else:
input_ids = torch.stack(input_ids)
labels = torch.stack(labels)
if use_varlen_attn:
assert input_ids.size(1) % seq_parallel_world_size == 0
attention_mask = None
position_ids = torch.stack(position_ids, dim=0)
else:
# Some tokenizers have the same eos token and pad token, so input_ids
# cannot be masked directly based on the pad token id.
attention_mask = torch.zeros_like(input_ids).bool()
for i, length in enumerate(ori_length):
attention_mask[i, :length] = True
bs, seq_len = input_ids.shape
position_ids = torch.arange(seq_len).unsqueeze(0).long().repeat(bs, 1)
if seq_parallel_world_size > 1:
input_ids = pad_for_sequence_parallel(input_ids, pad_index)
labels = pad_for_sequence_parallel(labels, IGNORE_INDEX)
position_ids = pad_for_sequence_parallel(position_ids, 0)
if attention_mask is not None:
attention_mask = pad_for_sequence_parallel(attention_mask, 0)
if use_varlen_attn:
max_seqlen = (
cumulative_len[0][1:] - # noqa: W504
cumulative_len[0][:-1]).max().item()
data_dict = {
'input_ids': input_ids,
'cumulative_len': cumulative_len,
'position_ids': position_ids,
'labels': labels,
'max_seqlen': max_seqlen
}
else:
data_dict = {
'input_ids': input_ids,
'attention_mask': attention_mask,
'position_ids': position_ids,
'labels': labels
}
if has_image:
if all(x.shape == pixel_values[0].shape for x in pixel_values):
pixel_values = torch.stack(pixel_values, dim=0)
data_dict['pixel_values'] = pixel_values
if has_grounding_image:
# if all(x.shape == grounding_pixel_values[0].shape for x in grounding_pixel_values):
# grounding_pixel_values = torch.stack(grounding_pixel_values, dim=0)
data_dict['g_pixel_values'] = grounding_pixel_values
if has_mask:
data_dict['masks'] = object_masks
if has_bboxes:
data_dict['bboxes'] = object_bboxes
if has_points:
data_dict['points'] = prompt_points
if return_hf_format:
return data_dict
else:
return {'data': data_dict, 'data_samples': None}
import copy
import random
import glob
import json
import logging
import os
import torch
from mmengine import print_log
from mmengine.config import Config, ConfigDict
from PIL import Image
from torch.utils.data import Dataset
import numpy as np
import torch.nn.functional as F
from pycocotools.coco import COCO
from pycocotools import mask as mask_utils
from xtuner.registry import BUILDER
from xtuner.dataset.utils import encode_fn
from xtuner.dataset.map_fns import llava_map_fn
from projects.glamm.datasets.utils.utils import expand2square
from projects.glamm.datasets.utils.utils import GCG_QUESTIONS, ANSWER_LIST
from projects.glamm.utils import DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
class GCGDataset(Dataset):
def __init__(self,
image_folder,
image_processor,
data_path=None,
tokenizer=None,
template_map_fn=None,
max_length=2048,
pad_image_to_square=False,
repeats=1,
num_classes_per_sample=3,
extra_image_processor=None):
super().__init__()
self.question_templates = GCG_QUESTIONS
if extra_image_processor is not None:
self.extra_image_processor = BUILDER.build(extra_image_processor)
self.num_classes_per_sample = num_classes_per_sample
self.tokenizer = BUILDER.build(tokenizer)
self.tokenizer.add_tokens(
[DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True
)
reg_tokens = ['<bbox>', '<point>']
segmentation_tokens = ['[SEG]']
phrase_tokens = ['<p>', '</p>']
special_tokens = reg_tokens + segmentation_tokens + phrase_tokens
self.tokenizer.add_tokens(special_tokens, special_tokens=True)
self.max_length = max_length
self.template_map_fn = BUILDER.build(template_map_fn)
self.text_data = self.json_file_preprocess(data_path, image_folder)
self.image_folder = image_folder
self.image_processor = BUILDER.build(image_processor)
size = self.image_processor.crop_size
if isinstance(size, dict):
self.image_w, self.image_h = size['width'], size['height']
elif isinstance(size, int):
self.image_h, self.image_w = size, size
else:
self.image_w, self.image_h = size
self.pad_image_to_square = pad_image_to_square
self.repeats = repeats
def json_file_preprocess(self, data_path, image_folder=None):
with open(data_path, 'r') as f:
json_data = json.load(f)
return json_data
@property
def modality_length(self):
length_list = []
for data_dict in self.text_data:
cur_len = 100
length_list.append(cur_len)
return length_list * self.repeats
def __len__(self):
return len(self.text_data) * self.repeats
def real_len(self):
return len(self.text_data)
def _parse_annotations(self, ann_info):
image_path = os.path.join(self.image_folder, ann_info['file_name'])
image = Image.open(image_path).convert('RGB')
if hasattr(self, 'extra_image_processor'):
g_image = np.array(image) # for grounding
g_image = self.extra_image_processor.apply_image(g_image)
g_pixel_values = torch.from_numpy(g_image).permute(2, 0, 1).contiguous()
ann_info['g_pixel_values'] = g_pixel_values
width, height = image.size
if self.pad_image_to_square:
image = expand2square(
image, tuple(int(x * 255) for x in self.image_processor.image_mean))
image = self.image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
ann_info['pixel_values'] = image
caption = ann_info['caption'].strip('"').strip()
masks, phrases, tokens_positive = [], [], []
for word, grounding in ann_info["groundings"].items():
phrases.append(word)
tokens_positive.append(grounding["token_positives"])
# Convert segmentation to binary mask
binary_mask = np.zeros((height, width), dtype=np.uint8)
for rle in grounding["rle_masks"]:
m = mask_utils.decode(rle).astype(np.uint8)
binary_mask += m.squeeze()
masks.append(binary_mask)
def sort_by_start_index(items, order):
return [items[i] for i in order]
phrase_order = sorted(range(len(tokens_positive)), key=lambda x: tokens_positive[x][0])
masks = sort_by_start_index(masks, phrase_order)
phrases = sort_by_start_index(phrases, phrase_order)
tokens_positive = sort_by_start_index(tokens_positive, phrase_order)
ann_info.update({
'image_path': image_path,
'caption': caption,
'masks': masks,
'phrases': phrases,
'tokens_positive': tokens_positive,
})
return ann_info
def create_conversation(self, caption, tokens_positive):
question = random.choice(self.question_templates).strip()
# Prepare caption with tags
def tag_caption(caption, tokens):
for start, end in sorted(tokens, key=lambda x: x[0], reverse=True):
caption = f"{caption[:start]}<p> {caption[start:end]} </p> [SEG]{caption[end:]}"
return caption
detailed_answer = tag_caption(caption, tokens_positive)
question = 'The <image> provides an overview of the picture.\n' + question
conversation = [{'input': question, 'output': detailed_answer}]
return conversation
def __getitem__(self, index):
index = index % self.real_len()
data_dict = {}
ann_info = copy.deepcopy(self.text_data[index])
ann_info = self._parse_annotations(ann_info)
data_dict['g_pixel_values'] = ann_info.pop('g_pixel_values')
data_dict['pixel_values'] = ann_info.pop('pixel_values')
if len(ann_info['masks']) == 0:
return self.__getitem__(0)
data_dict['masks'] = torch.from_numpy(np.stack(ann_info['masks'], axis=0))
conversation = self.create_conversation(ann_info['caption'], ann_info['tokens_positive'])
data_dict['conversation'] = conversation
result = self.template_map_fn(data_dict)
data_dict.update(result)
result = encode_fn(data_dict, tokenizer=self.tokenizer, max_length=self.max_length, with_image_token=True)
data_dict.update(result)
return data_dict
class GranDfGCGDataset(GCGDataset):
pass
class RefCOCOgGCGDataset(GCGDataset):
def json_file_preprocess(self, data_path, image_folder=None):
with open(data_path, 'r') as f:
json_data = json.load(f)
return [list(line.values())[0] for line in json_data]
def _parse_annotations(self, ann_info):
image_path = os.path.join(self.image_folder, ann_info['img_file_name'])
image = Image.open(image_path).convert('RGB')
if hasattr(self, 'extra_image_processor'):
g_image = np.array(image) # for grounding
g_image = self.extra_image_processor.apply_image(g_image)
g_pixel_values = torch.from_numpy(g_image).permute(2, 0, 1).contiguous()
ann_info['g_pixel_values'] = g_pixel_values
width, height = image.size
if self.pad_image_to_square:
image = expand2square(
image, tuple(int(x * 255) for x in self.image_processor.image_mean))
image = self.image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
ann_info['pixel_values'] = image
caption = ann_info['caption'].strip('"').strip().lower()
masks, phrases, tokens_positive = [], [], []
for detail in ann_info['refs']:
phrase = detail['sentence']
if phrase.lower() in caption:
phrases.append(phrase)
index = caption.find(phrase)
end_index = index + len(phrase) if index != -1 else -1
tokens_positive.append([index, end_index])
binary_mask = np.zeros((height, width), dtype=np.uint8)
for seg in detail["segmentation"]:
rles = mask_utils.frPyObjects([seg], height, width)
m = mask_utils.decode(rles)
m = m.astype(np.uint8)
binary_mask += m.squeeze()
masks.append(binary_mask)
def sort_by_start_index(items, order):
return [items[i] for i in order]
phrase_order = sorted(range(len(tokens_positive)), key=lambda x: tokens_positive[x][0])
masks = sort_by_start_index(masks, phrase_order)
phrases = sort_by_start_index(phrases, phrase_order)
tokens_positive = sort_by_start_index(tokens_positive, phrase_order)
ann_info.update({
'image_path': image_path,
'caption': caption,
'masks': masks,
'phrases': phrases,
'tokens_positive': tokens_positive,
})
return ann_info
class OpenPsgGCGDataset(GCGDataset):
pass
class Flickr30kGCGDataset(GCGDataset):
def json_file_preprocess(self, data_path, image_folder=None):
def filter_images(data_infos, min_size):
return [i for i, info in enumerate(data_infos) if min(info['width'], info['height']) >= min_size]
self.coco = COCO(data_path)
self.image_ids = self.coco.getImgIds()
data_infos = []
total_ann_ids = []
removed_img_count = 0
for img_id in self.image_ids:
info = self.coco.loadImgs([img_id])[0]
if len(info['caption'].split(' ')) < 3:
removed_img_count += 1
continue
info['filename'] = info['file_name'].split('_')[-1]
info['height'] = int(info['height'])
info['width'] = int(info['width'])
data_infos.append(info)
ann_ids = self.coco.getAnnIds(imgIds=[img_id])
total_ann_ids.extend(ann_ids)
assert len(set(total_ann_ids)) == len(total_ann_ids), f"Non-unique annotation IDs in '{data_path}'!"
print(f'Removed {removed_img_count} images.')
data_infos = [data_infos[i] for i in filter_images(data_infos, min_size=32)]
return data_infos
def _parse_annotations(self, img_info):
ann_ids = self.coco.getAnnIds(imgIds=img_info['id'])
ann_info = self.coco.loadAnns(ann_ids)
annotations = {'phrases': [], 'caption': img_info['caption'], 'masks': [], 'tokens_positive': []}
image_path = os.path.join(self.image_folder, img_info['file_name'])
image = Image.open(image_path).convert('RGB')
if hasattr(self, 'extra_image_processor'):
g_image = np.array(image) # for grounding
g_image = self.extra_image_processor.apply_image(g_image)
g_pixel_values = torch.from_numpy(g_image).permute(2, 0, 1).contiguous()
annotations['g_pixel_values'] = g_pixel_values
width, height = image.size
if self.pad_image_to_square:
image = expand2square(
image, tuple(int(x * 255) for x in self.image_processor.image_mean))
image = self.image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
annotations['pixel_values'] = image
for ann in ann_info:
if ann.get('ignore', False):
continue
x1, y1, w, h = ann['bbox']
inter_w = max(0, min(x1 + w, img_info['width']) - max(x1, 0))
inter_h = max(0, min(y1 + h, img_info['height']) - max(y1, 0))
if inter_w * inter_h == 0 or ann['area'] <= 0 or w < 1 or h < 1:
continue
bbox = [x1, y1, x1 + w, y1 + h]
tokens_positive = ann['tokens_positive']
phrase = [img_info['caption'][span[0]:span[1]] for span in tokens_positive]
annotations['phrases'].append(phrase[0])
annotations['tokens_positive'].append(tokens_positive[0])
rle = ann['sam_mask']
mask_decoded = mask_utils.decode(rle).astype(np.uint8)
annotations['masks'].append(mask_decoded)
def sort_by_start_index(items, order):
return [items[i] for i in order]
phrase_order = sorted(range(len(annotations['tokens_positive'])), key=lambda x: annotations['tokens_positive'][x][0])
annotations['masks'] = sort_by_start_index(annotations['masks'], phrase_order)
annotations['phrases'] = sort_by_start_index(annotations['phrases'], phrase_order)
annotations['tokens_positive'] = sort_by_start_index(annotations['tokens_positive'], phrase_order)
return annotations
if __name__ == '__main__':
from transformers import CLIPImageProcessor, AutoTokenizer
from third_parts.segment_anything.utils.transforms import ResizeLongestSide
pretrained_model = 'MBZUAI/GLaMM-GranD-Pretrained'
llm_name_or_path = 'lmsys/vicuna-7b-v1.5'
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path)
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path='openai/clip-vit-large-patch14-336')
extra_image_processor = dict(
type=ResizeLongestSide,
target_length=1024,
)
from xtuner.utils.templates import PROMPT_TEMPLATE
prompt_template = PROMPT_TEMPLATE.vicuna
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory, template_map_fn
from projects.glamm.datasets.collate_fns.glamm_collate_fn import glamm_collate_fn
dataset = Flickr30kGCGDataset(
image_folder='data/flickr30k/flickr30k-images/',
image_processor=image_processor,
data_path='./data/GranDf/annotations/train/flickr_mergedGT_GCG_train.json',
tokenizer=tokenizer,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=2048,
pad_image_to_square=True,
repeats=1,
num_classes_per_sample=3,
extra_image_processor=extra_image_processor)
for i in range(1000):
print(dataset[i])
\ No newline at end of file
import copy
import random
import glob
import json
import logging
import os
import torch
from mmengine import print_log
from mmengine.config import Config, ConfigDict
from PIL import Image
from torch.utils.data import Dataset
import numpy as np
import torch.nn.functional as F
from pycocotools.coco import COCO
from pycocotools import mask as mask_utils
from xtuner.registry import BUILDER
from xtuner.dataset.utils import encode_fn
from xtuner.dataset.map_fns import llava_map_fn
from projects.glamm.datasets.utils.utils import expand2square
from projects.glamm.datasets.utils.utils import SEG_QUESTIONS, ANSWER_LIST
from projects.glamm.utils import DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from third_parts.mmdet.datasets.refcoco import RefCocoDataset
class ReferSegmDataset(RefCocoDataset):
def __init__(self,
data_root,
ann_file=None,
split_file=None,
image_processor=None,
extra_image_processor=None,
data_prefix=dict(img_path='train2014/'),
tokenizer=None,
template_map_fn=None,
max_length=2048,
pad_image_to_square=False,
num_classes_per_sample=3):
super().__init__(
data_root=data_root,
data_prefix=data_prefix,
pipeline=None,
ann_file=ann_file,
split_file=split_file,
)
self.begin_str = f"""{DEFAULT_IMAGE_TOKEN} provides an overview of the picture.\n"""
self.question_templates = SEG_QUESTIONS
if extra_image_processor is not None:
self.extra_image_processor = BUILDER.build(extra_image_processor)
self.num_classes_per_sample = num_classes_per_sample
self.tokenizer = BUILDER.build(tokenizer)
self.tokenizer.add_tokens(
[DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True
)
reg_tokens = ['<bbox>', '<point>']
segmentation_tokens = ['[SEG]']
phrase_tokens = ['<p>', '</p>']
special_tokens = reg_tokens + segmentation_tokens + phrase_tokens
self.tokenizer.add_tokens(special_tokens, special_tokens=True)
self.max_length = max_length
self.template_map_fn = BUILDER.build(template_map_fn)
self.image_processor = BUILDER.build(image_processor)
size = self.image_processor.crop_size
if isinstance(size, dict):
self.image_w, self.image_h = size['width'], size['height']
self.pad_image_to_square = pad_image_to_square
@property
def modality_length(self):
import pickle
length_list = []
for idx in range(len(self)):
length_list.append(100)
# for idx in range(len(self)):
# if self.serialize_data:
# start_addr = 0 if idx == 0 else self.data_address[idx - 1].item()
# end_addr = self.data_address[idx].item()
# bytes = memoryview(
# self.data_bytes[start_addr:end_addr]) # type: ignore
# data_dict = pickle.loads(bytes)
# else:
# data_dict = copy.deepcopy(self.data_list[idx])
return length_list
def _parse_annotations(self, ann_info):
image_path = ann_info['img_path']
image = Image.open(image_path).convert('RGB')
if hasattr(self, 'extra_image_processor'):
g_image = np.array(image) # for grounding
g_image = self.extra_image_processor.apply_image(g_image)
g_pixel_values = torch.from_numpy(
g_image).permute(2, 0, 1).contiguous()
ann_info['g_pixel_values'] = g_pixel_values
width, height = image.size
if self.pad_image_to_square:
image = expand2square(
image, tuple(int(x * 255) for x in self.image_processor.image_mean))
image = self.image_processor.preprocess(
image, return_tensors='pt')['pixel_values'][0]
ann_info['pixel_values'] = image
masks, phrases = [], []
instances, text = ann_info['instances'], ann_info['text']
index = np.random.choice(range(len(instances)), min(
len(instances), self.num_classes_per_sample))
for idx in index:
inst = instances[idx]
phrase = text[idx].lower()
phrases.append(phrase)
binary_mask = np.zeros((height, width), dtype=np.uint8)
for seg in inst["mask"]:
rles = mask_utils.frPyObjects([seg], height, width)
m = mask_utils.decode(rles)
m = m.astype(np.uint8)
binary_mask += m.squeeze()
masks.append(binary_mask)
ann_info.update({
'masks': masks,
'phrases': phrases,
})
return ann_info
def __getitem__(self, idx):
data_dict = {}
ann_info = super().__getitem__(idx)
ann_info = self._parse_annotations(ann_info)
data_dict['g_pixel_values'] = ann_info.pop('g_pixel_values')
data_dict['pixel_values'] = ann_info.pop('pixel_values')
if len(ann_info['masks']) == 0:
return self.__getitem__(0)
data_dict['masks'] = torch.from_numpy(
np.stack(ann_info['masks'], axis=0))
conversation = []
for i, phrase in enumerate(ann_info['phrases']):
question = random.choice(SEG_QUESTIONS).format(class_name=phrase)
conversation.append(
{'input': question, 'output': random.choice(ANSWER_LIST)})
data_dict['conversation'] = conversation
result = self.template_map_fn(data_dict)
data_dict.update(result)
result = encode_fn(data_dict, tokenizer=self.tokenizer,
max_length=self.max_length, with_image_token=True)
data_dict.update(result)
return data_dict
if __name__ == '__main__':
from transformers import CLIPImageProcessor, AutoTokenizer
from third_parts.segment_anything.utils.transforms import ResizeLongestSide
pretrained_model = 'MBZUAI/GLaMM-GranD-Pretrained'
llm_name_or_path = 'lmsys/vicuna-7b-v1.5'
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path)
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path='openai/clip-vit-large-patch14-336')
extra_image_processor = dict(
type=ResizeLongestSide,
target_length=1024,
)
from xtuner.utils.templates import PROMPT_TEMPLATE
prompt_template = PROMPT_TEMPLATE.vicuna
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory, template_map_fn
from projects.glamm.datasets.collate_fns.glamm_collate_fn import glamm_collate_fn
dataset = ReferSegmDataset(
tokenizer=tokenizer,
image_processor=image_processor,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
extra_image_processor=extra_image_processor,
data_root='data/coco/',
data_prefix=dict(img_path='train2014/'),
ann_file='refcoco+/instances.json',
split_file='refcoco+/refs(unc).p',
)
for i in range(1000):
dataset[i]
import copy
import random
import glob
import json
import logging
import os
import torch
from mmengine import print_log
from mmengine.config import Config, ConfigDict
from PIL import Image
from torch.utils.data import Dataset
import numpy as np
import torch.nn.functional as F
from pycocotools.coco import COCO
from pycocotools import mask as mask_utils
from xtuner.registry import BUILDER
from xtuner.dataset.utils import encode_fn
from xtuner.dataset.map_fns import llava_map_fn
from projects.glamm.datasets.utils.utils import expand2square
from projects.glamm.datasets.utils.utils import ANSWER_LIST, REGION_QUESTIONS
from projects.glamm.utils import DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
class RegionDataset(Dataset):
def __init__(self,
image_folder,
image_processor,
data_path=None,
tokenizer=None,
template_map_fn=None,
max_length=2048,
pad_image_to_square=False,
repeats=1,
num_classes_per_sample=3,
extra_image_processor=None):
super().__init__()
self.begin_str = f"""{DEFAULT_IMAGE_TOKEN} provides an overview of the picture.\n"""
self.question_templates = REGION_QUESTIONS
if extra_image_processor is not None:
self.extra_image_processor = BUILDER.build(extra_image_processor)
self.num_classes_per_sample = num_classes_per_sample
self.tokenizer = BUILDER.build(tokenizer)
self.tokenizer.add_tokens(
[DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True
)
reg_tokens = ['<bbox>', '<point>']
segmentation_tokens = ['[SEG]']
phrase_tokens = ['<p>', '</p>']
special_tokens = reg_tokens + segmentation_tokens + phrase_tokens
self.tokenizer.add_tokens(special_tokens, special_tokens=True)
self.max_length = max_length
self.template_map_fn = BUILDER.build(template_map_fn)
self.text_data = self._load_annotations(data_path, image_folder)
self.image_folder = image_folder
self.image_processor = BUILDER.build(image_processor)
size = self.image_processor.crop_size
if isinstance(size, dict):
self.image_w, self.image_h = size['width'], size['height']
elif isinstance(size, int):
self.image_h, self.image_w = size, size
else:
self.image_w, self.image_h = size
self.pad_image_to_square = pad_image_to_square
self.repeats = repeats
def _load_annotations(self, data_path, image_folder=None):
self.coco = COCO(data_path)
img_ids = self.coco.getImgIds()
data_infos = []
for img_id in img_ids:
info = self.coco.loadImgs([img_id])[0]
info['filename'] = info['file_name'].split('_')[-1]
info['height'] = int(info['height'])
info['width'] = int(info['width'])
if min(info['height'], info['width']) < 32:
continue
data_infos.append(info)
return data_infos
@property
def modality_length(self):
length_list = []
for data_dict in self.text_data:
cur_len = 100
length_list.append(cur_len)
return length_list * self.repeats
def __len__(self):
return len(self.text_data) * self.repeats
def real_len(self):
return len(self.text_data)
def region_processor(self, orig_size, post_size, bboxes, labels):
orig_h, orig_w = orig_size
post_h, post_w = post_size
y_scale = post_h / orig_h
x_scale = post_w / orig_w
shuffle_ids = torch.randperm(len(labels))[:self.num_classes_per_sample]
selected_bboxes = bboxes[shuffle_ids]
# Ensure selected_bboxes is two-dimensional
if len(selected_bboxes.shape) == 1:
selected_bboxes = np.expand_dims(selected_bboxes, axis=0)
selected_labels = [labels[i] for i in shuffle_ids]
selected_bboxes[:, [0, 2]] *= x_scale
selected_bboxes[:, [1, 3]] *= y_scale
selected_bboxes = torch.tensor(
selected_bboxes, dtype=torch.float32) / post_h
return selected_bboxes, selected_labels
def _parse_annotations(self, img_info):
data_dict = {}
bboxes, captions = [], []
ann_info = self.coco.loadAnns(self.coco.getAnnIds(imgIds=img_info['id']))
image_path = os.path.join(self.image_folder, img_info['file_name'])
image = Image.open(image_path).convert('RGB')
if hasattr(self, 'extra_image_processor'):
g_image = np.array(image) # for grounding
g_image = self.extra_image_processor.apply_image(g_image)
g_pixel_values = torch.from_numpy(
g_image).permute(2, 0, 1).contiguous()
data_dict['g_pixel_values'] = g_pixel_values
orig_w, orig_h = image.size
if self.pad_image_to_square:
image = expand2square(
image, tuple(int(x * 255) for x in self.image_processor.image_mean))
image = self.image_processor.preprocess(
image, return_tensors='pt')['pixel_values'][0]
post_h, post_w = image.shape[1:3]
data_dict['pixel_values'] = image
for ann in ann_info:
if ann.get('ignore', False) or ann['area'] <= 0 or ann['bbox'][2] < 1 or ann['bbox'][3] < 1:
continue
x1, y1, w, h = ann['bbox']
inter_w = max(0, min(x1 + w, orig_w) - max(x1, 0))
inter_h = max(0, min(y1 + h, orig_h) - max(y1, 0))
if inter_w * inter_h == 0:
continue
bbox = [x1, y1, x1 + w, y1 + h]
if bbox:
bboxes.append(bbox)
captions.append(img_info['caption'])
if len(bboxes) == 0:
return self.__getitem__(0)
bboxes = np.array(bboxes, dtype=np.float32)
seg_map = img_info['file_name'].replace('jpg', 'png')
bboxes, captions = self.region_processor((orig_h, orig_w), (post_h, post_w), bboxes, captions)
data_dict['bboxes'] = bboxes
data_dict['captions'] = captions
data_dict['seg_map'] = seg_map
return data_dict
def create_conversation(self, captions):
questions = []
answers = []
for i, label in enumerate(captions):
question = random.choice(self.question_templates).strip().replace('<region>', f'region{i + 1} <bbox>')
questions.append(question)
answers.append(label)
conversation = []
for i, (question, answer) in enumerate(zip(questions, answers)):
if i == 0:
question = self.begin_str + question
conversation.append({'input': question, 'output': answer})
return conversation
def __getitem__(self, index):
index = index % self.real_len()
data_dict = {}
ann_info = copy.deepcopy(self.text_data[index])
ann_info = self._parse_annotations(ann_info)
data_dict['g_pixel_values'] = ann_info.pop('g_pixel_values', None)
data_dict['pixel_values'] = ann_info.pop('pixel_values')
data_dict['bboxes'] = ann_info.pop('bboxes', None)
conversation = self.create_conversation(ann_info['captions'])
data_dict['conversation'] = conversation
result = self.template_map_fn(data_dict)
data_dict.update(result)
result = encode_fn(data_dict, tokenizer=self.tokenizer,
max_length=self.max_length, with_image_token=True)
data_dict.update(result)
return data_dict
class RefCocoGRegionDataset(RegionDataset):
pass
class VisualGenomeRegionDataset(RegionDataset):
def _parse_annotations(self, img_info):
data_dict = {}
bboxes, captions = [], []
ann_info = self.coco.loadAnns(self.coco.getAnnIds(imgIds=img_info['id']))
image_path = os.path.join(self.image_folder, img_info['file_name'])
image = Image.open(image_path).convert('RGB')
if hasattr(self, 'extra_image_processor'):
g_image = np.array(image) # for grounding
g_image = self.extra_image_processor.apply_image(g_image)
g_pixel_values = torch.from_numpy(
g_image).permute(2, 0, 1).contiguous()
data_dict['g_pixel_values'] = g_pixel_values
orig_w, orig_h = image.size
if self.pad_image_to_square:
image = expand2square(
image, tuple(int(x * 255) for x in self.image_processor.image_mean))
image = self.image_processor.preprocess(
image, return_tensors='pt')['pixel_values'][0]
post_h, post_w = image.shape[1:3]
data_dict['pixel_values'] = image
for ann in ann_info:
if ann.get('ignore', False) or ann['area'] <= 0 or ann['bbox'][2] < 1 or ann['bbox'][3] < 1:
continue
x1, y1, w, h = ann['bbox']
inter_w = max(0, min(x1 + w, orig_w) - max(x1, 0))
inter_h = max(0, min(y1 + h, orig_h) - max(y1, 0))
if inter_w * inter_h == 0:
continue
bbox = [x1, y1, x1 + w, y1 + h]
if bbox:
bboxes.append(bbox)
captions.append(ann['caption'].strip())
if len(bboxes) == 0:
return self.__getitem__(0)
bboxes = np.array(bboxes, dtype=np.float32)
seg_map = img_info['file_name'].replace('jpg', 'png')
bboxes, captions = self.region_processor((orig_h, orig_w), (post_h, post_w), bboxes, captions)
data_dict['bboxes'] = bboxes
data_dict['captions'] = captions
data_dict['seg_map'] = seg_map
return data_dict
if __name__ == '__main__':
from transformers import CLIPImageProcessor, AutoTokenizer
from third_parts.segment_anything.utils.transforms import ResizeLongestSide
pretrained_model = 'MBZUAI/GLaMM-GranD-Pretrained'
llm_name_or_path = 'lmsys/vicuna-7b-v1.5'
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path)
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path='openai/clip-vit-large-patch14-336')
extra_image_processor = dict(
type=ResizeLongestSide,
target_length=1024,
)
from xtuner.utils.templates import PROMPT_TEMPLATE
prompt_template = PROMPT_TEMPLATE.vicuna
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory, template_map_fn
from projects.glamm.datasets.collate_fns.glamm_collate_fn import glamm_collate_fn
dataset = VisualGenomeRegionDataset(
image_folder='./data/visual_genome/images',
image_processor=image_processor,
data_path='data/visual_genome/train.json',
tokenizer=tokenizer,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=2048,
pad_image_to_square=False,
repeats=1,
num_classes_per_sample=3,
extra_image_processor=None)
for i in range(1000):
print(dataset[i])
import copy
import random
import glob
import json
import logging
import os
import torch
from mmengine import print_log
from mmengine.config import Config, ConfigDict
from PIL import Image
from torch.utils.data import Dataset
import numpy as np
import torch.nn.functional as F
from pycocotools.coco import COCO
from xtuner.registry import BUILDER
from xtuner.dataset.utils import encode_fn
from xtuner.dataset.map_fns import llava_map_fn
from projects.glamm.datasets.utils.utils import expand2square
from projects.glamm.datasets.utils.utils import SEG_QUESTIONS, ANSWER_LIST
from projects.glamm.utils import DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
class SemanticSegDataset(Dataset):
def __init__(self,
image_folder,
image_processor,
data_path=None,
tokenizer=None,
offline_processed_text_folder=None,
max_dataset_length=None,
dataset_map_fn=None,
template_map_fn=None,
max_length=2048,
pad_image_to_square=False,
num_proc=8,
lazy=False,
repeats=1,
gcg_format=False,
num_classes_per_sample=3,
extra_image_processor=None):
super().__init__()
self.gcg_format = gcg_format
if extra_image_processor is not None:
self.extra_image_processor = BUILDER.build(extra_image_processor)
self.num_classes_per_sample = num_classes_per_sample
self.tokenizer = BUILDER.build(tokenizer)
self.tokenizer.add_tokens(
[DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True
)
reg_tokens = ['<bbox>', '<point>']
segmentation_tokens = ['[SEG]']
phrase_tokens = ['<p>', '</p>']
special_tokens = reg_tokens + segmentation_tokens + phrase_tokens
self.tokenizer.add_tokens(special_tokens, special_tokens=True)
assert offline_processed_text_folder or (data_path and tokenizer)
self.lazy = lazy
self.max_length = max_length
self.dataset_map_fn = dataset_map_fn
self.template_map_fn = template_map_fn
if isinstance(self.template_map_fn, dict) and self.lazy:
_type = self.template_map_fn['type']
del self.template_map_fn['type']
self.template_map_fn = _type(**self.template_map_fn)
if offline_processed_text_folder and data_path:
print_log(
'Both `offline_processed_text_folder` and '
'`data_path` are set, and we load dataset from'
'`offline_processed_text_folder` '
f'({offline_processed_text_folder})',
logger='current',
level=logging.WARNING)
if offline_processed_text_folder is not None:
raise NotImplementedError
else:
self.image_label_datas = self.json_file_preprocess(data_path, image_folder)
self.image_folder = image_folder
if isinstance(image_processor, dict) or isinstance(image_processor, Config) or isinstance(image_processor, ConfigDict):
self.image_processor = BUILDER.build(image_processor)
else:
self.image_processor = image_processor
size = self.image_processor.crop_size
if isinstance(size, dict):
self.image_w, self.image_h = size['width'], size['height']
elif isinstance(size, int):
self.image_h, self.image_w = size, size
else:
self.image_w, self.image_h = size
self.pad_image_to_square = pad_image_to_square
self.down_ratio = 1
self.repeats = repeats
def json_file_preprocess(self, data_path, image_folder):
# ade20k
with open(data_path, 'r') as file:
ade20k_classes = json.load(file)
ade20k_image_dir = image_folder
ade20k_images = [os.path.join(ade20k_image_dir, img) for img in os.listdir(ade20k_image_dir) if
img.endswith('.jpg')]
ade20k_labels = [img.replace(".jpg", ".png").replace(
"images", "annotations") for img in ade20k_images]
self.classes = np.array(ade20k_classes)
ret = []
for image, label in zip(ade20k_images, ade20k_labels):
ret.append({"image": image, "label": label})
return ret
def __len__(self):
return len(self.image_label_datas) * self.repeats
@property
def modality_length(self):
length_list = []
for data_dict in self.image_label_datas:
length_list.append(100)
length_list = length_list * self.repeats
return length_list
def real_len(self):
return len(self.image_label_datas)
def decode_mask(self, label_path):
label = np.array(Image.open(label_path))
# ade20k
label = np.where(label == 0, 255, label - 1)
unique_labels = [lbl for lbl in np.unique(label) if lbl != 255]
if not unique_labels:
return None, None
selected_labels = np.random.choice(unique_labels, min(
len(unique_labels), self.num_classes_per_sample), replace=False)
label = torch.from_numpy(label).long()
masks = torch.stack([label == class_id for class_id in selected_labels], dim=0)
return masks, selected_labels
def __getitem__(self, index):
index = index % self.real_len()
data_dict = copy.deepcopy(self.image_label_datas[index])
assert 'image' in data_dict.keys()
if data_dict.get('image', None) is not None:
image_file = data_dict['image']
image = Image.open(image_file).convert('RGB')
if hasattr(self, 'extra_image_processor'):
g_image = np.array(image) # for grounding
g_image = self.extra_image_processor.apply_image(g_image)
g_pixel_values = torch.from_numpy(g_image).permute(2, 0, 1).contiguous()
data_dict['g_pixel_values'] = g_pixel_values
ori_width, ori_height = image.size
if self.pad_image_to_square:
image = expand2square(image, tuple(int(x * 255)
for x in self.image_processor.image_mean))
image = self.image_processor.preprocess(
image, return_tensors='pt')['pixel_values'][0]
data_dict['pixel_values'] = image
# process and get masks
data_dict['masks'], class_id = self.decode_mask(data_dict['label'])
if class_id is None:
return self.__getitem__(0)
if self.gcg_format:
pass
else:
conversation = []
for i, c_id in enumerate(class_id):
question = random.choice(SEG_QUESTIONS).format(
class_name=self.classes[c_id].lower())
if i == 0:
question = f"""The {DEFAULT_IMAGE_TOKEN} provides an overview of the picture.\n""" + question
conversation.append(
{'input': question, 'output': random.choice(ANSWER_LIST)})
data_dict.update({'conversation': conversation})
else:
if hasattr(self.image_processor, 'crop_size'):
crop_size = self.image_processor.crop_size
else:
crop_size = self.image_processor.size
data_dict['pixel_values'] = torch.zeros(3, crop_size['height'],
crop_size['width'])
data_dict['masks'] = None
if self.lazy:
result = self.template_map_fn(data_dict)
data_dict.update(result)
result = encode_fn(data_dict, tokenizer=self.tokenizer,
max_length=self.max_length, with_image_token=True)
data_dict.update(result)
return data_dict
class ADE20kSemanticSegDataset(SemanticSegDataset):
def __init__(self,
image_folder,
image_processor,
data_path=None,
tokenizer=None,
offline_processed_text_folder=None,
max_dataset_length=None,
dataset_map_fn=None,
template_map_fn=None,
max_length=2048,
pad_image_to_square=False,
num_proc=8,
lazy=False,
repeats=1,
gcg_format=False,
num_classes_per_sample=3,
extra_image_processor=None):
super().__init__(
image_folder=image_folder,
image_processor=image_processor,
data_path=data_path,
tokenizer=tokenizer,
offline_processed_text_folder=offline_processed_text_folder,
max_dataset_length=max_dataset_length,
dataset_map_fn=dataset_map_fn,
template_map_fn=template_map_fn,
max_length=max_length,
pad_image_to_square=pad_image_to_square,
num_proc=num_proc,
lazy=lazy,
repeats=repeats,
gcg_format=gcg_format,
num_classes_per_sample=num_classes_per_sample,
extra_image_processor=extra_image_processor,
)
class COCOStuffSemanticSegDataset(SemanticSegDataset):
def __init__(self,
image_folder,
image_processor,
data_path=None,
tokenizer=None,
offline_processed_text_folder=None,
max_dataset_length=None,
dataset_map_fn=None,
template_map_fn=None,
max_length=2048,
pad_image_to_square=False,
num_proc=8,
lazy=False,
repeats=1,
label_path=None,
gcg_format=False,
num_classes_per_sample=3,
extra_image_processor=None):
self.label_path = label_path
super().__init__(
image_folder=image_folder,
image_processor=image_processor,
data_path=data_path,
tokenizer=tokenizer,
offline_processed_text_folder=offline_processed_text_folder,
max_dataset_length=max_dataset_length,
dataset_map_fn=dataset_map_fn,
template_map_fn=template_map_fn,
max_length=max_length,
pad_image_to_square=pad_image_to_square,
num_proc=num_proc,
lazy=lazy,
repeats=repeats,
gcg_format=gcg_format,
num_classes_per_sample=num_classes_per_sample,
extra_image_processor=extra_image_processor,
)
self.cocostuff_class2index = {c: i for i, c in enumerate(self.classes)}
def json_file_preprocess(self, data_path, image_folder):
# coco stuff
assert self.label_path is not None
with open(data_path, 'r') as file:
cocostuff_classes = [line.strip().split(": ")[-1]
for line in file.readlines()[1:]]
coco_stuff_image_dir = image_folder
coco_stuff_label_dir = self.label_path
coco_stuff_labels = glob.glob(
os.path.join(coco_stuff_label_dir, "*.png"))
coco_stuff_images = [label.replace(".png", ".jpg").replace(coco_stuff_label_dir, coco_stuff_image_dir)
for label in coco_stuff_labels]
self.classes = np.array(cocostuff_classes)
ret = []
for image, label in zip(coco_stuff_images, coco_stuff_labels):
ret.append({"image": image, "label": label})
return ret
def decode_mask(self, label_path):
label = np.array(Image.open(label_path))
# coco stuff
ignored_classes = [index for class_name,
index in self.cocostuff_class2index.items() if "-" in class_name]
label = np.where(np.isin(label, ignored_classes), 255, label)
unique_labels = [lbl for lbl in np.unique(label) if lbl != 255]
if not unique_labels:
print("No valid label !!!")
return None, None
# only choose 1
selected_labels = np.random.choice(unique_labels, min(
len(unique_labels), self.num_classes_per_sample), replace=False)
label = torch.from_numpy(label).long()
masks = torch.stack(
[label == class_id for class_id in selected_labels], dim=0)
return masks, selected_labels
class PascalPartSemanticSegDataset(SemanticSegDataset):
def json_file_preprocess(self, data_path, image_folder):
self.coco_api = COCO(data_path)
img_ids = self.coco_api.getImgIds()
all_classes = self.coco_api.loadCats(self.coco_api.getCatIds())
class_map_pascal_part = {}
for cat in all_classes:
cat_main, cat_part = cat["name"].strip().split(":")
name = (cat_main, cat_part)
class_map_pascal_part[cat["id"]] = name
self.classes = class_map_pascal_part
return img_ids
def __getitem__(self, index):
index = index % self.real_len()
img_id = self.image_label_datas[index]
img_info = self.coco_api.loadImgs([img_id])[0]
file_name = img_info["file_name"]
data_dict = {}
image_file = os.path.join(self.image_folder, file_name)
image = Image.open(image_file).convert('RGB')
if hasattr(self, 'extra_image_processor'):
g_image = np.array(image) # for grounding
g_image = self.extra_image_processor.apply_image(g_image)
g_pixel_values = torch.from_numpy(g_image).permute(2, 0, 1).contiguous()
data_dict['g_pixel_values'] = g_pixel_values
if self.pad_image_to_square:
image = expand2square(
image, tuple(int(x * 255) for x in self.image_processor.image_mean))
image = self.image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
data_dict['pixel_values'] = image
annotation_ids = self.coco_api.getAnnIds(imgIds=img_info["id"])
annotations = self.coco_api.loadAnns(annotation_ids)
if not annotations:
return self.__getitem__(0)
sampled_anns = np.random.choice(annotations, min(
len(annotations), self.num_classes_per_sample), replace=False)
conversation = []
for i, ann in enumerate(sampled_anns):
cat_id = ann['category_id']
sampled_cls = self.classes[cat_id]
if isinstance(sampled_cls, tuple):
obj, part = sampled_cls
name = f"{obj} {part}" if random.random() < 0.5 else f"the {part} of the {obj}"
else:
name = sampled_cls
question = random.choice(SEG_QUESTIONS).format(class_name=name)
if i == 0:
question = f"""The {DEFAULT_IMAGE_TOKEN} provides an overview of the picture.\n""" + question
conversation.append(
{'input': question, 'output': random.choice(ANSWER_LIST)})
masks = [self.coco_api.annToMask(ann) for ann in sampled_anns]
masks = np.stack(masks, axis=0)
masks = torch.from_numpy(masks)
data_dict['masks'] = masks
data_dict['conversation'] = conversation
if self.lazy:
result = self.template_map_fn(data_dict)
data_dict.update(result)
result = encode_fn(data_dict, tokenizer=self.tokenizer, max_length=self.max_length, with_image_token=True)
data_dict.update(result)
return data_dict
class PacoSemanticSegDataset(PascalPartSemanticSegDataset):
def json_file_preprocess(self, data_path, image_folder):
self.coco_api = COCO(data_path)
all_classes = self.coco_api.loadCats(self.coco_api.getCatIds())
class_map_paco = {}
for cat in all_classes:
cat_split = cat["name"].strip().split(":")
if len(cat_split) == 1:
name = cat_split[0].split("_(")[0]
else:
assert len(cat_split) == 2
obj, part = cat_split
obj = obj.split("_(")[0]
part = part.split("_(")[0]
name = (obj, part)
class_map_paco[cat["id"]] = name
self.classes = class_map_paco
return self.coco_api.getImgIds()
\ No newline at end of file
[
"wall", "building", "sky", "floor", "tree", "ceiling", "road",
"bed", "windowpane", "grass", "cabinet", "sidewalk",
"person", "earth", "door", "table", "mountain", "plant",
"curtain", "chair", "car", "water", "painting", "sofa",
"shelf", "house", "sea", "mirror", "rug", "field", "armchair",
"seat", "fence", "desk", "rock", "wardrobe", "lamp",
"bathtub", "railing", "cushion", "base", "box", "column",
"signboard", "chest of drawers", "counter", "sand", "sink",
"skyscraper", "fireplace", "refrigerator", "grandstand",
"path", "stairs", "runway", "case", "pool table", "pillow",
"screen door", "stairway", "river", "bridge", "bookcase",
"blind", "coffee table", "toilet", "flower", "book", "hill",
"bench", "countertop", "stove", "palm", "kitchen island",
"computer", "swivel chair", "boat", "bar", "arcade machine",
"hovel", "bus", "towel", "light", "truck", "tower",
"chandelier", "awning", "streetlight", "booth",
"television receiver", "airplane", "dirt track", "apparel",
"pole", "land", "bannister", "escalator", "ottoman", "bottle",
"buffet", "poster", "stage", "van", "ship", "fountain",
"conveyer belt", "canopy", "washer", "plaything",
"swimming pool", "stool", "barrel", "basket", "waterfall",
"tent", "bag", "minibike", "cradle", "oven", "ball", "food",
"step", "tank", "trade name", "microwave", "pot", "animal",
"bicycle", "lake", "dishwasher", "screen", "blanket",
"sculpture", "hood", "sconce", "vase", "traffic light",
"tray", "ashcan", "fan", "pier", "crt screen", "plate",
"monitor", "bulletin board", "shower", "radiator", "glass",
"clock", "flag"
]
\ No newline at end of file
0: unlabeled
1: person
2: bicycle
3: car
4: motorcycle
5: airplane
6: bus
7: train
8: truck
9: boat
10: traffic light
11: fire hydrant
12: street sign
13: stop sign
14: parking meter
15: bench
16: bird
17: cat
18: dog
19: horse
20: sheep
21: cow
22: elephant
23: bear
24: zebra
25: giraffe
26: hat
27: backpack
28: umbrella
29: shoe
30: eye glasses
31: handbag
32: tie
33: suitcase
34: frisbee
35: skis
36: snowboard
37: sports ball
38: kite
39: baseball bat
40: baseball glove
41: skateboard
42: surfboard
43: tennis racket
44: bottle
45: plate
46: wine glass
47: cup
48: fork
49: knife
50: spoon
51: bowl
52: banana
53: apple
54: sandwich
55: orange
56: broccoli
57: carrot
58: hot dog
59: pizza
60: donut
61: cake
62: chair
63: couch
64: potted plant
65: bed
66: mirror
67: dining table
68: window
69: desk
70: toilet
71: door
72: tv
73: laptop
74: mouse
75: remote
76: keyboard
77: cell phone
78: microwave
79: oven
80: toaster
81: sink
82: refrigerator
83: blender
84: book
85: clock
86: vase
87: scissors
88: teddy bear
89: hair drier
90: toothbrush
91: hair brush
92: banner
93: blanket
94: branch
95: bridge
96: building-other
97: bush
98: cabinet
99: cage
100: cardboard
101: carpet
102: ceiling-other
103: ceiling-tile
104: cloth
105: clothes
106: clouds
107: counter
108: cupboard
109: curtain
110: desk-stuff
111: dirt
112: door-stuff
113: fence
114: floor-marble
115: floor-other
116: floor-stone
117: floor-tile
118: floor-wood
119: flower
120: fog
121: food-other
122: fruit
123: furniture-other
124: grass
125: gravel
126: ground-other
127: hill
128: house
129: leaves
130: light
131: mat
132: metal
133: mirror-stuff
134: moss
135: mountain
136: mud
137: napkin
138: net
139: paper
140: pavement
141: pillow
142: plant-other
143: plastic
144: platform
145: playingfield
146: railing
147: railroad
148: river
149: road
150: rock
151: roof
152: rug
153: salad
154: sand
155: sea
156: shelf
157: sky
158: skyscraper
159: snow
160: solid-other
161: stairs
162: stone
163: straw
164: structural-other
165: table
166: tent
167: textile-other
168: towel
169: tree
170: vegetable
171: wall-brick
172: wall-concrete
173: wall-other
174: wall-panel
175: wall-stone
176: wall-tile
177: wall-wood
178: water-other
179: waterdrops
180: window-blind
181: window-other
182: wood
from PIL import Image
def expand2square(pil_img, background_color):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
CAPTION_QUESTIONS = [
'Could you please give me a detailed description of the image?',
'Can you provide a thorough description of the this image?',
'Please provide a thorough description of the this image',
'Please provide a thorough description of the this image.',
'Please describe in detail the contents of the image.',
'Please describe in detail the contents of the image',
'Could you give a comprehensive explanation of what can be found within this picture?',
'Could you give me an elaborate explanation of this picture?',
'Could you provide me with a detailed analysis of this photo?',
'Could you please give me a detailed description of the image?',
'Can you provide a thorough description of the this image?',
'Please describe in detail the contents of the image',
'Please describe in detail the contents of the image.',
'Can you give a comprehensive explanation of this photo',
'Please provide an elaborate explanation of this picture.',
'Please provide an elaborate explanation of this picture',
'Could you provide me with a detailed analysis of this photo',
]
REGION_QUESTIONS = [
'Can you provide me with a detailed description of the region in the picture marked by <region>?',
"I'm curious about the region represented by <region> in the picture. Could you describe it in detail?",
'What can you tell me about the region indicated by <region> in the image?',
"I'd like to know more about the area in the photo labeled <region>. Can you give me a detailed description?",
'Could you describe the region shown as <region> in the picture in great detail?',
'What details can you give me about the region outlined by <region> in the photo?',
'Please provide me with a comprehensive description of the region marked with <region> in the image.',
'Can you give me a detailed account of the region labeled as <region> in the picture?',
"I'm interested in learning more about the region represented by <region> in the photo. Can you describe it in detail?",
'What is the region outlined by <region> in the picture like? Could you give me a detailed description?',
'Can you provide me with a detailed description of the region in the picture marked by <region>, please?',
"I'm curious about the region represented by <region> in the picture. Could you describe it in detail, please?",
'What can you tell me about the region indicated by <region> in the image, exactly?',
"I'd like to know more about the area in the photo labeled <region>, please. Can you give me a detailed description?",
'Could you describe the region shown as <region> in the picture in great detail, please?',
'What details can you give me about the region outlined by <region> in the photo, please?',
'Please provide me with a comprehensive description of the region marked with <region> in the image, please.',
'Can you give me a detailed account of the region labeled as <region> in the picture, please?',
"I'm interested in learning more about the region represented by <region> in the photo. Can you describe it in detail, please?",
'What is the region outlined by <region> in the picture like, please? Could you give me a detailed description?',
]
REGION_GROUP_QUESTIONS = [
'Could you please give me a detailed description of these areas <region>?',
'Can you provide a thorough description of the regions <region> in this image?',
'Please describe in detail the contents of the boxed areas <region>.',
'Could you give a comprehensive explanation of what can be found within <region> in the picture?',
'Could you give me an elaborate explanation of the <region> regions in this picture?',
'Can you provide a comprehensive description of the areas identified by <region> in this photo?',
'Help me understand the specific locations labeled <region> in this picture in detail, please.',
'What is the detailed information about the areas marked by <region> in this image?',
'Could you provide me with a detailed analysis of the regions designated <region> in this photo?',
'What are the specific features of the areas marked <region> in this picture that you can describe in detail?',
'Could you elaborate on the regions identified by <region> in this image?',
'What can you tell me about the areas labeled <region> in this picture?',
'Can you provide a thorough analysis of the specific locations designated <region> in this photo?',
'I am interested in learning more about the regions marked <region> in this image. Can you provide me with more information?',
'Could you please provide a detailed description of the areas identified by <region> in this photo?',
'What is the significance of the regions labeled <region> in this picture?',
'I would like to know more about the specific locations designated <region> in this image. Can you provide me with more information?',
'Can you provide a detailed breakdown of the regions marked <region> in this photo?',
'What specific features can you tell me about the areas identified by <region> in this picture?',
'Could you please provide a comprehensive explanation of the locations labeled <region> in this image?',
'Can you provide a detailed account of the regions designated <region> in this photo?',
'I am curious about the areas marked <region> in this picture. Can you provide me with a detailed analysis?',
'What important details can you tell me about the specific locations identified by <region> in this image?',
'Could you please provide a detailed description of the regions labeled <region> in this photo?',
'What can you tell me about the features of the areas designated <region> in this picture?',
'Can you provide a comprehensive overview of the regions marked <region> in this image?',
'I would like to know more about the specific locations identified by <region> in this photo. Can you provide me with more information?',
'What is the detailed information you have on the areas labeled <region> in this picture?',
'Could you provide me with a thorough analysis of the regions designated <region> in this image?',
'Can you provide a detailed explanation of the specific locations marked by <region> in this photo?'
]
GCG_QUESTIONS = [
'Could you please give me a detailed description of the image? Please respond with interleaved segmentation masks for the corresponding parts of the answer.',
'Can you provide a thorough description of the this image? Please output with interleaved segmentation masks for the corresponding phrases.',
'Please describe in detail the contents of the image. Please respond with interleaved segmentation masks for the corresponding parts of the answer.',
'Could you give a comprehensive explanation of what can be found within this picture? Please output with interleaved segmentation masks for the corresponding phrases.',
'Could you give me an elaborate explanation of this picture? Please respond with interleaved segmentation masks for the corresponding phrases.',
'Could you provide me with a detailed analysis of this photo? Please output with interleaved segmentation masks for the corresponding parts of the answer.',
]
SEG_QUESTIONS = [
"Can you segment the {class_name} in this image?",
"Please segment {class_name} in this image.",
"What is {class_name} in this image? Please respond with segmentation mask.",
"What is {class_name} in this image? Please output segmentation mask.",
"Can you segment the {class_name} in this image",
"Please segment {class_name} in this image",
"What is {class_name} in this image? Please respond with segmentation mask",
"What is {class_name} in this image? Please output segmentation mask",
"Could you provide a segmentation mask for the {class_name} in this image?",
"Please identify and segment the {class_name} in this image.",
"Where is the {class_name} in this picture? Please respond with a segmentation mask.",
"Can you highlight the {class_name} in this image with a segmentation mask?",
"Could you provide a segmentation mask for the {class_name} in this image",
"Please identify and segment the {class_name} in this image",
"Where is the {class_name} in this picture? Please respond with a segmentation mask",
"Can you highlight the {class_name} in this image with a segmentation mask",
]
ANSWER_LIST = [
"It is [SEG].",
"Sure, [SEG].",
"Sure, it is [SEG].",
"Sure, the segmentation result is [SEG].",
"[SEG].",
]
\ No newline at end of file
import torch
import torch.nn as nn
import torch.nn.functional as F
from xtuner.registry import BUILDER
from xtuner.model.utils import LoadWoInit, guess_load_checkpoint
from xtuner.model.llava import LLaVAModel
from mmengine.model import BaseModel
from mmengine import print_log
from projects.glamm.utils import prepare_inputs_labels_for_multimodal
from projects.glamm.utils import DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
class GLaMM(LLaVAModel):
def __init__(self,
use_activation_checkpointing=True,
tokenizer=None,
grounding_encoder=None,
region_encoder=None,
loss_mask=None,
loss_dice=None,
*args, **kwargs):
super(GLaMM, self).__init__(
*args, use_activation_checkpointing=use_activation_checkpointing, **kwargs)
self.use_activation_checkpointing = use_activation_checkpointing
self.tokenizer = BUILDER.build(tokenizer)
self._add_special_tokens()
self.grounding_encoder = BUILDER.build(grounding_encoder)
self.grounding_encoder.requires_grad_(False)
self.grounding_encoder.mask_decoder.requires_grad_(True)
if region_encoder is not None:
self.region_encoder = BUILDER.build(region_encoder)
in_dim = self.config.hidden_size
out_dim = self.grounding_encoder.mask_decoder.transformer_dim
self.text_hidden_fcs = nn.Sequential(
nn.Linear(in_dim, in_dim), nn.ReLU(inplace=True),
nn.Linear(in_dim, out_dim), nn.Dropout(0.0)
)
self.loss_mask = BUILDER.build(loss_mask)
self.loss_dice = BUILDER.build(loss_dice)
def _add_special_tokens(self):
reg_tokens = ['<im_start>', '<im_end>', '<bbox>', '<point>']
segmentation_tokens = ['[SEG]']
phrase_tokens = ['<p>', '</p>']
special_tokens = reg_tokens + segmentation_tokens + phrase_tokens
num_new_tokens = self.tokenizer.add_tokens(
special_tokens, special_tokens=True)
if num_new_tokens > 0:
self.llm.resize_token_embeddings(len(self.tokenizer))
input_embeddings = self.llm.get_input_embeddings().weight.data
output_embeddings = self.llm.get_output_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(
dim=0, keepdim=True)
output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(
dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
output_embeddings[-num_new_tokens:] = output_embeddings_avg
self.seg_token_idx = self.tokenizer("[SEG]", add_special_tokens=False).input_ids[0]
self.bop_token_idx = self.tokenizer("<p>", add_special_tokens=False).input_ids[0]
self.eop_token_idx = self.tokenizer("</p>", add_special_tokens=False).input_ids[0]
self.bbox_token_idx = self.tokenizer("<bbox>", add_special_tokens=False).input_ids[0]
if self.use_activation_checkpointing or self.use_llm_lora or not self.freeze_llm:
self.llm.enable_input_require_grads()
def forward(self, data, data_samples=None, mode='loss'):
if 'pixel_values' in data:
visual_outputs = self.visual_encoder(
data['pixel_values'].to(self.visual_encoder.dtype),
output_hidden_states=True)
pixel_values = self.projector(
visual_outputs.hidden_states[self.visual_select_layer][:, 1:])
data['pixel_values'] = pixel_values
bboxes = data.pop('bboxes', None)
if bboxes is not None:
select_hidden_state_layer = -2
num_level_reg_features = 4
mlvl_reg_features = visual_outputs.hidden_states[select_hidden_state_layer::-3]
mlvl_reg_features = mlvl_reg_features[::-1]
mlvl_reg_features = mlvl_reg_features[-num_level_reg_features:]
mlvl_reg_features = [item[:, 1:] for item in mlvl_reg_features]
mlvl_reg_features = self.region_encoder(mlvl_reg_features, bboxes)
data = prepare_inputs_labels_for_multimodal(llm=self.llm, **data)
if bboxes is not None:
inputs_embeds = data['inputs_embeds']
for i, reg_feat in enumerate(mlvl_reg_features):
reg_mask = data['new_input_ids'][i] == self.bbox_token_idx
inputs_embeds[i][reg_mask] = reg_feat
data['inputs_embeds'] = inputs_embeds
if mode == 'loss':
return self.compute_loss(data, data_samples)
elif mode == 'predict':
return self.predict(data, data_samples)
elif mode == 'tensor':
return self._forward(data, data_samples)
else:
raise NotImplementedError
def compute_loss(self, data, data_samples=None):
g_pixel_values = data.pop('g_pixel_values', None)
gt_masks = data.pop('masks', None)
new_input_ids = data.pop('new_input_ids', None)
output = self.llm(output_hidden_states=True, **data)
if gt_masks is None:
return {'llm_loss': output.loss}
resize_list = [pixel.shape[-2:] for pixel in g_pixel_values]
ori_size_list = [mask.shape[-2:] for mask in gt_masks]
g_pixel_values = torch.stack([
self.grounding_encoder.preprocess(pixel) for pixel in g_pixel_values
])
image_embeddings = self.grounding_encoder.image_encoder(g_pixel_values)
seg_token_mask = new_input_ids == self.seg_token_idx
hidden_states = output.hidden_states
hidden_states = self.text_hidden_fcs(hidden_states[-1])
pred_embeddings = hidden_states[seg_token_mask]
seg_token_counts = seg_token_mask.int().sum(-1)
pred_embeddings_list = torch.split(pred_embeddings, seg_token_counts.tolist(), dim=0)
pred_masks = self._generate_and_postprocess_masks(
pred_embeddings_list, image_embeddings, resize_list, ori_size_list)
bs = len(pred_masks)
loss_mask, loss_dice = 0, 0
for i in range(bs):
pred_mask = pred_masks[i]
gt_mask = gt_masks[i]
sam_loss_mask = self.loss_mask(pred_mask, gt_mask)
sam_loss_dice = self.loss_dice(pred_mask, gt_mask)
accuracy = torch.eq((pred_mask.sigmoid() > 0.5), gt_mask).to(pred_mask).mean()
loss_mask += sam_loss_mask
loss_dice += sam_loss_dice
loss_dict = {
'loss_mask': loss_mask / bs,
'loss_dice': loss_dice / bs,
'accuracy': accuracy,
'llm_loss': output.loss,
}
return loss_dict
def _generate_and_postprocess_masks(self, pred_embeddings, image_embeddings, resize_list=None, orig_size_list=None, infer=False):
pred_masks = []
for i, pred_embedding in enumerate(pred_embeddings):
sparse_embeddings, dense_embeddings = self.grounding_encoder.prompt_encoder(
points=None, boxes=None, masks=None, text_embeds=pred_embedding.unsqueeze(1)
)
sparse_embeddings = sparse_embeddings.to(pred_embedding.dtype)
low_res_masks, _ = self.grounding_encoder.mask_decoder(
image_embeddings=image_embeddings[i].unsqueeze(0),
image_pe=self.grounding_encoder.prompt_encoder.get_dense_pe(),
sparse_prompt_embeddings=sparse_embeddings, dense_prompt_embeddings=dense_embeddings,
multimask_output=False, )
pred_mask = self.grounding_encoder.postprocess_masks(
low_res_masks, input_size=resize_list[i], original_size=orig_size_list[i], )
pred_masks.append(pred_mask[:, 0])
return pred_masks
def predict(self, data):
pass
def _forward(self, data, dta_samples=None):
outputs = self.llm(**data)
return outputs
from abc import ABCMeta, abstractmethod
from typing import List, Optional, Tuple
from torch import Tensor
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from mmcv import ops
from mmcv.cnn import ConvModule, Linear
from mmengine.model import BaseModule
class BaseRoIExtractor(BaseModule, metaclass=ABCMeta):
"""Base class for RoI extractor.
Args:
roi_layer (:obj:`ConfigDict` or dict): Specify RoI layer type and
arguments.
out_channels (int): Output channels of RoI layers.
featmap_strides (list[int]): Strides of input feature maps.
init_cfg (:obj:`ConfigDict` or dict or list[:obj:`ConfigDict` or \
dict], optional): Initialization config dict. Defaults to None.
"""
def __init__(self,
roi_layer,
out_channels: int,
featmap_strides: List[int],
init_cfg=None) -> None:
super().__init__(init_cfg=init_cfg)
self.roi_layers = self.build_roi_layers(roi_layer, featmap_strides)
self.out_channels = out_channels
self.featmap_strides = featmap_strides
@property
def num_inputs(self) -> int:
"""int: Number of input feature maps."""
return len(self.featmap_strides)
def build_roi_layers(self, layer_cfg,
featmap_strides: List[int]) -> nn.ModuleList:
"""Build RoI operator to extract feature from each level feature map.
Args:
layer_cfg (:obj:`ConfigDict` or dict): Dictionary to construct and
config RoI layer operation. Options are modules under
``mmcv/ops`` such as ``RoIAlign``.
featmap_strides (list[int]): The stride of input feature map w.r.t
to the original image size, which would be used to scale RoI
coordinate (original image coordinate system) to feature
coordinate system.
Returns:
:obj:`nn.ModuleList`: The RoI extractor modules for each level
feature map.
"""
cfg = layer_cfg.copy()
layer_type = cfg.pop('type')
if isinstance(layer_type, str):
assert hasattr(ops, layer_type)
layer_cls = getattr(ops, layer_type)
else:
layer_cls = layer_type
roi_layers = nn.ModuleList(
[layer_cls(spatial_scale=1 / s, **cfg) for s in featmap_strides])
return roi_layers
def roi_rescale(self, rois: Tensor, scale_factor: float) -> Tensor:
"""Scale RoI coordinates by scale factor.
Args:
rois (Tensor): RoI (Region of Interest), shape (n, 5)
scale_factor (float): Scale factor that RoI will be multiplied by.
Returns:
Tensor: Scaled RoI.
"""
cx = (rois[:, 1] + rois[:, 3]) * 0.5
cy = (rois[:, 2] + rois[:, 4]) * 0.5
w = rois[:, 3] - rois[:, 1]
h = rois[:, 4] - rois[:, 2]
new_w = w * scale_factor
new_h = h * scale_factor
x1 = cx - new_w * 0.5
x2 = cx + new_w * 0.5
y1 = cy - new_h * 0.5
y2 = cy + new_h * 0.5
new_rois = torch.stack((rois[:, 0], x1, y1, x2, y2), dim=-1)
return new_rois
@abstractmethod
def forward(self,
feats: Tuple[Tensor],
rois: Tensor,
roi_scale_factor: Optional[float] = None) -> Tensor:
"""Extractor ROI feats.
Args:
feats (Tuple[Tensor]): Multi-scale features.
rois (Tensor): RoIs with the shape (n, 5) where the first
column indicates batch id of each RoI.
roi_scale_factor (Optional[float]): RoI scale factor.
Defaults to None.
Returns:
Tensor: RoI feature.
"""
pass
class MLVLFuseModule(nn.Module):
def __init__(self, input_dims=1024, embed_dims=1024, num_levels=3, num_fuse=4):
super(MLVLFuseModule, self).__init__()
self.embed_dims = embed_dims
self.num_levels = num_levels
self.num_fuse = num_fuse
self.input_dims = input_dims
self.shuffle_channles = embed_dims // 4
# contains the tuple of level indices that will do the interaction
self.fuse_lvl_list = []
num_levels = self.num_levels
for lvl in range(num_levels):
top_lvl = min(lvl + 1, num_levels - 1)
dow_lvl = max(lvl - 1, 0)
tar_lvl = lvl
self.fuse_lvl_list.append((tar_lvl, top_lvl, dow_lvl))
self.remain_chs = self.embed_dims - self.shuffle_channles * 2
self._init_layers()
def generate_coordinate(self, featmap_sizes, device='cuda'):
x_range = torch.linspace(-1, 1, featmap_sizes[-1], device=device)
y_range = torch.linspace(-1, 1, featmap_sizes[-2], device=device)
y, x = torch.meshgrid(y_range, x_range)
y = y.expand([featmap_sizes[0], 1, -1, -1])
x = x.expand([featmap_sizes[0], 1, -1, -1])
coord_feat = torch.cat([x, y], 1)
return coord_feat
def _init_layers(self):
self.input_conv = nn.ModuleList([nn.Conv2d(self.input_dims + 2,
self.embed_dims, 1)
for _ in range(self.num_levels)])
self.fuse_convs = nn.ModuleList()
for i in range(self.num_fuse):
self.fuse_convs.append(
ConvModule(self.embed_dims,
self.embed_dims,
3,
stride=1,
padding=3 // 2,
conv_cfg=None,
norm_cfg=dict(type='GN',
num_groups=64,
requires_grad=True)
))
def init_weights(self):
pass
def _single_shuffle(self, inputs, conv_module):
if not isinstance(conv_module, (nn.ModuleList, list)):
conv_module = [conv_module]
for single_conv_m in conv_module:
fused_inputs = []
for fuse_lvl_tuple in self.fuse_lvl_list:
tar_lvl, top_lvl, dow_lvl = fuse_lvl_tuple
tar_input = inputs[tar_lvl]
top_input = inputs[top_lvl]
down_input = inputs[dow_lvl]
remain = tar_input[:, :self.remain_chs]
from_top = top_input[:, self.remain_chs:][:, self.shuffle_channles:]
from_top = F.interpolate(from_top.to(torch.float32),
size=tar_input.shape[-2:],
mode='bilinear',
align_corners=True)
from_down = down_input[:, self.remain_chs:][:, :self.shuffle_channles]
from_down = F.interpolate(from_down.to(torch.float32),
size=tar_input.shape[-2:],
mode='bilinear',
align_corners=True)
fused_inputs.append(
torch.cat([remain, from_top.to(remain.dtype), from_down.to(remain.dtype)], dim=1))
fused_inputs = [single_conv_m(item) for item in fused_inputs]
inputs = fused_inputs
return inputs
def forward(self, inputs, ):
feat_size = [item.shape for item in inputs]
new_inputs = []
for feat, single_feat_size in zip(inputs, feat_size):
coord_feat = self.generate_coordinate(
single_feat_size, device=inputs[0].device)
# feat = torch.cat([feat, coord_feat], dim=1)
feat = torch.cat([feat, coord_feat.to(feat.dtype)], dim=1)
new_inputs.append(feat)
inputs = new_inputs
inputs = [self.input_conv[lvl](item)
for lvl, item in enumerate(inputs)]
for conv_m in self.fuse_convs:
inputs = self._single_shuffle(inputs, [conv_m])
return inputs
class MlvlRoIExtractor(BaseRoIExtractor):
def __init__(self,
roi_layer,
out_channels,
featmap_strides,
embed_dims=1024,
stride=1,
norm_init=True,
fuse_level=3,
finest_scale=56,
init_cfg=None):
super(MlvlRoIExtractor, self).__init__(roi_layer, out_channels,
featmap_strides, init_cfg)
self.embed_dims = embed_dims
self.finest_scale = finest_scale
self.fuse_level = fuse_level
self.norm_init = norm_init
self.pconvs = nn.ModuleList(
nn.Conv2d(self.embed_dims, self.embed_dims, 3, stride=1, padding=1)
for _ in range(self.fuse_level))
self.pos_embedd = nn.Sequential(
nn.Linear(4, 256),
nn.ReLU(inplace=True),
nn.LayerNorm(256),
nn.Linear(256, 1024),
nn.ReLU(inplace=True),
nn.LayerNorm(1024),
)
self.updims = nn.Linear(1024, 4096)
self.flatten_linear = nn.Linear(
self.embed_dims * self.roi_layers[0].output_size[0] ** 2, 1024)
self.norm_init_weights()
# self.dtype = torch.float32
def norm_init_weights(self):
pass
def forward(self, feats, rois, roi_scale_factor=None):
"""Forward function."""
num_imgs = len(rois)
# feats = [item for item in feats]
batch_rois = torch.cat(rois, dim=0).to(feats[0].dtype)
pos_embedd = self.pos_embedd(batch_rois)
out_size = self.roi_layers[0].output_size
num_levels = len(feats)
if feats[0].dim() == 3:
h = w = int(math.sqrt(feats[0].shape[1]))
assert h == 16
assert w == 16
b, c = feats[0].shape[0], feats[0].shape[-1]
feats = [item.reshape(b, h, w, c).permute(0, 3, 1, 2)
for item in feats]
new_rois = []
for img_id, single_img_roi in enumerate(rois):
# rescale to original img scale
single_img_roi = single_img_roi * 224
roi_img_id = single_img_roi.new_ones(len(single_img_roi)) * img_id
single_img_roi = torch.cat(
[roi_img_id[:, None], single_img_roi], dim=1)
new_rois.append(single_img_roi)
rois = torch.cat(new_rois)
roi_feats = feats[0].new_zeros(self.fuse_level,
rois.size(0), self.out_channels, *out_size)
for i in range(num_levels):
if len(rois) > 0:
rois_ = rois
ori_dtype = feats[i].dtype
roi_feats_t = self.roi_layers[i](feats[i].to(
torch.float32), rois_.to(torch.float32))
roi_feats[i] = roi_feats_t.to(ori_dtype)
else:
roi_feats += sum(
x.view(-1)[0]
for x in self.parameters()) * 0. + feats[i].sum() * 0.
fuse_roi_feats = []
for i in range(self.fuse_level):
fuse_roi_feats.append(self.pconvs[i](roi_feats[i]))
fuse_roi_feats = sum(fuse_roi_feats)
fuse_roi_feats = F.relu(fuse_roi_feats)
fuse_roi_feats = fuse_roi_feats.flatten(1, -1)
fuse_roi_feats = self.flatten_linear(fuse_roi_feats)
fuse_roi_feats = fuse_roi_feats + pos_embedd
fuse_roi_feats = self.updims(fuse_roi_feats)
query_feats = []
for i in range(num_imgs):
mask = rois[:, 0] == i
query_feats.append(fuse_roi_feats[mask])
return query_feats
class MLVLROIQueryModule(nn.Module):
def __init__(self, embed_dims=1024, out_dims=4096,
num_levels=3):
super(MLVLROIQueryModule, self).__init__()
self.mlvl_fuse = MLVLFuseModule(input_dims=embed_dims,
embed_dims=embed_dims,
num_levels=num_levels,
num_fuse=5)
strids = [14 / 8, 14 / 4, 14 / 2, 14]
assert len(strids) == num_levels
bbox_roi_extractor = dict(roi_layer=dict(type='RoIAlign',
output_size=14,
sampling_ratio=2),
out_channels=embed_dims,
embed_dims=embed_dims,
fuse_level=num_levels,
featmap_strides=strids)
self.roi_align = MlvlRoIExtractor(**bbox_roi_extractor)
def forward(self, mlvl_feats, bboxes):
if mlvl_feats[0].dim() == 3:
h = w = int(math.sqrt(mlvl_feats[0].shape[1]))
assert h == 24
assert w == 24
b, c = mlvl_feats[0].shape[0], mlvl_feats[0].shape[-1]
mlvl_feats = [item.reshape(b, h, w, c).permute(0, 3, 1, 2) for item in mlvl_feats]
base_shape = mlvl_feats[0].shape[-2:]
num_level = len(mlvl_feats)
to_shape = [(base_shape[0] * 2 ** level, base_shape[1] * 2 ** level)
for level in range(num_level)]
to_shape = to_shape[::-1]
for level in range(num_level):
feat = mlvl_feats[level]
shape = to_shape[level]
# feat = feat
# mlvl_feats[level] = F.interpolate(feat, size=shape, mode='bilinear', align_corners=True)
# todo: temporary fix for "upsample_bilinear2d_out_frame" not implemented for 'BFloat16'
feat = feat.to(torch.float32)
mlvl_feats[level] = F.interpolate(
feat, size=shape, mode='bilinear', align_corners=True)
mlvl_feats[level] = mlvl_feats[level].to(torch.bfloat16)
mlvl_feats = self.mlvl_fuse(mlvl_feats)
return self.roi_align(mlvl_feats, bboxes)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment