"value":"<video>\nCan you give me an overview of the video content?"
},
{
"from":"gpt",
"value":"The video showcases a group of men washing cars in a parking lot using various tools and techniques, such as pressure washing hoses, hand towels, and brushes. There are also text displays of special offers and business details throughout the video. The men work on different cars, including a white car and a black SUV."
}
]
},
{
"video":"v_TpB_zMG3XBA.mp4",
"conversations":[
{
"from":"human",
"value":"<video>\nCan you describe what the woman is doing on the lake?"
},
{
"from":"gpt",
"value":"The woman is pulled forward in the lake and then begins to glide on the water using a surfboard. She sways left and right while sliding, and then slowly squats down, leaving the surfboard and falling into the water and floating on the surface."
}
]
},
{
"video":"v_rBMQFpHspmo.mp4",
"conversations":[
{
"from":"human",
"value":"<video>\nWhat is the main activity happening in the video?"
},
{
"from":"gpt",
"value":"The main activity in the video is a boy attempting to pole vault over a bar in an indoor field while being watched by several people."
"""Different from the original implementation, Qwen3VL use timestamps rather than absolute time position ids."""
# Since we use timestamps to seperate videos, like <t1> <vision_start> <frame1> <vision_end> <t2> <vision_start> <frame2> <vision_end>, the video_grid_thw should also be split
Calculate the 3D rope index based on image and video's temporal, height and width in LLM.
Explanation:
Each embedding sequence contains vision embedding and text embedding or just contains text embedding.
For pure text embedding sequence, the rotary position embedding has no difference with modern LLMs.
Examples:
input_ids: [T T T T T], here T is for text.
temporal position_ids: [0, 1, 2, 3, 4]
height position_ids: [0, 1, 2, 3, 4]
width position_ids: [0, 1, 2, 3, 4]
For vision and text embedding sequence, we calculate 3D rotary position embedding for vision part
and 1D rotary position embedding for text part.
Examples:
Temporal (Time): 3 patches, representing different segments of the video in time.
Height: 2 patches, dividing each frame vertically.
Width: 2 patches, dividing each frame horizontally.
We also have some important parameters:
fps (Frames Per Second): The video's frame rate, set to 1. This means one frame is processed each second.
tokens_per_second: This is a crucial parameter. It dictates how many "time-steps" or "temporal tokens" are conceptually packed into a one-second interval of the video. In this case, we have 25 tokens per second. So each second of the video will be represented with 25 separate time points. It essentially defines the temporal granularity.
temporal_patch_size: The number of frames that compose one temporal patch. Here, it's 2 frames.
interval: The step size for the temporal position IDs, calculated as tokens_per_second * temporal_patch_size / fps. In this case, 25 * 2 / 1 = 50. This means that each temporal patch will be have a difference of 50 in the temporal position IDs.
input_ids: [V V V V V V V V V V V V T T T T T], here V is for vision.