Unverified Commit 6aa057c9 authored by storyicon's avatar storyicon Committed by GitHub
Browse files

[Multimodal] Support custom video metadata for pre-extracted frame sequences (#40133)


Signed-off-by: default avatarstoryicon <storyicon@foxmail.com>
Co-authored-by: default avatarmergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
parent a2bd09c9
......@@ -780,6 +780,70 @@ vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct \
Works with common video formats like MP4 when using OpenCV backends.
#### Pre-extracted Frame Sequences with `media_io_kwargs`
When you extract video frames on the client side and send them as `video/jpeg` (base64-concatenated JPEG frames), you can preserve the original video metadata by using `media_io_kwargs` in your request. This enables more accurate video understanding by preserving temporal information that would otherwise be lost during client-side frame extraction.
**Supported Parameters:**
| Parameter | Type | Description |
| --------- | ---- | ----------- |
| `fps` | float | Frame rate of the original video |
| `frames_indices` | list[int] | Indices of the actually sampled frames |
| `total_num_frames` | int | Total frame count of the original video |
| `duration` | float | Duration of the original video in seconds |
| `do_sample_frames` | bool | Whether to perform frame sampling |
??? code
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
# Client-side frame extraction
frames = extract_frames(video_path, num_frames=32)
frames_b64 = ",".join([encode_image(f) for f in frames])
video_url = f"data:video/jpeg;base64,{frames_b64}"
# Pass video metadata via media_io_kwargs
response = client.chat.completions.create(
model="your-multimodal-model",
messages=[{
"role": "user",
"content": [
{"type": "video_url", "video_url": {"url": video_url}},
{"type": "text", "text": "Describe what happens in this video."}
]
}],
extra_body={
"media_io_kwargs": {
"video": {
"fps": 30.0,
"frames_indices": [0, 10, 20, 30, 40, 50, 60, 70, 80, 90,
100, 110, 120, 130, 140, 150, 160, 170,
180, 190, 200, 210, 220, 230, 240, 250,
260, 270, 280, 290, 300, 310],
"total_num_frames": 900,
"duration": 30.0,
}
}
},
)
print(response.choices[0].message.content)
```
**Why use `media_io_kwargs`?**
When extracting frames client-side, the server loses important context about the original video:
- **Temporal information**: Which frames were sampled and their positions in the original timeline
- **Video duration**: How long the original video was
- **Frame rate**: The original playback speed
By passing this metadata, the model can better understand the temporal distribution of the sampled frames and whether important moments might have been skipped.
#### Custom RGBA Background Color
To use a custom background color for RGBA images, pass the `rgba_background_color` parameter via `--media-io-kwargs`:
......
......@@ -92,14 +92,48 @@ class VideoMediaIO(MediaIO[tuple[npt.NDArray, dict[str, Any]]]):
)
total = int(frames.shape[0])
fps = float(self.kwargs.get("fps", 1))
duration = total / fps if fps > 0 else 0.0
# validate and extract frames_indices
frames_indices = self.kwargs.get("frames_indices")
if frames_indices is not None:
if not (
isinstance(frames_indices, list)
and all(isinstance(i, int) for i in frames_indices)
):
raise ValueError("frames_indices must be a list of integers")
if len(frames_indices) != total:
raise ValueError(
f"frames_indices length ({len(frames_indices)}) must "
f"match number of frames sent ({total})"
)
else:
frames_indices = list(range(total))
# validate and extract total_num_frames
total_num_frames = self.kwargs.get("total_num_frames", total)
if not isinstance(total_num_frames, int) or total_num_frames < 1:
raise ValueError("total_num_frames must be a positive integer")
if total_num_frames < total:
raise ValueError(
f"total_num_frames ({total_num_frames}) must be >= "
f"number of frames sent ({total})"
)
# validate and extract duration
duration = self.kwargs.get("duration")
if duration is not None:
if not isinstance(duration, (int, float)) or duration < 0:
raise ValueError("duration must be a non-negative number")
else:
duration = total_num_frames / fps if fps > 0 else 0.0
metadata = {
"total_num_frames": total,
"total_num_frames": total_num_frames,
"fps": fps,
"duration": duration,
"video_backend": "jpeg_sequence",
"frames_indices": list(range(total)),
"do_sample_frames": False,
"frames_indices": frames_indices,
"do_sample_frames": self.kwargs.get("do_sample_frames", False),
}
return frames, metadata
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment