benchmark: enhance configurable multimodal benchmarking in bench_serving (#9812)

Co-authored-by: Xiang (Kevin) Li <lik@nvidia.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>

benchmark: enhance configurable multimodal benchmarking in bench_serving (#9812)
Co-authored-by: Xiang (Kevin) Li <lik@nvidia.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
e3bb7f5a · Kevin Xiang Li · GitHub · 92473e2e · e3bb7f5a · e3bb7f5a
Unverified Commit e3bb7f5a authored Oct 08, 2025 by Kevin Xiang Li Committed by GitHub Oct 08, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 246 additions and 145 deletions

docs/developer_guide/bench_serving.md docs/developer_guide/bench_serving.md +42 -21

python/sglang/bench_serving.py python/sglang/bench_serving.py +204 -124

No files found.
--- a/docs/developer_guide/bench_serving.md
+++ b/docs/developer_guide/bench_serving.md
@@ -59,15 +59,16 @@ Select with `--dataset-name`:
 - `sharegpt` (default): loads ShareGPT-style pairs; optionally restrict with `--sharegpt-context-len` and override outputs with `--sharegpt-output-len`
 - `random`: random text lengths; sampled from ShareGPT token space
 - `random-ids`: random token ids (can lead to gibberish)
- `random-image`: generates random images and wraps them in chat messages; supports custom resolutions via 'heightxwidth' format
+- `image`: generates images and wraps them in chat messages; supports custom resolutions, multiple formats, and different content types
 - `generated-shared-prefix`: synthetic dataset with shared long system prompts and short questions
 - `mmmu`: samples from MMMU (Math split) and includes images
 Common dataset flags:
 - `--num-prompts N`: number of requests
- `--random-input-len`, `--random-output-len`, `--random-range-ratio`: for random/random-ids/random-image
+- `--random-input-len`, `--random-output-len`, `--random-range-ratio`: for random/random-ids/image
- `--random-image-num-images`, `--random-image-resolution`: for random-image dataset (supports presets 1080p/720p/360p or custom 'heightxwidth' format)
+- `--image-count`: Number of images per request (for `image` dataset).
 - `--apply-chat-template`: apply tokenizer chat template when constructing prompts
 - `--dataset-path PATH`: file path for ShareGPT json; if blank and missing, it will be downloaded and cached
@@ -79,14 +80,16 @@ Generated Shared Prefix flags (for `generated-shared-prefix`):
 - `--gsp-question-len`
 - `--gsp-output-len`
-Random Image dataset flags (for `random-image`):
+Image dataset flags (for `image`):
- `--random-image-num-images`: Number of images per request
+- `--image-count`: Number of images per request
- `--random-image-resolution`: Image resolution; supports presets (1080p, 720p, 360p) or custom 'heightxwidth' format (e.g., 1080x1920, 512x768)
+- `--image-resolution`: Image resolution; supports presets (4k, 1080p, 720p, 360p) or custom 'heightxwidth' format (e.g., 1080x1920, 512x768)
+- `--image-format`: Image format (jpeg or png)
+- `--image-content`: Image content type (random or blank)
 ### Examples
-1. To benchmark random-image dataset with 3 images per request, 500 prompts, 512 input length, and 512 output length, you can run:
+1. To benchmark image dataset with 3 images per request, 500 prompts, 512 input length, and 512 output length, you can run:
 ```bash
 python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --disable-radix-cache
@@ -95,10 +98,10 @@ python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --disabl
 ```bash
 python -m sglang.bench_serving \
    --backend sglang-oai-chat \
-    --dataset-name random-image \
+    --dataset-name image \
    --num-prompts 500 \
-    --random-image-num-images 3 \
+    --image-count 3 \
-    --random-image-resolution 720p \
+    --image-resolution 720p \
    --random-input-len 512 \
    --random-output-len 512
 ```
@@ -159,9 +162,10 @@ The script will add `Authorization: Bearer $OPENAI_API_KEY` automatically for Op
 Printed after each run:
 - Request throughput (req/s)
- Input token throughput (tok/s)
+- Input token throughput (tok/s) - includes both text and vision tokens
 - Output token throughput (tok/s)
- Total token throughput (tok/s)
+- Total token throughput (tok/s) - includes both text and vision tokens
+- Total input text tokens and Total input vision tokens - per-modality breakdown
 - Concurrency: aggregate time of all requests divided by wall time
 - End-to-End Latency (ms): mean/median/std/p99 per-request total latency
 - Time to First Token (TTFT, ms): mean/median/std/p99 for streaming mode
@@ -227,31 +231,48 @@ python3 -m sglang.bench_serving \
  --apply-chat-template
 ```
-4) Random images (VLM) with chat template:
+4) Images (VLM) with chat template:
 ```bash
 python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 --port 30000 \
  --model your-vlm-model \
-  --dataset-name random-image \
+  --dataset-name image \
-  --random-image-num-images 2 \
+  --image-count 2 \
-  --random-image-resolution 720p \
+  --image-resolution 720p \
  --random-input-len 128 --random-output-len 256 \
  --num-prompts 200 \
  --apply-chat-template
 ```
-4a) Random images with custom resolution:
+4a) Images with custom resolution:
+```bash
+python3 -m sglang.bench_serving \
+  --backend sglang \
+  --host 127.0.0.1 --port 30000 \
+  --model your-vlm-model \
+  --dataset-name image \
+  --image-count 1 \
+  --image-resolution 512x768 \
+  --random-input-len 64 --random-output-len 128 \
+  --num-prompts 100 \
+  --apply-chat-template
+```
+4b) 1080p images with PNG format and blank content:
 ```bash
 python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 --port 30000 \
  --model your-vlm-model \
-  --dataset-name random-image \
+  --dataset-name image \
-  --random-image-num-images 1 \
+  --image-count 1 \
-  --random-image-resolution 512x768 \
+  --image-resolution 1080p \
+  --image-format png \
+  --image-content blank \
  --random-input-len 64 --random-output-len 128 \
  --num-prompts 100 \
  --apply-chat-template
@@ -325,7 +346,7 @@ python3 -m sglang.bench_serving \
 - All requests failed: verify `--backend`, server URL/port, `--model`, and authentication. Check warmup errors printed by the script.
 - Throughput seems too low: adjust `--request-rate` and `--max-concurrency`; verify server batch size/scheduling; ensure streaming is enabled if appropriate.
 - Token counts look odd: prefer chat/instruct models with proper chat templates; otherwise tokenization of gibberish may be inconsistent.
- Random-image/MMMU datasets: ensure you installed extra deps (`pillow`, `datasets`, `pybase64`).
+- Image/MMMU datasets: ensure you installed extra deps (`pillow`, `datasets`, `pybase64`).
 - Authentication errors (401/403): set `OPENAI_API_KEY` or disable auth on your server.
 ### Notes

--- a/python/sglang/bench_serving.py
+++ b/python/sglang/bench_serving.py
@@ -35,6 +35,7 @@ import numpy as np
 import requests
 from tqdm.asyncio import tqdm
 from transformers import (
+    AutoProcessor,
    AutoTokenizer,
    PreTrainedTokenizer,
    PreTrainedTokenizerBase,
@@ -327,8 +328,9 @@ async def async_request_openai_chat_completions(
            "model": request_func_input.model,
            "messages": messages,
            "temperature": 0.0,
-            "max_tokens": request_func_input.output_len,
+            "max_completion_tokens": request_func_input.output_len,
            "stream": not args.disable_stream,
+            "ignore_eos": not args.disable_ignore_eos,
            **request_func_input.extra_request_body,
        }
@@ -659,7 +661,30 @@ def get_tokenizer(
    )
-def get_dataset(args, tokenizer):
+def get_processor(
+    pretrained_model_name_or_path: str,
+) -> Union[PreTrainedTokenizer, PreTrainedTokenizerFast]:
+    assert (
+        pretrained_model_name_or_path is not None
+        and pretrained_model_name_or_path != ""
+    )
+    if pretrained_model_name_or_path.endswith(
+        ".json"
+    ) or pretrained_model_name_or_path.endswith(".model"):
+        from sglang.srt.hf_transformers_utils import get_processor
+        return get_processor(pretrained_model_name_or_path)
+    if pretrained_model_name_or_path is not None and not os.path.exists(
+        pretrained_model_name_or_path
+    ):
+        pretrained_model_name_or_path = get_model(pretrained_model_name_or_path)
+    return AutoProcessor.from_pretrained(
+        pretrained_model_name_or_path, trust_remote_code=True
+    )
+def get_dataset(args, tokenizer, model_id=None):
    tokenize_prompt = getattr(args, "tokenize_prompt", False)
    if args.dataset_name == "sharegpt":
        assert not tokenize_prompt
@@ -672,7 +697,7 @@ def get_dataset(args, tokenizer):
            prompt_suffix=args.prompt_suffix,
            apply_chat_template=args.apply_chat_template,
        )
-    elif args.dataset_name.startswith("random") and args.dataset_name != "random-image":
+    elif args.dataset_name.startswith("random"):
        input_requests = sample_random_requests(
            input_len=args.random_input_len,
            output_len=args.random_output_len,
@@ -683,17 +708,18 @@ def get_dataset(args, tokenizer):
            random_sample=args.dataset_name == "random",
            return_text=not tokenize_prompt,
        )
-    elif args.dataset_name == "random-image":
+    elif args.dataset_name == "image":
-        assert not tokenize_prompt, "random-image does not support --tokenize-prompt"
+        processor = get_processor(model_id)
-        input_requests = sample_random_image_requests(
+        input_requests = sample_image_requests(
            num_requests=args.num_prompts,
-            num_images=args.random_image_num_images,
+            image_count=args.image_count,
            input_len=args.random_input_len,
            output_len=args.random_output_len,
            range_ratio=args.random_range_ratio,
-            tokenizer=tokenizer,
+            processor=processor,
-            apply_chat_template=args.apply_chat_template,
+            image_content=args.image_content,
-            image_resolution=args.random_image_resolution,
+            image_format=args.image_format,
+            image_resolution=args.image_resolution,
        )
    elif args.dataset_name == "generated-shared-prefix":
        assert not tokenize_prompt
@@ -707,12 +733,11 @@ def get_dataset(args, tokenizer):
            args=args,
        )
    elif args.dataset_name == "mmmu":
-        assert not tokenize_prompt
+        processor = get_processor(model_id)
        input_requests = sample_mmmu_requests(
            num_requests=args.num_prompts,
-            tokenizer=tokenizer,
+            processor=processor,
            fixed_output_len=args.random_output_len,
-            apply_chat_template=args.apply_chat_template,
            random_sample=True,
        )
    elif args.dataset_name == "mooncake":
@@ -757,6 +782,8 @@ ASYNC_REQUEST_FUNCS = {
 class BenchmarkMetrics:
    completed: int
    total_input: int
+    total_input_text: int
+    total_input_vision: int
    total_output: int
    total_output_retokenized: int
    request_throughput: float
@@ -850,9 +877,17 @@ class DatasetRow:
    prompt: str
    prompt_len: int
    output_len: int
+    text_prompt_len: Optional[int] = None
+    vision_prompt_len: Optional[int] = None
    image_data: Optional[List[str]] = None
    timestamp: Optional[float] = None
+    def __post_init__(self):
+        if self.text_prompt_len is None:
+            self.text_prompt_len = self.prompt_len
+        if self.vision_prompt_len is None:
+            self.vision_prompt_len = 0
 async def get_mooncake_request_over_time(
    input_requests: List[Dict],
@@ -929,9 +964,8 @@ async def get_mooncake_request_over_time(
 def sample_mmmu_requests(
    num_requests: int,
-    tokenizer: PreTrainedTokenizerBase,
+    processor: AutoProcessor,
    fixed_output_len: Optional[int] = None,
-    apply_chat_template: bool = True,
    random_sample: bool = True,
 ) -> List[DatasetRow]:
    """
@@ -1010,54 +1044,12 @@ def sample_mmmu_requests(
                question = example.get("question")
                # Construct the prompt
-                prompt = f"Question: {question}\n\nAnswer: "
+                text_prompt = f"Question: {question}\n\nAnswer: "
-                if apply_chat_template:
-                    try:
-                        is_phi4_multimodal = (
-                            "phi-4-multimodal" in tokenizer.name_or_path.lower()
-                        )
-                        if is_phi4_multimodal:
-                            # <|endoftext10|> is the image token used in the phi-4-multimodal model.
-                            content = prompt.replace("image 1", "<|endoftext10|>")
-                        else:
-                            content = [
-                                {
-                                    "type": "image_url",
-                                    "image_url": {"url": image_data},
-                                },
-                                {"type": "text", "text": prompt},
-                            ]
-                        prompt = tokenizer.apply_chat_template(
-                            [
-                                {
-                                    "role": "user",
-                                    "content": content,
-                                }
-                            ],
-                            add_generation_prompt=True,
-                            tokenize=False,
-                        )
-                    except Exception as e:
-                        # Note (Xinyuan): This is a workaround for an issue where some tokenizers do not support content as a list. (e.g. InternVL)
-                        print(
-                            f"Error applying chat template: {e}, fallback to <image> tag"
-                        )
-                        prompt = f"<image>{prompt}"
-                # Calculate token lengths for text only (without image data)
-                prompt_token_ids = tokenizer.encode(prompt)
-                prompt_len = len(prompt_token_ids)
                output_len = fixed_output_len if fixed_output_len is not None else 256
+                data_row = create_mm_data_row(
-                filtered_dataset.append(
+                    text_prompt, [image], [image_data], output_len, processor
-                    DatasetRow(
-                        prompt=prompt,
-                        prompt_len=prompt_len,
-                        output_len=output_len,
-                        image_data=[image_data],
-                    )
                )
+                filtered_dataset.append(data_row)
        except Exception as e:
            print(f"Error processing example {i}: {e}")
@@ -1145,7 +1137,11 @@ def sample_sharegpt_requests(
            continue
        filtered_dataset.append(
-            DatasetRow(prompt=prompt, prompt_len=prompt_len, output_len=output_len)
+            DatasetRow(
+                prompt=prompt,
+                prompt_len=prompt_len,
+                output_len=output_len,
+            )
        )
    print(f"#Input tokens: {np.sum([x.prompt_len for x in filtered_dataset])}")
@@ -1256,7 +1252,7 @@ def sample_random_requests(
    return input_requests
-def parse_random_image_resolution(image_resolution: str) -> Tuple[int, int]:
+def parse_image_resolution(image_resolution: str) -> Tuple[int, int]:
    """Parse image resolution into (width, height).
    Supports presets '1080p', '720p', '360p' and custom 'heightxwidth' format
@@ -1281,24 +1277,79 @@ def parse_random_image_resolution(image_resolution: str) -> Tuple[int, int]:
                return (width, height)
    raise ValueError(
-        f"Unsupported random-image resolution: {image_resolution}. "
+        f"Unsupported image resolution: {image_resolution}. "
        "Choose from 4k, 1080p, 720p, 360p, or provide custom 'heightxwidth' (e.g., 1080x1920)."
    )
-def sample_random_image_requests(
+def create_mm_data_row(text_prompt, images, images_base64, output_len, processor):
+    try:
+        content_items = [
+            {"type": "image_url", "image_url": {"url": img_url}}
+            for img_url in images_base64
+        ]
+        content_items.append({"type": "text", "text": text_prompt})
+        prompt_str = processor.apply_chat_template(
+            [{"role": "user", "content": content_items}],
+            add_generation_prompt=True,
+            tokenize=False,
+        )
+    except Exception:
+        # Some tokenizers do not support list content; fall back to a placeholder in the text
+        prompt_str = f"<image>{text_prompt}"
+    # Calculate total tokens (text + vision)
+    prompt_len = processor(
+        text=[prompt_str],
+        images=images,
+        padding=False,
+        return_tensors="pt",
+    )["input_ids"].numel()
+    # Calculate text-only tokens
+    try:
+        # Create text-only version of the prompt
+        text_only_prompt = processor.apply_chat_template(
+            [{"role": "user", "content": text_prompt}],
+            add_generation_prompt=True,
+            tokenize=False,
+        )
+        text_prompt_len = processor(
+            text=[text_only_prompt],
+            padding=False,
+            return_tensors="pt",
+        )["input_ids"].numel()
+    except Exception:
+        # Fallback: just tokenize the text prompt directly
+        text_prompt_len = len(processor.tokenizer.encode(text_prompt))
+    # Vision tokens = total tokens - text tokens
+    vision_prompt_len = prompt_len - text_prompt_len
+    return DatasetRow(
+        prompt=text_prompt,
+        prompt_len=prompt_len,
+        output_len=output_len,
+        text_prompt_len=text_prompt_len,
+        vision_prompt_len=vision_prompt_len,
+        image_data=images_base64,
+    )
+def sample_image_requests(
    num_requests: int,
-    num_images: int,
+    image_count: int,
    input_len: int,
    output_len: int,
    range_ratio: float,
-    tokenizer: PreTrainedTokenizerBase,
+    processor: AutoProcessor,
-    apply_chat_template: bool = True,
+    image_content: str,
-    image_resolution: str = "1080p",
+    image_format: str,
+    image_resolution: str,
 ) -> List[DatasetRow]:
-    """Generate requests with random images.
+    """Generate requests with images.
-    - Each request includes ``num_images`` random images.
+    - Each request includes ``image_count`` images.
    - Supported resolutions: 4k (3840x2160), 1080p (1920x1080), 720p (1280x720), 360p (640x360),
      or custom 'heightxwidth' (e.g., 1080x1920).
    - Text lengths follow the 'random' dataset sampling rule. ``prompt_len``
@@ -1313,12 +1364,12 @@ def sample_random_image_requests(
        ) from e
    # Parse resolution (supports presets and 'heightxwidth')
-    width, height = parse_random_image_resolution(image_resolution)
+    width, height = parse_image_resolution(image_resolution)
    # Check for potentially problematic combinations and warn user
-    if width * height >= 1920 * 1080 and num_images * num_requests >= 100:
+    if width * height >= 1920 * 1080 and image_count * num_requests >= 100:
        warnings.warn(
-            f"High resolution ({width}x{height}) with {num_images * num_requests} total images "
+            f"High resolution ({width}x{height}) with {image_count * num_requests} total images "
            f"may take a long time. Consider reducing resolution or image count.",
            UserWarning,
            stacklevel=2,
@@ -1332,53 +1383,50 @@ def sample_random_image_requests(
        int(output_len * range_ratio), output_len + 1, size=num_requests
    )
-    def _gen_random_image_data_uri(width: int = width, height: int = height) -> str:
+    def _gen_random_image_data_uri(
-        arr = (np.random.rand(height, width, 3) * 255).astype(np.uint8)
+        width: int = width, height: int = height
-        img = Image.fromarray(arr, mode="RGB")
+    ) -> (Image, str, int):
+        if image_content == "blank":
+            # Generate blank white image
+            arr = np.full((height, width, 3), 255, dtype=np.uint8)
+        else:
+            # Generate random colored image
+            arr = (np.random.rand(height, width, 3) * 255).astype(np.uint8)
+        img = Image.fromarray(arr)
        buf = io.BytesIO()
-        img.save(buf, format="JPEG", quality=85)
+        img.save(buf, format=image_format, quality=85)
        encoded = pybase64.b64encode(buf.getvalue()).decode("utf-8")
-        return f"data:image/jpeg;base64,{encoded}"
+        image_data = f"data:image/{image_format};base64,{encoded}"
+        image_bytes = len(image_data.encode("utf-8"))
+        return img, image_data, image_bytes
    dataset: List[DatasetRow] = []
+    total_image_bytes = 0
    for i in range(num_requests):
        # Generate text prompt
-        text_prompt = gen_prompt(tokenizer, int(input_lens[i]))
+        text_prompt = gen_prompt(processor.tokenizer, int(input_lens[i]))
        # Generate image list
-        images = [_gen_random_image_data_uri() for _ in range(num_images)]
+        images, images_base64, images_bytes = zip(
+            *[_gen_random_image_data_uri() for _ in range(image_count)]
-        prompt_str = text_prompt
+        )
-        if apply_chat_template:
+        total_image_bytes += sum(list(images_bytes))
-            try:
-                content_items = [
+        data_row = create_mm_data_row(
-                    {"type": "image_url", "image_url": {"url": img_url}}
+            text_prompt,
-                    for img_url in images
+            list(images),
-                ]
+            list(images_base64),
-                content_items.append({"type": "text", "text": text_prompt})
+            int(output_lens[i]),
-                prompt_str = tokenizer.apply_chat_template(
+            processor,
-                    [{"role": "user", "content": content_items}],
-                    add_generation_prompt=True,
-                    tokenize=False,
-                )
-            except Exception:
-                # Some tokenizers do not support list content; fall back to a placeholder in the text
-                prompt_str = f"<image>{text_prompt}"
-        prompt_token_ids = tokenizer.encode(prompt_str)
-        prompt_token_len = len(prompt_token_ids)
-        dataset.append(
-            DatasetRow(
-                prompt=prompt_str,
-                prompt_len=prompt_token_len,
-                output_len=int(output_lens[i]),
-                image_data=images,
-            )
        )
+        dataset.append(data_row)
    print(f"#Input tokens: {np.sum([x.prompt_len for x in dataset])}")
    print(f"#Output tokens: {np.sum([x.output_len for x in dataset])}")
+    print(
+        f"\nCreated {len(dataset)} {image_content} {image_format} images with average {total_image_bytes//num_requests} bytes per request"
+    )
    return dataset
@@ -1450,7 +1498,9 @@ def sample_generated_shared_prefix_requests(
            input_requests.append(
                DatasetRow(
-                    prompt=full_prompt, prompt_len=prompt_len, output_len=output_len
+                    prompt=full_prompt,
+                    prompt_len=prompt_len,
+                    output_len=output_len,
                )
            )
            total_input_tokens += prompt_len
@@ -1532,6 +1582,8 @@ def calculate_metrics(
    output_lens: List[int] = []
    retokenized_output_lens: List[int] = []
    total_input = 0
+    total_input_text = 0
+    total_input_vision = 0
    completed = 0
    itls: List[float] = []
    tpots: List[float] = []
@@ -1545,7 +1597,9 @@ def calculate_metrics(
                tokenizer.encode(outputs[i].generated_text, add_special_tokens=False)
            )
            retokenized_output_lens.append(retokenized_output_len)
-            total_input += outputs[i].prompt_len
+            total_input += input_requests[i].prompt_len
+            total_input_text += input_requests[i].text_prompt_len
+            total_input_vision += input_requests[i].vision_prompt_len
            if output_len > 1:
                tpots.append((outputs[i].latency - outputs[i].ttft) / (output_len - 1))
            itls += outputs[i].itl
@@ -1567,6 +1621,8 @@ def calculate_metrics(
    metrics = BenchmarkMetrics(
        completed=completed,
        total_input=total_input,
+        total_input_text=total_input_text,
+        total_input_vision=total_input_vision,
        total_output=sum(output_lens),
        total_output_retokenized=sum(retokenized_output_lens),
        request_throughput=completed / dur_s,
@@ -1815,6 +1871,10 @@ async def benchmark(
    print("{:<40} {:<10}".format("Successful requests:", metrics.completed))
    print("{:<40} {:<10.2f}".format("Benchmark duration (s):", benchmark_duration))
    print("{:<40} {:<10}".format("Total input tokens:", metrics.total_input))
+    print("{:<40} {:<10}".format("Total input text tokens:", metrics.total_input_text))
+    print(
+        "{:<40} {:<10}".format("Total input vision tokens:", metrics.total_input_vision)
+    )
    print("{:<40} {:<10}".format("Total generated tokens:", metrics.total_output))
    print(
        "{:<40} {:<10}".format(
@@ -1884,6 +1944,8 @@ async def benchmark(
            "duration": benchmark_duration,
            "completed": metrics.completed,
            "total_input_tokens": metrics.total_input,
+            "total_input_text_tokens": metrics.total_input_text,
+            "total_input_vision_tokens": metrics.total_input_vision,
            "total_output_tokens": metrics.total_output,
            "total_output_tokens_retokenized": metrics.total_output_retokenized,
            "request_throughput": metrics.request_throughput,
@@ -1918,11 +1980,11 @@ async def benchmark(
        output_file_name = args.output_file
    else:
        now = datetime.now().strftime("%m%d")
-        if args.dataset_name == "random-image":
+        if args.dataset_name == "image":
            output_file_name = (
                f"{args.backend}_{now}_{args.num_prompts}_{args.random_input_len}_"
-                f"{args.random_output_len}_{args.random_image_num_images}imgs_"
+                f"{args.random_output_len}_{args.image_count}imgs_"
-                f"{args.random_image_resolution}.jsonl"
+                f"{args.image_resolution}.jsonl"
            )
        elif args.dataset_name.startswith("random"):
            output_file_name = f"{args.backend}_{now}_{args.num_prompts}_{args.random_input_len}_{args.random_output_len}.jsonl"
@@ -2098,6 +2160,12 @@ def run_benchmark(args_: argparse.Namespace):
            "Because when the tokenizer counts the output tokens, if there is gibberish, it might count incorrectly.\n"
        )
+    if args.dataset_name in ["image", "mmmu"]:
+        args.apply_chat_template = True
+        assert (
+            not args.tokenize_prompt
+        ), "`--tokenize-prompt` not compatible with image dataset"
    print(f"{args}\n")
    # Read dataset
@@ -2105,7 +2173,7 @@ def run_benchmark(args_: argparse.Namespace):
    model_id = args.model
    tokenizer_id = args.tokenizer if args.tokenizer is not None else args.model
    tokenizer = get_tokenizer(tokenizer_id)
-    input_requests = get_dataset(args, tokenizer)
+    input_requests = get_dataset(args, tokenizer, model_id)
    # compatible with SimpleNamespace
    if not hasattr(args, "flush_cache"):
@@ -2186,7 +2254,7 @@ if __name__ == "__main__":
            "random-ids",
            "generated-shared-prefix",
            "mmmu",
-            "random-image",
+            "image",
            "mooncake",
        ],
        help="Name of the dataset to benchmark on.",
@@ -2226,37 +2294,49 @@ if __name__ == "__main__":
        "--random-input-len",
        type=int,
        default=1024,
-        help="Number of input tokens per request, used only for random dataset.",
+        help="Number of input tokens per request, used only for random and image dataset.",
    )
    parser.add_argument(
        "--random-output-len",
        default=1024,
        type=int,
-        help="Number of output tokens per request, used only for random dataset.",
+        help="Number of output tokens per request, used only for random and image dataset.",
    )
    parser.add_argument(
        "--random-range-ratio",
        type=float,
        default=0.0,
        help="Range of sampled ratio of input/output length, "
-        "used only for random dataset.",
+        "used only for random and image dataset.",
    )
-    # random-image dataset args
+    # image dataset args
    parser.add_argument(
-        "--random-image-num-images",
+        "--image-count",
        type=int,
        default=1,
-        help="Number of images per request (only available with the random-image dataset)",
+        help="Number of images per request (only available with the image dataset)",
    )
    parser.add_argument(
-        "--random-image-resolution",
+        "--image-resolution",
        type=str,
        default="1080p",
        help=(
-            "Resolution of random images for random-image dataset. "
+            "Resolution of images for image dataset. "
            "Supports presets 4k/1080p/720p/360p or custom 'heightxwidth' (e.g., 1080x1920)."
        ),
    )
+    parser.add_argument(
+        "--image-format",
+        type=str,
+        default="jpeg",
+        help=("Format of images for image dataset. " "Supports jpeg and png."),
+    )
+    parser.add_argument(
+        "--image-content",
+        type=str,
+        default="random",
+        help=("Content for images for image dataset. " "Supports random and blank."),
+    )
    parser.add_argument(
        "--request-rate",
        type=float,