Unverified Commit e3bb7f5a authored by Kevin Xiang Li's avatar Kevin Xiang Li Committed by GitHub
Browse files

benchmark: enhance configurable multimodal benchmarking in bench_serving (#9812)


Co-authored-by: default avatarXiang (Kevin) Li <lik@nvidia.com>
Co-authored-by: default avatarXinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
parent 92473e2e
...@@ -59,15 +59,16 @@ Select with `--dataset-name`: ...@@ -59,15 +59,16 @@ Select with `--dataset-name`:
- `sharegpt` (default): loads ShareGPT-style pairs; optionally restrict with `--sharegpt-context-len` and override outputs with `--sharegpt-output-len` - `sharegpt` (default): loads ShareGPT-style pairs; optionally restrict with `--sharegpt-context-len` and override outputs with `--sharegpt-output-len`
- `random`: random text lengths; sampled from ShareGPT token space - `random`: random text lengths; sampled from ShareGPT token space
- `random-ids`: random token ids (can lead to gibberish) - `random-ids`: random token ids (can lead to gibberish)
- `random-image`: generates random images and wraps them in chat messages; supports custom resolutions via 'heightxwidth' format - `image`: generates images and wraps them in chat messages; supports custom resolutions, multiple formats, and different content types
- `generated-shared-prefix`: synthetic dataset with shared long system prompts and short questions - `generated-shared-prefix`: synthetic dataset with shared long system prompts and short questions
- `mmmu`: samples from MMMU (Math split) and includes images - `mmmu`: samples from MMMU (Math split) and includes images
Common dataset flags: Common dataset flags:
- `--num-prompts N`: number of requests - `--num-prompts N`: number of requests
- `--random-input-len`, `--random-output-len`, `--random-range-ratio`: for random/random-ids/random-image - `--random-input-len`, `--random-output-len`, `--random-range-ratio`: for random/random-ids/image
- `--random-image-num-images`, `--random-image-resolution`: for random-image dataset (supports presets 1080p/720p/360p or custom 'heightxwidth' format) - `--image-count`: Number of images per request (for `image` dataset).
- `--apply-chat-template`: apply tokenizer chat template when constructing prompts - `--apply-chat-template`: apply tokenizer chat template when constructing prompts
- `--dataset-path PATH`: file path for ShareGPT json; if blank and missing, it will be downloaded and cached - `--dataset-path PATH`: file path for ShareGPT json; if blank and missing, it will be downloaded and cached
...@@ -79,14 +80,16 @@ Generated Shared Prefix flags (for `generated-shared-prefix`): ...@@ -79,14 +80,16 @@ Generated Shared Prefix flags (for `generated-shared-prefix`):
- `--gsp-question-len` - `--gsp-question-len`
- `--gsp-output-len` - `--gsp-output-len`
Random Image dataset flags (for `random-image`): Image dataset flags (for `image`):
- `--random-image-num-images`: Number of images per request - `--image-count`: Number of images per request
- `--random-image-resolution`: Image resolution; supports presets (1080p, 720p, 360p) or custom 'heightxwidth' format (e.g., 1080x1920, 512x768) - `--image-resolution`: Image resolution; supports presets (4k, 1080p, 720p, 360p) or custom 'heightxwidth' format (e.g., 1080x1920, 512x768)
- `--image-format`: Image format (jpeg or png)
- `--image-content`: Image content type (random or blank)
### Examples ### Examples
1. To benchmark random-image dataset with 3 images per request, 500 prompts, 512 input length, and 512 output length, you can run: 1. To benchmark image dataset with 3 images per request, 500 prompts, 512 input length, and 512 output length, you can run:
```bash ```bash
python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --disable-radix-cache python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --disable-radix-cache
...@@ -95,10 +98,10 @@ python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --disabl ...@@ -95,10 +98,10 @@ python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --disabl
```bash ```bash
python -m sglang.bench_serving \ python -m sglang.bench_serving \
--backend sglang-oai-chat \ --backend sglang-oai-chat \
--dataset-name random-image \ --dataset-name image \
--num-prompts 500 \ --num-prompts 500 \
--random-image-num-images 3 \ --image-count 3 \
--random-image-resolution 720p \ --image-resolution 720p \
--random-input-len 512 \ --random-input-len 512 \
--random-output-len 512 --random-output-len 512
``` ```
...@@ -159,9 +162,10 @@ The script will add `Authorization: Bearer $OPENAI_API_KEY` automatically for Op ...@@ -159,9 +162,10 @@ The script will add `Authorization: Bearer $OPENAI_API_KEY` automatically for Op
Printed after each run: Printed after each run:
- Request throughput (req/s) - Request throughput (req/s)
- Input token throughput (tok/s) - Input token throughput (tok/s) - includes both text and vision tokens
- Output token throughput (tok/s) - Output token throughput (tok/s)
- Total token throughput (tok/s) - Total token throughput (tok/s) - includes both text and vision tokens
- Total input text tokens and Total input vision tokens - per-modality breakdown
- Concurrency: aggregate time of all requests divided by wall time - Concurrency: aggregate time of all requests divided by wall time
- End-to-End Latency (ms): mean/median/std/p99 per-request total latency - End-to-End Latency (ms): mean/median/std/p99 per-request total latency
- Time to First Token (TTFT, ms): mean/median/std/p99 for streaming mode - Time to First Token (TTFT, ms): mean/median/std/p99 for streaming mode
...@@ -227,31 +231,48 @@ python3 -m sglang.bench_serving \ ...@@ -227,31 +231,48 @@ python3 -m sglang.bench_serving \
--apply-chat-template --apply-chat-template
``` ```
4) Random images (VLM) with chat template: 4) Images (VLM) with chat template:
```bash ```bash
python3 -m sglang.bench_serving \ python3 -m sglang.bench_serving \
--backend sglang \ --backend sglang \
--host 127.0.0.1 --port 30000 \ --host 127.0.0.1 --port 30000 \
--model your-vlm-model \ --model your-vlm-model \
--dataset-name random-image \ --dataset-name image \
--random-image-num-images 2 \ --image-count 2 \
--random-image-resolution 720p \ --image-resolution 720p \
--random-input-len 128 --random-output-len 256 \ --random-input-len 128 --random-output-len 256 \
--num-prompts 200 \ --num-prompts 200 \
--apply-chat-template --apply-chat-template
``` ```
4a) Random images with custom resolution: 4a) Images with custom resolution:
```bash
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model your-vlm-model \
--dataset-name image \
--image-count 1 \
--image-resolution 512x768 \
--random-input-len 64 --random-output-len 128 \
--num-prompts 100 \
--apply-chat-template
```
4b) 1080p images with PNG format and blank content:
```bash ```bash
python3 -m sglang.bench_serving \ python3 -m sglang.bench_serving \
--backend sglang \ --backend sglang \
--host 127.0.0.1 --port 30000 \ --host 127.0.0.1 --port 30000 \
--model your-vlm-model \ --model your-vlm-model \
--dataset-name random-image \ --dataset-name image \
--random-image-num-images 1 \ --image-count 1 \
--random-image-resolution 512x768 \ --image-resolution 1080p \
--image-format png \
--image-content blank \
--random-input-len 64 --random-output-len 128 \ --random-input-len 64 --random-output-len 128 \
--num-prompts 100 \ --num-prompts 100 \
--apply-chat-template --apply-chat-template
...@@ -325,7 +346,7 @@ python3 -m sglang.bench_serving \ ...@@ -325,7 +346,7 @@ python3 -m sglang.bench_serving \
- All requests failed: verify `--backend`, server URL/port, `--model`, and authentication. Check warmup errors printed by the script. - All requests failed: verify `--backend`, server URL/port, `--model`, and authentication. Check warmup errors printed by the script.
- Throughput seems too low: adjust `--request-rate` and `--max-concurrency`; verify server batch size/scheduling; ensure streaming is enabled if appropriate. - Throughput seems too low: adjust `--request-rate` and `--max-concurrency`; verify server batch size/scheduling; ensure streaming is enabled if appropriate.
- Token counts look odd: prefer chat/instruct models with proper chat templates; otherwise tokenization of gibberish may be inconsistent. - Token counts look odd: prefer chat/instruct models with proper chat templates; otherwise tokenization of gibberish may be inconsistent.
- Random-image/MMMU datasets: ensure you installed extra deps (`pillow`, `datasets`, `pybase64`). - Image/MMMU datasets: ensure you installed extra deps (`pillow`, `datasets`, `pybase64`).
- Authentication errors (401/403): set `OPENAI_API_KEY` or disable auth on your server. - Authentication errors (401/403): set `OPENAI_API_KEY` or disable auth on your server.
### Notes ### Notes
......
...@@ -35,6 +35,7 @@ import numpy as np ...@@ -35,6 +35,7 @@ import numpy as np
import requests import requests
from tqdm.asyncio import tqdm from tqdm.asyncio import tqdm
from transformers import ( from transformers import (
AutoProcessor,
AutoTokenizer, AutoTokenizer,
PreTrainedTokenizer, PreTrainedTokenizer,
PreTrainedTokenizerBase, PreTrainedTokenizerBase,
...@@ -327,8 +328,9 @@ async def async_request_openai_chat_completions( ...@@ -327,8 +328,9 @@ async def async_request_openai_chat_completions(
"model": request_func_input.model, "model": request_func_input.model,
"messages": messages, "messages": messages,
"temperature": 0.0, "temperature": 0.0,
"max_tokens": request_func_input.output_len, "max_completion_tokens": request_func_input.output_len,
"stream": not args.disable_stream, "stream": not args.disable_stream,
"ignore_eos": not args.disable_ignore_eos,
**request_func_input.extra_request_body, **request_func_input.extra_request_body,
} }
...@@ -659,7 +661,30 @@ def get_tokenizer( ...@@ -659,7 +661,30 @@ def get_tokenizer(
) )
def get_dataset(args, tokenizer): def get_processor(
pretrained_model_name_or_path: str,
) -> Union[PreTrainedTokenizer, PreTrainedTokenizerFast]:
assert (
pretrained_model_name_or_path is not None
and pretrained_model_name_or_path != ""
)
if pretrained_model_name_or_path.endswith(
".json"
) or pretrained_model_name_or_path.endswith(".model"):
from sglang.srt.hf_transformers_utils import get_processor
return get_processor(pretrained_model_name_or_path)
if pretrained_model_name_or_path is not None and not os.path.exists(
pretrained_model_name_or_path
):
pretrained_model_name_or_path = get_model(pretrained_model_name_or_path)
return AutoProcessor.from_pretrained(
pretrained_model_name_or_path, trust_remote_code=True
)
def get_dataset(args, tokenizer, model_id=None):
tokenize_prompt = getattr(args, "tokenize_prompt", False) tokenize_prompt = getattr(args, "tokenize_prompt", False)
if args.dataset_name == "sharegpt": if args.dataset_name == "sharegpt":
assert not tokenize_prompt assert not tokenize_prompt
...@@ -672,7 +697,7 @@ def get_dataset(args, tokenizer): ...@@ -672,7 +697,7 @@ def get_dataset(args, tokenizer):
prompt_suffix=args.prompt_suffix, prompt_suffix=args.prompt_suffix,
apply_chat_template=args.apply_chat_template, apply_chat_template=args.apply_chat_template,
) )
elif args.dataset_name.startswith("random") and args.dataset_name != "random-image": elif args.dataset_name.startswith("random"):
input_requests = sample_random_requests( input_requests = sample_random_requests(
input_len=args.random_input_len, input_len=args.random_input_len,
output_len=args.random_output_len, output_len=args.random_output_len,
...@@ -683,17 +708,18 @@ def get_dataset(args, tokenizer): ...@@ -683,17 +708,18 @@ def get_dataset(args, tokenizer):
random_sample=args.dataset_name == "random", random_sample=args.dataset_name == "random",
return_text=not tokenize_prompt, return_text=not tokenize_prompt,
) )
elif args.dataset_name == "random-image": elif args.dataset_name == "image":
assert not tokenize_prompt, "random-image does not support --tokenize-prompt" processor = get_processor(model_id)
input_requests = sample_random_image_requests( input_requests = sample_image_requests(
num_requests=args.num_prompts, num_requests=args.num_prompts,
num_images=args.random_image_num_images, image_count=args.image_count,
input_len=args.random_input_len, input_len=args.random_input_len,
output_len=args.random_output_len, output_len=args.random_output_len,
range_ratio=args.random_range_ratio, range_ratio=args.random_range_ratio,
tokenizer=tokenizer, processor=processor,
apply_chat_template=args.apply_chat_template, image_content=args.image_content,
image_resolution=args.random_image_resolution, image_format=args.image_format,
image_resolution=args.image_resolution,
) )
elif args.dataset_name == "generated-shared-prefix": elif args.dataset_name == "generated-shared-prefix":
assert not tokenize_prompt assert not tokenize_prompt
...@@ -707,12 +733,11 @@ def get_dataset(args, tokenizer): ...@@ -707,12 +733,11 @@ def get_dataset(args, tokenizer):
args=args, args=args,
) )
elif args.dataset_name == "mmmu": elif args.dataset_name == "mmmu":
assert not tokenize_prompt processor = get_processor(model_id)
input_requests = sample_mmmu_requests( input_requests = sample_mmmu_requests(
num_requests=args.num_prompts, num_requests=args.num_prompts,
tokenizer=tokenizer, processor=processor,
fixed_output_len=args.random_output_len, fixed_output_len=args.random_output_len,
apply_chat_template=args.apply_chat_template,
random_sample=True, random_sample=True,
) )
elif args.dataset_name == "mooncake": elif args.dataset_name == "mooncake":
...@@ -757,6 +782,8 @@ ASYNC_REQUEST_FUNCS = { ...@@ -757,6 +782,8 @@ ASYNC_REQUEST_FUNCS = {
class BenchmarkMetrics: class BenchmarkMetrics:
completed: int completed: int
total_input: int total_input: int
total_input_text: int
total_input_vision: int
total_output: int total_output: int
total_output_retokenized: int total_output_retokenized: int
request_throughput: float request_throughput: float
...@@ -850,9 +877,17 @@ class DatasetRow: ...@@ -850,9 +877,17 @@ class DatasetRow:
prompt: str prompt: str
prompt_len: int prompt_len: int
output_len: int output_len: int
text_prompt_len: Optional[int] = None
vision_prompt_len: Optional[int] = None
image_data: Optional[List[str]] = None image_data: Optional[List[str]] = None
timestamp: Optional[float] = None timestamp: Optional[float] = None
def __post_init__(self):
if self.text_prompt_len is None:
self.text_prompt_len = self.prompt_len
if self.vision_prompt_len is None:
self.vision_prompt_len = 0
async def get_mooncake_request_over_time( async def get_mooncake_request_over_time(
input_requests: List[Dict], input_requests: List[Dict],
...@@ -929,9 +964,8 @@ async def get_mooncake_request_over_time( ...@@ -929,9 +964,8 @@ async def get_mooncake_request_over_time(
def sample_mmmu_requests( def sample_mmmu_requests(
num_requests: int, num_requests: int,
tokenizer: PreTrainedTokenizerBase, processor: AutoProcessor,
fixed_output_len: Optional[int] = None, fixed_output_len: Optional[int] = None,
apply_chat_template: bool = True,
random_sample: bool = True, random_sample: bool = True,
) -> List[DatasetRow]: ) -> List[DatasetRow]:
""" """
...@@ -1010,54 +1044,12 @@ def sample_mmmu_requests( ...@@ -1010,54 +1044,12 @@ def sample_mmmu_requests(
question = example.get("question") question = example.get("question")
# Construct the prompt # Construct the prompt
prompt = f"Question: {question}\n\nAnswer: " text_prompt = f"Question: {question}\n\nAnswer: "
if apply_chat_template:
try:
is_phi4_multimodal = (
"phi-4-multimodal" in tokenizer.name_or_path.lower()
)
if is_phi4_multimodal:
# <|endoftext10|> is the image token used in the phi-4-multimodal model.
content = prompt.replace("image 1", "<|endoftext10|>")
else:
content = [
{
"type": "image_url",
"image_url": {"url": image_data},
},
{"type": "text", "text": prompt},
]
prompt = tokenizer.apply_chat_template(
[
{
"role": "user",
"content": content,
}
],
add_generation_prompt=True,
tokenize=False,
)
except Exception as e:
# Note (Xinyuan): This is a workaround for an issue where some tokenizers do not support content as a list. (e.g. InternVL)
print(
f"Error applying chat template: {e}, fallback to <image> tag"
)
prompt = f"<image>{prompt}"
# Calculate token lengths for text only (without image data)
prompt_token_ids = tokenizer.encode(prompt)
prompt_len = len(prompt_token_ids)
output_len = fixed_output_len if fixed_output_len is not None else 256 output_len = fixed_output_len if fixed_output_len is not None else 256
data_row = create_mm_data_row(
filtered_dataset.append( text_prompt, [image], [image_data], output_len, processor
DatasetRow(
prompt=prompt,
prompt_len=prompt_len,
output_len=output_len,
image_data=[image_data],
)
) )
filtered_dataset.append(data_row)
except Exception as e: except Exception as e:
print(f"Error processing example {i}: {e}") print(f"Error processing example {i}: {e}")
...@@ -1145,7 +1137,11 @@ def sample_sharegpt_requests( ...@@ -1145,7 +1137,11 @@ def sample_sharegpt_requests(
continue continue
filtered_dataset.append( filtered_dataset.append(
DatasetRow(prompt=prompt, prompt_len=prompt_len, output_len=output_len) DatasetRow(
prompt=prompt,
prompt_len=prompt_len,
output_len=output_len,
)
) )
print(f"#Input tokens: {np.sum([x.prompt_len for x in filtered_dataset])}") print(f"#Input tokens: {np.sum([x.prompt_len for x in filtered_dataset])}")
...@@ -1256,7 +1252,7 @@ def sample_random_requests( ...@@ -1256,7 +1252,7 @@ def sample_random_requests(
return input_requests return input_requests
def parse_random_image_resolution(image_resolution: str) -> Tuple[int, int]: def parse_image_resolution(image_resolution: str) -> Tuple[int, int]:
"""Parse image resolution into (width, height). """Parse image resolution into (width, height).
Supports presets '1080p', '720p', '360p' and custom 'heightxwidth' format Supports presets '1080p', '720p', '360p' and custom 'heightxwidth' format
...@@ -1281,24 +1277,79 @@ def parse_random_image_resolution(image_resolution: str) -> Tuple[int, int]: ...@@ -1281,24 +1277,79 @@ def parse_random_image_resolution(image_resolution: str) -> Tuple[int, int]:
return (width, height) return (width, height)
raise ValueError( raise ValueError(
f"Unsupported random-image resolution: {image_resolution}. " f"Unsupported image resolution: {image_resolution}. "
"Choose from 4k, 1080p, 720p, 360p, or provide custom 'heightxwidth' (e.g., 1080x1920)." "Choose from 4k, 1080p, 720p, 360p, or provide custom 'heightxwidth' (e.g., 1080x1920)."
) )
def sample_random_image_requests( def create_mm_data_row(text_prompt, images, images_base64, output_len, processor):
try:
content_items = [
{"type": "image_url", "image_url": {"url": img_url}}
for img_url in images_base64
]
content_items.append({"type": "text", "text": text_prompt})
prompt_str = processor.apply_chat_template(
[{"role": "user", "content": content_items}],
add_generation_prompt=True,
tokenize=False,
)
except Exception:
# Some tokenizers do not support list content; fall back to a placeholder in the text
prompt_str = f"<image>{text_prompt}"
# Calculate total tokens (text + vision)
prompt_len = processor(
text=[prompt_str],
images=images,
padding=False,
return_tensors="pt",
)["input_ids"].numel()
# Calculate text-only tokens
try:
# Create text-only version of the prompt
text_only_prompt = processor.apply_chat_template(
[{"role": "user", "content": text_prompt}],
add_generation_prompt=True,
tokenize=False,
)
text_prompt_len = processor(
text=[text_only_prompt],
padding=False,
return_tensors="pt",
)["input_ids"].numel()
except Exception:
# Fallback: just tokenize the text prompt directly
text_prompt_len = len(processor.tokenizer.encode(text_prompt))
# Vision tokens = total tokens - text tokens
vision_prompt_len = prompt_len - text_prompt_len
return DatasetRow(
prompt=text_prompt,
prompt_len=prompt_len,
output_len=output_len,
text_prompt_len=text_prompt_len,
vision_prompt_len=vision_prompt_len,
image_data=images_base64,
)
def sample_image_requests(
num_requests: int, num_requests: int,
num_images: int, image_count: int,
input_len: int, input_len: int,
output_len: int, output_len: int,
range_ratio: float, range_ratio: float,
tokenizer: PreTrainedTokenizerBase, processor: AutoProcessor,
apply_chat_template: bool = True, image_content: str,
image_resolution: str = "1080p", image_format: str,
image_resolution: str,
) -> List[DatasetRow]: ) -> List[DatasetRow]:
"""Generate requests with random images. """Generate requests with images.
- Each request includes ``num_images`` random images. - Each request includes ``image_count`` images.
- Supported resolutions: 4k (3840x2160), 1080p (1920x1080), 720p (1280x720), 360p (640x360), - Supported resolutions: 4k (3840x2160), 1080p (1920x1080), 720p (1280x720), 360p (640x360),
or custom 'heightxwidth' (e.g., 1080x1920). or custom 'heightxwidth' (e.g., 1080x1920).
- Text lengths follow the 'random' dataset sampling rule. ``prompt_len`` - Text lengths follow the 'random' dataset sampling rule. ``prompt_len``
...@@ -1313,12 +1364,12 @@ def sample_random_image_requests( ...@@ -1313,12 +1364,12 @@ def sample_random_image_requests(
) from e ) from e
# Parse resolution (supports presets and 'heightxwidth') # Parse resolution (supports presets and 'heightxwidth')
width, height = parse_random_image_resolution(image_resolution) width, height = parse_image_resolution(image_resolution)
# Check for potentially problematic combinations and warn user # Check for potentially problematic combinations and warn user
if width * height >= 1920 * 1080 and num_images * num_requests >= 100: if width * height >= 1920 * 1080 and image_count * num_requests >= 100:
warnings.warn( warnings.warn(
f"High resolution ({width}x{height}) with {num_images * num_requests} total images " f"High resolution ({width}x{height}) with {image_count * num_requests} total images "
f"may take a long time. Consider reducing resolution or image count.", f"may take a long time. Consider reducing resolution or image count.",
UserWarning, UserWarning,
stacklevel=2, stacklevel=2,
...@@ -1332,53 +1383,50 @@ def sample_random_image_requests( ...@@ -1332,53 +1383,50 @@ def sample_random_image_requests(
int(output_len * range_ratio), output_len + 1, size=num_requests int(output_len * range_ratio), output_len + 1, size=num_requests
) )
def _gen_random_image_data_uri(width: int = width, height: int = height) -> str: def _gen_random_image_data_uri(
arr = (np.random.rand(height, width, 3) * 255).astype(np.uint8) width: int = width, height: int = height
img = Image.fromarray(arr, mode="RGB") ) -> (Image, str, int):
if image_content == "blank":
# Generate blank white image
arr = np.full((height, width, 3), 255, dtype=np.uint8)
else:
# Generate random colored image
arr = (np.random.rand(height, width, 3) * 255).astype(np.uint8)
img = Image.fromarray(arr)
buf = io.BytesIO() buf = io.BytesIO()
img.save(buf, format="JPEG", quality=85) img.save(buf, format=image_format, quality=85)
encoded = pybase64.b64encode(buf.getvalue()).decode("utf-8") encoded = pybase64.b64encode(buf.getvalue()).decode("utf-8")
return f"data:image/jpeg;base64,{encoded}" image_data = f"data:image/{image_format};base64,{encoded}"
image_bytes = len(image_data.encode("utf-8"))
return img, image_data, image_bytes
dataset: List[DatasetRow] = [] dataset: List[DatasetRow] = []
total_image_bytes = 0
for i in range(num_requests): for i in range(num_requests):
# Generate text prompt # Generate text prompt
text_prompt = gen_prompt(tokenizer, int(input_lens[i])) text_prompt = gen_prompt(processor.tokenizer, int(input_lens[i]))
# Generate image list # Generate image list
images = [_gen_random_image_data_uri() for _ in range(num_images)] images, images_base64, images_bytes = zip(
*[_gen_random_image_data_uri() for _ in range(image_count)]
prompt_str = text_prompt )
if apply_chat_template: total_image_bytes += sum(list(images_bytes))
try:
content_items = [ data_row = create_mm_data_row(
{"type": "image_url", "image_url": {"url": img_url}} text_prompt,
for img_url in images list(images),
] list(images_base64),
content_items.append({"type": "text", "text": text_prompt}) int(output_lens[i]),
prompt_str = tokenizer.apply_chat_template( processor,
[{"role": "user", "content": content_items}],
add_generation_prompt=True,
tokenize=False,
)
except Exception:
# Some tokenizers do not support list content; fall back to a placeholder in the text
prompt_str = f"<image>{text_prompt}"
prompt_token_ids = tokenizer.encode(prompt_str)
prompt_token_len = len(prompt_token_ids)
dataset.append(
DatasetRow(
prompt=prompt_str,
prompt_len=prompt_token_len,
output_len=int(output_lens[i]),
image_data=images,
)
) )
dataset.append(data_row)
print(f"#Input tokens: {np.sum([x.prompt_len for x in dataset])}") print(f"#Input tokens: {np.sum([x.prompt_len for x in dataset])}")
print(f"#Output tokens: {np.sum([x.output_len for x in dataset])}") print(f"#Output tokens: {np.sum([x.output_len for x in dataset])}")
print(
f"\nCreated {len(dataset)} {image_content} {image_format} images with average {total_image_bytes//num_requests} bytes per request"
)
return dataset return dataset
...@@ -1450,7 +1498,9 @@ def sample_generated_shared_prefix_requests( ...@@ -1450,7 +1498,9 @@ def sample_generated_shared_prefix_requests(
input_requests.append( input_requests.append(
DatasetRow( DatasetRow(
prompt=full_prompt, prompt_len=prompt_len, output_len=output_len prompt=full_prompt,
prompt_len=prompt_len,
output_len=output_len,
) )
) )
total_input_tokens += prompt_len total_input_tokens += prompt_len
...@@ -1532,6 +1582,8 @@ def calculate_metrics( ...@@ -1532,6 +1582,8 @@ def calculate_metrics(
output_lens: List[int] = [] output_lens: List[int] = []
retokenized_output_lens: List[int] = [] retokenized_output_lens: List[int] = []
total_input = 0 total_input = 0
total_input_text = 0
total_input_vision = 0
completed = 0 completed = 0
itls: List[float] = [] itls: List[float] = []
tpots: List[float] = [] tpots: List[float] = []
...@@ -1545,7 +1597,9 @@ def calculate_metrics( ...@@ -1545,7 +1597,9 @@ def calculate_metrics(
tokenizer.encode(outputs[i].generated_text, add_special_tokens=False) tokenizer.encode(outputs[i].generated_text, add_special_tokens=False)
) )
retokenized_output_lens.append(retokenized_output_len) retokenized_output_lens.append(retokenized_output_len)
total_input += outputs[i].prompt_len total_input += input_requests[i].prompt_len
total_input_text += input_requests[i].text_prompt_len
total_input_vision += input_requests[i].vision_prompt_len
if output_len > 1: if output_len > 1:
tpots.append((outputs[i].latency - outputs[i].ttft) / (output_len - 1)) tpots.append((outputs[i].latency - outputs[i].ttft) / (output_len - 1))
itls += outputs[i].itl itls += outputs[i].itl
...@@ -1567,6 +1621,8 @@ def calculate_metrics( ...@@ -1567,6 +1621,8 @@ def calculate_metrics(
metrics = BenchmarkMetrics( metrics = BenchmarkMetrics(
completed=completed, completed=completed,
total_input=total_input, total_input=total_input,
total_input_text=total_input_text,
total_input_vision=total_input_vision,
total_output=sum(output_lens), total_output=sum(output_lens),
total_output_retokenized=sum(retokenized_output_lens), total_output_retokenized=sum(retokenized_output_lens),
request_throughput=completed / dur_s, request_throughput=completed / dur_s,
...@@ -1815,6 +1871,10 @@ async def benchmark( ...@@ -1815,6 +1871,10 @@ async def benchmark(
print("{:<40} {:<10}".format("Successful requests:", metrics.completed)) print("{:<40} {:<10}".format("Successful requests:", metrics.completed))
print("{:<40} {:<10.2f}".format("Benchmark duration (s):", benchmark_duration)) print("{:<40} {:<10.2f}".format("Benchmark duration (s):", benchmark_duration))
print("{:<40} {:<10}".format("Total input tokens:", metrics.total_input)) print("{:<40} {:<10}".format("Total input tokens:", metrics.total_input))
print("{:<40} {:<10}".format("Total input text tokens:", metrics.total_input_text))
print(
"{:<40} {:<10}".format("Total input vision tokens:", metrics.total_input_vision)
)
print("{:<40} {:<10}".format("Total generated tokens:", metrics.total_output)) print("{:<40} {:<10}".format("Total generated tokens:", metrics.total_output))
print( print(
"{:<40} {:<10}".format( "{:<40} {:<10}".format(
...@@ -1884,6 +1944,8 @@ async def benchmark( ...@@ -1884,6 +1944,8 @@ async def benchmark(
"duration": benchmark_duration, "duration": benchmark_duration,
"completed": metrics.completed, "completed": metrics.completed,
"total_input_tokens": metrics.total_input, "total_input_tokens": metrics.total_input,
"total_input_text_tokens": metrics.total_input_text,
"total_input_vision_tokens": metrics.total_input_vision,
"total_output_tokens": metrics.total_output, "total_output_tokens": metrics.total_output,
"total_output_tokens_retokenized": metrics.total_output_retokenized, "total_output_tokens_retokenized": metrics.total_output_retokenized,
"request_throughput": metrics.request_throughput, "request_throughput": metrics.request_throughput,
...@@ -1918,11 +1980,11 @@ async def benchmark( ...@@ -1918,11 +1980,11 @@ async def benchmark(
output_file_name = args.output_file output_file_name = args.output_file
else: else:
now = datetime.now().strftime("%m%d") now = datetime.now().strftime("%m%d")
if args.dataset_name == "random-image": if args.dataset_name == "image":
output_file_name = ( output_file_name = (
f"{args.backend}_{now}_{args.num_prompts}_{args.random_input_len}_" f"{args.backend}_{now}_{args.num_prompts}_{args.random_input_len}_"
f"{args.random_output_len}_{args.random_image_num_images}imgs_" f"{args.random_output_len}_{args.image_count}imgs_"
f"{args.random_image_resolution}.jsonl" f"{args.image_resolution}.jsonl"
) )
elif args.dataset_name.startswith("random"): elif args.dataset_name.startswith("random"):
output_file_name = f"{args.backend}_{now}_{args.num_prompts}_{args.random_input_len}_{args.random_output_len}.jsonl" output_file_name = f"{args.backend}_{now}_{args.num_prompts}_{args.random_input_len}_{args.random_output_len}.jsonl"
...@@ -2098,6 +2160,12 @@ def run_benchmark(args_: argparse.Namespace): ...@@ -2098,6 +2160,12 @@ def run_benchmark(args_: argparse.Namespace):
"Because when the tokenizer counts the output tokens, if there is gibberish, it might count incorrectly.\n" "Because when the tokenizer counts the output tokens, if there is gibberish, it might count incorrectly.\n"
) )
if args.dataset_name in ["image", "mmmu"]:
args.apply_chat_template = True
assert (
not args.tokenize_prompt
), "`--tokenize-prompt` not compatible with image dataset"
print(f"{args}\n") print(f"{args}\n")
# Read dataset # Read dataset
...@@ -2105,7 +2173,7 @@ def run_benchmark(args_: argparse.Namespace): ...@@ -2105,7 +2173,7 @@ def run_benchmark(args_: argparse.Namespace):
model_id = args.model model_id = args.model
tokenizer_id = args.tokenizer if args.tokenizer is not None else args.model tokenizer_id = args.tokenizer if args.tokenizer is not None else args.model
tokenizer = get_tokenizer(tokenizer_id) tokenizer = get_tokenizer(tokenizer_id)
input_requests = get_dataset(args, tokenizer) input_requests = get_dataset(args, tokenizer, model_id)
# compatible with SimpleNamespace # compatible with SimpleNamespace
if not hasattr(args, "flush_cache"): if not hasattr(args, "flush_cache"):
...@@ -2186,7 +2254,7 @@ if __name__ == "__main__": ...@@ -2186,7 +2254,7 @@ if __name__ == "__main__":
"random-ids", "random-ids",
"generated-shared-prefix", "generated-shared-prefix",
"mmmu", "mmmu",
"random-image", "image",
"mooncake", "mooncake",
], ],
help="Name of the dataset to benchmark on.", help="Name of the dataset to benchmark on.",
...@@ -2226,37 +2294,49 @@ if __name__ == "__main__": ...@@ -2226,37 +2294,49 @@ if __name__ == "__main__":
"--random-input-len", "--random-input-len",
type=int, type=int,
default=1024, default=1024,
help="Number of input tokens per request, used only for random dataset.", help="Number of input tokens per request, used only for random and image dataset.",
) )
parser.add_argument( parser.add_argument(
"--random-output-len", "--random-output-len",
default=1024, default=1024,
type=int, type=int,
help="Number of output tokens per request, used only for random dataset.", help="Number of output tokens per request, used only for random and image dataset.",
) )
parser.add_argument( parser.add_argument(
"--random-range-ratio", "--random-range-ratio",
type=float, type=float,
default=0.0, default=0.0,
help="Range of sampled ratio of input/output length, " help="Range of sampled ratio of input/output length, "
"used only for random dataset.", "used only for random and image dataset.",
) )
# random-image dataset args # image dataset args
parser.add_argument( parser.add_argument(
"--random-image-num-images", "--image-count",
type=int, type=int,
default=1, default=1,
help="Number of images per request (only available with the random-image dataset)", help="Number of images per request (only available with the image dataset)",
) )
parser.add_argument( parser.add_argument(
"--random-image-resolution", "--image-resolution",
type=str, type=str,
default="1080p", default="1080p",
help=( help=(
"Resolution of random images for random-image dataset. " "Resolution of images for image dataset. "
"Supports presets 4k/1080p/720p/360p or custom 'heightxwidth' (e.g., 1080x1920)." "Supports presets 4k/1080p/720p/360p or custom 'heightxwidth' (e.g., 1080x1920)."
), ),
) )
parser.add_argument(
"--image-format",
type=str,
default="jpeg",
help=("Format of images for image dataset. " "Supports jpeg and png."),
)
parser.add_argument(
"--image-content",
type=str,
default="random",
help=("Content for images for image dataset. " "Supports random and blank."),
)
parser.add_argument( parser.add_argument(
"--request-rate", "--request-rate",
type=float, type=float,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment