[SpecDecode][Benchmark] Add SPEED-bench support to benchmarking CLI (#36029)

Signed-off-by: talora <talora@nvidia.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>

[SpecDecode][Benchmark] Add SPEED-bench support to benchmarking CLI (#36029)
Signed-off-by: talora <talora@nvidia.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
3cc328a4 · Talor Abramovich · GitHub · 3beb57a2 · 3cc328a4 · 3cc328a4
Unverified Commit 3cc328a4 authored Apr 15, 2026 by Talor Abramovich Committed by GitHub Apr 15, 2026
Show whitespace changes
Inline Side-by-side

Showing with 152 additions and 0 deletions

docs/benchmarking/cli.md docs/benchmarking/cli.md +64 -0

vllm/benchmarks/datasets/datasets.py vllm/benchmarks/datasets/datasets.py +88 -0

No files found.
--- a/docs/benchmarking/cli.md
+++ b/docs/benchmarking/cli.md
@@ -37,6 +37,7 @@ th {
 | HuggingFace-Blazedit | ✅ | ✅ | `vdaita/edit_5k_char`, `vdaita/edit_10k_char` |
 | HuggingFace-ASR | ✅ | ✅ | `openslr/librispeech_asr`, `facebook/voxpopuli`,  `LIUM/tedlium`, `edinburghcstr/ami`,        `speechcolab/gigaspeech`,        `kensho/spgispeech` |
 | Spec Bench | ✅ | ✅ | `wget https://raw.githubusercontent.com/hemingkx/Spec-Bench/refs/heads/main/data/spec_bench/question.jsonl` |
+| SPEED-Bench | ✅ | ✅ | `curl -LsSf https://raw.githubusercontent.com/NVIDIA-NeMo/Skills/refs/heads/main/nemo_skills/dataset/speed-bench/prepare.py \| python3 -` |
 | Custom | ✅ | ✅ | Local file: `data.jsonl` |
 | Custom MM | ✅ | ✅ | Local file: `mm_data.jsonl` |

@@ -239,6 +240,69 @@ vllm bench serve \
    --spec-bench-category "summarization"
 ```

+#### SPEED-Bench Benchmark with Speculative Decoding
+
+[SPEED-Bench](https://huggingface.co/datasets/nvidia/SPEED-Bench) is a unified and diverse dataset for speculative decoding, supporting acceptance rate and length measurements using the Qualitative split and throughput measurements using the Throughput splits in 5 configuration of input sequence length (1k, 2k, 8k, 16k, 32k).
+
+!!! note
+    This dataset is governed by the [NVIDIA Evaluation Dataset License Agreement](https://huggingface.co/datasets/nvidia/SPEED-Bench/blob/main/License.pdf). For each dataset a user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose. The `prepare.py` script automatically fetches data from all the source datasets.
+
+First, download the dataset to a folder, using this one liner:
+
+```bash
+curl -LsSf https://raw.githubusercontent.com/NVIDIA-NeMo/Skills/refs/heads/main/nemo_skills/dataset/speed-bench/prepare.py \| python3 -
+```
+
+The command supports also the following arguments:
+
+- `--config`: download only a subset of the dataset: `qualitative`, `throughput_1k`, `throughput_2k`, `throughput_8k`, `throughput_16k` and `throughput_32k`. By default, it will download all subsets.
+- `--output_dir`: download to a specified folder. By default, it will download to the current directory.
+
+Start a server with speculative decoding:
+
+```bash
+vllm serve meta-llama/Llama-3.3-70B-Instruct \
+    --speculative-config $'{"method": "eagle3",
+    "num_speculative_tokens": 3,
+    "model": "nvidia/Llama-3.3-70B-Instruct-Eagle3"}'
+```
+
+Run all categories in the Qualitative split:
+
+```bash
+vllm bench serve \
+    --model meta-llama/Llama-3.3-70B-Instruct \
+    --dataset-name speed_bench \
+    --dataset-path "<YOUR_DOWNLOADED_PATH>/data/speed_bench" \
+    --num-prompts -1
+```
+
+Available categories include `[writing, roleplay, reasoning, math, coding, stem, humanities, multilingual, summarization, qa, rag]`.
+
+Run only a specific category like "multilingual":
+
+```bash
+vllm bench serve \
+    --model meta-llama/Llama-3.3-70B-Instruct \
+    --dataset-name speed_bench \
+    --dataset-path "<YOUR_DOWNLOADED_PATH>/data/speed_bench" \
+    --num-prompts -1
+    --speed-bench-category "multilingual"
+```
+
+Run all categories in the Throughput split (2k ISL):
+
+```bash
+vllm bench serve \
+    --model meta-llama/Llama-3.3-70B-Instruct \
+    --dataset-name speed_bench \
+    --speed-bench-dataset-subset throughput_2k
+    --dataset-path "<YOUR_DOWNLOADED_PATH>/data/speed_bench/" \
+    --num-prompts -1
+```
+
+Available categories include `[high_entropy, mixed, low_entropy]`, where high entropy data contains unstructued data such as creative writing while low entropy data contains more structured data such as coding, more details are in the dataset card.
+
 #### Other HuggingFaceDataset Examples

 ```bash

--- a/vllm/benchmarks/datasets/datasets.py
+++ b/vllm/benchmarks/datasets/datasets.py
@@ -25,6 +25,7 @@ from contextlib import suppress
 from dataclasses import dataclass, replace
 from functools import cache
 from io import BytesIO
+from pathlib import Path
 from tempfile import NamedTemporaryFile
 from typing import Any, cast

@@ -1422,6 +1423,7 @@ def add_dataset_parser(parser: FlexibleArgumentParser):
            "custom_mm",
            "prefix_repetition",
            "spec_bench",
+            "speed_bench",
        ],
        help="Name of the dataset to benchmark on.",
    )
@@ -1606,6 +1608,34 @@ def add_dataset_parser(parser: FlexibleArgumentParser):
        "repetition dataset.",
    )

+    speed_bench_group = parser.add_argument_group("speed bench dataset options")
+    speed_bench_group.add_argument(
+        "--speed-bench-dataset-subset",
+        type=str,
+        default="qualitative",
+        choices={
+            "qualitative",
+            "throughput_1k",
+            "throughput_2k",
+            "throughput_8k",
+            "throughput_16k",
+            "throughput_32k",
+        },
+        help="Subset of the SPEED-Bench dataset.",
+    )
+    speed_bench_group.add_argument(
+        "--speed-bench-output-len",
+        type=int,
+        default=4096,
+        help="Num of output tokens per request, used only for speed bench dataset.",
+    )
+    speed_bench_group.add_argument(
+        "--speed-bench-category",
+        type=str,
+        default=None,
+        help="Category for speed bench dataset. If None, use all categories.",
+    )
+

 def add_random_dataset_base_args(
    parser_or_group: FlexibleArgumentParser | argparse._ArgumentGroup,
@@ -2074,6 +2104,19 @@ def get_samples(args, tokenizer: TokenizerLike) -> list[SampleRequest]:
                request_id_prefix=args.request_id_prefix,
                no_oversample=args.no_oversample,
            ),
+            "speed_bench": lambda: SpeedBench(
+                dataset_path=args.dataset_path,
+                dataset_subset=args.speed_bench_dataset_subset,
+                category=args.speed_bench_category,
+                disable_shuffle=args.disable_shuffle,
+            ).sample(
+                num_requests=args.num_prompts,
+                tokenizer=tokenizer,
+                output_len=args.speed_bench_output_len,
+                enable_multimodal_chat=args.enable_multimodal_chat,
+                request_id_prefix=args.request_id_prefix,
+                no_oversample=args.no_oversample,
+            ),
        }

        try:
@@ -3551,3 +3594,48 @@ class MMStarDataset(HuggingFaceDataset):
            sampled_requests, num_requests, request_id_prefix, no_oversample
        )
        return sampled_requests
+
+
+# -----------------------------------------------------------------------------
+# Speed Bench Dataset Implementation
+# -----------------------------------------------------------------------------
+
+
+class SpeedBench(CustomDataset):
+    """
+    Implements the SPEED-Bench dataset: https://huggingface.co/datasets/nvidia/SPEED-Bench
+    Download the dataset using:
+    curl -LsSf https://raw.githubusercontent.com/NVIDIA-NeMo/Skills/refs/heads/main/nemo_skills/dataset/speed-bench/prepare.py | python3 -
+    """  # noqa: E501
+
+    def __init__(self, **kwargs) -> None:
+        self.dataset_subset = kwargs.pop("dataset_subset", "qualitative")
+        self.category = kwargs.pop("category", None)
+        super().__init__(**kwargs)
+        self.load_data()
+
+    def load_data(self) -> None:
+        if self.dataset_path is None:
+            raise ValueError("dataset_path must be provided for loading data.")
+
+        self.data = []
+
+        # Load the JSONL file
+        jsonl_data = pd.read_json(
+            path_or_buf=Path(self.dataset_path) / f"{self.dataset_subset}.jsonl",
+            lines=True,
+        )
+
+        # check if the JSONL file has a 'turns' column
+        if "messages" not in jsonl_data.columns:
+            raise ValueError("JSONL file must contain a 'messages' column.")
+
+        for _, row in jsonl_data.iterrows():
+            # sample only from a specific category if specified
+            if (not self.category) or (self.category == row["category"]):
+                prompt = row["messages"][0]["content"]
+                self.data.append({"prompt": prompt})
+
+        random.seed(self.random_seed)
+        if not getattr(self, "disable_shuffle", False):
+            random.shuffle(self.data)