Unverified Commit 3cc328a4 authored by Talor Abramovich's avatar Talor Abramovich Committed by GitHub
Browse files

[SpecDecode][Benchmark] Add SPEED-bench support to benchmarking CLI (#36029)


Signed-off-by: default avatartalora <talora@nvidia.com>
Co-authored-by: default avatarBenjamin Chislett <bchislett@nvidia.com>
parent 3beb57a2
......@@ -37,6 +37,7 @@ th {
| HuggingFace-Blazedit | ✅ | ✅ | `vdaita/edit_5k_char`, `vdaita/edit_10k_char` |
| HuggingFace-ASR | ✅ | ✅ | `openslr/librispeech_asr`, `facebook/voxpopuli`, `LIUM/tedlium`, `edinburghcstr/ami`, `speechcolab/gigaspeech`, `kensho/spgispeech` |
| Spec Bench | ✅ | ✅ | `wget https://raw.githubusercontent.com/hemingkx/Spec-Bench/refs/heads/main/data/spec_bench/question.jsonl` |
| SPEED-Bench | ✅ | ✅ | `curl -LsSf https://raw.githubusercontent.com/NVIDIA-NeMo/Skills/refs/heads/main/nemo_skills/dataset/speed-bench/prepare.py \| python3 -` |
| Custom | ✅ | ✅ | Local file: `data.jsonl` |
| Custom MM | ✅ | ✅ | Local file: `mm_data.jsonl` |
......@@ -239,6 +240,69 @@ vllm bench serve \
--spec-bench-category "summarization"
```
#### SPEED-Bench Benchmark with Speculative Decoding
[SPEED-Bench](https://huggingface.co/datasets/nvidia/SPEED-Bench) is a unified and diverse dataset for speculative decoding, supporting acceptance rate and length measurements using the Qualitative split and throughput measurements using the Throughput splits in 5 configuration of input sequence length (1k, 2k, 8k, 16k, 32k).
!!! note
This dataset is governed by the [NVIDIA Evaluation Dataset License Agreement](https://huggingface.co/datasets/nvidia/SPEED-Bench/blob/main/License.pdf). For each dataset a user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose. The `prepare.py` script automatically fetches data from all the source datasets.
First, download the dataset to a folder, using this one liner:
```bash
curl -LsSf https://raw.githubusercontent.com/NVIDIA-NeMo/Skills/refs/heads/main/nemo_skills/dataset/speed-bench/prepare.py \| python3 -
```
The command supports also the following arguments:
- `--config`: download only a subset of the dataset: `qualitative`, `throughput_1k`, `throughput_2k`, `throughput_8k`, `throughput_16k` and `throughput_32k`. By default, it will download all subsets.
- `--output_dir`: download to a specified folder. By default, it will download to the current directory.
Start a server with speculative decoding:
```bash
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--speculative-config $'{"method": "eagle3",
"num_speculative_tokens": 3,
"model": "nvidia/Llama-3.3-70B-Instruct-Eagle3"}'
```
Run all categories in the Qualitative split:
```bash
vllm bench serve \
--model meta-llama/Llama-3.3-70B-Instruct \
--dataset-name speed_bench \
--dataset-path "<YOUR_DOWNLOADED_PATH>/data/speed_bench" \
--num-prompts -1
```
Available categories include `[writing, roleplay, reasoning, math, coding, stem, humanities, multilingual, summarization, qa, rag]`.
Run only a specific category like "multilingual":
```bash
vllm bench serve \
--model meta-llama/Llama-3.3-70B-Instruct \
--dataset-name speed_bench \
--dataset-path "<YOUR_DOWNLOADED_PATH>/data/speed_bench" \
--num-prompts -1
--speed-bench-category "multilingual"
```
Run all categories in the Throughput split (2k ISL):
```bash
vllm bench serve \
--model meta-llama/Llama-3.3-70B-Instruct \
--dataset-name speed_bench \
--speed-bench-dataset-subset throughput_2k
--dataset-path "<YOUR_DOWNLOADED_PATH>/data/speed_bench/" \
--num-prompts -1
```
Available categories include `[high_entropy, mixed, low_entropy]`, where high entropy data contains unstructued data such as creative writing while low entropy data contains more structured data such as coding, more details are in the dataset card.
#### Other HuggingFaceDataset Examples
```bash
......
......@@ -25,6 +25,7 @@ from contextlib import suppress
from dataclasses import dataclass, replace
from functools import cache
from io import BytesIO
from pathlib import Path
from tempfile import NamedTemporaryFile
from typing import Any, cast
......@@ -1422,6 +1423,7 @@ def add_dataset_parser(parser: FlexibleArgumentParser):
"custom_mm",
"prefix_repetition",
"spec_bench",
"speed_bench",
],
help="Name of the dataset to benchmark on.",
)
......@@ -1606,6 +1608,34 @@ def add_dataset_parser(parser: FlexibleArgumentParser):
"repetition dataset.",
)
speed_bench_group = parser.add_argument_group("speed bench dataset options")
speed_bench_group.add_argument(
"--speed-bench-dataset-subset",
type=str,
default="qualitative",
choices={
"qualitative",
"throughput_1k",
"throughput_2k",
"throughput_8k",
"throughput_16k",
"throughput_32k",
},
help="Subset of the SPEED-Bench dataset.",
)
speed_bench_group.add_argument(
"--speed-bench-output-len",
type=int,
default=4096,
help="Num of output tokens per request, used only for speed bench dataset.",
)
speed_bench_group.add_argument(
"--speed-bench-category",
type=str,
default=None,
help="Category for speed bench dataset. If None, use all categories.",
)
def add_random_dataset_base_args(
parser_or_group: FlexibleArgumentParser | argparse._ArgumentGroup,
......@@ -2074,6 +2104,19 @@ def get_samples(args, tokenizer: TokenizerLike) -> list[SampleRequest]:
request_id_prefix=args.request_id_prefix,
no_oversample=args.no_oversample,
),
"speed_bench": lambda: SpeedBench(
dataset_path=args.dataset_path,
dataset_subset=args.speed_bench_dataset_subset,
category=args.speed_bench_category,
disable_shuffle=args.disable_shuffle,
).sample(
num_requests=args.num_prompts,
tokenizer=tokenizer,
output_len=args.speed_bench_output_len,
enable_multimodal_chat=args.enable_multimodal_chat,
request_id_prefix=args.request_id_prefix,
no_oversample=args.no_oversample,
),
}
try:
......@@ -3551,3 +3594,48 @@ class MMStarDataset(HuggingFaceDataset):
sampled_requests, num_requests, request_id_prefix, no_oversample
)
return sampled_requests
# -----------------------------------------------------------------------------
# Speed Bench Dataset Implementation
# -----------------------------------------------------------------------------
class SpeedBench(CustomDataset):
"""
Implements the SPEED-Bench dataset: https://huggingface.co/datasets/nvidia/SPEED-Bench
Download the dataset using:
curl -LsSf https://raw.githubusercontent.com/NVIDIA-NeMo/Skills/refs/heads/main/nemo_skills/dataset/speed-bench/prepare.py | python3 -
""" # noqa: E501
def __init__(self, **kwargs) -> None:
self.dataset_subset = kwargs.pop("dataset_subset", "qualitative")
self.category = kwargs.pop("category", None)
super().__init__(**kwargs)
self.load_data()
def load_data(self) -> None:
if self.dataset_path is None:
raise ValueError("dataset_path must be provided for loading data.")
self.data = []
# Load the JSONL file
jsonl_data = pd.read_json(
path_or_buf=Path(self.dataset_path) / f"{self.dataset_subset}.jsonl",
lines=True,
)
# check if the JSONL file has a 'turns' column
if "messages" not in jsonl_data.columns:
raise ValueError("JSONL file must contain a 'messages' column.")
for _, row in jsonl_data.iterrows():
# sample only from a specific category if specified
if (not self.category) or (self.category == row["category"]):
prompt = row["messages"][0]["content"]
self.data.append({"prompt": prompt})
random.seed(self.random_seed)
if not getattr(self, "disable_shuffle", False):
random.shuffle(self.data)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment