Unverified Commit ba5c5e54 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

[Docs] Switch to better markdown linting pre-commit hook (#21851)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent 555e7225
...@@ -28,6 +28,7 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performanc ...@@ -28,6 +28,7 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performanc
## Trigger the benchmark ## Trigger the benchmark
Performance benchmark will be triggered when: Performance benchmark will be triggered when:
- A PR being merged into vllm. - A PR being merged into vllm.
- Every commit for those PRs with `perf-benchmarks` label AND `ready` label. - Every commit for those PRs with `perf-benchmarks` label AND `ready` label.
...@@ -38,6 +39,7 @@ bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh ...@@ -38,6 +39,7 @@ bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
``` ```
Runtime environment variables: Runtime environment variables:
- `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0. - `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0.
- `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file). - `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
- `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file). - `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
...@@ -46,12 +48,14 @@ Runtime environment variables: ...@@ -46,12 +48,14 @@ Runtime environment variables:
- `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string. - `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string.
Nightly benchmark will be triggered when: Nightly benchmark will be triggered when:
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label. - Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.
## Performance benchmark details ## Performance benchmark details
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases. See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
> NOTE: For Intel® Xeon® Processors, use `tests/latency-tests-cpu.json`, `tests/throughput-tests-cpu.json`, `tests/serving-tests-cpu.json` instead. > NOTE: For Intel® Xeon® Processors, use `tests/latency-tests-cpu.json`, `tests/throughput-tests-cpu.json`, `tests/serving-tests-cpu.json` instead.
>
### Latency test ### Latency test
Here is an example of one test inside `latency-tests.json`: Here is an example of one test inside `latency-tests.json`:
...@@ -149,6 +153,7 @@ Here is an example using the script to compare result_a and result_b without det ...@@ -149,6 +153,7 @@ Here is an example using the script to compare result_a and result_b without det
Here is an example using the script to compare result_a and result_b with detail test name. Here is an example using the script to compare result_a and result_b with detail test name.
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json` `python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`
| | results_a/benchmark_results.json_name | results_a/benchmark_results.json | results_b/benchmark_results.json_name | results_b/benchmark_results.json | perf_ratio | | | results_a/benchmark_results.json_name | results_a/benchmark_results.json | results_b/benchmark_results.json_name | results_b/benchmark_results.json | perf_ratio |
|---|---------------------------------------------|----------------------------------------|---------------------------------------------|----------------------------------------|----------| |---|---------------------------------------------|----------------------------------------|---------------------------------------------|----------------------------------------|----------|
| 0 | serving_llama8B_tp1_sharegpt_qps_1 | 142.633982 | serving_llama8B_tp1_sharegpt_qps_1 | 156.526018 | 1.097396 | | 0 | serving_llama8B_tp1_sharegpt_qps_1 | 142.633982 | serving_llama8B_tp1_sharegpt_qps_1 | 156.526018 | 1.097396 |
......
# Nightly benchmark annotation
## Description ## Description
...@@ -13,15 +14,15 @@ Please download the visualization scripts in the post ...@@ -13,15 +14,15 @@ Please download the visualization scripts in the post
- Find the docker we use in `benchmarking pipeline` - Find the docker we use in `benchmarking pipeline`
- Deploy the docker, and inside the docker: - Deploy the docker, and inside the docker:
- Download `nightly-benchmarks.zip`. - Download `nightly-benchmarks.zip`.
- In the same folder, run the following code: - In the same folder, run the following code:
```bash ```bash
export HF_TOKEN=<your HF token> export HF_TOKEN=<your HF token>
apt update apt update
apt install -y git apt install -y git
unzip nightly-benchmarks.zip unzip nightly-benchmarks.zip
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
``` ```
And the results will be inside `./benchmarks/results`. And the results will be inside `./benchmarks/results`.
...@@ -13,25 +13,25 @@ Latest reproduction guilde: [github issue link](https://github.com/vllm-project/ ...@@ -13,25 +13,25 @@ Latest reproduction guilde: [github issue link](https://github.com/vllm-project/
## Setup ## Setup
- Docker images: - Docker images:
- vLLM: `vllm/vllm-openai:v0.6.2` - vLLM: `vllm/vllm-openai:v0.6.2`
- SGLang: `lmsysorg/sglang:v0.3.2-cu121` - SGLang: `lmsysorg/sglang:v0.3.2-cu121`
- LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12` - LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
- TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3` - TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
- *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.* - *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.*
- Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark. - Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
- Hardware - Hardware
- 8x Nvidia A100 GPUs - 8x Nvidia A100 GPUs
- Workload: - Workload:
- Dataset - Dataset
- ShareGPT dataset - ShareGPT dataset
- Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output) - Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output)
- Decode-heavy dataset (in average 462 input tokens, 256 output tokens) - Decode-heavy dataset (in average 462 input tokens, 256 output tokens)
- Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use. - Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use.
- Models: llama-3 8B, llama-3 70B. - Models: llama-3 8B, llama-3 70B.
- We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)). - We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)).
- Average QPS (query per second): 2, 4, 8, 16, 32 and inf. - Average QPS (query per second): 2, 4, 8, 16, 32 and inf.
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed. - Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better). - Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
## Known issues ## Known issues
......
# Performance benchmarks descriptions
## Latency tests ## Latency tests
......
## Essential Elements of an Effective PR Description Checklist # Essential Elements of an Effective PR Description Checklist
- [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
- [ ] The test plan, such as providing test command. - [ ] The test plan, such as providing test command.
- [ ] The test results, such as pasting the results comparison before and after, or e2e results - [ ] The test results, such as pasting the results comparison before and after, or e2e results
...@@ -14,5 +15,4 @@ PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE B ...@@ -14,5 +15,4 @@ PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE B
## (Optional) Documentation Update ## (Optional) Documentation Update
<!--- pyml disable-next-line no-emphasis-as-heading -->
**BEFORE SUBMITTING, PLEASE READ <https://docs.vllm.ai/en/latest/contributing>** (anything written below this line will be removed by GitHub Actions) **BEFORE SUBMITTING, PLEASE READ <https://docs.vllm.ai/en/latest/contributing>** (anything written below this line will be removed by GitHub Actions)
MD007:
indent: 4
MD013: false
MD024:
siblings_only: true
MD033: false
MD042: false
MD045: false
MD046: false
MD051: false
MD052: false
MD053: false
MD059: false
...@@ -35,12 +35,11 @@ repos: ...@@ -35,12 +35,11 @@ repos:
exclude: 'csrc/(moe/topk_softmax_kernels.cu|quantization/gguf/(ggml-common.h|dequantize.cuh|vecdotq.cuh|mmq.cuh|mmvq.cuh))|vllm/third_party/.*' exclude: 'csrc/(moe/topk_softmax_kernels.cu|quantization/gguf/(ggml-common.h|dequantize.cuh|vecdotq.cuh|mmq.cuh|mmvq.cuh))|vllm/third_party/.*'
types_or: [c++, cuda] types_or: [c++, cuda]
args: [--style=file, --verbose] args: [--style=file, --verbose]
- repo: https://github.com/jackdewinter/pymarkdown - repo: https://github.com/igorshubovych/markdownlint-cli
rev: v0.9.29 rev: v0.45.0
hooks: hooks:
- id: pymarkdown - id: markdownlint-fix
exclude: '.*\.inc\.md' exclude: '.*\.inc\.md'
args: [fix]
- repo: https://github.com/rhysd/actionlint - repo: https://github.com/rhysd/actionlint
rev: v1.7.7 rev: v1.7.7
hooks: hooks:
......
<!-- markdownlint-disable MD001 MD041 -->
<p align="center"> <p align="center">
<picture> <picture>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png"> <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png">
...@@ -16,6 +17,7 @@ Easy, fast, and cheap LLM serving for everyone ...@@ -16,6 +17,7 @@ Easy, fast, and cheap LLM serving for everyone
--- ---
*Latest News* 🔥 *Latest News* 🔥
- [2025/05] We hosted [NYC vLLM Meetup](https://lu.ma/c1rqyf1f)! Please find the meetup slides [here](https://docs.google.com/presentation/d/1_q_aW_ioMJWUImf1s1YM-ZhjXz8cUeL0IJvaquOYBeA/edit?usp=sharing). - [2025/05] We hosted [NYC vLLM Meetup](https://lu.ma/c1rqyf1f)! Please find the meetup slides [here](https://docs.google.com/presentation/d/1_q_aW_ioMJWUImf1s1YM-ZhjXz8cUeL0IJvaquOYBeA/edit?usp=sharing).
- [2025/05] vLLM is now a hosted project under PyTorch Foundation! Please find the announcement [here](https://pytorch.org/blog/pytorch-foundation-welcomes-vllm/). - [2025/05] vLLM is now a hosted project under PyTorch Foundation! Please find the announcement [here](https://pytorch.org/blog/pytorch-foundation-welcomes-vllm/).
- [2025/04] We hosted [Asia Developer Day](https://www.sginnovate.com/event/limited-availability-morning-evening-slots-remaining-inaugural-vllm-asia-developer-day)! Please find the meetup slides from the vLLM team [here](https://docs.google.com/presentation/d/19cp6Qu8u48ihB91A064XfaXruNYiBOUKrBxAmDOllOo/edit?usp=sharing). - [2025/04] We hosted [Asia Developer Day](https://www.sginnovate.com/event/limited-availability-morning-evening-slots-remaining-inaugural-vllm-asia-developer-day)! Please find the meetup slides from the vLLM team [here](https://docs.google.com/presentation/d/19cp6Qu8u48ihB91A064XfaXruNYiBOUKrBxAmDOllOo/edit?usp=sharing).
...@@ -46,6 +48,7 @@ Easy, fast, and cheap LLM serving for everyone ...@@ -46,6 +48,7 @@ Easy, fast, and cheap LLM serving for everyone
</details> </details>
--- ---
## About ## About
vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is a fast and easy-to-use library for LLM inference and serving.
...@@ -75,6 +78,7 @@ vLLM is flexible and easy to use with: ...@@ -75,6 +78,7 @@ vLLM is flexible and easy to use with:
- Multi-LoRA support - Multi-LoRA support
vLLM seamlessly supports most popular open-source models on HuggingFace, including: vLLM seamlessly supports most popular open-source models on HuggingFace, including:
- Transformer-like LLMs (e.g., Llama) - Transformer-like LLMs (e.g., Llama)
- Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3) - Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
- Embedding Models (e.g., E5-Mistral) - Embedding Models (e.g., E5-Mistral)
...@@ -91,6 +95,7 @@ pip install vllm ...@@ -91,6 +95,7 @@ pip install vllm
``` ```
Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more. Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
- [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html) - [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html)
- [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html) - [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
- [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html) - [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)
...@@ -107,6 +112,7 @@ vLLM is a community project. Our compute resources for development and testing a ...@@ -107,6 +112,7 @@ vLLM is a community project. Our compute resources for development and testing a
<!-- Note: Please sort them in alphabetical order. --> <!-- Note: Please sort them in alphabetical order. -->
<!-- Note: Please keep these consistent with docs/community/sponsors.md --> <!-- Note: Please keep these consistent with docs/community/sponsors.md -->
Cash Donations: Cash Donations:
- a16z - a16z
- Dropbox - Dropbox
- Sequoia Capital - Sequoia Capital
...@@ -114,6 +120,7 @@ Cash Donations: ...@@ -114,6 +120,7 @@ Cash Donations:
- ZhenFund - ZhenFund
Compute Resources: Compute Resources:
- AMD - AMD
- Anyscale - Anyscale
- AWS - AWS
......
...@@ -60,9 +60,10 @@ Please note: **No feature work allowed for cherry picks**. All PRs that are cons ...@@ -60,9 +60,10 @@ Please note: **No feature work allowed for cherry picks**. All PRs that are cons
Before each release, we perform end-to-end performance validation to ensure no regressions are introduced. This validation uses the [vllm-benchmark workflow](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) on PyTorch CI. Before each release, we perform end-to-end performance validation to ensure no regressions are introduced. This validation uses the [vllm-benchmark workflow](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) on PyTorch CI.
**Current Coverage:** **Current Coverage:**
* Models: Llama3, Llama4, and Mixtral * Models: Llama3, Llama4, and Mixtral
* Hardware: NVIDIA H100 and AMD MI300x * Hardware: NVIDIA H100 and AMD MI300x
* *Note: Coverage may change based on new model releases and hardware availability* * _Note: Coverage may change based on new model releases and hardware availability_
**Performance Validation Process:** **Performance Validation Process:**
...@@ -71,11 +72,13 @@ Request write access to the [pytorch/pytorch-integration-testing](https://github ...@@ -71,11 +72,13 @@ Request write access to the [pytorch/pytorch-integration-testing](https://github
**Step 2: Review Benchmark Setup** **Step 2: Review Benchmark Setup**
Familiarize yourself with the benchmark configurations: Familiarize yourself with the benchmark configurations:
* [CUDA setup](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks/cuda) * [CUDA setup](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks/cuda)
* [ROCm setup](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks/rocm) * [ROCm setup](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks/rocm)
**Step 3: Run the Benchmark** **Step 3: Run the Benchmark**
Navigate to the [vllm-benchmark workflow](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) and configure: Navigate to the [vllm-benchmark workflow](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) and configure:
* **vLLM branch**: Set to the release branch (e.g., `releases/v0.9.2`) * **vLLM branch**: Set to the release branch (e.g., `releases/v0.9.2`)
* **vLLM commit**: Set to the RC commit hash * **vLLM commit**: Set to the RC commit hash
......
...@@ -4,7 +4,7 @@ This README guides you through running benchmark tests with the extensive ...@@ -4,7 +4,7 @@ This README guides you through running benchmark tests with the extensive
datasets supported on vLLM. It’s a living document, updated as new features and datasets datasets supported on vLLM. It’s a living document, updated as new features and datasets
become available. become available.
**Dataset Overview** ## Dataset Overview
<table style="width:100%; border-collapse: collapse;"> <table style="width:100%; border-collapse: collapse;">
<thead> <thead>
...@@ -81,9 +81,10 @@ become available. ...@@ -81,9 +81,10 @@ become available.
**Note**: HuggingFace dataset's `dataset-name` should be set to `hf` **Note**: HuggingFace dataset's `dataset-name` should be set to `hf`
--- ## 🚀 Example - Online Benchmark
<details> <details>
<summary><b>🚀 Example - Online Benchmark</b></summary> <summary>Show more</summary>
<br/> <br/>
...@@ -109,7 +110,7 @@ vllm bench serve \ ...@@ -109,7 +110,7 @@ vllm bench serve \
If successful, you will see the following output If successful, you will see the following output
``` ```text
============ Serving Benchmark Result ============ ============ Serving Benchmark Result ============
Successful requests: 10 Successful requests: 10
Benchmark duration (s): 5.78 Benchmark duration (s): 5.78
...@@ -133,11 +134,11 @@ P99 ITL (ms): 8.39 ...@@ -133,11 +134,11 @@ P99 ITL (ms): 8.39
================================================== ==================================================
``` ```
**Custom Dataset** ### Custom Dataset
If the dataset you want to benchmark is not supported yet in vLLM, even then you can benchmark on it using `CustomDataset`. Your data needs to be in `.jsonl` format and needs to have "prompt" field per entry, e.g., data.jsonl If the dataset you want to benchmark is not supported yet in vLLM, even then you can benchmark on it using `CustomDataset`. Your data needs to be in `.jsonl` format and needs to have "prompt" field per entry, e.g., data.jsonl
``` ```json
{"prompt": "What is the capital of India?"} {"prompt": "What is the capital of India?"}
{"prompt": "What is the capital of Iran?"} {"prompt": "What is the capital of Iran?"}
{"prompt": "What is the capital of China?"} {"prompt": "What is the capital of China?"}
...@@ -166,7 +167,7 @@ vllm bench serve --port 9001 --save-result --save-detailed \ ...@@ -166,7 +167,7 @@ vllm bench serve --port 9001 --save-result --save-detailed \
You can skip applying chat template if your data already has it by using `--custom-skip-chat-template`. You can skip applying chat template if your data already has it by using `--custom-skip-chat-template`.
**VisionArena Benchmark for Vision Language Models** ### VisionArena Benchmark for Vision Language Models
```bash ```bash
# need a model with vision capability here # need a model with vision capability here
...@@ -184,7 +185,7 @@ vllm bench serve \ ...@@ -184,7 +185,7 @@ vllm bench serve \
--num-prompts 1000 --num-prompts 1000
``` ```
**InstructCoder Benchmark with Speculative Decoding** ### InstructCoder Benchmark with Speculative Decoding
``` bash ``` bash
VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \ VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
...@@ -201,13 +202,13 @@ vllm bench serve \ ...@@ -201,13 +202,13 @@ vllm bench serve \
--num-prompts 2048 --num-prompts 2048
``` ```
**Other HuggingFaceDataset Examples** ### Other HuggingFaceDataset Examples
```bash ```bash
vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests
``` ```
**`lmms-lab/LLaVA-OneVision-Data`** `lmms-lab/LLaVA-OneVision-Data`:
```bash ```bash
vllm bench serve \ vllm bench serve \
...@@ -221,7 +222,7 @@ vllm bench serve \ ...@@ -221,7 +222,7 @@ vllm bench serve \
--num-prompts 10 --num-prompts 10
``` ```
**`Aeala/ShareGPT_Vicuna_unfiltered`** `Aeala/ShareGPT_Vicuna_unfiltered`:
```bash ```bash
vllm bench serve \ vllm bench serve \
...@@ -234,7 +235,7 @@ vllm bench serve \ ...@@ -234,7 +235,7 @@ vllm bench serve \
--num-prompts 10 --num-prompts 10
``` ```
**`AI-MO/aimo-validation-aime`** `AI-MO/aimo-validation-aime`:
``` bash ``` bash
vllm bench serve \ vllm bench serve \
...@@ -245,7 +246,7 @@ vllm bench serve \ ...@@ -245,7 +246,7 @@ vllm bench serve \
--seed 42 --seed 42
``` ```
**`philschmid/mt-bench`** `philschmid/mt-bench`:
``` bash ``` bash
vllm bench serve \ vllm bench serve \
...@@ -255,7 +256,7 @@ vllm bench serve \ ...@@ -255,7 +256,7 @@ vllm bench serve \
--num-prompts 80 --num-prompts 80
``` ```
**Running With Sampling Parameters** ### Running With Sampling Parameters
When using OpenAI-compatible backends such as `vllm`, optional sampling When using OpenAI-compatible backends such as `vllm`, optional sampling
parameters can be specified. Example client command: parameters can be specified. Example client command:
...@@ -273,25 +274,29 @@ vllm bench serve \ ...@@ -273,25 +274,29 @@ vllm bench serve \
--num-prompts 10 --num-prompts 10
``` ```
**Running With Ramp-Up Request Rate** ### Running With Ramp-Up Request Rate
The benchmark tool also supports ramping up the request rate over the The benchmark tool also supports ramping up the request rate over the
duration of the benchmark run. This can be useful for stress testing the duration of the benchmark run. This can be useful for stress testing the
server or finding the maximum throughput that it can handle, given some latency budget. server or finding the maximum throughput that it can handle, given some latency budget.
Two ramp-up strategies are supported: Two ramp-up strategies are supported:
- `linear`: Increases the request rate linearly from a start value to an end value. - `linear`: Increases the request rate linearly from a start value to an end value.
- `exponential`: Increases the request rate exponentially. - `exponential`: Increases the request rate exponentially.
The following arguments can be used to control the ramp-up: The following arguments can be used to control the ramp-up:
- `--ramp-up-strategy`: The ramp-up strategy to use (`linear` or `exponential`). - `--ramp-up-strategy`: The ramp-up strategy to use (`linear` or `exponential`).
- `--ramp-up-start-rps`: The request rate at the beginning of the benchmark. - `--ramp-up-start-rps`: The request rate at the beginning of the benchmark.
- `--ramp-up-end-rps`: The request rate at the end of the benchmark. - `--ramp-up-end-rps`: The request rate at the end of the benchmark.
</details> </details>
## 📈 Example - Offline Throughput Benchmark
<details> <details>
<summary><b>📈 Example - Offline Throughput Benchmark</b></summary> <summary>Show more</summary>
<br/> <br/>
...@@ -305,15 +310,15 @@ vllm bench throughput \ ...@@ -305,15 +310,15 @@ vllm bench throughput \
If successful, you will see the following output If successful, you will see the following output
``` ```text
Throughput: 7.15 requests/s, 4656.00 total tokens/s, 1072.15 output tokens/s Throughput: 7.15 requests/s, 4656.00 total tokens/s, 1072.15 output tokens/s
Total num prompt tokens: 5014 Total num prompt tokens: 5014
Total num output tokens: 1500 Total num output tokens: 1500
``` ```
**VisionArena Benchmark for Vision Language Models** ### VisionArena Benchmark for Vision Language Models
``` bash ```bash
vllm bench throughput \ vllm bench throughput \
--model Qwen/Qwen2-VL-7B-Instruct \ --model Qwen/Qwen2-VL-7B-Instruct \
--backend vllm-chat \ --backend vllm-chat \
...@@ -325,13 +330,13 @@ vllm bench throughput \ ...@@ -325,13 +330,13 @@ vllm bench throughput \
The `num prompt tokens` now includes image token counts The `num prompt tokens` now includes image token counts
``` ```text
Throughput: 2.55 requests/s, 4036.92 total tokens/s, 326.90 output tokens/s Throughput: 2.55 requests/s, 4036.92 total tokens/s, 326.90 output tokens/s
Total num prompt tokens: 14527 Total num prompt tokens: 14527
Total num output tokens: 1280 Total num output tokens: 1280
``` ```
**InstructCoder Benchmark with Speculative Decoding** ### InstructCoder Benchmark with Speculative Decoding
``` bash ``` bash
VLLM_WORKER_MULTIPROC_METHOD=spawn \ VLLM_WORKER_MULTIPROC_METHOD=spawn \
...@@ -349,15 +354,15 @@ vllm bench throughput \ ...@@ -349,15 +354,15 @@ vllm bench throughput \
"prompt_lookup_min": 2}' "prompt_lookup_min": 2}'
``` ```
``` ```text
Throughput: 104.77 requests/s, 23836.22 total tokens/s, 10477.10 output tokens/s Throughput: 104.77 requests/s, 23836.22 total tokens/s, 10477.10 output tokens/s
Total num prompt tokens: 261136 Total num prompt tokens: 261136
Total num output tokens: 204800 Total num output tokens: 204800
``` ```
**Other HuggingFaceDataset Examples** ### Other HuggingFaceDataset Examples
**`lmms-lab/LLaVA-OneVision-Data`** `lmms-lab/LLaVA-OneVision-Data`:
```bash ```bash
vllm bench throughput \ vllm bench throughput \
...@@ -370,7 +375,7 @@ vllm bench throughput \ ...@@ -370,7 +375,7 @@ vllm bench throughput \
--num-prompts 10 --num-prompts 10
``` ```
**`Aeala/ShareGPT_Vicuna_unfiltered`** `Aeala/ShareGPT_Vicuna_unfiltered`:
```bash ```bash
vllm bench throughput \ vllm bench throughput \
...@@ -382,7 +387,7 @@ vllm bench throughput \ ...@@ -382,7 +387,7 @@ vllm bench throughput \
--num-prompts 10 --num-prompts 10
``` ```
**`AI-MO/aimo-validation-aime`** `AI-MO/aimo-validation-aime`:
```bash ```bash
vllm bench throughput \ vllm bench throughput \
...@@ -394,7 +399,7 @@ vllm bench throughput \ ...@@ -394,7 +399,7 @@ vllm bench throughput \
--num-prompts 10 --num-prompts 10
``` ```
**Benchmark with LoRA Adapters** Benchmark with LoRA adapters:
``` bash ``` bash
# download dataset # download dataset
...@@ -413,20 +418,22 @@ vllm bench throughput \ ...@@ -413,20 +418,22 @@ vllm bench throughput \
</details> </details>
## 🛠️ Example - Structured Output Benchmark
<details> <details>
<summary><b>🛠️ Example - Structured Output Benchmark</b></summary> <summary>Show more</summary>
<br/> <br/>
Benchmark the performance of structured output generation (JSON, grammar, regex). Benchmark the performance of structured output generation (JSON, grammar, regex).
**Server Setup** ### Server Setup
```bash ```bash
vllm serve NousResearch/Hermes-3-Llama-3.1-8B --disable-log-requests vllm serve NousResearch/Hermes-3-Llama-3.1-8B --disable-log-requests
``` ```
**JSON Schema Benchmark** ### JSON Schema Benchmark
```bash ```bash
python3 benchmarks/benchmark_serving_structured_output.py \ python3 benchmarks/benchmark_serving_structured_output.py \
...@@ -438,7 +445,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \ ...@@ -438,7 +445,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \
--num-prompts 1000 --num-prompts 1000
``` ```
**Grammar-based Generation Benchmark** ### Grammar-based Generation Benchmark
```bash ```bash
python3 benchmarks/benchmark_serving_structured_output.py \ python3 benchmarks/benchmark_serving_structured_output.py \
...@@ -450,7 +457,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \ ...@@ -450,7 +457,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \
--num-prompts 1000 --num-prompts 1000
``` ```
**Regex-based Generation Benchmark** ### Regex-based Generation Benchmark
```bash ```bash
python3 benchmarks/benchmark_serving_structured_output.py \ python3 benchmarks/benchmark_serving_structured_output.py \
...@@ -461,7 +468,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \ ...@@ -461,7 +468,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \
--num-prompts 1000 --num-prompts 1000
``` ```
**Choice-based Generation Benchmark** ### Choice-based Generation Benchmark
```bash ```bash
python3 benchmarks/benchmark_serving_structured_output.py \ python3 benchmarks/benchmark_serving_structured_output.py \
...@@ -472,7 +479,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \ ...@@ -472,7 +479,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \
--num-prompts 1000 --num-prompts 1000
``` ```
**XGrammar Benchmark Dataset** ### XGrammar Benchmark Dataset
```bash ```bash
python3 benchmarks/benchmark_serving_structured_output.py \ python3 benchmarks/benchmark_serving_structured_output.py \
...@@ -485,14 +492,16 @@ python3 benchmarks/benchmark_serving_structured_output.py \ ...@@ -485,14 +492,16 @@ python3 benchmarks/benchmark_serving_structured_output.py \
</details> </details>
## 📚 Example - Long Document QA Benchmark
<details> <details>
<summary><b>📚 Example - Long Document QA Benchmark</b></summary> <summary>Show more</summary>
<br/> <br/>
Benchmark the performance of long document question-answering with prefix caching. Benchmark the performance of long document question-answering with prefix caching.
**Basic Long Document QA Test** ### Basic Long Document QA Test
```bash ```bash
python3 benchmarks/benchmark_long_document_qa_throughput.py \ python3 benchmarks/benchmark_long_document_qa_throughput.py \
...@@ -504,7 +513,7 @@ python3 benchmarks/benchmark_long_document_qa_throughput.py \ ...@@ -504,7 +513,7 @@ python3 benchmarks/benchmark_long_document_qa_throughput.py \
--repeat-count 5 --repeat-count 5
``` ```
**Different Repeat Modes** ### Different Repeat Modes
```bash ```bash
# Random mode (default) - shuffle prompts randomly # Random mode (default) - shuffle prompts randomly
...@@ -537,14 +546,16 @@ python3 benchmarks/benchmark_long_document_qa_throughput.py \ ...@@ -537,14 +546,16 @@ python3 benchmarks/benchmark_long_document_qa_throughput.py \
</details> </details>
## 🗂️ Example - Prefix Caching Benchmark
<details> <details>
<summary><b>🗂️ Example - Prefix Caching Benchmark</b></summary> <summary>Show more</summary>
<br/> <br/>
Benchmark the efficiency of automatic prefix caching. Benchmark the efficiency of automatic prefix caching.
**Fixed Prompt with Prefix Caching** ### Fixed Prompt with Prefix Caching
```bash ```bash
python3 benchmarks/benchmark_prefix_caching.py \ python3 benchmarks/benchmark_prefix_caching.py \
...@@ -555,7 +566,7 @@ python3 benchmarks/benchmark_prefix_caching.py \ ...@@ -555,7 +566,7 @@ python3 benchmarks/benchmark_prefix_caching.py \
--input-length-range 128:256 --input-length-range 128:256
``` ```
**ShareGPT Dataset with Prefix Caching** ### ShareGPT Dataset with Prefix Caching
```bash ```bash
# download dataset # download dataset
...@@ -572,14 +583,16 @@ python3 benchmarks/benchmark_prefix_caching.py \ ...@@ -572,14 +583,16 @@ python3 benchmarks/benchmark_prefix_caching.py \
</details> </details>
## ⚡ Example - Request Prioritization Benchmark
<details> <details>
<summary><b>⚡ Example - Request Prioritization Benchmark</b></summary> <summary>Show more</summary>
<br/> <br/>
Benchmark the performance of request prioritization in vLLM. Benchmark the performance of request prioritization in vLLM.
**Basic Prioritization Test** ### Basic Prioritization Test
```bash ```bash
python3 benchmarks/benchmark_prioritization.py \ python3 benchmarks/benchmark_prioritization.py \
...@@ -590,7 +603,7 @@ python3 benchmarks/benchmark_prioritization.py \ ...@@ -590,7 +603,7 @@ python3 benchmarks/benchmark_prioritization.py \
--scheduling-policy priority --scheduling-policy priority
``` ```
**Multiple Sequences per Prompt** ### Multiple Sequences per Prompt
```bash ```bash
python3 benchmarks/benchmark_prioritization.py \ python3 benchmarks/benchmark_prioritization.py \
......
...@@ -3,6 +3,7 @@ ...@@ -3,6 +3,7 @@
This script automates the process of finding the optimal server parameter combination (`max-num-seqs` and `max-num-batched-tokens`) to maximize throughput for a vLLM server. It also supports additional constraints such as E2E latency and prefix cache hit rate. This script automates the process of finding the optimal server parameter combination (`max-num-seqs` and `max-num-batched-tokens`) to maximize throughput for a vLLM server. It also supports additional constraints such as E2E latency and prefix cache hit rate.
## Table of Contents ## Table of Contents
- [Prerequisites](#prerequisites) - [Prerequisites](#prerequisites)
- [Configuration](#configuration) - [Configuration](#configuration)
- [How to Run](#how-to-run) - [How to Run](#how-to-run)
...@@ -52,7 +53,7 @@ You must set the following variables at the top of the script before execution. ...@@ -52,7 +53,7 @@ You must set the following variables at the top of the script before execution.
1. **Configure**: Edit the script and set the variables in the [Configuration](#configuration) section. 1. **Configure**: Edit the script and set the variables in the [Configuration](#configuration) section.
2. **Execute**: Run the script. Since the process can take a long time, it is highly recommended to use a terminal multiplexer like `tmux` or `screen` to prevent the script from stopping if your connection is lost. 2. **Execute**: Run the script. Since the process can take a long time, it is highly recommended to use a terminal multiplexer like `tmux` or `screen` to prevent the script from stopping if your connection is lost.
``` ```bash
cd <FOLDER_OF_THIS_SCRIPT> cd <FOLDER_OF_THIS_SCRIPT>
bash auto_tune.sh bash auto_tune.sh
``` ```
...@@ -64,6 +65,7 @@ bash auto_tune.sh ...@@ -64,6 +65,7 @@ bash auto_tune.sh
Here are a few examples of how to configure the script for different goals: Here are a few examples of how to configure the script for different goals:
### 1. Maximize Throughput (No Latency Constraint) ### 1. Maximize Throughput (No Latency Constraint)
- **Goal**: Find the best `max-num-seqs` and `max-num-batched-tokens` to get the highest possible throughput for 1800 input tokens and 20 output tokens. - **Goal**: Find the best `max-num-seqs` and `max-num-batched-tokens` to get the highest possible throughput for 1800 input tokens and 20 output tokens.
- **Configuration**: - **Configuration**:
...@@ -76,6 +78,7 @@ MAX_LATENCY_ALLOWED_MS=100000000000 # A very large number ...@@ -76,6 +78,7 @@ MAX_LATENCY_ALLOWED_MS=100000000000 # A very large number
``` ```
#### 2. Maximize Throughput with a Latency Requirement #### 2. Maximize Throughput with a Latency Requirement
- **Goal**: Find the best server parameters when P99 end-to-end latency must be below 500ms. - **Goal**: Find the best server parameters when P99 end-to-end latency must be below 500ms.
- **Configuration**: - **Configuration**:
...@@ -88,6 +91,7 @@ MAX_LATENCY_ALLOWED_MS=500 ...@@ -88,6 +91,7 @@ MAX_LATENCY_ALLOWED_MS=500
``` ```
#### 3. Maximize Throughput with Prefix Caching and Latency Requirements #### 3. Maximize Throughput with Prefix Caching and Latency Requirements
- **Goal**: Find the best server parameters assuming a 60% prefix cache hit rate and a latency requirement of 500ms. - **Goal**: Find the best server parameters assuming a 60% prefix cache hit rate and a latency requirement of 500ms.
- **Configuration**: - **Configuration**:
...@@ -109,7 +113,7 @@ After the script finishes, you will find the results in a new, timestamped direc ...@@ -109,7 +113,7 @@ After the script finishes, you will find the results in a new, timestamped direc
- **Final Result Summary**: A file named `result.txt` is created in the log directory. It contains a summary of each tested combination and concludes with the overall best parameters found. - **Final Result Summary**: A file named `result.txt` is created in the log directory. It contains a summary of each tested combination and concludes with the overall best parameters found.
``` ```text
# Example result.txt content # Example result.txt content
hash:a1b2c3d4... hash:a1b2c3d4...
max_num_seqs: 128, max_num_batched_tokens: 2048, request_rate: 10.0, e2el: 450.5, throughput: 9.8, goodput: 9.8 max_num_seqs: 128, max_num_batched_tokens: 2048, request_rate: 10.0, e2el: 450.5, throughput: 9.8, goodput: 9.8
......
...@@ -8,7 +8,7 @@ Currently this just includes dense GEMMs and only works on Hopper GPUs. ...@@ -8,7 +8,7 @@ Currently this just includes dense GEMMs and only works on Hopper GPUs.
You need to install vLLM in your usual fashion, then install DeepGEMM from source in its own directory: You need to install vLLM in your usual fashion, then install DeepGEMM from source in its own directory:
``` ```bash
git clone --recursive https://github.com/deepseek-ai/DeepGEMM git clone --recursive https://github.com/deepseek-ai/DeepGEMM
cd DeepGEMM cd DeepGEMM
python setup.py install python setup.py install
...@@ -17,7 +17,7 @@ uv pip install -e . ...@@ -17,7 +17,7 @@ uv pip install -e .
## Usage ## Usage
``` ```console
python benchmark_fp8_block_dense_gemm.py python benchmark_fp8_block_dense_gemm.py
INFO 02-26 21:55:13 [__init__.py:207] Automatically detected platform cuda. INFO 02-26 21:55:13 [__init__.py:207] Automatically detected platform cuda.
===== STARTING FP8 GEMM BENCHMARK ===== ===== STARTING FP8 GEMM BENCHMARK =====
......
...@@ -86,6 +86,7 @@ D = s_a s_b \widehat A \widehat B ...@@ -86,6 +86,7 @@ D = s_a s_b \widehat A \widehat B
``` ```
Epilogue parameters: Epilogue parameters:
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector). - `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector). - `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
...@@ -135,7 +136,7 @@ That is precomputed and stored in `azp_with_adj` as a row-vector. ...@@ -135,7 +136,7 @@ That is precomputed and stored in `azp_with_adj` as a row-vector.
Epilogue parameters: Epilogue parameters:
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector). - `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
- Generally this will be per-tensor as the zero-points are per-tensor. - Generally this will be per-tensor as the zero-points are per-tensor.
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector). - `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
- `azp_with_adj` is the precomputed zero-point term ($` z_a J_a \widehat B `$), is per-channel (row-vector). - `azp_with_adj` is the precomputed zero-point term ($` z_a J_a \widehat B `$), is per-channel (row-vector).
- `bias` is the bias, is always per-channel (row-vector). - `bias` is the bias, is always per-channel (row-vector).
...@@ -152,7 +153,7 @@ That means the zero-point term $` z_a J_a \widehat B `$ becomes an outer product ...@@ -152,7 +153,7 @@ That means the zero-point term $` z_a J_a \widehat B `$ becomes an outer product
Epilogue parameters: Epilogue parameters:
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector). - `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
- Generally this will be per-token as the zero-points are per-token. - Generally this will be per-token as the zero-points are per-token.
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector). - `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
- `azp_adj` is the precomputed zero-point adjustment term ($` \mathbf 1 \widehat B `$), is per-channel (row-vector). - `azp_adj` is the precomputed zero-point adjustment term ($` \mathbf 1 \widehat B `$), is per-channel (row-vector).
- `azp` is the zero-point (`z_a`), is per-token (column-vector). - `azp` is the zero-point (`z_a`), is per-token (column-vector).
......
...@@ -6,13 +6,13 @@ toc_depth: 4 ...@@ -6,13 +6,13 @@ toc_depth: 4
The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with: The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with:
``` ```bash
vllm --help vllm --help
``` ```
Available Commands: Available Commands:
``` ```bash
vllm {chat,complete,serve,bench,collect-env,run-batch} vllm {chat,complete,serve,bench,collect-env,run-batch}
``` ```
......
...@@ -40,6 +40,7 @@ Although the first compilation can take some time, for all subsequent server lau ...@@ -40,6 +40,7 @@ Although the first compilation can take some time, for all subsequent server lau
Use `VLLM_XLA_CACHE_PATH` environment variable to write to shareable storage for future deployed nodes (like when using autoscaling). Use `VLLM_XLA_CACHE_PATH` environment variable to write to shareable storage for future deployed nodes (like when using autoscaling).
#### Reducing compilation time #### Reducing compilation time
This initial compilation time ranges significantly and is impacted by many of the arguments discussed in this optimization doc. Factors that influence the length of time to compile are things like model size and `--max-num-batch-tokens`. Other arguments you can tune are things like `VLLM_TPU_MOST_MODEL_LEN`. This initial compilation time ranges significantly and is impacted by many of the arguments discussed in this optimization doc. Factors that influence the length of time to compile are things like model size and `--max-num-batch-tokens`. Other arguments you can tune are things like `VLLM_TPU_MOST_MODEL_LEN`.
### Optimize based on your data ### Optimize based on your data
...@@ -71,12 +72,15 @@ The fewer tokens we pad, the less unnecessary computation TPU does, the better p ...@@ -71,12 +72,15 @@ The fewer tokens we pad, the less unnecessary computation TPU does, the better p
However, you need to be careful to choose the padding gap. If the gap is too small, it means the number of buckets is large, leading to increased warmup (precompile) time and higher memory to store the compiled graph. Too many compilaed graphs may lead to HBM OOM. Conversely, an overly large gap yields no performance improvement compared to the default exponential padding. However, you need to be careful to choose the padding gap. If the gap is too small, it means the number of buckets is large, leading to increased warmup (precompile) time and higher memory to store the compiled graph. Too many compilaed graphs may lead to HBM OOM. Conversely, an overly large gap yields no performance improvement compared to the default exponential padding.
**If possible, use the precision that matches the chip’s hardware acceleration** #### Quantization
If possible, use the precision that matches the chip’s hardware acceleration:
- v5e has int4/int8 hardware acceleration in the MXU - v5e has int4/int8 hardware acceleration in the MXU
- v6e has int4/int8 hardware acceleration in the MXU - v6e has int4/int8 hardware acceleration in the MXU
Supported quantized formats and features in vLLM on TPU [Jul '25] Supported quantized formats and features in vLLM on TPU [Jul '25]:
- INT8 W8A8 - INT8 W8A8
- INT8 W8A16 - INT8 W8A16
- FP8 KV cache - FP8 KV cache
...@@ -84,11 +88,13 @@ Supported quantized formats and features in vLLM on TPU [Jul '25] ...@@ -84,11 +88,13 @@ Supported quantized formats and features in vLLM on TPU [Jul '25]
- [WIP] AWQ - [WIP] AWQ
- [WIP] FP4 W4A8 - [WIP] FP4 W4A8
**Don't set TP to be less than the number of chips on a single-host deployment** #### Parallelization
Don't set TP to be less than the number of chips on a single-host deployment.
Although it’s common to do this with GPUs, don't try to fragment 2 or 8 different workloads across 8 chips on a single host. If you need 1 or 4 chips, just create an instance with 1 or 4 chips (these are partial-host machine types). Although it’s common to do this with GPUs, don't try to fragment 2 or 8 different workloads across 8 chips on a single host. If you need 1 or 4 chips, just create an instance with 1 or 4 chips (these are partial-host machine types).
### Tune your workloads! ### Tune your workloads
Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](../../benchmarks/auto_tune/README.md) to optimize your workloads for your use case. Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](../../benchmarks/auto_tune/README.md) to optimize your workloads for your use case.
...@@ -99,6 +105,7 @@ Although we try to have great default configs, we strongly recommend you check o ...@@ -99,6 +105,7 @@ Although we try to have great default configs, we strongly recommend you check o
The auto-tuner provides a profile of optimized configurations as its final step. However, interpreting this profile can be challenging for new users. We plan to expand this section in the future with more detailed guidance. In the meantime, you can learn how to collect a TPU profile using vLLM's native profiling tools [here](../examples/offline_inference/profiling_tpu.md). This profile can provide valuable insights into your workload's performance. The auto-tuner provides a profile of optimized configurations as its final step. However, interpreting this profile can be challenging for new users. We plan to expand this section in the future with more detailed guidance. In the meantime, you can learn how to collect a TPU profile using vLLM's native profiling tools [here](../examples/offline_inference/profiling_tpu.md). This profile can provide valuable insights into your workload's performance.
#### SPMD #### SPMD
More details to come. More details to come.
**Want us to cover something that isn't listed here? Open up an issue please and cite this doc. We'd love to hear your questions or tips.** **Want us to cover something that isn't listed here? Open up an issue please and cite this doc. We'd love to hear your questions or tips.**
...@@ -20,19 +20,19 @@ the failure? ...@@ -20,19 +20,19 @@ the failure?
- **Use this title format:** - **Use this title format:**
``` ```text
[CI Failure]: failing-test-job - regex/matching/failing:test [CI Failure]: failing-test-job - regex/matching/failing:test
``` ```
- **For the environment field:** - **For the environment field:**
``` ```text
Still failing on main as of commit abcdef123 Still failing on main as of commit abcdef123
``` ```
- **In the description, include failing tests:** - **In the description, include failing tests:**
``` ```text
FAILED failing/test.py:failing_test1 - Failure description FAILED failing/test.py:failing_test1 - Failure description
FAILED failing/test.py:failing_test2 - Failure description FAILED failing/test.py:failing_test2 - Failure description
https://github.com/orgs/vllm-project/projects/20 https://github.com/orgs/vllm-project/projects/20
......
...@@ -106,6 +106,7 @@ releases (which would take too much time), they can be built from ...@@ -106,6 +106,7 @@ releases (which would take too much time), they can be built from
source to unblock the update process. source to unblock the update process.
### FlashInfer ### FlashInfer
Here is how to build and install it from source with `torch2.7.0+cu128` in vLLM [Dockerfile](https://github.com/vllm-project/vllm/blob/27bebcd89792d5c4b08af7a65095759526f2f9e1/docker/Dockerfile#L259-L271): Here is how to build and install it from source with `torch2.7.0+cu128` in vLLM [Dockerfile](https://github.com/vllm-project/vllm/blob/27bebcd89792d5c4b08af7a65095759526f2f9e1/docker/Dockerfile#L259-L271):
```bash ```bash
...@@ -121,6 +122,7 @@ public location for immediate installation, such as [this FlashInfer wheel link] ...@@ -121,6 +122,7 @@ public location for immediate installation, such as [this FlashInfer wheel link]
team if you want to get the package published there. team if you want to get the package published there.
### xFormers ### xFormers
Similar to FlashInfer, here is how to build and install xFormers from source: Similar to FlashInfer, here is how to build and install xFormers from source:
```bash ```bash
...@@ -138,7 +140,7 @@ uv pip install --system \ ...@@ -138,7 +140,7 @@ uv pip install --system \
### causal-conv1d ### causal-conv1d
``` ```bash
uv pip install 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.0.post8' uv pip install 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.0.post8'
``` ```
......
...@@ -31,7 +31,7 @@ Features that fall under this policy include (at a minimum) the following: ...@@ -31,7 +31,7 @@ Features that fall under this policy include (at a minimum) the following:
The deprecation process consists of several clearly defined stages that span The deprecation process consists of several clearly defined stages that span
multiple Y releases: multiple Y releases:
**1. Deprecated (Still On By Default)** ### 1. Deprecated (Still On By Default)
- **Action**: Feature is marked as deprecated. - **Action**: Feature is marked as deprecated.
- **Timeline**: A removal version is explicitly stated in the deprecation - **Timeline**: A removal version is explicitly stated in the deprecation
...@@ -46,7 +46,7 @@ warning (e.g., "This will be removed in v0.10.0"). ...@@ -46,7 +46,7 @@ warning (e.g., "This will be removed in v0.10.0").
- GitHub Issue (RFC) for feedback - GitHub Issue (RFC) for feedback
- Documentation and use of the `@typing_extensions.deprecated` decorator for Python APIs - Documentation and use of the `@typing_extensions.deprecated` decorator for Python APIs
**2.Deprecated (Off By Default)** ### 2.Deprecated (Off By Default)
- **Action**: Feature is disabled by default, but can still be re-enabled via a - **Action**: Feature is disabled by default, but can still be re-enabled via a
CLI flag or environment variable. Feature throws an error when used without CLI flag or environment variable. Feature throws an error when used without
...@@ -55,7 +55,7 @@ re-enabling. ...@@ -55,7 +55,7 @@ re-enabling.
while signaling imminent removal. Ensures any remaining usage is clearly while signaling imminent removal. Ensures any remaining usage is clearly
surfaced and blocks silent breakage before full removal. surfaced and blocks silent breakage before full removal.
**3. Removed** ### 3. Removed
- **Action**: Feature is completely removed from the codebase. - **Action**: Feature is completely removed from the codebase.
- **Note**: Only features that have passed through the previous deprecation - **Note**: Only features that have passed through the previous deprecation
......
...@@ -112,13 +112,13 @@ vllm bench serve \ ...@@ -112,13 +112,13 @@ vllm bench serve \
In practice, you should set the `--duration` argument to a large value. Whenever you want the server to stop profiling, run: In practice, you should set the `--duration` argument to a large value. Whenever you want the server to stop profiling, run:
``` ```bash
nsys sessions list nsys sessions list
``` ```
to get the session id in the form of `profile-XXXXX`, then run: to get the session id in the form of `profile-XXXXX`, then run:
``` ```bash
nsys stop --session=profile-XXXXX nsys stop --session=profile-XXXXX
``` ```
......
...@@ -32,9 +32,9 @@ We prefer to keep all vulnerability-related communication on the security report ...@@ -32,9 +32,9 @@ We prefer to keep all vulnerability-related communication on the security report
on GitHub. However, if you need to contact the VMT directly for an urgent issue, on GitHub. However, if you need to contact the VMT directly for an urgent issue,
you may contact the following individuals: you may contact the following individuals:
- Simon Mo - simon.mo@hey.com - Simon Mo - <simon.mo@hey.com>
- Russell Bryant - rbryant@redhat.com - Russell Bryant - <rbryant@redhat.com>
- Huzaifa Sidhpurwala - huzaifas@redhat.com - Huzaifa Sidhpurwala - <huzaifas@redhat.com>
## Slack Discussion ## Slack Discussion
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment