Two helper scripts live in `benchmarks/` to compare hashing options used by prefix caching and related utilities. They are standalone (no server required) and help choose a hash algorithm before enabling prefix caching in production.
-`benchmarks/benchmark_hash.py`: Micro-benchmark that measures per-call latency of three implementations on a representative `(bytes, tuple[int])` payload.
-`benchmarks/benchmark_prefix_block_hash.py`: End-to-end block hashing benchmark that runs the full prefix-cache hash pipeline (`hash_block_tokens`) across many fake blocks and reports throughput.
vLLM maintains a per-commit wheel repository (commonly referred to as "nightly") at `https://wheels.vllm.ai` that provides pre-built wheels for every commit on the `main` branch since `v0.5.3`. This document explains how the nightly wheel index mechanism works.
## Build and Upload Process on CI
### Wheel Building
Wheels are built in the `Release` pipeline (`.buildkite/release-pipeline.yaml`) after a PR is merged into the main branch, with multiple variants:
-**Backend variants**: `cpu` and `cuXXX` (e.g., `cu129`, `cu130`).
-**Architecture variants**: `x86_64` and `aarch64`.
Each build step:
1. Builds the wheel in a Docker container.
2. Renames the wheel filename to use the correct manylinux tag (currently `manylinux_2_31`) for PEP 600 compliance.
3. Uploads the wheel to S3 bucket `vllm-wheels` under `/{commit_hash}/`.
### Index Generation
After uploading each wheel, the `.buildkite/scripts/upload-wheels.sh` script:
1.**Lists all existing wheels** in the commit directory from S3
2.**Generates indices** using `.buildkite/scripts/generate-nightly-index.py`:
3.**Uploads indices** to multiple locations (overriding existing ones):
-`/{commit_hash}/` - Always uploaded for commit-specific access.
-`/nightly/` - Only for commits on `main` branch (not PRs).
-`/{version}/` - Only for release wheels (no `dev` in its version).
!!! tip "Handling Concurrent Builds"
The index generation script can handle multiple variants being built concurrently by always listing all wheels in the commit directory before generating indices, avoiding race conditions.
## Directory Structure
The S3 bucket structure follows this pattern:
```text
s3://vllm-wheels/
├── {commit_hash}/ # Commit-specific wheels and indices
│ ├── vllm-*.whl # All wheel files
│ ├── index.html # Project list (default variant)
│ ├── vllm/
│ │ ├── index.html # Package index (default variant)
│ │ ├── index.html # Package index (cu129 variant)
│ │ └── metadata.json # Metadata (cu129 variant)
│ ├── cu130/ # Variant subdirectory
│ ├── cpu/ # Variant subdirectory
│ └── .../ # More variant subdirectories
├── nightly/ # Latest main branch wheels (mirror of latest commit)
└── {version}/ # Release version indices (e.g., 0.11.2)
```
All built wheels are stored in `/{commit_hash}/`, while different indices are generated and reference them.
This avoids duplication of wheel files.
For example, you can specify the following URLs to use different indices:
-`https://wheels.vllm.ai/nightly/cu130` for the latest main branch wheels built with CUDA 13.0.
-`https://wheels.vllm.ai/{commit_hash}` for wheels built at a specific commit (default variant).
-`https://wheels.vllm.ai/0.12.0/cpu` for 0.12.0 release wheels built for CPU variant.
Please note that not all variants are present on every commit. The available variants are subject to change over time, e.g., changing cu130 to cu131.
### Variant Organization
Indices are organized by variant:
-**Default variant**: Wheels without variant suffix (i.e., built with the current `VLLM_MAIN_CUDA_VERSION`) are placed in the root.
-**Variant subdirectories**: Wheels with variant suffixes (e.g., `+cu130`, `.cpu`) are organized in subdirectories.
-**Alias to default**: The default variant can have an alias (e.g., `cu129` for now) for consistency and convenience.
The variant is extracted from the wheel filename (as described in the [file name convention](https://packaging.python.org/en/latest/specifications/binary-distribution-format/#file-name-convention)):
- The variant is encoded in the local version identifier (e.g. `+cu129` or `dev<N>+g<hash>.cu130`).
The `generate-nightly-index.py` script performs the following:
1.**Parses wheel filenames** using regex to extract:
- Package name
- Version (with variant extracted)
- Python tag, ABI tag, platform tag
- Build tag (if present)
2.**Groups wheels by variant**, then by package name:
- Currently only `vllm` is built, but the structure supports multiple packages in the future.
3.**Generates HTML indices** (compliant with the [Simple repository API](https://packaging.python.org/en/latest/specifications/simple-repository-api/#simple-repository-api)):
- Top-level `index.html`: Lists all packages and variant subdirectories
- Package-level `index.html`: Lists all wheel files for that package
- Uses relative paths to wheel files for portability
4.**Generates metadata.json**:
- Machine-readable JSON containing all wheel metadata
- Includes `path` field with URL-encoded relative path to wheel file
- Used by `setup.py` to locate compatible pre-compiled wheels during Python-only builds
### Special Handling for AWS Services
The wheels and indices are directly stored on AWS S3, and we use AWS CloudFront as a CDN in front of the S3 bucket.
Since S3 does not provide proper directory listing, to support PyPI-compatible simple repository API behavior, we deploy a CloudFront Function that:
- redirects any URL that does not end with `/` and does not look like a file (i.e., does not contain a dot `.` in the last path segment) to the same URL with a trailing `/`
- appends `/index.html` to any URL that ends with `/`
For example, the following requests would be handled as:
-`/nightly` -> `/nightly/index.html`
-`/nightly/cu130/` -> `/nightly/cu130/index.html`
-`/nightly/index.html` or `/nightly/vllm.whl` -> unchanged
!!! note "AWS S3 Filename Escaping"
S3 will automatically escape filenames upon upload according to its [naming rule](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html). The direct impact on vllm is that `+` in filenames will be converted to `%2B`. We take special care in the index generation script to escape filenames properly when generating the HTML indices and JSON metadata, to ensure the URLs are correct and can be directly used.
## Usage of precompiled wheels in `setup.py` {#precompiled-wheels-usage}
When installing vLLM with `VLLM_USE_PRECOMPILED=1`, the `setup.py` script:
1.**Determines wheel location** via `precompiled_wheel_utils.determine_wheel_url()`:
- Env var `VLLM_PRECOMPILED_WHEEL_LOCATION` (user-specified URL/path) always takes precedence and skips all other steps.
- Determines the variant from `VLLM_MAIN_CUDA_VERSION` (can be overridden with env var `VLLM_PRECOMPILED_WHEEL_VARIANT`); the default variant will also be tried as a fallback.
- Determines the _base commit_ (explained later) of this branch (can be overridden with env var `VLLM_PRECOMPILED_WHEEL_COMMIT`).
2.**Fetches metadata** from `https://wheels.vllm.ai/{commit}/vllm/metadata.json` (for the default variant) or `https://wheels.vllm.ai/{commit}/{variant}/vllm/metadata.json` (for a specific variant).
3.**Selects compatible wheel** based on:
- Package name (`vllm`)
- Platform tag (architecture match)
4.**Downloads and extracts** precompiled binaries from the wheel:
- C++ extension modules (`.so` files)
- Flash Attention Python modules
- Triton kernel Python files
5.**Patches package_data** to include extracted files in the installation
!!! note "What is the base commit?"
The base commit is determined by finding the merge-base
between the current branch and upstream `main`, ensuring
compatibility between source code and precompiled binaries.
_Note: it's users' responsibility to ensure there is no native code (e.g., C++ or CUDA) changes before using precompiled wheels._
## Implementation Files
Key files involved in the nightly wheel mechanism:
-**`.buildkite/release-pipeline.yaml`**: CI pipeline that builds wheels
-**`.buildkite/scripts/upload-wheels.sh`**: Script that uploads wheels and generates indices
-**`.buildkite/scripts/generate-nightly-index.py`**: Python script that generates PyPI-compatible indices
-**`setup.py`**: Contains `precompiled_wheel_utils` class for fetching and using precompiled wheels
We support tracing vLLM workers using the `torch.profiler` module. You can enable tracing by setting the `VLLM_TORCH_PROFILER_DIR` environment variable to the directory where you want to save the traces: `VLLM_TORCH_PROFILER_DIR=/mnt/traces/`. Additionally, you can control the profiling content by specifying the following environment variables:
We support tracing vLLM workers using the `torch.profiler` module. You can enable the torch profiler by setting `--profiler-config`
when launching the server, and setting the entries `profiler` to `'torch'` and `torch_profiler_dir` to the directory where you want to save the traces. Additionally, you can control the profiling content by specifying the following additional arguments in the config:
-`VLLM_TORCH_PROFILER_RECORD_SHAPES=1` to enable recording Tensor Shapes, off by default
-`torch_profiler_record_shapes` to enable recording Tensor Shapes, off by default
-`VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY=1` to record memory, off by default
-`torch_profiler_with_memory` to record memory, off by default
-`VLLM_TORCH_PROFILER_WITH_STACK=1` to enable recording stack information, on by default
-`torch_profiler_with_stack` to enable recording stack information, on by default
-`VLLM_TORCH_PROFILER_WITH_FLOPS=1` to enable recording FLOPs, off by default
-`torch_profiler_with_flops` to enable recording FLOPs, off by default
-`VLLM_TORCH_PROFILER_USE_GZIP=0` to disable gzip-compressing profiling files, on by default
-`torch_profiler_use_gzip` to control gzip-compressing profiling files, on by default
-`VLLM_TORCH_PROFILER_DUMP_CUDA_TIME_TOTAL=0` to disable dumping and printing the aggregated CUDA self time table, on by default
-`torch_profiler_dump_cuda_time_total` to control dumping and printing the aggregated CUDA self time table, on by default
The OpenAI server also needs to be started with the `VLLM_TORCH_PROFILER_DIR` environment variable set.
When using `vllm bench serve`, you can enable profiling by passing the `--profile` flag.
When using `vllm bench serve`, you can enable profiling by passing the `--profile` flag.
...
@@ -40,8 +39,7 @@ Refer to [examples/offline_inference/simple_profiling.py](../../examples/offline
...
@@ -40,8 +39,7 @@ Refer to [examples/offline_inference/simple_profiling.py](../../examples/offline
[**Kthena**](https://github.com/volcano-sh/kthena) is a Kubernetes-native LLM inference platform that transforms how organizations deploy and manage Large Language Models in production. Built with declarative model lifecycle management and intelligent request routing, it provides high performance and enterprise-grade scalability for LLM inference workloads.
This guide shows how to deploy a production-grade, **multi-node vLLM** service on Kubernetes.
We’ll:
- Install the required components (Kthena + Volcano).
- Deploy a multi-node vLLM model via Kthena’s `ModelServing` CR.
- Validate the deployment.
---
## 1. Prerequisites
You need:
- A Kubernetes cluster with **GPU nodes**.
-`kubectl` access with cluster-admin or equivalent permissions.
-**Volcano** installed for gang scheduling.
-**Kthena** installed with the `ModelServing` CRD available.
- A valid **Hugging Face token** if loading models from Hugging Face Hub.
- Kthena controllers and CRDs, including `ModelServing`, are installed and healthy.
Validate:
```bash
kubectl get crd | grep modelserving
```
You should see:
```text
modelservings.workload.serving.volcano.sh ...
```
---
## 2. The Multi-Node vLLM `ModelServing` Example
Kthena provides an example manifest to deploy a **multi-node vLLM cluster running Llama**. Conceptually this is equivalent to the vLLM production stack Helm deployment, but expressed with `ModelServing`.
A simplified version of the example (`llama-multinode`) looks like:
-`spec.replicas: 1` – one `ServingGroup` (one logical model deployment).
-`roles`:
-`entryTemplate` – defines **leader** pods that run:
@@ -79,7 +79,7 @@ The `post_process*` methods take `PoolingRequestOutput` objects as input and gen
...
@@ -79,7 +79,7 @@ The `post_process*` methods take `PoolingRequestOutput` objects as input and gen
The `validate_or_generate_params` method is used for validating with the plugin any `SamplingParameters`/`PoolingParameters` received with the user request, or to generate new ones if none are specified. The function always returns the validated/generated parameters.
The `validate_or_generate_params` method is used for validating with the plugin any `SamplingParameters`/`PoolingParameters` received with the user request, or to generate new ones if none are specified. The function always returns the validated/generated parameters.
The `output_to_response` method is used only for online serving and converts the plugin output to the `IOProcessorResponse` type that is then returned by the API Server. The implementation of the `/pooling` serving endpoint is available here [vllm/entrypoints/openai/serving_pooling.py](../../vllm/entrypoints/pooling/pooling/serving.py).
The `output_to_response` method is used only for online serving and converts the plugin output to the `IOProcessorResponse` type that is then returned by the API Server. The implementation of the `/pooling` serving endpoint is available here [vllm/entrypoints/openai/serving_pooling.py](../../vllm/entrypoints/pooling/pooling/serving.py).
An example implementation of a plugin that enables generating geotiff images with the PrithviGeospatialMAE model is available [here](https://github.com/IBM/terratorch/tree/main/terratorch/vllm/plugins/segmentation). Please, also refer to our online ([examples/online_serving/pooling/prithvi_geospatial_mae.py](../../examples/online_serving/pooling/prithvi_geospatial_mae.py)) and offline ([examples/offline_inference/pooling/prithvi_geospatial_mae_io_processor.py](../../examples/offline_inference/pooling/prithvi_geospatial_mae_io_processor.py)) inference examples.
An example implementation of a plugin that enables generating geotiff images with the PrithviGeospatialMAE model is available [here](https://github.com/IBM/terratorch/tree/main/terratorch/vllm/plugins/segmentation). Please, also refer to our online ([examples/pooling/plugin/prithvi_geospatial_mae_client.py](../../examples/pooling/plugin/prithvi_geospatial_mae_client.py)) and offline ([examples/pooling/plugin/prithvi_geospatial_mae_io_processor.py](../../examples/pooling/plugin/prithvi_geospatial_mae_io_processor.py)) inference examples.
-`vllm:request_success_total` - Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached.
-`vllm:request_success` - Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached.
@@ -90,7 +90,6 @@ To be used with a particular `FusedMoEPrepareAndFinalize` subclass, MoE kernels
...
@@ -90,7 +90,6 @@ To be used with a particular `FusedMoEPrepareAndFinalize` subclass, MoE kernels
| cutlass_fp8 | standard,</br>batched | fp8 | A,T | silu, gelu | Y | Y | [`cutlass_moe_fp8`][vllm.model_executor.layers.fused_moe.cutlass_moe.cutlass_moe_fp8],</br>[`CutlassExpertsFp8`][vllm.model_executor.layers.fused_moe.cutlass_moe.CutlassExpertsFp8],</br>[`CutlasBatchedExpertsFp8`][vllm.model_executor.layers.fused_moe.cutlass_moe.CutlassBatchedExpertsFp8] |
| cutlass_fp8 | standard,</br>batched | fp8 | A,T | silu, gelu | Y | Y | [`cutlass_moe_fp8`][vllm.model_executor.layers.fused_moe.cutlass_moe.cutlass_moe_fp8],</br>[`CutlassExpertsFp8`][vllm.model_executor.layers.fused_moe.cutlass_moe.CutlassExpertsFp8],</br>[`CutlasBatchedExpertsFp8`][vllm.model_executor.layers.fused_moe.cutlass_moe.CutlassBatchedExpertsFp8] |
| flashinfer | standard | nvfp4,</br>fp8 | T | <sup>5</sup> | N | Y | [`flashinfer_cutlass_moe_fp4`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe.flashinfer_cutlass_moe_fp4],</br>[`FlashInferExperts`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe.FlashInferExperts] |
| flashinfer | standard | nvfp4,</br>fp8 | T | <sup>5</sup> | N | Y | [`flashinfer_cutlass_moe_fp4`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe.flashinfer_cutlass_moe_fp4],</br>[`FlashInferExperts`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe.FlashInferExperts] |
| gpt oss triton | standard | N/A | N/A | <sup>5</sup> | Y | Y | [`triton_kernel_fused_experts`][vllm.model_executor.layers.fused_moe.gpt_oss_triton_kernels_moe.triton_kernel_fused_experts],</br>[`OAITritonExperts`][vllm.model_executor.layers.fused_moe.gpt_oss_triton_kernels_moe.OAITritonExperts] |
| gpt oss triton | standard | N/A | N/A | <sup>5</sup> | Y | Y | [`triton_kernel_fused_experts`][vllm.model_executor.layers.fused_moe.gpt_oss_triton_kernels_moe.triton_kernel_fused_experts],</br>[`OAITritonExperts`][vllm.model_executor.layers.fused_moe.gpt_oss_triton_kernels_moe.OAITritonExperts] |
| deep gemm+triton<sup>2</sup> | standard,</br>batched | all<sup>1</sup> | G(128),A,T | silu, gelu | <sup>6</sup> | Y | [`TritonOrDeepGemmExperts`][vllm.model_executor.layers.fused_moe.triton_deep_gemm_moe.TritonOrDeepGemmExperts],</br>[`BatchedTritonOrDeepGemmExperts`][vllm.model_executor.layers.fused_moe.batched_triton_or_deep_gemm_moe.BatchedTritonOrDeepGemmExperts] |
@@ -22,8 +22,8 @@ In the example above, the KV cache in the first block can be uniquely identified
...
@@ -22,8 +22,8 @@ In the example above, the KV cache in the first block can be uniquely identified
We only cache full blocks.
We only cache full blocks.
!!! note "Note 2"
!!! note "Note 2"
The above hash key structure is not 100% collision free. Theoretically it’s still possible for the different prefix tokens to have the same hash value. To avoid any hash collisions **in a multi-tenant setup, we advise to use SHA256** as hash function instead of the default builtin hash.
The above hash key structure is not 100% collision free. Theoretically it’s still possible for the different prefix tokens to have the same hash value. To avoid any hash collisions **in a multi-tenant setup, we use SHA256** as hash function instead of the builtin hash.
SHA256 is supported since vLLM v0.8.3 and must be enabled with a command line argument. It comes with a performance impact of about 100-200ns per token (~6ms for 50k tokens of context).
SHA256 is supported since vLLM v0.8.3 and the default since v0.10.2. It comes with a negligible performance impact of about 75ns per token (<4ms for 50k tokens of context).
**A hashing example with multi-modality inputs**
**A hashing example with multi-modality inputs**
In this example, we illustrate how prefix caching works with multi-modality inputs (e.g., images). Assuming we have a request with the following messages:
In this example, we illustrate how prefix caching works with multi-modality inputs (e.g., images). Assuming we have a request with the following messages: