@@ -9,7 +9,7 @@ Without them, the CI for your PR will fail.
...
@@ -9,7 +9,7 @@ Without them, the CI for your PR will fail.
### Model loading
### Model loading
Include an example HuggingFace repository for your model in <gh-file:tests/models/registry.py>.
Include an example HuggingFace repository for your model in [tests/models/registry.py](../../../tests/models/registry.py).
This enables a unit test that loads dummy weights to ensure that the model can be initialized in vLLM.
This enables a unit test that loads dummy weights to ensure that the model can be initialized in vLLM.
!!! important
!!! important
...
@@ -26,26 +26,24 @@ Passing these tests provides more confidence that your implementation is correct
...
@@ -26,26 +26,24 @@ Passing these tests provides more confidence that your implementation is correct
### Model correctness
### Model correctness
These tests compare the model outputs of vLLM against [HF Transformers](https://github.com/huggingface/transformers). You can add new tests under the subdirectories of <gh-dir:tests/models>.
These tests compare the model outputs of vLLM against [HF Transformers](https://github.com/huggingface/transformers). You can add new tests under the subdirectories of [tests/models](../../../tests/models).
#### Generative models
#### Generative models
For [generative models](../../models/generative_models.md), there are two levels of correctness tests, as defined in <gh-file:tests/models/utils.py>:
For [generative models](../../models/generative_models.md), there are two levels of correctness tests, as defined in [tests/models/utils.py](../../../tests/models/utils.py):
- Exact correctness (`check_outputs_equal`): The text outputted by vLLM should exactly match the text outputted by HF.
- Exact correctness (`check_outputs_equal`): The text outputted by vLLM should exactly match the text outputted by HF.
- Logprobs similarity (`check_logprobs_close`): The logprobs outputted by vLLM should be in the top-k logprobs outputted by HF, and vice versa.
- Logprobs similarity (`check_logprobs_close`): The logprobs outputted by vLLM should be in the top-k logprobs outputted by HF, and vice versa.
#### Pooling models
#### Pooling models
For [pooling models](../../models/pooling_models.md), we simply check the cosine similarity, as defined in <gh-file:tests/models/utils.py>.
For [pooling models](../../models/pooling_models.md), we simply check the cosine similarity, as defined in [tests/models/utils.py](../../../tests/models/utils.py).
[](){ #mm-processing-tests }
### Multi-modal processing
### Multi-modal processing
#### Common tests
#### Common tests
Adding your model to <gh-file:tests/models/multimodal/processing/test_common.py> verifies that the following input combinations result in the same outputs:
Adding your model to [tests/models/multimodal/processing/test_common.py](../../../tests/models/multimodal/processing/test_common.py) verifies that the following input combinations result in the same outputs:
- Text + multi-modal data
- Text + multi-modal data
- Tokens + multi-modal data
- Tokens + multi-modal data
...
@@ -54,6 +52,6 @@ Adding your model to <gh-file:tests/models/multimodal/processing/test_common.py>
...
@@ -54,6 +52,6 @@ Adding your model to <gh-file:tests/models/multimodal/processing/test_common.py>
#### Model-specific tests
#### Model-specific tests
You can add a new file under <gh-dir:tests/models/multimodal/processing> to run tests that only apply to your model.
You can add a new file under [tests/models/multimodal/processing](../../../tests/models/multimodal/processing) to run tests that only apply to your model.
For example, if the HF processor for your model accepts user-specified keyword arguments, you can verify that the keyword arguments are being applied correctly, such as in <gh-file:tests/models/multimodal/processing/test_phi3v.py>.
For example, if the HF processor for your model accepts user-specified keyword arguments, you can verify that the keyword arguments are being applied correctly, such as in [tests/models/multimodal/processing/test_phi3v.py](../../../tests/models/multimodal/processing/test_phi3v.py).
- Voxtral decoder-only (audio embeddings + LLM): [vllm/model_executor/models/voxtral.py](../../../vllm/model_executor/models/voxtral.py). Make sure to have installed `mistral-common[audio]`.
- Gemma3n decoder-only with fixed instruction prompt: <gh-file:vllm/model_executor/models/gemma3n_mm.py>
- Gemma3n decoder-only with fixed instruction prompt: [vllm/model_executor/models/gemma3n_mm.py](../../../vllm/model_executor/models/gemma3n_mm.py)
## Test with the API
## Test with the API
...
@@ -268,7 +278,7 @@ Once your model implements `SupportsTranscription`, you can test the endpoints (
...
@@ -268,7 +278,7 @@ Once your model implements `SupportsTranscription`, you can test the endpoints (
http://localhost:8000/v1/audio/translations
http://localhost:8000/v1/audio/translations
```
```
Or check out more examples in <gh-file:examples/online_serving>.
Or check out more examples in [examples/online_serving](../../../examples/online_serving).
!!! note
!!! note
- If your model handles chunking internally (e.g., via its processor or encoder), set `min_energy_split_window_size=None` in the returned `SpeechToTextConfig` to disable server-side chunking.
- If your model handles chunking internally (e.g., via its processor or encoder), set `min_energy_split_window_size=None` in the returned `SpeechToTextConfig` to disable server-side chunking.
@@ -33,14 +33,13 @@ Traces can be visualized using <https://ui.perfetto.dev/>.
...
@@ -33,14 +33,13 @@ Traces can be visualized using <https://ui.perfetto.dev/>.
#### Offline Inference
#### Offline Inference
Refer to <gh-file:examples/offline_inference/simple_profiling.py> for an example.
Refer to [examples/offline_inference/simple_profiling.py](../../examples/offline_inference/simple_profiling.py) for an example.
#### OpenAI Server
#### OpenAI Server
```bash
```bash
VLLM_TORCH_PROFILER_DIR=./vllm_profile \
VLLM_TORCH_PROFILER_DIR=./vllm_profile \
python -m vllm.entrypoints.openai.api_server \
vllm serve meta-llama/Llama-3.1-8B-Instruct
--model meta-llama/Meta-Llama-3-70B
```
```
vllm bench command:
vllm bench command:
...
@@ -48,7 +47,7 @@ vllm bench command:
...
@@ -48,7 +47,7 @@ vllm bench command:
```bash
```bash
vllm bench serve \
vllm bench serve \
--backend vllm \
--backend vllm \
--model meta-llama/Meta-Llama-3-70B\
--model meta-llama/Llama-3.1-8B-Instruct\
--dataset-name sharegpt \
--dataset-name sharegpt \
--dataset-path sharegpt.json \
--dataset-path sharegpt.json \
--profile\
--profile\
...
@@ -71,18 +70,21 @@ apt update
...
@@ -71,18 +70,21 @@ apt update
apt install nsight-systems-cli
apt install nsight-systems-cli
```
```
### Example commands and usage
!!! tip
When profiling with `nsys`, it is advisable to set the environment variable `VLLM_WORKER_MULTIPROC_METHOD=spawn`. The default is to use the `fork` method instead of `spawn`. More information on the topic can be found in the [Nsight Systems release notes](https://docs.nvidia.com/nsight-systems/ReleaseNotes/index.html#general-issues).
When profiling with `nsys`, it is advisable to set the environment variable `VLLM_WORKER_MULTIPROC_METHOD=spawn`. The default is to use the `fork` method instead of `spawn`. More information on the topic can be found in the [Nsight Systems release notes](https://docs.nvidia.com/nsight-systems/ReleaseNotes/index.html#general-issues).
The Nsight Systems profiler can be launched with `nsys profile ...`, with a few recommended flags for vLLM: `--trace-fork-before-exec=true --cuda-graph-trace=node`.
### Example commands and usage
#### Offline Inference
#### Offline Inference
For basic usage, you can just append `nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node` before any existing script you would run for offline inference.
For basic usage, you can just append the profiling command before any existing script you would run for offline inference.
The following is an example using the `vllm bench latency` script:
The following is an example using the `vllm bench latency` script:
```bash
```bash
nsys profile -o report.nsys-rep\
nsys profile \
--trace-fork-before-exec=true\
--trace-fork-before-exec=true\
--cuda-graph-trace=node \
--cuda-graph-trace=node \
vllm bench latency \
vllm bench latency \
...
@@ -96,40 +98,29 @@ vllm bench latency \
...
@@ -96,40 +98,29 @@ vllm bench latency \
#### OpenAI Server
#### OpenAI Server
To profile the server, you will want to prepend your `vllm serve` command with `nsys profile` just like for offline inference, however you must specify `--delay XX --duration YY` parameters according to the needs of your benchmark. After the duration time has been used up, the server will be killed.
To profile the server, you will want to prepend your `vllm serve` command with `nsys profile` just like for offline inference, but you will need to specify a few other arguments to enable dynamic capture similarly to the Torch Profiler:
```bash
```bash
# server
# server
nsys profile -o report.nsys-rep \
VLLM_TORCH_CUDA_PROFILE=1 \
nsys profile \
--trace-fork-before-exec=true\
--trace-fork-before-exec=true\
--cuda-graph-trace=node \
--cuda-graph-trace=node \
--delay 30\
--capture-range=cudaProfilerApi\
--duration 60\
--capture-range-end repeat\
vllm serve meta-llama/Llama-3.1-8B-Instruct
vllm serve meta-llama/Llama-3.1-8B-Instruct
# client
# client
vllm bench serve \
vllm bench serve \
--backend vllm \
--backend vllm \
--model meta-llama/Llama-3.1-8B-Instruct \
--model meta-llama/Llama-3.1-8B-Instruct \
--num-prompts 1 \
--dataset-name sharegpt \
--dataset-name random \
--dataset-path sharegpt.json \
--random-input 1024 \
--profile\
--random-output 512
--num-prompts 2
```
In practice, you should set the `--duration` argument to a large value. Whenever you want the server to stop profiling, run:
```bash
nsys sessions list
```
to get the session id in the form of `profile-XXXXX`, then run:
```bash
nsys stop --session=profile-XXXXX
```
```
to manually kill the profiler and generate your `nsys-rep` report.
With `--profile`, vLLM will capture a profile for each run of `vllm bench serve`. Once the server is killed, the profiles will all be saved.
#### Analysis
#### Analysis
...
@@ -160,14 +151,34 @@ GUI example:
...
@@ -160,14 +151,34 @@ GUI example:
<imgwidth="1799"alt="Screenshot 2025-03-05 at 11 48 42 AM"src="https://github.com/user-attachments/assets/c7cff1ae-6d6f-477d-a342-bd13c4fc424c"/>
<imgwidth="1799"alt="Screenshot 2025-03-05 at 11 48 42 AM"src="https://github.com/user-attachments/assets/c7cff1ae-6d6f-477d-a342-bd13c4fc424c"/>
## Continuous Profiling
There is a [GitHub CI workflow](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-profiling.yml) in the PyTorch infrastructure repository that provides continuous profiling for different models on vLLM. This automated profiling helps track performance characteristics over time and across different model configurations.
### How It Works
The workflow currently runs weekly profiling sessions for selected models, generating detailed performance traces that can be analyzed using different tools to identify performance regressions or optimization opportunities. But, it can be triggered manually as well, using the Github Action tool.
### Adding New Models
To extend the continuous profiling to additional models, you can modify the [profiling-tests.json](https://github.com/pytorch/pytorch-integration-testing/blob/main/vllm-profiling/cuda/profiling-tests.json) configuration file in the PyTorch integration testing repository. Simply add your model specifications to this file to include them in the automated profiling runs.
### Viewing Profiling Results
The profiling traces generated by the continuous profiling workflow are publicly available on the [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm). Look for the **Profiling traces** table to access and download the traces for different models and runs.
## Profiling vLLM Python Code
## Profiling vLLM Python Code
The Python standard library includes
The Python standard library includes
[cProfile](https://docs.python.org/3/library/profile.html) for profiling Python
[cProfile](https://docs.python.org/3/library/profile.html) for profiling Python
code. vLLM includes a couple of helpers that make it easy to apply it to a section of vLLM.
code. vLLM includes a couple of helpers that make it easy to apply it to a section of vLLM.
Both the `vllm.utils.cprofile` and `vllm.utils.cprofile_context` functions can be
Both the `vllm.utils.profiling.cprofile` and `vllm.utils.profiling.cprofile_context` functions can be
used to profile a section of code.
used to profile a section of code.
!!! note
The legacy import paths `vllm.utils.cprofile` and `vllm.utils.cprofile_context` are deprecated.
Please use `vllm.utils.profiling.cprofile` and `vllm.utils.profiling.cprofile_context` instead.
### Example usage - decorator
### Example usage - decorator
The first helper is a Python decorator that can be used to profile a function.
The first helper is a Python decorator that can be used to profile a function.
...
@@ -175,9 +186,9 @@ If a filename is specified, the profile will be saved to that file. If no filena
...
@@ -175,9 +186,9 @@ If a filename is specified, the profile will be saved to that file. If no filena
specified, profile data will be printed to stdout.
specified, profile data will be printed to stdout.
```python
```python
importvllm.utils
fromvllm.utils.profilingimportcprofile
@vllm.utils.cprofile("expensive_function.prof")
@cprofile("expensive_function.prof")
defexpensive_function():
defexpensive_function():
# some expensive code
# some expensive code
pass
pass
...
@@ -189,13 +200,13 @@ The second helper is a context manager that can be used to profile a block of
...
@@ -189,13 +200,13 @@ The second helper is a context manager that can be used to profile a block of
code. Similar to the decorator, the filename is optional.
code. Similar to the decorator, the filename is optional.
[Anyscale](https://www.anyscale.com) is a managed, multi-cloud platform developed by the creators of Ray.
[Anyscale](https://www.anyscale.com) is a managed, multi-cloud platform developed by the creators of Ray.
Anyscale automates the entire lifecycle of Ray clusters in your AWS, GCP, or Azure account, delivering the flexibility of open-source Ray
Anyscale automates the entire lifecycle of Ray clusters in your AWS, GCP, or Azure account, delivering the flexibility of open-source Ray
without the operational overhead of maintaining Kubernetes control planes, configuring autoscalers, managing observability stacks, or manually managing head and worker nodes with helper scripts like <gh-file:examples/online_serving/run_cluster.sh>.
without the operational overhead of maintaining Kubernetes control planes, configuring autoscalers, managing observability stacks, or manually managing head and worker nodes with helper scripts like [examples/online_serving/run_cluster.sh](../../../examples/online_serving/run_cluster.sh).
When serving large language models with vLLM, Anyscale can rapidly provision [production-ready HTTPS endpoints](https://docs.anyscale.com/examples/deploy-ray-serve-llms) or [fault-tolerant batch inference jobs](https://docs.anyscale.com/examples/ray-data-llm).
When serving large language models with vLLM, Anyscale can rapidly provision [production-ready HTTPS endpoints](https://docs.anyscale.com/examples/deploy-ray-serve-llms) or [fault-tolerant batch inference jobs](https://docs.anyscale.com/examples/ray-data-llm).
@@ -13,7 +13,7 @@ Before you begin, ensure that you have the following:
...
@@ -13,7 +13,7 @@ Before you begin, ensure that you have the following:
- A running Kubernetes cluster
- A running Kubernetes cluster
- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at [https://github.com/NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin)
- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at [https://github.com/NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin)
- Available GPU resources in your cluster
- Available GPU resources in your cluster
-An S3 with the model which will be deployed
-(Optional) An S3 bucket or other storage with the model weights, if using automatic model download
## Installing the chart
## Installing the chart
...
@@ -61,10 +61,16 @@ The following table describes configurable parameters of the chart in `values.ya
...
@@ -61,10 +61,16 @@ The following table describes configurable parameters of the chart in `values.ya
Models compatible with vLLM can be deployed on Hugging Face Inference Endpoints, either starting from the [Hugging Face Hub](https://huggingface.co) or directly from the [Inference Endpoints](https://endpoints.huggingface.co/) interface. This allows you to serve models in a fully managed environment with GPU acceleration, auto-scaling, and monitoring, without managing the infrastructure manually.
For advanced details on vLLM integration and deployment options, see [Advanced Deployment Details](#advanced-deployment-details).
## Deployment Methods
-[**Method 1: Deploy from the Catalog.**](#method-1-deploy-from-the-catalog) One-click deploy models from the Hugging Face Hub with ready-made optimized configurations.
-[**Method 2: Guided Deployment (Transformers Models).**](#method-2-guided-deployment-transformers-models) Instantly deploy models tagged with `transformers` from the Hub UI using the **Deploy** button.
-[**Method 3: Manual Deployment (Advanced Models).**](#method-3-manual-deployment-advanced-models) For models that either use custom code with the `transformers` tag, or don’t run with standard `transformers` but are supported by vLLM. This method requires manual configuration.
### Method 1: Deploy from the Catalog
This is the easiest way to get started with vLLM on Hugging Face Inference Endpoints. You can browse a catalog of models with verified and optimized deployment configuration at [Inference Endpoints](https://endpoints.huggingface.co/catalog) to maximize performance.
1. Go to [Endpoints Catalog](https://endpoints.huggingface.co/catalog) and in the **Inference Server** options, select `vLLM`.This will display the current list of models with optimized preconfigured options.
1. Once the deployment is ready, you can use the endpoint. Update the `DEPLOYMENT_URL` with the URL provided in the console, remembering to append `/v1` as required.
"text": "Give me a brief explanation of gravity in simple terms.",
}
],
}
],
stream=True,
)
for message in chat_completion:
print(message.choices[0].delta.content, end="")
```
!!! note
The catalog provides models optimized for vLLM, including GPU settings and inference engine configurations. You can monitor the endpoint and update the **container or its configuration** from the Inference Endpoints UI.
This method applies to models with the [`transformers` library tag](https://huggingface.co/models?library=transformers) in their metadata. It allows you to deploy a model directly from the Hub UI without manual configuration.
1. Navigate to a model on [Hugging Face Hub](https://huggingface.co/models).
For this example we will use the [`ibm-granite/granite-docling-258M`](https://huggingface.co/ibm-granite/granite-docling-258M) model. You can verify that the model is compatible by checking the front matter in the [README](https://huggingface.co/ibm-granite/granite-docling-258M/blob/main/README.md), where the library is tagged as `library: transformers`.
2. Locate the **Deploy** button. The button appears for models tagged with `transformers` at the top right of the [model card](https://huggingface.co/ibm-granite/granite-docling-258M).
4. Select the Hardware (we choose AWS>GPU>T4 for the example) and Container Configuration. Choose `vLLM` as the container type and finalize the deployment pressing **Create Endpoint**.
5. Use the deployed endpoint. Update the `DEPLOYMENT_URL` with the URL provided in the console (remember to add `/v1` needed). You can then use your endpoint programmatically or via the SDK.
This method uses best-guess defaults. You may need to adjust the configuration to fit your specific requirements.
### Method 3: Manual Deployment (Advanced Models)
Some models require manual deployment because they:
- Use custom code with the `transformers` tag
- Don't run with standard `transformers` but are supported by `vLLM`
These models cannot be deployed using the **Deploy** button on the model card.
In this guide, we demonstrate manual deployment using the [`rednote-hilab/dots.ocr`](https://huggingface.co/rednote-hilab/dots.ocr) model, an OCR model integrated with vLLM (see vLLM [PR](https://github.com/vllm-project/vllm/pull/24645)).
1. Start a new deployment. Go to [Inference Endpoints](https://endpoints.huggingface.co/) and click `New`.
5. Create the endpoint. Click **Create Endpoint** to deploy the model.
Once the endpoint is ready, you can use it with the OpenAI Completion API, cURL, or other SDKs. Remember to append `/v1` to the deployment URL if needed.
!!! note
You can adjust the **container settings** (Container URI, Container Arguments) from the Inference Endpoints UI and press **Update Endpoint**. This redeploys the endpoint with the updated container configuration. Changes to the model itself require creating a new endpoint or redeploying with a different model. For example, for this demo, you may need to update the Container URI to the nightly image (`vllm/vllm-openai:nightly`) and add the `--trust-remote-code` flag in the container arguments.
## Advanced Deployment Details
With the [Transformers modeling backend integration](https://blog.vllm.ai/2025/04/11/transformers-backend.html), vLLM now offers Day 0 support for any model compatible with `transformers`. This means you can deploy such models immediately, leveraging vLLM’s optimized inference without additional backend modifications.
Hugging Face Inference Endpoints provides a fully managed environment for serving models via vLLM. You can deploy models without configuring servers, installing dependencies, or managing clusters. Endpoints also support deployment across multiple cloud providers (AWS, Azure, GCP) without the need for separate accounts.
The platform integrates seamlessly with the Hugging Face Hub, allowing you to deploy any vLLM- or `transformers`-compatible model, track usage, and update the inference engine directly. The vLLM engine comes preconfigured, enabling optimized inference and easy switching between models or engines without modifying your code. This setup simplifies production deployment: endpoints are ready in minutes, include monitoring and logging, and let you focus on serving models rather than maintaining infrastructure.
## Next Steps
- Explore the [Inference Endpoints](https://endpoints.huggingface.co/catalog) model catalog
- Read the Inference Endpoints [documentation](https://huggingface.co/docs/inference-endpoints/en/index)
- Learn about [Inference Endpoints engines](https://huggingface.co/docs/inference-endpoints/en/engines/vllm)
- Understand the [Transformers modeling backend integration](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
1. Use the script: <gh-file:examples/online_serving/retrieval_augmented_generation_with_langchain.py>
1. Use the script: [examples/online_serving/retrieval_augmented_generation_with_langchain.py](../../../examples/online_serving/retrieval_augmented_generation_with_langchain.py)
1. Use the script: <gh-file:examples/online_serving/retrieval_augmented_generation_with_llamaindex.py>
1. Use the script: [examples/online_serving/retrieval_augmented_generation_with_llamaindex.py](../../../examples/online_serving/retrieval_augmented_generation_with_llamaindex.py)
1. Use the script: <gh-file:examples/online_serving/streamlit_openai_chatbot_webserver.py>
1. Use the script: [examples/online_serving/streamlit_openai_chatbot_webserver.py](../../../examples/online_serving/streamlit_openai_chatbot_webserver.py)
[KAITO](https://kaito-project.github.io/kaito/docs/) is a Kubernetes operator that supports deploying and serving LLMs with vLLM. It offers managing large models via container images with built-in OpenAI-compatible inference, auto-provisioning GPU nodes and curated model presets.
Please refer to [quick start](https://kaito-project.github.io/kaito/docs/quick-start) for more details.