We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below:
-[vLLM Shenzhen Meetup](https://mp.weixin.qq.com/s/k8ZBO1u2_2odgiKWH_GVTQ), August 30th 2025. [[Slides]](https://drive.google.com/drive/folders/1Ua2SVKVSu-wp5vou_6ElraDt2bnKhiEA)
-[vLLM Singapore Meetup](https://www.sginnovate.com/event/vllm-sg-meet), August 27th 2025. [[Slides]](https://drive.google.com/drive/folders/1ncf3GyqLdqFaB6IeB834E5TZJPLAOiXZ?usp=sharing)
-[vLLM Shanghai Meetup](https://mp.weixin.qq.com/s/pDmAXHcN7Iqc8sUKgJgGtg), August 23rd 2025. [[Slides]](https://drive.google.com/drive/folders/1OvLx39wnCGy_WKq8SiVKf7YcxxYI3WCH)
-[vLLM Korea Meetup](https://luma.com/cgcgprmh), August 19th 2025. [[Slides]](https://drive.google.com/file/d/1bcrrAE1rxUgx0mjIeOWT6hNe2RefC5Hm/view).
-[vLLM Beijing Meetup](https://mp.weixin.qq.com/s/dgkWg1WFpWGO2jCdTqQHxA), August 2nd 2025. [[Slides]](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF)[[Recording]](https://www.chaspark.com/#/live/1166916873711665152).
@@ -11,9 +11,39 @@ vLLM contains two sets of benchmarks:
The performance benchmarks are used for development to confirm whether new changes improve performance under various workloads. They are triggered on every commit with both the `perf-benchmarks` and `ready` labels, and when a PR is merged into vLLM.
### Manually Trigger the benchmark
Use [vllm-ci-test-repo images](https://gallery.ecr.aws/q9t5s3a7/vllm-ci-test-repo) with vLLM benchmark suite.
For CPU environment, please use the image with "-cpu" postfix.
Here is an example for docker run command for CPU.
```bash
docker run -it--entrypoint /bin/bash -v /data/huggingface:/root/.cache/huggingface -eHF_TOKEN=''--shm-size=16g --name vllm-cpu-ci public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:1da94e673c257373280026f75ceb4effac80e892-cpu
```
Then, run below command inside the docker instance.
When run, benchmark script generates results under **benchmark/results** folder, along with the benchmark_results.md and benchmark_results.json.
#### Runtime environment variables
-`ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0.
-`SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
-`LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
-`THROUGHPUT_JSON`: JSON file to use for the throughout tests. Default value is empty string (use default file).
-`REMOTE_HOST`: IP for the remote vLLM service to benchmark. Default value is empty string.
-`REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string.
For more results visualization, check the [visualizing the results](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md#visualizing-the-results).
The latest performance results are hosted on the public [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm).
More information on the performance benchmarks and their parameters can be found [here](gh-file:.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md).
More information on the performance benchmarks and their parameters can be found in [Benchmark README](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md) and [performance benchmark description](gh-file:.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md).
# Speech-to-Text (Transcription/Translation) Support
This document walks you through the steps to add support for speech-to-text (ASR) models to vLLM’s transcription and translation APIs by implementing [SupportsTranscription][vllm.model_executor.models.interfaces.SupportsTranscription].
Please refer to the [supported models](../../models/supported_models.md#transcription) for further guidance.
## Update the base vLLM model
It is assumed you have already implemented your model in vLLM according to the basic model guide. Extend your model with the [SupportsTranscription][vllm.model_executor.models.interfaces.SupportsTranscription] interface and implement the following class attributes and methods.
### `supported_languages` and `supports_transcription_only`
Declare supported languages and capabilities:
- The `supported_languages` mapping is validated at init time.
- Set `supports_transcription_only=True` if the model should not serve text generation (eg Whisper).
??? code "supported_languages and supports_transcription_only"
# Set to None to disable server-side chunking if your
# model/processor handles it already
min_energy_split_window_size=None,
)
```
See [Audio preprocessing and chunking](#audio-preprocessing-and-chunking) for what each field controls.
Implement the prompt construction via [get_generation_prompt][vllm.model_executor.models.interfaces.SupportsTranscription.get_generation_prompt]. The server passes you the resampled waveform and task parameters; you return a valid [PromptType][vllm.inputs.data.PromptType]. There are two common patterns:
#### Multimodal LLM with audio embeddings (e.g., Voxtral, Gemma3n)
Return a dict containing `multi_modal_data` with the audio, and either a `prompt` string or `prompt_token_ids`:
# Return None if unknown; otherwise return an estimate.
returnint(audio_duration_s*stt_config.sample_rate//320)# example
```
## Audio preprocessing and chunking
The API server takes care of basic audio I/O and optional chunking before building prompts:
- Resampling: Input audio is resampled to `SpeechToTextConfig.sample_rate` using `librosa`.
- Chunking: If `SpeechToTextConfig.allow_audio_chunking` is True and the duration exceeds `max_audio_clip_s`, the server splits the audio into overlapping chunks and generates a prompt per chunk. Overlap is controlled by `overlap_chunk_second`.
- Energy-aware splitting: When `min_energy_split_window_size` is set, the server finds low-energy regions to minimize cutting within words.
- Gemma3n decoder-only with fixed instruction prompt: <gh-file:vllm/model_executor/models/gemma3n_mm.py>
## Test with the API
Once your model implements `SupportsTranscription`, you can test the endpoints (API mimics OpenAI):
- Transcription (ASR):
```bash
curl -s -X POST \
-H "Authorization: Bearer $VLLM_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F "file=@/path/to/audio.wav" \
-F "model=$MODEL_ID" \
http://localhost:8000/v1/audio/transcriptions
```
- Translation (source → English unless otherwise supported):
```bash
curl -s -X POST \
-H "Authorization: Bearer $VLLM_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F "file=@/path/to/audio.wav" \
-F "model=$MODEL_ID" \
http://localhost:8000/v1/audio/translations
```
Or check out more examples in <gh-file:examples/online_serving>.
!!! note
- If your model handles chunking internally (e.g., via its processor or encoder), set `min_energy_split_window_size=None` in the returned `SpeechToTextConfig` to disable server-side chunking.
- Implementing `get_num_audio_tokens` improves accuracy of streaming usage metrics (`prompt_tokens`) without an extra forward pass.
- For multilingual behavior, keep `supported_languages` aligned with actual model capabilities.
@@ -19,7 +19,7 @@ When using `vllm bench serve`, you can enable profiling by passing the `--profil
Traces can be visualized using <https://ui.perfetto.dev/>.
!!! tip
You can directly call bench module without installing vllm using `python -m vllm.entrypoints.cli.main bench`.
You can directly call bench module without installing vLLM using `python -m vllm.entrypoints.cli.main bench`.
!!! tip
Only send a few requests through vLLM when profiling, as the traces can get quite large. Also, no need to untar the traces, they can be viewed directly.
...
...
@@ -73,6 +73,8 @@ apt install nsight-systems-cli
### Example commands and usage
When profiling with `nsys`, it is advisable to set the environment variable `VLLM_WORKER_MULTIPROC_METHOD=spawn`. The default is to use the `fork` method instead of `spawn`. More information on the topic can be found in the [Nsight Systems release notes](https://docs.nvidia.com/nsight-systems/ReleaseNotes/index.html#general-issues).
#### Offline Inference
For basic usage, you can just append `nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node` before any existing script you would run for offline inference.
[AnythingLLM](https://github.com/Mintplex-Labs/anything-llm) is a full-stack application that enables you to turn any document, resource, or piece of content into context that any LLM can use as references during chatting.
[AnythingLLM](https://github.com/Mintplex-Labs/anything-llm) is a full-stack application that enables you to turn any document, resource, or piece of content into context that any LLM can use as references during chatting.
It allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints.
## Prerequisites
- Setup vLLM environment
Set up the vLLM environment:
```bash
pip install vllm
```
## Deploy
- Start the vLLM server with the supported chatcompletion model, e.g.
1. Start the vLLM server with a supported chat-completion model, for example:
vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack).
vLLM is also available via [Llama Stack](https://github.com/llamastack/llama-stack).
To install Llama Stack, run
...
...
@@ -8,9 +8,9 @@ To install Llama Stack, run
pip install llama-stack -q
```
## Inference using OpenAICompatible API
## Inference using OpenAI-Compatible API
Then start Llama Stack server pointing to your vLLM server with the following configuration:
Then start the Llama Stack server and configure it to point to your vLLM server with the following settings:
```yaml
inference:
...
...
@@ -20,15 +20,15 @@ inference:
url:http://127.0.0.1:8000
```
Please refer to [this guide](https://llama-stack.readthedocs.io/en/latest/distributions/self_hosted_distro/remote-vllm.html) for more details on this remote vLLM provider.
Please refer to [this guide](https://llama-stack.readthedocs.io/en/latest/providers/inference/remote_vllm.html) for more details on this remote vLLM provider.
## Inference via Embedded vLLM
## Inference using Embedded vLLM
An [inline vLLM provider](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/inline/inference/vllm)
An [inline provider](https://github.com/llamastack/llama-stack/tree/main/llama_stack/providers/inline/inference)
is also available. This is a sample of configuration using that method: