[Docs] Remove references to `VLLM_ATTENTION_BACKEND` (#30564)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

[Docs] Remove references to `VLLM_ATTENTION_BACKEND` (#30564)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
f5dfbbd8 · Matthew Bonanni · GitHub · fc011942 · f5dfbbd8
Unverified Commit f5dfbbd8 authored Dec 12, 2025 by Matthew Bonanni Committed by GitHub Dec 13, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 16 additions and 6 deletions

docs/getting_started/quickstart.md docs/getting_started/quickstart.md +16 -6

No files found.
--- a/docs/getting_started/quickstart.md
+++ b/docs/getting_started/quickstart.md
@@ -281,17 +281,27 @@ Alternatively, you can use the `openai` Python package:
 Currently, vLLM supports multiple backends for efficient Attention computation across different platforms and accelerator architectures. It automatically selects the most performant backend compatible with your system and model specifications.
-If desired, you can also manually set the backend of your choice by configuring the environment variable `VLLM_ATTENTION_BACKEND` to one of the following options:
+If desired, you can also manually set the backend of your choice using the `--attention-backend` CLI argument:
+```bash
+# For online serving
+vllm serve Qwen/Qwen2.5-1.5B-Instruct --attention-backend FLASH_ATTN
+# For offline inference
+python script.py --attention-backend FLASHINFER
+```
+Some of the available backend options include:
 - On NVIDIA CUDA: `FLASH_ATTN` or `FLASHINFER`.
 - On AMD ROCm: `TRITON_ATTN`, `ROCM_ATTN`, `ROCM_AITER_FA` or `ROCM_AITER_UNIFIED_ATTN`.
-For AMD ROCm, you can further control the specific Attention implementation using the following variables:
+For AMD ROCm, you can further control the specific Attention implementation using the following options:
- Triton Unified Attention: `VLLM_ROCM_USE_AITER=0 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0 VLLM_ROCM_USE_AITER_MHA=0`
+- Triton Unified Attention: Set the environment variables `VLLM_ROCM_USE_AITER=0 VLLM_ROCM_USE_AITER_MHA=0` and pass `--attention-config.use_prefill_decode_attention=false` as a CLI argument.
- AITER Unified Attention: `VLLM_ROCM_USE_AITER=1 VLLM_USE_AITER_UNIFIED_ATTENTION=1 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0 VLLM_ROCM_USE_AITER_MHA=0`
+- AITER Unified Attention: Set the environment variables `VLLM_ROCM_USE_AITER=1 VLLM_USE_AITER_UNIFIED_ATTENTION=1 VLLM_ROCM_USE_AITER_MHA=0` and pass `--attention-config.use_prefill_decode_attention=false` as a CLI argument.
- Triton Prefill-Decode Attention: `VLLM_ROCM_USE_AITER=1 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 VLLM_ROCM_USE_AITER_MHA=0`
+- Triton Prefill-Decode Attention: Set the environment variables `VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MHA=0` and pass `--attention-config.use_prefill_decode_attention=true` as a CLI argument.
- AITER Multi-head Attention: `VLLM_ROCM_USE_AITER=1 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0 VLLM_ROCM_USE_AITER_MHA=1`
+- AITER Multi-head Attention: Set the environment variables `VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MHA=1` and pass `--attention-config.use_prefill_decode_attention=false` as a CLI argument.
 !!! warning
    There are no pre-built vllm wheels containing Flash Infer, so you must install it in your environment first. Refer to the [Flash Infer official docs](https://docs.flashinfer.ai/) or see [docker/Dockerfile](../../docker/Dockerfile) for instructions on how to install it.