Unverified Commit ba5c5e54 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

[Docs] Switch to better markdown linting pre-commit hook (#21851)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent 555e7225
...@@ -339,13 +339,13 @@ Each described step is logged by vLLM server, as follows (negative values corres ...@@ -339,13 +339,13 @@ Each described step is logged by vLLM server, as follows (negative values corres
- `VLLM_{phase}_{dim}_BUCKET_{param}` - collection of 12 environment variables configuring ranges of bucketing mechanism - `VLLM_{phase}_{dim}_BUCKET_{param}` - collection of 12 environment variables configuring ranges of bucketing mechanism
* `{phase}` is either `PROMPT` or `DECODE` - `{phase}` is either `PROMPT` or `DECODE`
* `{dim}` is either `BS`, `SEQ` or `BLOCK` - `{dim}` is either `BS`, `SEQ` or `BLOCK`
* `{param}` is either `MIN`, `STEP` or `MAX` - `{param}` is either `MIN`, `STEP` or `MAX`
* Default values: - Default values:
| `{phase}` | Parameter | Env Variable | Value Expression | | `{phase}` | Parameter | Env Variable | Value Expression |
|-----------|-----------|--------------|------------------| |-----------|-----------|--------------|------------------|
......
# TPU # TPU
# TPU Supported Models ## Supported Models
## Text-only Language Models
### Text-only Language Models
| Model | Architecture | Supported | | Model | Architecture | Supported |
|-----------------------------------------------------|--------------------------------|-----------| |-----------------------------------------------------|--------------------------------|-----------|
......
...@@ -45,10 +45,10 @@ If a model is neither supported natively by vLLM or Transformers, it can still b ...@@ -45,10 +45,10 @@ If a model is neither supported natively by vLLM or Transformers, it can still b
For a model to be compatible with the Transformers backend for vLLM it must: For a model to be compatible with the Transformers backend for vLLM it must:
- be a Transformers compatible custom model (see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)): - be a Transformers compatible custom model (see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)):
* The model directory must have the correct structure (e.g. `config.json` is present). - The model directory must have the correct structure (e.g. `config.json` is present).
* `config.json` must contain `auto_map.AutoModel`. - `config.json` must contain `auto_map.AutoModel`.
- be a Transformers backend for vLLM compatible model (see [writing-custom-models][writing-custom-models]): - be a Transformers backend for vLLM compatible model (see [writing-custom-models][writing-custom-models]):
* Customisation should be done in the base model (e.g. in `MyModel`, not `MyModelForCausalLM`). - Customisation should be done in the base model (e.g. in `MyModel`, not `MyModelForCausalLM`).
If the compatible model is: If the compatible model is:
...@@ -134,10 +134,10 @@ class MyConfig(PretrainedConfig): ...@@ -134,10 +134,10 @@ class MyConfig(PretrainedConfig):
- `base_model_tp_plan` is a `dict` that maps fully qualified layer name patterns to tensor parallel styles (currently only `"colwise"` and `"rowwise"` are supported). - `base_model_tp_plan` is a `dict` that maps fully qualified layer name patterns to tensor parallel styles (currently only `"colwise"` and `"rowwise"` are supported).
- `base_model_pp_plan` is a `dict` that maps direct child layer names to `tuple`s of `list`s of `str`s: - `base_model_pp_plan` is a `dict` that maps direct child layer names to `tuple`s of `list`s of `str`s:
* You only need to do this for layers which are not present on all pipeline stages - You only need to do this for layers which are not present on all pipeline stages
* vLLM assumes that there will be only one `nn.ModuleList`, which is distributed across the pipeline stages - vLLM assumes that there will be only one `nn.ModuleList`, which is distributed across the pipeline stages
* The `list` in the first element of the `tuple` contains the names of the input arguments - The `list` in the first element of the `tuple` contains the names of the input arguments
* The `list` in the last element of the `tuple` contains the names of the variables the layer outputs to in your modeling code - The `list` in the last element of the `tuple` contains the names of the variables the layer outputs to in your modeling code
## Loading a Model ## Loading a Model
......
...@@ -99,7 +99,7 @@ From any node, enter a container and run `ray status` and `ray list nodes` to ve ...@@ -99,7 +99,7 @@ From any node, enter a container and run `ray status` and `ray list nodes` to ve
### Running vLLM on a Ray cluster ### Running vLLM on a Ray cluster
!!! tip !!! tip
If Ray is running inside containers, run the commands in the remainder of this guide _inside the containers_, not on the host. To open a shell inside a container, connect to a node and use `docker exec -it <container_name> /bin/bash`. If Ray is running inside containers, run the commands in the remainder of this guide *inside the containers*, not on the host. To open a shell inside a container, connect to a node and use `docker exec -it <container_name> /bin/bash`.
Once a Ray cluster is running, use vLLM as you would in a single-node setting. All resources across the Ray cluster are visible to vLLM, so a single `vllm` command on a single node is sufficient. Once a Ray cluster is running, use vLLM as you would in a single-node setting. All resources across the Ray cluster are visible to vLLM, so a single `vllm` command on a single node is sufficient.
......
...@@ -31,11 +31,12 @@ vLLM provides three communication backends for EP: ...@@ -31,11 +31,12 @@ vLLM provides three communication backends for EP:
Enable EP by setting the `--enable-expert-parallel` flag. The EP size is automatically calculated as: Enable EP by setting the `--enable-expert-parallel` flag. The EP size is automatically calculated as:
``` ```text
EP_SIZE = TP_SIZE × DP_SIZE EP_SIZE = TP_SIZE × DP_SIZE
``` ```
Where: Where:
- `TP_SIZE`: Tensor parallel size (always 1 for now) - `TP_SIZE`: Tensor parallel size (always 1 for now)
- `DP_SIZE`: Data parallel size - `DP_SIZE`: Data parallel size
- `EP_SIZE`: Expert parallel size (computed automatically) - `EP_SIZE`: Expert parallel size (computed automatically)
......
...@@ -206,6 +206,7 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai ...@@ -206,6 +206,7 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
We support both [Vision](https://platform.openai.com/docs/guides/vision)- and We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
[Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters; [Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters;
see our [Multimodal Inputs](../features/multimodal_inputs.md) guide for more information. see our [Multimodal Inputs](../features/multimodal_inputs.md) guide for more information.
- *Note: `image_url.detail` parameter is not supported.* - *Note: `image_url.detail` parameter is not supported.*
Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py> Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>
......
...@@ -13,15 +13,18 @@ All communications between nodes in a multi-node vLLM deployment are **insecure ...@@ -13,15 +13,18 @@ All communications between nodes in a multi-node vLLM deployment are **insecure
The following options control inter-node communications in vLLM: The following options control inter-node communications in vLLM:
#### 1. **Environment Variables:** #### 1. **Environment Variables:**
- `VLLM_HOST_IP`: Sets the IP address for vLLM processes to communicate on
- `VLLM_HOST_IP`: Sets the IP address for vLLM processes to communicate on
#### 2. **KV Cache Transfer Configuration:** #### 2. **KV Cache Transfer Configuration:**
- `--kv-ip`: The IP address for KV cache transfer communications (default: 127.0.0.1)
- `--kv-port`: The port for KV cache transfer communications (default: 14579) - `--kv-ip`: The IP address for KV cache transfer communications (default: 127.0.0.1)
- `--kv-port`: The port for KV cache transfer communications (default: 14579)
#### 3. **Data Parallel Configuration:** #### 3. **Data Parallel Configuration:**
- `data_parallel_master_ip`: IP of the data parallel master (default: 127.0.0.1)
- `data_parallel_master_port`: Port of the data parallel master (default: 29500) - `data_parallel_master_ip`: IP of the data parallel master (default: 127.0.0.1)
- `data_parallel_master_port`: Port of the data parallel master (default: 29500)
### Notes on PyTorch Distributed ### Notes on PyTorch Distributed
...@@ -41,18 +44,21 @@ Key points from the PyTorch security guide: ...@@ -41,18 +44,21 @@ Key points from the PyTorch security guide:
### Security Recommendations ### Security Recommendations
#### 1. **Network Isolation:** #### 1. **Network Isolation:**
- Deploy vLLM nodes on a dedicated, isolated network
- Use network segmentation to prevent unauthorized access - Deploy vLLM nodes on a dedicated, isolated network
- Implement appropriate firewall rules - Use network segmentation to prevent unauthorized access
- Implement appropriate firewall rules
#### 2. **Configuration Best Practices:** #### 2. **Configuration Best Practices:**
- Always set `VLLM_HOST_IP` to a specific IP address rather than using defaults
- Configure firewalls to only allow necessary ports between nodes - Always set `VLLM_HOST_IP` to a specific IP address rather than using defaults
- Configure firewalls to only allow necessary ports between nodes
#### 3. **Access Control:** #### 3. **Access Control:**
- Restrict physical and network access to the deployment environment
- Implement proper authentication and authorization for management interfaces - Restrict physical and network access to the deployment environment
- Follow the principle of least privilege for all system components - Implement proper authentication and authorization for management interfaces
- Follow the principle of least privilege for all system components
## Security and Firewalls: Protecting Exposed vLLM Systems ## Security and Firewalls: Protecting Exposed vLLM Systems
......
...@@ -148,7 +148,7 @@ are not yet supported. ...@@ -148,7 +148,7 @@ are not yet supported.
vLLM V1 supports logprobs and prompt logprobs. However, there are some important semantic vLLM V1 supports logprobs and prompt logprobs. However, there are some important semantic
differences compared to V0: differences compared to V0:
**Logprobs Calculation** ##### Logprobs Calculation
Logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e. Logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e.
before applying any logits post-processing such as temperature scaling or penalty before applying any logits post-processing such as temperature scaling or penalty
...@@ -157,7 +157,7 @@ probabilities used during sampling. ...@@ -157,7 +157,7 @@ probabilities used during sampling.
Support for logprobs with post-sampling adjustments is in progress and will be added in future updates. Support for logprobs with post-sampling adjustments is in progress and will be added in future updates.
**Prompt Logprobs with Prefix Caching** ##### Prompt Logprobs with Prefix Caching
Currently prompt logprobs are only supported when prefix caching is turned off via `--no-enable-prefix-caching`. In a future release, prompt logprobs will be compatible with prefix caching, but a recomputation will be triggered to recover the full prompt logprobs even upon a prefix cache hit. See details in [RFC #13414](gh-issue:13414). Currently prompt logprobs are only supported when prefix caching is turned off via `--no-enable-prefix-caching`. In a future release, prompt logprobs will be compatible with prefix caching, but a recomputation will be triggered to recover the full prompt logprobs even upon a prefix cache hit. See details in [RFC #13414](gh-issue:13414).
...@@ -165,7 +165,7 @@ Currently prompt logprobs are only supported when prefix caching is turned off v ...@@ -165,7 +165,7 @@ Currently prompt logprobs are only supported when prefix caching is turned off v
As part of the major architectural rework in vLLM V1, several legacy features have been deprecated. As part of the major architectural rework in vLLM V1, several legacy features have been deprecated.
**Sampling features** ##### Sampling features
- **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](gh-issue:13361). - **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](gh-issue:13361).
- **Per-Request Logits Processors**: In V0, users could pass custom - **Per-Request Logits Processors**: In V0, users could pass custom
...@@ -173,11 +173,11 @@ As part of the major architectural rework in vLLM V1, several legacy features ha ...@@ -173,11 +173,11 @@ As part of the major architectural rework in vLLM V1, several legacy features ha
feature has been deprecated. Instead, the design is moving toward supporting **global logits feature has been deprecated. Instead, the design is moving toward supporting **global logits
processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](gh-pr:13360). processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](gh-pr:13360).
**KV Cache features** ##### KV Cache features
- **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping - **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping
to handle request preemptions. to handle request preemptions.
**Structured Output features** ##### Structured Output features
- **Request-level Structured Output Backend**: Deprecated, alternative backends (outlines, guidance) with fallbacks is supported now. - **Request-level Structured Output Backend**: Deprecated, alternative backends (outlines, guidance) with fallbacks is supported now.
...@@ -19,9 +19,9 @@ We currently support `/v1/chat/completions`, `/v1/embeddings`, and `/v1/score` e ...@@ -19,9 +19,9 @@ We currently support `/v1/chat/completions`, `/v1/embeddings`, and `/v1/score` e
## Pre-requisites ## Pre-requisites
* The examples in this document use `meta-llama/Meta-Llama-3-8B-Instruct`. * The examples in this document use `meta-llama/Meta-Llama-3-8B-Instruct`.
- Create a [user access token](https://huggingface.co/docs/hub/en/security-tokens) * Create a [user access token](https://huggingface.co/docs/hub/en/security-tokens)
- Install the token on your machine (Run `huggingface-cli login`). * Install the token on your machine (Run `huggingface-cli login`).
- Get access to the gated model by [visiting the model card](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and agreeing to the terms and conditions. * Get access to the gated model by [visiting the model card](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and agreeing to the terms and conditions.
## Example 1: Running with a local file ## Example 1: Running with a local file
...@@ -105,7 +105,7 @@ To integrate with cloud blob storage, we recommend using presigned urls. ...@@ -105,7 +105,7 @@ To integrate with cloud blob storage, we recommend using presigned urls.
* [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html). * [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html).
* The `awscli` package (Run `pip install awscli`) to configure your credentials and interactively use s3. * The `awscli` package (Run `pip install awscli`) to configure your credentials and interactively use s3.
- [Configure your credentials](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html). * [Configure your credentials](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html).
* The `boto3` python package (Run `pip install boto3`) to generate presigned urls. * The `boto3` python package (Run `pip install boto3`) to generate presigned urls.
### Step 1: Upload your input script ### Step 1: Upload your input script
......
...@@ -28,16 +28,20 @@ to run disaggregated prefill and benchmark the performance. ...@@ -28,16 +28,20 @@ to run disaggregated prefill and benchmark the performance.
### Components ### Components
#### Server Scripts #### Server Scripts
- `disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh` - Launches individual vLLM servers for prefill/decode, and also launches the proxy server. - `disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh` - Launches individual vLLM servers for prefill/decode, and also launches the proxy server.
- `disagg_prefill_lmcache_v1/disagg_proxy_server.py` - FastAPI proxy server that coordinates between prefiller and decoder - `disagg_prefill_lmcache_v1/disagg_proxy_server.py` - FastAPI proxy server that coordinates between prefiller and decoder
- `disagg_prefill_lmcache_v1/disagg_example_nixl.sh` - Main script to run the example - `disagg_prefill_lmcache_v1/disagg_example_nixl.sh` - Main script to run the example
#### Configuration #### Configuration
- `disagg_prefill_lmcache_v1/configs/lmcache-prefiller-config.yaml` - Configuration for prefiller server - `disagg_prefill_lmcache_v1/configs/lmcache-prefiller-config.yaml` - Configuration for prefiller server
- `disagg_prefill_lmcache_v1/configs/lmcache-decoder-config.yaml` - Configuration for decoder server - `disagg_prefill_lmcache_v1/configs/lmcache-decoder-config.yaml` - Configuration for decoder server
#### Log Files #### Log Files
The main script generates several log files: The main script generates several log files:
- `prefiller.log` - Logs from the prefill server - `prefiller.log` - Logs from the prefill server
- `decoder.log` - Logs from the decode server - `decoder.log` - Logs from the decode server
- `proxy.log` - Logs from the proxy server - `proxy.log` - Logs from the proxy server
......
...@@ -156,16 +156,6 @@ markers = [ ...@@ -156,16 +156,6 @@ markers = [
"optional: optional tests that are automatically skipped, include --optional to run them", "optional: optional tests that are automatically skipped, include --optional to run them",
] ]
[tool.pymarkdown]
plugins.md004.style = "sublist" # ul-style
plugins.md007.indent = 4 # ul-indent
plugins.md007.start_indented = true # ul-indent
plugins.md013.enabled = false # line-length
plugins.md041.enabled = false # first-line-h1
plugins.md033.enabled = false # inline-html
plugins.md046.enabled = false # code-block-style
plugins.md024.allow_different_nesting = true # no-duplicate-headers
[tool.ty.src] [tool.ty.src]
root = "./vllm" root = "./vllm"
respect-ignore-files = true respect-ignore-files = true
......
# Expert parallel kernels
Large-scale cluster-level expert parallel, as described in the [DeepSeek-V3 Technical Report](http://arxiv.org/abs/2412.19437), is an efficient way to deploy sparse MoE models with many experts. However, such deployment requires many components beyond a normal Python package, including system package support and system driver support. It is impossible to bundle all these components into a Python package. Large-scale cluster-level expert parallel, as described in the [DeepSeek-V3 Technical Report](http://arxiv.org/abs/2412.19437), is an efficient way to deploy sparse MoE models with many experts. However, such deployment requires many components beyond a normal Python package, including system package support and system driver support. It is impossible to bundle all these components into a Python package.
Here we break down the requirements in 2 steps: Here we break down the requirements in 2 steps:
1. Build and install the Python libraries (both [pplx-kernels](https://github.com/ppl-ai/pplx-kernels) and [DeepEP](https://github.com/deepseek-ai/DeepEP)), including necessary dependencies like NVSHMEM. This step does not require any privileged access. Any user can do this. 1. Build and install the Python libraries (both [pplx-kernels](https://github.com/ppl-ai/pplx-kernels) and [DeepEP](https://github.com/deepseek-ai/DeepEP)), including necessary dependencies like NVSHMEM. This step does not require any privileged access. Any user can do this.
2. Configure NVIDIA driver to enable IBGDA. This step requires root access, and must be done on the host machine. 2. Configure NVIDIA driver to enable IBGDA. This step requires root access, and must be done on the host machine.
...@@ -8,15 +11,15 @@ Here we break down the requirements in 2 steps: ...@@ -8,15 +11,15 @@ Here we break down the requirements in 2 steps:
All scripts accept a positional argument as workspace path for staging the build, defaulting to `$(pwd)/ep_kernels_workspace`. All scripts accept a positional argument as workspace path for staging the build, defaulting to `$(pwd)/ep_kernels_workspace`.
# Usage ## Usage
## Single-node ### Single-node
```bash ```bash
bash install_python_libraries.sh bash install_python_libraries.sh
``` ```
## Multi-node ### Multi-node
```bash ```bash
bash install_python_libraries.sh bash install_python_libraries.sh
......
...@@ -6,7 +6,8 @@ via the LoRAResolver plugin framework. ...@@ -6,7 +6,8 @@ via the LoRAResolver plugin framework.
Note that `VLLM_ALLOW_RUNTIME_LORA_UPDATING` must be set to true to allow LoRA resolver plugins Note that `VLLM_ALLOW_RUNTIME_LORA_UPDATING` must be set to true to allow LoRA resolver plugins
to work, and `VLLM_PLUGINS` must be set to include the desired resolver plugins. to work, and `VLLM_PLUGINS` must be set to include the desired resolver plugins.
# lora_filesystem_resolver ## lora_filesystem_resolver
This LoRA Resolver is installed with vLLM by default. This LoRA Resolver is installed with vLLM by default.
To use, set `VLLM_PLUGIN_LORA_CACHE_DIR` to a local directory. When vLLM receives a request To use, set `VLLM_PLUGIN_LORA_CACHE_DIR` to a local directory. When vLLM receives a request
for a LoRA adapter `foobar` it doesn't currently recognize, it will look in that local directory for a LoRA adapter `foobar` it doesn't currently recognize, it will look in that local directory
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment