Unverified Commit ba5c5e54 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

[Docs] Switch to better markdown linting pre-commit hook (#21851)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent 555e7225
......@@ -339,13 +339,13 @@ Each described step is logged by vLLM server, as follows (negative values corres
- `VLLM_{phase}_{dim}_BUCKET_{param}` - collection of 12 environment variables configuring ranges of bucketing mechanism
* `{phase}` is either `PROMPT` or `DECODE`
- `{phase}` is either `PROMPT` or `DECODE`
* `{dim}` is either `BS`, `SEQ` or `BLOCK`
- `{dim}` is either `BS`, `SEQ` or `BLOCK`
* `{param}` is either `MIN`, `STEP` or `MAX`
- `{param}` is either `MIN`, `STEP` or `MAX`
* Default values:
- Default values:
| `{phase}` | Parameter | Env Variable | Value Expression |
|-----------|-----------|--------------|------------------|
......
# TPU
# TPU Supported Models
## Text-only Language Models
## Supported Models
### Text-only Language Models
| Model | Architecture | Supported |
|-----------------------------------------------------|--------------------------------|-----------|
......
......@@ -45,10 +45,10 @@ If a model is neither supported natively by vLLM or Transformers, it can still b
For a model to be compatible with the Transformers backend for vLLM it must:
- be a Transformers compatible custom model (see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)):
* The model directory must have the correct structure (e.g. `config.json` is present).
* `config.json` must contain `auto_map.AutoModel`.
- The model directory must have the correct structure (e.g. `config.json` is present).
- `config.json` must contain `auto_map.AutoModel`.
- be a Transformers backend for vLLM compatible model (see [writing-custom-models][writing-custom-models]):
* Customisation should be done in the base model (e.g. in `MyModel`, not `MyModelForCausalLM`).
- Customisation should be done in the base model (e.g. in `MyModel`, not `MyModelForCausalLM`).
If the compatible model is:
......@@ -134,10 +134,10 @@ class MyConfig(PretrainedConfig):
- `base_model_tp_plan` is a `dict` that maps fully qualified layer name patterns to tensor parallel styles (currently only `"colwise"` and `"rowwise"` are supported).
- `base_model_pp_plan` is a `dict` that maps direct child layer names to `tuple`s of `list`s of `str`s:
* You only need to do this for layers which are not present on all pipeline stages
* vLLM assumes that there will be only one `nn.ModuleList`, which is distributed across the pipeline stages
* The `list` in the first element of the `tuple` contains the names of the input arguments
* The `list` in the last element of the `tuple` contains the names of the variables the layer outputs to in your modeling code
- You only need to do this for layers which are not present on all pipeline stages
- vLLM assumes that there will be only one `nn.ModuleList`, which is distributed across the pipeline stages
- The `list` in the first element of the `tuple` contains the names of the input arguments
- The `list` in the last element of the `tuple` contains the names of the variables the layer outputs to in your modeling code
## Loading a Model
......
......@@ -99,7 +99,7 @@ From any node, enter a container and run `ray status` and `ray list nodes` to ve
### Running vLLM on a Ray cluster
!!! tip
If Ray is running inside containers, run the commands in the remainder of this guide _inside the containers_, not on the host. To open a shell inside a container, connect to a node and use `docker exec -it <container_name> /bin/bash`.
If Ray is running inside containers, run the commands in the remainder of this guide *inside the containers*, not on the host. To open a shell inside a container, connect to a node and use `docker exec -it <container_name> /bin/bash`.
Once a Ray cluster is running, use vLLM as you would in a single-node setting. All resources across the Ray cluster are visible to vLLM, so a single `vllm` command on a single node is sufficient.
......
......@@ -31,11 +31,12 @@ vLLM provides three communication backends for EP:
Enable EP by setting the `--enable-expert-parallel` flag. The EP size is automatically calculated as:
```
```text
EP_SIZE = TP_SIZE × DP_SIZE
```
Where:
- `TP_SIZE`: Tensor parallel size (always 1 for now)
- `DP_SIZE`: Data parallel size
- `EP_SIZE`: Expert parallel size (computed automatically)
......
......@@ -206,6 +206,7 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
[Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters;
see our [Multimodal Inputs](../features/multimodal_inputs.md) guide for more information.
- *Note: `image_url.detail` parameter is not supported.*
Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>
......
......@@ -13,15 +13,18 @@ All communications between nodes in a multi-node vLLM deployment are **insecure
The following options control inter-node communications in vLLM:
#### 1. **Environment Variables:**
- `VLLM_HOST_IP`: Sets the IP address for vLLM processes to communicate on
- `VLLM_HOST_IP`: Sets the IP address for vLLM processes to communicate on
#### 2. **KV Cache Transfer Configuration:**
- `--kv-ip`: The IP address for KV cache transfer communications (default: 127.0.0.1)
- `--kv-port`: The port for KV cache transfer communications (default: 14579)
- `--kv-ip`: The IP address for KV cache transfer communications (default: 127.0.0.1)
- `--kv-port`: The port for KV cache transfer communications (default: 14579)
#### 3. **Data Parallel Configuration:**
- `data_parallel_master_ip`: IP of the data parallel master (default: 127.0.0.1)
- `data_parallel_master_port`: Port of the data parallel master (default: 29500)
- `data_parallel_master_ip`: IP of the data parallel master (default: 127.0.0.1)
- `data_parallel_master_port`: Port of the data parallel master (default: 29500)
### Notes on PyTorch Distributed
......@@ -41,18 +44,21 @@ Key points from the PyTorch security guide:
### Security Recommendations
#### 1. **Network Isolation:**
- Deploy vLLM nodes on a dedicated, isolated network
- Use network segmentation to prevent unauthorized access
- Implement appropriate firewall rules
- Deploy vLLM nodes on a dedicated, isolated network
- Use network segmentation to prevent unauthorized access
- Implement appropriate firewall rules
#### 2. **Configuration Best Practices:**
- Always set `VLLM_HOST_IP` to a specific IP address rather than using defaults
- Configure firewalls to only allow necessary ports between nodes
- Always set `VLLM_HOST_IP` to a specific IP address rather than using defaults
- Configure firewalls to only allow necessary ports between nodes
#### 3. **Access Control:**
- Restrict physical and network access to the deployment environment
- Implement proper authentication and authorization for management interfaces
- Follow the principle of least privilege for all system components
- Restrict physical and network access to the deployment environment
- Implement proper authentication and authorization for management interfaces
- Follow the principle of least privilege for all system components
## Security and Firewalls: Protecting Exposed vLLM Systems
......
......@@ -148,7 +148,7 @@ are not yet supported.
vLLM V1 supports logprobs and prompt logprobs. However, there are some important semantic
differences compared to V0:
**Logprobs Calculation**
##### Logprobs Calculation
Logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e.
before applying any logits post-processing such as temperature scaling or penalty
......@@ -157,7 +157,7 @@ probabilities used during sampling.
Support for logprobs with post-sampling adjustments is in progress and will be added in future updates.
**Prompt Logprobs with Prefix Caching**
##### Prompt Logprobs with Prefix Caching
Currently prompt logprobs are only supported when prefix caching is turned off via `--no-enable-prefix-caching`. In a future release, prompt logprobs will be compatible with prefix caching, but a recomputation will be triggered to recover the full prompt logprobs even upon a prefix cache hit. See details in [RFC #13414](gh-issue:13414).
......@@ -165,7 +165,7 @@ Currently prompt logprobs are only supported when prefix caching is turned off v
As part of the major architectural rework in vLLM V1, several legacy features have been deprecated.
**Sampling features**
##### Sampling features
- **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](gh-issue:13361).
- **Per-Request Logits Processors**: In V0, users could pass custom
......@@ -173,11 +173,11 @@ As part of the major architectural rework in vLLM V1, several legacy features ha
feature has been deprecated. Instead, the design is moving toward supporting **global logits
processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](gh-pr:13360).
**KV Cache features**
##### KV Cache features
- **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping
to handle request preemptions.
**Structured Output features**
##### Structured Output features
- **Request-level Structured Output Backend**: Deprecated, alternative backends (outlines, guidance) with fallbacks is supported now.
......@@ -19,9 +19,9 @@ We currently support `/v1/chat/completions`, `/v1/embeddings`, and `/v1/score` e
## Pre-requisites
* The examples in this document use `meta-llama/Meta-Llama-3-8B-Instruct`.
- Create a [user access token](https://huggingface.co/docs/hub/en/security-tokens)
- Install the token on your machine (Run `huggingface-cli login`).
- Get access to the gated model by [visiting the model card](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and agreeing to the terms and conditions.
* Create a [user access token](https://huggingface.co/docs/hub/en/security-tokens)
* Install the token on your machine (Run `huggingface-cli login`).
* Get access to the gated model by [visiting the model card](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and agreeing to the terms and conditions.
## Example 1: Running with a local file
......@@ -105,7 +105,7 @@ To integrate with cloud blob storage, we recommend using presigned urls.
* [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html).
* The `awscli` package (Run `pip install awscli`) to configure your credentials and interactively use s3.
- [Configure your credentials](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html).
* [Configure your credentials](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html).
* The `boto3` python package (Run `pip install boto3`) to generate presigned urls.
### Step 1: Upload your input script
......
......@@ -28,16 +28,20 @@ to run disaggregated prefill and benchmark the performance.
### Components
#### Server Scripts
- `disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh` - Launches individual vLLM servers for prefill/decode, and also launches the proxy server.
- `disagg_prefill_lmcache_v1/disagg_proxy_server.py` - FastAPI proxy server that coordinates between prefiller and decoder
- `disagg_prefill_lmcache_v1/disagg_example_nixl.sh` - Main script to run the example
#### Configuration
- `disagg_prefill_lmcache_v1/configs/lmcache-prefiller-config.yaml` - Configuration for prefiller server
- `disagg_prefill_lmcache_v1/configs/lmcache-decoder-config.yaml` - Configuration for decoder server
#### Log Files
The main script generates several log files:
- `prefiller.log` - Logs from the prefill server
- `decoder.log` - Logs from the decode server
- `proxy.log` - Logs from the proxy server
......
......@@ -156,16 +156,6 @@ markers = [
"optional: optional tests that are automatically skipped, include --optional to run them",
]
[tool.pymarkdown]
plugins.md004.style = "sublist" # ul-style
plugins.md007.indent = 4 # ul-indent
plugins.md007.start_indented = true # ul-indent
plugins.md013.enabled = false # line-length
plugins.md041.enabled = false # first-line-h1
plugins.md033.enabled = false # inline-html
plugins.md046.enabled = false # code-block-style
plugins.md024.allow_different_nesting = true # no-duplicate-headers
[tool.ty.src]
root = "./vllm"
respect-ignore-files = true
......
# Expert parallel kernels
Large-scale cluster-level expert parallel, as described in the [DeepSeek-V3 Technical Report](http://arxiv.org/abs/2412.19437), is an efficient way to deploy sparse MoE models with many experts. However, such deployment requires many components beyond a normal Python package, including system package support and system driver support. It is impossible to bundle all these components into a Python package.
Here we break down the requirements in 2 steps:
1. Build and install the Python libraries (both [pplx-kernels](https://github.com/ppl-ai/pplx-kernels) and [DeepEP](https://github.com/deepseek-ai/DeepEP)), including necessary dependencies like NVSHMEM. This step does not require any privileged access. Any user can do this.
2. Configure NVIDIA driver to enable IBGDA. This step requires root access, and must be done on the host machine.
......@@ -8,15 +11,15 @@ Here we break down the requirements in 2 steps:
All scripts accept a positional argument as workspace path for staging the build, defaulting to `$(pwd)/ep_kernels_workspace`.
# Usage
## Usage
## Single-node
### Single-node
```bash
bash install_python_libraries.sh
```
## Multi-node
### Multi-node
```bash
bash install_python_libraries.sh
......
......@@ -6,7 +6,8 @@ via the LoRAResolver plugin framework.
Note that `VLLM_ALLOW_RUNTIME_LORA_UPDATING` must be set to true to allow LoRA resolver plugins
to work, and `VLLM_PLUGINS` must be set to include the desired resolver plugins.
# lora_filesystem_resolver
## lora_filesystem_resolver
This LoRA Resolver is installed with vLLM by default.
To use, set `VLLM_PLUGIN_LORA_CACHE_DIR` to a local directory. When vLLM receives a request
for a LoRA adapter `foobar` it doesn't currently recognize, it will look in that local directory
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment