Unverified Commit 402d3783 authored by Cyrus Leung's avatar Cyrus Leung Committed by GitHub
Browse files

[Doc] [1/N] Reorganize Getting Started section (#11645)


Signed-off-by: default avatarDarkLight1337 <tlleungac@connect.ust.hk>
parent 9e764e7b
...@@ -77,8 +77,7 @@ python -m vllm.entrypoints.openai.api_server --model <model> ...@@ -77,8 +77,7 @@ python -m vllm.entrypoints.openai.api_server --model <model>
That code can be found in <gh-file:vllm/entrypoints/openai/api_server.py>. That code can be found in <gh-file:vllm/entrypoints/openai/api_server.py>.
More details on the API server can be found in the {doc}`OpenAI Compatible More details on the API server can be found in the [OpenAI-Compatible Server](#openai-compatible-server) document.
Server </serving/openai_compatible_server>` document.
## LLM Engine ## LLM Engine
......
...@@ -2,7 +2,7 @@ ...@@ -2,7 +2,7 @@
## Debugging ## Debugging
Please see the [Debugging Tips](#debugging-python-multiprocessing) Please see the [Troubleshooting](#troubleshooting-python-multiprocessing)
page for information on known issues and how to solve them. page for information on known issues and how to solve them.
## Introduction ## Introduction
......
...@@ -2,7 +2,7 @@ ...@@ -2,7 +2,7 @@
# Installation for ARM CPUs # Installation for ARM CPUs
vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CPU backend initially developed for the x86 platform. This guide provides installation instructions specific to ARM. For additional details on supported features, refer to the x86 platform documentation covering: vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CPU backend initially developed for the x86 platform. This guide provides installation instructions specific to ARM. For additional details on supported features, refer to the [x86 CPU documentation](#installation-x86) covering:
- CPU backend inference capabilities - CPU backend inference capabilities
- Relevant runtime environment variables - Relevant runtime environment variables
......
(installation-cpu)= (installation-x86)=
# Installation with CPU # Installation for x86 CPUs
vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16. vLLM CPU backend supports the following vLLM features: vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16. vLLM CPU backend supports the following vLLM features:
...@@ -151,4 +151,4 @@ $ python examples/offline_inference.py ...@@ -151,4 +151,4 @@ $ python examples/offline_inference.py
$ VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp $ VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
``` ```
- Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like [Nginx](../serving/deploying_with_nginx.md) or HAProxy are recommended. Anyscale Ray project provides the feature on LLM [serving](https://docs.ray.io/en/latest/serve/index.html). Here is the example to setup a scalable LLM serving with [Ray Serve](https://github.com/intel/llm-on-ray/blob/main/docs/setup.md). - Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like [Nginx](#nginxloadbalancer) or HAProxy are recommended. Anyscale Ray project provides the feature on LLM [serving](https://docs.ray.io/en/latest/serve/index.html). Here is the example to setup a scalable LLM serving with [Ray Serve](https://github.com/intel/llm-on-ray/blob/main/docs/setup.md).
(installation)= (installation-cuda)=
# Installation # Installation for CUDA
vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) binaries. vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) binaries.
......
(installation-rocm)= (installation-rocm)=
# Installation with ROCm # Installation for ROCm
vLLM supports AMD GPUs with ROCm 6.2. vLLM supports AMD GPUs with ROCm 6.2.
......
# Installation with Intel® Gaudi® AI Accelerators (installation-gaudi)=
# Installation for Intel® Gaudi®
This README provides instructions on running vLLM with Intel Gaudi devices. This README provides instructions on running vLLM with Intel Gaudi devices.
......
(installation-index)=
# Installation
vLLM supports the following hardware platforms:
```{toctree}
:maxdepth: 1
gpu-cuda
gpu-rocm
cpu-x86
cpu-arm
hpu-gaudi
tpu
xpu
openvino
neuron
```
(installation-neuron)= (installation-neuron)=
# Installation with Neuron # Installation for Neuron
vLLM 0.3.3 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK with continuous batching. vLLM 0.3.3 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK with continuous batching.
Paged Attention and Chunked Prefill are currently in development and will be available soon. Paged Attention and Chunked Prefill are currently in development and will be available soon.
......
(installation-openvino)= (installation-openvino)=
# Installation with OpenVINO # Installation for OpenVINO
vLLM powered by OpenVINO supports all LLM models from {doc}`vLLM supported models list <../models/supported_models>` and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support, as well as on both integrated and discrete Intel® GPUs ([the list of supported GPUs](https://docs.openvino.ai/2024/about-openvino/release-notes-openvino/system-requirements.html#gpu)). OpenVINO vLLM backend supports the following advanced vLLM features: vLLM powered by OpenVINO supports all LLM models from [vLLM supported models list](#supported-models) and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support, as well as on both integrated and discrete Intel® GPUs ([the list of supported GPUs](https://docs.openvino.ai/2024/about-openvino/release-notes-openvino/system-requirements.html#gpu)). OpenVINO vLLM backend supports the following advanced vLLM features:
- Prefix caching (`--enable-prefix-caching`) - Prefix caching (`--enable-prefix-caching`)
- Chunked prefill (`--enable-chunked-prefill`) - Chunked prefill (`--enable-chunked-prefill`)
......
(installation-tpu)= (installation-tpu)=
# Installation with TPU # Installation for TPUs
Tensor Processing Units (TPUs) are Google's custom-developed application-specific Tensor Processing Units (TPUs) are Google's custom-developed application-specific
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
......
(installation-xpu)= (installation-xpu)=
# Installation with XPU # Installation for XPUs
vLLM initially supports basic model inferencing and serving on Intel GPU platform. vLLM initially supports basic model inferencing and serving on Intel GPU platform.
......
...@@ -23,7 +23,7 @@ $ conda activate myenv ...@@ -23,7 +23,7 @@ $ conda activate myenv
$ pip install vllm $ pip install vllm
``` ```
Please refer to the {ref}`installation documentation <installation>` for more details on installing vLLM. Please refer to the [installation documentation](#installation-index) for more details on installing vLLM.
(offline-batched-inference)= (offline-batched-inference)=
......
(debugging)= (troubleshooting)=
# Debugging Tips # Troubleshooting
This document outlines some debugging strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible. This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
```{note} ```{note}
Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated. Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.
...@@ -47,6 +47,7 @@ You might also need to set `export NCCL_SOCKET_IFNAME=<your_network_interface>` ...@@ -47,6 +47,7 @@ You might also need to set `export NCCL_SOCKET_IFNAME=<your_network_interface>`
If vLLM crashes and the error trace captures it somewhere around `self.graph.replay()` in `vllm/worker/model_runner.py`, it is a CUDA error inside CUDAGraph. If vLLM crashes and the error trace captures it somewhere around `self.graph.replay()` in `vllm/worker/model_runner.py`, it is a CUDA error inside CUDAGraph.
To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the {class}`~vllm.LLM` class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error. To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the {class}`~vllm.LLM` class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.
(troubleshooting-incorrect-hardware-driver)=
## Incorrect hardware/driver ## Incorrect hardware/driver
If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly. If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.
...@@ -139,7 +140,7 @@ A multi-node environment is more complicated than a single-node one. If you see ...@@ -139,7 +140,7 @@ A multi-node environment is more complicated than a single-node one. If you see
Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes. Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
``` ```
(debugging-python-multiprocessing)= (troubleshooting-python-multiprocessing)=
## Python multiprocessing ## Python multiprocessing
### `RuntimeError` Exception ### `RuntimeError` Exception
...@@ -150,7 +151,7 @@ If you have seen a warning in your logs like this: ...@@ -150,7 +151,7 @@ If you have seen a warning in your logs like this:
WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
initialized. We must use the `spawn` multiprocessing start method. Setting initialized. We must use the `spawn` multiprocessing start method. Setting
VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing
for more information. for more information.
``` ```
......
...@@ -50,7 +50,7 @@ For more information, check out the following: ...@@ -50,7 +50,7 @@ For more information, check out the following:
- [vLLM announcing blog post](https://vllm.ai) (intro to PagedAttention) - [vLLM announcing blog post](https://vllm.ai) (intro to PagedAttention)
- [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023) - [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023)
- [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al. - [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al.
- {ref}`vLLM Meetups <meetups>`. - [vLLM Meetups](#meetups)
## Documentation ## Documentation
...@@ -58,18 +58,11 @@ For more information, check out the following: ...@@ -58,18 +58,11 @@ For more information, check out the following:
:caption: Getting Started :caption: Getting Started
:maxdepth: 1 :maxdepth: 1
getting_started/installation getting_started/installation/index
getting_started/amd-installation
getting_started/openvino-installation
getting_started/cpu-installation
getting_started/gaudi-installation
getting_started/arm-installation
getting_started/neuron-installation
getting_started/tpu-installation
getting_started/xpu-installation
getting_started/quickstart getting_started/quickstart
getting_started/debugging
getting_started/examples/examples_index getting_started/examples/examples_index
getting_started/troubleshooting
getting_started/faq
``` ```
```{toctree} ```{toctree}
...@@ -110,7 +103,6 @@ usage/structured_outputs ...@@ -110,7 +103,6 @@ usage/structured_outputs
usage/spec_decode usage/spec_decode
usage/compatibility_matrix usage/compatibility_matrix
usage/performance usage/performance
usage/faq
usage/engine_args usage/engine_args
usage/env_vars usage/env_vars
usage/usage_stats usage/usage_stats
......
...@@ -120,7 +120,7 @@ outputs = llm.chat(conversation, chat_template=custom_template) ...@@ -120,7 +120,7 @@ outputs = llm.chat(conversation, chat_template=custom_template)
## Online Inference ## Online Inference
Our [OpenAI Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs: Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:
- [Completions API](#completions-api) is similar to `LLM.generate` but only accepts text. - [Completions API](#completions-api) is similar to `LLM.generate` but only accepts text.
- [Chat API](#chat-api) is similar to `LLM.chat`, accepting both text and [multi-modal inputs](#multimodal-inputs) for models with a chat template. - [Chat API](#chat-api) is similar to `LLM.chat`, accepting both text and [multi-modal inputs](#multimodal-inputs) for models with a chat template.
...@@ -106,7 +106,7 @@ A code example can be found here: <gh-file:examples/offline_inference_scoring.py ...@@ -106,7 +106,7 @@ A code example can be found here: <gh-file:examples/offline_inference_scoring.py
## Online Inference ## Online Inference
Our [OpenAI Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs: Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:
- [Pooling API](#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models. - [Pooling API](#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models.
- [Embeddings API](#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](#multimodal-inputs) for embedding models. - [Embeddings API](#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](#multimodal-inputs) for embedding models.
......
...@@ -95,7 +95,7 @@ $ --tensor-parallel-size 16 ...@@ -95,7 +95,7 @@ $ --tensor-parallel-size 16
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient. To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
```{warning} ```{warning}
After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script](../getting_started/debugging.md) for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See <gh-issue:6803> for more information. After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script](#troubleshooting-incorrect-hardware-driver) for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See <gh-issue:6803> for more information.
``` ```
```{warning} ```{warning}
......
...@@ -182,7 +182,7 @@ speculative decoding, breaking down the guarantees into three key areas: ...@@ -182,7 +182,7 @@ speculative decoding, breaking down the guarantees into three key areas:
3. **vLLM Logprob Stability** 3. **vLLM Logprob Stability**
\- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the \- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
same request across runs. For more details, see the FAQ section same request across runs. For more details, see the FAQ section
titled *Can the output of a prompt vary across runs in vLLM?* in the {ref}`FAQs <faq>`. titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).
**Conclusion** **Conclusion**
...@@ -195,7 +195,7 @@ can occur due to following factors: ...@@ -195,7 +195,7 @@ can occur due to following factors:
**Mitigation Strategies** **Mitigation Strategies**
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the {ref}`FAQs <faq>`. For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).
## Resources for vLLM contributors ## Resources for vLLM contributors
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment