Unverified Commit 6ad909fd authored by Cyrus Leung's avatar Cyrus Leung Committed by GitHub
Browse files

[Doc] Improve GitHub links (#11491)


Signed-off-by: default avatarDarkLight1337 <tlleungac@connect.ust.hk>
parent b689ada9
...@@ -74,6 +74,35 @@ html_theme_options = { ...@@ -74,6 +74,35 @@ html_theme_options = {
html_static_path = ["_static"] html_static_path = ["_static"]
html_js_files = ["custom.js"] html_js_files = ["custom.js"]
myst_url_schemes = {
'http': None,
'https': None,
'mailto': None,
'ftp': None,
"gh-issue": {
"url":
"https://github.com/vllm-project/vllm/issues/{{path}}#{{fragment}}",
"title": "Issue #{{path}}",
"classes": ["github"],
},
"gh-pr": {
"url":
"https://github.com/vllm-project/vllm/pull/{{path}}#{{fragment}}",
"title": "Pull Request #{{path}}",
"classes": ["github"],
},
"gh-dir": {
"url": "https://github.com/vllm-project/vllm/tree/main/{{path}}",
"title": "{{path}}",
"classes": ["github"],
},
"gh-file": {
"url": "https://github.com/vllm-project/vllm/blob/main/{{path}}",
"title": "{{path}}",
"classes": ["github"],
},
}
# see https://docs.readthedocs.io/en/stable/reference/environment-variables.html # noqa # see https://docs.readthedocs.io/en/stable/reference/environment-variables.html # noqa
READTHEDOCS_VERSION_TYPE = os.environ.get('READTHEDOCS_VERSION_TYPE') READTHEDOCS_VERSION_TYPE = os.environ.get('READTHEDOCS_VERSION_TYPE')
if READTHEDOCS_VERSION_TYPE == "tag": if READTHEDOCS_VERSION_TYPE == "tag":
......
# Dockerfile # Dockerfile
See [here](https://github.com/vllm-project/vllm/blob/main/Dockerfile) for the main Dockerfile to construct We provide a <gh-file:Dockerfile> to construct the image for running an OpenAI compatible server with vLLM.
the image for running an OpenAI compatible server with vLLM. More information about deploying with Docker can be found [here](https://docs.vllm.ai/en/stable/serving/deploying_with_docker.html). More information about deploying with Docker can be found [here](../../serving/deploying_with_docker.md).
Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes: Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:
......
...@@ -13,11 +13,12 @@ Finally, one of the most impactful ways to support us is by raising awareness ab ...@@ -13,11 +13,12 @@ Finally, one of the most impactful ways to support us is by raising awareness ab
## License ## License
See [LICENSE](https://github.com/vllm-project/vllm/tree/main/LICENSE). See <gh-file:LICENSE>.
## Developing ## Developing
Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation. Check out the [building from source](https://docs.vllm.ai/en/latest/getting_started/installation.html#build-from-source) documentation for details. Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation.
Check out the [building from source](#build-from-source) documentation for details.
## Testing ## Testing
...@@ -43,7 +44,7 @@ Currently, the repository does not pass the `mypy` tests. ...@@ -43,7 +44,7 @@ Currently, the repository does not pass the `mypy` tests.
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible. If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
```{important} ```{important}
If you discover a security vulnerability, please follow the instructions [here](https://github.com/vllm-project/vllm/tree/main/SECURITY.md#reporting-a-vulnerability). If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
``` ```
## Pull Requests & Code Reviews ## Pull Requests & Code Reviews
...@@ -54,9 +55,9 @@ code quality and improve the efficiency of the review process. ...@@ -54,9 +55,9 @@ code quality and improve the efficiency of the review process.
### DCO and Signed-off-by ### DCO and Signed-off-by
When contributing changes to this project, you must agree to the [DCO](https://github.com/vllm-project/vllm/tree/main/DCO). When contributing changes to this project, you must agree to the <gh-file:DCO>.
Commits must include a `Signed-off-by:` header which certifies agreement with Commits must include a `Signed-off-by:` header which certifies agreement with
the terms of the [DCO](https://github.com/vllm-project/vllm/tree/main/DCO). the terms of the DCO.
Using `-s` with `git commit` will automatically add this header. Using `-s` with `git commit` will automatically add this header.
...@@ -89,8 +90,7 @@ If the PR spans more than one category, please include all relevant prefixes. ...@@ -89,8 +90,7 @@ If the PR spans more than one category, please include all relevant prefixes.
The PR needs to meet the following code quality standards: The PR needs to meet the following code quality standards:
- We adhere to [Google Python style guide](https://google.github.io/styleguide/pyguide.html) and [Google C++ style guide](https://google.github.io/styleguide/cppguide.html). - We adhere to [Google Python style guide](https://google.github.io/styleguide/pyguide.html) and [Google C++ style guide](https://google.github.io/styleguide/cppguide.html).
- Pass all linter checks. Please use [format.sh](https://github.com/vllm-project/vllm/blob/main/format.sh) to format your - Pass all linter checks. Please use <gh-file:format.sh> to format your code.
code.
- The code needs to be well-documented to ensure future contributors can easily - The code needs to be well-documented to ensure future contributors can easily
understand the code. understand the code.
- Include sufficient tests to ensure the project stays correct and robust. This - Include sufficient tests to ensure the project stays correct and robust. This
......
...@@ -22,13 +22,13 @@ Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the serve ...@@ -22,13 +22,13 @@ Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the serve
`export VLLM_RPC_TIMEOUT=1800000` `export VLLM_RPC_TIMEOUT=1800000`
``` ```
## Example commands and usage: ## Example commands and usage
### Offline Inference: ### Offline Inference
Refer to [examples/offline_inference_with_profiler.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_with_profiler.py) for an example. Refer to <gh-file:examples/offline_inference_with_profiler.py> for an example.
### OpenAI Server: ### OpenAI Server
```bash ```bash
VLLM_TORCH_PROFILER_DIR=./vllm_profile python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B VLLM_TORCH_PROFILER_DIR=./vllm_profile python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B
......
...@@ -55,7 +55,7 @@ for output in outputs: ...@@ -55,7 +55,7 @@ for output in outputs:
More API details can be found in the {doc}`Offline Inference More API details can be found in the {doc}`Offline Inference
</dev/offline_inference/offline_index>` section of the API docs. </dev/offline_inference/offline_index>` section of the API docs.
The code for the `LLM` class can be found in [vllm/entrypoints/llm.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py). The code for the `LLM` class can be found in <gh-file:vllm/entrypoints/llm.py>.
### OpenAI-compatible API server ### OpenAI-compatible API server
...@@ -66,7 +66,7 @@ This server can be started using the `vllm serve` command. ...@@ -66,7 +66,7 @@ This server can be started using the `vllm serve` command.
vllm serve <model> vllm serve <model>
``` ```
The code for the `vllm` CLI can be found in [vllm/scripts.py](https://github.com/vllm-project/vllm/blob/main/vllm/scripts.py). The code for the `vllm` CLI can be found in <gh-file:vllm/scripts.py>.
Sometimes you may see the API server entrypoint used directly instead of via the Sometimes you may see the API server entrypoint used directly instead of via the
`vllm` CLI command. For example: `vllm` CLI command. For example:
...@@ -75,7 +75,7 @@ Sometimes you may see the API server entrypoint used directly instead of via the ...@@ -75,7 +75,7 @@ Sometimes you may see the API server entrypoint used directly instead of via the
python -m vllm.entrypoints.openai.api_server --model <model> python -m vllm.entrypoints.openai.api_server --model <model>
``` ```
That code can be found in [vllm/entrypoints/openai/api_server.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py). That code can be found in <gh-file:vllm/entrypoints/openai/api_server.py>.
More details on the API server can be found in the {doc}`OpenAI Compatible More details on the API server can be found in the {doc}`OpenAI Compatible
Server </serving/openai_compatible_server>` document. Server </serving/openai_compatible_server>` document.
...@@ -105,7 +105,7 @@ processing. ...@@ -105,7 +105,7 @@ processing.
- **Output Processing**: Processes the outputs generated by the model, decoding the - **Output Processing**: Processes the outputs generated by the model, decoding the
token IDs from a language model into human-readable text. token IDs from a language model into human-readable text.
The code for `LLMEngine` can be found in [vllm/engine/llm_engine.py]. The code for `LLMEngine` can be found in <gh-file:vllm/engine/llm_engine.py>.
### AsyncLLMEngine ### AsyncLLMEngine
...@@ -115,10 +115,9 @@ incoming requests. The `AsyncLLMEngine` is designed for online serving, where it ...@@ -115,10 +115,9 @@ incoming requests. The `AsyncLLMEngine` is designed for online serving, where it
can handle multiple concurrent requests and stream outputs to clients. can handle multiple concurrent requests and stream outputs to clients.
The OpenAI-compatible API server uses the `AsyncLLMEngine`. There is also a demo The OpenAI-compatible API server uses the `AsyncLLMEngine`. There is also a demo
API server that serves as a simpler example in API server that serves as a simpler example in <gh-file:vllm/entrypoints/api_server.py>.
[vllm/entrypoints/api_server.py].
The code for `AsyncLLMEngine` can be found in [vllm/engine/async_llm_engine.py]. The code for `AsyncLLMEngine` can be found in <gh-file:vllm/engine/async_llm_engine.py>.
## Worker ## Worker
...@@ -252,7 +251,3 @@ big problem. ...@@ -252,7 +251,3 @@ big problem.
In summary, the complete config object `VllmConfig` can be treated as an In summary, the complete config object `VllmConfig` can be treated as an
engine-level global state that is shared among all vLLM classes. engine-level global state that is shared among all vLLM classes.
[vllm/engine/async_llm_engine.py]: https://github.com/vllm-project/vllm/tree/main/vllm/engine/async_llm_engine.py
[vllm/engine/llm_engine.py]: https://github.com/vllm-project/vllm/tree/main/vllm/engine/llm_engine.py
[vllm/entrypoints/api_server.py]: https://github.com/vllm-project/vllm/tree/main/vllm/entrypoints/api_server.py
...@@ -2,13 +2,14 @@ ...@@ -2,13 +2,14 @@
## Debugging ## Debugging
Please see the [Debugging Please see the [Debugging Tips](#debugging-python-multiprocessing)
Tips](https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing)
page for information on known issues and how to solve them. page for information on known issues and how to solve them.
## Introduction ## Introduction
*Note that source code references are to the state of the code at the time of writing in December, 2024.* ```{important}
The source code references are to the state of the code at the time of writing in December, 2024.
```
The use of Python multiprocessing in vLLM is complicated by: The use of Python multiprocessing in vLLM is complicated by:
...@@ -20,7 +21,7 @@ This document describes how vLLM deals with these challenges. ...@@ -20,7 +21,7 @@ This document describes how vLLM deals with these challenges.
## Multiprocessing Methods ## Multiprocessing Methods
[Python multiprocessing methods](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods) include: [Python multiprocessing methods](https://docs.python.org/3/library/multiprocessing.html.md#contexts-and-start-methods) include:
- `spawn` - spawn a new Python process. This will be the default as of Python - `spawn` - spawn a new Python process. This will be the default as of Python
3.14. 3.14.
...@@ -82,7 +83,7 @@ There are other miscellaneous places hard-coding the use of `spawn`: ...@@ -82,7 +83,7 @@ There are other miscellaneous places hard-coding the use of `spawn`:
Related PRs: Related PRs:
- <https://github.com/vllm-project/vllm/pull/8823> - <gh-pr:8823>
## Prior State in v1 ## Prior State in v1
...@@ -96,7 +97,7 @@ engine core. ...@@ -96,7 +97,7 @@ engine core.
- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/v1/engine/llm_engine.py#L93-L95> - <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/v1/engine/llm_engine.py#L93-L95>
- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/v1/engine/llm_engine.py#L70-L77> - <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/v1/engine/llm_engine.py#L70-L77>
- https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/v1/engine/core_client.py#L44-L45 - <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/v1/engine/core_client.py#L44-L45>
It was off by default for all the reasons mentioned above - compatibility with It was off by default for all the reasons mentioned above - compatibility with
dependencies and code using vLLM as a library. dependencies and code using vLLM as a library.
...@@ -119,17 +120,17 @@ instruct users to either add a `__main__` guard or to disable multiprocessing. ...@@ -119,17 +120,17 @@ instruct users to either add a `__main__` guard or to disable multiprocessing.
If that known-failure case occurs, the user will see two messages that explain If that known-failure case occurs, the user will see two messages that explain
what is happening. First, a log message from vLLM: what is happening. First, a log message from vLLM:
``` ```console
WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
initialized. We must use the `spawn` multiprocessing start method. Setting initialized. We must use the `spawn` multiprocessing start method. Setting
VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing
for more information. for more information.
``` ```
Second, Python itself will raise an exception with a nice explanation: Second, Python itself will raise an exception with a nice explanation:
``` ```console
RuntimeError: RuntimeError:
An attempt has been made to start a new process before the An attempt has been made to start a new process before the
current process has finished its bootstrapping phase. current process has finished its bootstrapping phase.
......
...@@ -36,11 +36,10 @@ def generate_examples(): ...@@ -36,11 +36,10 @@ def generate_examples():
# Generate the example docs for each example script # Generate the example docs for each example script
for script_path, doc_path in zip(script_paths, doc_paths): for script_path, doc_path in zip(script_paths, doc_paths):
script_url = f"https://github.com/vllm-project/vllm/blob/main/examples/{script_path.name}"
# Make script_path relative to doc_path and call it include_path # Make script_path relative to doc_path and call it include_path
include_path = '../../../..' / script_path.relative_to(root_dir) include_path = '../../../..' / script_path.relative_to(root_dir)
content = (f"{generate_title(doc_path.stem)}\n\n" content = (f"{generate_title(doc_path.stem)}\n\n"
f"Source: <{script_url}>.\n\n" f"Source: <gh-file:examples/{script_path.name}>.\n\n"
f"```{{literalinclude}} {include_path}\n" f"```{{literalinclude}} {include_path}\n"
":language: python\n" ":language: python\n"
":linenos:\n```") ":linenos:\n```")
......
...@@ -22,7 +22,7 @@ Installation options: ...@@ -22,7 +22,7 @@ Installation options:
You can build and install vLLM from source. You can build and install vLLM from source.
First, build a docker image from [Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/Dockerfile.rocm) and launch a docker container from the image. First, build a docker image from <gh-file:Dockerfile.rocm> and launch a docker container from the image.
It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon: It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
```console ```console
...@@ -33,7 +33,7 @@ It is important that the user kicks off the docker build using buildkit. Either ...@@ -33,7 +33,7 @@ It is important that the user kicks off the docker build using buildkit. Either
} }
``` ```
[Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/Dockerfile.rocm) uses ROCm 6.2 by default, but also supports ROCm 5.7, 6.0 and 6.1 in older vLLM branches. <gh-file:Dockerfile.rocm> uses ROCm 6.2 by default, but also supports ROCm 5.7, 6.0 and 6.1 in older vLLM branches.
It provides flexibility to customize the build of docker image using the following arguments: It provides flexibility to customize the build of docker image using the following arguments:
- `BASE_IMAGE`: specifies the base image used when running `docker build`, specifically the PyTorch on ROCm base image. - `BASE_IMAGE`: specifies the base image used when running `docker build`, specifically the PyTorch on ROCm base image.
......
...@@ -145,10 +145,10 @@ $ python examples/offline_inference.py ...@@ -145,10 +145,10 @@ $ python examples/offline_inference.py
- On CPU based setup with NUMA enabled, the memory access performance may be largely impacted by the [topology](https://github.com/intel/intel-extension-for-pytorch/blob/main/docs/tutorials/performance_tuning/tuning_guide.md#non-uniform-memory-access-numa). For NUMA architecture, two optimizations are to recommended: Tensor Parallel or Data Parallel. - On CPU based setup with NUMA enabled, the memory access performance may be largely impacted by the [topology](https://github.com/intel/intel-extension-for-pytorch/blob/main/docs/tutorials/performance_tuning/tuning_guide.md#non-uniform-memory-access-numa). For NUMA architecture, two optimizations are to recommended: Tensor Parallel or Data Parallel.
- Using Tensor Parallel for a latency constraints deployment: following GPU backend design, a Megatron-LM's parallel algorithm will be used to shard the model, based on the number of NUMA nodes (e.g. TP = 2 for a two NUMA node system). With [TP feature on CPU](https://github.com/vllm-project/vllm/pull/6125) merged, Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving: - Using Tensor Parallel for a latency constraints deployment: following GPU backend design, a Megatron-LM's parallel algorithm will be used to shard the model, based on the number of NUMA nodes (e.g. TP = 2 for a two NUMA node system). With [TP feature on CPU](gh-pr:6125) merged, Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
```console ```console
$ VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp $ VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
``` ```
- Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like [Nginx](../serving/deploying_with_nginx) or HAProxy are recommended. Anyscale Ray project provides the feature on LLM [serving](https://docs.ray.io/en/latest/serve/index.html). Here is the example to setup a scalable LLM serving with [Ray Serve](https://github.com/intel/llm-on-ray/blob/main/docs/setup.md). - Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like [Nginx](../serving/deploying_with_nginx.md) or HAProxy are recommended. Anyscale Ray project provides the feature on LLM [serving](https://docs.ray.io/en/latest/serve/index.html). Here is the example to setup a scalable LLM serving with [Ray Serve](https://github.com/intel/llm-on-ray/blob/main/docs/setup.md).
...@@ -24,7 +24,7 @@ To isolate the model downloading and loading issue, you can use the `--load-form ...@@ -24,7 +24,7 @@ To isolate the model downloading and loading issue, you can use the `--load-form
## Model is too large ## Model is too large
If the model is too large to fit in a single GPU, you might want to [consider tensor parallelism](https://docs.vllm.ai/en/latest/serving/distributed_serving.html#distributed-inference-and-serving) to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using [this example](https://docs.vllm.ai/en/latest/getting_started/examples/save_sharded_state.html) . The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism. If the model is too large to fit in a single GPU, you might want to [consider tensor parallelism](#distributed-serving) to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
## Enable more logging ## Enable more logging
...@@ -139,6 +139,7 @@ A multi-node environment is more complicated than a single-node one. If you see ...@@ -139,6 +139,7 @@ A multi-node environment is more complicated than a single-node one. If you see
Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes. Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
``` ```
(debugging-python-multiprocessing)=
## Python multiprocessing ## Python multiprocessing
### `RuntimeError` Exception ### `RuntimeError` Exception
...@@ -195,5 +196,5 @@ if __name__ == '__main__': ...@@ -195,5 +196,5 @@ if __name__ == '__main__':
## Known Issues ## Known Issues
- In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](https://github.com/vllm-project/vllm/pull/6759). - In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](gh-pr:6759).
- To circumvent a NCCL [bug](https://github.com/NVIDIA/nccl/issues/1234) , all vLLM processes will set an environment variable ``NCCL_CUMEM_ENABLE=0`` to disable NCCL's ``cuMem`` allocator. It does not affect performance but only gives memory benefits. When external processes want to set up a NCCL connection with vLLM's processes, they should also set this environment variable, otherwise, inconsistent environment setup will cause NCCL to hang or crash, as observed in the [RLHF integration](https://github.com/OpenRLHF/OpenRLHF/pull/604) and the [discussion](https://github.com/vllm-project/vllm/issues/5723#issuecomment-2554389656) . - To circumvent a NCCL [bug](https://github.com/NVIDIA/nccl/issues/1234) , all vLLM processes will set an environment variable ``NCCL_CUMEM_ENABLE=0`` to disable NCCL's ``cuMem`` allocator. It does not affect performance but only gives memory benefits. When external processes want to set up a NCCL connection with vLLM's processes, they should also set this environment variable, otherwise, inconsistent environment setup will cause NCCL to hang or crash, as observed in the [RLHF integration](https://github.com/OpenRLHF/OpenRLHF/pull/604) and the [discussion](gh-issue:5723#issuecomment-2554389656) .
...@@ -80,10 +80,8 @@ $ python setup.py develop ...@@ -80,10 +80,8 @@ $ python setup.py develop
## Supported Features ## Supported Features
- [Offline batched - [Offline batched inference](#offline-batched-inference)
inference](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#offline-batched-inference) - Online inference via [OpenAI-Compatible Server](#openai-compatible-server)
- Online inference via [OpenAI-Compatible
Server](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server)
- HPU autodetection - no need to manually select device within vLLM - HPU autodetection - no need to manually select device within vLLM
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators - Paged KV cache with algorithms enabled for Intel Gaudi accelerators
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops, - Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
......
...@@ -24,7 +24,7 @@ $ pip install vllm ...@@ -24,7 +24,7 @@ $ pip install vllm
``` ```
```{note} ```{note}
Although we recommend using `conda` to create and manage Python environments, it is highly recommended to use `pip` to install vLLM. This is because `pip` can install `torch` with separate library packages like `NCCL`, while `conda` installs `torch` with statically linked `NCCL`. This can cause issues when vLLM tries to use `NCCL`. See [this issue](https://github.com/vllm-project/vllm/issues/8420) for more details. Although we recommend using `conda` to create and manage Python environments, it is highly recommended to use `pip` to install vLLM. This is because `pip` can install `torch` with separate library packages like `NCCL`, while `conda` installs `torch` with statically linked `NCCL`. This can cause issues when vLLM tries to use `NCCL`. See <gh-issue:8420> for more details.
``` ```
````{note} ````{note}
......
...@@ -29,7 +29,7 @@ Please refer to the {ref}`installation documentation <installation>` for more de ...@@ -29,7 +29,7 @@ Please refer to the {ref}`installation documentation <installation>` for more de
## Offline Batched Inference ## Offline Batched Inference
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). The example script for this section can be found [here](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference.py). With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference.py>
The first line of this example imports the classes {class}`~vllm.LLM` and {class}`~vllm.SamplingParams`: The first line of this example imports the classes {class}`~vllm.LLM` and {class}`~vllm.SamplingParams`:
...@@ -87,7 +87,8 @@ $ vllm serve Qwen/Qwen2.5-1.5B-Instruct ...@@ -87,7 +87,8 @@ $ vllm serve Qwen/Qwen2.5-1.5B-Instruct
``` ```
```{note} ```{note}
By default, the server uses a predefined chat template stored in the tokenizer. You can learn about overriding it [here](https://github.com/vllm-project/vllm/blob/main/docs/source/serving/openai_compatible_server.md#chat-template). By default, the server uses a predefined chat template stored in the tokenizer.
You can learn about overriding it [here](#chat-template).
``` ```
This server can be queried in the same format as OpenAI API. For example, to list the models: This server can be queried in the same format as OpenAI API. For example, to list the models:
...@@ -130,7 +131,7 @@ completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct", ...@@ -130,7 +131,7 @@ completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
print("Completion result:", completion) print("Completion result:", completion)
``` ```
A more detailed client example can be found [here](https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py). A more detailed client example can be found here: <gh-file:examples/openai_completion_client.py>
### OpenAI Chat Completions API with vLLM ### OpenAI Chat Completions API with vLLM
......
...@@ -154,8 +154,7 @@ For more information about using TPUs with GKE, see ...@@ -154,8 +154,7 @@ For more information about using TPUs with GKE, see
## Build a docker image with {code}`Dockerfile.tpu` ## Build a docker image with {code}`Dockerfile.tpu`
You can use [Dockerfile.tpu](https://github.com/vllm-project/vllm/blob/main/Dockerfile.tpu) You can use <gh-file:Dockerfile.tpu> to build a Docker image with TPU support.
to build a Docker image with TPU support.
```console ```console
$ docker build -f Dockerfile.tpu -t vllm-tpu . $ docker build -f Dockerfile.tpu -t vllm-tpu .
......
...@@ -71,4 +71,4 @@ $ --pipeline-parallel-size=2 \ ...@@ -71,4 +71,4 @@ $ --pipeline-parallel-size=2 \
$ -tp=8 $ -tp=8
``` ```
By default, a ray instance will be launched automatically if no existing one is detected in system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring helper [script](https://github.com/vllm-project/vllm/tree/main/examples/run_cluster.sh). By default, a ray instance will be launched automatically if no existing one is detected in system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/run_cluster.sh> helper script.
...@@ -31,8 +31,8 @@ If you don't want to fork the repository and modify vLLM's codebase, please refe ...@@ -31,8 +31,8 @@ If you don't want to fork the repository and modify vLLM's codebase, please refe
## 1. Bring your model code ## 1. Bring your model code
Clone the PyTorch model code from the HuggingFace Transformers repository and put it into the [vllm/model_executor/models](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) directory. Clone the PyTorch model code from the HuggingFace Transformers repository and put it into the <gh-dir:vllm/model_executor/models> directory.
For instance, vLLM's [OPT model](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/opt.py) was adapted from the HuggingFace's [modeling_opt.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py) file. For instance, vLLM's [OPT model](gh-file:vllm/model_executor/models/opt.py) was adapted from the HuggingFace's [modeling_opt.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py) file.
```{warning} ```{warning}
When copying the model code, make sure to review and adhere to the code's copyright and licensing terms. When copying the model code, make sure to review and adhere to the code's copyright and licensing terms.
...@@ -99,7 +99,7 @@ Currently, vLLM supports the basic multi-head attention mechanism and its varian ...@@ -99,7 +99,7 @@ Currently, vLLM supports the basic multi-head attention mechanism and its varian
If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM. If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.
``` ```
For reference, check out the [LLAMA model](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py). vLLM already supports a large number of models. It is recommended to find a model similar to yours and adapt it to your model's architecture. Check out the [vLLM models](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) directory for more examples. For reference, check out our [Llama implementation](gh-file:vllm/model_executor/models/llama.py). vLLM already supports a large number of models. It is recommended to find a model similar to yours and adapt it to your model's architecture. Check out <gh-dir:vllm/model_executor/models> for more examples.
## 3. (Optional) Implement tensor parallelism and quantization support ## 3. (Optional) Implement tensor parallelism and quantization support
...@@ -123,7 +123,7 @@ This method should load the weights from the HuggingFace's checkpoint file and a ...@@ -123,7 +123,7 @@ This method should load the weights from the HuggingFace's checkpoint file and a
## 5. Register your model ## 5. Register your model
Finally, register your {code}`*ForCausalLM` class to the {code}`_VLLM_MODELS` in [vllm/model_executor/models/registry.py](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/registry.py). Finally, register your {code}`*ForCausalLM` class to the {code}`_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py>.
## 6. Out-of-Tree Model Integration ## 6. Out-of-Tree Model Integration
......
...@@ -78,8 +78,8 @@ and register it via {meth}`INPUT_REGISTRY.register_dummy_data <vllm.inputs.regis ...@@ -78,8 +78,8 @@ and register it via {meth}`INPUT_REGISTRY.register_dummy_data <vllm.inputs.regis
Here are some examples: Here are some examples:
- Image inputs (static feature size): [LLaVA-1.5 Model](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava.py) - Image inputs (static feature size): [LLaVA-1.5 Model](gh-file:vllm/model_executor/models/llava.py)
- Image inputs (dynamic feature size): [LLaVA-NeXT Model](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava_next.py) - Image inputs (dynamic feature size): [LLaVA-NeXT Model](gh-file:vllm/model_executor/models/llava_next.py)
```{seealso} ```{seealso}
[Input Processing Pipeline](#input-processing-pipeline) [Input Processing Pipeline](#input-processing-pipeline)
...@@ -107,8 +107,8 @@ The dummy data should have the maximum possible number of multi-modal tokens, as ...@@ -107,8 +107,8 @@ The dummy data should have the maximum possible number of multi-modal tokens, as
Here are some examples: Here are some examples:
- Image inputs (static feature size): [LLaVA-1.5 Model](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava.py) - Image inputs (static feature size): [LLaVA-1.5 Model](gh-file:vllm/model_executor/models/llava.py)
- Image inputs (dynamic feature size): [LLaVA-NeXT Model](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava_next.py) - Image inputs (dynamic feature size): [LLaVA-NeXT Model](gh-file:vllm/model_executor/models/llava_next.py)
```{seealso} ```{seealso}
[Input Processing Pipeline](#input-processing-pipeline) [Input Processing Pipeline](#input-processing-pipeline)
...@@ -135,8 +135,8 @@ You can register input processors via {meth}`INPUT_REGISTRY.register_input_proce ...@@ -135,8 +135,8 @@ You can register input processors via {meth}`INPUT_REGISTRY.register_input_proce
A common use case of input processors is inserting placeholder tokens to leverage the vLLM framework for attention mask generation. A common use case of input processors is inserting placeholder tokens to leverage the vLLM framework for attention mask generation.
Here are some examples: Here are some examples:
- Insert static number of image tokens: [LLaVA-1.5 Model](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava.py) - Insert static number of image tokens: [LLaVA-1.5 Model](gh-file:vllm/model_executor/models/llava.py)
- Insert dynamic number of image tokens: [LLaVA-NeXT Model](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava_next.py) - Insert dynamic number of image tokens: [LLaVA-NeXT Model](gh-file:vllm/model_executor/models/llava_next.py)
```{seealso} ```{seealso}
[Input Processing Pipeline](#input-processing-pipeline) [Input Processing Pipeline](#input-processing-pipeline)
......
...@@ -46,7 +46,7 @@ for output in outputs: ...@@ -46,7 +46,7 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``` ```
A code example can be found in [examples/offline_inference.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference.py). A code example can be found here: <gh-file:examples/offline_inference.py>
### `LLM.beam_search` ### `LLM.beam_search`
...@@ -103,7 +103,7 @@ for output in outputs: ...@@ -103,7 +103,7 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``` ```
A code example can be found in [examples/offline_inference_chat.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_chat.py). A code example can be found here: <gh-file:examples/offline_inference_chat.py>
If the model doesn't have a chat template or you want to specify another one, If the model doesn't have a chat template or you want to specify another one,
you can explicitly pass a chat template: you can explicitly pass a chat template:
...@@ -120,7 +120,7 @@ outputs = llm.chat(conversation, chat_template=custom_template) ...@@ -120,7 +120,7 @@ outputs = llm.chat(conversation, chat_template=custom_template)
## Online Inference ## Online Inference
Our [OpenAI Compatible Server](../serving/openai_compatible_server) provides endpoints that correspond to the offline APIs: Our [OpenAI Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
- [Completions API](#completions-api) is similar to `LLM.generate` but only accepts text. - [Completions API](#completions-api) is similar to `LLM.generate` but only accepts text.
- [Chat API](#chat-api) is similar to `LLM.chat`, accepting both text and [multi-modal inputs](#multimodal-inputs) for models with a chat template. - [Chat API](#chat-api) is similar to `LLM.chat`, accepting both text and [multi-modal inputs](#multimodal-inputs) for models with a chat template.
...@@ -65,7 +65,7 @@ embeds = output.outputs.embedding ...@@ -65,7 +65,7 @@ embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})") print(f"Embeddings: {embeds!r} (size={len(embeds)})")
``` ```
A code example can be found in [examples/offline_inference_embedding.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_embedding.py). A code example can be found here: <gh-file:examples/offline_inference_embedding.py>
### `LLM.classify` ### `LLM.classify`
...@@ -80,7 +80,7 @@ probs = output.outputs.probs ...@@ -80,7 +80,7 @@ probs = output.outputs.probs
print(f"Class Probabilities: {probs!r} (size={len(probs)})") print(f"Class Probabilities: {probs!r} (size={len(probs)})")
``` ```
A code example can be found in [examples/offline_inference_classification.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_classification.py). A code example can be found here: <gh-file:examples/offline_inference_classification.py>
### `LLM.score` ### `LLM.score`
...@@ -102,7 +102,7 @@ score = output.outputs.score ...@@ -102,7 +102,7 @@ score = output.outputs.score
print(f"Score: {score}") print(f"Score: {score}")
``` ```
A code example can be found in [examples/offline_inference_scoring.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_scoring.py). A code example can be found here: <gh-file:examples/offline_inference_scoring.py>
## Online Inference ## Online Inference
......
...@@ -756,7 +756,7 @@ and pass {code}`--hf_overrides '{"architectures": ["MantisForConditionalGenerati ...@@ -756,7 +756,7 @@ and pass {code}`--hf_overrides '{"architectures": ["MantisForConditionalGenerati
```{note} ```{note}
The official {code}`openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork ({code}`HwwwH/MiniCPM-V-2`) for now. The official {code}`openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork ({code}`HwwwH/MiniCPM-V-2`) for now.
For more details, please see: <https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630> For more details, please see: <gh-pr:4087#issuecomment-2250397630>
``` ```
### Pooling Models ### Pooling Models
...@@ -834,5 +834,5 @@ We have the following levels of testing for models: ...@@ -834,5 +834,5 @@ We have the following levels of testing for models:
1. **Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to [models tests](https://github.com/vllm-project/vllm/blob/main/tests/models) for the models that have passed this test. 1. **Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to [models tests](https://github.com/vllm-project/vllm/blob/main/tests/models) for the models that have passed this test.
2. **Output Sensibility**: We check if the output of the model is sensible and coherent, by measuring the perplexity of the output and checking for any obvious errors. This is a less stringent test. 2. **Output Sensibility**: We check if the output of the model is sensible and coherent, by measuring the perplexity of the output and checking for any obvious errors. This is a less stringent test.
3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to [functionality tests](https://github.com/vllm-project/vllm/tree/main/tests) and [examples](https://github.com/vllm-project/vllm/tree/main/examples) for the models that have passed this test. 3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to [functionality tests](gh-dir:tests) and [examples](gh-dir:main/examples) for the models that have passed this test.
4. **Community Feedback**: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category. 4. **Community Feedback**: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment