Unverified Commit 6ad909fd authored by Cyrus Leung's avatar Cyrus Leung Committed by GitHub
Browse files

[Doc] Improve GitHub links (#11491)


Signed-off-by: default avatarDarkLight1337 <tlleungac@connect.ust.hk>
parent b689ada9
......@@ -15,7 +15,7 @@ The performance benchmarks are used for development to confirm whether new chang
The latest performance results are hosted on the public [vLLM Performance Dashboard](https://perf.vllm.ai).
More information on the performance benchmarks and their parameters can be found [here](https://github.com/vllm-project/vllm/blob/main/.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md).
More information on the performance benchmarks and their parameters can be found [here](gh-file:.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md).
(nightly-benchmarks)=
......@@ -25,4 +25,4 @@ These compare vLLM's performance against alternatives (`tgi`, `trt-llm`, and `lm
The latest nightly benchmark results are shared in major release blog posts such as [vLLM v0.6.0](https://blog.vllm.ai/2024/09/05/perf-update.html).
More information on the nightly benchmarks and their parameters can be found [here](https://github.com/vllm-project/vllm/blob/main/.buildkite/nightly-benchmarks/nightly-descriptions.md).
More information on the nightly benchmarks and their parameters can be found [here](gh-file:.buildkite/nightly-benchmarks/nightly-descriptions.md).
......@@ -129,4 +129,4 @@ The table below shows the compatibility of various quantization implementations
Please note that this compatibility chart may be subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
For the most up-to-date information on hardware support and quantization methods, please check the [quantization directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/layers/quantization) or consult with the vLLM development team.
For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team.
......@@ -25,7 +25,7 @@ memory to share data between processes under the hood, particularly for tensor p
## Building vLLM's Docker Image from Source
You can build and run vLLM from source via the provided [Dockerfile](https://github.com/vllm-project/vllm/blob/main/Dockerfile). To build vLLM:
You can build and run vLLM from source via the provided <gh-file:Dockerfile>. To build vLLM:
```console
$ # optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
......
......@@ -51,7 +51,7 @@ $ --pipeline-parallel-size 2
If a single node does not have enough GPUs to hold the model, you can run the model using multiple nodes. It is important to make sure the execution environment is the same on all nodes, including the model path, the Python environment. The recommended way is to use docker images to ensure the same environment, and hide the heterogeneity of the host machines via mapping them into the same docker configuration.
The first step, is to start containers and organize them into a cluster. We have provided a helper [script](https://github.com/vllm-project/vllm/tree/main/examples/run_cluster.sh) to start the cluster. Please note, this script launches docker without administrative privileges that would be required to access GPU performance counters when running profiling and tracing tools. For that purpose, the script can have `CAP_SYS_ADMIN` to the docker container by using the `--cap-add` option in the docker run command.
The first step, is to start containers and organize them into a cluster. We have provided the helper script <gh-file:examples/run_cluster.sh> to start the cluster. Please note, this script launches docker without administrative privileges that would be required to access GPU performance counters when running profiling and tracing tools. For that purpose, the script can have `CAP_SYS_ADMIN` to the docker container by using the `--cap-add` option in the docker run command.
Pick a node as the head node, and run the following command:
......@@ -95,7 +95,7 @@ $ --tensor-parallel-size 16
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
```{warning}
After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script](https://docs.vllm.ai/en/latest/getting_started/debugging.html) for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See the [discussion](https://github.com/vllm-project/vllm/issues/6803) for more information.
After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script](../getting_started/debugging.md) for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See <gh-issue:6803> for more information.
```
```{warning}
......
......@@ -65,8 +65,7 @@ and all chat requests will error.
vllm serve <model> --chat-template ./path-to-chat-template.jinja
```
vLLM community provides a set of chat templates for popular models. You can find them in the examples
directory [here](https://github.com/vllm-project/vllm/tree/main/examples/)
vLLM community provides a set of chat templates for popular models. You can find them under the <gh-dir:examples> directory.
With the inclusion of multi-modal chat APIs, the OpenAI spec now accepts chat messages in a new format which specifies
both a `type` and a `text` field. An example is provided below:
......@@ -184,9 +183,7 @@ The order of priorities is `command line > config file values > defaults`.
Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
#### Code example
See [examples/openai_completion_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py).
Code example: <gh-file:examples/openai_completion_client.py>
#### Extra parameters
......@@ -217,9 +214,7 @@ We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
see our [Multimodal Inputs](../usage/multimodal_inputs.md) guide for more information.
- *Note: `image_url.detail` parameter is not supported.*
#### Code example
See [examples/openai_chat_completion_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client.py).
Code example: <gh-file:examples/openai_chat_completion_client.py>
#### Extra parameters
......@@ -252,9 +247,7 @@ which will be treated as a single prompt to the model.
This enables multi-modal inputs to be passed to embedding models, see [this page](#multimodal-inputs) for details.
```
#### Code example
See [examples/openai_embedding_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_embedding_client.py).
Code example: <gh-file:examples/openai_embedding_client.py>
#### Extra parameters
......@@ -298,9 +291,7 @@ Our Pooling API encodes input prompts using a [pooling model](../models/pooling_
The input format is the same as [Embeddings API](#embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
#### Code example
See [examples/openai_pooling_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_pooling_client.py).
Code example: <gh-file:examples/openai_pooling_client.py>
(score-api)=
### Score API
......@@ -310,9 +301,7 @@ Usually, the score for a sentence pair refers to the similarity between two sent
You can find the documentation for these kind of models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
#### Code example
See [examples/openai_cross_encoder_score.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_cross_encoder_score.py).
Code example: <gh-file:examples/openai_cross_encoder_score.py>
#### Single inference
......
......@@ -82,7 +82,7 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
-
-
* - [LoRA](#lora-adapter)
- [✗](https://github.com/vllm-project/vllm/pull/9057)
- [✗](gh-pr:9057)
- ✅
-
-
......@@ -168,10 +168,10 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
-
* - <abbr title="Encoder-Decoder Models">enc-dec</abbr>
- ✗
- [✗](https://github.com/vllm-project/vllm/issues/7366)
- [✗](gh-issue:7366)
- ✗
- ✗
- [✗](https://github.com/vllm-project/vllm/issues/7366)
- [✗](gh-issue:7366)
- ✅
- ✅
-
......@@ -205,7 +205,7 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
- ✅
- ✅
- ✅
- [✗](https://github.com/vllm-project/vllm/pull/8199)
- [✗](gh-pr:8199)
- ✅
- ✗
- ✅
......@@ -244,7 +244,7 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
- ✗
- ✗
- ✅
- [✗](https://github.com/vllm-project/vllm/issues/8198)
- [✗](gh-issue:8198)
- ✅
-
-
......@@ -253,8 +253,8 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
-
* - <abbr title="Multimodal Inputs">mm</abbr>
- ✅
- [✗](https://github.com/vllm-project/vllm/pull/8348)
- [✗](https://github.com/vllm-project/vllm/pull/7199)
- [✗](gh-pr:8348)
- [✗](gh-pr:7199)
- ?
- ?
- ✅
......@@ -273,14 +273,14 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
- ✅
- ✅
- ✅
- [✗](https://github.com/vllm-project/vllm/issues/6137)
- [✗](gh-issue:6137)
- ✅
- ✗
- ✅
- ✅
- ✅
- ?
- [✗](https://github.com/vllm-project/vllm/issues/7968)
- [✗](gh-issue:7968)
- ✅
-
-
......@@ -290,14 +290,14 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
- ✅
- ✅
- ✅
- [✗](https://github.com/vllm-project/vllm/issues/6137)
- [✗](gh-issue:6137)
- ✅
- ✗
- ✅
- ✅
- ✅
- ?
- [✗](https://github.com/vllm-project/vllm/issues/7968>)
- [✗](gh-issue:7968>)
- ?
- ✅
-
......@@ -314,7 +314,7 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
- ✅
- ✅
- ✅
- [✗](https://github.com/vllm-project/vllm/issues/9893)
- [✗](gh-issue:9893)
- ?
- ✅
- ✅
......@@ -338,7 +338,7 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
- CPU
- AMD
* - [CP](#chunked-prefill)
- [✗](https://github.com/vllm-project/vllm/issues/2729)
- [✗](gh-issue:2729)
- ✅
- ✅
- ✅
......@@ -346,7 +346,7 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
- ✅
- ✅
* - [APC](#apc)
- [✗](https://github.com/vllm-project/vllm/issues/3687)
- [✗](gh-issue:3687)
- ✅
- ✅
- ✅
......@@ -359,7 +359,7 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
- ✅
- ✅
- ✅
- [✗](https://github.com/vllm-project/vllm/pull/4830)
- [✗](gh-pr:4830)
- ✅
* - <abbr title="Prompt Adapter">prmpt adptr</abbr>
- ✅
......@@ -367,7 +367,7 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
- ✅
- ✅
- ✅
- [✗](https://github.com/vllm-project/vllm/issues/8475)
- [✗](gh-issue:8475)
- ✅
* - [SD](#spec_decode)
- ✅
......@@ -439,7 +439,7 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
- ✅
- ✅
- ✅
- [✗](https://github.com/vllm-project/vllm/issues/8477)
- [✗](gh-issue:8477)
- ✅
* - best-of
- ✅
......
......@@ -47,8 +47,7 @@ outputs = llm.generate(
)
```
Check out [examples/multilora_inference.py](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)
for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options.
Check out <gh-file:examples/multilora_inference.py> for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options.
## Serving LoRA Adapters
......
......@@ -5,7 +5,7 @@
This page teaches you how to pass multi-modal inputs to [multi-modal models](#supported-mm-models) in vLLM.
```{note}
We are actively iterating on multi-modal support. See [this RFC](https://github.com/vllm-project/vllm/issues/4194) for upcoming changes,
We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes,
and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
```
......@@ -60,7 +60,7 @@ for o in outputs:
print(generated_text)
```
A code example can be found in [examples/offline_inference_vision_language.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py).
Full example: <gh-file:examples/offline_inference_vision_language.py>
To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
......@@ -91,7 +91,7 @@ for o in outputs:
print(generated_text)
```
A code example can be found in [examples/offline_inference_vision_language_multi_image.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py).
Full example: <gh-file:examples/offline_inference_vision_language_multi_image.py>
Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:
......@@ -125,13 +125,13 @@ for o in outputs:
You can pass a list of NumPy arrays directly to the {code}`'video'` field of the multi-modal dictionary
instead of using multi-image input.
Please refer to [examples/offline_inference_vision_language.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py) for more details.
Full example: <gh-file:examples/offline_inference_vision_language.py>
### Audio
You can pass a tuple {code}`(array, sampling_rate)` to the {code}`'audio'` field of the multi-modal dictionary.
Please refer to [examples/offline_inference_audio_language.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_audio_language.py) for more details.
Full example: <gh-file:examples/offline_inference_audio_language.py>
### Embedding
......@@ -208,7 +208,7 @@ A chat template is **required** to use Chat Completions API.
Although most models come with a chat template, for others you have to define one yourself.
The chat template can be inferred based on the documentation on the model's HuggingFace repo.
For example, LLaVA-1.5 (`llava-hf/llava-1.5-7b-hf`) requires a chat template that can be found [here](https://github.com/vllm-project/vllm/blob/main/examples/template_llava.jinja).
For example, LLaVA-1.5 (`llava-hf/llava-1.5-7b-hf`) requires a chat template that can be found here: <gh-file:examples/template_llava.jinja>
```
### Image
......@@ -271,7 +271,7 @@ chat_response = client.chat.completions.create(
print("Chat completion output:", chat_response.choices[0].message.content)
```
A full code example can be found in [examples/openai_chat_completion_client_for_multimodal.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py).
Full example: <gh-file:examples/openai_chat_completion_client_for_multimodal.py>
```{tip}
Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
......@@ -296,7 +296,7 @@ $ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
Instead of {code}`image_url`, you can pass a video file via {code}`video_url`.
You can use [these tests](https://github.com/vllm-project/vllm/blob/main/tests/entrypoints/openai/test_video.py) as reference.
You can use [these tests](gh-file:entrypoints/openai/test_video.py) as reference.
````{note}
By default, the timeout for fetching videos through HTTP URL url is `30` seconds.
......@@ -399,7 +399,7 @@ result = chat_completion_from_url.choices[0].message.content
print("Chat completion output from audio url:", result)
```
A full code example can be found in [examples/openai_chat_completion_client_for_multimodal.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py).
Full example: <gh-file:examples/openai_chat_completion_client_for_multimodal.py>
````{note}
By default, the timeout for fetching audios through HTTP URL is `10` seconds.
......@@ -435,7 +435,7 @@ Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to expl
to run this model in embedding mode instead of text generation mode.
The custom chat template is completely different from the original one for this model,
and can be found [here](https://github.com/vllm-project/vllm/blob/main/examples/template_vlm2vec.jinja).
and can be found here: <gh-file:examples/template_vlm2vec.jinja>
```
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
......@@ -475,7 +475,7 @@ vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
Like with VLM2Vec, we have to explicitly pass `--task embed`.
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
by [this custom chat template](https://github.com/vllm-project/vllm/blob/main/examples/template_dse_qwen2_vl.jinja).
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
```
```{important}
......@@ -483,4 +483,4 @@ Also important, `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of th
example below for details.
```
A full code example can be found in [examples/openai_chat_embedding_client_for_multimodal.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_embedding_client_for_multimodal.py).
Full example: <gh-file:examples/openai_chat_embedding_client_for_multimodal.py>
......@@ -4,8 +4,8 @@
```{warning}
Please note that speculative decoding in vLLM is not yet optimized and does
not usually yield inter-token latency reductions for all prompt datasets or sampling parameters. The work
to optimize it is ongoing and can be followed in [this issue.](https://github.com/vllm-project/vllm/issues/4630)
not usually yield inter-token latency reductions for all prompt datasets or sampling parameters.
The work to optimize it is ongoing and can be followed here: <gh-issue:4630>
```
```{warning}
......@@ -176,7 +176,7 @@ speculative decoding, breaking down the guarantees into three key areas:
> distribution. [View Test Code](https://github.com/vllm-project/vllm/blob/47b65a550866c7ffbd076ecb74106714838ce7da/tests/samplers/test_rejection_sampler.py#L252)
> - **Greedy Sampling Equality**: Confirms that greedy sampling with speculative decoding matches greedy sampling
> without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler,
> provides a lossless guarantee. Almost all of the tests in [this directory](https://github.com/vllm-project/vllm/tree/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e)
> provides a lossless guarantee. Almost all of the tests in <gh-dir:tests/spec_decode/e2e>.
> verify this property using [this assertion implementation](https://github.com/vllm-project/vllm/blob/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e/conftest.py#L291)
3. **vLLM Logprob Stability**
......@@ -202,4 +202,4 @@ For mitigation strategies, please refer to the FAQ entry *Can the output of a pr
- [A Hacker's Guide to Speculative Decoding in vLLM](https://www.youtube.com/watch?v=9wNAgpX6z_4)
- [What is Lookahead Scheduling in vLLM?](https://docs.google.com/document/d/1Z9TvqzzBPnh5WHcRwjvK2UEeFeq5zMZb5mFE8jR0HCs/edit#heading=h.1fjfb0donq5a)
- [Information on batch expansion](https://docs.google.com/document/d/1T-JaS2T1NRfdP51qzqpyakoCXxSXTtORppiwaj5asxA/edit#heading=h.kk7dq05lc6q8)
- [Dynamic speculative decoding](https://github.com/vllm-project/vllm/issues/4565)
- [Dynamic speculative decoding](gh-issue:4565)
......@@ -131,7 +131,7 @@ completion = client.chat.completions.create(
print(completion.choices[0].message.content)
```
The complete code of the examples can be found on [examples/openai_chat_completion_structured_outputs.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_structured_outputs.py).
Full example: <gh-file:examples/openai_chat_completion_structured_outputs.py>
## Experimental Automatic Parsing (OpenAI API)
......@@ -257,4 +257,4 @@ outputs = llm.generate(
print(outputs[0].outputs[0].text)
```
A complete example with all options can be found in [examples/offline_inference_structured_outputs.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_structured_outputs.py).
Full example: <gh-file:examples/offline_inference_structured_outputs.py>
......@@ -4,7 +4,7 @@ vLLM collects anonymous usage data by default to help the engineering team bette
## What data is collected?
You can see the up to date list of data collected by vLLM in the [usage_lib.py](https://github.com/vllm-project/vllm/blob/main/vllm/usage/usage_lib.py).
The list of data collected by the latest version of vLLM can be found here: <gh-file:vllm/usage/usage_lib.py>
Here is an example as of v0.4.0:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment