Unverified Commit af107d5a authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

Make distinct `code` and `console` admonitions so readers are less likely to miss them (#20585)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent 31c5d0a1
......@@ -76,7 +76,7 @@ Currently, there are no pre-built CPU wheels.
### Build image from source
??? Commands
??? console "Commands"
```bash
docker build -f docker/Dockerfile.cpu \
......@@ -149,7 +149,7 @@ vllm serve facebook/opt-125m
- If using vLLM CPU backend on a machine with hyper-threading, it is recommended to bind only one OpenMP thread on each physical CPU core using `VLLM_CPU_OMP_THREADS_BIND` or using auto thread binding feature by default. On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
??? Commands
??? console "Commands"
```console
$ lscpu -e # check the mapping between logical CPU cores and physical CPU cores
......
......@@ -95,7 +95,7 @@ Currently, there are no pre-built ROCm wheels.
4. Build vLLM. For example, vLLM on ROCM 6.3 can be built with the following steps:
??? Commands
??? console "Commands"
```bash
pip install --upgrade pip
......@@ -206,7 +206,7 @@ DOCKER_BUILDKIT=1 docker build \
To run the above docker image `vllm-rocm`, use the below command:
??? Command
??? console "Command"
```bash
docker run -it \
......
......@@ -237,7 +237,7 @@ As an example, if a request of 3 sequences, with max sequence length of 412 come
Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup:
??? Logs
??? console "Logs"
```text
INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:79.16 GiB
......@@ -286,7 +286,7 @@ When there's large amount of requests pending, vLLM scheduler will attempt to fi
Each described step is logged by vLLM server, as follows (negative values correspond to memory being released):
??? Logs
??? console "Logs"
```text
INFO 08-02 17:37:44 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024]
......
......@@ -147,7 +147,7 @@ curl http://localhost:8000/v1/completions \
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` Python package:
??? Code
??? code
```python
from openai import OpenAI
......@@ -186,7 +186,7 @@ curl http://localhost:8000/v1/chat/completions \
Alternatively, you can use the `openai` Python package:
??? Code
??? code
```python
from openai import OpenAI
......
......@@ -39,6 +39,8 @@ body[data-md-color-scheme="slate"] .md-nav__item--section > label.md-nav__link .
:root {
--md-admonition-icon--announcement: url('data:image/svg+xml;charset=utf-8,<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16" width="16" height="16"><path d="M3.25 9a.75.75 0 0 1 .75.75c0 2.142.456 3.828.733 4.653a.122.122 0 0 0 .05.064.212.212 0 0 0 .117.033h1.31c.085 0 .18-.042.258-.152a.45.45 0 0 0 .075-.366A16.743 16.743 0 0 1 6 9.75a.75.75 0 0 1 1.5 0c0 1.588.25 2.926.494 3.85.293 1.113-.504 2.4-1.783 2.4H4.9c-.686 0-1.35-.41-1.589-1.12A16.4 16.4 0 0 1 2.5 9.75.75.75 0 0 1 3.25 9Z"></path><path d="M0 6a4 4 0 0 1 4-4h2.75a.75.75 0 0 1 .75.75v6.5a.75.75 0 0 1-.75.75H4a4 4 0 0 1-4-4Zm4-2.5a2.5 2.5 0 1 0 0 5h2v-5Z"></path><path d="M15.59.082A.75.75 0 0 1 16 .75v10.5a.75.75 0 0 1-1.189.608l-.002-.001h.001l-.014-.01a5.775 5.775 0 0 0-.422-.25 10.63 10.63 0 0 0-1.469-.64C11.576 10.484 9.536 10 6.75 10a.75.75 0 0 1 0-1.5c2.964 0 5.174.516 6.658 1.043.423.151.787.302 1.092.443V2.014c-.305.14-.669.292-1.092.443C11.924 2.984 9.713 3.5 6.75 3.5a.75.75 0 0 1 0-1.5c2.786 0 4.826-.484 6.155-.957.665-.236 1.154-.47 1.47-.64.144-.077.284-.161.421-.25l.014-.01a.75.75 0 0 1 .78-.061Z"></path></svg>');
--md-admonition-icon--important: url('data:image/svg+xml;charset=utf-8,<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16" width="16" height="16"><path d="M4.47.22A.749.749 0 0 1 5 0h6c.199 0 .389.079.53.22l4.25 4.25c.141.14.22.331.22.53v6a.749.749 0 0 1-.22.53l-4.25 4.25A.749.749 0 0 1 11 16H5a.749.749 0 0 1-.53-.22L.22 11.53A.749.749 0 0 1 0 11V5c0-.199.079-.389.22-.53Zm.84 1.28L1.5 5.31v5.38l3.81 3.81h5.38l3.81-3.81V5.31L10.69 1.5ZM8 4a.75.75 0 0 1 .75.75v3.5a.75.75 0 0 1-1.5 0v-3.5A.75.75 0 0 1 8 4Zm0 8a1 1 0 1 1 0-2 1 1 0 0 1 0 2Z"></path></svg>');
--md-admonition-icon--code: url('data:image/svg+xml;charset=utf-8,<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"><path d="m11.28 3.22 4.25 4.25a.75.75 0 0 1 0 1.06l-4.25 4.25a.749.749 0 0 1-1.275-.326.75.75 0 0 1 .215-.734L13.94 8l-3.72-3.72a.749.749 0 0 1 .326-1.275.75.75 0 0 1 .734.215m-6.56 0a.75.75 0 0 1 1.042.018.75.75 0 0 1 .018 1.042L2.06 8l3.72 3.72a.749.749 0 0 1-.326 1.275.75.75 0 0 1-.734-.215L.47 8.53a.75.75 0 0 1 0-1.06Z"/></svg>');
--md-admonition-icon--console: url('data:image/svg+xml;charset=utf-8,<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"><path d="M0 2.75C0 1.784.784 1 1.75 1h12.5c.966 0 1.75.784 1.75 1.75v10.5A1.75 1.75 0 0 1 14.25 15H1.75A1.75 1.75 0 0 1 0 13.25Zm1.75-.25a.25.25 0 0 0-.25.25v10.5c0 .138.112.25.25.25h12.5a.25.25 0 0 0 .25-.25V2.75a.25.25 0 0 0-.25-.25ZM7.25 8a.75.75 0 0 1-.22.53l-2.25 2.25a.749.749 0 0 1-1.275-.326.75.75 0 0 1 .215-.734L5.44 8 3.72 6.28a.749.749 0 0 1 .326-1.275.75.75 0 0 1 .734.215l2.25 2.25c.141.14.22.331.22.53m1.5 1.5h3a.75.75 0 0 1 0 1.5h-3a.75.75 0 0 1 0-1.5"/></svg>');
}
.md-typeset .admonition.announcement,
......@@ -49,6 +51,14 @@ body[data-md-color-scheme="slate"] .md-nav__item--section > label.md-nav__link .
.md-typeset details.important {
border-color: rgb(239, 85, 82);
}
.md-typeset .admonition.code,
.md-typeset details.code {
border-color: #64dd17
}
.md-typeset .admonition.console,
.md-typeset details.console {
border-color: #64dd17
}
.md-typeset .announcement > .admonition-title,
.md-typeset .announcement > summary {
......@@ -58,6 +68,14 @@ body[data-md-color-scheme="slate"] .md-nav__item--section > label.md-nav__link .
.md-typeset .important > summary {
background-color: rgb(239, 85, 82, 0.1);
}
.md-typeset .code > .admonition-title,
.md-typeset .code > summary {
background-color: #64dd171a;
}
.md-typeset .console > .admonition-title,
.md-typeset .console > summary {
background-color: #64dd171a;
}
.md-typeset .announcement > .admonition-title::before,
.md-typeset .announcement > summary::before {
......@@ -71,6 +89,18 @@ body[data-md-color-scheme="slate"] .md-nav__item--section > label.md-nav__link .
-webkit-mask-image: var(--md-admonition-icon--important);
mask-image: var(--md-admonition-icon--important);
}
.md-typeset .code > .admonition-title::before,
.md-typeset .code > summary::before {
background-color: #64dd17;
-webkit-mask-image: var(--md-admonition-icon--code);
mask-image: var(--md-admonition-icon--code);
}
.md-typeset .console > .admonition-title::before,
.md-typeset .console > summary::before {
background-color: #64dd17;
-webkit-mask-image: var(--md-admonition-icon--console);
mask-image: var(--md-admonition-icon--console);
}
/* Make label fully visible on hover */
.md-content__button[href*="edit"]:hover::after {
......
......@@ -85,7 +85,7 @@ and automatically applies the model's [chat template](https://huggingface.co/doc
In general, only instruction-tuned models have a chat template.
Base models may perform poorly as they are not trained to respond to the chat conversation.
??? Code
??? code
```python
from vllm import LLM
......
......@@ -642,7 +642,7 @@ Specified using `--task generate`.
For the best results, we recommend using the following dependency versions (tested on A10 and L40):
??? Dependency versions
??? code "Dependency versions"
```text
# Core vLLM-compatible dependencies with Molmo accuracy setup (tested on L40)
......
......@@ -13,7 +13,7 @@ pip install langchain langchain_community -q
To run inference on a single or multiple GPUs, use `VLLM` class from `langchain`.
??? Code
??? code
```python
from langchain_community.llms import VLLM
......
......@@ -15,7 +15,7 @@ vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the [official OpenAI Python client](https://github.com/openai/openai-python).
??? Code
??? code
```python
from openai import OpenAI
......@@ -146,7 +146,7 @@ completion = client.chat.completions.create(
Only `X-Request-Id` HTTP request header is supported for now. It can be enabled
with `--enable-request-id-headers`.
??? Code
??? code
```python
completion = client.chat.completions.create(
......@@ -185,7 +185,7 @@ Code example: <gh-file:examples/online_serving/openai_completion_client.py>
The following [sampling parameters][sampling-params] are supported.
??? Code
??? code
```python
--8<-- "vllm/entrypoints/openai/protocol.py:completion-sampling-params"
......@@ -193,7 +193,7 @@ The following [sampling parameters][sampling-params] are supported.
The following extra parameters are supported:
??? Code
??? code
```python
--8<-- "vllm/entrypoints/openai/protocol.py:completion-extra-params"
......@@ -217,7 +217,7 @@ Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>
The following [sampling parameters][sampling-params] are supported.
??? Code
??? code
```python
--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-sampling-params"
......@@ -225,7 +225,7 @@ The following [sampling parameters][sampling-params] are supported.
The following extra parameters are supported:
??? Code
??? code
```python
--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-extra-params"
......@@ -268,7 +268,7 @@ and passing a list of `messages` in the request. Refer to the examples below for
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
??? Code
??? code
```python
import requests
......@@ -327,7 +327,7 @@ The following [pooling parameters][pooling-params] are supported.
The following extra parameters are supported by default:
??? Code
??? code
```python
--8<-- "vllm/entrypoints/openai/protocol.py:embedding-extra-params"
......@@ -335,7 +335,7 @@ The following extra parameters are supported by default:
For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead:
??? Code
??? code
```python
--8<-- "vllm/entrypoints/openai/protocol.py:chat-embedding-extra-params"
......@@ -358,7 +358,7 @@ Code example: <gh-file:examples/online_serving/openai_transcription_client.py>
The following [sampling parameters][sampling-params] are supported.
??? Code
??? code
```python
--8<-- "vllm/entrypoints/openai/protocol.py:transcription-sampling-params"
......@@ -366,7 +366,7 @@ The following [sampling parameters][sampling-params] are supported.
The following extra parameters are supported:
??? Code
??? code
```python
--8<-- "vllm/entrypoints/openai/protocol.py:transcription-extra-params"
......@@ -446,7 +446,7 @@ curl -v "http://127.0.0.1:8000/classify" \
}'
```
??? Response
??? console "Response"
```bash
{
......@@ -494,7 +494,7 @@ curl -v "http://127.0.0.1:8000/classify" \
}'
```
??? Response
??? console "Response"
```bash
{
......@@ -564,7 +564,7 @@ curl -X 'POST' \
}'
```
??? Response
??? console "Response"
```bash
{
......@@ -589,7 +589,7 @@ You can pass a string to `text_1` and a list to `text_2`, forming multiple sente
where each pair is built from `text_1` and a string in `text_2`.
The total number of pairs is `len(text_2)`.
??? Request
??? console "Request"
```bash
curl -X 'POST' \
......@@ -606,7 +606,7 @@ The total number of pairs is `len(text_2)`.
}'
```
??? Response
??? console "Response"
```bash
{
......@@ -634,7 +634,7 @@ You can pass a list to both `text_1` and `text_2`, forming multiple sentence pai
where each pair is built from a string in `text_1` and the corresponding string in `text_2` (similar to `zip()`).
The total number of pairs is `len(text_2)`.
??? Request
??? console "Request"
```bash
curl -X 'POST' \
......@@ -655,7 +655,7 @@ The total number of pairs is `len(text_2)`.
}'
```
??? Response
??? console "Response"
```bash
{
......@@ -716,7 +716,7 @@ Code example: <gh-file:examples/online_serving/jinaai_rerank_client.py>
Note that the `top_n` request parameter is optional and will default to the length of the `documents` field.
Result documents will be sorted by relevance, and the `index` property can be used to determine original order.
??? Request
??? console "Request"
```bash
curl -X 'POST' \
......@@ -734,7 +734,7 @@ Result documents will be sorted by relevance, and the `index` property can be us
}'
```
??? Response
??? console "Response"
```bash
{
......
......@@ -12,7 +12,7 @@ vllm serve unsloth/Llama-3.2-1B-Instruct
Then query the endpoint to get the latest metrics from the server:
??? Output
??? console "Output"
```console
$ curl http://0.0.0.0:8000/metrics
......@@ -33,7 +33,7 @@ Then query the endpoint to get the latest metrics from the server:
The following metrics are exposed:
??? Code
??? code
```python
--8<-- "vllm/engine/metrics.py:metrics-definitions"
......
......@@ -60,7 +60,7 @@ To identify the particular CUDA operation that causes the error, you can add `--
If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.
??? Code
??? code
```python
# Test PyTorch NCCL
......@@ -170,7 +170,7 @@ WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
or an error from Python that looks like this:
??? Logs
??? console "Logs"
```console
RuntimeError:
......@@ -214,7 +214,7 @@ if __name__ == '__main__':
vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](gh-pr:10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script:
??? Code
??? code
```python
import torch
......
......@@ -10,7 +10,7 @@ The list of data collected by the latest version of vLLM can be found here: <gh-
Here is an example as of v0.4.0:
??? Output
??? console "Output"
```json
{
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment