Unverified Commit af107d5a authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

Make distinct `code` and `console` admonitions so readers are less likely to miss them (#20585)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent 31c5d0a1
...@@ -16,7 +16,7 @@ vllm {chat,complete,serve,bench,collect-env,run-batch} ...@@ -16,7 +16,7 @@ vllm {chat,complete,serve,bench,collect-env,run-batch}
Start the vLLM OpenAI Compatible API server. Start the vLLM OpenAI Compatible API server.
??? Examples ??? console "Examples"
```bash ```bash
# Start with a model # Start with a model
......
...@@ -57,7 +57,7 @@ By default, we optimize model inference using CUDA graphs which take up extra me ...@@ -57,7 +57,7 @@ By default, we optimize model inference using CUDA graphs which take up extra me
You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage: You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:
??? Code ??? code
```python ```python
from vllm import LLM from vllm import LLM
...@@ -129,7 +129,7 @@ reduce the size of the processed multi-modal inputs, which in turn saves memory. ...@@ -129,7 +129,7 @@ reduce the size of the processed multi-modal inputs, which in turn saves memory.
Here are some examples: Here are some examples:
??? Code ??? code
```python ```python
from vllm import LLM from vllm import LLM
......
...@@ -7,7 +7,7 @@ vLLM uses the following environment variables to configure the system: ...@@ -7,7 +7,7 @@ vLLM uses the following environment variables to configure the system:
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables). All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
??? Code ??? code
```python ```python
--8<-- "vllm/envs.py:env-vars-definition" --8<-- "vllm/envs.py:env-vars-definition"
......
...@@ -95,7 +95,7 @@ For additional features and advanced configurations, refer to the official [MkDo ...@@ -95,7 +95,7 @@ For additional features and advanced configurations, refer to the official [MkDo
## Testing ## Testing
??? note "Commands" ??? console "Commands"
```bash ```bash
pip install -r requirements/dev.txt pip install -r requirements/dev.txt
......
...@@ -27,7 +27,7 @@ All vLLM modules within the model must include a `prefix` argument in their cons ...@@ -27,7 +27,7 @@ All vLLM modules within the model must include a `prefix` argument in their cons
The initialization code should look like this: The initialization code should look like this:
??? Code ??? code
```python ```python
from torch import nn from torch import nn
......
...@@ -12,7 +12,7 @@ Further update the model as follows: ...@@ -12,7 +12,7 @@ Further update the model as follows:
- Implement [get_placeholder_str][vllm.model_executor.models.interfaces.SupportsMultiModal.get_placeholder_str] to define the placeholder string which is used to represent the multi-modal item in the text prompt. This should be consistent with the chat template of the model. - Implement [get_placeholder_str][vllm.model_executor.models.interfaces.SupportsMultiModal.get_placeholder_str] to define the placeholder string which is used to represent the multi-modal item in the text prompt. This should be consistent with the chat template of the model.
??? Code ??? code
```python ```python
class YourModelForImage2Seq(nn.Module): class YourModelForImage2Seq(nn.Module):
...@@ -41,7 +41,7 @@ Further update the model as follows: ...@@ -41,7 +41,7 @@ Further update the model as follows:
- Implement [get_multimodal_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings] that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs. - Implement [get_multimodal_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings] that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs.
??? Code ??? code
```python ```python
class YourModelForImage2Seq(nn.Module): class YourModelForImage2Seq(nn.Module):
...@@ -71,7 +71,7 @@ Further update the model as follows: ...@@ -71,7 +71,7 @@ Further update the model as follows:
- Implement [get_input_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings] to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings. - Implement [get_input_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings] to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.
??? Code ??? code
```python ```python
from .utils import merge_multimodal_embeddings from .utils import merge_multimodal_embeddings
...@@ -155,7 +155,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -155,7 +155,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
Looking at the code of HF's `LlavaForConditionalGeneration`: Looking at the code of HF's `LlavaForConditionalGeneration`:
??? Code ??? code
```python ```python
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544 # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544
...@@ -179,7 +179,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -179,7 +179,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
The number of placeholder feature tokens per image is `image_features.shape[1]`. The number of placeholder feature tokens per image is `image_features.shape[1]`.
`image_features` is calculated inside the `get_image_features` method: `image_features` is calculated inside the `get_image_features` method:
??? Code ??? code
```python ```python
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300 # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300
...@@ -217,7 +217,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -217,7 +217,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
To find the sequence length, we turn to the code of `CLIPVisionEmbeddings`: To find the sequence length, we turn to the code of `CLIPVisionEmbeddings`:
??? Code ??? code
```python ```python
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257 # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257
...@@ -244,7 +244,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -244,7 +244,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
Overall, the number of placeholder feature tokens for an image can be calculated as: Overall, the number of placeholder feature tokens for an image can be calculated as:
??? Code ??? code
```python ```python
def get_num_image_tokens( def get_num_image_tokens(
...@@ -269,7 +269,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -269,7 +269,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
Notice that the number of image tokens doesn't depend on the image width and height. Notice that the number of image tokens doesn't depend on the image width and height.
We can simply use a dummy `image_size` to calculate the multimodal profiling data: We can simply use a dummy `image_size` to calculate the multimodal profiling data:
??? Code ??? code
```python ```python
# NOTE: In actuality, this is usually implemented as part of the # NOTE: In actuality, this is usually implemented as part of the
...@@ -314,7 +314,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -314,7 +314,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
Looking at the code of HF's `FuyuForCausalLM`: Looking at the code of HF's `FuyuForCausalLM`:
??? Code ??? code
```python ```python
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322 # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322
...@@ -344,7 +344,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -344,7 +344,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
In `FuyuImageProcessor.preprocess`, the images are resized and padded to the target `FuyuImageProcessor.size`, In `FuyuImageProcessor.preprocess`, the images are resized and padded to the target `FuyuImageProcessor.size`,
returning the dimensions after resizing (but before padding) as metadata. returning the dimensions after resizing (but before padding) as metadata.
??? Code ??? code
```python ```python
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544 # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544
...@@ -382,7 +382,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -382,7 +382,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata: In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata:
??? Code ??? code
```python ```python
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425 # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425
...@@ -420,7 +420,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -420,7 +420,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
The number of patches is in turn defined by `FuyuImageProcessor.get_num_patches`: The number of patches is in turn defined by `FuyuImageProcessor.get_num_patches`:
??? Code ??? code
```python ```python
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562 # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562
...@@ -457,7 +457,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -457,7 +457,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
For the multimodal image profiling data, the logic is very similar to LLaVA: For the multimodal image profiling data, the logic is very similar to LLaVA:
??? Code ??? code
```python ```python
def get_dummy_mm_data( def get_dummy_mm_data(
...@@ -546,7 +546,7 @@ return a schema of the tensors outputted by the HF processor that are related to ...@@ -546,7 +546,7 @@ return a schema of the tensors outputted by the HF processor that are related to
In order to support the use of [MultiModalFieldConfig.batched][] like in LLaVA, In order to support the use of [MultiModalFieldConfig.batched][] like in LLaVA,
we remove the extra batch dimension by overriding [BaseMultiModalProcessor._call_hf_processor][]: we remove the extra batch dimension by overriding [BaseMultiModalProcessor._call_hf_processor][]:
??? Code ??? code
```python ```python
def _call_hf_processor( def _call_hf_processor(
...@@ -623,7 +623,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies ...@@ -623,7 +623,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
It simply repeats each input `image_token` a number of times equal to the number of placeholder feature tokens (`num_image_tokens`). It simply repeats each input `image_token` a number of times equal to the number of placeholder feature tokens (`num_image_tokens`).
Based on this, we override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] as follows: Based on this, we override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] as follows:
??? Code ??? code
```python ```python
def _get_prompt_updates( def _get_prompt_updates(
...@@ -668,7 +668,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies ...@@ -668,7 +668,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
We define a helper function to return `ncols` and `nrows` directly: We define a helper function to return `ncols` and `nrows` directly:
??? Code ??? code
```python ```python
def get_image_feature_grid_size( def get_image_feature_grid_size(
...@@ -698,7 +698,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies ...@@ -698,7 +698,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
Based on this, we can initially define our replacement tokens as: Based on this, we can initially define our replacement tokens as:
??? Code ??? code
```python ```python
def get_replacement(item_idx: int): def get_replacement(item_idx: int):
...@@ -718,7 +718,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies ...@@ -718,7 +718,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
However, this is not entirely correct. After `FuyuImageProcessor.preprocess_with_tokenizer_info` is called, However, this is not entirely correct. After `FuyuImageProcessor.preprocess_with_tokenizer_info` is called,
a BOS token (`<s>`) is also added to the promopt: a BOS token (`<s>`) is also added to the promopt:
??? Code ??? code
```python ```python
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435 # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435
...@@ -745,7 +745,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies ...@@ -745,7 +745,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
To assign the vision embeddings to only the image tokens, instead of a string To assign the vision embeddings to only the image tokens, instead of a string
you can return an instance of [PromptUpdateDetails][vllm.multimodal.processing.PromptUpdateDetails]: you can return an instance of [PromptUpdateDetails][vllm.multimodal.processing.PromptUpdateDetails]:
??? Code ??? code
```python ```python
hf_config = self.info.get_hf_config() hf_config = self.info.get_hf_config()
...@@ -772,7 +772,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies ...@@ -772,7 +772,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt, Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt,
we can search for it to conduct the replacement at the start of the string: we can search for it to conduct the replacement at the start of the string:
??? Code ??? code
```python ```python
def _get_prompt_updates( def _get_prompt_updates(
......
...@@ -125,7 +125,7 @@ to manually kill the profiler and generate your `nsys-rep` report. ...@@ -125,7 +125,7 @@ to manually kill the profiler and generate your `nsys-rep` report.
You can view these profiles either as summaries in the CLI, using `nsys stats [profile-file]`, or in the GUI by installing Nsight [locally following the directions here](https://developer.nvidia.com/nsight-systems/get-started). You can view these profiles either as summaries in the CLI, using `nsys stats [profile-file]`, or in the GUI by installing Nsight [locally following the directions here](https://developer.nvidia.com/nsight-systems/get-started).
??? CLI example ??? console "CLI example"
```bash ```bash
nsys stats report1.nsys-rep nsys stats report1.nsys-rep
......
...@@ -97,7 +97,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `-- ...@@ -97,7 +97,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits. flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits.
Keep an eye on memory usage with parallel jobs as it can be substantial (see example below). Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).
??? Command ??? console "Command"
```bash ```bash
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB) # Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
......
...@@ -30,7 +30,7 @@ python -m vllm.entrypoints.openai.api_server \ ...@@ -30,7 +30,7 @@ python -m vllm.entrypoints.openai.api_server \
- Call it with AutoGen: - Call it with AutoGen:
??? Code ??? code
```python ```python
import asyncio import asyncio
......
...@@ -34,7 +34,7 @@ vllm = "latest" ...@@ -34,7 +34,7 @@ vllm = "latest"
Next, let us add our code to handle inference for the LLM of your choice (`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your `main.py`: Next, let us add our code to handle inference for the LLM of your choice (`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your `main.py`:
??? Code ??? code
```python ```python
from vllm import LLM, SamplingParams from vllm import LLM, SamplingParams
...@@ -64,7 +64,7 @@ cerebrium deploy ...@@ -64,7 +64,7 @@ cerebrium deploy
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`) If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`)
??? Command ??? console "Command"
```python ```python
curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \ curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
...@@ -82,7 +82,7 @@ If successful, you should be returned a CURL command that you can call inference ...@@ -82,7 +82,7 @@ If successful, you should be returned a CURL command that you can call inference
You should get a response like: You should get a response like:
??? Response ??? console "Response"
```python ```python
{ {
......
...@@ -26,7 +26,7 @@ dstack init ...@@ -26,7 +26,7 @@ dstack init
Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`: Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
??? Config ??? code "Config"
```yaml ```yaml
type: service type: service
...@@ -48,7 +48,7 @@ Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2- ...@@ -48,7 +48,7 @@ Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-
Then, run the following CLI for provisioning: Then, run the following CLI for provisioning:
??? Command ??? console "Command"
```console ```console
$ dstack run . -f serve.dstack.yml $ dstack run . -f serve.dstack.yml
...@@ -79,7 +79,7 @@ Then, run the following CLI for provisioning: ...@@ -79,7 +79,7 @@ Then, run the following CLI for provisioning:
After the provisioning, you can interact with the model by using the OpenAI SDK: After the provisioning, you can interact with the model by using the OpenAI SDK:
??? Code ??? code
```python ```python
from openai import OpenAI from openai import OpenAI
......
...@@ -27,7 +27,7 @@ vllm serve mistralai/Mistral-7B-Instruct-v0.1 ...@@ -27,7 +27,7 @@ vllm serve mistralai/Mistral-7B-Instruct-v0.1
- Use the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack to query the vLLM server. - Use the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack to query the vLLM server.
??? Code ??? code
```python ```python
from haystack.components.generators.chat import OpenAIChatGenerator from haystack.components.generators.chat import OpenAIChatGenerator
......
...@@ -34,7 +34,7 @@ vllm serve qwen/Qwen1.5-0.5B-Chat ...@@ -34,7 +34,7 @@ vllm serve qwen/Qwen1.5-0.5B-Chat
- Call it with litellm: - Call it with litellm:
??? Code ??? code
```python ```python
import litellm import litellm
......
...@@ -17,7 +17,7 @@ vLLM can be deployed with [LWS](https://github.com/kubernetes-sigs/lws) on Kuber ...@@ -17,7 +17,7 @@ vLLM can be deployed with [LWS](https://github.com/kubernetes-sigs/lws) on Kuber
Deploy the following yaml file `lws.yaml` Deploy the following yaml file `lws.yaml`
??? Yaml ??? code "Yaml"
```yaml ```yaml
apiVersion: leaderworkerset.x-k8s.io/v1 apiVersion: leaderworkerset.x-k8s.io/v1
...@@ -177,7 +177,7 @@ curl http://localhost:8080/v1/completions \ ...@@ -177,7 +177,7 @@ curl http://localhost:8080/v1/completions \
The output should be similar to the following The output should be similar to the following
??? Output ??? console "Output"
```text ```text
{ {
......
...@@ -24,7 +24,7 @@ sky check ...@@ -24,7 +24,7 @@ sky check
See the vLLM SkyPilot YAML for serving, [serving.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml). See the vLLM SkyPilot YAML for serving, [serving.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml).
??? Yaml ??? code "Yaml"
```yaml ```yaml
resources: resources:
...@@ -95,7 +95,7 @@ HF_TOKEN="your-huggingface-token" \ ...@@ -95,7 +95,7 @@ HF_TOKEN="your-huggingface-token" \
SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file. SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file.
??? Yaml ??? code "Yaml"
```yaml ```yaml
service: service:
...@@ -111,7 +111,7 @@ SkyPilot can scale up the service to multiple service replicas with built-in aut ...@@ -111,7 +111,7 @@ SkyPilot can scale up the service to multiple service replicas with built-in aut
max_completion_tokens: 1 max_completion_tokens: 1
``` ```
??? Yaml ??? code "Yaml"
```yaml ```yaml
service: service:
...@@ -186,7 +186,7 @@ vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'L4': 1}) R ...@@ -186,7 +186,7 @@ vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'L4': 1}) R
After the service is READY, you can find a single endpoint for the service and access the service with the endpoint: After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
??? Commands ??? console "Commands"
```bash ```bash
ENDPOINT=$(sky serve status --endpoint 8081 vllm) ENDPOINT=$(sky serve status --endpoint 8081 vllm)
...@@ -220,7 +220,7 @@ service: ...@@ -220,7 +220,7 @@ service:
This will scale the service up to when the QPS exceeds 2 for each replica. This will scale the service up to when the QPS exceeds 2 for each replica.
??? Yaml ??? code "Yaml"
```yaml ```yaml
service: service:
...@@ -285,7 +285,7 @@ sky serve down vllm ...@@ -285,7 +285,7 @@ sky serve down vllm
It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas. It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.
??? Yaml ??? code "Yaml"
```yaml ```yaml
envs: envs:
......
...@@ -60,7 +60,7 @@ And then you can send out a query to the OpenAI-compatible API to check the avai ...@@ -60,7 +60,7 @@ And then you can send out a query to the OpenAI-compatible API to check the avai
curl -o- http://localhost:30080/models curl -o- http://localhost:30080/models
``` ```
??? Output ??? console "Output"
```json ```json
{ {
...@@ -89,7 +89,7 @@ curl -X POST http://localhost:30080/completions \ ...@@ -89,7 +89,7 @@ curl -X POST http://localhost:30080/completions \
}' }'
``` ```
??? Output ??? console "Output"
```json ```json
{ {
...@@ -121,7 +121,7 @@ sudo helm uninstall vllm ...@@ -121,7 +121,7 @@ sudo helm uninstall vllm
The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above: The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above:
??? Yaml ??? code "Yaml"
```yaml ```yaml
servingEngineSpec: servingEngineSpec:
......
...@@ -29,7 +29,7 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following: ...@@ -29,7 +29,7 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following:
First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model: First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model:
??? Config ??? console "Config"
```bash ```bash
cat <<EOF |kubectl apply -f - cat <<EOF |kubectl apply -f -
...@@ -57,7 +57,7 @@ First, create a Kubernetes PVC and Secret for downloading and storing Hugging Fa ...@@ -57,7 +57,7 @@ First, create a Kubernetes PVC and Secret for downloading and storing Hugging Fa
Next, start the vLLM server as a Kubernetes Deployment and Service: Next, start the vLLM server as a Kubernetes Deployment and Service:
??? Config ??? console "Config"
```bash ```bash
cat <<EOF |kubectl apply -f - cat <<EOF |kubectl apply -f -
......
...@@ -36,7 +36,7 @@ docker build . -f Dockerfile.nginx --tag nginx-lb ...@@ -36,7 +36,7 @@ docker build . -f Dockerfile.nginx --tag nginx-lb
Create a file named `nginx_conf/nginx.conf`. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another `server vllmN:8000 max_fails=3 fail_timeout=10000s;` entry to `upstream backend`. Create a file named `nginx_conf/nginx.conf`. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another `server vllmN:8000 max_fails=3 fail_timeout=10000s;` entry to `upstream backend`.
??? Config ??? console "Config"
```console ```console
upstream backend { upstream backend {
...@@ -95,7 +95,7 @@ Notes: ...@@ -95,7 +95,7 @@ Notes:
- The below example assumes GPU backend used. If you are using CPU backend, remove `--gpus device=ID`, add `VLLM_CPU_KVCACHE_SPACE` and `VLLM_CPU_OMP_THREADS_BIND` environment variables to the docker run command. - The below example assumes GPU backend used. If you are using CPU backend, remove `--gpus device=ID`, add `VLLM_CPU_KVCACHE_SPACE` and `VLLM_CPU_OMP_THREADS_BIND` environment variables to the docker run command.
- Adjust the model name that you want to use in your vLLM servers if you don't want to use `Llama-2-7b-chat-hf`. - Adjust the model name that you want to use in your vLLM servers if you don't want to use `Llama-2-7b-chat-hf`.
??? Commands ??? console "Commands"
```console ```console
mkdir -p ~/.cache/huggingface/hub/ mkdir -p ~/.cache/huggingface/hub/
......
...@@ -22,7 +22,7 @@ server. ...@@ -22,7 +22,7 @@ server.
Here is a sample of `LLM` class usage: Here is a sample of `LLM` class usage:
??? Code ??? code
```python ```python
from vllm import LLM, SamplingParams from vllm import LLM, SamplingParams
...@@ -180,7 +180,7 @@ vision-language model. ...@@ -180,7 +180,7 @@ vision-language model.
To avoid accidentally passing incorrect arguments, the constructor is now keyword-only. This ensures that the constructor will raise an error if old configurations are passed. vLLM developers have already made this change for all models within vLLM. For out-of-tree registered models, developers need to update their models, for example by adding shim code to adapt the old constructor signature to the new one: To avoid accidentally passing incorrect arguments, the constructor is now keyword-only. This ensures that the constructor will raise an error if old configurations are passed. vLLM developers have already made this change for all models within vLLM. For out-of-tree registered models, developers need to update their models, for example by adding shim code to adapt the old constructor signature to the new one:
??? Code ??? code
```python ```python
class MyOldModel(nn.Module): class MyOldModel(nn.Module):
......
...@@ -448,7 +448,7 @@ elements of the entire head for all context tokens. However, overall, ...@@ -448,7 +448,7 @@ elements of the entire head for all context tokens. However, overall,
all results for output have been calculated but are just stored in all results for output have been calculated but are just stored in
different thread register memory. different thread register memory.
??? Code ??? code
```cpp ```cpp
float* out_smem = reinterpret_cast<float*>(shared_mem); float* out_smem = reinterpret_cast<float*>(shared_mem);
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment