Unverified Commit f17aec0d authored by Reid's avatar Reid Committed by GitHub
Browse files

[doc] Fold long code blocks to improve readability (#19926)


Signed-off-by: default avatarreidliu41 <reid201711@gmail.com>
Co-authored-by: default avatarreidliu41 <reid201711@gmail.com>
parent 493c2753
...@@ -91,7 +91,7 @@ source to unblock the update process. ...@@ -91,7 +91,7 @@ source to unblock the update process.
### FlashInfer ### FlashInfer
Here is how to build and install it from source with torch2.7.0+cu128 in vLLM [Dockerfile](https://github.com/vllm-project/vllm/blob/27bebcd89792d5c4b08af7a65095759526f2f9e1/docker/Dockerfile#L259-L271): Here is how to build and install it from source with torch2.7.0+cu128 in vLLM [Dockerfile](https://github.com/vllm-project/vllm/blob/27bebcd89792d5c4b08af7a65095759526f2f9e1/docker/Dockerfile#L259-L271):
``` ```bash
export TORCH_CUDA_ARCH_LIST='7.5 8.0 8.9 9.0 10.0+PTX' export TORCH_CUDA_ARCH_LIST='7.5 8.0 8.9 9.0 10.0+PTX'
export FLASHINFER_ENABLE_SM90=1 export FLASHINFER_ENABLE_SM90=1
uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/flashinfer@v0.2.6.post1" uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/flashinfer@v0.2.6.post1"
...@@ -105,14 +105,14 @@ team if you want to get the package published there. ...@@ -105,14 +105,14 @@ team if you want to get the package published there.
### xFormers ### xFormers
Similar to FlashInfer, here is how to build and install xFormers from source: Similar to FlashInfer, here is how to build and install xFormers from source:
``` ```bash
export TORCH_CUDA_ARCH_LIST='7.0 7.5 8.0 8.9 9.0 10.0+PTX' export TORCH_CUDA_ARCH_LIST='7.0 7.5 8.0 8.9 9.0 10.0+PTX'
MAX_JOBS=16 uv pip install --system --no-build-isolation "git+https://github.com/facebookresearch/xformers@v0.0.30" MAX_JOBS=16 uv pip install --system --no-build-isolation "git+https://github.com/facebookresearch/xformers@v0.0.30"
``` ```
### Mamba ### Mamba
``` ```bash
uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4" uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4"
``` ```
......
...@@ -16,35 +16,33 @@ vllm {chat,complete,serve,bench,collect-env,run-batch} ...@@ -16,35 +16,33 @@ vllm {chat,complete,serve,bench,collect-env,run-batch}
Start the vLLM OpenAI Compatible API server. Start the vLLM OpenAI Compatible API server.
Examples: ??? Examples
```bash ```bash
# Start with a model # Start with a model
vllm serve meta-llama/Llama-2-7b-hf vllm serve meta-llama/Llama-2-7b-hf
# Specify the port # Specify the port
vllm serve meta-llama/Llama-2-7b-hf --port 8100 vllm serve meta-llama/Llama-2-7b-hf --port 8100
# Check with --help for more options # Check with --help for more options
# To list all groups # To list all groups
vllm serve --help=listgroup vllm serve --help=listgroup
# To view a argument group # To view a argument group
vllm serve --help=ModelConfig vllm serve --help=ModelConfig
# To view a single argument # To view a single argument
vllm serve --help=max-num-seqs vllm serve --help=max-num-seqs
# To search by keyword # To search by keyword
vllm serve --help=max vllm serve --help=max
``` ```
## chat ## chat
Generate chat completions via the running API server. Generate chat completions via the running API server.
Examples:
```bash ```bash
# Directly connect to localhost API without arguments # Directly connect to localhost API without arguments
vllm chat vllm chat
...@@ -60,8 +58,6 @@ vllm chat --quick "hi" ...@@ -60,8 +58,6 @@ vllm chat --quick "hi"
Generate text completions based on the given prompt via the running API server. Generate text completions based on the given prompt via the running API server.
Examples:
```bash ```bash
# Directly connect to localhost API without arguments # Directly connect to localhost API without arguments
vllm complete vllm complete
...@@ -73,6 +69,8 @@ vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1 ...@@ -73,6 +69,8 @@ vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1
vllm complete --quick "The future of AI is" vllm complete --quick "The future of AI is"
``` ```
</details>
## bench ## bench
Run benchmark tests for latency online serving throughput and offline inference throughput. Run benchmark tests for latency online serving throughput and offline inference throughput.
...@@ -89,8 +87,6 @@ vllm bench {latency, serve, throughput} ...@@ -89,8 +87,6 @@ vllm bench {latency, serve, throughput}
Benchmark the latency of a single batch of requests. Benchmark the latency of a single batch of requests.
Example:
```bash ```bash
vllm bench latency \ vllm bench latency \
--model meta-llama/Llama-3.2-1B-Instruct \ --model meta-llama/Llama-3.2-1B-Instruct \
...@@ -104,8 +100,6 @@ vllm bench latency \ ...@@ -104,8 +100,6 @@ vllm bench latency \
Benchmark the online serving throughput. Benchmark the online serving throughput.
Example:
```bash ```bash
vllm bench serve \ vllm bench serve \
--model meta-llama/Llama-3.2-1B-Instruct \ --model meta-llama/Llama-3.2-1B-Instruct \
...@@ -120,8 +114,6 @@ vllm bench serve \ ...@@ -120,8 +114,6 @@ vllm bench serve \
Benchmark offline inference throughput. Benchmark offline inference throughput.
Example:
```bash ```bash
vllm bench throughput \ vllm bench throughput \
--model meta-llama/Llama-3.2-1B-Instruct \ --model meta-llama/Llama-3.2-1B-Instruct \
...@@ -143,7 +135,8 @@ vllm collect-env ...@@ -143,7 +135,8 @@ vllm collect-env
Run batch prompts and write results to file. Run batch prompts and write results to file.
Examples: <details>
<summary>Examples</summary>
```bash ```bash
# Running with a local file # Running with a local file
...@@ -159,6 +152,8 @@ vllm run-batch \ ...@@ -159,6 +152,8 @@ vllm run-batch \
--model meta-llama/Meta-Llama-3-8B-Instruct --model meta-llama/Meta-Llama-3-8B-Instruct
``` ```
</details>
## More Help ## More Help
For detailed options of any subcommand, use: For detailed options of any subcommand, use:
......
...@@ -57,19 +57,21 @@ By default, we optimize model inference using CUDA graphs which take up extra me ...@@ -57,19 +57,21 @@ By default, we optimize model inference using CUDA graphs which take up extra me
You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage: You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:
```python ??? Code
from vllm import LLM
from vllm.config import CompilationConfig, CompilationLevel ```python
from vllm import LLM
llm = LLM( from vllm.config import CompilationConfig, CompilationLevel
model="meta-llama/Llama-3.1-8B-Instruct",
compilation_config=CompilationConfig( llm = LLM(
level=CompilationLevel.PIECEWISE, model="meta-llama/Llama-3.1-8B-Instruct",
# By default, it goes up to max_num_seqs compilation_config=CompilationConfig(
cudagraph_capture_sizes=[1, 2, 4, 8, 16], level=CompilationLevel.PIECEWISE,
), # By default, it goes up to max_num_seqs
) cudagraph_capture_sizes=[1, 2, 4, 8, 16],
``` ),
)
```
You can disable graph capturing completely via the `enforce_eager` flag: You can disable graph capturing completely via the `enforce_eager` flag:
...@@ -127,18 +129,20 @@ reduce the size of the processed multi-modal inputs, which in turn saves memory. ...@@ -127,18 +129,20 @@ reduce the size of the processed multi-modal inputs, which in turn saves memory.
Here are some examples: Here are some examples:
```python ??? Code
from vllm import LLM
# Available for Qwen2-VL series models ```python
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct", from vllm import LLM
mm_processor_kwargs={
"max_pixels": 768 * 768, # Default is 1280 * 28 * 28 # Available for Qwen2-VL series models
}) llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
mm_processor_kwargs={
# Available for InternVL series models "max_pixels": 768 * 768, # Default is 1280 * 28 * 28
llm = LLM(model="OpenGVLab/InternVL2-2B", })
mm_processor_kwargs={
"max_dynamic_patch": 4, # Default is 12 # Available for InternVL series models
}) llm = LLM(model="OpenGVLab/InternVL2-2B",
``` mm_processor_kwargs={
"max_dynamic_patch": 4, # Default is 12
})
```
...@@ -7,6 +7,8 @@ vLLM uses the following environment variables to configure the system: ...@@ -7,6 +7,8 @@ vLLM uses the following environment variables to configure the system:
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables). All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
```python ??? Code
--8<-- "vllm/envs.py:env-vars-definition"
``` ```python
--8<-- "vllm/envs.py:env-vars-definition"
```
...@@ -93,25 +93,27 @@ For additional features and advanced configurations, refer to the official [MkDo ...@@ -93,25 +93,27 @@ For additional features and advanced configurations, refer to the official [MkDo
## Testing ## Testing
```bash ??? note "Commands"
pip install -r requirements/dev.txt
# Linting, formatting and static type checking ```bash
pre-commit install --hook-type pre-commit --hook-type commit-msg pip install -r requirements/dev.txt
# You can manually run pre-commit with # Linting, formatting and static type checking
pre-commit run --all-files pre-commit install --hook-type pre-commit --hook-type commit-msg
# To manually run something from CI that does not run # You can manually run pre-commit with
# locally by default, you can run: pre-commit run --all-files
pre-commit run mypy-3.9 --hook-stage manual --all-files
# Unit tests # To manually run something from CI that does not run
pytest tests/ # locally by default, you can run:
pre-commit run mypy-3.9 --hook-stage manual --all-files
# Run tests for a single test file with detailed output # Unit tests
pytest -s -v tests/test_logger.py pytest tests/
```
# Run tests for a single test file with detailed output
pytest -s -v tests/test_logger.py
```
!!! tip !!! tip
Since the <gh-file:docker/Dockerfile> ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12. Since the <gh-file:docker/Dockerfile> ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12.
......
...@@ -27,33 +27,35 @@ All vLLM modules within the model must include a `prefix` argument in their cons ...@@ -27,33 +27,35 @@ All vLLM modules within the model must include a `prefix` argument in their cons
The initialization code should look like this: The initialization code should look like this:
```python ??? Code
from torch import nn
from vllm.config import VllmConfig ```python
from vllm.attention import Attention from torch import nn
from vllm.config import VllmConfig
class MyAttention(nn.Module): from vllm.attention import Attention
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__() class MyAttention(nn.Module):
self.attn = Attention(prefix=f"{prefix}.attn") def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__()
class MyDecoderLayer(nn.Module): self.attn = Attention(prefix=f"{prefix}.attn")
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__() class MyDecoderLayer(nn.Module):
self.self_attn = MyAttention(prefix=f"{prefix}.self_attn") def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__()
class MyModel(nn.Module): self.self_attn = MyAttention(prefix=f"{prefix}.self_attn")
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__() class MyModel(nn.Module):
self.layers = nn.ModuleList( def __init__(self, vllm_config: VllmConfig, prefix: str):
[MyDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") for i in range(vllm_config.model_config.hf_config.num_hidden_layers)] super().__init__()
) self.layers = nn.ModuleList(
[MyDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") for i in range(vllm_config.model_config.hf_config.num_hidden_layers)]
class MyModelForCausalLM(nn.Module): )
def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
super().__init__() class MyModelForCausalLM(nn.Module):
self.model = MyModel(vllm_config, prefix=f"{prefix}.model") def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
``` super().__init__()
self.model = MyModel(vllm_config, prefix=f"{prefix}.model")
```
### Computation Code ### Computation Code
......
This diff is collapsed.
...@@ -97,26 +97,26 @@ to manually kill the profiler and generate your `nsys-rep` report. ...@@ -97,26 +97,26 @@ to manually kill the profiler and generate your `nsys-rep` report.
You can view these profiles either as summaries in the CLI, using `nsys stats [profile-file]`, or in the GUI by installing Nsight [locally following the directions here](https://developer.nvidia.com/nsight-systems/get-started). You can view these profiles either as summaries in the CLI, using `nsys stats [profile-file]`, or in the GUI by installing Nsight [locally following the directions here](https://developer.nvidia.com/nsight-systems/get-started).
CLI example: ??? CLI example
```bash ```bash
nsys stats report1.nsys-rep nsys stats report1.nsys-rep
... ...
** CUDA GPU Kernel Summary (cuda_gpu_kern_sum): ** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ----------- ----------- -------- --------- ----------- ---------------------------------------------------------------------------------------------------- -------- --------------- --------- ----------- ----------- -------- --------- ----------- ----------------------------------------------------------------------------------------------------
46.3 10,327,352,338 17,505 589,965.9 144,383.0 27,040 3,126,460 944,263.8 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_of… 46.3 10,327,352,338 17,505 589,965.9 144,383.0 27,040 3,126,460 944,263.8 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_of…
14.8 3,305,114,764 5,152 641,520.7 293,408.0 287,296 2,822,716 867,124.9 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize256x128x64_warpgroupsize2x1x1_execute_segment_k_of… 14.8 3,305,114,764 5,152 641,520.7 293,408.0 287,296 2,822,716 867,124.9 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize256x128x64_warpgroupsize2x1x1_execute_segment_k_of…
12.1 2,692,284,876 14,280 188,535.4 83,904.0 19,328 2,862,237 497,999.9 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x128x64_warpgroupsize1x1x1_execute_segment_k_off… 12.1 2,692,284,876 14,280 188,535.4 83,904.0 19,328 2,862,237 497,999.9 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x128x64_warpgroupsize1x1x1_execute_segment_k_off…
9.5 2,116,600,578 33,920 62,399.8 21,504.0 15,326 2,532,285 290,954.1 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_… 9.5 2,116,600,578 33,920 62,399.8 21,504.0 15,326 2,532,285 290,954.1 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_…
5.0 1,119,749,165 18,912 59,208.4 9,056.0 6,784 2,578,366 271,581.7 void vllm::act_and_mul_kernel<c10::BFloat16, &vllm::silu_kernel<c10::BFloat16>, (bool)1>(T1 *, cons… 5.0 1,119,749,165 18,912 59,208.4 9,056.0 6,784 2,578,366 271,581.7 void vllm::act_and_mul_kernel<c10::BFloat16, &vllm::silu_kernel<c10::BFloat16>, (bool)1>(T1 *, cons…
4.1 916,662,515 21,312 43,011.6 19,776.0 8,928 2,586,205 199,790.1 void cutlass::device_kernel<flash::enable_sm90_or_later<flash::FlashAttnFwdSm90<flash::CollectiveMa… 4.1 916,662,515 21,312 43,011.6 19,776.0 8,928 2,586,205 199,790.1 void cutlass::device_kernel<flash::enable_sm90_or_later<flash::FlashAttnFwdSm90<flash::CollectiveMa…
2.6 587,283,113 37,824 15,526.7 3,008.0 2,719 2,517,756 139,091.1 std::enable_if<T2>(int)0&&vllm::_typeConvert<T1>::exists, void>::type vllm::fused_add_rms_norm_kern… 2.6 587,283,113 37,824 15,526.7 3,008.0 2,719 2,517,756 139,091.1 std::enable_if<T2>(int)0&&vllm::_typeConvert<T1>::exists, void>::type vllm::fused_add_rms_norm_kern…
1.9 418,362,605 18,912 22,121.5 3,871.0 3,328 2,523,870 175,248.2 void vllm::rotary_embedding_kernel<c10::BFloat16, (bool)1>(const long *, T1 *, T1 *, const T1 *, in 1.9 418,362,605 18,912 22,121.5 3,871.0 3,328 2,523,870 175,248.2 void vllm::rotary_embedding_kernel<c10::BFloat16, (bool)1>(const long *, T1 *, T1 *, const T1 *, in…
0.7 167,083,069 18,880 8,849.7 2,240.0 1,471 2,499,996 101,436.1 void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0… 0.7 167,083,069 18,880 8,849.7 2,240.0 1,471 2,499,996 101,436.1 void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0…
... ...
``` ```
GUI example: GUI example:
......
...@@ -97,19 +97,21 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `-- ...@@ -97,19 +97,21 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits. flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits.
Keep an eye on memory usage with parallel jobs as it can be substantial (see example below). Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).
```console ??? Command
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
python3 use_existing_torch.py ```console
DOCKER_BUILDKIT=1 docker build . \ # Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
--file docker/Dockerfile \ python3 use_existing_torch.py
--target vllm-openai \ DOCKER_BUILDKIT=1 docker build . \
--platform "linux/arm64" \ --file docker/Dockerfile \
-t vllm/vllm-gh200-openai:latest \ --target vllm-openai \
--build-arg max_jobs=66 \ --platform "linux/arm64" \
--build-arg nvcc_threads=2 \ -t vllm/vllm-gh200-openai:latest \
--build-arg torch_cuda_arch_list="9.0 10.0+PTX" \ --build-arg max_jobs=66 \
--build-arg vllm_fa_cmake_gpu_arches="90-real" --build-arg nvcc_threads=2 \
``` --build-arg torch_cuda_arch_list="9.0 10.0+PTX" \
--build-arg vllm_fa_cmake_gpu_arches="90-real"
```
!!! note !!! note
If you are building the `linux/arm64` image on a non-ARM host (e.g., an x86_64 machine), you need to ensure your system is set up for cross-compilation using QEMU. This allows your host machine to emulate ARM64 execution. If you are building the `linux/arm64` image on a non-ARM host (e.g., an x86_64 machine), you need to ensure your system is set up for cross-compilation using QEMU. This allows your host machine to emulate ARM64 execution.
......
...@@ -30,51 +30,53 @@ python -m vllm.entrypoints.openai.api_server \ ...@@ -30,51 +30,53 @@ python -m vllm.entrypoints.openai.api_server \
- Call it with AutoGen: - Call it with AutoGen:
```python ??? Code
import asyncio
from autogen_core.models import UserMessage ```python
from autogen_ext.models.openai import OpenAIChatCompletionClient import asyncio
from autogen_core.models import ModelFamily from autogen_core.models import UserMessage
from autogen_ext.models.openai import OpenAIChatCompletionClient
from autogen_core.models import ModelFamily
async def main() -> None:
# Create a model client
model_client = OpenAIChatCompletionClient( async def main() -> None:
model="mistralai/Mistral-7B-Instruct-v0.2", # Create a model client
base_url="http://{your-vllm-host-ip}:{your-vllm-host-port}/v1", model_client = OpenAIChatCompletionClient(
api_key="EMPTY", model="mistralai/Mistral-7B-Instruct-v0.2",
model_info={ base_url="http://{your-vllm-host-ip}:{your-vllm-host-port}/v1",
"vision": False, api_key="EMPTY",
"function_calling": False, model_info={
"json_output": False, "vision": False,
"family": ModelFamily.MISTRAL, "function_calling": False,
"structured_output": True, "json_output": False,
}, "family": ModelFamily.MISTRAL,
) "structured_output": True,
},
messages = [UserMessage(content="Write a very short story about a dragon.", source="user")] )
# Create a stream. messages = [UserMessage(content="Write a very short story about a dragon.", source="user")]
stream = model_client.create_stream(messages=messages)
# Create a stream.
# Iterate over the stream and print the responses. stream = model_client.create_stream(messages=messages)
print("Streamed responses:")
async for response in stream: # Iterate over the stream and print the responses.
if isinstance(response, str): print("Streamed responses:")
# A partial response is a string. async for response in stream:
print(response, flush=True, end="") if isinstance(response, str):
else: # A partial response is a string.
# The last response is a CreateResult object with the complete message. print(response, flush=True, end="")
print("\n\n------------\n") else:
print("The complete response:", flush=True) # The last response is a CreateResult object with the complete message.
print(response.content, flush=True) print("\n\n------------\n")
print("The complete response:", flush=True)
# Close the client when done. print(response.content, flush=True)
await model_client.close()
# Close the client when done.
await model_client.close()
asyncio.run(main())
```
asyncio.run(main())
```
For details, see the tutorial: For details, see the tutorial:
......
...@@ -34,25 +34,27 @@ vllm = "latest" ...@@ -34,25 +34,27 @@ vllm = "latest"
Next, let us add our code to handle inference for the LLM of your choice (`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your `main.py`: Next, let us add our code to handle inference for the LLM of your choice (`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your `main.py`:
```python ??? Code
from vllm import LLM, SamplingParams
llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1") ```python
from vllm import LLM, SamplingParams
def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95): llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
sampling_params = SamplingParams(temperature=temperature, top_p=top_p) def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
outputs = llm.generate(prompts, sampling_params)
# Print the outputs. sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
results = [] outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
results.append({"prompt": prompt, "generated_text": generated_text})
return {"results": results} # Print the outputs.
``` results = []
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
results.append({"prompt": prompt, "generated_text": generated_text})
return {"results": results}
```
Then, run the following code to deploy it to the cloud: Then, run the following code to deploy it to the cloud:
...@@ -62,47 +64,51 @@ cerebrium deploy ...@@ -62,47 +64,51 @@ cerebrium deploy
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`) If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`)
```python ??? Command
curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
-H 'Content-Type: application/json' \ ```python
-H 'Authorization: <JWT TOKEN>' \ curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
--data '{ -H 'Content-Type: application/json' \
"prompts": [ -H 'Authorization: <JWT TOKEN>' \
"Hello, my name is", --data '{
"The president of the United States is", "prompts": [
"The capital of France is", "Hello, my name is",
"The future of AI is" "The president of the United States is",
] "The capital of France is",
}' "The future of AI is"
``` ]
}'
```
You should get a response like: You should get a response like:
```python ??? Response
{
"run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262", ```python
"result": { {
"result": [ "run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
{ "result": {
"prompt": "Hello, my name is", "result": [
"generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of" {
}, "prompt": "Hello, my name is",
{ "generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
"prompt": "The president of the United States is", },
"generated_text": " elected every four years. This is a democratic system.\n\n5. What" {
}, "prompt": "The president of the United States is",
{ "generated_text": " elected every four years. This is a democratic system.\n\n5. What"
"prompt": "The capital of France is", },
"generated_text": " Paris.\n" {
}, "prompt": "The capital of France is",
{ "generated_text": " Paris.\n"
"prompt": "The future of AI is", },
"generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective." {
} "prompt": "The future of AI is",
] "generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
}, }
"run_time_ms": 152.53663063049316 ]
} },
``` "run_time_ms": 152.53663063049316
}
```
You now have an autoscaling endpoint where you only pay for the compute you use! You now have an autoscaling endpoint where you only pay for the compute you use!
...@@ -26,75 +26,81 @@ dstack init ...@@ -26,75 +26,81 @@ dstack init
Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`: Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
```yaml ??? Config
type: service
```yaml
python: "3.11" type: service
env:
- MODEL=NousResearch/Llama-2-7b-chat-hf python: "3.11"
port: 8000 env:
resources: - MODEL=NousResearch/Llama-2-7b-chat-hf
gpu: 24GB port: 8000
commands: resources:
- pip install vllm gpu: 24GB
- vllm serve $MODEL --port 8000 commands:
model: - pip install vllm
format: openai - vllm serve $MODEL --port 8000
type: chat model:
name: NousResearch/Llama-2-7b-chat-hf format: openai
``` type: chat
name: NousResearch/Llama-2-7b-chat-hf
```
Then, run the following CLI for provisioning: Then, run the following CLI for provisioning:
```console ??? Command
$ dstack run . -f serve.dstack.yml
```console
⠸ Getting run plan... $ dstack run . -f serve.dstack.yml
Configuration serve.dstack.yml
Project deep-diver-main ⠸ Getting run plan...
User deep-diver Configuration serve.dstack.yml
Min resources 2..xCPU, 8GB.., 1xGPU (24GB) Project deep-diver-main
Max price - User deep-diver
Max duration - Min resources 2..xCPU, 8GB.., 1xGPU (24GB)
Spot policy auto Max price -
Retry policy no Max duration -
Spot policy auto
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE Retry policy no
1 gcp us-central1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
2 gcp us-east1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804 # BACKEND REGION INSTANCE RESOURCES SPOT PRICE
3 gcp us-west1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804 1 gcp us-central1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
... 2 gcp us-east1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
Shown 3 of 193 offers, $5.876 max 3 gcp us-west1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
...
Continue? [y/n]: y Shown 3 of 193 offers, $5.876 max
⠙ Submitting run...
⠏ Launching spicy-treefrog-1 (pulling) Continue? [y/n]: y
spicy-treefrog-1 provisioning completed (running) ⠙ Submitting run...
Service is published at ... ⠏ Launching spicy-treefrog-1 (pulling)
``` spicy-treefrog-1 provisioning completed (running)
Service is published at ...
```
After the provisioning, you can interact with the model by using the OpenAI SDK: After the provisioning, you can interact with the model by using the OpenAI SDK:
```python ??? Code
from openai import OpenAI
```python
client = OpenAI( from openai import OpenAI
base_url="https://gateway.<gateway domain>",
api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>" client = OpenAI(
) base_url="https://gateway.<gateway domain>",
api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
completion = client.chat.completions.create( )
model="NousResearch/Llama-2-7b-chat-hf",
messages=[ completion = client.chat.completions.create(
{ model="NousResearch/Llama-2-7b-chat-hf",
"role": "user", messages=[
"content": "Compose a poem that explains the concept of recursion in programming.", {
} "role": "user",
] "content": "Compose a poem that explains the concept of recursion in programming.",
) }
]
print(completion.choices[0].message.content) )
```
print(completion.choices[0].message.content)
```
!!! note !!! note
dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out [this repository](https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm) dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out [this repository](https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm)
...@@ -27,29 +27,29 @@ vllm serve mistralai/Mistral-7B-Instruct-v0.1 ...@@ -27,29 +27,29 @@ vllm serve mistralai/Mistral-7B-Instruct-v0.1
- Use the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack to query the vLLM server. - Use the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack to query the vLLM server.
```python ??? Code
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage ```python
from haystack.utils import Secret from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
generator = OpenAIChatGenerator( from haystack.utils import Secret
# for compatibility with the OpenAI API, a placeholder api_key is needed
api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"), generator = OpenAIChatGenerator(
model="mistralai/Mistral-7B-Instruct-v0.1", # for compatibility with the OpenAI API, a placeholder api_key is needed
api_base_url="http://{your-vLLM-host-ip}:{your-vLLM-host-port}/v1", api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"),
generation_kwargs = {"max_tokens": 512} model="mistralai/Mistral-7B-Instruct-v0.1",
) api_base_url="http://{your-vLLM-host-ip}:{your-vLLM-host-port}/v1",
generation_kwargs = {"max_tokens": 512}
response = generator.run( )
messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")]
) response = generator.run(
messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")]
print("-"*30) )
print(response)
print("-"*30) print("-"*30)
``` print(response)
print("-"*30)
Output e.g.: ```
```console ```console
------------------------------ ------------------------------
......
...@@ -34,21 +34,23 @@ vllm serve qwen/Qwen1.5-0.5B-Chat ...@@ -34,21 +34,23 @@ vllm serve qwen/Qwen1.5-0.5B-Chat
- Call it with litellm: - Call it with litellm:
```python ??? Code
import litellm
messages = [{ "content": "Hello, how are you?","role": "user"}] ```python
import litellm
# hosted_vllm is prefix key word and necessary messages = [{ "content": "Hello, how are you?","role": "user"}]
response = litellm.completion(
model="hosted_vllm/qwen/Qwen1.5-0.5B-Chat", # pass the vllm model name # hosted_vllm is prefix key word and necessary
messages=messages, response = litellm.completion(
api_base="http://{your-vllm-server-host}:{your-vllm-server-port}/v1", model="hosted_vllm/qwen/Qwen1.5-0.5B-Chat", # pass the vllm model name
temperature=0.2, messages=messages,
max_tokens=80) api_base="http://{your-vllm-server-host}:{your-vllm-server-port}/v1",
temperature=0.2,
print(response) max_tokens=80)
```
print(response)
```
### Embeddings ### Embeddings
......
...@@ -17,99 +17,101 @@ vLLM can be deployed with [LWS](https://github.com/kubernetes-sigs/lws) on Kuber ...@@ -17,99 +17,101 @@ vLLM can be deployed with [LWS](https://github.com/kubernetes-sigs/lws) on Kuber
Deploy the following yaml file `lws.yaml` Deploy the following yaml file `lws.yaml`
```yaml ??? Yaml
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet ```yaml
metadata: apiVersion: leaderworkerset.x-k8s.io/v1
name: vllm kind: LeaderWorkerSet
spec: metadata:
replicas: 2 name: vllm
leaderWorkerTemplate: spec:
size: 2 replicas: 2
restartPolicy: RecreateGroupOnPodRestart leaderWorkerTemplate:
leaderTemplate: size: 2
metadata: restartPolicy: RecreateGroupOnPodRestart
labels: leaderTemplate:
role: leader metadata:
spec: labels:
containers: role: leader
- name: vllm-leader spec:
image: docker.io/vllm/vllm-openai:latest containers:
env: - name: vllm-leader
- name: HUGGING_FACE_HUB_TOKEN image: docker.io/vllm/vllm-openai:latest
value: <your-hf-token> env:
command: - name: HUGGING_FACE_HUB_TOKEN
- sh value: <your-hf-token>
- -c command:
- "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); - sh
python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2" - -c
resources: - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE);
limits: python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2"
nvidia.com/gpu: "8" resources:
memory: 1124Gi limits:
ephemeral-storage: 800Gi nvidia.com/gpu: "8"
requests: memory: 1124Gi
ephemeral-storage: 800Gi ephemeral-storage: 800Gi
cpu: 125 requests:
ports: ephemeral-storage: 800Gi
- containerPort: 8080 cpu: 125
readinessProbe: ports:
tcpSocket: - containerPort: 8080
port: 8080 readinessProbe:
initialDelaySeconds: 15 tcpSocket:
periodSeconds: 10 port: 8080
volumeMounts: initialDelaySeconds: 15
- mountPath: /dev/shm periodSeconds: 10
name: dshm volumeMounts:
volumes: - mountPath: /dev/shm
- name: dshm name: dshm
emptyDir: volumes:
medium: Memory - name: dshm
sizeLimit: 15Gi emptyDir:
workerTemplate: medium: Memory
spec: sizeLimit: 15Gi
containers: workerTemplate:
- name: vllm-worker spec:
image: docker.io/vllm/vllm-openai:latest containers:
command: - name: vllm-worker
- sh image: docker.io/vllm/vllm-openai:latest
- -c command:
- "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)" - sh
resources: - -c
limits: - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
nvidia.com/gpu: "8" resources:
memory: 1124Gi limits:
ephemeral-storage: 800Gi nvidia.com/gpu: "8"
requests: memory: 1124Gi
ephemeral-storage: 800Gi ephemeral-storage: 800Gi
cpu: 125 requests:
env: ephemeral-storage: 800Gi
- name: HUGGING_FACE_HUB_TOKEN cpu: 125
value: <your-hf-token> env:
volumeMounts: - name: HUGGING_FACE_HUB_TOKEN
- mountPath: /dev/shm value: <your-hf-token>
name: dshm volumeMounts:
volumes: - mountPath: /dev/shm
- name: dshm name: dshm
emptyDir: volumes:
medium: Memory - name: dshm
sizeLimit: 15Gi emptyDir:
--- medium: Memory
apiVersion: v1 sizeLimit: 15Gi
kind: Service ---
metadata: apiVersion: v1
name: vllm-leader kind: Service
spec: metadata:
ports: name: vllm-leader
- name: http spec:
port: 8080 ports:
protocol: TCP - name: http
targetPort: 8080 port: 8080
selector: protocol: TCP
leaderworkerset.sigs.k8s.io/name: vllm targetPort: 8080
role: leader selector:
type: ClusterIP leaderworkerset.sigs.k8s.io/name: vllm
``` role: leader
type: ClusterIP
```
```bash ```bash
kubectl apply -f lws.yaml kubectl apply -f lws.yaml
...@@ -175,25 +177,27 @@ curl http://localhost:8080/v1/completions \ ...@@ -175,25 +177,27 @@ curl http://localhost:8080/v1/completions \
The output should be similar to the following The output should be similar to the following
```text ??? Output
{
"id": "cmpl-1bb34faba88b43f9862cfbfb2200949d", ```text
"object": "text_completion",
"created": 1715138766,
"model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
"choices": [
{ {
"index": 0, "id": "cmpl-1bb34faba88b43f9862cfbfb2200949d",
"text": " top destination for foodies, with", "object": "text_completion",
"logprobs": null, "created": 1715138766,
"finish_reason": "length", "model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
"stop_reason": null "choices": [
{
"index": 0,
"text": " top destination for foodies, with",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 5,
"total_tokens": 12,
"completion_tokens": 7
}
} }
], ```
"usage": {
"prompt_tokens": 5,
"total_tokens": 12,
"completion_tokens": 7
}
}
```
...@@ -24,48 +24,50 @@ sky check ...@@ -24,48 +24,50 @@ sky check
See the vLLM SkyPilot YAML for serving, [serving.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml). See the vLLM SkyPilot YAML for serving, [serving.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml).
```yaml ??? Yaml
resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. ```yaml
use_spot: True resources:
disk_size: 512 # Ensure model checkpoints can fit. accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
disk_tier: best use_spot: True
ports: 8081 # Expose to internet traffic. disk_size: 512 # Ensure model checkpoints can fit.
disk_tier: best
envs: ports: 8081 # Expose to internet traffic.
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass. envs:
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
setup: | HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
conda create -n vllm python=3.10 -y
conda activate vllm setup: |
conda create -n vllm python=3.10 -y
pip install vllm==0.4.0.post1 conda activate vllm
# Install Gradio for web UI.
pip install gradio openai pip install vllm==0.4.0.post1
pip install flash-attn==2.5.7 # Install Gradio for web UI.
pip install gradio openai
run: | pip install flash-attn==2.5.7
conda activate vllm
echo 'Starting vllm api server...' run: |
python -u -m vllm.entrypoints.openai.api_server \ conda activate vllm
--port 8081 \ echo 'Starting vllm api server...'
--model $MODEL_NAME \ python -u -m vllm.entrypoints.openai.api_server \
--trust-remote-code \ --port 8081 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ --model $MODEL_NAME \
2>&1 | tee api_server.log & --trust-remote-code \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
echo 'Waiting for vllm api server to start...' 2>&1 | tee api_server.log &
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
echo 'Waiting for vllm api server to start...'
echo 'Starting gradio server...' while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
git clone https://github.com/vllm-project/vllm.git || true
python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \ echo 'Starting gradio server...'
-m $MODEL_NAME \ git clone https://github.com/vllm-project/vllm.git || true
--port 8811 \ python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
--model-url http://localhost:8081/v1 \ -m $MODEL_NAME \
--stop-token-ids 128009,128001 --port 8811 \
``` --model-url http://localhost:8081/v1 \
--stop-token-ids 128009,128001
```
Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...): Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...):
...@@ -93,68 +95,67 @@ HF_TOKEN="your-huggingface-token" \ ...@@ -93,68 +95,67 @@ HF_TOKEN="your-huggingface-token" \
SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file. SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file.
```yaml ??? Yaml
service:
replicas: 2 ```yaml
# An actual request for readiness probe. service:
readiness_probe: replicas: 2
path: /v1/chat/completions # An actual request for readiness probe.
post_data: readiness_probe:
model: $MODEL_NAME path: /v1/chat/completions
messages: post_data:
- role: user model: $MODEL_NAME
content: Hello! What is your name? messages:
max_completion_tokens: 1 - role: user
``` content: Hello! What is your name?
<details>
<summary>Click to see the full recipe YAML</summary>
```yaml
service:
replicas: 2
# An actual request for readiness probe.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_completion_tokens: 1 max_completion_tokens: 1
```
resources: ??? Yaml
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
use_spot: True ```yaml
disk_size: 512 # Ensure model checkpoints can fit. service:
disk_tier: best replicas: 2
ports: 8081 # Expose to internet traffic. # An actual request for readiness probe.
readiness_probe:
envs: path: /v1/chat/completions
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct post_data:
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass. model: $MODEL_NAME
messages:
setup: | - role: user
conda create -n vllm python=3.10 -y content: Hello! What is your name?
conda activate vllm max_completion_tokens: 1
pip install vllm==0.4.0.post1 resources:
# Install Gradio for web UI. accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
pip install gradio openai use_spot: True
pip install flash-attn==2.5.7 disk_size: 512 # Ensure model checkpoints can fit.
disk_tier: best
run: | ports: 8081 # Expose to internet traffic.
conda activate vllm
echo 'Starting vllm api server...' envs:
python -u -m vllm.entrypoints.openai.api_server \ MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
--port 8081 \ HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
--model $MODEL_NAME \
--trust-remote-code \ setup: |
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ conda create -n vllm python=3.10 -y
2>&1 | tee api_server.log conda activate vllm
```
pip install vllm==0.4.0.post1
</details> # Install Gradio for web UI.
pip install gradio openai
pip install flash-attn==2.5.7
run: |
conda activate vllm
echo 'Starting vllm api server...'
python -u -m vllm.entrypoints.openai.api_server \
--port 8081 \
--model $MODEL_NAME \
--trust-remote-code \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
2>&1 | tee api_server.log
```
Start the serving the Llama-3 8B model on multiple replicas: Start the serving the Llama-3 8B model on multiple replicas:
...@@ -170,8 +171,7 @@ Wait until the service is ready: ...@@ -170,8 +171,7 @@ Wait until the service is ready:
watch -n10 sky serve status vllm watch -n10 sky serve status vllm
``` ```
<details> Example outputs:
<summary>Example outputs:</summary>
```console ```console
Services Services
...@@ -184,29 +184,29 @@ vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP([Spot]{'L4': 1}) R ...@@ -184,29 +184,29 @@ vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP([Spot]{'L4': 1}) R
vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'L4': 1}) READY us-east4 vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'L4': 1}) READY us-east4
``` ```
</details>
After the service is READY, you can find a single endpoint for the service and access the service with the endpoint: After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
```console ??? Commands
ENDPOINT=$(sky serve status --endpoint 8081 vllm)
curl -L http://$ENDPOINT/v1/chat/completions \ ```bash
-H "Content-Type: application/json" \ ENDPOINT=$(sky serve status --endpoint 8081 vllm)
-d '{ curl -L http://$ENDPOINT/v1/chat/completions \
"model": "meta-llama/Meta-Llama-3-8B-Instruct", -H "Content-Type: application/json" \
"messages": [ -d '{
{ "model": "meta-llama/Meta-Llama-3-8B-Instruct",
"role": "system", "messages": [
"content": "You are a helpful assistant." {
}, "role": "system",
{ "content": "You are a helpful assistant."
"role": "user", },
"content": "Who are you?" {
} "role": "user",
], "content": "Who are you?"
"stop_token_ids": [128009, 128001] }
}' ],
``` "stop_token_ids": [128009, 128001]
}'
```
To enable autoscaling, you could replace the `replicas` with the following configs in `service`: To enable autoscaling, you could replace the `replicas` with the following configs in `service`:
...@@ -220,57 +220,54 @@ service: ...@@ -220,57 +220,54 @@ service:
This will scale the service up to when the QPS exceeds 2 for each replica. This will scale the service up to when the QPS exceeds 2 for each replica.
<details> ??? Yaml
<summary>Click to see the full recipe YAML</summary>
```yaml
```yaml service:
service: replica_policy:
replica_policy: min_replicas: 2
min_replicas: 2 max_replicas: 4
max_replicas: 4 target_qps_per_replica: 2
target_qps_per_replica: 2 # An actual request for readiness probe.
# An actual request for readiness probe. readiness_probe:
readiness_probe: path: /v1/chat/completions
path: /v1/chat/completions post_data:
post_data: model: $MODEL_NAME
model: $MODEL_NAME messages:
messages: - role: user
- role: user content: Hello! What is your name?
content: Hello! What is your name? max_completion_tokens: 1
max_completion_tokens: 1
resources:
resources: accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. use_spot: True
use_spot: True disk_size: 512 # Ensure model checkpoints can fit.
disk_size: 512 # Ensure model checkpoints can fit. disk_tier: best
disk_tier: best ports: 8081 # Expose to internet traffic.
ports: 8081 # Expose to internet traffic.
envs:
envs: MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
setup: |
setup: | conda create -n vllm python=3.10 -y
conda create -n vllm python=3.10 -y conda activate vllm
conda activate vllm
pip install vllm==0.4.0.post1
pip install vllm==0.4.0.post1 # Install Gradio for web UI.
# Install Gradio for web UI. pip install gradio openai
pip install gradio openai pip install flash-attn==2.5.7
pip install flash-attn==2.5.7
run: |
run: | conda activate vllm
conda activate vllm echo 'Starting vllm api server...'
echo 'Starting vllm api server...' python -u -m vllm.entrypoints.openai.api_server \
python -u -m vllm.entrypoints.openai.api_server \ --port 8081 \
--port 8081 \ --model $MODEL_NAME \
--model $MODEL_NAME \ --trust-remote-code \
--trust-remote-code \ --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ 2>&1 | tee api_server.log
2>&1 | tee api_server.log ```
```
</details>
To update the service with the new config: To update the service with the new config:
...@@ -288,38 +285,35 @@ sky serve down vllm ...@@ -288,38 +285,35 @@ sky serve down vllm
It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas. It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.
<details> ??? Yaml
<summary>Click to see the full GUI YAML</summary>
```yaml ```yaml
envs: envs:
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm. ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm.
resources: resources:
cpus: 2 cpus: 2
setup: |
conda create -n vllm python=3.10 -y
conda activate vllm
# Install Gradio for web UI.
pip install gradio openai
run: |
conda activate vllm
export PATH=$PATH:/sbin
echo 'Starting gradio server...'
git clone https://github.com/vllm-project/vllm.git || true
python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
-m $MODEL_NAME \
--port 8811 \
--model-url http://$ENDPOINT/v1 \
--stop-token-ids 128009,128001 | tee ~/gradio.log
```
</details> setup: |
conda create -n vllm python=3.10 -y
conda activate vllm
# Install Gradio for web UI.
pip install gradio openai
run: |
conda activate vllm
export PATH=$PATH:/sbin
echo 'Starting gradio server...'
git clone https://github.com/vllm-project/vllm.git || true
python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
-m $MODEL_NAME \
--port 8811 \
--model-url http://$ENDPOINT/v1 \
--stop-token-ids 128009,128001 | tee ~/gradio.log
```
1. Start the chat web UI: 1. Start the chat web UI:
......
...@@ -60,22 +60,22 @@ And then you can send out a query to the OpenAI-compatible API to check the avai ...@@ -60,22 +60,22 @@ And then you can send out a query to the OpenAI-compatible API to check the avai
curl -o- http://localhost:30080/models curl -o- http://localhost:30080/models
``` ```
Expected output: ??? Output
```json ```json
{
"object": "list",
"data": [
{ {
"id": "facebook/opt-125m", "object": "list",
"object": "model", "data": [
"created": 1737428424, {
"owned_by": "vllm", "id": "facebook/opt-125m",
"root": null "object": "model",
"created": 1737428424,
"owned_by": "vllm",
"root": null
}
]
} }
] ```
}
```
To send an actual chatting request, you can issue a curl request to the OpenAI `/completion` endpoint: To send an actual chatting request, you can issue a curl request to the OpenAI `/completion` endpoint:
...@@ -89,23 +89,23 @@ curl -X POST http://localhost:30080/completions \ ...@@ -89,23 +89,23 @@ curl -X POST http://localhost:30080/completions \
}' }'
``` ```
Expected output: ??? Output
```json ```json
{
"id": "completion-id",
"object": "text_completion",
"created": 1737428424,
"model": "facebook/opt-125m",
"choices": [
{ {
"text": " there was a brave knight who...", "id": "completion-id",
"index": 0, "object": "text_completion",
"finish_reason": "length" "created": 1737428424,
"model": "facebook/opt-125m",
"choices": [
{
"text": " there was a brave knight who...",
"index": 0,
"finish_reason": "length"
}
]
} }
] ```
}
```
### Uninstall ### Uninstall
...@@ -121,23 +121,25 @@ sudo helm uninstall vllm ...@@ -121,23 +121,25 @@ sudo helm uninstall vllm
The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above: The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above:
```yaml ??? Yaml
servingEngineSpec:
runtimeClassName: ""
modelSpec:
- name: "opt125m"
repository: "vllm/vllm-openai"
tag: "latest"
modelURL: "facebook/opt-125m"
replicaCount: 1 ```yaml
servingEngineSpec:
runtimeClassName: ""
modelSpec:
- name: "opt125m"
repository: "vllm/vllm-openai"
tag: "latest"
modelURL: "facebook/opt-125m"
requestCPU: 6 replicaCount: 1
requestMemory: "16Gi"
requestGPU: 1
pvcStorage: "10Gi" requestCPU: 6
``` requestMemory: "16Gi"
requestGPU: 1
pvcStorage: "10Gi"
```
In this YAML configuration: In this YAML configuration:
* **`modelSpec`** includes: * **`modelSpec`** includes:
......
...@@ -29,85 +29,89 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following: ...@@ -29,85 +29,89 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following:
First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model: First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model:
```bash ??? Config
cat <<EOF |kubectl apply -f -
apiVersion: v1 ```bash
kind: PersistentVolumeClaim cat <<EOF |kubectl apply -f -
metadata: apiVersion: v1
name: vllm-models kind: PersistentVolumeClaim
spec: metadata:
accessModes: name: vllm-models
- ReadWriteOnce spec:
volumeMode: Filesystem accessModes:
resources: - ReadWriteOnce
requests: volumeMode: Filesystem
storage: 50Gi resources:
--- requests:
apiVersion: v1 storage: 50Gi
kind: Secret ---
metadata: apiVersion: v1
name: hf-token-secret kind: Secret
type: Opaque metadata:
data: name: hf-token-secret
token: $(HF_TOKEN) type: Opaque
EOF data:
``` token: $(HF_TOKEN)
EOF
```
Next, start the vLLM server as a Kubernetes Deployment and Service: Next, start the vLLM server as a Kubernetes Deployment and Service:
```bash ??? Config
cat <<EOF |kubectl apply -f -
apiVersion: apps/v1 ```bash
kind: Deployment cat <<EOF |kubectl apply -f -
metadata: apiVersion: apps/v1
name: vllm-server kind: Deployment
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: vllm
template:
metadata: metadata:
labels: name: vllm-server
app.kubernetes.io/name: vllm
spec: spec:
containers: replicas: 1
- name: vllm selector:
image: vllm/vllm-openai:latest matchLabels:
command: ["/bin/sh", "-c"] app.kubernetes.io/name: vllm
args: [ template:
"vllm serve meta-llama/Llama-3.2-1B-Instruct" metadata:
] labels:
env: app.kubernetes.io/name: vllm
- name: HUGGING_FACE_HUB_TOKEN spec:
valueFrom: containers:
secretKeyRef: - name: vllm
name: hf-token-secret image: vllm/vllm-openai:latest
key: token command: ["/bin/sh", "-c"]
ports: args: [
- containerPort: 8000 "vllm serve meta-llama/Llama-3.2-1B-Instruct"
volumeMounts: ]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
volumeMounts:
- name: llama-storage
mountPath: /root/.cache/huggingface
volumes:
- name: llama-storage - name: llama-storage
mountPath: /root/.cache/huggingface persistentVolumeClaim:
volumes: claimName: vllm-models
- name: llama-storage ---
persistentVolumeClaim: apiVersion: v1
claimName: vllm-models kind: Service
--- metadata:
apiVersion: v1 name: vllm-server
kind: Service spec:
metadata: selector:
name: vllm-server app.kubernetes.io/name: vllm
spec: ports:
selector: - protocol: TCP
app.kubernetes.io/name: vllm port: 8000
ports: targetPort: 8000
- protocol: TCP type: ClusterIP
port: 8000 EOF
targetPort: 8000 ```
type: ClusterIP
EOF
```
We can verify that the vLLM server has started successfully via the logs (this might take a couple of minutes to download the model): We can verify that the vLLM server has started successfully via the logs (this might take a couple of minutes to download the model):
...@@ -128,6 +132,9 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) ...@@ -128,6 +132,9 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
PVC is used to store the model cache and it is optional, you can use hostPath or other storage options PVC is used to store the model cache and it is optional, you can use hostPath or other storage options
<details>
<summary>Yaml</summary>
```yaml ```yaml
apiVersion: v1 apiVersion: v1
kind: PersistentVolumeClaim kind: PersistentVolumeClaim
...@@ -144,6 +151,8 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) ...@@ -144,6 +151,8 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
volumeMode: Filesystem volumeMode: Filesystem
``` ```
</details>
Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models
```yaml ```yaml
...@@ -156,13 +165,16 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) ...@@ -156,13 +165,16 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
stringData: stringData:
token: "REPLACE_WITH_TOKEN" token: "REPLACE_WITH_TOKEN"
``` ```
Next to create the deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model. Next to create the deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model.
Here are two examples for using NVIDIA GPU and AMD GPU. Here are two examples for using NVIDIA GPU and AMD GPU.
NVIDIA GPU: NVIDIA GPU:
<details>
<summary>Yaml</summary>
```yaml ```yaml
apiVersion: apps/v1 apiVersion: apps/v1
kind: Deployment kind: Deployment
...@@ -233,10 +245,15 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) ...@@ -233,10 +245,15 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
periodSeconds: 5 periodSeconds: 5
``` ```
</details>
AMD GPU: AMD GPU:
You can refer to the `deployment.yaml` below if using AMD ROCm GPU like MI300X. You can refer to the `deployment.yaml` below if using AMD ROCm GPU like MI300X.
<details>
<summary>Yaml</summary>
```yaml ```yaml
apiVersion: apps/v1 apiVersion: apps/v1
kind: Deployment kind: Deployment
...@@ -305,12 +322,17 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) ...@@ -305,12 +322,17 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
mountPath: /dev/shm mountPath: /dev/shm
``` ```
</details>
You can get the full example with steps and sample yaml files from <https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve>. You can get the full example with steps and sample yaml files from <https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve>.
2. Create a Kubernetes Service for vLLM 2. Create a Kubernetes Service for vLLM
Next, create a Kubernetes Service file to expose the `mistral-7b` deployment: Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:
<details>
<summary>Yaml</summary>
```yaml ```yaml
apiVersion: v1 apiVersion: v1
kind: Service kind: Service
...@@ -330,6 +352,8 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) ...@@ -330,6 +352,8 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
type: ClusterIP type: ClusterIP
``` ```
</details>
3. Deploy and Test 3. Deploy and Test
Apply the deployment and service configurations using `kubectl apply -f <filename>`: Apply the deployment and service configurations using `kubectl apply -f <filename>`:
......
...@@ -36,23 +36,25 @@ docker build . -f Dockerfile.nginx --tag nginx-lb ...@@ -36,23 +36,25 @@ docker build . -f Dockerfile.nginx --tag nginx-lb
Create a file named `nginx_conf/nginx.conf`. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another `server vllmN:8000 max_fails=3 fail_timeout=10000s;` entry to `upstream backend`. Create a file named `nginx_conf/nginx.conf`. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another `server vllmN:8000 max_fails=3 fail_timeout=10000s;` entry to `upstream backend`.
```console ??? Config
upstream backend {
least_conn; ```console
server vllm0:8000 max_fails=3 fail_timeout=10000s; upstream backend {
server vllm1:8000 max_fails=3 fail_timeout=10000s; least_conn;
} server vllm0:8000 max_fails=3 fail_timeout=10000s;
server { server vllm1:8000 max_fails=3 fail_timeout=10000s;
listen 80;
location / {
proxy_pass http://backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
} }
} server {
``` listen 80;
location / {
proxy_pass http://backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
```
[](){ #nginxloadbalancer-nginx-vllm-container } [](){ #nginxloadbalancer-nginx-vllm-container }
...@@ -93,30 +95,32 @@ Notes: ...@@ -93,30 +95,32 @@ Notes:
- The below example assumes GPU backend used. If you are using CPU backend, remove `--gpus device=ID`, add `VLLM_CPU_KVCACHE_SPACE` and `VLLM_CPU_OMP_THREADS_BIND` environment variables to the docker run command. - The below example assumes GPU backend used. If you are using CPU backend, remove `--gpus device=ID`, add `VLLM_CPU_KVCACHE_SPACE` and `VLLM_CPU_OMP_THREADS_BIND` environment variables to the docker run command.
- Adjust the model name that you want to use in your vLLM servers if you don't want to use `Llama-2-7b-chat-hf`. - Adjust the model name that you want to use in your vLLM servers if you don't want to use `Llama-2-7b-chat-hf`.
```console ??? Commands
mkdir -p ~/.cache/huggingface/hub/
hf_cache_dir=~/.cache/huggingface/ ```console
docker run \ mkdir -p ~/.cache/huggingface/hub/
-itd \ hf_cache_dir=~/.cache/huggingface/
--ipc host \ docker run \
--network vllm_nginx \ -itd \
--gpus device=0 \ --ipc host \
--shm-size=10.24gb \ --network vllm_nginx \
-v $hf_cache_dir:/root/.cache/huggingface/ \ --gpus device=0 \
-p 8081:8000 \ --shm-size=10.24gb \
--name vllm0 vllm \ -v $hf_cache_dir:/root/.cache/huggingface/ \
--model meta-llama/Llama-2-7b-chat-hf -p 8081:8000 \
docker run \ --name vllm0 vllm \
-itd \ --model meta-llama/Llama-2-7b-chat-hf
--ipc host \ docker run \
--network vllm_nginx \ -itd \
--gpus device=1 \ --ipc host \
--shm-size=10.24gb \ --network vllm_nginx \
-v $hf_cache_dir:/root/.cache/huggingface/ \ --gpus device=1 \
-p 8082:8000 \ --shm-size=10.24gb \
--name vllm1 vllm \ -v $hf_cache_dir:/root/.cache/huggingface/ \
--model meta-llama/Llama-2-7b-chat-hf -p 8082:8000 \
``` --name vllm1 vllm \
--model meta-llama/Llama-2-7b-chat-hf
```
!!! note !!! note
If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`. If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`.
......
...@@ -22,31 +22,33 @@ server. ...@@ -22,31 +22,33 @@ server.
Here is a sample of `LLM` class usage: Here is a sample of `LLM` class usage:
```python ??? Code
from vllm import LLM, SamplingParams
```python
# Define a list of input prompts from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is", # Define a list of input prompts
"The capital of France is", prompts = [
"The largest ocean is", "Hello, my name is",
] "The capital of France is",
"The largest ocean is",
# Define sampling parameters ]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Define sampling parameters
# Initialize the LLM engine with the OPT-125M model sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="facebook/opt-125m")
# Initialize the LLM engine with the OPT-125M model
# Generate outputs for the input prompts llm = LLM(model="facebook/opt-125m")
outputs = llm.generate(prompts, sampling_params)
# Generate outputs for the input prompts
# Print the generated outputs outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt # Print the generated outputs
generated_text = output.outputs[0].text for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") prompt = output.prompt
``` generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
More API details can be found in the [Offline Inference](#offline-inference-api) section of the API docs. More API details can be found in the [Offline Inference](#offline-inference-api) section of the API docs.
...@@ -178,32 +180,34 @@ vision-language model. ...@@ -178,32 +180,34 @@ vision-language model.
To avoid accidentally passing incorrect arguments, the constructor is now keyword-only. This ensures that the constructor will raise an error if old configurations are passed. vLLM developers have already made this change for all models within vLLM. For out-of-tree registered models, developers need to update their models, for example by adding shim code to adapt the old constructor signature to the new one: To avoid accidentally passing incorrect arguments, the constructor is now keyword-only. This ensures that the constructor will raise an error if old configurations are passed. vLLM developers have already made this change for all models within vLLM. For out-of-tree registered models, developers need to update their models, for example by adding shim code to adapt the old constructor signature to the new one:
```python ??? Code
class MyOldModel(nn.Module):
def __init__( ```python
self, class MyOldModel(nn.Module):
config, def __init__(
cache_config: Optional[CacheConfig] = None, self,
quant_config: Optional[QuantizationConfig] = None, config,
lora_config: Optional[LoRAConfig] = None, cache_config: Optional[CacheConfig] = None,
prefix: str = "", quant_config: Optional[QuantizationConfig] = None,
) -> None: lora_config: Optional[LoRAConfig] = None,
... prefix: str = "",
) -> None:
from vllm.config import VllmConfig ...
class MyNewModel(MyOldModel):
def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): from vllm.config import VllmConfig
config = vllm_config.model_config.hf_config class MyNewModel(MyOldModel):
cache_config = vllm_config.cache_config def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
quant_config = vllm_config.quant_config config = vllm_config.model_config.hf_config
lora_config = vllm_config.lora_config cache_config = vllm_config.cache_config
super().__init__(config, cache_config, quant_config, lora_config, prefix) quant_config = vllm_config.quant_config
lora_config = vllm_config.lora_config
if __version__ >= "0.6.4": super().__init__(config, cache_config, quant_config, lora_config, prefix)
MyModel = MyNewModel
else: if __version__ >= "0.6.4":
MyModel = MyOldModel MyModel = MyNewModel
``` else:
MyModel = MyOldModel
```
This way, the model can work with both old and new versions of vLLM. This way, the model can work with both old and new versions of vLLM.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment