Unverified Commit ba5c5e54 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

[Docs] Switch to better markdown linting pre-commit hook (#21851)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent 555e7225
...@@ -19,9 +19,9 @@ vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096 ...@@ -19,9 +19,9 @@ vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096
- Download and install [Anything LLM desktop](https://anythingllm.com/desktop). - Download and install [Anything LLM desktop](https://anythingllm.com/desktop).
- On the bottom left of open settings, AI Prooviders --> LLM: - On the bottom left of open settings, AI Prooviders --> LLM:
- LLM Provider: Generic OpenAI - LLM Provider: Generic OpenAI
- Base URL: http://{vllm server host}:{vllm server port}/v1 - Base URL: http://{vllm server host}:{vllm server port}/v1
- Chat Model Name: `Qwen/Qwen1.5-32B-Chat-AWQ` - Chat Model Name: `Qwen/Qwen1.5-32B-Chat-AWQ`
![](../../assets/deployment/anything-llm-provider.png) ![](../../assets/deployment/anything-llm-provider.png)
...@@ -30,9 +30,9 @@ vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096 ...@@ -30,9 +30,9 @@ vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096
![](../../assets/deployment/anything-llm-chat-without-doc.png) ![](../../assets/deployment/anything-llm-chat-without-doc.png)
- Click the upload button: - Click the upload button:
- upload the doc - upload the doc
- select the doc and move to the workspace - select the doc and move to the workspace
- save and embed - save and embed
![](../../assets/deployment/anything-llm-upload-doc.png) ![](../../assets/deployment/anything-llm-upload-doc.png)
......
...@@ -19,11 +19,11 @@ vllm serve qwen/Qwen1.5-0.5B-Chat ...@@ -19,11 +19,11 @@ vllm serve qwen/Qwen1.5-0.5B-Chat
- Download and install [Chatbox desktop](https://chatboxai.app/en#download). - Download and install [Chatbox desktop](https://chatboxai.app/en#download).
- On the bottom left of settings, Add Custom Provider - On the bottom left of settings, Add Custom Provider
- API Mode: `OpenAI API Compatible` - API Mode: `OpenAI API Compatible`
- Name: vllm - Name: vllm
- API Host: `http://{vllm server host}:{vllm server port}/v1` - API Host: `http://{vllm server host}:{vllm server port}/v1`
- API Path: `/chat/completions` - API Path: `/chat/completions`
- Model: `qwen/Qwen1.5-0.5B-Chat` - Model: `qwen/Qwen1.5-0.5B-Chat`
![](../../assets/deployment/chatbox-settings.png) ![](../../assets/deployment/chatbox-settings.png)
......
...@@ -34,11 +34,11 @@ docker compose up -d ...@@ -34,11 +34,11 @@ docker compose up -d
- In the top-right user menu (under the profile icon), go to Settings, then click `Model Provider`, and locate the `vLLM` provider to install it. - In the top-right user menu (under the profile icon), go to Settings, then click `Model Provider`, and locate the `vLLM` provider to install it.
- Fill in the model provider details as follows: - Fill in the model provider details as follows:
- **Model Type**: `LLM` - **Model Type**: `LLM`
- **Model Name**: `Qwen/Qwen1.5-7B-Chat` - **Model Name**: `Qwen/Qwen1.5-7B-Chat`
- **API Endpoint URL**: `http://{vllm_server_host}:{vllm_server_port}/v1` - **API Endpoint URL**: `http://{vllm_server_host}:{vllm_server_port}/v1`
- **Model Name for API Endpoint**: `Qwen/Qwen1.5-7B-Chat` - **Model Name for API Endpoint**: `Qwen/Qwen1.5-7B-Chat`
- **Completion Mode**: `Completion` - **Completion Mode**: `Completion`
![](../../assets/deployment/dify-settings.png) ![](../../assets/deployment/dify-settings.png)
......
# Haystack # Haystack
# Haystack
[Haystack](https://github.com/deepset-ai/haystack) is an end-to-end LLM framework that allows you to build applications powered by LLMs, Transformer models, vector search and more. Whether you want to perform retrieval-augmented generation (RAG), document search, question answering or answer generation, Haystack can orchestrate state-of-the-art embedding models and LLMs into pipelines to build end-to-end NLP applications and solve your use case. [Haystack](https://github.com/deepset-ai/haystack) is an end-to-end LLM framework that allows you to build applications powered by LLMs, Transformer models, vector search and more. Whether you want to perform retrieval-augmented generation (RAG), document search, question answering or answer generation, Haystack can orchestrate state-of-the-art embedding models and LLMs into pipelines to build end-to-end NLP applications and solve your use case.
It allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. It allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints.
......
...@@ -3,6 +3,7 @@ ...@@ -3,6 +3,7 @@
[Retrieval-augmented generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) is a technique that enables generative artificial intelligence (Gen AI) models to retrieve and incorporate new information. It modifies interactions with a large language model (LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to supplement information from its pre-existing training data. This allows LLMs to use domain-specific and/or updated information. Use cases include providing chatbot access to internal company data or generating responses based on authoritative sources. [Retrieval-augmented generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) is a technique that enables generative artificial intelligence (Gen AI) models to retrieve and incorporate new information. It modifies interactions with a large language model (LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to supplement information from its pre-existing training data. This allows LLMs to use domain-specific and/or updated information. Use cases include providing chatbot access to internal company data or generating responses based on authoritative sources.
Here are the integrations: Here are the integrations:
- vLLM + [langchain](https://github.com/langchain-ai/langchain) + [milvus](https://github.com/milvus-io/milvus) - vLLM + [langchain](https://github.com/langchain-ai/langchain) + [milvus](https://github.com/milvus-io/milvus)
- vLLM + [llamaindex](https://github.com/run-llama/llama_index) + [milvus](https://github.com/milvus-io/milvus) - vLLM + [llamaindex](https://github.com/run-llama/llama_index) + [milvus](https://github.com/milvus-io/milvus)
......
...@@ -140,11 +140,12 @@ The core vLLM production stack configuration is managed with YAML. Here is the e ...@@ -140,11 +140,12 @@ The core vLLM production stack configuration is managed with YAML. Here is the e
``` ```
In this YAML configuration: In this YAML configuration:
* **`modelSpec`** includes: * **`modelSpec`** includes:
* `name`: A nickname that you prefer to call the model. * `name`: A nickname that you prefer to call the model.
* `repository`: Docker repository of vLLM. * `repository`: Docker repository of vLLM.
* `tag`: Docker image tag. * `tag`: Docker image tag.
* `modelURL`: The LLM model that you want to use. * `modelURL`: The LLM model that you want to use.
* **`replicaCount`**: Number of replicas. * **`replicaCount`**: Number of replicas.
* **`requestCPU` and `requestMemory`**: Specifies the CPU and memory resource requests for the pod. * **`requestCPU` and `requestMemory`**: Specifies the CPU and memory resource requests for the pod.
* **`requestGPU`**: Specifies the number of GPUs required. * **`requestGPU`**: Specifies the number of GPUs required.
......
...@@ -5,7 +5,7 @@ Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine le ...@@ -5,7 +5,7 @@ Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine le
- [Deployment with CPUs](#deployment-with-cpus) - [Deployment with CPUs](#deployment-with-cpus)
- [Deployment with GPUs](#deployment-with-gpus) - [Deployment with GPUs](#deployment-with-gpus)
- [Troubleshooting](#troubleshooting) - [Troubleshooting](#troubleshooting)
- [Startup Probe or Readiness Probe Failure, container log contains "KeyboardInterrupt: terminated"](#startup-probe-or-readiness-probe-failure-container-log-contains-keyboardinterrupt-terminated) - [Startup Probe or Readiness Probe Failure, container log contains "KeyboardInterrupt: terminated"](#startup-probe-or-readiness-probe-failure-container-log-contains-keyboardinterrupt-terminated)
- [Conclusion](#conclusion) - [Conclusion](#conclusion)
Alternatively, you can deploy vLLM to Kubernetes using any of the following: Alternatively, you can deploy vLLM to Kubernetes using any of the following:
......
...@@ -361,7 +361,7 @@ instances in Prometheus. ...@@ -361,7 +361,7 @@ instances in Prometheus.
We use this concept for the `vllm:cache_config_info` metric: We use this concept for the `vllm:cache_config_info` metric:
``` ```text
# HELP vllm:cache_config_info Information of the LLMEngine CacheConfig # HELP vllm:cache_config_info Information of the LLMEngine CacheConfig
# TYPE vllm:cache_config_info gauge # TYPE vllm:cache_config_info gauge
vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",...} 1.0 vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",...} 1.0
...@@ -686,7 +686,7 @@ documentation for this option states: ...@@ -686,7 +686,7 @@ documentation for this option states:
The metrics were added by <gh-pr:7089> and who up in an OpenTelemetry trace The metrics were added by <gh-pr:7089> and who up in an OpenTelemetry trace
as: as:
``` ```text
-> gen_ai.latency.time_in_scheduler: Double(0.017550230026245117) -> gen_ai.latency.time_in_scheduler: Double(0.017550230026245117)
-> gen_ai.latency.time_in_model_forward: Double(3.151565277099609) -> gen_ai.latency.time_in_model_forward: Double(3.151565277099609)
-> gen_ai.latency.time_in_model_execute: Double(3.6468167304992676) -> gen_ai.latency.time_in_model_execute: Double(3.6468167304992676)
......
...@@ -5,6 +5,7 @@ An implementation of xPyD with dynamic scaling based on point-to-point communica ...@@ -5,6 +5,7 @@ An implementation of xPyD with dynamic scaling based on point-to-point communica
## Detailed Design ## Detailed Design
### Overall Process ### Overall Process
As shown in Figure 1, the overall process of this **PD disaggregation** solution is described through a request flow: As shown in Figure 1, the overall process of this **PD disaggregation** solution is described through a request flow:
1. The client sends an HTTP request to the Proxy/Router's `/v1/completions` interface. 1. The client sends an HTTP request to the Proxy/Router's `/v1/completions` interface.
...@@ -23,7 +24,7 @@ A simple HTTP service acts as the entry point for client requests and starts a b ...@@ -23,7 +24,7 @@ A simple HTTP service acts as the entry point for client requests and starts a b
The Proxy/Router is responsible for selecting 1P1D based on the characteristics of the client request, such as the prompt, and generating a corresponding `request_id`, for example: The Proxy/Router is responsible for selecting 1P1D based on the characteristics of the client request, such as the prompt, and generating a corresponding `request_id`, for example:
``` ```text
cmpl-___prefill_addr_10.0.1.2:21001___decode_addr_10.0.1.3:22001_93923d63113b4b338973f24d19d4bf11-0 cmpl-___prefill_addr_10.0.1.2:21001___decode_addr_10.0.1.3:22001_93923d63113b4b338973f24d19d4bf11-0
``` ```
...@@ -70,6 +71,7 @@ pip install "vllm>=0.9.2" ...@@ -70,6 +71,7 @@ pip install "vllm>=0.9.2"
## Run xPyD ## Run xPyD
### Instructions ### Instructions
- The following examples are run on an A800 (80GB) device, using the Meta-Llama-3.1-8B-Instruct model. - The following examples are run on an A800 (80GB) device, using the Meta-Llama-3.1-8B-Instruct model.
- Pay attention to the setting of the `kv_buffer_size` (in bytes). The empirical value is 10% of the GPU memory size. This is related to the kvcache size. If it is too small, the GPU memory buffer for temporarily storing the received kvcache will overflow, causing the kvcache to be stored in the tensor memory pool, which increases latency. If it is too large, the kvcache available for inference will be reduced, leading to a smaller batch size and decreased throughput. - Pay attention to the setting of the `kv_buffer_size` (in bytes). The empirical value is 10% of the GPU memory size. This is related to the kvcache size. If it is too small, the GPU memory buffer for temporarily storing the received kvcache will overflow, causing the kvcache to be stored in the tensor memory pool, which increases latency. If it is too large, the kvcache available for inference will be reduced, leading to a smaller batch size and decreased throughput.
- For Prefill instances, when using non-GET mode, the `kv_buffer_size` can be set to 1, as Prefill currently does not need to receive kvcache. However, when using GET mode, a larger `kv_buffer_size` is required because it needs to store the kvcache sent to the D instance. - For Prefill instances, when using non-GET mode, the `kv_buffer_size` can be set to 1, as Prefill currently does not need to receive kvcache. However, when using GET mode, a larger `kv_buffer_size` is required because it needs to store the kvcache sent to the D instance.
......
...@@ -18,10 +18,12 @@ In the example above, the KV cache in the first block can be uniquely identified ...@@ -18,10 +18,12 @@ In the example above, the KV cache in the first block can be uniquely identified
* Block tokens: A tuple of tokens in this block. The reason to include the exact tokens is to reduce potential hash value collision. * Block tokens: A tuple of tokens in this block. The reason to include the exact tokens is to reduce potential hash value collision.
* Extra hashes: Other values required to make this block unique, such as LoRA IDs, multi-modality input hashes (see the example below), and cache salts to isolate caches in multi-tenant environments. * Extra hashes: Other values required to make this block unique, such as LoRA IDs, multi-modality input hashes (see the example below), and cache salts to isolate caches in multi-tenant environments.
> **Note 1:** We only cache full blocks. !!! note "Note 1"
We only cache full blocks.
> **Note 2:** The above hash key structure is not 100% collision free. Theoretically it’s still possible for the different prefix tokens to have the same hash value. To avoid any hash collisions **in a multi-tenant setup, we advise to use SHA256** as hash function instead of the default builtin hash. !!! note "Note 2"
SHA256 is supported since vLLM v0.8.3 and must be enabled with a command line argument. It comes with a performance impact of about 100-200ns per token (~6ms for 50k tokens of context). The above hash key structure is not 100% collision free. Theoretically it’s still possible for the different prefix tokens to have the same hash value. To avoid any hash collisions **in a multi-tenant setup, we advise to use SHA256** as hash function instead of the default builtin hash.
SHA256 is supported since vLLM v0.8.3 and must be enabled with a command line argument. It comes with a performance impact of about 100-200ns per token (~6ms for 50k tokens of context).
**A hashing example with multi-modality inputs** **A hashing example with multi-modality inputs**
In this example, we illustrate how prefix caching works with multi-modality inputs (e.g., images). Assuming we have a request with the following messages: In this example, we illustrate how prefix caching works with multi-modality inputs (e.g., images). Assuming we have a request with the following messages:
...@@ -92,7 +94,8 @@ To improve privacy in shared environments, vLLM supports isolating prefix cache ...@@ -92,7 +94,8 @@ To improve privacy in shared environments, vLLM supports isolating prefix cache
With this setup, cache sharing is limited to users or requests that explicitly agree on a common salt, enabling cache reuse within a trust group while isolating others. With this setup, cache sharing is limited to users or requests that explicitly agree on a common salt, enabling cache reuse within a trust group while isolating others.
> **Note:** Cache isolation is not supported in engine V0. !!! note
Cache isolation is not supported in engine V0.
## Data Structure ## Data Structure
......
...@@ -8,7 +8,7 @@ Throughout the example, we will run a common Llama model using v1, and turn on d ...@@ -8,7 +8,7 @@ Throughout the example, we will run a common Llama model using v1, and turn on d
In the very verbose logs, we can see: In the very verbose logs, we can see:
``` ```console
INFO 03-07 03:06:55 [backends.py:409] Using cache directory: ~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0 for vLLM's torch.compile INFO 03-07 03:06:55 [backends.py:409] Using cache directory: ~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0 for vLLM's torch.compile
``` ```
...@@ -75,7 +75,7 @@ Every submodule can be identified by its index, and will be processed individual ...@@ -75,7 +75,7 @@ Every submodule can be identified by its index, and will be processed individual
In the very verbose logs, we can also see: In the very verbose logs, we can also see:
``` ```console
DEBUG 03-07 03:52:37 [backends.py:134] store the 0-th graph for shape None from inductor via handle ('fpegyiq3v3wzjzphd45wkflpabggdbjpylgr7tta4hj6uplstsiw', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/iw/ciwzrk3ittdqatuzwonnajywvno3llvjcs2vfdldzwzozn3zi3iy.py') DEBUG 03-07 03:52:37 [backends.py:134] store the 0-th graph for shape None from inductor via handle ('fpegyiq3v3wzjzphd45wkflpabggdbjpylgr7tta4hj6uplstsiw', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/iw/ciwzrk3ittdqatuzwonnajywvno3llvjcs2vfdldzwzozn3zi3iy.py')
DEBUG 03-07 03:52:39 [backends.py:134] store the 1-th graph for shape None from inductor via handle ('f7fmlodmf3h3by5iiu2c4zarwoxbg4eytwr3ujdd2jphl4pospfd', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/ly/clyfzxldfsj7ehaluis2mca2omqka4r7mgcedlf6xfjh645nw6k2.py') DEBUG 03-07 03:52:39 [backends.py:134] store the 1-th graph for shape None from inductor via handle ('f7fmlodmf3h3by5iiu2c4zarwoxbg4eytwr3ujdd2jphl4pospfd', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/ly/clyfzxldfsj7ehaluis2mca2omqka4r7mgcedlf6xfjh645nw6k2.py')
... ...
...@@ -93,7 +93,7 @@ One more detail: you can see that the 1-th graph and the 15-th graph have the sa ...@@ -93,7 +93,7 @@ One more detail: you can see that the 1-th graph and the 15-th graph have the sa
If we already have the cache directory (e.g. run the same code for the second time), we will see the following logs: If we already have the cache directory (e.g. run the same code for the second time), we will see the following logs:
``` ```console
DEBUG 03-07 04:00:45 [backends.py:86] Directly load the 0-th graph for shape None from inductor via handle ('fpegyiq3v3wzjzphd45wkflpabggdbjpylgr7tta4hj6uplstsiw', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/iw/ciwzrk3ittdqatuzwonnajywvno3llvjcs2vfdldzwzozn3zi3iy.py') DEBUG 03-07 04:00:45 [backends.py:86] Directly load the 0-th graph for shape None from inductor via handle ('fpegyiq3v3wzjzphd45wkflpabggdbjpylgr7tta4hj6uplstsiw', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/iw/ciwzrk3ittdqatuzwonnajywvno3llvjcs2vfdldzwzozn3zi3iy.py')
``` ```
......
...@@ -36,9 +36,9 @@ th:not(:first-child) { ...@@ -36,9 +36,9 @@ th:not(:first-child) {
| Feature | [CP][chunked-prefill] | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode.md) | CUDA graph | [pooling](../models/pooling_models.md) | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search | | Feature | [CP][chunked-prefill] | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode.md) | CUDA graph | [pooling](../models/pooling_models.md) | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| [CP][chunked-prefill] | ✅ | | | | | | | | | | | | | | | | [CP][chunked-prefill] | ✅ | | | | | | | | | | | | | |
| [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | | | [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | |
| [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | | | [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | |
| [SD](spec_decode.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | | | [SD](spec_decode.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | |
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | | CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | |
| [pooling](../models/pooling_models.md) | ✅\* | ✅\* | ✅ | ❌ | ✅ | ✅ | | | | | | | | | | [pooling](../models/pooling_models.md) | ✅\* | ✅\* | ✅ | ❌ | ✅ | ✅ | | | | | | | | |
......
...@@ -119,6 +119,7 @@ export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True ...@@ -119,6 +119,7 @@ export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
``` ```
### Using API Endpoints ### Using API Endpoints
Loading a LoRA Adapter: Loading a LoRA Adapter:
To dynamically load a LoRA adapter, send a POST request to the `/v1/load_lora_adapter` endpoint with the necessary To dynamically load a LoRA adapter, send a POST request to the `/v1/load_lora_adapter` endpoint with the necessary
...@@ -156,6 +157,7 @@ curl -X POST http://localhost:8000/v1/unload_lora_adapter \ ...@@ -156,6 +157,7 @@ curl -X POST http://localhost:8000/v1/unload_lora_adapter \
``` ```
### Using Plugins ### Using Plugins
Alternatively, you can use the LoRAResolver plugin to dynamically load LoRA adapters. LoRAResolver plugins enable you to load LoRA adapters from both local and remote sources such as local file system and S3. On every request, when there's a new model name that hasn't been loaded yet, the LoRAResolver will try to resolve and load the corresponding LoRA adapter. Alternatively, you can use the LoRAResolver plugin to dynamically load LoRA adapters. LoRAResolver plugins enable you to load LoRA adapters from both local and remote sources such as local file system and S3. On every request, when there's a new model name that hasn't been loaded yet, the LoRAResolver will try to resolve and load the corresponding LoRA adapter.
You can set up multiple LoRAResolver plugins if you want to load LoRA adapters from different sources. For example, you might have one resolver for local files and another for S3 storage. vLLM will load the first LoRA adapter that it finds. You can set up multiple LoRAResolver plugins if you want to load LoRA adapters from different sources. For example, you might have one resolver for local files and another for S3 storage. vLLM will load the first LoRA adapter that it finds.
......
...@@ -588,7 +588,9 @@ Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for ...@@ -588,7 +588,9 @@ Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model, To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
pass a tensor of shape to the corresponding field of the multi-modal dictionary. pass a tensor of shape to the corresponding field of the multi-modal dictionary.
#### Image Embedding Inputs #### Image Embedding Inputs
For image embeddings, you can pass the base64-encoded tensor to the `image_embeds` field. For image embeddings, you can pass the base64-encoded tensor to the `image_embeds` field.
The following example demonstrates how to pass image embeddings to the OpenAI server: The following example demonstrates how to pass image embeddings to the OpenAI server:
......
...@@ -97,7 +97,7 @@ for output in outputs: ...@@ -97,7 +97,7 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``` ```
# Acknowledgement ## Acknowledgement
Special thanks to open-source low precision libraries such as AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and Special thanks to open-source low precision libraries such as AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and
ExLLaMAV2 for providing low-precision CUDA kernels, which are leveraged in AutoRound. ExLLaMAV2 for providing low-precision CUDA kernels, which are leveraged in AutoRound.
...@@ -134,8 +134,8 @@ lm_eval --model vllm \ ...@@ -134,8 +134,8 @@ lm_eval --model vllm \
- Employ the chat template or instruction template that the model was trained with - Employ the chat template or instruction template that the model was trained with
- If you've fine-tuned a model, consider using a sample of your training data for calibration - If you've fine-tuned a model, consider using a sample of your training data for calibration
- Tune key hyperparameters to the quantization algorithm: - Tune key hyperparameters to the quantization algorithm:
- `dampening_frac` sets how much influence the GPTQ algorithm has. Lower values can improve accuracy, but can lead to numerical instabilities that cause the algorithm to fail. - `dampening_frac` sets how much influence the GPTQ algorithm has. Lower values can improve accuracy, but can lead to numerical instabilities that cause the algorithm to fail.
- `actorder` sets the activation ordering. When compressing the weights of a layer weight, the order in which channels are quantized matters. Setting `actorder="weight"` can improve accuracy without added latency. - `actorder` sets the activation ordering. When compressing the weights of a layer weight, the order in which channels are quantized matters. Setting `actorder="weight"` can improve accuracy without added latency.
The following is an example of an expanded quantization recipe you can tune to your own use case: The following is an example of an expanded quantization recipe you can tune to your own use case:
......
...@@ -50,6 +50,7 @@ Here is an example of how to enable FP8 quantization: ...@@ -50,6 +50,7 @@ Here is an example of how to enable FP8 quantization:
``` ```
The `kv_cache_dtype` argument specifies the data type for KV cache storage: The `kv_cache_dtype` argument specifies the data type for KV cache storage:
- `"auto"`: Uses the model's default "unquantized" data type - `"auto"`: Uses the model's default "unquantized" data type
- `"fp8"` or `"fp8_e4m3"`: Supported on CUDA 11.8+ and ROCm (AMD GPU) - `"fp8"` or `"fp8_e4m3"`: Supported on CUDA 11.8+ and ROCm (AMD GPU)
- `"fp8_e5m2"`: Supported on CUDA 11.8+ - `"fp8_e5m2"`: Supported on CUDA 11.8+
......
...@@ -213,6 +213,7 @@ lm_eval --model vllm \ ...@@ -213,6 +213,7 @@ lm_eval --model vllm \
``` ```
## Quark Quantization Script ## Quark Quantization Script
In addition to the example of Python API above, Quark also offers a In addition to the example of Python API above, Quark also offers a
[quantization script](https://quark.docs.amd.com/latest/pytorch/example_quark_torch_llm_ptq.html) [quantization script](https://quark.docs.amd.com/latest/pytorch/example_quark_torch_llm_ptq.html)
to quantize large language models more conveniently. It supports quantizing models with variety to quantize large language models more conveniently. It supports quantizing models with variety
......
...@@ -13,6 +13,7 @@ pip install \ ...@@ -13,6 +13,7 @@ pip install \
``` ```
## Quantizing HuggingFace Models ## Quantizing HuggingFace Models
You can quantize your own huggingface model with torchao, e.g. [transformers](https://huggingface.co/docs/transformers/main/en/quantization/torchao) and [diffusers](https://huggingface.co/docs/diffusers/en/quantization/torchao), and save the checkpoint to huggingface hub like [this](https://huggingface.co/jerryzh168/llama3-8b-int8wo) with the following example code: You can quantize your own huggingface model with torchao, e.g. [transformers](https://huggingface.co/docs/transformers/main/en/quantization/torchao) and [diffusers](https://huggingface.co/docs/diffusers/en/quantization/torchao), and save the checkpoint to huggingface hub like [this](https://huggingface.co/jerryzh168/llama3-8b-int8wo) with the following example code:
??? code ??? code
......
...@@ -164,7 +164,7 @@ Note, it is recommended to manually reserve 1 CPU for vLLM front-end process whe ...@@ -164,7 +164,7 @@ Note, it is recommended to manually reserve 1 CPU for vLLM front-end process whe
### How to decide `VLLM_CPU_KVCACHE_SPACE`? ### How to decide `VLLM_CPU_KVCACHE_SPACE`?
- This value is 4GB by default. Larger space can support more concurrent requests, longer context length. However, users should take care of memory capacity of each NUMA node. The memory usage of each TP rank is the sum of `weight shard size` and `VLLM_CPU_KVCACHE_SPACE`, if it exceeds the capacity of a single NUMA node, the TP worker will be killed with `exitcode 9` due to out-of-memory. This value is 4GB by default. Larger space can support more concurrent requests, longer context length. However, users should take care of memory capacity of each NUMA node. The memory usage of each TP rank is the sum of `weight shard size` and `VLLM_CPU_KVCACHE_SPACE`, if it exceeds the capacity of a single NUMA node, the TP worker will be killed with `exitcode 9` due to out-of-memory.
### How to do performance tuning for vLLM CPU? ### How to do performance tuning for vLLM CPU?
...@@ -183,13 +183,13 @@ vLLM CPU supports tensor parallel (TP) and pipeline parallel (PP) to leverage mu ...@@ -183,13 +183,13 @@ vLLM CPU supports tensor parallel (TP) and pipeline parallel (PP) to leverage mu
### Which quantization configs does vLLM CPU support? ### Which quantization configs does vLLM CPU support?
- vLLM CPU supports quantizations: - vLLM CPU supports quantizations:
- AWQ (x86 only) - AWQ (x86 only)
- GPTQ (x86 only) - GPTQ (x86 only)
- compressed-tensor INT8 W8A8 (x86, s390x) - compressed-tensor INT8 W8A8 (x86, s390x)
### (x86 only) What is the purpose of `VLLM_CPU_MOE_PREPACK` and `VLLM_CPU_SGL_KERNEL`? ### (x86 only) What is the purpose of `VLLM_CPU_MOE_PREPACK` and `VLLM_CPU_SGL_KERNEL`?
- Both of them requires `amx` CPU flag. - Both of them requires `amx` CPU flag.
- `VLLM_CPU_MOE_PREPACK` can provides better performance for MoE models - `VLLM_CPU_MOE_PREPACK` can provides better performance for MoE models
- `VLLM_CPU_SGL_KERNEL` can provides better performance for MoE models and small-batch scenarios. - `VLLM_CPU_SGL_KERNEL` can provides better performance for MoE models and small-batch scenarios.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment