Commit 7a985548 authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.9.0' into v0.9.0-ori

parents 45d3785c dc1440cf
(deployment-chatbox)=
# Chatbox
[Chatbox](https://github.com/chatboxai/chatbox) is a desktop client for LLMs, available on Windows, Mac, Linux.
It allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints.
## Prerequisites
- Setup vLLM environment
## Deploy
- Start the vLLM server with the supported chat completion model, e.g.
```console
vllm serve qwen/Qwen1.5-0.5B-Chat
```
- Download and install [Chatbox desktop](https://chatboxai.app/en#download).
- On the bottom left of settings, Add Custom Provider
- API Mode: `OpenAI API Compatible`
- Name: vllm
- API Host: `http://{vllm server host}:{vllm server port}/v1`
- API Path: `/chat/completions`
- Model: `qwen/Qwen1.5-0.5B-Chat`
:::{image} /assets/deployment/chatbox-settings.png
:::
- Go to `Just chat`, and start to chat:
:::{image} /assets/deployment/chatbox-chat.png
:::
(deployment-dify)=
# Dify
[Dify](https://github.com/langgenius/dify) is an open-source LLM app development platform. Its intuitive interface combines agentic AI workflow, RAG pipeline, agent capabilities, model management, observability features, and more, allowing you to quickly move from prototype to production.
It supports vLLM as a model provider to efficiently serve large language models.
This guide walks you through deploying Dify using a vLLM backend.
## Prerequisites
- Setup vLLM environment
- Install [Docker](https://docs.docker.com/engine/install/) and [Docker Compose](https://docs.docker.com/compose/install/)
## Deploy
- Start the vLLM server with the supported chat completion model, e.g.
```console
vllm serve Qwen/Qwen1.5-7B-Chat
```
- Start the Dify server with docker compose ([details](https://github.com/langgenius/dify?tab=readme-ov-file#quick-start)):
```console
git clone https://github.com/langgenius/dify.git
cd dify
cd docker
cp .env.example .env
docker compose up -d
```
- Open the browser to access `http://localhost/install`, config the basic login information and login.
- In the top-right user menu (under the profile icon), go to Settings, then click `Model Provider`, and locate the `vLLM` provider to install it.
- Fill in the model provider details as follows:
- **Model Type**: `LLM`
- **Model Name**: `Qwen/Qwen1.5-7B-Chat`
- **API Endpoint URL**: `http://{vllm_server_host}:{vllm_server_port}/v1`
- **Model Name for API Endpoint**: `Qwen/Qwen1.5-7B-Chat`
- **Completion Mode**: `Completion`
:::{image} /assets/deployment/dify-settings.png
:::
- To create a test chatbot, go to `Studio → Chatbot → Create from Blank`, then select Chatbot as the type:
:::{image} /assets/deployment/dify-create-chatbot.png
:::
- Click the chatbot you just created to open the chat interface and start interacting with the model:
:::{image} /assets/deployment/dify-chat.png
:::
......@@ -6,11 +6,17 @@
anything-llm
bentoml
cerebrium
chatbox
dify
dstack
helm
litellm
lobe-chat
lws
modal
open-webui
retrieval_augmented_generation
skypilot
streamlit
triton
:::
(deployment-litellm)=
# LiteLLM
[LiteLLM](https://github.com/BerriAI/litellm) call all LLM APIs using the OpenAI format [Bedrock, Huggingface, VertexAI, TogetherAI, Azure, OpenAI, Groq etc.]
LiteLLM manages:
- Translate inputs to provider's `completion`, `embedding`, and `image_generation` endpoints
- [Consistent output](https://docs.litellm.ai/docs/completion/output), text responses will always be available at `['choices'][0]['message']['content']`
- Retry/fallback logic across multiple deployments (e.g. Azure/OpenAI) - [Router](https://docs.litellm.ai/docs/routing)
- Set Budgets & Rate limits per project, api key, model [LiteLLM Proxy Server (LLM Gateway)](https://docs.litellm.ai/docs/simple_proxy)
And LiteLLM supports all models on VLLM.
## Prerequisites
- Setup vLLM and litellm environment
```console
pip install vllm litellm
```
## Deploy
### Chat completion
- Start the vLLM server with the supported chat completion model, e.g.
```console
vllm serve qwen/Qwen1.5-0.5B-Chat
```
- Call it with litellm:
```python
import litellm
messages = [{ "content": "Hello, how are you?","role": "user"}]
# hosted_vllm is prefix key word and necessary
response = litellm.completion(
model="hosted_vllm/qwen/Qwen1.5-0.5B-Chat", # pass the vllm model name
messages=messages,
api_base="http://{your-vllm-server-host}:{your-vllm-server-port}/v1",
temperature=0.2,
max_tokens=80)
print(response)
```
### Embeddings
- Start the vLLM server with the supported embedding model, e.g.
```console
vllm serve BAAI/bge-base-en-v1.5
```
- Call it with litellm:
```python
from litellm import embedding
import os
os.environ["HOSTED_VLLM_API_BASE"] = "http://{your-vllm-server-host}:{your-vllm-server-port}/v1"
# hosted_vllm is prefix key word and necessary
# pass the vllm model name
embedding = embedding(model="hosted_vllm/BAAI/bge-base-en-v1.5", input=["Hello world"])
print(embedding)
```
For details, see the tutorial [Using vLLM in LiteLLM](https://docs.litellm.ai/docs/providers/vllm).
(deployment-lobe-chat)=
# Lobe Chat
[Lobe Chat](https://github.com/lobehub/lobe-chat) is an open-source, modern-design ChatGPT/LLMs UI/Framework.
Supports speech-synthesis, multi-modal, and extensible (function call) plugin system.
One-click FREE deployment of your private OpenAI ChatGPT/Claude/Gemini/Groq/Ollama chat application.
It supports vLLM as a AI model provider to efficiently serve large language models.
For details, see the tutorial [Using vLLM in LobeChat](https://lobehub.com/docs/usage/providers/vllm).
(deployment-retrieval-augmented-generation)=
# Retrieval-Augmented Generation
[Retrieval-augmented generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) is a technique that enables generative artificial intelligence (Gen AI) models to retrieve and incorporate new information. It modifies interactions with a large language model (LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to supplement information from its pre-existing training data. This allows LLMs to use domain-specific and/or updated information. Use cases include providing chatbot access to internal company data or generating responses based on authoritative sources.
Here are the integrations:
- vLLM + [langchain](https://github.com/langchain-ai/langchain) + [milvus](https://github.com/milvus-io/milvus)
- vLLM + [llamaindex](https://github.com/run-llama/llama_index) + [milvus](https://github.com/milvus-io/milvus)
## vLLM + langchain
### Prerequisites
- Setup vLLM and langchain environment
```console
pip install -U vllm \
langchain_milvus langchain_openai \
langchain_community beautifulsoup4 \
langchain-text-splitters
```
### Deploy
- Start the vLLM server with the supported embedding model, e.g.
```console
# Start embedding service (port 8000)
vllm serve ssmits/Qwen2-7B-Instruct-embed-base
```
- Start the vLLM server with the supported chat completion model, e.g.
```console
# Start chat service (port 8001)
vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001
```
- Use the script: <gh-file:examples/online_serving/retrieval_augmented_generation_with_langchain.py>
- Run the script
```python
python retrieval_augmented_generation_with_langchain.py
```
## vLLM + llamaindex
### Prerequisites
- Setup vLLM and llamaindex environment
```console
pip install vllm \
llama-index llama-index-readers-web \
llama-index-llms-openai-like \
llama-index-embeddings-openai-like \
llama-index-vector-stores-milvus \
```
### Deploy
- Start the vLLM server with the supported embedding model, e.g.
```console
# Start embedding service (port 8000)
vllm serve ssmits/Qwen2-7B-Instruct-embed-base
```
- Start the vLLM server with the supported chat completion model, e.g.
```console
# Start chat service (port 8001)
vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001
```
- Use the script: <gh-file:examples/online_serving/retrieval_augmented_generation_with_llamaindex.py>
- Run the script
```python
python retrieval_augmented_generation_with_llamaindex.py
```
(deployment-streamlit)=
# Streamlit
[Streamlit](https://github.com/streamlit/streamlit) lets you transform Python scripts into interactive web apps in minutes, instead of weeks. Build dashboards, generate reports, or create chat apps.
It can be quickly integrated with vLLM as a backend API server, enabling powerful LLM inference via API calls.
## Prerequisites
- Setup vLLM environment
## Deploy
- Start the vLLM server with the supported chat completion model, e.g.
```console
vllm serve qwen/Qwen1.5-0.5B-Chat
```
- Install streamlit and openai:
```console
pip install streamlit openai
```
- Use the script: <gh-file:examples/online_serving/streamlit_openai_chatbot_webserver.py>
- Start the streamlit web UI and start to chat:
```console
streamlit run streamlit_openai_chatbot_webserver.py
# or specify the VLLM_API_BASE or VLLM_API_KEY
VLLM_API_BASE="http://vllm-server-host:vllm-server-port/v1" streamlit run streamlit_openai_chatbot_webserver.py
# start with debug mode to view more details
streamlit run streamlit_openai_chatbot_webserver.py --logger.level=debug
```
:::{image} /assets/deployment/streamlit-chat.png
:::
......@@ -53,6 +53,45 @@ Key points from the PyTorch security guide:
- Implement proper authentication and authorization for management interfaces
- Follow the principle of least privilege for all system components
## Security and Firewalls: Protecting Exposed vLLM Systems
While vLLM is designed to allow unsafe network services to be isolated to
private networks, there are components—such as dependencies and underlying
frameworks—that may open insecure services listening on all network interfaces,
sometimes outside of vLLM's direct control.
A major concern is the use of `torch.distributed`, which vLLM leverages for
distributed communication, including when using vLLM on a single host. When vLLM
uses TCP initialization (see [PyTorch TCP Initialization
documentation](https://docs.pytorch.org/docs/stable/distributed.html#tcp-initialization)),
PyTorch creates a `TCPStore` that, by default, listens on all network
interfaces. This means that unless additional protections are put in place,
these services may be accessible to any host that can reach your machine via any
network interface.
**From a PyTorch perspective, any use of `torch.distributed` should be
considered insecure by default.** This is a known and intentional behavior from
the PyTorch team.
### Firewall Configuration Guidance
The best way to protect your vLLM system is to carefully configure a firewall to
expose only the minimum network surface area necessary. In most cases, this
means:
- **Block all incoming connections except to the TCP port the API server is
listening on.**
- Ensure that ports used for internal communication (such as those for
`torch.distributed` and KV cache transfer) are only accessible from trusted
hosts or networks.
- Never expose these internal ports to the public internet or untrusted
networks.
Consult your operating system or application platform documentation for specific
firewall configuration instructions.
## Reporting Security Vulnerabilities
If you believe you have found a security vulnerability in vLLM, please report it following the project's security policy. For more information on how to report security issues and the project's security policy, please see the [vLLM Security Policy](https://github.com/vllm-project/vllm/blob/main/SECURITY.md).
......@@ -52,8 +52,8 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
More API details can be found in the {doc}`Offline Inference
</api/offline_inference/index>` section of the API docs.
More API details can be found in the [Offline Inference]
(#offline-inference-api) section of the API docs.
The code for the `LLM` class can be found in <gh-file:vllm/entrypoints/llm.py>.
......
......@@ -415,8 +415,8 @@ The discussion in <gh-issue:10582> about adding prefix cache metrics yielded
some interesting points which may be relevant to how we approach
future metrics.
Every time the prefix cache is queried, we record the number of blocks
queried and the number of queried blocks present in the cache
Every time the prefix cache is queried, we record the number of tokens
queried and the number of queried tokens present in the cache
(i.e. hits).
However, the metric of interest is the hit rate - i.e. the number of
......@@ -467,6 +467,9 @@ In general:
hatch](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#show-hidden-metrics)
for some time before deleting them.
See the [deprecation policy](project:../../contributing/deprecation_policy.md) for
the project-wide deprecation policy.
### Unimplemented - `vllm:tokens_total`
Added by <gh-pr:4464>, but apparently never implemented. This can just be
......
......@@ -16,7 +16,7 @@ In the example above, the KV cache in the first block can be uniquely identified
* Parent hash value: The hash value of the parent hash block.
* Block tokens: A tuple of tokens in this block. The reason to include the exact tokens is to reduce potential hash value collision.
* Extra hashes: Other values required to make this block unique, such as LoRA IDs and multi-modality input hashes (see the example below).
* Extra hashes: Other values required to make this block unique, such as LoRA IDs, multi-modality input hashes (see the example below), and cache salts to isolate caches in multi-tenant environments.
> **Note 1:** We only cache full blocks.
......@@ -76,6 +76,24 @@ Block 3
In the rest of this document, we first introduce the data structure used for prefix caching in vLLM v1, followed by the prefix caching workflow of major KV cache operators (e.g., allocate, append, free, eviction). Finally, we use an example to illustrate the end to end prefix caching workflow.
**Cache Isolation for Security**
To improve privacy in shared environments, vLLM supports isolating prefix cache reuse through optional per-request salting. By including a `cache_salt` in the request, this value is injected into the hash of the first block, ensuring that only requests with the same salt can reuse cached KV blocks. This prevents timing-based attacks where an adversary could infer cached content by observing latency differences. This offers protection without compromising performance.
```json
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Here is a document with details about the world series: ..."},
{"role": "user", "content": "Who won the world series in 2020?"}
],
"cache_salt": "your-cache-salt"
}
```
With this setup, cache sharing is limited to users or requests that explicitly agree on a common salt, enabling cache reuse within a trust group while isolating others.
> **Note:** Cache isolation is not supported in engine V0.
## Data Structure
The prefix caching in vLLM v1 is implemented in the KV cache manager. The basic building block is the “Block” data class (simplified):
......
......@@ -137,3 +137,9 @@ By default, vLLM will try to determine a set of sizes to capture cudagraph. You
`vllm serve meta-llama/Llama-3.2-1B --compilation-config "{'cudagraph_capture_sizes': [1, 2, 4, 8]}"`
Then it will only capture cudagraph for the specified sizes. It can be useful to have fine-grained control over the cudagraph capture.
### Full Cudagraph capture
It is possible to include attention as part of the cudagraph if using an attention backend that is cudagraph compatible. This can improve performance in some cases such as decode speed for smaller models. Enable this using `--compilation-config "{'full_cuda_graph': True}"`
Currently only FlashAttention 3 is compatible, and only when cascade attention is disabled.
......@@ -42,7 +42,7 @@ Check the ❌ or 🟠 with links to see tracking issue for unsupported feature/h
* [APC](#automatic-prefix-caching)
* [LoRA](#lora-adapter)
* <abbr title="Prompt Adapter">prmpt adptr</abbr>
* [SD](#spec_decode)
* [SD](#spec-decode)
* CUDA graph
* <abbr title="Pooling Models">pooling</abbr>
* <abbr title="Encoder-Decoder Models">enc-dec</abbr>
......@@ -122,7 +122,7 @@ Check the ❌ or 🟠 with links to see tracking issue for unsupported feature/h
*
*
*
- * [SD](#spec_decode)
- * [SD](#spec-decode)
*
*
*
......@@ -377,7 +377,7 @@ Check the ❌ or 🟠 with links to see tracking issue for unsupported feature/h
*
* [](gh-issue:8475)
*
- * [SD](#spec_decode)
- * [SD](#spec-decode)
*
*
*
......
......@@ -66,7 +66,7 @@ The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. P
The server entrypoint accepts all other LoRA configuration parameters (`max_loras`, `max_lora_rank`, `max_cpu_loras`,
etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along
with its base model:
with its base model (if `jq` is not installed, you can follow [this guide](https://jqlang.org/download/) to install it.):
```bash
curl localhost:8000/v1/models | jq .
......@@ -134,7 +134,7 @@ curl -X POST http://localhost:8000/v1/load_lora_adapter \
}'
```
Upon a successful request, the API will respond with a 200 OK status code. If an error occurs, such as if the adapter
Upon a successful request, the API will respond with a `200 OK` status code from `vllm serve`, and `curl` returns the response body: `Success: LoRA adapter 'sql_adapter' added successfully`. If an error occurs, such as if the adapter
cannot be found or loaded, an appropriate error message will be returned.
Unloading a LoRA Adapter:
......@@ -142,6 +142,8 @@ Unloading a LoRA Adapter:
To unload a LoRA adapter that has been previously loaded, send a POST request to the `/v1/unload_lora_adapter` endpoint
with the name or ID of the adapter to be unloaded.
Upon a successful request, the API responds with a `200 OK` status code from `vllm serve`, and `curl` returns the response body: `Success: LoRA adapter 'sql_adapter' removed successfully`.
Example request to unload a LoRA adapter:
```bash
......@@ -157,9 +159,12 @@ Alternatively, you can use the LoRAResolver plugin to dynamically load LoRA adap
You can set up multiple LoRAResolver plugins if you want to load LoRA adapters from different sources. For example, you might have one resolver for local files and another for S3 storage. vLLM will load the first LoRA adapter that it finds.
You can either install existing plugins or implement your own.
You can either install existing plugins or implement your own. By default, vLLM comes with a [resolver plugin to load LoRA adapters from a local directory.](https://github.com/vllm-project/vllm/tree/main/vllm/plugins/lora_resolvers)
To enable this resolver, set `VLLM_ALLOW_RUNTIME_LORA_UPDATING` to True, set `VLLM_PLUGINS` to include `lora_filesystem_resolver`, and then set `VLLM_LORA_RESOLVER_CACHE_DIR` to a local directory. When vLLM receives a request using a LoRA adapter `foobar`,
it will first look in the local directory for a directory `foobar`, and attempt to load the contents of that directory as a LoRA adapter. If successful, the request will complete as normal and
that adapter will then be available for normal use on the server.
Steps to implement your own LoRAResolver plugin:
Alternatively, follow these example steps to implement your own plugin:
1. Implement the LoRAResolver interface.
Example of a simple S3 LoRAResolver implementation:
......
......@@ -213,10 +213,13 @@ Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions
:::{important}
A chat template is **required** to use Chat Completions API.
For HF format models, the default chat template is defined inside `chat_template.json` or `tokenizer_config.json`.
Although most models come with a chat template, for others you have to define one yourself.
The chat template can be inferred based on the documentation on the model's HuggingFace repo.
For example, LLaVA-1.5 (`llava-hf/llava-1.5-7b-hf`) requires a chat template that can be found here: <gh-file:examples/template_llava.jinja>
If no default chat template is available, we will first look for a built-in fallback in <gh-file:vllm/transformers_utils/chat_templates/registry.py>.
If no fallback is available, an error is raised and you have to provide the chat template manually via the `--chat-template` argument.
For certain models, we provide alternative chat templates inside <gh-dir:vllm/examples>.
For example, VLM2Vec uses <gh-file:examples/template_vlm2vec.jinja> which is different from the default one for Phi-3-Vision.
:::
### Image Inputs
......
# Prompt Embedding Inputs
This page teaches you how to pass prompt embedding inputs to vLLM.
## What are prompt embeddings?
The traditional flow of text data for a Large Language Model goes from text to token ids (via a tokenizer) then from token ids to prompt embeddings. For a traditional decoder-only model (such as meta-llama/Llama-3.1-8B-Instruct), this step of converting token ids to prompt embeddings happens via a look-up from a learned embedding matrix, but the model is not limited to processing only the embeddings corresponding to its token vocabulary.
:::{note}
Prompt embeddings are currently only supported in the v0 engine.
:::
## Offline Inference
To input multi-modal data, follow this schema in {class}`vllm.inputs.EmbedsPrompt`:
- `prompt_embeds`: A torch tensor representing a sequence of prompt/token embeddings. This has the shape (sequence_length, hidden_size), where sequence length is the number of tokens embeddings and hidden_size is the hidden size (embedding size) of the model.
### Hugging Face Transformers Inputs
You can pass prompt embeddings from Hugging Face Transformers models to the `'prompt_embeds'` field of the prompt embedding dictionary, as shown in the following examples:
```python
from vllm import LLM
import transformers
model_name = "meta-llama/Llama-3.2-1B-Instruct"
# Transformers
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
transformers_model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
llm = LLM(model=model_name, enable_prompt_embeds=True)
# Refer to the HuggingFace repo for the correct format to use
chat = [{"role": "user", "content": "Please tell me about the capital of France."}]
token_ids = tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors='pt')
embedding_layer = transformers_model.get_input_embeddings()
prompt_embeds = embedding_layer(token_ids).squeeze(0)
# Single prompt inference
outputs = llm.generate({
"prompt_embeds": prompt_embeds,
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
# Batch inference
chats = [
[{"role": "user", "content": "Please tell me about the capital of France."}],
[{"role": "user", "content": "When is the day longest during the year?"}],
[{"role": "user", "content": "Where is bigger, the moon or the sun?"}]
]
token_ids_list = [
tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors='pt') for chat in chats
]
prompt_embeds_list = [embedding_layer(token_ids).squeeze(0) for token_ids in token_ids_list]
outputs = llm.generate(
[
{
"prompt_embeds": prompt_embeds,
} for prompt_embeds in prompt_embeds_list
]
)
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
```
## Online Serving
Our OpenAI-compatible server accepts prompt embeddings inputs via the [Completions API](https://platform.openai.com/docs/api-reference/completions). Prompt embeddings inputs are added via a new `'prompt_embeds'` key in the JSON package.
When a mixture of `'prompt_embeds'` and `'prompt'` inputs are provided in a single request, the prompt embeds are always returned first.
Prompt embeddings are passed in as base64 encoded torch tensors.
### Transformers Inputs via OpenAI Client
First, launch the OpenAI-compatible server:
```bash
vllm serve meta-llama/Llama-3.2-1B-Instruct --task generate \
--max-model-len 4096 --enable-prompt-embeds
```
Then, you can use the OpenAI client as follows:
```python
from openai import OpenAI
import transformers
import torch
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
model_name = "meta-llama/Llama-3.2-1B-Instruct"
# Transformers
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
transformers_model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
# Refer to the HuggingFace repo for the correct format to use
chat = [{"role": "user", "content": "Please tell me about the capital of France."}]
token_ids = tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors='pt')
embedding_layer = transformers_model.get_input_embeddings()
prompt_embeds = embedding_layer(token_ids).squeeze(0)
# Prompt embeddings
buffer = io.BytesIO()
torch.save(prompt_embeds, buffer)
buffer.seek(0)
binary_data = buffer.read()
encoded_embeds = base64.b64encode(binary_data).decode('utf-8')
completion = client_with_prompt_embeds.completions.create(
model=model_name,
# NOTE: The OpenAI client does not allow `None` as an input to
# `prompt`. Use an empty string if you have no text prompts.
prompt="",
max_tokens=5,
temperature=0.0,
# NOTE: The OpenAI client allows passing in extra JSON body via the
# `extra_body` argument.
extra_body={"prompt_embeds": encoded_embeds}
)
print(completion.choices[0].text)
```
......@@ -19,23 +19,6 @@ FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada L
FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
:::
## Quick Start with Online Dynamic Quantization
Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying `--quantization="fp8"` in the command line or setting `quantization="fp8"` in the LLM constructor.
In this mode, all Linear modules (except for the final `lm_head`) have their weights quantized down to FP8_E4M3 precision with a per-tensor scale. Activations have their minimum and maximum values calculated during each forward pass to provide a dynamic per-tensor scale for high accuracy. As a result, latency improvements are limited in this mode.
```python
from vllm import LLM
model = LLM("facebook/opt-125m", quantization="fp8")
# INFO 06-10 17:55:42 model_runner.py:157] Loading model weights took 0.1550 GB
result = model.generate("Hello, my name is")
```
:::{warning}
Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
:::
## Installation
To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
......@@ -86,7 +69,7 @@ recipe = QuantizationModifier(
# Apply the quantization algorithm.
oneshot(model=model, recipe=recipe)
# Save the model.
# Save the model: Meta-Llama-3-8B-Instruct-FP8-Dynamic
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)
......@@ -94,7 +77,7 @@ tokenizer.save_pretrained(SAVE_DIR)
### 3. Evaluating Accuracy
Install `vllm` and `lm-evaluation-harness`:
Install `vllm` and `lm-evaluation-harness` for evaluation:
```console
pip install vllm lm-eval==0.4.4
......@@ -105,7 +88,8 @@ Load and run the model in `vllm`:
```python
from vllm import LLM
model = LLM("./Meta-Llama-3-8B-Instruct-FP8-Dynamic")
model.generate("Hello my name is")
result = model.generate("Hello my name is")
print(result[0].outputs[0].text)
```
Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
......@@ -133,59 +117,22 @@ Here's an example of the resulting scores:
## Troubleshooting and Support
If you encounter any issues or have feature requests, please open an issue on the `vllm-project/llm-compressor` GitHub repository.
## Deprecated Flow
:::{note}
The following information is preserved for reference and search purposes.
The quantization method described below is deprecated in favor of the `llmcompressor` method described above.
:::
For static per-tensor offline quantization to FP8, please install the [AutoFP8 library](https://github.com/neuralmagic/autofp8).
```bash
git clone https://github.com/neuralmagic/AutoFP8.git
pip install -e AutoFP8
```
This package introduces the `AutoFP8ForCausalLM` and `BaseQuantizeConfig` objects for managing how your model will be compressed.
## Offline Quantization with Static Activation Scaling Factors
You can use AutoFP8 with calibration data to produce per-tensor static scales for both the weights and activations by enabling the `activation_scheme="static"` argument.
```python
from datasets import load_dataset
from transformers import AutoTokenizer
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
pretrained_model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
quantized_model_dir = "Meta-Llama-3-8B-Instruct-FP8"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
If you encounter any issues or have feature requests, please open an issue on the [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor/issues) GitHub repository.
# Load and tokenize 512 dataset samples for calibration of activation scales
ds = load_dataset("mgoin/ultrachat_2k", split="train_sft").select(range(512))
examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")
## Online Dynamic Quantization
# Define quantization config with static activation scales
quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="static")
# Load the model, quantize, and save checkpoint
model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir)
```
Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying `--quantization="fp8"` in the command line or setting `quantization="fp8"` in the LLM constructor.
Your model checkpoint with quantized weights and activations should be available at `Meta-Llama-3-8B-Instruct-FP8/`.
Finally, you can load the quantized model checkpoint directly in vLLM.
In this mode, all Linear modules (except for the final `lm_head`) have their weights quantized down to FP8_E4M3 precision with a per-tensor scale. Activations have their minimum and maximum values calculated during each forward pass to provide a dynamic per-tensor scale for high accuracy. As a result, latency improvements are limited in this mode.
```python
from vllm import LLM
model = LLM(model="Meta-Llama-3-8B-Instruct-FP8/")
# INFO 06-10 21:15:41 model_runner.py:159] Loading model weights took 8.4596 GB
model = LLM("facebook/opt-125m", quantization="fp8")
# INFO 06-10 17:55:42 model_runner.py:157] Loading model weights took 0.1550 GB
result = model.generate("Hello, my name is")
print(result[0].outputs[0].text)
```
:::{warning}
Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
:::
......@@ -17,6 +17,7 @@ gptqmodel
int4
int8
fp8
modelopt
quark
quantized_kvcache
torchao
......
......@@ -18,6 +18,12 @@ To use INT4 quantization with vLLM, you'll need to install the [llm-compressor](
pip install llmcompressor
```
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
```console
pip install vllm lm-eval==0.4.4
```
## Quantization Process
The quantization process involves four main steps:
......@@ -87,7 +93,7 @@ oneshot(
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
# Save the compressed model
# Save the compressed model: Meta-Llama-3-8B-Instruct-W4A16-G128
SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
......@@ -163,4 +169,4 @@ recipe = GPTQModifier(
## Troubleshooting and Support
If you encounter any issues or have feature requests, please open an issue on the [`vllm-project/llm-compressor`](https://github.com/vllm-project/llm-compressor) GitHub repository. The full INT4 quantization example in `llm-compressor` is available [here](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py).
If you encounter any issues or have feature requests, please open an issue on the [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor/issues) GitHub repository. The full INT4 quantization example in `llm-compressor` is available [here](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py).
......@@ -19,6 +19,12 @@ To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](
pip install llmcompressor
```
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
```console
pip install vllm lm-eval==0.4.4
```
## Quantization Process
The quantization process involves four main steps:
......@@ -91,7 +97,7 @@ oneshot(
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
# Save the compressed model
# Save the compressed model: Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token
SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
......@@ -132,4 +138,4 @@ Quantized models can be sensitive to the presence of the `bos` token. Make sure
## Troubleshooting and Support
If you encounter any issues or have feature requests, please open an issue on the [`vllm-project/llm-compressor`](https://github.com/vllm-project/llm-compressor) GitHub repository.
If you encounter any issues or have feature requests, please open an issue on the [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor/issues) GitHub repository.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment