@@ -20,12 +20,16 @@ You can pass prompt embeddings from Hugging Face Transformers models to the `'p
...
@@ -20,12 +20,16 @@ You can pass prompt embeddings from Hugging Face Transformers models to the `'p
## Online Serving
## Online Serving
Our OpenAI-compatible server accepts prompt embeddings inputs via the [Completions API](https://platform.openai.com/docs/api-reference/completions). Prompt embeddings inputs are added via a new `'prompt_embeds'` key in the JSON package.
Our OpenAI-compatible server accepts prompt embeddings inputs via the [Completions API](https://platform.openai.com/docs/api-reference/completions). Prompt embeddings inputs are added via a new `'prompt_embeds'` key in the JSON package and are enabled by the `--enable-prompt-embeds` flag in `vllm serve`.
When a mixture of `'prompt_embeds'` and `'prompt'` inputs are provided in a single request, the prompt embeds are always returned first.
When a mixture of `'prompt_embeds'` and `'prompt'` inputs are provided in a single request, the prompt embeds are always returned first.
Prompt embeddings are passed in as base64 encoded torch tensors.
Prompt embeddings are passed in as base64 encoded torch tensors.
!!! warning
The vLLM engine may crash if incorrect shape of embeddings is passed.