Our [OpenAI Compatible Server](../serving/openai_compatible_server) can be used for online inference.
Our [OpenAI Compatible Server](../serving/openai_compatible_server) provides endpoints that correspond to the offline APIs:
Please click on the above link for more details on how to launch the server.
### Completions API
-[Completions API](#completions-api) is similar to `LLM.generate` but only accepts text.
-[Chat API](#chat-api) is similar to `LLM.chat`, accepting both text and [multi-modal inputs](#multimodal-inputs) for models with a chat template.
Our Completions API is similar to `LLM.generate` but only accepts text.
It is compatible with [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions)
so that you can use OpenAI client to interact with it.
A code example can be found in [examples/openai_completion_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py).
### Chat API
Our Chat API is similar to `LLM.chat`, accepting both text and [multi-modal inputs](#multimodal-inputs).
It is compatible with [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat)
so that you can use OpenAI client to interact with it.
A code example can be found in [examples/openai_chat_completion_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client.py).
@@ -106,22 +106,8 @@ A code example can be found in [examples/offline_inference_scoring.py](https://g
...
@@ -106,22 +106,8 @@ A code example can be found in [examples/offline_inference_scoring.py](https://g
## Online Inference
## Online Inference
Our [OpenAI Compatible Server](../serving/openai_compatible_server.md) can be used for online inference.
Our [OpenAI Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
Please click on the above link for more details on how to launch the server.
### Embeddings API
-[Pooling API](#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models.
-[Embeddings API](#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](#multimodal-inputs) for embedding models.
Our Embeddings API is similar to `LLM.embed`, accepting both text and [multi-modal inputs](#multimodal-inputs).
-[Score API](#score-api) is similar to `LLM.score` for cross-encoder models.
The text-only API is compatible with [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings)
so that you can use OpenAI client to interact with it.
A code example can be found in [examples/openai_embedding_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_embedding_client.py).
The multi-modal API is an extension of the [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings)
that incorporates [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat),
so it is not part of the OpenAI standard. Please see [](#multimodal-inputs) for more details on how to use it.
### Score API
Our Score API is similar to `LLM.score`.
Please see [this page](#score-api) for more details on how to use it.
see our [Multimodal Inputs](../usage/multimodal_inputs.md) guide for more information.
see our [Multimodal Inputs](../usage/multimodal_inputs.md) guide for more information.
- *Note: `image_url.detail` parameter is not supported.*
- *Note: `image_url.detail` parameter is not supported.*
#### Code example
See [examples/openai_chat_completion_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client.py).
#### Extra parameters
#### Extra parameters
The following [sampling parameters (click through to see documentation)](../dev/sampling_params.md) are supported.
The following [sampling parameters (click through to see documentation)](../dev/sampling_params.md) are supported.
...
@@ -230,15 +242,20 @@ The following extra parameters are supported:
...
@@ -230,15 +242,20 @@ The following extra parameters are supported:
(embeddings-api)=
(embeddings-api)=
### Embeddings API
### Embeddings API
Refer to [OpenAI's API reference](https://platform.openai.com/docs/api-reference/embeddings) for more details.
Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
If the model has a [chat template](#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat Completions API](#chat-api))
If the model has a [chat template](#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat API](#chat-api))
which will be treated as a single prompt to the model.
which will be treated as a single prompt to the model.
```{tip}
```{tip}
This enables multi-modal inputs to be passed to embedding models, see [this page](../usage/multimodal_inputs.md) for details.
This enables multi-modal inputs to be passed to embedding models, see [this page](#multimodal-inputs) for details.
```
```
#### Code example
See [examples/openai_embedding_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_embedding_client.py).
#### Extra parameters
#### Extra parameters
The following [pooling parameters (click through to see documentation)](../dev/pooling_params.md) are supported.
The following [pooling parameters (click through to see documentation)](../dev/pooling_params.md) are supported.
...
@@ -268,20 +285,35 @@ For chat-like input (i.e. if `messages` is passed), these extra parameters are s
...
@@ -268,20 +285,35 @@ For chat-like input (i.e. if `messages` is passed), these extra parameters are s
(tokenizer-api)=
(tokenizer-api)=
### Tokenizer API
### Tokenizer API
The Tokenizer API is a simple wrapper over [HuggingFace-style tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer).
Our Tokenizer API is a simple wrapper over [HuggingFace-style tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer).
It consists of two endpoints:
It consists of two endpoints:
- `/tokenize` corresponds to calling `tokenizer.encode()`.
- `/tokenize` corresponds to calling `tokenizer.encode()`.
- `/detokenize` corresponds to calling `tokenizer.decode()`.
- `/detokenize` corresponds to calling `tokenizer.decode()`.
(pooling-api)=
### Pooling API
Our Pooling API encodes input prompts using a [pooling model](../models/pooling_models.md) and returns the corresponding hidden states.
The input format is the same as [Embeddings API](#embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
#### Code example
See [examples/openai_pooling_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_pooling_client.py).
(score-api)=
(score-api)=
### Score API
### Score API
The Score API applies a cross-encoder model to predict scores for sentence pairs.
Our Score API applies a cross-encoder model to predict scores for sentence pairs.
Usually, the score for a sentence pair refers to the similarity between two sentences, on a scale of 0 to 1.
Usually, the score for a sentence pair refers to the similarity between two sentences, on a scale of 0 to 1.
You can find the documentation for these kind of models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
You can find the documentation for these kind of models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
#### Code example
See [examples/openai_cross_encoder_score.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_cross_encoder_score.py).
#### Single inference
#### Single inference
You can pass a string to both `text_1` and `text_2`, forming a single sentence pair.
You can pass a string to both `text_1` and `text_2`, forming a single sentence pair.